Skip to content

AndreaDev237/IndigoDocumentIntelligenceServer

Repository files navigation

Document Intelligence Server

An MCP-first knowledge-base server built as an AI Solutions Engineer assignment.
Upload internal documents (PDF), query them with natural language, and expose the knowledge base to any MCP-compatible AI agent.

Live demo: https://indigo.coolify.bonsicorp.ovh
MCP endpoint (live): https://indigo.coolify.bonsicorp.ovh/mcp


Architecture

┌──────────────────────────────────────────────────────────────┐
│                     Frontend (React SPA)                      │
│              upload · list · tag · delete · view              │
└────────────────────────┬─────────────────────────────────────┘
                         │ HTTP/JSON (cookie or bearer)
                         ▼
┌──────────────────────────────────────────────────────────────┐
│                 FastAPI Backend (REST)                        │
│  /documents  /search  /tags  /auth                           │
│  → orchestrates ingestion pipeline                            │
│  → single source of truth for DB access                       │
└──────┬──────────────────────────────────────────┬────────────┘
       │ SQL + pgvector                            │ internal HTTP
       ▼                                           ▼
┌──────────────────────┐       ┌──────────────────────────────┐
│  PostgreSQL 16       │       │   MCP Server (FastMCP)       │
│  + pgvector          │       │   Streamable HTTP :8001/mcp   │
│  HNSW + tsvector     │       │   thin client — calls API    │
└──────────────────────┘       │   5 tools: search · by_tag ·  │
                               │   by_document · list_documents│
                               │   · list_tags                 │
                               └──────────────────────────────┘

Four Docker services: frontend (React + nginx), api (FastAPI), mcp (FastMCP Streamable HTTP), db (PostgreSQL 16 + pgvector).

The MCP server is a thin HTTP client — it has no direct database access. All reads and writes flow through the backend API. This keeps the data model in one place and means the MCP server has no SQLAlchemy models to drift.


Stack choices and rationale

Vector store — pgvector

A dedicated vector database (Qdrant, Pinecone, Weaviate) would introduce a fifth service and a second consistency boundary. For a corpus in the hundreds-of-documents range, pgvector on PostgreSQL 16 covers every requirement:

  • Hybrid search in a single SQL query: HNSW index on vector(1536) for dense retrieval, GIN index on a tsvector GENERATED column for BM25-style sparse retrieval, then Reciprocal Rank Fusion in a CTE. No orchestration between two stores.
  • SQL JOINs for metadata filtering: tag filters, status filters, and pagination are plain WHERE clauses on the same relation.
  • Operational simplicity: one service to back up, one schema to migrate, one connection pool.
  • Scaling ceiling: pgvector with HNSW handles ~1 M chunks at sub-10 ms comfortably. Beyond that, or when multi-tenancy requires namespace isolation, a dedicated vector DB would be the right call.

Embedding model — text-embedding-3-small

  • 1536 dimensions, 8 191 token context window (well above the 800-token chunk target).
  • ~$0.02 / 1 M tokens — effectively free at document-corpus scale.
  • Solid multilingual quality; the reference documents mix Italian and English.
  • text-embedding-3-large was benchmarked: marginally better recall, 6× the cost, 2× the storage. Not justified here.
  • Self-hosted alternative for a privacy-sensitive production deployment: BAAI/bge-m3 (multilingual, runs on a single A10 GPU).

Chunking strategy — structure-aware recursive with header/footer stripping

The pipeline is two-phase:

Phase 1 — Parser (pypdf + heuristics)

  • POST /documents dispatches on the request Content-Type: application/pdf flows through the PDF parser below; text/plain is ingested directly as a single-page document, skipping the PDF-specific heuristics.
  • Lines appearing on ≥ 50 % of pages are detected as headers/footers and removed before any text reaches the chunker.
  • Pages where ≥ 40 % of lines end with a page-number pattern are flagged is_toc = True and excluded from the chunk stream (kept in the document record for future outline features).

Phase 2 — Chunker (recursive, heading-bounded)

  • The document is split into sections at validated headings first, then each oversized section is split on token budget.
  • Heading validation rejects ToC line residue, brand banners, and punctuation-terminated lines.
  • Target: 800 tokens, overlap: 100 tokens (tiktoken counter).
  • Each chunk carries: section_heading, page_number, chunk_index, char_start, char_end.

This two-phase approach was developed iteratively across six atomic commits (see docs/adr/0001-chunking-refactor.md) after fixed-size-only chunking produced ToC lines as headings and brand banners dominating chunk 0.

Scope of these heuristics. The pipeline was designed and tuned against the reference corpus shipped with the assignment: single-column, well-structured, machine-generated PDFs in Italian and English, plus plain text. It is not a general-purpose document understanding stack. Multi-column layouts, scanned PDFs, complex tables, CJK or RTL scripts, and Markdown/HTML/DOCX inputs are out of scope and would require a different parser path (see Known limitations → Chunking & extraction below).

Why not semantic chunking (LLM-based)? Non-deterministic, adds an extra API call per ingestion, and the structure-aware heuristics already capture most of the signal available in well-formatted PDFs.

Why not unstructured.io, marker, or a managed extraction API? A production-grade pipeline serving heterogeneous inputs would absolutely use one of these (or Azure Document Intelligence / AWS Textract, plus OCR and semantic chunking on top). They were not adopted here because the assignment corpus does not exercise the failure modes they exist to solve, and pulling in a heavy parser stack would have traded review-time clarity for capability the test set cannot demonstrate. The honest framing is: this is a deliberate scope choice for an assignment, not a recommendation for production.

MCP transport — Streamable HTTP

The MCP specification supports three transports: stdio, SSE, and Streamable HTTP. Stdio is suitable for local process launching (Claude Desktop spawning a binary) but requires the client to manage the subprocess lifecycle. SSE is one-directional. Streamable HTTP is the current recommended transport for remotely hosted servers: it is stateless-friendly, works behind any reverse proxy, and is the default in the FastMCP SDK.

The server is mounted as a Starlette ASGI app on :8001/mcp, with an additional /health route for orchestration health checks. This means the same container works both locally (Claude Desktop, OpenCode) and as a remote endpoint (Coolify-hosted, any HTTP client).


MCP tool design

Tool inventory

Tool Parameters Purpose
search query, top_k (5), hybrid (true) Unfiltered semantic + BM25 hybrid search over all chunks
search_by_tag query, tags, top_k (5), hybrid (true) Hybrid search restricted to documents carrying any of the given tags
search_by_document query, documents, top_k (5), hybrid (true) Hybrid search restricted to specific documents (documents is a list of UUIDs or filenames — auto-detected by the backend)
list_documents tags (null), status (null), limit (50) Browse the document catalogue
list_tags List all tags with document counts

Why these five tools

The goal was to cover the minimal set of operations an LLM needs to answer questions from a document corpus, while giving the tool-selection step the strongest possible signal at the schema level:

  • search is the primary entry point: a user question maps directly to a query string against the whole knowledge base.
  • search_by_tag handles taxonomy-scoped queries ("search only in compliance documents"). tags is a required parameter — calling this tool is itself the declaration of intent to filter.
  • search_by_document handles narrow-scope follow-ups ("look only inside aml_kyc_policy.pdf and data_privacy_compliance_policy.pdf"). The documents parameter accepts UUIDs or filenames so the model can pass through whatever identifier surfaced earlier in the conversation.
  • list_documents supports exploratory tasks ("what documents are available?", "show me all documents tagged gdpr") that search alone cannot handle — you cannot search for what you do not know exists.
  • list_tags covers the taxonomy discovery step: before filtering by tag, the model can check which tags exist and how many documents each covers.

Why separate search_by_tag / search_by_document instead of optional tags / documents parameters on search? Three reasons:

  1. The brief is prescriptive about this exact tool surface — separation is a stated requirement, not an internal preference.
  2. Separate tools give the LLM a clearer mental model: "search filtered by tag" is a categorically different intent from "search the whole knowledge base", and the schema should make that legible at tool-selection time rather than burying it inside an optional parameter.
  3. Required-vs-optional parameter signals strengthen tool selection. With tags required on search_by_tag, the model cannot accidentally invoke a filtered search with no filter (a common failure mode when filters are optional booleans on a generic tool).

A get_document MCP tool was deliberately not included: full-document fetches are an out-of-band frontend concern, served directly by the REST endpoint GET /documents/{id}/chunks. Adding it to the MCP surface would dilute the tool-selection signal without enabling any agent workflow that the five tools above don't already cover. See docs/adr/0002-mcp-tool-surface-alignment.md.

Why these names

Verb-noun and verb-noun-by-qualifier pairs, all lowercase with underscores, matching the convention of well-known MCP tool registries. The names are self-explanatory to an LLM that has never seen the schema: search searches, search_by_tag searches with a tag filter, list_documents lists documents, list_tags lists tags.

Why these parameters

  • top_k default 5, max 20: Five chunks is enough context for most single-question answers. The max-20 cap prevents the model from accidentally pulling the entire corpus into a context window.
  • hybrid default true: Keyword-heavy queries (product codes, policy names, article numbers) score poorly on dense vectors alone. Hybrid is strictly better on recall and costs zero extra — it is a single SQL query with two ranking components.
  • tags as list[str] (required on search_by_tag): Supports multi-tag filtering with OR semantics. The most common real-world case: "search within these two policy categories."
  • documents as list[str] (required on search_by_document): Each entry is either a document UUID or a filename; the backend auto-detects per element. UUIDs are precise (recommended when the model already has them from a prior list_documents call); filenames are ergonomic when the user names a document directly.
  • Response envelope {data, metadata, message}: Every tool returns a consistent three-field envelope. data carries the payload, metadata carries auxiliary numeric info (scores, counts), and message is a dynamic, context-sensitive sentence that guides the model's next step — including low-confidence warnings (top_score < 0.3) and suggestions to rephrase or remove tag filters on empty results.

How to run locally

Prerequisites: Docker, Docker Compose, an OpenAI API key.

# 1. Clone and configure
git clone <repo-url>
cd project
cp .env.example .env
# Edit .env — at minimum set:
#   OPENAI_API_KEY=sk-...
#   JWT_SECRET_KEY=$(openssl rand -hex 32)
#   BACKEND_API_KEY=$(openssl rand -hex 32)
#   ADMIN_EMAIL=admin@example.com
#   ADMIN_PASSWORD=<choose a password>
#   MCP_API_KEY=<choose a key>

# 2. Start
docker compose up

# 3. Open
# Frontend:  http://localhost:3000
# API docs:  http://localhost:8000/docs
# MCP:       http://localhost:8001/mcp

Log in with ADMIN_EMAIL / ADMIN_PASSWORD. Upload PDFs via the frontend or POST /documents. The ingestion pipeline runs in the background; poll GET /documents until status: ready.

Test documents

A set of sample PDFs is in test-docs/ for immediate testing.

Running the test suite

# Backend (170 passed, 2 xfailed)
docker compose exec api pytest

# MCP server (21 passed)
docker compose exec mcp pytest

The 2 xfails are documented in docs/adr/0001-chunking-refactor.md: ToC line residue in two documents is marginally above the 5 % acceptance threshold. They are tracked, not ignored.


Connecting an MCP-compatible client

The server speaks Streamable HTTP MCP over plain HTTP (or HTTPS in production). Any client that supports this transport can connect.

OpenCode

Add to ~/.config/opencode/opencode.json:

{
  "mcp": {
    "document-intelligence": {
      "type": "remote",
      "url": "http://localhost:8001/mcp",
      "headers": {
        "Authorization": "Bearer <MCP_API_KEY>"
      }
    }
  }
}

Claude Desktop

Add to claude_desktop_config.json (location varies by OS):

{
  "mcpServers": {
    "document-intelligence": {
      "url": "http://localhost:8001/mcp",
      "headers": {
        "Authorization": "Bearer <MCP_API_KEY>"
      }
    }
  }
}

Remote (production) endpoint

Replace http://localhost:8001/mcp with https://indigo.coolify.bonsicorp.ovh/mcp and use the production MCP_API_KEY.


Known limitations and what I would improve with more time

Limitations

Area Limitation
Auth Single hardcoded admin user. No registration, no roles, no per-user isolation. JWT has no refresh token (60 min expiry, then re-login).
Ingestion BackgroundTasks (in-process thread). If the API process restarts mid-ingestion, the document is stuck at status: processing with no recovery path. No progress reporting.
Chunking & extraction The parser + chunker are heuristics tuned to the assignment corpus (single-column, machine-generated PDF and plain text, IT/EN). Known failure modes: multi-column layouts (text interleaved across columns), scanned/image-only PDFs (no OCR), complex tables (flattened to text), CJK/RTL scripts (heading and ToC heuristics are Latin-script-biased), DOCX/HTML/Markdown (not parsed at all). 2 xfails on ToC residue. No min_chunk_tokens floor (a product-catalog PDF may produce very small chunks). A production deployment serving arbitrary documents would replace this layer with unstructured.io / marker / a managed extraction API, OCR for scans, and semantic chunking on top.
Embeddings Vendor lock-in to OpenAI. No fallback if the API is unavailable. Embeddings are not cached; re-ingesting the same document re-embeds all chunks (dedup prevents this, but a forced re-ingest would).
Search No cross-encoder re-ranker. RRF is parameter-free and robust, but a fine-tuned re-ranker would improve precision on ambiguous queries. Score normalization is linear cosine transform, not calibrated to a probability.
MCP No streaming tool responses — all results are returned at once. Large search responses on densely-chunked documents could push context limits.
Frontend Search UI (/search) is intentionally minimal — query + top_k + hybrid toggle + comma-separated tag/document filters + result list. No URL state sync, no debouncing, no autocomplete on tags, no pagination on the documents list. No dark mode.
Operations COOKIE_SECURE is hardcoded False in dev — it must be flipped for HTTPS but is not yet env-driven. No structured alerting, no metrics endpoint.

What I would improve with more time

Short term (days)

  • Replace BackgroundTasks with a proper task queue (Celery + Redis or arq). Gives retry logic, progress streaming, and crash recovery.
  • Add a min_chunk_tokens floor (e.g. 50 tokens) to prevent micro-chunks on catalog-style documents.
  • Extend the ingestion parser to DOCX — pypdf is the only real coupling left to PDF, and plain text is already supported.
  • Fix the 2 xfails: tighten the ToC page-number regex to catch the edge cases.
  • Make COOKIE_SECURE env-driven.

Medium term (weeks)

  • Multi-user with document ownership and per-collection ACL. This is the biggest architectural gap for a real SaaS product.
  • Cross-encoder re-ranker as an optional post-processing step (bge-reranker-v2-m3 via a sidecar container, opt-in via a search parameter).
  • Streaming ingestion progress over SSE — the frontend polls at 3 s; a push model would be cleaner.
  • Self-hosted embedding model (BAAI/bge-m3) for data privacy and vendor independence.
  • Evaluation harness: a small labelled QA set to measure retrieval recall@k and MRR before and after chunking or retrieval changes. Currently all chunking improvements are validated by heuristic assertions and xfail bookkeeping, not by end-to-end retrieval metrics.

Longer term

  • Hybrid re-ranking with a ColBERT-style late-interaction model for significantly better precision at scale.
  • A get_document_summary MCP tool (section headings + first-page preview) to let the model decide whether to drill into a document via search_by_document before paying for the full retrieval.
  • Namespace/collection support: logical groupings of documents independent of tags, enabling strict multi-tenant isolation at the vector layer.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors