An MCP-first knowledge-base server built as an AI Solutions Engineer assignment.
Upload internal documents (PDF), query them with natural language, and expose the knowledge base to any MCP-compatible AI agent.
Live demo: https://indigo.coolify.bonsicorp.ovh
MCP endpoint (live): https://indigo.coolify.bonsicorp.ovh/mcp
┌──────────────────────────────────────────────────────────────┐
│ Frontend (React SPA) │
│ upload · list · tag · delete · view │
└────────────────────────┬─────────────────────────────────────┘
│ HTTP/JSON (cookie or bearer)
▼
┌──────────────────────────────────────────────────────────────┐
│ FastAPI Backend (REST) │
│ /documents /search /tags /auth │
│ → orchestrates ingestion pipeline │
│ → single source of truth for DB access │
└──────┬──────────────────────────────────────────┬────────────┘
│ SQL + pgvector │ internal HTTP
▼ ▼
┌──────────────────────┐ ┌──────────────────────────────┐
│ PostgreSQL 16 │ │ MCP Server (FastMCP) │
│ + pgvector │ │ Streamable HTTP :8001/mcp │
│ HNSW + tsvector │ │ thin client — calls API │
└──────────────────────┘ │ 5 tools: search · by_tag · │
│ by_document · list_documents│
│ · list_tags │
└──────────────────────────────┘
Four Docker services: frontend (React + nginx), api (FastAPI), mcp (FastMCP Streamable HTTP), db (PostgreSQL 16 + pgvector).
The MCP server is a thin HTTP client — it has no direct database access. All reads and writes flow through the backend API. This keeps the data model in one place and means the MCP server has no SQLAlchemy models to drift.
A dedicated vector database (Qdrant, Pinecone, Weaviate) would introduce a fifth service and a second consistency boundary. For a corpus in the hundreds-of-documents range, pgvector on PostgreSQL 16 covers every requirement:
- Hybrid search in a single SQL query: HNSW index on
vector(1536)for dense retrieval, GIN index on atsvector GENERATEDcolumn for BM25-style sparse retrieval, then Reciprocal Rank Fusion in a CTE. No orchestration between two stores. - SQL JOINs for metadata filtering: tag filters, status filters, and pagination are plain WHERE clauses on the same relation.
- Operational simplicity: one service to back up, one schema to migrate, one connection pool.
- Scaling ceiling: pgvector with HNSW handles ~1 M chunks at sub-10 ms comfortably. Beyond that, or when multi-tenancy requires namespace isolation, a dedicated vector DB would be the right call.
- 1536 dimensions, 8 191 token context window (well above the 800-token chunk target).
- ~$0.02 / 1 M tokens — effectively free at document-corpus scale.
- Solid multilingual quality; the reference documents mix Italian and English.
text-embedding-3-largewas benchmarked: marginally better recall, 6× the cost, 2× the storage. Not justified here.- Self-hosted alternative for a privacy-sensitive production deployment:
BAAI/bge-m3(multilingual, runs on a single A10 GPU).
The pipeline is two-phase:
Phase 1 — Parser (pypdf + heuristics)
POST /documentsdispatches on the requestContent-Type:application/pdfflows through the PDF parser below;text/plainis ingested directly as a single-page document, skipping the PDF-specific heuristics.- Lines appearing on ≥ 50 % of pages are detected as headers/footers and removed before any text reaches the chunker.
- Pages where ≥ 40 % of lines end with a page-number pattern are flagged
is_toc = Trueand excluded from the chunk stream (kept in the document record for future outline features).
Phase 2 — Chunker (recursive, heading-bounded)
- The document is split into sections at validated headings first, then each oversized section is split on token budget.
- Heading validation rejects ToC line residue, brand banners, and punctuation-terminated lines.
- Target: 800 tokens, overlap: 100 tokens (
tiktokencounter). - Each chunk carries:
section_heading,page_number,chunk_index,char_start,char_end.
This two-phase approach was developed iteratively across six atomic commits (see docs/adr/0001-chunking-refactor.md) after fixed-size-only chunking produced ToC lines as headings and brand banners dominating chunk 0.
Scope of these heuristics. The pipeline was designed and tuned against the reference corpus shipped with the assignment: single-column, well-structured, machine-generated PDFs in Italian and English, plus plain text. It is not a general-purpose document understanding stack. Multi-column layouts, scanned PDFs, complex tables, CJK or RTL scripts, and Markdown/HTML/DOCX inputs are out of scope and would require a different parser path (see Known limitations → Chunking & extraction below).
Why not semantic chunking (LLM-based)? Non-deterministic, adds an extra API call per ingestion, and the structure-aware heuristics already capture most of the signal available in well-formatted PDFs.
Why not unstructured.io, marker, or a managed extraction API? A production-grade pipeline serving heterogeneous inputs would absolutely use one of these (or Azure Document Intelligence / AWS Textract, plus OCR and semantic chunking on top). They were not adopted here because the assignment corpus does not exercise the failure modes they exist to solve, and pulling in a heavy parser stack would have traded review-time clarity for capability the test set cannot demonstrate. The honest framing is: this is a deliberate scope choice for an assignment, not a recommendation for production.
The MCP specification supports three transports: stdio, SSE, and Streamable HTTP. Stdio is suitable for local process launching (Claude Desktop spawning a binary) but requires the client to manage the subprocess lifecycle. SSE is one-directional. Streamable HTTP is the current recommended transport for remotely hosted servers: it is stateless-friendly, works behind any reverse proxy, and is the default in the FastMCP SDK.
The server is mounted as a Starlette ASGI app on :8001/mcp, with an additional /health route for orchestration health checks. This means the same container works both locally (Claude Desktop, OpenCode) and as a remote endpoint (Coolify-hosted, any HTTP client).
| Tool | Parameters | Purpose |
|---|---|---|
search |
query, top_k (5), hybrid (true) |
Unfiltered semantic + BM25 hybrid search over all chunks |
search_by_tag |
query, tags, top_k (5), hybrid (true) |
Hybrid search restricted to documents carrying any of the given tags |
search_by_document |
query, documents, top_k (5), hybrid (true) |
Hybrid search restricted to specific documents (documents is a list of UUIDs or filenames — auto-detected by the backend) |
list_documents |
tags (null), status (null), limit (50) |
Browse the document catalogue |
list_tags |
— | List all tags with document counts |
The goal was to cover the minimal set of operations an LLM needs to answer questions from a document corpus, while giving the tool-selection step the strongest possible signal at the schema level:
searchis the primary entry point: a user question maps directly to a query string against the whole knowledge base.search_by_taghandles taxonomy-scoped queries ("search only in compliance documents").tagsis a required parameter — calling this tool is itself the declaration of intent to filter.search_by_documenthandles narrow-scope follow-ups ("look only insideaml_kyc_policy.pdfanddata_privacy_compliance_policy.pdf"). Thedocumentsparameter accepts UUIDs or filenames so the model can pass through whatever identifier surfaced earlier in the conversation.list_documentssupports exploratory tasks ("what documents are available?", "show me all documents taggedgdpr") that search alone cannot handle — you cannot search for what you do not know exists.list_tagscovers the taxonomy discovery step: before filtering by tag, the model can check which tags exist and how many documents each covers.
Why separate search_by_tag / search_by_document instead of optional tags / documents parameters on search? Three reasons:
- The brief is prescriptive about this exact tool surface — separation is a stated requirement, not an internal preference.
- Separate tools give the LLM a clearer mental model: "search filtered by tag" is a categorically different intent from "search the whole knowledge base", and the schema should make that legible at tool-selection time rather than burying it inside an optional parameter.
- Required-vs-optional parameter signals strengthen tool selection. With
tagsrequired onsearch_by_tag, the model cannot accidentally invoke a filtered search with no filter (a common failure mode when filters are optional booleans on a generic tool).
A get_document MCP tool was deliberately not included: full-document fetches are an out-of-band frontend concern, served directly by the REST endpoint GET /documents/{id}/chunks. Adding it to the MCP surface would dilute the tool-selection signal without enabling any agent workflow that the five tools above don't already cover. See docs/adr/0002-mcp-tool-surface-alignment.md.
Verb-noun and verb-noun-by-qualifier pairs, all lowercase with underscores, matching the convention of well-known MCP tool registries. The names are self-explanatory to an LLM that has never seen the schema: search searches, search_by_tag searches with a tag filter, list_documents lists documents, list_tags lists tags.
top_kdefault 5, max 20: Five chunks is enough context for most single-question answers. The max-20 cap prevents the model from accidentally pulling the entire corpus into a context window.hybriddefaulttrue: Keyword-heavy queries (product codes, policy names, article numbers) score poorly on dense vectors alone. Hybrid is strictly better on recall and costs zero extra — it is a single SQL query with two ranking components.tagsaslist[str](required onsearch_by_tag): Supports multi-tag filtering with OR semantics. The most common real-world case: "search within these two policy categories."documentsaslist[str](required onsearch_by_document): Each entry is either a document UUID or a filename; the backend auto-detects per element. UUIDs are precise (recommended when the model already has them from a priorlist_documentscall); filenames are ergonomic when the user names a document directly.- Response envelope
{data, metadata, message}: Every tool returns a consistent three-field envelope.datacarries the payload,metadatacarries auxiliary numeric info (scores, counts), andmessageis a dynamic, context-sensitive sentence that guides the model's next step — including low-confidence warnings (top_score < 0.3) and suggestions to rephrase or remove tag filters on empty results.
Prerequisites: Docker, Docker Compose, an OpenAI API key.
# 1. Clone and configure
git clone <repo-url>
cd project
cp .env.example .env
# Edit .env — at minimum set:
# OPENAI_API_KEY=sk-...
# JWT_SECRET_KEY=$(openssl rand -hex 32)
# BACKEND_API_KEY=$(openssl rand -hex 32)
# ADMIN_EMAIL=admin@example.com
# ADMIN_PASSWORD=<choose a password>
# MCP_API_KEY=<choose a key>
# 2. Start
docker compose up
# 3. Open
# Frontend: http://localhost:3000
# API docs: http://localhost:8000/docs
# MCP: http://localhost:8001/mcpLog in with ADMIN_EMAIL / ADMIN_PASSWORD. Upload PDFs via the frontend or POST /documents. The ingestion pipeline runs in the background; poll GET /documents until status: ready.
A set of sample PDFs is in test-docs/ for immediate testing.
# Backend (170 passed, 2 xfailed)
docker compose exec api pytest
# MCP server (21 passed)
docker compose exec mcp pytestThe 2 xfails are documented in docs/adr/0001-chunking-refactor.md: ToC line residue in two documents is marginally above the 5 % acceptance threshold. They are tracked, not ignored.
The server speaks Streamable HTTP MCP over plain HTTP (or HTTPS in production). Any client that supports this transport can connect.
Add to ~/.config/opencode/opencode.json:
{
"mcp": {
"document-intelligence": {
"type": "remote",
"url": "http://localhost:8001/mcp",
"headers": {
"Authorization": "Bearer <MCP_API_KEY>"
}
}
}
}Add to claude_desktop_config.json (location varies by OS):
{
"mcpServers": {
"document-intelligence": {
"url": "http://localhost:8001/mcp",
"headers": {
"Authorization": "Bearer <MCP_API_KEY>"
}
}
}
}Replace http://localhost:8001/mcp with https://indigo.coolify.bonsicorp.ovh/mcp and use the production MCP_API_KEY.
| Area | Limitation |
|---|---|
| Auth | Single hardcoded admin user. No registration, no roles, no per-user isolation. JWT has no refresh token (60 min expiry, then re-login). |
| Ingestion | BackgroundTasks (in-process thread). If the API process restarts mid-ingestion, the document is stuck at status: processing with no recovery path. No progress reporting. |
| Chunking & extraction | The parser + chunker are heuristics tuned to the assignment corpus (single-column, machine-generated PDF and plain text, IT/EN). Known failure modes: multi-column layouts (text interleaved across columns), scanned/image-only PDFs (no OCR), complex tables (flattened to text), CJK/RTL scripts (heading and ToC heuristics are Latin-script-biased), DOCX/HTML/Markdown (not parsed at all). 2 xfails on ToC residue. No min_chunk_tokens floor (a product-catalog PDF may produce very small chunks). A production deployment serving arbitrary documents would replace this layer with unstructured.io / marker / a managed extraction API, OCR for scans, and semantic chunking on top. |
| Embeddings | Vendor lock-in to OpenAI. No fallback if the API is unavailable. Embeddings are not cached; re-ingesting the same document re-embeds all chunks (dedup prevents this, but a forced re-ingest would). |
| Search | No cross-encoder re-ranker. RRF is parameter-free and robust, but a fine-tuned re-ranker would improve precision on ambiguous queries. Score normalization is linear cosine transform, not calibrated to a probability. |
| MCP | No streaming tool responses — all results are returned at once. Large search responses on densely-chunked documents could push context limits. |
| Frontend | Search UI (/search) is intentionally minimal — query + top_k + hybrid toggle + comma-separated tag/document filters + result list. No URL state sync, no debouncing, no autocomplete on tags, no pagination on the documents list. No dark mode. |
| Operations | COOKIE_SECURE is hardcoded False in dev — it must be flipped for HTTPS but is not yet env-driven. No structured alerting, no metrics endpoint. |
Short term (days)
- Replace
BackgroundTaskswith a proper task queue (Celery + Redis orarq). Gives retry logic, progress streaming, and crash recovery. - Add a
min_chunk_tokensfloor (e.g. 50 tokens) to prevent micro-chunks on catalog-style documents. - Extend the ingestion parser to DOCX —
pypdfis the only real coupling left to PDF, and plain text is already supported. - Fix the 2 xfails: tighten the ToC page-number regex to catch the edge cases.
- Make
COOKIE_SECUREenv-driven.
Medium term (weeks)
- Multi-user with document ownership and per-collection ACL. This is the biggest architectural gap for a real SaaS product.
- Cross-encoder re-ranker as an optional post-processing step (
bge-reranker-v2-m3via a sidecar container, opt-in via a search parameter). - Streaming ingestion progress over SSE — the frontend polls at 3 s; a push model would be cleaner.
- Self-hosted embedding model (
BAAI/bge-m3) for data privacy and vendor independence. - Evaluation harness: a small labelled QA set to measure retrieval recall@k and MRR before and after chunking or retrieval changes. Currently all chunking improvements are validated by heuristic assertions and xfail bookkeeping, not by end-to-end retrieval metrics.
Longer term
- Hybrid re-ranking with a ColBERT-style late-interaction model for significantly better precision at scale.
- A
get_document_summaryMCP tool (section headings + first-page preview) to let the model decide whether to drill into a document viasearch_by_documentbefore paying for the full retrieval. - Namespace/collection support: logical groupings of documents independent of tags, enabling strict multi-tenant isolation at the vector layer.