Skip to content

EHrekov/askdocs

Repository files navigation

AskDocs

CI

A bilingual (UK/EN) Retrieval-Augmented Generation system for document analysis: ask questions about any uploaded files and get grounded, cited answers. Hybrid search, prompt-injection & PII guardrails, token-budgeted context, and a streaming single-page UI.

This is a personal portfolio project. It is intentionally compact but built with production-minded engineering: retry/back-off, caching, rate limiting, input/output guardrails, and graceful degradation.


What it does

Upload documents, ask questions in natural language, and get answers grounded only in your own corpus, with mandatory source citations. The assistant refuses to answer when the context does not contain the answer (hallucination mitigation). It is domain-agnostic — contracts, reports, manuals, research papers, notes, spreadsheets, anything with text.

Supported formats: PDF, TXT, MD, DOCX, XLSX, PPTX, CSV.


Architecture

flowchart TD
    U["Browser — vanilla-JS SPA"] -->|SSE /api/query/stream| MW["FastAPI<br/>Request-ID · CORS · Rate-limit"]

    subgraph SEC["Input guardrails"]
        G1[check_length]
        G2["check_injection<br/>EN + UK patterns"]
        G3["mask_pii (optional)"]
    end

    subgraph RAG["RAG pipeline"]
        E1["embed_query<br/>OpenAI + LRU cache + retry"]
        H["Hybrid search<br/>0.6·vector + 0.4·BM25"]
        R["Cross-encoder rerank<br/>(optional)"]
        O["Reorder — Lost-in-the-Middle"]
        B["Token-budget guard"]
    end

    MW --> SEC --> RAG
    E1 --> H --> R --> O --> B --> LLM["OpenAI chat<br/>token streaming"]
    H <-->|vectors + metadata| VS[("ChromaDB")]
    LLM --> OG["check_output_safety"] --> UM["unmask_pii"] --> SESS["Session history<br/>in-memory · TTL 1h"]
    UM -->|SSE tokens| U

    subgraph IDX["Indexing"]
        L["Loader<br/>PDF/DOCX/XLSX/PPTX/TXT/MD/CSV"] --> C["Sentence-aware<br/>chunker + overlap"] --> E2[embed_texts] --> VS
    end
    U -->|POST /api/documents/upload| L
Loading

Request flow

POST /api/query/stream
  → rate-limit (per-IP, sliding window)
  → input guardrails (length, injection EN/UK, optional PII masking)
  → session get/create  → embed(question)  [LRU cache + exponential back-off]
  → ChromaDB vector search → BM25 re-score → hybrid top-K
  → [optional] cross-encoder rerank → reorder (Lost-in-the-Middle) → token-budget trim
  → OpenAI chat (streamed) ──SSE──> tokens to browser
  → output guardrail (prompt-leak / suspicious-URL) → PII unmask → persist to session

Quick start

pip install -r requirements.txt

cp .env.example .env          # then set OPENAI_API_KEY in .env
python main.py                # or: python start.py  (runs pre-flight checks)

Open http://localhost:8000. Sample documents in data/sample_docs/ are auto-indexed on first run.

Optional cross-encoder re-ranker (heavy, pulls in torch):

pip install -r requirements-optional.txt   # then set USE_RERANKER=true in .env

Configuration (.env)

Variable Default Description
OPENAI_API_KEY Required. OpenAI API key
APP_HOST / APP_PORT 0.0.0.0 / 8000 Bind address
DEBUG false Enables auto-reload + debug logs
ALLOWED_ORIGINS http://localhost:8000 Comma-separated CORS origins
CHAT_MODEL gpt-4o-mini Chat model. Legacy models use temperature/max_tokens; newer ones use max_completion_tokens
EMBEDDING_MODEL text-embedding-3-small Embedding model
CHUNK_SIZE 512 Chunk size (characters)
CHUNK_OVERLAP 100 Overlap between chunks (~20%)
TOP_K 5 Chunks retrieved per query
MAX_CONTEXT_TOKENS 6000 Token budget for RAG context
USE_RERANKER false Enable cross-encoder re-ranking
MAX_INPUT_CHARS 10000 Max question length
RATE_LIMIT_PER_MINUTE 30 Requests per IP per minute
OPENAI_MAX_RETRIES 3 OpenAI client retry attempts
OPENAI_TIMEOUT_SECONDS 60 OpenAI request timeout
CHROMA_PATH ./data/chroma_db ChromaDB persistence path

API

Method Path Description
GET /api/health Health check
GET /api/stats DB + session statistics
POST /api/query Ask a question (non-streaming)
POST /api/query/stream Ask a question (SSE streaming)
DELETE /api/sessions/{id} Clear a conversation session
POST /api/documents/upload Upload & index a document
GET /api/documents List indexed documents
DELETE /api/documents/{name} Delete a document and its chunks

Interactive API docs: /api/docs (Swagger), /api/redoc.

curl -X POST http://localhost:8000/api/query \
  -H "Content-Type: application/json" \
  -d '{"question": "What is this document about?", "language": "en", "mask_pii": false}'

RAG features

  • Hybrid search0.6 · cosine + 0.4 · BM25, BM25 rescored over the vector candidate set (keeps it O(candidates), not O(corpus)).
  • Cross-encoder re-ranking (optional) — offloaded to a thread pool so the async event loop is never blocked; degrades gracefully if unavailable.
  • Lost-in-the-Middle mitigation — top-ranked chunks anchored at the start and end of the context (Liu et al., 2023).
  • Token-budget guard — context trimmed to MAX_CONTEXT_TOKENS (tiktoken).
  • Sentence-aware chunking — respects paragraph/section boundaries and common abbreviations; configurable overlap.
  • Embedding cache — SHA-256-keyed, size-capped LRU.
  • Resilience — exponential back-off on transient OpenAI errors.
  • Re-indexing — re-uploading a document replaces its old chunks.

Security features

  • Prompt-injection detection — 40+ regex patterns, English and Ukrainian, risk-scored (BLOCK ≥ 0.9, WARN ≥ 0.6), case-insensitive.
  • PII masking / unmasking — email, UA phone, tax ID, IBAN, card; masked before the LLM, restored in the response.
  • Output guardrail — system-prompt-leak and suspicious-URL detection (advisory; logged).
  • File magic-byte validation — rejects extension-spoofed uploads.
  • Filename sanitizationPath().name + allow-list regex; prevents path traversal and null-byte/RTL tricks.
  • Decompression-bomb guard — OOXML (DOCX/XLSX/PPTX) archives are inspected and rejected if the inflated size / compression ratio is implausible.
  • Streamed-size enforcement — upload capped without loading it all in RAM.
  • Per-IP rate limiting — sliding window with periodic purge.
  • No tracebacks to clients — errors return generic localized messages; attacker payloads are never logged verbatim.
  • Dependency auditingpip-audit is part of the dev tooling; the pinned runtime tree is CVE-clean (pip-audit -r requirements.txt).

Residual risks (by design, documented — not silently ignored)

  • No authentication / multi-tenancy. Anyone with network access can query, upload and delete, and drive OpenAI cost. Intended for local / single-user portfolio use; production would add auth + quotas.
  • Indirect prompt injection. Injection scoring runs on the question, not on retrieved document text. A hostile uploaded document could try to steer the model; the system prompt mitigates but does not eliminate this.
  • PII in document context still reaches the LLM provider (only the question is masked) — expected for a RAG, but a privacy consideration.
  • Rate limiting keys on request.client.host. Behind a reverse proxy all clients share an IP unless the proxy is configured; X-Forwarded-For is intentionally not trusted (spoofable). Deploy accordingly.
  • Regex-based prompt-injection detection is defense-in-depth, not a guarantee — it raises the bar, it is not a security boundary.

Tests

pip install -r requirements-dev.txt
pytest -q

Covers chunking, PII mask/unmask round-trip, injection detection (EN/UK), length guard, and file magic-byte validation.

Audit dependencies for known CVEs:

pip-audit -r requirements.txt

Project structure

askdocs/
├── main.py                 FastAPI app entry point + lifespan
├── start.py                Pre-flight checks + launcher
├── config.py               pydantic-settings configuration
├── backend/
│   ├── rag/
│   │   ├── loader.py       Multi-format document loaders
│   │   ├── chunker.py      Sentence-aware chunking
│   │   ├── embeddings.py   OpenAI embeddings + LRU cache + retry
│   │   ├── vector_store.py ChromaDB + hybrid (vector + BM25) search
│   │   ├── reranker.py     Optional cross-encoder re-ranking
│   │   ├── session.py      In-memory session manager (TTL)
│   │   └── pipeline.py     RAG orchestration (sync + streaming)
│   ├── security/guards.py  Injection / PII / output / file guards
│   └── api/routes.py       FastAPI routes
├── frontend/               Vanilla-JS SPA (SSE streaming, i18n)
├── tests/                  pytest suite
└── data/sample_docs/       Auto-indexed samples

Limitations & future work

These are conscious trade-offs for a single-process portfolio project:

  • Sessions & rate limiting are in-memory — not shared across workers and reset on restart. Production would use Redis.
  • No authentication / multi-tenancy — anyone with network access can query.
  • No OCR — scanned / image-only PDFs have no text layer and are rejected with a clear message; only PDFs with embedded text are supported.
  • Document identity is the filename stem — two different files with the same stem would collide in the vector store.
  • Chunk char offsets are approximate (stored for debugging only).
  • Single-node ChromaDB — fine for thousands of chunks, not millions.

Possible next steps: Redis-backed sessions/rate-limiting, auth, evaluation harness (retrieval precision / answer faithfulness), Dockerfile + CI.


License

MIT — see LICENSE.

About

Production-minded RAG app — ask questions about your documents, get grounded, cited answers. FastAPI · ChromaDB · hybrid vector+BM25 search · cross-encoder reranking · SSE streaming · prompt-injection & PII guardrails · CI.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Contributors