A small, fully local HTTP API that answers questions over a fixed set of reference documents using retrieval-augmented-generation style logic. There are no external LLM calls, no vector database, and no hosted services — all retrieval, ranking, and confidence math runs in-process in pure Python.
Requires Python 3.10+. From the project root, create a virtual environment and install the dependencies (a venv is recommended, and required on systems whose Python is "externally managed", e.g. Homebrew):
python3 -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txtThe commands below assume the venv is activated. If you prefer not to activate it, prefix
each command with the venv interpreter instead, e.g. .venv/bin/python -m pytest -q.
# Start the HTTP API (serves on http://127.0.0.1:8000)
uvicorn app.main:app --reload
# Chat in the terminal (runs the engine in-process, no server needed)
python -m app.tui
# Run the tests
pytest -q# Answerable
curl -s http://127.0.0.1:8000/answer \
-H 'content-type: application/json' \
-d '{"question": "How long do refunds take?"}'
# Weak evidence -> fallback
curl -s http://127.0.0.1:8000/answer \
-H 'content-type: application/json' \
-d '{"question": "Do you support SSO?"}'Response shape:
{
"answer": "string",
"citations": [{ "doc_id": "string", "title": "string", "snippet": "string" }],
"confidence": 0.0,
"fallback": false
}Interactive docs are available at http://127.0.0.1:8000/docs once the server is running.
A small Rich-based terminal chat runs the engine in-process — no server needed:
python -m app.tuiType a question to see the grounded answer, a colored confidence bar, and citations.
Commands: /help, /quit (Ctrl+C / Ctrl+D also exit cleanly).
- Sentence-level chunking. Each document is split into sentences. Sentences are the unit of retrieval, which yields precise citations (the exact supporting sentence) and lets answers be assembled extractively.
- Hand-written BM25. Retrieval ranks sentences with a from-scratch BM25 (
k1=1.5,b=0.75). BM25 handles term frequency saturation and length normalization well on short text, and writing it by hand keeps the relevance math visible instead of hidden behind a library. Tokenization lowercases, drops a small stop list, and applies conservative singularization so query and document word forms align (refunds->refund,members->member). Each document's title is folded into every one of its chunks, since the title is real evidence that applies to all its sentences — without this, "What is the API rate limit?" would fall back, because "rate" appears only in the "API Rate Limits" title. - Extractive answers (no hallucination). The answer is the top matching sentence plus any close runner-up (capped at 3), returned verbatim. The same sentences are the citation snippets, so every word of the answer is provably present in a source document.
- Confidence = IDF-weighted query-term coverage. Raw BM25 scores are unbounded and
not comparable across questions, so they make a poor confidence signal. Instead, each
distinct query content term is weighted by its IDF, and confidence is the share of that
weighted mass that the retrieved sentences actually cover. Rare, informative terms
dominate; common ones barely move the needle. A query term that does not appear in the
corpus at all is assigned the maximum IDF (treated as
df = 0), so an unknown key term drags confidence down hard. The score is bounded in[0, 1]. - Fallback logic. The service returns the fixed message
"I could not find enough evidence in the provided documents." with empty citations
when any of the following hold: the question has no content terms after stopword
removal, nothing scores above zero, or confidence is below the threshold (
0.4). Worked example — "Do you support SSO?":supportis common and does appear in the corpus (low IDF), butssois unknown (maximum IDF) and matches nothing, so confidence lands near0.35and the service falls back rather than answering from the weaksupportmatch.
All tunables (K1, B, RUNNER_UP_RATIO, CONFIDENCE_THRESHOLD, MAX_CITATIONS) are
named constants at the top of app/retrieval.py.
app/
main.py FastAPI app, startup index build, POST /answer
models.py Pydantic request/response schemas (the API contract)
retrieval.py tokenizer, sentence chunking, BM25 index, IDF
qa.py orchestration: retrieve -> score -> answer / cite / fallback
data/
documents.json the reference corpus
tests/
test_answer.py end-to-end tests (answerable + fallback + contract)
- Lexical only. Matching is on (singularized) surface tokens. Synonyms and paraphrases are not understood: "SSO" will not match "single sign-on", and "kept" will not match "retained". This is the intended trade-off for a no-embeddings, no-LLM design.
- Extractive, not synthesized. Answers are stitched from existing sentences and cannot combine facts across documents into a new phrasing, nor reformat to directly mirror the question.
- Small fixed corpus. The index is rebuilt from a local JSON file at startup; there is no incremental update path.
- Naive sentence splitter. Splitting on
.!?is sufficient for this corpus but would mishandle abbreviations or decimal numbers in richer text.
- Semantic retrieval. Add embeddings with an approximate-nearest-neighbor index to recover synonym/paraphrase recall, while keeping BM25 as a hybrid lexical signal.
- Optional constrained synthesis. Layer an LLM that rewrites the answer only from the retrieved snippets, with citation enforcement and a grounding check, so fluency improves without reintroducing hallucination.
- Evaluation harness. A labelled query set with retrieval and answer-quality metrics (precision@k, answerable/fallback accuracy) to tune thresholds with data instead of by hand.
- Operational hardening. Authentication, rate limiting, response caching, structured logging and metrics, and request tracing for observability.