| title | paperQA |
|---|---|
| emoji | 📄 |
| colorFrom | blue |
| colorTo | indigo |
| sdk | gradio |
| sdk_version | 6.13.0 |
| app_file | app.py |
| pinned | false |
| license | mit |
| short_description | Multi-paper QA with page-level citations. |
Ask questions of one or more scientific PDFs. Get answers grounded in the source, with the exact file and page they came from.
👉 Try the live demo — upload one paper or several, ask a question, see the cited pages side-by-side.
- What It Does
- Why It's Different From a Generic RAG Demo
- Headline Numbers (Attention Is All You Need, 6 questions)
- How It Works
- Architectural Decisions
- Running Locally
- Engineering Rules
- Security
- License
Upload one or more papers. Type a question. Get back an answer that:
- cites the exact file and page it came from (
[paper.pdf, page 8]), - shows the source passage beside the answer so you can verify it,
- pools all uploaded PDFs into one normalized retrieval space so cross-document scores are directly comparable (see ADR-0008),
- never invents page numbers — citations are filtered against retrieved passages first, then through a numerical-anchor grounding check (see ADR-0007).
The system is tuned for arXiv-style scientific PDFs. The same pipeline works on any PDF, but the prompts and the included gold set are calibrated for academic papers.
Most "chat with your PDF" demos stop at "model returns text." This project's portfolio claim is the engineering around the model:
- Page-level citation grain is a design decision, not an afterthought (see ADR-0002). Every passage is a page; faithful citations follow by construction.
- Pluggable backends.
AnswererandRetrieverare protocols, not classes — the offlineStubAnswerer(CI),HFInferenceAnswerer(live demo), and the visualColPaliRetriever(CPU-only path implemented; GPU baseline parked) all satisfy the same contract. - Multi-document mode. Pooled-index design (ADR-0008) means upload set is one cosine-comparable space, not N independently-normalized indexes. Single-doc behaviour is identical to before.
- Measurement, not vibes.
docs/baselines/ships real numbers — every architecture change ships next to a recall@k / faithfulness delta. Seedocs/adr/0005-evaluation-plan.md. - Mature CI. Lint + format + strict mypy + 60 unit tests on every push, across Python 3.11 and 3.12. Integration tests are gated behind
pytest -m integrationso the network never enters CI.
Measured on the included tests/eval/gold.json gold set, with Qwen2.5-7B-Instruct as the answerer:
| Metric | Value | What it means |
|---|---|---|
| mean recall@3 | 0.833 | The retriever puts the right page in the top 3 on 5/6 questions |
| mean recall@1 | 0.583 | Top-1 is less reliable — that's why default top_k = 4 |
| mean citation faithfulness | 1.000 | Every emitted citation lands on a gold-relevant page |
| must-cite rate | 0.833 | Cited pages match the exact gold page (5/6 questions) |
The path to those numbers is documented as a series of baselines in docs/baselines/ — every architectural change ships with the measured delta it produced.
- Ingest —
pypdfextracts onePassageper PDF page; multiple uploaded PDFs are pooled into one passage list. - Index — pages are embedded with
all-MiniLM-L6-v2; the index is an in-memory NumPy matrix (per ADR-0003, no vector DB needed at this corpus size). - Retrieve — the question is embedded and scored against the pooled page matrix; top-k pages are returned with their source filename and page number attached.
- Answer — the question + retrieved passages go to
Qwen2.5-7B-Instructvia the HF Inference API. The system prompt forces the model to cite[<file>, page N]and refuse out-of-document questions. - Cite — emitted
[<file>, page N]markers are parsed back, filtered against the retrieved set (no hallucinated references), then run through a numerical-anchor grounding check: if the answer states a number, that number must appear on the cited page or the citation is dropped with a visible annotation.
The visual-retrieval path (ColPali, ADR-0006) targets the one measured failure case: questions about table-heavy pages where pypdf text extraction loses the signal (e.g. the Table 3 ablation question in the gold set, where recall@k = 0 for every k). The implementation is in paperqa/retrievers/colpali.py, fully unit-tested with mocks, and gated behind a [visual] extra plus PAPERQA_RETRIEVER=colpali. The end-to-end GPU baseline is parked pending hardware access — DigitalOcean GPU droplets need a $250 pre-pay, HF GPU Spaces are paid, and Colab's GPU runtimes don't expose a stable URL for a Space deploy. The path is wired so a future GPU run is one git push away.
The interesting calls are recorded as ADRs, not buried in commits:
- ADR-0001 — Scope, niche, and initial model choice
- ADR-0002 — Chunking strategy: page-level passages
- ADR-0003 — Embeddings and retrieval: MiniLM + in-memory cosine
- ADR-0004 — Answering model and inference backend
- ADR-0005 — Evaluation plan
- ADR-0006 — Visual retrieval: ColPali behind a Retriever abstraction
- ADR-0007 — Per-citation grounding check (post-hoc)
- ADR-0008 — Multi-document support: pooled index, file-prefixed citations
git clone https://github.com/Joncik91/paperQA.git
cd paperQA
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev,embed,llm,app]"
python app.py # opens Gradio on localhost:7860Set HF_TOKEN to use the real Inference API; without it the app falls back to the offline StubAnswerer so the demo never hard-errors.
Full guide: docs/running-locally.md. Deploying to a Space: docs/deploying.md.
This is also a portfolio piece, so the discipline is part of the product. The hard rules are codified in CONTRIBUTING.md:
- DRY on the second occurrence — no copy-paste tolerated.
- Code comments explain WHAT and WHY, never HOW. The code is the HOW.
- Commit messages are WHAT changed, WHY it changed, WHERE it landed. One logical change per commit.
- Documentation lands in the same commit as the code it describes. No "I'll write the docs later."
Two paths leave your machine when you run paperQA:
- The HF Inference API call. When
HF_TOKENis set, the question plus the retrieved passages from the uploaded PDFs are sent to Hugging Face's inference endpoint forQwen2.5-7B-Instruct. That means snippets of your uploaded paper(s) leave the machine on every answered question. WithoutHF_TOKEN, the app falls back to the offlineStubAnswererand nothing leaves the machine. - The model download (one-time). First retrieval downloads
all-MiniLM-L6-v2weights (~90 MB) from the Hugging Face Hub. Read-only; nothing of yours is uploaded.
For proprietary or unpublished papers, run locally without HF_TOKEN
(stub answers, no network), or swap the answerer for a self-hosted
endpoint. The Answerer protocol is what makes that a one-line change
— see ADR-0004.
MIT © Joncik91. See LICENSE.
