What if, instead of chunking a document and searching by similarity, you just showed the LLM the table of contents and let it pick which sections to read?
That's the core idea behind this project (v2), and it scores 82% on FinanceBench (91% in 10-K category) - a jump from 64% with traditional chunk-based RAG (v1). No vector database, no embeddings, just SQLite and an LLM that navigates document structure the way a human would.
This repo contains both approaches, evaluated on FinanceBench (150 questions across 43 companies' SEC filings).
v1 - Traditional RAG (64.4%). We started with the standard playbook: chunk documents by section, embed with BGE-large, index into OpenSearch with hybrid BM25+KNN search, rerank with a cross-encoder. It works reasonably well for straightforward metric lookups ("what was revenue in FY2023?") but falls apart on questions that need context from specific document sections - the kind where you'd flip to a particular part of the filing if you were reading it yourself.
Looking for better approaches. We came across PageIndex which claims to achieve 98.7% on FinanceBench using a vectorless tree-indexing approach. Impressive, but it works by making recursive LLM calls over the full document structure at query time.
Their API costs $0.01/page for tree generation + token based additional cost for querying, and for SEC filings that can run 200+ pages, the per-document cost adds up fast. More importantly, we wanted to understand why structure-based retrieval works better, not just use a hosted API.
v2 - Tree-based section routing (82.0% overall, 91% for 10-K). We tried a much simpler version of the same idea. Instead of recursive LLM traversal, we just extract the heading hierarchy from each document, clean up opaque headings ("Note 7" → "Note 7 - Goodwill and Intangible Assets") with a single LLM call per PDF, and at query time show the LLM the heading tree so it can pick which sections to read.
Ingestion needs just 2 LLM calls per PDF (not per page) - one to identify opaque headings, one to augment them. At query time, it averages around 3 LLM calls (filter extraction, section routing, answer generation), so the total cost per question is a fraction of what PageIndex uses. It's not as accurate as PageIndex's 98.7%, but this is a proof of concept - and the fact that such a simple system handles 82% of FinanceBench correctly suggests there's a lot of room to push this further, whether by improving the routing itself or combining it with a vector-based fallback for the tougher queries it can't handle on its own.
| Approach | Accuracy | Infrastructure |
|---|---|---|
| v1 - Chunk + Hybrid Search | 64.4% | OpenSearch, Embedding |
| v2 - Tree-Based Routing | 82.0% | SQLite only, no Embedding or Vector DB |
The key insight is that SEC filings (and many professional documents) have well-structured, self-explanatory section headings. When a human analyst looks for "goodwill impairment", they don't read the whole filing - they check the table of contents, go to the notes section, and find the relevant note. v2 does the same thing.
This approach works well when:
- Document headings are descriptive and well-organized
- The content you need is localized in specific sections
- Questions map naturally to document structure (as financial questions do)
It would struggle with documents that have poor structure, ambiguous headings, or where answers are scattered across many unrelated sections.
This is a proof of concept. 82% on FinanceBench with such a simple system is encouraging - it shows that structure-aware retrieval handles most scenarios well. The 18% it gets wrong are mostly edge cases that a more sophisticated system could address:
- Hybrid v1+v2. Use v2's structural routing as the primary path, fall back to v1's embedding search when the router is uncertain or the question doesn't map cleanly to a section.
- Better heading augmentation. Currently a single LLM call per document. Richer heading descriptions (using section content previews) would help the router pick better.
- Cross-document reasoning. Neither approach handles questions that require comparing across multiple filings well.
- Section-level embeddings. Embed entire sections alongside the tree - useful when headings alone aren't descriptive enough for routing.
| Question Type | Accuracy |
|---|---|
| metrics-generated | 82.0% (41/50) |
| domain-relevant | 66.0% (33/50) |
| novel-generated | 44.9% (22/49) |
| Question Type | Accuracy |
|---|---|
| metrics-generated | 90.0% (45/50) |
| domain-relevant | 82.0% (41/50) |
| novel-generated | 73.5% (36/49) |
The biggest gains are on domain-relevant (+16%) and novel (+29%) questions - exactly where embedding similarity struggles but structural navigation shines.
FinRag/
common/ # Shared utilities
table_formatter.py # Structured table formatting (preserves column semantics)
eval_utils.py # LLM calls, judge, results I/O
v1/ # Chunk-based RAG (see v1/README.md)
ingestion/ # Chunker, embedder, indexer
data/ # OpenSearch index interface
retrieval/ # Query transform, hybrid search, reranker
pipeline.py # Ingestion entry point
eval.py # Evaluation
colab_chunk.ipynb
v2/ # Tree-based routing (see v2/README.md)
build_tree.py # Heading hierarchy extraction
augment_headings.py # LLM clarifies opaque headings
store.py # SQLite storage
retrieval.py # Section routing + answer generation
eval.py # Evaluation
colab_tree.ipynb
docker-compose.yaml # OpenSearch (v1 only)
setup_hybrid_search.sh # Hybrid search pipeline setup
pyproject.toml
- Python 3.11+
- OpenAI API key
- Pre-parsed Docling JSON files in
parsed_docs/(from financebench-parsed, or parse your own with Docling)
v1 (needs OpenSearch + GPU):
docker compose up -d
bash setup_hybrid_search.sh
python -m v1.pipeline --input-dir parsed_docs
python -m v1.index_from_jsonl
python -m v1.evalv2 (just needs OpenAI API key):
python -m v2.build_tree --input-dir parsed_docs
python -m v2.augment_headings
python -m v2.evalSee v1/README.md and v2/README.md for details.