FinRag

What if, instead of chunking a document and searching by similarity, you just showed the LLM the table of contents and let it pick which sections to read?

That's the core idea behind this project (v2), and it scores 82% on FinanceBench (91% in 10-K category) - a jump from 64% with traditional chunk-based RAG (v1). No vector database, no embeddings, just SQLite and an LLM that navigates document structure the way a human would.

This repo contains both approaches, evaluated on FinanceBench (150 questions across 43 companies' SEC filings).

The journey

v1 - Traditional RAG (64.4%). We started with the standard playbook: chunk documents by section, embed with BGE-large, index into OpenSearch with hybrid BM25+KNN search, rerank with a cross-encoder. It works reasonably well for straightforward metric lookups ("what was revenue in FY2023?") but falls apart on questions that need context from specific document sections - the kind where you'd flip to a particular part of the filing if you were reading it yourself.

Looking for better approaches. We came across PageIndex which claims to achieve 98.7% on FinanceBench using a vectorless tree-indexing approach. Impressive, but it works by making recursive LLM calls over the full document structure at query time.

Their API costs $0.01/page for tree generation + token based additional cost for querying, and for SEC filings that can run 200+ pages, the per-document cost adds up fast. More importantly, we wanted to understand why structure-based retrieval works better, not just use a hosted API.

v2 - Tree-based section routing (82.0% overall, 91% for 10-K). We tried a much simpler version of the same idea. Instead of recursive LLM traversal, we just extract the heading hierarchy from each document, clean up opaque headings ("Note 7" → "Note 7 - Goodwill and Intangible Assets") with a single LLM call per PDF, and at query time show the LLM the heading tree so it can pick which sections to read.

Ingestion needs just 2 LLM calls per PDF (not per page) - one to identify opaque headings, one to augment them. At query time, it averages around 3 LLM calls (filter extraction, section routing, answer generation), so the total cost per question is a fraction of what PageIndex uses. It's not as accurate as PageIndex's 98.7%, but this is a proof of concept - and the fact that such a simple system handles 82% of FinanceBench correctly suggests there's a lot of room to push this further, whether by improving the routing itself or combining it with a vector-based fallback for the tougher queries it can't handle on its own.

Approach	Accuracy	Infrastructure
v1 - Chunk + Hybrid Search	64.4%	OpenSearch, Embedding
v2 - Tree-Based Routing	82.0%	SQLite only, no Embedding or Vector DB

Why does structure-based retrieval work?

The key insight is that SEC filings (and many professional documents) have well-structured, self-explanatory section headings. When a human analyst looks for "goodwill impairment", they don't read the whole filing - they check the table of contents, go to the notes section, and find the relevant note. v2 does the same thing.

This approach works well when:

Document headings are descriptive and well-organized
The content you need is localized in specific sections
Questions map naturally to document structure (as financial questions do)

It would struggle with documents that have poor structure, ambiguous headings, or where answers are scattered across many unrelated sections.

Where it could be improved

This is a proof of concept. 82% on FinanceBench with such a simple system is encouraging - it shows that structure-aware retrieval handles most scenarios well. The 18% it gets wrong are mostly edge cases that a more sophisticated system could address:

Hybrid v1+v2. Use v2's structural routing as the primary path, fall back to v1's embedding search when the router is uncertain or the question doesn't map cleanly to a section.
Better heading augmentation. Currently a single LLM call per document. Richer heading descriptions (using section content previews) would help the router pick better.
Cross-document reasoning. Neither approach handles questions that require comparing across multiple filings well.
Section-level embeddings. Embed entire sections alongside the tree - useful when headings alone aren't descriptive enough for routing.

Results breakdown

v1 (64.4%)

Question Type	Accuracy
metrics-generated	82.0% (41/50)
domain-relevant	66.0% (33/50)
novel-generated	44.9% (22/49)

v2 (82.0%)

Question Type	Accuracy
metrics-generated	90.0% (45/50)
domain-relevant	82.0% (41/50)
novel-generated	73.5% (36/49)

The biggest gains are on domain-relevant (+16%) and novel (+29%) questions - exactly where embedding similarity struggles but structural navigation shines.

Project structure

FinRag/
  common/                   # Shared utilities
    table_formatter.py      # Structured table formatting (preserves column semantics)
    eval_utils.py           # LLM calls, judge, results I/O
  v1/                       # Chunk-based RAG (see v1/README.md)
    ingestion/              # Chunker, embedder, indexer
    data/                   # OpenSearch index interface
    retrieval/              # Query transform, hybrid search, reranker
    pipeline.py             # Ingestion entry point
    eval.py                 # Evaluation
    colab_chunk.ipynb        
  v2/                       # Tree-based routing (see v2/README.md)
    build_tree.py           # Heading hierarchy extraction
    augment_headings.py     # LLM clarifies opaque headings
    store.py                # SQLite storage
    retrieval.py            # Section routing + answer generation
    eval.py                 # Evaluation
    colab_tree.ipynb        
  docker-compose.yaml       # OpenSearch (v1 only)
  setup_hybrid_search.sh    # Hybrid search pipeline setup
  pyproject.toml

Setup

Prerequisites

Python 3.11+
OpenAI API key
Pre-parsed Docling JSON files in parsed_docs/ (from financebench-parsed, or parse your own with Docling)

Quick start

v1 (needs OpenSearch + GPU):

docker compose up -d
bash setup_hybrid_search.sh
python -m v1.pipeline --input-dir parsed_docs
python -m v1.index_from_jsonl
python -m v1.eval

v2 (just needs OpenAI API key):

python -m v2.build_tree --input-dir parsed_docs
python -m v2.augment_headings
python -m v2.eval

See v1/README.md and v2/README.md for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FinRag

The journey

Why does structure-based retrieval work?

Where it could be improved

Results breakdown

v1 (64.4%)

v2 (82.0%)

Project structure

Setup

Prerequisites

Quick start

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
common		common
v1		v1
v2		v2
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.yaml		docker-compose.yaml
pyproject.toml		pyproject.toml
setup_hybrid_search.sh		setup_hybrid_search.sh

Folders and files

Latest commit

History

Repository files navigation

FinRag

The journey

Why does structure-based retrieval work?

Where it could be improved

Results breakdown

v1 (64.4%)

v2 (82.0%)

Project structure

Setup

Prerequisites

Quick start

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages