Skip to content

AsyncBuilds/FinRag

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FinRag

What if, instead of chunking a document and searching by similarity, you just showed the LLM the table of contents and let it pick which sections to read?

That's the core idea behind this project (v2), and it scores 82% on FinanceBench (91% in 10-K category) - a jump from 64% with traditional chunk-based RAG (v1). No vector database, no embeddings, just SQLite and an LLM that navigates document structure the way a human would.

This repo contains both approaches, evaluated on FinanceBench (150 questions across 43 companies' SEC filings).

The journey

v1 - Traditional RAG (64.4%). We started with the standard playbook: chunk documents by section, embed with BGE-large, index into OpenSearch with hybrid BM25+KNN search, rerank with a cross-encoder. It works reasonably well for straightforward metric lookups ("what was revenue in FY2023?") but falls apart on questions that need context from specific document sections - the kind where you'd flip to a particular part of the filing if you were reading it yourself.

Looking for better approaches. We came across PageIndex which claims to achieve 98.7% on FinanceBench using a vectorless tree-indexing approach. Impressive, but it works by making recursive LLM calls over the full document structure at query time.

Their API costs $0.01/page for tree generation + token based additional cost for querying, and for SEC filings that can run 200+ pages, the per-document cost adds up fast. More importantly, we wanted to understand why structure-based retrieval works better, not just use a hosted API.

v2 - Tree-based section routing (82.0% overall, 91% for 10-K). We tried a much simpler version of the same idea. Instead of recursive LLM traversal, we just extract the heading hierarchy from each document, clean up opaque headings ("Note 7" → "Note 7 - Goodwill and Intangible Assets") with a single LLM call per PDF, and at query time show the LLM the heading tree so it can pick which sections to read.

Ingestion needs just 2 LLM calls per PDF (not per page) - one to identify opaque headings, one to augment them. At query time, it averages around 3 LLM calls (filter extraction, section routing, answer generation), so the total cost per question is a fraction of what PageIndex uses. It's not as accurate as PageIndex's 98.7%, but this is a proof of concept - and the fact that such a simple system handles 82% of FinanceBench correctly suggests there's a lot of room to push this further, whether by improving the routing itself or combining it with a vector-based fallback for the tougher queries it can't handle on its own.

Approach Accuracy Infrastructure
v1 - Chunk + Hybrid Search 64.4% OpenSearch, Embedding
v2 - Tree-Based Routing 82.0% SQLite only, no Embedding or Vector DB

Why does structure-based retrieval work?

The key insight is that SEC filings (and many professional documents) have well-structured, self-explanatory section headings. When a human analyst looks for "goodwill impairment", they don't read the whole filing - they check the table of contents, go to the notes section, and find the relevant note. v2 does the same thing.

This approach works well when:

  • Document headings are descriptive and well-organized
  • The content you need is localized in specific sections
  • Questions map naturally to document structure (as financial questions do)

It would struggle with documents that have poor structure, ambiguous headings, or where answers are scattered across many unrelated sections.

Where it could be improved

This is a proof of concept. 82% on FinanceBench with such a simple system is encouraging - it shows that structure-aware retrieval handles most scenarios well. The 18% it gets wrong are mostly edge cases that a more sophisticated system could address:

  • Hybrid v1+v2. Use v2's structural routing as the primary path, fall back to v1's embedding search when the router is uncertain or the question doesn't map cleanly to a section.
  • Better heading augmentation. Currently a single LLM call per document. Richer heading descriptions (using section content previews) would help the router pick better.
  • Cross-document reasoning. Neither approach handles questions that require comparing across multiple filings well.
  • Section-level embeddings. Embed entire sections alongside the tree - useful when headings alone aren't descriptive enough for routing.

Results breakdown

v1 (64.4%)

Question Type Accuracy
metrics-generated 82.0% (41/50)
domain-relevant 66.0% (33/50)
novel-generated 44.9% (22/49)

v2 (82.0%)

Question Type Accuracy
metrics-generated 90.0% (45/50)
domain-relevant 82.0% (41/50)
novel-generated 73.5% (36/49)

The biggest gains are on domain-relevant (+16%) and novel (+29%) questions - exactly where embedding similarity struggles but structural navigation shines.

Project structure

FinRag/
  common/                   # Shared utilities
    table_formatter.py      # Structured table formatting (preserves column semantics)
    eval_utils.py           # LLM calls, judge, results I/O
  v1/                       # Chunk-based RAG (see v1/README.md)
    ingestion/              # Chunker, embedder, indexer
    data/                   # OpenSearch index interface
    retrieval/              # Query transform, hybrid search, reranker
    pipeline.py             # Ingestion entry point
    eval.py                 # Evaluation
    colab_chunk.ipynb        
  v2/                       # Tree-based routing (see v2/README.md)
    build_tree.py           # Heading hierarchy extraction
    augment_headings.py     # LLM clarifies opaque headings
    store.py                # SQLite storage
    retrieval.py            # Section routing + answer generation
    eval.py                 # Evaluation
    colab_tree.ipynb        
  docker-compose.yaml       # OpenSearch (v1 only)
  setup_hybrid_search.sh    # Hybrid search pipeline setup
  pyproject.toml

Setup

Prerequisites

  • Python 3.11+
  • OpenAI API key
  • Pre-parsed Docling JSON files in parsed_docs/ (from financebench-parsed, or parse your own with Docling)

Quick start

v1 (needs OpenSearch + GPU):

docker compose up -d
bash setup_hybrid_search.sh
python -m v1.pipeline --input-dir parsed_docs
python -m v1.index_from_jsonl
python -m v1.eval

v2 (just needs OpenAI API key):

python -m v2.build_tree --input-dir parsed_docs
python -m v2.augment_headings
python -m v2.eval

See v1/README.md and v2/README.md for details.

About

Structure-aware RAG for financial document QA - 82% on FinanceBench. LLM navigates section headings instead of searching embeddings.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors