Financial RAG Pipeline

End-to-end structured financial data extraction from multi-page PDF reports.

IBM Docling → ChromaDB → CohereRerank → Gemini 1.5 Pro → Validated JSON

Architecture

Phase	Component	Purpose
1	IBM Docling	Parse PDFs → Markdown (text, tables, figures)
2	all-MiniLM-L6-v2 + ChromaDB	Semantic chunking & vector storage
3	Cosine Top-500 → CohereRerank v3	Two-stage retrieval → Top 40–50 chunks
4	Gemini 1.5 Pro (LangChain)	Structured extraction with Pydantic validation
5	Validated JSON	50+ financial fields per company/period

Quick Start

1. Install Dependencies

pip install -r requirements.txt

2. Configure Environment

cp .env.example .env
# Edit .env with your API keys

Required API keys:

Google AI — for Gemini 1.5 Pro (Google AI Studio)
Cohere — for CohereRerank v3 (Cohere Dashboard)

3. Run the Pipeline

# Basic usage
python main.py path/to/annual_report.pdf

# Custom output path
python main.py path/to/annual_report.pdf --output results.json

# Custom ChromaDB directory
python main.py path/to/annual_report.pdf --chroma-dir ./my_chroma_db

Output Format

The pipeline extracts 50+ financial fields per company/period into a validated JSON structure:

[
  {
    "company_name": "Example Corp Ltd",
    "period": "FY2024",
    "currency_unit": "INR Crores",
    "income_statement": {
      "revenue": 15234.5,
      "ebitda": 3200.0,
      "pat": 1850.0,
      "eps_basic": 45.2,
      ...
    },
    "balance_sheet": {
      "fixed_assets": 8500.0,
      "borrowings": 2100.0,
      ...
    },
    "cash_flow": {
      "cash_flow_from_operations": 2800.0,
      "cash_at_end": 1200.0,
      ...
    }
  }
]

Project Structure

Financial Document Analyzer/
├── main.py                 # CLI entry point
├── requirements.txt        # Python dependencies
├── .env.example            # Environment variable template
├── README.md               # This file
└── src/
    ├── __init__.py
    ├── models.py           # Pydantic schemas (50+ fields)
    ├── parser.py           # IBM Docling parsing
    ├── embeddings.py       # Chunking & ChromaDB storage
    ├── retriever.py        # Two-stage retrieval + CohereRerank
    ├── extractor.py        # Gemini 1.5 Pro extraction chain
    └── pipeline.py         # Full orchestration + error handling

Cost Estimation (per 500-page document)

Service	Usage	Approx. Cost
IBM Docling	500 pages (Local CPU)	$0.00
all-MiniLM-L6-v2	~500K tokens (Local CPU)	$0.00
CohereRerank v3	500 docs reranked	~$0.001
Gemini 1.5 Pro	~100K tokens	~$0.35
Total	—	~$0.35

Known Limitations

Limitation	Mitigation
Docling misses scanned text	Pre-process with external OCR
Chunk boundary splits table row	Increase overlap to 200 tokens for table-heavy docs
Gemini hallucination on sparse context	Pydantic validation + null count thresholds
Multiple companies in one chunk	Post-process: validate company_name per row
Units vary across documents	Enforce `currency_unit` field; normalise downstream

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
chroma_db		chroma_db
src		src
.gitignore		.gitignore
Financial_RAG_Pipeline_Implementation_Plan.docx		Financial_RAG_Pipeline_Implementation_Plan.docx
Implementing Financial RAG Pipeline.md		Implementing Financial RAG Pipeline.md
README.md		README.md
financials_output.json		financials_output.json
main.py		main.py
requirements.txt		requirements.txt
sample.pdf		sample.pdf
test_parser.py		test_parser.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Financial RAG Pipeline

Architecture

Quick Start

1. Install Dependencies

2. Configure Environment

3. Run the Pipeline

Output Format

Project Structure

Cost Estimation (per 500-page document)

Known Limitations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Financial RAG Pipeline

Architecture

Quick Start

1. Install Dependencies

2. Configure Environment

3. Run the Pipeline

Output Format

Project Structure

Cost Estimation (per 500-page document)

Known Limitations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages