Build an AI-powered fund performance analysis system that enables Limited Partners (LPs) to:
- Upload fund performance PDF documents
- Automatically parse and extract structured data (tables → SQL, text → Vector DB)
- Ask natural language questions about fund metrics (DPI, IRR, etc.)
- Get accurate answers powered by RAG (Retrieval Augmented Generation) and SQL calculations
As an LP, you receive quarterly fund performance reports in PDF format. These documents contain:
- Capital Call tables: When and how much capital was called
- Distribution tables: When and how much was distributed back to LPs
- Adjustment tables: Rebalancing entries (recallable distributions, capital call adjustments)
- Text explanations: Definitions, investment strategies, market commentary
Your task: Build a system that automatically processes these documents and answers questions like:
- "What is the current DPI of this fund?"
- "Has the fund returned all invested capital to LPs?"
- "What does 'Paid-In Capital' mean in this context?"
- "Show me all capital calls in 2024"
This repository contains a project scaffold to help you get started quickly:
- Docker Compose configuration (PostgreSQL, Redis, Backend, Frontend)
- Database schema and models (SQLAlchemy)
- Basic API structure (FastAPI with endpoints)
- Frontend boilerplate (Next.js with TailwindCSS)
- Environment configuration
- Upload page layout
- Chat interface layout
- Fund dashboard layout
- Navigation and routing
- DPI (Distributions to Paid-In) - Fully implemented
- IRR (Internal Rate of Return) - Using numpy-financial
- PIC (Paid-In Capital) - With adjustments
- Calculation breakdown API - Shows all cash flows and transactions for debugging
- Located in:
backend/app/services/metrics_calculator.py
Debugging Features:
- View all capital calls, distributions, and adjustments used in calculations
- See cash flow timeline for IRR calculation
- Verify intermediate values (total calls, total distributions, etc.)
- Trace calculation steps with detailed explanations
- Reference PDF: ILPA metrics explanation document
- Sample Fund Report: Generated with realistic data
- PDF Generator Script:
files/create_sample_pdf.py
- Expected Results: Documented for validation
The following core functionalities are NOT implemented and need to be built by you:
- PDF parsing with pdfplumber (integrate and test)
- Table detection and extraction logic
- Intelligent table classification (capital calls vs distributions vs adjustments)
- Data validation and cleaning
- Error handling for malformed PDFs
- Background task processing (Celery integration)
Files to implement:
backend/app/services/document_processor.py
(skeleton provided)backend/app/services/table_parser.py
(needs implementation)
- Text chunking strategy implementation
- embedding generation
- FAISS index creation and management
- Semantic search implementation
- Context retrieval for LLM
- Prompt engineering for accurate responses
Files to implement:
backend/app/services/vector_store.py
(pgvector implementation with TODOs)backend/app/services/rag_engine.py
(needs implementation)
Note: This project uses pgvector instead of FAISS. pgvector is a PostgreSQL extension that stores vectors directly in your database, eliminating the need for a separate vector database.
- Intent classifier (calculation vs definition vs retrieval)
- Query router logic
- LLM integration
- Response formatting
- Source citation
- Conversation context management
Files to implement:
backend/app/services/query_engine.py
(needs implementation)
- End-to-end document upload flow
- API integration tests
- Error handling and logging
- Performance optimization
Note: Metrics calculation is already implemented. You can focus on document processing and RAG!
┌─────────────────────────────────────────────────────────────┐
│ Frontend (Next.js) │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Upload │ │ Chat │ │ Dashboard │ │
│ │ Page │ │ Interface │ │ Page │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└────────────────────────┬────────────────────────────────────┘
│ REST API
┌────────────────────────▼────────────────────────────────────┐
│ Backend (FastAPI) │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Document Processor │ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Docling │────────▶│ Table │ │ │
│ │ │ Parser │ │ Extractor │ │ │
│ │ └──────────────┘ └──────┬───────┘ │ │
│ │ │ │ │
│ │ ┌──────────────┐ ┌──────▼───────┐ │ │
│ │ │ Text │────────▶│ Embedding │ │ │
│ │ │ Chunker │ │ Generator │ │ │
│ │ └──────────────┘ └──────────────┘ │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Query Engine (RAG) │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────┐ │ │
│ │ │ Intent │─▶│ Vector │─▶│ LLM │ │ │
│ │ │ Classifier │ │ Search │ │ Response │ │ │
│ │ └──────────────┘ └──────────────┘ └──────────┘ │ │
│ │ │ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Metrics │─▶│ SQL │ │ │
│ │ │ Calculator │ │ Queries │ │ │
│ │ └──────────────┘ └──────────────┘ │ │
│ └─────────────────────────────────────────────────────┘ │
└────────────────────────┬────────────────────────────────────┘
│
┌────────────────┼────────────────┐
│ │ │
┌───────▼────────┐ ┌────▼─────┐ ┌────────▼────────┐
│ PostgreSQL │ │ FAISS │ │ Redis │
│ (Structured) │ │ (Vectors)│ │ (Task Queue) │
└────────────────┘ └──────────┘ └─────────────────┘
CREATE TABLE funds (
id SERIAL PRIMARY KEY,
name VARCHAR(255) NOT NULL,
gp_name VARCHAR(255),
fund_type VARCHAR(100),
vintage_year INTEGER,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE capital_calls (
id SERIAL PRIMARY KEY,
fund_id INTEGER REFERENCES funds(id),
call_date DATE NOT NULL,
call_type VARCHAR(100),
amount DECIMAL(15, 2) NOT NULL,
description TEXT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE distributions (
id SERIAL PRIMARY KEY,
fund_id INTEGER REFERENCES funds(id),
distribution_date DATE NOT NULL,
distribution_type VARCHAR(100),
is_recallable BOOLEAN DEFAULT FALSE,
amount DECIMAL(15, 2) NOT NULL,
description TEXT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE adjustments (
id SERIAL PRIMARY KEY,
fund_id INTEGER REFERENCES funds(id),
adjustment_date DATE NOT NULL,
adjustment_type VARCHAR(100),
category VARCHAR(100),
amount DECIMAL(15, 2) NOT NULL,
is_contribution_adjustment BOOLEAN DEFAULT FALSE,
description TEXT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
fund_id INTEGER REFERENCES funds(id),
file_name VARCHAR(255) NOT NULL,
file_path VARCHAR(500),
upload_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
parsing_status VARCHAR(50) DEFAULT 'pending',
error_message TEXT
);
- Docker setup with PostgreSQL, Redis
- FastAPI backend with CRUD endpoints
- Next.js frontend with basic layout
- Database schema implementation
- Environment configuration
- File upload API endpoint
- Docling integration for PDF parsing
- Table extraction and SQL storage
- Text chunking and embedding
- Parsing status tracking
- pgvector setup (PostgreSQL extension)
- Embedding generation (OpenAI/local)
- Similarity search using pgvector operators
- LangChain integration
- Basic chat interface
- DPI calculation function
- IRR calculation function
- Metrics API endpoints
- Query engine integration
- Fund list page with metrics
- Fund detail page with charts
- Transaction tables with pagination
- Error handling improvements
- Loading states
- Conversation history
- Multi-fund comparison
- Excel export
- Custom calculation formulas
- Test coverage (50%+)
- Docker & Docker Compose
- Node.js 18+ (for local frontend development)
- Python 3.11+ (for local backend development)
- OpenAI API key (or use free alternatives - see below)
- Clone the repository
git clone <your-repo-url>
cd fund-analysis-system
- Set up environment variables
# Copy example env file
cp .env.example .env
# Edit .env and add your API keys
# OPENAI_API_KEY=sk-...
# DATABASE_URL=postgresql://user:password@localhost:5432/funddb
- Start with Docker Compose
docker-compose up -d
- Access the application
- Frontend: http://localhost:3000
- Backend API: http://localhost:8000
- API Docs: http://localhost:8000/docs
- Upload sample document
- Navigate to http://localhost:3000/upload
- Upload the provided PDF:
files/ILPA based Capital Accounting and Performance Metrics_ PIC, Net PIC, DPI, IRR .pdf
- Wait for parsing to complete
- Start asking questions
- Go to http://localhost:3000/chat
- Try: "What is DPI?"
- Try: "Calculate the current DPI for this fund"
fund-analysis-system/
├── backend/
│ ├── app/
│ │ ├── api/
│ │ │ ├── endpoints/
│ │ │ │ ├── documents.py
│ │ │ │ ├── funds.py
│ │ │ │ ├── chat.py
│ │ │ │ └── metrics.py
│ │ │ └── deps.py
│ │ ├── core/
│ │ │ ├── config.py
│ │ │ └── security.py
│ │ ├── db/
│ │ │ ├── base.py
│ │ │ ├── session.py
│ │ │ └── init_db.py
│ │ ├── models/
│ │ │ ├── fund.py
│ │ │ ├── transaction.py
│ │ │ └── document.py
│ │ ├── schemas/
│ │ │ ├── fund.py
│ │ │ ├── transaction.py
│ │ │ ├── document.py
│ │ │ └── chat.py
│ │ ├── services/
│ │ │ ├── document_processor.py
│ │ │ ├── table_parser.py
│ │ │ ├── vector_store.py
│ │ │ ├── query_engine.py
│ │ │ └── metrics_calculator.py
│ │ └── main.py
│ ├── tests/
│ │ ├── test_document_processor.py
│ │ ├── test_metrics.py
│ │ └── test_api.py
│ ├── requirements.txt
│ ├── Dockerfile
│ └── alembic/
│ └── versions/
├── frontend/
│ ├── app/
│ │ ├── layout.tsx
│ │ ├── page.tsx
│ │ ├── upload/
│ │ │ └── page.tsx
│ │ ├── chat/
│ │ │ └── page.tsx
│ │ └── funds/
│ │ ├── page.tsx
│ │ └── [id]/
│ │ └── page.tsx
│ ├── components/
│ │ ├── ui/
│ │ │ ├── button.tsx
│ │ │ ├── card.tsx
│ │ │ ├── input.tsx
│ │ │ └── ...
│ │ ├── FileUpload.tsx
│ │ ├── ChatInterface.tsx
│ │ ├── FundMetrics.tsx
│ │ └── TransactionTable.tsx
│ ├── lib/
│ │ ├── api.ts
│ │ └── utils.ts
│ ├── package.json
│ ├── tsconfig.json
│ ├── next.config.js
│ ├── tailwind.config.ts
│ └── Dockerfile
├── docker-compose.yml
├── .env.example
├── README.md
└── docs/
├── API.md
├── ARCHITECTURE.md
└── CALCULATIONS.md
POST /api/documents/upload
GET /api/documents/{doc_id}/status
GET /api/documents/{doc_id}
DELETE /api/documents/{doc_id}
GET /api/funds
POST /api/funds
GET /api/funds/{fund_id}
GET /api/funds/{fund_id}/transactions
GET /api/funds/{fund_id}/metrics
POST /api/chat/query
GET /api/chat/conversations/{conv_id}
POST /api/chat/conversations
See API.md for detailed documentation.
PIC = Total Capital Calls - Adjustments
DPI = Cumulative Distributions / PIC
IRR = Rate where NPV of all cash flows = 0
Uses numpy-financial.irr() function
See CALCULATIONS.md for detailed formulas.
cd backend
pytest tests/ -v --cov=app
cd frontend
npm test
curl -X POST "http://localhost:8000/api/documents/upload" \
-F "file=@files/sample_fund_report.pdf"
curl -X POST "http://localhost:8000/api/chat/query" \
-H "Content-Type: application/json" \
-d '{
"query": "What is the current DPI?",
"fund_id": 1
}'
- Use Docling to extract document structure
- Identify tables by headers (e.g., "Capital Call", "Distribution")
- Parse table rows and map to SQL schema
- Extract text paragraphs for vector storage
- Handle parsing errors gracefully
- Retrieval: Vector similarity search (top-k=5)
- Augmentation: Combine retrieved context with SQL data
- Generation: LLM generates answer with citations
- Always validate input data before calculation
- Handle edge cases (zero PIC, missing data)
- Return calculation breakdown for transparency
- Cache results for performance
- "What does DPI mean?"
- "Explain Paid-In Capital"
- "What is a recallable distribution?"
- "What is the current DPI?"
- "Calculate the IRR for this fund"
- "Has the fund returned all capital to LPs?"
- "Show me all capital calls in 2024"
- "What was the largest distribution?"
- "List all adjustments"
- "How is the fund performing compared to industry benchmarks?"
- "What percentage of distributions were recallable?"
- "Explain the trend in capital calls over time"
- Document upload and parsing works
- Tables correctly stored in SQL
- Text stored in vector DB
- DPI calculation is accurate
- Basic RAG Q&A works
- Application runs via Docker
- Structure: Modular, separation of concerns (10pts)
- Readability: Clear naming, comments (10pts)
- Error Handling: Try-catch, validation (10pts)
- Type Safety: TypeScript, Pydantic (10pts)
- Parsing Accuracy: Table recognition (10pts)
- Calculation Accuracy: DPI, IRR (10pts)
- RAG Quality: Relevant answers (10pts)
- Intuitiveness: Easy to use (10pts)
- Feedback: Loading, errors, success (5pts)
- Design: Clean, consistent (5pts)
- README: Setup instructions (5pts)
- API Docs: Endpoint descriptions (3pts)
- Architecture: Diagrams (2pts)
- Dashboard implementation (+5pts)
- Charts/visualization (+3pts)
- Multi-fund support (+3pts)
- Test coverage (+5pts)
- Live deployment (+4pts)
- GitHub Repository (public or private with access)
- Complete source code (backend + frontend)
- Docker configuration (docker-compose.yml)
- Documentation (README, API docs, architecture)
- Sample data (at least one test PDF)
- Project overview
- Tech stack
- Setup instructions (Docker)
- Environment variables (.env.example)
- API testing examples
- Features implemented
- Known limitations
- Future improvements
- Screenshots (minimum 3)
- Recommended: 1 week (Phase 1-4)
- Maximum: 2 weeks (Phase 1-6)
- Push code to GitHub
- Test that
docker-compose up
works - Send repository URL via email
- Include any special instructions
- Framework: FastAPI (Python 3.11+)
- Document Parser: Docling
- Vector DB: pgvector (PostgreSQL extension)
- SQL DB: PostgreSQL 15+
- ORM: SQLAlchemy
- LLM Framework: LangChain
- LLM: OpenAI GPT-4 or any LLM
- Embeddings: OpenAI text-embedding-3-small
- Task Queue: Celery + Redis
- Framework: Next.js 14 (App Router)
- UI Library: shadcn/ui + Tailwind CSS
- State: Zustand or React Context
- Data Fetching: TanStack Query
- Charts: Recharts
- File Upload: react-dropzone
- Development: Docker + Docker Compose
- Deployment: Your choice (Vercel, Railway, AWS, etc.)
Problem: Docling can't extract tables Solution:
- Check PDF format (ensure it's not scanned image)
- Add fallback parsing logic
- Manually define table structure patterns
Problem: OpenAI API is expensive Solution: Use free alternatives (see "Free LLM Options" section below)
- Use caching for repeated queries
- Use cheaper models (gpt-3.5-turbo)
- Use local LLM (Ollama) for development
Problem: IRR returns NaN or extreme values Solution:
- Validate cash flow sequence
- Check for missing dates
- Handle edge cases (all positive/negative flows)
Problem: Frontend can't call backend API Solution:
- Add CORS middleware in FastAPI
- Allow origin: http://localhost:3000
- Check network configuration in Docker
You don't need to pay for OpenAI API! Here are free alternatives:
Completely free, runs locally on your machine
- Install Ollama
# Mac
brew install ollama
# Linux
curl -fsSL https://ollama.com/install.sh | sh
# Windows
# Download from https://ollama.com/download
- Download a model
# Llama 3.2 (3B - fast, good for development)
ollama pull llama3.2
# Or Llama 3.1 (8B - better quality)
ollama pull llama3.1
# Or Mistral (7B - good balance)
ollama pull mistral
- Update your .env
# Use Ollama instead of OpenAI
LLM_PROVIDER=ollama
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_MODEL=llama3.2
- Modify your code to use Ollama
# In backend/app/services/query_engine.py
from langchain_community.llms import Ollama
llm = Ollama(
base_url="http://localhost:11434",
model="llama3.2"
)
Pros: Free, private, no API limits, works offline Cons: Requires decent hardware (8GB+ RAM), slower than cloud APIs
Free tier: 60 requests per minute
-
Get free API key
- Go to https://makersuite.google.com/app/apikey
- Click "Create API Key"
- Copy your key
-
Install package
pip install langchain-google-genai
- Update .env
GOOGLE_API_KEY=your-gemini-api-key
LLM_PROVIDER=gemini
- Use in code
from langchain_google_genai import ChatGoogleGenerativeAI
llm = ChatGoogleGenerativeAI(
model="gemini-pro",
google_api_key=os.getenv("GOOGLE_API_KEY")
)
Pros: Free, fast, good quality Cons: Rate limits, requires internet
Free tier: Very fast inference, generous limits
-
Get free API key
- Go to https://console.groq.com
- Sign up and get API key
-
Install package
pip install langchain-groq
- Update .env
GROQ_API_KEY=your-groq-api-key
LLM_PROVIDER=groq
- Use in code
from langchain_groq import ChatGroq
llm = ChatGroq(
api_key=os.getenv("GROQ_API_KEY"),
model="mixtral-8x7b-32768" # or "llama3-70b-8192"
)
Pros: Free, extremely fast, good quality Cons: Rate limits, requires internet
Free inference API
-
Get free token
- Go to https://huggingface.co/settings/tokens
- Create a token
-
Update .env
HUGGINGFACE_API_TOKEN=your-hf-token
LLM_PROVIDER=huggingface
- Use in code
from langchain_community.llms import HuggingFaceHub
llm = HuggingFaceHub(
repo_id="mistralai/Mistral-7B-Instruct-v0.2",
huggingfacehub_api_token=os.getenv("HUGGINGFACE_API_TOKEN")
)
Pros: Free, many models available Cons: Can be slow, rate limits
Provider | Cost | Speed | Quality | Setup Difficulty |
---|---|---|---|---|
Ollama | Free | Medium | Good | Easy |
Gemini | Free | Fast | Very Good | Very Easy |
Groq | Free | Very Fast | Good | Very Easy |
Hugging Face | Free | Slow | Varies | Easy |
OpenAI | Paid | Fast | Excellent | Very Easy |
For Development/Testing:
- Use Ollama with
llama3.2
(free, no limits)
For Production/Demo:
- Use Groq or Gemini (free tier is generous)
If you have budget:
- Use OpenAI GPT-4 (best quality)
Located in files/
directory:
ILPA based Capital Accounting and Performance Metrics_ PIC, Net PIC, DPI, IRR.pdf
- Reference document explaining fund metrics
- Contains definitions of PIC, DPI, IRR, TVPI
- Use this to test text extraction and RAG
For comprehensive testing, you should create mock fund performance reports with:
Date | Call Number | Amount | Description
-----------|-------------|-------------|------------------
2023-01-15 | Call 1 | $5,000,000 | Initial Capital
2023-06-20 | Call 2 | $3,000,000 | Follow-on
2024-03-10 | Call 3 | $2,000,000 | Bridge Round
Date | Type | Amount | Recallable | Description
-----------|-------------|-------------|------------|------------------
2023-12-15 | Return | $1,500,000 | No | Exit: Company A
2024-06-20 | Income | $500,000 | No | Dividend
2024-09-10 | Return | $2,000,000 | Yes | Partial Exit: Company B
Date | Type | Amount | Description
-----------|---------------------|-----------|------------------
2024-01-15 | Recallable Dist | -$500,000 | Recalled distribution
2024-03-20 | Capital Call Adj | $100,000 | Fee adjustment
For the sample data above:
- Total Capital Called: $10,000,000
- Total Distributions: $4,000,000
- Net PIC: $10,100,000 (after adjustments)
- DPI: 0.40 (4M / 10M)
- IRR: ~8-12% (depends on exact dates)
We've included a Python script to generate sample PDFs:
cd files/
pip install reportlab
python create_sample_pdf.py
This creates Sample_Fund_Performance_Report.pdf
with:
- Capital calls table (4 entries)
- Distributions table (4 entries)
- Adjustments table (3 entries)
- Performance summary with definitions
You can create PDFs using:
- Google Docs/Word → Export as PDF
- Python libraries (reportlab, fpdf)
- Online PDF generators
Tip: Start with simple, well-structured tables before handling complex layouts.
- Docling: https://github.com/DS4SD/docling
- LangChain RAG: https://python.langchain.com/docs/use_cases/question_answering/
- FAISS: https://faiss.ai/
- ILPA Guidelines: https://ilpa.org/
- PE Metrics: https://www.investopedia.com/terms/d/dpi.asp
- Start Simple: Get Phase 1-4 working before adding features
- Test Early: Test document parsing with sample PDF immediately
- Use Tools: Leverage LangChain, shadcn/ui to save time
- Focus on Core: Perfect the RAG pipeline and calculations first
- Document Well: Clear README helps evaluators understand your work
- Handle Errors: Graceful error handling shows maturity
- Ask Questions: If requirements are unclear, document your assumptions
For questions about this coding challenge:
- Open an issue in this repository
- Email: [your-contact-email]
Good luck! Build something amazing!
PIC = Capital Contributions (Gross) - Adjustments
DPI = Cumulative Distributions / PIC
Cumulative Distributions =
Return of Capital +
Dividends Paid +
Interest Paid +
Realized Gains Distributed -
(Fees & Carried Interest Withheld)
Adjustments = Σ (Rebalance of Distribution + Rebalance of Capital Call)
- Nature: Clawback of over-distributed amounts
- Recording: Contribution (-)
- DPI Impact: Numerator ↓, Denominator ↑ → DPI ↓
- Nature: Refund of over-called capital
- Recording: Distribution (+)
- DPI Impact: Denominator ↓, Numerator unchanged → Requires flag to prevent DPI inflation
Version: 1.0
Last Updated: 2025-10-06
Author: InterOpera-Apps Hiring Team