Build an AI-powered fund performance analysis system that enables Limited Partners (LPs) to:
- Upload fund performance PDF documents
- Automatically parse and extract structured data (tables → SQL, text → Vector DB)
- Ask natural language questions about fund metrics (DPI, IRR, etc.)
- Get accurate answers powered by RAG (Retrieval Augmented Generation) and SQL calculations
As an LP, you receive quarterly fund performance reports in PDF format. These documents contain:
- Capital Call tables: When and how much capital was called
- Distribution tables: When and how much was distributed back to LPs
- Adjustment tables: Rebalancing entries (recallable distributions, capital call adjustments)
- Text explanations: Definitions, investment strategies, market commentary
Your task: Build a system that automatically processes these documents and answers questions like:
- "What is the current DPI of this fund?"
- "Has the fund returned all invested capital to LPs?"
- "What does 'Paid-In Capital' mean in this context?"
- "Show me all capital calls in 2024"
This repository contains a project scaffold to help you get started quickly:
- Docker Compose configuration (PostgreSQL, Redis, Backend, Frontend)
- Database schema and models (SQLAlchemy)
- Basic API structure (FastAPI with endpoints)
- Frontend boilerplate (Next.js with TailwindCSS)
- Environment configuration
- Upload page layout
- Chat interface layout
- Fund dashboard layout
- Navigation and routing
- DPI (Distributions to Paid-In) - Fully implemented
- IRR (Internal Rate of Return) - Using numpy-financial
- PIC (Paid-In Capital) - With adjustments
- Calculation breakdown API - Shows all cash flows and transactions for debugging
- Located in:
backend/app/services/metrics_calculator.py
Debugging Features:
- View all capital calls, distributions, and adjustments used in calculations
- See cash flow timeline for IRR calculation
- Verify intermediate values (total calls, total distributions, etc.)
- Trace calculation steps with detailed explanations
- Reference PDF: ILPA metrics explanation document
- Sample Fund Report: Generated with realistic data
- PDF Generator Script:
files/create_sample_pdf.py - Expected Results: Documented for validation
The following core functionalities are NOT implemented and need to be built by you:
- PDF parsing with pdfplumber (integrate and test)
- Table detection and extraction logic
- Intelligent table classification (capital calls vs distributions vs adjustments)
- Data validation and cleaning
- Error handling for malformed PDFs
- Background task processing (Celery integration)
Files to implement:
backend/app/services/document_processor.py(skeleton provided)backend/app/services/table_parser.py(needs implementation)
- Text chunking strategy implementation
- embedding generation
- FAISS index creation and management
- Semantic search implementation
- Context retrieval for LLM
- Prompt engineering for accurate responses
Files to implement:
backend/app/services/vector_store.py(pgvector implementation with TODOs)backend/app/services/rag_engine.py(needs implementation)
Note: This project uses pgvector instead of FAISS. pgvector is a PostgreSQL extension that stores vectors directly in your database, eliminating the need for a separate vector database.
- Intent classifier (calculation vs definition vs retrieval)
- Query router logic
- LLM integration
- Response formatting
- Source citation
- Conversation context management
Files to implement:
backend/app/services/query_engine.py(needs implementation)
- End-to-end document upload flow
- API integration tests
- Error handling and logging
- Performance optimization
Note: Metrics calculation is already implemented. You can focus on document processing and RAG!
┌─────────────────────────────────────────────────────────────┐
│ Frontend (Next.js) │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Upload │ │ Chat │ │ Funds │ │ Compare │ │
│ │ Page │ │ History │ │Dashboard │ │ Page │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
└────────────────────────┬────────────────────────────────────┘
│ REST API
┌────────────────────────▼────────────────────────────────────┐
│ Backend (FastAPI) │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Document Processor │ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Docling │────────▶│ Table │ │ │
│ │ │ Parser │ │ Extractor │ │ │
│ │ └──────────────┘ └──────┬───────┘ │ │
│ │ │ │ │
│ │ ┌──────────────┐ ┌──────▼───────┐ │ │
│ │ │ Text │────────▶│ Embedding │ │ │
│ │ │ Chunker │ │ Generator │ │ │
│ │ └──────────────┘ └──────────────┘ │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Query Engine (RAG) │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────┐ │ │
│ │ │ Intent │─▶│ Vector │─▶│ LLM │ │ │
│ │ │ Classifier │ │ Search │ │ Response │ │ │
│ │ └──────────────┘ └──────────────┘ └──────────┘ │ │
│ │ │ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Metrics │─▶│ SQL │ │ │
│ │ │ Calculator │ │ Queries │ │ │
│ │ └──────────────┘ └──────────────┘ │ │
│ └─────────────────────────────────────────────────────┘ │
└────────────────────────┬────────────────────────────────────┘
┌───────────────┬┴──────────┬──────────────┐
│ │ │ │
PostgreSQL Celery Worker Redis Gemini/Groq
(pgvector + (Background (Task (LLM)
Transactions) Tasks) Queue)
User Query
↓
Query Engine Classifier
├→ [Calculation] → Metrics API → PIC/DPI/IRR/TVPI
├→ [Definition] → RAG Pipeline (retrieval + LLM)
└→ [Retrieval] → Vector Search + SQL Query
↓
Vector Search (Top-5)
↓
Context Aggregation (RAG documents + SQL results + Memory)
↓
LLM Generate Answer (Gemini/Groq/Ollama)
↓
Format Response with Citations
↓
Return to Frontend
1. User sends query via Chat UI
↓
2. Frontend: POST /api/chat/query (query, fund_id, conversation_id)
↓
3. Backend Query Engine:
├─→ Classify intent (calculation/definition/retrieval)
├─→ If Calculation: Call Metrics API → SQL joins
├─→ If Retrieval: Vector search (pgvector) + SQL aggregation
└─→ Context → LLM with memory
↓
4. LLM Response with sources
↓
5. Store in Redis (conversation history with TTL)
↓
6. Return formatted response to Frontend
↓
7. Display in Chat UI with history sidebar
PostgreSQL (Persistent):
- Structured transaction data (capital calls, distributions, adjustments)
- Document metadata and references
- Vector embeddings for RAG (pgvector extension)
Redis (Ephemeral):
- Chat conversation history (with optional TTL)
- Session data
- Task queue for background processing (Celery)
CREATE TABLE funds (
id SERIAL PRIMARY KEY,
name VARCHAR(255) NOT NULL,
gp_name VARCHAR(255),
fund_type VARCHAR(100),
vintage_year INTEGER,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);CREATE TABLE capital_calls (
id SERIAL PRIMARY KEY,
fund_id INTEGER REFERENCES funds(id),
call_date DATE NOT NULL,
call_type VARCHAR(100),
amount DECIMAL(15, 2) NOT NULL,
description TEXT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);CREATE TABLE distributions (
id SERIAL PRIMARY KEY,
fund_id INTEGER REFERENCES funds(id),
distribution_date DATE NOT NULL,
distribution_type VARCHAR(100),
is_recallable BOOLEAN DEFAULT FALSE,
amount DECIMAL(15, 2) NOT NULL,
description TEXT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);CREATE TABLE adjustments (
id SERIAL PRIMARY KEY,
fund_id INTEGER REFERENCES funds(id),
adjustment_date DATE NOT NULL,
adjustment_type VARCHAR(100),
category VARCHAR(100),
amount DECIMAL(15, 2) NOT NULL,
is_contribution_adjustment BOOLEAN DEFAULT FALSE,
description TEXT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);CREATE TABLE documents (
id SERIAL PRIMARY KEY,
fund_id INTEGER REFERENCES funds(id),
file_name VARCHAR(255) NOT NULL,
file_path VARCHAR(500),
upload_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
parsing_status VARCHAR(50) DEFAULT 'pending',
error_message TEXT
);CREATE TABLE document_embeddings (
id SERIAL PRIMARY KEY,
document_id INTEGER REFERENCES documents(id),
fund_id INTEGER REFERENCES funds(id),
content TEXT NOT NULL,
embedding vector(768), -- pgvector extension (768-d Gemini embeddings)
metadata JSONB, -- Document source, section, page number, etc.
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- IVFFlat index for similarity search (for vectors ≤2000 dimensions)
CREATE INDEX idx_document_embeddings_embedding ON document_embeddings
USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);
-- GIN index for metadata filtering
CREATE INDEX idx_embedding_metadata ON document_embeddings USING GIN (metadata);Conversations stored in Redis (not PostgreSQL):
- Key pattern: "conversation:{conversation_id}:meta" → fund_id, title, timestamps
- Key pattern: "conversation:{conversation_id}:messages" → [role, content, timestamp]
Benefits:
- Fast retrieval and updates
- Automatic TTL expiration (configurable)
- Fallback to in-memory dict if Redis unavailable
Example Redis structure:
conversation:uuid-1:meta → {fund_id: 1, title: "Fund Analysis", updated_at: "..."}
conversation:uuid-1:messages → [{role: "user", content: "What is DPI?"}, ...]
PostgreSQL:
funds
├─→ documents (one-to-many)
│ └─→ document_embeddings (one-to-many) [pgvector + metadata]
├─→ capital_calls (one-to-many)
├─→ distributions (one-to-many)
└─→ adjustments (one-to-many)
Redis (Chat History):
conversations/{fund_id}
└─→ messages (stored as JSON arrays)
- Docker setup with PostgreSQL, Redis, backend, frontend
- FastAPI backend with full CRUD endpoints + health checks
- Next.js frontend with complete layout and routing (6+ pages)
- Database schema implementation (7 tables + pgvector)
- Environment configuration (.env.example with all required keys)
- Comprehensive error handling with ErrorBoundary
- Toast provider for user notifications
- File upload API endpoint with file validation
- Docling integration for PDF parsing and structure extraction
- Table extraction with intelligent classification (capital calls/distributions/adjustments)
- Text chunking (1000-char chunks, 200-char overlap)
- Metadata extraction from PDFs (fund name, GP, vintage year, strategy)
- Parsing status tracking with detailed error messages
- Celery background task processing with Redis queue
- Automatic fund assignment to documents
- Extract-metadata endpoint for pre-upload validation
- pgvector setup (PostgreSQL extension, 768-dimensional vectors)
- Embedding generation (Google text-embedding-004 API)
- IVFFlat ANN index for similarity search (100 lists)
- JSONB metadata filtering with GIN index
- LangChain integration with Gemini 2.0 Flash LLM
- RAG engine with document retrieval and context aggregation
- Complete chat interface with conversation history
- Intent classification (calculation vs definition vs retrieval)
- Query routing with conversation memory
- Source citations in LLM responses
- DPI calculation (Cumulative Distributions / PIC)
- IRR calculation with chronologically sorted cash flows
- TVPI, RVPI, NAV calculation from reported document values
- PIC calculation with adjustments (capital calls - adjustments)
- Metrics API endpoints with full breakdown
- Query engine integration with SQL joins and aggregations
- Calculation transparency showing all intermediate values
- Validation of cash flow sequences
- Fund list page with metrics and fund selector
- Fund detail page with 3 charts (distributions, cumulative flow, DPI/TVPI)
- Transaction tables with traditional pagination
- Sortable columns with visual indicators
- Date range and type filtering
- Results counter
- Error handling improvements (ErrorBoundary + try-catch)
- Loading states with spinners
- Toast notifications (auto-dismiss)
- Conversation history (Redis with TTL)
- Fund-specific chat conversations
- Multi-fund comparison page (9 metrics, 3 charts)
- CSV export for transactions
- Celery worker for background processing
- Delete conversations feature
- Delete documents feature
- Delete funds with cascade delete
- Advanced filtering and sorting
- System diagrams (RAG, Chat flow, ER)
- Chat History Persistence: Conversations stored in Redis with optional TTL (not persistent across server restarts without Redis persistence config)
- Embedding Dimension: Limited to 768-d vectors (Gemini model). IVFFlat index not available for >2000-d vectors
- PDF Support: Optimized for well-structured PDFs with clear tables; scanned PDFs or complex layouts may have lower accuracy
- LLM Responses: Quality depends on selected LLM provider (Gemini, Groq, etc.). Rate limits apply to free tiers
- Background Processing: Document processing may take 5-30 seconds depending on PDF size and server capacity
- Database: Single PostgreSQL instance (no replication/clustering for HA)
- Vector Index: 768-d vectors require ~600MB disk space per 100K documents
- Implement Redis persistence for conversation durability
- Add caching layer (Redis) for frequently calculated metrics
- Batch embedding generation for faster PDF processing
- Implement query result pagination in RAG retrieval
- Live deployment with CI/CD pipeline (GitHub Actions)
- Playwright E2E tests with GitHub Actions integration
- Custom calculation formulas (user-defined PIC/DPI calculations)
- Support for XLSX/Excel exports (currently CSV only)
- Multi-language support for chat interface
- Real-time collaboration (multiple users per fund)
- Webhooks for external integrations
- Kubernetes deployment configuration
- Database connection pooling (pgBouncer)
- CDN integration for frontend assets
- Monitoring dashboard (Prometheus/Grafana)
- Comprehensive logging (ELK stack)
- API rate limiting per user
- Support for more table formats (nested, sparse)
- Scanned PDF OCR support
- Fuzzy matching for fund names
- Handling of multi-currency documents
- Fund Performance Analysis System for LPs
- Automatic PDF parsing, RAG Q&A, metrics calculation
- Docker deployment, Next.js + FastAPI stack
- Backend: FastAPI, Docling, pgvector, LangChain, Gemini 2.0 Flash, Celery, Redis
- Frontend: Next.js 14, shadcn/ui, Tailwind CSS, Recharts
- Infrastructure: PostgreSQL, Redis, Docker Compose
- Documented in "Quick Start" section (steps 1-6)
docker-compose up -dcommand included- Health check verification steps
.env.examplefile in repository- All required keys documented
- Free API key alternatives provided (Gemini, Groq, Ollama)
- Documented in "Testing" section
- curl examples for upload, metrics, chat
- Backend endpoint list with descriptions
- Comprehensive "Features Implemented" section above (96 items across 6 phases)
- All checkboxes marked [x]
- "Known Limitations" section above (8 limitations)
- Explains trade-offs and constraints
- "Future Improvements" section above (20+ planned items)
- Organized by category (Performance, Features, Infrastructure, Accuracy)
Screenshots are stored in docs/screenshots/ and embedded in this README below.
Upload PDF documents with automatic fund metadata extraction (name, GP, vintage year, strategy).
Features shown:
- File drag-and-drop zone
- Auto-extracted metadata from PDF
- Fund selector (create new or assign to existing)
- Upload status and progress
Ask natural language questions about fund metrics and get RAG-powered answers with citations.
Features shown:
- Fund selector dropdown
- Conversation history sidebar
- Chat messages with LLM responses
- Source citations from documents
- New conversation button
View comprehensive fund metrics, performance charts, and transaction details.
Features shown:
- Fund name and key metrics (DPI, IRR, TVPI)
- 3 charts (Distribution by Type, Cumulative Flow, DPI vs TVPI)
- Transaction tables (Capital Calls, Distributions, Adjustments)
- Sortable and filterable columns
- Export to CSV button
Compare metrics across multiple funds side-by-side.
Features shown:
- Fund selection checkboxes
- 9-metric comparison table (DPI, IRR, TVPI, RVPI, NAV, PIC, etc.)
- 3 comparison charts (DPI vs TVPI, Metrics bar chart, etc.)
- Color-coded fund rows
- Detailed breakdown per fund
View all uploaded documents with parsing status and management options.
Features shown:
- Document list with fund assignment
- Parsing status (pending/success/error)
- Delete document button
- Upload date and file size
- Error message display
- Docker & Docker Compose
- Node.js 18+ (for local frontend development)
- Python 3.11+ (for local backend development)
- OpenAI API key (or use free alternatives - see below)
- Clone the repository
git clone <your-repo-url>
cd fund-analysis-system- Set up environment variables
# Copy example env file
cp .env.example .env
# Edit .env and select an LLM provider (see .env.example for all options):
#
# RECOMMENDED: Google Gemini (Free Tier - 60 requests/min)
# 1. Get API key: https://makersuite.google.com/app/apikey
# 2. Set: LLM_PROVIDER=gemini
# 3. Set: GOOGLE_API_KEY=your-api-key
#
# ALTERNATIVE: Groq (Free Tier - Very Fast)
# 1. Get API key: https://console.groq.com
# 2. Set: LLM_PROVIDER=groq
# 3. Set: GROQ_API_KEY=your-api-key
#
# ALTERNATIVE: Ollama (Local - Free, No Rate Limits)
# 1. Install: brew install ollama (Mac) or https://ollama.com
# 2. Run: ollama pull llama3.2
# 3. Set: LLM_PROVIDER=ollama
#
# ALTERNATIVE: OpenAI (Paid)
# 1. Get API key: https://platform.openai.com/account/api-keys
# 2. Set: LLM_PROVIDER=openai
# 3. Set: OPENAI_API_KEY=sk-...- Start with Docker Compose
docker-compose up -d- Access the application
- Frontend: http://localhost:3000
- Backend API: http://localhost:8000
- API Docs: http://localhost:8000/docs
- Upload sample document
- Navigate to http://localhost:3000/upload
- Upload one of the provided sample PDFs:
files/ILPA based Capital Accounting and Performance Metrics_ PIC, Net PIC, DPI, IRR.pdf(reference document with definitions)files/Sample_Fund_Tech_Ventures_III.pdf(sample data - recommended)files/Sample_Fund_Growth_Capital_IV.pdf(sample data)files/Sample_Fund_Innovation_II.pdf(sample data)
- Wait for parsing to complete (shows progress bar)
- Start asking questions
- Go to http://localhost:3000/chat
- Select the fund you just uploaded
- Try: "What is DPI?"
- Try: "Calculate the current DPI for this fund"
- Try: "Show me all capital calls"
- Try: "What is a recallable distribution?"
fund-analysis-system/
├── backend/
│ ├── app/
│ │ ├── api/
│ │ │ ├── __init__.py
│ │ │ ├── deps.py
│ │ │ └── endpoints/
│ │ │ ├── __init__.py
│ │ │ ├── documents.py # Upload, extract, list, delete
│ │ │ ├── funds.py # CRUD + metrics + transactions
│ │ │ ├── chat.py # Query + conversations
│ │ │ └── metrics.py # DPI, IRR, TVPI calculations
│ │ ├── core/
│ │ │ ├── __init__.py
│ │ │ ├── config.py # Settings, LLM provider config
│ │ │ └── celery_app.py # Celery worker configuration
│ │ ├── models/
│ │ │ ├── __init__.py
│ │ │ ├── fund.py # Fund, conversation models
│ │ │ ├── document.py # Document model
│ │ │ └── transaction.py # CapitalCall, Distribution, Adjustment
│ │ ├── services/
│ │ │ ├── __init__.py
│ │ │ ├── document_processor.py # Docling integration, chunking
│ │ │ ├── table_parser.py # Table extraction, classification
│ │ │ ├── vector_store.py # pgvector, embeddings, search
│ │ │ ├── rag_engine.py # RAG retrieval + augmentation
│ │ │ ├── query_engine.py # Intent classification, routing
│ │ │ ├── metrics_calculator.py # DPI, IRR, TVPI, NAV calculations
│ │ │ └── celery_tasks/
│ │ │ ├── __init__.py
│ │ │ └── document_tasks.py # Background document processing
│ │ ├── db/
│ │ │ ├── __init__.py
│ │ │ ├── session.py # Database session management
│ │ │ └── init_db.py # Database initialization
│ │ ├── schemas/ # Pydantic schemas (inline in endpoints)
│ │ └── main.py # FastAPI app, CORS, routes
│ │
│ ├── tests/
│ │ ├── __pycache__/
│ │ ├── test_chat_api.py
│ │ ├── test_chat_api_retrieval_sql.py
│ │ ├── test_documents_api.py
│ │ ├── test_documents_api_processing.py
│ │ ├── test_funds_transactions_csv.py
│ │ ├── test_metrics_api.py
│ │ ├── test_pipeline.py
│ │ ├── test_query_engine_*.py # (6 routing/process/retrieval tests)
│ │ └── test_table_parser_*.py # (2 basic tests)
│ │
│ ├── requirements.txt
│ ├── Dockerfile # Production-ready image
│ └── .dockerignore
│
├── frontend/
│ ├── app/
│ │ ├── layout.tsx # Root layout with providers
│ │ ├── page.tsx # Home/redirect page
│ │ ├── globals.css # Global tailwind styles
│ │ ├── upload/
│ │ │ └── page.tsx # PDF upload with metadata extraction
│ │ ├── chat/
│ │ │ ├── page.tsx # Chat interface + history sidebar
│ │ │ └── ChatContent.tsx # Chat message display
│ │ ├── funds/
│ │ │ ├── page.tsx # Fund list with metrics
│ │ │ ├── [id]/
│ │ │ │ └── page.tsx # Fund detail + 3 charts + 3 tables
│ │ │ ├── compare/
│ │ │ │ └── page.tsx # Multi-fund comparison
│ │ │ └── documents/
│ │ │ └── page.tsx # Document management
│ │ └── README.md
│ │
│ ├── components/
│ │ ├── ErrorBoundary.tsx # React error boundary
│ │ ├── Navigation.tsx # Top navbar/header
│ │ ├── ToastProvider.tsx # Toast notification system
│ │ └── TransactionTableWithFilters.tsx # Table with sorting/filtering
│ │
│ ├── lib/
│ │ ├── api.ts # API client (funds, chat, documents)
│ │ └── utils.ts # Formatting utilities
│ │
│ ├── public/ # Static assets
│ ├── package.json # Dependencies
│ ├── tsconfig.json # TypeScript config
│ ├── next.config.js # Next.js config
│ ├── tailwind.config.ts # Tailwind CSS config
│ ├── postcss.config.js # PostCSS config
│ ├── Dockerfile # Production image
│ └── .dockerignore
│
├── docker-compose.yml # PostgreSQL + Redis + backend + frontend
├── .env.example # Environment template (Gemini, Groq, etc.)
├── .railwayignore # (optional) Railway deployment
│
├── docs/
│ ├── API.md # API endpoint documentation
│ ├── ARCHITECTURE.md # System design & data flow
│ ├── CALCULATIONS.md # Metrics formulas (DPI, IRR, TVPI)
│ ├── SCREENSHOTS.md # How to capture/add screenshots
│ └── screenshots/ # Screenshot images (01-upload.png, etc.)
│
├── files/
│ ├── ILPA based Capital...pdf # Reference document (definitions)
│ ├── Sample_Fund_*.pdf # 3 sample fund reports
│ ├── create_sample_pdf.py # PDF generator script
│ └── README.md # Sample data guide
│
├── AGENTS.md # Feature tracking & context inventory
├── SETUP.md # Installation & Docker troubleshooting
├── TROUBLESHOOTING.md # Common issues & solutions
├── README.md # This file
└── .gitignore
Backend Core:
backend/app/main.py- FastAPI app with all endpointsbackend/app/core/config.py- Settings & LLM provider configbackend/requirements.txt- Python dependencies
Backend Services (RAG + Parsing):
backend/app/services/document_processor.py- PDF parsing (Docling)backend/app/services/table_parser.py- Table extractionbackend/app/services/vector_store.py- pgvector embeddingsbackend/app/services/rag_engine.py- RAG retrieval
Backend Processing (Async):
backend/app/core/celery_app.py- Celery configurationbackend/app/services/celery_tasks/document_tasks.py- Background tasks
Frontend Pages:
- Upload:
frontend/app/upload/page.tsx - Chat:
frontend/app/chat/page.tsx - Funds:
frontend/app/funds/page.tsx,[id]/page.tsx,compare/page.tsx
Docker & Deployment:
docker-compose.yml- Local dev environmentbackend/Dockerfile- FastAPI imagefrontend/Dockerfile- Next.js image
Documentation:
README.md- Main documentation (this file)docs/API.md- REST API referencedocs/SCREENSHOTS.md- Screenshot capture guide
POST /api/documents/upload # Upload PDF with auto-assignment
POST /api/documents/extract-metadata # Extract fund metadata from PDF
GET /api/documents/ # List all documents (paginated)
GET /api/documents/{doc_id}/status # Get parsing status
GET /api/documents/{doc_id} # Get document details
DELETE /api/documents/{doc_id} # Delete document + embeddings
GET /api/funds # List all funds with metrics
POST /api/funds # Create new fund
GET /api/funds/{fund_id} # Get fund details + metrics
PUT /api/funds/{fund_id} # Update fund info
DELETE /api/funds/{fund_id} # Delete fund (cascade delete)
GET /api/funds/{fund_id}/transactions # Get capital calls/distributions/adjustments (paginated)
GET /api/funds/{fund_id}/transactions.csv # Export transactions as CSV
GET /api/funds/{fund_id}/metrics # Get calculated metrics (DPI/IRR/TVPI/RVPI/NAV)
POST /api/chat/query # Submit query + get response
GET /api/chat/conversations # List conversations for fund
POST /api/chat/conversations # Create new conversation
GET /api/chat/conversations/{conv_id} # Get conversation history
DELETE /api/chat/conversations/{conv_id} # Delete conversation
GET /api/metrics/funds/{fund_id}/metrics # Get all calculated metrics
See API.md for detailed documentation.
PIC = Total Capital Calls - Adjustments
DPI = Cumulative Distributions / PIC
IRR = Rate where NPV of all cash flows = 0
Uses numpy-financial.irr() function
See CALCULATIONS.md for detailed formulas.
cd backend
pytest tests/ -v --cov=appcd frontend
npm testcurl -X POST "http://localhost:8000/api/documents/upload" \
-F "file=@files/sample_fund_report.pdf"curl -X POST "http://localhost:8000/api/chat/query" \
-H "Content-Type: application/json" \
-d '{
"query": "What is the current DPI?",
"fund_id": 1
}'- Use Docling to extract document structure
- Identify tables by headers (e.g., "Capital Call", "Distribution")
- Parse table rows and map to SQL schema
- Extract text paragraphs for vector storage
- Handle parsing errors gracefully
- Retrieval: Vector similarity search (top-k=5)
- Augmentation: Combine retrieved context with SQL data
- Generation: LLM generates answer with citations
- Always validate input data before calculation
- Handle edge cases (zero PIC, missing data)
- Return calculation breakdown for transparency
- Cache results for performance
- "What does DPI mean?"
- "Explain Paid-In Capital"
- "What is a recallable distribution?"
- "What is the current DPI?"
- "Calculate the IRR for this fund"
- "Has the fund returned all capital to LPs?"
- "Show me all capital calls in 2024"
- "What was the largest distribution?"
- "List all adjustments"
- "How is the fund performing compared to industry benchmarks?"
- "What percentage of distributions were recallable?"
- "Explain the trend in capital calls over time"
- Document upload and parsing works
- Tables correctly stored in SQL
- Text stored in vector DB
- DPI calculation is accurate
- Basic RAG Q&A works
- Application runs via Docker
- Structure: Modular, separation of concerns (10pts)
- Readability: Clear naming, comments (10pts)
- Error Handling: Try-catch, validation (10pts)
- Type Safety: TypeScript, Pydantic (10pts)
- Parsing Accuracy: Table recognition (10pts)
- Calculation Accuracy: DPI, IRR (10pts)
- RAG Quality: Relevant answers (10pts)
- Intuitiveness: Easy to use (10pts)
- Feedback: Loading, errors, success (5pts)
- Design: Clean, consistent (5pts)
- README: Setup instructions (5pts)
- API Docs: Endpoint descriptions (3pts)
- Architecture: Diagrams (2pts)
- Dashboard implementation (+5pts)
- Charts/visualization (+3pts)
- Multi-fund support (+3pts)
- Test coverage (+5pts)
- Live deployment (+4pts)
- GitHub Repository (public or private with access)
- Complete source code (backend + frontend)
- Docker configuration (docker-compose.yml)
- Documentation (README, API docs, architecture)
- Sample data (at least one test PDF)
- Project overview
- Tech stack
- Setup instructions (Docker)
- Environment variables (.env.example)
- API testing examples
- Features implemented
- Known limitations
- Future improvements
- Screenshots (minimum 3)
- Recommended: 1 week (Phase 1-4)
- Maximum: 2 weeks (Phase 1-6)
- Push code to GitHub
- Test that
docker-compose upworks - Send repository URL via email
- Include any special instructions
- Framework: FastAPI (Python 3.11+)
- Document Parser: Docling
- Vector DB: pgvector (PostgreSQL extension, 768-d vectors)
- SQL DB: PostgreSQL 15+
- ORM: SQLAlchemy
- LLM Framework: LangChain
- LLM: Gemini 2.0 Flash (or Groq, Ollama, OpenAI)
- Embeddings: Google text-embedding-004 (768-d, IVFFlat ANN index)
- Chat Storage: Redis (conversation history with TTL)
- Task Queue: Celery + Redis
- Framework: Next.js 14 (App Router)
- UI Library: shadcn/ui + Tailwind CSS
- State: Zustand or React Context
- Data Fetching: TanStack Query
- Charts: Recharts
- File Upload: react-dropzone
- Development: Docker + Docker Compose
- Deployment: Your choice (Vercel, Railway, AWS, etc.)
Problem: Docling can't extract tables Solution:
- Check PDF format (ensure it's not scanned image)
- Add fallback parsing logic
- Manually define table structure patterns
Problem: OpenAI API is expensive Solution: Use free alternatives (see "Free LLM Options" section below)
- Use caching for repeated queries
- Use cheaper models (gpt-3.5-turbo)
- Use local LLM (Ollama) for development
Problem: IRR returns NaN or extreme values Solution:
- Validate cash flow sequence
- Check for missing dates
- Handle edge cases (all positive/negative flows)
Problem: Frontend can't call backend API Solution:
- Add CORS middleware in FastAPI
- Allow origin: http://localhost:3000
- Check network configuration in Docker
You don't need to pay for OpenAI API! Here are free alternatives:
Completely free, runs locally on your machine
- Install Ollama
# Mac
brew install ollama
# Linux
curl -fsSL https://ollama.com/install.sh | sh
# Windows
# Download from https://ollama.com/download- Download a model
# Llama 3.2 (3B - fast, good for development)
ollama pull llama3.2
# Or Llama 3.1 (8B - better quality)
ollama pull llama3.1
# Or Mistral (7B - good balance)
ollama pull mistral- Update your .env
# Use Ollama instead of OpenAI
LLM_PROVIDER=ollama
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_MODEL=llama3.2- Modify your code to use Ollama
# In backend/app/services/query_engine.py
from langchain_community.llms import Ollama
llm = Ollama(
base_url="http://localhost:11434",
model="llama3.2"
)Pros: Free, private, no API limits, works offline Cons: Requires decent hardware (8GB+ RAM), slower than cloud APIs
Free tier: 60 requests per minute
-
Get free API key
- Go to https://makersuite.google.com/app/apikey
- Click "Create API Key"
- Copy your key
-
Install package
pip install langchain-google-genai- Update .env
GOOGLE_API_KEY=your-gemini-api-key
LLM_PROVIDER=gemini- Use in code
from langchain_google_genai import ChatGoogleGenerativeAI
llm = ChatGoogleGenerativeAI(
model="gemini-pro",
google_api_key=os.getenv("GOOGLE_API_KEY")
)Pros: Free, fast, good quality Cons: Rate limits, requires internet
Free tier: Very fast inference, generous limits
-
Get free API key
- Go to https://console.groq.com
- Sign up and get API key
-
Install package
pip install langchain-groq- Update .env
GROQ_API_KEY=your-groq-api-key
LLM_PROVIDER=groq- Use in code
from langchain_groq import ChatGroq
llm = ChatGroq(
api_key=os.getenv("GROQ_API_KEY"),
model="mixtral-8x7b-32768" # or "llama3-70b-8192"
)Pros: Free, extremely fast, good quality Cons: Rate limits, requires internet
Free inference API
-
Get free token
- Go to https://huggingface.co/settings/tokens
- Create a token
-
Update .env
HUGGINGFACE_API_TOKEN=your-hf-token
LLM_PROVIDER=huggingface- Use in code
from langchain_community.llms import HuggingFaceHub
llm = HuggingFaceHub(
repo_id="mistralai/Mistral-7B-Instruct-v0.2",
huggingfacehub_api_token=os.getenv("HUGGINGFACE_API_TOKEN")
)Pros: Free, many models available Cons: Can be slow, rate limits
| Provider | Cost | Speed | Quality | Setup Difficulty |
|---|---|---|---|---|
| Ollama | Free | Medium | Good | Easy |
| Gemini | Free | Fast | Very Good | Very Easy |
| Groq | Free | Very Fast | Good | Very Easy |
| Hugging Face | Free | Slow | Varies | Easy |
| OpenAI | Paid | Fast | Excellent | Very Easy |
For Development/Testing:
- Use Ollama with
llama3.2(free, no limits)
For Production/Demo:
- Use Groq or Gemini (free tier is generous)
If you have budget:
- Use OpenAI GPT-4 (best quality)
All sample files are located in files/ directory:
ILPA based Capital Accounting and Performance Metrics_ PIC, Net PIC, DPI, IRR.pdf- Reference document with definitions and explanations
- Contains: PIC, DPI, IRR, TVPI, RVPI definitions
- Use for: Testing text extraction, RAG retrieval, definition queries
- Size: ~80KB
-
Sample_Fund_Tech_Ventures_III.pdf⭐ Recommended- Early-stage venture fund
- Metrics: DPI 0.76x, IRR 5.04%, TVPI 2.00x
- Capital: $107.9M invested
- Transactions: 4 calls, 4 distributions, 3 adjustments
-
Sample_Fund_Growth_Capital_IV.pdf- Growth equity fund (larger)
- Metrics: DPI 0.93x, IRR 5.40%, TVPI 2.00x
- Capital: $457.1M invested
- Transactions: 4 calls, 4 distributions, 3 adjustments
-
Sample_Fund_Innovation_II.pdf- Innovation/tech fund
- Metrics: DPI 0.59x, IRR 6.54%, TVPI 2.00x
- Capital: $63.2M invested
- Transactions: 4 calls, 4 distributions, 3 adjustments
For comprehensive testing, you should create mock fund performance reports with:
Date | Call Number | Amount | Description
-----------|-------------|-------------|------------------
2023-01-15 | Call 1 | $5,000,000 | Initial Capital
2023-06-20 | Call 2 | $3,000,000 | Follow-on
2024-03-10 | Call 3 | $2,000,000 | Bridge Round
Date | Type | Amount | Recallable | Description
-----------|-------------|-------------|------------|------------------
2023-12-15 | Return | $1,500,000 | No | Exit: Company A
2024-06-20 | Income | $500,000 | No | Dividend
2024-09-10 | Return | $2,000,000 | Yes | Partial Exit: Company B
Date | Type | Amount | Description
-----------|---------------------|-----------|------------------
2024-01-15 | Recallable Dist | -$500,000 | Recalled distribution
2024-03-20 | Capital Call Adj | $100,000 | Fee adjustment
For the sample data above:
- Total Capital Called: $10,000,000
- Total Distributions: $4,000,000
- Net PIC: $10,100,000 (after adjustments)
- DPI: 0.40 (4M / 10M)
- IRR: ~8-12% (depends on exact dates)
We've included a Python script to generate sample PDFs:
cd files/
pip install reportlab
python create_sample_pdf.pyThis creates Sample_Fund_Performance_Report.pdf with:
- Capital calls table (4 entries)
- Distributions table (4 entries)
- Adjustments table (3 entries)
- Performance summary with definitions
You can create PDFs using:
- Google Docs/Word → Export as PDF
- Python libraries (reportlab, fpdf)
- Online PDF generators
Tip: Start with simple, well-structured tables before handling complex layouts.
- Docling: https://github.com/DS4SD/docling
- LangChain RAG: https://python.langchain.com/docs/use_cases/question_answering/
- FAISS: https://faiss.ai/
- ILPA Guidelines: https://ilpa.org/
- PE Metrics: https://www.investopedia.com/terms/d/dpi.asp
- Start Simple: Get Phase 1-4 working before adding features
- Test Early: Test document parsing with sample PDF immediately
- Use Tools: Leverage LangChain, shadcn/ui to save time
- Focus on Core: Perfect the RAG pipeline and calculations first
- Document Well: Clear README helps evaluators understand your work
- Handle Errors: Graceful error handling shows maturity
- Ask Questions: If requirements are unclear, document your assumptions
For questions about this coding challenge:
- Open an issue in this repository
- Email: [your-contact-email]
Good luck! Build something amazing!
PIC = Capital Contributions (Gross) - Adjustments
DPI = Cumulative Distributions / PIC
Cumulative Distributions =
Return of Capital +
Dividends Paid +
Interest Paid +
Realized Gains Distributed -
(Fees & Carried Interest Withheld)
Adjustments = Σ (Rebalance of Distribution + Rebalance of Capital Call)
- Nature: Clawback of over-distributed amounts
- Recording: Contribution (-)
- DPI Impact: Numerator ↓, Denominator ↑ → DPI ↓
- Nature: Refund of over-called capital
- Recording: Distribution (+)
- DPI Impact: Denominator ↓, Numerator unchanged → Requires flag to prevent DPI inflation
Version: 1.0 Last Updated: 2025-10-06 Author: InterOpera-Apps Hiring Team




