Fund Performance Analysis System - Coding Challenge

Time Estimate: 1 Week (Senior Developer)

Overview

Build an AI-powered fund performance analysis system that enables Limited Partners (LPs) to:

Upload fund performance PDF documents
Automatically parse and extract structured data (tables → SQL, text → Vector DB)
Ask natural language questions about fund metrics (DPI, IRR, etc.)
Get accurate answers powered by RAG (Retrieval Augmented Generation) and SQL calculations

Business Context

As an LP, you receive quarterly fund performance reports in PDF format. These documents contain:

Capital Call tables: When and how much capital was called
Distribution tables: When and how much was distributed back to LPs
Adjustment tables: Rebalancing entries (recallable distributions, capital call adjustments)
Text explanations: Definitions, investment strategies, market commentary

Your task: Build a system that automatically processes these documents and answers questions like:

"What is the current DPI of this fund?"
"Has the fund returned all invested capital to LPs?"
"What does 'Paid-In Capital' mean in this context?"
"Show me all capital calls in 2024"

What's Provided (Starting Point)

This repository contains a project scaffold to help you get started quickly:

Infrastructure Setup

Docker Compose configuration (PostgreSQL, Redis, Backend, Frontend)
Database schema and models (SQLAlchemy)
Basic API structure (FastAPI with endpoints)
Frontend boilerplate (Next.js with TailwindCSS)
Environment configuration

Basic UI Components

Upload page layout
Chat interface layout
Fund dashboard layout
Navigation and routing

Metrics Calculation (Provided)

DPI (Distributions to Paid-In) - Fully implemented
IRR (Internal Rate of Return) - Using numpy-financial
PIC (Paid-In Capital) - With adjustments
Calculation breakdown API - Shows all cash flows and transactions for debugging
Located in: backend/app/services/metrics_calculator.py

Debugging Features:

View all capital calls, distributions, and adjustments used in calculations
See cash flow timeline for IRR calculation
Verify intermediate values (total calls, total distributions, etc.)
Trace calculation steps with detailed explanations

Sample Data (Provided)

Reference PDF: ILPA metrics explanation document
Sample Fund Report: Generated with realistic data
PDF Generator Script: files/create_sample_pdf.py
Expected Results: Documented for validation

What's NOT Implemented (Your Job)

The following core functionalities are NOT implemented and need to be built by you:

1. Document Processing Pipeline (Phase 2) - CRITICAL

PDF parsing with pdfplumber (integrate and test)
Table detection and extraction logic
Intelligent table classification (capital calls vs distributions vs adjustments)
Data validation and cleaning
Error handling for malformed PDFs
Background task processing (Celery integration)

Files to implement:

backend/app/services/document_processor.py (skeleton provided)
backend/app/services/table_parser.py (needs implementation)

2. Vector Store & RAG System (Phase 3) - CRITICAL

Text chunking strategy implementation
embedding generation
FAISS index creation and management
Semantic search implementation
Context retrieval for LLM
Prompt engineering for accurate responses

Files to implement:

backend/app/services/vector_store.py (pgvector implementation with TODOs)
backend/app/services/rag_engine.py (needs implementation)

Note: This project uses pgvector instead of FAISS. pgvector is a PostgreSQL extension that stores vectors directly in your database, eliminating the need for a separate vector database.

3. Query Engine & Intent Classification (Phase 3-4) - CRITICAL

Files to implement:

backend/app/services/query_engine.py (needs implementation)

4. Integration & Testing

End-to-end document upload flow
API integration tests
Error handling and logging
Performance optimization

Note: Metrics calculation is already implemented. You can focus on document processing and RAG!

System Architecture

┌─────────────────────────────────────────────────────────────┐
│                        Frontend (Next.js)                   │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐   │
│  │  Upload  │  │   Chat   │  │  Funds   │  │ Compare  │   │
│  │   Page   │  │  History │  │Dashboard │  │   Page   │   │
│  └──────────┘  └──────────┘  └──────────┘  └──────────┘   │
└────────────────────────┬────────────────────────────────────┘
                         │ REST API
┌────────────────────────▼────────────────────────────────────┐
│                    Backend (FastAPI)                        │
│  ┌─────────────────────────────────────────────────────┐   │
│  │              Document Processor                     │   │
│  │  ┌──────────────┐         ┌──────────────┐        │   │
│  │  │   Docling    │────────▶│  Table       │        │   │
│  │  │   Parser     │         │  Extractor   │        │   │
│  │  └──────────────┘         └──────┬───────┘        │   │
│  │                                   │                 │   │
│  │  ┌──────────────┐         ┌──────▼───────┐        │   │
│  │  │   Text       │────────▶│  Embedding   │        │   │
│  │  │   Chunker    │         │  Generator   │        │   │
│  │  └──────────────┘         └──────────────┘        │   │
│  └─────────────────────────────────────────────────────┘   │
│                                                              │
│  ┌─────────────────────────────────────────────────────┐   │
│  │              Query Engine (RAG)                     │   │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────┐ │   │
│  │  │   Intent     │─▶│   Vector     │─▶│   LLM    │ │   │
│  │  │  Classifier  │  │   Search     │  │ Response │ │   │
│  │  └──────────────┘  └──────────────┘  └──────────┘ │   │
│  │                                                      │   │
│  │  ┌──────────────┐  ┌──────────────┐               │   │
│  │  │  Metrics     │─▶│     SQL      │               │   │
│  │  │ Calculator   │  │   Queries    │               │   │
│  │  └──────────────┘  └──────────────┘               │   │
│  └─────────────────────────────────────────────────────┘   │
└────────────────────────┬────────────────────────────────────┘
        ┌───────────────┬┴──────────┬──────────────┐
        │               │           │              │
   PostgreSQL      Celery Worker   Redis        Gemini/Groq
  (pgvector +       (Background     (Task        (LLM)
   Transactions)    Tasks)          Queue)

RAG Pipeline Flow

User Query
    ↓
Query Engine Classifier
    ├→ [Calculation] → Metrics API → PIC/DPI/IRR/TVPI
    ├→ [Definition] → RAG Pipeline (retrieval + LLM)
    └→ [Retrieval] → Vector Search + SQL Query
        ↓
    Vector Search (Top-5)
        ↓
    Context Aggregation (RAG documents + SQL results + Memory)
        ↓
    LLM Generate Answer (Gemini/Groq/Ollama)
        ↓
    Format Response with Citations
        ↓
    Return to Frontend

Chat Query Sequence

1. User sends query via Chat UI
   ↓
2. Frontend: POST /api/chat/query (query, fund_id, conversation_id)
   ↓
3. Backend Query Engine:
   ├─→ Classify intent (calculation/definition/retrieval)
   ├─→ If Calculation: Call Metrics API → SQL joins
   ├─→ If Retrieval: Vector search (pgvector) + SQL aggregation
   └─→ Context → LLM with memory
   ↓
4. LLM Response with sources
   ↓
5. Store in Redis (conversation history with TTL)
   ↓
6. Return formatted response to Frontend
   ↓
7. Display in Chat UI with history sidebar

Data Model

Storage Strategy

PostgreSQL (Persistent):
  - Structured transaction data (capital calls, distributions, adjustments)
  - Document metadata and references
  - Vector embeddings for RAG (pgvector extension)

Redis (Ephemeral):
  - Chat conversation history (with optional TTL)
  - Session data
  - Task queue for background processing (Celery)

PostgreSQL Schema

`funds` table

CREATE TABLE funds (
    id SERIAL PRIMARY KEY,
    name VARCHAR(255) NOT NULL,
    gp_name VARCHAR(255),
    fund_type VARCHAR(100),
    vintage_year INTEGER,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

`capital_calls` table

CREATE TABLE capital_calls (
    id SERIAL PRIMARY KEY,
    fund_id INTEGER REFERENCES funds(id),
    call_date DATE NOT NULL,
    call_type VARCHAR(100),
    amount DECIMAL(15, 2) NOT NULL,
    description TEXT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

`distributions` table

CREATE TABLE distributions (
    id SERIAL PRIMARY KEY,
    fund_id INTEGER REFERENCES funds(id),
    distribution_date DATE NOT NULL,
    distribution_type VARCHAR(100),
    is_recallable BOOLEAN DEFAULT FALSE,
    amount DECIMAL(15, 2) NOT NULL,
    description TEXT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

`adjustments` table

CREATE TABLE adjustments (
    id SERIAL PRIMARY KEY,
    fund_id INTEGER REFERENCES funds(id),
    adjustment_date DATE NOT NULL,
    adjustment_type VARCHAR(100),
    category VARCHAR(100),
    amount DECIMAL(15, 2) NOT NULL,
    is_contribution_adjustment BOOLEAN DEFAULT FALSE,
    description TEXT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

`documents` table

CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    fund_id INTEGER REFERENCES funds(id),
    file_name VARCHAR(255) NOT NULL,
    file_path VARCHAR(500),
    upload_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    parsing_status VARCHAR(50) DEFAULT 'pending',
    error_message TEXT
);

`document_embeddings` table (Vector Storage)

CREATE TABLE document_embeddings (
    id SERIAL PRIMARY KEY,
    document_id INTEGER REFERENCES documents(id),
    fund_id INTEGER REFERENCES funds(id),
    content TEXT NOT NULL,
    embedding vector(768),  -- pgvector extension (768-d Gemini embeddings)
    metadata JSONB,         -- Document source, section, page number, etc.
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- IVFFlat index for similarity search (for vectors ≤2000 dimensions)
CREATE INDEX idx_document_embeddings_embedding ON document_embeddings
  USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);

-- GIN index for metadata filtering
CREATE INDEX idx_embedding_metadata ON document_embeddings USING GIN (metadata);

Chat History Storage (Redis)

Conversations stored in Redis (not PostgreSQL):
- Key pattern: "conversation:{conversation_id}:meta" → fund_id, title, timestamps
- Key pattern: "conversation:{conversation_id}:messages" → [role, content, timestamp]

Benefits:
- Fast retrieval and updates
- Automatic TTL expiration (configurable)
- Fallback to in-memory dict if Redis unavailable

Example Redis structure:
  conversation:uuid-1:meta → {fund_id: 1, title: "Fund Analysis", updated_at: "..."}
  conversation:uuid-1:messages → [{role: "user", content: "What is DPI?"}, ...]

Database Relationships (ER Diagram)

PostgreSQL:
  funds
    ├─→ documents (one-to-many)
    │    └─→ document_embeddings (one-to-many)  [pgvector + metadata]
    ├─→ capital_calls (one-to-many)
    ├─→ distributions (one-to-many)
    └─→ adjustments (one-to-many)

Redis (Chat History):
  conversations/{fund_id}
    └─→ messages (stored as JSON arrays)

Features Implemented

Phase 1: Core Infrastructure ✅

Docker setup with PostgreSQL, Redis, backend, frontend
FastAPI backend with full CRUD endpoints + health checks
Next.js frontend with complete layout and routing (6+ pages)
Database schema implementation (7 tables + pgvector)
Environment configuration (.env.example with all required keys)
Comprehensive error handling with ErrorBoundary
Toast provider for user notifications

Phase 2: Document Processing ✅

File upload API endpoint with file validation
Docling integration for PDF parsing and structure extraction
Table extraction with intelligent classification (capital calls/distributions/adjustments)
Text chunking (1000-char chunks, 200-char overlap)
Metadata extraction from PDFs (fund name, GP, vintage year, strategy)
Parsing status tracking with detailed error messages
Celery background task processing with Redis queue
Automatic fund assignment to documents
Extract-metadata endpoint for pre-upload validation

Phase 3: Vector Store & RAG ✅

Phase 4: Fund Metrics Calculation ✅

DPI calculation (Cumulative Distributions / PIC)
IRR calculation with chronologically sorted cash flows
TVPI, RVPI, NAV calculation from reported document values
PIC calculation with adjustments (capital calls - adjustments)
Metrics API endpoints with full breakdown
Query engine integration with SQL joins and aggregations
Calculation transparency showing all intermediate values
Validation of cash flow sequences

Phase 5: Dashboard & Polish ✅

Fund list page with metrics and fund selector
Fund detail page with 3 charts (distributions, cumulative flow, DPI/TVPI)
Transaction tables with traditional pagination
Sortable columns with visual indicators
Date range and type filtering
Results counter
Error handling improvements (ErrorBoundary + try-catch)
Loading states with spinners
Toast notifications (auto-dismiss)

Phase 6: Advanced Features ✅

Known Limitations

Chat History Persistence: Conversations stored in Redis with optional TTL (not persistent across server restarts without Redis persistence config)
Embedding Dimension: Limited to 768-d vectors (Gemini model). IVFFlat index not available for >2000-d vectors
PDF Support: Optimized for well-structured PDFs with clear tables; scanned PDFs or complex layouts may have lower accuracy
LLM Responses: Quality depends on selected LLM provider (Gemini, Groq, etc.). Rate limits apply to free tiers
Background Processing: Document processing may take 5-30 seconds depending on PDF size and server capacity
Database: Single PostgreSQL instance (no replication/clustering for HA)
Vector Index: 768-d vectors require ~600MB disk space per 100K documents

Future Improvements

Performance

Implement Redis persistence for conversation durability
Add caching layer (Redis) for frequently calculated metrics
Batch embedding generation for faster PDF processing
Implement query result pagination in RAG retrieval

Features

Live deployment with CI/CD pipeline (GitHub Actions)
Playwright E2E tests with GitHub Actions integration
Custom calculation formulas (user-defined PIC/DPI calculations)
Support for XLSX/Excel exports (currently CSV only)
Multi-language support for chat interface
Real-time collaboration (multiple users per fund)
Webhooks for external integrations

Infrastructure

Kubernetes deployment configuration
Database connection pooling (pgBouncer)
CDN integration for frontend assets
Monitoring dashboard (Prometheus/Grafana)
Comprehensive logging (ELK stack)
API rate limiting per user

Accuracy

Support for more table formats (nested, sparse)
Scanned PDF OCR support
Fuzzy matching for fund names
Handling of multi-currency documents

README Must Include

✅ Project Overview

Fund Performance Analysis System for LPs
Automatic PDF parsing, RAG Q&A, metrics calculation
Docker deployment, Next.js + FastAPI stack

✅ Tech Stack

Backend: FastAPI, Docling, pgvector, LangChain, Gemini 2.0 Flash, Celery, Redis
Frontend: Next.js 14, shadcn/ui, Tailwind CSS, Recharts
Infrastructure: PostgreSQL, Redis, Docker Compose

✅ Setup Instructions (Docker)

Documented in "Quick Start" section (steps 1-6)
docker-compose up -d command included
Health check verification steps

✅ Environment Variables

.env.example file in repository
All required keys documented
Free API key alternatives provided (Gemini, Groq, Ollama)

✅ API Testing Examples

Documented in "Testing" section
curl examples for upload, metrics, chat
Backend endpoint list with descriptions

✅ Features Implemented

Comprehensive "Features Implemented" section above (96 items across 6 phases)
All checkboxes marked [x]

✅ Known Limitations

"Known Limitations" section above (8 limitations)
Explains trade-offs and constraints

✅ Future Improvements

"Future Improvements" section above (20+ planned items)
Organized by category (Performance, Features, Infrastructure, Accuracy)

✅ Screenshots

Screenshots are stored in docs/screenshots/ and embedded in this README below.

Screenshots

1. Document Upload with Metadata Extraction

Upload PDF documents with automatic fund metadata extraction (name, GP, vintage year, strategy).

Features shown:

File drag-and-drop zone
Auto-extracted metadata from PDF
Fund selector (create new or assign to existing)
Upload status and progress

2. Chat Interface with Conversation History

Ask natural language questions about fund metrics and get RAG-powered answers with citations.

Features shown:

Fund selector dropdown
Conversation history sidebar
Chat messages with LLM responses
Source citations from documents
New conversation button

3. Fund Dashboard with Metrics & Charts

View comprehensive fund metrics, performance charts, and transaction details.

Features shown:

Fund name and key metrics (DPI, IRR, TVPI)
3 charts (Distribution by Type, Cumulative Flow, DPI vs TVPI)
Transaction tables (Capital Calls, Distributions, Adjustments)
Sortable and filterable columns
Export to CSV button

4. Multi-Fund Comparison

Compare metrics across multiple funds side-by-side.

Features shown:

Fund selection checkboxes
9-metric comparison table (DPI, IRR, TVPI, RVPI, NAV, PIC, etc.)
3 comparison charts (DPI vs TVPI, Metrics bar chart, etc.)
Color-coded fund rows
Detailed breakdown per fund

5. Documents Management

View all uploaded documents with parsing status and management options.

Features shown:

Document list with fund assignment
Parsing status (pending/success/error)
Delete document button
Upload date and file size
Error message display

Getting Started

Prerequisites

Docker & Docker Compose
Node.js 18+ (for local frontend development)
Python 3.11+ (for local backend development)
OpenAI API key (or use free alternatives - see below)

Quick Start

Clone the repository

git clone <your-repo-url>
cd fund-analysis-system

Set up environment variables

# Copy example env file
cp .env.example .env

# Edit .env and select an LLM provider (see .env.example for all options):
# 
# RECOMMENDED: Google Gemini (Free Tier - 60 requests/min)
#   1. Get API key: https://makersuite.google.com/app/apikey
#   2. Set: LLM_PROVIDER=gemini
#   3. Set: GOOGLE_API_KEY=your-api-key
#
# ALTERNATIVE: Groq (Free Tier - Very Fast)
#   1. Get API key: https://console.groq.com
#   2. Set: LLM_PROVIDER=groq
#   3. Set: GROQ_API_KEY=your-api-key
#
# ALTERNATIVE: Ollama (Local - Free, No Rate Limits)
#   1. Install: brew install ollama (Mac) or https://ollama.com
#   2. Run: ollama pull llama3.2
#   3. Set: LLM_PROVIDER=ollama
#
# ALTERNATIVE: OpenAI (Paid)
#   1. Get API key: https://platform.openai.com/account/api-keys
#   2. Set: LLM_PROVIDER=openai
#   3. Set: OPENAI_API_KEY=sk-...

Start with Docker Compose

docker-compose up -d

Access the application

Upload sample document

Navigate to http://localhost:3000/upload
Upload one of the provided sample PDFs:
- files/ILPA based Capital Accounting and Performance Metrics_ PIC, Net PIC, DPI, IRR.pdf (reference document with definitions)
- files/Sample_Fund_Tech_Ventures_III.pdf (sample data - recommended)
- files/Sample_Fund_Growth_Capital_IV.pdf (sample data)
- files/Sample_Fund_Innovation_II.pdf (sample data)
Wait for parsing to complete (shows progress bar)

Start asking questions

Go to http://localhost:3000/chat
Select the fund you just uploaded
Try: "What is DPI?"
Try: "Calculate the current DPI for this fund"
Try: "Show me all capital calls"
Try: "What is a recallable distribution?"

Project Structure

fund-analysis-system/
├── backend/
│   ├── app/
│   │   ├── api/
│   │   │   ├── __init__.py
│   │   │   ├── deps.py
│   │   │   └── endpoints/
│   │   │       ├── __init__.py
│   │   │       ├── documents.py      # Upload, extract, list, delete
│   │   │       ├── funds.py          # CRUD + metrics + transactions
│   │   │       ├── chat.py           # Query + conversations
│   │   │       └── metrics.py        # DPI, IRR, TVPI calculations
│   │   ├── core/
│   │   │   ├── __init__.py
│   │   │   ├── config.py            # Settings, LLM provider config
│   │   │   └── celery_app.py        # Celery worker configuration
│   │   ├── models/
│   │   │   ├── __init__.py
│   │   │   ├── fund.py              # Fund, conversation models
│   │   │   ├── document.py          # Document model
│   │   │   └── transaction.py       # CapitalCall, Distribution, Adjustment
│   │   ├── services/
│   │   │   ├── __init__.py
│   │   │   ├── document_processor.py # Docling integration, chunking
│   │   │   ├── table_parser.py      # Table extraction, classification
│   │   │   ├── vector_store.py      # pgvector, embeddings, search
│   │   │   ├── rag_engine.py        # RAG retrieval + augmentation
│   │   │   ├── query_engine.py      # Intent classification, routing
│   │   │   ├── metrics_calculator.py # DPI, IRR, TVPI, NAV calculations
│   │   │   └── celery_tasks/
│   │   │       ├── __init__.py
│   │   │       └── document_tasks.py # Background document processing
│   │   ├── db/
│   │   │   ├── __init__.py
│   │   │   ├── session.py           # Database session management
│   │   │   └── init_db.py           # Database initialization
│   │   ├── schemas/                 # Pydantic schemas (inline in endpoints)
│   │   └── main.py                  # FastAPI app, CORS, routes
│   │
│   ├── tests/
│   │   ├── __pycache__/
│   │   ├── test_chat_api.py
│   │   ├── test_chat_api_retrieval_sql.py
│   │   ├── test_documents_api.py
│   │   ├── test_documents_api_processing.py
│   │   ├── test_funds_transactions_csv.py
│   │   ├── test_metrics_api.py
│   │   ├── test_pipeline.py
│   │   ├── test_query_engine_*.py   # (6 routing/process/retrieval tests)
│   │   └── test_table_parser_*.py   # (2 basic tests)
│   │
│   ├── requirements.txt
│   ├── Dockerfile                   # Production-ready image
│   └── .dockerignore
│
├── frontend/
│   ├── app/
│   │   ├── layout.tsx               # Root layout with providers
│   │   ├── page.tsx                 # Home/redirect page
│   │   ├── globals.css              # Global tailwind styles
│   │   ├── upload/
│   │   │   └── page.tsx             # PDF upload with metadata extraction
│   │   ├── chat/
│   │   │   ├── page.tsx             # Chat interface + history sidebar
│   │   │   └── ChatContent.tsx      # Chat message display
│   │   ├── funds/
│   │   │   ├── page.tsx             # Fund list with metrics
│   │   │   ├── [id]/
│   │   │   │   └── page.tsx         # Fund detail + 3 charts + 3 tables
│   │   │   ├── compare/
│   │   │   │   └── page.tsx         # Multi-fund comparison
│   │   │   └── documents/
│   │   │       └── page.tsx         # Document management
│   │   └── README.md
│   │
│   ├── components/
│   │   ├── ErrorBoundary.tsx        # React error boundary
│   │   ├── Navigation.tsx           # Top navbar/header
│   │   ├── ToastProvider.tsx        # Toast notification system
│   │   └── TransactionTableWithFilters.tsx # Table with sorting/filtering
│   │
│   ├── lib/
│   │   ├── api.ts                   # API client (funds, chat, documents)
│   │   └── utils.ts                 # Formatting utilities
│   │
│   ├── public/                      # Static assets
│   ├── package.json                 # Dependencies
│   ├── tsconfig.json                # TypeScript config
│   ├── next.config.js               # Next.js config
│   ├── tailwind.config.ts           # Tailwind CSS config
│   ├── postcss.config.js            # PostCSS config
│   ├── Dockerfile                   # Production image
│   └── .dockerignore
│
├── docker-compose.yml               # PostgreSQL + Redis + backend + frontend
├── .env.example                     # Environment template (Gemini, Groq, etc.)
├── .railwayignore                   # (optional) Railway deployment
│
├── docs/
│   ├── API.md                       # API endpoint documentation
│   ├── ARCHITECTURE.md              # System design & data flow
│   ├── CALCULATIONS.md              # Metrics formulas (DPI, IRR, TVPI)
│   ├── SCREENSHOTS.md               # How to capture/add screenshots
│   └── screenshots/                 # Screenshot images (01-upload.png, etc.)
│
├── files/
│   ├── ILPA based Capital...pdf     # Reference document (definitions)
│   ├── Sample_Fund_*.pdf            # 3 sample fund reports
│   ├── create_sample_pdf.py         # PDF generator script
│   └── README.md                    # Sample data guide
│
├── AGENTS.md                        # Feature tracking & context inventory
├── SETUP.md                         # Installation & Docker troubleshooting
├── TROUBLESHOOTING.md               # Common issues & solutions
├── README.md                        # This file
└── .gitignore

Key Files by Category

Backend Core:

backend/app/main.py - FastAPI app with all endpoints
backend/app/core/config.py - Settings & LLM provider config
backend/requirements.txt - Python dependencies

Backend Services (RAG + Parsing):

backend/app/services/document_processor.py - PDF parsing (Docling)
backend/app/services/table_parser.py - Table extraction
backend/app/services/vector_store.py - pgvector embeddings
backend/app/services/rag_engine.py - RAG retrieval

Backend Processing (Async):

backend/app/core/celery_app.py - Celery configuration
backend/app/services/celery_tasks/document_tasks.py - Background tasks

Frontend Pages:

Upload: frontend/app/upload/page.tsx
Chat: frontend/app/chat/page.tsx
Funds: frontend/app/funds/page.tsx, [id]/page.tsx, compare/page.tsx

Docker & Deployment:

docker-compose.yml - Local dev environment
backend/Dockerfile - FastAPI image
frontend/Dockerfile - Next.js image

Documentation:

README.md - Main documentation (this file)
docs/API.md - REST API reference
docs/SCREENSHOTS.md - Screenshot capture guide

API Endpoints

Documents

POST   /api/documents/upload                    # Upload PDF with auto-assignment
POST   /api/documents/extract-metadata          # Extract fund metadata from PDF
GET    /api/documents/                          # List all documents (paginated)
GET    /api/documents/{doc_id}/status           # Get parsing status
GET    /api/documents/{doc_id}                  # Get document details
DELETE /api/documents/{doc_id}                  # Delete document + embeddings

Funds

GET    /api/funds                               # List all funds with metrics
POST   /api/funds                               # Create new fund
GET    /api/funds/{fund_id}                     # Get fund details + metrics
PUT    /api/funds/{fund_id}                     # Update fund info
DELETE /api/funds/{fund_id}                     # Delete fund (cascade delete)
GET    /api/funds/{fund_id}/transactions        # Get capital calls/distributions/adjustments (paginated)
GET    /api/funds/{fund_id}/transactions.csv    # Export transactions as CSV
GET    /api/funds/{fund_id}/metrics             # Get calculated metrics (DPI/IRR/TVPI/RVPI/NAV)

Chat

POST   /api/chat/query                          # Submit query + get response
GET    /api/chat/conversations                  # List conversations for fund
POST   /api/chat/conversations                  # Create new conversation
GET    /api/chat/conversations/{conv_id}        # Get conversation history
DELETE /api/chat/conversations/{conv_id}        # Delete conversation

Metrics

GET    /api/metrics/funds/{fund_id}/metrics     # Get all calculated metrics

See API.md for detailed documentation.

Fund Metrics Formulas

Paid-In Capital (PIC)

PIC = Total Capital Calls - Adjustments

DPI (Distribution to Paid-In)

DPI = Cumulative Distributions / PIC

IRR (Internal Rate of Return)

IRR = Rate where NPV of all cash flows = 0
Uses numpy-financial.irr() function

See CALCULATIONS.md for detailed formulas.

Testing

Run Backend Tests

cd backend
pytest tests/ -v --cov=app

Run Frontend Tests

cd frontend
npm test

Test Document Upload

curl -X POST "http://localhost:8000/api/documents/upload" \
  -F "file=@files/sample_fund_report.pdf"

Test Chat Query

curl -X POST "http://localhost:8000/api/chat/query" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What is the current DPI?",
    "fund_id": 1
  }'

Implementation Guidelines

Document Parsing Strategy

Use Docling to extract document structure
Identify tables by headers (e.g., "Capital Call", "Distribution")
Parse table rows and map to SQL schema
Extract text paragraphs for vector storage
Handle parsing errors gracefully

RAG Pipeline

Retrieval: Vector similarity search (top-k=5)
Augmentation: Combine retrieved context with SQL data
Generation: LLM generates answer with citations

Calculation Logic

Always validate input data before calculation
Handle edge cases (zero PIC, missing data)
Return calculation breakdown for transparency
Cache results for performance

Sample Questions to Test

Definitions

"What does DPI mean?"
"Explain Paid-In Capital"
"What is a recallable distribution?"

Calculations

"What is the current DPI?"
"Calculate the IRR for this fund"
"Has the fund returned all capital to LPs?"

Data Retrieval

"Show me all capital calls in 2024"
"What was the largest distribution?"
"List all adjustments"

Complex Queries

"How is the fund performing compared to industry benchmarks?"
"What percentage of distributions were recallable?"
"Explain the trend in capital calls over time"

Evaluation Criteria

Must-Have (Pass/Fail)

Document upload and parsing works
Tables correctly stored in SQL
Text stored in vector DB
DPI calculation is accurate
Basic RAG Q&A works
Application runs via Docker

Code Quality (40 points)

Structure: Modular, separation of concerns (10pts)
Readability: Clear naming, comments (10pts)
Error Handling: Try-catch, validation (10pts)
Type Safety: TypeScript, Pydantic (10pts)

Functionality (30 points)

Parsing Accuracy: Table recognition (10pts)
Calculation Accuracy: DPI, IRR (10pts)
RAG Quality: Relevant answers (10pts)

UX/UI (20 points)

Intuitiveness: Easy to use (10pts)
Feedback: Loading, errors, success (5pts)
Design: Clean, consistent (5pts)

Documentation (10 points)

README: Setup instructions (5pts)
API Docs: Endpoint descriptions (3pts)
Architecture: Diagrams (2pts)

Bonus Points (up to 20 points)

Dashboard implementation (+5pts)
Charts/visualization (+3pts)
Multi-fund support (+3pts)
Test coverage (+5pts)
Live deployment (+4pts)

Submission Requirements

What to Submit

GitHub Repository (public or private with access)
Complete source code (backend + frontend)
Docker configuration (docker-compose.yml)
Documentation (README, API docs, architecture)
Sample data (at least one test PDF)

README Must Include

Project overview
Tech stack
Setup instructions (Docker)
Environment variables (.env.example)
API testing examples
Features implemented
Known limitations
Future improvements
Screenshots (minimum 3)

Timeline

Recommended: 1 week (Phase 1-4)
Maximum: 2 weeks (Phase 1-6)

How to Submit

Push code to GitHub
Test that docker-compose up works
Send repository URL via email
Include any special instructions

Tech Stack

Backend

Framework: FastAPI (Python 3.11+)
Document Parser: Docling
Vector DB: pgvector (PostgreSQL extension, 768-d vectors)
SQL DB: PostgreSQL 15+
ORM: SQLAlchemy
LLM Framework: LangChain
LLM: Gemini 2.0 Flash (or Groq, Ollama, OpenAI)
Embeddings: Google text-embedding-004 (768-d, IVFFlat ANN index)
Chat Storage: Redis (conversation history with TTL)
Task Queue: Celery + Redis

Frontend

Framework: Next.js 14 (App Router)
UI Library: shadcn/ui + Tailwind CSS
State: Zustand or React Context
Data Fetching: TanStack Query
Charts: Recharts
File Upload: react-dropzone

Infrastructure

Development: Docker + Docker Compose
Deployment: Your choice (Vercel, Railway, AWS, etc.)

Troubleshooting

Document Parsing Issues

Problem: Docling can't extract tables Solution:

Check PDF format (ensure it's not scanned image)
Add fallback parsing logic
Manually define table structure patterns

LLM API Costs

Problem: OpenAI API is expensive Solution: Use free alternatives (see "Free LLM Options" section below)

Use caching for repeated queries
Use cheaper models (gpt-3.5-turbo)
Use local LLM (Ollama) for development

IRR Calculation Errors

Problem: IRR returns NaN or extreme values Solution:

Validate cash flow sequence
Check for missing dates
Handle edge cases (all positive/negative flows)

CORS Issues

Problem: Frontend can't call backend API Solution:

Add CORS middleware in FastAPI
Allow origin: http://localhost:3000
Check network configuration in Docker

Free LLM Options

You don't need to pay for OpenAI API! Here are free alternatives:

Option 1: Ollama (Recommended for Development)

Completely free, runs locally on your machine

Install Ollama

# Mac
brew install ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows
# Download from https://ollama.com/download

Download a model

# Llama 3.2 (3B - fast, good for development)
ollama pull llama3.2

# Or Llama 3.1 (8B - better quality)
ollama pull llama3.1

# Or Mistral (7B - good balance)
ollama pull mistral

Update your .env

# Use Ollama instead of OpenAI
LLM_PROVIDER=ollama
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_MODEL=llama3.2

Modify your code to use Ollama

# In backend/app/services/query_engine.py
from langchain_community.llms import Ollama

llm = Ollama(
    base_url="http://localhost:11434",
    model="llama3.2"
)

Pros: Free, private, no API limits, works offline Cons: Requires decent hardware (8GB+ RAM), slower than cloud APIs

Option 2: Google Gemini (Free Tier)

Free tier: 60 requests per minute

Get free API key
- Go to https://makersuite.google.com/app/apikey
- Click "Create API Key"
- Copy your key
Install package

pip install langchain-google-genai

Update .env

GOOGLE_API_KEY=your-gemini-api-key
LLM_PROVIDER=gemini

Use in code

from langchain_google_genai import ChatGoogleGenerativeAI

llm = ChatGoogleGenerativeAI(
    model="gemini-pro",
    google_api_key=os.getenv("GOOGLE_API_KEY")
)

Pros: Free, fast, good quality Cons: Rate limits, requires internet

Option 3: Groq (Free Tier)

Free tier: Very fast inference, generous limits

Get free API key
- Go to https://console.groq.com
- Sign up and get API key
Install package

pip install langchain-groq

Update .env

GROQ_API_KEY=your-groq-api-key
LLM_PROVIDER=groq

Use in code

from langchain_groq import ChatGroq

llm = ChatGroq(
    api_key=os.getenv("GROQ_API_KEY"),
    model="mixtral-8x7b-32768"  # or "llama3-70b-8192"
)

Pros: Free, extremely fast, good quality Cons: Rate limits, requires internet

Option 4: Hugging Face (Free)

Free inference API

Get free token
- Go to https://huggingface.co/settings/tokens
- Create a token
Update .env

HUGGINGFACE_API_TOKEN=your-hf-token
LLM_PROVIDER=huggingface

Use in code

from langchain_community.llms import HuggingFaceHub

llm = HuggingFaceHub(
    repo_id="mistralai/Mistral-7B-Instruct-v0.2",
    huggingfacehub_api_token=os.getenv("HUGGINGFACE_API_TOKEN")
)

Pros: Free, many models available Cons: Can be slow, rate limits

Comparison Table

Provider	Cost	Speed	Quality	Setup Difficulty
Ollama	Free	Medium	Good	Easy
Gemini	Free	Fast	Very Good	Very Easy
Groq	Free	Very Fast	Good	Very Easy
Hugging Face	Free	Slow	Varies	Easy
OpenAI	Paid	Fast	Excellent	Very Easy

Recommended Setup for This Project

For Development/Testing:

Use Ollama with llama3.2 (free, no limits)

For Production/Demo:

Use Groq or Gemini (free tier is generous)

If you have budget:

Use OpenAI GPT-4 (best quality)

Sample Data

Provided Sample Files

All sample files are located in files/ directory:

Reference Document

ILPA based Capital Accounting and Performance Metrics_ PIC, Net PIC, DPI, IRR.pdf
- Reference document with definitions and explanations
- Contains: PIC, DPI, IRR, TVPI, RVPI definitions
- Use for: Testing text extraction, RAG retrieval, definition queries
- Size: ~80KB

Sample Fund Reports (Generated Test Data)

Sample_Fund_Tech_Ventures_III.pdf ⭐ Recommended
- Early-stage venture fund
- Metrics: DPI 0.76x, IRR 5.04%, TVPI 2.00x
- Capital: $107.9M invested
- Transactions: 4 calls, 4 distributions, 3 adjustments
Sample_Fund_Growth_Capital_IV.pdf
- Growth equity fund (larger)
- Metrics: DPI 0.93x, IRR 5.40%, TVPI 2.00x
- Capital: $457.1M invested
- Transactions: 4 calls, 4 distributions, 3 adjustments
Sample_Fund_Innovation_II.pdf
- Innovation/tech fund
- Metrics: DPI 0.59x, IRR 6.54%, TVPI 2.00x
- Capital: $63.2M invested
- Transactions: 4 calls, 4 distributions, 3 adjustments

Sample Data You Should Create

For comprehensive testing, you should create mock fund performance reports with:

Example Capital Call Table

Date       | Call Number | Amount      | Description
-----------|-------------|-------------|------------------
2023-01-15 | Call 1      | $5,000,000  | Initial Capital
2023-06-20 | Call 2      | $3,000,000  | Follow-on
2024-03-10 | Call 3      | $2,000,000  | Bridge Round

Example Distribution Table

Date       | Type        | Amount      | Recallable | Description
-----------|-------------|-------------|------------|------------------
2023-12-15 | Return      | $1,500,000  | No         | Exit: Company A
2024-06-20 | Income      | $500,000    | No         | Dividend
2024-09-10 | Return      | $2,000,000  | Yes        | Partial Exit: Company B

Example Adjustment Table

Date       | Type                | Amount    | Description
-----------|---------------------|-----------|------------------
2024-01-15 | Recallable Dist     | -$500,000 | Recalled distribution
2024-03-20 | Capital Call Adj    | $100,000  | Fee adjustment

Expected Test Results

For the sample data above:

Total Capital Called: $10,000,000
Total Distributions: $4,000,000
Net PIC: $10,100,000 (after adjustments)
DPI: 0.40 (4M / 10M)
IRR: ~8-12% (depends on exact dates)

Creating Test PDFs

Option 1: Use Provided Script (Recommended)

We've included a Python script to generate sample PDFs:

cd files/
pip install reportlab
python create_sample_pdf.py

This creates Sample_Fund_Performance_Report.pdf with:

Capital calls table (4 entries)
Distributions table (4 entries)
Adjustments table (3 entries)
Performance summary with definitions

Option 2: Create Your Own

You can create PDFs using:

Google Docs/Word → Export as PDF
Python libraries (reportlab, fpdf)
Online PDF generators

Tip: Start with simple, well-structured tables before handling complex layouts.

Reference Materials

Docling: https://github.com/DS4SD/docling
LangChain RAG: https://python.langchain.com/docs/use_cases/question_answering/
FAISS: https://faiss.ai/
ILPA Guidelines: https://ilpa.org/
PE Metrics: https://www.investopedia.com/terms/d/dpi.asp

Tips for Success

Start Simple: Get Phase 1-4 working before adding features
Test Early: Test document parsing with sample PDF immediately
Use Tools: Leverage LangChain, shadcn/ui to save time
Focus on Core: Perfect the RAG pipeline and calculations first
Document Well: Clear README helps evaluators understand your work
Handle Errors: Graceful error handling shows maturity
Ask Questions: If requirements are unclear, document your assumptions

Support

For questions about this coding challenge:

Open an issue in this repository
Email: [your-contact-email]

Good luck! Build something amazing!

Appendix: Calculation Formulas (from PDF)

Paid-In Capital (PIC)

PIC = Capital Contributions (Gross) - Adjustments

DPI (Distribution to Paid-In)

DPI = Cumulative Distributions / PIC

Cumulative Distributions

Cumulative Distributions =
  Return of Capital +
  Dividends Paid +
  Interest Paid +
  Realized Gains Distributed -
  (Fees & Carried Interest Withheld)

Adjustments

Adjustments = Σ (Rebalance of Distribution + Rebalance of Capital Call)

Rebalance of Distribution

Nature: Clawback of over-distributed amounts
Recording: Contribution (-)
DPI Impact: Numerator ↓, Denominator ↑ → DPI ↓

Rebalance of Capital Call

Nature: Refund of over-called capital
Recording: Distribution (+)
DPI Impact: Denominator ↓, Numerator unchanged → Requires flag to prevent DPI inflation

Version: 1.0 Last Updated: 2025-10-06 Author: InterOpera-Apps Hiring Team

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
backend		backend
docs		docs
files		files
frontend		frontend
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
AGENTS.md		AGENTS.md
README.md		README.md
SETUP.md		SETUP.md
TROUBLESHOOTING.md		TROUBLESHOOTING.md
docker-compose.yml		docker-compose.yml

Deprasny/coding-test-3rd

Folders and files

Latest commit

History

Repository files navigation

Fund Performance Analysis System - Coding Challenge

Time Estimate: 1 Week (Senior Developer)

Overview

Business Context

What's Provided (Starting Point)

Infrastructure Setup

Basic UI Components

Metrics Calculation (Provided)

Sample Data (Provided)

What's NOT Implemented (Your Job)

1. Document Processing Pipeline (Phase 2) - CRITICAL

2. Vector Store & RAG System (Phase 3) - CRITICAL

3. Query Engine & Intent Classification (Phase 3-4) - CRITICAL

4. Integration & Testing

System Architecture

RAG Pipeline Flow

Chat Query Sequence

Data Model

Storage Strategy

PostgreSQL Schema

funds table

capital_calls table

distributions table

adjustments table

documents table

document_embeddings table (Vector Storage)

Chat History Storage (Redis)

Database Relationships (ER Diagram)

Features Implemented

Phase 1: Core Infrastructure ✅

Phase 2: Document Processing ✅

Phase 3: Vector Store & RAG ✅

Phase 4: Fund Metrics Calculation ✅

Phase 5: Dashboard & Polish ✅

Phase 6: Advanced Features ✅

Known Limitations

Future Improvements

Performance

Features

Infrastructure

Accuracy

README Must Include

✅ Project Overview

✅ Tech Stack

✅ Setup Instructions (Docker)

✅ Environment Variables

✅ API Testing Examples

✅ Features Implemented

✅ Known Limitations

✅ Future Improvements

✅ Screenshots

Screenshots

1. Document Upload with Metadata Extraction

2. Chat Interface with Conversation History

3. Fund Dashboard with Metrics & Charts

4. Multi-Fund Comparison

5. Documents Management

Getting Started

Prerequisites

Quick Start

Project Structure

Key Files by Category

API Endpoints

Documents

Funds

Chat

Metrics

Fund Metrics Formulas

Paid-In Capital (PIC)

DPI (Distribution to Paid-In)

IRR (Internal Rate of Return)

Testing

Run Backend Tests

Run Frontend Tests

Test Document Upload

`funds` table

`capital_calls` table

`distributions` table

`adjustments` table

`documents` table

`document_embeddings` table (Vector Storage)