Production-grade semantic search system for internal documents using embeddings, pgvector, and Retrieval-Augmented Generation.
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β React ββββββΆβ FastAPI ββββββΆβ PostgreSQL β
β Frontend β β REST API β β + pgvector β
ββββββββββββββββ ββββββββ¬ββββββββ ββββββββββββββββ
β
ββββββββ΄ββββββββ
β β
βββββββΌββββββ βββββββΌββββββ
β Embedding β β RAG β
β Service β β Generator β
β (MiniLM) β β(Flan-T5-L) β
βββββββββββββ βββββββββββββ
β β
βββββββΌββββββ βββββββΌββββββ
β Redis β β MLflow β
β Cache β β Tracking β
βββββββββββββ βββββββββββββ
| Component | Technology |
|---|---|
| API | FastAPI (async) |
| Database | PostgreSQL + pgvector |
| Embeddings | sentence-transformers/all-MiniLM-L6-v2 |
| LLM | google/flan-t5-large |
| Experiment Tracking | MLflow |
| Cache | Redis |
| Auth | JWT (python-jose + passlib) |
| Frontend | React + Vite |
| Containerization | Docker + Docker Compose |
| Orchestration | Kubernetes |
- Python 3.11+ (tested on 3.13)
- Docker & Docker Compose
- Node.js 18+ (for frontend)
docker compose -f infra/docker/docker-compose.yml up -d postgres redis mlflowThis starts:
- PostgreSQL 16 with pgvector extension on port 5432
- Redis 7 on port 6379
- MLflow tracking server on port 5000
python -m venv venv
source venv/bin/activate
pip install -r requirements.txtuvicorn app.main:app --reload --host 0.0.0.0 --port 8000First request will download ML models (~90MB for embeddings, ~3GB for flan-t5-large). Subsequent starts use cached models.
cd frontend
npm install
npm run devOpens at http://localhost:3000 with API proxy to port 8000.
First grab a token, then upload:
TOKEN=$(curl -s -X POST http://localhost:8000/auth/token \
-H "Content-Type: application/json" \
-d '{"username":"admin","password":"admin"}' | python -c "import sys,json; print(json.load(sys.stdin)['access_token'])")
curl -X POST http://localhost:8000/upload -H "Authorization: Bearer $TOKEN" -F "file=@data/sample_ml_basics.txt"
curl -X POST http://localhost:8000/upload -H "Authorization: Bearer $TOKEN" -F "file=@data/sample_kubernetes.txt"docker compose -f infra/docker/docker-compose.yml up --buildcurl http://localhost:8000/health
# {"status":"healthy","version":"1.0.0","database":"connected"}curl -X POST http://localhost:8000/auth/token \
-H "Content-Type: application/json" \
-d '{"username": "admin", "password": "admin"}'
# {"access_token":"eyJ...","token_type":"bearer"}curl -X POST http://localhost:8000/upload \
-H "Authorization: Bearer $TOKEN" \
-F "file=@data/sample_ml_basics.txt"
# {"document_id":"...","filename":"sample_ml_basics.txt","chunk_count":3,"message":"Document indexed successfully with 3 chunks."}Supported formats: PDF, TXT, DOCX. Max file size: 50MB.
curl http://localhost:8000/documents \
-H "Authorization: Bearer $TOKEN"Returns all indexed documents (paginated via skip and limit query params).
curl -X DELETE http://localhost:8000/documents/{document_id} \
-H "Authorization: Bearer $TOKEN"Deletes the document and all its chunks.
curl -X POST http://localhost:8000/search \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"query": "What is RAG?", "top_k": 3}'Response includes:
generated_answerβ AI-generated answer grounded in retrieved contextretrieved_chunksβ top-k source chunks with similarity scoreslatency_msβ end-to-end latencymodel_infoβ embedding model, LLM, and top-k used
kubectl apply -f infra/k8s/namespace.yaml
kubectl apply -f infra/k8s/postgres.yaml
kubectl apply -f infra/k8s/redis.yaml
kubectl apply -f infra/k8s/app.yamlIncludes Ingress config at search.local. Update the host or add your domain.
project-root/
βββ app/
β βββ api/ # FastAPI route handlers (health, auth, upload, search)
β βββ core/ # Config, logging, JWT auth
β βββ services/ # Ingestion, retrieval, chunking, caching, MLflow tracking
β βββ models/ # SQLAlchemy models + Pydantic schemas
β βββ db/ # Async session + DB initialization
βββ ml/
β βββ embedding/ # sentence-transformers embedding service
β βββ rag/ # Flan-T5 answer generation with prompt engineering
βββ infra/
β βββ docker/ # Dockerfile + docker-compose
β βββ k8s/ # Namespace, Postgres, Redis, App + Ingress
βββ frontend/ # React (Vite) β premium warm-tone UI
β βββ src/components/ # Header, SearchBar, ResultsPanel, UploadModal, HealthBadge
βββ data/ # Sample documents for testing
βββ tests/ # Unit tests (chunker, parser, API)
βββ uploads/ # Temporary upload directory (gitignored)
βββ requirements.txt
βββ .env.example # Environment variable template
βββ README.md
All settings are in .env (or environment variables):
| Variable | Default | Description |
|---|---|---|
DATABASE_URL |
postgresql+asyncpg://... |
Async DB connection string |
SYNC_DATABASE_URL |
postgresql://... |
Sync DB connection string (used by MLflow) |
REDIS_URL |
redis://localhost:6379/0 |
Redis cache URL |
EMBEDDING_MODEL |
sentence-transformers/all-MiniLM-L6-v2 |
HF embedding model |
LLM_MODEL |
google/flan-t5-large |
HF text generation model |
EMBEDDING_DIMENSION |
384 |
Embedding vector dimension |
MLFLOW_TRACKING_URI |
http://localhost:5000 |
MLflow server URL |
CHUNK_SIZE |
200 |
Words per chunk |
CHUNK_OVERLAP |
30 |
Overlapping words between chunks |
TOP_K |
5 |
Default retrieval count |
SECRET_KEY |
β | JWT signing key (change in production) |
ACCESS_TOKEN_EXPIRE_MINUTES |
60 |
JWT token lifetime |
UPLOAD_DIR |
./uploads |
Temp directory for uploaded files |
LOG_LEVEL |
INFO |
Application log level |
pip install pytest pytest-asyncio anyio httpx
pytest tests/ -v- First query is slow (~20s on CPU) due to model loading. Subsequent queries are 2-5s.
- With GPU, inference drops to under 1s.
- Redis caching returns repeated queries instantly (300s TTL).
- Chunk deduplication prevents duplicate results from re-uploaded documents.
Endpoints are rate-limited per client IP via Redis:
| Endpoint | Limit |
|---|---|
POST /upload |
5 requests / 60s |
POST /search |
20 requests / 60s |
If Redis is unavailable, rate limiting is silently skipped.
Visit http://localhost:5000 after starting MLflow to view:
- Search query metrics (latency, result count, top similarity score)
- Document ingestion tracking (chunk count, model params)