Production-ready Retrieval-Augmented Generation (RAG) system with dual vector stores, async processing, and FAISS IVF optimization. Achieves 4.2x throughput improvement with 337ms average latency across 15.5k+ documents.
- β‘ 4.2x Throughput: Parallel async queries vs sequential
- π 15,580 Documents: 10,580 FAQs + 5,000 Support Tickets
- π₯ 337ms Latency: Average query response time (parallel mode)
- π― FAISS IVF: Optimized with 205/141 clusters for 3-10x faster search
- π Dual Stores: Intelligent FAQ β Ticket fallback at 65% threshold
- FAQ Store: 10,580 FAQs (580 local CSV + 10,000 from Bitext HuggingFace dataset)
- Ticket Store: 5,000 historical support tickets with resolution status
- Smart Fallback: Searches FAQ first, falls back to Tickets if confidence < 65%
- FAISS IVF Indexing: Clustered indexing for 3-10x faster retrieval
- Parallel Queries: 4.2x throughput improvement over sequential
- Non-blocking I/O: ThreadPoolExecutor for concurrent operations
- Production-Ready: FastAPI async endpoints with proper error handling
- Gemini 2.0 Flash: State-of-the-art LLM for natural responses
- Context-Aware: Retrieves top-K relevant documents before generating
- Source Tracking: Shows whether answer came from FAQ or Ticket
- Rich Metadata: Resolution status, categories, confidence scores
- Real-time Metrics: Latency, confidence, source distribution
- Comprehensive Logging: Query history in JSONL format
- Performance Benchmarks: Continuous monitoring with test suite
| Category | Technologies |
|---|---|
| Backend | FastAPI, Python 3.10+ |
| Frontend | Streamlit |
| Vector DB | FAISS (with IVF optimization) |
| Embeddings | sentence-transformers (all-MiniLM-L6-v2) |
| LLM | Google Gemini 2.0 Flash |
| Framework | LangChain |
| Data Source | HuggingFace Datasets (Bitext 26.8k corpus) |
- Python 3.10 or higher
- Google Gemini API key
- 4GB+ RAM
- Clone the repository
git clone https://github.com/Sakshamyadav15/SupportRAG.git
cd SupportRAG- Create virtual environment
python -m venv .venv
.\.venv\Scripts\Activate.ps1 # Windows
# source .venv/bin/activate # Linux/Mac- Install dependencies
pip install -r requirements.txt- Configure environment
# Create .env file
copy .env.example .env
# Add your Gemini API key
# GEMINI_API_KEY=your_api_key_here- Build vector stores (one-time setup, ~3-4 minutes)
python -c "from src.core.dual_rag_pipeline import DualStoreRAGPipeline; p = DualStoreRAGPipeline(); p.build_vector_stores(use_ivf=True); p.save_vector_stores()"Terminal 1 - Start API:
.\start_api.ps1
# Or: uvicorn src.api.main:app --reload --host localhost --port 8000Terminal 2 - Start Frontend:
.\start_frontend.ps1
# Or: streamlit run frontend\app.pyAccess:
- API: http://localhost:8000
- API Docs: http://localhost:8000/docs
- Frontend: http://localhost:8501
import requests
# Query the RAG system
response = requests.post(
"http://localhost:8000/query",
json={"question": "How do I track my order?", "top_k": 3}
)
result = response.json()
print(f"Answer: {result['answer']}")
print(f"Source: {result['source']}")
print(f"Confidence: {result['confidence']:.1%}"){
"question": "How do I reset my password?",
"top_k": 3
}Response:
{
"answer": "To reset your password...",
"source": "FAQ",
"confidence": 0.87,
"citations": [
{
"rank": 1,
"content": "Question: How do I reset my password?\nAnswer: ...",
"similarity": 0.87,
"source": "FAQ",
"category": "Account"
}
],
"latency_ms": 342,
"query": "How do I reset my password?",
"timestamp": "2025-10-03T16:30:00"
}{
"rebuild": false // true to rebuild from scratch, false to load existing
}Returns total queries, average latency, confidence distribution, source breakdown
Returns vector store status and document counts
Interactive Docs: http://localhost:8000/docs
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β User Query β
ββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β FastAPI Async Endpoint β
β (Parallel Vector Store Searches) β
ββββββββββββ¬βββββββββββββββββββββββββββ¬ββββββββββββββββββββ
β β
βΌ βΌ
ββββββββββββββββββββ ββββββββββββββββββββ
β FAQ Store β β Ticket Store β
β (10,580 docs) β β (5,000 docs) β
β FAISS IVF β β FAISS IVF β
β 205 clusters β β 141 clusters β
ββββββββββββ¬ββββββββ ββββββββββββ¬ββββββββ
β β
ββββββββββββ¬ββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββ
β Fallback Logic β
β (65% threshold) β
β FAQ β Ticket β
ββββββββββββββ¬βββββββββββββ
β
βΌ
βββββββββββββββββββββββββββ
β Gemini 2.0 Flash LLM β
β (Context + Query) β
ββββββββββββββ¬βββββββββββββ
β
βΌ
βββββββββββββββββββββββββββ
β Answer + Citations β
β + Metadata β
βββββββββββββββββββββββββββ
- Documents: 15,580 (10,580 FAQs + 5,000 Tickets)
- Index: FAISS IVF (205/141 clusters)
- Queries: 5 concurrent customer support questions
- Hardware: CPU-only (local machine)
| Metric | Before Optimization | After Optimization | Improvement |
|---|---|---|---|
| Documents | 5,580 | 15,580 | 2.8x more |
| Index Type | Flat | IVF Clustered | Advanced |
| API | Sync | Async | Concurrent |
| Latency (Sequential) | 850-1200ms | ~1,400ms | Baseline |
| Latency (Parallel) | N/A | 337ms | 4.2x faster β |
| Throughput | 0.70 queries/sec | 2.96 queries/sec | 4.2x higher β |
FAISS IVF Indexing:
- FAQ Store: 205 clusters, nprobe=51 (searches 25% of clusters)
- Ticket Store: 141 clusters, nprobe=35
- Benefit: 3-10x faster search with 99% accuracy vs brute force
Async Processing:
- Parallel searches across both FAQ and Ticket stores
- ThreadPoolExecutor with 4 workers for non-blocking operations
- Benefit: 4.2x throughput improvement
Run Benchmarks:
python test_async_performance.py- Customer Support Automation: Instant answers to common questions
- Knowledge Base Search: Semantic search over FAQs and historical tickets
- Support Ticket Deflection: Reduce human agent workload by 20-40%
- Contextual Recommendations: Suggest solutions based on past ticket resolutions
# Required
GEMINI_API_KEY=your_gemini_api_key
# Optional (defaults shown)
EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2
FAQ_SIMILARITY_THRESHOLD=0.65
TOP_K=3
LOG_LEVEL=INFOEdit src/config/settings.py for fine-tuning:
- Embedding dimensions
- Vector store parameters
- LLM temperature/max tokens
- Logging settings
SupportRAG/
βββ src/
β βββ api/
β β βββ main.py # FastAPI async endpoints
β βββ core/
β β βββ dual_rag_pipeline.py # Main RAG logic with IVF + async
β βββ config/
β β βββ settings.py # Configuration management
β βββ models/
β β βββ schemas.py # Pydantic models
β βββ utils/
β βββ logger.py # Logging utilities
β βββ metrics.py # Performance tracking
βββ frontend/
β βββ app.py # Streamlit UI with dual store tracking
βββ data/
β βββ support_faqs.csv # Local FAQ dataset (580 FAQs)
β βββ support_tickets.csv # Support tickets (5,000 tickets)
β βββ vector_stores/ # Saved FAISS indexes
β βββ faq_store/
β βββ ticket_store/
βββ requirements.txt
βββ .env.example
βββ start_api.ps1
βββ start_frontend.ps1
βββ test_async_performance.py # Performance benchmarks
βββ test_rag.py # RAG system tests
βββ README.md
python test_async_performance.pyExpected output:
- Sync query performance
- Async sequential performance
- Async parallel performance (4.2x speedup)
- Latency percentiles
python test_rag.py# Start API first
uvicorn src.api.main:app --reload
# In another terminal
python -c "import requests; r=requests.post('http://localhost:8000/query', json={'question': 'How do I track my order?'}); print(r.json()['answer'])"python -c "import requests; r=requests.post('http://localhost:8000/query', json={'question': 'How do I track my order?'}); print(r.json()['answer'])"
---
## π Example Queries
Try these in the Streamlit interface or via API:
```python
# Account-related (usually from FAQ)
"How do I reset my password?"
"How do I change my email address?"
# Order-related (may fallback to Tickets)
"How do I track my order?"
"My order hasn't arrived yet"
# Payment issues (likely from Tickets)
"My payment was declined"
"I was charged twice"
# Refunds (mix of FAQ and Tickets)
"How long does a refund take?"
"My refund hasn't arrived after 12 days"
- Deploy to production (Render/Railway/Vercel)
- Add Redis caching for 80-90% speedup on repeated queries
- Implement evaluation metrics (precision@k, recall@k, MRR)
- Add user feedback loop (thumbs up/down)
- Multi-language support (Spanish, French, German)
- Hybrid search (BM25 + semantic)
- Docker Compose for one-command deployment
- A/B testing framework
- Query analytics dashboard
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
Saksham Yadav
- GitHub: @Sakshamyadav15
- LinkedIn: Saksham Yadav
- LangChain for the RAG framework
- HuggingFace for the Bitext customer support dataset (26.8k records)
- Google for Gemini 2.0 Flash API
- FAISS (Facebook AI) for efficient vector similarity search
- FastAPI for the async web framework
Built with β€οΈ for modern AI-powered customer support