Skip to content

RomanRosa/docschat-rag

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

3 Commits
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

DocsChat RAG ๐Ÿš€

Python LangChain License Code style: black

Enterprise-grade RAG system for technical documentation with advanced retrieval strategies and precise source citation

Features โ€ข Demo โ€ข Quick Start โ€ข Architecture โ€ข Documentation


๐Ÿ“‹ Overview

DocsChat RAG is a production-ready Retrieval-Augmented Generation system designed for querying technical documentation (Python, React, FastAPI) with enterprise-grade architecture patterns, multiple retrieval strategies, and accurate source citations.

What Makes This Different?

  • ๐ŸŽฏ Hybrid Search: Combines semantic (vector) and keyword (BM25) search with RRF fusion
  • ๐Ÿ” HyDE Query Transformation: Hypothetical Document Embeddings for improved retrieval
  • ๐Ÿ“Š Intelligent Reranking: Cohere API integration for relevance optimization
  • ๐Ÿ“– Source Citation: Precise page numbers and document references
  • ๐Ÿ’ฌ Conversational Memory: Multi-turn conversation context
  • ๐Ÿ—๏ธ Clean Architecture: SOLID principles, dependency injection, comprehensive testing
  • ๐Ÿณ Production Ready: Docker containerization, CI/CD pipelines, monitoring

โœจ Features

Core Capabilities

  • Multi-Source Ingestion: Scrapes and processes Python, React, and FastAPI official documentation
  • Advanced Chunking: Semantic and recursive text splitting strategies
  • Hybrid Retrieval:
    • Semantic search (cosine similarity)
    • Keyword search (BM25)
    • RRF-based fusion
  • Query Enhancement: HyDE transformation for better retrieval
  • Smart Reranking: Cohere cross-encoder for relevance scoring
  • Citation Tracking: Page numbers and source URLs preserved
  • Conversational Context: Last N turns memory management

Technical Highlights

  • SOLID Principles: Clean, maintainable, extensible codebase
  • Type Safety: Comprehensive type hints with mypy validation
  • Testing: Unit and integration tests with pytest
  • Documentation: Detailed docstrings (Google style) and architecture docs
  • CI/CD: GitHub Actions for linting, testing, and deployment
  • Observability: Structured logging with loguru
  • Containerization: Docker and docker-compose setup

๐ŸŽฅ Demo

Live Demo: docschat-rag.streamlit.app (coming soon)

Example Query

User: "How do I handle CORS in FastAPI?"

DocsChat:
To handle CORS in FastAPI, use the CORSMiddleware:

```python
from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware

app = FastAPI()

app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)

Sources:


---

## ๐Ÿš€ Quick Start

### Prerequisites

- Python 3.11+
- Poetry (recommended) or pip
- OpenAI API key
- Cohere API key (optional, for reranking)

### Installation

```bash
# Clone repository
git clone https://github.com/RomanRosa/docschat-rag.git
cd docschat-rag

# Install dependencies with Poetry
poetry install

# OR with pip
pip install -r requirements.txt

# Setup environment variables
cp .env.example .env
# Edit .env with your API keys

Environment Variables

# .env
OPENAI_API_KEY=your_openai_api_key_here
COHERE_API_KEY=your_cohere_api_key_here

# LLM Configuration
LLM_MODEL=gpt-4o-mini
EMBEDDING_MODEL=text-embedding-3-small
TEMPERATURE=0.0

# Retrieval Configuration
TOP_K=10
RERANK_TOP_K=3
CHUNK_SIZE=1000
CHUNK_OVERLAP=200

# Vector Store
CHROMADB_PERSIST_DIR=./data/vectorstore

Run Locally

# 1. Ingest documentation (one-time setup)
poetry run python scripts/ingest_docs.py --sources python react fastapi

# 2. Build vector index
poetry run python scripts/rebuild_index.py

# 3. Launch Streamlit UI
poetry run streamlit run src/ui/app.py

Access at: http://localhost:8501

Docker Quick Start

# Build and run with docker-compose
docker-compose up -d

# Access at http://localhost:8501

๐Ÿ—๏ธ Architecture

High-Level Design

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   User UI   โ”‚  Streamlit Chat Interface
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜
       โ”‚
       โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚           RAG Pipeline Orchestrator         โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”‚
โ”‚  โ”‚  Query Processor                    โ”‚    โ”‚
โ”‚  โ”‚  - Validation                       โ”‚    โ”‚
โ”‚  โ”‚  - HyDE Transformation              โ”‚    โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ”‚
โ”‚               โ”‚                             โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”‚
โ”‚  โ”‚  Hybrid Retriever                   โ”‚    โ”‚
โ”‚  โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚    โ”‚
โ”‚  โ”‚  โ”‚  Semantic    โ”‚  โ”‚  Keyword     โ”‚ โ”‚    โ”‚
โ”‚  โ”‚  โ”‚  (Vector)    โ”‚  โ”‚  (BM25)      โ”‚ โ”‚    โ”‚
โ”‚  โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚    โ”‚
โ”‚  โ”‚         โ”‚                  โ”‚        โ”‚    โ”‚
โ”‚  โ”‚         โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜        โ”‚    โ”‚
โ”‚  โ”‚                  โ”‚                  โ”‚    โ”‚
โ”‚  โ”‚         โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”        โ”‚    โ”‚
โ”‚  โ”‚         โ”‚  RRF Fusion      โ”‚        โ”‚    โ”‚
โ”‚  โ”‚         โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜        โ”‚    โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ”‚
โ”‚                     โ”‚                       โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”‚
โ”‚  โ”‚  Cohere Reranker                   โ”‚     โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ”‚
โ”‚                     โ”‚                       โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”‚
โ”‚  โ”‚  LLM Generator (GPT-4o-mini)       โ”‚     โ”‚
โ”‚  โ”‚  - Prompt with context             โ”‚     โ”‚
โ”‚  โ”‚  - Citation injection              โ”‚     โ”‚
โ”‚  โ”‚  - Memory management               โ”‚     โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Key Components

Module Responsibility Key Classes
ingestion/ Document scraping, parsing, chunking PythonDocsIngester, SemanticChunker
vectorization/ Embeddings generation, vector storage OpenAIEmbedder, ChromaVectorStore
retrieval/ Search strategies, reranking HybridRetriever, CohereReranker
generation/ LLM calls, prompt templating OpenAIGenerator, ConversationalMemory
pipeline/ End-to-end orchestration RAGPipeline, QueryProcessor
ui/ Streamlit interface ChatInterface, SourcePanel

See ARCHITECTURE.md for detailed design documentation.


๐Ÿ“š Documentation


๐Ÿงช Testing

# Run all tests
poetry run pytest

# Run with coverage
poetry run pytest --cov=src --cov-report=html

# Run specific test module
poetry run pytest tests/unit/test_retrieval.py

# Run integration tests
poetry run pytest tests/integration/

Test Coverage

Current coverage: 92% (target: 90%+)


๐Ÿ› ๏ธ Development

Setup Development Environment

# Install dev dependencies
poetry install --with dev

# Setup pre-commit hooks
pre-commit install

# Run linting
poetry run ruff check src/
poetry run black src/
poetry run mypy src/

# Format code
poetry run black src/

Code Style

  • Formatter: Black (line length: 100)
  • Linter: Ruff (replaces flake8, isort, pylint)
  • Type Checker: Mypy (strict mode)
  • Docstrings: Google style
  • Commits: Conventional Commits

๐Ÿ“ฆ Tech Stack

Category Technology
Framework LangChain 0.1.0
LLM OpenAI GPT-4o-mini
Embeddings OpenAI text-embedding-3-small
Vector DB ChromaDB
Reranking Cohere API
UI Streamlit
Testing Pytest, pytest-cov
CI/CD GitHub Actions
Containerization Docker, docker-compose
Logging Loguru

๐Ÿ—บ๏ธ Roadmap

Phase 1: MVP โœ… (Current)

  • Basic ingestion pipeline
  • Hybrid retrieval
  • LLM generation with citations
  • Streamlit UI
  • Docker setup

Phase 2: Enhancement ๐Ÿšง (In Progress)

  • HyDE query transformation
  • Advanced reranking
  • Conversational memory
  • Performance benchmarking

Phase 3: Production ๐Ÿ“‹ (Planned)

  • Multi-tenancy support
  • API endpoints (FastAPI)
  • Admin dashboard
  • Usage analytics
  • Cost optimization

Phase 4: Advanced ๐Ÿ”ฎ (Future)

  • Multi-modal support (code screenshots)
  • Custom fine-tuned embeddings
  • Graph RAG integration
  • Real-time doc updates

๐Ÿค Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

Development Workflow

  1. Fork the repository
  2. Create feature branch (git checkout -b feature/amazing-feature)
  3. Commit changes (git commit -m 'feat(retrieval): add amazing feature')
  4. Push to branch (git push origin feature/amazing-feature)
  5. Open Pull Request

๐Ÿ“„ License

This project is licensed under the MIT License - see LICENSE file for details.


๐Ÿ™ Acknowledgments


๐Ÿ“ž Contact

Francisco Romรกn Peรฑa de la Rosa


โญ Star History

Star History Chart


If this project helped you, please consider giving it a โญ!

Made with โค๏ธ by Roman de la Rosa

Releases

No releases published

Packages

 
 
 

Contributors