Skip to content

MohammadMawed/rag-python-mongodb

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

RAG System with MongoDB Cosmos DB

A production-ready Retrieval-Augmented Generation (RAG) system built with Python, featuring MongoDB Cosmos DB as the vector store and support for multiple LLM providers including Google Gemini and OpenAI.

πŸš€ Features

  • πŸ”„ Dual Engine Support: Seamlessly switch between Google Gemini and OpenAI models
  • πŸ“Š Vector Database: MongoDB Cosmos DB for scalable, high-performance vector storage
  • βš™οΈ Flexible Configuration: Environment-based configuration with validation
  • πŸ“„ Smart Document Processing: Intelligent text chunking and metadata handling
  • πŸ›‘οΈ Production Ready: Comprehensive error handling, logging, and connection management
  • πŸ’¬ Interactive CLI: User-friendly command-line interface for testing and management
  • πŸ” Hybrid Search: Vector similarity search with text search fallback
  • πŸ“ˆ Scalable Architecture: Modular design for easy extension and maintenance

πŸ“‹ Prerequisites

  1. MongoDB Cosmos DB Account

    • Create an Azure Cosmos DB account with MongoDB API
    • Enable vector search capabilities in your Cosmos DB
    • Obtain your connection string from the Azure portal
  2. API Keys

  3. Development Environment

    • Python 3.8 or higher
    • pip package manager
    • Git (for cloning the repository)

πŸ› οΈ Installation

  1. Clone the repository:
git clone https://github.com/your-username/RAG-python.git
cd RAG-python
  1. Create and activate a virtual environment:
# Create virtual environment
python -m venv venv

# Activate virtual environment
# On macOS/Linux:
source venv/bin/activate
# On Windows:
venv\Scripts\activate
  1. Install dependencies:
pip install -r requirements.txt
  1. Set up environment variables:
    • Create a .env file in the project root
    • Copy the template below and fill in your credentials

βš™οΈ Configuration

Create a .env file in the project root with the following variables:

# Engine Selection: Choose "google" or "openai"
ENGINE=google

# API Keys (get one based on your chosen engine)
GOOGLE_API_KEY=your-google-api-key-here
OPENAI_API_KEY=your-openai-api-key-here

# MongoDB Cosmos DB Configuration
COSMOS_CONNECTION_STRING=mongodb://your-cosmos-connection-string
COSMOS_DATABASE_NAME=rag_database
COSMOS_COLLECTION_NAME=documents
COSMOS_VECTOR_INDEX_NAME=vector_index

# Document Processing Configuration
CHUNK_SIZE=1000
CHUNK_OVERLAP=200
EMBEDDING_DIMENSION=768
TOP_K_RESULTS=5

Configuration Options

Variable Description Default Options
ENGINE LLM provider to use google google, openai
CHUNK_SIZE Maximum characters per text chunk 1000 Integer
CHUNK_OVERLAP Overlap between chunks 200 Integer
EMBEDDING_DIMENSION Vector embedding dimensions 768 768 (Gemini), 1536 (OpenAI)
TOP_K_RESULTS Number of similar documents to retrieve 5 Integer

πŸš€ Usage

Quick Start

  1. Start the interactive CLI:
python main.py
  1. Follow the prompts to:
    • Add documents to your knowledge base
    • Ask questions about your documents
    • View system status and statistics

CLI Commands

Once the application starts, you can use these commands:

  • Add text: Input text directly into the system
  • Ask questions: Query your knowledge base
  • Show stats: View system statistics
  • Clear database: Reset your knowledge base
  • Exit: Quit the application

Programmatic Usage

For integration into other applications:

from rag_chain import RAGSystem

# Initialize the RAG system
rag = RAGSystem()

# Add a document to the knowledge base
document_text = "Your document content here..."
rag.add_text(
    text=document_text, 
    metadata={"source": "example.pdf", "category": "research"}
)

# Query the system
question = "What is the main topic of the document?"
result = rag.query(question)

print(f"Answer: {result['answer']}")
print(f"Sources: {[doc.metadata for doc in result['source_documents']]}")

# Clean up resources
rag.close()

Advanced Usage

# Batch document processing
documents = [
    {"text": "Document 1 content", "metadata": {"source": "doc1.txt"}},
    {"text": "Document 2 content", "metadata": {"source": "doc2.txt"}},
]

for doc in documents:
    rag.add_text(doc["text"], doc["metadata"])

# Custom query with parameters
result = rag.query(
    question="Your question here",
    top_k=10,  # Retrieve more documents
    include_metadata=True
)

πŸ—οΈ Architecture

System Components

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   main.py       β”‚    β”‚   rag_chain.py   β”‚    β”‚  config.py      β”‚
β”‚ (CLI Interface) │◄──►│  (RAG System)    │◄──►│ (Configuration) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                β”‚
                       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”
                       β”‚                 β”‚
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚document_processorβ”‚    β”‚vector_store_manager β”‚
              β”‚    .py           β”‚    β”‚       .py           β”‚
              β”‚ (Text Processing)β”‚    β”‚  (MongoDB Cosmos)   β”‚
              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Core Modules

  1. config.py - Configuration Management

    • Environment variable handling
    • Settings validation
    • Default value management
  2. vector_store_manager.py - Vector Database Operations

    • MongoDB Cosmos DB integration
    • Vector index management
    • Similarity search implementation
    • Connection pooling
  3. document_processor.py - Text Processing

    • Document chunking using LangChain's RecursiveCharacterTextSplitter
    • Metadata preservation
    • Batch processing support
  4. rag_chain.py - RAG System Orchestration

    • LLM integration (Google Gemini / OpenAI)
    • Retrieval and generation pipeline
    • Context management
    • Response formatting
  5. main.py - Interactive CLI Interface

    • User interaction handling
    • Command processing
    • System status display

Data Flow

1. Document Input β†’ 2. Text Chunking β†’ 3. Embedding Generation β†’ 4. Vector Storage
                                                                         β”‚
8. Response ← 7. Answer Generation ← 6. Context Retrieval ← 5. Query Processing

Detailed Process:

  1. Document Ingestion: Text documents are input through CLI or API
  2. Text Chunking: Documents are split using RecursiveCharacterTextSplitter
  3. Embedding Generation: Chunks are converted to vectors using selected model
  4. Vector Storage: Embeddings stored in MongoDB Cosmos DB with metadata
  5. Query Processing: User queries are embedded using the same model
  6. Context Retrieval: Vector similarity search retrieves relevant chunks
  7. Answer Generation: LLM generates responses using retrieved context
  8. Response Delivery: Formatted answer returned with source information

Supported Models

Google Gemini

  • Model: gemini-pro
  • Embeddings: models/embedding-001
  • Dimensions: 768
  • Context Window: 32,768 tokens

OpenAI

  • Model: gpt-3.5-turbo / gpt-4
  • Embeddings: text-embedding-ada-002
  • Dimensions: 1,536
  • Context Window: 4,096 / 8,192 tokens

πŸ“š Dependencies

The project uses the following key dependencies:

langchain==0.1.0              # Core LangChain framework
langchain-openai==0.0.2       # OpenAI integration
langchain-google-genai==0.0.5 # Google Gemini integration
langchain-community==0.0.10   # Community integrations
pymongo==4.6.1                # MongoDB driver
azure-cosmos==4.5.1           # Azure Cosmos DB client
python-dotenv==1.0.0          # Environment variable management
tiktoken==0.5.2               # OpenAI tokenization
numpy==1.24.3                 # Numerical operations
pydantic==2.5.0               # Data validation

🎯 Best Practices

Document Processing

  • Optimal Chunk Size: 1000 characters works well for most content types
  • Chunk Overlap: 200 characters ensures context continuity
  • Metadata Usage: Include source, timestamp, and category information

Performance Optimization

  • Batch Processing: Add multiple documents in batches for better performance
  • Index Optimization: Ensure vector indexes are properly configured in Cosmos DB
  • Connection Pooling: System automatically manages database connections

Security Considerations

  • Environment Variables: Store all sensitive data in .env files
  • API Key Rotation: Regularly rotate API keys
  • Access Control: Implement proper database access controls
  • Data Validation: All inputs are validated before processing

Monitoring

  • RU Consumption: Monitor Cosmos DB Request Units usage
  • Response Times: Track query performance
  • Error Rates: Monitor API call success rates

πŸ”§ Troubleshooting

Common Issues and Solutions

Connection Problems

Issue: Cannot connect to Cosmos DB

ConnectionError: Unable to connect to MongoDB Cosmos DB

Solutions:

  • βœ… Verify connection string format and credentials
  • βœ… Check firewall settings and IP whitelisting
  • βœ… Ensure vector search is enabled in your Cosmos DB account
  • βœ… Verify network connectivity

API Authentication

Issue: API key authentication failed

AuthenticationError: Invalid API key

Solutions:

  • βœ… Verify API keys are correctly set in .env file
  • βœ… Check API key permissions and quotas
  • βœ… Ensure no extra spaces or characters in keys
  • βœ… Verify the selected engine matches available API keys

Performance Issues

Issue: Slow query responses Solutions:

  • βœ… Optimize chunk size for your content type
  • βœ… Reduce TOP_K_RESULTS if retrieving too many documents
  • βœ… Monitor Cosmos DB RU consumption
  • βœ… Consider upgrading Cosmos DB tier

Memory Problems

Issue: Out of memory errors during document processing Solutions:

  • βœ… Process documents in smaller batches
  • βœ… Reduce chunk size temporarily
  • βœ… Increase system memory allocation
  • βœ… Implement document streaming for large files

Environment Setup Issues

Issue: Module import errors

ModuleNotFoundError: No module named 'langchain'

Solutions:

  • βœ… Ensure virtual environment is activated
  • βœ… Run pip install -r requirements.txt
  • βœ… Check Python version compatibility (3.8+)

Getting Help

If you encounter issues not covered here:

  1. Check the Issues page
  2. Enable debug logging by setting LOG_LEVEL=DEBUG in your .env
  3. Review system logs for detailed error messages
  4. Create a new issue with error details and system information

πŸ”’ Security

Environment Security

  • 🚫 Never commit .env files to version control
  • βœ… Use environment variables in production deployments
  • βœ… Implement proper access controls for your databases
  • βœ… Regularly rotate API keys and connection strings

Data Privacy

  • πŸ”’ All data is processed locally or through secure API endpoints
  • πŸ”’ Documents are stored with encryption at rest in Cosmos DB
  • πŸ”’ API communications use HTTPS/TLS encryption

Best Practices

  • βœ… Use role-based access control (RBAC) for Cosmos DB
  • βœ… Implement API rate limiting
  • βœ… Monitor for unusual access patterns
  • βœ… Keep dependencies updated for security patches

🀝 Contributing

We welcome contributions! Please follow these steps:

Development Setup

  1. Fork and clone the repository
git clone https://github.com/your-username/RAG-python.git
cd RAG-python
  1. Create a development environment
python -m venv dev-env
source dev-env/bin/activate  # On Windows: dev-env\Scripts\activate
pip install -r requirements.txt
  1. Create a feature branch
git checkout -b feature/your-feature-name

Contribution Guidelines

  • πŸ“ Write clear, descriptive commit messages
  • πŸ§ͺ Add tests for new functionality
  • πŸ“š Update documentation for new features
  • πŸ” Ensure code follows PEP 8 style guidelines
  • βœ… Test your changes thoroughly

Pull Request Process

  1. Update the README.md with details of changes if applicable
  2. Ensure your code includes appropriate error handling
  3. Add or update tests as needed
  4. Submit a pull request with a clear description

Code Style

  • Use type hints where possible
  • Follow PEP 8 style guidelines
  • Include docstrings for functions and classes
  • Use meaningful variable and function names

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

MIT License Summary

  • βœ… Commercial use allowed
  • βœ… Modification allowed
  • βœ… Distribution allowed
  • βœ… Private use allowed
  • ❌ No warranty provided
  • ❌ No liability assumed

πŸ™ Acknowledgments

  • LangChain: For the excellent RAG framework
  • MongoDB: For Cosmos DB vector search capabilities
  • Google: For Gemini AI model access
  • OpenAI: For GPT model access
  • Azure: For cloud infrastructure

πŸ“ž Support


🎯 Quick Reference

Essential Commands

# Setup
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
cp .env.example .env  # Edit with your credentials

# Run
python main.py

# Development
python -m pytest tests/
python -m black .
python -m flake8

Environment Template

ENGINE=google
GOOGLE_API_KEY=your-key-here
COSMOS_CONNECTION_STRING=mongodb://your-connection-string
COSMOS_DATABASE_NAME=rag_database

Ready to get started? Follow the Installation guide and start building your RAG system! πŸš€

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages