Skip to content

AdnanSattar/enterprise-rag-stack

Repository files navigation

Enterprise RAG Stack

A complete, production-grade reference implementation of modern Retrieval-Augmented Generation systems for 2025 and beyond.

Python 3.10+ FastAPI License: MIT Code style: black


This repository accompanies the article series "The Enterprise RAG Engineering Playbook (2025 Edition)" and provides modular, end-to-end components for building high-accuracy, scalable, and secure RAG pipelines.

🎯 Use Cases

Build production-grade systems for:

  • Enterprise Search β€” Find information across thousands of documents
  • Compliance Assistants β€” Policy Q&A with audit trails
  • Legal Reasoning Tools β€” Contract analysis with citation
  • Customer Support Copilots β€” Grounded answers from knowledge bases
  • Multi-hop Knowledge Systems β€” Complex queries requiring graph traversal

✨ Features

This repository includes reference implementations of all core RAG subsystems:

# Module Description
1 Document Ingestion OCR cleanup, boilerplate removal, metadata extraction
2 Semantic Chunking Title-aware, section-aware, token-budget segmentation
3 Embeddings Versioning, drift detection, deterministic preprocessing
4 Hybrid Retrieval Vector (HNSW) + BM25 + metadata filtering + score fusion
5 Reranking Cross-encoder reranking, batch inference, deduplication
6 Context Assembly Structured prompts, token budgeting, inline citations
7 Verification Deterministic checks + LLM critic validation
8 Graph-RAG Entity linking, knowledge graph traversal, multi-hop evidence
9 Monitoring Recall@k, latency dashboards, drift detection
10 Deployment FastAPI, circuit breakers, Docker, Kubernetes
11 Security ACL filtering, PII redaction, audit logging
12 Cost Optimization Model routing, caching, GPU batching

πŸ“ Repository Structure

enterprise-rag-stack/
β”‚
β”œβ”€β”€ README.md
β”œβ”€β”€ LICENSE
β”œβ”€β”€ requirements.txt
β”‚
β”œβ”€β”€ docs/
β”‚   β”œβ”€β”€ architecture_overview.md
β”‚   β”œβ”€β”€ diagrams/
β”‚   └── article_series_index.md
β”‚
β”œβ”€β”€ config/
β”‚   β”œβ”€β”€ embedding_config.yaml
β”‚   β”œβ”€β”€ retrieval_config.yaml
β”‚   β”œβ”€β”€ model_routing.yaml
β”‚   β”œβ”€β”€ index_config.yaml
β”‚   β”œβ”€β”€ reranker_config.yaml
β”‚   └── logging.yaml
β”‚
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ sample_documents/
β”‚   β”œβ”€β”€ golden_eval_set.json
β”‚   └── pii_redaction_rules.json
β”‚
β”œβ”€β”€ ingestion/              # Document normalization pipeline
β”‚   β”œβ”€β”€ normalize.py
β”‚   β”œβ”€β”€ ingest_pipeline.py
β”‚   └── dedupe.py
β”‚
β”œβ”€β”€ chunking/               # Semantic chunking strategies
β”‚   β”œβ”€β”€ semantic_chunker.py
β”‚   β”œβ”€β”€ sentence_splitter.py
β”‚   └── chunk_eval_tools.py
β”‚
β”œβ”€β”€ embeddings/             # Embedding with versioning
β”‚   β”œβ”€β”€ embedder.py
β”‚   └── drift_monitor.py
β”‚
β”œβ”€β”€ retrieval/              # Hybrid retrieval system
β”‚   β”œβ”€β”€ hybrid_retriever.py
β”‚   β”œβ”€β”€ vector_retriever.py
β”‚   β”œβ”€β”€ lexical_retriever.py
β”‚   └── score_fusion.py
β”‚
β”œβ”€β”€ reranking/              # Cross-encoder reranking
β”‚   β”œβ”€β”€ cross_encoder_reranker.py
β”‚   └── batch_reranking.py
β”‚
β”œβ”€β”€ context/                # Context assembly & prompts
β”‚   β”œβ”€β”€ context_builder.py
β”‚   β”œβ”€β”€ context_budgeting.py
β”‚   └── prompt_templates/
β”‚
β”œβ”€β”€ verification/           # Answer verification
β”‚   β”œβ”€β”€ deterministic_checks.py
β”‚   β”œβ”€β”€ critic_llm.py
β”‚   └── verification_pipeline.py
β”‚
β”œβ”€β”€ graph_rag/              # Knowledge graph integration
β”‚   β”œβ”€β”€ entity_linking.py
β”‚   β”œβ”€β”€ graph_builder.py
β”‚   β”œβ”€β”€ graph_traversal.py
β”‚   └── kg_summarizer.py
β”‚
β”œβ”€β”€ monitoring/             # Evaluation & metrics
β”‚   β”œβ”€β”€ eval_runner.py
β”‚   β”œβ”€β”€ recall_metrics.py
β”‚   β”œβ”€β”€ latency_metrics.py
β”‚   └── evaluation_schema.md
β”‚
β”œβ”€β”€ deployment/             # FastAPI, Docker, K8s
β”‚   β”œβ”€β”€ fastapi_app.py
β”‚   β”œβ”€β”€ circuit_breaker.py
β”‚   β”œβ”€β”€ docker/
β”‚   β”‚   β”œβ”€β”€ Dockerfile
β”‚   β”‚   └── docker-compose.yaml
β”‚   └── k8s/
β”‚
β”œβ”€β”€ security/               # ACLs, PII, audit
β”‚   β”œβ”€β”€ pii_redaction.py
β”‚   β”œβ”€β”€ acl_filters.py
β”‚   └── audit_logging.py
β”‚
└── cost_optimization/      # Model routing, caching
    β”œβ”€β”€ model_router.py
    β”œβ”€β”€ caching_layer.py
    β”œβ”€β”€ gpu_batcher.py
    └── unit_economics_calculator.py

πŸš€ Quick Start

Installation

# Clone the repository
git clone https://github.com/AdnanSattar/enterprise-rag-stack.git
cd enterprise-rag-stack

# Install dependencies
pip install -r requirements.txt

# Set environment variables
export OPENAI_API_KEY="your-api-key"
export CHROMA_PATH="./data/chroma"

Run Sample Pipeline

# Ingest sample documents
python ingestion/ingest_pipeline.py --input data/sample_documents --tenant default

# Run hybrid retriever
python retrieval/hybrid_retriever.py --query "What are the payment terms?"

Start the API Server

# Run FastAPI server
python deployment/fastapi_app.py

# Or with uvicorn
uvicorn deployment.fastapi_app:app --reload --port 8000

Docker Deployment

# Build and start all services
docker-compose up -d

# Check logs
docker-compose logs -f rag-api

πŸ“– Key Concepts

The High-Accuracy RAG Formula

Accuracy = Quality(Ingestion) Γ— Recall(Retrieval) Γ— Precision(Reranking) Γ— Grounding(Verification)

Hybrid Retrieval Score Fusion

score = vector_weight Γ— s_vector + lexical_weight Γ— s_bm25

Why hybrid? Pure vector fails for keyword-heavy queries (SKU numbers, codes). BM25 captures exact matches.

Two-Level Verification

Level Method What It Catches
1 Deterministic Numbers, dates, currencies not in context
2 LLM Critic Unsupported claims, hallucinations

Embedding Versioning

# CRITICAL: When model changes, re-index everything
index_name = f"docs_v1_{model_slug}_{embedding_version}"

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      Enterprise RAG Stack                             β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                       β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”              β”‚
β”‚   β”‚ Ingest  │──▢│  Chunk  │──▢│  Embed  │──▢│  Index  β”‚              β”‚
β”‚   β”‚ & Clean β”‚   β”‚Semantic β”‚   β”‚ Version β”‚   β”‚  HNSW   β”‚              β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜              β”‚
β”‚                                                                       β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”              β”‚
β”‚   β”‚ Query   │──▢│ Hybrid  │──▢│ Rerank  │──▢│ Context β”‚              β”‚
β”‚   β”‚Classify β”‚   β”‚Retrieve β”‚   β”‚ Top-K   β”‚   β”‚Assemble β”‚              β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜              β”‚
β”‚                                                                       β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”                            β”‚
β”‚   β”‚   LLM   │──▢│ Verify  │──▢│Response β”‚                            β”‚
β”‚   β”‚Generate β”‚   β”‚ Answer  β”‚   β”‚ + Cache β”‚                            β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                            β”‚
β”‚                                                                       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“Š Performance Targets

Metric Target Alert Threshold
Recall@5 > 0.85 < 0.75
Precision@5 > 0.70 < 0.60
P95 Latency < 2000ms > 5000ms
Hallucination Rate < 5% > 10%
Cache Hit Rate > 40% < 20%

πŸ”§ Configuration

Environment Variables

Variable Description Default
OPENAI_API_KEY OpenAI API key Required
CHROMA_PATH ChromaDB storage path ./data/chroma
COLLECTION_NAME Vector collection name docs_v1
REDIS_URL Redis connection URL redis://localhost:6379
LOG_LEVEL Logging level INFO

YAML Configuration

  • config/embedding_config.yaml β€” Embedding model, versioning, drift detection
  • config/retrieval_config.yaml β€” Hybrid weights, BM25 params, reranking
  • config/model_routing.yaml β€” LLM selection based on query complexity
  • config/index_config.yaml β€” Vector store and ANN index settings (HNSW/IVF)
  • config/reranker_config.yaml β€” Cross-encoder and lightweight reranking config
  • config/logging.yaml β€” Structured logging configuration

πŸ“š Article Series

This repository accompanies "The Enterprise RAG Engineering Playbook" article series on Medium:

Part Title Link Modules Covered
Part 1 Building a High-Quality Retrieval Foundation Read on Medium ingestion/, chunking/, embeddings/
Part 2 High Recall Retrieval and Precision Reranking Read on Medium retrieval/, reranking/
Part 3 Context Assembly and Grounded Prompting Read on Medium context/
Part 4 Verification Layers and Graph-RAG Read on Medium verification/, graph_rag/
Part 5 Monitoring, Evaluation and Lifecycle Management Read on Medium monitoring/, deployment/
Part 6 Security, Compliance and Cost Optimization Read on Medium security/, cost_optimization/

πŸ“– Full article series index: docs/article_series_index.md


πŸ§ͺ Testing

Run Evaluation

from monitoring.eval_runner import RAGEvaluator
from data.golden_eval_set import load_eval_set

evaluator = RAGEvaluator(pipeline=my_pipeline)
results = evaluator.run_evaluation(load_eval_set())

print(f"Recall@5: {results.mean_recall:.3f}")
print(f"P95 Latency: {results.p95_latency_ms:.1f}ms")

Golden Dataset

# Run against golden evaluation set
python monitoring/eval_runner.py --eval-set data/golden_eval_set.json

πŸ”’ Security

  • Row-level ACLs via metadata filters
  • PII redaction before indexing
  • Namespace isolation for multi-tenancy
  • Audit logging for compliance
  • Encrypted connections (TLS)

πŸ“ˆ Monitoring

Export metrics to Prometheus/Grafana:

curl http://localhost:8000/metrics

Key metrics:

  • rag_query_latency_ms β€” End-to-end latency
  • rag_retrieval_recall β€” Recall@k
  • rag_cache_hit_rate β€” Cache effectiveness
  • rag_hallucination_rate β€” Verification failures

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/amazing-feature
  3. Commit your changes: git commit -m 'Add amazing feature'
  4. Push to the branch: git push origin feature/amazing-feature
  5. Open a Pull Request

Please ensure:

  • Code is formatted with black
  • Type hints are included
  • Tests pass
  • Documentation is updated

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


Enterprise RAG Stack β€” Built with ❀️ for production RAG systems

Companion code for the Medium article series: The Enterprise RAG Engineering Playbook

Report Bug Β· Request Feature