RAG AI Agent in Python
Production-Ready Retrieval-Augmented Generation (RAG) AI Agent in Python This repository contains a production-oriented reference implementation of a Retrieval-Augmented Generation (RAG) AI agent built in Python. It's designed to serve as a practical starting point for building, deploying, and operating RAG-powered assistants that combine vector search over your knowledge sources with a large language model (LLM) for accurate, up-to-date responses.
Key goals:
- Modular, maintainable architecture suitable for production.
- Clear examples for ingesting data, building a vector store, and serving a queryable RAG agent.
- Best-practice considerations for configuration, logging, testing, and deployment.
Table of contents
- Features
- Architecture overview
- Quickstart
- Installation
- Configuration
- Typical workflows
- Ingesting documents
- Building/using a vector store
- Querying the agent
- Deployment
- Testing & CI
- Observability & Monitoring
- Security & privacy
- Contributing
- License
- Acknowledgements & References
Features
- Document ingestion pipeline (file system, PDFs, text, HTML — extendable)
- Embeddings + vector store (FAISS / Chroma / Milvus / Weaviate friendly)
- LLM orchestration layer for RAG responses (plug any LLM: OpenAI, Hugging Face, local)
- Conversation memory/session management
- API server example (FastAPI) for production endpoints
- Dockerfile and container-friendly configuration
- Example scripts for common tasks (ingest, index, query)
- Tests and example CI workflow
Architecture overview
- Data ingestion: preprocess raw docs -> text chunks -> metadata
- Embeddings: convert text chunks to vectors via an embeddings model
- Vector store: store and query embeddings (approximate nearest neighbors)
- RAG agent: retrieve context, construct prompts, call LLM, optionally perform chain-of-thought / tool use
- API & orchestration: expose endpoints, session handling, rate-limiting
- Observability: logging, request tracing, metrics for latency / error rates
Quickstart (fast path) Prerequisites
- Python 3.10+
- Git
- Optional: Docker for containerized deployment
- LLM/embedding credentials (e.g., OPENAI_API_KEY) or local model runtime
Clone and install
git clone https://github.com/Parrot90/Production_Ready_RAG_AI_Agent_in_Python.git
cd Production_Ready_RAG_AI_Agent_in_Python
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtIngest your documents (example)
python scripts/ingest.py --source ./data/docs --output ./data/chunksBuild or update vector index
python scripts/index.py --chunks ./data/chunks --store ./data/vectorstoreRun the API server
uvicorn app.main:app --host 0.0.0.0 --port 8000 --reloadInstallation (detailed)
- Create a virtual environment: python -m venv .venv source .venv/bin/activate
- Install dependencies: pip install -r requirements.txt
- Optional: install system packages required by some vector stores (e.g., FAISS).
Configuration The project uses environment variables and a small config loader (config.py). Typical config options:
- OPENAI_API_KEY or other provider credentials
- EMBEDDING_MODEL (provider/model for embeddings)
- LLM_MODEL (model for generation)
- VECTOR_STORE (faiss | chroma | milvus | weaviate)
- PERSIST_DIRECTORY (where to persist vector stores / metadata)
- MAX_CONTEXT_TOKENS, CHUNK_SIZE, CHUNK_OVERLAP, etc.
Typical workflows
- Ingesting documents
- Purpose: convert raw documents into cleaned text chunks with metadata (filename, source, modified time).
- Steps:
- Detect file type (txt, md, pdf, html, docx)
- Extract text
- Split into chunks (size + overlap tuned for your LLM context window)
- Save chunk JSON/CSV ready for embedding
- Building/using a vector store
- Create embeddings for chunks using your chosen embedder
- Save vectors & metadata to a vector index (FAISS/Chroma)
- For production, consider a database-backed vector store (Milvus/Weaviate) for persistence, scaling, and multi-node support
- Querying the agent (RAG flow)
- Receive a user query via API
- Embed the query and retrieve top-N relevant chunks
- Construct a prompt combining retrieved context and a system/instruction template
- Call the LLM for generation
- Optionally store the conversational turn and any analytics/metrics
Example usage snippet (pseudo)
from rag_agent import RAGAgent
agent = RAGAgent(config)
response = agent.query("How do I rotate a vector in 3D?", session_id="user-123")
print(response.answer)Deployment
- Containerization: Dockerfile provided. Build and push image to your registry.
- Orchestration: Kubernetes manifests are not included by default, but you can use a simple Deployment + HPA, Service, and Ingress.
- Consider autoscaling, GPU-equipped nodes for local LLMs, or managed hosting for third-party LLM APIs.
- Secrets management: use your cloud provider or HashiCorp Vault for keys.
- Persistent volumes: for vector store persistence (if using self-hosted FAISS/Chroma).
Testing & CI
- Unit tests: pytest
- Integration tests: small sample document set + local vector store + mocked LLM responses
- Example GitHub Actions workflow (ci.yml) may include:
- linting (flake8/ruff)
- unit tests
- build and push Docker image on release
Observability & Monitoring
- Logging: structured JSON logs via Python logging; configurable level via LOG_LEVEL
- Metrics: expose Prometheus metrics (request latency, error count, embedding/LLM call latencies)
- Tracing: add OpenTelemetry for distributed tracing through ingestion -> retrieval -> LLM calls
- Alerts: setup alerts for high error rate, high latency, low recall (if you have test queries)
Security & privacy
- Do not log sensitive user data; sanitize before logging.
- If storing user conversations, provide retention policy and opt-out mechanisms.
- Secure APIs with authentication (API keys, OAuth).
- Network: use TLS in production; limit access to vector store endpoints.
- Rate limit LLM/API calls to avoid cost spikes.
Performance & scaling tips
- Cache embeddings for repeated queries or common documents.
- Use approximate nearest neighbor (ANN) index with HNSW or IVF for FAISS.
- Use batching for embedding requests to reduce API overhead.
- Use streaming responses (if your LLM supports) to reduce perceived latency.
Best practices for production
- Keep ingestion reproducible: store hashes & source metadata so you can rebuild the index deterministically.
- Version your vector store and model choices.
- Have a small set of evaluation queries for continuous monitoring of retrieval quality.
- Implement canary releases when changing LLMs or prompt templates.
Contributing Contributions welcome! Please follow these steps:
- Fork the repo and create a branch: git checkout -b feat/my-feature
- Write tests for any new behavior.
- Run linting and tests: tox or pytest
- Submit a pull request describing the change.
Code of conduct: Be respectful. See CODE_OF_CONDUCT.md (add if needed).
Recommended repository layout
- app/ # API server and agent orchestration
- scripts/ # Command-line ingestion / index / utils
- libs/ # core RAG logic: retriever, prompt templates, llm wrappers
- tests/ # unit and integration tests
- Dockerfile
- requirements.txt
- README.md
Troubleshooting
- If embeddings or LLM calls fail: check credentials and network access.
- If retrieval quality is poor: tune chunk size/overlap, increase top_k, experiment with different embeddings.
- If performance is slow: profile embedding calls and vector queries, consider batching and index tuning.
License This repository is provided under the MIT License. See LICENSE for details.
Acknowledgements & references
- LangChain: Patterns and chain design influenced by the LangChain ecosystem
- FAISS, Chroma, Milvus: popular vector stores
- OpenAI / Hugging Face: LLM and embeddings providers
Contact Maintainer: Parrot90 For questions and help, open an issue in this repository.
If you’d like, I can:
- Generate example scripts (ingest.py, index.py, app/main.py).
- Create a Dockerfile, GitHub Actions CI, and a basic test suite.
- Tailor the README with exact commands and config after I inspect the repository contents.