LocalRAG is a privacy-first, retrieval-augmented generation (RAG) platform designed for autonomous document research.
Unlike standard RAG pipelines, LocalRAG implements a cyclic agentic architecture using LangGraph, allowing the system to audit its own answers, detect hallucinations, and self-correct in real-time. It runs entirely offline on consumer hardware (RTX 3090/4090) using containerized microservices.
The system follows a Microservices Pattern orchestrated via Docker Compose. It decouples the Inference Engine (Compute) from the State Management (Vector DB) and Application Logic.
- User Flow: The User interacts with the Streamlit UI, which sends requests to the FastAPI backend.
- Agentic Loop: The LangGraph agent orchestrates retrieval from Qdrant, re-ranking via FlashRank, and generation via Ollama.
- Self-Correction: If the Hallucination Grader fails, the agent autonomously loops back to retry the generation.
Instead of a linear chain (Retrieve -> Generate), this system uses a State Graph.
- Hallucination Grader: After generating an answer, a secondary LLM call verifies if the claims are grounded in the retrieved context.
- Retry Mechanism: If a hallucination is detected, the graph loops back to the generation step with a penalty prompt.
To solve the "Lost in the Middle" phenomenon:
- Stage 1: Broad retrieval of top 10 documents using Dense Vector Search (Cosine Similarity).
- Stage 2: Re-Ranking using a Cross-Encoder (
ms-marco-MiniLM-L-12-v2) running locally on CPU to filter for the top 3 semantically relevant chunks.
Integrated Arize Phoenix (OpenTelemetry) to trace every step of the pipeline.
- Latency Tracing: Visualize exactly how long Retrieval took vs. Token Generation.
- Token Counting: Monitor cost (simulated) and throughput.
Implements Test-Driven Development (TDD) for RAG.
- Uses DeepEval to run regression tests before deployment.
- A local Llama-3 model acts as a "Judge" to score answers for Faithfulness and Relevancy.
| Component | Tool Choice | Why this over the alternative? |
|---|---|---|
| Inference | Ollama (Docker) | Provides a stable, OpenAI-compatible API layer over raw llama.cpp bindings, simplifying container networking. |
| Vector DB | Qdrant | Chosen over ChromaDB for its Rust-based performance, ability to handle millions of vectors, and built-in hybrid search capabilities. |
| Orchestration | LangGraph | Chosen over standard LangChain Chains to enable Cyclic Graphs (Loops) required for self-correction. |
| Observability | Arize Phoenix | The only open-source, local-first OTEL collector that provides visual trace waterfalls without a cloud login. |
- Docker Desktop.
- NVIDIA GPU (RTX 30XX or 40XX recommended) with updated drivers.
- RAM: 32GB+ recommended (for running Docker + Chrome + VS Code).
This single command launches the Database, Inference Engine, Dashboard, and UI.
docker-compose up -d