A production-grade Retrieval-Augmented Generation (RAG) system that answers questions grounded in your private document corpus β not the open internet.
Features Β· Architecture Β· Quickstart Β· Usage Β· Configuration Β· API Reference Β· Roadmap
Knowledge Assistant is an end-to-end RAG pipeline that lets you ask natural language questions against your own documents (PDFs, DOCX, TXT, Markdown) and receive grounded, source-cited answers powered by LLMs.
Built for enterprise use cases β private documentation, policy libraries, technical knowledge bases β where accuracy and traceability matter more than generality.
Key principle: The system answers from your documents first. When the answer isn't in your corpus, it says so β no hallucination.
- π Multi-format ingestion β PDF, DOCX, TXT, Markdown via LangChain document loaders
- βοΈ Intelligent chunking β Configurable
RecursiveCharacterTextSplitterwith overlap for context continuity - π§ Pluggable embeddings β Sentence Transformers (local), OpenAI, or Cohere embeddings
- ποΈ Swappable vector stores β ChromaDB (default, persisted), FAISS, Pinecone
- π€ Flexible LLM backends β Ollama (Mistral / Llama 3, fully local), OpenAI GPT-4, Anthropic Claude
- π REST API β FastAPI with async endpoints, ready for integration
- π³ Dockerized β Single
docker-compose uplaunches the API + Ollama service - π Retrieval metrics β Retrieval precision benchmarking and qualitative analysis utilities
βββββββββββββββββββββββββββββββββββ
β INGESTION PIPELINE β
β β
Raw Documents βββββββΊ β Load β Chunk β Embed β Store β
(PDF/DOCX/TXT) β β
β run_ingest.py β
ββββββββββββββββ¬βββββββββββββββββββ
β persisted vectors
βΌ
βββββββββββββββββ
β Vector DB β
β (ChromaDB / β
β FAISS) β
βββββββββ¬ββββββββ
β similarity search (top-k)
ββββββββββββββββΌβββββββββββββββββββ
β QUERY PIPELINE β
β β
User Query ββββββββββΊ β Embed Query β
β β Retrieve Chunks β
β β Build Prompt β
β β LLM (Ollama / GPT / Claude) β
β β Return Answer + Sources β
β β
β app.py (FastAPI) β
βββββββββββββββββββββββββββββββββββ
all-MiniLM-L6-v2 (default) β 384-dimensional vectors, fast, runs fully offline via Sentence Transformers.
Cosine similarity search over the vector store. Top-k chunks (configurable) are injected into the LLM prompt as grounding context.
- Python 3.10+
- Docker + Docker Compose
- Ollama (for local LLM inference)
git clone https://github.com/Sumit1673/knowledge-assistant.git
cd knowledge-assistantpip install -r requirements.txt# Install Ollama (macOS / Linux)
curl -fsSL https://ollama.com/install.sh | sh
# In terminal 1 β start the Ollama server
ollama serve
# In terminal 2 β pull Mistral (used by default)
ollama pull mistralcp config/config.example.yaml config/config.yaml
# Edit config.yaml to point to your documents directory and set your preferencespython run_ingest.pyThis reads documents from the configured source_dir, chunks them, generates embeddings, and persists the vector store to output/vector_store/.
uvicorn app:app --host 0.0.0.0 --port 8000 --reloadOr with Docker:
docker-compose up --buildThe API will be available at http://localhost:8000. Interactive docs at http://localhost:8000/docs.
Docker Compose runs two services:
| Service | Description |
|---|---|
rag_api |
FastAPI application (port 8000) |
ollama |
Local LLM inference server (port 11434) |
Note: Document ingestion (
run_ingest.py) is intentionally separate from the API startup to keep cold-start time fast. Run it manually after adding new documents, or automate it via a file-upload trigger or cron job.
# Start all services
docker-compose up -d
# Ingest documents (run once, or after adding new docs)
docker-compose exec rag_api python run_ingest.py
# View logs
docker-compose logs -f rag_api
# Stop
docker-compose downAsk a question:
curl -X POST http://localhost:8000/query \
-H "Content-Type: application/json" \
-d '{"question": "What is the refund policy?"}'Response:
{
"answer": "The refund policy allows returns within 30 days of purchase...",
"sources": [
{"document": "policy_handbook.pdf", "page": 12, "score": 0.91}
]
}import requests
response = requests.post(
"http://localhost:8000/query",
json={"question": "Summarize the onboarding process"}
)
print(response.json()["answer"])Edit config/config.yaml to customise the pipeline:
# Document ingestion
ingestion:
source_dir: "data/documents" # Folder with your source documents
chunk_size: 500 # Characters per chunk
chunk_overlap: 50 # Overlap between consecutive chunks
# Embeddings
embeddings:
model: "sentence-transformers/all-MiniLM-L6-v2" # or "openai", "cohere"
# Vector store
vector_store:
type: "chroma" # "chroma" | "faiss" | "pinecone"
persist_dir: "output/vector_store"
# LLM
llm:
provider: "ollama" # "ollama" | "openai" | "anthropic"
model: "mistral" # Model name for the chosen provider
temperature: 0.1
# Retrieval
retrieval:
top_k: 4 # Number of chunks to retrieve per queryknowledge-assistant/
βββ app.py # FastAPI application & query endpoint
βββ run_ingest.py # Document ingestion pipeline (run once / on update)
βββ config/
β βββ config.yaml # Runtime configuration
βββ src/
β βββ ingestion/ # Document loaders, text splitters
β βββ embeddings/ # Embedding model wrappers
β βββ vector_store/ # VectorStoreManager (Chroma / FAISS)
β βββ llm/ # LLM handler (Ollama / OpenAI / Anthropic)
β βββ rag/ # RAGQueryHandler β retrieval + prompt + generation
βββ data/
β βββ documents/ # β Drop your source documents here
βββ output/
β βββ vector_store/ # Auto-generated persisted vector store
βββ docker-compose.yml
βββ Dockerfile
βββ requirements.txt
| Method | Endpoint | Description |
|---|---|---|
POST |
/query |
Ask a question against the document corpus |
GET |
/health |
Health check |
GET |
/docs |
Interactive Swagger UI |
Full schema available at /docs when the server is running.
- Streaming responses via Server-Sent Events
- File upload endpoint that auto-triggers ingestion
- Hybrid search (dense + BM25 sparse retrieval)
- Multi-tenant document namespacing
- Evaluation harness (RAGAS metrics: faithfulness, context recall)
- HuggingFace Spaces demo with Groq backend
Contributions, issues and feature requests are welcome. Please open an issue first to discuss what you'd like to change.
- Fork the repo
- Create a feature branch (
git checkout -b feature/your-feature) - Commit your changes (
git commit -m 'Add your feature') - Push to the branch (
git push origin feature/your-feature) - Open a Pull Request
Distributed under the MIT License. See LICENSE for details.
Sumit Vaise β Senior ML Engineer