🔍 Knowledge Assistant — Enterprise RAG System

A production-grade Retrieval-Augmented Generation (RAG) system that answers questions grounded in your private document corpus — not the open internet.

Features · Architecture · Quickstart · Usage · Configuration · API Reference · Roadmap

📌 Overview

Knowledge Assistant is an end-to-end RAG pipeline that lets you ask natural language questions against your own documents (PDFs, DOCX, TXT, Markdown) and receive grounded, source-cited answers powered by LLMs.

Built for enterprise use cases — private documentation, policy libraries, technical knowledge bases — where accuracy and traceability matter more than generality.

Key principle: The system answers from your documents first. When the answer isn't in your corpus, it says so — no hallucination.

✨ Features

📄 Multi-format ingestion — PDF, DOCX, TXT, Markdown via LangChain document loaders
✂️ Intelligent chunking — Configurable RecursiveCharacterTextSplitter with overlap for context continuity
🧠 Pluggable embeddings — Sentence Transformers (local), OpenAI, or Cohere embeddings
🗃️ Swappable vector stores — ChromaDB (default, persisted), FAISS, Pinecone
🤖 Flexible LLM backends — Ollama (Mistral / Llama 3, fully local), OpenAI GPT-4, Anthropic Claude
🌐 REST API — FastAPI with async endpoints, ready for integration
🐳 Dockerized — Single docker-compose up launches the API + Ollama service
📊 Retrieval metrics — Retrieval precision benchmarking and qualitative analysis utilities

🏗️ Architecture

                        ┌─────────────────────────────────┐
                        │         INGESTION PIPELINE       │
                        │                                  │
  Raw Documents ──────► │  Load → Chunk → Embed → Store   │
  (PDF/DOCX/TXT)        │                                  │
                        │  run_ingest.py                   │
                        └──────────────┬──────────────────┘
                                       │  persisted vectors
                                       ▼
                               ┌───────────────┐
                               │   Vector DB   │
                               │  (ChromaDB /  │
                               │    FAISS)     │
                               └───────┬───────┘
                                       │  similarity search (top-k)
                        ┌──────────────▼──────────────────┐
                        │         QUERY PIPELINE           │
                        │                                  │
  User Query ─────────► │  Embed Query                     │
                        │  → Retrieve Chunks               │
                        │  → Build Prompt                  │
                        │  → LLM (Ollama / GPT / Claude)   │
                        │  → Return Answer + Sources       │
                        │                                  │
                        │  app.py  (FastAPI)               │
                        └─────────────────────────────────┘

Embedding Model

all-MiniLM-L6-v2 (default) — 384-dimensional vectors, fast, runs fully offline via Sentence Transformers.

Retrieval Strategy

Cosine similarity search over the vector store. Top-k chunks (configurable) are injected into the LLM prompt as grounding context.

🚀 Quickstart

Prerequisites

Python 3.10+
Docker + Docker Compose
Ollama (for local LLM inference)

1. Clone the repo

git clone https://github.com/Sumit1673/knowledge-assistant.git
cd knowledge-assistant

2. Install dependencies

pip install -r requirements.txt

3. Start Ollama and pull the model

# Install Ollama (macOS / Linux)
curl -fsSL https://ollama.com/install.sh | sh

# In terminal 1 — start the Ollama server
ollama serve

# In terminal 2 — pull Mistral (used by default)
ollama pull mistral

4. Configure the system

cp config/config.example.yaml config/config.yaml
# Edit config.yaml to point to your documents directory and set your preferences

5. Ingest your documents

python run_ingest.py

This reads documents from the configured source_dir, chunks them, generates embeddings, and persists the vector store to output/vector_store/.

6. Start the API server

uvicorn app:app --host 0.0.0.0 --port 8000 --reload

Or with Docker:

docker-compose up --build

The API will be available at http://localhost:8000. Interactive docs at http://localhost:8000/docs.

🐳 Docker Setup

Docker Compose runs two services:

Service	Description
`rag_api`	FastAPI application (port 8000)
`ollama`	Local LLM inference server (port 11434)

Note: Document ingestion (run_ingest.py) is intentionally separate from the API startup to keep cold-start time fast. Run it manually after adding new documents, or automate it via a file-upload trigger or cron job.

# Start all services
docker-compose up -d

# Ingest documents (run once, or after adding new docs)
docker-compose exec rag_api python run_ingest.py

# View logs
docker-compose logs -f rag_api

# Stop
docker-compose down

💬 Usage

Via REST API

Ask a question:

curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{"question": "What is the refund policy?"}'

Response:

{
  "answer": "The refund policy allows returns within 30 days of purchase...",
  "sources": [
    {"document": "policy_handbook.pdf", "page": 12, "score": 0.91}
  ]
}

Via Python client

import requests

response = requests.post(
    "http://localhost:8000/query",
    json={"question": "Summarize the onboarding process"}
)
print(response.json()["answer"])

⚙️ Configuration

Edit config/config.yaml to customise the pipeline:

# Document ingestion
ingestion:
  source_dir: "data/documents"       # Folder with your source documents
  chunk_size: 500                    # Characters per chunk
  chunk_overlap: 50                  # Overlap between consecutive chunks

# Embeddings
embeddings:
  model: "sentence-transformers/all-MiniLM-L6-v2"   # or "openai", "cohere"

# Vector store
vector_store:
  type: "chroma"                     # "chroma" | "faiss" | "pinecone"
  persist_dir: "output/vector_store"

# LLM
llm:
  provider: "ollama"                 # "ollama" | "openai" | "anthropic"
  model: "mistral"                   # Model name for the chosen provider
  temperature: 0.1

# Retrieval
retrieval:
  top_k: 4                           # Number of chunks to retrieve per query

📂 Project Structure

knowledge-assistant/
├── app.py                  # FastAPI application & query endpoint
├── run_ingest.py           # Document ingestion pipeline (run once / on update)
├── config/
│   └── config.yaml         # Runtime configuration
├── src/
│   ├── ingestion/          # Document loaders, text splitters
│   ├── embeddings/         # Embedding model wrappers
│   ├── vector_store/       # VectorStoreManager (Chroma / FAISS)
│   ├── llm/                # LLM handler (Ollama / OpenAI / Anthropic)
│   └── rag/                # RAGQueryHandler — retrieval + prompt + generation
├── data/
│   └── documents/          # ← Drop your source documents here
├── output/
│   └── vector_store/       # Auto-generated persisted vector store
├── docker-compose.yml
├── Dockerfile
└── requirements.txt

📡 API Reference

Method	Endpoint	Description
`POST`	`/query`	Ask a question against the document corpus
`GET`	`/health`	Health check
`GET`	`/docs`	Interactive Swagger UI

Full schema available at /docs when the server is running.

🔭 Roadmap

Streaming responses via Server-Sent Events
File upload endpoint that auto-triggers ingestion
Hybrid search (dense + BM25 sparse retrieval)
Multi-tenant document namespacing
Evaluation harness (RAGAS metrics: faithfulness, context recall)
HuggingFace Spaces demo with Groq backend

🤝 Contributing

Contributions, issues and feature requests are welcome. Please open an issue first to discuss what you'd like to change.

Fork the repo
Create a feature branch (git checkout -b feature/your-feature)
Commit your changes (git commit -m 'Add your feature')
Push to the branch (git push origin feature/your-feature)
Open a Pull Request

📄 License

Distributed under the MIT License. See LICENSE for details.

👤 Author

Sumit Vaise — Senior ML Engineer

_{If this project was useful, consider giving it a ⭐}

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
config		config
dataset		dataset
src/rag_assistant		src/rag_assistant
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
Dockerfile.Chroma		Dockerfile.Chroma
README.md		README.md
app.py		app.py
data		data
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt
run_ingest.py		run_ingest.py
test.ipynb		test.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔍 Knowledge Assistant — Enterprise RAG System

📌 Overview

✨ Features

🏗️ Architecture

Embedding Model

Retrieval Strategy

🚀 Quickstart

Prerequisites

1. Clone the repo

2. Install dependencies

3. Start Ollama and pull the model

4. Configure the system

5. Ingest your documents

6. Start the API server

🐳 Docker Setup

💬 Usage

Via REST API

Via Python client

⚙️ Configuration

📂 Project Structure

📡 API Reference

🔭 Roadmap

🤝 Contributing

📄 License

👤 Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

🔍 Knowledge Assistant — Enterprise RAG System

📌 Overview

✨ Features

🏗️ Architecture

Embedding Model

Retrieval Strategy

🚀 Quickstart

Prerequisites

1. Clone the repo

2. Install dependencies

3. Start Ollama and pull the model

4. Configure the system

5. Ingest your documents

6. Start the API server

🐳 Docker Setup

💬 Usage

Via REST API

Via Python client

⚙️ Configuration

📂 Project Structure

📡 API Reference

🔭 Roadmap

🤝 Contributing

📄 License

👤 Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages