<a href="https://colab.research.google.com/github/PeterTheMango/RagResearch/blob/main/Rag_Implementation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Building a RAG System
### Done by: Peter Sotomango [60301211]

In this notebook I explored how to design a RAG based Q/A system and used embedding models and large language models from Hugging Face.

I focused on using lightweight models for now due to limited resource constraints.

The dateset that was used was [G4KMU's T2 - RagBench](https://huggingface.co/datasets/G4KMU/t2-ragbench) for getting test documents to put in the database and test the LLMs.

# CLAUDE PLANNING

## Project Overview
This section contains the implementation plan for the RAG (Retrieval-Augmented Generation) Q&A system.

## Current Status
- âœ… Planning phase complete
- Ready for implementation

---

## Architecture Decisions

### 1. **Device Management**
- Auto-detect GPU availability using `torch.cuda.is_available()`
- Fallback to CPU if GPU unavailable
- Move models to appropriate device automatically

### 2. **Models**

#### Embedding Model (Recommended: BAAI/bge-small-en-v1.5)
**Primary Choice:**
- **BAAI/bge-small-en-v1.5** 
  - Size: 33M parameters, 384 dimensions
  - Excellent performance-to-size ratio
  - Good for both CPU and GPU

**Alternatives:**
- **sentence-transformers/all-MiniLM-L6-v2** (faster, 22M params, 384 dims)
- **BAAI/bge-base-en-v1.5** (better quality, 109M params, 768 dims - GPU preferred)

#### LLM (Language Model)
**Primary Choice:**
- **mistralai/Mistral-7B-Instruct-v0.2** (GPU recommended)
  - 7B parameters
  - Strong instruction following
  - Good balance of quality and speed

**Alternatives:**
- **google/flan-t5-large** (780M params - lighter)
- **TinyLlama/TinyLlama-1.1B-Chat-v1.0** (1.1B params - CPU friendly)

### 3. **Vector Database**
- **ChromaDB** - Simple, lightweight, persistent storage
- Local storage for embeddings
- Supports similarity search with various distance metrics

### 4. **Data Source**
- PDFs from `data/` folder
- Subset of G4KMU T2-RagBench dataset
- Dynamic loading - add PDFs as needed

### 5. **Chunking Strategy**
- **Semantic Chunking** with sentence-level splitting
- Approach: Use sentence boundaries as natural breakpoints
- Recommended: RecursiveCharacterTextSplitter with sentence separators
- Target chunk size: 512-1024 characters (adjustable based on model context)
- Overlap: 50-100 characters to maintain context continuity

**Alternative Approaches:**
- Fixed-size chunking (simpler but less semantic)
- Paragraph-based chunking (larger chunks)
- Sliding window with larger overlap

### 6. **Retrieval Strategy**
- **Similarity Search** using cosine similarity
- Top-k retrieval (k=3-5 most relevant chunks)
- Return chunks with similarity scores

**Future Enhancements:**
- Re-ranking with cross-encoder
- Hybrid search (keyword + semantic)
- MMR (Maximal Marginal Relevance) for diversity

### 7. **Evaluation Metrics**

#### RAG-Specific Metrics:
1. **Context Relevance** - How relevant are retrieved documents to the query?
2. **Answer Relevance** - How relevant is the generated answer to the query?
3. **Faithfulness/Groundedness** - Is the answer consistent with retrieved context?
4. **Context Precision** - Precision of relevant chunks in top-k results
5. **Context Recall** - Coverage of relevant information

#### Retrieval Metrics:
- **Hit Rate** - Percentage of queries with at least one relevant result
- **MRR (Mean Reciprocal Rank)** - Average of reciprocal ranks of first relevant result
- **Similarity Scores** - Average cosine similarity of retrieved chunks

#### Answer Quality Metrics:
- **Answer Similarity** - Semantic similarity to ground truth (if available)
- **Response Time** - Latency for end-to-end query processing
- **BLEU/ROUGE** (optional) - If reference answers available

---

## Implementation Plan

### Phase 1: Environment Setup
1. Install required packages:
   - `transformers`, `sentence-transformers`, `torch`
   - `chromadb`
   - `PyPDF2` or `pypdf` for PDF processing
   - `langchain` (optional, for text splitting utilities)
   - `nltk` or `spacy` for sentence tokenization

2. Set up device detection and configuration
3. Create data/ folder structure

### Phase 2: Data Ingestion & Processing
1. **Load PDFs** from data/ folder
   - Extract text from each PDF
   - Maintain document metadata (filename, page numbers)

2. **Chunk Documents**
   - Implement semantic chunking with sentence boundaries
   - Create chunk metadata (source document, chunk index, page number)
   - Store original text alongside chunks

3. **Generate Embeddings**
   - Load embedding model (BAAI/bge-small-en-v1.5)
   - Batch process chunks for efficiency
   - Generate embeddings for all chunks

4. **Store in ChromaDB**
   - Initialize ChromaDB collection
   - Store embeddings with metadata
   - Create persistent storage

### Phase 3: RAG Query Pipeline
1. **Load Models**
   - Load embedding model for query encoding
   - Load LLM for answer generation
   - Configure generation parameters

2. **Query Processing**
   - Accept user question
   - Generate query embedding
   - Retrieve top-k similar chunks from ChromaDB

3. **Answer Generation**
   - Construct prompt with retrieved context
   - Format: "Context: {chunks}\n\nQuestion: {question}\n\nAnswer:"
   - Generate answer using LLM
   - Return answer with sources and similarity scores

### Phase 4: Evaluation & Metrics
1. **Implement Metric Calculators**
   - Context relevance scorer
   - Answer relevance scorer
   - Faithfulness checker
   - Retrieval metrics (Hit Rate, MRR)

2. **Logging & Output**
   - Log queries, retrieved contexts, and answers
   - Save evaluation metrics to file
   - Create visualization of results (optional)

3. **Test Cases**
   - Create test questions for evaluation
   - Compare results across different configurations

---

## Notes & Considerations

### Performance Optimization:
- Use batch processing for embeddings
- Consider quantization (4-bit/8-bit) for LLM if memory constrained
- Cache embeddings to avoid recomputation
- Use GPU memory efficiently (offload when not in use)

### Quality Improvements:
- Experiment with different chunk sizes
- Tune top-k retrieval parameter
- Try different prompt templates
- Consider re-ranking retrieved results

### Future Enhancements:
- Add query expansion/reformulation
- Implement conversational memory for multi-turn QA
- Add citation/source attribution in answers
- Support multiple embedding models comparison
- Web interface for easier interaction

### Error Handling:
- Handle missing PDFs gracefully
- Validate embedding dimensions
- Catch model loading errors
- Log failures for debugging

---

## Dependencies
```python
# Core ML
torch
transformers
sentence-transformers

# Vector DB
chromadb

# Text Processing
pypdf or PyPDF2
langchain or langchain-text-splitters
nltk

# Evaluation
scikit-learn (for metrics)
numpy
pandas

# Optional
ragas (for advanced RAG metrics)
```

# ENVIRONMENT CONFIGURATION

# Ingesting Data


# Process Data

# Save to vector database

# Load Models

# Get User Question

# Prompt Model

# Get Output

# Save Outputs

# Metrics