<a href="https://colab.research.google.com/github/PeterTheMango/RagResearch/blob/main/Rag_Implementation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Building a RAG System
### Done by: Peter Sotomango [60301211]

In this notebook I explored how to design a RAG based Q/A system and used embedding models and large language models from Hugging Face.

I focused on using lightweight models for now due to limited resource constraints.

The dateset that was used was [G4KMU's T2 - RagBench](https://huggingface.co/datasets/G4KMU/t2-ragbench) for getting test documents to put in the database and test the LLMs.

# CLAUDE PLANNING

## Project Overview
This section contains the implementation plan for the RAG (Retrieval-Augmented Generation) Q&A system.

## Current Status
- âœ… Planning phase complete
- Ready for implementation

---

## Architecture Decisions

### 1. **Device Management**
- Auto-detect GPU availability using `torch.cuda.is_available()`
- Fallback to CPU if GPU unavailable
- Move models to appropriate device automatically

### 2. **Models**

#### Embedding Model (Recommended: BAAI/bge-small-en-v1.5)
**Primary Choice:**
- **BAAI/bge-small-en-v1.5** 
  - Size: 33M parameters, 384 dimensions
  - Excellent performance-to-size ratio
  - Good for both CPU and GPU

**Alternatives:**
- **sentence-transformers/all-MiniLM-L6-v2** (faster, 22M params, 384 dims)
- **BAAI/bge-base-en-v1.5** (better quality, 109M params, 768 dims - GPU preferred)

#### LLM (Language Model)
**Primary Choice:**
- **mistralai/Mistral-7B-Instruct-v0.2** (GPU recommended)
  - 7B parameters
  - Strong instruction following
  - Good balance of quality and speed

**Alternatives:**
- **google/flan-t5-large** (780M params - lighter)
- **TinyLlama/TinyLlama-1.1B-Chat-v1.0** (1.1B params - CPU friendly)

### 3. **Vector Database**
- **ChromaDB** - Simple, lightweight, persistent storage
- Local storage for embeddings
- Supports similarity search with various distance metrics

### 4. **Data Source**
- PDFs from `data/` folder
- Subset of G4KMU T2-RagBench dataset
- Dynamic loading - add PDFs as needed

### 5. **Chunking Strategy**
- **Semantic Chunking** with sentence-level splitting
- Approach: Use sentence boundaries as natural breakpoints
- Recommended: RecursiveCharacterTextSplitter with sentence separators
- Target chunk size: 512-1024 characters (adjustable based on model context)
- Overlap: 50-100 characters to maintain context continuity

**Alternative Approaches:**
- Fixed-size chunking (simpler but less semantic)
- Paragraph-based chunking (larger chunks)
- Sliding window with larger overlap

### 6. **Retrieval Strategy**
- **Similarity Search** using cosine similarity
- Top-k retrieval (k=3-5 most relevant chunks)
- Return chunks with similarity scores

**Future Enhancements:**
- Re-ranking with cross-encoder
- Hybrid search (keyword + semantic)
- MMR (Maximal Marginal Relevance) for diversity

### 7. **Evaluation Metrics**

#### RAG-Specific Metrics:
1. **Context Relevance** - How relevant are retrieved documents to the query?
2. **Answer Relevance** - How relevant is the generated answer to the query?
3. **Faithfulness/Groundedness** - Is the answer consistent with retrieved context?
4. **Context Precision** - Precision of relevant chunks in top-k results
5. **Context Recall** - Coverage of relevant information

#### Retrieval Metrics:
- **Hit Rate** - Percentage of queries with at least one relevant result
- **MRR (Mean Reciprocal Rank)** - Average of reciprocal ranks of first relevant result
- **Similarity Scores** - Average cosine similarity of retrieved chunks

#### Answer Quality Metrics:
- **Answer Similarity** - Semantic similarity to ground truth (if available)
- **Response Time** - Latency for end-to-end query processing
- **BLEU/ROUGE** (optional) - If reference answers available

---

## Implementation Plan

### Phase 1: Environment Setup
1. Install required packages:
   - `transformers`, `sentence-transformers`, `torch`
   - `chromadb`
   - `PyPDF2` or `pypdf` for PDF processing
   - `langchain` (optional, for text splitting utilities)
   - `nltk` or `spacy` for sentence tokenization

2. Set up device detection and configuration
3. Create data/ folder structure

### Phase 2: Data Ingestion & Processing
1. **Load PDFs** from data/ folder
   - Extract text from each PDF
   - Maintain document metadata (filename, page numbers)

2. **Chunk Documents**
   - Implement semantic chunking with sentence boundaries
   - Create chunk metadata (source document, chunk index, page number)
   - Store original text alongside chunks

3. **Generate Embeddings**
   - Load embedding model (BAAI/bge-small-en-v1.5)
   - Batch process chunks for efficiency
   - Generate embeddings for all chunks

4. **Store in ChromaDB**
   - Initialize ChromaDB collection
   - Store embeddings with metadata
   - Create persistent storage

### Phase 3: RAG Query Pipeline
1. **Load Models**
   - Load embedding model for query encoding
   - Load LLM for answer generation
   - Configure generation parameters

2. **Query Processing**
   - Accept user question
   - Generate query embedding
   - Retrieve top-k similar chunks from ChromaDB

3. **Answer Generation**
   - Construct prompt with retrieved context
   - Format: "Context: {chunks}\n\nQuestion: {question}\n\nAnswer:"
   - Generate answer using LLM
   - Return answer with sources and similarity scores

### Phase 4: Evaluation & Metrics
1. **Implement Metric Calculators**
   - Context relevance scorer
   - Answer relevance scorer
   - Faithfulness checker
   - Retrieval metrics (Hit Rate, MRR)

2. **Logging & Output**
   - Log queries, retrieved contexts, and answers
   - Save evaluation metrics to file
   - Create visualization of results (optional)

3. **Test Cases**
   - Create test questions for evaluation
   - Compare results across different configurations

---

## Notes & Considerations

### Performance Optimization:
- Use batch processing for embeddings
- Consider quantization (4-bit/8-bit) for LLM if memory constrained
- Cache embeddings to avoid recomputation
- Use GPU memory efficiently (offload when not in use)

### Quality Improvements:
- Experiment with different chunk sizes
- Tune top-k retrieval parameter
- Try different prompt templates
- Consider re-ranking retrieved results

### Future Enhancements:
- Add query expansion/reformulation
- Implement conversational memory for multi-turn QA
- Add citation/source attribution in answers
- Support multiple embedding models comparison
- Web interface for easier interaction

### Error Handling:
- Handle missing PDFs gracefully
- Validate embedding dimensions
- Catch model loading errors
- Log failures for debugging

---

## Dependencies
```python
# Core ML
torch
transformers
sentence-transformers

# Vector DB
chromadb

# Text Processing
pypdf or PyPDF2
langchain or langchain-text-splitters
nltk

# Evaluation
scikit-learn (for metrics)
numpy
pandas

# Optional
ragas (for advanced RAG metrics)
```

# ENVIRONMENT CONFIGURATION

## Install Required Packages
# Run this cell first to install all dependencies

In [None]:
!pip install -q transformers sentence-transformers torch accelerate
!pip install -q chromadb
!pip install -q pypdf langchain-text-splitters
!pip install -q nltk scikit-learn pandas numpy
print("âœ“ All packages installed successfully!")

## Import Libraries

In [None]:
# Core ML libraries
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from sentence_transformers import SentenceTransformer

# Vector database
import chromadb
from chromadb.config import Settings

# Text processing
from pypdf import PdfReader
from langchain_text_splitters import RecursiveCharacterTextSplitter
import nltk

# Utilities
import os
import numpy as np
import pandas as pd
from pathlib import Path
from typing import List, Dict, Tuple
import json
from datetime import datetime
import time

# Download NLTK data for sentence tokenization
try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt', quiet=True)

print("âœ“ All libraries imported successfully!")

## Device Detection & Configuration

In [None]:
# Detect device (GPU or CPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

print(f"Device: {device}")
if device.type == "cuda":
    print(f"GPU Name: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.2f} GB")
else:
    print("Running on CPU - models will load slower")

# Configuration
CONFIG = {
    "device": device,
    "embedding_model_name": "BAAI/bge-small-en-v1.5",
    "llm_model_name": "mistralai/Mistral-7B-Instruct-v0.2",  # Can be changed based on resources
    "chunk_size": 512,
    "chunk_overlap": 100,
    "top_k": 3,  # Number of chunks to retrieve
    "data_folder": "data",
    "chroma_db_path": "./chroma_db",
    "output_folder": "outputs"
}

print("\nâœ“ Device detection complete!")
print(f"Configuration: {json.dumps({k: str(v) for k, v in CONFIG.items()}, indent=2)}")

## Create Folder Structure

In [None]:
# Create necessary folders
folders = [CONFIG["data_folder"], CONFIG["output_folder"]]

for folder in folders:
    Path(folder).mkdir(parents=True, exist_ok=True)
    print(f"âœ“ Created/verified folder: {folder}")

# Check if there are any PDFs in the data folder
pdf_files = list(Path(CONFIG["data_folder"]).glob("*.pdf"))
print(f"\nFound {len(pdf_files)} PDF file(s) in {CONFIG['data_folder']} folder")

if len(pdf_files) == 0:
    print(f"\nâš  No PDF files found. Please add PDF files to the '{CONFIG['data_folder']}' folder before proceeding.")
else:
    print("PDF files:")
    for pdf in pdf_files:
        print(f"  - {pdf.name}")

# Ingesting Data


## Load PDFs and Extract Text

In [None]:
def load_pdfs_from_folder(folder_path: str) -> List[Dict]:
    """
    Load all PDF files from a folder and extract text with metadata
    
    Args:
        folder_path: Path to folder containing PDF files
        
    Returns:
        List of dictionaries containing document text and metadata
    """
    documents = []
    pdf_files = list(Path(folder_path).glob("*.pdf"))
    
    if len(pdf_files) == 0:
        print(f"âš  No PDF files found in {folder_path}")
        return documents
    
    print(f"Loading {len(pdf_files)} PDF file(s)...")
    
    for pdf_path in pdf_files:
        try:
            reader = PdfReader(str(pdf_path))
            num_pages = len(reader.pages)
            
            print(f"\n  Processing: {pdf_path.name} ({num_pages} pages)")
            
            # Extract text from each page
            for page_num, page in enumerate(reader.pages, start=1):
                text = page.extract_text()
                
                if text.strip():  # Only add non-empty pages
                    documents.append({
                        "text": text,
                        "metadata": {
                            "source": pdf_path.name,
                            "page": page_num,
                            "total_pages": num_pages
                        }
                    })
            
            print(f"    âœ“ Extracted {num_pages} pages")
                    
        except Exception as e:
            print(f"    âœ— Error processing {pdf_path.name}: {str(e)}")
            continue
    
    print(f"\nâœ“ Successfully loaded {len(documents)} page(s) from {len(pdf_files)} PDF file(s)")
    return documents

# Load all PDFs
documents = load_pdfs_from_folder(CONFIG["data_folder"])

# Display summary
if documents:
    total_chars = sum(len(doc["text"]) for doc in documents)
    print(f"\nTotal characters extracted: {total_chars:,}")
    print(f"Average characters per page: {total_chars // len(documents):,}")

# Process Data

## Chunk Documents with Semantic Splitting

In [None]:
def chunk_documents(documents: List[Dict], chunk_size: int = 512, chunk_overlap: int = 100) -> List[Dict]:
    """
    Chunk documents using semantic splitting with sentence boundaries
    
    Args:
        documents: List of documents with text and metadata
        chunk_size: Target size for each chunk
        chunk_overlap: Number of characters to overlap between chunks
        
    Returns:
        List of chunks with associated metadata
    """
    # Initialize text splitter with sentence-aware splitting
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        length_function=len,
        separators=["\n\n", "\n", ". ", " ", ""],  # Prioritize semantic breaks
        is_separator_regex=False
    )
    
    chunks = []
    
    print(f"Chunking {len(documents)} document(s)...")
    print(f"  Chunk size: {chunk_size}, Overlap: {chunk_overlap}\n")
    
    for doc_idx, doc in enumerate(documents):
        text = doc["text"]
        metadata = doc["metadata"]
        
        # Split text into chunks
        text_chunks = text_splitter.split_text(text)
        
        # Create chunk objects with metadata
        for chunk_idx, chunk_text in enumerate(text_chunks):
            chunks.append({
                "text": chunk_text,
                "metadata": {
                    **metadata,  # Include original document metadata
                    "chunk_index": chunk_idx,
                    "doc_index": doc_idx,
                    "chunk_size": len(chunk_text)
                }
            })
    
    print(f"âœ“ Created {len(chunks)} chunks from {len(documents)} document(s)")
    print(f"  Average chunk size: {sum(len(c['text']) for c in chunks) // len(chunks)} characters")
    
    return chunks

# Chunk the documents
if documents:
    chunks = chunk_documents(
        documents, 
        chunk_size=CONFIG["chunk_size"], 
        chunk_overlap=CONFIG["chunk_overlap"]
    )
    
    # Display some examples
    print(f"\nðŸ“„ Example chunks:")
    for i, chunk in enumerate(chunks[:2]):  # Show first 2 chunks
        print(f"\nChunk {i+1}:")
        print(f"  Source: {chunk['metadata']['source']}, Page: {chunk['metadata']['page']}")
        print(f"  Text preview: {chunk['text'][:150]}...")
else:
    print("âš  No documents to chunk. Please load PDFs first.")
    chunks = []

# Save to vector database

## Generate Embeddings and Store in ChromaDB

In [None]:
def create_embeddings_and_store(chunks: List[Dict], model_name: str, device: torch.device, db_path: str) -> chromadb.Collection:
    """
    Generate embeddings for chunks and store them in ChromaDB
    
    Args:
        chunks: List of text chunks with metadata
        model_name: Name of the embedding model
        device: Device to use (CPU/GPU)
        db_path: Path to ChromaDB storage
        
    Returns:
        ChromaDB collection with stored embeddings
    """
    if not chunks:
        print("âš  No chunks to embed!")
        return None
    
    print(f"\n{'='*60}")
    print(f"Loading embedding model: {model_name}")
    print(f"{'='*60}")
    
    # Load embedding model
    embedding_model = SentenceTransformer(model_name, device=str(device))
    print(f"âœ“ Model loaded on {device}")
    
    # Extract texts and prepare metadata
    texts = [chunk["text"] for chunk in chunks]
    metadatas = [chunk["metadata"] for chunk in chunks]
    
    # Convert metadata values to strings for ChromaDB compatibility
    for metadata in metadatas:
        for key, value in metadata.items():
            metadata[key] = str(value)
    
    print(f"\n{'='*60}")
    print(f"Generating embeddings for {len(texts)} chunks...")
    print(f"{'='*60}")
    
    # Generate embeddings in batches for efficiency
    batch_size = 32
    embeddings = []
    
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        batch_embeddings = embedding_model.encode(
            batch,
            convert_to_numpy=True,
            show_progress_bar=False
        )
        embeddings.extend(batch_embeddings.tolist())
        
        if (i // batch_size + 1) % 10 == 0:
            print(f"  Processed {i + len(batch)}/{len(texts)} chunks")
    
    print(f"âœ“ Generated {len(embeddings)} embeddings")
    print(f"  Embedding dimension: {len(embeddings[0])}")
    
    # Initialize ChromaDB client
    print(f"\n{'='*60}")
    print(f"Initializing ChromaDB...")
    print(f"{'='*60}")
    
    client = chromadb.PersistentClient(path=db_path)
    
    # Delete existing collection if it exists (for fresh start)
    try:
        client.delete_collection("rag_collection")
        print("  Deleted existing collection")
    except:
        pass
    
    # Create new collection
    collection = client.create_collection(
        name="rag_collection",
        metadata={"description": "RAG document chunks with embeddings"}
    )
    
    print(f"âœ“ Created collection: rag_collection")
    
    # Add documents to collection
    print(f"\n{'='*60}")
    print(f"Storing embeddings in ChromaDB...")
    print(f"{'='*60}")
    
    # ChromaDB requires unique IDs
    ids = [f"chunk_{i}" for i in range(len(chunks))]
    
    collection.add(
        embeddings=embeddings,
        documents=texts,
        metadatas=metadatas,
        ids=ids
    )
    
    print(f"âœ“ Successfully stored {collection.count()} chunks in vector database")
    print(f"  Database path: {db_path}")
    
    return collection, embedding_model

# Generate embeddings and store in ChromaDB
if chunks:
    collection, embedding_model = create_embeddings_and_store(
        chunks,
        CONFIG["embedding_model_name"],
        CONFIG["device"],
        CONFIG["chroma_db_path"]
    )
    
    print(f"\n{'='*60}")
    print(f"âœ“ Phase 2 Complete: Data Ingestion & Processing")
    print(f"{'='*60}")
    print(f"  Total chunks embedded: {len(chunks)}")
    print(f"  Vector database ready for querying")
else:
    print("âš  No chunks available. Please ensure PDFs are loaded and chunked first.")
    collection = None
    embedding_model = None

# Load Models

# Get User Question

# Prompt Model

# Get Output

# Save Outputs

# Metrics