# Advanced RAG System Demo

### Objectives:
1. **System Resource Monitoring**: Track RAM/CPU usage during heavy tasks.
2. **Pipeline Execution**: Load data, chunk, index, retrieve, and generate.
3. **Evaluation**: Compare Generated Answers vs Reference Answers using **BLEU-4** and **ROUGE-L** metrics.

In [1]:
# Setup Environment & Utils
!pip install -r requirements.txt

import sys
import os
import psutil
from dotenv import load_dotenv

# Ensure src is in python path
sys.path.append(os.getcwd())

def print_system_usage(stage=""):
    process = psutil.Process(os.getpid())
    mem_info = process.memory_info()
    mem_mb = mem_info.rss / 1024 / 1024
    print(f"[{stage}] Memory: {mem_mb:.2f} MB")

You should consider upgrading via the '/Users/gizemcidal/Desktop/rag_and_finetuning_task_vdf/venv/bin/python -m pip install --upgrade pip' command.[0m


In [2]:
# Load Configuration & Modules
from rag.config import RAGConfig
from rag.data_loader import DataLoader
from rag.chunking import HierarchicalChunker
from rag.vector_db import VectorDBHandler
from rag.retriever import HierarchicalRetriever
from rag.generator import RAGGenerator
from rag.evaluator import Evaluator

config = RAGConfig()
print("Configuration Loaded.")
print_system_usage("Init")

Configuration Loaded.
[Init] Memory: 524.97 MB


In [3]:
# Authenticate with Hugging Face (Required for Gemma Model)
from huggingface_hub import login

# Load existing .env file
load_dotenv()
hf_token = os.getenv("HF_TOKEN")

if hf_token and hf_token != "your_huggingface_token_here":
    print("Logging in with token from .env...")
    login(token=hf_token)
else:
    print("Please Paste Token manually or update .env file.")
    print("Get token: https://huggingface.co/settings/tokens")
    login()

Logging in with token from .env...


Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


In [4]:
# Data Loading
loader = DataLoader(config)

# Download Book
book_text = loader.download_book()
print(f"Book loaded. Length: {len(book_text)} chars")

# Load QA Pairs
qa_pairs = loader.load_qa_pairs()
print(f"Loaded {len(qa_pairs)} QA pairs for testing.")
print_system_usage("Data Loading")

`trust_remote_code` is not supported anymore.
Please check that the Hugging Face dataset 'narrativeqa' isn't based on a loading script and remove `trust_remote_code`.
If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.


Book already exists at /Users/gizemcidal/Desktop/rag_and_finetuning_task_vdf/data/zuleika_dobson.txt
Book loaded. Length: 467598 chars
Loading NarrativeQA test split for ID 1845...


Resolving data files:   0%|          | 0/24 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/24 [00:00<?, ?it/s]

Found 40 QA pairs for Book ID 1845.
Loaded 40 QA pairs for testing.
[Data Loading] Memory: 559.77 MB


In [5]:
# Hierarchical Chunking
chunker = HierarchicalChunker(
    parent_chunk_size=config.PARENT_CHUNK_SIZE,
    child_chunk_size=config.CHILD_CHUNK_SIZE,
    overlap=config.CHUNK_OVERLAP
)

chunks = chunker.chunk_data(book_text)
print(f"Created {len(chunks['parents'])} parent chunks and {len(chunks['children'])} child chunks.")

parents = chunks['parents']
children = chunks['children']
print_system_usage("Chunking")

Created 346 parent chunks and 30866 child chunks.
[Chunking] Memory: 574.44 MB


In [6]:
# Indexing in Qdrant (Local Disk Mode)
# CRITICAL: Force cleanup of previous instances to release file locks
import gc
try:
    if 'vdb' in locals():
        print("Cleaning up previous DB instance...")
        if hasattr(vdb, 'close'):
            vdb.close()
        del vdb
        gc.collect() # Force garbage collection to release file handles
except Exception as e:
    print(f"Cleanup warning: {e}")

vdb = VectorDBHandler(config)
vdb.create_collection()

print("Indexing chunks... (this creates embeddings using CPU/GPU)")
vdb.index_chunks(chunks)
print_system_usage("Indexing")

Initializing Qdrant at /Users/gizemcidal/Desktop/rag_and_finetuning_task_vdf/data/qdrant_db


  self.client = QdrantClient(path=self.config.QDRANT_PATH)


Collection dracula_chunks already exists.
Indexing chunks... (this creates embeddings using CPU/GPU)
Loading embedding model: sentence-transformers/all-MiniLM-L6-v2
Generating embeddings for 30866 chunks...


Batches:   0%|          | 0/965 [00:00<?, ?it/s]

  self.client.upsert(


Upserted 30866 points.
[Indexing] Memory: 8574.31 MB


In [7]:
# Initialize Components
retriever = HierarchicalRetriever(config, vdb, parents)
generator = RAGGenerator(config)
evaluator = Evaluator()
print("RAG Components Ready.")
print_system_usage("Model Load")

Loading embedding model: sentence-transformers/all-MiniLM-L6-v2
Loading Reranker model: cross-encoder/ms-marco-MiniLM-L-6-v2


config.json:   0%|          | 0.00/794 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

Loading LLM: google/gemma-3-1b-it


`torch_dtype` is deprecated! Use `dtype` instead!
Device set to use mps


RAG Components Ready.
[Model Load] Memory: 1234.86 MB


In [8]:
# Run RAG Loop & Evaluation
import pandas as pd

# Run on a subset or all pairs
test_pairs = qa_pairs[:5] # Testing on first 5 pairs for demo speed
results = []

print(f"Running RAG on {len(test_pairs)} queries...")

for i, qa in enumerate(test_pairs):
    question = qa['question']
    reference = qa['answer1']
    
    # 1. Retrieve
    context = retriever.retrieve_context(question, top_k=config.TOP_K)
    
    # 2. Generate
    generated_answer = generator.generate_answer(question, context)
    
    # 3. Evaluate
    scores = evaluator.evaluate(generated_answer, reference)
    
    results.append({
        "Question": question,
        "Generated Answer": generated_answer,
        "Reference Answer": reference,
        "BLEU-4": scores['bleu'],
        "ROUGE-L": scores['rouge']
    })
    print(f".", end="") # Progress indicator

print("\nDone!")
print_system_usage("Inference Complete")

Running RAG on 5 queries...
.....
Done!
[Inference Complete] Memory: 3512.67 MB


In [None]:
# Results Analysis
df_results = pd.DataFrame(results)

# Calculate Averages
avg_bleu = df_results['BLEU-4'].mean()
avg_rouge = df_results['ROUGE-L'].mean()

print("--- Evaluation Summary ---")
print(f"Average BLEU-4: {avg_bleu:.4f}")
print(f"Average ROUGE-L: {avg_rouge:.4f}")

# Display Table
df_results[['Question', 'Generated Answer', 'BLEU-4', 'ROUGE-L']]

--- Evaluation Summary ---
Average BLEU-4: 0.0039
Average ROUGE-L: 0.1023


Unnamed: 0,Question,Generated Answer,BLEU-4,ROUGE-L
0,Who are Zuleika's most prominent suitors?,The text does not mention who Zuleika’s most p...,0.010331,0.2
1,Why does Zuleika reject the Duke?,Please provide me with the context! I need the...,0.009134,0.060606
2,Who is the first person Zuleika falls in love ...,"According to the text, Zuleika falls in love w...",0.0,0.117647
3,Where do Zuleika and her suitors meet?,"According to the text, Zuleika and her suitors...",0.0,0.133333
4,How does Zuleika stop the Duke's first suicide...,Please provide me with the context! I need the...,0.0,0.0


: 