# Capstone: RAG Pipeline Optimization and Evaluator Reliability

Author: Darpan Beri, darpanberi (dot) 99 (at) gmail (dot) com

Last Updated: 2025-04-22

This notebook implements the experiment pipeline for my capstone project.
It involves:
1. Setting up the environment and loading necessary libraries and data.
2. Preprocessing text data and creating a knowledge base using Haystack.
3. Defining a RAG pipeline using a Llama 3.1 8B model for answer generation.
4. Defining an evaluation pipeline using another Llama 3.1 8B model to assess generated answers.
5. Running experiments by varying `chunk_size` and `top_k` hyperparameters.
6. Saving the results for analysis.

**Note:** You need a HuggingFace token and access to the Llama 3.1 8B gated model granted to you by Meta. Request the model access from [here](https://huggingface.co/meta-llama/Llama-3.1-8B).

## Dependencies
This section installs the required Python packages for the project.
- `datasets`: For loading data from Hugging Face Hub.
- `ftfy`: For fixing text encoding issues.
- `haystack-ai`: The core library for building the RAG pipeline (document stores, retrievers, embedders).
- `sentence-transformers`: Used for embedding documents and queries.
- `bitsandbytes`: Enables model quantization (loading models in 8-bit) to save memory.
- `transformers`, `accelerate`: Hugging Face libraries for loading and running LLMs.
- `tqdm`: For displaying progress bars during loops.

In [1]:
# Install required packages
!pip install datasets ftfy haystack-ai sentence-transformers bitsandbytes -q
!pip install --upgrade transformers accelerate -q
!pip install tqdm -q

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.8/44.8 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m482.8/482.8 kB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.1/76.1 MB[0m [31m21.8 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.0/85.0 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.2/59.2 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.4/10.4 MB[0m [31m75.7 MB/s[0m eta [36m0:00:00[0m:00:01[0m:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m354.7/354.7 kB[0m [31m18.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m481.2/481.2 kB[0m [31m22.2 MB/s[0m eta [36m0:00:00[0m


In [2]:
# --- Core Library Imports ---
import pandas as pd         # For data manipulation and saving results
import torch              # PyTorch for tensor operations and GPU management
import gc                 # Garbage collector for memory management
import numpy as np          # Numerical operations (not heavily used here, but good practice)
import time               # For timing experiments
import json               # For saving experiment metadata (commented out currently)
import os                 # For interacting with the operating system (paths, environment variables)
import re                 # Regular expressions for text cleaning
import nltk               # Natural Language Toolkit for sentence tokenization

# --- Hugging Face Imports ---
from datasets import load_dataset         # Function to load datasets from Hugging Face Hub
from transformers import (
    AutoTokenizer,                      # Loads tokenizers for LLMs
    AutoModelForCausalLM,               # Loads causal LLMs (like Llama)
    pipeline,                           # High-level interface for using models
    BitsAndBytesConfig                  # Configuration for quantization
)
import huggingface_hub    # For interacting with the Hugging Face Hub (e.g., login)

# --- Haystack Imports ---
from haystack import Document             # Haystack's representation of text units
from haystack.document_stores.in_memory import InMemoryDocumentStore # Simple document store
from haystack.components.embedders import (
    SentenceTransformersDocumentEmbedder, # Embeds Documents
    SentenceTransformersTextEmbedder      # Embeds text queries
)
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever # Retrieves documents based on embedding similarity
from haystack.utils.device import ComponentDevice # Utility for specifying device (CPU/GPU) for Haystack components

# --- Utilities ---
from tqdm.auto import tqdm # Progress bar utility
import ftfy               # Text fixing library

## Hugging Face Hub Authentication

This token is required to download gated models like Llama 3.1.

In [3]:
# Set your Hugging Face token for accessing gated models
huggingface_hub.login("Your_HuggingFace_Token_Here") # Replace with your actual token or use a secure method

## Environment Setup and Data Loading
This section configures the environment, downloads necessary NLTK data, checks GPU availability, and loads the datasets.

In [4]:
# --- Environment Configuration ---
# Download the 'punkt' tokenizer models from NLTK, used for sentence splitting. 'quiet=True' suppresses output.
nltk.download('punkt', quiet=True)
# Setting for CUDA error reporting (can sometimes help debug GPU issues)
os.environ["CUDA_LAUNCH_BLOCKING"] = "1"

# --- Cache Directory Setup ---
# Define and create a cache directory for Hugging Face models/datasets
# This avoids re-downloading large files on subsequent runs.
CACHE_DIR = "./model_cache"
os.makedirs(CACHE_DIR, exist_ok=True)
os.environ["HF_HOME"] = CACHE_DIR # Set Hugging Face Hub's cache directory

# --- GPU Availability Check ---
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"Number of available GPUs: {torch.cuda.device_count()}")
    for i in range(torch.cuda.device_count()):
        print(f"GPU {i}: {torch.cuda.get_device_name(i)}")
else:
    print("CUDA not available. Running on CPU will be very slow.")

# --- Dataset Loading ---
print("\nLoading datasets from Hugging Face Hub...")
try:
    # Load the question-answer pairs (test split)
    QA_dataset = load_dataset("rag-datasets/rag-mini-wikipedia", "question-answer", cache_dir=CACHE_DIR)
    # Load the text corpus used for the knowledge base
    text_dataset = load_dataset("rag-datasets/rag-mini-wikipedia", "text-corpus", cache_dir=CACHE_DIR)
    print(f"Successfully loaded datasets.")
    # Print the number of test questions available
    print(f"Loaded {len(QA_dataset['test'])} test questions from QA dataset.")
    # Optional: Display dataset structure
    # print("\nQA Dataset Structure:")
    # print(QA_dataset)
    # print("\nText Corpus Dataset Structure:")
    # print(text_dataset)
except Exception as e:
    print(f"Error loading datasets: {e}")
    print("Please ensure you are connected to the internet and the dataset name is correct.")
    # Depending on the error, you might want to exit or handle it differently
    # exit()

PyTorch version: 2.5.1+cu121
CUDA available: True
Number of available GPUs: 2
GPU 0: Tesla T4
GPU 1: Tesla T4

Loading datasets from Hugging Face Hub...


README.md:   0%|          | 0.00/719 [00:00<?, ?B/s]

part.0.parquet:   0%|          | 0.00/54.4k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/918 [00:00<?, ? examples/s]

part.0.parquet:   0%|          | 0.00/797k [00:00<?, ?B/s]

Generating passages split:   0%|          | 0/3200 [00:00<?, ? examples/s]

Successfully loaded datasets.
Loaded 918 test questions from QA dataset.


## Text Preprocessing and Chunking Strategy
Defines functions to clean the text corpus and split it into manageable chunks for the RAG pipeline's knowledge base.

In [5]:
def preprocess_text(text: str) -> str:
    """
    Applies basic text cleaning operations.
    - Fixes encoding issues using ftfy.
    - Normalizes hyphens.
    - Fixes escaped apostrophes.
    - Removes content within parentheses (potential noise).

    Args:
        text (str): The input text string.

    Returns:
        str: The cleaned text string.
    """
    if not isinstance(text, str): # Basic type check
        return ""
    text = ftfy.fix_text(text) # Fix potential encoding errors (e.g., mojibake)
    text = text.replace("--", "–") # Replace double hyphens with en-dash
    text = text.replace("\\'", "'") # Replace escaped apostrophes
    text = re.sub(r"\([^)]*\)", "", text) # Remove text within parentheses
    return text.strip() # Remove leading/trailing whitespace


def split_content_into_chunks(content: str, chunk_size: int = 100, overlap: int = 20) -> list[str]:
    """
    Splits a large text content into smaller chunks suitable for embedding.

    Strategy:
    1. Splits the content by paragraphs (`\n\n`).
    2. Iterates through paragraphs, adding them to the current chunk.
    3. If adding a paragraph exceeds `chunk_size` (in words):
       - Saves the current chunk.
       - Starts a new chunk, potentially with `overlap` words from the end of the previous chunk.
    4. If a single paragraph itself is larger than `chunk_size`:
       - Splits that paragraph into sentences using `nltk.sent_tokenize`.
       - Adds sentences one by one, creating new chunks with overlap as needed when `chunk_size` is exceeded.

    Args:
        content (str): The entire text corpus concatenated into a single string.
        chunk_size (int): The target maximum number of words per chunk (approximate).
        overlap (int): The target number of words to overlap between consecutive chunks.

    Returns:
        list[str]: A list of text chunks.
    """
    if not isinstance(content, str) or not content:
        return []

    # First split by paragraphs, filtering out empty ones
    paragraphs = [p for p in content.split('\n\n') if p.strip()]
    if not paragraphs: # Handle case where content only has whitespace or is structured differently
        # Fallback: split by single newline or sentence tokenize the whole thing
        paragraphs = [p for p in content.split('\n') if p.strip()]
        if not paragraphs:
             paragraphs = sent_tokenize(content) # Use sentence tokenization as last resort

    chunks = []           # List to store the final chunks
    current_chunk = []    # List of strings (paragraphs or sentences) forming the current chunk
    current_size = 0      # Approximate word count of the current chunk

    for paragraph in paragraphs:
        paragraph_words = len(paragraph.split()) # Word count of the current paragraph

        # --- Case 1: Paragraph itself is too large ---
        if paragraph_words > chunk_size:
            # Split large paragraph into sentences
            sentences = sent_tokenize(paragraph)
            for sentence in sentences:
                sentence_words = len(sentence.split())

                # If adding sentence exceeds chunk size, finalize previous chunk (if any)
                if current_size + sentence_words > chunk_size and current_chunk:
                    chunks.append(" ".join(current_chunk)) # Save the completed chunk

                    # Create overlap for the *next* chunk
                    overlap_tokens = []
                    overlap_size = 0
                    # Iterate backwards through the *saved* chunk's pieces
                    for piece in reversed(current_chunk):
                         piece_words = len(piece.split())
                         if overlap_size + piece_words <= overlap:
                            overlap_tokens.append(piece)
                            overlap_size += piece_words
                         else:
                             # If adding the whole piece exceeds overlap, try splitting it (optional, adds complexity)
                             # For simplicity here, we just stop accumulating overlap tokens
                             break # Stop if adding the next piece exceeds overlap target

                    # Start new chunk with overlap (reversed back to original order)
                    current_chunk = list(reversed(overlap_tokens))
                    current_size = overlap_size

                # Add the current sentence to the (potentially new) chunk
                current_chunk.append(sentence)
                current_size += sentence_words

        # --- Case 2: Adding paragraph exceeds chunk size ---
        elif current_size + paragraph_words > chunk_size and current_chunk:
            # Finalize the current chunk before adding the new paragraph
            chunks.append(" ".join(current_chunk))

            # Create overlap for the *next* chunk (similar logic as above)
            overlap_tokens = []
            overlap_size = 0
            for piece in reversed(current_chunk):
                piece_words = len(piece.split())
                if overlap_size + piece_words <= overlap:
                    overlap_tokens.append(piece)
                    overlap_size += piece_words
                else:
                    break

            # Start new chunk with overlap and the current paragraph
            current_chunk = list(reversed(overlap_tokens))
            current_size = overlap_size
            current_chunk.append(paragraph)
            current_size += paragraph_words

        # --- Case 3: Paragraph fits into the current chunk ---
        else:
            current_chunk.append(paragraph)
            current_size += paragraph_words

    # Add the last remaining chunk if it's not empty
    if current_chunk:
        chunks.append(" ".join(current_chunk))

    return chunks

## Creating the RAG Knowledge Base with Haystack
This section defines a function to prepare the Haystack `InMemoryDocumentStore`.
It processes the text corpus, splits it into chunks based on the specified `chunk_size`,
embeds these chunks using a sentence transformer model, and stores them in the document store.
A cache (`document_stores_cache`) is used to avoid recomputing the store for the same `chunk_size`.

In [6]:
# Cache for document stores to avoid re-computation for the same chunk size
document_stores_cache = {}

def prepare_document_store(chunk_size: int) -> InMemoryDocumentStore:
    """
    Prepares and populates an InMemoryDocumentStore with embedded text chunks.
    Uses a cache (`document_stores_cache`) to return existing stores for a given chunk_size.

    Args:
        chunk_size (int): The target chunk size (in words) for splitting the corpus.

    Returns:
        InMemoryDocumentStore: A Haystack document store containing the embedded chunks.
    """
    # Check cache first
    if chunk_size in document_stores_cache:
        print(f"Using cached document store for chunk_size={chunk_size}")
        return document_stores_cache[chunk_size]

    print(f"\n--- Preparing New Document Store (chunk_size={chunk_size}) ---")
    document_store = InMemoryDocumentStore(embedding_dim=384) # Dim matches all-MiniLM-L6-v2

    # --- Step 1: Process and Concatenate Corpus ---
    print("Processing and concatenating text corpus...")
    # Apply preprocessing to each passage and join them into one large string
    # Using list comprehension with tqdm for progress bar
    all_text_passages = [preprocess_text(entry['passage']) for entry in tqdm(text_dataset['passages'], desc="Preprocessing Passages")]
    all_text = " ".join(filter(None, all_text_passages)) # Filter out potential empty strings after preprocessing
    print(f"Total characters in concatenated corpus: {len(all_text)}")

    # --- Step 2: Split into Chunks ---
    print("Splitting content into chunks...")
    # Use the previously defined function with the specified chunk_size
    # Using a default overlap of 20 words here, could be made configurable
    content_chunks = split_content_into_chunks(all_text, chunk_size=chunk_size, overlap=20)
    print(f"Created {len(content_chunks)} chunks.")
    if not content_chunks:
        print("Warning: No chunks were created. Check preprocessing and chunking logic.")
        # You might want to handle this case more robustly, maybe raise an error
        # For now, return the empty store
        document_stores_cache[chunk_size] = document_store
        return document_store

    # --- Step 3: Create Haystack Document Objects ---
    # Convert each text chunk string into a Haystack Document object
    docs = [Document(content=chunk) for chunk in content_chunks]
    print(f"Created {len(docs)} Haystack Document objects.")

    # --- Step 4: Embed Documents ---
    print("Embedding documents...")
    # Determine the device for embedding (GPU if available, else CPU)
    # Using Haystack's ComponentDevice for compatibility
    if torch.cuda.is_available():
        # Use the first GPU (cuda:0) for embedding by default
        embedding_device = ComponentDevice.from_str("cuda:0")
        print(f"Using device: {torch.cuda.get_device_name(0)} for document embedding.")
    else:
        embedding_device = ComponentDevice.from_str("cpu")
        print("Using CPU for document embedding.")

    # Initialize the document embedder
    doc_embedder = SentenceTransformersDocumentEmbedder(
        model="sentence-transformers/all-MiniLM-L6-v2", # A common, efficient embedding model
        device=embedding_device, # Assign the determined device
        batch_size=64,           # Adjust batch size based on GPU memory
        progress_bar=True        # Show progress during embedding
    )
    # Warm up the embedder (loads the model)
    try:
        doc_embedder.warm_up()
    except Exception as e:
        print(f"Error warming up document embedder: {e}")
        # Handle error appropriately, maybe fall back to CPU or exit
        raise e # Re-raise the exception

    # --- Step 5: Write Documents to Store (with embeddings) ---
    # Process documents in batches to manage memory usage during embedding
    write_batch_size = 128 # How many docs to embed and write at once
    print(f"Writing documents to store in batches of {write_batch_size}...")
    for i in tqdm(range(0, len(docs), write_batch_size), desc="Embedding and Writing Docs"):
        batch_docs = docs[i:i + write_batch_size]
        try:
            # Run embedding on the batch
            docs_with_embeddings = doc_embedder.run(documents=batch_docs) # Ensure keyword arg 'documents' is used
            # Write the embedded documents to the store
            document_store.write_documents(docs_with_embeddings["documents"])
        except Exception as e:
            print(f"Error processing or writing batch starting at index {i}: {e}")
            # Optional: Add logic to retry or skip the batch
            continue # Continue with the next batch

    # --- Caching and Return ---
    print(f"Document store preparation complete for chunk_size={chunk_size}.")
    document_stores_cache[chunk_size] = document_store # Cache the populated store
    return document_store

## Debugging Functions (Optional)
These functions were used during development to test specific components (retriever, RAG model, evaluator) in isolation with controlled inputs. They can be helpful for troubleshooting.

In [7]:
# 1. Function to inspect retrieved documents for a given question
def print_retrieved_documents(question: str, retrieved_docs: dict):
    """
    Prints the question and the content of documents retrieved for it.

    Args:
        question (str): The input question.
        retrieved_docs (dict): The output dictionary from a Haystack retriever run,
                                expected to have a "documents" key.
    """
    print(f"\n{'='*80}")
    print(f"DEBUG: Retrieved Documents for Question: {question}")
    print(f"{'='*80}")
    if "documents" in retrieved_docs and retrieved_docs["documents"]:
        for i, doc in enumerate(retrieved_docs["documents"]):
            print(f"\n--- DOCUMENT {i+1} (Score: {doc.score if hasattr(doc, 'score') else 'N/A'}) ---")
            print(f"Content Length: {len(doc.content)} chars")
            print(f"{'-'*60}")
            print(doc.content)
            print(f"{'-'*60}")
    else:
        print("No documents retrieved or 'documents' key missing.")
    print(f"{'='*80}\n")

# 2. Function to test the core generation capability of the RAG model (Generator LLM)
def test_rag_model(generator_model: 'ModelHandler', debug: bool = True):
    """
    Performs a sanity check on the generator model with simple, predefined
    context-question pairs. Bypasses the retrieval step.

    Args:
        generator_model (ModelHandler): An initialized instance of the ModelHandler class
                                         for the generator LLM.
        debug (bool): If True, prints detailed output. (Currently always prints).
    """
    print("\n--- Testing RAG Generator Model (Direct Generation) ---")
    if not generator_model:
        print("Error: Generator model not provided.")
        return

    test_cases = [
        # Add more cases if needed
        {
            "question": "Was Abraham Lincoln the sixteenth President of the United States?",
            "context": "Abraham Lincoln (February 12, 1809 – April 15, 1865) was an American lawyer and statesman who served as the 16th president of the United States from 1861 until his assassination in 1865.",
            "expected_answer": "yes" # Or a similar affirmation
        },
        {
            "question": "When did Lincoln begin his political career?",
            "context": "Lincoln began his political career in 1832, at the age of 23, when he ran for the Illinois General Assembly.",
            "expected_answer": "1832"
        }
    ]

    for i, case in enumerate(test_cases):
        question = case["question"]
        context = case["context"]
        expected = case["expected_answer"]

        # Simplified prompt, similar to the main pipeline but with controlled context
        prompt = f"""Context: {context}

Question: {question}

Answer the question using only the context. Keep your answer brief.

Answer:"""

        print(f"\n--- Test Case {i+1} ---")
        print(f"Question: {question}")
        if debug: print(f"Provided Context: '{context}'")
        print(f"Expected Answer (approx): '{expected}'")

        # Generate with specific parameters for testing
        try:
            # Using a slightly larger max_tokens for testing to ensure output
            raw_result = generator_model.generate(prompt, max_tokens=20, temperature=0.0) # Temp 0 for deterministic output
            print(f"Generated Answer: '{raw_result}'")
        except Exception as e:
            print(f"Error during generation for test case {i+1}: {e}")

    print("\n--- RAG Generator Model Testing Complete ---")


# 3. Function to test the Evaluator LLM
def test_evaluator(evaluator_model: 'ModelHandler'):
    """
    Performs a sanity check on the evaluator model with simple, predefined
    question-answer pairs to see if it produces the expected "Yes" or "No".

    Args:
        evaluator_model (ModelHandler): An initialized instance of the ModelHandler class
                                        for the evaluator LLM.
    """
    print("\n--- Testing Evaluator Model ---")
    if not evaluator_model:
        print("Error: Evaluator model not provided.")
        return

    test_cases = [
        {"question": "Is the sky blue?", "generated_answer": "Yes, the sky is blue.", "ground_truth": "yes", "expected_eval": "Yes"},
        {"question": "Is the Earth flat?", "generated_answer": "No, the Earth is round.", "ground_truth": "no", "expected_eval": "Yes"}, # Generated matches GT meaning
        {"question": "What is 2+2?", "generated_answer": "2", "ground_truth": "4", "expected_eval": "No"},
        {"question": "What is 2+2?", "generated_answer": "4", "ground_truth": "4", "expected_eval": "Yes"},
        {"question": "What is the capital of France?", "generated_answer": "Paris", "ground_truth": "Paris", "expected_eval": "Yes"},
        {"question": "What is the capital of France?", "generated_answer": "Lyon", "ground_truth": "Paris", "expected_eval": "No"},
        {"question": "Who was the first US president?", "generated_answer": "George Washington was the first president.", "ground_truth": "George Washington", "expected_eval": "Yes"},
        {"question": "Who was the first US president?", "generated_answer": "", "ground_truth": "George Washington", "expected_eval": "No"},
    ]

    for i, case in enumerate(test_cases):
        query = case["question"]
        generated_answer = case["generated_answer"]
        ground_truth = case["ground_truth"]
        expected_eval = case["expected_eval"]

        # Evaluation prompt mirroring the main pipeline
        prompt = f"""Task: In the context of the question, is the semantic meaning of the generated the same as the truth? Please reply "Yes" if true, else "No."
Question: {query}
Generated: {generated_answer}
Truth: {ground_truth}

Answer:"""

        print(f"\n--- Test Case {i+1} ---")
        print(f"Question: {query}")
        print(f"Generated Answer: '{generated_answer}'")
        print(f"Ground Truth: '{ground_truth}'")
        print(f"Expected Evaluation: '{expected_eval}'")

        try:
            # Generate evaluation using very few tokens and zero temperature
            raw_result = evaluator_model.generate(prompt, max_tokens=3, temperature=0.0) # Increased slightly to catch variations
            print(f"Raw Evaluator Output: '{raw_result}'")

            # Check for expected output
            if expected_eval.lower() in raw_result.lower():
                print("Evaluation Result: Matches Expected")
            else:
                print("Evaluation Result: *** Does Not Match Expected ***")

        except Exception as e:
            print(f"Error during evaluation for test case {i+1}: {e}")

    print("\n--- Evaluator Model Testing Complete ---")

## Model Handler Class
A wrapper class to manage loading LLMs (with quantization) and generating text using the `transformers` pipeline. This simplifies using multiple models (generator, evaluator) potentially on different devices.

In [8]:
class ModelHandler:
    """
    Manages loading and interacting with a Hugging Face causal language model,
    specifically handling 8-bit quantization and text generation via pipelines.
    """
    def __init__(self, model_name: str, device_id: int = 0):
        """
        Initializes the ModelHandler.

        Args:
            model_name (str): The name of the model on Hugging Face Hub (e.g., "meta-llama/Llama-3.1-8B").
            device_id (int): The GPU device ID to load the model onto (e.g., 0 for cuda:0).
        """
        self.model_name = model_name
        self.device_id = device_id
        self.device = f"cuda:{self.device_id}" if torch.cuda.is_available() else "cpu"
        self.tokenizer = None
        self.model = None
        self.pipe = None
        print(f"ModelHandler created for {self.model_name} on device target: {self.device}")

    def load_model(self):
        """
        Loads the tokenizer and the specified LLM with 8-bit quantization.
        Sets up the text generation pipeline.
        """
        if self.pipe: # Avoid reloading if already loaded
             print(f"Model {self.model_name} already loaded.")
             return

        print(f"\n--- Loading Model: {self.model_name} ---")
        try:
            # --- Load Tokenizer ---
            print("Loading tokenizer...")
            self.tokenizer = AutoTokenizer.from_pretrained(self.model_name, cache_dir=CACHE_DIR)
            # Ensure a padding token is set; if not, use the EOS token
            if self.tokenizer.pad_token is None:
                print("Warning: Tokenizer does not have a pad token. Setting pad_token=eos_token.")
                self.tokenizer.pad_token = self.tokenizer.eos_token
            print("Tokenizer loaded.")

            # --- GPU Memory Check (Optional but helpful) ---
            if self.device != "cpu":
                try:
                    gpu_mem = torch.cuda.get_device_properties(self.device_id).total_memory
                    gpu_mem_gb = gpu_mem / (1024**3)
                    print(f"Target GPU ({self.device}) total memory: {gpu_mem_gb:.2f} GB")
                except Exception as e:
                    print(f"Could not get GPU memory info: {e}")
            else:
                 print("Target device is CPU.")


            # --- Configure Quantization ---
            print("Configuring 8-bit quantization...")
            quantization_config = BitsAndBytesConfig(
                load_in_8bit=True,
                # bnb_4bit_compute_dtype=torch.bfloat16 # This is for 4-bit, not needed for load_in_8bit=True
            )
            # Use bfloat16 for computations if available, otherwise float16 or float32
            compute_dtype = torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16


            # --- Load Model ---
            print(f"Loading model {self.model_name} with quantization...")
            self.model = AutoModelForCausalLM.from_pretrained(
                self.model_name,
                torch_dtype=compute_dtype, # Use appropriate compute dtype
                quantization_config=quantization_config,
                low_cpu_mem_usage=True, # Tries to reduce CPU RAM usage during loading
                device_map={"": self.device_id}, # Maps the entire model to the specified device ID
                cache_dir=CACHE_DIR
            )
            print("Model loaded.")

            # --- Create Pipeline ---
            # The pipeline handles tokenization, model inference, and decoding.
            # `device_map` in from_pretrained already placed the model, so `device` arg in pipeline is not needed/can cause issues.
            print("Creating text generation pipeline...")
            self.pipe = pipeline(
                "text-generation",
                model=self.model,
                tokenizer=self.tokenizer,
                # device=self.device # Not needed when device_map is used
            )
            print(f"Pipeline created successfully for {self.model_name}.")

        except Exception as e:
            print(f"!!! ERROR loading model {self.model_name}: {e}")
            # Clean up partially loaded components
            self.tokenizer = None
            self.model = None
            self.pipe = None
            # Re-raise the exception to halt execution if loading fails
            raise e

    def generate(self, prompt: str, max_tokens: int = 100, temperature: float = 0.1) -> str:
        """
        Generates text based on the input prompt using the loaded model pipeline.

        Args:
            prompt (str): The input text prompt for the model.
            max_tokens (int): The maximum number of new tokens to generate.
            temperature (float): Controls randomness. 0 for deterministic, >0 for more randomness.

        Returns:
            str: The generated text (only the response part, excluding the prompt).
                 Returns "Error in generation" if an exception occurs.
        """
        if self.pipe is None:
            print("Model pipeline not loaded. Attempting to load...")
            try:
                 self.load_model()
            except Exception as e:
                 print(f"Failed to load model for generation: {e}")
                 return "Error: Model not loaded"


        if not isinstance(prompt, str) or not prompt:
             print("Warning: Empty or invalid prompt provided.")
             return "" # Return empty string for empty prompt


        try:
            # Generate response using the pipeline
            # `max_new_tokens` controls the length of the generated part only
            # `do_sample=True` is needed for temperature > 0
            # Set pad_token_id to eos_token_id to suppress warnings when padding isn't explicitly handled elsewhere
            response = self.pipe(
                prompt,
                max_new_tokens=max_tokens,
                temperature=max(temperature, 1e-4), # Ensure temperature is slightly > 0 if sampling is desired
                do_sample=(temperature > 0),
                num_return_sequences=1,
                pad_token_id=self.tokenizer.eos_token_id
            )

            # Extract only the generated text part
            generated_full_text = response[0]["generated_text"]

            # Check if the generated text contains the prompt and remove it
            if generated_full_text.startswith(prompt):
                 answer = generated_full_text[len(prompt):].strip()
            else:
                 # Sometimes the pipeline might not return the prompt; handle this gracefully
                 print("Warning: Generated text did not start with the prompt. Returning full output.")
                 answer = generated_full_text.strip()


            return answer

        except Exception as e:
            print(f"!!! ERROR during text generation: {e}")
            return "Error in generation" # Return a clear error message

    def unload_model(self):
        """Explicitly delete model and tokenizer and clear GPU cache."""
        print(f"Unloading model {self.model_name}...")
        del self.pipe
        del self.model
        del self.tokenizer
        self.pipe = None
        self.model = None
        self.tokenizer = None
        gc.collect() # Run Python garbage collection
        if self.device != 'cpu':
            torch.cuda.empty_cache() # Clear PyTorch's CUDA cache
        print(f"Model {self.model_name} unloaded.")

## RAG Pipeline: Answer Generation Function
This function orchestrates the process of generating answers for a set of questions using the RAG approach.

In [9]:
def generate_answers(generator_model: ModelHandler,
                     chunk_size: int,
                     top_k: int,
                     num_questions: int = None,
                     debug: bool = False) -> list[dict]:
    """
    Generates answers for questions using a RAG pipeline.

    Steps:
    1. Prepares/retrieves the document store for the given `chunk_size`.
    2. Initializes the text embedder and retriever components from Haystack.
    3. Iterates through the specified questions:
       a. Embeds the question.
       b. Retrieves the `top_k` relevant document chunks.
       c. Formats a prompt including the retrieved context and the question.
       d. Uses the `generator_model` to generate an answer based on the prompt.
    4. Stores the question, generated answer, and ground truth answer.

    Args:
        generator_model (ModelHandler): Initialized handler for the generator LLM.
        chunk_size (int): Chunk size used to prepare the document store.
        top_k (int): The number of documents to retrieve for each question.
        num_questions (int, optional): If provided, limits the number of questions processed.
                                       Defaults to None (process all questions).
        debug (bool): If True, prints detailed information during processing (retrieved docs, prompts).

    Returns:
        list[dict]: A list of dictionaries, each containing "question", "generated_answer",
                    and "ground_truth" for one question.
    """
    print(f"\n--- Starting Answer Generation (chunk_size={chunk_size}, top_k={top_k}) ---")
    start_time = time.time()

    # --- Step 1: Prepare Document Store and Retriever ---
    # Retrieve or create the document store using the helper function
    document_store = prepare_document_store(chunk_size)
    # Initialize the retriever component
    retriever = InMemoryEmbeddingRetriever(document_store=document_store, top_k=top_k)

    # --- Step 2: Prepare Query Embedder ---
    # Determine device for query embedding (can be different from generator/evaluator)
    # Place on GPU 1 if available and more than one GPU exists, otherwise GPU 0 or CPU
    if torch.cuda.is_available():
        embedder_device_id = 1 if torch.cuda.device_count() > 1 else 0
        embedder_device_str = f"cuda:{embedder_device_id}"
        print(f"Using device {torch.cuda.get_device_name(embedder_device_id)} for query embedding.")
    else:
        embedder_device_str = "cpu"
        print("Using CPU for query embedding.")
    embedding_device = ComponentDevice.from_str(embedder_device_str)

    # Initialize the text embedder for queries
    query_embedder = SentenceTransformersTextEmbedder(
        model="sentence-transformers/all-MiniLM-L6-v2", # Must match document embedder
        device=embedding_device,
        progress_bar=False # Usually fast enough not to need a progress bar
    )
    # Warm up the embedder
    try:
        query_embedder.warm_up()
    except Exception as e:
        print(f"Error warming up query embedder: {e}")
        raise e # Stop execution if embedder fails

    # --- Step 3: Select Questions ---
    questions_data = QA_dataset['test'] # Use the test split
    if num_questions is not None and 0 < num_questions < len(questions_data):
        print(f"Processing a subset of {num_questions} questions.")
        questions_data = questions_data.select(range(num_questions))
    else:
        print(f"Processing all {len(questions_data)} questions.")

    # --- Step 4: Prompt Template ---
    # The placeholders {} will be filled with context and question.
    prompt_template = """Answer the question based *only* on the context provided below. Be concise and provide only the answer, without explanation or preamble.

Context:
{}

Question: {}

Answer:"""

    # --- Step 5: Iterate and Generate ---
    results = [] # List to store results
    # Loop through selected questions with a progress bar
    for i in tqdm(range(len(questions_data)), desc="Generating Answers"):
        question = questions_data['question'][i]
        ground_truth = questions_data['answer'][i]

        if not isinstance(question, str) or not question.strip():
             print(f"Warning: Skipping invalid or empty question at index {i}")
             continue # Skip this iteration

        try:
            # Embed the current question
            query_embedding_result = query_embedder.run(text=question)
            query_embedding = query_embedding_result["embedding"]

            # Retrieve relevant documents using the embedding
            retrieved_docs = retriever.run(query_embedding=query_embedding)

            # --- Optional Debug Output ---
            if debug:
                 print_retrieved_documents(question, retrieved_docs) # Use the debug function

            # Format the retrieved context
            # Join the content of retrieved documents, separated by newlines
            context_text = "\n\n".join([doc.content for doc in retrieved_docs["documents"]])
            if not context_text:
                 print(f"Warning: No context retrieved for question: {question}")
                 # Decide how to handle: skip, generate without context, or use a placeholder?
                 # For now, generate with empty context.
                 context_text = "No context available."

            # Construct the full prompt for the generator model
            full_prompt = prompt_template.format(context_text, question)

            if debug:
                print(f"\n--- Prompt for Generator (Question {i+1}) ---")
                print(full_prompt)
                print("-" * 60)

            # Generate the answer using the ModelHandler instance
            # Using max_tokens=10 as specified in project summary, temperature=0 for consistency
            generated_answer = generator_model.generate(full_prompt, max_tokens=10, temperature=0.0)

            if debug:
                print(f"Raw Generated Answer: '{generated_answer}'")
                print(f"Ground Truth: '{ground_truth}'")
                print("="*60)

            # Basic validation/cleanup of the generated answer
            if not generated_answer or generated_answer.isspace() or generated_answer == "Error in generation":
                generated_answer = "NO_RESPONSE" # Use a placeholder for failed/empty generations

            # Store the results for this question
            results.append({
                "question": question,
                "generated_answer": generated_answer,
                "ground_truth": ground_truth
            })

        except Exception as e:
            print(f"!!! ERROR processing question {i+1} ('{question[:50]}...'): {e}")
            # Store error indicator for this question
            results.append({
                "question": question,
                "generated_answer": "GENERATION_ERROR",
                "ground_truth": ground_truth
            })
            continue # Continue to the next question

    # --- Completion ---
    elapsed_time = (time.time() - start_time) / 60
    print(f"--- Answer Generation Completed in {elapsed_time:.2f} minutes ---")
    return results

## Evaluation Pipeline: Assessing Generated Answers
This function takes the generated answers and uses the evaluator LLM to determine if the generated answer semantically matches the ground truth answer.

In [10]:
def evaluate_answers(evaluator_model: ModelHandler,
                     generated_results: list[dict],
                     debug: bool = False) -> list[dict]:
    """
    Evaluates the generated answers against ground truth using an evaluator LLM.

    Steps:
    1. Iterates through the results from the `generate_answers` function.
    2. For each result:
       a. Skips entries marked as "NO_RESPONSE" or "GENERATION_ERROR".
       b. Formats a prompt asking the evaluator LLM to compare the generated answer
          and ground truth, requesting a "Yes" or "No" response.
       c. Uses the `evaluator_model` to get the evaluation result.
    3. Stores the original info plus the evaluator's raw output ("Evaluation").

    Args:
        evaluator_model (ModelHandler): Initialized handler for the evaluator LLM.
        generated_results (list[dict]): The list of dictionaries output by `generate_answers`.
        debug (bool): If True, prints detailed information during the evaluation process.

    Returns:
        list[dict]: The input list augmented with an "Evaluation" key containing
                    the raw output from the evaluator LLM.
    """
    print(f"\n--- Starting Answer Evaluation ---")
    start_time = time.time()
    final_results_with_eval = [] # Store results including the evaluation

    # --- Evaluation Prompt Template ---
    # This prompt is crucial for instructing the evaluator model correctly.
    # It clearly defines the task and the expected output format ("Yes" or "No").
    eval_prompt_template = """Task: In the context of the question, is the semantic meaning of the generated answer the same as the ground truth answer?
Please reply *only* with the word "Yes" if they are semantically the same, or *only* with the word "No" if they are not. Do not provide any explanation.

Question: {}
Generated Answer: {}
Ground Truth Answer: {}

Evaluation:""" # Changed "Answer:" to "Evaluation:" for clarity

    # --- Iterate and Evaluate ---
    for entry in tqdm(generated_results, desc="Evaluating Answers"):
        query = entry["question"]
        generated_answer = entry["generated_answer"]
        ground_truth = entry["ground_truth"]

        # Handle cases where generation failed or produced no response
        if generated_answer in ["NO_RESPONSE", "GENERATION_ERROR"]:
            evaluation_output = "N/A_GENERATION_FAILED" # Mark evaluation as not applicable
        else:
            try:
                # Construct the prompt for the evaluator
                eval_prompt = eval_prompt_template.format(query, generated_answer, ground_truth)

                if debug:
                    print(f"\n--- Prompt for Evaluator ---")
                    print(eval_prompt)
                    print("-" * 60)

                # Generate the evaluation using the evaluator model
                # max_tokens=2 or 3 should be enough for "Yes" or "No" + potential whitespace/eos
                # temperature=0 for deterministic evaluation
                evaluation_output = evaluator_model.generate(eval_prompt, max_tokens=3, temperature=0.0)

                # Basic cleanup of evaluator output (optional, depends on observed model behavior)
                # evaluation_output = evaluation_output.strip().capitalize() # e.g., "yes " -> "Yes"

                if debug:
                    print(f"Raw Evaluator Output: '{evaluation_output}'")
                    print("="*60)

                # Handle potential errors during evaluation generation
                if evaluation_output == "Error in generation":
                    evaluation_output = "EVALUATION_ERROR"

            except Exception as e:
                print(f"!!! ERROR evaluating question ('{query[:50]}...'): {e}")
                evaluation_output = "EVALUATION_ERROR" # Mark as error

        # Append original info and the evaluation result
        final_results_with_eval.append({
            "question": query,
            "generated_answer": generated_answer,
            "ground_truth": ground_truth,
            "Evaluation": evaluation_output # Store the raw output from the evaluator
        })

    # --- Completion ---
    elapsed_time = (time.time() - start_time) / 60
    print(f"--- Evaluation Completed in {elapsed_time:.2f} minutes ---")
    return final_results_with_eval

## Experiment Execution Functions
Functions to run a single experiment with specific parameters or multiple experiments iterating over parameter ranges.

In [11]:
def run_experiment(chunk_size: int,
                   top_k: int,
                   num_questions: int = None,
                   debug: bool = False) -> pd.DataFrame:
    """
    Runs a complete single experiment: loads models, generates answers, evaluates them,
    and saves the results to a CSV file.

    Args:
        chunk_size (int): The chunk size for the document store.
        top_k (int): The number of documents to retrieve.
        num_questions (int, optional): Limit the number of questions for testing. Defaults to None (all).
        debug (bool): Enables detailed print statements in sub-functions.

    Returns:
        pd.DataFrame: A pandas DataFrame containing the final results with evaluations.
                      Returns an empty DataFrame if a critical error occurs.
    """
    experiment_start_time = time.time()
    print(f"\n{'='*70}")
    print(f"Starting Experiment: chunk_size={chunk_size}, top_k={top_k}")
    if num_questions:
        print(f"   (Using {num_questions} questions for this run)")
    print(f"{'='*70}")

    generator_model = None
    evaluator_model = None
    results_df = pd.DataFrame() # Initialize empty DataFrame

    try:
        # --- Initialize Models ---
        # Assign models to different GPUs if available (GPU 0 for Generator, GPU 1 for Evaluator)
        gen_device_id = 0
        eval_device_id = 1 if torch.cuda.device_count() > 1 else 0
        print(f"Assigning Generator to device cuda:{gen_device_id}")
        print(f"Assigning Evaluator to device cuda:{eval_device_id}")

        # Instantiate model handlers
        # Using Llama 3.1 8B for both generator and evaluator as per project summary
        generator_model = ModelHandler("meta-llama/Llama-3.1-8B", device_id=gen_device_id)
        evaluator_model = ModelHandler("meta-llama/Llama-3.1-8B", device_id=eval_device_id) # Changed from 1B

        # --- Load Models ---
        # Loading models sequentially can help manage memory demands
        generator_model.load_model()
        evaluator_model.load_model()

        # --- Optional: Run Debug Tests ---
        # These tests help verify models are loaded and behaving somewhat reasonably before the main run.
        if debug:
            test_rag_model(generator_model)
            test_evaluator(evaluator_model)

        # --- Step 1: Generate Answers ---
        generated_results = generate_answers(
            generator_model=generator_model,
            chunk_size=chunk_size,
            top_k=top_k,
            num_questions=num_questions,
            debug=debug # Pass debug flag down
        )

        # --- Step 2: Evaluate Answers ---
        final_results_with_eval = evaluate_answers(
            evaluator_model=evaluator_model,
            generated_results=generated_results,
            debug=debug # Pass debug flag down
        )

        # --- Step 3: Save Results ---
        if final_results_with_eval:
            results_df = pd.DataFrame(final_results_with_eval)
            # Define filename including parameters and timestamp
            timestamp = time.strftime("%Y%m%d-%H%M%S")
            base_filename = f'results_chunk{chunk_size}_top{top_k}_{timestamp}'
            csv_filename = f'{base_filename}.csv'

            print(f"\nSaving results to {csv_filename}...")
            results_df.to_csv(csv_filename, index=False)
            print(f"Results saved successfully ({len(results_df)} rows).")

            # --- Optional: Calculate and Print Accuracy based on Evaluator ---
            # Note: This accuracy is based on the *evaluator LLM's* output ("Yes"/"No"),
            # which is the subject of reliability analysis in the project.
            # This calculation assumes the evaluator reliably outputs "Yes" or "No".
            # You might need more robust parsing if the evaluator output varies.
            # Count "Yes" (case-insensitive, stripping whitespace)
            correct_count = results_df['Evaluation'].str.strip().str.lower().eq('yes').sum()
            total_evaluated = len(results_df[results_df['Evaluation'] != "N/A_GENERATION_FAILED"]) # Exclude non-evaluated items
            accuracy = (correct_count / total_evaluated) * 100 if total_evaluated > 0 else 0
            print(f"Evaluator-based Accuracy: {accuracy:.2f}% ({correct_count}/{total_evaluated})")
            # This accuracy is preliminary and needs comparison with human evaluation.

        else:
            print("Warning: No results were generated or evaluated.")

    except Exception as e:
        print(f"!!! CRITICAL ERROR during experiment (chunk={chunk_size}, top_k={top_k}): {e}")
        # Consider logging the full traceback
        # import traceback
        # traceback.print_exc()

    finally:
        # --- Step 4: Clean Up ---
        # Ensure models are unloaded and memory is freed, even if errors occurred
        print("\nCleaning up models and freeing memory...")
        if generator_model:
            generator_model.unload_model()
        if evaluator_model:
            evaluator_model.unload_model()
        # Explicitly delete references and run GC again
        del generator_model
        del evaluator_model
        gc.collect()
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
        print("Cleanup complete.")

        total_time_minutes = (time.time() - experiment_start_time) / 60
        print(f"\n🏁 Experiment (chunk={chunk_size}, top_k={top_k}) Finished.")
        print(f"   Total Time: {total_time_minutes:.2f} minutes.")
        print(f"{'='*70}\n")

    return results_df # Return the DataFrame (might be empty if errors occurred)


# Function to run multiple experiments iterating through parameter lists.
def run_multiple_experiments(chunk_sizes: list[int],
                             top_k_values: list[int],
                             num_questions: int = None):
    """
    Runs a series of experiments for combinations of chunk sizes and top_k values.
    Logs progress and saves a summary of (evaluator-based) results.

    Args:
        chunk_sizes (list[int]): List of chunk sizes to test.
        top_k_values (list[int]): List of top_k values to test.
        num_questions (int, optional): Limit the number of questions for each experiment run. Defaults to None (all).

    Returns:
        pd.DataFrame: A summary DataFrame containing parameters and evaluator-based accuracy for each run.
                      May be incomplete if errors occur.
    """
    all_results_summary = [] # To store summary info from each run
    overall_start_time = time.time()
    log_filename = f"experiment_log_{time.strftime('%Y%m%d-%H%M%S')}.txt"

    # --- Logging Setup ---
    print(f"Starting multiple experiments. Logging progress to: {log_filename}")
    with open(log_filename, "w") as log:
        log.write(f"--- Experiment Log ---\n")
        log.write(f"Start Time: {time.strftime('%Y-%m-%d %H:%M:%S')}\n")
        log.write(f"Chunk Sizes: {chunk_sizes}\n")
        log.write(f"Top_k Values: {top_k_values}\n")
        log.write(f"Num Questions per run: {num_questions if num_questions else 'All'}\n")
        log.write("-" * 30 + "\n\n")

    # --- Experiment Loop ---
    total_experiments = len(chunk_sizes) * len(top_k_values)
    current_experiment = 0
    for chunk_size in chunk_sizes:
        for top_k in top_k_values:
            current_experiment += 1
            print(f"\n>>> Running Experiment {current_experiment}/{total_experiments}: chunk={chunk_size}, top_k={top_k} <<<")
            with open(log_filename, "a") as log:
                 log.write(f"--- Starting Exp {current_experiment}/{total_experiments}: chunk={chunk_size}, top_k={top_k} ---\n")
                 log.write(f"Time: {time.strftime('%Y-%m-%d %H:%M:%S')}\n")

            try:
                # Run the single experiment function
                results_df = run_experiment(
                    chunk_size=chunk_size,
                    top_k=top_k,
                    num_questions=num_questions,
                    debug=False # Usually disable debug for multi-runs unless troubleshooting
                )

                # --- Log and Summarize Results ---
                if not results_df.empty:
                    # Recalculate evaluator accuracy for summary (robust parsing recommended)
                    correct_count = results_df['Evaluation'].str.strip().str.lower().eq('yes').sum()
                    total_evaluated = len(results_df[results_df['Evaluation'] != "N/A_GENERATION_FAILED"])
                    accuracy = (correct_count / total_evaluated) * 100 if total_evaluated > 0 else 0
                    run_time_minutes = (time.time() - overall_start_time) / 60 # Track time per experiment if needed

                    summary_entry = {
                        "chunk_size": chunk_size,
                        "top_k": top_k,
                        "num_questions_processed": len(results_df),
                        "evaluator_accuracy": accuracy,
                        # "time_minutes": run_time_minutes # Add if needed
                    }
                    all_results_summary.append(summary_entry)

                    with open(log_filename, "a") as log:
                        log.write(f"Completed Successfully.\n")
                        log.write(f"Evaluator Accuracy: {accuracy:.2f}% ({correct_count}/{total_evaluated})\n")
                        log.write(f"Results saved to: results_chunk{chunk_size}_top{top_k}_*.csv\n")
                        log.write("-" * 20 + "\n\n")
                else:
                     with open(log_filename, "a") as log:
                          log.write(f"Completed with WARNINGS (Empty Results DataFrame).\n")
                          log.write("-" * 20 + "\n\n")


            except Exception as e:
                print(f"!!! FATAL ERROR in multi-experiment run (chunk={chunk_size}, top_k={top_k}): {e}")
                # Log the error
                with open(log_filename, "a") as log:
                    log.write(f"!!! EXPERIMENT FAILED !!!\n")
                    log.write(f"Error: {str(e)}\n")
                    # Consider logging traceback here too
                    log.write("-" * 20 + "\n\n")
                # Optional: Decide whether to stop all runs or continue
                # continue # Continue to the next experiment

            # --- Intermediate Save (Optional but Recommended) ---
            # Save the summary DataFrame periodically in case of crashes
            if all_results_summary and current_experiment % 5 == 0: # Save every 5 experiments
                temp_summary_df = pd.DataFrame(all_results_summary)
                temp_summary_df.to_csv(f"experiment_summary_progress_{time.strftime('%Y%m%d-%H%M%S')}.csv", index=False)
                print(f"Saved intermediate progress summary ({current_experiment}/{total_experiments}).")

    # --- Final Summary ---
    print("\n--- All Experiments Attempted ---")
    overall_time_minutes = (time.time() - overall_start_time) / 60
    print(f"Total time for all experiments: {overall_time_minutes:.2f} minutes.")

    summary_df = pd.DataFrame(all_results_summary)
    if not summary_df.empty:
        print("\nSummary of Results (based on Evaluator LLM):")
        print(summary_df)

        # Save final summary
        summary_filename_base = f"experiment_summary_final_{time.strftime('%Y%m%d-%H%M%S')}"
        summary_df.to_csv(f"{summary_filename_base}.csv", index=False)
        # summary_df.to_excel(f"{summary_filename_base}.xlsx", index=False) # Optional Excel export
        print(f"Final summary saved to {summary_filename_base}.csv")

        # --- Find Best Config (based on evaluator) ---
        # Note: "Best" here is according to the possibly unreliable evaluator LLM
        try:
            best_idx = summary_df["evaluator_accuracy"].idxmax()
            best_config = summary_df.loc[best_idx]
            print("\nBest Configuration (according to Evaluator):")
            print(best_config)
            with open(log_filename, "a") as log:
                 log.write(f"\n--- Run Summary ---\n")
                 log.write(f"Total Time: {overall_time_minutes:.2f} minutes\n")
                 log.write(f"Best Config (Evaluator): chunk={best_config['chunk_size']}, k={best_config['top_k']}, Acc={best_config['evaluator_accuracy']:.2f}%\n")
        except ValueError:
             print("\nCould not determine best configuration (summary might be empty or contain NaNs).")
             with open(log_filename, "a") as log:
                  log.write("\nCould not determine best configuration from summary.\n")

    else:
        print("\nNo results were successfully summarized.")
        with open(log_filename, "a") as log:
             log.write("\n--- No results were successfully summarized. ---\n")


    return summary_df

## Running the Experiments
Define the parameter ranges and execute the experiment runs.
Start with a small test run (`run_experiment`) before launching the full loop (`run_multiple_experiments`).

In [12]:
# --- Single Test Run (Recommended First) ---
# Use a small number of questions and enable debug to verify the pipeline
print("--- Running Single Test Experiment ---")
test_run_df = run_experiment(
    chunk_size=150,      # Example parameters
    top_k=3,
    num_questions=10,   # Use a small number for testing
    debug=True          # Enable debug output for inspection
)
print("\n--- Test Run Complete ---")
# # Display the results of the test run
# if not test_run_df.empty:
#     print(test_run_df.head())
# else:
#     print("Test run produced no results (check logs/output for errors).")


# --- Full Experiment Runs ---
# Uncomment and run this section to perform the full grid search.
# **WARNING:** This will take a significant amount of time and compute resources.
# Ensure you have sufficient GPU memory and time allocated. Monitor the process.

# print("\n--- !!! Starting Full Experiment Grid Search !!! ---")
# # Define the parameter grid based on the project description
# chunk_sizes_to_test = [50, 100, 150, 200]
# top_k_values_to_test = list(range(1, 11)) # [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

# # Run experiments for all combinations
# # Set num_questions=None to use all 918 questions, or set a number for a smaller run.
# full_summary_df = run_multiple_experiments(
#     chunk_sizes=chunk_sizes_to_test,
#     top_k_values=top_k_values_to_test,
#     num_questions=None # Use None for all questions, or e.g., 50 for a smaller test
# )

# print("\n--- !!! Full Experiment Grid Search Complete !!! ---")
# if not full_summary_df.empty:
#      print("\nFinal Summary DataFrame:")
#      print(full_summary_df)
# else:
#      print("Full experiment run produced no summary (check logs/output for errors).")

--- Running Single Test Experiment ---

Starting Experiment: chunk_size=150, top_k=3
   (Using 10 questions for this run)
Assigning Generator to device cuda:0
Assigning Evaluator to device cuda:1
ModelHandler created for meta-llama/Llama-3.1-8B on device target: cuda:0
ModelHandler created for meta-llama/Llama-3.1-8B on device target: cuda:1

--- Loading Model: meta-llama/Llama-3.1-8B ---
Loading tokenizer...


tokenizer_config.json:   0%|          | 0.00/50.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

Tokenizer loaded.
Target GPU (cuda:0) total memory: 14.74 GB
Configuring 8-bit quantization...
Loading model meta-llama/Llama-3.1-8B with quantization...


config.json:   0%|          | 0.00/826 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/185 [00:00<?, ?B/s]

Device set to use cuda:0


Model loaded.
Creating text generation pipeline...
Pipeline created successfully for meta-llama/Llama-3.1-8B.

--- Loading Model: meta-llama/Llama-3.1-8B ---
Loading tokenizer...
Tokenizer loaded.
Target GPU (cuda:1) total memory: 14.74 GB
Configuring 8-bit quantization...
Loading model meta-llama/Llama-3.1-8B with quantization...


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Device set to use cuda:1


Model loaded.
Creating text generation pipeline...
Pipeline created successfully for meta-llama/Llama-3.1-8B.

--- Testing RAG Generator Model (Direct Generation) ---

--- Test Case 1 ---
Question: Was Abraham Lincoln the sixteenth President of the United States?
Provided Context: 'Abraham Lincoln (February 12, 1809 – April 15, 1865) was an American lawyer and statesman who served as the 16th president of the United States from 1861 until his assassination in 1865.'
Expected Answer (approx): 'yes'




Generated Answer: 'Yes, Abraham Lincoln was the sixteenth President of the United States.'

--- Test Case 2 ---
Question: When did Lincoln begin his political career?
Provided Context: 'Lincoln began his political career in 1832, at the age of 23, when he ran for the Illinois General Assembly.'
Expected Answer (approx): '1832'
Generated Answer: 'Lincoln began his political career in 1832, at the age of 23, when he ran'

--- RAG Generator Model Testing Complete ---

--- Testing Evaluator Model ---

--- Test Case 1 ---
Question: Is the sky blue?
Generated Answer: 'Yes, the sky is blue.'
Ground Truth: 'yes'
Expected Evaluation: 'Yes'
Raw Evaluator Output: 'Yes

Explanation'
Evaluation Result: Matches Expected

--- Test Case 2 ---
Question: Is the Earth flat?
Generated Answer: 'No, the Earth is round.'
Ground Truth: 'no'
Expected Evaluation: 'Yes'
Raw Evaluator Output: 'Yes

Explanation'
Evaluation Result: Matches Expected

--- Test Case 3 ---
Question: What is 2+2?
Generated Answer: '2'
G