# RAG Pipeline Tutorial 2

This notebook implements a simpler RAG evaluation approach:
- **No chunking**: Each context is stored as a single chunk
- **ID-based retrieval evaluation**: Checks if the golden context ID appears in top 3 retrieved chunk IDs
- Since there's no chunking, each context has a unique ID that matches between golden context and retrieved chunks

## Evaluation Approach
- Golden context ID = unique identifier for each context
- Retrieval Recall@3 = % of questions where golden context ID is found in top 3 retrieved chunk IDs


## Step 1: Setup and Imports


In [1]:
import os
import sys
import json
import numpy as np
from tqdm import tqdm
from datetime import datetime

# Add parent directory to path to import shared modules
import pathlib
notebook_dir = pathlib.Path().resolve()
parent_dir = notebook_dir.parent
if str(parent_dir) not in sys.path:
    sys.path.insert(0, str(parent_dir))

print(" Imports successful")


 Imports successful


## Step 2: Load Configuration


In [23]:
from config_local import (
    OLLAMA_BASE_URL,
    LLM_MODEL,
    EMBEDDING_MODEL,
    DEVICE,
    PERSIST_DIRECTORY,
    K_RETRIEVED,
    TEMPERATURE
)

print("Configuration loaded:")
print(f"  Ollama URL: {OLLAMA_BASE_URL}")
print(f"  LLM Model: {LLM_MODEL}")
print(f"  Embedding Model: {EMBEDDING_MODEL}")
print(f"  Device: {DEVICE}")
print(f"  K Retrieved: {K_RETRIEVED}")


Configuration loaded:
  Ollama URL: http://localhost:11434
  LLM Model: llama3:8b
  Embedding Model: BAAI/bge-base-en-v1.5
  Device: cuda
  K Retrieved: 3


## Step 3: Check Ollama Connection


In [4]:
import requests

def check_ollama():
    """Check if Ollama is running and model is available"""
    try:
        response = requests.get(f"{OLLAMA_BASE_URL}/api/tags", timeout=5)
        if response.status_code == 200:
            models = response.json().get("models", [])
            model_names = [m.get("name", "") for m in models]
            if LLM_MODEL in model_names:
                print(f" Ollama is running")
                print(f" Model '{LLM_MODEL}' is available")
                return True
            else:
                print(f" Model '{LLM_MODEL}' not found")
                return False
        else:
            print(f" Ollama returned error: {response.status_code}")
            return False
    except Exception as e:
        print(f" Cannot connect to Ollama: {e}")
        return False

check_ollama()


 Ollama is running
 Model 'llama3:8b' is available


True

## Step 4: Load the Dataset


In [5]:
from dataset_loader import load_squad_v2

print("Loading SQuAD v2 dataset...")
examples = load_squad_v2(split='validation')

print(f"\n Loaded {len(examples)} examples")

# Examine one example
example = examples[0]
print("\nExample structure:")
print(f"  ID (unique index): {example['id']}")
print(f"  Title: {example['title']}")
print(f"  Question: {example['question']}")
print(f"  Context: {example['context'][:200]}...")
print(f"  Answers: {example['answers']['text']}")
print(f"  Is Impossible: {example['is_impossible']}")


  from .autonotebook import tqdm as notebook_tqdm


Loading SQuAD v2 dataset...

 Loaded 11873 examples

Example structure:
  ID (unique index): 56ddde6b9a695914005b9628
  Title: Normans
  Question: In what country is Normandy located?
  Context: The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse ("...
  Answers: ['France', 'France', 'France', 'France']
  Is Impossible: False


## Step 5: Prepare Documents WITHOUT Chunking

**Key difference**: Each context becomes ONE document chunk. No splitting is performed.
Each context's unique ID (from `example['id']`) will be stored in metadata and used for evaluation.


In [24]:
from langchain_core.documents import Document
from collections import OrderedDict

print("Preparing documents...")
print("Deduplicating contexts to ensure each unique context has only ONE ID")

# Step 1: Deduplicate contexts and assign unique context IDs
# Use OrderedDict to preserve first occurrence order
unique_contexts = OrderedDict()  # Maps context_text -> unique_context_id
question_to_context_id = {}  # Maps question_id -> context_id

context_counter = 0

for ex in examples:
    context_text = ex['context']
    question_id = ex['id']
    
    # If we've seen this context before, reuse its ID
    if context_text in unique_contexts:
        context_id = unique_contexts[context_text]
    else:
        # New unique context - assign a new ID
        context_counter += 1
        context_id = f"context_{context_counter}"  # Unique identifier for this context
        unique_contexts[context_text] = context_id
    
    # Map question ID to context ID
    question_to_context_id[question_id] = context_id

print(f"\n Found {len(unique_contexts)} unique contexts")
print(f"  Total questions: {len(examples)}")
print(f"  Deduplication: {len(examples) - len(unique_contexts)} duplicate contexts removed")

# Step 2: Create documents only for unique contexts
documents = []
for context_text, context_id in unique_contexts.items():
    doc = Document(
        page_content=context_text,  # Full context as one chunk
        metadata={
            'id': context_id,  # Unique context ID (not question ID)
            'unique_context_id': context_id,  # Store for clarity
        }
    )
    documents.append(doc)

print(f"\n Created {len(documents)} documents (one per unique context)")
print(f"  No chunking performed - each context is a complete chunk")

# Store the mapping for use in evaluation
# This will be available in the global scope
QUESTION_TO_CONTEXT_ID_MAP = question_to_context_id
CONTEXT_ID_TO_DOCS = {doc.metadata['id']: doc for doc in documents}

# Examine one document
print("\nExample document:")
print(f"  Content length: {len(documents[0].page_content)} characters")
print(f"  Unique Context ID: {documents[0].metadata['id']}")
print(f"\nExample mapping:")
sample_question_id = list(question_to_context_id.keys())[0]
sample_context_id = question_to_context_id[sample_question_id]
print(f"  Question ID '{sample_question_id}' -> Context ID '{sample_context_id}'")

Preparing documents...
Deduplicating contexts to ensure each unique context has only ONE ID

 Found 1204 unique contexts
  Total questions: 11873
  Deduplication: 10669 duplicate contexts removed

 Created 1204 documents (one per unique context)
  No chunking performed - each context is a complete chunk

Example document:
  Content length: 742 characters
  Unique Context ID: context_1

Example mapping:
  Question ID '56ddde6b9a695914005b9628' -> Context ID 'context_1'


## Step 6: Initialize Embedding Model


In [7]:
from langchain_community.embeddings import HuggingFaceEmbeddings

print(f"Initializing embedding model: {EMBEDDING_MODEL}")
print(f"Using device: {DEVICE}")

embedding_model = HuggingFaceEmbeddings(
    model_name=EMBEDDING_MODEL,
    model_kwargs={'device': DEVICE}
)

print("\n Embedding model loaded")


Initializing embedding model: BAAI/bge-base-en-v1.5
Using device: cuda


  embedding_model = HuggingFaceEmbeddings(



 Embedding model loaded


## Step 7: Create or Load Vector Store

Use a different persist directory to avoid conflicts with chunked version.


In [8]:
from langchain_community.vectorstores import Chroma

# Use a separate directory for non-chunked version
VECTOR_STORE_DIR = "./chroma_db_no_chunking"

# Check if vector store already exists
if os.path.exists(VECTOR_STORE_DIR) and os.listdir(VECTOR_STORE_DIR):
    print(f"Loading existing vector store from {VECTOR_STORE_DIR}...")
    vectorstore = Chroma(
        persist_directory=VECTOR_STORE_DIR,
        embedding_function=embedding_model
    )
    print(f" Loaded {vectorstore._collection.count()} documents")
else:
    print(f"Creating new vector store at {VECTOR_STORE_DIR}...")
    print("This may take several minutes (embedding all contexts)...")
    
    vectorstore = Chroma.from_documents(
        documents=documents,
        embedding=embedding_model,
        persist_directory=VECTOR_STORE_DIR
    )
    print(f" Created vector store with {vectorstore._collection.count()} documents")

print("\nVector store ready for retrieval!")


Creating new vector store at ./chroma_db_no_chunking...
This may take several minutes (embedding all contexts)...
 Created vector store with 1204 documents

Vector store ready for retrieval!


## Step 8: Create Retriever


In [9]:
print(f"Creating retriever to fetch top {K_RETRIEVED} most relevant contexts...")

retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": K_RETRIEVED}
)

print(" Retriever created")

# Test retrieval
test_question = "What is the capital of France?"
print(f"\nTesting retrieval with question: '{test_question}'")

retrieved_docs = retriever.invoke(test_question)
print(f"\nRetrieved {len(retrieved_docs)} contexts:")
for i, doc in enumerate(retrieved_docs[:2], 1):
    print(f"\n  Context {i}:")
    print(f"  Unique ID: {doc.metadata.get('id', 'N/A')}")
    print(f"  Content preview: {doc.page_content[:150]}...")


Creating retriever to fetch top 3 most relevant contexts...
 Retriever created

Testing retrieval with question: 'What is the capital of France?'

Retrieved 3 contexts:

  Context 1:
  Unique ID: context_1066
  Content preview: Warsaw (Polish: Warszawa [varˈʂava] ( listen); see also other names) is the capital and largest city of Poland. It stands on the Vistula River in east...

  Context 2:
  Unique ID: context_1073
  Content preview: Warsaw remained the capital of the Polish–Lithuanian Commonwealth until 1796, when it was annexed by the Kingdom of Prussia to become the capital of t...


## Step 9: Evaluate Retriever Accuracy (Recall@3)

**Evaluation Logic:**
- For each question, we have a **golden context** with a unique ID
- We retrieve top 3 chunks and get their IDs
- **Recall@3 = % of questions where golden context ID is found in top 3 retrieved chunk IDs**

Since we're not chunking, the golden context ID will exactly match the chunk ID if the correct context was retrieved.


In [10]:
def evaluate_retriever_accuracy_id_based(retriever, examples, max_samples=50):
    """
    Evaluate retriever accuracy using ID-based matching.
    
    Logic:
    - Each question maps to a unique context ID via QUESTION_TO_CONTEXT_ID_MAP
    - We retrieve top K chunks and check their context IDs
    - Success = golden context ID is found in retrieved chunk IDs
    """
    print(f"Evaluating retriever accuracy (ID-based Recall@{K_RETRIEVED}) on {max_samples} examples...")
    print(f"\nEvaluation method:")
    print(f"  - Map question ID to unique context ID")
    print(f"  - Check if golden context ID appears in top {K_RETRIEVED} retrieved chunk IDs")
    print(f"  - Only unique contexts stored (no duplicates)\n")
    
    # Filter to only answerable questions
    answerable_examples = [ex for ex in examples[:max_samples] if not ex['is_impossible']]
    unanswerable_count = max_samples - len(answerable_examples)
    
    print(f"Answerable questions: {len(answerable_examples)}")
    print(f"Unanswerable questions: {unanswerable_count} (excluded from evaluation)\n")
    
    correct_retrievals = 0
    detailed_results = []
    
    for example in tqdm(answerable_examples, desc="Evaluating retrieval"):
        question = example['question']
        question_id = example['id']
        
        golden_context_id = QUESTION_TO_CONTEXT_ID_MAP.get(question_id)
        
        if golden_context_id is None:
            print(f"Warning: Question ID {question_id} not found in mapping!")
            continue
        
        retrieved_docs = retriever.invoke(question)
        retrieved_context_ids = [doc.metadata.get('id') for doc in retrieved_docs]
        
        found = False
        rank = None
        
        for rank_idx, retrieved_id in enumerate(retrieved_context_ids, start=1):
            if str(retrieved_id) == str(golden_context_id):
                found = True
                rank = rank_idx
                break
        
        if found:
            correct_retrievals += 1
        
        detailed_results.append({
            'question': question,
            'question_id': question_id,
            'golden_context_id': golden_context_id,
            'found': found,
            'rank': rank,
            'retrieved_context_ids': retrieved_context_ids
        })
    
    recall_at_k = correct_retrievals / len(answerable_examples) if len(answerable_examples) > 0 else 0.0
    
    print(f"\n{'='*60}")
    print("RETRIEVER ACCURACY RESULTS (ID-based Recall@K)")
    print(f"{'='*60}")
    print(f"Total Questions:           {max_samples}")
    print(f"Answerable Questions:      {len(answerable_examples)}")
    print(f"Unanswerable Questions:    {unanswerable_count} (excluded)")
    print(f"Golden Context Found:      {correct_retrievals}")
    print(f"\nRecall@{K_RETRIEVED}:            {recall_at_k:.4f} ({recall_at_k*100:.2f}%)")
    print(f"\nInterpretation:")
    print(f"  - {recall_at_k*100:.2f}% of questions had their golden context ID in top {K_RETRIEVED} retrieved chunk IDs")
    print(f"  - Only unique contexts stored - no duplicate contexts")
    print(f"{'='*60}\n")
    
    return {
        'recall_at_k': recall_at_k,
        'correct_retrievals': correct_retrievals,
        'total_answerable': len(answerable_examples),
        'detailed_results': detailed_results
    }

# Run retriever evaluation
retriever_metrics = evaluate_retriever_accuracy_id_based(retriever, examples, max_samples=50)

Evaluating retriever accuracy (ID-based Recall@3) on 50 examples...

Evaluation method:
  - Map question ID to unique context ID
  - Check if golden context ID appears in top 3 retrieved chunk IDs
  - Only unique contexts stored (no duplicates)

Answerable questions: 21
Unanswerable questions: 29 (excluded from evaluation)



Evaluating retrieval: 100%|██████████| 21/21 [00:00<00:00, 97.58it/s]


RETRIEVER ACCURACY RESULTS (ID-based Recall@K)
Total Questions:           50
Answerable Questions:      21
Unanswerable Questions:    29 (excluded)
Golden Context Found:      15

Recall@3:            0.7143 (71.43%)

Interpretation:
  - 71.43% of questions had their golden context ID in top 3 retrieved chunk IDs
  - Only unique contexts stored - no duplicate contexts






## Step 10: Examine Results

Let's look at examples where retrieval succeeded and failed.


In [11]:
print("Examples where golden context WAS retrieved:\n")
found_examples = [r for r in retriever_metrics['detailed_results'] if r['found']]
for i, result in enumerate(found_examples[:5], 1):
    print(f"{i}. Question: {result['question'][:80]}...")
    print(f"   Golden Context ID: {result['golden_context_id']}")
    print(f"   Found at rank: {result['rank']}")
    print(f"   Retrieved IDs: {result['retrieved_context_ids']}")
    print()

print("\n" + "="*60)
print("Examples where golden context was NOT retrieved:\n")
not_found_examples = [r for r in retriever_metrics['detailed_results'] if not r['found']]
for i, result in enumerate(not_found_examples[:5], 1):
    print(f"{i}. Question: {result['question'][:80]}...")
    print(f"   Golden Context ID: {result['golden_context_id']}")
    print(f"   Retrieved IDs: {result['retrieved_context_ids']}")
    print()
    
print(f"\nSummary:")
print(f"  Successfully retrieved: {len(found_examples)} / {len(retriever_metrics['detailed_results'])}")
print(f"  Failed to retrieve: {len(not_found_examples)} / {len(retriever_metrics['detailed_results'])}")


Examples where golden context WAS retrieved:

1. Question: In what country is Normandy located?...
   Golden Context ID: context_1
   Found at rank: 2
   Retrieved IDs: ['context_4', 'context_1', 'context_31']

2. Question: When were the Normans in Normandy?...
   Golden Context ID: context_1
   Found at rank: 1
   Retrieved IDs: ['context_1', 'context_17', 'context_2']

3. Question: From which countries did the Norse originate?...
   Golden Context ID: context_1
   Found at rank: 1
   Retrieved IDs: ['context_1', 'context_6', 'context_5']

4. Question: Who was the Norse leader?...
   Golden Context ID: context_1
   Found at rank: 1
   Retrieved IDs: ['context_1', 'context_20', 'context_6']

5. Question: What century did the Normans first gain their separate identity?...
   Golden Context ID: context_1
   Found at rank: 1
   Retrieved IDs: ['context_1', 'context_18', 'context_7']


Examples where golden context was NOT retrieved:

1. Question: Who was the duke in the battle of Hastings

In [12]:
# Prepare results
results = {
    'timestamp': datetime.now().isoformat(),
    'evaluation_method': 'ID-based Recall@K (no chunking)',
    'retriever_metrics': {
        'recall_at_k': retriever_metrics['recall_at_k'],
        'correct_retrievals': retriever_metrics['correct_retrievals'],
        'total_answerable': retriever_metrics['total_answerable']
    },
    'config': {
        'embedding_model': EMBEDDING_MODEL,
        'llm_model': LLM_MODEL,
        'device': DEVICE,
        'k_retrieved': K_RETRIEVED,
        'chunking': False,
        'description': 'Each context = one chunk, evaluation by ID matching'
    },
    'sample_results': retriever_metrics['detailed_results'][:10]
}

# Save to file
output_file = f"retrieval_results_no_chunking_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"
with open(output_file, 'w') as f:
    json.dump(results, f, indent=2)

print(f" Results saved to {output_file}")
print(f"\nResults summary:")
print(f"  Recall@{K_RETRIEVED}: {retriever_metrics['recall_at_k']:.4f} ({retriever_metrics['recall_at_k']*100:.2f}%)")


 Results saved to retrieval_results_no_chunking_20251225_213912.json

Results summary:
  Recall@3: 0.7143 (71.43%)


## Summary

### Key Differences from Chunked Approach:

1. **No Chunking**: Each context is stored as a single, complete document chunk
   - Simplifies evaluation: golden context ID = chunk ID
   - Easier to understand: did we retrieve the right context?

2. **ID-Based Evaluation**: 
   - Check if golden context ID (unique index) appears in top 3 retrieved chunk IDs
   - Direct matching: if IDs match, we retrieved the correct context

3. **Advantages**:
   - Simpler evaluation logic
   - Clear interpretation: % of questions where correct context was in top 3
   - No ambiguity from chunking variations

4. **Trade-offs**:
   - Larger chunks (full contexts) vs smaller chunks (subsections)
   - May retrieve entire context even if only a small part is relevant
   - Different retrieval characteristics compared to chunked approach

### Evaluation Metric:
- **Recall@3**: Percentage of questions where golden context ID was found in top 3 retrieved chunk IDs
- Since no chunking, this directly measures: "Did we retrieve the correct context?"


## Step 10: Initialize LLM (Ollama)

Initialize the language model that will generate answers based on the retrieved context.


In [13]:
from langchain_ollama import ChatOllama

print(f"Initializing LLM: {LLM_MODEL}")
print(f"Temperature: {TEMPERATURE} (deterministic)")

llm = ChatOllama(
    model=LLM_MODEL,
    base_url=OLLAMA_BASE_URL,
    temperature=TEMPERATURE,
    num_ctx=4096,   # Context window size
    streaming=False  
)

print(" LLM initialized")

# Test LLM with a simple prompt
test_response = llm.invoke("Say 'Hello, RAG pipeline!'")
print(f"\nTest response: {test_response.content}")


Initializing LLM: llama3:8b
Temperature: 0 (deterministic)
 LLM initialized

Test response: HELLO, RAG PIPELINE!


## Step 11: Create RAG Prompt Template

The prompt template structures how we present the context and question to the LLM. This is crucial for getting good answers.


In [14]:
from langchain_core.prompts import ChatPromptTemplate

# Create prompt template
prompt = ChatPromptTemplate.from_messages([
    (
        "system",
        "You are a question-answering assistant. "
        "Answer the question using ONLY the provided context. "
        "If the answer cannot be found in the context, respond with 'I don't know' "
        "or 'The answer is not available in the provided context.'"
    ),
    (
        "human",
        "Context:\n{context}\n\nQuestion:\n{question}\n\nAnswer:"
    )
])

print(" Prompt template created")

# Show what a formatted prompt looks like
example_context = "Paris is the capital of France. It is located in the north-central part of the country."
example_question = "What is the capital of France?"

formatted = prompt.format_messages(context=example_context, question=example_question)
print("\nExample formatted prompt:")
for msg in formatted:
    print(f"\n[{msg.type.upper()}]")
    print(msg.content)


 Prompt template created

Example formatted prompt:

[SYSTEM]
You are a question-answering assistant. Answer the question using ONLY the provided context. If the answer cannot be found in the context, respond with 'I don't know' or 'The answer is not available in the provided context.'

[HUMAN]
Context:
Paris is the capital of France. It is located in the north-central part of the country.

Question:
What is the capital of France?

Answer:


## Step 12: Build the Complete RAG Chain

Now we combine all components into a single chain that:
1. Takes a question
2. Retrieves relevant documents
3. Formats them as context
4. Sends to LLM with prompt
5. Returns the answer


In [15]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnableLambda
from operator import itemgetter

def format_docs(docs):
    """
    Format retrieved documents into a single context string.
    Each document is a complete context (no chunking).
    """
    if not docs:
        return "No relevant context found."
    
    # Join contexts with clear separators
    contexts = [doc.page_content for doc in docs]
    return "\n\n---\n\n".join(contexts)

print("Building RAG chain...")

# The RAG chain pipeline
rag_chain = (
    # Step 1: Retrieve contexts for the question
    {
        "docs": itemgetter("question") | retriever,
        "question": itemgetter("question"),
    }
    # Step 2: Format contexts into a single string
    | RunnableLambda(lambda x: {
        "question": x["question"],
        "context": format_docs(x["docs"]),
    })
    # Step 3: Generate answer using LLM
    | RunnableLambda(lambda x: {
        "answer": (
            prompt
            | llm
            | StrOutputParser()
        ).invoke({
            "question": x["question"],
            "context": x["context"],
        })
    })
)

print(" RAG chain built")
print("\nChain flow:")
print("  Question → Retrieve Contexts → Format Context → LLM → Answer")

Building RAG chain...
 RAG chain built

Chain flow:
  Question → Retrieve Contexts → Format Context → LLM → Answer


## Step 13: Test the RAG Pipeline

Let's test the complete pipeline with a real question from the dataset.


In [16]:
import time

print("Testing RAG pipeline with a sample question...")

# Get a sample question from the dataset
test_example = examples[10]
test_question = test_example['question']
test_question_id = test_example['id']
ground_truth_answer = test_example['answers']['text'][0] if test_example['answers']['text'] else "N/A"

# Map question ID to context ID
golden_context_id = QUESTION_TO_CONTEXT_ID_MAP.get(test_question_id)

print(f"\nQuestion: {test_question}")
print(f"Question ID: {test_question_id}")
print(f"Golden Context ID: {golden_context_id}")
print(f"Ground Truth Answer: {ground_truth_answer}")

# Step 1: Check Retriever
print("\n" + "-"*60)
print("Step 1: Check Retriever")
print("-"*60)
retrieved = retriever.invoke(test_question)
print(f"Retrieved {len(retrieved)} contexts:")

golden_context_found = False
golden_context_rank = None

for i, doc in enumerate(retrieved, 1):
    doc_id = doc.metadata.get('id')
    is_golden_context = (str(doc_id).strip() == str(golden_context_id).strip())
    
    if is_golden_context:
        golden_context_found = True
        golden_context_rank = i
    
    marker = "  GOLDEN CONTEXT" if is_golden_context else ""
    print(f"\n  [Context {i}]{marker}")
    print(f"  Context ID: {doc_id}")
    print(f"  Content preview: {doc.page_content[:200]}...")

if golden_context_found:
    print(f"\n Golden context found at rank {golden_context_rank}!")
else:
    print(f"\n Golden context NOT found in top {K_RETRIEVED} retrieved contexts")

# Step 2: Generate Answer with LLM
print("\n" + "-"*60)
print("Step 2: Generate Answer with LLM")
print("-"*60)
start_time = time.time()

result = rag_chain.invoke({"question": test_question})

elapsed = time.time() - start_time

print(f"\n Pipeline completed in {elapsed:.2f} seconds")
print(f"\nGenerated Answer:")
print(f"{result['answer']}")

# Summary
print("\n" + "="*60)
print("SUMMARY")
print("="*60)
if golden_context_found:
    print(f"Golden context found: Yes (rank {golden_context_rank})")
else:
    print(f"Golden context found: No")
print(f"Answer generated: {len(result['answer'])} characters")
print(f"Processing time: {elapsed:.2f} seconds")
print("="*60)

Testing RAG pipeline with a sample question...

Question: Who ruled the duchy of Normandy
Question ID: 56dddf4066d3e219004dad60
Golden Context ID: context_2
Ground Truth Answer: Richard I

------------------------------------------------------------
Step 1: Check Retriever
------------------------------------------------------------
Retrieved 3 contexts:

  [Context 1]
  Context ID: context_17
  Content preview: In 1066, Duke William II of Normandy conquered England killing King Harold II at the Battle of Hastings. The invading Normans and their descendants replaced the Anglo-Saxons as the ruling class of Eng...

  [Context 2]
  Context ID: context_4
  Content preview: In the course of the 10th century, the initially destructive incursions of Norse war bands into the rivers of France evolved into more permanent encampments that included local women and personal prop...

  [Context 3]
  Context ID: context_1
  Content preview: The Normans (Norman: Nourmands; French: Normands; Latin: Nor

## Step 14: Calculate answer quality metrics (recall and exact match):

In [17]:
print("Calculating answer quality metrics...")

from metrics_extended import recall_ordered, exact_match

# Calculate metrics using the result from Step 13
recall = recall_ordered(result['answer'], ground_truth_answer)
em = exact_match(result['answer'], ground_truth_answer)

print(f"\n{'='*60}")
print("ANSWER QUALITY METRICS")
print(f"{'='*60}")
print(f"Recall:     {recall:.4f} ({'Yes' if recall == 1.0 else 'No'})")
print(f"Exact Match: {em:.4f} ({'Yes' if em == 1.0 else 'No'})")

# Show word matching details
from metrics_extended import normalize_answer
normalized_gen = normalize_answer(result['answer'])
normalized_gt = normalize_answer(ground_truth_answer)
gt_words = normalized_gt.split()
gen_words = normalized_gen.split()

print(f"\nWord Matching Details:")
print(f"  Ground truth words (in order): {gt_words}")
print(f"  Generated words: {gen_words}")

# Find if words appear in order
gt_idx = 0
matched_positions = []
for i, gen_word in enumerate(gen_words):
    if gt_idx < len(gt_words) and gen_word == gt_words[gt_idx]:
        matched_positions.append((i, gen_word))
        gt_idx += 1

if len(matched_positions) == len(gt_words):
    print(f"  Matched words in order: {[w for _, w in matched_positions]}")
    print(f"  Positions in generated answer: {[i for i, _ in matched_positions]}")
    print(f"  → Recall = 1.0 (all {len(gt_words)} word(s) found in order)")
else:
    print(f"  Matched words: {[w for _, w in matched_positions]}")
    print(f"  Expected {len(gt_words)} words, found {len(matched_positions)} in order")
    print(f"  → Recall = 0.0 (words not found in same order)")

print(f"{'='*60}")

Calculating answer quality metrics...

ANSWER QUALITY METRICS
Recall:     0.0000 (No)
Exact Match: 0.0000 (No)

Word Matching Details:
  Ground truth words (in order): ['richard', 'i']
  Generated words: ['according', 'to', 'provided', 'context', 'rollo', 'was', 'one', 'who', 'established', 'duchy', 'of', 'normandy', 'by', 'treaty', 'with', 'king', 'charles', 'iii', 'of', 'west', 'francia', 'in', '911']
  Matched words: []
  Expected 2 words, found 0 in order
  → Recall = 0.0 (words not found in same order)


## Step 15: Evaluate RAG Pipeline on Multiple Examples

In [18]:
from metrics_extended import recall_ordered, exact_match

# Select a subset of examples for evaluation
num_samples = 10  # Use 10 examples for quick evaluation
eval_examples = examples[:num_samples]

print(f"Evaluating RAG pipeline on {len(eval_examples)} examples...")
print("This may take a few minutes...\n")

predictions = []
ground_truths_list = []
is_impossible_list = []
processing_times = []

# Process each example
for i, example in enumerate(tqdm(eval_examples, desc="Evaluating")):
    start_time = time.time()
    
    try:
        # Get prediction from RAG pipeline
        result = rag_chain.invoke({"question": example['question']})
        prediction = result["answer"]
        predictions.append(prediction)
    except Exception as e:
        print(f"\nError on example {i}: {e}")
        predictions.append("")
    
    # Collect ground truth
    ground_truths_list.append(example['answers']['text'])
    is_impossible_list.append(example['is_impossible'])
    
    processing_times.append(time.time() - start_time)

print(f"\n Evaluation complete!")
print(f"Average time per question: {np.mean(processing_times):.2f} seconds")

Evaluating RAG pipeline on 10 examples...
This may take a few minutes...



Evaluating: 100%|██████████| 10/10 [00:03<00:00,  2.57it/s]


 Evaluation complete!
Average time per question: 0.39 seconds





## Step 16: Calculate Aggregate Metrics

In [19]:
print("Calculating aggregate metrics...")

from metrics_extended import recall_ordered, exact_match

# Calculate metrics for each prediction
recall_scores = []
em_scores = []

for i, (prediction, ground_truths) in enumerate(zip(predictions, ground_truths_list)):
    # Handle multiple ground truth answers - take the best score
    best_recall = 0.0
    best_em = 0.0
    
    for gt in ground_truths:
        if gt.strip():  # Skip empty answers
            recall_score = recall_ordered(prediction, gt)
            em_score = exact_match(prediction, gt)
            best_recall = max(best_recall, recall_score)
            best_em = max(best_em, em_score)
    
    recall_scores.append(best_recall)
    em_scores.append(best_em)

# Calculate aggregate metrics
mean_recall = np.mean(recall_scores)
mean_em = np.mean(em_scores)
std_recall = np.std(recall_scores)
std_em = np.std(em_scores)

# Display results
print(f"\n{'='*60}")
print("AGGREGATE METRICS")
print(f"{'='*60}")
print(f"Total Examples:        {len(eval_examples)}")
print(f"\nRecall:")
print(f"  Mean:                {mean_recall:.4f} ± {std_recall:.4f}")
print(f"  ({mean_recall*100:.2f}% of answers contain all ground truth words in order)")
print(f"\nExact Match:")
print(f"  Mean:                {mean_em:.4f} ± {std_em:.4f}")
print(f"  ({mean_em*100:.2f}% of answers match exactly)")
print(f"{'='*60}")

# Store for later use
metrics_results = {
    'recall': mean_recall,
    'exact_match': mean_em,
    'std_recall': std_recall,
    'std_em': std_em,
    'individual_recall': recall_scores,
    'individual_em': em_scores
}

Calculating aggregate metrics...

AGGREGATE METRICS
Total Examples:        10

Recall:
  Mean:                0.4000 ± 0.4899
  (40.00% of answers contain all ground truth words in order)

Exact Match:
  Mean:                0.2000 ± 0.4000
  (20.00% of answers match exactly)


## Step 17: Examine Individual Results

In [20]:
print("Examining individual results...\n")

# Show first 5 examples with their scores
for i in range(min(5, len(eval_examples))):
    example = eval_examples[i]
    pred = predictions[i]
    gt = ground_truths_list[i][0] if ground_truths_list[i] else "(unanswerable)"
    recall_score = recall_scores[i]
    em_score = em_scores[i]
    
    print(f"{'='*60}")
    print(f"Example {i+1}")
    print(f"{'='*60}")
    print(f"Question: {example['question']}")
    print(f"\nGround Truth: {gt}")
    print(f"Prediction:   {pred}")
    print(f"\nMetrics:")
    print(f"  Recall:     {recall_score:.4f} ({'Yes' if recall_score == 1.0 else 'No'})")
    print(f"  Exact Match: {em_score:.4f} ({'Yes' if em_score == 1.0 else 'No'})")
    print()

Examining individual results...

Example 1
Question: In what country is Normandy located?

Ground Truth: France
Prediction:   France.

Metrics:
  Recall:     1.0000 (Yes)
  Exact Match: 1.0000 (Yes)

Example 2
Question: When were the Normans in Normandy?

Ground Truth: 10th and 11th centuries
Prediction:   According to the provided context, the Normans gave their name to Normandy, a region in France, in the 10th and 11th centuries.

Metrics:
  Recall:     1.0000 (Yes)
  Exact Match: 0.0000 (No)

Example 3
Question: From which countries did the Norse originate?

Ground Truth: Denmark, Iceland and Norway
Prediction:   The Norse originated from Denmark, Iceland, Norway.

Metrics:
  Recall:     0.0000 (No)
  Exact Match: 0.0000 (No)

Example 4
Question: Who was the Norse leader?

Ground Truth: Rollo
Prediction:   Rollo.

Metrics:
  Recall:     1.0000 (Yes)
  Exact Match: 1.0000 (Yes)

Example 5
Question: What century did the Normans first gain their separate identity?

Ground Truth: 10th c

## Step 18: Save Results

In [21]:
# Prepare results dictionary
results = {
    'timestamp': datetime.now().isoformat(),
    'evaluation_method': 'ID-based Recall@K with ordered word recall',
    'retriever_metrics': {
        'recall_at_k': retriever_metrics['recall_at_k'],
        'correct_retrievals': retriever_metrics['correct_retrievals'],
        'total_answerable': retriever_metrics['total_answerable']
    },
    'answer_quality_metrics': {
        'recall': metrics_results['recall'],
        'exact_match': metrics_results['exact_match'],
        'std_recall': metrics_results['std_recall'],
        'std_em': metrics_results['std_em']
    },
    'dataset_stats': {
        'total_samples': len(eval_examples),
        'answerable_samples': sum(1 for x in is_impossible_list if not x),
        'unanswerable_samples': sum(is_impossible_list)
    },
    'config': {
        'embedding_model': EMBEDDING_MODEL,
        'llm_model': LLM_MODEL,
        'device': DEVICE,
        'k_retrieved': K_RETRIEVED,
        'chunking': False,
        'description': 'Each context = one chunk, evaluation by ID matching'
    },
    'sample_predictions': [
        {
            'question': eval_examples[i]['question'],
            'prediction': predictions[i],
            'ground_truth': ground_truths_list[i][0] if ground_truths_list[i] else None,
            'recall': recall_scores[i],
            'exact_match': em_scores[i]
        }
        for i in range(min(10, len(eval_examples)))
    ]
}

# Save to file
output_file = f"rag_evaluation_results_no_chunking_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"
with open(output_file, 'w') as f:
    json.dump(results, f, indent=2)

print(f" Results saved to {output_file}")
print(f"\nSaved metrics include:")
print(f"  - Answer quality (Recall, Exact Match)")
print(f"  - Individual predictions and scores")

 Results saved to rag_evaluation_results_no_chunking_20251225_214103.json

Saved metrics include:
  - Answer quality (Recall, Exact Match)
  - Individual predictions and scores


## Step 19: Summary


In [22]:
print("="*60)
print("NOTEBOOK SUMMARY")
print("="*60)
print("\nWhat we accomplished:")
print("1.  Set up RAG pipeline without chunking (each context = one chunk)")
print("2.  Created unique context IDs and mapped questions to contexts")
print("3.  Evaluated retriever accuracy using ID-based matching")
print("4.  Built complete RAG chain (retrieval + generation)")
print("5.  Tested pipeline on single example")
print("6.  Evaluated on multiple examples")
print("7.  Calculated recall (ordered words) and exact match metrics")
print("\nKey Results:")
print(f"  Retriever Recall@{K_RETRIEVED}: {retriever_metrics['recall_at_k']:.2%}")
print(f"  Answer Recall: {metrics_results['recall']:.2%}")
print(f"  Exact Match: {metrics_results['exact_match']:.2%}")
print("="*60)

NOTEBOOK SUMMARY

What we accomplished:
1.  Set up RAG pipeline without chunking (each context = one chunk)
2.  Created unique context IDs and mapped questions to contexts
3.  Evaluated retriever accuracy using ID-based matching
4.  Built complete RAG chain (retrieval + generation)
5.  Tested pipeline on single example
6.  Evaluated on multiple examples
7.  Calculated recall (ordered words) and exact match metrics

Key Results:
  Retriever Recall@3: 71.43%
  Answer Recall: 40.00%
  Exact Match: 20.00%



### Next Steps:
- Experiment with different `K_RETRIEVED` values (more documents = more context but potentially more noise)
- Try different embedding models to improve retrieval accuracy
- Adjust the prompt template to improve generation quality
- Evaluate on larger datasets
- Compare retriever accuracy with final answer quality to identify improvement opportunities
