# LLM Zoomcamp - Evaluation Homework

This notebook demonstrates various evaluation techniques for search systems and RAG (Retrieval-Augmented Generation) pipelines. We'll explore:

1. **Text-based search evaluation** using MinSearch with different boosting parameters
2. **Vector search evaluation** using TF-IDF and SVD embeddings
3. **Advanced vector search** with Qdrant and modern embedding models
4. **Answer similarity evaluation** using cosine similarity and ROUGE metrics


In [None]:
# Install required packages for the evaluation homework
# - minsearch: A lightweight search engine for text and vector search
# - qdrant_client: Python client for Qdrant vector database
# The -q flag suppresses verbose output during installation
!pip install -U minsearch qdrant_client -q

Collecting minsearch
  Downloading minsearch-0.0.4-py3-none-any.whl (11 kB)
Collecting qdrant_client
  Downloading qdrant_client-1.15.1-py3-none-any.whl (337 kB)
     ------------------------------------- 337.3/337.3 kB 21.8 MB/s eta 0:00:00
Collecting pandas
  Using cached pandas-2.3.1-cp310-cp310-win_amd64.whl (11.3 MB)
Installing collected packages: pandas, minsearch, qdrant_client
  Attempting uninstall: qdrant_client
    Found existing installation: qdrant-client 1.14.3
    Uninstalling qdrant-client-1.14.3:
      Successfully uninstalled qdrant-client-1.14.3
Successfully installed minsearch-0.0.4 pandas-2.3.1 qdrant_client-1.15.1



[notice] A new release of pip available: 22.3.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


## Evaluation data

For this homework, we will use the same dataset we generated
in the videos.

Let's get them:

In [None]:
# Import essential libraries for data handling and HTTP requests
import requests
import pandas as pd

# Define the base URL for accessing evaluation datasets from the LLM Zoomcamp repository
url_prefix = 'https://raw.githubusercontent.com/DataTalksClub/llm-zoomcamp/main/03-evaluation/'

# Load the documents dataset
# This contains FAQ documents with unique IDs that we'll use for retrieval evaluation
docs_url = url_prefix + 'search_evaluation/documents-with-ids.json'
documents = requests.get(docs_url).json()

# Load the ground truth dataset
# This contains question-answer pairs with known correct document associations
# Each record has a question, the course it belongs to, and the document ID that should be retrieved
ground_truth_url = url_prefix + 'search_evaluation/ground-truth-data.csv'
df_ground_truth = pd.read_csv(ground_truth_url)

# Convert DataFrame to list of dictionaries for easier processing
ground_truth = df_ground_truth.to_dict(orient='records')

In [None]:
# Examine the structure of the first 3 documents
# Each document contains: id, question, text (answer), section, and course
documents[:3]

[{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
  'section': 'General course-related questions',
  'question': 'Course - When will the course start?',
  'course': 'data-engineering-zoomcamp',
  'id': 'c02e79ef'},
 {'text': 'GitHub - DataTalksClub data-engineering-zoomcamp#prerequisites',
  'section': 'General course-related questions',
  'question': 'Course - What are the prerequisites for this course?',
  'course': 'data-engineering-zoomcamp',
  'id': '1f6520ca'},
 {'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware

In [None]:
# Examine the structure of the first 3 ground truth records
# Each record contains: question, course, and document (the correct document ID for this question)
ground_truth[:3]

[{'question': 'When does the course begin?',
  'course': 'data-engineering-zoomcamp',
  'document': 'c02e79ef'},
 {'question': 'How can I get the course schedule?',
  'course': 'data-engineering-zoomcamp',
  'document': 'c02e79ef'},
 {'question': 'What is the link for course registration?',
  'course': 'data-engineering-zoomcamp',
  'document': 'c02e79ef'}]

Here, `documents` contains the documents from the FAQ database
with unique IDs, and `ground_truth` contains generated
question-answer pairs. 

Also, we will need the code for evaluating retrieval:

In [None]:
# Import tqdm for progress bars during evaluation
from tqdm.auto import tqdm

def hit_rate(relevance_total):
    """
    Calculate Hit Rate: the proportion of queries for which at least one relevant document 
    was retrieved in the top-k results.
    
    Args:
        relevance_total: List of lists, where each inner list contains boolean values
                        indicating whether each retrieved document is relevant
    
    Returns:
        float: Hit rate between 0 and 1
    """
    cnt = 0
    
    # Count queries where at least one relevant document was found
    for line in relevance_total:
        if True in line:  # At least one relevant document found
            cnt += 1
    
    return cnt / len(relevance_total)

def mrr(relevance_total):
    """
    Calculate Mean Reciprocal Rank (MRR): measures the average of reciprocal ranks
    of the first relevant document for each query.
    
    Args:
        relevance_total: List of lists, where each inner list contains boolean values
                        indicating whether each retrieved document is relevant
    
    Returns:
        float: MRR score between 0 and 1
    """
    total_score = 0.0
    
    # For each query, find the rank of the first relevant document
    for line in relevance_total:
        for rank in range(len(line)):
            if line[rank] == True:  # Found first relevant document
                # Add reciprocal of rank (1-indexed) to total score
                total_score += 1 / (rank + 1)
                break  # Only consider the first relevant document
    
    return total_score / len(relevance_total)

def evaluate(ground_truth, search_function):
    """
    Evaluate a search function using hit rate and MRR metrics.
    
    Args:
        ground_truth: List of dictionaries containing question, course, and document
        search_function: Function that takes a ground truth record and returns search results
    
    Returns:
        dict: Dictionary containing 'hit_rate' and 'mrr' scores
    """
    relevance_total = []
    
    # Evaluate each question in the ground truth
    for q in tqdm(ground_truth):
        doc_id = q['document']  # The correct document ID for this question
        results = search_function(q)  # Get search results
        
        # Check which results are relevant (match the correct document ID)
        relevance = [d['id'] == doc_id for d in results]
        relevance_total.append(relevance)
    
    # Calculate and return both metrics
    return {
        'hit_rate': hit_rate(relevance_total),
        'mrr': mrr(relevance_total),
    }

## Q1. Minsearch text

Now let's evaluate our usual minsearch approach, indexing documents with:
```python
text_fields=["question", "section", "text"],
keyword_fields=["course", "id"]
```
but tweak the parameters for search. Let's use the following boosting params:

```python
boost = {'question': 1.5, 'section': 0.1}
```

In [None]:
# Import the MinSearch library for text-based search
import minsearch

# Create a MinSearch index with specified field configurations
# text_fields: Fields that will be tokenized and used for full-text search
# keyword_fields: Fields that will be used for exact matching (filtering)
index = minsearch.Index(
    text_fields=["question", "section", "text"],  # Searchable text fields
    keyword_fields=["course", "id"]  # Fields for exact filtering
)

# Fit the index with our documents dataset
# This builds the internal search structures (inverted index, etc.)
index.fit(documents)

<minsearch.minsearch.Index at 0x2bcea7a4130>

In [None]:
# Import tqdm again (if not already imported in current kernel session)
from tqdm.auto import tqdm

def minsearch_search(query, course):
    """
    Perform text-based search using MinSearch with custom boosting parameters.
    
    Args:
        query (str): The search query
        course (str): The course to filter by
    
    Returns:
        list: List of search results (documents)
    """
    # Define boosting parameters to weight different fields
    # Higher values = more importance in ranking
    boost = {
        'question': 1.5,  # Give more weight to matches in the question field
        'section': 0.1    # Give less weight to matches in the section field
    }
    
    # Perform the search with boosting and filtering
    results = index.search(
        query=query,
        filter_dict={'course': course},  # Only return documents from the specified course
        boost_dict=boost,  # Apply field boosting
        num_results=5  # Return top 5 results
    )
    
    return results

  0%|          | 0/4627 [00:00<?, ?it/s]

What's the hitrate for this approach?

In [None]:
# Evaluate the MinSearch approach with boosting parameters
# Using lambda function to create a search function that matches our evaluation interface
# This will calculate both hit rate and MRR for the text-based search approach
minsearch_results = evaluate(ground_truth, lambda q: minsearch_search(q['question'], q['course']))
print("MinSearch Evaluation Results:")
print(f"Hit Rate: {minsearch_results['hit_rate']:.3f}")
print(f"MRR: {minsearch_results['mrr']:.3f}")

minsearch_results

  0%|          | 0/4627 [00:00<?, ?it/s]

{'hit_rate': 0.848714069591528, 'mrr': 0.7288235717887772}

## Embeddings 

The latest version of minsearch also supports vector search. 
We will use it:

In [None]:
# Import VectorSearch from minsearch for vector-based similarity search
from minsearch import VectorSearch

We will also use TF-IDF and Singular Value Decomposition to 
create embeddings from texts. You can refer to our
["Create Your Own Search Engine" workshop](https://github.com/alexeygrigorev/build-your-own-search-engine)
if you want to know more about it.

In [None]:
# Import scikit-learn components for creating embeddings
from sklearn.feature_extraction.text import TfidfVectorizer  # Convert text to TF-IDF vectors
from sklearn.decomposition import TruncatedSVD  # Dimensionality reduction using SVD
from sklearn.pipeline import make_pipeline  # Create ML pipelines

Let's create embeddings for the "question" field:

In [None]:
# Extract text content from documents for embedding creation
# We'll use only the 'question' field for this first vector search experiment
texts = []

for doc in documents:
    text_content = doc['question']  # Extract the question text
    texts.append(text_content)

print(f"Extracted {len(texts)} question texts for embedding creation")

# Create a machine learning pipeline for text embeddings
# Pipeline steps:
# 1. TfidfVectorizer: Convert text to TF-IDF vectors (min_df=3 filters rare words)
# 2. TruncatedSVD: Reduce dimensionality to 128 components for efficiency
pipeline = make_pipeline(
    TfidfVectorizer(min_df=3),  # Ignore terms that appear in fewer than 3 documents
    TruncatedSVD(n_components=128, random_state=1)  # Reduce to 128 dimensions
)

# Fit the pipeline and transform texts to embeddings
print("Creating embeddings using TF-IDF + SVD pipeline...")
X = pipeline.fit_transform(texts)
print(f"Created embeddings with shape: {X.shape}")  # Should be (num_documents, 128)

## Q2. Vector search for question

Now let's index these embeddings with minsearch:

In [None]:
# Create a vector search index using the embeddings
# keyword_fields specifies which fields can be used for filtering
vindex = VectorSearch(keyword_fields={'course'})

# Fit the vector index with our embeddings and document metadata
# X: The embedding vectors we created
# documents: The original documents with metadata for filtering
print("Creating vector search index...")
vindex.fit(X, documents)
print("Vector index created successfully!")

<minsearch.vector.VectorSearch at 0x2bc8b5ed360>

Evaluate this search method. What's MRR for it?

In [None]:
def vector_search(query, course):
    """
    Perform vector-based search using embeddings created with TF-IDF + SVD.
    
    Args:
        query (str): The search query
        course (str): The course to filter by
    
    Returns:
        list: List of search results (documents) ranked by vector similarity
    """
    # Transform the query using the same pipeline used for documents
    # This ensures the query is in the same vector space as the indexed documents
    X_query = pipeline.transform([query])
    
    # Perform vector similarity search
    results = vindex.search(
        query_vector=X_query,  # The query vector
        filter_dict={'course': course},  # Filter by course
        num_results=5  # Return top 5 most similar documents
    )
    
    return results

# Evaluate the vector search approach using questions from ground truth
print("Evaluating vector search approach...")
vector_results = evaluate(ground_truth, lambda q: vector_search(q['question'], q['course']))
print("\nVector Search Evaluation Results:")
print(f"Hit Rate: {vector_results['hit_rate']:.3f}")
print(f"MRR: {vector_results['mrr']:.3f}")

vector_results

  0%|          | 0/4627 [00:00<?, ?it/s]

{'hit_rate': 0.48173762697212014, 'mrr': 0.3572833369353793}

## Q3. Vector search for question and answer

We only used question in Q2. We can use both question and answer:

```python
texts = []

for doc in documents:
    t = doc['question'] + ' ' + doc['text']
    texts.append(t)
```

Using the same pipeline (`min_df=3 for TF-IDF vectorizer and `n_components=128` for SVD), evaluate the performance of this
approach

In [None]:
# Create a new text corpus combining question and answer text
# This approach should provide richer context for vector search
texts_combined = []

for doc in documents:
    # Combine question and answer text with a space separator
    combined_text = doc['question'] + ' ' + doc['text']
    texts_combined.append(combined_text)

print(f"Created {len(texts_combined)} combined question+answer texts")

# Create a new pipeline with the same parameters for consistency
pipeline_combined = make_pipeline(
    TfidfVectorizer(min_df=3),  # Same filtering as before
    TruncatedSVD(n_components=128, random_state=1)  # Same dimensionality
)

# Fit the pipeline on the combined texts and create embeddings
print("Creating embeddings for combined question+answer texts...")
X_combined = pipeline_combined.fit_transform(texts_combined)
print(f"Created combined embeddings with shape: {X_combined.shape}")

# Create a new vector index with the combined embeddings
vindex_combined = VectorSearch(keyword_fields={'course'})
vindex_combined.fit(X_combined, documents)
print("Combined vector index created successfully!")

<minsearch.vector.VectorSearch at 0x2bc8b9466e0>

What's the hitrate?

In [None]:
def vector_search_combined(query, course):
    """
    Perform vector search using embeddings created from combined question+answer text.
    
    Args:
        query (str): The search query
        course (str): The course to filter by
    
    Returns:
        list: List of search results ranked by vector similarity
    """
    # Transform query using the combined text pipeline
    X_query = pipeline_combined.transform([query])
    
    # Search using the combined embeddings index
    results = vindex_combined.search(
        query_vector=X_query,
        filter_dict={'course': course},
        num_results=5
    )
    
    return results

# Evaluate the combined vector search approach
print("Evaluating combined question+answer vector search...")
combined_results = evaluate(ground_truth, lambda q: vector_search_combined(q['question'], q['course']))
print("\nCombined Vector Search Evaluation Results:")
print(f"Hit Rate: {combined_results['hit_rate']:.3f}")
print(f"MRR: {combined_results['mrr']:.3f}")

combined_results

  0%|          | 0/4627 [00:00<?, ?it/s]

{'hit_rate': 0.8210503566025502, 'mrr': 0.6717347453353508}

## Q4. Qdrant

Now let's evaluate the following settings in Qdrant:

- `text = doc['question'] + ' ' + doc['text']`
- `model_handle = "jinaai/jina-embeddings-v2-small-en"`
- `limit = 5`

In [None]:
# Import FastEmbed for modern embedding models
from fastembed import TextEmbedding

# Test query to understand embedding dimensions
query = 'I just discovered the course. Can I join now?'

# Initialize the Jina embedding model
# This is a state-of-the-art embedding model optimized for semantic similarity
model_name = 'jinaai/jina-embeddings-v2-small-en'
print(f"Loading embedding model: {model_name}")

model = TextEmbedding(model_name=model_name)

# Create embeddings for the test query to determine dimensionality
embeddings_query = list(model.embed([query]))
embedding_dim = len(embeddings_query[0])
print(f"Embedding dimensionality: {embedding_dim}")

embedding_dim

512

In [None]:
# Import Qdrant client and models for vector database operations
from qdrant_client import QdrantClient, models

# Initialize Qdrant client
# Assumes Qdrant is running locally on port 6333
client = QdrantClient("http://localhost:6333")

# Configuration for the vector collection
EMBEDDING_DIMENSIONALITY = 512  # Jina embeddings have 512 dimensions
collection_name = "ml-docs"

print(f"Creating Qdrant collection '{collection_name}' with {EMBEDDING_DIMENSIONALITY} dimensions...")

# Create a new collection in Qdrant
# Delete existing collection if it exists to ensure clean state
try:
    client.delete_collection(collection_name)
    print(f"Deleted existing collection '{collection_name}'")
except:
    print(f"No existing collection '{collection_name}' found")

client.create_collection(
    collection_name=collection_name,
    vectors_config=models.VectorParams(
        size=EMBEDDING_DIMENSIONALITY,  # Vector dimension
        distance=models.Distance.COSINE  # Use cosine similarity for distance metric
    )
)
print("Collection created successfully!")

# Prepare data points for insertion into Qdrant
print("Creating embeddings and preparing data points...")
points = []
id_ = 0

for doc in documents:
    # Combine question and answer text (same as our best performing approach)
    text = doc['question'] + ' ' + doc['text']
    
    # Create embedding for this document
    embedding_vector = list(model.embed([text]))[0]
    
    # Create a Qdrant point with embedding and metadata
    point = models.PointStruct(
        id=id_,  # Unique identifier for this point
        vector=embedding_vector,  # The embedding vector
        payload={  # Metadata associated with this vector
            "id": doc["id"],  # Original document ID
            "text": text,  # The combined text
            "section": doc['section'],  # Document section
            "course": doc['course']  # Course name for filtering
        }
    )
    points.append(point)
    id_ += 1
    
    # Print progress every 100 documents
    if id_ % 100 == 0:
        print(f"Processed {id_} documents...")

print(f"Created {len(points)} data points")

# Insert all points into the Qdrant collection
print("Inserting points into Qdrant...")
client.upsert(
    collection_name=collection_name,
    points=points
)
print("Data insertion completed!")

UpdateResult(operation_id=0, status=<UpdateStatus.COMPLETED: 'completed'>)

What's the MRR?

In [None]:
def evaluate_qdrant(ground_truth, search_function):
    """
    Evaluate a Qdrant search function using hit rate and MRR metrics.
    This version is adapted for Qdrant's response format.
    
    Args:
        ground_truth: List of dictionaries containing question, course, and document
        search_function: Function that takes a ground truth record and returns Qdrant results
    
    Returns:
        dict: Dictionary containing 'hit_rate' and 'mrr' scores
    """
    relevance_total = []
    
    for q in tqdm(ground_truth):
        doc_id = q['document']  # The correct document ID
        results = search_function(q)  # Get Qdrant search results
        
        # Extract document IDs from Qdrant response format
        # Qdrant returns results in results.points where each point has a payload
        relevance = [d.payload["id"] == doc_id for d in results.points]
        relevance_total.append(relevance)
    
    return {
        'hit_rate': hit_rate(relevance_total),
        'mrr': mrr(relevance_total),
    }

def qdrant_search(query, course):
    """
    Perform vector search using Qdrant with Jina embeddings.
    
    Args:
        query (str): The search query
        course (str): The course to filter by
    
    Returns:
        QdrantSearchResult: Qdrant search results object
    """
    # Create embedding for the query using the same model as documents
    query_embedding = list(model.embed([query]))[0]
    
    # Perform vector search in Qdrant with course filtering
    results = client.query_points(
        collection_name=collection_name,
        query=query_embedding,  # The query vector
        limit=5,  # Return top 5 results
        with_payload=True,  # Include metadata in results
        query_filter=models.Filter(  # Filter by course
            must=[
                models.FieldCondition(
                    key="course",
                    match=models.MatchValue(value=course)
                )
            ]
        )
    )
    
    return results

# Evaluate Qdrant approach
print("Evaluating Qdrant vector search with Jina embeddings...")
qdrant_results = evaluate_qdrant(ground_truth, lambda q: qdrant_search(q['question'], q['course']))
print("\nQdrant Evaluation Results:")
print(f"Hit Rate: {qdrant_results['hit_rate']:.3f}")
print(f"MRR: {qdrant_results['mrr']:.3f}")

qdrant_results

  0%|          | 0/4627 [00:00<?, ?it/s]

{'hit_rate': 0.9299762264966501, 'mrr': 0.8517722066133576}

## Q5. Cosine simiarity

In the second part of the module, we looked at evaluating
the entire RAG approach. In particular, we looked at 
comparing the answer generated by our system with the actual
answer from the FAQ.

One of the ways of doing it is using the cosine similarity. 
Let's see how to calculate it.

Cosine similarity is a dot product between two normalized vectors.
In geometrical sense, it's the cosine of the angle between
the vectors. Look up "cosine similarity geometry" if you want to
learn more about it.

For us, it means that we need two things:

- First, we normalize each of the vectors
- Then, compute the dot product

So, we get this:


In [None]:
def cosine(u, v):
    """
    Calculate cosine similarity between two vectors.
    
    Cosine similarity measures the cosine of the angle between two vectors,
    providing a metric of orientation rather than magnitude. It ranges from -1 to 1,
    where 1 indicates identical orientation, 0 indicates orthogonality, and -1 indicates opposite orientation.
    
    Mathematical formula: cos(θ) = (u · v) / (||u|| × ||v||)
    
    Args:
        u, v: Input vectors (numpy arrays or similar)
    
    Returns:
        float: Cosine similarity score
    """
    # Normalize both vectors to unit length
    u_normalized = normalize(u)
    v_normalized = normalize(v)
    
    # Compute dot product of normalized vectors
    return u_normalized.dot(v_normalized)

For normalization, we first compute the vector norm (its length),
and then divide the vector by it:


In [None]:
# Import NumPy for mathematical operations
import numpy as np

def normalize(u):
    """
    Normalize a vector to unit length (L2 normalization).
    
    This converts the vector to have a magnitude (length) of 1 while preserving its direction.
    The L2 norm (Euclidean norm) is calculated as: ||u|| = √(u₁² + u₂² + ... + uₙ²)
    
    Args:
        u: Input vector (numpy array)
    
    Returns:
        numpy array: Normalized vector with unit length
    """
    # Calculate the L2 norm (Euclidean norm) of the vector
    norm = np.sqrt(u.dot(u))  # This is equivalent to np.linalg.norm(u)
    
    # Divide vector by its norm to get unit vector
    # Add small epsilon to avoid division by zero
    return u / (norm + 1e-10)

Or we can simplify it:

In [None]:
def cosine_simplified(u, v):
    """
    Calculate cosine similarity using a simplified, more efficient approach.
    
    This version combines normalization and dot product calculation into a single function,
    avoiding the intermediate step of creating normalized vectors.
    
    Mathematical formula: cos(θ) = (u · v) / (||u|| × ||v||)
    
    Args:
        u, v: Input vectors (numpy arrays)
    
    Returns:
        float: Cosine similarity score between -1 and 1
    """
    # Calculate L2 norms of both vectors
    u_norm = np.sqrt(u.dot(u))  # ||u||
    v_norm = np.sqrt(v.dot(v))  # ||v||
    
    # Calculate cosine similarity directly
    # Add small epsilon to denominator to avoid division by zero
    return u.dot(v) / (u_norm * v_norm + 1e-10)

Now let's use this function to compute the
A->Q->A cosine similarity.

We will use the results from [our gpt-4o-mini evaluations](https://github.com/DataTalksClub/llm-zoomcamp/blob/main/03-evaluation/rag_evaluation/data/results-gpt4o-mini.csv):

In [None]:
# Load pre-computed RAG evaluation results
# This dataset contains LLM-generated answers compared to original answers
results_url = url_prefix + 'rag_evaluation/data/results-gpt4o-mini.csv'
df_results = pd.read_csv(results_url)

print(f"Loaded {len(df_results)} answer pairs for evaluation")
print("\nDataset columns:", df_results.columns.tolist())
print(f"Sample data shape: {df_results.shape}")

# Display first few rows to understand the data structure
df_results.head()

Unnamed: 0,answer_llm,answer_orig,document,question,course
0,You can sign up for the course by visiting the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Where can I sign up for the course?,machine-learning-zoomcamp
1,You can sign up using the link provided in the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Can you provide a link to sign up?,machine-learning-zoomcamp
2,"Yes, there is an FAQ for the Machine Learning ...",Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Is there an FAQ for this Machine Learning course?,machine-learning-zoomcamp
3,The context does not provide any specific info...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Does this course have a GitHub repository for ...,machine-learning-zoomcamp
4,To structure your questions and answers for th...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,How can I structure my questions and answers f...,machine-learning-zoomcamp


When creating embeddings, we will use a simple way -
the same we used in the [Embeddings](#embeddings) section:


In [None]:
# Create a new pipeline for answer similarity evaluation
# We'll use the same TF-IDF + SVD approach as before for consistency
pipeline_similarity = make_pipeline(
    TfidfVectorizer(min_df=3),  # Filter rare terms
    TruncatedSVD(n_components=128, random_state=1)  # Reduce dimensionality
)

print("Created pipeline for answer similarity evaluation")

Let's fit the vectorizer on all the text data we have:

In [None]:
# Fit the pipeline on all available text data for comprehensive vocabulary
# Combining LLM answers, original answers, and questions gives us the full text space
all_texts = df_results.answer_llm + ' ' + df_results.answer_orig + ' ' + df_results.question
print(f"Fitting pipeline on {len(all_texts)} combined text samples...")

# Fit the vectorizer on the combined corpus
pipeline_similarity.fit(all_texts)
print("Pipeline fitted successfully!")

# Check the vocabulary size and embedding dimensions
vectorizer = pipeline_similarity.named_steps['tfidfvectorizer']
svd = pipeline_similarity.named_steps['truncatedsvd']
print(f"Vocabulary size: {len(vectorizer.vocabulary_)}")
print(f"Embedding dimensions: {svd.n_components}")

all_texts

0,1,2
,steps,"[('tfidfvectorizer', ...), ('truncatedsvd', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,input,'content'
,encoding,'utf-8'
,decode_error,'strict'
,strip_accents,
,lowercase,True
,preprocessor,
,tokenizer,
,analyzer,'word'
,stop_words,
,token_pattern,'(?u)\\b\\w\\w+\\b'

0,1,2
,n_components,128
,algorithm,'randomized'
,n_iter,5
,n_oversamples,10
,power_iteration_normalizer,'auto'
,random_state,1
,tol,0.0


Now use the `transform` method of the pipeline to create the embeddings and calculate the cosine similarity between each
pair.

What's the average cosine?

This is how you do it:

- For each answer pair, compute
    - `v_llm` for the answer from the LLM 
    - `v_orig` for the original answer
    - then compute the cosine between them
- At the end, take the average

In [None]:
# Calculate cosine similarity between LLM-generated and original answers
print("Calculating cosine similarities between answer pairs...")

similarities = []

# Process each answer pair with progress tracking
for i, row in tqdm(df_results.iterrows(), total=len(df_results), desc="Computing similarities"):
    # Transform both answers to embedding vectors using our fitted pipeline
    v_llm = pipeline_similarity.transform([row["answer_llm"]])[0]  # LLM-generated answer
    v_orig = pipeline_similarity.transform([row["answer_orig"]])[0]  # Original answer
    
    # Calculate cosine similarity between the two answer embeddings
    similarity = cosine_simplified(v_llm, v_orig)
    similarities.append(similarity)
    
    # Log progress every 100 samples
    if (i + 1) % 100 == 0:
        current_avg = np.mean(similarities)
        print(f"Processed {i + 1}/{len(df_results)} pairs, current avg similarity: {current_avg:.3f}")

# Calculate final statistics
avg_cosine = np.mean(similarities)
std_cosine = np.std(similarities)
min_cosine = np.min(similarities)
max_cosine = np.max(similarities)

print(f"\nCosine Similarity Statistics:")
print(f"Average cosine similarity: {avg_cosine:.3f}")
print(f"Standard deviation: {std_cosine:.3f}")
print(f"Minimum similarity: {min_cosine:.3f}")
print(f"Maximum similarity: {max_cosine:.3f}")

avg_cosine

Average cosine similarity: 0.84


## Q6. Rouge

And alternative way to see how two texts are similar is ROUGE. 

This is a set of metrics that compares two answers based on the overlap of n-grams, word sequences, and word pairs.

It can give a more nuanced view of text similarity than just cosine similarity alone.

We don't need to implement it ourselves, there's a python package for it:


In [None]:
# Install ROUGE package for text similarity evaluation
# ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is commonly used
# for evaluating text summarization and generation tasks
!pip install rouge -q


[notice] A new release of pip available: 22.3.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


(The latest version at the moment of writing is `1.0.1`)

Let's compute the ROUGE score between the answers at the index 10 of our dataframe (`doc_id=5170565b`)


In [None]:
# Import and initialize ROUGE scorer
from rouge import Rouge
rouge_scorer = Rouge()

# Test ROUGE evaluation on a specific example (index 10)
print("Testing ROUGE evaluation on sample answer pair (index 10):")
sample_row = df_results.iloc[10]

print(f"Document ID: {getattr(sample_row, 'doc_id', 'Not available')}")
print(f"LLM Answer: {sample_row.answer_llm[:100]}...")
print(f"Original Answer: {sample_row.answer_orig[:100]}...")

# Calculate ROUGE scores for this pair
# get_scores returns a list with one dictionary containing all ROUGE metrics
scores = rouge_scorer.get_scores(sample_row.answer_llm, sample_row.answer_orig)[0]

print("\nROUGE Scores for this pair:")
for rouge_type in ['rouge-1', 'rouge-2', 'rouge-l']:
    print(f"{rouge_type.upper()}:")
    print(f"  Precision: {scores[rouge_type]['p']:.3f}")
    print(f"  Recall: {scores[rouge_type]['r']:.3f}")
    print(f"  F1-Score: {scores[rouge_type]['f']:.3f}")

scores

{'rouge-1': {'r': 0.45454545454545453,
  'p': 0.45454545454545453,
  'f': 0.45454544954545456},
 'rouge-2': {'r': 0.21621621621621623,
  'p': 0.21621621621621623,
  'f': 0.21621621121621637},
 'rouge-l': {'r': 0.3939393939393939,
  'p': 0.3939393939393939,
  'f': 0.393939388939394}}

There are three scores: `rouge-1`, `rouge-2` and `rouge-l`, and precision, recall and F1 score for each.

* `rouge-1` - the overlap of unigrams,
* `rouge-2` - bigrams,
* `rouge-l` - the longest common subsequence

For the 10th document, Rouge-1 F1 score is 0.45

Let's compute it for the pairs in the entire dataframe.
What's the average Rouge-1 F1?

In [None]:
# Calculate ROUGE-1 F1 scores for all answer pairs in the dataset
print("Calculating ROUGE-1 F1 scores for all answer pairs...")

rouge_1_f1_scores = []
rouge_2_f1_scores = []
rouge_l_f1_scores = []

# Process each answer pair with progress tracking
for i, row in tqdm(df_results.iterrows(), total=len(df_results), desc="Computing ROUGE scores"):
    try:
        # Calculate ROUGE scores for this answer pair
        scores = rouge_scorer.get_scores(row["answer_llm"], row["answer_orig"])[0]
        
        # Extract F1 scores for different ROUGE variants
        rouge_1_f1_scores.append(scores["rouge-1"]['f'])  # Unigram overlap F1
        rouge_2_f1_scores.append(scores["rouge-2"]['f'])  # Bigram overlap F1
        rouge_l_f1_scores.append(scores["rouge-l"]['f'])  # Longest common subsequence F1
        
    except Exception as e:
        print(f"Error processing row {i}: {e}")
        # Add zeros for failed calculations to maintain alignment
        rouge_1_f1_scores.append(0.0)
        rouge_2_f1_scores.append(0.0)
        rouge_l_f1_scores.append(0.0)

# Calculate statistics for all ROUGE metrics
def print_rouge_stats(scores, metric_name):
    """Print comprehensive statistics for a ROUGE metric."""
    avg_score = np.mean(scores)
    std_score = np.std(scores)
    min_score = np.min(scores)
    max_score = np.max(scores)
    
    print(f"\n{metric_name} Statistics:")
    print(f"  Average: {avg_score:.3f}")
    print(f"  Std Dev: {std_score:.3f}")
    print(f"  Min: {min_score:.3f}")
    print(f"  Max: {max_score:.3f}")
    
    return avg_score

# Print statistics for all ROUGE variants
rouge_1_avg = print_rouge_stats(rouge_1_f1_scores, "ROUGE-1 F1")
rouge_2_avg = print_rouge_stats(rouge_2_f1_scores, "ROUGE-2 F1")
rouge_l_avg = print_rouge_stats(rouge_l_f1_scores, "ROUGE-L F1")

print(f"\n=== FINAL RESULTS ===")
print(f"Average ROUGE-1 F1 score: {rouge_1_avg:.3f}")
print(f"Average ROUGE-2 F1 score: {rouge_2_avg:.3f}")
print(f"Average ROUGE-L F1 score: {rouge_l_avg:.3f}")

rouge_1_avg

Average ROUGE-1 F1 score: 0.35


## Summary and Key Takeaways

This notebook demonstrated comprehensive evaluation techniques for search systems and RAG pipelines:

### Search System Evaluation
1. **MinSearch with Boosting**: Evaluated text-based search with custom field weights
2. **Vector Search (TF-IDF + SVD)**: Compared single-field vs combined-field embeddings
3. **Modern Vector Search (Qdrant + Jina)**: Tested state-of-the-art embedding models

### Answer Quality Evaluation
1. **Cosine Similarity**: Measured semantic similarity between generated and reference answers
2. **ROUGE Metrics**: Evaluated text overlap using industry-standard metrics

### Performance Insights
- **Hit Rate**: Measures if any relevant document appears in top-k results
- **MRR (Mean Reciprocal Rank)**: Considers the position of the first relevant document
- **Combined text fields** often outperform single fields for vector search
- **Modern embedding models** (like Jina) typically provide better semantic understanding
- **Multiple evaluation metrics** provide different perspectives on system performance

### Best Practices Learned
1. Always use multiple evaluation metrics for comprehensive assessment
2. Compare different approaches systematically using the same ground truth
3. Consider both retrieval quality (Hit Rate, MRR) and answer quality (Cosine, ROUGE)
4. Use progress tracking and detailed logging for long-running evaluations
5. Implement proper error handling for robust evaluation pipelines