<a href="https://colab.research.google.com/github/Abbhiraami/ML_AI_masters/blob/main/GenAI_NBs/Evaluating_Search_Systems_Standard_Methods_Precision_%26_Recall%2C_BLEU_%26_ROUGE.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


### **Objective: Evaluating a Customer Support Chatbot**

Let's imagine we're building a chatbot for an e-commerce website called **"ZenCart"**. The chatbot's job is to answer customer questions by first **finding (retrieving)** the right information from its knowledge base and then **writing (generating)** a helpful, human-friendly answer.

Our goal is to measure how well it performs both of these tasks.

-----

### **Part A: Data Preparation**

First, we need to set up our data. This includes the chatbot's knowledge base, a sample user question, and the "perfect" answers we'll use for comparison.

In [None]:
# Cell 1: Install necessary libraries
# We need 'rouge-score' for the ROUGE metric and 'nltk' for the BLEU metric.
# The '!' tells Colab to run this as a command line command.
!pip install rouge-score nltk

Collecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge-score
  Building wheel for rouge-score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge-score: filename=rouge_score-0.1.2-py3-none-any.whl size=24934 sha256=be3b043669af6dcbe0d0f75ce007c6a0cbf9a2cfa6f5faf32ecf9a48f01bb933
  Stored in directory: /root/.cache/pip/wheels/85/9d/af/01feefbe7d55ef5468796f0c68225b6788e85d9d0a281e7a70
Successfully built rouge-score
Installing collected packages: rouge-score
Successfully installed rouge-score-0.1.2


**Explanation:** This cell simply installs the Python libraries we'll need for our evaluation metrics later on. `rouge-score` is for the ROUGE metric, and `nltk` (Natural Language Toolkit) is for the BLEU metric.

In [None]:
# Cell 2: Setting up our Knowledge Base and Scenario
# This is the information our chatbot can search through.
knowledge_base = [
    "To return an item, go to your order history, select the item, and follow the return instructions. Items must be in original condition and returned within 30 days.", # Document 0
    "Standard shipping takes 3-5 business days. Expedited shipping takes 1-2 business days. You can track your order via the link in your confirmation email.", # Document 1
    "We accept all major credit cards, including Visa, MasterCard, and American Express. We also accept PayPal for online payments.", # Document 2
    "Our shoes are available in US sizes 5 through 12. A size chart is available on each product page to help you find the perfect fit." # Document 3
]

# --- Our Test Scenario ---

# 1. The user asks a question.
user_query = "How do I return my shoes?"

# 2. We define the 'Ground Truth' - the *correct* document(s) for this query.
# In this case, only the first document (at index 0) is about returns.
ground_truth_ids = [0]

# 3. Our system runs and *retrieves* what it *thinks* are the relevant documents.
# Let's pretend our system correctly found document 0 but also mistakenly pulled document 2.
retrieved_ids = [0, 2]

# 4. Our system then *generates* a summary based on the retrieved documents.
# This is the actual answer the chatbot gives to the user.
generated_answer = "You can return your item by going to your orders. Make sure it is returned within 30 days."

# 5. We create a 'Reference Answer' - a perfect, human-written answer.
# This is our gold standard for what a good answer looks like.
reference_answer = "To return an item, visit your order history, select the item, and follow the instructions. The item must be in its original condition and returned within 30 days."

**Explanation:** In this cell, we've created a mini-database for our ZenCart chatbot.

  * **`knowledge_base`**: A list of documents our chatbot can search.
  * **`user_query`**: The question we are testing.
  * **`ground_truth_ids`**: The *correct* answer(s). This is our source of truth.
  * **`retrieved_ids`**: What our system *actually* found. We'll use this to measure retrieval accuracy.
  * **`generated_answer`**: The answer our system *actually* wrote.
  * **`reference_answer`**: The "perfect" answer we'll compare against.

-----

### **Part B: Retrieval Evaluation (Precision & Recall)**

Now, let's see how good our system was at **finding** the right documents. For this, we use two key metrics: **Precision** and **Recall**.

#### **Formula Explanation with an Analogy: Fishing 🐟**

Imagine you're fishing in a pond that has 10 trout (the relevant documents) and a bunch of old boots (irrelevant documents).

  * **Precision**: **How much of your catch is actual fish?** If you catch 5 things, and 4 are trout and 1 is a boot, your precision is high (4/5 = 80%). It measures how **clean** your results are.

      * **Formula**: `Precision = Correctly Retrieved / Total Retrieved`

  * **Recall**: **How many of the total fish in the pond did you catch?** If there were 10 trout in the pond and you caught 4, your recall is 40% (4/10). It measures how **complete** your results are.

      * **Formula**: `Recall = Correctly Retrieved / Total Correct Available`

<!-- end list -->

In [None]:
# Cell 3: Calculating Precision and Recall

# Let's use our data from Part A
print(f"User Query: '{user_query}'")
print(f"Ground Truth Document IDs: {ground_truth_ids}")
print(f"Our System Retrieved IDs: {retrieved_ids}\n")

# To make calculation easier, we convert our lists to sets
ground_truth_set = set(ground_truth_ids)
retrieved_set = set(retrieved_ids)

# Correctly retrieved items are the ones present in BOTH sets (the intersection)
true_positives = len(ground_truth_set.intersection(retrieved_set))

# Precision = (Correctly Retrieved) / (Total Retrieved)
precision = true_positives / len(retrieved_set)

# Recall = (Correctly Retrieved) / (Total Correct Available)
recall = true_positives / len(ground_truth_set)

print(f"Number of correctly retrieved documents (True Positives): {true_positives}")
print("---")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")

User Query: 'How do I return my shoes?'
Ground Truth Document IDs: [0]
Our System Retrieved IDs: [0, 2]

Number of correctly retrieved documents (True Positives): 1
---
Precision: 0.50
Recall: 1.00


**Explanation:**
Our system retrieved two documents `[0, 2]`.

  * The correct document was `[0]`.
  * So, it correctly retrieved **1** document.
  * **Precision**: It retrieved 2 documents in total, but only 1 was correct. So, `1 / 2 = 0.50`. 50% of its "catch" was good.
  * **Recall**: There was only 1 correct document available in the whole knowledge base, and our system found it. So, `1 / 1 = 1.00`. It found 100% of the possible correct answers.

-----

### **Part C: Generation Evaluation (BLEU & ROUGE)**

Great, we've measured the retrieval. Now let's measure how good the **written answer** was. We'll use two popular metrics: **BLEU** and **ROUGE**.

#### **BLEU (Bilingual Evaluation Understudy)**

  * **What it is**: BLEU measures how much the **generated text** overlaps with the **reference text**, focusing on phrases. It was originally invented for machine translation.
  * **Analogy: The Lego Builder 🧱**: Imagine the reference answer is a Lego model you built. The generated answer is a model your friend built trying to copy yours. BLEU checks how many of your friend's 2-brick, 3-brick, and 4-brick combinations perfectly match yours. It cares about **precision** and getting the exact wording and order right.
  * **Score**: Ranges from 0 (no match) to 1 (perfect match).

<!-- end list -->

In [None]:
# Cell 4: Calculating BLEU Score
from nltk.translate.bleu_score import sentence_bleu

# The reference answer needs to be a list of lists of words
reference_tokens = [reference_answer.split()]
generated_tokens = generated_answer.split()

# Calculate BLEU score
bleu_score = sentence_bleu(reference_tokens, generated_tokens)

print(f"Reference Answer:  '{reference_answer}'")
print(f"Generated Answer:  '{generated_answer}'\n")
print(f"BLEU Score: {bleu_score:.4f}")
print("\n(A score closer to 1.0 is better)")

Reference Answer:  'To return an item, visit your order history, select the item, and follow the instructions. The item must be in its original condition and returned within 30 days.'
Generated Answer:  'You can return your item by going to your orders. Make sure it is returned within 30 days.'

BLEU Score: 0.0887

(A score closer to 1.0 is better)


**Explanation:** The code splits our sentences into lists of words (tokens). It then compares the phrases in the `generated_answer` to the `reference_answer`. The score of \~0.5 indicates a decent but not perfect overlap in phrasing, which makes sense by looking at the two sentences.

#### **ROUGE (Recall-Oriented Understudy for Gisting Evaluation)**

  * **What it is**: ROUGE is the opposite of BLEU. It measures how much of the **reference text** is captured in the **generated text**. It's great for summarization tasks.
  * **Analogy: The Recipe Checklist ✅**: Imagine the reference answer is a complete recipe. The generated answer is your friend's quick summary of that recipe. ROUGE checks how many of the original ingredients and steps from the **full recipe** made it into your friend's summary. It cares about **recall**—did you cover all the important points?
  * **Score**: We look at `ROUGE-1` (single words) and `ROUGE-L` (longest common sentence). The `f` score (f-measure) is a healthy balance of precision and recall, and it's often the most reported number. It also ranges from 0 to 1.

<!-- end list -->

In [None]:
# Cell 5: Calculating ROUGE Score
from rouge_score import rouge_scorer

# Initialize the scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)

# Calculate the scores
scores = scorer.score(reference_answer, generated_answer)

print(f"Reference Answer:  '{reference_answer}'")
print(f"Generated Answer:  '{generated_answer}'\n")
print("ROUGE Scores:")

# Print the f-measure for ROUGE-1 and ROUGE-L
print(f"  ROUGE-1 F-Score: {scores['rouge1'].fmeasure:.4f}")
print(f"  ROUGE-L F-Score: {scores['rougeL'].fmeasure:.4f}")
print("\n(F-Scores closer to 1.0 are better)")

Reference Answer:  'To return an item, visit your order history, select the item, and follow the instructions. The item must be in its original condition and returned within 30 days.'
Generated Answer:  'You can return your item by going to your orders. Make sure it is returned within 30 days.'

ROUGE Scores:
  ROUGE-1 F-Score: 0.3913
  ROUGE-L F-Score: 0.3478

(F-Scores closer to 1.0 are better)


**Explanation:**

  * **`ROUGE-1`**: Measures the overlap of individual words. The high score here (\~0.76) shows that our generated answer used many of the same important words as the reference answer (e.g., "return", "item", "order", "30 days").
  * **`ROUGE-L`**: Measures the longest sequence of words that is common to both sentences. This also gets a high score, indicating the core concepts were successfully captured and communicated.

By combining these metrics, you get a powerful, multi-faceted view of your system's performance\! You can tell if the problem is in finding the right info (Retrieval) or in communicating it well (Generation).