If you are viewing this notebook on Google Colab (or any other cloud vendor), you need to **uncomment and run** the following codeblock to install the dependencies for this chapter:

---

💡 **NOTE**: We will want to use a GPU to run the examples in this notebook. In Google Colab, go to
**Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4**.

---


In [1]:

!pip install langchain>=0.1.17 openai>=1.13.3 langchain_openai>=0.1.6 transformers>=4.40.1 datasets>=2.18.0 accelerate>=0.27.2 sentence-transformers>=2.5.1 duckduckgo-search>=5.2.2
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python

Collecting llama-cpp-python
  Downloading llama_cpp_python-0.3.9.tar.gz (67.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.9/67.9 MB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting diskcache>=5.6.1 (from llama-cpp-python)
  Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)
Downloading diskcache-5.6.3-py3-none-any.whl (45 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: llama-cpp-python
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mBuilding wheel for llama-cpp-python [0m[1;32m([0m[32mpyproject.toml[0m[1;32m)[0m did not run successfully.
  [31m│[0m exit code:

## Loading our model

In [2]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=False,
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

# Create a pipeline
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    return_full_text=False,
    max_new_tokens=500,
    do_sample=False,
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/16.5k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.44k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.94M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/599 [00:00<?, ?B/s]

Device set to use cuda


### Evaluate the LLM using a curated benchmark set specific to our evaluation.

1. Create a custom benchmark dataset specific to the use case
2. Compute summarization-specific evaluation metrics using the custom benchmark data set
3. Use an LLM-as-a-Judge approach to evaluate custom metrics

Task 1: Create a Benchmark Dataset
Recall that ROUGE requires reference sets to compute scores. In our demo, we used a large, generic benchmark set.

In this lab, you have to use a domain-specific benchmark set specific to the use case.

Case-Specific Benchmark Set
While the base-specific data set likely won't be as large, it does have the advantage of being more representative of the task we're actually asking the LLM to perform.

Below, we've started to create a dataset for grocery product review summaries. It's your task to create two more product summaries to this dataset.

Hint: Try opening up another tab and using AI Playground to generate some examples! Just be sure to manually check them since this is our ground-truth evaluation data.

Note: For this task, we're creating an extremely small reference set. In practice, you'll want to create one with far more example records.

In [17]:

import pandas as pd

eval_data = pd.DataFrame(
    {
        "inputs": [
            "This coffee is exceptional. Its intensely bold flavor profile is both nutty and fruity – especially with notes of plum and citrus. While the price is relatively good, I find myself needing to purchase bags too often. If this came in 16oz bags instead of just 12oz bags, I'd purchase it all the time. I highly recommend they start scaling up their bag size.",
            "The moment I opened the tub of Chocolate-Covered Strawberry Delight ice cream, I was greeted by the enticing aroma of fresh strawberries and rich chocolate. The appearance of the ice cream was equally appealing, with a swirl of pink strawberry ice cream and chunks of chocolate-covered strawberries scattered throughout. The first bite did not disappoint. The strawberry ice cream was creamy and flavorful, with a natural sweetness that was not overpowering. The chocolate-covered strawberries added a satisfying crunch fruity bite.",
            "Arroz Delicioso is a must-try for Mexican cuisine enthusiasts! This authentic Mexican rice, infused with a blend of tomatoes, onions, and garlic, brings a burst of flavor to any meal. Its vibrant color and delightful aroma will transport you straight to the heart of Mexico. The rice cooks evenly, resulting in separate, fluffy grains that hold their shape, making it perfect for dishes like arroz con pollo or as a side for tacos. With a cook time of just 20 minutes, Arroz Delicioso is a convenient and delicious addition to your pantry. Give it a try and elevate your Mexican food game!",
            "FreshCrunch salad mixes are revolutionizing the way we think about packaged salads! Each bag is packed with a vibrant blend of crisp, nutrient-rich greens, including baby spinach, arugula, and kale. The veggies are pre-washed and ready to eat, making meal prep a breeze. FreshCrunch sets itself apart with its innovative packaging that keeps the greens fresh for up to 10 days, reducing food waste and ensuring you always have a healthy option on hand. The salad mixes are versatile and pair well with various dressings and toppings. Try FreshCrunch for a convenient, delicious, and nutritious meal solution that doesn't compromise on quality or taste!",
            "If you're a grill enthusiast like me, you know the importance of having the right tools for the job. That's why I was thrilled to get my hands on the new Click-Clack Grill Tongs. These tongs are not just any ordinary grilling utensil; they're a game-changer. First impressions matter, and the Click-Clack Grill Tongs certainly deliver. The sleek, stainless steel design exudes a professional feel, and the ergonomic handle ensures a comfortable grip even during those long grilling sessions. But what truly sets these tongs apart is their innovative 'Click-Clack' mechanism. With a simple press of a button, the tongs automatically open and close, allowing for precise control when flipping or turning your food. No more struggling with stiff, unwieldy tongs that can ruin your carefully prepared meals. The tongs also feature a scalloped edge, which provides a secure grip on everything from juicy steaks to delicate vegetables. And with their generous length, you can keep your hands safely away from the heat while still maintaining optimal control. Cleanup is a breeze thanks to the dishwasher-safe construction, and the integrated hanging loop makes storage a snap. In conclusion, the Click-Clack Grill Tongs have earned a permanent spot in my grilling arsenal. They've made my grilling experience more enjoyable and efficient, and I'm confident they'll do the same for you. So, if you're looking to up your grilling game, I highly recommend giving these tongs a try. Happy grilling!",
            "As a parent, I understand the importance of providing my child with nutritious, wholesome food. That's why I was thrilled to discover Fresh 'n' Quik Baby Food, a new product that promises to deliver fresh, homemade baby food in minutes. The concept behind Fresh 'n' Quik is simple yet ingenious. The system consists of pre-portioned, organic fruit and vegetable purees that can be quickly and easily blended with breast milk, formula, or water to create a nutritious meal for your little one. The purees are made with high-quality ingredients, free from additives, preservatives, and artificial flavors, ensuring that your baby receives only the best. One of the standout features of Fresh 'n' Quik is the convenience it offers. The purees come in individual, resealable pouches that can be stored in the freezer until you're ready to use them. When it's time to feed your baby, simply pop a pouch into the Fresh 'n' Quik blender, add your liquid of choice, and blend. In less than a minute, you have a fresh, homemade meal that's ready to serve. The blender itself is compact, easy to use, and even easier to clean. The blades are removable, making it a breeze to rinse off any leftover puree. And the best part? The blender is whisper-quiet, so you don't have to worry about waking your sleeping baby while preparing their meal. But what truly sets Fresh 'n' Quik apart is the variety of flavors available. From classic combinations like apple and banana to more adventurous options like mango and kale, there's something for every palate. And because the purees are made with real fruits and vegetables, your baby is exposed to a wide range of flavors and textures, helping to cultivate a diverse and adventurous palate from an early age. In conclusion, Fresh 'n' Quik Baby Food is a game-changer for parents seeking a convenient, nutritious, and delicious option for their little ones. The system is easy to use, quick to clean, and offers a wide variety of flavors to keep your baby's taste buds excited. I highly recommend giving Fresh 'n' Quik a try – your baby (and your schedule) will thank you!"
        ],
        "ground_truth": [
            "This bold, nutty, and fruity coffee is delicious, and they need to start selling it in larger bags.",
            "Chocolate-Covered Strawberry Delight ice cream looks delicious with its aroma of strawberry and chocolate, and its creamy, naturally sweet taste did not disappoint.",
            "Arroz Delicioso offers authentic, flavorful Mexican rice with a blend of tomatoes, onions, and garlic, cooking evenly into separate, fluffy grains in just 20 minutes, making it a convenient and delicious choice for dishes like arroz con pollo or as a side for tacos.",
            "FreshCrunch salad mixes offer convenient, pre-washed, nutrient-rich greens in an innovative packaging that keeps them fresh for up to 10 days, providing a versatile, tasty, and waste-reducing healthy meal solution.",
            "The Click-Clack Grill Tongs are a high-quality, innovative grilling tool with a sleek design, comfortable grip, and an automatic opening/closing mechanism for precise control. These tongs have made grilling more enjoyable and efficient, and are highly recommended for anyone looking to improve their grilling experience.",
            "Fresh 'n' Quik Baby Food is a revolutionary product that delivers fresh, homemade baby food in minutes. With pre-portioned, organic fruit and vegetable purees, the system offers convenience, high-quality ingredients, and a wide range of flavors to cultivate a diverse palate in your little one. The blender is compact, easy to use, and whisper-quiet, making mealtime a breeze. Fresh 'n' Quik Baby Food is a must-try for parents seeking a nutritious and delicious option for their babies."
        ],
    }
)

display(eval_data)

Unnamed: 0,inputs,ground_truth
0,This coffee is exceptional. Its intensely bold...,"This bold, nutty, and fruity coffee is delicio..."
1,The moment I opened the tub of Chocolate-Cover...,Chocolate-Covered Strawberry Delight ice cream...
2,Arroz Delicioso is a must-try for Mexican cuis...,"Arroz Delicioso offers authentic, flavorful Me..."
3,FreshCrunch salad mixes are revolutionizing th...,"FreshCrunch salad mixes offer convenient, pre-..."
4,"If you're a grill enthusiast like me, you know...",The Click-Clack Grill Tongs are a high-quality...
5,"As a parent, I understand the importance of pr...",Fresh 'n' Quik Baby Food is a revolutionary pr...


Question: What are some strategies for evaluating your custom-generated benchmark data set? For example:

How can you scale the curation?

How do you know if the ground truth is correct?

Who should have input?

Should it remain static over time?

Next, we're saving this reference data set for future use.

In [18]:
# Define the first model for summarization
def query_summary_system(input: str) -> str:
    messages = [
        {
            "role": "system",
            "content": "You are an assistant that summarizes text. Given a text input, you need to provide a one-sentence summary. You specialize in summarizing reviews of grocery products. Please keep the reviews in first-person perspective if they're originally written in first person. Do not change the sentiment. Do not create a run-on sentence – be concise."
        },
        {
            "role": "user",
            "content": input
        }
    ]
    # Generate the output
    output = pipe(messages)
    print(output[0]["generated_text"])

    return output[0]["generated_text"]

In [19]:
genrated_text = query_summary_system(input="This coffee is exceptional. Its intensely bold flavor profile is both nutty and fruity – especially with notes of plum and citrus. While the price is relatively good, I find myself needing to purchase bags too often. If this came in 16oz bags instead of just 12oz bags, I'd purchase it all the time. I highly recommend they start scaling up their bag size.")

 I highly recommend this coffee for its bold, nutty, and fruity flavor, but I wish it came in larger 16oz bags to reduce the frequency of my purchases.


Task 2: Compute ROUGE on Custom Benchmark Data
Next, we will want to compute our ROUGE-N metric to understand how well our system summarizes grocery product reviews based on the reference of reviews that was just created.

Remember that the mlflow.evaluate function accepts the following parameters for this use case:

An LLM model

Reference data for evaluation

Column with ground truth data

The model/task type (e.g. "text-summarization")

Run the Evaluation

Instead of using the generic benchmark dataset, your task is to compute ROUGE metrics using the case-specific benchmark data that we just created.

In [20]:

# A custom function to iterate through our eval DF
def query_iteration(inputs):
    answers = []

    for index, row in inputs.iterrows():
        completion = query_summary_system(row["inputs"])
        answers.append(completion)

    return answers

# Test query_iteration function – it needs to return a list of output strings
query_iteration(eval_data)

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


 I highly recommend this coffee for its bold, nutty, and fruity flavor, but I wish it came in larger 16oz bags to reduce the frequency of my purchases.
 I was captivated by the aroma and appearance of Chocolate-Covered Strawberry Delight ice cream, and the first bite was a delightful mix of creamy strawberry ice cream and crunchy chocolate-covered strawberries.
 Arroz Delicioso is an authentic, flavorful Mexican rice that cooks quickly and evenly, making it perfect for a variety of dishes and a convenient pantry staple.
 FreshCrunch salad mixes offer a convenient, nutritious, and delicious meal solution with innovative packaging that keeps greens fresh for up to 10 days.
 The Click-Clack Grill Tongs are a game-changer for grill enthusiasts, with their innovative 'Click-Clack' mechanism, ergonomic design, and scalloped edge for a secure grip, making grilling more enjoyable and efficient.
 Fresh 'n' Quik Baby Food is a convenient, nutritious, and delicious option for parents seeking a qu

[' I highly recommend this coffee for its bold, nutty, and fruity flavor, but I wish it came in larger 16oz bags to reduce the frequency of my purchases.',
 ' I was captivated by the aroma and appearance of Chocolate-Covered Strawberry Delight ice cream, and the first bite was a delightful mix of creamy strawberry ice cream and crunchy chocolate-covered strawberries.',
 ' Arroz Delicioso is an authentic, flavorful Mexican rice that cooks quickly and evenly, making it perfect for a variety of dishes and a convenient pantry staple.',
 ' FreshCrunch salad mixes offer a convenient, nutritious, and delicious meal solution with innovative packaging that keeps greens fresh for up to 10 days.',
 " The Click-Clack Grill Tongs are a game-changer for grill enthusiasts, with their innovative 'Click-Clack' mechanism, ergonomic design, and scalloped edge for a secure grip, making grilling more enjoyable and efficient.",
 " Fresh 'n' Quik Baby Food is a convenient, nutritious, and delicious option fo

#### To evaluate the model responses based on a ground truth, you can compute metrics such as ROUGE

In [21]:
!pip install rouge-score


Collecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge-score
  Building wheel for rouge-score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge-score: filename=rouge_score-0.1.2-py3-none-any.whl size=24934 sha256=a5744f5032f1c395a2fdb7f3c3905e774b9b4668fb8818767f342a1193732e1a
  Stored in directory: /root/.cache/pip/wheels/1e/19/43/8a442dc83660ca25e163e1bd1f89919284ab0d0c1475475148
Successfully built rouge-score
Installing collected packages: rouge-score
Successfully installed rouge-score-0.1.2


### Python Evaluation Function

In [22]:
from rouge_score import rouge_scorer
import pandas as pd

def evaluate_model(eval_data: pd.DataFrame):
    # Generate predictions
    predictions = query_iteration(eval_data)

    # Ground truth references
    references = eval_data["ground_truth"].tolist()

    # Initialize ROUGE scorer
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    scores = {
        "rouge1": [],
        "rouge2": [],
        "rougeL": []
    }

    # Evaluate each pair
    for pred, ref in zip(predictions, references):
        result = scorer.score(ref, pred)
        scores["rouge1"].append(result["rouge1"].fmeasure)
        scores["rouge2"].append(result["rouge2"].fmeasure)
        scores["rougeL"].append(result["rougeL"].fmeasure)

    # Compute average scores
    avg_scores = {k: sum(v) / len(v) for k, v in scores.items()}
    return avg_scores, predictions

In [23]:
avg_scores, model_outputs = evaluate_model(eval_data)
print("Average ROUGE Scores:", avg_scores)

 I highly recommend this coffee for its bold, nutty, and fruity flavor, but I wish it came in larger 16oz bags to reduce the frequency of my purchases.
 I was captivated by the aroma and appearance of Chocolate-Covered Strawberry Delight ice cream, and the first bite was a delightful mix of creamy strawberry ice cream and crunchy chocolate-covered strawberries.
 Arroz Delicioso is an authentic, flavorful Mexican rice that cooks quickly and evenly, making it perfect for a variety of dishes and a convenient pantry staple.
 FreshCrunch salad mixes offer a convenient, nutritious, and delicious meal solution with innovative packaging that keeps greens fresh for up to 10 days.
 The Click-Clack Grill Tongs are a game-changer for grill enthusiasts, with their innovative 'Click-Clack' mechanism, ergonomic design, and scalloped edge for a secure grip, making grilling more enjoyable and efficient.
 Fresh 'n' Quik Baby Food is a convenient, nutritious, and delicious option for parents seeking a qu

### Add BERTScore (Semantic Similarity)

In [24]:
!pip install bert-score

Collecting bert-score
  Downloading bert_score-0.3.13-py3-none-any.whl.metadata (15 kB)
Downloading bert_score-0.3.13-py3-none-any.whl (61 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.1/61.1 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: bert-score
Successfully installed bert-score-0.3.13


In [26]:
from bert_score import score as bert_score

def compute_bert_score(eval_data: pd.DataFrame):
    # Generate predictions
    predictions = query_iteration(eval_data)

    # Ground truth references
    references = eval_data["ground_truth"].tolist()
    P, R, F1 = bert_score(predictions, references, lang='en', verbose=True)
    return {
        "bert_precision": round(P.mean().item(), 4),
        "bert_recall": round(R.mean().item(), 4),
        "bert_f1": round(F1.mean().item(), 4)
    }


In [27]:
res = evaluate_model(eval_data)
print("Average Bert Scores:", res)

 I highly recommend this coffee for its bold, nutty, and fruity flavor, but I wish it came in larger 16oz bags to reduce the frequency of my purchases.
 I was captivated by the aroma and appearance of Chocolate-Covered Strawberry Delight ice cream, and the first bite was a delightful mix of creamy strawberry ice cream and crunchy chocolate-covered strawberries.
 Arroz Delicioso is an authentic, flavorful Mexican rice that cooks quickly and evenly, making it perfect for a variety of dishes and a convenient pantry staple.
 FreshCrunch salad mixes offer a convenient, nutritious, and delicious meal solution with innovative packaging that keeps greens fresh for up to 10 days.
 The Click-Clack Grill Tongs are a game-changer for grill enthusiasts, with their innovative 'Click-Clack' mechanism, ergonomic design, and scalloped edge for a secure grip, making grilling more enjoyable and efficient.
 Fresh 'n' Quik Baby Food is a convenient, nutritious, and delicious option for parents seeking a qu

### Task 3: Use an LLM-as-a-Judge Approach to Evaluate Custom Metrics


Using a LLM-as-a-Judge approach is a modern way to evaluate text summarization quality by leveraging a large language model (LLM) to compare a model-generated summary with the ground-truth summary, based on criteria like:

Faithfulness

Relevance

Fluency

Conciseness

In [32]:
def llm_as_judge_eval(eval_data):
    """
    Uses an LLM to compare model summaries with ground-truth summaries.

    Args:
        eval_data (DataFrame): Must contain 'inputs' and 'ground_truth'.
        model_name (str): OpenAI model to use, e.g., 'gpt-4' or 'gpt-4o'.

    Returns:
        List of LLM judgments (e.g., 'Model summary is better', 'Both are good', etc.)
    """
    # Step 1: Get model predictions
    predictions = query_iteration(eval_data)

    results = []

    for i, row in eval_data.iterrows():
        original = row["inputs"]
        reference = row["ground_truth"]
        prediction = predictions[i]

        prompt = f"""
                  You are an expert summarization evaluator. Compare the following two summaries of the same grocery product review. Judge them based on:
                  - Faithfulness to the original
                  - Fluency
                  - Relevance
                  - Conciseness

                  === Original Review ===
                  {original}

                  === Summary A ===
                  {prediction}

                  === Summary B ===
                  {reference}

                  Which summary is better?

                  Please respond in the following format:
                  Answer: <your choice>
                  Explanation: <brief justification>
              """
        output = pipe(prompt)[0]["generated_text"].strip()

        results.append({
            "model_summary": prediction,
            "ground_truth": reference,
            "judgment": output
        })

    return results

In [33]:
from collections import Counter

def summarize_judgments(judgments):
    verdicts = [j["judgment"].split("\n")[0] for j in judgments]
    return Counter(verdicts)


In [34]:
judgments = llm_as_judge_eval(eval_data)
summary_stats = summarize_judgments(judgments)

print(summary_stats)


 I highly recommend this coffee for its bold, nutty, and fruity flavor, but I wish it came in larger 16oz bags to reduce the frequency of my purchases.
 I was captivated by the aroma and appearance of Chocolate-Covered Strawberry Delight ice cream, and the first bite was a delightful mix of creamy strawberry ice cream and crunchy chocolate-covered strawberries.
 Arroz Delicioso is an authentic, flavorful Mexican rice that cooks quickly and evenly, making it perfect for a variety of dishes and a convenient pantry staple.
 FreshCrunch salad mixes offer a convenient, nutritious, and delicious meal solution with innovative packaging that keeps greens fresh for up to 10 days.
 The Click-Clack Grill Tongs are a game-changer for grill enthusiasts, with their innovative 'Click-Clack' mechanism, ergonomic design, and scalloped edge for a secure grip, making grilling more enjoyable and efficient.
 Fresh 'n' Quik Baby Food is a convenient, nutritious, and delicious option for parents seeking a qu