
# Domain-Specific Evaluation 


In this lab, you will have the opportunity to evaluate a large language model on a specific task **using a dataset designed for this exact evaluation.**

**Lab Outline:**

*In this lab, you will need to complete the following tasks:*

- **Task 1:** Create a Benchmark Dataset
- **Task 2:** Compute ROUGE on Custom Benchmark Data
- **Task 3:** Use an LLM-as-a-Judge approach to evaluate custom metrics

In [0]:
%pip install -U -qq databricks-sdk rouge_score textstat mlflow tiktoken
dbutils.library.restartPython()

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
jupyter-server 1.23.4 requires anyio<4,>=3.1.0, but you have anyio 4.9.0 which is incompatible.
[43mNote: you may need to restart the kernel using %restart_python or dbutils.library.restartPython() to use updated packages.[0m


Before starting the Lab, run the provided classroom setup script. This script will define configuration variables necessary for the lab. Execute the following cell:

In [0]:
%run ../Includes/Classroom-Setup-03


The examples and models presented in this course are intended solely for demonstration and educational purposes.
 Please note that the models and prompt examples may sometimes contain offensive, inaccurate, biased, or harmful content.


**Other Conventions:**

Throughout this lab, we'll refer to the object `DA`. This object, provided by Databricks Academy, contains variables such as your username, catalog name, schema name, working directory, and dataset locations. Run the code block below to view these details:

In [0]:
print(f"Username:          {DA.username}")
print(f"Catalog Name:      {DA.catalog_name}")
print(f"Schema Name:       {DA.schema_name}")
print(f"Working Directory: {DA.paths.working_dir}")
print(f"Dataset Location:  {DA.paths.datasets}")

Username:          labuser11003544_1753435669@vocareum.com
Catalog Name:      dbacademy
Schema Name:       labuser11003544_1753435669
Working Directory: /Volumes/dbacademy/ops/labuser11003544_1753435669@vocareum_com
Dataset Location:  NestedNamespace (news='/Volumes/dbacademy_news/v01', arxiv='/Volumes/dbacademy_arxiv/v01')


# Lab Overview

In this lab, you will again be evaluating the performance of an AI system designed to summarize text.

In [0]:
query_product_summary_system(
    "This is the best frozen pizza I've ever had! Sure, it's not the healthiest, but it tasted just like it was delivered from our favorite pizzeria down the street. The cheese browned nicely and fresh tomatoes are a nice touch, too! I would buy it again despite it's high price. If I could change one thing, I'd made it a little healthier – could we get a gluten-free crust option? My son would love that."
)

"I think this is the best frozen pizza I've ever had, with a delicious taste similar to a pizzeria, and I would buy it again despite its high price."

However, you will evaluate the LLM using a curated benchmark set specific to our evaluation.

This lab will follow the below steps:

1. Create a custom benchmark dataset specific to the use case
2. Compute summarization-specific evaluation metrics using the custom benchmark data set
3. Use an LLM-as-a-Judge approach to evaluate custom metrics

## Task 1: Create a Benchmark Dataset

Recall that ROUGE requires reference sets to compute scores. In our demo, we used a large, generic benchmark set.

In this lab, you have to use a domain-specific benchmark set specific to the use case.

### Case-Specific Benchmark Set

While the base-specific data set likely won't be as large, it does have the advantage of being **more representative of the task we're actually asking the LLM to perform.**

Below, we've started to create a dataset for grocery product review summaries. It's your task to create **two more** product summaries to this dataset.

**Hint:** Try opening up another tab and using AI Playground to generate some examples! Just be sure to manually check them since this is our *ground-truth evaluation data*.

**Note:** For this task, we're creating an *extremely small* reference set. In practice, you'll want to create one with far more example records.

In [0]:

import pandas as pd

eval_data = pd.DataFrame(
    {
        "inputs": [
            "This coffee is exceptional. Its intensely bold flavor profile is both nutty and fruity – especially with notes of plum and citrus. While the price is relatively good, I find myself needing to purchase bags too often. If this came in 16oz bags instead of just 12oz bags, I'd purchase it all the time. I highly recommend they start scaling up their bag size.",
            "The moment I opened the tub of Chocolate-Covered Strawberry Delight ice cream, I was greeted by the enticing aroma of fresh strawberries and rich chocolate. The appearance of the ice cream was equally appealing, with a swirl of pink strawberry ice cream and chunks of chocolate-covered strawberries scattered throughout. The first bite did not disappoint. The strawberry ice cream was creamy and flavorful, with a natural sweetness that was not overpowering. The chocolate-covered strawberries added a satisfying crunch fruity bite.",
            "Arroz Delicioso is a must-try for Mexican cuisine enthusiasts! This authentic Mexican rice, infused with a blend of tomatoes, onions, and garlic, brings a burst of flavor to any meal. Its vibrant color and delightful aroma will transport you straight to the heart of Mexico. The rice cooks evenly, resulting in separate, fluffy grains that hold their shape, making it perfect for dishes like arroz con pollo or as a side for tacos. With a cook time of just 20 minutes, Arroz Delicioso is a convenient and delicious addition to your pantry. Give it a try and elevate your Mexican food game!",
            "FreshCrunch salad mixes are revolutionizing the way we think about packaged salads! Each bag is packed with a vibrant blend of crisp, nutrient-rich greens, including baby spinach, arugula, and kale. The veggies are pre-washed and ready to eat, making meal prep a breeze. FreshCrunch sets itself apart with its innovative packaging that keeps the greens fresh for up to 10 days, reducing food waste and ensuring you always have a healthy option on hand. The salad mixes are versatile and pair well with various dressings and toppings. Try FreshCrunch for a convenient, delicious, and nutritious meal solution that doesn't compromise on quality or taste!",
            "If you're a grill enthusiast like me, you know the importance of having the right tools for the job. That's why I was thrilled to get my hands on the new Click-Clack Grill Tongs. These tongs are not just any ordinary grilling utensil; they're a game-changer. First impressions matter, and the Click-Clack Grill Tongs certainly deliver. The sleek, stainless steel design exudes a professional feel, and the ergonomic handle ensures a comfortable grip even during those long grilling sessions. But what truly sets these tongs apart is their innovative 'Click-Clack' mechanism. With a simple press of a button, the tongs automatically open and close, allowing for precise control when flipping or turning your food. No more struggling with stiff, unwieldy tongs that can ruin your carefully prepared meals. The tongs also feature a scalloped edge, which provides a secure grip on everything from juicy steaks to delicate vegetables. And with their generous length, you can keep your hands safely away from the heat while still maintaining optimal control. Cleanup is a breeze thanks to the dishwasher-safe construction, and the integrated hanging loop makes storage a snap. In conclusion, the Click-Clack Grill Tongs have earned a permanent spot in my grilling arsenal. They've made my grilling experience more enjoyable and efficient, and I'm confident they'll do the same for you. So, if you're looking to up your grilling game, I highly recommend giving these tongs a try. Happy grilling!",
            "As a parent, I understand the importance of providing my child with nutritious, wholesome food. That's why I was thrilled to discover Fresh 'n' Quik Baby Food, a new product that promises to deliver fresh, homemade baby food in minutes. The concept behind Fresh 'n' Quik is simple yet ingenious. The system consists of pre-portioned, organic fruit and vegetable purees that can be quickly and easily blended with breast milk, formula, or water to create a nutritious meal for your little one. The purees are made with high-quality ingredients, free from additives, preservatives, and artificial flavors, ensuring that your baby receives only the best. One of the standout features of Fresh 'n' Quik is the convenience it offers. The purees come in individual, resealable pouches that can be stored in the freezer until you're ready to use them. When it's time to feed your baby, simply pop a pouch into the Fresh 'n' Quik blender, add your liquid of choice, and blend. In less than a minute, you have a fresh, homemade meal that's ready to serve. The blender itself is compact, easy to use, and even easier to clean. The blades are removable, making it a breeze to rinse off any leftover puree. And the best part? The blender is whisper-quiet, so you don't have to worry about waking your sleeping baby while preparing their meal. But what truly sets Fresh 'n' Quik apart is the variety of flavors available. From classic combinations like apple and banana to more adventurous options like mango and kale, there's something for every palate. And because the purees are made with real fruits and vegetables, your baby is exposed to a wide range of flavors and textures, helping to cultivate a diverse and adventurous palate from an early age. In conclusion, Fresh 'n' Quik Baby Food is a game-changer for parents seeking a convenient, nutritious, and delicious option for their little ones. The system is easy to use, quick to clean, and offers a wide variety of flavors to keep your baby's taste buds excited. I highly recommend giving Fresh 'n' Quik a try – your baby (and your schedule) will thank you!"
        ],
        "ground_truth": [
            "This bold, nutty, and fruity coffee is delicious, and they need to start selling it in larger bags.",
            "Chocolate-Covered Strawberry Delight ice cream looks delicious with its aroma of strawberry and chocolate, and its creamy, naturally sweet taste did not disappoint.",
            "Arroz Delicioso offers authentic, flavorful Mexican rice with a blend of tomatoes, onions, and garlic, cooking evenly into separate, fluffy grains in just 20 minutes, making it a convenient and delicious choice for dishes like arroz con pollo or as a side for tacos.",
            "FreshCrunch salad mixes offer convenient, pre-washed, nutrient-rich greens in an innovative packaging that keeps them fresh for up to 10 days, providing a versatile, tasty, and waste-reducing healthy meal solution.",
            "The Click-Clack Grill Tongs are a high-quality, innovative grilling tool with a sleek design, comfortable grip, and an automatic opening/closing mechanism for precise control. These tongs have made grilling more enjoyable and efficient, and are highly recommended for anyone looking to improve their grilling experience.",
            "Fresh 'n' Quik Baby Food is a revolutionary product that delivers fresh, homemade baby food in minutes. With pre-portioned, organic fruit and vegetable purees, the system offers convenience, high-quality ingredients, and a wide range of flavors to cultivate a diverse palate in your little one. The blender is compact, easy to use, and whisper-quiet, making mealtime a breeze. Fresh 'n' Quik Baby Food is a must-try for parents seeking a nutritious and delicious option for their babies."
        ],
    }
)

display(eval_data)

inputs,ground_truth
"This coffee is exceptional. Its intensely bold flavor profile is both nutty and fruity – especially with notes of plum and citrus. While the price is relatively good, I find myself needing to purchase bags too often. If this came in 16oz bags instead of just 12oz bags, I'd purchase it all the time. I highly recommend they start scaling up their bag size.","This bold, nutty, and fruity coffee is delicious, and they need to start selling it in larger bags."
"The moment I opened the tub of Chocolate-Covered Strawberry Delight ice cream, I was greeted by the enticing aroma of fresh strawberries and rich chocolate. The appearance of the ice cream was equally appealing, with a swirl of pink strawberry ice cream and chunks of chocolate-covered strawberries scattered throughout. The first bite did not disappoint. The strawberry ice cream was creamy and flavorful, with a natural sweetness that was not overpowering. The chocolate-covered strawberries added a satisfying crunch fruity bite.","Chocolate-Covered Strawberry Delight ice cream looks delicious with its aroma of strawberry and chocolate, and its creamy, naturally sweet taste did not disappoint."
"Arroz Delicioso is a must-try for Mexican cuisine enthusiasts! This authentic Mexican rice, infused with a blend of tomatoes, onions, and garlic, brings a burst of flavor to any meal. Its vibrant color and delightful aroma will transport you straight to the heart of Mexico. The rice cooks evenly, resulting in separate, fluffy grains that hold their shape, making it perfect for dishes like arroz con pollo or as a side for tacos. With a cook time of just 20 minutes, Arroz Delicioso is a convenient and delicious addition to your pantry. Give it a try and elevate your Mexican food game!","Arroz Delicioso offers authentic, flavorful Mexican rice with a blend of tomatoes, onions, and garlic, cooking evenly into separate, fluffy grains in just 20 minutes, making it a convenient and delicious choice for dishes like arroz con pollo or as a side for tacos."
"FreshCrunch salad mixes are revolutionizing the way we think about packaged salads! Each bag is packed with a vibrant blend of crisp, nutrient-rich greens, including baby spinach, arugula, and kale. The veggies are pre-washed and ready to eat, making meal prep a breeze. FreshCrunch sets itself apart with its innovative packaging that keeps the greens fresh for up to 10 days, reducing food waste and ensuring you always have a healthy option on hand. The salad mixes are versatile and pair well with various dressings and toppings. Try FreshCrunch for a convenient, delicious, and nutritious meal solution that doesn't compromise on quality or taste!","FreshCrunch salad mixes offer convenient, pre-washed, nutrient-rich greens in an innovative packaging that keeps them fresh for up to 10 days, providing a versatile, tasty, and waste-reducing healthy meal solution."
"If you're a grill enthusiast like me, you know the importance of having the right tools for the job. That's why I was thrilled to get my hands on the new Click-Clack Grill Tongs. These tongs are not just any ordinary grilling utensil; they're a game-changer. First impressions matter, and the Click-Clack Grill Tongs certainly deliver. The sleek, stainless steel design exudes a professional feel, and the ergonomic handle ensures a comfortable grip even during those long grilling sessions. But what truly sets these tongs apart is their innovative 'Click-Clack' mechanism. With a simple press of a button, the tongs automatically open and close, allowing for precise control when flipping or turning your food. No more struggling with stiff, unwieldy tongs that can ruin your carefully prepared meals. The tongs also feature a scalloped edge, which provides a secure grip on everything from juicy steaks to delicate vegetables. And with their generous length, you can keep your hands safely away from the heat while still maintaining optimal control. Cleanup is a breeze thanks to the dishwasher-safe construction, and the integrated hanging loop makes storage a snap. In conclusion, the Click-Clack Grill Tongs have earned a permanent spot in my grilling arsenal. They've made my grilling experience more enjoyable and efficient, and I'm confident they'll do the same for you. So, if you're looking to up your grilling game, I highly recommend giving these tongs a try. Happy grilling!","The Click-Clack Grill Tongs are a high-quality, innovative grilling tool with a sleek design, comfortable grip, and an automatic opening/closing mechanism for precise control. These tongs have made grilling more enjoyable and efficient, and are highly recommended for anyone looking to improve their grilling experience."
"As a parent, I understand the importance of providing my child with nutritious, wholesome food. That's why I was thrilled to discover Fresh 'n' Quik Baby Food, a new product that promises to deliver fresh, homemade baby food in minutes. The concept behind Fresh 'n' Quik is simple yet ingenious. The system consists of pre-portioned, organic fruit and vegetable purees that can be quickly and easily blended with breast milk, formula, or water to create a nutritious meal for your little one. The purees are made with high-quality ingredients, free from additives, preservatives, and artificial flavors, ensuring that your baby receives only the best. One of the standout features of Fresh 'n' Quik is the convenience it offers. The purees come in individual, resealable pouches that can be stored in the freezer until you're ready to use them. When it's time to feed your baby, simply pop a pouch into the Fresh 'n' Quik blender, add your liquid of choice, and blend. In less than a minute, you have a fresh, homemade meal that's ready to serve. The blender itself is compact, easy to use, and even easier to clean. The blades are removable, making it a breeze to rinse off any leftover puree. And the best part? The blender is whisper-quiet, so you don't have to worry about waking your sleeping baby while preparing their meal. But what truly sets Fresh 'n' Quik apart is the variety of flavors available. From classic combinations like apple and banana to more adventurous options like mango and kale, there's something for every palate. And because the purees are made with real fruits and vegetables, your baby is exposed to a wide range of flavors and textures, helping to cultivate a diverse and adventurous palate from an early age. In conclusion, Fresh 'n' Quik Baby Food is a game-changer for parents seeking a convenient, nutritious, and delicious option for their little ones. The system is easy to use, quick to clean, and offers a wide variety of flavors to keep your baby's taste buds excited. I highly recommend giving Fresh 'n' Quik a try – your baby (and your schedule) will thank you!","Fresh 'n' Quik Baby Food is a revolutionary product that delivers fresh, homemade baby food in minutes. With pre-portioned, organic fruit and vegetable purees, the system offers convenience, high-quality ingredients, and a wide range of flavors to cultivate a diverse palate in your little one. The blender is compact, easy to use, and whisper-quiet, making mealtime a breeze. Fresh 'n' Quik Baby Food is a must-try for parents seeking a nutritious and delicious option for their babies."


**Question:** What are some strategies for evaluating your custom-generated benchmark data set? For example:
* How can you scale the curation?
* How do you know if the ground truth is correct?
* Who should have input?
* Should it remain static over time?

Next, we're saving this reference data set for future use.

In [0]:
spark_df = spark.createDataFrame(eval_data)
spark_df.write.mode("overwrite").saveAsTable(f"{DA.catalog_name}.{DA.schema_name}.case_spec_summ_eval")

## Task 2: Compute ROUGE on Custom Benchmark Data

Next, we will want to compute our ROUGE-N metric to understand how well our system summarizes grocery product reviews based on the reference of reviews that was just created.

Remember that the `mlflow.evaluate` function accepts the following parameters for this use case:

* An LLM model
* Reference data for evaluation
* Column with ground truth data
* The model/task type (e.g. `"text-summarization"`)


### Step 2.1: Run the Evaluation

Instead of using the generic benchmark dataset like in the demo, your task is to **compute ROUGE metrics using the case-specific benchmark data that we just created.**

**Note:** If needed, refer back to the demo to complete the below code blocks.

First, the function that you can use to iterate through rows for `mlflow.evaluate`.

In [0]:

## A custom function to iterate through our eval DF
def query_iteration(inputs):
    answers = []

    for index, row in inputs.iterrows():
        completion = query_product_summary_system(row["inputs"])
        answers.append(completion)

    return answers

## Test query_iteration function – it needs to return a list of output strings
query_iteration(eval_data)

['I highly recommend this exceptional coffee with its intensely bold and fruity flavor profile, but I wish it came in larger bag sizes to reduce frequent purchases.',
 'I was thoroughly impressed with the Chocolate-Covered Strawberry Delight ice cream, which had a delicious aroma, appealing appearance, and a creamy, flavorful taste with a perfect balance of sweet and fruity notes.',
 'I highly recommend Arroz Delicioso, an authentic Mexican rice that brings a burst of flavor to any meal with its blend of tomatoes, onions, and garlic, and cooks to perfect, fluffy grains in just 20 minutes.',
 'I love FreshCrunch salad mixes because they offer a convenient and delicious way to enjoy fresh, nutrient-rich greens that stay crisp for up to 10 days.',
 "I'm thrilled with the Click-Clack Grill Tongs, which have become a staple in my grilling arsenal due to their innovative mechanism, ergonomic design, and ease of use and cleanup.",
 "I highly recommend Fresh 'n' Quik Baby Food, a convenient an

Next, use the above function and `mlflow.evaluate` to perform the `text-summarization` evaluation.

In [0]:

import mlflow

## MLflow's `evaluate` with a custom function
results = mlflow.evaluate(
    query_iteration,                      ## iterative function from above
    eval_data,                            ## eval DF
    targets="ground_truth",               ## column with expected or "good" output
    model_type="text-summarization"       ## type of model or task
)


 - For traditional ML or deep learning models: Use `mlflow.models.evaluate`, which maintains full compatibility with the original `mlflow.evaluate` API.

 - For LLMs or GenAI applications: Use the new `mlflow.genai.evaluate` API, which offers enhanced features specifically designed for evaluating LLMs and GenAI applications.

2025/07/25 11:38:02 INFO mlflow.models.evaluation.evaluators.default: Computing model predictions.
2025/07/25 11:38:06 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...


### Step 2.2: Evaluate the Results

Next, take a look at the results.

In [0]:

display(results.tables["eval_results_table"])

Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

inputs,ground_truth,outputs,token_count,flesch_kincaid_grade_level/v1/score,ari_grade_level/v1/score
"This coffee is exceptional. Its intensely bold flavor profile is both nutty and fruity – especially with notes of plum and citrus. While the price is relatively good, I find myself needing to purchase bags too often. If this came in 16oz bags instead of just 12oz bags, I'd purchase it all the time. I highly recommend they start scaling up their bag size.","This bold, nutty, and fruity coffee is delicious, and they need to start selling it in larger bags.","I highly recommend this exceptional coffee with its intensely bold and fruity flavor profile, but I wish it came in larger bag sizes to reduce frequent purchases.",29,14.6066666667,15.7944444444
"The moment I opened the tub of Chocolate-Covered Strawberry Delight ice cream, I was greeted by the enticing aroma of fresh strawberries and rich chocolate. The appearance of the ice cream was equally appealing, with a swirl of pink strawberry ice cream and chunks of chocolate-covered strawberries scattered throughout. The first bite did not disappoint. The strawberry ice cream was creamy and flavorful, with a natural sweetness that was not overpowering. The chocolate-covered strawberries added a satisfying crunch fruity bite.","Chocolate-Covered Strawberry Delight ice cream looks delicious with its aroma of strawberry and chocolate, and its creamy, naturally sweet taste did not disappoint.","I was thoroughly impressed with the Chocolate-Covered Strawberry Delight ice cream, which had a delicious aroma, appealing appearance, and a creamy, flavorful taste with a perfect balance of sweet and fruity notes.",41,17.17125,21.5053125
"Arroz Delicioso is a must-try for Mexican cuisine enthusiasts! This authentic Mexican rice, infused with a blend of tomatoes, onions, and garlic, brings a burst of flavor to any meal. Its vibrant color and delightful aroma will transport you straight to the heart of Mexico. The rice cooks evenly, resulting in separate, fluffy grains that hold their shape, making it perfect for dishes like arroz con pollo or as a side for tacos. With a cook time of just 20 minutes, Arroz Delicioso is a convenient and delicious addition to your pantry. Give it a try and elevate your Mexican food game!","Arroz Delicioso offers authentic, flavorful Mexican rice with a blend of tomatoes, onions, and garlic, cooking evenly into separate, fluffy grains in just 20 minutes, making it a convenient and delicious choice for dishes like arroz con pollo or as a side for tacos.","I highly recommend Arroz Delicioso, an authentic Mexican rice that brings a burst of flavor to any meal with its blend of tomatoes, onions, and garlic.",33,13.6115384615,14.3953846154
"FreshCrunch salad mixes are revolutionizing the way we think about packaged salads! Each bag is packed with a vibrant blend of crisp, nutrient-rich greens, including baby spinach, arugula, and kale. The veggies are pre-washed and ready to eat, making meal prep a breeze. FreshCrunch sets itself apart with its innovative packaging that keeps the greens fresh for up to 10 days, reducing food waste and ensuring you always have a healthy option on hand. The salad mixes are versatile and pair well with various dressings and toppings. Try FreshCrunch for a convenient, delicious, and nutritious meal solution that doesn't compromise on quality or taste!","FreshCrunch salad mixes offer convenient, pre-washed, nutrient-rich greens in an innovative packaging that keeps them fresh for up to 10 days, providing a versatile, tasty, and waste-reducing healthy meal solution.","I love FreshCrunch salad mixes because they offer a convenient and nutritious meal solution with their pre-washed, long-lasting, and versatile blends of fresh greens.",32,12.945,18.63375
"If you're a grill enthusiast like me, you know the importance of having the right tools for the job. That's why I was thrilled to get my hands on the new Click-Clack Grill Tongs. These tongs are not just any ordinary grilling utensil; they're a game-changer. First impressions matter, and the Click-Clack Grill Tongs certainly deliver. The sleek, stainless steel design exudes a professional feel, and the ergonomic handle ensures a comfortable grip even during those long grilling sessions. But what truly sets these tongs apart is their innovative 'Click-Clack' mechanism. With a simple press of a button, the tongs automatically open and close, allowing for precise control when flipping or turning your food. No more struggling with stiff, unwieldy tongs that can ruin your carefully prepared meals. The tongs also feature a scalloped edge, which provides a secure grip on everything from juicy steaks to delicate vegetables. And with their generous length, you can keep your hands safely away from the heat while still maintaining optimal control. Cleanup is a breeze thanks to the dishwasher-safe construction, and the integrated hanging loop makes storage a snap. In conclusion, the Click-Clack Grill Tongs have earned a permanent spot in my grilling arsenal. They've made my grilling experience more enjoyable and efficient, and I'm confident they'll do the same for you. So, if you're looking to up your grilling game, I highly recommend giving these tongs a try. Happy grilling!","The Click-Clack Grill Tongs are a high-quality, innovative grilling tool with a sleek design, comfortable grip, and an automatic opening/closing mechanism for precise control. These tongs have made grilling more enjoyable and efficient, and are highly recommended for anyone looking to improve their grilling experience.","I'm thoroughly impressed with the Click-Clack Grill Tongs, which have become a staple in my grilling arsenal due to their innovative mechanism, comfortable design, and ease of use.",38,15.1371428571,18.3067857143
"As a parent, I understand the importance of providing my child with nutritious, wholesome food. That's why I was thrilled to discover Fresh 'n' Quik Baby Food, a new product that promises to deliver fresh, homemade baby food in minutes. The concept behind Fresh 'n' Quik is simple yet ingenious. The system consists of pre-portioned, organic fruit and vegetable purees that can be quickly and easily blended with breast milk, formula, or water to create a nutritious meal for your little one. The purees are made with high-quality ingredients, free from additives, preservatives, and artificial flavors, ensuring that your baby receives only the best. One of the standout features of Fresh 'n' Quik is the convenience it offers. The purees come in individual, resealable pouches that can be stored in the freezer until you're ready to use them. When it's time to feed your baby, simply pop a pouch into the Fresh 'n' Quik blender, add your liquid of choice, and blend. In less than a minute, you have a fresh, homemade meal that's ready to serve. The blender itself is compact, easy to use, and even easier to clean. The blades are removable, making it a breeze to rinse off any leftover puree. And the best part? The blender is whisper-quiet, so you don't have to worry about waking your sleeping baby while preparing their meal. But what truly sets Fresh 'n' Quik apart is the variety of flavors available. From classic combinations like apple and banana to more adventurous options like mango and kale, there's something for every palate. And because the purees are made with real fruits and vegetables, your baby is exposed to a wide range of flavors and textures, helping to cultivate a diverse and adventurous palate from an early age. In conclusion, Fresh 'n' Quik Baby Food is a game-changer for parents seeking a convenient, nutritious, and delicious option for their little ones. The system is easy to use, quick to clean, and offers a wide variety of flavors to keep your baby's taste buds excited. I highly recommend giving Fresh 'n' Quik a try – your baby (and your schedule) will thank you!","Fresh 'n' Quik Baby Food is a revolutionary product that delivers fresh, homemade baby food in minutes. With pre-portioned, organic fruit and vegetable purees, the system offers convenience, high-quality ingredients, and a wide range of flavors to cultivate a diverse palate in your little one. The blender is compact, easy to use, and whisper-quiet, making mealtime a breeze. Fresh 'n' Quik Baby Food is a must-try for parents seeking a nutritious and delicious option for their babies.","I highly recommend Fresh 'n' Quik Baby Food, a convenient and nutritious system that allows me to quickly blend homemade-style meals for my baby using pre-portioned, organic purees and my choice of liquid.",44,15.8739393939,19.7618181818


**Question:** How do we interpret these results? What does it tell us about the summarization quality? About our LLM?

Next, compute the summarized metrics to view the performance of the LLM on the entire dataset.

In [0]:
results.metrics

{'flesch_kincaid_grade_level/v1/mean': 14.890922896547899,
 'flesch_kincaid_grade_level/v1/variance': 1.9551799096595943,
 'flesch_kincaid_grade_level/v1/p90': 16.522594696969698,
 'ari_grade_level/v1/mean': 18.06624924265549,
 'ari_grade_level/v1/variance': 5.619728195360746,
 'ari_grade_level/v1/p90': 20.633565340909087}

**Bonus:** Take a look at the results in the Experiment Tracking UI.

Do you see any summaries that you think are particularly good or problematic?

## Task 3: Use an LLM-as-a-Judge Approach to Evaluate Custom Metrics

In this task, you will define and evaluate a custom metric called "professionalism" using an LLM-as-a-Judge approach. The goal is to assess how professionally written the summaries generated by the language model are, based on a set of predefined criteria.


### Step 3.1: Define a Humor Metric

- Define humor and create a grading prompt.

  **To Do:**

  For this task, you are provided with an initial example of humor (humor_example_score_1). Your task is to generate another evaluation example (humor_example_score_2). 

  **Hint:** You can use AI Playground for this. Ensure that the generated example is relevant to the prompt and reflects a different humor score. Manually verify the generated example to ensure its correctness.


In [0]:
## Define an evaluation example for humor with a score of 2
humor_example_score_1 = mlflow.metrics.genai.EvaluationExample(
    input="Tell me a joke!",  
    output=(
        "Why don't scientists trust atoms? Because they make up everything!"  
    ),
    score=2,  ## Humor score assigned to the output
    justification=(
        "The joke uses a common pun and is somewhat humorous, but it may not elicit strong laughter or amusement from everyone."  ## Justification for the assigned score
    ),
)

## Define another evaluation example for humor with a score of 4
humor_example_score_2 = mlflow.metrics.genai.EvaluationExample(
    input="Tell me a joke!",  
    output=(
        "I told my wife she was drawing her eyebrows too high. She looked surprised!"  
    ),
    score=4,  ## Humor score assigned to the output
    justification=(
        "The joke is clever and unexpected, resulting in genuine amusement and laughter. It demonstrates wit and creativity, making it highly enjoyable."  ## Justification for the assigned score
    ),
)

### Step 3.2: LLM-as-a-Judge to Compare Metric

* **3.2.1:  Create a metric for comparing the responses for humor**

  Define a custom metric to evaluate the humor in generated responses. This metric will assess the level of humor present in the responses generated by the language model.

In [0]:
## Define the metric for the evaluation
comparison_humor_metric = mlflow.metrics.genai.make_genai_metric(
    name="comparison_humor",
    definition=(
        "Humor refers to the ability to evoke laughter, amusement, or enjoyment through cleverness, wit, or unexpected twists."
    ),
    grading_prompt=(
        "Humor: If the response is funny and induces laughter or amusement, below are the details for different scores: "
        "- Score 1: The response attempts humor but falls flat, eliciting little to no laughter or amusement."
        "- Score 2: The response is somewhat humorous, eliciting mild laughter or amusement from some individuals."
        "- Score 3: The response is moderately funny, eliciting genuine laughter or amusement from most individuals."
        "- Score 4: The response is highly humorous, eliciting strong laughter or amusement from nearly everyone."
        "- Score 5: The response is exceptionally funny, resulting in uncontrollable laughter or intense enjoyment."
    ),
    ## Examples for humor
    examples=[
        humor_example_score_1, 
        humor_example_score_2
    ],
    model="endpoints:/databricks-meta-llama-3-3-70b-instruct",
    parameters={"temperature": 0.0},
    aggregations=["mean", "variance"],
    greater_is_better=True,
)

* **3.2.2: Generate data with varying humor levels**

  Add input prompts and corresponding output responses with different levels of humor. 
  
  **Hint:** You can utilize AI playgrounds to generate these values.

In [0]:
## Define testing data with different humor scores for comparison
humor_data = pd.DataFrame(
    {
        "inputs": [
            "Tell me a joke about pandas.",
            "What's a programmer's favorite place to hang out?",
            "Why don't scientists trust atoms?",
            "Why did the scarecrow win an award?"
        ],
        "ground_truth": [
            "Why did the pandas break up? Because they couldn't bamboo-zle their problems away!",
            "The Foo Bar!",
            "Because they make up everything!",
            "Because he was outstanding in his field!"
        ],
    }
)

* **3.2.3: Evaluate the Comparison**

  Next, evaluate the comparison between the responses generated by the language model. This evaluation will provide you with a metric for assessing the professionalism of the generated summaries based on predefined criteria.


In [0]:
benchmark_comparison_results = mlflow.evaluate(
    model="endpoints:/databricks-meta-llama-3-3-70b-instruct",  ## Model used for evaluation
    data=humor_data,                               ## Data for evaluation
    targets="ground_truth",                       ## Column with the ground truth data
    model_type="text-summarization",              ## Type of model or task
    extra_metrics=[comparison_humor_metric],  ## Custom metric for evaluating professionalism
)


 - For traditional ML or deep learning models: Use `mlflow.models.evaluate`, which maintains full compatibility with the original `mlflow.evaluate` API.

 - For LLMs or GenAI applications: Use the new `mlflow.genai.evaluate` API, which offers enhanced features specifically designed for evaluating LLMs and GenAI applications.

2025/07/25 11:38:15 INFO mlflow.models.evaluation.evaluators.default: Computing model predictions.
2025/07/25 11:38:22 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...


  0%|          | 0/1 [00:00<?, ?it/s]



  0%|          | 0/4 [00:00<?, ?it/s]

* **3.2.4: View Comparison Results**

  Now, let's take a look at the results of the comparison between the responses generated by the language model. This comparison provides insights into the professionalism of the generated summaries based on the predefined criteria.


In [0]:
display(benchmark_comparison_results.tables["eval_results_table"])

Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

inputs,ground_truth,outputs,token_count,flesch_kincaid_grade_level/v1/score,ari_grade_level/v1/score,comparison_humor/v1/score,comparison_humor/v1/justification
Tell me a joke about pandas.,Why did the pandas break up? Because they couldn't bamboo-zle their problems away!,"Why did the panda go to the party? Because he heard it was a ""paws-itively"" good time! (get it?)",28,3.0202631579,2.6557894737,4,"The joke is clever and uses a play on words with ""paws-itively"" to create a pun, which is likely to elicit genuine laughter or amusement from most individuals, demonstrating a good level of wit and creativity, making it highly enjoyable."
What's a programmer's favorite place to hang out?,The Foo Bar!,"A programmer's favorite place to hang out... (dramatic pause)...is probably their local coffee shop, where they can fuel up on caffeine and code away on their laptop! But, if I'm being completely honest, their favorite place to hang out is likely to be... (dramatic pause again)...the Internet! Specifically, online communities like GitHub, Stack Overflow, Reddit's r/learnprogramming, and other forums where they can connect with fellow coders, share knowledge, and learn from each other. After all, programmers tend to be a curious and social bunch, and the internet provides them with a vast playground to explore, learn, and collaborate with others who share their passions! Am I right?",144,9.008811749,11.9197730307,2,"The response attempts humor by using dramatic pauses and a playful tone, but it falls short of being highly humorous, as it relies on a somewhat obvious punchline about programmers liking the internet, which may elicit mild amusement from some individuals but not strong laughter or intense enjoyment from nearly everyone."
Why don't scientists trust atoms?,Because they make up everything!,"Scientists actually do trust atoms! Atoms are the basic building blocks of matter, and scientists have a deep understanding of their behavior and properties. In fact, the study of atoms and their interactions is a fundamental part of physics, chemistry, and many other fields of science. Scientists use atomic theories and models to explain and predict the behavior of matter at the atomic and subatomic level, and these theories have been extensively tested and validated through experiments and observations. So, why might you think scientists don't trust atoms? Perhaps it's because of a play on words: ""Atoms"" sounds similar to ""a lot"", and the phrase ""I don't trust atoms"" sounds like ""I don't trust a lot"". This is just a clever pun, but not meant to be taken literally! In short, scientists do trust atoms and rely on our understanding of them to make new discoveries and advancements in various fields of science!",183,10.9184232026,11.8056372549,1,"The response attempts humor by explaining a potential pun, but it falls flat as it spends most of the time providing a serious explanation of atoms and their importance in science, only briefly touching on the play on words, which is not enough to elicit significant laughter or amusement."
Why did the scarecrow win an award?,Because he was outstanding in his field!,"A classic joke! The answer is: ""Because he was outstanding in his field!"" (get it? outstanding in his field... haha)",28,3.53,4.7468333333,4,"The model's response is highly humorous, as it delivers a classic joke with a clever punchline, ""outstanding in his field,"" which is a common play on words that often elicits genuine laughter or amusement from most individuals, demonstrating wit and creativity."
