# Vertex AI GenAI Evaluation Service

Vertex AI's Gen AI evaluation service empowers you to assess any generative AI model or application according to your specific requirements.

Instead of relying solely on general leaderboards and reports, you can define your own evaluation criteria and directly compare how different models perform against your unique needs, use case and data.

This service allows you to:

- Gain a deeper understanding of model performance. And, Go beyond general metrics and understand how a model handles your specific data and tasks.
- Make informed decisions throughout the development lifecycle. By using evaluations to guide model selection, refine prompt engineering, and optimize model customization.
- Streamline your evaluation workflow by leveraging Vertex AI's integrated tools to easily launch and reuse evaluations as needed.

Essentially, it puts you in control of evaluating generative AI, ensuring the models you choose are the best fit for your specific applications.

## Learning Objectives

In this notebook, you will learn:

- How to use Vertex AI Gen AI Evaluation Service 
- Different types of evaluation techniques (Computation Based and Model Based)
- How to prepare you dataset and get it ready for evaluation
- Analyze and understand the results from evaluation



### Setup

In [None]:
# General
import os
import random
import string

import pandas as pd
from IPython.display import Markdown, display

# Main
from vertexai.evaluation import (
    EvalTask,
    MetricPromptTemplateExamples,
    PairwiseMetric,
    PointwiseMetric,
    PointwiseMetricPromptTemplate,
)
from vertexai.generative_models import (
    GenerativeModel,
    HarmBlockThreshold,
    HarmCategory,
)

### Helper Function

The function helps us display the results from the evaluation SDK in a readable format

In [None]:
def display_eval_report(eval_result, metrics=None):
    """Display the evaluation results."""

    title, summary_metrics, report_df = eval_result
    metrics_df = pd.DataFrame.from_dict(summary_metrics, orient="index").T
    if metrics:
        metrics_df = metrics_df.filter(
            [
                metric
                for metric in metrics_df.columns
                if any(selected_metric in metric for selected_metric in metrics)
            ]
        )
        report_df = report_df.filter(
            [
                metric
                for metric in report_df.columns
                if any(selected_metric in metric for selected_metric in metrics)
            ]
        )

    # Display the title with Markdown for emphasis
    display(Markdown(f"## {title}"))

    # Display the metrics DataFrame
    display(Markdown("### Summary Metrics"))
    display(metrics_df)

    # Display the detailed report DataFrame
    display(Markdown("### Report Metrics"))
    display(report_df)

## Evaluation Process

Vertex AI's Gen AI evaluation service lets you assess any generative model based on your specific needs and criteria.

These are the four steps you will follow to help you evaluate:

1. Define: Tailor metrics to your business goals and choose your evaluation approach (Computation based vs Model Based).
2. Prepare: Create a dataset that reflects your real-world use case.
3. Run: Easily launch evaluations using templates or existing examples. Define your models and create reusable `EvalTasks` within Vertex AI to use in other evaluations.
4. Analyze: Interpret your results and understand how each model performs against your specific criteria.

Let's take a look at these steps one by one


### Computation Based Metrics

[Computation based metrics](https://cloud.google.com/vertex-ai/generative-ai/docs/models/determine-eval#computation-based-metrics) are computed using mathematical formulas to compare the model's output against a **ground truth or reference**. Commonly used computation-based metrics include ROUGE and BLEU. The commonly used metrics can be categorized into the following groups:

- Lexicon-based metrics: Use math to calculate the string similarities between LLM-generated results and ground truth, such as `Exact Match` and `ROUGE`.
- Count-based metrics: Aggregate the number of rows that hit or miss certain ground-truth labels, such as `F1-score`, `Accuracy`, and `Tool Name Match`.
- Embedding-based metrics: Calculate the distance between the LLM-generated results and ground truth in the embedding space, reflecting their level of similarity. These include `cosine similarity` and `euclidean distance` 

Let's say that we are trying to evaluate how well different prompts works for summarization using Gemini. We'll start by defining a few articles in `context` and since we are using computation based metrics for evaluation, we will also need to define the ground truth for the summaries. This will be defined in `reference`. `eval_dataset` should be a dataframe that contains columns needed for evaluation

In [None]:
instruction = "Summarize the following article"

context = [
    "To make a classic spaghetti carbonara, start by bringing a large pot of salted water to a boil. While the water is heating up, cook pancetta or guanciale in a skillet with olive oil over medium heat until it's crispy and golden brown. Once the pancetta is done, remove it from the skillet and set it aside. In the same skillet, whisk together eggs, grated Parmesan cheese, and black pepper to make the sauce. When the pasta is cooked al dente, drain it and immediately toss it in the skillet with the egg mixture, adding a splash of the pasta cooking water to create a creamy sauce.",
    "Preparing a perfect risotto requires patience and attention to detail. Begin by heating butter in a large, heavy-bottomed pot over medium heat. Add finely chopped onions and minced garlic to the pot, and cook until they're soft and translucent, about 5 minutes. Next, add Arborio rice to the pot and cook, stirring constantly, until the grains are coated with the butter and begin to toast slightly. Pour in a splash of white wine and cook until it's absorbed. From there, gradually add hot chicken or vegetable broth to the rice, stirring frequently, until the risotto is creamy and the rice is tender with a slight bite.",
    "To bake a decadent chocolate cake from scratch, start by preheating your oven to 350°F (175°C) and greasing and flouring two 9-inch round cake pans. In a large mixing bowl, cream together softened butter and granulated sugar until light and fluffy. Beat in eggs one at a time, making sure each egg is fully incorporated before adding the next. In a separate bowl, sift together all-purpose flour, cocoa powder, baking powder, baking soda, and salt. Divide the batter evenly between the prepared cake pans and bake for 25-30 minutes, or until a toothpick inserted into the center comes out clean.",
]

reference = [
    "The process of making spaghetti carbonara involves boiling pasta, crisping pancetta or guanciale, whisking together eggs and Parmesan cheese, and tossing everything together to create a creamy sauce.",
    "Preparing risotto entails sautéing onions and garlic, toasting Arborio rice, adding wine and broth gradually, and stirring until creamy and tender.",
    "Baking a decadent chocolate cake requires creaming butter and sugar, beating in eggs and alternating dry ingredients with buttermilk before baking until done.",
]

eval_dataset = pd.DataFrame(
    {
        "context": context,
        "reference": reference,
        "instruction": [instruction] * len(context),
    }
)

Now, we'll define different prompt templates to evaluate. This list can include all the different prompt templates you want to experiment with. For the purpose of demostration, we'll use two different prompt templates below.

In [None]:
prompt_templates = [
    "Instruction: {instruction}. Article: {context}. Summary:",
    "Article: {context}. Complete this task: {instruction}, in one sentence. Summary:",
]

Next, we will define the different metrics we want to measure for this task. There are different metrics that are defined by Vertex AI based on the task. You can take a look at the [documentation](https://cloud.google.com/vertex-ai/generative-ai/docs/models/determine-eval#computation-based-metrics) to see the different metrics. And, take a look at the [API documentation](https://cloud.google.com/vertex-ai/generative-ai/docs/reference/python/latest/vertexai.preview.evaluation.EvalTask#vertexai_preview_evaluation_EvalTask) to refer to the right string values for each.

In [None]:
metrics = [
    "rouge_1",
    "rouge_l_sum",
    "bleu",
    "safety",
]

We'll configure the model that we are going to use for the generating the reponses.

In [None]:
generation_config = {
    "temperature": 0.3,
}

safety_settings = {
    HarmCategory.HARM_CATEGORY_UNSPECIFIED: HarmBlockThreshold.BLOCK_NONE,
    HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_NONE,
    HarmCategory.HARM_CATEGORY_HATE_SPEECH: HarmBlockThreshold.BLOCK_NONE,
    HarmCategory.HARM_CATEGORY_HARASSMENT: HarmBlockThreshold.BLOCK_NONE,
    HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: HarmBlockThreshold.BLOCK_NONE,
}

gemini_model = GenerativeModel(
    "gemini-2.0-flash",
    generation_config=generation_config,
    safety_settings=safety_settings,
)

Now, that we have configured all the parameters, we are ready to start evaluating. We will use [EvalTask](https://cloud.google.com/vertex-ai/generative-ai/docs/reference/python/latest/vertexai.preview.evaluation.EvalTask) to kick of a Vertex AI Experiment for our evaluation. 

#### Understand the EvalTask class

The EvalTask class is a core component of the Gen AI Evaluation Service SDK framework. It allows you to define and run evaluation jobs against your Gen AI models/applications, providing a structured way to measure their performance on specific tasks. Think of an EvalTask as a blueprint for your evaluation process. Evaluation tasks must contain an evaluation dataset, and a list of metrics to evaluate. Supported metrics are documented on the Generative AI on Vertex AI [Define your evaluation metrics page](https://cloud.google.com/vertex-ai/generative-ai/docs/models/determine-eval). The dataset can be an `pandas.DataFrame`, Python dictionary or a file path URI and can contain default column names such as `prompt`, `reference`, `response`, and `baseline_model_response`.  

- Bring-your-own-response (BYOR): You already have the data that you want to evaluate stored in the dataset. You can customize the response column names for both your model and the baseline model using parameters like response_column_name and baseline_model_response_column_name or through the metric_column_mapping.

- Perform model inference without a prompt template: You have a dataset containing the input prompts to the model and want to perform model inference before evaluation. A column named prompt is required in the evaluation dataset and is used directly as input to the model.

- Perform model inference with a prompt template: You have a dataset containing the input variables to the prompt template and want to assemble the prompts for model inference. Evaluation dataset must contain column names corresponding to the variable names in the prompt template. For example, if prompt template is "Instruction: {instruction}, context: {context}", the dataset must contain instruction and context columns.

EvalTask supports extensive evaluation scenarios including BYOR, model inference with Gemini models, 3P models endpoints/SDK clients, or custom model generation functions, using computation-based metrics, model-based pointwise and pairwise metrics. The evaluate() method triggers the evaluation process, optionally taking a model, prompt template, experiment logging configuartions, and other evaluation run configurations. You can view the SDK reference documentation for [Gen AI Evaluation](https://cloud.google.com/vertex-ai/generative-ai/docs/reference/python/latest/vertexai.evaluation) package for more details.

To start with, we'll be using model inference with prompt templates.

In [None]:
experiment_name = "prompt-engineering-eval"

summarization_eval_task = EvalTask(
    dataset=eval_dataset,
    metrics=metrics,
    experiment=experiment_name,
)

Please note: The cell below takes 10-15 mins to finish executing. 

In [None]:
def generate_uuid(length: int = 8) -> str:
    """Generate a uuid of a specified length (default=8)."""
    return "".join(
        random.choices(string.ascii_lowercase + string.digits, k=length)
    )


run_id = generate_uuid()
eval_results = []


for i, prompt_template in enumerate(prompt_templates):
    experiment_run_name = f"eval-prompt-engineering-{run_id}-prompt-{i}"

    eval_result = summarization_eval_task.evaluate(
        prompt_template=prompt_template,
        experiment_run_name=experiment_run_name,
        model=gemini_model,
    )

    eval_results.append(
        (f"Prompt #{i}", eval_result.summary_metrics, eval_result.metrics_table)
    )

In [None]:
for eval_result in eval_results:
    display_eval_report(eval_result)

You can view the results from the experiment runs using the following function.

In [None]:
summarization_eval_task.display_runs()

### Model Based Metrics

Now, suppose you do not have reference results or ground truths for your use case, you can evaluate and compare different models using another LLM. This technique is called **Model based evaluation**. Model-based metrics assesses your candidate model against a judge model. The judge model for most use cases is Gemini, but you can also use models such as [MetricX](https://github.com/google-research/metricx) or [COMET](https://huggingface.co/Unbabel/wmt22-comet-da) for translation use cases.

You can measure model-based metrics pairwise or pointwise:

- **Pointwise metrics**: Let the judge model assess the candidate model's output based on the evaluation criteria. For example, the score could be 0 to 5, where 0 means the response does not fit the criteria, while 5 means the response fits the criteria well.

- **Pairwise metrics**: Let the judge model compare the responses of two models and pick the better one. This is often used when comparing a candidate model with the baseline model. Pairwise metrics are only supported with Gemini as a judge model.

#### Pointwise Metrics

Let's say now that we have decided on a prompt, we want to also rate the summaries generated on two qualitative criteria such as "fluency" and "entertaining". We will define a metric called `text_quality` using those two criteria.

In [None]:
# Your own definition of text_quality.
metric_prompt_template = PointwiseMetricPromptTemplate(
    criteria={
        "fluency": "Sentences flow smoothly and are easy to read, avoiding awkward phrasing or run-on sentences. Ideas and sentences connect logically, using transitions effectively where needed.",
        "entertaining": "Short, amusing text that incorporates emojis, exclamations and questions to convey quick and spontaneous communication and diversion.",
    },
    rating_rubric={
        "1": "The response performs well on both criteria.",
        "0": "The response is somewhat aligned with both criteria",
        "-1": "The response falls short on both criteria",
    },
)

text_quality = PointwiseMetric(
    metric="text_quality",
    metric_prompt_template=metric_prompt_template,
)

Let's take a look at the what this `text_quality` metric looks like.

In [None]:
print(text_quality.metric_prompt_template)

We will first need to get the responses from our model of interest (`gemini-2.0-flash`) and store it in a list. 


In [None]:
gemini_model = GenerativeModel(
    "gemini-2.0-flash",
    generation_config=generation_config,
    safety_settings=safety_settings,
)

context = [
    "To make a classic spaghetti carbonara, start by bringing a large pot of salted water to a boil. While the water is heating up, cook pancetta or guanciale in a skillet with olive oil over medium heat until it's crispy and golden brown. Once the pancetta is done, remove it from the skillet and set it aside. In the same skillet, whisk together eggs, grated Parmesan cheese, and black pepper to make the sauce. When the pasta is cooked al dente, drain it and immediately toss it in the skillet with the egg mixture, adding a splash of the pasta cooking water to create a creamy sauce.",
    "Preparing a perfect risotto requires patience and attention to detail. Begin by heating butter in a large, heavy-bottomed pot over medium heat. Add finely chopped onions and minced garlic to the pot, and cook until they're soft and translucent, about 5 minutes. Next, add Arborio rice to the pot and cook, stirring constantly, until the grains are coated with the butter and begin to toast slightly. Pour in a splash of white wine and cook until it's absorbed. From there, gradually add hot chicken or vegetable broth to the rice, stirring frequently, until the risotto is creamy and the rice is tender with a slight bite.",
    "For a flavorful grilled steak, start by choosing a well-marbled cut of beef like ribeye or New York strip. Season the steak generously with kosher salt and freshly ground black pepper on both sides, pressing the seasoning into the meat. Preheat a grill to high heat and brush the grates with oil to prevent sticking. Place the seasoned steak on the grill and cook for about 4-5 minutes on each side for medium-rare, or adjust the cooking time to your desired level of doneness. Let the steak rest for a few minutes before slicing against the grain and serving.",
]

In [None]:
responses = []
instruction = "Summarize the following article"
for article in context:
    prompt = f"Instruction: {instruction}. Article: {article}. Summary:"
    response = gemini_model.generate_content(prompt)
    responses.append(response.text)
responses

In [None]:
eval_dataset = pd.DataFrame(
    {
        "response": responses,
    }
)

In [None]:
EXPERIMENT_NAME = "pointwise-eval"
eval_task = EvalTask(
    dataset=eval_dataset, metrics=[text_quality], experiment=EXPERIMENT_NAME
)

pointwise_eval_results = eval_task.evaluate()

You can view the `summary_metrics` for all 3 articles here.

In [None]:
pointwise_eval_results.summary_metrics

We see that text_quality is `0.0` because based on the criteria we defined above the responses somewhat align with both creteria. To get the score for each article based on the criteria you defined in `text_quality` you can access the `metrics_table`.

In [None]:
pointwise_eval_results.metrics_table

Based on the criteria we defined it looks like our model gave a neutral score for each of the summaries. You can see the rationale behind the models score in the `text_quality/explanation` column.

#### Pairwise Metrics - Compare Models Side-by-Side (SxS)

Let's say now that we have decided on a prompt, we want to also rate the summaries generated from two different LLMs (Gemini 2.0 Flash vs Gemini 2.0 Flash Lite). You can evaluate the summaries from two different models using pairwise model evaluation and side-by-side comparison.

To directly compare two models, you can define a `PairwiseMetric` within an `EvalTask` run. This approach allows for a head-to-head assessment of the models' performance.

In [None]:
instruction = "Summarize the following article"
prompt_template = "{instruction}. Article: {context}. Summary:"

pairwise_eval_dataset = pd.DataFrame(
    {
        "context": context,
        "instruction": [instruction] * len(context),
        "reference": reference,
    }
)

Once we have prepared the evaluation dataset we will define the two models that we want to compare. `model_a` will be `gemini-2.0-flash` and `model_b` will be `gemini-2.0-flash-lite`.

In [None]:
# Baseline model for pairwise comparison
model_a = GenerativeModel(
    "gemini-2.0-flash",
    generation_config=generation_config,
    safety_settings=safety_settings,
)

# Candidate model for pairwise comparison
model_b = GenerativeModel(
    "gemini-2.0-flash-lite",
    generation_config=generation_config,
    safety_settings=safety_settings,
)

Similar to how to defined the evaluation cretiria for pointwise evaluation above, you could customize your evaluation prompt. However, to save time, Vertex AI provides [predefined evaluation prompts](https://cloud.google.com/vertex-ai/generative-ai/docs/models/metrics-templates) in `MetricPromptTemplateExamples` you could use. In this use case, we are going to be evaluating the text quality between the two models. `pointwise_text_quality` will use the following criteria to evalute the model responses.

```
STEP 1: Analyze Response A based on all the Criteria provided, including Coherence, Fluency, Instruction following, Groundedness, and Verbosity. Provide assessment according to each criterion.
STEP 2: Analyze Response B based on all the Criteria provided, including Coherence, Fluency, Instruction following, Groundedness, and Verbosity. Provide assessment according to each criterion 
STEP 3: Compare the overall performance of Response A and Response B based on your analyses and assessment of each criterion 
STEP 4: Output your preference of "A", "SAME" or "B" to the pairwise_choice field according to the Rating Rubric.
STEP 5: Output your assessment reasoning in the explanation field, justifying your choice by highlighting the specific strengths and weaknesses of each response in terms of Text Quality
```

In [None]:
# Create a "Pairwise Text Quality" metric
text_quality_prompt_template = MetricPromptTemplateExamples.get_prompt_template(
    "pairwise_text_quality"
)

pairwise_text_quality_metric = PairwiseMetric(
    metric="pairwise_text_quality",
    metric_prompt_template=text_quality_prompt_template,
    baseline_model=model_a,
)

Once we have defined the dataset and the evaluation crieteria. We are ready to kick off the evaluation job.

Please note: the cell below is going to take 10-15 mins to finish execution.

In [None]:
pairwise_text_quality_eval_task = EvalTask(
    dataset=pairwise_eval_dataset,
    metrics=[pairwise_text_quality_metric],
    experiment=EXPERIMENT_NAME,
)

# Specify candidate model for pairwise comparison
pairwise_text_quality_result = pairwise_text_quality_eval_task.evaluate(
    model=model_b,
    prompt_template=prompt_template,
)

Once the above experiment runs, you can view the results using the `display_eval_report` helper function.

In [None]:
display_eval_report(
    (
        "Side-by-side EvalTask",
        pairwise_text_quality_result.summary_metrics,
        pairwise_text_quality_result.metrics_table,
    )
)

All of the above evaluation experiments we ran in the notebook are accessible from the [Experiments Tab](https://console.cloud.google.com/vertex-ai/experiments/experiments) in the Vertex AI UI.

Copyright 2024 Google LLC

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

     https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.