In [None]:
# Copyright 2025 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Rubric evaluation - Multimodal and Custom metric for text quality

<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/gemini/evaluation/multimodal_text_quality_rubric_evaluation.ipynb">
      <img width="32px" src="https://www.gstatic.com/pantheon/images/bigquery/welcome_page/colab-logo.svg" alt="Google Colaboratory logo"><br> Open in Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fgenerative-ai%2Fmain%2Fgemini%2Fevaluation%2Fmultimodal_text_quality_rubric_evaluation.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo"><br> Open in Colab Enterprise
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/gemini/evaluation/multimodal_text_quality_rubric_evaluation.ipynb">
      <img src="https://www.gstatic.com/images/branding/gcpiconscolors/vertexai/v1/32px.svg" alt="Vertex AI logo"><br> Open in Vertex AI Workbench
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/evaluation/multimodal_text_quality_rubric_evaluation.ipynb">
      <img width="32px" src="https://www.svgrepo.com/download/217753/github.svg" alt="GitHub logo"><br> View on GitHub
    </a>
  </td>
</table>

<div style="clear: both;"></div>

<b>Share to:</b>

<a href="https://www.linkedin.com/sharing/share-offsite/?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/evaluation/multimodal_text_quality_rubric_evaluation.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/8/81/LinkedIn_icon.svg" alt="LinkedIn logo">
</a>

<a href="https://bsky.app/intent/compose?text=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/evaluation/multimodal_text_quality_rubric_evaluation.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/7/7a/Bluesky_Logo.svg" alt="Bluesky logo">
</a>

<a href="https://twitter.com/intent/tweet?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/evaluation/multimodal_text_quality_rubric_evaluation.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/5/5a/X_icon_2.svg" alt="X logo">
</a>

<a href="https://reddit.com/submit?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/evaluation/multimodal_text_quality_rubric_evaluation.ipynb" target="_blank">
  <img width="20px" src="https://redditinc.com/hubfs/Reddit%20Inc/Brand/Reddit_Logo.png" alt="Reddit logo">
</a>

<a href="https://www.facebook.com/sharer/sharer.php?u=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/evaluation/multimodal_text_quality_rubric_evaluation.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/5/51/Facebook_f_logo_%282019%29.svg" alt="Facebook logo">
</a>

| Author(s) |
| --- |
| [Ivan Nardini](https://github.com/inardini), [Naveksha Sood](https://github.com/navekshasood)|

## Overview

This notebook showcases how to perform rubric-based evaluation using Vertex AIGenAI Evaluation service. You will learn how to use both predefined metrics for multimodal tasks and how to create your own custom rubric-based metrics from scratch.

### Why Use Rubric-Based Evaluation?

When evaluating Large Language Models (LLMs), simple metrics like accuracy often don't capture the full picture. How do you measure nuance, creativity, adherence to a specific style, or the quality of a summary?

**Rubric-based evaluation** solves this by using a powerful technique: **using an LLM to evaluate another LLM.** Instead of relying on rigid, predefined criteria, we generate a dynamic "rubric" (a set of questions) tailored to the specific prompt. An "auto-rater" model then assesses the LLM's response against this rubric.

This approach is powerful because it allows you to:

*   **Evaluate Complex, Subjective Tasks:** Go beyond simple right/wrong answers to assess quality, style, and safety.
*   **Align Evaluation with Your Goals:** Create custom rubrics that measure what truly matters for your specific use case.
*   **Scale Your Evaluations:** Automate the process of nuanced evaluation, saving countless hours of manual review.

### Steps in Rubric-Based Evaluation

1. **Rubric Generation**: An LLM generates a set of evaluation questions (the rubric) based on the inference prompt.
2. **Rubric Revision** [Optional but Recommended]: You inspect and revise the generated questions to ensure they align perfectly with your requirements.
3. **Rubric Critiquing**: An autorater LLM judges the model's response against the rubric. This can be done **pointwise** (scoring a single response) or **pairwise** (comparing two different responses).

In this tutorial, you'll learn how to use both predefined and custom rubrics to evaluate model responses for multimodal and text-based tasks.

## Getting Started


### Install Google Vertex AI SDK and other required packages

In [None]:
%pip install --upgrade --quiet "google-cloud-aiplatform[evaluation]"

### Authenticate your notebook environment (Colab only)

In [None]:
import sys

if "google.colab" in sys.modules:
    from google.colab import auth

    auth.authenticate_user()

### Set Google Cloud project information
To get started using Vertex AI, you must have an existing Google Cloud project and enable the Vertex AI API.

Learn more about setting up a project and a development environment.

In [None]:
import vertexai

PROJECT_ID = "[your-project-id]"  # @param {type: "string", placeholder: "[your-project-id]", isTemplate: true}
LOCATION = "us-central1"  # @param {type:"string"}

BUCKET_NAME = "[your-bucket-name]"  # @param {type: "string", placeholder: "[your-bucket-name]", isTemplate: true}
BUCKET_URI = f"gs://{BUCKET_NAME}"

!gsutil mb -l {LOCATION} {BUCKET_URI}

vertexai.init(project=PROJECT_ID, location=LOCATION)

### Import libraries

Let's import the necessary libraries from the Vertex AI SDK and other packages we'll use throughout the notebook.

In [None]:
import pandas as pd
from vertexai.generative_models import GenerativeModel
from vertexai.preview.evaluation import (
    CustomOutputConfig,
    EvalTask,
    PointwiseMetric,
    PredefinedRubricMetrics,
    RubricBasedMetric,
    RubricGenerationConfig,
    notebook_utils,
    utils,
)

### Helpers

This helper function provides a clean way to display the contents of a Pandas DataFrame, which we will use to inspect our datasets and results.

In [None]:
def display_df(
    df: pd.DataFrame, num_rows: int | None = 5, line_length: int = 80
) -> None:
    """
    Displays DataFrame rows cleanly, ensuring full text visibility.
    """
    # Explicitly select the data slice to avoid ambiguity.
    data_to_display = df if num_rows is None else df.head(num_rows)

    if data_to_display.empty:
        print("The DataFrame is empty.")
        return

    # A simple loop is better than a complex, nested structure.
    for _, row in data_to_display.iterrows():
        print("-" * line_length)
        for col, value in row.items():
            # Readability counts: Clearly label the column and show the value.
            print(f"{col}: {value}")
    print("-" * line_length)

## Rubric based Multimodal Understanding

In this section, we'll evaluate a use case where an AI model acts as an insurance agent. The model receives an image of a damaged car and a prompt asking it to assess the damage, severity, and estimated repair cost.

Since the Vertex AI Gen AI Evaluation doesn't yet support direct multimodal *inference* within the `EvalTask`, we'll start with pre-generated responses. 

You would typically generate these responses by calling a multimodal model like Gemini.

### Create a multimodal dataset

The dataset contains:
1. **prompt**: The instruction given to the language model.
2. **image**: A URI pointing to an image file in Google Cloud Storage.
3. **response**: The text generated by our candidate model.
4. **baseline_model_response**: The text generated by our baseline model for comparison.

In [None]:
image = [
    '{"contents": [{"parts": [{"file_data": {"mime_type": "image/jpeg", "file_uri": "gs://cloud-samples-data/generative-ai/evaluation/use_cases/car_assessment/bumper.jpg"}}]}]}',
    '{"contents": [{"parts": [{"file_data": {"mime_type": "image/jpeg", "file_uri": "gs://cloud-samples-data/generative-ai/evaluation/use_cases/car_assessment/engine_compartment.jpg"}}]}]}',
    '{"contents": [{"parts": [{"file_data": {"mime_type": "image/jpeg", "file_uri": "gs://cloud-samples-data/generative-ai/evaluation/use_cases/car_assessment/hood.jpg"}}]}]}',
    '{"contents": [{"parts": [{"file_data": {"mime_type": "image/jpeg", "file_uri": "gs://cloud-samples-data/generative-ai/evaluation/use_cases/car_assessment/lateral.jpg"}}]}]}',
    '{"contents": [{"parts": [{"file_data": {"mime_type": "image/jpeg", "file_uri": "gs://cloud-samples-data/generative-ai/evaluation/use_cases/car_assessment/windshield.jpg"}}]}]}',
]

prompt = [
    "You are an insurance agent specializing in car accident assessments. You will be provided with a conversation about a car accident and an image of the damaged car. Your task is to analyze the image and identify the primary type of damage visible. Use the conversation for context, but prioritize the visual evidence from the image. Provide your assessment including the primary damage type, the severity of the damage, and a rough estimation of the repair cost.",
] * len(image)

response = [
    """**Primary Damage:** Bumper.
            **Severity:** Moderate. The rear bumper is visibly dented, scratched, and partially detached from the vehicle's body.
            **Estimated Cost:** The repair cost is estimated to be between $500 and $1,500. This includes the cost of either repairing the existing bumper or replacing it, as well as painting to match the vehicle's color.""",
    """**Primary Damage:** Engine Compartment.
            **Severity:** Severe. The image shows a catastrophic front-end collision where the hood is completely buckled and the underlying engine components are exposed and likely damaged.
            **Estimated Cost:** The repair costs are expected to be substantial, likely ranging from $5,000 to over $10,000. Given the severity, the vehicle may be declared a total loss by the insurance company.""",
    """**Primary Damage:** Hood.
            **Severity:** Moderate to Severe. The hood is significantly crumpled and bent upwards, indicating a forceful front-end impact.
            **Estimated Cost:** The cost for replacing and painting the hood is estimated to be between $1,000 and $2,500. This does not include potential underlying damage to the radiator, grille, or engine compartment which would increase the cost.""",
    """**Primary Damage:** Lateral.
            **Severity:** Severe. There is extensive damage along the driver's side of the vehicle, with both the front and rear doors significantly crushed inwards.
            **Estimated Cost:** The estimated repair cost is between $4,000 and $8,000. This type of damage often involves replacing multiple door panels, repainting, and potentially complex and costly repairs to the vehicle's frame and B-pillar.""",
    """**Primary Damage:** Windshield.
            **Severity:** Minor to Moderate. A large crack is visible running across the windshield.
            **Estimated Cost:** The replacement cost is estimated to be between $400 and $1,200. The final price will depend on the make and model of the vehicle and if it is equipped with Advanced Driver-Assistance Systems (ADAS) that require recalibration after the windshield is replaced.""",
]

baseline_response = [
    """**Damage Type:** Bumper
            **Severity:** Moderate
            **Estimated Cost:** $600 - $1,800. The bumper is visibly damaged and will likely need replacement and painting.""",
    """**Damage Type:** Engine Compartment
            **Severity:** Severe
            **Estimated Cost:** Potentially over $10,000. The damage is extensive and appears to affect the engine. This could be a total loss.""",
    """**Damage Type:** Hood
            **Severity:** Moderate
            **Estimated Cost:** $1,200 - $3,000. This includes a new hood, paint, and labor. There may be additional costs for hidden damage.""",
    """**Damage Type:** Lateral
            **Severity:** Severe
            **Estimated Cost:** $4,500 - $9,000. The damage spans two doors and may have affected the vehicle's frame, requiring significant repair work.""",
    """**Damage Type:** Windshield
            **Severity:** Minor
            **Estimated Cost:** $350 - $900. The cost depends on the vehicle model and whether recalibration of safety features is needed.""",
]


eval_dataset = pd.DataFrame(
    {
        "prompt": prompt,
        "image": image,
        "baseline_model_response": baseline_response,
        "response": response,
    }
)

### **Run a Pairwise Evaluation with a Predefined Metric**

Now, we're going to put our two models in a head-to-head competition for each image in our dataset. This is a **pairwise evaluation**, and it's one of the most effective ways to determine which model is truly better for your specific use case.

To do this, we'll configure an `EvalTask`. Think of this as the main orchestrator for our evaluation job. It bundles together our dataset, the metrics we want to compute, and the Cloud Storage location for the results.

The key ingredient here is the metric we choose: `PredefinedRubricMetrics.Pairwise.MULTIMODAL_UNDERSTANDING`. It's an off-the-shelf metric designed to handle all the heavy lifting of multimodal pairwise evaluation for you.

When you run the cell below, the single command `eval_task.evaluate()` kicks off the evaluation process. Here's a peek at what the Vertex AI Evaluation Service does for you under the hood:

1.  **Rubric Generation:** For each row in your dataset, the service's autorater model analyzes the prompt and image to generate a custom set of rubric questions tailored to that specific example.
2.  **Pairwise Critique:** The autorater then systematically compares your `response` and `baseline_model_response`, answering each rubric question for both models to see how they stack up.
3.  **Scoring and Aggregation:** Finally, the service processes the autorater's raw judgments, calculates the final win/loss/tie scores, and organizes all the rich details—including the generated rubrics and the autorater's reasoning—into the structured `eval_result` DataFrame that gets returned to your notebook.

Let's run it.

In [None]:
eval_task = EvalTask(
    dataset=eval_dataset,
    metrics=[PredefinedRubricMetrics.Pairwise.MULTIMODAL_UNDERSTANDING],
    output_uri_prefix=BUCKET_URI,
)

eval_result = eval_task.evaluate()

### Understanding eval results for pairwise rubric based metrics

The `eval_result` object contains a wealth of information. Let's break down the key columns in the results DataFrame:

*   **`rubrics`**: The list of questions automatically generated by the autorater for this specific prompt and image. This gives you insight into *how* the model was evaluated.
*   **`description`**: A text description of the image, generated by the autorater model to provide context for its evaluation.
*   **`pairwise_rb_multimodal_understanding/pairwise_choice`**: The final verdict of the autorater. It's a human-readable judgment, such as "Candidate response is slightly better than the baseline response."
*   **`pairwise_rb_multimodal_understanding/score`**: A numeric score that maps to the `pairwise_choice`. It's on a 5-point scale:
    *   `1.0`: Candidate is better.
    *   `0.5`: Candidate is slightly better.
    *   `0.0`: Both are equal (or a tie).
    *   `-0.5`: Baseline is slightly better.
    *   `-1.0`: Baseline is better.
*   **`.../candidate_rubric_verdict_pairs`**: A breakdown showing how the candidate response scored on each individual rubric question (e.g., `Does the response identify the primary damage?: True`).
*   **`.../baseline_rubric_verdict_pairs`**: The same breakdown for the baseline model.
*   **`.../raw_outputs`**: The full, unprocessed text generated by the autorater. This is extremely useful for debugging, as it contains the autorater's "chain of thought" and reasoning for its final verdict.

In [None]:
notebook_utils.display_eval_result(eval_result=eval_result)

In [None]:
notebook_utils.display_explanations(eval_result=eval_result, num=1)

## Customize your metric - Text Quality

Predefined metrics are great, but what if you have very specific requirements?

In this section, we'll switch to a new use case—evaluating text summarization—and build our own **custom rubric-based metric.**

This gives you complete control over the evaluation process, from how the rubric questions are generated to how the final critique is performed.


### Create a text quality dataset

We will create a simple dataset consisting of different prompts that ask for summarization of various text inputs.


In [None]:
tq_prompt = [
    "Summarize the following meeting transcript, focusing on the main decisions and action items. Transcript: 'Alex: OK, team. The primary goal for Q3 is to launch the new feature set, codenamed 'Phoenix'. Sarah, your team will handle the front-end development, and Mike, your team is responsible for the back-end infrastructure. The deadline for code completion is August 15th. Marketing, led by Chloe, will begin the promotional campaign on August 1st. All teams must provide weekly progress reports.'",
    "Provide a one-sentence summary of the following product description. Description: 'The new Chrono-Max 5000 is a state-of-the-art timepiece featuring a self-winding 25-jewel Swiss movement, a scratch-resistant sapphire crystal display, and a waterproof titanium case rated for depths up to 200 meters. It also includes a perpetual calendar, a moon phase indicator, and a dual-time zone function, all elegantly designed for both formal and casual wear.'",
    "Explain the scientific concept described below in simple terms, suitable for a middle school student. Concept: 'The theory of plate tectonics describes the large-scale motion of seven large plates and the movements of a larger number of smaller plates of the Earth's lithosphere. The movement of the plates results in seismic activity such as earthquakes, and the formation of geologic features like mountains, volcanoes, and oceanic trenches.'",
    "Condense the following historical event summary into a tweet (280 characters or less). Summary: 'The Industrial Revolution, which took place from the 18th to 19th centuries, was a period during which predominantly agrarian, rural societies in Europe and America became industrial and urban. The iron and textile industries, along with the development of the steam engine, played central roles in this transformation, which also saw major social and economic changes, including the rise of factories and a new working class.'",
    "Summarize the plot of the following movie. Plot: 'A young farm boy, who dreams of adventure, discovers a message from a captured princess. He embarks on a journey with a wise old mentor, a cocky pilot, and a loyal furry co-pilot to rescue her from the clutches of an evil empire and its fearsome enforcer. Along the way, he learns about a mystical energy field that grants him special abilities and ultimately joins a rebellion to restore freedom to the galaxy.'",
    "Generate a brief summary of the key findings from the following research abstract. Abstract: 'Our study investigated the effects of daily meditation on stress and cognitive function in a cohort of 150 adults. Participants were randomly assigned to either a daily 20-minute guided meditation group or a control group. After eight weeks, the meditation group showed a statistically significant reduction in self-reported stress levels and improved performance on tasks measuring attention and working memory compared to the control group.'",
    "Summarize the main arguments presented in the following opinion piece. Piece: 'The widespread adoption of remote work represents the most significant shift in the labor market in a century. While critics point to challenges in collaboration and corporate culture, the benefits—including increased employee flexibility and satisfaction, reduced operational costs for companies, and a positive environmental impact from less commuting—far outweigh the drawbacks. To remain competitive, businesses must embrace this new paradigm.'",
    "Create a short summary of the following financial news report. Report: 'Global markets experienced a volatile week, with the Tech Index falling by 3.5% due to investor concerns over rising inflation and potential interest rate hikes by the central bank. In contrast, the commodities sector saw a surge, with oil prices reaching a two-year high amidst geopolitical tensions and supply chain disruptions. Analysts advise a cautious approach for the upcoming quarter.'",
    "Condense the following recipe into three main steps. Recipe: 'To prepare classic spaghetti carbonara, first cook 400g of spaghetti in salted boiling water until al dente. While the pasta cooks, fry 150g of diced pancetta in a pan until crisp. In a separate bowl, whisk together 4 large egg yolks, 50g of grated Pecorino Romano cheese, and a generous amount of black pepper. Once the pasta is cooked, drain it, reserving some pasta water. Quickly toss the hot pasta with the egg and cheese mixture, adding the crispy pancetta and a splash of pasta water to create a creamy sauce. Serve immediately.'",
    "Provide a brief summary of the following legal clause, explaining its main purpose. Clause: 'Force Majeure: Neither party shall be liable for any failure or delay in performing their obligations under this contract if such failure or delay is due to any cause beyond their reasonable control, including but not limited to acts of God, war, terrorism, civil unrest, or significant interruptions in public utilities.'",
]

eval_dataset = pd.DataFrame(
    {
        "prompt": tq_prompt,
    }
)

### Create your own rubric based metric

With our dataset defined, the first step is to generate the rubric-based metric.  

Building your own rubric-based metric gives you precise control over the entire evaluation lifecycle. You get to define not just *what* gets measured, but *how* it gets measured.

About what to measure, you can provide your own `prompt_template` to instruct the autorater on exactly what makes a good set of rubric questions for your specific task. You can guide it to focus on factuality, conciseness, creativity, brand voice, or any other criteria you care about as shown below.


In [None]:
rubric_gen_prompt = """
# Instructions
Your task is to generate a rubric that can be used to evaluate the text quality of responses generated by an AI model. Specifically, to generate rubrics for a user prompt (<user_prompt>) that describes the properties that should hold for a good response to that prompt. Generate the rubric following the provided guidelines.

# Rubric Generation Guidelines
## Verifying key aspects of 6 high-level criteria categories
Text Quality is evaluated with 6 high-level criteria categories as follows:
Response Style & Format Structure
Content Relevance & Conciseness
Content Completeness & Detail
Instruction Following
Groundness OR Truthfulness / Correctness
Harmlessness / Safety

The generated rubrics should be able to verify key aspects of all 6 high-level criteria categories.

## Rules of thumbs for good rubrics
* A rubric is a granular, binary constraint expressed in the form of a question, a decomposition, or a testable criteria. Think of it as a deterministic "yes/no" question based on the prompt that a user or a rater can ask about the response to verify whether the response fulfilled the requirement in the prompt.
* There are different types of constraints in the prompt. In this task, the constraints should be based on the 6 high-level criteria categories that we use to evaluate the Text Quality: Some are related to the Content Relevance & Conciseness of the response, some are related to Response Style & Format Structure, some are related to Instruction Following.
* The goal of this task is to appropriately capture all the different constraints that verify the key aspects of the high-level criteria category and associated with the specifics in the user prompt — such that given these rubrics, anyone could determine how well and completely the model fulfilled the constraints in the prompt.

## Additional constraints on the generated rubrics

* Generated rubrics should be ordered according to the "importance" of its high-level criteria category. The importance of the 6 high-level criteria categories is ordered as follows:

"Instruction Following" > "Groundness OR Truthfulness / Correctness" > "Harmlessness / Safety" > "Content Relevance & Conciseness" > "Response Style & Format Structure" > "Content Completeness & Detail"

For example, generated rubrics for "Instruction Following" should be output first.

* The number of generated rubrics for each criteria category may not be the same. We desire to include more questions for criteria categories of higher "importance", e.g. "Instruction Following", "Groundness OR Truthfulness / Correctness".

* Not every rubric needs to be prompt-specific. Some can be prompt-agnostic. For example, rubrics for "Harmlessness / Safety" may mostly be prompt-agnostic.

* Aim for less than 10 rubrics per criteria category and less than 20 in total.

* Pay attention to the following, which are common mistakes of rubric generation:
  * Word count or length: If the user prompt asks for an exact word count or length, the rubric should be exact to the word count or length. Do not add "approximate" or "around" to the word count or length.
  * Reference content: If the user prompt contains a reference to a specific document or a specific context, the rubric should be specific to the reference content.
  * The rubric should be specific to the reference content in the user prompt: If the user prompt contains a reference to a specific document or a specific context, the rubric should be specific to the reference content.
  * The rubric should avoid hallucination: Do not generate rubrics that are not based on the user prompt.
  * The rubric should be logically correct: When the user prompt is a math word problem or a science problem or a data analysis problem, the rubrics should be logically correct.
  * The rubric should be concise: Do not generate repeated rubrics, including different rubrics that are semantically similar.

# Iteratively generate rubrics
Thoroughly examine the user prompt, generate rubrics for the given user prompt following the above Rubric Generation Guidelines. Review your answer, correct your mistakes and produce a revised answer. Do not exceed 3 iterations in total. Output your final answer for the generated rubrics.

# Output format.
Write your final output in JSON according to this schema:

```json
{{
 "questions": [
   "question 1 ...",
   "question 2 ...",
   "question 3 ...",
 ],
}}
```

IMPORTANT: Do not respond to the <user_prompt>. Only generate the rubric questions for the prompt.

User prompt
<user_prompt>
{prompt}
</user_prompt>
"""

About how, you can provide a `metric_prompt_template` that instructs the autorater on how to evaluate a response against the rubric. You can guide its "chain of thought" and define the exact scoring logic it should follow.

In [None]:
rubric_critique_prompt = """
# Instructions
Your task is to evaluate the text quality of responses generated by an AI model. You will be presented with a user prompt, the model's response to that user prompt, and a series of questions against which the text quality of the response will be judged.

# Rubric
[[YES]]: The model's response fulfilled the question.
[[NO]]: The model's response did not fulfill the question.

# Follow these steps for each question:
STEP 1: Repeat the question.
STEP 2: Determine the steps needed to **exactly**, **precisely** and **completely** answer the question.
STEP 3: Follow the steps outlined in STEP 2, thinking out loud.
STEP 4: Review the thoughts and the original question.
STEP 5: Output the final verdict.

# Output format:
<question>
STEP 1: ...
STEP 2: ...
STEP 3: ...
STEP 4: ...
STEP 5: ...
Question: repeat the original question
Verdict: [[YES|NO]]
</question>

<question>
...

# User Inputs, AI-generated Response, and Rubrics
## User Inputs
### Prompt
{prompt}

## AI-generated Response
{response}

## Rubrics
{rubrics}

REMEMBER: Your answer will help improve the AI model. It is important to answer the question correctly. Even answering "no" will improve the model!

Evaluation:
"""

### Rubric Generation and Revision

With our custom rubric-based metric component defined, you are ready to generate the rubrics using templates you defined, the `RubricBasedMetric` and its `generate_rubrics()` method.

> Notice that you have several key parameters you can customize such as parsing functions to transform that text into a clean, structured list of questions that the SDK can use (Rubric) and to process the autorater's detailed critique, extracting the final score and verdict pairs from its raw output (Critique).

Let's display the generated rubrics to see what the autorater came up with.

In [None]:
rbm = RubricBasedMetric(
    generation_config=RubricGenerationConfig(
        prompt_template=rubric_gen_prompt, parsing_fn=utils.parse_rubrics
    ),
    critique_metric=PointwiseMetric(
        metric="custom_rubric_based_text_quality",
        metric_prompt_template=rubric_critique_prompt,
        custom_output_config=CustomOutputConfig(
            return_raw_output=True, parsing_fn=utils.parse_pointwise_rubric_result
        ),
    ),
)

data_with_rubrics = rbm.generate_rubrics(eval_dataset)

Let's visualize generated rubrics. This step gives you the opportunity for **human-in-the-loop revision**. Before using these questions for the final evaluation, you can inspect them, edit them, or add new ones to ensure they perfectly match your quality criteria. This is a best practice for building a robust and trustworthy evaluation pipeline.

In [None]:
display_df(data_with_rubrics, num_rows=3)

### Rubric Critiquing

Now that we have our dataset enriched with our custom rubrics, we can proceed to the final step: critiquing the model's responses.

We'll create a new `EvalTask` and pass it our DataFrame, which now includes the `rubrics` column. When we call `eval_task.evaluate()`, the SDK detects the existing rubrics and will **skip the generation step**, moving directly to the critique.

For this evaluation, we'll perform a **pointwise** evaluation, meaning each response is scored individually based on the rubric, rather than being compared to a baseline. We'll have the SDK generate responses by passing a Gemini model to the `evaluate` method.

In [None]:
eval_task = EvalTask(
    dataset=data_with_rubrics,
    metrics=[rbm],
)

eval_result = eval_task.evaluate(model=GenerativeModel("gemini-2.0-flash"))

### Visualize and understand eval results for custom rubric metric

As before, you can use helpers function to visualize evaluation results. 

In [None]:
notebook_utils.display_eval_result(eval_result=eval_result)

In [None]:
notebook_utils.display_explanations(eval_result=eval_result, num=1)