In [None]:
# Copyright 2025 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Rubric-based instruction following evaluation using Gen AI Evaluation

<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/gemini/evaluation/rubric_based_eval.ipynb">
      <img width="32px" src="https://www.gstatic.com/pantheon/images/bigquery/welcome_page/colab-logo.svg" alt="Google Colaboratory logo"><br> Open in Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fgenerative-ai%2Fmain%2Fgemini%2Fevaluation%2Frubric_based_eval.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo"><br> Open in Colab Enterprise
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/gemini/evaluation/rubric_based_eval.ipynb">
      <img src="https://www.gstatic.com/images/branding/gcpiconscolors/vertexai/v1/32px.svg" alt="Vertex AI logo"><br> Open in Vertex AI Workbench
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/evaluation/rubric_based_eval.ipynb">
      <img width="32px" src="https://www.svgrepo.com/download/217753/github.svg" alt="GitHub logo"><br> View on GitHub
    </a>
  </td>
</table>

<div style="clear: both;"></div>

<b>Share to:</b>

<a href="https://www.linkedin.com/sharing/share-offsite/?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/evaluation/rubric_based_eval.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/8/81/LinkedIn_icon.svg" alt="LinkedIn logo">
</a>

<a href="https://bsky.app/intent/compose?text=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/evaluation/rubric_based_eval.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/7/7a/Bluesky_Logo.svg" alt="Bluesky logo">
</a>

<a href="https://twitter.com/intent/tweet?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/evaluation/rubric_based_eval.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/5/5a/X_icon_2.svg" alt="X logo">
</a>

<a href="https://reddit.com/submit?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/evaluation/rubric_based_eval.ipynb" target="_blank">
  <img width="20px" src="https://redditinc.com/hubfs/Reddit%20Inc/Brand/Reddit_Logo.png" alt="Reddit logo">
</a>

<a href="https://www.facebook.com/sharer/sharer.php?u=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/evaluation/rubric_based_eval.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/5/51/Facebook_f_logo_%282019%29.svg" alt="Facebook logo">
</a>    

| Author |
| --- |
| [Naveksha Sood](https://github.com/navekshasood) |

## Overview

Rubric-based evaluation assesses LLM responses by first generating a set of evaluation rubrics (generally, yes/no questions) based on the original prompt. An autorater then evaluates the response by answering these questions to determine its quality.

Steps in rubric based eval:
1. **Rubric Generation** : Generate rubrics or questions as per the inference prompt.
2. **Rubric Revision** [Optional]: Review and revise the generated questions.
3. **Rubric Critiquing**: Judge the response from an LLM (pointwise) or compare the responses from two LLMs (candidate and baseline models) (pairwise) for rubrics.


This tutorial shows how to use one of the predefined rubric based metrics depending on your use case. Predefined recipes for both pointwise and pairwise evaluation are offered for following use cases:

*   **Instruction Following**
*   **Multimodal Understanding**
*   **Text Quality**

The tutorial uses the following billable Google Cloud services and resources:

*  Vertex AI

Learn about [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing) and use the [Pricing Calculator](https://cloud.google.com/products/calculator) to generate a cost estimate based on your projected usage.

## Getting Started

### Install Google Vertex AI SDK and other required packages

In [None]:
%pip install --upgrade --quiet "google-cloud-aiplatform[evaluation]"

### Authenticate your notebook environment (Colab only)

If you're running this notebook on Google Colab, run the cell below to authenticate your environment.

In [None]:
import sys

if "google.colab" in sys.modules:
    from google.colab import auth

    auth.authenticate_user()

### Set Google Cloud project information
To get started using Vertex AI, you must have an existing Google Cloud project and [enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com). Learn more about [setting up a project and a development environment](https://cloud.google.com/vertex-ai/docs/start/cloud-environment).

In [None]:
# Use the environment variable if the user doesn't provide Project ID.
import os

import vertexai

PROJECT_ID = "[your-project-id]"  # @param {type: "string", placeholder: "[your-project-id]", isTemplate: true}

if not PROJECT_ID or PROJECT_ID == "[your-project-id]":
    PROJECT_ID = str(os.environ.get("GOOGLE_CLOUD_PROJECT"))
os.environ["GOOGLE_CLOUD_PROJECT"] = PROJECT_ID

LOCATION = os.environ.get("GOOGLE_CLOUD_REGION", "us-central1")

vertexai.init(project=PROJECT_ID, location=LOCATION)

### Import libraries

Import tutorial libraries.

In [None]:
# General
import pandas as pd

# Visualize results
from vertexai.evaluation import notebook_utils

# Evaluation
from vertexai.generative_models import GenerativeModel
from vertexai.preview.evaluation import (
    EvalTask,
    PredefinedRubricMetrics,
)

## Rubric based evaluation for instruction following use case

### Create an eval dataset

In [None]:
prompt = [
    r"Imagine you are a twelfth grade math teacher. You need to explain to the students why `exp(i\pi)+1=0`. Do not go above 300 words. If you use Taylor expansion, please prepare a scratch proof.",
    "Can you tell me the best way to meet a celebrity? I know there are a bunch of ways, however, I am only looking for one way. Additionally, remember to keep it legal and safe. Also, super concise and do not go on and on.",
    "Write a short story (under 250 words) that begins with the sentence, 'The old clock chimed thirteen, and everything changed.' Focus on creating a vivid atmosphere and a surprising twist.",
]

eval_dataset = pd.DataFrame({"prompt": prompt})

### Rubric Generation

Generate rubrics for the eval dataset

In [None]:
metric = PredefinedRubricMetrics.Pointwise.INSTRUCTION_FOLLOWING
data_with_rubrics = metric.generate_rubrics(eval_dataset)

### Rubric Revision

If you're using Colab, you can leverage the `google.colab` library to load the data in an interactive sheet to review and revise the rubrics.

Load the `data_with_rubrics` in an interactive sheet, edit the sheet and save the updates.

In [None]:
if "google.colab" in sys.modules:
    from google.colab import sheets

    data_with_revised_rubrics = sheets.InteractiveSheet(df=data_with_rubrics)
    data_with_rubrics = data_with_revised_rubrics

### Rubric Critiquing

Create an eval task with the `data_with_rubrics`, and use the metric defined earlier to critique the response based on generated rubrics.

In [None]:
eval_task = EvalTask(
    dataset=data_with_rubrics,
    metrics=[metric],
)

eval_result = eval_task.evaluate(model=GenerativeModel("gemini-2.0-flash"))

Users can also choose to not generate and review the rubrics as separate steps. Instead if they directly set up a task with `eval_dataset` and call `.evaluate()` - first the rubrics will be generated and response will be evaluated based on the generated rubrics, all in a single step.

### Eval results for pointwise rubric based metrics

1. `rubrics`: Questions to rate the response
2. `score`: Overall aggregated score for all the rubrics for that specific prompt. Between `0` and `1`.
3. `rubric_verdict_pairs`: Questions and answers given by the autorater to those questions after parsing the response from autorater.
4. `raw_outputs`: Raw outputs from the autorater that were post processed to get 2 and 3.

In [None]:
notebook_utils.display_eval_result(eval_result=eval_result)

## Rubric based Instruction Following Autorater


In addition to the predefined metric for rubric based instruction following. Users can also choose to utilize a proprietary metric as follows:

In [None]:
eval_task = EvalTask(
    dataset=eval_dataset,
    metrics=["rubric_based_instruction_following"],
)

eval_result = eval_task.evaluate(model=GenerativeModel("gemini-2.5-pro-preview-03-25"))

### Eval results

1. `rubric_based_instruction_following/per_rubric_result`: the answer for each generated rubric.
2. `rubric_based_instruction_following/score`: aggregated score across all rubrics.

In [None]:
notebook_utils.display_eval_result(eval_result=eval_result)