In [None]:
# Copyright 2025 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Rubric-based instruction following evaluation using Gen AI Evaluation Service

 <table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/gemini/evaluation/evaltask_approach/rubric_based_eval.ipynb">
      <img width="32px" src="https://www.gstatic.com/pantheon/images/bigquery/welcome_page/colab-logo.svg" alt="Google Colaboratory logo"><br> Open in Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fgenerative-ai%2Fmain%2Fgemini%2Fevaluation%2Fevaltask_approach%2Frubric_based_eval.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo"><br> Open in Colab Enterprise
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/gemini/evaluation/evaltask_approach/rubric_based_eval.ipynb">
      <img src="https://www.gstatic.com/images/branding/gcpiconscolors/vertexai/v1/32px.svg" alt="Vertex AI logo"><br> Open in Vertex AI Workbench
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/evaluation/evaltask_approach/rubric_based_eval.ipynb">
      <img width="32px" src="https://raw.githubusercontent.com/primer/octicons/refs/heads/main/icons/mark-github-24.svg" alt="GitHub logo"><br> View on GitHub
    </a>
  </td>
</table>

<div style="clear: both;"></div>

<b>Share to:</b>

<a href="https://www.linkedin.com/sharing/share-offsite/?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/evaluation/evaltask_approach/rubric_based_eval.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/8/81/LinkedIn_icon.svg" alt="LinkedIn logo">
</a>

<a href="https://bsky.app/intent/compose?text=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/evaluation/evaltask_approach/rubric_based_eval.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/7/7a/Bluesky_Logo.svg" alt="Bluesky logo">
</a>

<a href="https://twitter.com/intent/tweet?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/evaluation/evaltask_approach/rubric_based_eval.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/5/5a/X_icon_2.svg" alt="X logo">
</a>

<a href="https://reddit.com/submit?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/evaluation/evaltask_approach/rubric_based_eval.ipynb" target="_blank">
  <img width="20px" src="https://redditinc.com/hubfs/Reddit%20Inc/Brand/Reddit_Logo.png" alt="Reddit logo">
</a>

<a href="https://www.facebook.com/sharer/sharer.php?u=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/evaluation/evaltask_approach/rubric_based_eval.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/5/51/Facebook_f_logo_%282019%29.svg" alt="Facebook logo">
</a>


| Author |
| --- |
| [Naveksha Sood](https://github.com/navekshasood) |

## Overview

Rubric-based evaluation assesses LLM responses by first generating a set of evaluation rubrics (generally, yes/no questions) based on the original prompt. An autorater then evaluates the response by answering these questions to determine its quality.

Steps in rubric based eval:
1. **Rubric Generation** : Generate rubrics or questions as per the inference prompt.
2. **Rubric Revision** [Optional]: Review and revise the generated questions.
3. **Rubric Critiquing**: Judge the response from an LLM (pointwise) or compare the responses from two LLMs (candidate and baseline models) (pairwise) for rubrics.


This tutorial shows how to use one of the predefined rubric based metrics depending on your use case. Predefined recipes for both pointwise and pairwise evaluation are offered for following use cases:

*   **Instruction Following**
*   **Multimodal Understanding**
*   **Text Quality**

The tutorial uses the following billable Google Cloud services and resources:

*  Vertex AI

Learn about [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing) and use the [Pricing Calculator](https://cloud.google.com/products/calculator) to generate a cost estimate based on your projected usage.

## Getting Started

### Install Google Vertex AI SDK and other required packages

In [1]:
%pip install --upgrade --quiet "google-cloud-aiplatform[evaluation]"

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.1/46.1 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.5/10.5 MB[0m [31m109.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.1/8.1 MB[0m [31m110.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.9/119.9 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m278.1/278.1 kB[0m [31m15.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m788.2/788.2 kB[0m [31m35.7 MB/s[0m eta [36m0:00:00[0m
[?25h

### Authenticate your notebook environment (Colab only)

If you're running this notebook on Google Colab, run the cell below to authenticate your environment.

In [2]:
import sys

if "google.colab" in sys.modules:
    from google.colab import auth

    auth.authenticate_user()

### Set Google Cloud project information
To get started using Vertex AI, you must have an existing Google Cloud project and [enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com). Learn more about [setting up a project and a development environment](https://cloud.google.com/vertex-ai/docs/start/cloud-environment).

In [4]:
# Use the environment variable if the user doesn't provide Project ID.
import os

import vertexai

PROJECT_ID = "adk-mcp-client"  # @param {type: "string", placeholder: "[your-project-id]", isTemplate: true}

if not PROJECT_ID or PROJECT_ID == "[your-project-id]":
    PROJECT_ID = str(os.environ.get("GOOGLE_CLOUD_PROJECT"))
os.environ["GOOGLE_CLOUD_PROJECT"] = PROJECT_ID

LOCATION = os.environ.get("GOOGLE_CLOUD_REGION", "us-central1")

vertexai.init(project=PROJECT_ID, location=LOCATION)

### Import libraries

Import tutorial libraries.

In [5]:
# General
import pandas as pd

# Visualize results
from vertexai.evaluation import notebook_utils

# Evaluation
from vertexai.generative_models import GenerativeModel
from vertexai.preview.evaluation import (
    EvalTask,
    PredefinedRubricMetrics,
)

## Rubric based evaluation for instruction following use case

### Create an eval dataset

In [80]:
from vertexai.generative_models import GenerativeModel, GenerationConfig

def extract_patientinfo(patientsinfo_file: str, Patientid: str) -> str:
    """
    Extracts all information related to a specific patient by patient id from patientsinfo pdf
    using the Gemini model.

    Args:
        patientsinfo_file: The full text of the patients info pdf doc
        Patient ID: The ID associated with patient

    Returns:
        Relevant information about the patient asked in user query or a message indicating it's not found.
    """

    model = GenerativeModel("gemini-2.0-flash") # Changed model to gemini-2.0-flash for accessibility

    prompt_text = f"""
    You are an AI assistant specialized in analyzing patientsinfo by fetching relevant details from the patientsinfo pdf document.
    Given the following patientsinfo text, extract relevant details present in the file regarding patient info, medical history, consultation, diagnosis
    and surgical interventions.

    If "{Patientid}" is not explicitly mentioned or no information
    is found regarding it, state "No specific information found for {Patientid}.

    patientsinfo-file text: {patientsinfo_file}"""

    generation_config = GenerationConfig(
        temperature=0,
        top_p=0.95,
        max_output_tokens=8192, # Updated to the correct max allowed value (exclusive upper bound)
    )

    response = model.generate_content(
        prompt_text,
        generation_config=generation_config,
    )

    return response.text


In [105]:
# Call extract_patientinfo with the data and patient ID
patient_id = "A-123456"
patient_info_extracted = extract_patientinfo(patientsinfo_file=data, Patientid=patient_id)
print(patient_info_extracted)

Here's the extracted information from the provided text:

*   **Patient Name:** Mr. Ramesh Gupta
*   **Patient ID:** A-123456
*   **Consultation Details:** Mr. Gupta presents with respiratory issues (shortness of breath, wheezing, increased fatigue), episodes of high blood pressure, and worsening intermittent heart blocks affecting daily activities.
*   **Diagnosis:** Respiratory issues likely secondary to hypertension and intermittent heart blocks.
*   **Surgical Procedures:** Further evaluation by a cardiologist is needed for heart blocks. Pulmonary function tests may be required to assess respiratory function before any surgical intervention is planned.
*   **Medical History:**
    *   Controlled Type 2 Diabetes Mellitus (since 2010)
    *   Hypertension (since 2015)
    *   Intermittent heart blocks (recent diagnosis)
    *   Allergies: No known drug allergies.



In [82]:
prompt = [
    f"Extract the patient's name, ID, consultation details, diagnosis info, surgical procedures, and medical history from the following patient record: {data}",
    f"Provide the medical history and diagnosis for patient A-123456 from this record: {data}",
    f"What are the consultation details and surgical intervention recommendations for Mr. Ramesh Gupta? Patient record: {data}",
]

eval_dataset = pd.DataFrame({"prompt": prompt})

In [83]:
print(eval_dataset)

                                              prompt
0  Extract the patient's name, ID, consultation d...
1  Provide the medical history and diagnosis for ...
2  What are the consultation details and surgical...


In [84]:
eval_dataset.head(10)

Unnamed: 0,prompt
0,"Extract the patient's name, ID, consultation d..."
1,Provide the medical history and diagnosis for ...
2,What are the consultation details and surgical...


In [91]:
# @title ### Set Google Cloud project information
# @markdown To get started using Vertex AI, you must have an existing Google Cloud project and [enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com).
# @markdown Learn more about [setting up a project and a development environment](https://cloud.google.com/vertex-ai/docs/start/cloud-environment).

# @markdown ---

import os
PROJECT_ID = "adk-mcp-client"  # @param {type: "string", placeholder: "[your-project-id]", isTemplate: true}
if not PROJECT_ID or PROJECT_ID == "[your-project-id]":
    PROJECT_ID = str(os.environ.get("GOOGLE_CLOUD_PROJECT"))
LOCATION= "us-central1"  # @param {type: "string", placeholder: "us-central1", isTemplate: true}
LOCATION = os.environ.get("GOOGLE_CLOUD_REGION", LOCATION)


from vertexai import Client, types
client = Client(project=PROJECT_ID, location=LOCATION)

In [94]:
# Example: Generate rubrics using a predefined method
data_with_rubrics = client.evals.generate_rubrics(
    src=eval_dataset,
    rubric_group_name="general_quality_rubrics",
    predefined_spec_name=types.RubricMetric.GENERAL_QUALITY,
)

# Display the dataset with the generated rubrics
data_with_rubrics.show()

In [96]:
import json
import pandas as pd

# Make sure your GOOGLE_API_KEY environment variable is set.
# WARNING: Setting API keys directly in code is insecure. Use environment variables or secure storage.

# Alternative, use your GOOGLE_API_KEY from Colab Secrets manager
from google.colab import userdata
os.environ['GOOGLE_API_KEY'] = userdata.get('GOOGLE_API_KEY')


geminiai_responses = client.evals.run_inference(
    model="gemini-2.5-flash",
    src=eval_dataset
)
geminiai_responses.show()

Gemini Inference: 100%|██████████| 3/3 [00:11<00:00,  3.88s/it]


In [97]:
eval_result = client.evals.evaluate(
    dataset=geminiai_responses,
    metrics=[
        types.RubricMetric.COHERENCE,
        types.RubricMetric.FLUENCY,
        types.Metric(name='rouge_1'),
        types.Metric(name='bleu'),
    ]
)
eval_result.show()

Computing Metrics for Evaluation Dataset:   0%|          | 0/12 [00:00<?, ?it/s]ERROR:vertexai._genai._evals_metric_handlers:Error processing metric fluency for case eval_case_2: 429 RESOURCE_EXHAUSTED. {'error': {'code': 429, 'message': 'Judge model resource exhausted. Please try again later.', 'status': 'RESOURCE_EXHAUSTED'}}
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/vertexai/_genai/_evals_metric_handlers.py", line 666, in get_metric_result
    response = self.module.evaluate_instances(metric_config=payload)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vertexai/_genai/evals.py", line 904, in evaluate_instances
    return self._evaluate_instances(
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vertexai/_genai/evals.py", line 662, in _evaluate_instances
    response = self._api_client.request("post", path, request_dict, http_options)
  

In [88]:
print(data_with_rubrics.head(5))

                                              prompt  \
0  Extract the patient's name, ID, consultation d...   
1  Provide the medical history and diagnosis for ...   
2  What are the consultation details and surgical...   

                                             rubrics  
0  [Does the response extract the patient's name?...  
1  [Does the response include the patient name as...  
2  [Does the response include the patient's name ...  


### Rubric Generation

Generate rubrics for the eval dataset

In [98]:
metric = PredefinedRubricMetrics.Pointwise.INSTRUCTION_FOLLOWING
data_with_rubrics = metric.generate_rubrics(eval_dataset)

INFO:vertexai.preview.evaluation._pre_eval_utils:Generating a total of 3 responses from Gemini model gemini-2.0-flash-001.
100%|██████████| 3/3 [00:04<00:00,  1.58s/it]


### Rubric Revision

If you're using Colab, you can leverage the `google.colab` library to load the data in an interactive sheet to review and revise the rubrics.

Load the `data_with_rubrics` in an interactive sheet, edit the sheet and save the updates.

In [99]:
if "google.colab" in sys.modules:
    from google.colab import sheets

    data_with_revised_rubrics = sheets.InteractiveSheet(df=data_with_rubrics)
    data_with_rubrics = data_with_revised_rubrics

https://docs.google.com/spreadsheets/d/1MrRnd5TLpZQrYWEUkc35znubyH6MFI1Y9m_Oms3IbP4/edit#gid=0


### Rubric Critiquing

Create an eval task with the `data_with_rubrics`, and use the metric defined earlier to critique the response based on generated rubrics.

In [100]:
eval_task = EvalTask(
    dataset=data_with_rubrics,
    metrics=[metric],
)

eval_result = eval_task.evaluate(model=GenerativeModel("gemini-2.0-flash"))

INFO:vertexai.preview.evaluation._pre_eval_utils:Generating a total of 3 responses from Gemini model gemini-2.0-flash.
100%|██████████| 3/3 [00:03<00:00,  1.10s/it]
INFO:vertexai.preview.evaluation._pre_eval_utils:All 3 responses are successfully generated from model.
INFO:vertexai.preview.evaluation._evaluation:Multithreaded Batch Inference took: 3.323775941999884 seconds.
INFO:vertexai.preview.evaluation._evaluation:Computing metrics with a total of 3 Vertex Gen AI Evaluation Service API requests.
100%|██████████| 3/3 [00:45<00:00, 15.05s/it]
INFO:vertexai.preview.evaluation._evaluation:All 3 metric requests are successfully computed.
INFO:vertexai.preview.evaluation._evaluation:Evaluation Took:45.1850313570003 seconds


Users can also choose to not generate and review the rubrics as separate steps. Instead if they directly set up a task with `eval_dataset` and call `.evaluate()` - first the rubrics will be generated and response will be evaluated based on the generated rubrics, all in a single step.

In [104]:
eval_result = client.evals.evaluate(
    dataset=geminiai_responses,
    metrics=[
        types.RubricMetric.COHERENCE,
        types.RubricMetric.FLUENCY,
        types.Metric(name='rouge_1'),
        types.Metric(name='bleu'),
    ]
)
eval_result.show()

Computing Metrics for Evaluation Dataset:   0%|          | 0/12 [00:00<?, ?it/s]ERROR:vertexai._genai._evals_metric_handlers:Error processing metric coherence for case eval_case_1: 429 RESOURCE_EXHAUSTED. {'error': {'code': 429, 'message': 'Judge model resource exhausted. Please try again later.', 'status': 'RESOURCE_EXHAUSTED'}}
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/vertexai/_genai/_evals_metric_handlers.py", line 666, in get_metric_result
    response = self.module.evaluate_instances(metric_config=payload)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vertexai/_genai/evals.py", line 904, in evaluate_instances
    return self._evaluate_instances(
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vertexai/_genai/evals.py", line 662, in _evaluate_instances
    response = self._api_client.request("post", path, request_dict, http_options)


### Eval results for pointwise rubric based metrics

1. `rubrics`: Questions to rate the response
2. `score`: Overall aggregated score for all the rubrics for that specific prompt. Between `0` and `1`.
3. `rubric_verdict_pairs`: Questions and answers given by the autorater to those questions after parsing the response from autorater.
4. `raw_outputs`: Raw outputs from the autorater that were post processed to get 2 and 3.

In [101]:
notebook_utils.display_eval_result(eval_result=eval_result)

### Summary Metrics

Unnamed: 0,row_count,rb_instruction_following/mean,rb_instruction_following/std
0,3.0,0.958333,0.072169


### Row-based Metrics

Unnamed: 0,prompt,rubrics,response,rb_instruction_following/score,rb_instruction_following/rubric_verdict_pairs,rb_instruction_following/raw_outputs
0,"Extract the patient's name, ID, consultation d...","[""Does the response extract the patient's name...",Here's the extracted information from the pati...,1.0,<question>: True\nDoes the response extract th...,[Does the response extract the patient's name ...
1,Provide the medical history and diagnosis for ...,"[""Does the response include the patient's name...","Okay, here's the extracted information from th...",0.875,Does the response include the patient's name?:...,[Does the response include the patient's name?...
2,What are the consultation details and surgical...,"[""Does the response provide the patient's name...","Okay, here's the extracted information from Mr...",1.0,Does the response provide the patient's name a...,[Does the response provide the patient's name ...


Error: Runtime no longer has a reference to this dataframe, please re-run this cell and try again.


## Rubric based Instruction Following Autorater


In addition to the predefined metric for rubric based instruction following. Users can also choose to utilize a proprietary metric as follows:

In [102]:
eval_task = EvalTask(
    dataset=eval_dataset,
    metrics=["rubric_based_instruction_following"],
)

eval_result = eval_task.evaluate(model=GenerativeModel("gemini-2.0-flash"))

INFO:vertexai.preview.evaluation._pre_eval_utils:Generating a total of 3 responses from Gemini model gemini-2.0-flash.
100%|██████████| 3/3 [00:02<00:00,  1.07it/s]
INFO:vertexai.preview.evaluation._pre_eval_utils:All 3 responses are successfully generated from model.
INFO:vertexai.preview.evaluation._evaluation:Multithreaded Batch Inference took: 2.8105573470002128 seconds.
INFO:vertexai.preview.evaluation._evaluation:Computing metrics with a total of 3 Vertex Gen AI Evaluation Service API requests.
100%|██████████| 3/3 [02:39<00:00, 53.18s/it]
INFO:vertexai.preview.evaluation._evaluation:Evaluation Took:159.5669184999997 seconds


### Eval results

1. `rubric_based_instruction_following/per_rubric_result`: the answer for each generated rubric.
2. `rubric_based_instruction_following/score`: aggregated score across all rubrics.

In [103]:
notebook_utils.display_eval_result(eval_result=eval_result)

### Summary Metrics

Unnamed: 0,row_count,rubric_based_instruction_following/mean,rubric_based_instruction_following/std
0,3.0,0.933333,


### Row-based Metrics

Unnamed: 0,prompt,response,rubric_based_instruction_following/per_rubric_result,rubric_based_instruction_following/score
0,"Extract the patient's name, ID, consultation d...",Here's the extracted information from the pati...,Error,
1,Provide the medical history and diagnosis for ...,Here's the extracted information from the medi...,[{'rubric': 'does the response correctly extra...,0.933333
2,What are the consultation details and surgical...,Here's the extracted information from Mr. Rame...,Error,


Error: Runtime no longer has a reference to this dataframe, please re-run this cell and try again.
