In [None]:
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# AutoSxS: Evaluate a LLM in Vertex AI Model Registry against a third-party model


<table align="left">

  <td>
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/model_evaluation/autosxs_llm_evaluation_for_summarization_task.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
    </a>
  </td>
  <td>
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/model_evaluation/autosxs_llm_evaluation_for_summarization_task.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/vertex-ai-samples/main/notebooks/official/model_evaluation/autosxs_llm_evaluation_for_summarization_task.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      Open in Vertex AI Workbench
    </a>
  </td>                                                                                               
</table>

## Overview

This notebook demonstrates how to use Vertex AI automatic side-by-side (AutoSxS) to evaluate the performance between a generative AI model in Vertex AI Model Registry and a third-party language model.

AutoSxS is a model-assisted evaluation tool that helps you compare two large language models (LLMs) side by side. As part of AutoSxS's preview release, we only support comparing models for summarization and question answering tasks. We will support more tasks and customization in the future.

Learn more about [Vertex AI AutoSxS Model Evaluation](https://cloud.google.com/vertex-ai/docs/generative-ai/models/side-by-side-eval#autosxs).

### Objective

In this tutorial, you will learn how to use `Vertex AI Pipelines` and `google_cloud_pipeline_components` to evaluate the performance between two LLM models:

This tutorial uses the following Google Cloud ML services and resources:

- Vertex AI Model Registry
- Vertex AI Pipelines
- Vertex AI Batch Predictions


The steps performed include:

- Fetch the dataset from the public source.
- Preprocess the data locally and save test data in GCS.
- Create and run a Vertex AI AutoSxS Pipeline that generates the judgments and evaulates the two candicate models using the generated judgments.
- Print the judgments and evaluation metrics.
- Clean up the resources created in this notebook.

### Dataset

The dataset used for this tutorial is [Extreme Summarization (XSum)](https://arxiv.org/abs/1808.08745). The dataset consists of BBC articles and accompanying single sentence summaries. Specifically, each article is prefaced with an introductory sentence (aka summary) which is professionally written, typically by the author of the article. That dataset has 226,711 articles divided into training (90%, 204,045), validation (5%, 11,332), and test (5%, 11,334) sets.


### Costs

This tutorial uses billable components of Google Cloud:

* Vertex AI
* Cloud Storage

Learn about [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing),
and [Cloud Storage pricing](https://cloud.google.com/storage/pricing),
and use the [Pricing Calculator](https://cloud.google.com/products/calculator/)
to generate a cost estimate based on your projected usage.

## Installation

Install the following packages required to execute this notebook.

In [None]:
! pip3 install --upgrade --force-reinstall $USER_FLAG \
    google-cloud-aiplatform \
    google-cloud-pipeline-components==2.9.0 \
    datasets

### Colab only: Uncomment the following cell to restart the kernel.

In [None]:
# import IPython

# app = IPython.Application.instance()
# app.kernel.do_shutdown(True)

## Before you begin

### Set up your Google Cloud project

**The following steps are required, regardless of your notebook environment.**

1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs.

2. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).

3. [Enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com).

4. If you are running this notebook locally, you need to install the [Cloud SDK](https://cloud.google.com/sdk).

#### Set your project ID

**If you don't know your project ID**, try the following:
* Run `gcloud config list`.
* Run `gcloud projects list`.
* See the support page: [Locate the project ID](https://support.google.com/googleapi/answer/7014113)

In [None]:
PROJECT_ID = "[your-project-id]"  # @param {type:"string"}

# Set the project id
! gcloud config set project {PROJECT_ID}

#### Region

You may change the `REGION` variable, which is used for operations
throughout the rest of this notebook.  Below are regions supported for AutoSxS.

- Americas: `us-central1`
- Europe: `europe-west4`
- Asia Pacific: `asia-southeast1`

You can also change the `REGION` variable used by Vertex AI. Learn more about [Vertex AI regions](https://cloud.google.com/vertex-ai/docs/general/locations).

In [None]:
REGION = "us-central1"  # @param {type: "string"}

### Authenticate your Google Cloud account

Depending on your Jupyter environment, you may have to manually authenticate. Follow the relevant instructions below.

**1. Vertex AI Workbench**
* Do nothing as you are already authenticated.

**2. Local JupyterLab instance, uncomment and run:**

In [None]:
# ! gcloud auth login

**3. Colab, uncomment and run:**

In [None]:
# from google.colab import auth
# auth.authenticate_user()

**4. Service account or other**
* See how to grant Cloud Storage permissions to your service account at https://cloud.google.com/storage/docs/gsutil/commands/iam#ch-examples.

### UUID

We define a UUID generation function to avoid resource name collisions on resources created within the notebook.

In [None]:
import random
import string

def generate_uuid(length: int = 8) -> str:
    """Generate a uuid of a specifed length (default=8)."""
    return "".join(random.choices(string.ascii_lowercase + string.digits, k=length))


UUID = generate_uuid()

### Create a Cloud Storage bucket

Create a storage bucket to store intermediate artifacts to the AutoSxS pipeline.

In [None]:
BUCKET_URI = "gs://your-bucket-name-unique"  # @param {type:"string"}

Create your Cloud Storage bucket if it doesn't already exist.

In [None]:
if BUCKET_URI == "" or BUCKET_URI is None or BUCKET_URI == "gs://[your-bucket-name]":
    BUCKET_URI = "gs://" + PROJECT_ID + "aip-" + UUID

! gsutil ls -b $BUCKET_URI || gsutil mb -l $REGION $BUCKET_URI

### Import libraries

Import the Vertex AI Python SDK and other required Python libraries.

In [None]:
import json
import os
import urllib
import uuid
import pickle


from google.cloud import aiplatform
from google_cloud_pipeline_components.preview import model_evaluation
from kfp import compiler
from datasets import load_dataset
import pandas as pd

  return component_factory.create_component_from_func(


### Initialize Vertex AI SDK for Python

Initialize the Vertex SDK for Python for your project and corresponding bucket.



In [None]:
aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=BUCKET_URI)

## Tutorial

### Generate Evaluation Dataset for AutoSxS

Below you create your dataset, specifying the set of prompts to evaluate on.

In this notebook, we:
- Download the Extreme Summarization (XSum) from the public resource.
- Use 10 examples from the original dataset to create the evaluation dataset for AutoSxS.
  - Data in column `document` will be treated as model prompts.
  - Data in column `summary` will be treated as responses for model B, because model B is a third-party model in this notebook.
- Store it as JSON file in Cloud Storage.

####**Note: For best results, we recommend users input 100-500 examples. There are diminishing returns past 400 examples.**

In [None]:
# Download the dataset.
raw_datasets = load_dataset("xsum", split="train")

# Fetch 10 examples from the original dataset.
datasets_10 = raw_datasets.select(range(10))
print('dataset structure: \n', datasets_10)

# Create the evaluation dataset with 10 examples.
prompts = datasets_10['document']
summaries = datasets_10['summary']
examples = pd.DataFrame({'content': prompts, 'summary': summaries})

examples.head()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/5.76k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/6.24k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.00M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/204045 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/11332 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11334 [00:00<?, ? examples/s]

dataset structure: 
 Dataset({
    features: ['document', 'summary', 'id'],
    num_rows: 10
})


Unnamed: 0,content,summary
0,"The full cost of damage in Newton Stewart, one...",Clean-up operations are continuing across the ...
1,A fire alarm went off at the Holiday Inn in Ho...,Two tourist buses have been destroyed by fire ...
2,Ferrari appeared in a position to challenge un...,Lewis Hamilton stormed to pole position at the...
3,"John Edward Bates, formerly of Spalding, Linco...",A former Lincolnshire Police officer carried o...
4,Patients and staff were evacuated from Cerahpa...,An armed man who locked himself into a room at...


#### [Optional] Load your JSONL evaluation dataset from GCS

Alternatively, you can load your own JSONL dataset from GCS.

In [None]:
# # Uncomment to read from GCS.
# GCS_PATH = 'gs://your-own-evaluation-dataset.jsonl'
# examples = pd.read_json(GCS_PATH, lines=True)

Next, we upload our final dataset to GCS to be used as input for AutoSxS.

In [None]:
examples.to_json('evaluation_dataset.json', orient='records', lines=True)
! gsutil cp evaluation_dataset.json $BUCKET_URI/input/evaluation_dataset.json
DATASET = f'{BUCKET_URI}/input/evaluation_dataset.json'

Copying file://evaluation_dataset.json [Content-Type=application/json]...
/ [0 files][    0.0 B/ 21.8 KiB]                                                / [1 files][ 21.8 KiB/ 21.8 KiB]                                                
Operation completed over 1 objects/21.8 KiB.                                     


### Create and Run AutoSxS Job

In order to run AutoSxS, we need to define a `autosxs_pipeline` job with the following parameters. More details of the autosxs pipeline configuration can be found [here](https://google-cloud-pipeline-components.readthedocs.io/en/google-cloud-pipeline-components-2.9.0/api/preview/model_evaluation.html#preview.model_evaluation.autosxs_pipeline).



**Required Parameters:**
  - **evaluation_dataset:** A list of GCS paths to a JSONL dataset containing
      evaluation examples.
  - **task:** Evaluation task in the form {task}@{version}. task can be one of
      "summarization", "question_answering". Version is an integer with 3 digits or
      "latest". Ex: summarization@001 or question_answering@latest.
  - **id_columns:** The columns which distinguish unique evaluation examples.
  - **autorater_prompt_parameters:** Map of autorater prompt parameters to columns
      or templates. The expected parameters are:
      - inference_instruction - Details
      on how to perform a task.
      - inference_context - Content to reference to
      perform the task.

Additionally, we need to specify where the predictions for the candidate models (Model A and Model B) come from. AutoSxS can either run Vertex Batch Prediction to get predictions, or a predefined predictions column can be provided in the evaluation dataset.

**Model Parameters if using Batch Prediction (assuming Model A):**
  - **model_a:** A fully-qualified model resource name. This parameter is optional
      if Model A responses are specified.
  - **model_a_prompt_parameters:** Map of Model A prompt template parameters to
      columns or templates. In the case of [text-bison](https://cloud.google.com/vertex-ai/docs/generative-ai/model-reference/text#request_body), the only parameter needed is `prompt`.
  - **model_a_parameters:** The parameters that govern the predictions from model A such as the model temperature.

**Model Parameters if bringing your own predictions (assuming Model A):**
  - **response_column_a:** The column containing responses for model A. Required if
      any response tables are provided for model A.

Lastly, there are parameters that configure additional features such as exporting the judgments or comparing judgments to a human-preference dataset to check the AutoRater's alignment with human raters.
  - **judgments_format:** The format to write judgments to. Can be either 'json' or
      'bigquery'.
  - **bigquery_destination_prefix:** BigQuery table to write judgments to if the
      specified format is 'bigquery'.
  - **human_preference_column:** The column containing ground truths. Only required
      when users want to check the autorater alignment against human preference.

In this notebook, we will evaluate a third-party model's predictions (located in the `summary` column of `DATASET`) against the output of `text-bison@001` using a built-in summarization instruction. The task being performed is summarization.

First, compile the AutoSxS pipeline locally.

In [None]:
template_uri = 'pipeline.yaml'
compiler.Compiler().compile(
    pipeline_func=model_evaluation.autosxs_pipeline,
    package_path=template_uri,
)

The following code starts a Vertex Pipeline job, viewable from the Vertex UI. This pipeline job will take ~15 mins.

The logs here will include to the URL to the current pipeline, so you can follow the pipline progress and access/view pipeline outputs.

In [None]:
display_name = f'autosxs-summarization-{generate_uuid()}'
prompt_column = 'content'
response_column_b = 'summary'
DATASET = f'{BUCKET_URI}/input/evaluation_dataset.json'
parameters = {
    'evaluation_dataset': DATASET,
    'id_columns': [prompt_column],
    'autorater_prompt_parameters': {
        'inference_context': {'column': prompt_column},
        'inference_instruction': {'template': '{{ default_instruction }}'},
    },
    'task': 'summarization@001',
    'model_a': 'publishers/google/models/text-bison@001',
    'model_a_prompt_parameters': {
        'prompt': {
            'template': '{{ default_instruction }}: {{' + prompt_column + "}}.",
            # 'template': 'Summarize the following: {{' + prompt_column + "}}.",  - This is also okay.
        },
    },
    'response_column_b': response_column_b,
}

job = aiplatform.PipelineJob(
    job_id=display_name,
    display_name=display_name,
    pipeline_root=os.path.join(BUCKET_URI, display_name),
    template_path=template_uri,
    parameter_values=parameters,
    enable_caching=False,
)
job.run()

INFO:google.cloud.aiplatform.pipeline_jobs:Creating PipelineJob
INFO:google.cloud.aiplatform.pipeline_jobs:PipelineJob created. Resource name: projects/942664513926/locations/us-central1/pipelineJobs/autosxs-summarization-8k0lwxvt
INFO:google.cloud.aiplatform.pipeline_jobs:To use this PipelineJob in another session:
INFO:google.cloud.aiplatform.pipeline_jobs:pipeline_job = aiplatform.PipelineJob.get('projects/942664513926/locations/us-central1/pipelineJobs/autosxs-summarization-8k0lwxvt')
INFO:google.cloud.aiplatform.pipeline_jobs:View Pipeline Job:
https://console.cloud.google.com/vertex-ai/locations/us-central1/pipelines/runs/autosxs-summarization-8k0lwxvt?project=942664513926
INFO:google.cloud.aiplatform.pipeline_jobs:PipelineJob projects/942664513926/locations/us-central1/pipelineJobs/autosxs-summarization-8k0lwxvt current state:
PipelineState.PIPELINE_STATE_RUNNING
INFO:google.cloud.aiplatform.pipeline_jobs:PipelineJob projects/942664513926/locations/us-central1/pipelineJobs/autos

### Get the judgments and autosxs win-rate metrics
Next, we can load judgments from the completed autosxs job.

The results are written to the Cloud Storage output bucket you specified in the autosxs job request.

In [None]:
# To use an existing pipeline, override job using the line below.
# job = aiplatform.PipelineJob.get('projects/[PROJECT_NUMBER]/locations/[REGION]/pipelineJobs/[PIPELINE_RUN_NAME]')

for details in job.task_details:
  if details.task_name == 'autosxs-arbiter':
    break

# Judgments
judgments_uri = details.outputs['judgments'].artifacts[0].uri
judgments_df = pd.read_json(judgments_uri, lines=True)
judgments_df.head()

Unnamed: 0,content,inference_instruction,inference_context,response_a,response_b,choice,explanation,confidence
0,"John Edward Bates, formerly of Spalding, Linco...",Summarize INPUT in a few sentences. Rely stric...,"John Edward Bates, formerly of Spalding, Linco...","John Edward Bates, 67, is accused of sexually ...",A former Lincolnshire Police officer carried o...,A,Response (A) is more concise and contains only...,0.7
1,"The full cost of damage in Newton Stewart, one...",Summarize INPUT in a few sentences. Rely stric...,"The full cost of damage in Newton Stewart, one...","Flooding has caused damage in Newton Stewart, ...",Clean-up operations are continuing across the ...,A,Response (A) provides more details than Respon...,1.0
2,Belgian cyclist Demoitie died after a collisio...,Summarize INPUT in a few sentences. Rely stric...,Belgian cyclist Demoitie died after a collisio...,Belgian cyclist Antoine Demoitie died after a ...,Welsh cyclist Luke Rowe says changes to the sp...,A,Response (A) follows the instruction and summa...,1.0
3,Simone Favaro got the crucial try with the las...,Summarize INPUT in a few sentences. Rely stric...,Simone Favaro got the crucial try with the las...,Glasgow Warriors came back from a 10-10 half-t...,Defending Pro12 champions Glasgow Warriors bag...,A,Response (A) follows the instruction and provi...,0.8
4,A fire alarm went off at the Holiday Inn in Ho...,Summarize INPUT in a few sentences. Rely stric...,A fire alarm went off at the Holiday Inn in Ho...,Two tour buses were set on fire in the car par...,Two tourist buses have been destroyed by fire ...,B,Response (B) is more concise and contains all ...,1.0


If any example failed to get the result in AutoSxS, their error messages will be stored in an error table. If the error table is empty, it implies there's no failed examples during the evaluation.

In [None]:
for details in job.task_details:
  if details.task_name == 'autosxs-arbiter':
    break

# Error table
error_messages_uri = details.outputs['error_messages'].artifacts[0].uri
errors_df = pd.read_json(error_messages_uri, lines=True)
errors_df.head()

We can also look at metrics computed from the judgments. AutoSxS outputs the win rate to show how often one model outperformed another.

In [None]:
# Metrics
for details in job.task_details:
  if details.task_name == 'autosxs-metrics-computer':
    break
pd.DataFrame([details.outputs['autosxs_metrics'].artifacts[0].metadata])

Unnamed: 0,autosxs_model_a_win_rate,autosxs_model_b_win_rate
0,0.9,0.1


## Cleaning up

To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud
project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.

Otherwise, you can delete the individual resources you created in this tutorial:

Set `delete_bucket` to **True** to delete the Cloud Storage bucket.

In [None]:
import os

job.delete()

# Delete Cloud Storage objects that were created
delete_bucket = False
if delete_bucket or os.getenv("IS_TESTING"):
    ! gsutil -m rm -r $BUCKET_URI