In [None]:
# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# AutoSxS: Check autorater alignment against a human-preference dataset


<table align="left">

  <td>
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/model_evaluation/model_based_llm_evaluation/autosxs_check_alignment_against_human_preference_data.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
    </a>
  </td>
  <td>
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/model_evaluation/model_based_llm_evaluation/autosxs_check_alignment_against_human_preference_data.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/vertex-ai-samples/main/notebooks/official/model_evaluation/model_based_llm_evaluation/autosxs_check_alignment_against_human_preference_data.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      Open in Vertex AI Workbench
    </a>
  </td>                                                                                               
</table>

## Overview

This notebook demonstrates how to use Vertex AI automatic side-by-side (AutoSxS) to check how well the autorater aligns with the human rater.

Automatic side-by-side (AutoSxS) is a model-assisted evaluation tool that helps you compare two large language models (LLMs) side by side. As part of AutoSxS's preview release, we only support comparing models for summarization and question answering tasks. We will support more tasks and customization in the future.

Learn more about [Vertex AI AutoSxS Model Evaluation](https://cloud.google.com/vertex-ai/docs/generative-ai/models/side-by-side-eval#autosxs).

### Objective

In this tutorial, you learn how to use `Vertex AI Pipelines` and `google_cloud_pipeline_components` to check autorater alignment using human-preference data:

This tutorial uses the following Google Cloud ML services and resources:

- Cloud Storage
- Vertex AI PaLM API
- Vertex AI Pipelines
- Vertex AI Batch Prediction


The steps performed include:
- Create a evaluation dataset with predictions and human preference data.
- Preprocess the data locally and save it in Cloud Storage.
- Create and run a Vertex AI AutoSxS Pipeline that generates the judgments and a set of AutoSxS metrics using the generated judgments.
- Print the judgments and AutoSxS metrics.
- Clean up the resources created in this notebook.

### Costs

This tutorial uses billable components of Google Cloud:

* Vertex AI
* Cloud Storage

Learn about [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing),
and [Cloud Storage pricing](https://cloud.google.com/storage/pricing),
and use the [Pricing Calculator](https://cloud.google.com/products/calculator/)
to generate a cost estimate based on your projected usage.

## Installation

Install the following packages required to execute this notebook.

In [1]:
! pip3 install --upgrade --force-reinstall \
    google-cloud-aiplatform \
    google-cloud-pipeline-components==2.10.0 \
    gcsfs

### Colab only: Uncomment the following cell to restart the kernel.

In [None]:
# import IPython

# app = IPython.Application.instance()
# app.kernel.do_shutdown(True)

## Before you begin

### Set up your Google Cloud project

**The following steps are required, regardless of your notebook environment.**

1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs.

2. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).

3. [Enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com).

4. If you are running this notebook locally, you need to install the [Cloud SDK](https://cloud.google.com/sdk).

#### Set your project ID

**If you don't know your project ID**, try the following:
* Run `gcloud config list`.
* Run `gcloud projects list`.
* See the support page: [Locate the project ID](https://support.google.com/googleapi/answer/7014113)

In [None]:
PROJECT_ID = "[your-project-id]"  # @param {type:"string"}

# Set the project id
! gcloud config set project {PROJECT_ID}

#### Region

You may change the `REGION` variable, which is used for operations
throughout the rest of this notebook.  Below are regions supported for AutoSxS.

- Americas: `us-central1`
- Europe: `europe-west4`
- Asia Pacific: `asia-southeast1`

You can also change the `REGION` variable used by Vertex AI. Learn more about [Vertex AI regions](https://cloud.google.com/vertex-ai/docs/general/locations).

In [None]:
REGION = "us-central1"  # @param {type: "string"}

### Authenticate your Google Cloud account

Depending on your Jupyter environment, you may have to manually authenticate. Follow the relevant instructions below.

**1. Vertex AI Workbench**
* Do nothing as you are already authenticated.

**2. Local JupyterLab instance, uncomment and run:**

In [None]:
# ! gcloud auth login

**3. Colab, uncomment and run:**

In [None]:
# from google.colab import auth
# auth.authenticate_user()

**4. Service account or other**
* See how to grant Cloud Storage permissions to your service account at https://cloud.google.com/storage/docs/gsutil/commands/iam#ch-examples.

### UUID

We define a UUID generation function to avoid resource name collisions on resources created within the notebook.

In [None]:
import random
import string


def generate_uuid(length: int = 8) -> str:
    """Generate a uuid of a specified length (default=8)."""
    return "".join(random.choices(string.ascii_lowercase + string.digits, k=length))


UUID = generate_uuid()

### Create a Cloud Storage bucket

Create a storage bucket to store intermediate artifacts to the AutoSxS pipeline.

In [None]:
BUCKET_URI = "gs://[your-bucket-name-unique]"  # @param {type:"string"}

**Only if your bucket doesn't already exist:** Run the following cell to create your Cloud Storage bucket.

In [2]:
if (
    BUCKET_URI == ""
    or BUCKET_URI is None
    or BUCKET_URI == "gs://[your-bucket-name-unique]"
):
    BUCKET_URI = "gs://" + PROJECT_ID + "-aip-" + UUID

! gsutil mb -l $REGION -p $PROJECT_ID $BUCKET_URI

### Import libraries

Import the Vertex AI Python SDK and other required Python libraries.

In [3]:
import os

import pandas as pd
from google.cloud import aiplatform
from google_cloud_pipeline_components.preview import model_evaluation
from kfp import compiler

### Initialize Vertex AI SDK for Python

Initialize the Vertex SDK for Python for your project and corresponding bucket.


In [None]:
aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=BUCKET_URI)

## Tutorial
It is unlikely that the autorater will perform at the same level as human raters in all customer use cases, especially in cases where human raters are expected to have specialized knowledge.

The tutorial below shows how AutoSxS helps to determine if you can trust the autorater once you have the ground-truth human-preference data.


### Generate evaluation dataset for AutoSxS human alignment checking

Below you create your dataset, specifying the set of prompts, predictions from two models and the human-preference data.

In this notebook, we:
- Create a evaluation dataset with 10 examples for AutoSxS.
  - Data in column `prompt` will be treated as model prompts.
  - Data in column `pred_a` will be treated as responses for model A.
  - Data in column `pred_b` will be treated as responses for model B.
  - Data in column `actuals` will be treated as the human-preference data.
- Store it as JSON file in Cloud Storage.

#### **Note: For best results, we recommend users input 100-500 examples. There are diminishing returns past 400 examples.**

In [4]:
# Define context, questions, predictions and human preference data.
context = [
    "Beginning in the late 1910s and early 1920s, Whitehead gradually turned his attention from mathematics to philosophy of science, and finally to metaphysics. He developed a comprehensive metaphysical system which radically departed from most of western philosophy. Whitehead argued that reality consists of processes rather than material objects, and that processes are best defined by their relations with other processes, thus rejecting the theory that reality is fundamentally constructed by bits of matter that exist independently of one another. Today Whitehead's philosophical works – particularly Process and Reality – are regarded as the foundational texts of process philosophy.",
    "The gills have an adnate attachment to the cap, are narrow to moderately broad, closely spaced, and eventually separate from the stem. Young gills are cinnamon-brown in color, with lighter edges, but darken in maturity because they become covered with the dark spores. The stem is 6 to 8 cm (2+3⁄8 to 3+1⁄8 in) long by 1.5 to 2 mm (1⁄16 to 3⁄32 in) thick, and roughly equal in width throughout except for a slightly enlarged base. The lower region of the stem is brownish in color and has silky 'hairs' pressed against the stem; the upper region is grayish and pruinose (lightly dusted with powdery white granules). The flesh turns slightly bluish or greenish where it has been injured. The application of a drop of dilute potassium hydroxide solution on the cap or flesh will cause a color change to pale to dark yellowish to reddish brown; a drop on the stem produces a less intense or no color change.",
    "Go to Device Support. Choose your device. Scroll to Getting started and select Hardware &amp; phone details. Choose Insert or remove SIM card and follow the steps. Review the Account Summary page for details. Image 13 Activate online Go to att.com/activateprepaid ((att.com/activarprepaid for Spanish)) and follow the prompts. Activate over the phone Call us at 877.426.0525 for automated instructions. You will need to know your SIM/eSIM ICCID &amp; IMEI number for activation. Note: Look for your SIM (( ICCID )) number on your box or SIM card Now youre ready to activate your phone 1. Start with your new device powered off. 2. To activate a new line of service or a replacement device, please go to the AT&amp;T Activation site or call 866.895.1099. You download the eSIM to your device over Wi-Fi®. The eSIM connects your device to our wireless network. How do I activate my phone with an eSIM? Turn your phone on, connect to Wi-Fi, and follow the prompts. Swap active SIM cards AT&amp;T Wireless SM SIM Card Turn your device off. Remove the old SIM card. Insert the new one. Turn on your device.",
    "According to chief astronaut Deke Slayton's autobiography, he chose Bassett for Gemini 9 because he was 'strong enough to carry' both himself and See. Slayton had also assigned Bassett as command module pilot for the second backup Apollo crew, alongside Frank Borman and William Anders.",
    "Adaptation of the endosymbiont to the host's lifestyle leads to many changes in the endosymbiont–the foremost being drastic reduction in its genome size. This is due to many genes being lost during the process of metabolism, and DNA repair and recombination. While important genes participating in the DNA to RNA transcription, protein translation and DNA/RNA replication are retained. That is, a decrease in genome size is due to loss of protein coding genes and not due to lessening of inter-genic regions or open reading frame (ORF) size. Thus, species that are naturally evolving and contain reduced sizes of genes can be accounted for an increased number of noticeable differences between them, thereby leading to changes in their evolutionary rates. As the endosymbiotic bacteria related with these insects are passed on to the offspring strictly via vertical genetic transmission, intracellular bacteria goes through many hurdles during the process, resulting in the decrease in effective population sizes when compared to the free living bacteria. This incapability of the endosymbiotic bacteria to reinstate its wild type phenotype via a recombination process is called as Muller's ratchet phenomenon. Muller's ratchet phenomenon together with less effective population sizes has led to an accretion of deleterious mutations in the non-essential genes of the intracellular bacteria. This could have been due to lack of selection mechanisms prevailing in the rich environment of the host.",
    "The National Archives Building in downtown Washington holds record collections such as all existing federal census records, ships' passenger lists, military unit records from the American Revolution to the Philippine–American War, records of the Confederate government, the Freedmen's Bureau records, and pension and land records.",
    "Standard 35mm photographic film used for cinema projection has a much higher image resolution than HDTV systems, and is exposed and projected at a rate of 24 frames per second (frame/s). To be shown on standard television, in PAL-system countries, cinema film is scanned at the TV rate of 25 frame/s, causing a speedup of 4.1 percent, which is generally considered acceptable. In NTSC-system countries, the TV scan rate of 30 frame/s would cause a perceptible speedup if the same were attempted, and the necessary correction is performed by a technique called 3:2 Pulldown: Over each successive pair of film frames, one is held for three video fields (1/20 of a second) and the next is held for two video fields (1/30 of a second), giving a total time for the two frames of 1/12 of a second and thus achieving the correct average film frame rate.",
    "Maria Deraismes was initiated into Freemasonry in 1882, then resigned to allow her lodge to rejoin their Grand Lodge. Having failed to achieve acceptance from any masonic governing body, she and Georges Martin started a mixed masonic lodge that actually worked masonic ritual. Annie Besant spread the phenomenon to the English speaking world. Disagreements over ritual led to the formation of exclusively female bodies of Freemasons in England, which spread to other countries. Meanwhile, the French had re-invented Adoption as an all-female lodge in 1901, only to cast it aside again in 1935. The lodges, however, continued to meet, which gave rise, in 1959, to a body of women practising continental Freemasonry.",
    "Excavation of the foundations began in November 1906, with an average of 275 workers during the day shift and 100 workers during the night shift. The excavation was required to be completed in 120 days. To remove the spoils from the foundation, three temporary wooden platforms were constructed to street level. Hoisting engines were installed to place the beams for the foundation, while the piers were sunk into the ground under their own weight. Because of the lack of space in the area, the contractors' offices were housed beneath the temporary platforms. During the process of excavation, the Gilsey Building's foundations were underpinned or shored up, because that building had relatively shallow foundations descending only 18 feet (5.5 m) below Broadway.",
    "Dopamine consumed in food cannot act on the brain, because it cannot cross the blood–brain barrier. However, there are also a variety of plants that contain L-DOPA, the metabolic precursor of dopamine. The highest concentrations are found in the leaves and bean pods of plants of the genus Mucuna, especially in Mucuna pruriens (velvet beans), which have been used as a source for L-DOPA as a drug. Another plant containing substantial amounts of L-DOPA is Vicia faba, the plant that produces fava beans (also known as 'broad beans'). The level of L-DOPA in the beans, however, is much lower than in the pod shells and other parts of the plant. The seeds of Cassia and Bauhinia trees also contain substantial amounts of L-DOPA.",
]

questions = [
    "What was the predominant theory of reality that Whitehead opposed?",
    "Why do the gills on the Psilocybe pelliculosa mushroom darken as they mature?",
    "user: How do I provision my AT&T SIM card?",
    "Why did chief astronaut Deke Slayton choose Charles Bassett for Gemini 9, according to Slayton's autobiography?",
    "What is the main alteration in an endosymbiont when it adapts to a host?",
    "What's the earliest war The National Archives Building has military unit records for",
    "To be shown on SDTV in PAL-system countries, at what rate is cinema film scanned?",
    "What year was the all-female masonic lodge cast aside?",
    "Why did the Gilsey Building have underpinned and shored up foundations?",
    "Why can dopamine consumed in food not act on the brain?",
]
predictions_a = [
    "bits of matter that exist independently of one another",
    "The gills darken in maturity because they become covered with the dark spores.",
    "Go to Device Support. Choose your device. Scroll to Getting started and select Hardware &amp; phone details. Choose Insert or remove SIM card and follow the steps.",
    "he was 'smart enough to carry' both himself and See",
    "drastic reduction in its genome size",
    "American Revolution to the Philippine–American War",
    "Cinema film is scanned at the TV rate of 25 frame/s.",
    "1935",
    "The Gilsey Building's foundations were shored up because they were only 18 feet below Broadway.",
    "The blood–brain barrier does not allow dopamine consumed in food to enter the brain.",
]
predictions_b = [
    "independent bits of matter",
    "Young gills are cinnamon-brown in color, with lighter edges, but darken in maturity because they become covered with the dark spores.",
    "Go to Device Support.",
    "he was 'strong enough to carry' both himself and See, as stated by chief astronaut Deke Slayton in his autobiography",
    "its genome size decrease",
    "American Revolution",
    "25 frame/s, causing a speedup of 4.1 percent",
    "1901",
    "The Gilsey Building's foundations were underpinned or shored up.",
    "Mucuna pruriens (velvet beans) have been used as a source for L-DOPA as a drug. Another plant containing substantial amounts of L-DOPA is Vicia faba, the plant that produces fava beans (also known as 'broad beans').",
]

human_preference = [
    "A",
    "B",
    "A",
    "B",
    "A",
    "A",
    "B",
    "A",
    "A",
    "A",
]

# Create the evaluation dataset with context, questions, predictions and human preference data.
examples = pd.DataFrame(
    {
        "context": context,
        "questions": questions,
        "pred_a": predictions_a,
        "pred_b": predictions_b,
        "actuals": human_preference,
    }
)
examples.head()

#### [Optional] Load your JSONL evaluation dataset from Cloud Storage.

Alternatively, you can load your own JSONL dataset from Cloud Storage.


In [None]:
# # Uncomment to read from Cloud Storage.
# GCS_PATH = 'gs://your-own-evaluation-dataset-with-human-preference-data.jsonl'
# preds = pd.read_json(GCS_PATH, lines=True)

#### Upload your dataset to Cloud Storage

Finally, we upload our evaluation dataset to Cloud Storage to be used as input for AutoSxS.

In [5]:
# Upload predictions to the Cloud Storage bucket.
examples.to_json(
    "evaluation_dataset_with_human_preference.json", orient="records", lines=True
)
! gsutil cp evaluation_dataset_with_human_preference.json $BUCKET_URI/input/evaluation_dataset_with_human_preference.json
DATASET = f"{BUCKET_URI}/input/evaluation_dataset_with_human_preference.json"

### Create and run AutoSxS job

In order to run AutoSxS, we need to define a `autosxs_pipeline` job with the following parameters. More details of the AutoSxS pipeline configuration can be found [here](https://google-cloud-pipeline-components.readthedocs.io/en/google-cloud-pipeline-components-2.9.0/api/preview/model_evaluation.html#preview.model_evaluation.autosxs_pipeline).

**Required Parameters:**
  - **evaluation_dataset:** A list of Cloud Storage paths to a JSONL dataset containing
      evaluation examples.
  - **task:** Evaluation task in the form {task}@{version}. task can be one of
      "summarization", "question_answering". Version is an integer with 3 digits or
      "latest". Ex: summarization@001 or question_answering@latest.
  - **id_columns:** The columns which distinguish unique evaluation examples.
  - **autorater_prompt_parameters:** Map of autorater prompt parameters to columns
      or templates. The expected parameters are:
      - inference_instruction - Details
      on how to perform a task.
      - inference_context - Content to reference to
      perform the task.

Additionally, we need to specify where the predictions for the candidate models (Model A and Model B) are coming from. AutoSxS can either run Vertex Batch Prediction to get predictions, or a predefined predictions column can be provided in the evaluation dataset.

**Model Parameters if using Batch Prediction (assuming Model A):**
  - **model_a:** A fully-qualified model resource name. This parameter is optional
      if Model A responses are specified.
  - **model_a_prompt_parameters:** Map of Model A prompt template parameters to
      columns or templates. In the case of [text-bison](https://cloud.google.com/vertex-ai/docs/generative-ai/model-reference/text#request_body), the only parameter needed is `prompt`.
  - **model_a_parameters:** The parameters that govern the predictions from model A such as the model temperature.

**Model Parameters if bringing your own predictions (assuming Model A):**
  - **response_column_a:** The column containing responses for model A. Required if
      any response tables are provided for model A.

Lastly, there are parameters that configure additional features such as exporting the judgments or comparing judgments to a human-preference dataset to check the AutoRater's alignment with human raters.
  - **judgments_format:** The format to write judgments to. Can be either 'json' or
      'bigquery'.
  - **bigquery_destination_prefix:** BigQuery table to write judgments to if the
      specified format is 'bigquery'.
  - **human_preference_column:** The column containing ground truths. Only required
      when users want to check the autorater alignment against human preference.

In this notebook, we will evaluate how well the autorater aligns with the human rater using two model's predictions (located in the `pred_a` column and `pred_b` column of `PREDS` dataset) and the human preference data (located in the `actuals` column of `PREDS` dataset). The task being performed is question answering.

First, compile the AutoSxS pipeline locally.

In [None]:
template_uri = "pipeline.yaml"
compiler.Compiler().compile(
    pipeline_func=model_evaluation.autosxs_pipeline,
    package_path=template_uri,
)

The following code starts a Vertex Pipeline job, viewable from the Vertex UI. This pipeline job will take ~10 mins.

The logs here will include to the URL to the current pipeline, so you can follow the pipline progress and access/view pipeline outputs.

In [6]:
display_name = f"autosxs-question-answering-human-alignment-checking-{generate_uuid()}"
context_column = "context"
question_column = "questions"
response_column_a = "pred_a"
response_column_b = "pred_b"
human_preference_column = "actuals"
parameters = {
    "evaluation_dataset": DATASET,
    "id_columns": [question_column],
    "autorater_prompt_parameters": {
        "inference_context": {"column": context_column},
        "inference_instruction": {"column": question_column},
    },
    "task": "question_answering@001",
    "response_column_a": response_column_a,
    "response_column_b": response_column_b,
    "human_preference_column": human_preference_column,
}

job = aiplatform.PipelineJob(
    job_id=display_name,
    display_name=display_name,
    pipeline_root=os.path.join(BUCKET_URI, display_name),
    template_path=template_uri,
    parameter_values=parameters,
    enable_caching=False,
)
job.run()

### Get the judgments and AutoSxS metrics
Next, we can load judgments from the completed AutoSxS job.

The results are written to the Cloud Storage output bucket you specified in the AutoSxS job request.

In [7]:
# To use an existing pipeline, override job using the line below.
# job = aiplatform.PipelineJob.get('projects/[PROJECT_NUMBER]/locations/[REGION]/pipelineJobs/[PIPELINE_RUN_NAME]')

for details in job.task_details:
    if details.task_name == "online-evaluation-pairwise":
        break

# Judgments
judgments_uri = details.outputs["judgments"].artifacts[0].uri
judgments_df = pd.read_json(judgments_uri, lines=True)
judgments_df.head()

If any example failed to get the result in AutoSxS, their error messages will be stored in an error table. If the error table is empty, it implies there's no failed examples during the evaluation.

In [8]:
for details in job.task_details:
    if details.task_name == "online-evaluation-pairwise":
        break

# Error table
error_messages_uri = details.outputs["error_messages"].artifacts[0].uri
errors_df = pd.read_json(error_messages_uri, lines=True)
errors_df.head()

We can also look at AutoSxS metrics computed from the judgments.

In the case of human-preference data been provided, AutoSxS outputs the win rate from the AutoRater and a set of human-preference alignment metrics. You can find more details of AutoSxS metrics [here](https://cloud.google.com/vertex-ai/docs/generative-ai/models/side-by-side-eval#human-metrics).

In [9]:
# Metrics
for details in job.task_details:
    if details.task_name == "model-evaluation-text-generation-pairwise":
        break
pd.DataFrame([details.outputs["autosxs_metrics"].artifacts[0].metadata])

## Cleaning up

To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud
project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.

Otherwise, you can delete the individual resources you created in this tutorial:

Set `delete_bucket` to **True** to delete the Cloud Storage bucket.

In [None]:
import os

job.delete()

# Delete Cloud Storage objects that were created
delete_bucket = False
if delete_bucket or os.getenv("IS_TESTING"):
    ! gsutil -m rm -r $BUCKET_URI