In [None]:
# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# AutoSxS: Check autorater alignment against a human-preference dataset


<table align="left">

  <td>
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/model_evaluation/model_based_llm_evaluation/autosxs_check_alignment_against_human_preference_data.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
    </a>
  </td>
  <td>
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/model_evaluation/model_based_llm_evaluation/autosxs_check_alignment_against_human_preference_data.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/vertex-ai-samples/main/notebooks/official/model_evaluation/model_based_llm_evaluation/autosxs_check_alignment_against_human_preference_data.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      Open in Vertex AI Workbench
    </a>
  </td>                                                                                               
</table>

## Overview

This notebook demonstrates how to use Vertex AI automatic side-by-side (AutoSxS) to check how well the autorater aligns with the human rater.

Automatic side-by-side (AutoSxS) is a model-assisted evaluation tool that helps you compare two large language models (LLMs) side by side. As part of AutoSxS's preview release, we only support comparing models for summarization and question answering tasks. We will support more tasks and customization in the future.

Learn more about [Vertex AI AutoSxS Model Evaluation](https://cloud.google.com/vertex-ai/docs/generative-ai/models/side-by-side-eval#autosxs).

### Objective

In this tutorial, you learn how to use `Vertex AI Pipelines` and `google_cloud_pipeline_components` to check autorater alignment using human-preference data:

This tutorial uses the following Google Cloud ML services and resources:

- Vertex AI Model Registry
- Vertex AI Pipelines
- Vertex AI Batch Predictions


The steps performed include:
- Create a evaluation dataset with predictions and human preference data.
- Preprocess the data locally and save it in GCS.
- Create and run a Vertex AI AutoSxS Pipeline that generates the judgments and a set of autosxs metrics using the generated judgments.
- Print the judgments and autosxs metrics.
- Clean up the resources created in this notebook.

### Costs

This tutorial uses billable components of Google Cloud:

* Vertex AI
* Cloud Storage

Learn about [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing),
and [Cloud Storage pricing](https://cloud.google.com/storage/pricing),
and use the [Pricing Calculator](https://cloud.google.com/products/calculator/)
to generate a cost estimate based on your projected usage.

## Installation

Install the following packages required to execute this notebook.

In [1]:
! pip3 install --upgrade --force-reinstall $USER_FLAG \
    google-cloud-aiplatform \
    google-cloud-pipeline-components==2.9.0

### Colab only: Uncomment the following cell to restart the kernel.

In [None]:
# import IPython

# app = IPython.Application.instance()
# app.kernel.do_shutdown(True)

## Before you begin

### Set up your Google Cloud project

**The following steps are required, regardless of your notebook environment.**

1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs.

2. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).

3. [Enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com).

4. If you are running this notebook locally, you need to install the [Cloud SDK](https://cloud.google.com/sdk).

#### Set your project ID

**If you don't know your project ID**, try the following:
* Run `gcloud config list`.
* Run `gcloud projects list`.
* See the support page: [Locate the project ID](https://support.google.com/googleapi/answer/7014113)

In [None]:
PROJECT_ID = "[your-project-id]"  # @param {type:"string"}

# Set the project id
! gcloud config set project {PROJECT_ID}

#### Region

You may change the `REGION` variable, which is used for operations
throughout the rest of this notebook.  Below are regions supported for AutoSxS.

- Americas: `us-central1`
- Europe: `europe-west4`
- Asia Pacific: `asia-southeast1`

You can also change the `REGION` variable used by Vertex AI. Learn more about [Vertex AI regions](https://cloud.google.com/vertex-ai/docs/general/locations).

In [None]:
REGION = "us-central1"  # @param {type: "string"}

### Authenticate your Google Cloud account

Depending on your Jupyter environment, you may have to manually authenticate. Follow the relevant instructions below.

**1. Vertex AI Workbench**
* Do nothing as you are already authenticated.

**2. Local JupyterLab instance, uncomment and run:**

In [None]:
# ! gcloud auth login

**3. Colab, uncomment and run:**

In [None]:
# from google.colab import auth
# auth.authenticate_user()

**4. Service account or other**
* See how to grant Cloud Storage permissions to your service account at https://cloud.google.com/storage/docs/gsutil/commands/iam#ch-examples.

### UUID

We define a UUID generation function to avoid resource name collisions on resources created within the notebook.

In [None]:
import random
import string

def generate_uuid(length: int = 8) -> str:
    """Generate a uuid of a specifed length (default=8)."""
    return "".join(random.choices(string.ascii_lowercase + string.digits, k=length))


UUID = generate_uuid()

### Create a Cloud Storage bucket

Create a storage bucket to store intermediate artifacts to the AutoSxS pipeline.

In [None]:
BUCKET_URI = "gs://your-bucket-name-unique"  # @param {type:"string"}

Create your Cloud Storage bucket if it doesn't already exist.

In [2]:
if BUCKET_URI == "" or BUCKET_URI is None or BUCKET_URI == "gs://[your-bucket-name]":
    BUCKET_URI = "gs://" + PROJECT_ID + "aip-" + UUID

! gsutil ls -b $BUCKET_URI || gsutil mb -l $REGION $BUCKET_URI

### Import libraries

Import the Vertex AI Python SDK and other required Python libraries.

In [3]:
import json
import os
import urllib
import uuid
import pickle


from google.cloud import aiplatform
from google_cloud_pipeline_components.preview import model_evaluation
from kfp import compiler
import pandas as pd

### Initialize Vertex AI SDK for Python

Initialize the Vertex SDK for Python for your project and corresponding bucket.



In [None]:
aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=BUCKET_URI)

## Tutorial
It is unlikely that the autorater will perform at the same level as human raters in all customer use cases, especially in cases where human raters are expected to have specialized knowledge.

The tutorial below shows how AutoSxS helps to determine if you can trust the autorater once you have the ground-truth human-preference data.


### Generate Evaluation Dataset for AutoSxS Human Alignment checking

Below you create your dataset, specifying the set of prompts, predictions from two models and the human-preference data.

In this notebook, we:
- Create a evaluation dataset with 10 examples for AutoSxS.
  - Data in column `prompt` will be treated as model prompts.
  - Data in column `pred_a` will be treated as responses for model A.
  - Data in column `pred_b` will be treated as responses for model B.
  - Data in column `actuals` will be treated as the human-preference data.
- Store it as JSON file in Cloud Storage.

####**Note: For best results, we recommend users input 100-500 examples. There are diminishing returns past 400 examples.**

In [4]:
# Define prompts, predictions and human preference data.
prompts = [
    "I've never really received anything from anyone, but I do go around F2P worlds and look for people in partial rune sets. About 2 months ago, I was at the GE and saw guy wearing an addy scimmy, rune legs, rune kite, and a rune full helm, so I bought a rune chain body and scimmy. He wasn't begging for anything, just using the GE. I had to wait a while to trade him, and he had no idea why I was trading him, but it was worth it to see his gratefulness. You should go around and try it sometime; really makes you feel good.",
    "According to Kassadins winrate is growing at an alarming rate. Since patch 4.5 where Kassadins winrate was 43.38% he has increased to a winrate of 51.45% which is a 8,07% increase. It might not look worrying at first sight, but the winrate is continiusly and rapidly increasing as people master the new Kassadin. The most scary thing is Kassadins winrate at challanger/diamond 1 level play where he has a whooping 68% winrate which is insane (According to Be prapared to see him stomp in All-star.",
    "This is the same company that had to move FFXIII from PS2 to PS3 and still didn't finish the damn game until four years into the generation. They spent so much of their time and money making the graphics perfect that the game itself is a hollow shell of what could have been a JRPG. It was so expensive and time consuming that they couldn't even consider designing FFXV for this generation and had to push out FFXIII-2 because they spent too much money on those art assets to not use them again. This is the company that epitomizes the problems with AAA gaming's focus on having the best graphics all the time. They spent half of this generation making sure FFXIII was the prettiest game ever and forgot to actually develop a real game to go along with those visuals. All a new generation would mean for Squeenix is 8 years spent making FFXV have the best looking shoelaces any game has ever had only to have the game be released to rampant hate because we don't give two shits about the fucking shoelaces (like how Snow's hair didn't convince us that FFXIII was a quality product).",
    "We bought a house! But the paint colors aren't great - lots of bright colors in what used to be kids' rooms and walls in need of retouching in the common areas. I'd like to get starting on the paint project right away and my family will be visiting in two weeks to help - so I'd like to develop a paint scheme to go off of. I get a little overwhelmed when contemplating the seemingly endless paint colors at Home Depot. I want neutral walls, but should I stick with one shade throughout or mix it up? Is beige blah, is cream too boring, and could anyone tell me what #%&! greige is? My couch is brown leather, the entertainment center, kitchen table, and bedroom furniture are a warm cherry, and the cabinetry is maple. Since we'll be buying furniture over time as funds allow, I think neutral walls that play well off other colors is my best bet. The house gets gorgeous light from western facing windows, but even still I don't want to go too dramatic or dark.",
    "I'd like to talk about bows. Why? Because they sit in a very strange place in the end game right now. If you don't want to use a bow, more often than not you'll end up in two very frustrating scenarios. You're either being kited to death by a jump shotting stamina shot user, or you're sitting behind whatever cover you can waiting for nothing in particular because someone is taking potshots at you. Now you may say 'Take the damage reduction perk' or 'just use your own bow'. And you'd be right. That would solve the issue somewhat. Problem is, those are basically the ONLY way to deal with it. There's a lack of diversity. When things become mandatory in a competitive setting, there's a balance issue. So. I want to talk about solutions on how to give bows fair counterplay without breaking them. And Ill start by sharing my own ideas on the matted, starting with a few simple ideas like these. Make a non-fully charged bow shot travel VERY little distance. Make charging bows in mid-air impossible. Make players momentum/speed slow when charging and shortly after shooting so they are easier to catch. Raise the stamina cost of charging a bow shot. Make charging a bow shot slower. Bows should be like snipers. Not SMGs. Using a bow effectively should be about keeping your enemy at a distance and whittling them down. If you have someone rushing you down and you try to snipe at them before they get to you, there should be some cost to it. If it used a noticeable amount of stamina to fire, then you have to start weighing risk and reward. Is it worth putting yourself at a stamina disadvantage should you miss? How much time would you have to recover that lost stamina? What if they get to you before you can switch weapons and guard? They'd get a free hit. As it stands, using bows is really risk free, and I believe that should change. Also something to remember is that if you're comfortably set up without your opponent able to approach you, the stamina drain wouldn't do much aside from make you have to think about how youre firing. Do you unload arrows in a rapid volley, leaving your stamina low and risking being at a disadvantage if your opponent gets to you? Or do you space out your shots so you stay at max stamina, but fire less rapidly? I also feel the ending arena should be changed a bit to make approaching a little more viable, but I digress.",
    "I’d been watching [James Marshal]( videos for the past couple of days and with the concepts swirling around in my mind, I decided to smile less at work. I work in a shop, selling alcohol, cigarettes, confectionery, soft drinks amongst other things. In an attempt to offer the customer a pleasant experience, you smile, say please, say thank you and all other polite things. I tried to do the opposite. Not smiling, not saying thank you, not nodding, not saying anything for a few moments. It was difficult to do, because much of my behaviour is habitual through years of working in a retail environment. Some of my thoughts were “It’s going to be awkward; they won’t be happy about it; I’m going to feel uncomfortable; It’s scary.” It was awkward. It was scary. It was uncomfortable. But less than I’d imagined. And I don’t think anyone cared. Some of the regulars might have noticed something different, but nobody said anything. In fact, the hardest part was not nodding or saying thank you or cheers. Why smile less? Why stop saying thank you? In hindsight what I was doing was playing around with the concept of Pressure and Release. When people are nervous, be it talking to an attractive lady or whatever, there is the compulsion to say something to release the pressure. Sexual tension can’t build without any pressure. James Marshall loves to sit in that tension. When you watch him speak he has an enormous presence on stage, particular when he isn’t saying anything. Improving Pressure and Release will increase sexual tension will improve your results with women. However, in the shop, I was using it because it directly ties in with my social anxiety. When I’m compulsively smiling and talking it’s because I’m feeling uncomfortable and I’m not okay with pauses and silences of two seconds or more. Smiling less, delaying my thank yous, strong eye contact with no talking were all ways of exploring the silence, the tension and the pressure. James Marshall also encourages to remove the judgement and labels from feelings and sensations. Social anxiety, approach anxiety, fear, embarrassment, shame, and so on are shorn of their names. Instead, Marshall encourages you to focus on the physical sensations and describe those without judgement. I can’t remember what the feeling was, but the other day I noticed there was tightness in the abdomen, the chest, the throat and the face. And with this feeling, everything was happening in the front of my body with nothing going on in the back. The Smiling Less experiment happened for the whole day. It felt quite weird at first, like I wasn’t being myself. But I remembered something Marshall said: “What possible new behaviours could you adopt and could be you?” This quote in turn reminded me of the [Feldenkrais Method]( which is all about doing and learning new movements to change what is habitual and usually painful to your body. What Marshall was saying was like the Feldenkrais Method but instead of movement it was social interaction. As the day wore on, I became a bit more like my old self. This happened because it was tiring to adopt new behaviours for a whole day and I also think it happened because my subconscious was calibrating on how to fit this with the rest of my personality. Another way of looking at this experiment is that I was [regulating my emotional compulsions.]( Instead of saying or doing the first habitual thing that came to mind, I paused or did something else. This goes against some ideas like going with your gut or the popular idea of improvisation where you blurt and do things with no filter. Impro is like that i.e. responding to your impulse or desire to do something, but oftentimes in bad improv it’s a result of adrenalin and nerves and a need to make things comfortable and safe. In impro, we channel what the audience wants most. And almost always it’s the uncomfortable and dangerous option. As my facilitator has said, “When you’re in the shit, stay in the shit.” By smiling less, I was putting myself in the shit. At least what I thought was the shit. It wasn’t easy, but it was easy enough for it to be surmountable. I hoping to take some of what I learned at the shop to the open mic I’ll be hosting tomorrow. Breathe, pause and smile less.",
    "Ignore labels. Ignore all the labels. Without the presence of labels, you're only left with what your actual experiences have been. If you find yourself making your life harder, or not enjoying something, you have to start working from the actual descriptions of the experiences you've had to plan some new changes in your life that will improve things for you. Empathizing with other people is a good start.",
    "I've always been quite extroverted. However, following a 10 day retreat (my first) a few months ago, I've become significantly more introverted. I'm more comfortable in my own company. Social situations tire me out quicker. Instead of preferring and dominating group conversations I find myself often in 1-on-1 conversations. I talk less and listen more. I can't pin this down to the course definitely because a number of other factors in my life have changed. (After the course I went traveling and got sick of 3-day friends. I graduated university before the course. I've been teaching myself conversational skills, which often have an emphasis on letting other person talk. I've also had a lot of experiences in my life recently of making friends and then them or myself moving to different cities, and I've recently moved cities. It could just be me getting older (23M).) But I had a distinct experience following the course which makes me think it's highly related - on the night of the day we broke silence, everyone was talking loudly in a large group. The old me would have loved this situation and probably been one of the loudest in the group. But instead of joining the group I went for a walk into night by myself. When I did speak to people, I gravitated towards the quieter people in the course. This was the completely out of character for me, and it was the first time I felt like I did that day. I'm not necessarily concerned about this change, in fact it might make me more suited for my job (software engineer). But I'm still trying to make sense it - it's a pretty large change in my personality. If anyone has had a similar experience or can offer an explanation, I'd appreciate it.",
    "Many people seem to be concerned with the change they made to convert existing remove only tabs (from previous leagues, no races) into the new premium tabs that you purchased, I would say some people like the change and aren't commenting that they do. The ones that are voicing their thoughts are those that dislike the change as they feel they are being 'cheated' out of additional space. To me I think the change is good, they just need to make it clear that it converts previous league tabs into the new tabs (for the permeant leagues) I don't think the whole previous league tab thing was meant to be a way of artificially increasing your stash space, but some people got used to it and should get used to the new change, most of us really don't need a whole bunch of unique item tabs from previous leagues that match all other previous leagues.",
    "Roommate 1 wants something like Risk but with no luck (and no dice at all). Something competitive, with a map, preferably involving conquest. For future reference this pretty much sums up Small World to a T. Chaos in the Old World is also similar and very good, but does include dice in combat. Castles of Burgundy : The most like a step up from Catan on this list, but no direct competition (just like Catan). My real best suggestion for you however is: Kemet : Control some areas and multiple ways to victory. A little intimidating at first but once you get a few games under your belt it's not overly rule-tigious. This probably fits all of your requirements most closely and costs tons less than Terra Mystica.",
]
predictions_a = [
    "Go around F2P worlds and look for people in partial rune sets.",
    "According to Kassadins winrate is growing at an alarming rate. Since patch 4.5 where Kassadins winrate was 43.38% he has increased to a winrate of 51.45% which is a 8,07% increase. It might not look worrying at first",
    "This is the same company that had to move FFXIII from PS2 to PS3 and still didn't finish the damn game until four years into the generation. They spent so much of their time and money making the graphics perfect that the game itself is a hollow shell of what could have been a JRPG.",
    "I'd like to get started on the paint project right away and my family will be visiting in two weeks to help - so I'd like to develop a paint scheme to go off of.",
    "shots and use your stamina to whittle down your opponent?",
    "I work in a shop, selling alcohol, cigarettes, confectionery, soft drinks amongst other things. In an attempt to offer the customer a pleasant experience, you smile, say please, say thank you and all other polite things. I tried to do the opposite. Not smiling, not saying thank you",
    "Without the presence of labels, you're only left with what your actual experiences have been. If you find yourself making your life harder, or not enjoying something, you have to start working from the actual descriptions of the experiences you've had to plan some new changes in your life that will improve things for you.",
    "I've always been quite extroverted. However, following a 10 day retreat (my first) a few months ago, I've become significantly more introverted. I'm more comfortable in my own company. Social situations tire me out quicker. Instead of preferring and dominating group conversations I find myself often in",
    "I think the change is good, they just need to make it clear that it converts previous league tabs into the new tabs (for the permeant leagues)",
    "Small World is a great game, but it's not like Risk. It's not like anything else. It's not like anything else. It's not like anything else. It's not like anything else. It's not like anything else. It's not like anything else.",
]
predictions_b = [
    "I bought a rune chain body and scimmy for my fiance.",
    "Kassadins winrate is growing at an alarming rate.",
    "Squeenix is the same company that had to move FFXIII from PS2 to PS3 and still didn't finish the game until four years into the generation.",
    "I'd like to get a paint scheme to go off of.",
    "Make a non-fully charged bow shot.",
    "Smile less at work.",
    "Avoid labels.",
    "I've been a bit introverted.",
    "I would say some people like the change and aren't commenting that they do.",
    "Make a list of the best games you can play.",
]

human_preference = ["A", "A", "B", "A", "B", "B", "B", "A", "B", "A",]

# Create the evaluation dataset with prompts, predictions and human preference data.
examples = pd.DataFrame({
    'prompt': prompts,
    'pred_a': predictions_a,
    'pred_b': predictions_b,
    'actuals': human_preference,
})
examples.head()

#### [Optional] Load your JSONL evaluation dataset from GCS

Alternatively, you can load your own JSONL dataset from GCS.



In [None]:
# # Uncomment to read from GCS.
# GCS_PATH = 'gs://your-own-evaluation-dataset-with-human-preference-data.jsonl'
# preds = pd.read_json(GCS_PATH, lines=True)

We next upload our final dataset to GCS to be used as input for AutoSxS.

In [5]:
# Upload predictions to GCS.
examples.to_json('evaluation_dataset_with_human_preference.json', orient='records', lines=True)
! gsutil cp evaluation_dataset_with_human_preference.json $BUCKET_URI/input/evaluation_dataset_with_human_preference.json
DATASET = f'{BUCKET_URI}/input/evaluation_dataset_with_human_preference.json'

### Create and Run AutoSxS Job

In order to run AutoSxS, we need to define a `autosxs_pipeline` job with the following parameters. More details of the autosxs pipeline configuration can be found [here](https://google-cloud-pipeline-components.readthedocs.io/en/google-cloud-pipeline-components-2.9.0/api/preview/model_evaluation.html#preview.model_evaluation.autosxs_pipeline).

**Required Parameters:**
  - **evaluation_dataset:** A list of GCS paths to a JSONL dataset containing
      evaluation examples.
  - **task:** Evaluation task in the form {task}@{version}. task can be one of
      "summarization", "question_answering". Version is an integer with 3 digits or
      "latest". Ex: summarization@001 or question_answering@latest.
  - **id_columns:** The columns which distinguish unique evaluation examples.
  - **autorater_prompt_parameters:** Map of autorater prompt parameters to columns
      or templates. The expected parameters are:
      - inference_instruction - Details
      on how to perform a task.
      - inference_context - Content to reference to
      perform the task.

Additionally, we need to specify where the predictions for the candidate models (Model A and Model B) are coming from. AutoSxS can either run Vertex Batch Prediction to get predictions, or a predefined predictions column can be provided in the evaluation dataset.

**Model Parameters if using Batch Prediction (assuming Model A):**
  - **model_a:** A fully-qualified model resource name. This parameter is optional
      if Model A responses are specified.
  - **model_a_prompt_parameters:** Map of Model A prompt template parameters to
      columns or templates. In the case of [text-bison](https://cloud.google.com/vertex-ai/docs/generative-ai/model-reference/text#request_body), the only parameter needed is `prompt`.
  - **model_a_parameters:** The parameters that govern the predictions from model A such as the model temperature.

**Model Parameters if bringing your own predictions (assuming Model A):**
  - **response_column_a:** The column containing responses for model A. Required if
      any response tables are provided for model A.

Lastly, there are parameters that configure additional features such as exporting the judgments or comparing judgments to a human-preference dataset to check the AutoRater's alignment with human raters.
  - **judgments_format:** The format to write judgments to. Can be either 'json' or
      'bigquery'.
  - **bigquery_destination_prefix:** BigQuery table to write judgments to if the
      specified format is 'bigquery'.
  - **human_preference_column:** The column containing ground truths. Only required
      when users want to check the autorater alignment against human preference.

In this notebook, we will evaluate how well the autorater aligns with the human rater using two model's predictions (located in the `pred_a` column and `pred_b` column of `PREDS` dataset) and the human preference data (located in the `actuals` column of `PREDS` dataset). The task being performed is summarization.

First, compile the AutoSxS pipeline locally.

In [None]:
template_uri = 'pipeline.yaml'
compiler.Compiler().compile(
    pipeline_func=model_evaluation.autosxs_pipeline,
    package_path=template_uri,
)

The following code starts a Vertex Pipeline job, viewable from the Vertex UI. This pipeline job will take ~10 mins.

The logs here will include to the URL to the current pipeline, so you can follow the pipline progress and access/view pipeline outputs.

In [6]:
display_name = f'autosxs-summarization-human-alignment-checking-{generate_uuid()}'
prompt_column = 'prompt'
response_column_a = 'pred_a'
response_column_b = 'pred_b'
human_preference_column = 'actuals'
parameters = {
    'evaluation_dataset': DATASET,
    'id_columns': [prompt_column],
    'autorater_prompt_parameters': {
        'inference_context': {'column': prompt_column},
        'inference_instruction': {'template': '{{ default_instruction }}'},
    },
    'task': 'summarization@001',
    'response_column_a': response_column_a,
    'response_column_b': response_column_b,
    'human_preference_column': human_preference_column,
}

job = aiplatform.PipelineJob(
    job_id=display_name,
    display_name=display_name,
    pipeline_root=os.path.join(BUCKET_URI, display_name),
    template_path=template_uri,
    parameter_values=parameters,
    enable_caching=False,
)
job.run()

### Get the judgments and autosxs metrics
Next, we can load judgments from the completed autosxs job.

The results are written to the Cloud Storage output bucket you specified in the autosxs job request.

In [7]:
# To use an existing pipeline, override job using the line below.
# job = aiplatform.PipelineJob.get('projects/[PROJECT_NUMBER]/locations/[REGION]/pipelineJobs/[PIPELINE_RUN_NAME]')

for details in job.task_details:
  if details.task_name == 'autosxs-arbiter':
    break

# Judgments
judgments_uri = details.outputs['judgments'].artifacts[0].uri
judgments_df = pd.read_json(judgments_uri, lines=True)
judgments_df.head()

If any example failed to get the result in AutoSxS, their error messages will be stored in an error table. If the error table is empty, it implies there's no failed examples during the evaluation.

In [8]:
for details in job.task_details:
  if details.task_name == 'autosxs-arbiter':
    break

# Error table
error_messages_uri = details.outputs['error_messages'].artifacts[0].uri
errors_df = pd.read_json(error_messages_uri, lines=True)
errors_df.head()

We can also look at AutoSxS metrics computed from the judgments.

In the case of human-preference data been provided, AutoSxS outputs the win rate from the AutoRater and a set of human-preference alignment metrics. You can find more details of AutoSxS metrics [here](https://cloud.google.com/vertex-ai/docs/generative-ai/models/side-by-side-eval#human-metrics).

In [9]:
# Metrics
for details in job.task_details:
  if details.task_name == 'autosxs-metrics-computer':
    break
pd.DataFrame([details.outputs['autosxs_metrics'].artifacts[0].metadata])

## Cleaning up

To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud
project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.

Otherwise, you can delete the individual resources you created in this tutorial:

Set `delete_bucket` to **True** to delete the Cloud Storage bucket.

In [None]:
import os

job.delete()

# Delete Cloud Storage objects that were created
delete_bucket = False
if delete_bucket or os.getenv("IS_TESTING"):
    ! gsutil -m rm -r $BUCKET_URI