In [None]:
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Enhancing quality and explainability with Vertex AI Evaluation

<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/gemini/evaluation/enhancing_quality_and_explainability_with_eval.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Google Colaboratory logo"><br> Run in Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/evaluation/enhancing_quality_and_explainability_with_eval.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo"><br> View on GitHub
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fgenerative-ai%2Fmain%2Fgemini%2Fevaluation%2Fenhancing_quality_and_explainability_with_eval.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo"><br> Open in Colab Enterprise
    </a>
  </td>      

  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/gemini/evaluation/enhancing_quality_and_explainability_with_eval.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo"><br> Open in Vertex AI Workbench
    </a>
  </td>
</table>

<div style="clear: both;"></div>

<b>Share to:</b>

<a href="https://www.linkedin.com/sharing/share-offsite/?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/evaluation/enhancing_quality_and_explainability_with_eval.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/8/81/LinkedIn_icon.svg" alt="LinkedIn logo">
</a>

<a href="https://bsky.app/intent/compose?text=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/evaluation/enhancing_quality_and_explainability_with_eval.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/7/7a/Bluesky_Logo.svg" alt="Bluesky logo">
</a>

<a href="https://twitter.com/intent/tweet?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/evaluation/enhancing_quality_and_explainability_with_eval.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/5/53/X_logo_2023_original.svg" alt="X logo">
</a>

<a href="https://reddit.com/submit?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/evaluation/enhancing_quality_and_explainability_with_eval.ipynb" target="_blank">
  <img width="20px" src="https://redditinc.com/hubfs/Reddit%20Inc/Brand/Reddit_Logo.png" alt="Reddit logo">
</a>

<a href="https://www.facebook.com/sharer/sharer.php?u=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/evaluation/enhancing_quality_and_explainability_with_eval.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/5/51/Facebook_f_logo_%282019%29.svg" alt="Facebook logo">
</a>            

| | |
|-|-|
| Author(s) | [Anant Nawalgaria](https://github.com/anantnawal) |

## Overview

### Vertex Gen AI Evaluation API

The [Vertex Gen AI Evaluation Service](https://cloud.google.com/vertex-ai/generative-ai/docs/models/evaluation-overview) which can be accessed both through its SDK and web API interfaces, lets you evaluate your large language models (LLMs), both pointwise and pairwise, across several metrics.

It is primarily used ad-hoc in the initial experimental phase for evaluating which set of prompts and models work well for a use case. However, as described in detail in the corresponding blog, this notebook will show some sample code on dummy data of how you can use 
Evaluation to enhance the quality of the response generated by the LLMs by combining the pairwise and pointwise capabilities of Gen AI Evaluation elegantly at the time of generation. It would also then return a human readable explanation to help understand the quality evaluation of the response. Note that although this notebook only demonstrates this workflow on text, it can be extended to any modality once an evaluation mechanism is available for that modality.

For more information about generative AI on Vertex AI please see [Generative AI on Vertex AI](https://cloud.google.com/vertex-ai/docs/generative-ai/learn/overview) documentation.

### Objectives

In this tutorial, you will learn how to combine the Gemini API in Vertex AI  with the Gen AI Eval API service for Python to improve generation quality & explainability of the responses.
You will complete the following tasks:

- Install the Vertex AI SDK for Python
- Use the Gemini API in Vertex AI to interact with each model
  - Gemini 1.5 Pro (`gemini-1.5-pro`) model:
    - Generate multiple responses for a given instruction and context
    - Use the pairwise and pointwise capabilities of eval to select the best response and also return a human readable explanation for it. 

### Costs

This tutorial uses billable components of Google Cloud:

- Vertex AI

Learn about [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing) and use the [Pricing Calculator](https://cloud.google.com/products/calculator/) to generate a cost estimate based on your projected usage.

## Getting Started

### Install the required libraries for Python

In [None]:
%pip install --upgrade --user --quiet google-cloud-aiplatform[evaluation]
%pip install --upgrade --user bigframes -q
%pip install --quiet --upgrade nest_asyncio

### Restart current runtime

To use the newly installed packages in this Jupyter runtime, it is recommended to restart the runtime. Run the following cell to restart the current kernel.

The restart process might take a minute or so.

In [None]:
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

#### Set your project ID and region

In [None]:
PROJECT_ID = "[your-project-id]"
LOCATION = "us-central1"

After the restart is complete, continue to the next step.

<div class="alert alert-block alert-warning">
<b>⚠️ Wait for the kernel to finish restarting before you continue. ⚠️</b>
</div>

### Authenticate your notebook environment (Colab only)

If you are running this notebook on Google Colab, run the following cell to authenticate your environment. This step is not required if you are using [Vertex AI Workbench](https://cloud.google.com/vertex-ai-workbench).

In [None]:
import sys

# Additional authentication is required for Google Colab
if "google.colab" in sys.modules:
    # Authenticate user to Google Cloud
    from google.colab import auth

    auth.authenticate_user()

### Define Google Cloud project information (Colab only)

If you are running this notebook on Google Colab, specify the Google Cloud project information to use. In the following cell, you specify your project information, import the Vertex AI package, and initialize the package. This step is not required if you are using [Vertex AI Workbench](https://cloud.google.com/vertex-ai-workbench).

In [None]:
if "google.colab" in sys.modules:
    # Define project information
    # Initialize Vertex AI
    import vertexai

    vertexai.init(project=PROJECT_ID, location=LOCATION)

### Import libraries & initialize project variables

In [1]:
import functools
from functools import partial
import uuid

from google.cloud import aiplatform
import nest_asyncio
import pandas as pd
from vertexai.evaluation import EvalTask, MetricPromptTemplateExamples
from vertexai.generative_models import GenerationConfig, GenerativeModel

nest_asyncio.apply()

## Defining functions for ranking using evaluations

This section defines the various helper functions to perform pairwise and pointwise evaluations, as well as the logic to combine them 
to select the best response and return associated quality metrics and explanation.

This function simplifies AutoSXS comparisons between pairs of responses. It uses the Gen AI Evaluation Service API in Vertex AI and works well with Python's <code>max()</code> or <code>sorted()</code> functions. This lets you easily find the best response or rank a list of responses using pairwise comparisons. For other tasks, like summarization SxS, you can find a full list of metrics on the website mentioned below.

In [None]:
experiment_name = "qa-quality"


def pairwise_greater(
    instructions: list,
    context: str,
    project_id: str,
    location: str,
    experiment_name: str,
    baseline: str,
    candidate: str,
) -> tuple:
    """
    Takes Instructions, Context and two different responses.
    Returns the response which best matches the instructions/Context for the given
    quality metric ( in this case question answering).
    More details on the web API and different quality metrics which this function
    can be extended to can be found on
    https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/evaluation
    """
    eval_dataset = pd.DataFrame(
        {
            "instruction": [instructions],
            "context": [context],
            "response": [candidate],
            "baseline_model_response": [baseline],
        }
    )

    eval_task = EvalTask(
        dataset=eval_dataset,
        metrics=[
            MetricPromptTemplateExamples.Pairwise.QUESTION_ANSWERING_QUALITY,
        ],
        experiment=experiment_name,
    )
    results = eval_task.evaluate(
        prompt_template="{instruction} \n {context}",
        experiment_run_name="gemini-qa-pairwise-" + str(uuid.uuid4()),
    )
    result = results.metrics_table[
        [
            "pairwise_question_answering_quality/pairwise_choice",
            "pairwise_question_answering_quality/explanation",
        ]
    ].to_dict("records")[0]
    choice = (
        baseline
        if result["pairwise_question_answering_quality/pairwise_choice"] == "BASELINE"
        else candidate
    )
    return (choice, result["pairwise_question_answering_quality/explanation"])


def greater(cmp: callable, a: str, b: str) -> int:
    """
    A comparison function which takes the comparison function, and two variables as input
    and returns the one which is greater according to the logic defined inside the cmp function.
    """
    choice, explanation = cmp(a, b)

    if choice == a:
        return 1
    return -1

The below function performs the pointwise evaluation of the provided set of responses, with respect to the provided metric, instruction and context.

In [None]:
def pointwise_eval(
    instruction: str,
    context: str,
    responses: list[str],
    eval_metrics: list[object] = [
        MetricPromptTemplateExamples.Pointwise.QUESTION_ANSWERING_QUALITY,
        MetricPromptTemplateExamples.Pointwise.GROUNDEDNESS,
    ],
    experiment_name: str = experiment_name,
) -> object:
    """
    Takes the instruction, context and a variable number of corresponding generated responses, and returns the pointwise evaluation metrics
    for each of the provided metrics. For this example the metrics are Q & A related, however the full list can be found on the website:
    https://cloud.google.com/vertex-ai/generative-ai/docs/models/online-pipeline-services
    """

    instructions = [instruction] * len(responses)

    contexts = [context] * len(responses)

    eval_dataset = pd.DataFrame(
        {
            "instruction": instructions,
            "context": contexts,
            "response": responses,
        }
    )

    eval_task = EvalTask(
        dataset=eval_dataset, metrics=eval_metrics, experiment=experiment_name
    )
    results = eval_task.evaluate(
        prompt_template="{instruction} \n {context}",
        experiment_run_name="gemini-qa-pointwise-" + str(uuid.uuid4()),
    )
    (results.metrics_table.columns)
    return results

This function integrates pairwise and pointwise logic to enhance response selection and evaluation. Here's the process:

**Current Workflow**:

1. Pairwise Comparison: Compares responses in pairs to identify the best one based on user-defined metrics.
2. Pointwise Evaluation: Assesses the quality of the chosen response, providing human-readable explanations to build trust.

**Alternative Workflow**:

1. Pointwise Evaluation: Evaluate each response individually based on the Pointwise scores or likelihood/logprobs of the response, filtering out those that don't meet specific quality criteria or selecting the top K responses.
2. Pairwise Comparison: Ranks the remaining high-quality responses using pairwise methods to determine the best one(s).

**Key Points**

-  Combines pairwise and pointwise approaches for robust response selection.
-  Offers flexibility with two possible workflows to suit different needs.
-  Prioritizes response quality and provides explanations to support user confidence.

In [None]:
def rank_responses(instruction: str, context: str, responses: list[str]) -> tuple:
    """
    Takes the instruction, context and a variable number of responses as input, and returns the best performing response as well as its associated
    human readable pointwise quality metrics for the configured criteria in the above functions.
    The process consists of two steps:
    1. Selecting the best response by using Pairwise comparisons between the responses for the user specified metric ( e.g. Q & A)
    2. Doing pointwise evaluation of the best response and returning human readable quality metrics and explanation along with the best response.
    """
    cmp_f = partial(
        pairwise_greater, instruction, context, PROJECT_ID, LOCATION, experiment_name
    )
    cmp_greater = partial(greater, cmp_f)

    pairwise_best_response = max(responses, key=functools.cmp_to_key(cmp_greater))
    pointwise_metric = pointwise_eval(instruction, context, [pairwise_best_response])
    qa_metrics = pointwise_metric.metrics_table[
        [
            col
            for col in pointwise_metric.metrics_table.columns
            if ("question_answering" in col) or ("groundedness" in col)
        ]
    ].to_dict("records")[0]

    return pairwise_best_response, qa_metrics

### Load the Gemini 1.5 Pro model

Here we load the model, and assign a temperature value in the range `0.3` to `1.0` and configure it to generate multiple responses. A higher temperature value is critical, even for use cases where creativity is less important like Q & A: since de-correlated responses would mean if the model gets it wrong with the top choice for one response, it has a possibility of getting it right with one of the other responses

In [None]:
generation_model = GenerativeModel("gemini-1.5-pro-002")
generation_config = GenerationConfig(
    temperature=0.4, max_output_tokens=512, candidate_count=num_responses
)

### Prompt Gemini
Now we prompt Gemini to generate multiple slightly de-correlated responses based on the above configuration. Multiple responses will be generated in single call.

In [None]:
instruction_qa = "Please answer the following question based on the context provided. Question: what is the correct process of fixing your tires?"
context_qa = (
    "Context:\n"
    + "the world is a magical place and fixing tires is one of those magical tasks. According to the Administration and Association (TIA), the only method to properly repair a tire puncture is to fill the injury with a repair stem and back the stem with a repair patch. This is commonly known as a combination repair or a patch/plug repair."
)
prompt_qa = instruction_qa + "\n" + context_qa + "\n\nAnswer:\n"
responses = [
    candidate.text
    for candidate in generation_model.generate_content(
        contents=prompt_qa,
        generation_config=generation_config,
    ).candidates
]

prompt_qa

Here we use the `rank_responses()` function to fetch the best selected response as well as its associated quality metrics.

In [None]:
best_response, metrics = rank_responses(instruction_qa, context_qa, responses)

Now we print the various generated responses:
1. The raw responses generated by Gemini
2. The best performing response
3. Its associated pointwise quality metrics and explanation.

In [None]:
for ix, response in enumerate(responses, start=1):
    print(f"Response no. {ix}: \n {response}")

In [None]:
print(best_response)

In [None]:
metrics

## Cleaning Up

To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud
project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.

Otherwise, you can delete the individual resources you created in this tutorial.

In [None]:
aiplatform.init(project=PROJECT_ID, location=LOCATION)
experiment = aiplatform.Experiment(experiment_name)
experiment.delete()