In [None]:
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Stage 2: Building MVP: - 05 Evaluation with Vertex AI


<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/workshops/rag-ops/2.5_mvp_evaluation_vertexai_eval.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Google Colaboratory logo"><br> Open in Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fgenerative-ai%2Fmain%2Fworkshops%2Frag-ops%2F2.5_mvp_evaluation_vertexai_eval.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo"><br> Open in Colab Enterprise
    </a>
  </td>    
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/workshops/rag-ops/2.5_mvp_evaluation_vertexai_eval.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo"><br> Open in Workbench
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://github.com/GoogleCloudPlatform/generative-ai/blob/main/workshops/rag-ops/2.5_mvp_evaluation_vertexai_eval.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo"><br> View on GitHub
    </a>
  </td>
</table>

<div style="clear: both;"></div>

<b>Share to:</b>

<a href="https://www.linkedin.com/sharing/share-offsite/?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/workshops/rag-ops/2.5_mvp_evaluation_vertexai_eval.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/8/81/LinkedIn_icon.svg" alt="LinkedIn logo">
</a>

<a href="https://bsky.app/intent/compose?text=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/workshops/rag-ops/2.5_mvp_evaluation_vertexai_eval.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/7/7a/Bluesky_Logo.svg" alt="Bluesky logo">
</a>

<a href="https://twitter.com/intent/tweet?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/workshops/rag-ops/2.5_mvp_evaluation_vertexai_eval.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/5/53/X_logo_2023_original.svg" alt="X logo">
</a>

<a href="https://reddit.com/submit?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/workshops/rag-ops/2.5_mvp_evaluation_vertexai_eval.ipynb" target="_blank">
  <img width="20px" src="https://redditinc.com/hubfs/Reddit%20Inc/Brand/Reddit_Logo.png" alt="Reddit logo">
</a>

<a href="https://www.facebook.com/sharer/sharer.php?u=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/workshops/rag-ops/2.5_mvp_evaluation_vertexai_eval.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/5/51/Facebook_f_logo_%282019%29.svg" alt="Facebook logo">
</a>            

## Overview

This notebook is the fifth in a series designed to guide you through building a Minimum Viable Product (MVP) for a Multimodal Retrieval Augmented Generation (RAG) system using the Vertex Gemini API.

This notebook dives deeper into evaluating RAG system performance by introducing Vertex AI Eval service, a powerful tool for assessing the quality of generated answers. Building upon the previous notebook's hands-on approach to evaluation, we now explore a more streamlined and scalable method using Vertex AI's dedicated evaluation capabilities.

**Here's what you'll achieve:**

* **Harness the Power of Vertex AI Eval Service:** Learn to effectively utilize Vertex AI Eval service to evaluate answers generated by your RAG system. This involves understanding its features, configuring evaluation jobs, and interpreting the results.  You can find more information about Vertex AI Eval service in the [official documentation](https://cloud.google.com/vertex-ai/generative-ai/docs/models/evaluation-overview).
* **Explore Predefined and Custom Metrics:**  Explore the range of predefined metrics offered by Vertex AI Eval service and learn how to create custom metrics tailored to your specific evaluation needs.  The [documentation on defining evaluation metrics](https://cloud.google.com/vertex-ai/generative-ai/docs/models/determine-eval) provides a comprehensive guide.
* **Streamline Your Evaluation Workflow:** Experience a more efficient and scalable evaluation process compared to manual implementation. Vertex AI Eval service automates many underlying tasks, allowing you to focus on analyzing results and improving your system.
* **Gain Deeper Insights with Visualizations:**  Utilize the visualization capabilities of Vertex AI Eval service to gain a comprehensive understanding of your RAG system's performance. Explore various visualizations to identify specific areas for improvement, such as factual accuracy, coherence, and relevance.
* **Compare and Contrast with Detailed Explanations:**  Continue the analysis of Gemini 1.5 Pro and Gemini 1.5 Flash models by evaluating their performance using Vertex AI Eval service.  Leverage the service's ability to provide detailed explanations for individual instances to understand the strengths and weaknesses of each model.

This notebook offers a practical introduction to Vertex AI Eval service and its application in evaluating RAG systems. By leveraging this powerful tool, you can streamline your evaluation workflow, gain deeper insights into your system's performance, and make data-driven improvements to build a more robust and reliable RAG MVP.

## Getting Started

### Install Vertex AI SDK for Python


In [1]:
%pip install --upgrade --user --quiet google-cloud-aiplatform[evaluation]

### Restart runtime

To use the newly installed packages in this Jupyter runtime, you must restart the runtime. You can do this by running the cell below, which restarts the current kernel.

In [2]:
import sys

if "google.colab" in sys.modules:
    import IPython

    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

<div class="alert alert-block alert-warning">
<b>⚠️ The kernel is going to restart. Please wait until it is finished before continuing to the next step. ⚠️</b>
</div>


### Authenticate your notebook environment (Colab only)

If you are running this notebook on Google Colab, run the cell below to authenticate your environment.


In [1]:
import sys

if "google.colab" in sys.modules:
    from google.colab import auth

    auth.authenticate_user()

### Set Google Cloud project information, GCS Bucket and initialize Vertex AI SDK

To get started using Vertex AI, you must have an existing Google Cloud project and [enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com).

Learn more about [setting up a project and a development environment](https://cloud.google.com/vertex-ai/docs/start/cloud-environment).

In [2]:
import os
import sys

from google.cloud import storage
import vertexai

PROJECT_ID = "[your-project-id]"  # @param {type:"string"}
LOCATION = "us-central1"
BUCKET_NAME = "mlops-for-genai"
EXPERIMENT = "rag-eval-01"

if PROJECT_ID == "[your-project-id]":
    PROJECT_ID = str(os.environ.get("GOOGLE_CLOUD_PROJECT"))

if not PROJECT_ID or PROJECT_ID == "[your-project-id]" or PROJECT_ID == "None":
    raise ValueError("Please set your PROJECT_ID")


vertexai.init(project=PROJECT_ID, location=LOCATION)

# Initialize cloud storage
storage_client = storage.Client(project=PROJECT_ID)
bucket = storage_client.bucket(BUCKET_NAME)

In [3]:
# # Variables for data location. Do not change.

PRODUCTION_DATA = "multimodal-finanace-qa/data/unstructured/production/"
PICKLE_FILE_NAME = "training_data_results.pkl"

### Import libraries


In [4]:
import inspect
import logging
import pickle
import warnings

from IPython.display import HTML, Markdown, display
from google.cloud import storage

# Library
import pandas as pd
import plotly.graph_objects as go
from vertexai.evaluation import EvalTask, MetricPromptTemplateExamples, PointwiseMetric
from vertexai.generative_models import GenerativeModel

In [5]:
logging.getLogger("urllib3.connectionpool").setLevel(logging.ERROR)
warnings.filterwarnings("ignore")

### Load the Gemini 1.5 models

To learn more about all [Gemini API models on Vertex AI](https://cloud.google.com/vertex-ai/generative-ai/docs/learn/models#gemini-models).

The Gemini model family has several model versions. You will start by using Gemini 1.5 Flash. Gemini 1.5 Flash is a more lightweight, fast, and cost-efficient model. This makes it a great option for prototyping.


In [6]:
MODEL_ID_FLASH = "gemini-1.5-flash-002"  # @param {type:"string"}
MODEL_ID_PRO = "gemini-1.5-pro-002"  # @param {type:"string"}


gemini_15_flash = GenerativeModel(MODEL_ID_FLASH)
gemini_15_pro = GenerativeModel(MODEL_ID_PRO)

In [9]:
# @title Helper Functions


def get_load_dataframes_from_gcs():
    gcs_path = "multimodal-finanace-qa/data/embeddings/index_db.pkl"
    # print("GCS PAth: ", gcs_path)
    blob = bucket.blob(gcs_path)

    # Download the pickle file from GCS
    blob.download_to_filename(f"{PICKLE_FILE_NAME}")

    # Load the pickle file into a list of dataframes
    with open(f"{PICKLE_FILE_NAME}", "rb") as f:
        dataframes = pickle.load(f)

    # Assign the dataframes to variables
    (
        index_db_final,
        extracted_text_chunk_df,
        video_metadata_chunk_df,
        audio_metadata_chunk_df,
    ) = dataframes

    return (
        index_db_final,
        extracted_text_chunk_df,
        video_metadata_chunk_df,
        audio_metadata_chunk_df,
    )


def get_load_training_dataframes_from_gcs():
    gcs_path = "multimodal-finanace-qa/data/structured/" + PICKLE_FILE_NAME
    # print("GCS PAth: ", gcs_path)
    blob = bucket.blob(gcs_path)

    # Download the pickle file from GCS
    blob.download_to_filename(f"{PICKLE_FILE_NAME}")

    # Load the pickle file into a list of dataframes
    with open(f"{PICKLE_FILE_NAME}", "rb") as f:
        dataframes = pickle.load(f)

    # Assign the dataframes to variables
    training_data_flash, training_data_pro = dataframes

    return training_data_flash, training_data_pro

![](https://storage.googleapis.com/mlops-for-genai/multimodal-finanace-qa/img/rag_eval_flow.png)

In [10]:
# Get the data that has been extracted in the previous step: IndexDB.
# Make sure that you have ran the previous notebook: stage_2_mvp_chunk_embeddings.ipynb


(
    index_db_final,
    extracted_text_chunk_df,
    video_metadata_chunk_df,
    audio_metadata_chunk_df,
) = get_load_dataframes_from_gcs()
training_data_flash, training_data_pro = get_load_training_dataframes_from_gcs()

In [11]:
index_db_final.head()

In [12]:
training_data_flash.head(2)

In [None]:
training_data_pro.head(2)

In [None]:
training_data_pro.shape

### Evaluations

In [13]:
# @title Vertex AI Eval Helper Functions


def print_doc(function):
    print(f"{function.__name__}:\n{inspect.getdoc(function)}\n")


def display_eval_report(eval_result, metrics=None):
    """Display the evaluation results."""

    title, summary_metrics, report_df = eval_result
    metrics_df = pd.DataFrame.from_dict(summary_metrics, orient="index").T
    if metrics:
        metrics_df = metrics_df.filter(
            [
                metric
                for metric in metrics_df.columns
                if any(selected_metric in metric for selected_metric in metrics)
            ]
        )
        report_df = report_df.filter(
            [
                metric
                for metric in report_df.columns
                if any(selected_metric in metric for selected_metric in metrics)
            ]
        )

    # Display the title with Markdown for emphasis
    display(Markdown(f"## {title}"))

    # Display the metrics DataFrame
    display(Markdown("### Summary Metrics"))
    display(metrics_df)

    # Display the detailed report DataFrame
    display(Markdown("### Report Metrics"))
    display(report_df)


def display_explanations(df, metrics=None, n=1):
    style = "white-space: pre-wrap; width: 800px; overflow-x: auto;"
    df = df.sample(n=n)
    if metrics:
        df = df.filter(
            ["instruction", "context", "reference", "completed_prompt", "response"]
            + [
                metric
                for metric in df.columns
                if any(selected_metric in metric for selected_metric in metrics)
            ]
        )

    for index, row in df.iterrows():
        for col in df.columns:
            display(HTML(f"<h2>{col}:</h2> <div style='{style}'>{row[col]}</div>"))
        display(HTML("<hr>"))


def plot_radar_plot(eval_results, max_score=5, metrics=None):
    fig = go.Figure()

    for eval_result in eval_results:
        title, summary_metrics, report_df = eval_result

        if metrics:
            summary_metrics = {
                k: summary_metrics[k]
                for k, v in summary_metrics.items()
                if any(selected_metric in k for selected_metric in metrics)
            }

        fig.add_trace(
            go.Scatterpolar(
                r=list(summary_metrics.values()),
                theta=list(summary_metrics.keys()),
                fill="toself",
                name=title,
            )
        )

    fig.update_layout(
        polar=dict(radialaxis=dict(visible=True, range=[0, max_score])), showlegend=True
    )

    fig.show()


def plot_bar_plot(eval_results, metrics=None):
    fig = go.Figure()
    data = []

    for eval_result in eval_results:
        title, summary_metrics, _ = eval_result
        if metrics:
            summary_metrics = {
                k: summary_metrics[k]
                for k, v in summary_metrics.items()
                if any(selected_metric in k for selected_metric in metrics)
            }

        data.append(
            go.Bar(
                x=list(summary_metrics.keys()),
                y=list(summary_metrics.values()),
                name=title,
            )
        )

    fig = go.Figure(data=data)

    # Change the bar mode
    fig.update_layout(barmode="group")
    fig.show()

### Prepare your dataset

To evaluate the RAG generated answers, the evaluation dataset is required to contain the following fields:

* Prompt: The user supplied prompt consisting of the User Question and the RAG Retrieved Context
* Response: The RAG Generated Answer
* Reference: The Golden Answer groundtruth to compare model response to

### rag_a: Gemini 1.5 Pro

In [14]:
def get_citation(row):
    final_ref = []
    for each_cit in row[:20]:
        final_ref.append(each_cit["content"])
    return "\n".join(final_ref)

In [15]:
questions = training_data_pro["question"].tolist()

retrieved_contexts = training_data_pro["citation"].apply(get_citation).tolist()

generated_answers_by_rag_a = training_data_pro["gen_answer"].tolist()


golden_answers = training_data_pro["answer"].tolist()

referenced_eval_dataset_rag_a = pd.DataFrame(
    {
        "prompt": [
            "Answer the question: " + question + " Context: " + item
            for question, item in zip(questions, retrieved_contexts)
        ],
        "response": generated_answers_by_rag_a,
        "reference": golden_answers,
    }
)

### rag_b: Gemini 1.5 Flash

In [16]:
questions = training_data_flash["question"].tolist()

retrieved_contexts = training_data_flash["citation"].apply(get_citation).tolist()


generated_answers_by_rag_b = training_data_flash["gen_answer"].tolist()

golden_answers = training_data_flash["answer"].tolist()


referenced_eval_dataset_rag_b = pd.DataFrame(
    {
        "prompt": [
            "Answer the question: " + question + " Context: " + item
            for question, item in zip(questions, retrieved_contexts)
        ],
        "response": generated_answers_by_rag_b,
        "reference": golden_answers,
    }
)

### Select and create metrics


You can run evaluation for just one metric, or a combination of metrics. For this example, we select a few RAG-related predefined metrics, and create a few of our own custom metrics.

#### Explore predefined metrics

In [17]:
# See all the available metric examples
MetricPromptTemplateExamples.list_example_metric_names()

In [18]:
# See the prompt example for one of the pointwise metrics
print(MetricPromptTemplateExamples.get_prompt_template("question_answering_quality"))

#### Create custom metrics

In [19]:
relevance_prompt_template = """
You are a professional writing evaluator. Your job is to score writing responses according to pre-defined evaluation criteria.

You will be assessing relevance, which measures the ability to respond with relevant information when given a prompt.

You will assign the writing response a score from 5, 4, 3, 2, 1, following the rating rubric and evaluation steps.

## Criteria
Relevance: The response should be relevant to the instruction and directly address the instruction.

## Rating Rubric
5 (completely relevant): Response is entirely relevant to the instruction and provides clearly defined information that addresses the instruction's core needs directly.
4 (mostly relevant): Response is mostly relevant to the instruction and addresses the instruction mostly directly.
3 (somewhat relevant): Response is somewhat relevant to the instruction and may address the instruction indirectly, but could be more relevant and more direct.
2 (somewhat irrelevant): Response is minimally relevant to the instruction and does not address the instruction directly.
1 (irrelevant): Response is completely irrelevant to the instruction.

## Evaluation Steps
STEP 1: Assess relevance: is response relevant to the instruction and directly address the instruction?
STEP 2: Score based on the criteria and rubrics.

Give step by step explanations for your scoring, and only choose scores from 5, 4, 3, 2, 1.


# User Inputs and AI-generated Response
## User Inputs
### Prompt
{prompt}

## AI-generated Response
{response}
"""

In [20]:
helpfulness_prompt_template = """
You are a professional writing evaluator. Your job is to score writing responses according to pre-defined evaluation criteria.

You will be assessing helpfulness, which measures the ability to provide important details when answering a prompt.

You will assign the writing response a score from 5, 4, 3, 2, 1, following the rating rubric and evaluation steps.

## Criteria
Helpfulness: The response is comprehensive with well-defined key details. The user would feel very satisfied with the content in a good response.

## Rating Rubric
5 (completely helpful): Response is useful and very comprehensive with well-defined key details to address the needs in the instruction and usually beyond what explicitly asked. The user would feel very satisfied with the content in the response.
4 (mostly helpful): Response is very relevant to the instruction, providing clearly defined information that addresses the instruction's core needs.  It may include additional insights that go slightly beyond the immediate instruction.  The user would feel quite satisfied with the content in the response.
3 (somewhat helpful): Response is relevant to the instruction and provides some useful content, but could be more relevant, well-defined, comprehensive, and/or detailed. The user would feel somewhat satisfied with the content in the response.
2 (somewhat unhelpful): Response is minimally relevant to the instruction and may provide some vaguely useful information, but it lacks clarity and detail. It might contain minor inaccuracies. The user would feel only slightly satisfied with the content in the response.
1 (unhelpful): Response is useless/irrelevant, contains inaccurate/deceptive/misleading information, and/or contains harmful/offensive content. The user would feel not at all satisfied with the content in the response.

## Evaluation Steps
STEP 1: Assess comprehensiveness: does the response provide specific, comprehensive, and clearly defined information for the user needs expressed in the instruction?
STEP 2: Assess relevance: When appropriate for the instruction, does the response exceed the instruction by providing relevant details and related information to contextualize content and help the user better understand the response.
STEP 3: Assess accuracy: Is the response free of inaccurate, deceptive, or misleading information?
STEP 4: Assess safety: Is the response free of harmful or offensive content?

Give step by step explanations for your scoring, and only choose scores from 5, 4, 3, 2, 1.


# User Inputs and AI-generated Response
## User Inputs
### Prompt
{prompt}

## AI-generated Response
{response}
"""

In [21]:
relevance = PointwiseMetric(
    metric="relevance",
    metric_prompt_template=relevance_prompt_template,
)

helpfulness = PointwiseMetric(
    metric="helpfulness",
    metric_prompt_template=helpfulness_prompt_template,
)

### Run evaluation with your dataset

In [22]:
rag_eval_task_rag_a = EvalTask(
    dataset=referenced_eval_dataset_rag_a,
    metrics=[
        "question_answering_quality",
        relevance,
        helpfulness,
        "groundedness",
        "safety",
        "instruction_following",
    ],
    experiment=EXPERIMENT,
)

rag_eval_task_rag_b = EvalTask(
    dataset=referenced_eval_dataset_rag_b,
    metrics=[
        "question_answering_quality",
        relevance,
        helpfulness,
        "groundedness",
        "safety",
        "instruction_following",
    ],
    experiment=EXPERIMENT,
)

In [23]:
result_rag_a = rag_eval_task_rag_a.evaluate()
result_rag_b = rag_eval_task_rag_b.evaluate()

### Display evaluation results

#### View summary results

If you want to have an overall view of all the metrics from individual model's evaluation result in one table, you can use the `display_eval_report()` helper function.

In [24]:
display_eval_report(
    (
        "Model A Eval Result",
        result_rag_a.summary_metrics,
        result_rag_a.metrics_table,
    )
)

In [25]:
display_eval_report(
    (
        "Model B Eval Result",
        result_rag_b.summary_metrics,
        result_rag_b.metrics_table,
    )
)

#### Visualize evaluation results

In [26]:
eval_results = []
eval_results.append(
    ("Model A", result_rag_a.summary_metrics, result_rag_a.metrics_table)
)
eval_results.append(
    ("Model B", result_rag_b.summary_metrics, result_rag_b.metrics_table)
)

In [27]:
plot_radar_plot(
    eval_results,
    metrics=[
        f"{metric}/mean"
        # Edit your list of metrics here if you used other metrics in evaluation.
        for metric in [
            "question_answering_quality",
            "safety",
            "groundedness",
            "instruction_following",
            "relevance",
            "helpfulness",
        ]
    ],
)

In [28]:
plot_bar_plot(
    eval_results,
    metrics=[
        f"{metric}/mean"
        for metric in [
            "question_answering_quality",
            "safety",
            "groundedness",
            "instruction_following",
            "relevance",
            "helpfulness",
        ]
    ],
)

#### View detailed explanation for an individual instance

If you need to delve into the individual result's detailed explanations on why a score is assigned and how confident the model is for each model-based metric, you can use the `display_explanations()` helper function. For example, you can set `n=2` to display explanation of the 2nd instance result as follows:

In [29]:
display_explanations(result_rag_a.metrics_table, n=2)

You can also focus on one or a few metrics as follows.

In [30]:
display_explanations(result_rag_b.metrics_table, metrics=["groundedness"])