In [None]:
# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Model Evaluation with Vertex AI

LLMOps, or Large Language Model Operations, is an important methodology as organizations increasingly adopt large language models (LLMs) for a wide range of applications. LLMOps is the set of tools, processes, and best practices for managing the lifecycle of LLMs, from development and deployment to monitoring and maintenance. Vertex AI offers services to manage LLMOps pipelines as also mechanisms to evaluate the new models quality after every pipeline execution that you run.

In the realm of Generative AI, evaluation is a critical aspect of assessing the quality and relevance of the generated text. It involves examining the output from a generative language model to determine its coherence, accuracy, and alignment with the provided prompt. Model evaluation helps identify areas for improvement, optimize model performance, and ensure that the generated text meets the desired standards for quality and usefulness. Read more about it in the official [documentation](https://cloud.google.com/vertex-ai/docs/generative-ai/models/evaluate-models).

# Objective

This lab teaches you how to evaluate a foundation model or a fine tuned model based on Automatic Metrics of model results on the evaluation data and you will use the following Google Cloud products:
*  Vertex AI Pipelines
*  Vertex AI Evaluation Services
*  Vertex AI Model Registry
*  Vertex AI Endpoints

# Use Case

Using Generative AI we will evaluate a model that generates a suitable TITLE for a news BODY from BBC FULLTEXT DATA (Sourced from BigQuery Public Dataset *bigquery-public-data.bbc_news.fulltext*). The evaluation will cover both foundation (text-bison@002) and fine tuned (called "bbc-news-summary-tuned") models with the automatic metrics method.

# Install and Import Dependencies

In [None]:
!pip install google-cloud-aiplatform
!pip install --user datasets
!pip install --user google-cloud-pipeline-components

In [2]:
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

In [1]:
from google.cloud import aiplatform
from google.colab import auth as google_auth
google_auth.authenticate_user()

In [2]:
import vertexai
PROJECT_ID = "YOUR_PROJECT_ID" #@param
vertexai.init(project=PROJECT_ID)

In [56]:
REGION = "europe-west4"
project_id = "YOUR_PROJECT_ID"

In [4]:
! gcloud config set project {project_id}

Updated property [core/project].


In [57]:
#Import the necessary libraries

import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
import warnings
warnings.filterwarnings('ignore')
import vertexai
vertexai.init(project=PROJECT_ID, location=REGION)
import kfp
import sys
import uuid
import json
import vertexai
import pandas as pd
from google.auth import default
from datasets import load_dataset
from google.cloud import aiplatform
from vertexai.preview.language_models import TextGenerationModel, EvaluationTextSummarizationSpec

# Prepare & Load Evaluation Data

In [16]:
BUCKET_NAME = 'img_public_test/next_demo'
BUCKET_URI = f"gs://img_public_test/next_demo/EVALUATE.jsonl"
REGION = "europe-west4"

In [58]:
json_url = 'https://storage.googleapis.com/img_public_test/next_demo/EVALUATE.jsonl'
df = pd.read_json(json_url, lines=True)
print (df)

                                            input_text  \
0    Summarize this text to generate a title: A bro...   
1    Summarize this text to generate a title: Ninte...   
2    Summarize this text to generate a title: Gambl...   
3    Summarize this text to generate a title: The n...   
4    Summarize this text to generate a title: The t...   
..                                                 ...   
735  Summarize this text to generate a title: Film ...   
736  Summarize this text to generate a title: R&B s...   
737  Summarize this text to generate a title: Sir E...   
738  Summarize this text to generate a title: Seneg...   
739  Summarize this text to generate a title: Oscar...   

                            output_text  
0       US duo in first spam conviction  
1      Nintendo DS makes its Euro debut  
2     Rings of steel combat net attacks  
3    What's next for next-gen consoles?  
4     'Blog' picked as word of the year  
..                                  ...  
735    Be

# Load Fine Tuned Model

In [45]:
tuned_model = TextGenerationModel.get_tuned_model("projects/273845608377/locations/europe-west4/models/4220809634753019904")
response = tuned_model.predict("Summarize this text to generate a title: \n Ever noticed how plane seats appear to be getting smaller and smaller? With increasing numbers of people taking to the skies, some experts are questioning if having such packed out planes is putting passengers at risk. They say that the shrinking space on aeroplanes is not only uncomfortable it it's putting our health and safety in danger. More than squabbling over the arm rest, shrinking space on planes putting our health and safety in danger? This week, a U.S consumer advisory group set up by the Department of Transportation said at a public hearing that while the government is happy to set standards for animals flying on planes, it doesn't stipulate a minimum amount of space for humans.")
print(response.text)

 Shrinking space on planes putting our health and safety in danger


# Evaluate of the Fine Tuned Model

In [47]:
 # Define the evaluation specification for a text summarization task on the fine tuned model
task_spec = EvaluationTextSummarizationSpec(
  task_name = "summarization",
  ground_truth_data=df
)

In [48]:
# Evaluate the model
eval_metrics_finetuned = tuned_model.evaluate(task_spec=task_spec)

INFO:google.cloud.aiplatform.pipeline_jobs:Creating PipelineJob
INFO:google.cloud.aiplatform.pipeline_jobs:PipelineJob created. Resource name: projects/273845608377/locations/europe-west4/pipelineJobs/evaluation-llm-text-generation-pipeline-20240404000646
INFO:google.cloud.aiplatform.pipeline_jobs:To use this PipelineJob in another session:
INFO:google.cloud.aiplatform.pipeline_jobs:pipeline_job = aiplatform.PipelineJob.get('projects/273845608377/locations/europe-west4/pipelineJobs/evaluation-llm-text-generation-pipeline-20240404000646')
INFO:google.cloud.aiplatform.pipeline_jobs:View Pipeline Job:
https://console.cloud.google.com/vertex-ai/locations/europe-west4/pipelines/runs/evaluation-llm-text-generation-pipeline-20240404000646?project=273845608377
INFO:vertexai.language_models._evaluatable_language_models:Your evaluation job is running and will take 15-20 minutes to complete. Click on the PipelineJob link to view progress.
INFO:google.cloud.aiplatform.pipeline_jobs:PipelineJob pro

In [49]:
print(eval_metrics_finetuned)

EvaluationMetric(bleu=None, rougeLSum=0.36600753600753694)


# Load Base Model

In [59]:
 # Create a reference to a generative AI model
base_model = TextGenerationModel.from_pretrained("text-bison@001")

# Evaluation of the Base Model

In [60]:
 # Define the evaluation specification for a text summarization task on the base model
task_spec = EvaluationTextSummarizationSpec(
  task_name = "summarization",
  ground_truth_data=df
)

In [None]:
# Evaluate the model
eval_metrics_base = base_model.evaluate(task_spec=task_spec)

INFO:google.cloud.aiplatform.pipeline_jobs:Creating PipelineJob
INFO:google.cloud.aiplatform.pipeline_jobs:PipelineJob created. Resource name: projects/273845608377/locations/europe-west4/pipelineJobs/evaluation-llm-text-generation-pipeline-20240404005730
INFO:google.cloud.aiplatform.pipeline_jobs:To use this PipelineJob in another session:
INFO:google.cloud.aiplatform.pipeline_jobs:pipeline_job = aiplatform.PipelineJob.get('projects/273845608377/locations/europe-west4/pipelineJobs/evaluation-llm-text-generation-pipeline-20240404005730')
INFO:google.cloud.aiplatform.pipeline_jobs:View Pipeline Job:
https://console.cloud.google.com/vertex-ai/locations/europe-west4/pipelines/runs/evaluation-llm-text-generation-pipeline-20240404005730?project=273845608377
INFO:vertexai.language_models._evaluatable_language_models:Your evaluation job is running and will take 15-20 minutes to complete. Click on the PipelineJob link to view progress.


In [None]:
print(eval_metrics_base)

# Comparison

As you can see in the eval_metrics_finetuned and eval_metrics_base metrics of the Fine Tuned and Base models respectively, the Evaluation Metric is RELATIVELY higher for the Fine Tuned Model as it determines how the model should phrase  responses to your prompts:

EvaluationMetric(bleu=None, rougeLSum=0.36600753600753694)

**rougeLSum**: This is the ROUGE-L score for the summary. ROUGE-L is a recall-based metric that measures the overlap between a summary and a reference summary. It is calculated by taking the longest common subsequence (LCS) between the two summaries and dividing it by the length of the reference summary.

The rougeLSum score in the given expression is 0.36600753600753694, which means that the summary has a 36.6% overlap with the reference summary.

# View Evaluation Results in Cloud Storage and Console

You can find the evaluation results in the Cloud Storage output directory that you specified when creating the evaluation job. The file is named evaluation_metrics.json.

For tuned models, you can also view evaluation results in the Google Cloud console:

In the Vertex AI section of the Google Cloud console, go to the Vertex AI Model Registry page.

Click the name of the model that you want to view evaluation metrics for.

In the Evaluate tab, click the name of the evaluation run that you want to view.