# Use the IBM watsonx.governance metrics toolkit to evaluate Google Vertex AI

The IBM watsonx.governance metrics toolkit lets you evaluate the output of a Large Language Model (LLM) against multiple task types: Text Summarization, Content Generation, Question Answering, Text Classification, Entity Extraction, and Retrieval-Augmented Generation (RAG).

This notebook will demonstrate how to evaluate output from a Text Summarization prompt run against a Google Vertex AI LLM.

- The prompt is generated against a Vertex AI Gemini Pro Predictions model endpoint.
- Prompt output is evaluated using IBM's watsonx.governance metrics toolkit.
- Evaluated metrics are published to IBM OpenPages.

## Learning goals

The learning goals of this notebook are:

-  Create your prompt for testing against the `Gemini Pro Predictions` model.
-  Configure metrics for evaluation.
-  Run the metrics against your prompt data.
-  Print and review the metrics returned by the IBM watsonx.governance metrics toolkit. 
-  Publish the computed metrics to IBM OpenPages.

## Table of Contents

This notebook contains the following parts:

1.	[Install the necessary packages](#packages)
2.  [Provision services and configure credentials](#credentials)
3.	[Evaluate Text Summarization output from the Google Vertex AI `Gemini Pro Predictions` model](#summarization)
4.	[Publish computed metrics to an OpenPages foundation model](#openpages)
5.  [Navigate to the OpenPages UI to verify your metrics](#verify)
6.	[Summary](#summary)

<a id="packages"></a>
## Step 1 - Install the necessary packages

In [None]:
!pip install --upgrade google-cloud-aiplatform | tail -n 1

In [None]:
!pip install -i https://test.pypi.org/simple/ ibm-watson-openscale==3.0.34.8  | tail -n 1
!pip install -i https://test.pypi.org/simple/ ibm-metrics-plugin==5.0.1.17  | tail -n 1

In [None]:
!pip install --upgrade evaluate --no-cache | tail -n 1
!pip install --upgrade textstat --no-cache | tail -n 1
!pip install --upgrade sacrebleu --no-cache | tail -n 1
!pip install --upgrade sacremoses --no-cache | tail -n 1
!pip install --upgrade datasets==2.10.0 --no-cache | tail -n 1

Optional `pip` installs, as needed:

In [None]:
!pip install pydantic==1.10.11

In [None]:
!pip install --upgrade ibm_db_sa

In [None]:
import warnings
warnings.filterwarnings('ignore')

<a id="credentials"></a>
## Step 2 - Provision services and configure credentials

### Provision an instance of IBM Watson OpenScale

If you have not already done so, provision an instance of IBM Watson OpenScale using the [OpenScale link in the Cloud catalog](https://cloud.ibm.com/catalog/services/watson-openscale).

### Generate an API key

You can generate a Cloud API key with IBM Cloud console or with IBM Cloud command line interface.

To generate an API key by using IBM Cloud console:

1. Go to the [**Users** section of the IBM Cloud console](https://cloud.ibm.com/iam#/users).
1. Click your name, then scroll down to the **API Keys** section.
1. Click **Create an IBM Cloud API key**.
1. Give your key a name and click **Create**.
1. Copy the created key - you will need to paste this key into the `CLOUD_API_KEY` variable in the "Configure your credentials" section below.

To create an API key using the IBM Cloud [command line interface](https://console.bluemix.net/docs/cli/reference/ibmcloud/download_cli):

1. From the command line interface, type the following:

    `bx login --sso`

    `bx iam api-key-create 'my_key'`

1. Copy the created key - you will need to paste this key into the `CLOUD_API_KEY` variable in the "Configure your credentials" section below.

### Configure your credentials

In [None]:
use_cpd = False
CLOUD_API_KEY = "<Your IBM API Key>"
IAM_URL="https://iam.ng.bluemix.net/oidc/token"

If you are running your notebook on a CPD cluster, uncomment and run the following code:

In [None]:
# use_cpd = True
# WOS_CREDENTIALS = {
#     "url": "xxxxx",
#     "username": "xxxxx",
#     "api_key": "xxxxx"
# }

# GEN_API_KEY = WOS_CREDENTIALS["api_key"]

# api_endpoint = WOS_CREDENTIALS["url"]
# project_id = "<Your project id>"
# endpoint_url = WOS_CREDENTIALS["url"]

<a id="summarization"></a>
## Step 3 - Evaluate Text Summarization output from the Google Vertex AI `Gemini Pro Predictions` model

### Download a dataset containing prompt input data for model inferencing and reference data for model output evaluation

The downloaded `.csv` file contains: input, a generated summary, and two reference summaries each for 50 sample prompts. Values are then further converted to input, output, and reference panda data frames.

In [None]:
!rm -fr llm_content.csv
!wget "https://raw.githubusercontent.com/IBM/watson-openscale-samples/main/IBM%20Cloud/WML/assets/data/watsonx/llm_content.csv"

In [None]:
import pandas as pd
import numpy as np
llm_data_all = pd.read_csv("llm_content.csv")
llm_data_all.head()

In [None]:
llm_data = llm_data_all.tail(10)
llm_data.head()

### Configure Vertex AI

**Note**: There are multiple ways to invoke Vertex AI model predictions and multiple ways to generate the Vertex AI token. This notebook demostrates how to generate the service account token and call the LLM predictions API using a REST API. You may also consider using the `gcloud` toolkit, specifically using the `gcloud auth print-access-token` call. 

In [None]:
# the project id from your google cloud account
PROJECT_ID = "<Your GCP Project ID>"

# the location of the project
LOCATION = "us-central1"

# the large language model - model id
MODEL_ID = "gemini-pro"

# service account credentials obtained from GCP Credentials Page
gcp_json_credentials_dict = {
  "type": "service_account",
  "project_id": "brave-healer-xxxx",
  "private_key_id": "xxxx",
  "private_key": "-----BEGIN PRIVATE KEY-----\xxxx==\n-----END PRIVATE KEY-----\n",
  "client_email": "xxxx@brave-healer-xxxx.iam.gserviceaccount.com",
  "client_id": "xxxx",
  "auth_uri": "https://accounts.google.com/o/oauth2/auth",
  "token_uri": "https://oauth2.googleapis.com/token",
  "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
  "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/xxxxx.iam.gserviceaccount.com",
  "universe_domain": "googleapis.com"
}

# predictions URL
vertexai_predictions_url = 'https://us-central1-aiplatform.googleapis.com/v1/projects/{0}/locations/us-central1/publishers/google/models/{1}:streamGenerateContent'.format(PROJECT_ID, MODEL_ID)

#### Initialize the Google Cloud AI platform with the project ID

In [None]:
from google.cloud import aiplatform
aiplatform.init(project=PROJECT_ID, location=LOCATION)

In [None]:
import vertexai
from vertexai.preview.generative_models import GenerativeModel, Part

#### Generate the Vertex AI token

In [None]:
def get_vertexai_token():
    from google.oauth2 import service_account
    import google.auth.transport.requests
    import google
    credentials = service_account.Credentials.from_service_account_info(
        gcp_json_credentials_dict, 
        scopes=['https://www.googleapis.com/auth/cloud-platform'])
    request = google.auth.transport.requests.Request()
    credentials.refresh(request)
    token = credentials.token
    return token

In [None]:
vertexai_token = get_vertexai_token()

In [None]:
headers = {}
headers["Content-Type"] = "application/json"
headers["Accept"] = "application/json"
headers["Authorization"] = "Bearer {}".format(vertexai_token)

### Create your prompt for testing against the `Gemini Pro Predictions` model

In [None]:
def get_prompt(text):
    prompt = f"""Please provide a summary of the following text with maximum of 20 words.
    
{text}
    
Summary:"""
    return prompt

### Run the prompt evaluation

In [None]:
import json
import requests
def get_completion(prompt_text):
    payload = {
        "contents": [
            {
                "role": "user",
                "parts": [
                    {
                        "text": get_prompt(prompt_text)
                    }
                ]
            }
        ],
        "generation_config": {
            "temperature": 0.2,
            "maxOutputTokens": 20
          }    
    }
    response = requests.post(vertexai_predictions_url, headers=headers, json=payload, verify=False)
    json_data = response.json()
    prompt_output = json_data[0]['candidates'][0]['content']['parts'][0]['text']
    return prompt_output

In [None]:
text = '''Scientists have discovered a new species of deep-sea fish that emits a soft, soothing light. This bioluminescent fish could inspire advancements in low-light underwater exploration."
'''
output = get_completion(text)
output

### Set the generated prompt summary with the summary from the Vertex AI `Gemini Pro Predictions` prompt evaluation

In [None]:
llm_data['vertexai_gemini_pro_generated_summary'] = llm_data['input_text'].apply(get_completion)

In [None]:
llm_data.head()

#### Sample generated output

In [None]:
llm_data['vertexai_gemini_pro_generated_summary']

### Authenticate with IBM watsonx.governance

In [None]:
from ibm_cloud_sdk_core.authenticators import IAMAuthenticator,BearerTokenAuthenticator,CloudPakForDataAuthenticator

from ibm_watson_openscale import *
from ibm_watson_openscale.supporting_classes.enums import *
from ibm_watson_openscale.supporting_classes import *

if use_cpd:
    authenticator = CloudPakForDataAuthenticator(
            url=WOS_CREDENTIALS['url'],
            username=WOS_CREDENTIALS['username'],
            password=WOS_CREDENTIALS['password'],
            disable_ssl_verification=True
        )
    
    client = APIClient(service_url=WOS_CREDENTIALS['url'],authenticator=authenticator)
    print(client.version)
else:
    authenticator = IAMAuthenticator(apikey=CLOUD_API_KEY)
    client = APIClient(authenticator=authenticator)
    print(client.version)

### Import common evaluation metrics and metric groups

These are the metrics used to evaluate your prompt against the selected model, based on the prompt task type. For example, Summarization.

In [None]:
from ibm_metrics_plugin.metrics.llm.utils.constants import LLMTextMetricGroup
from ibm_metrics_plugin.metrics.llm.utils.constants import LLMGenerationMetrics
from ibm_metrics_plugin.metrics.llm.utils.constants import LLMSummarizationMetrics
from ibm_metrics_plugin.metrics.llm.utils.constants import LLMQAMetrics
from ibm_metrics_plugin.metrics.llm.utils.constants import LLMClassificationMetrics
from ibm_metrics_plugin.metrics.llm.utils.constants import HAP_SCORE
from ibm_metrics_plugin.metrics.llm.utils.constants import PII_DETECTION

### Get the necessary data for evaluating the prompt template metrics

Metrics will be evaluated for the input, output, and reference summary text.

In [None]:
df_input = llm_data[['input_text']].copy()
df_output = llm_data[['vertexai_gemini_pro_generated_summary']].copy()
df_reference = llm_data[['reference_summary_2']].copy()

### Evaluate custom metrics

In [None]:
import nltk
nltk.download('stopwords')

In [None]:
import spacy
def extract_key_words(text):
    nlp = spacy.load("en_core_web_sm")
    doc = nlp(text)
    keywords = [token.text for token in doc if token.pos_ == 'NOUN']
    return keywords

In [None]:
def compute_f1_score(reference_keywords, generated_keywords):
    common_keywords = set(reference_keywords) & set(generated_keywords)

    precision = len(common_keywords) / len(generated_keywords) if len(generated_keywords) > 0 else 0
    recall = len(common_keywords) / len(reference_keywords) if len(reference_keywords) > 0 else 0
    f1_score = (2 * precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
    
    return precision, recall, f1_score

In [None]:
def compute_averages_f1_score(precisions, recalls, f1_scores):
    import numpy as np
    precision = round(np.min(precisions), 4)
    recall = round(np.min(recalls), 4)
    f1_score = round(np.min(f1_scores), 4)

    keyword_inclusions = {
        "keyword_inclusions" : {
            "precision": {
                "metric_value": precision
            },
            "recall": {
                "metric_value": recall
            },
            "f1_score": {
                "metric_value": f1_score
            }
        }
    }
    return keyword_inclusions

In [None]:
def key_word_inclusions(df_input, df_output, df_reference):
    precisions = []
    recalls = []
    f1_scores = []
    
    for input_text, generated_summary in zip(df_input['input_text'], df_output['vertexai_gemini_pro_generated_summary']):
    
        input_text_keywords = extract_key_words(input_text)
        print('Input Text Keywords: '+ str(input_text_keywords))
    
        generated_summary_keywords = extract_key_words(generated_summary)
        print('Generated Summary Keywords: '+ str(generated_summary_keywords))
        
        precision, recall, f1_score = compute_f1_score(input_text_keywords, generated_summary_keywords)
        
        precisions.append(precision)
        recalls.append(recall)
        f1_scores.append(f1_score)

        print('\n')
    
    keyword_inclusions = compute_averages_f1_score(precisions, recalls, f1_scores)
    return keyword_inclusions
    

### Configure metrics for evaluation

Select the metrics you want to evaluate; the code cell below contains 10 common Summarization metrics.

In [None]:
metric_config = {   
    "configuration": {
        LLMTextMetricGroup.SUMMARIZATION.value: {
            LLMSummarizationMetrics.ROUGE_SCORE.value: {},
            LLMSummarizationMetrics.SARI.value: {},
            LLMSummarizationMetrics.METEOR.value: {},
            LLMSummarizationMetrics.NORMALIZED_RECALL.value: {},
            LLMSummarizationMetrics.NORMALIZED_PRECISION.value: {},
            LLMSummarizationMetrics.NORMALIZED_F1_SCORE.value: {},
            LLMSummarizationMetrics.COSINE_SIMILARITY.value: {},
            LLMSummarizationMetrics.JACCARD_SIMILARITY.value: {},
            LLMSummarizationMetrics.BLEU.value: {},
            LLMSummarizationMetrics.FLESCH.value: {}
        }
    }
}

### Summarization metrics evaluation

Run the metrics against your prompt data.

In [None]:
import json
result = client.llm_metrics.compute_metrics(metric_config, 
                                            sources = df_input, 
                                            predictions = df_output, 
                                            references = df_reference, 
                                            custom_evaluators = [key_word_inclusions])

### Review metrics

Print and review the metrics returned by the IBM watsonx.governance toolkit.

In [None]:
print(json.dumps(result,indent=2))

<a id="openpages"></a>
## Step 4 - Publish computed metrics to an OpenPages foundation model

### Construct a key/value dictionary of the metrics to be published to OpenPages

In [None]:
def get_metrics(result):
    metrics = {}
    metrics['rouge1'] = round(result['rouge_score']['rouge1']['metric_value'], 4)
    metrics['rouge2'] = round(result['rouge_score']['rouge2']['metric_value'], 4)
    metrics['rougeL'] = round(result['rouge_score']['rougeL']['metric_value'], 4)
    metrics['rougeLsum'] = round(result['rouge_score']['rougeLsum']['metric_value'], 4)
    metrics['meteor'] = round(result['meteor']['metric_value'], 4)
    metrics['sari'] = round(result['sari']['metric_value'], 4)
    metrics['cosine_similarity'] = round(result['cosine_similarity']['metric_value'], 4)
    metrics['keyword_inclusions_f1_score'] = round(result['keyword_inclusions']['f1_score']['metric_value'], 4)
    # metrics['jaccard_similarity'] = round(result['jaccard_similarity']['metric_value'], 4)
    return metrics

In [None]:
metrics =  get_metrics(result)
metrics

In [None]:
import requests
import base64
import json

### Create an authorization token for OpenPages

In [None]:
def get_basic_auth_token(username, password):
    token = base64.b64encode(bytes('{0}:{1}'.format(username, password), 'utf-8')).decode("ascii")
    return token

### For a given model name, get the OpenPages model ID

In [None]:
def get_op_model_id(header, model_name):
    openpages_url = OP_URL.rstrip("/") + "/grc/api/query"
    # Prepare post payload
    get_id_payload = {
        "statement": "SELECT [Model].[Resource ID] FROM [Model] WHERE [Model].[Name] IN ('{0}')".format(model_name),
        "skipCount": 0
    }
    response = requests.post(openpages_url, json=get_id_payload, headers=header, verify=False).json()

    model_id = None
    if response is not None:
        if response.get("rows") is not None:
            rows = response.get("rows")
            if len(rows) != 0:
                fields = rows[0].get("fields")
                if fields is not None:
                    field = fields.get("field")
                    if len(field) != 0:
                        model_id = field[0]["value"]

    if model_id is None:
        print("Model ID not found.")
    else:
        print("Model ID fetched: " + model_id)
    return model_id

### For a given model ID, get the corresponding OpenPages metrics definitions map containing metric ID and its name

In [None]:
def get_op_model_metrics_definitions(header, model_id):
    openpages_url = OP_URL.rstrip("/") + "/grc/api/query"    
    get_metrics_payload = {
        "statement": "SELECT [Metric].[Resource ID], [Metric].[Name], [Metric].[Description] FROM [Model] JOIN [Metric] ON PARENT([Model]) WHERE [Model].[Resource ID]='{0}'".format(model_id),
        "skipCount": 0
    }
    print("Sending request to fetch all metrics associated with the model.")
    response = requests.post(openpages_url, json=get_metrics_payload, headers=header, verify=False).json()

    metrics_map = []

    if response is not None:
        if response.get("rows") is not None:
            rows = response.get("rows")
            if len(rows) != 0:
                for i in range(len(rows)):
                    fields = rows[i].get("fields")
                    if fields is not None:
                        field = fields.get("field")
                        metric_id_desc = {}
                        metric_id = None
                        metric_desc = None
                        for row in field:
                            if row.get('name') == 'Resource ID':
                                metric_id = row.get('value')
                            if row.get('name') == 'Description':
                                metric_desc = row.get('value')
                        metric_id_desc['metric_desc'] = metric_desc
                        metric_id_desc['metric_id'] = metric_id
                        metrics_map.append(metric_id_desc)
        print("Completed fetching, if any, all metrics associated with the model.")
        return metrics_map

### Construct the Metrics Object Payload for metrics creation

In [None]:
def get_metric_object_payload(primaryParentId, metric_name):
    metric_description = "watsonx.governance metric for '" + metric_name + "'"
    metric_object_payload = {
    	"name": metric_name,
    	"description": metric_description,
    	"typeDefinitionId": "Metric",
        "primaryParentId": primaryParentId,
    	"fields":
    	{
    		"field":
    		[
    			{
                    "name": "MRG-Metric:Data Source",
                    "dataType": "STRING_TYPE",
                    "value": "watsonx.governance"
                },
                {
            		"name": "MRG-Metric:Frequency",
            		"dataType": "ENUM_TYPE",
            		"enumValue": {
                		"name": "Multiple times a day"
                	}
            	},
                {
                    "name": "MRG-Metric-Shared:Breach Status",
                    "dataType": "ENUM_TYPE",
                    "enumValue": {
                        "name": "Green"
                    }
                },
                {
                    "name": "MRG-Metric-Shared:Direction Information",
                    "dataType": "ENUM_TYPE",
                    "enumValue": {
                        "name": "Increase means better performance"
                    }
                },
                {
                    "name": "MRG-Metric-Shared:Yellow Threshold",
                    "dataType": "FLOAT_TYPE",
                    "value": 0.6
                },
                {
                    "name": "MRG-Metric-Shared:Red Threshold",
                    "dataType": "FLOAT_TYPE",
                    "value": 0.5
                },
                
    		]
    	}
    }
    return metric_object_payload

### Construct the Metrics Value Payload for creating and associating a metric value to a metric of a given model object

In [None]:
def get_metric_value_payload(primaryParentId, metric_name, metric_value):
    metric_description = "watsonx.governance metric for '" + metric_name + "'"
    metric_value_payload = {
        "typeDefinitionId": "MetricValue",
        "primaryParentId": primaryParentId,
        "description": metric_description,
        "fields": {
            "field": [
                {
                    "name": "MRG-Metric-Shared:Breach Status",
                    "dataType": "ENUM_TYPE",
                    "enumValue": {
                        "name": "Green"
                    }
                },
                {
                    "name": "MRG-Metric-Shared:Direction Information",
                    "dataType": "ENUM_TYPE",
                    "enumValue": {
                        "name": "Increase means better performance"
                    }
                },
                {
                    "name": "MRG-Metric-Shared:Yellow Threshold",
                    "dataType": "FLOAT_TYPE",
                    "value": 0.6
                },
                {
                    "name": "MRG-Metric-Shared:Red Threshold",
                    "dataType": "FLOAT_TYPE",
                    "value": 0.5
                },
                {
                    "name": "MRG-MetricVal:Value",
                    "dataType": "FLOAT_TYPE",
                    "value": metric_value
                }
            ]
        }
    }
    return metric_value_payload

### Create a Metrics Object

In [None]:
def create_metrics_object(metric_object_payload):
    openpages_metric_object_creation_url = OP_URL + "/grc/api/contents"
    response = requests.post(openpages_metric_object_creation_url, json=metric_object_payload, headers=header, verify=False).json()
    metric_id = response['id']
    return metric_id

### Add Metric Value to the Metric Object

In [None]:
def add_metric_value_to_metric_object(metric_value_payload):
    openpages_metric_value_creation_url = OP_URL + "/grc/api/contents"
    response = requests.post(openpages_metric_value_creation_url, json=metric_value_payload, headers=header, verify=False).json()
    metric_value_id = response['id']
    return metric_value_id

### Check for the metric's existence in the metrics map

In [None]:
def get_existing_metric_id(metrics_map, metric_name):
    for item in metrics_map:
        if 'metric_desc' in item and metric_name in item['metric_desc']:
            return item['metric_id']
    return None

### Configure your OpenPages connection details

In [None]:
OP_URL = "<OpenPages Url>"
OP_USERNAME = "<OpenPages Username>"
OP_PASSWORD = "<OpenPages User Password>"
model_name = '<OpenPages Model>'

In [None]:
token = get_basic_auth_token(OP_USERNAME, OP_PASSWORD)

In [None]:
header = {
        "Content-Type": "application/json",
        "Accept": "application/json",
        "Authorization": "Basic {0}".format(token)
    }

### Fetch the Model ID for a given OpenPages model name

In [None]:
model_id = get_op_model_id(header, model_name)
model_id

### Publish the metrics to OpenPages

In [None]:
### Fetch the existing, if any, OP Model Metrics for a given OP Model ID
metrics_map = get_op_model_metrics_definitions(header, model_id)
print(metrics_map)

print('\n')

# Iterate over the given metrics to be published..
for metric_name, metric_value in metrics.items():
    
    # check if the metric exists by the given name, and if, get its metric_id
    metric_id = get_existing_metric_id(metrics_map, metric_name)

    # if the metric does not exists, then create it
    if metric_id is None:
        print(metric_name + ': Metric Object does not exists, hence creating it..')

        # construct the metric object to be published
        metric_object_payload = get_metric_object_payload(model_id, metric_name)

        # now, create the metric object
        metric_id = create_metrics_object(metric_object_payload)

    # Add the metric value to metric object

    # construct the metric value object to be published
    metric_value_payload = get_metric_value_payload(metric_id, metric_name, metric_value)

    # create the metric value - basically add the metric value to the metric object
    metric_value_id = add_metric_value_to_metric_object(metric_value_payload)
    
    print(str(metric_name) + ': Metric Object ID: ' + str(metric_id) + ', Metric Value Object ID: '+ str(metric_value_id) + '\n')

<a id="verify"></a>
## Step 5 - Navigate to the OpenPages UI to verify your metrics

To locate your metrics, first navigate to your model:
 1. From the **Menu** on the left, click on **Inventory**, and go to **Models**. 
 2. Choose your model from the list. 
 3. Go to the **Admin** tab, as can be seen in the following image:

![Model OpenPages.png](<attachment:Model OpenPages.png>)

To find your metrics, scroll down to **Associations**. They are under **Model Metrics**. 

To view the metrics in a seperate tab, go to the icon in the top right corner and click **Launch Grid page**. You can see your metrics as shown in the following image:

![Metrics OpenPages.png](<attachment:Metrics OpenPages.png>)

<a id="summary"></a>

## Summary

Congratulations, you successfully completed this notebook! You learned how to evaluate output from a Text Summarization prompt run against a Google Vertex AI LLM and publish the computed metrics to IBM OpenPages.

### Author:

**Ravi Chamarthy**