# Evaluating Azure OpenAI prompts with watsonx.governance

This notebook is part of the [watsonx.governance Level 4 Proof of Experience (PoX) hands-on lab](https://cp4d-outcomes.techzone.ibm.com/l4-pox/watsonx-governance). It will query an Azure OpenAI GPT-35 Turbo Deployment, and evaluate the output using the watsonx.governance (OpenScale) LLM SDK. Finally, it will push the evaluation metrics to the model use case in the watsonx governance console (OpenPages).

This notebook should be run in a Cloud Pak for Data 4.8.5 or higher software environment. It requires credentials for the Cloud Pak for Data install, which must be entered in the first code cell.

You may also include credentials for an Azure OpenAI GPT-35 Turbo deployment. However, if you do not have access to valid credentials, leave the Azure credential variables blank. The notebook will instead use output previously generated from an Azure OpenAI deployment to perform the evaluation. In this case, while it will not query an actual live Azure OpenAI service, it **is** evaluating output from Azure OpenAI and can be used to show how those models can be evaluated and governed in watsonx.governance.

Instructions for location your credentials are contained in the relevant portions of the hands-on lab. The code in this notebook is based off of the [Github sample code for Azure OpenAI monitoring](https://github.com/IBM/watson-openscale-samples/blob/main/IBM%20Cloud/WML/notebooks/watsonx/LLM%20Metrics%20Evals-Azure-OpenAI-OpenPages.ipynb). by [Ravi Chamarthy](mailto:ravi.chamarthy@in.ibm.com). If you receive errors caused by OpenPages API incompatibilities, you should be able to update with code from that notebook to address any issues.

In [49]:
CPD_URL = "https://cpd-cpd.apps.__________.cloud.techzone.ibm.com"
CPD_USERNAME = "admin"
CPD_PASSWORD = "_____________"
MODEL_TITLE = "_______"
API_KEY = "________________"

AZURE_BASE_URL = ""
AZURE_ENGINE = ""
AZURE_API_KEY = ""

Once the keys have been entered in the cell above, you may run through the remainder of the notebook. It has been heavily commented to show what is occurring at each stage.

### Install the necessary libraries

**YOU MAY GET PIP DEPENDENCY RESOLVER ERRORS**. These can be safely ignored.

In [2]:
!pip install --upgrade ibm-watson-machine-learning   | tail -n 1
!pip install --upgrade ibm-watson-openscale --no-cache | tail -n 1
!pip install --upgrade ibm-metrics-plugin --no-cache | tail -n 1
!pip install --upgrade ibm-metrics-plugin --no-cache | tail -n 1
!pip install --upgrade evaluate --no-cache | tail -n 1
!pip install --upgrade textstat --no-cache | tail -n 1
!pip install --upgrade sacrebleu --no-cache | tail -n 1
!pip install --upgrade sacremoses --no-cache | tail -n 1
!pip install --upgrade datasets==2.10.0 --no-cache | tail -n 1
!pip install openai==0.28

Successfully installed ibm-watson-machine-learning-1.0.357
Successfully installed ibm-watson-openscale-3.0.37


### Read the test data into a dataframe

In [3]:
import pandas as pd
import numpy as np
llm_data_all = pd.read_csv("https://raw.githubusercontent.com/CloudPak-Outcomes/Outcomes-Projects/main/watsonx-governance-l4/data/resume_summarization_test_data.csv")
llm_data_all.head()

Unnamed: 0,Resume,Extraction,Summarization,Resume_without_profile
0,Nerissa G. McCloud-Pearcy\n(205) 123-4567\nnmc...,"{""Location"": ""Birmingham, AL "", ""Gender"": ""Fem...",A results-driven Sales Manager with 14+ years ...,Nerissa G. McCloud-Pearcy\n(205) 123-4567\nnmc...
1,Sarah Tomlinson\n(123) 456-7891\ns.tomlinson@e...,"{""Location"": ""Oakbrook, IL "", ""Gender"": ""Femal...",Sarah Tomlinson is an innovative Sales Manager...,Sarah Tomlinson\n(123) 456-7891\ns.tomlinson@e...
2,Aliya Jackson\n(123) 456-7890\naliyajackson@ex...,"{""Location"": ""Detroit, MI "", ""Gender"": ""Female...",An OSHA-certified Construction Worker with 8+ ...,Aliya Jackson\n(123) 456-7890\naliyajackson@ex...
3,Anthony Gentile\n(123) 456-7890\nanthonygentil...,"{""Location"": ""Nashville, TN "", ""Gender"": ""Male...",A Construction Worker with two years of experi...,Anthony Gentile\n(123) 456-7890\nanthonygentil...
4,Raheem Richardson\n(123) 456-7890\nraheemricha...,"{""Location"": ""Philadelphia, PA "", ""Gender"": ""M...",A Construction Manager with 10+ years of exper...,Raheem Richardson\n(123) 456-7890\nraheemricha...


### Import OpenAI libraries

In [4]:
import os
import openai

### Set the Azure OpenAI deployment details

In [5]:
openai.api_type = "azure"
openai.api_base = AZURE_BASE_URL
openai.api_key = AZURE_API_KEY
openai.api_version = "2024-02-01"

### Define the prompt

In [6]:
def get_prompt(text):
    prompt = f"""You will be given a resume. Please summarize the resume in 100 words or less.
    
{text}
    
Summary:"""
    return prompt

### Define the prompt evaluation

Note that the `temperature`, `max_tokens`, and other configuration variables can be changed as needed.

In [7]:
def get_completion(prompt_text):
    response = openai.Completion.create(
        engine = AZURE_ENGINE,
        prompt=get_prompt(prompt_text),
        temperature=0.1, 
        max_tokens=200,
        top_p=0.5,
        frequency_penalty=0,
        presence_penalty=0,
        stop='\n'
    )
    return response.choices[0].text
    #return response

### Run the prompt evaluation

The next cell tries to run the prompt evaluation using the supplied Azure credentials. If the credentials are blank or if it fails, it will fall back to loading the pre-generated responses.

In [8]:
try:
    llm_data_all['gpt_35_turbo_generated_summary'] = llm_data_all['Resume_without_profile'].apply(get_completion)
except:
    print("Unable to access Azure OpenAI service, using pre-generated responses")
    llm_data_all = pd.read_csv('https://raw.githubusercontent.com/CloudPak-Outcomes/Outcomes-Projects/main/watsonx-governance-l4/data/openai_resume_summary_output.csv')

Unable to access Azure OpenAI service, using pre-generated responses


### Show the output from the evaluation

In [9]:
llm_data_all.head()

Unnamed: 0,Resume,Extraction,Summarization,Resume_without_profile,gpt_35_turbo_generated_summary
0,Nerissa G. McCloud-Pearcy\n(205) 123-4567\nnmc...,"{""Location"": ""Birmingham, AL "", ""Gender"": ""Fem...",A results-driven Sales Manager with 14+ years ...,Nerissa G. McCloud-Pearcy\n(205) 123-4567\nnmc...,Nerissa McCloud-Pearcy is a Hotel Sales Manag...
1,Sarah Tomlinson\n(123) 456-7891\ns.tomlinson@e...,"{""Location"": ""Oakbrook, IL "", ""Gender"": ""Femal...",Sarah Tomlinson is an innovative Sales Manager...,Sarah Tomlinson\n(123) 456-7891\ns.tomlinson@e...,Sarah Tomlinson is a Sales Manager with exper...
2,Aliya Jackson\n(123) 456-7890\naliyajackson@ex...,"{""Location"": ""Detroit, MI "", ""Gender"": ""Female...",An OSHA-certified Construction Worker with 8+ ...,Aliya Jackson\n(123) 456-7890\naliyajackson@ex...,Aliya Jackson is a construction worker with e...
3,Anthony Gentile\n(123) 456-7890\nanthonygentil...,"{""Location"": ""Nashville, TN "", ""Gender"": ""Male...",A Construction Worker with two years of experi...,Anthony Gentile\n(123) 456-7890\nanthonygentil...,Anthony Gentile is a construction worker with...
4,Raheem Richardson\n(123) 456-7890\nraheemricha...,"{""Location"": ""Philadelphia, PA "", ""Gender"": ""M...",A Construction Manager with 10+ years of exper...,Raheem Richardson\n(123) 456-7890\nraheemricha...,Raheem Richardson is a certified construction...


### Sample generated output

In [10]:
llm_data_all['gpt_35_turbo_generated_summary'][0]

' Nerissa McCloud-Pearcy is a Hotel Sales Manager with experience in developing and implementing new account-based sales and marketing strategies resulting in over $2M in new business over three years. She has experience in managing teams of 30+ hotel representatives, delivering training on guest relations and upselling techniques, and improving sales revenue for add-on services by 25%. She has also worked as an Assistant Branch Manager for Hertz Car Rental, where she managed day-to-day branch functions, trained and developed a team of 40+ representatives, and developed referral networks with local hotels and airports resulting in a 10% increase in B2B sales. Nerissa has an Associate of Arts in Liberal Arts from Faulkner University and key skills in strategic planning, sales management, marketing strategy, client relations, and team management.<|im_end|>'

# Evaluate Metrics

The next section of the notebook will evaluate the output.

### IBM watsonx.governance authentication

In [11]:
from ibm_cloud_sdk_core.authenticators import CloudPakForDataAuthenticator

from ibm_watson_openscale import *
from ibm_watson_openscale.supporting_classes.enums import *
from ibm_watson_openscale.supporting_classes import *

authenticator = CloudPakForDataAuthenticator(
    url=CPD_URL,
    username=CPD_USERNAME,
    password=CPD_PASSWORD,
    disable_ssl_verification=True
)
    
client = APIClient(service_url=CPD_URL,authenticator=authenticator)
print(client.version)

3.0.36


### Common Imports

In [12]:
from ibm_metrics_plugin.metrics.llm.utils.constants import LLMTextMetricGroup
from ibm_metrics_plugin.metrics.llm.utils.constants import LLMGenerationMetrics
from ibm_metrics_plugin.metrics.llm.utils.constants import LLMSummarizationMetrics
from ibm_metrics_plugin.metrics.llm.utils.constants import LLMQAMetrics
from ibm_metrics_plugin.metrics.llm.utils.constants import LLMClassificationMetrics
from ibm_metrics_plugin.metrics.llm.utils.constants import HAP_SCORE
from ibm_metrics_plugin.metrics.llm.utils.constants import PII_DETECTION

### Split the input, output, and source data into different dataframes

In [13]:
df_input = llm_data_all[['Resume']].copy()
df_output = llm_data_all[['gpt_35_turbo_generated_summary']].copy()
df_reference = llm_data_all[['Summarization']].copy()

### Configure the metrics for evaluation

In [14]:
metric_config = {   
    "configuration": {
        LLMTextMetricGroup.SUMMARIZATION.value: {
            LLMSummarizationMetrics.ROUGE_SCORE.value: {},
            LLMSummarizationMetrics.SARI.value: {},
            LLMSummarizationMetrics.METEOR.value: {},
            LLMSummarizationMetrics.NORMALIZED_RECALL.value: {},
            LLMSummarizationMetrics.NORMALIZED_PRECISION.value: {},
            LLMSummarizationMetrics.NORMALIZED_F1_SCORE.value: {},
            LLMSummarizationMetrics.COSINE_SIMILARITY.value: {},
            LLMSummarizationMetrics.JACCARD_SIMILARITY.value: {},
            LLMSummarizationMetrics.BLEU.value: {},
            LLMSummarizationMetrics.FLESCH.value: {}
        }
    }
}

### Compute the metrics

In [15]:
import json
result = client.llm_metrics.compute_metrics(metric_config,df_input,df_output, df_reference)

Downloading builder script:   0%|          | 0.00/5.94k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.34k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/6.93k [00:00<?, ?B/s]

[nltk_data] Downloading package wordnet to /home/wsuser/nltk_data...
[nltk_data] Downloading package punkt to /home/wsuser/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package omw-1.4 to /home/wsuser/nltk_data...


Downloading builder script:   0%|          | 0.00/12.1k [00:00<?, ?B/s]

### Evaluated Metrics

In [16]:
print(json.dumps(result,indent=2))

{
  "flesch": {
    "flesch_reading_ease": {
      "metric_value": 32.862,
      "mean": 32.862,
      "min": 10.06,
      "max": 48.2,
      "std": 10.865377858132684
    },
    "flesch_kincaid_grade": {
      "metric_value": 13.789999999999997,
      "mean": 13.789999999999997,
      "min": 10.2,
      "max": 20.7,
      "std": 3.1443441287492693
    }
  },
  "bleu": {
    "precisions": [
      0.21406003159557663,
      0.07404458598726114,
      0.04333868378812199,
      0.02750809061488673
    ],
    "brevity_penalty": 1.0,
    "length_ratio": 2.463035019455253,
    "translation_length": 1266,
    "reference_length": 514,
    "metric_value": 0.06593124314731579
  },
  "cosine_similarity": {
    "metric_value": 0.40435785356067794,
    "mean": 0.40435785356067794,
    "min": 0.2300233295073774,
    "max": 0.6312782491619542,
    "std": 0.11464594963578321
  },
  "jaccard_similarity": {
    "metric_value": 0.17263671292540633,
    "mean": 0.17263671292540633,
    "min": 0.083333333

### Construct a key/value dict of the metrics to be published to OpenPages

In [17]:
def get_metrics(result):
    metrics = {}
    metrics['rouge1'] = round(result['rouge_score']['rouge1']['metric_value'], 4)
    metrics['rouge2'] = round(result['rouge_score']['rouge2']['metric_value'], 4)
    metrics['rougeL'] = round(result['rouge_score']['rougeL']['metric_value'], 4)
    metrics['rougeLsum'] = round(result['rouge_score']['rougeLsum']['metric_value'], 4)
    metrics['meteor'] = round(result['meteor']['metric_value'], 4)
    metrics['sari'] = round(result['sari']['metric_value'], 4)
    metrics['cosine_similarity'] = round(result['cosine_similarity']['metric_value'], 4)
    metrics['jaccard_similarity'] = round(result['jaccard_similarity']['metric_value'], 4)
    return metrics

In [18]:
metrics =  get_metrics(result)
metrics

{'rouge1': 0.3289,
 'rouge2': 0.1281,
 'rougeL': 0.2309,
 'rougeLsum': 0.3098,
 'meteor': 0.3491,
 'sari': 32.9924,
 'cosine_similarity': 0.4044,
 'jaccard_similarity': 0.1726}

# Publishing computed metrics to watsonx governance console

This section of the notebook publishes the metrics to a model that has been defined in the watsonx governance console.

### Import libraries for the REST API

In [19]:
import requests
import base64
import json
import http.client
import ssl

### Define functions to get authorization token for OpenPages

In [41]:
def get_basic_auth_token(username, password):
    token = base64.b64encode(bytes('{0}:{1}'.format(username, password), 'utf-8')).decode("ascii")
    return token

def get_jwt_auth_token(username, apikey):
    OP_HOST = CPD_URL.lstrip("https://")
    conn = http.client.HTTPSConnection(
        OP_HOST,
        context=ssl._create_unverified_context()
    )
    payloadstr = {
        "username": username,
        "api_key": apikey
    }

    payload = json.dumps(payloadstr)

    headers = {
        'content-type': "application/json",
        'cache-control': "no-cache",
    }

    conn.request("POST", "/icp4d-api/v1/authorize", payload, headers)
    res = conn.getresponse()
    data = res.read()
    checkstat = res.status
    
    if checkstat == 200:
        print("Login Success!")

    elif checkstat == 401:
        print("UNAUTHORIZED!")

    else:
        print("UNKNOWN ERROR")
    
    token = json.loads(data)['token']
    return token

def get_token(username, password = None, apikey = None):
    return get_jwt_auth_token(username, apikey)

### Define a function to get the ID of the model from the title

In [52]:
def get_op_model_id(header, model_name):
    openpages_url = CPD_URL.rstrip("/") + "/openpages-openpagesinstance-cr-grc/api/query?q=SELECT [Model].[Resource ID] FROM [Model] WHERE [Model].[Name] IN ('{0}')".format(model_name)
    print(openpages_url)
    response = requests.get(openpages_url, headers=header, verify=False).json()
    
    model_id = None
    if response is not None:
        if response.get("rows") is not None:
            rows = response.get("rows")
            if len(rows) != 0:
                fields = rows[0].get("fields")
                if fields is not None:
                    field = fields.get("field")
                    if len(field) != 0:
                        model_id = field[0]["value"]

    if model_id is None:
        print("Model ID not found.")
    else:
        print("Model ID fetched: " + model_id)
    return model_id

### For a given model id, get the corresponding OP metrics definitions - Map containing metric id and its name

In [23]:
def get_op_model_metrics_definitions(header, model_id):
    openpages_url = CPD_URL.rstrip("/") + "/openpages-openpagesinstance-cr-grc/api/query?q=SELECT [Metric].[Resource ID], [Metric].[Name] FROM [Model] JOIN [Metric] ON PARENT([Model]) WHERE [Model].[Resource ID]='{0}'".format(model_id)
    response = requests.get(openpages_url, headers=header, verify=False).json()
    
    metrics_map = []

    if response is not None:
        if response.get("rows") is not None:
            rows = response.get("rows")
            if len(rows) != 0:
                for i in range(len(rows)):
                    fields = rows[i].get("fields")
                    if fields is not None:
                        field = fields.get("field")
                        metric_id_name = {}
                        metric_id = None
                        metric_name = None
                        for row in field:
                            if row.get('name') == 'Resource ID':
                                metric_id = row.get('value')
                            if row.get('name') == 'Name':
                                metric_name = row.get('value')
                        metric_id_name['metric_name'] = metric_name
                        metric_id_name['metric_id'] = metric_id
                        metrics_map.append(metric_id_name)
        print("Completed fetching, if any, all metrics associated with the model.")
        return metrics_map

### Construct the Metrics Object Payload for metrics creation

In [24]:
def get_metric_object_payload(primaryParentId, metric_name):
    metric_description = "watsonx.governance metric for '" + metric_name + "'"
    metric_object_payload = {
        "name": metric_name,
        "description": metric_description,
        "typeDefinitionId": "Metric",
        "primaryParentId": primaryParentId,
        "fields": {
            "field": [
                {
                    "name": "MRG-Metric:Data Source",
                    "dataType": "STRING_TYPE",
                    "value": "watsonx.governance"
                },
                {
                    "name": "MRG-Metric:Frequency",
                    "dataType": "ENUM_TYPE",
                    "enumValue": {
                        "name": "Multiple times a day"
                    }
                }
            ]
        }
    }
    return metric_object_payload

### Construct the Metrics Value Payload for creating and associating a metric value to a metric of a given model object

In [25]:
def get_metric_value_payload(primaryParentId, metric_name, metric_value):
    metric_description = "watsonx.governance metric for '" + metric_name + "'"
    metric_value_payload = {
        "typeDefinitionId": "MetricValue",
        "primaryParentId": primaryParentId,
        "description": metric_description,
        "fields": {
            "field": [
                {
                    "name": "MRG-Metric-Shared:Breach Status",
                    "dataType": "ENUM_TYPE",
                    "enumValue": {
                        "name": "Green"
                    }
                },
                {
                    "name": "MRG-Metric-Shared:Red Threshold",
                    "dataType": "FLOAT_TYPE",
                    "value": 0.5
                },
                {
                    "name": "MRG-MetricVal:Value",
                    "dataType": "FLOAT_TYPE",
                    "value": metric_value
                }
            ]
        }
    }
    return metric_value_payload

### Create Metrics Object

In [27]:
def create_metrics_object(metric_object_payload):
    openpages_metric_object_creation_url = CPD_URL.rstrip("/") + "/openpages-openpagesinstance-cr-grc/api/contents"
    response = requests.post(openpages_metric_object_creation_url, json=metric_object_payload, headers=header, verify=False).json()
    metric_id = response['id']
    return metric_id

### Add Metric Value to the Metric Object

In [28]:
def add_metric_value_to_metric_object(metric_value_payload):
    openpages_metric_value_creation_url = CPD_URL.rstrip("/") + "/openpages-openpagesinstance-cr-grc/api/contents"
    response = requests.post(openpages_metric_value_creation_url, json=metric_value_payload, headers=header, verify=False).json()
    metric_value_id = response['id']
    return metric_value_id

### Check for the metric existence in the metrics map

In [55]:
def get_existing_metric_id(metrics_map, metric_name):
    for item in metrics_map:
        if 'metric_name' in item and item['metric_name'] == metric_name:
            return item['metric_id']
    return None

### Create an OpenPages connection

In [42]:
token = get_token(CPD_USERNAME, apikey=API_KEY)
header = {
    "Content-Type": "application/json",
    "Accept": "application/json",
    "Authorization": "Bearer {0}".format(token)
}

Login Success!


### Fetch the Model Id for a given OP Model Name

In [51]:
model_id = get_op_model_id(header, MODEL_TITLE)
model_id

https://cpd-cpd.apps.6633b6113313fb001ef5a23b.cloud.techzone.ibm.com/openpages-openpagesinstance-cr-grc/api/query?q=SELECT [Model].[Resource ID] FROM [Model] WHERE [Model].[Name] IN ('mt5')
Model ID fetched: 2898


'2898'

In [53]:
metrics

{'rouge1': 0.3289,
 'rouge2': 0.1281,
 'rougeL': 0.2309,
 'rougeLsum': 0.3098,
 'meteor': 0.3491,
 'sari': 32.9924,
 'cosine_similarity': 0.4044,
 'jaccard_similarity': 0.1726}

### Publish the metrics to the watsonx governance console

In [56]:
### Fetch the existing, if any, OP Model Metrics for a given OP Model ID
metrics_map = get_op_model_metrics_definitions(header, model_id)

print('\n')

# Iterate over the given metrics to be published..
for metric_name, metric_value in metrics.items():
    
    # check if the metric exists by the given name, and if, get its metric_id
    metric_id = get_existing_metric_id(metrics_map, metric_name)

    # if the metric does not exists, then create it
    if metric_id is None:
        print(metric_name + ': Metric Object does not exist, creating it..')

        # construct the metric object to be published
        metric_object_payload = get_metric_object_payload(model_id, metric_name)

        # now, create the metric object
        metric_id = create_metrics_object(metric_object_payload)

    # Add the metric value to metric object

    # construct the metric value object to be published
    metric_value_payload = get_metric_value_payload(metric_id, metric_name, metric_value)

    # create the metric value - basically add the metric value to the metric object
    metric_value_id = add_metric_value_to_metric_object(metric_value_payload)
    
    print(str(metric_name) + ': Metric Object ID: ' + str(metric_id) + ', Metric Value Object ID: '+ str(metric_value_id) + '\n')

Completed fetching, if any, all metrics associated with the model.


rouge1: Metric Object does not exists, hence creating it..
rouge1: Metric Object ID: 2973, Metric Value Object ID: 2977

rouge2: Metric Object does not exists, hence creating it..
rouge2: Metric Object ID: 2978, Metric Value Object ID: 2979

rougeL: Metric Object does not exists, hence creating it..
rougeL: Metric Object ID: 2980, Metric Value Object ID: 2981

rougeLsum: Metric Object does not exists, hence creating it..
rougeLsum: Metric Object ID: 2982, Metric Value Object ID: 2983

meteor: Metric Object does not exists, hence creating it..
meteor: Metric Object ID: 2984, Metric Value Object ID: 2985

sari: Metric Object does not exists, hence creating it..
sari: Metric Object ID: 2986, Metric Value Object ID: 2987

cosine_similarity: Metric Object does not exists, hence creating it..
cosine_similarity: Metric Object ID: 2988, Metric Value Object ID: 2989

jaccard_similarity: Metric Object does not exists, hence cre