# Use the IBM watsonx.governance metrics toolkit to evaluate AWS Bedrock

The IBM watsonx.governance metrics toolkit lets you evaluate the output of a Large Language Model (LLM) against multiple task types: Text Summarization, Content Generation, Question Answering, Text Classification, Entity Extraction, and Retrieval-Augmented Generation (RAG). 

This notebook will demonstrate how to evaluate output from a Text Summarization prompt run against an Amazon Web Services (AWS) Bedrock LLM. It also demonstrates how to evaluate output from Content Generation, Question Answering, and Text Classification prompts.

## Learning goals

The learning goals of this notebook are:

-  Create your prompt for testing against the `anthropic.claude-v2` model.
-  Configure metrics for evaluation.
-  Run the metrics against your prompt data.
-  Print and review the metrics returned by the IBM watsonx.governance metrics toolkit. 

## Table of Contents

This notebook contains the following parts:

1.	[Install the necessary packages](#packages)
2.  [Provision services and configure credentials](#credentials)
3.	[Evaluate Text Summarization output from the AWS Bedrock `anthropic.claude-v2` model](#summarization)
4.	[Evaluate Content Generation output from the AWS Bedrock `anthropic.claude-v2` model](#contentgen)
5.	[Evaluate Question Answering output from the AWS Bedrock `anthropic.claude-v2` model](#question)
6.  [Evaluate Text Classification output from the AWS Bedrock `anthropic.claude-v2` model](#textclass)
7.	[Summary](#summary)

<a id="packages"></a>
## Step 1 - Install the necessary packages

In [None]:
!pip install --upgrade ibm-watson-machine-learning   | tail -n 1
!pip install --upgrade ibm-watson-openscale --no-cache | tail -n 1
!pip install --upgrade ibm-metrics-plugin --no-cache | tail -n 1

In [None]:
!pip install --upgrade evaluate --no-cache | tail -n 1
!pip install --upgrade rouge_score --no-cache | tail -n 1
!pip install --upgrade textstat --no-cache | tail -n 1
!pip install --upgrade sacrebleu --no-cache | tail -n 1
!pip install --upgrade sacremoses --no-cache | tail -n 1
!pip install --upgrade datasets==2.10.0 --no-cache | tail -n 1

In [None]:
!pip install boto3 -U --no-cache | tail -n 1

In [None]:
import warnings
warnings.filterwarnings('ignore')

<a id="credentials"></a>
## Step 2 - Provision services and configure credentials

### Provision an instance of IBM Watson OpenScale

If you have not already done so, provision an instance of IBM Watson OpenScale using the [OpenScale link in the Cloud catalog](https://cloud.ibm.com/catalog/services/watson-openscale).

### Generate an API key

You can generate a Cloud API key with IBM Cloud console or with IBM Cloud command line interface.

To generate an API key by using IBM Cloud console:

1. Go to the [**Users** section of the IBM Cloud console](https://cloud.ibm.com/iam#/users).
1. Click your name, then scroll down to the **API Keys** section.
1. Click **Create an IBM Cloud API key**.
1. Give your key a name and click **Create**.
1. Copy the created key - you will need to paste this key into the `CLOUD_API_KEY` variable in the "Configure your credentials" section below.

To create an API key using the IBM Cloud [command line interface](https://console.bluemix.net/docs/cli/reference/ibmcloud/download_cli):

1. From the command line interface, type the following:

    `bx login --sso`

    `bx iam api-key-create 'my_key'`

1. Copy the created key - you will need to paste this key into the `CLOUD_API_KEY` variable in the "Configure your credentials" section below.

### Configure your credentials

In [None]:
use_cpd = False
CLOUD_API_KEY = "<CLOUD_API_KEY>"
IAM_URL = "https://iam.ng.bluemix.net/oidc/token"

If you are running your notebook on a CPD cluster, uncomment and run the following code:

In [None]:
# use_cpd = True
# WOS_CREDENTIALS = {
#     "url": "xxxxx",
#     "username": "xxxxx",
#     "api_key": "xxxxx"
# }

# GEN_API_KEY = WOS_CREDENTIALS["api_key"]

# api_endpoint = WOS_CREDENTIALS["url"]
# project_id = "<Your project id>"
# endpoint_url = WOS_CREDENTIALS["url"]

### Authenticate with IBM watsonx.governance

In [None]:
from ibm_cloud_sdk_core.authenticators import IAMAuthenticator,BearerTokenAuthenticator,CloudPakForDataAuthenticator
from ibm_watson_openscale import *
from ibm_watson_openscale.supporting_classes.enums import *
from ibm_watson_openscale.supporting_classes import *

if use_cpd:
    authenticator = CloudPakForDataAuthenticator(
            url=WOS_CREDENTIALS['url'],
            username=WOS_CREDENTIALS['username'],
            password=WOS_CREDENTIALS['password'],
            disable_ssl_verification=True
        )
    
    client = APIClient(service_url=WOS_CREDENTIALS['url'],authenticator=authenticator)
    print(client.version)
else:
    authenticator = IAMAuthenticator(apikey=CLOUD_API_KEY)
    client = APIClient(authenticator=authenticator)
    print(client.version)

### Import common evaluation metrics and metric groups

These are the metrics used to evaluate your prompt against the selected model, based on the prompt task type — Summarization, Classification, Question Answering, etc.

In [None]:
from ibm_metrics_plugin.metrics.llm.utils.constants import LLMTextMetricGroup
from ibm_metrics_plugin.metrics.llm.utils.constants import LLMGenerationMetrics
from ibm_metrics_plugin.metrics.llm.utils.constants import LLMSummarizationMetrics
from ibm_metrics_plugin.metrics.llm.utils.constants import LLMQAMetrics
from ibm_metrics_plugin.metrics.llm.utils.constants import LLMClassificationMetrics
from ibm_metrics_plugin.metrics.llm.utils.constants import HAP_SCORE
from ibm_metrics_plugin.metrics.llm.utils.constants import PII_DETECTION

<a id="summarization"></a>
## Step 3 - Evaluate Text Summarization output from the AWS Bedrock `anthropic.claude-v2` model

### Download a dataset containing prompt input data for model inferencing and reference data for model output evaluation

The downloaded `.csv` file contains: input, a generated summary, and two reference summaries each for 50 sample prompts. Values are then further converted to input, output, and reference panda data frames.

In [None]:
!rm -fr llm_content.csv
!wget "https://raw.githubusercontent.com/IBM/watson-openscale-samples/main/IBM%20Cloud/WML/assets/data/watsonx/llm_content.csv"

In [None]:
import pandas as pd
import numpy as np
llm_data_all = pd.read_csv("llm_content.csv")
llm_data_all.head()

In [None]:
llm_data = llm_data_all.head(10)
llm_data.head()

In [None]:
import boto3, json

### Obtain your AWS security credentials

Copy or create your AWS [security credentials](https://docs.aws.amazon.com/IAM/latest/UserGuide/security-creds.html), and paste them in the code cell below.

In [None]:
aws_access_key_id = 'xxxxxx'
aws_secret_access_key = 'xxxxxx'

In [None]:
session = boto3.Session()

### Create an AWS Bedrock service client

Programmatically create a Bedrock service client.

In [None]:
bedrock = session.client(service_name='bedrock', 
                         aws_access_key_id = aws_access_key_id, 
                         aws_secret_access_key = aws_secret_access_key, 
                         region_name = 'us-east-1',
                         endpoint_url = 'https://bedrock.us-east-1.amazonaws.com')

### Select the `anthropic.claude-v2` model to use

In [None]:
#List the available foundation models in Bedrock

fm_model_list = bedrock.list_foundation_models()

fm_model_names = [x['modelId'] for x in fm_model_list['modelSummaries']]
print(*fm_model_names, sep = "\n")

In [None]:
#Specify the `anthropic.claude-v2` model

modelId = 'anthropic.claude-v2'
accept = 'application/json'
contentType = 'application/json'

### Create a `bedrock-runtime` client

The runtime client allows you to run your prompt against the `anthropic.claude-v2` model.

In [None]:
bedrock_runtime = session.client(service_name='bedrock-runtime', 
                         aws_access_key_id = aws_access_key_id, 
                         aws_secret_access_key = aws_secret_access_key, 
                         region_name = 'us-east-1',
                         endpoint_url = 'https://bedrock-runtime.us-east-1.amazonaws.com')

### Create your prompt for testing against the `anthropic.claude-v2` model

In [None]:
def get_prompt(text):
    prompt = f"""Human: Please provide a summary of the following text with maximum of 20 words.
    
{text}
    
Assistant:"""
    return prompt

### Examine the generated prompt summary result

In [None]:
def prompt_evaluation(text):
    prompt = get_prompt(text)
    body = json.dumps({"prompt": prompt,
                     "max_tokens_to_sample":2048,
                     "temperature":0.1,
                     "top_k":250,
                     "top_p":0.5,
                     "stop_sequences":[]
                      }) 
    response = bedrock_runtime.invoke_model(body=body, modelId=modelId, accept=accept, contentType=contentType)
    response_body = json.loads(response.get('body').read())
    completion = response_body['completion']
    summary = completion
    if '\n\n' in completion:
        summary = completion.split("\n\n")[1]
    print('-----')    
    print(summary)
    print('-----')
    return summary

### Append the generated prompt summary result to the model data set

In [None]:
llm_data['anthropic_generated_summary'] = llm_data['input_text'].apply(prompt_evaluation)

In [None]:
llm_data.head()

### Get the necessary data for evaluating the prompt template metrics

Metrics will be evaluated for the input, output, and reference summary text.

In [None]:
df_input = llm_data[['input_text']].copy()
df_output = llm_data[['anthropic_generated_summary']].copy()
df_reference = llm_data[['reference_summary_2']].copy()

### Configure metrics for evaluation

Select the metrics you want to evaluate; the code cell below contains 10 common Summarization metrics.

In [None]:
metric_config = {   
    "configuration": {
        LLMTextMetricGroup.SUMMARIZATION.value: {
            LLMSummarizationMetrics.ROUGE_SCORE.value: {},
            LLMSummarizationMetrics.SARI.value: {},
            LLMSummarizationMetrics.METEOR.value: {},
            LLMSummarizationMetrics.NORMALIZED_RECALL.value: {},
            LLMSummarizationMetrics.NORMALIZED_PRECISION.value: {},
            LLMSummarizationMetrics.NORMALIZED_F1_SCORE.value: {},
            LLMSummarizationMetrics.COSINE_SIMILARITY.value: {},
            LLMSummarizationMetrics.JACCARD_SIMILARITY.value: {},
            LLMSummarizationMetrics.BLEU.value: {},
            LLMSummarizationMetrics.FLESCH.value: {}
        }
    }
}

### Summarization metrics evaluation

Run the metrics against your prompt data.

In [None]:
import json
result = client.llm_metrics.compute_metrics(metric_config,sources = df_input, predictions = df_output, references = df_reference)

### Review metrics

Print and review the metrics returned by the IBM watsonx.governance metrics toolkit.

In [None]:
print(json.dumps(result,indent=2))

<a id="contentgen"></a>
## Step 4 - Evaluate Content Generation output from the AWS Bedrock `anthropic.claude-v2` model

### Download a dataset containing prompt input data for model inferencing and reference data for model output evaluation

The downloaded `.csv` file contains a question, generated answer text, and reference text for 50 sample prompts. Prompt values are then further converted to question, generated answer text, and reference panda data frames.

In [None]:
!rm -fr llm_content_generation.csv
!wget "https://raw.githubusercontent.com/IBM/watson-openscale-samples/main/IBM%20Cloud/WML/assets/data/watsonx/llm_content_generation.csv"

In [None]:
data = pd.read_csv("llm_content_generation.csv")
data.head()

In [None]:
df_input = data[['question']].copy()
df_output = data[['generated_text']].copy()
df_reference = data[['reference_text']].copy()

### Configure metrics for evaluation

Select the metrics you want to evaluate; the code cell below contains 7 common Content Generation metrics.

In [None]:
metric_config = {   
    #All Common parameters goes here 
    "configuration": {        
        LLMTextMetricGroup.GENERATION.value: { # metric group   
            LLMGenerationMetrics.BLEU.value: {},
            LLMGenerationMetrics.ROUGE_SCORE.value: {},
            LLMGenerationMetrics.FLESCH.value: {},
            LLMGenerationMetrics.METEOR.value: {},            
            LLMGenerationMetrics.NORMALIZED_RECALL.value: {},
            LLMGenerationMetrics.NORMALIZED_PRECISION.value: {},
            LLMGenerationMetrics.NORMALIZED_F1_SCORE.value: {}            
        }    
    }
}

### Content Generation metrics evaluation

Run the metrics against your prompt data.

In [None]:
result = client.llm_metrics.compute_metrics(metric_config,df_input,df_output, df_reference)

### Review metrics

Print and review the metrics returned by the IBM watsonx.governance metrics toolkit.

In [None]:
print(json.dumps(result,indent=2))

<a id="question"></a>
## Step 5 - Evaluate Question Answering output from the AWS Bedrock `anthropic.claude-v2` model

### Download a dataset containing prompt input data for model inferencing and reference data for model output evaluation

The downloaded `.csv` file contains question-and-answer pairs for 50 sample prompts. Values in the Question column are the input, and values in the Answer column are the prompt output.

In [None]:
!rm -fr llm_content_qa.csv
!wget "https://raw.githubusercontent.com/IBM/watson-openscale-samples/main/IBM%20Cloud/WML/assets/data/watsonx/llm_content_qa.csv"

In [None]:
data = pd.read_csv("llm_content_qa.csv")
data.head()

In [None]:
df_input = data[['question']].copy()
df_output = data[['answers']].copy()
df_reference = data[['answers']].copy()

### Configure metrics for evaluation

Select the metrics you want to evaluate; the code cell below contains 3 common Question Answering metrics.

In [None]:
metric_config = {   
    #All Common parameters goes here 
    "configuration": {        
        LLMTextMetricGroup.QA.value: { # metric group   
            LLMQAMetrics.EXACT_MATCH.value: {},
            LLMQAMetrics.ROUGE_SCORE.value: {},
            LLMQAMetrics.BLEU.value: {}          
        }    
    }
}

### Question and Answering metrics evaluation

Run the metrics against your prompt data.

In [None]:
result = client.llm_metrics.compute_metrics(metric_config,df_input,df_output, df_reference)

### Review metrics

Print and review the metrics returned by the IBM watsonx.governance metrics toolkit.

In [None]:
print(json.dumps(result,indent=2))

<a id="textclass"></a>
## Step 6 - Evaluate Text Classification output from the AWS Bedrock `anthropic.claude-v2` model

### Download a dataset containing prompt input data for model inferencing and reference data for model output evaluation


The downloaded `.csv` file contains label-and-text pairs for 50 sample prompts. Values in the `text` column are the input, and values in the `label` column act as both output and reference.

In [None]:
!rm -fr llm_content_classification.csv
!wget "https://raw.githubusercontent.com/IBM/watson-openscale-samples/main/IBM%20Cloud/WML/assets/data/watsonx/llm_content_classification.csv"

In [None]:
data = pd.read_csv("llm_content_classification.csv")
data.head()

In [None]:
data['label'] = data['label'].replace({'ham': 0, 'spam': 1})

In [None]:
df_input = data[['text']].copy()
df_output = data[['label']].copy()
df_reference = data[['label']].copy()

### Create a reference column

The reference column provides a more realistic classification example.

In [None]:
shuffled_column = df_reference['label'].sample(frac=1).reset_index(drop=True)
df_reference['label'] = shuffled_column

### Configure metrics for evaluation

Select the metrics you want to evaluate; the code cell below contains 5 common Text Classification metrics.

In [None]:
metric_config = {   
    #All Common parameters go here 
    "configuration": {        
        LLMTextMetricGroup.CLASSIFICATION.value: { # metric group   
            LLMClassificationMetrics.ACCURACY.value: {},
            LLMClassificationMetrics.PRECISION.value: {},
            LLMClassificationMetrics.RECALL.value: {},
            LLMClassificationMetrics.F1_SCORE.value: {},
            LLMClassificationMetrics.MATTHEWS_CORRELATION.value: {},            
        }    
    }
}

### Text Classification metrics evaluation

Run the metrics against your prompt data.

In [None]:
result = client.llm_metrics.compute_metrics(metric_config,df_input,df_output, df_reference)

### Review metrics

Print and review the metrics returned by the IBM watsonx.governance metrics toolkit.

In [None]:
print(json.dumps(result,indent=2))

<a id="summary"></a>
## Summary

Congratulations, you successfully completed this notebook! You learned how to evaluate output from Text Summarization, Content Generation, Question Answering, and Text Classification prompts run against an Amazon Web Services (AWS) Bedrock LLM. 

### Authors:

**Kishore Patel**

**Ravi Chamarthy**