# Using IBM watsonx.governance metrics toolkit to evaluate the quality of your Prompt Template

In [None]:
!pip install --upgrade ibm-watson-openscale --no-cache | tail -n 1

In [None]:
!pip install --upgrade 'ibm-metrics-plugin[generative-ai-quality]~=3.0.9'

In [None]:
import spacy
spacy.cli.download("en_core_web_sm")
!python -m nltk.downloader punkt

import nltk
nltk.download("wordnet")

In [None]:
!pip install spellchecker
!pip install pyspellchecker

In [5]:
import warnings
warnings.filterwarnings('ignore')

## Provision services and configure credentials

If you have not already, provision an instance of IBM Watson OpenScale using the [OpenScale link in the Cloud catalog](https://cloud.ibm.com/catalog/services/watson-openscale).

Your Cloud API key can be generated by going to the [**Users** section of the Cloud console](https://cloud.ibm.com/iam#/users). From that page, click your name, scroll down to the **API Keys** section, and click **Create an IBM Cloud API key**. Give your key a name and click **Create**, then copy the created key and paste it below.

**NOTE:** You can also get OpenScale `API_KEY` using IBM CLOUD CLI.

How to install IBM Cloud (bluemix) console: [instruction](https://console.bluemix.net/docs/cli/reference/ibmcloud/download_cli.html#install_use)

How to get api key using console:
```
bx login --sso
bx iam api-key-create 'my_key'
```

In [6]:
use_cpd = False
CLOUD_API_KEY = "***"
IAM_URL="https://iam.cloud.ibm.com"

Uncomment the code and run the below cell only if you are running your notebook on a CPD cluster.

In [7]:
# use_cpd = True
# WOS_CREDENTIALS = {
#     "url": "xxxxx",
#     "username": "xxxxx",
#     "password": "xxxxx",
#     "apikey": "xxxxx"
# }

## IBM watsonx.governance authentication

In [None]:
from ibm_cloud_sdk_core.authenticators import IAMAuthenticator, CloudPakForDataAuthenticator

from ibm_watson_openscale import *
from ibm_watson_openscale.supporting_classes.enums import *
from ibm_watson_openscale.supporting_classes import *

if use_cpd:
    authenticator = CloudPakForDataAuthenticator(
            url=WOS_CREDENTIALS['url'],
            username=WOS_CREDENTIALS['username'],
            apikey=WOS_CREDENTIALS['apikey'],
            disable_ssl_verification=True,
        )
    
    client = APIClient(service_url=WOS_CREDENTIALS['url'],authenticator=authenticator)
    print(client.version)
else:
    authenticator = IAMAuthenticator(apikey=CLOUD_API_KEY)
    client = APIClient(authenticator=authenticator)
    print(client.version)

# Common Imports

In [9]:
from ibm_metrics_plugin.metrics.llm.config.entities import LLMMetricType
from ibm_metrics_plugin.metrics.llm.config.entities import LLMTaskType
from ibm_metrics_plugin.metrics.llm.utils.constants import ContentValidationMetrics
from ibm_metrics_plugin.metrics.llm.utils.constants import ContentValidationMetricsParameters
import pandas as pd

# Evaluating Summarization output from AWS/anthropic.claude-v2

## Test data containing the summarization output from model and the reference data

In [None]:
!rm -fr llm_content.csv
!wget "https://raw.githubusercontent.com/IBM/watson-openscale-samples/main/IBM%20Cloud/WML/assets/data/watsonx/llm_content.csv"

In [None]:
llm_data_all = pd.read_csv("llm_content.csv")
llm_data_all.head()

In [12]:
llm_data = llm_data_all.head(10)
llm_data.head()

Unnamed: 0,input_text,generated_summary,reference_summary_1,reference_summary_2
0,Scientists have discovered a new species of de...,New bioluminescent fish species found in deep ...,Discovery of deep-sea fish emitting soothing l...,Scientists find new bioluminescent fish specie...
1,An international team of astronomers has ident...,Distant exoplanet\'s water vapor-filled atmosp...,Astronomers identify exoplanet with water vapo...,Discovery of exoplanet with water vapor in its...
2,Researchers have developed a novel nanotechnol...,New nanotechnology-based cancer treatment demo...,Researchers create cancer treatment using nano...,Innovative cancer treatment utilizing nanotech...
3,A new app is aiming to reduce food waste by co...,App connects local restaurants with customers ...,New sustainability-focused app facilitates sal...,Initiative to reduce food waste involves app c...
4,Archaeologists have uncovered an ancient city ...,"Ancient city dating back over 4,000 years disc...",Archaeological find in Iraq reveals ancient ci...,"Discovery of 4,000-year-old ancient city in Ku..."


In [13]:
df_input = llm_data[['input_text']].copy()
df_output = llm_data[['generated_summary']].copy()
df_reference = llm_data[['reference_summary_1']].copy()

## Metrics configuration for evaluation

### Metrics configuration for evaluating content validation metrics with submetrics under `SUMMARIZATION` task type.

Content Validation Metrics are operated on string-based functions. 

Content validation metrics can be evaluated for `SUMMARIZATION`, `QA`, `EXTRACTION`, `RAG` and `GENERATION` task types

The submetrics can be with or without parameters. When no submetric is specified for the content validation, those submetrics that do not have parameters will be computed



In [14]:
metric_config = {   
    "configuration": {
        LLMTaskType.SUMMARIZATION.value: {
            LLMMetricType.ROUGE_SCORE.value: {},
            LLMMetricType.SARI.value: {},
            LLMMetricType.METEOR.value: {},
            LLMMetricType.NORMALIZED_RECALL.value: {},
            LLMMetricType.NORMALIZED_PRECISION.value: {},
            LLMMetricType.NORMALIZED_F1_SCORE.value: {},
            LLMMetricType.COSINE_SIMILARITY.value: {},
            LLMMetricType.JACCARD_SIMILARITY.value: {},
            LLMMetricType.BLEU.value: {},
            LLMMetricType.FLESCH.value: {},
            LLMMetricType.CONTENT_ANALYSIS.value: {},
            LLMMetricType.KEYWORDS_INCLUSION.value: {},
            LLMMetricType.CONTENT_VALIDATION.value: {},    #sub metrics are not provided , hence the metrics without parameters (Contains email, Is email,Is Json,Contains Json, Contains link, No invalid links, Contains valid link) will be computed
            LLMMetricType.HAP_SCORE.value: {},
            LLMMetricType.PII_DETECTION.value: {}
        }    
    }
}
        

## Summarization Metrics Evaluation

The HAP and PII metrics will be computed on the server(watsonx.governance instance) in asynchronus manner, by default. The details of the computation tasks submitted, and the responses from them, are returned in the respective metric responses. The `get_metrics_result` method shown in the following cells can be used to get the response from the server.

To execute HAP and PII metrics in synchronus manner, send the parameter `background_mode=False` to the compute_metrics method.

In [None]:
import json
result = client.llm_metrics.compute_metrics(configuration=metric_config,sources=df_input,predictions=df_output,references=df_reference)

## Evaluated Metrics

#### Re-run the following cell until all computation tasks are finished, and results are returned 

In [16]:
results = client.llm_metrics.get_metrics_result(metric_config,result)
print(json.dumps(results,indent=2))

{
  "hap_score": {
    "total_records": 10,
    "max": 0.02312442846596241,
    "mean": 0.0041,
    "metric_value": 0.0041,
    "min": 0.0004542336391750723
  },
  "pii": {
    "total_records": 10,
    "max": 0,
    "mean": 0.0,
    "metric_value": 0.0,
    "min": 0
  },
  "content_analysis": {
    "coverage": {
      "metric_value": 0.3392,
      "mean": 0.3392,
      "min": 0.1892,
      "max": 0.5143,
      "std": 0.0858
    },
    "density": {
      "metric_value": 0.1002,
      "mean": 0.1002,
      "min": 0.0391,
      "max": 0.1684,
      "std": 0.0389
    },
    "compression": {
      "metric_value": 2.4916,
      "mean": 2.4916,
      "min": 1.7895,
      "max": 3.8571,
      "std": 0.7053
    },
    "abstractness": {
      "metric_value": 0.4396,
      "mean": 0.4396,
      "min": 0.3077,
      "max": 0.5833,
      "std": 0.1037
    },
    "repetitiveness": {
      "metric_value": 0.0059,
      "mean": 0.0059,
      "min": 0.0,
      "max": 0.0588,
      "std": 0.0176
    }
 

### Computing Content Validation metrics when the submetrics are specified



|   Submetrics      |Description                                                                             | Possible Values  |
|-------------------|------------                                                                             |------------------|
|Length Less than   |checks if the length of each row in the prediction is less than a specified maximum value |  Numeric|
|Length Greater than|checks if the length of each row in the prediction is less than a specified maximum value  |  Numeric|
|Contains Email     |checks if each row in the prediction  contains email                                        |    N/A|
|Is Email           |checks if each row in the prediction contains valid Email                             |N/A|
|Contains_Json      |checks if each row in the prediction contains json                                     |N/A|
|Is Json             |checks if each row in the prediction contains valid Json |N/A|
|Contains Link       |checks if each row in the prediction contains link |N/A|
|No Invalid Links    |checks if each row in the prediction has no invalid links|N/A|
|Contains Valid Link |checks of each row in the prediction contains valid link|N/A|
|Starts With         |checks if each row in the prediction starts with the specified substring|String|
|Ends with           |checks if each row in the prediction ends with the specified substring|String|
|Equals To          |checks if each row in the prediction is equal to the specified substring|String|
|Contains All       |checks if each row in the prediction contains all the given keywords|List|
|Contains None       |checks if each row in the prediction does not contains any of the provided keywords|List|
|Contains Any        |checks if each row in the prediction contains any of the provided keywords|List|
|Regex               |checks if each row in the prediction contains the provided regex| String|
|Contains String    |checks if each row in the prediction contains the provided string|String|
|Fuzzy Match         |checks if prediction fuzzy matches the keyword|String|

Note : 
- `CONTAINS_VALID_LINK` and `NO_INVALID_LINKS` cannot compute the accurate measures in an air gaped environment.
- `FUZZY_MATCH` computes the similarity between the prediction and refernece text. 

In [18]:
metric_config = {    
    "configuration": {        
        LLMTaskType.SUMMARIZATION.value: {
             LLMMetricType.CONTENT_VALIDATION.value: {
                ContentValidationMetrics.CONTAINS_ANY.value: {
                    ContentValidationMetricsParameters.KEYWORDS.value: ['Distant', 'treatment'], #sub metric with parameter
                    ContentValidationMetricsParameters.CASE_SENSITIVE.value: True},
                ContentValidationMetrics.CONTAINS_ALL.value: {
                     ContentValidationMetricsParameters.KEYWORDS.value: ['Scientists','create', 'highly','efficient','solar', 'panels' ,'for', 'low-light', 'environments'],
                     ContentValidationMetricsParameters.CASE_SENSITIVE.value: True}, #sub metric with parameter
                ContentValidationMetrics.CONTAINS_STRING.value: {
                    ContentValidationMetricsParameters.SUBSTRING.value: "Engineers create lightweight exoskeleton to aid mobility-impaired individuals in walking and daily tasks", #sub metric with parameter
                    ContentValidationMetricsParameters.CASE_SENSITIVE.value: True},
                ContentValidationMetrics.REGEX.value: {
                    ContentValidationMetricsParameters.PATTERN.value: "IP|configuration", #sub metric with parameter change
                    ContentValidationMetricsParameters.CASE_SENSITIVE.value: True},
                ContentValidationMetrics.CONTAINS_EMAIL.value: {},
                ContentValidationMetrics.CONTAINS_JSON.value: {},
                ContentValidationMetrics.CONTAINS_LINK.value: {},
                ContentValidationMetrics.CONTAINS_NONE.value: {
                    ContentValidationMetricsParameters.KEYWORDS.value: ["Distant", "New"],#sub metric with parameter
                    ContentValidationMetricsParameters.CASE_SENSITIVE.value: True},
                ContentValidationMetrics.CONTAINS_VALID_LINK.value: {},
                ContentValidationMetrics.ENDS_WITH.value: {
                    ContentValidationMetricsParameters.SUBSTRING.value: "marine science by the end",#sub metric with parameter
                    ContentValidationMetricsParameters.CASE_SENSITIVE.value: True},
                ContentValidationMetrics.EQUALS_TO.value: {
                    ContentValidationMetricsParameters.TEXT.value: "AI revolution",#sub metric with parameter
                    ContentValidationMetricsParameters.CASE_SENSITIVE.value: True},
                ContentValidationMetrics.IS_EMAIL.value: {},
                ContentValidationMetrics.IS_JSON.value: {},
                ContentValidationMetrics.LENGTH_GREATER_THAN.value: {
                    ContentValidationMetricsParameters.LENGTH.value: 10},#sub metric with parameter
                ContentValidationMetrics.LENGTH_LESS_THAN.value: {
                    ContentValidationMetricsParameters.LENGTH.value: 5},#sub metric with parameter
                ContentValidationMetrics.NO_INVALID_LINKS.value: {},
                ContentValidationMetrics.STARTS_WITH.value: {
                    ContentValidationMetricsParameters.SUBSTRING.value: "New nanotechnology",#sub metric with parameter
                    ContentValidationMetricsParameters.CASE_SENSITIVE.value: True},
                ContentValidationMetrics.FUZZY_MATCH.value: {
                    ContentValidationMetricsParameters.SIMILARITY_RATIO.value: 50,
                    ContentValidationMetricsParameters.TEXT.value: "cancer treatment demo"  #sub metric with parameter
                }
            }      
                      
        }    
    }
}

In [19]:
result = client.llm_metrics.compute_metrics(configuration=metric_config, predictions=df_output)

### Evaluated Metrics

In [20]:
import json
print(json.dumps(result,indent=2))

{
  "content_validation": {
    "contains_any": {
      "metric_value": 0.2
    },
    "contains_all": {
      "metric_value": 0.1
    },
    "contains_string": {
      "metric_value": 0.1
    },
    "regex": {
      "metric_value": 0.0
    },
    "contains_email": {
      "metric_value": 0.0
    },
    "contains_json": {
      "metric_value": 0.0
    },
    "contains_link": {
      "metric_value": 0.0
    },
    "contains_none": {
      "metric_value": 0.5
    },
    "contains_valid_link": {
      "metric_value": 0.0
    },
    "ends_with": {
      "metric_value": 0.0
    },
    "equals_to": {
      "metric_value": 0.0
    },
    "is_email": {
      "metric_value": 0.0
    },
    "is_json": {
      "metric_value": 0.0
    },
    "length_greater_than": {
      "metric_value": 1.0
    },
    "length_less_than": {
      "metric_value": 0.0
    },
    "no_invalid_links": {
      "metric_value": 1.0
    },
    "starts_with": {
      "metric_value": 0.1
    },
    "fuzzy_match": {
      "me

# Evaluating Content Generation output from the Foundation Model

## Test data containing the content generation output from model and the reference data

In [None]:
!rm -fr llm_content_generation.csv
!wget "https://raw.githubusercontent.com/IBM/watson-openscale-samples/main/IBM%20Cloud/WML/assets/data/watsonx/llm_content_generation.csv"

In [22]:
data = pd.read_csv("llm_content_generation.csv")
data.head()

Unnamed: 0,question,generated_text,reference_text
0,What are the benefits of regular exercise?,"Regular exercise has numerous benefits, includ...","Regular exercise has numerous benefits, includ..."
1,What is the process of photosynthesis?,Photosynthesis is the process by which plants ...,Photosynthesis is the process by which plants ...
2,What are the key features of a smartphone?,A smartphone is a mobile device that typically...,A smartphone is a mobile device that typically...
3,How does the immune system work?,The immune system is a complex network of cell...,The immune system is a complex network of cell...
4,What is the capital of France?,"The capital of France is Paris, which is known...","The capital of France is Paris, which is known..."


In [23]:
df_input = data[['question']].copy()
df_output = data[['generated_text']].copy()
df_reference = data[['reference_text']].copy()

## Metrics configuration for evaluation

In [24]:
metric_config = {   
    #All Common parameters goes here 
    "configuration": {        
        LLMTaskType.GENERATION.value: { # metric group   
            LLMMetricType.BLEU.value: {},
            LLMMetricType.ROUGE_SCORE.value: {},
            LLMMetricType.FLESCH.value: {},
            LLMMetricType.METEOR.value: {},            
            LLMMetricType.NORMALIZED_RECALL.value: {},
            LLMMetricType.NORMALIZED_PRECISION.value: {},
            LLMMetricType.NORMALIZED_F1_SCORE.value: {},
            LLMMetricType.HAP_SCORE.value: {},
            LLMMetricType.PII_DETECTION.value: {},
        }    
    }
}

## Content Generation Metrics Evaluation

The HAP and PII metrics will be computed on the server(watsonx.governance instance) in asynchronus manner, by default. The details of the computation tasks submitted, and the responses from them, are returned in the respective metric responses. The `get_metrics_result` method shown in the following cells can be used to get the response from the server.

To execute HAP and PII metrics in synchronus manner, send the parameter `background_mode=False` to the compute_metrics method.

In [25]:
result = client.llm_metrics.compute_metrics(metric_config,df_input,df_output, df_reference)

## Evaluated Metrics

#### Re-run the following cell until all computation tasks are finished, and results are returned 

In [26]:
results = client.llm_metrics.get_metrics_result(metric_config,result)
print(json.dumps(results,indent=2))

{
  "hap_score": {
    "total_records": 23,
    "max": 0.013588289730250835,
    "mean": 0.0016,
    "metric_value": 0.0016,
    "min": 0.0002777853514999151
  },
  "pii": {
    "total_records": 23,
    "max": 0,
    "mean": 0.0,
    "metric_value": 0.0,
    "min": 0
  },
  "flesch": {
    "flesch_reading_ease": {
      "metric_value": 39.10217391304347,
      "mean": 39.10217391304347,
      "min": -11.44,
      "max": 69.62,
      "std": 20.153544505710833
    },
    "flesch_kincaid_grade": {
      "metric_value": 12.673913043478263,
      "mean": 12.673913043478263,
      "min": 8.0,
      "max": 18.6,
      "std": 3.2043743730833554
    }
  },
  "bleu": {
    "precisions": [
      1.0,
      0.9949174078780177,
      0.9947643979057592,
      0.9946018893387314
    ],
    "brevity_penalty": 0.7138823993242189,
    "length_ratio": 0.7479224376731302,
    "translation_length": 810,
    "reference_length": 1083,
    "metric_value": 0.711075655695426,
    "total_records": 23
  },
  "me

# Evaluating Question and Answering output from the Foundation Model

## Test data containing the question and answer output from model and the reference data

In [None]:
!rm -fr llm_content_qa.csv
!wget "https://raw.githubusercontent.com/IBM/watson-openscale-samples/main/IBM%20Cloud/WML/assets/data/watsonx/llm_content_qa.csv"

In [28]:
data = pd.read_csv("llm_content_qa.csv")
data.head()

Unnamed: 0,question,answers
0,who did chris carter play for last year,Milwaukee Brewers
1,what is the latest version of safari on mac,Safari 11
2,when did bucharest become the capital of romania,1862
3,who did jeffrey dean morgan play on supernatural,John Eric Winchester
4,who is the shortest man that ever lived,Chandra Bahadur Dangi


In [29]:
df_input = data[['question']].copy()
df_output = data[['answers']].copy()
df_reference = data[['answers']].copy()

## Metrics configuration for evaluation

In [30]:
metric_config = {   
    #All Common parameters goes here 
    "configuration": {        
        LLMTaskType.QA.value: { # metric group   
            LLMMetricType.EXACT_MATCH.value: {},
            LLMMetricType.ROUGE_SCORE.value: {},
            LLMMetricType.BLEU.value: {},
            LLMMetricType.HAP_SCORE.value: {},
            LLMMetricType.PII_DETECTION.value: {},
            LLMMetricType.UNSUCCESSFUL_REQUESTS.value: {},
            LLMMetricType.KEYWORDS_INCLUSION.value: {},
            LLMMetricType.QUESTION_ROBUSTNESS.value: {"metrics": ["spelling_robustness"],
                                                         "excluded_keywords": ["chris","bucharest"]}
            
        }    
    }
}

## Question and Answering Metrics Evaluation

The HAP and PII metrics will be computed on the server(watsonx.governance instance) in asynchronus manner, by default. The details of the computation tasks submitted, and the responses from them, are returned in the respective metric responses. The `get_metrics_result` method shown in the following cells can be used to get the response from the server.

To execute HAP and PII metrics in synchronus manner, send the parameter `background_mode=False` to the compute_metrics method.

In [31]:
result = client.llm_metrics.compute_metrics(metric_config,df_input,df_output, df_reference)

## Evaluated Metrics

#### Re-run the following cell until all computation tasks are finished, and results are returned 

In [32]:
results = client.llm_metrics.get_metrics_result(metric_config,result)
print(json.dumps(results,indent=2))

{
  "hap_score": {
    "total_records": 50,
    "max": 0.801957368850708,
    "mean": 0.034,
    "metric_value": 0.034,
    "min": 0.0012616352178156376
  },
  "pii": {
    "total_records": 50,
    "max": 0.8,
    "mean": 0.016,
    "metric_value": 0.016,
    "min": 0
  },
  "keywords_inclusion": {
    "common_keywords": [
      [
        "milwaukee",
        "brewers"
      ],
      [
        "safari"
      ],
      [],
      [
        "winchester",
        "john",
        "eric"
      ],
      [
        "chandra",
        "bahadur",
        "dangi"
      ],
      [
        "seconds"
      ],
      [
        "suryanarayan",
        "m."
      ],
      [
        "meiji"
      ],
      [
        "ontario",
        "toronto",
        "canada"
      ],
      [],
      [
        "sports",
        "wii"
      ],
      [
        "castaway",
        "moon"
      ],
      [
        "melora",
        "hardin"
      ],
      [
        "season"
      ],
      [
        "treyarch"
      ],
      [

# Evaluating Text Classification output from the Foundation Model

## Test data containing the text classification output from model and the reference data

In [None]:
!rm -fr llm_content_classification.csv
!wget "https://raw.githubusercontent.com/IBM/watson-openscale-samples/main/IBM%20Cloud/WML/assets/data/watsonx/llm_content_classification.csv"

In [34]:
data = pd.read_csv("llm_content_classification.csv")
data.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [35]:
data['label'] = data['label'].replace({'ham': 0, 'spam': 1})

In [36]:
df_input = data[['text']].copy()
df_output = data[['label']].copy()
df_reference = data[['label']].copy()

## Make some realistic reference column

In [37]:
shuffled_column = df_reference['label'].sample(frac=1).reset_index(drop=True)
df_reference['label'] = shuffled_column

## Metrics configuration for evaluation

In [38]:
metric_config = {   
    #All Common parameters goes here 
    "configuration": {        
        LLMTaskType.CLASSIFICATION.value: { # metric group   
            LLMMetricType.ACCURACY.value: {},
            LLMMetricType.PRECISION.value: {},
            LLMMetricType.RECALL.value: {},
            LLMMetricType.F1_SCORE.value: {},
            LLMMetricType.MATTHEWS_CORRELATION.value: {}

        }    
    }
}

## Text Classification Metrics Evaluation

In [39]:
result = client.llm_metrics.compute_metrics(metric_config,df_input,df_output, df_reference)

## Evaluated Metrics

In [40]:
print(json.dumps(result,indent=2))

{
  "accuracy": {
    "accuracy": 0.7674084709260589
  },
  "f1": {
    "f1": 0.13253012048192772
  },
  "matthews_correlation": {
    "matthews_correlation": -0.001770397652787315
  },
  "precision": {
    "precision": 0.13253012048192772
  },
  "recall": {
    "recall": 0.13253012048192772
  }
}


# Evaluating Entity extraction output from the Foundation Model

## Test data containing the entity extraction output from model and the reference data

In [None]:
!rm -fr llm_extraction.csv
!wget "https://raw.githubusercontent.com/IBM/watson-openscale-samples/main/IBM%20Cloud/WML/assets/data/watsonx/llm_extraction.csv"

In [42]:
data = pd.read_csv("llm_extraction.csv")
data.head()

Unnamed: 0,input_text,generated_text,reference_text
0,John's credit card number is 4111 1111 1111 11...,4111 1111 1111 1111,4111 1111 1111 1111
1,Mary's credit card number is 5555 5555 5555 55...,5555 5555 5555 5555,5555 5555 5555 5555
2,David's credit card number is 1234 5678 9012 3...,1234 5678 9012 3456,1234 5678 9012 3456
3,Sarah's credit card number is 6011 1234 5678 9...,6011 1234 5678 9012,6011 1234 5678 9012
4,Alice's credit card number is 5105 1051 0510 5...,5105 1051 0510 5100,5105 1051 0510 5100


In [43]:
df_input = data[['input_text']].copy()
df_output = data[['generated_text']].copy()
df_reference = data[['reference_text']].copy()

## Metrics configuration for extraction

In [44]:
metric_config = {   
    #All Common parameters goes here 
    "configuration": {        
        LLMTaskType.EXTRACTION.value: { # metric group   
            LLMMetricType.EXACT_MATCH.value: {},
            LLMMetricType.MULTI_LABEL.value: {},
            LLMMetricType.FLESCH.value: {},
            LLMMetricType.PII_DETECTION.value: {},
            LLMMetricType.HAP_SCORE.value : {}
        }    
    }
}

## Entity extraction Metrics Evaluation

The HAP and PII metrics will be computed on the server(watsonx.governance instance) in asynchronus manner, by default. The details of the computation tasks submitted, and the responses from them, are returned in the respective metric responses. The `get_metrics_result` method shown in the following cells can be used to get the response from the server.

To execute HAP and PII metrics in synchronus manner, send the parameter `background_mode=False` to the compute_metrics method.

In [45]:
result = client.llm_metrics.compute_metrics(metric_config,df_input,df_output,df_reference)

## Evaluated Metrics

#### Re-run the following cell until all computation tasks are finished, and results are returned 

In [49]:
results = client.llm_metrics.get_metrics_result(metric_config,result)
print(json.dumps(results,indent=2))

{
  "hap_score": {
    "total_records": 10,
    "max": 0.038305215537548065,
    "mean": 0.0116,
    "metric_value": 0.0116,
    "min": 0.003492981195449829
  },
  "pii": {
    "total_records": 10,
    "max": 0.8,
    "mean": 0.56,
    "metric_value": 0.56,
    "min": 0.0
  },
  "flesch": {
    "flesch_reading_ease": {
      "metric_value": 34.794,
      "mean": 34.794,
      "min": -301.79,
      "max": 121.22,
      "std": 168.29611850544862
    },
    "flesch_kincaid_grade": {
      "metric_value": 9.040000000000001,
      "mean": 9.040000000000001,
      "min": -3.5,
      "max": 55.6,
      "std": 23.284638713108695
    }
  },
  "multi_label_metrics": {
    "micro_f1": {
      "metric_value": 0.0,
      "mean": 0,
      "min": 0,
      "max": 0,
      "std": 0
    },
    "macro_f1": {
      "metric_value": 0.0,
      "mean": 0,
      "min": 0,
      "max": 0,
      "std": 0
    },
    "micro_precision": {
      "metric_value": 0.0,
      "mean": 0,
      "min": 0,
      "max": 0,


 # Evaluating Retrieval-Augmented Generation(RAG) output from Foundation Model

## Test data containing question, answer and relevant context from model output for RAG metrics.

In [None]:
!rm -rf rag_ibm_faq.csv
!wget "https://raw.githubusercontent.com/IBM/watson-openscale-samples/main/IBM%20Cloud/WML/assets/data/watsonx/rag_ibm_faq.csv"

In [52]:
data = pd.read_csv("rag_ibm_faq.csv")
data.head()

Unnamed: 0,question,answer,contexts,reference
0,What is the origin of IBM’s “THINK” motto?,"In December 1911, when future IBM Chairman Tho...",['production and distribution — the THINK mott...,"In December 1911, when future IBM Chairman Tho..."
1,What is the origin of the term “Big Blue?”,Some writers have suggested that the “Big Blue...,['90 \n9215FQ14 Public Relations Q. What is ...,The term “Big Blue” as a reference to IBM did ...
2,In what year did IBM begin doing business in I...,1920 – Thomas J. Watson sails for Bombay to op...,['15 \n9215FQ14 W e a r e c o n v i n c e ...,1920 – Thomas J. Watson sails for Bombay to op...
3,What is watsonx.ai?,Watson is a cloud-based artificial intelligenc...,"['25 \n9215FQ14', '9215FQ14 a similar number ...",Watson is a cloud-based artificial intelligenc...
4,Does watsonx run on Openshift Container Platfo...,"Yes, Watson Studio runs on the Openshift Conta...","['3090, 308X and “plug-compatible systems. (...","Yes, Watson Studio runs on the Openshift Conta..."


In [53]:
df_input = data[["contexts","question"]].copy()
df_output = data[["answer"]].copy()
df_reference = data[["reference"]].copy()

#### Metrics configuration for evaluation
##### For Content-Analysis metrics, a list of sub-metrics can be passed **["coverage","density","compression","abstractness","repetitiveness"]**. RAG task type only have **["coverage","density","abstractness"]**, "n-grams" for abstractness and repetitiveness can also be configured, default 1.

e.g. **LLMCommonMetrics.CONTENT_ANALYSIS.value: {"metrics":["coverage","density","abstractness"], "abstractness":{"ngrams":2}}**

##### For RAG task type source dataframe variable can have multiple columns, in configuration it needs to be specified at global level, with key "context_columns", value as list of columns name for context and with key "question_column", value as name of question column.

##### For Unsuccessful-request, a list of custom phrase can be passed, it will override the default phrases.

e.g. **LLMCommonMetrics.UNSUCCESSFUL_REQUESTS.value: {"unsuccessful_phrases":["i don't know", "i am not sure"]}**

##### For Question-Robustness a list of keywords can be passed with key as "excluded_keywords" which will be excluded from spell check

In [54]:
metric_config = {
    "configuration": {
        "record_level":False,
        "context_columns":["contexts"],
        "question_column": "question",
        LLMTaskType.RAG.value: {
            LLMMetricType.CONTENT_ANALYSIS.value: {},
            LLMMetricType.UNSUCCESSFUL_REQUESTS.value: {
                # "unsuccessful_phrases": []
            },
            LLMMetricType.KEYWORDS_INCLUSION.value: {},
            LLMMetricType.QUESTION_ROBUSTNESS.value: {"metrics": ["spelling_robustness"],
                                                         "excluded_keywords": ["ibm","watsonx","openshift","ocp"]},
            LLMMetricType.PII_DETECTION.value: {},
            LLMMetricType.HAP_SCORE.value: {}
        }
    }
}

## RAG Metrics Evaluation

The HAP and PII metrics will be computed on the server(watsonx.governance instance) in asynchronus manner, by default. The details of the computation tasks submitted, and the responses from them, are returned in the respective metric responses. The `get_metrics_result` method shown in the following cells can be used to get the response from the server.

To execute HAP and PII metrics in synchronus manner, send the parameter `background_mode=False` to the compute_metrics method.

In [55]:
result = client.llm_metrics.compute_metrics(metric_config,df_input, df_output,df_reference)

## Evaluated Metrics

#### Re-run the following cell until all computation tasks are finished, and results are returned 

In [56]:
results = client.llm_metrics.get_metrics_result(metric_config,result)
print(json.dumps(results,indent=2))

{
  "hap_score": {
    "total_records": 5,
    "max": 0.002216296037659049,
    "mean": 0.0011,
    "metric_value": 0.0011,
    "min": 0.0003830457862932235
  },
  "pii": {
    "total_records": 5,
    "max": 0,
    "mean": 0.0,
    "metric_value": 0.0,
    "min": 0
  },
  "content_analysis": {
    "coverage": {
      "metric_value": 0.2383,
      "mean": 0.2383,
      "min": 0.0689,
      "max": 0.4405,
      "std": 0.1224
    },
    "density": {
      "metric_value": 0.5524,
      "mean": 0.5524,
      "min": 0.0295,
      "max": 1.0,
      "std": 0.4345
    },
    "abstractness": {
      "metric_value": 0.2311,
      "mean": 0.2311,
      "min": 0.0,
      "max": 0.6,
      "std": 0.2735
    }
  },
  "keywords_inclusion": {
    "common_keywords": [
      [
        "thomas",
        "meeting",
        "father",
        "advertising",
        "advance",
        "j.",
        "register",
        "cash",
        "sr",
        "thought",
        "sales",
        "watson",
        "company

Author: kishore.patel@in.ibm.com , ravi.chamarthy@in.ibm.com