# Design time notebook for Multi Lingual support of Generative AI Quality summarization metrics for IBM WatsonX.governance

This notebook demonstrates the metric results of the Generative AI Quality monitors for a prompt in Japanese. The various metrics computed across the particular task types are :
- Summarization:
    - Rouge Score
    - Cosine Similarity
	- Jaccard Similarity
    - Normalized Precision
    - Normalized Recall
    - Normalized F1 Score
	- Sari
	- Meteor
    - HAP Score
    - PII
- Generation:
    - Rouge Score
    - Normalized Precision
    - Normalized Recall
    - Normalized F1 Score
	- Meteor
    - HAP Score
    - PII
- Extraction:
    - Rouge Score
    - HAP Score
    - PII
- Question Answering(QA):
    - Rouge Score
    - HAP Score
    - PII

The notebook aims to show the Japanese metric values using a tokenizer in 2 different scenarios:
- Scenario 1 : When `language_code` is passed in the configuration to use the in-built tokenizer for a particular language
- Scenario 2 : When the user passes a custom tokenizer present on their system

**Note** : 
- The below given example is specific to the Japanese language.
- HAP Score and PII do not support a custom tokenizer

List of supported languages and the language code associated to these language:    
- English : en
- Japanese : ja
- German : de
- French : fr
- Spanish : es
- Arabic : ar
- Italian : it
- Portugese : pt
- Korean : ko
- Danish : da

Requirements to run in another language: A dataset consisting of source(input) and reference columns. The predictions(output) column is also needed, but if not present then a suitable Hugging Face model can be used to generate the predictions for the above given inputs

Changes to be made to run the cells in another language:
- Replace the filename and url to fetch the dataset in [Step 2](#data)
- If the predictions column is not present, replace the model id in [Step 3](#model) with that of a suitable Hugging Face model and run [Step 4](#predict)
- Change the language code [here](#language_code) and run the cells in [Scenario 1](#scenario-1)

## Contents

- [Step 1 - Setup](#setup)
- [Step 2 - Read data and store in dataframes](#data-1)
- [Step 3 - Initialize Hugging Face model](#model)
- [Step 4 - Generate the predictions for sample input](#predict)
- [Step 5 - Set the language code](#language_code)

- [Scenario 1 - Metrics with language code which uses the in-built WatsonNLP tokenizer](#scenario-1)
    - [Summarization metrics](#summarization-1)
        - [Step 1 - Configure the summarization metrics](#config-1.1.a)
        - [Step 2 - Compute the summarization metrics](#compute-1.1.b)
        - [Step 3 - Display the results](#results-1.1.c)
    - [Generation metrics](#generation-1)
        - [Step 1 - Read data and store in dataframes](#data-2)
        - [Step 2 - Configure the generation metrics](#config-1.2.a)
        - [Step 3 - Compute the generation metrics](#compute-1.2.b)
        - [Step 4 - Display the results](#results-1.2.c)
    - [Extraction metrics](#extraction-1)
        - [Step 1 - Read data and store in dataframes](#data-3)
        - [Step 2 - Configure the extraction metrics](#config-1.3.a)
        - [Step 3 - Compute the extraction metrics](#compute-1.3.b)
        - [Step 4 - Display the results](#results-1.3.c)
    - [Question Answering(QA) metrics](#qa-1)
        - [Step 1 - Read data and store in dataframes](#data-4)
        - [Step 2 - Configure the question answering(qa) metrics](#config-1.4.a)
        - [Step 3 - Compute the question answering(qa) metrics](#compute-1.4.b)
        - [Step 4 - Display the results](#results-1.4.c)
- [Scenario 2 - Metrics with custom tokenizer](#scenario-2)
    - [Step 1 - Create a custom tokenizer](#custom_tokenizer)
    - [Summarization metrics](#summarization-2)
        - [Step 1 - Configure the summarization metrics](#config-2.1.a)
        - [Step 2 - Compute the summarization metrics](#compute-2.1.b)
        - [Step 3 - Display the results](#results-2.1.c)
    - [Generation metrics](#generation-2)
        - [Step 2 - Configure the generation metrics](#config-2.2.a)
        - [Step 3 - Compute the generation metrics](#compute-2.2.b)
        - [Step 4 - Display the results](#results-2.2.c)
    - [Extraction metrics](#extraction-2)
        - [Step 2 - Configure the extraction metrics](#config-2.3.a)
        - [Step 3 - Compute the extraction metrics](#compute-2.3.b)
        - [Step 4 - Display the results](#results-2.3.c)
    - [Question Answering(QA) metrics](#qa-2)
        - [Step 2 - Configure the question answering(qa) metrics](#config-2.4.a)
        - [Step 3 - Compute the question answering(qa) metrics](#compute-2.4.b)
        - [Step 4 - Display the results](#results-2.4.c)

## Step 1 - Setup <a id="setup"></a>

### Install necessary libraries

In [None]:
!pip install -U ibm_watson_openscale | tail -n 1
!pip install -U "ibm-metrics-plugin[generative-ai-quality]~=3.0.11" | tail -n 1

import warnings
warnings.filterwarnings('ignore')

In [None]:
import spacy

spacy.cli.download("en_core_web_sm")
spacy.cli.download("ja_core_news_sm")
!python -m nltk.downloader punkt

**Note**: you may need to restart the kernel to use updated libraries.

### Configure your credentials

#### Provision services and configure credentials

If you have not already, provision an instance of IBM Watson OpenScale using the [OpenScale link in the Cloud catalog](https://cloud.ibm.com/catalog/services/watson-openscale).
Your Cloud API key can be generated by going to the [**Users** section of the Cloud console](https://cloud.ibm.com/iam#/users). From that page, click your name, scroll down to the **API Keys** section, and click **Create an IBM Cloud API key**. Give your key a name and click **Create**, then copy the created key and paste it below.
**NOTE:** You can also get OpenScale `API_KEY` using IBM CLOUD CLI.

How to install IBM Cloud (bluemix) console: [instruction](https://console.bluemix.net/docs/cli/reference/ibmcloud/download_cli.html#install_use)

How to get api key using console:
```
bx login --sso
bx iam api-key-create 'my_key'
```

In [3]:
use_cpd = False
CLOUD_API_KEY = "****"
IAM_URL="https://iam.cloud.ibm.com"
SERVICE_URL="https://aiopenscale.cloud.ibm.com"

Uncomment the code and run the below cell only if you are running your notebook on a CPD cluster.

In [4]:
# use_cpd = True
# WOS_CREDENTIALS = {
#     "url": "xxxxx",
#     "username": "xxxxx",
#     "password": "xxxxx",
#     "apikey": "xxxxx"
# }

## Step 2 - Read and store data in individual pandas dataframes <a id="data-1"></a>

### Read the data

Download the sample "llm_content_summarization_ja" file.

In [4]:
filename_summarization = "llm_content_summarization_ja.csv"
!rm -fr "llm_content_summarization_ja.csv"
!wget "https://raw.githubusercontent.com/IBM/watson-openscale-samples/refs/heads/main/IBM%20Cloud/WML/assets/data/watsonx/Multi_Lingual_Support/llm_content_summarization_ja.csv"

--2024-11-21 06:17:55--  https://raw.githubusercontent.com/IBM/watson-openscale-samples/refs/heads/main/IBM%20Cloud/WML/assets/data/watsonx/Multi_Lingual_Support/llm_content_summarization_ja.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 26758 (26K) [text/plain]
Saving to: ‘llm_content_summarization_ja.csv’


2024-11-21 06:17:56 (47.0 MB/s) - ‘llm_content_summarization_ja.csv’ saved [26758/26758]



### Converting the data into pandas dataframe

Extracting the columns and creating individual data frames for `input`, `prediction` and `reference` columns

In [5]:
import pandas as pd

llm_data_ja_summarization = pd.read_csv(filename_summarization)
df_input_ja_summarization = llm_data_ja_summarization[['input_text']].copy()
df_reference_ja_summarization = llm_data_ja_summarization[['reference_summary']].copy()
df_generated_ja_summarization = llm_data_ja_summarization[['generated_predictions']].copy() # Comment if using loacl LLM model

## Step 3 - Initialize a foundation model from Hugging Face
<a id="model"></a>

Uncomment the following cells to create a Hugging face model and generate the predictions for Japanese

Model used - p1atdev/mt5-base-xlsum-ja-v1.1 : Japanese summarization model

Note: The below given example is specific to the summarization task type. The other task types mentioned will need to be run with the suitable models and column names

In [6]:
# from transformers import pipeline

# def summarize_fn_ja(input_text):
#     seq2seq = pipeline("summarization", model="p1atdev/mt5-base-xlsum-ja-v1.1")
#     result = seq2seq(input_text)
#     return result[0]['summary_text']

## Step 4 - Generate predictions for sample input
<a id="predict"></a>

Creating an empty list and storing the generated predictions

In [7]:
# generated_predictions_ja = []
# for input in df_input_ja_summarization["input_text"]:
#     generated_predictions_ja.append(summarize_fn_ja(input))

Converting the list to an individual dataframe

In [8]:
# df_generated_ja_summarization = pd.DataFrame(columns=["generated_predictions"])
# df_generated_ja_summarization['generated_predictions'] = generated_predictions_ja
# print(df_generated_ja_summarization)

### Setting the language code <a id="language_code"></a>

In [6]:
language_code = "ja"

## Scenario 1 - Metrics with `language_code` which uses the in-built WatsonNLP tokenizer <a id="scenario-1"></a>

### IBM watsonx.governance authentication and verifying client version

In [7]:
from ibm_cloud_sdk_core.authenticators import IAMAuthenticator,BearerTokenAuthenticator,CloudPakForDataAuthenticator

from ibm_watson_openscale import *
from ibm_watson_openscale.supporting_classes.enums import *
from ibm_watson_openscale.supporting_classes import *

if use_cpd:
    authenticator = CloudPakForDataAuthenticator(
            url=WOS_CREDENTIALS['url'],
            username=WOS_CREDENTIALS['username'],
            apikey=WOS_CREDENTIALS['apikey'],
            disable_ssl_verification=True,
        )
    
    client = APIClient(service_url=WOS_CREDENTIALS['url'],authenticator=authenticator)
    print(client.version)
else:
    authenticator = IAMAuthenticator(apikey=CLOUD_API_KEY, url=IAM_URL)
    client = APIClient(authenticator=authenticator)
    print(client.version)

3.0.41


In [8]:
from ibm_metrics_plugin.metrics.llm.utils.constants import LLMTextMetricGroup
from ibm_metrics_plugin.metrics.llm.utils.constants import LLMSummarizationMetrics
from ibm_metrics_plugin.metrics.llm.utils.constants import LLMCommonMetrics
from ibm_metrics_plugin.metrics.llm.utils.constants import LLMGenerationMetrics
from ibm_metrics_plugin.metrics.llm.utils.constants import LLMExtractionMetrics
from ibm_metrics_plugin.metrics.llm.utils.constants import LLMQAMetrics
import json

## Summarization metrics<a id="summarization-1"></a>

#### Step 1 - Configure summarization metrics<a id="config-1.1.a"></a>

In [9]:
metric_config_1_summarization = {   
    "configuration": {
        LLMTextMetricGroup.SUMMARIZATION.value: {
            LLMSummarizationMetrics.ROUGE_SCORE.value: {},
            LLMSummarizationMetrics.COSINE_SIMILARITY.value: {},
            LLMSummarizationMetrics.JACCARD_SIMILARITY.value: {},
            LLMSummarizationMetrics.NORMALIZED_PRECISION.value: {},
            LLMSummarizationMetrics.NORMALIZED_RECALL.value: {},
            LLMSummarizationMetrics.NORMALIZED_F1_SCORE.value: {},
            LLMSummarizationMetrics.SARI.value: {},
            LLMSummarizationMetrics.METEOR.value: {},
            LLMCommonMetrics.HAP_SCORE.value: {},
            LLMCommonMetrics.PII_DETECTION.value: {
                "language_code" : language_code
            }
        },
        "language_code" : language_code
    }
}

#### Step 2 - Compute the summarization metrics<a id="compute-1.1.b"></a>

In [10]:
result_ja_1_summarization = client.llm_metrics.compute_metrics(metric_config_1_summarization,
                                                sources = df_input_ja_summarization, 
                                                predictions = df_generated_ja_summarization, 
                                                references = df_reference_ja_summarization)

[nltk_data] Downloading package punkt_tab to /home/wsuser/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


#### Step 3 - Display the results<a id="results-1.1.c"></a>

Fetching the results

In [15]:
final_results_ja_1_summarization = client.llm_metrics.get_metrics_result(configuration=metric_config_1_summarization, 
                                                           metrics_result=result_ja_1_summarization)

In [16]:
print(json.dumps(final_results_ja_1_summarization,indent=2))

{
  "cosine_similarity": {
    "total_records": 20,
    "max": 0.4972134156783251,
    "mean": 0.3267,
    "metric_value": 0.3267,
    "min": 0.20829453553574567
  },
  "hap_score": {
    "total_records": 20,
    "max": 0.10210397094488144,
    "mean": 0.0553,
    "metric_value": 0.0553,
    "min": 0.02366599254310131
  },
  "jaccard_similarity": {
    "total_records": 20,
    "max": 0.40476190476190477,
    "mean": 0.2605,
    "metric_value": 0.2605,
    "min": 0.14754098360655737
  },
  "meteor": {
    "metric_value": 0.3211,
    "total_records": 20
  },
  "normalized_f1": {
    "total_records": 20,
    "max": 0.5454545454545455,
    "mean": 0.3755,
    "metric_value": 0.3755,
    "min": 0.22988505747126436
  },
  "normalized_precision": {
    "total_records": 20,
    "max": 0.72,
    "mean": 0.4214,
    "metric_value": 0.4214,
    "min": 0.23529411764705882
  },
  "normalized_recall": {
    "total_records": 20,
    "max": 0.8,
    "mean": 0.4073,
    "metric_value": 0.4073,
    "min

**Note:** If the above results display with a "state": "in_progress" status, you can re-run the above cells as the computation of the metrics is still in progress.

## Generation metrics<a id="generation-1"></a>

### Step 1 - Read data and store in dataframes<a id="data-2"></a>

#### Read the data

Download the sample "llm_content_generation_ja" file

In [17]:
filename_generation = "llm_content_generation_ja.csv"
!rm -fr "llm_content_generation_ja.csv"
!wget "https://raw.githubusercontent.com/IBM/watson-openscale-samples/refs/heads/main/IBM%20Cloud/WML/assets/data/watsonx/Multi_Lingual_Support/llm_content_generation_ja.csv"

--2024-11-21 06:22:42--  https://raw.githubusercontent.com/IBM/watson-openscale-samples/refs/heads/main/IBM%20Cloud/WML/assets/data/watsonx/Multi_Lingual_Support/llm_content_generation_ja.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14183 (14K) [text/plain]
Saving to: ‘llm_content_generation_ja.csv’


2024-11-21 06:22:42 (44.1 MB/s) - ‘llm_content_generation_ja.csv’ saved [14183/14183]



### Converting the data into pandas dataframe

Extracting the columns and creating individual data frames for `input`, `prediction` and `reference` columns

In [18]:
import pandas as pd

llm_data_ja_generation = pd.read_csv(filename_generation)
df_input_ja_generation = llm_data_ja_generation[['question']].copy()
df_reference_ja_generation = llm_data_ja_generation[['reference_text']].copy()
df_generated_ja_generation = llm_data_ja_generation[['generated_text']].copy() # Comment if using loacl LLM model

#### Step 1 - Configure generation metrics<a id="config-1.2.a"></a>

In [19]:
metric_config_1_generation = {   
    "configuration": {
        LLMTextMetricGroup.GENERATION.value: {
            LLMGenerationMetrics.ROUGE_SCORE.value: {},
            LLMGenerationMetrics.NORMALIZED_PRECISION.value: {},
            LLMGenerationMetrics.NORMALIZED_RECALL.value: {},
            LLMGenerationMetrics.NORMALIZED_F1_SCORE.value: {},
            LLMGenerationMetrics.METEOR.value: {},
            LLMCommonMetrics.HAP_SCORE.value: {},
            LLMCommonMetrics.PII_DETECTION.value: {
                "language_code" : language_code
            }
        },
        "language_code" : language_code
    }
}

#### Step 2 - Compute the generation metrics<a id="compute-1.2.b"></a>

In [20]:
result_ja_1_generation = client.llm_metrics.compute_metrics(metric_config_1_generation,
                                                sources = df_input_ja_generation, 
                                                predictions = df_generated_ja_generation, 
                                                references = df_reference_ja_generation)

#### Step 3 - Display the results<a id="results-1.2.c"></a>

Fetching the results

In [23]:
final_results_ja_1_generation = client.llm_metrics.get_metrics_result(configuration=metric_config_1_generation, 
                                                           metrics_result=result_ja_1_generation)

In [24]:
print(json.dumps(final_results_ja_1_generation,indent=2))

{
  "hap_score": {
    "total_records": 23,
    "max": 0.12608997523784637,
    "mean": 0.0141,
    "metric_value": 0.0141,
    "min": 0.0021147574298083782
  },
  "meteor": {
    "metric_value": 0.6902,
    "total_records": 23
  },
  "normalized_f1": {
    "total_records": 23,
    "max": 0.9481481481481482,
    "mean": 0.8329,
    "metric_value": 0.8329,
    "min": 0.6399999999999999
  },
  "normalized_precision": {
    "total_records": 23,
    "max": 1.0,
    "mean": 0.9981,
    "metric_value": 0.9981,
    "min": 0.9722222222222222
  },
  "normalized_recall": {
    "total_records": 23,
    "max": 0.9142857142857143,
    "mean": 0.7201,
    "metric_value": 0.7201,
    "min": 0.47058823529411764
  },
  "pii": {
    "total_records": 23,
    "max": 0,
    "mean": 0.0,
    "metric_value": 0.0,
    "min": 0
  },
  "rouge_score": {
    "rouge1": 0.8328,
    "rouge1_recall": 0.7202,
    "rouge2": 0.8257,
    "rouge2_recall": 0.7118,
    "rougeL": 0.8328,
    "rougeL_recall": 0.7202,
    "rou

## Extraction metrics<a id="extraction-1"></a>

### Step 1 - Read data and store in dataframes<a id="data-3"></a>

#### Read the data

Download the sample "llm_content_extraction_ja" file

In [25]:
filename_extraction = "llm_content_extraction_ja.csv"
!rm -fr "llm_content_extraction_ja.csv"
!wget "https://raw.githubusercontent.com/IBM/watson-openscale-samples/refs/heads/main/IBM%20Cloud/WML/assets/data/watsonx/Multi_Lingual_Support/llm_content_extraction_ja.csv"

--2024-11-21 06:24:11--  https://raw.githubusercontent.com/IBM/watson-openscale-samples/refs/heads/main/IBM%20Cloud/WML/assets/data/watsonx/Multi_Lingual_Support/llm_content_extraction_ja.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2478 (2.4K) [text/plain]
Saving to: ‘llm_content_extraction_ja.csv’


2024-11-21 06:24:11 (21.4 MB/s) - ‘llm_content_extraction_ja.csv’ saved [2478/2478]



### Converting the data into pandas dataframe

Extracting the columns and creating individual data frames for `input`, `prediction` and `reference` columns

In [26]:
import pandas as pd

llm_data_ja_extraction = pd.read_csv(filename_extraction)
df_input_ja_extraction = llm_data_ja_extraction[['input_text']].copy()
df_reference_ja_extraction = llm_data_ja_extraction[['reference_text']].copy()
df_generated_ja_extraction = llm_data_ja_extraction[['generated_text']].copy() # Comment if using loacl LLM model

#### Step 1 - Configure extraction metrics<a id="config-1.3.a"></a>

In [27]:
metric_config_1_extraction = {   
    "configuration": {
        LLMTextMetricGroup.EXTRACTION.value: {
            LLMExtractionMetrics.ROUGE_SCORE.value: {},
            LLMCommonMetrics.HAP_SCORE.value: {},
            LLMCommonMetrics.PII_DETECTION.value: {
                "language_code" : language_code
            }
        },
        "language_code" : language_code
    }
}

#### Step 2 - Compute the extraction metrics<a id="compute-1.3.b"></a>

In [28]:
result_ja_1_extraction = client.llm_metrics.compute_metrics(metric_config_1_extraction,
                                                sources = df_input_ja_extraction, 
                                                predictions = df_generated_ja_extraction, 
                                                references = df_reference_ja_extraction)

#### Step 3 - Display the results<a id="results-1.3.c"></a>

Fetching the results

In [31]:
final_results_ja_1_extraction = client.llm_metrics.get_metrics_result(configuration=metric_config_1_extraction, 
                                                           metrics_result=result_ja_1_extraction)

In [32]:
print(json.dumps(final_results_ja_1_extraction,indent=2))

{
  "hap_score": {
    "total_records": 10,
    "max": 0.1282653957605362,
    "mean": 0.0241,
    "metric_value": 0.0241,
    "min": 0.0043441494926810265
  },
  "pii": {
    "total_records": 10,
    "max": 0.8,
    "mean": 0.56,
    "metric_value": 0.56,
    "min": 0.0
  },
  "rouge_score": {
    "rouge1": 0.9,
    "rouge1_recall": 0.9,
    "rouge2": 0.8,
    "rouge2_recall": 0.8,
    "rougeL": 0.9,
    "rougeL_recall": 0.9,
    "rougeLsum": 0.9,
    "rougeLsum_recall": 0.9,
    "total_records": 10
  }
}


## Question Answering(QA) metrics<a id="qa-1"></a>

### Step 1 - Read data and store in dataframes<a id="data-4"></a>

#### Read the data

Download the sample "llm_content_qa_ja" file

In [33]:
filename_qa = "llm_content_qa_ja.csv"
!rm -fr "llm_content_qa_ja.csv"
!wget "https://raw.githubusercontent.com/IBM/watson-openscale-samples/refs/heads/main/IBM%20Cloud/WML/assets/data/watsonx/Multi_Lingual_Support/llm_content_qa_ja.csv"

--2024-11-21 06:25:24--  https://raw.githubusercontent.com/IBM/watson-openscale-samples/refs/heads/main/IBM%20Cloud/WML/assets/data/watsonx/Multi_Lingual_Support/llm_content_qa_ja.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4407 (4.3K) [text/plain]
Saving to: ‘llm_content_qa_ja.csv’


2024-11-21 06:25:24 (34.8 MB/s) - ‘llm_content_qa_ja.csv’ saved [4407/4407]



### Converting the data into pandas dataframe

Extracting the columns and creating individual data frames for `input`, `prediction` and `reference` columns

In [34]:
import pandas as pd

llm_data_ja_qa = pd.read_csv(filename_qa)
df_input_ja_qa = llm_data_ja_qa[['question']].copy()
df_reference_ja_qa = llm_data_ja_qa[['answers']].copy()
df_generated_ja_qa = llm_data_ja_qa[['answers']].copy() # Comment if using loacl LLM model

#### Step 1 - Configure question answering metrics<a id="config-1.4.a"></a>

In [41]:
metric_config_1_qa = {   
    "configuration": {
        LLMTextMetricGroup.QA.value: {
            LLMQAMetrics.ROUGE_SCORE.value: {},
            LLMCommonMetrics.HAP_SCORE.value: {},
            LLMCommonMetrics.PII_DETECTION.value: {
                "language_code" : language_code
            }
        },
        "language_code" : language_code
    }
}

#### Step 2 - Compute the question answering(qa) metrics<a id="compute-1.4.b"></a>

In [42]:
result_ja_1_qa = client.llm_metrics.compute_metrics(metric_config_1_qa,
                                                sources = df_input_ja_qa, 
                                                predictions = df_generated_ja_qa,
                                                references=df_reference_ja_qa)

#### Step 3 - Display the results<a id="results-1.4.c"></a>

Fetching the results

In [45]:
final_results_ja_1_qa = client.llm_metrics.get_metrics_result(configuration=metric_config_1_qa, 
                                                           metrics_result=result_ja_1_qa)

In [46]:
print(json.dumps(final_results_ja_1_qa,indent=2))

{
  "hap_score": {
    "total_records": 50,
    "max": 0.9634731411933899,
    "mean": 0.0748,
    "metric_value": 0.0748,
    "min": 0.001772725721821189
  },
  "pii": {
    "total_records": 50,
    "max": 0,
    "mean": 0.0,
    "metric_value": 0.0,
    "min": 0
  },
  "rouge_score": {
    "rouge1": 1.0,
    "rouge1_recall": 1.0,
    "rouge2": 0.72,
    "rouge2_recall": 0.72,
    "rougeL": 1.0,
    "rougeL_recall": 1.0,
    "rougeLsum": 1.0,
    "rougeLsum_recall": 1.0,
    "total_records": 50
  }
}


## Scenario 2 - Metrics with custom tokenizer <a id="scenario-2"></a>

### Creating a custom tokenizer using Spacy

In [47]:
import spacy

import nltk
nltk.download('wordnet')

In [48]:
def get_language_model_sp(language='en'):
    nlp = None
    if language=='en':
        nlp = spacy.load("en_core_web_sm")
    if language=='ja':
        nlp = spacy.load("ja_core_news_sm")
    return nlp  

In [49]:
class MyCustomTokenizer():    
    def __init__(self, language = 'en'):
        self.language = language
        self.tokenizer = None
    
    def tokenize(self, input_text):
        tokens = []
        nlp = get_language_model_sp(self.language)        
        doc = nlp(input_text)
        for token in doc:
            tokens.append(str(token))
        return tokens
    
    def __call__(self, input_text):
        return self.tokenize(input_text)

### Initializing tokenizer

In [50]:
my_custom_tokenizer_ja = MyCustomTokenizer(language_code).tokenize

## Summarization Metrics<a id="summarization-2"></a>

### Step 1 - Configure summarization metrics<a id="config-2.1.a"></a>

In [51]:
metric_config_2_summarization = {   
    "configuration": {
        LLMTextMetricGroup.SUMMARIZATION.value: {
            LLMSummarizationMetrics.ROUGE_SCORE.value: {},
            LLMSummarizationMetrics.COSINE_SIMILARITY.value: {},
            LLMSummarizationMetrics.JACCARD_SIMILARITY.value: {},
            LLMSummarizationMetrics.NORMALIZED_PRECISION.value: {},
            LLMSummarizationMetrics.NORMALIZED_RECALL.value: {},
            LLMSummarizationMetrics.NORMALIZED_F1_SCORE.value: {},
            LLMSummarizationMetrics.SARI.value: {},
            LLMSummarizationMetrics.METEOR.value: {}
        },
    }
}

### Step 2 - Compute the summarization metrics<a id="compute-2.1.b"></a>

In [58]:
result_ja_2_summarization = client.llm_metrics.compute_metrics(metric_config_2_summarization, 
                                                sources = df_input_ja_summarization, 
                                                predictions = df_generated_ja_summarization, 
                                                references = df_reference_ja_summarization, 
                                                tokenizer = my_custom_tokenizer_ja)

### Step 3 - Display the results<a id="results-2"></a>

In [59]:
print(json.dumps(result_ja_2_summarization,indent=2))

{
  "normalized_precision": {
    "metric_value": 0.4330320512820512,
    "mean": 0.4330320512820512,
    "min": 0.20512820512820512,
    "max": 0.7083333333333334,
    "std": 0.1250694049511426,
    "total_records": 20
  },
  "sari": {
    "metric_value": 37.189037853372184,
    "total_records": 20
  },
  "rouge_score": {
    "rouge1": 0.3899,
    "rouge1_recall": 0.4272,
    "rouge2": 0.2031,
    "rouge2_recall": 0.2203,
    "rougeL": 0.2986,
    "rougeL_recall": 0.3229,
    "rougeLsum": 0.2986,
    "rougeLsum_recall": 0.3229,
    "total_records": 20
  },
  "jaccard_similarity": {
    "metric_value": 0.2758702031822637,
    "mean": 0.2758702031822637,
    "min": 0.16666666666666666,
    "max": 0.38095238095238093,
    "std": 0.07698297898073732,
    "total_records": 20
  },
  "normalized_recall": {
    "metric_value": 0.4271739649831853,
    "mean": 0.4271739649831853,
    "min": 0.16176470588235295,
    "max": 0.8125,
    "std": 0.1994148081880162,
    "total_records": 20
  },
  "me

## Generation metrics<a id="generation-2"></a>

#### Step 1 - Configure generation metrics<a id="config-2.2.a"></a>

In [60]:
metric_config_2_generation = {   
    "configuration": {
        LLMTextMetricGroup.GENERATION.value: {
            LLMGenerationMetrics.ROUGE_SCORE.value: {},
            LLMGenerationMetrics.NORMALIZED_PRECISION.value: {},
            LLMGenerationMetrics.NORMALIZED_RECALL.value: {},
            LLMGenerationMetrics.NORMALIZED_F1_SCORE.value: {},
            LLMGenerationMetrics.METEOR.value: {}
        }
    }
}

#### Step 2 - Compute the generation metrics<a id="compute-2.2.b"></a>

In [61]:
result_ja_2_generation = client.llm_metrics.compute_metrics(metric_config_2_generation,
                                                sources = df_input_ja_generation, 
                                                predictions = df_generated_ja_generation, 
                                                references = df_reference_ja_generation,
                                                tokenizer = my_custom_tokenizer_ja)

#### Step 3 - Display the results<a id="results-2.2.c"></a>

In [62]:
print(json.dumps(result_ja_2_generation,indent=2))

{
  "normalized_precision": {
    "metric_value": 0.9983850931677019,
    "mean": 0.9983850931677019,
    "min": 0.9761904761904762,
    "max": 1.0,
    "std": 0.00545610526818546,
    "total_records": 23
  },
  "rouge_score": {
    "rouge1": 0.8351,
    "rouge1_recall": 0.7233,
    "rouge2": 0.8287,
    "rouge2_recall": 0.7158,
    "rougeL": 0.8351,
    "rougeL_recall": 0.7233,
    "rougeLsum": 0.8351,
    "rougeLsum_recall": 0.7233,
    "total_records": 23
  },
  "normalized_recall": {
    "metric_value": 0.7232754525118498,
    "mean": 0.7232754525118498,
    "min": 0.4807692307692308,
    "max": 0.925,
    "std": 0.0953663927099713,
    "total_records": 23
  },
  "meteor": {
    "metric_value": 0.6921405027118602,
    "total_records": 23
  },
  "normalized_f1": {
    "metric_value": 0.8351304305697805,
    "mean": 0.8351304305697805,
    "min": 0.6493506493506493,
    "max": 0.9548387096774195,
    "std": 0.06538224681019711,
    "total_records": 23
  }
}


## Extraction metrics<a id="extraction-2"></a>

#### Step 1 - Configure extraction metrics<a id="config-2.3.a"></a>

In [63]:
metric_config_2_extraction = {   
    "configuration": {
        LLMTextMetricGroup.EXTRACTION.value: {
            LLMExtractionMetrics.ROUGE_SCORE.value: {}
        }
    }
}

#### Step 2 - Compute the extraction metrics<a id="compute-2.3.b"></a>

In [64]:
result_ja_2_extraction = client.llm_metrics.compute_metrics(metric_config_2_extraction,
                                                sources = df_input_ja_extraction, 
                                                predictions = df_generated_ja_extraction, 
                                                references = df_reference_ja_extraction,
                                                tokenizer = my_custom_tokenizer_ja)

#### Step 3 - Display the results<a id="results-2.3.c"></a>

In [65]:
print(json.dumps(result_ja_2_extraction,indent=2))

{
  "rouge_score": {
    "rouge1": 0.9,
    "rouge1_recall": 0.9,
    "rouge2": 0.8,
    "rouge2_recall": 0.8,
    "rougeL": 0.9,
    "rougeL_recall": 0.9,
    "rougeLsum": 0.9,
    "rougeLsum_recall": 0.9,
    "total_records": 10
  }
}


## Question Answering(QA) metrics<a id="qa-2"></a>

#### Step 1 - Configure question answering metrics<a id="config-2.4.a"></a>

In [66]:
metric_config_2_qa = {   
    "configuration": {
        LLMTextMetricGroup.QA.value: {
            LLMQAMetrics.ROUGE_SCORE.value: {}
        }
    }
}

#### Step 2 - Compute the question answering(qa) metrics<a id="compute-2.4.b"></a>

In [67]:
result_ja_2_qa = client.llm_metrics.compute_metrics(metric_config_2_qa,
                                                sources = df_input_ja_qa, 
                                                predictions = df_generated_ja_qa,
                                                references=df_reference_ja_qa,
                                                tokenizer = my_custom_tokenizer_ja)

#### Step 3 - Display the results<a id="results-2.4.c"></a>

In [68]:
print(json.dumps(result_ja_2_qa,indent=2))

{
  "rouge_score": {
    "rouge1": 1.0,
    "rouge1_recall": 1.0,
    "rouge2": 0.72,
    "rouge2_recall": 0.72,
    "rougeL": 1.0,
    "rougeL_recall": 1.0,
    "rougeLsum": 1.0,
    "rougeLsum_recall": 1.0,
    "total_records": 50
  }
}


Author: <a href="mailto:kshitij.g1@ibm.com">Kshitij Gopali</a>

Copyright © 2024 IBM.