# Using IBM watsonx.governance metrics toolkit to assess the risk of a foundation model


This notebook should be run using Python 3.10.

The notebook evaluates the risk associated with the given foundation model. The risk assessment result could be stored in OpenPages or exported as a pdf file.


### Contents

- Setup
- Configure credentials 
- Evaluate the risks for a FM



## Setup <a name="settingup"></a>

### Replace your username and artifactory token below, and use the correct version for ibm-metrics-plugin

In [None]:
!pip install ibm-metrics-plugin[mra]

Note: you may need to restart the kernel to use updated packages.

If you want to evaluate the risks against watsonx.ai model then you must provide CPD or Cloud credentials in the system_credentials below. For evaluating the risk of external LLM, a wrapper scoring function is required, a sample wrapper scoring function is provided later in the notebook.


The computed metrics can be displayed in cell output as json or table format. The computed metrics can also be saved to OpenPages; this will prevent the metrics from being computed again the next time you run evaluation; instead it will retrieve the saved metrics from OpenPages. 

If you want to store metrics to OpenPages then you must provide CPD or Cloud credentials in the system_credentials below. 

## Credentials Details:


#### Below are the inputs for the API:

| API input       | Expected value | optional/required| 
|------------|-----|-----|
| system_credentials      | Contains necessary details to connect to OpenPages, and IBM watsonx.ai. |Required | 
| risk_dimensions      |  List of risks to be evaluated. If set to None; all the available risk will be evaluated  |Optional | 
| scoring_function      |  Function that encapsulates all logic to infer external LLM. |Optional | 
| foundation_model_name      | Specifies the name of the foundation model under evaluation. Needed if user want to export metrics to a pdf|Optional | 
| max_sample_size      | Set maximum data instances for evaluation. Defaults to None, which will load all data and increase evaluation time|Optional | 

#### Below are system_credentials dictionary keys definitions:
| Dictionary key       | Expected dictionary value |  optional/required| 
|------------|-----|-----|
| op_url      |  Wx.gov Software URL>/openpages-openpagesinstance-cr-grc  | Optional | 
| op_username      | OpenPages cloud instance username  |Optional  | 
| op_password      | OpenPages cloud instance password  |Optional | 
| op_model_name      | OpenPages FM Model ID to which metrics needs to be published  |Optional | 
| op_cpd_host      | Your CPD or wx.gov Software URL - without https:// |Optional | 
| op_cpd_apikey      | CPD apikey for OpenPages  |Optional | 
| watsonx_ai_cloud_apikey      | Your cloud apikey for Wastonx.ai |Required (if you do not provide scoring function for external LLM )| 
| watsonx_ai_project_id      | Project id for inference against the watsonx.ai model  |Required (if you do not provide scoring function for external LLM ) | 
| watsonx_ai_endpoint_url      | Cloud Software URL for Wastonx.ai  |Required (if you do not provide scoring function external for LLM ) | 
| watsonx_ai_model_id      |Wastonx.ai model ID ex:google/flan-t5-xxl   |Required (if you do not provide scoring function external for LLM ) | 
| watsonx_ai_cpd_username      | Your CPD username for Wastonx.ai |Required (if you do not provide scoring function external for LLM ) | 
| watsonx_ai_cpd_apikey      | Your CPD api key for Wastonx.ai  |Required (if you do not provide scoring function external for LLM ) | 
    



In [1]:
OP_URL = "<EDIT THIS>"
OP_USERNAME = "<EDIT THIS>"
OP_PASSWORD = "<EDIT THIS>"
OP_MODEL_NAME = "<EDIT THIS>"
OP_HOST = "<EDIT THIS>"
OP_APIKEY = "<EDIT THIS>"

CLOUD_APIKEY = "<EDIT THIS>"
PROJECT_ID = "<EDIT THIS>"
ENDPOINT_URL = "<EDIT THIS>"
CPD_USERNAME= "<EDIT THIS>"
CPD_APIKEY = "<EDIT THIS>"
WATSONX_AI_MODEL_ID = "<EDIT THIS>"


#Set this variable to the Wastonx.ai model name ex:google/flan-t5-xxl
# if evaluating the risk of external LLM set variable to the FM under evaluation.
foundation_model_name = "<EDIT THIS>"


risk_dimensions =  "[<EDIT THIS>]"


#This setup will access wastonx.ai using Cloud
system_credentials = {
    "watsonx_ai_cloud_apikey": CLOUD_APIKEY,
    "watsonx_ai_endpoint_url": ENDPOINT_URL,
    "watsonx_ai_project_id": PROJECT_ID,
    "watsonx_ai_model_id": WATSONX_AI_MODEL_ID,
    
}


# uncomment to access wastonx.ai using CPD
# system_credentials = {
#     "watsonx_ai_endpoint_url": ENDPOINT_URL,
#     "watsonx_ai_project_id": PROJECT_ID,
#     "watsonx_ai_cpd_username": CPD_USERNAME,
#     "watsonx_ai_cpd_apikey": CPD_APIKEY,
# }




# uncomment below step if the result should be pushed to openpages using CPD
# system_credentials.update({
#     "op_cpd_host": OP_HOST,
#     "op_username": OP_USERNAME,
#     "op_password": OP_PASSWORD, 
#     "op_cpd_apikey": OP_APIKEY,  
#     "op_model_name": OP_MODEL_NAME,
#     "op_url": OP_URL
# })


# uncomment below step if the result should be pushed to openpages using Cloud
# system_credentials.update({
#     "op_username": OP_USERNAME,
#     "op_password": OP_PASSWORD, 
#     "op_model_name": OP_MODEL_NAME,
#     "op_url": OP_URL
# })

### Scoring function
For evaluating the risk of external LLM, a wrapper scoring function is required. The scoring function takes the prompts as input and return the model predictions. a sample scoring function is provided below. 

In [11]:
scoring_function = None
##sample_scoring_function: For evaluating the risk of external LLM, uncomment the below code to use the sample_scoring_function instead of watsonx_ai

# !pip install transformers
# !pip install SentencePiece
# from transformers import T5Tokenizer, T5ForConditionalGeneration
# import pandas as pd
# tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-xl")
# model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-xl")


# def sample_scoring_function(data):
#     predictions_list = []
#     print("working")
#     for prompt_text in data.iloc[:, 0].values.tolist():
#         input_ids = tokenizer(prompt_text, return_tensors="pt").input_ids
#         output = model.generate(input_ids)
#         output = tokenizer.decode(output[0], skip_special_tokens=True)
#         predictions_list.append(output)
#     return pd.DataFrame({"generated_text": predictions_list})


# scoring_function = sample_scoring_function


In [None]:
from ibm_metrics_plugin.mra.evaluate_fm_risk import EvaluateFMRisk
from ibm_wos_utils.joblib.utils.notebook_utils import  create_download_link_for_file
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
!python -m nltk.downloader stopwords





### Evaluate the risks for a FM and show the computed metrics as JSON 

In the cell below, max_sample_size = 5 to demonstrate the use case and save time. However, for a proper evaluation of the LLM, you should use the full dataset to obtain meaningful results. 

Note: If the full dataset is used for evaluation; expect the evaluation to run for hours depending on the risk, and the dataset size.

In [7]:

fm_evaluate = EvaluateFMRisk(
    system_credentials=system_credentials,
    foundation_model_name=foundation_model_name,
    risk_dimensions=risk_dimensions,
    scoring_function=scoring_function,
    max_sample_size = 5
)
risks_metrics_results = fm_evaluate.evaluate_fm_risks()
print(risks_metrics_results)



## uncomment  below to use the full dataset and obtain meaningful results. 
#max_sample_size = None

# fm_evaluate = EvaluateFMRisk(
#     system_credentials=system_credentials,
#     foundation_model_name=foundation_model_name,
#     risk_dimensions=risk_dimensions,
#     scoring_function=scoring_function,
#     max_sample_size = max_sample_size
# )

# risks_metrics_results = fm_evaluate.evaluate_fm_risks()
# print(risks_metrics_results)


Generating train split: 0 examples [00:00, ? examples/s]

Generating train split: 0 examples [00:00, ? examples/s]

working
{'hallucination': {'cards.value_alignment.hallucinations.truthfulqa': {'score_name': 'rougeL', 'score': 0.144, 'num_of_instances': 5, 'counts': 1.0, 'totals': 8.0, 'precisions': 0.106, 'bp': 0.02, 'sys_len': 14, 'ref_len': 69, 'sacrebleu': 0.001, 'score_ci_low': 0.054, 'score_ci_high': 0.209, 'sacrebleu_ci_low': 0.0, 'sacrebleu_ci_high': 0.003, 'rougeL': 0.144, 'rouge1': 0.144, 'rouge2': 0.019, 'rougeLsum': 0.144, 'rougeL_ci_low': 0.054, 'rougeL_ci_high': 0.209, 'rouge1_ci_low': 0.054, 'rouge1_ci_high': 0.209, 'rouge2_ci_low': 0.0, 'rouge2_ci_high': 0.076, 'rougeLsum_ci_low': 0.054, 'rougeLsum_ci_high': 0.209}}}


### Show the computed metrics in a table format 

In [8]:
for risk_name, risks_metrics in risks_metrics_results.items():
    df = pd.DataFrame.from_dict(risks_metrics)
    df = df.reset_index().rename(columns={"index":"Metric Name"})
    print(risk_name)
    display(df)


hallucination


Unnamed: 0,Metric Name,cards.value_alignment.hallucinations.truthfulqa
0,bp,0.02
1,counts,1.0
2,num_of_instances,5
3,precisions,0.106
4,ref_len,69
5,rouge1,0.144
6,rouge1_ci_high,0.209
7,rouge1_ci_low,0.054
8,rouge2,0.019
9,rouge2_ci_high,0.076


### Export the computed metrics to PDF report

In [None]:

import os
current_directory = os.getcwd()
output_path = current_directory



pdf_path = fm_evaluate.get_pdf_report(risks_metrics_results, output_path)
pdf_file = create_download_link_for_file(pdf_path)
display((pdf_file))


### (Optional step) HuggingFace
Some models used by risks ex: [social-bias] may require access to IBM gated models. In this case it is necessary to set an environment variable with a Hugging face token.

For example to get access to

https://huggingface.co/ibm/social-bias-detector-v0

join the IBM HF interest group: https://huggingface.co/ibm


Note: If the token is not found the risk require access to IBM gated models; the risk will be skipped

In [None]:
import os
HF_TOKEN = "<EDIT THIS>"
os.environ["HF_ACCESS_TOKEN"] = HF_TOKEN
os.system(f"export HF_ACCESS_TOKEN={HF_TOKEN}")

### (Optional step)  Wastonx.ai
Some risks use the LLM as a judge approach, where a Judge LLM is used to evaluate the output of the LLM under investigation.
If the judge LLM is too large to run on the client's machine then it must be run on Wastonx.ai. it is necessary to have access to a Wastonx.ai account and to set the aprropriate credentuals in oreder to be able to use these risks.
Note: If the token is not found the risk simply will be skiiped

In [None]:
import os

os.environ["WML_URL"] = ENDPOINT_URL
os.system(f"export WML_URL={ENDPOINT_URL}")

os.environ["WML_PROJECT_ID"] = PROJECT_ID
os.system(f"export WML_PROJECT_ID={PROJECT_ID}")

os.environ["WML_APIKEY"] = CLOUD_APIKEY
os.system(f"export WML_APIKEY={CLOUD_APIKEY}")

### (Optional step) Evaluate the risks for a FM again to include any risks that was not computed due to the missing HF token and Wastonx.ai


In [None]:
fm_evaluate = EvaluateFMRisk(
    system_credentials=system_credentials,
    foundation_model_name=foundation_model_name,
    risk_dimensions=risk_dimensions,
    scoring_function=scoring_function,
    max_sample_size = 5
)

risks_metrics_results = fm_evaluate.evaluate_fm_risks()
print(risks_metrics_results)



## uncomment  below to use the full dataset and obtain meaningful results. 
#max_sample_size = None

# fm_evaluate = EvaluateFMRisk(
#     system_credentials=system_credentials,
#     foundation_model_name=foundation_model_name,
#     risk_dimensions=risk_dimensions,
#     scoring_function=scoring_function,
#     max_sample_size = max_sample_size
# )

# risks_metrics_results = fm_evaluate.evaluate_fm_risks()
# print(risks_metrics_results)