## IBM watsonx.governance - Generate Configuration Archive for LLMs.

This notebook can be used to generate the driftV2 baseline archives for LLMs.

**Contents:**
1. [Setting up the environment](#setting-up-the-environment) - Pre-requisites: Install Libraries and required dependencies
2. [Training Data](#Training-Data) - Read the training data as a pandas DataFrame
3. [User Inputs Section](#user-inputs-section) - Provide Model Details, IBM watsonx.governance Services and their configuration
4. [Generate Configuration Archive](#generate-configuration-archive)
5. [Helper Methods](#helper-methods)
6. [Definitions](#definitions)

## Setting up the environment

<b> Installing required packages </b>

In [None]:
%pip install --upgrade "ibm-metrics-plugin~=5.0.0" "ibm-watson-openscale~=3.0.34" | tail -n 1

In [1]:
# ----------------------------------------------------------------------------------------------------
# IBM Confidential
# OCO Source Materials
# 5900-A3Q, 5737-H76
# Copyright IBM Corp. 2023, 2024
# The source code for this Notebook is not published or other-wise divested of its trade 
# secrets, irrespective of what has been deposited with the U.S.Copyright Office.
# ----------------------------------------------------------------------------------------------------

VERSION = "1.1"
#Version History
#1.1: Added support for RAG
#1.0: Initial release

<b> Action: Please restart the kernel </b>

In [2]:
import json
import asyncio
import aiohttp
import pandas as pd
import requests
from cachetools import TTLCache, cached

from ibm_metrics_plugin.metrics.drift_v2.utils.async_utils import \
    gather_with_concurrency
from ibm_watson_openscale.utils.configuration_utility import \
    ConfigurationUtilityLLM

## Training Data

The training data can be either scored/un-scored.

*Note: Pandas' read\_csv method converts the columns to its data types. If you want the column type to not be interpreted, specify the dtype param to read_csv method in this cell. More on this method [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)*

In [3]:
training_data_df = pd.read_csv("TO BE EDITED")
print(training_data_df.head())
print("Columns:{}".format(list(training_data_df.columns.values)))

                                                text  humor
0  What does a fastidious female call a condom? g...   True
1  Tennis legend althea gibson to be honored with...  False
2  I heard of a new sex position that i want to t...   True
3  Facebook's trending news topics will now be au...  False
4         Til the w in wnba dosen't stand for worse.   True
Columns:['text', 'humor']


## User Inputs Section

##### _1. Provide Common Parameters_:

Provide the common parameters like the basic problem type, asset type, prompt variable columns, etc. Read more about these [here](#definitions). 

##### _2. Provide a scoring function_

The scoring function should adhere to the following guidelines.

- The input of the scoring function should accept a `pandas.DataFrame` comprising of all the `prompt_variable_columns`.
- The output of the scoring function should return:
    - a `pandas.DataFrame` comprising of:
        - all columns of input `pandas.DataFrame`
        - `prediction_column`
        - `input_token_count` if available
        - `generated_token_count` if available
        - `prediction_probability` by aggregating the log probabilities.


In [4]:
#See 'Definitions' section to know more.

problem_type = "classification" # Supported problem types are classification, extraction, generation, \
                                #question_answering, summarization and retrieval_augmented_generation.
asset_type = "prompt"
input_data_type = "unstructured_text"
prompt_variable_columns = ["text"] #Mandatory parameter.
meta_columns = []
prediction_column = "generated_text"
input_token_count_column = "input_token_count"
output_token_count_column = "generated_token_count"
prediction_probability_column = "prediction_probability"

common_parameters = {
    "asset_type": asset_type,
    "input_data_type": input_data_type,
    "problem_type" : problem_type, 
    "prompt_variable_columns": prompt_variable_columns,
    "meta_columns": meta_columns,
    "prediction_column": prediction_column,
    "input_token_count_column": input_token_count_column,
    "output_token_count_column": output_token_count_column,
    "prediction_probability_column": prediction_probability_column
}

drift_v2_parameters = {}


In [5]:
SCORING_URL = "Your deployment scoring URL" #Example: https://us-south.ml.cloud.ibm.com/ml/v1-beta/deployments/{deployment_id}/generation/text?version=2021-05-01
SCORING_BATCH_SIZE = 15

API_KEY = "Your API Key"
TOKEN_GENERATION_URL = "https://iam.cloud.ibm.com/identity/token"
# USERNAME = ""   #Uncomment and edit this line if you are using CPD cluster.


<b> The helper function below will be used to create the IAM token required for scoring <b>

In [6]:
@cached(cache=TTLCache(maxsize=1024, ttl=1800))
def get_iam_token(apikey=API_KEY,
                  url=TOKEN_GENERATION_URL):

    headers = {
        "Content-Type": "application/x-www-form-urlencoded",
        "Accept": "application/json",
    }
    data = "grant_type=urn%3Aibm%3Aparams%3Aoauth%3Agrant-type%3Aapikey&apikey=" + apikey

    resp = requests.post(url=url, headers=headers, data=data)
    if resp.status_code != 200:
        raise Exception(
            "Error creating IAM Token. Status Code: ", resp.status_code)

    resp_data = resp.json()
    return resp_data["access_token"]

# Uncomment the following if you are using a CPD cluster

# @cached(cache=TTLCache(maxsize=1024, ttl=1800))
# def get_iam_token(apikey=API_KEY,
#                   url=TOKEN_GENERATION_URL, username=USERNAME):

#     headers = {
#         "Content-Type": "application/json",
#         "Accept": "application/json",
#     }
#     data = {"username": username, "api_key": apikey}

#     resp = requests.post(url=url, headers=headers, json=data)
#     if resp.status_code != 200:
#         raise Exception(
#             "Error creating IAM Token. Status Code: ", resp.status_code)

#     resp_data = resp.json()
#     return resp_data["token"]

In [7]:
SCORING_DELAY = False #Set it to True if you are on Lite plan.
SCORING_DELAY_THRESHOLD = 2

The scoring function defined below scores all rows of the data frame in batches of size <i>SCORING_BATCH_SIZE </i> defined in the setup


In [8]:
async def scoring_fn(training_data, schema):
    data_df_size = len(training_data)

    prediction_column = schema.get("prediction_column")
    input_token_count_column = schema.get("input_token_count_column")
    output_token_count_column = schema.get("output_token_count_column")
    prediction_probability_column = schema.get("prediction_probability_column")
    prompt_variable_columns = schema.get("prompt_variable_columns")

    if prediction_column is None:
        raise ValueError("'prediction_column' must be present in schema")
    if input_token_count_column is None:
        raise ValueError("'input_token_count_column' must be present in schema")
    if prediction_probability_column is None:
        raise ValueError("'prediction_probability_column' must be present in schema")
    if output_token_count_column is None:
        raise ValueError("'output_token_count_column' must be present in schema")
    
    async def perform_scoring(session, training_data, row, index):

        token = get_iam_token()
        headers = {"Content-Type": "application/json",
                   "Authorization": f"Bearer {token}"}

        values = [row[col] for col in prompt_variable_columns]
        scoring_payload = {}
        prompts = {}

        #Generating payload 
        for field, value in zip(prompt_variable_columns, values):
            prompts[field] = value
        scoring_payload["parameters"] = {}
        scoring_payload["parameters"]["prompt_variables"] = prompts
        scoring_payload["parameters"]["return_options"] = {}
        scoring_payload["parameters"]["return_options"]["generated_tokens"] = True
        scoring_payload["parameters"]["return_options"]["token_logprobs"] = True

        scoring_payload = json.dumps(scoring_payload)

        try:
            result = await session.post(SCORING_URL, headers=headers, data=scoring_payload)
            try:
                result = await result.json()
            except Exception as e:
                print(str(e))
                return
        except aiohttp.ClientResponseError as err:
            if err.status == 401:  # IAM Token expired, regenerate and retry
                token = get_iam_token()
                headers["Authorization"] = f"Bearer {token}"

                result = await session.post(
                    SCORING_URL, headers=headers, data=scoring_payload)
                result = await result.json()
        
        if SCORING_DELAY is True:
            await asyncio.sleep(SCORING_DELAY_THRESHOLD)
        try:
            output_token_count = result["results"][0]["generated_token_count"]
            training_data.at[index, output_token_count_column] = output_token_count
        except KeyError:
            pass

        try:
            input_token_count = result["results"][0]["input_token_count"]
            training_data.at[index, input_token_count_column] = input_token_count
        except KeyError:
            pass
            
        generated_text = result["results"][0]["generated_text"]

        result_set = result["results"][0]["generated_tokens"]
        log_probabilities = [token["logprob"]
                     for token in result_set if "logprob" in token]
        
        training_data.at[index, prediction_probability_column] = sum(log_probabilities)
        training_data.at[index, prediction_column] = generated_text
        
        print(f"Scored {index}th row of total {data_df_size} rows.", end="\r")
        return training_data
    
    coros = []
    connector = aiohttp.TCPConnector(limit=20)
    
    async with aiohttp.ClientSession(connector=connector) as session:
        scored, index = 0, 0
        for row in training_data.to_dict(orient="records"):
            if scored == SCORING_BATCH_SIZE:
                data = await gather_with_concurrency(*coros) #Wait till a batch is finished before beginning next
                coros.clear()
                scored = 0
            coros.append(perform_scoring(
                session, training_data, row, index=index))
            scored += 1
            index += 1
        if coros:
            data = await gather_with_concurrency(*coros)

    print("\nScoring has been completed.")
    return data[-1]



## Generate Configuration Archive

Run the following code to generate the drift v2 baseline archive for LLMs

In [9]:
drift_config = ConfigurationUtilityLLM(training_data_df, common_parameters, scoring_fn=scoring_fn)
drift_config.generate_drift_v2_archive_llm(drift_v2_parameters)

Scoring data..
Scored 45th row of total 50 rows.
Scoring has been completed.
Baseline archive created at path:  /Users/nelwin/Desktop/-/Code/Issues/ntbk-updt/fix/notebooks/t/Cloud/baseline__a76e55be-ce97-4d7e-b0e3-2ce0c1256efd.tar.gz


## Helper Methods

### Read file in COS to pandas dataframe

In [None]:
%pip install ibm-cos-sdk

import ibm_boto3
import pandas as pd
import sys
import types

from ibm_botocore.client import Config

def __iter__(self): return 0

api_key = "TO_BE_EDITED" # cos api key
resource_instance_id = "TO_BE_EDITED" # cos resource instance id
service_endpoint =  "TO_BE_EDITED" # cos service region endpoint
bucket =  "TO_BE_EDITED" # cos bucket name
file_name= "TO_BE_EDITED" # cos file name
auth_endpoint = "https://iam.ng.bluemix.net/oidc/token"

cos_client = ibm_boto3.client(service_name="s3",
    ibm_api_key_id=api_key,
    ibm_auth_endpoint=auth_endpoint,
    config=Config(signature_version="oauth"),
    endpoint_url=service_endpoint)

body = cos_client.get_object(Bucket=bucket,Key=file_name)["Body"]

# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

training_data_df = pd.read_csv(body)

## Definitions

### Common Parameters

| Parameter | Description | Default Value | Possible Value(s) |
| :- | :- | :- | :- |
| problem_type | One of the prompt task types supported by drift v2 |  | classification, extraction, generation, question_answering, summarization, retrieval_augmented_generation|
| asset_type | The asset type | prompt | prompt |
| input_data_type | The type of input from the dataframe | unstructured_text | unstructured_text |
| prompt_variable_columns | The names of all prompt variable columns | | |
| meta_columns | Optional parameter. List of all meta data columns | | |
| label_column | Optional parameter. The name of label column| reference_output | |
| prediction_column | Optional parameter. | generated_text | |
| input_token_count_column | Optional parameter. The name of column representing token counts of input| input_token_count | |
| output_token_count_column | Optional parameter. The name of column representing token counts of output | generated_token_count | |
| prediction_probability_column | Optional parameter. The name of prediction probability column| prediction_probability | |

Example:
```html
problem_type = "classification"
asset_type = "prompt"
input_data_type = "unstructured_text"
prompt_variable_columns = ["text"]
meta_columns = []
prediction_column = "prediction"
input_token_count_column = "input_token_count"
output_token_count_column = "generated_token_count"
prediction_probability_column = "prediction_probability"
```