## IBM Watson OpenScale - Generate DriftV2 Baseline Archive 

This notebook can be used to generate the driftV2 baseline archives for LLMs.

**Contents:**
1. [Setting up the environment](#setting-up-the-environment) - Pre-requisites: Install Libraries and required dependencies
2. [Training Data](#Training-Data) - Read the training data as a pandas DataFrame
3. [User Inputs Section](#user-inputs-section) - Provide Model Details, IBM Watson OpenScale Services and their configuration
4. [Generate Configuration Archive](#generate-configuration-archive)
5. [Helper Methods](#helper-methods)
6. [Definitions](#definitions)

## Setting up the environment

<b> Installing required packages </b>

In [None]:
%pip install --upgrade "ibm-metrics-plugin~=4.8.7" "ibm-watson-openscale~=3.0.34" --no-cache | tail -n 1

In [1]:
# ----------------------------------------------------------------------------------------------------
# IBM Confidential
# OCO Source Materials
# 5900-A3Q, 5737-H76
# Copyright IBM Corp. 2023, 2024
# The source code for this Notebook is not published or other-wise divested of its trade 
# secrets, irrespective of what has been deposited with the U.S.Copyright Office.
# ----------------------------------------------------------------------------------------------------

VERSION = "1.0"

<b> Action: Please restart the kernel </b>

In [2]:
import json
import asyncio
import aiohttp
import pandas as pd
import requests
from cachetools import TTLCache, cached

from ibm_watson_openscale.utils.configuration_utility import \
    ConfigurationUtilityLLM
from tqdm.asyncio import tqdm_asyncio

## Training Data

The training data can be either scored/un-scored.

*Note: Pandas' read\_csv method converts the columns to its data types. If you want the column type to not be interpreted, specify the dtype param to read_csv method in this cell. More on this method [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)*

In [3]:
training_data_df = pd.read_csv("TO BE EDITED")

print(training_data_df.head())
print("Columns:{}".format(list(training_data_df.columns.values)))

          Location   News Agency  \
0   MIAMI, Florida           CNN   
1  LONDON, England           CNN   
2  FLINT, Michigan           CNN   
3              NaN           CNN   
4              NaN  Mental Floss   

                                             article  \
0  Four suspects indicted on murder and burglary ...   
1  Queen Elizabeth helped launch Heathrow's $8.6 ...   
2  Sen. Barack Obama Monday proposed spending bil...   
3  Steven Spielberg led the FBI straight to a sto...   
4  Starting a legitimate business is hard, boring...   

                                          highlights  
0  NEW: 17-year-old alleged shooter appears in Mi...  
1  Queen Elizabeth opens Heathrow Airport's $8.6 ...  
2  Sen. Obama offers plan to spend $10B on school...  
3  Stolen art can be lost for decades . Soft targ...  
4  When a business scam fails, it tends to fail i...  
Columns:['Location', 'News Agency', 'article', 'highlights']


## User Inputs Section

##### _1. Provide Common Parameters_:

Provide the common parameters like the basic problem type, asset type, prompt variable columns, etc. Read more about these [here](#definitions). 

##### _2. Provide a scoring function_

The scoring function should adhere to the following guidelines.

- The input of the scoring function should accept a `pandas.DataFrame` comprising of all the `prompt_variable_columns`.
- The output of the scoring function should return:
    - a `pandas.DataFrame` comprising of:
        - all columns of input `pandas.DataFrame`
        - `prediction_column`
        - `input_token_count` if available
        - `generated_token_count` if available
        - `prediction_probability` by aggregating the log probabilities.


In [4]:
#See 'Definitions' section to know more.

problem_type = "summarization" # Supported problem types are classification, extraction, generation, \
                                #question_answering and summarization.
asset_type = "prompt"
input_data_type = "unstructured_text"
prompt_variable_columns = ["TO BE EDITED"] #Mandatory parameter.
meta_columns = []
prediction_column = "generated_text"
input_token_count_column = "input_token_count"
output_token_count_column = "generated_token_count"
prediction_probability_column = "prediction_probability"

common_parameters = {
    "asset_type": asset_type,
    "input_data_type": input_data_type,
    "problem_type" : problem_type, 
    "prompt_variable_columns": prompt_variable_columns,
    "meta_columns": meta_columns,
    "prediction_column": prediction_column,
    "input_token_count_column": input_token_count_column,
    "output_token_count_column": output_token_count_column,
    "prediction_probability_column": prediction_probability_column
}

drift_v2_parameters = {}


In [5]:
SCORING_URL = "Your deployment scoring URL" #Example: https://us-south.ml.cloud.ibm.com/ml/v1-beta/deployments/{deployment_id}/generation/text?version=2021-05-01
SCORING_BATCH_SIZE = 15

API_KEY = "Your API Key"
TOKEN_GENERATION_URL = "https://iam.cloud.ibm.com/identity/token"
# USERNAME = ""   #Uncomment and edit this line if you are using CPD cluster.


<b> The helper function below will be used to create the IAM token required for scoring <b>

In [6]:
@cached(cache=TTLCache(maxsize=1024, ttl=1800))
def get_iam_token(apikey=API_KEY,
                  url=TOKEN_GENERATION_URL):

    headers = {
        "Content-Type": "application/x-www-form-urlencoded",
        "Accept": "application/json",
    }
    data = "grant_type=urn%3Aibm%3Aparams%3Aoauth%3Agrant-type%3Aapikey&apikey=" + apikey

    resp = requests.post(url=url, headers=headers, data=data)
    if resp.status_code != 200:
        raise Exception(
            "Error creating IAM Token. Status Code: ", resp.status_code)

    resp_data = resp.json()
    return resp_data["access_token"]

# Uncomment the following if you are using a CPD cluster

# @cached(cache=TTLCache(maxsize=1024, ttl=1800))
# def get_iam_token(apikey=API_KEY,
#                   url=TOKEN_GENERATION_URL, username=USERNAME):

#     headers = {
#         "Content-Type": "application/json",
#         "Accept": "application/json",
#     }
#     data = {"username": username, "api_key": apikey}

#     resp = requests.post(url=url, headers=headers, json=data)
#     if resp.status_code != 200:
#         raise Exception(
#             "Error creating IAM Token. Status Code: ", resp.status_code)

#     resp_data = resp.json()
#     return resp_data["token"]

In [7]:
SCORING_DELAY = False #Set it to True if you are on Lite plan.
SCORING_DELAY_THRESHOLD = 2

The scoring function defined below scores all rows of the data frame in batches of size <i>SCORING_BATCH_SIZE </i> defined in the setup


In [8]:
async def scoring_fn(training_data, schema):
    """
    Perform scoring on the given training data using the schema provided.

    Parameters:
    training_data (DataFrame): The training data.
    schema (dict): The schema containing column names for predictions and token counts.

    Returns:
    DataFrame: The training data with updated scores.
    """
    # Extract required columns from the schema
    required_columns = ["prediction_column", "input_token_count_column",
                        "output_token_count_column", "prediction_probability_column",
                        "prompt_variable_columns"]

    for col in required_columns:
        if schema.get(col) is None:
            raise ValueError(f"'{col}' must be present in schema")

    async def score_single_row(row):
        """
        Perform scoring for a single row of data.

        Parameters:
        row (dict): A single row of training data.

        Returns:
        tuple: Tuple containing values for prediction, prediction_probability, input_token_count, output_token_count.
        """
        token = get_iam_token()
        headers = {
            "Content-Type": "application/json",
            "Authorization": f"Bearer {token}"
        }

        prompt_variables = {field: str(row[field])
                            for field in schema["prompt_variable_columns"]}
        scoring_payload = json.dumps({
            "parameters": {
                "prompt_variables": prompt_variables,
                "return_options": {
                    "generated_tokens": True,
                    "token_logprobs": True
                }
            }
        })

        try:
            result = await session.post(SCORING_URL, headers=headers, data=scoring_payload)
            if not result.ok: 
                if result.status == 401:
                    token = get_iam_token()
                    headers["Authorization"] = f"Bearer {token}"
                    result = await session.post(SCORING_URL, headers=headers, data=scoring_payload)
                    result = await result.json()
                else:
                    raise Exception(f"Failed while scoring: Reason: {result.status}: {await result.text()}")
            result = await result.json()
        except aiohttp.ClientError as err:
            raise Exception(f"{err.status}: {err.message}")

        if SCORING_DELAY:
            await asyncio.sleep(SCORING_DELAY_THRESHOLD)

        result_data = result["results"][0]
        prediction = result_data.get("generated_text", "")
        prediction_probability = sum(
            token.get("logprob", 0) for token in result_data.get("generated_tokens", [])
        )
        input_token_count = result_data.get("input_token_count", None)
        output_token_count = result_data.get("generated_token_count", None)
        return {
            schema["prediction_column"]: prediction,
            schema["prediction_probability_column"]: prediction_probability,
            schema["input_token_count_column"]: input_token_count,
            schema["output_token_count_column"]: output_token_count
        }

    async with aiohttp.ClientSession(connector=aiohttp.TCPConnector(limit=SCORING_BATCH_SIZE)) as session:
        tasks = [
            score_single_row(row)
            for row in training_data.to_dict(orient="records")
        ]
        results = await tqdm_asyncio.gather(*tasks, desc="Scoring training data... ", unit="rows")

    training_data = pd.concat([training_data, pd.DataFrame(results)], axis=1)
    return training_data

## Generate Configuration Archive

Run the following code to generate the drift v2 baseline archive for LLMs

In [9]:
drift_config = ConfigurationUtilityLLM(training_data_df, common_parameters, scoring_fn=scoring_fn)
drift_config.generate_drift_v2_archive_llm(drift_v2_parameters)

Scoring data..


Scoring training data... : 100%|██████████| 50/50 [00:17<00:00,  2.93rows/s]


Please install correct version of ibm-metrics-plugin["notebook"] to compute embeddings.
Baseline archive created at path:  /Users/prempiyush/work/code/python-playground/notebooks/baseline__c19f5682-dd51-46bf-b723-408ae30011bf.tar.gz


## Helper Methods

### Read file in COS to pandas dataframe

In [None]:
%pip install ibm-cos-sdk

import ibm_boto3
import pandas as pd
import sys
import types

from ibm_botocore.client import Config

def __iter__(self): return 0

api_key = "TO_BE_EDITED" # cos api key
resource_instance_id = "TO_BE_EDITED" # cos resource instance id
service_endpoint =  "TO_BE_EDITED" # cos service region endpoint
bucket =  "TO_BE_EDITED" # cos bucket name
file_name= "TO_BE_EDITED" # cos file name
auth_endpoint = "https://iam.ng.bluemix.net/oidc/token"

cos_client = ibm_boto3.client(service_name="s3",
    ibm_api_key_id=api_key,
    ibm_auth_endpoint=auth_endpoint,
    config=Config(signature_version="oauth"),
    endpoint_url=service_endpoint)

body = cos_client.get_object(Bucket=bucket,Key=file_name)["Body"]

# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

training_data_df = pd.read_csv(body)

## Definitions

### Common Parameters

| Parameter | Description | Default Value | Possible Value(s) |
| :- | :- | :- | :- |
| problem_type | One of the prompt task types supported by drift v2 |  | classification, extraction, generation, question_answering, summarization|
| asset_type | The asset type | prompt | prompt |
| input_data_type | The type of input from the dataframe | unstructured_text | unstructured_text |
| prompt_variable_columns | The names of all prompt variable columns | | |
| meta_columns | Optional parameter. List of all meta data columns | | |
| label_column | Optional parameter. The name of label column| reference_output | |
| prediction_column | Optional parameter. | generated_text | |
| input_token_count_column | Optional parameter. The name of column representing token counts of input| input_token_count | |
| output_token_count_column | Optional parameter. The name of column representing token counts of output | generated_token_count | |
| prediction_probability_column | Optional parameter. The name of prediction probability column| prediction_probability | |

Example:
```html
problem_type = "classification"
asset_type = "prompt"
input_data_type = "unstructured_text"
prompt_variable_columns = ["text"]
meta_columns = []
prediction_column = "prediction"
input_token_count_column = "input_token_count"
output_token_count_column = "generated_token_count"
prediction_probability_column = "prediction_probability"
```