<img src="https://github.com/pmservice/ai-openscale-tutorials/raw/master/notebooks/images/banner.png" align="left" alt="banner">

# IBM watsonx.governance - Generate Configuration Archive for Unstructured Text models

This notebook demonstrates how to generate a configuration archive for monitoring deployments in IBM watsonx.governance. This configuration is targetted for `System-Managed` monitored deployments.

***Target audience for this notebook:***
This notebook is targetted for users who fall in the below category:
- Users who want to monitor their subscriptions created on unstructured text data in IBM watsonx.governance

User must provide the necessary inputs where marked. Generated configuration package can be used in IBM watsonx.governance UI while configuring monitoring of a model deployment in IBM watsonx.governance.

**Contents:**
1. [Setting up the environment](#setting-up-the-environment) - Pre-requisites: Install Libraries and required dependencies
2. [Training Data](#training-data) - Read the training data as a pandas DataFrame
3. [User Inputs Section](#user-inputs-section) - Provide Model Details, IBM watsonx.governance Services and their configuration
4. [Generate Configuration Archive](#generate-configuration-archive)
5. [Helper Methods](#helper-methods)
6. [Definitions](#definitions)

## Setting up the environment

In [None]:
%pip install --upgrade "ibm-metrics-plugin[notebook]~=3.0.9" "ibm-watson-openscale~=3.0.34" | tail -n 1

In [1]:
# ----------------------------------------------------------------------------------------------------
# IBM Confidential
# OCO Source Materials
# 5900-A3Q, 5737-H76
# Copyright IBM Corp. 2024
# The source code for this Notebook is not published or other-wise divested of its trade 
# secrets, irrespective of what has been deposited with the U.S.Copyright Office.
# ----------------------------------------------------------------------------------------------------

VERSION = "1.0.0"

## Training Data
*Note: Pandas' read\_csv method converts the columns to its data types. If you want the column type to not be interpreted, specify the dtype param to read_csv method in this cell. More on this method [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)*

*Note: By default NA values will be dropped while computing training data distribution. Please ensure to handle the NA values during Pandas' read\_csv method*

In [2]:
import pandas as pd
training_data_df = pd.read_csv("TO BE EDITED")

print(training_data_df.head())
print("Columns:{}".format(list(training_data_df.columns.values)))

  BrandName   Price  Rating  \
0   Samsung  199.99       5   
1   Samsung  199.99       4   
2   Samsung  199.99       5   
3   Samsung  199.99       4   
4   Samsung  199.99       4   

                                             Reviews  ReviewVotes  
0  I feel so LUCKY to have found this used (phone...            1  
1  nice phone, nice up grade from my pantach revu...            0  
2                                       Very pleased            0  
3  It works good but it goes slow sometimes but i...            0  
4  Great phone to replace my lost phone. The only...            0  
Columns:['BrandName', 'Price', 'Rating', 'Reviews', 'ReviewVotes']


## User Inputs Section

##### _1. Provide Common Parameters_:

Provide the common parameters like the basic model details like type, feature columns, etc. Read more about these [here](#common-parameters). 

##### _2. Provide Drift v2 Parameters_
Read more about these parameters [here](#drift-v2-parameters)


##### _3. Provide a scoring function_
The scoring function is required and it should adhere to the following guidelines.

- The input of the scoring function should accept `training_data`, which can be either a local file path to images or a `pandas.DataFrame`, with sub-folders acting as labels for the images. The `schema` parameter is a dictionary specifying the column names for various components in the scoring response, such as `prediction_column`, `probability_column`, `input_token_count_column`, `output_token_count_column`, `prediction_probability_column`, and `label_column`, depending on the input data type, whether it's structured/prompt/unstructured image data.
- The output of the scoring function should return:
    - a `pandas.DataFrame` with all columns of the input DataFrame, with additional columns varying based on the `problem_type`.
    - For binary and multiclass problems, both `probability_column` and `prediction_column` are included. 
    - For regression, only `prediction_column` is included.
    - Prompt asset related problems may include columns like `input_token_count_column`, `output_token_count_column`, and `prediction_probability_column`.
    - For unstructured_image input types, the `label_column` is also included in the output DataFrame.
- The data type of the label column and prediction column should be same. Moreover, the label column and the prediction column array should have the same unique class labels
- A host of different scoring function templates are provided [here](https://github.com/IBM/watson-openscale-samples/wiki/Score-function-templates-for-IBM-Watson-OpenScale)

In [3]:

common_parameters = {
    "problem_type" : "TO_BE_EDITED",
    "input_data_type": "unstructured_text",
    "asset_type": "model",
    "meta_columns": ["TO_BE_EDITED"], # <- Not required if the model doesn't have any meta columns
    "label_column": "TO_BE_EDITED",
    "prediction_column": "TO_BE_EDITED",
    "probability_column": "TO_BE_EDITED", # <- Not required for Regression problems.
    "enable_drift_v2": True,
    "notebook_version": VERSION
}

drift_v2_parameters = {
    # "max_samples": 10000
    "important_input_metadata_columns" : ["TO_BE_EDITED"] # <- Add this if input metadata drift to be calculated and meta columns are available
}

scoring_fn = None
scoring_batch_size = None #Change this to control how many rows get score at a time. Default values for image models is 50 and for others, it is 5000

## Drift v2 Archive

Run the following code to generate the drift v2 archive for the IBM watsonx.governance monitors. This archive is used as is by IBM watsonx.governance UI/SDK to onboard model for monitoring. UI/SDK will identify the drift v2 artifacts and appropriately upload to the monitor.

In [4]:
from ibm_watson_openscale.utils.configuration_utility import ConfigurationUtility

config_util = ConfigurationUtility(
    training_data=training_data_df,
    common_parameters=common_parameters,
    scoring_fn=scoring_fn,
    batch_size=scoring_batch_size)

config_util.create_drift_configuration_package(
    drift_v2_parameters=drift_v2_parameters if "drift_v2_parameters" in locals() else {},
    display_link=True)

## Helper Methods

### Read file in COS to pandas dataframe

In [None]:
%pip install ibm-cos-sdk

import ibm_boto3
import pandas as pd
import sys
import types

from ibm_botocore.client import Config

def __iter__(self): return 0

api_key = "TO_BE_EDITED" # cos api key
resource_instance_id = "TO_BE_EDITED" # cos resource instance id
service_endpoint =  "TO_BE_EDITED" # cos service region endpoint
bucket =  "TO_BE_EDITED" # cos bucket name
file_name= "TO_BE_EDITED" # cos file name
auth_endpoint = "https://iam.ng.bluemix.net/oidc/token"

cos_client = ibm_boto3.client(service_name="s3",
    ibm_api_key_id=api_key,
    ibm_auth_endpoint=auth_endpoint,
    config=Config(signature_version="oauth"),
    endpoint_url=service_endpoint)

body = cos_client.get_object(Bucket=bucket,Key=file_name)["Body"]

# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

training_data_df = pd.read_csv(body)

## Definitions

### Common Parameters

| Parameter | Description | Default Value | Possible Value(s) |
|:-|:-|:-|:-|
| problem_type | Enumeration classifying if your model is a binary or a multi-class classifier or a regressor. |  | `binary`, `multiclass`, `regression` |
| asset_type | The type of your asset. |  | `model`|
| input_data_type | The type of your input data. |  | `unstructured_text`|
| label_column | The column which contains the target field (also known as label column or the class label). |  | A string value referring column name |
| feature_columns | Columns identified as features by model. The order of the feature columns should be same as that of the subscription. Use helper methods to compute these if required. |  | A list of column names |
| categorical_columns | Feature columns identified as categorical by model. Use helper methods to compute these if required. |  | A list of column names |
| prediction_column | The column containing the model output. This should be of the same data type as the label column. |  | A string value referring column name |
| probability_column | The column (of type array) containing the model probabilities for all the possible prediction outcomes. This is not required for regression models. One of `probability_column` or `class_probabilities` must be specified for classification models. If both are specified, `class_probabilities` is preferred.|  | A string value referring column name |
| class_probabilities | The columns (of type double) containing the model probabilities of class labels. This is not required for regression models. For example, for Go Sales model deployed in MS Azure ML Studio, value of this property would be `["Scored Probabilities for Class \"Camping Equipment\"", "Scored Probabilities for Class \"Mountaineering Equipment\"", "Scored Probabilities for Class \"Personal Accessories\""]`. Please note escaping double quotes is a must-have requirement for above example. One of `probability_column` or `class_probabilities` must be specified for classification models. If both are specified, `class_probabilities` is preferred. |  | A list of column names |
| enable_drift_v2 | Boolean value to allow generation of Drift v2 Archive. | `True` | `True` or `False` |


### Drift v2 Parameters

| Parameter | Description | Default Value | Possible Value(s) |
| :- | :- | :- | :- |
| max_samples | Defines maximum sample size on which the drift v2 archive is created. | None | |