<img src="https://github.com/pmservice/ai-openscale-tutorials/raw/master/notebooks/images/banner.png" align="left" alt="banner">

# IBM watsonx.governance - Generate Drift v2 Archive for Unstructured Image models

This notebook demonstrates how to generate a configuration archive for monitoring deployments in IBM watsonx.governance. This configuration is targetted for `System-Managed` monitored deployments.

***Target audience for this notebook:***
This notebook is targetted for users who fall in the below category:
- Users who want to monitor their subscriptions created on unstructured image data in IBM watsonx.governance

User must provide the necessary inputs where marked. Generated configuration package can be used in IBM watsonx.governance UI while configuring monitoring of a model deployment in IBM watsonx.governance.

**Contents:**
1. [Setting up the environment](#setting-up-the-environment) - Pre-requisites: Install Libraries and required dependencies
2. [Training Data](#training-data) - Read the training data as a pandas DataFrame
3. [User Inputs Section](#user-inputs-section) - Provide Model Details, IBM watsonx.governance Services and their configuration
4. [Drift v2 Archive](#generate-configuration-archive)
5. [Integrating Unstructured Image Subscriptions with a Defined Training Data Schema](#integrating-unstructured-image-subscriptions-with-a-defined-training-data-schema)
6. [Drift Configuration](#drift-configuration)
7. [Definitions](#definitions)

## Setting up the environment

In [None]:
%pip install --upgrade "ibm-metrics-plugin[notebook]~=5.0.3" "ibm-watson-openscale~=3.0.34" | tail -n 1

In [1]:
# ----------------------------------------------------------------------------------------------------
# IBM Confidential
# OCO Source Materials
# 5900-A3Q, 5737-H76
# Copyright IBM Corp. 2024
# The source code for this Notebook is not published or other-wise divested of its trade 
# secrets, irrespective of what has been deposited with the U.S.Copyright Office.
# ----------------------------------------------------------------------------------------------------

VERSION = "1.0.0"

## Training Data

Training data can either of the following:
- CSV : The training data can be organized in a CSV format with columns for image paths, labels, and optional meta columns. Set the `training_data_df` in the below section if the training data is a csv file.
- String: Pointing to a local directory where subfolders denote distinct labels. For instance, if "/images" is provided, subdirectories like "/images/cats" and "/images/dogs" will correspond to labels for images within. Set the `image_dir` in the below section if the training data is an image directory.

*Note: Pandas' read\_csv method converts the columns to its data types. If you want the column type to not be interpreted, specify the dtype param to read_csv method in this cell. More on this method [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)*

*Note: By default NA values will be dropped while computing training data distribution. Please ensure to handle the NA values during Pandas' read\_csv method*

In [2]:
import pandas as pd

image_dir = None  # <- If your images are in a directory, set this to the path of the directory

training_data_df = pd.read_csv("TO BE EDITED")

print(training_data_df.head())
print("Columns:{}".format(list(training_data_df.columns.values)))

      id gender masterCategory subCategory  articleType baseColour  season  \
0  15970    Men        Apparel     Topwear       Shirts  Navy Blue    Fall   
1  39386    Men        Apparel  Bottomwear        Jeans       Blue  Summer   
2  59263  Women    Accessories     Watches      Watches     Silver  Winter   
3  21379    Men        Apparel  Bottomwear  Track Pants      Black    Fall   
4  53759    Men        Apparel     Topwear      Tshirts       Grey  Summer   

   year   usage                             productDisplayName  \
0  2011  Casual               Turtle Check Men Navy Blue Shirt   
1  2012  Casual             Peter England Men Party Blue Jeans   
2  2016  Casual                       Titan Women Silver Watch   
3  2011  Casual  Manchester United Men Solid Black Track Pants   
4  2012  Casual                          Puma Men Grey T-shirt   

                               path  predictedLabel  
0  fashion_dataset/images/15970.jpg               1  
1  fashion_dataset/images/

## User Inputs Section

##### _1. Provide Common Parameters_:

Provide the common parameters like the basic model details like type, feature columns, etc. Read more about these [here](#common-parameters). 

##### _2. Provide Drift v2 Parameters_
Read more about these parameters [here](#drift-v2-parameters)


##### _3. Provide a scoring function_
The scoring function is required and it should adhere to the following guidelines.

- The input of the scoring function should accept `training_data`, which will be a `pandas.DataFrame`. For image models dataframe should include `image_path_column` and corresponding `image_label`. The `schema` parameter is a dictionary specifying the column names for various components in the scoring response, such as, `image_path_column`, `prediction_column`, `probability_column`, `input_token_count_column`, `output_token_count_column`, `prediction_probability_column`, and `label_column`, depending on the input data type, whether it's structured/prompt/unstructured image data.
- The output of the scoring function should return:
    - a `pandas.DataFrame` with all columns of the input DataFrame, with additional columns varying based on the `problem_type`.
    - For binary and multiclass problems, both `probability_column` and `prediction_column` are included. 
    - For regression, only `prediction_column` is included.
    - Prompt asset related problems may include columns like `input_token_count_column`, `output_token_count_column`, and `prediction_probability_column`.
    - For unstructured_image input types, the `label_column` is also included in the output DataFrame.
- The data type of the label column and prediction column should be same. Moreover, the label column and the prediction column array should have the same unique class labels
- A host of different scoring function templates are provided [here](https://github.com/IBM/watson-openscale-samples/wiki/Score-function-templates-for-IBM-Watson-OpenScale)

In [3]:
image_dir = "TO BE EDITED" 

In [4]:

common_parameters = {
    "problem_type" : "TO_BE_EDITED",
    "input_data_type": "unstructured_image",
    "asset_type": "model",
    "meta_columns": ["TO_BE_EDITED"], # <-  Not required if the model doesn't have any meta columns
    "label_column": "TO_BE_EDITED",
    "prediction_column": "TO_BE_EDITED",
    "probability_column": "TO_BE_EDITED", # <- Not required for Regression problems.
    "image_path_column": "TO_BE_EDITED", 
    "enable_drift_v2": True,
    "notebook_version": VERSION
}

drift_v2_parameters = {
    # "max_samples": 10000
    "important_input_metadata_columns" : ["TO_BE_EDITED"] # <- Add this if input metadata drift to be calculated and meta columns are available
}


scoring_fn = None
scoring_batch_size = None #Change this to control how many rows get score at a time. Default values for image models is 50 and for others, it is 5000

## Drift v2 Archive

Run the following code to generate the drift v2 archive for the IBM watsonx.governance monitors. This archive is used as is by IBM watsonx.governance UI/SDK to onboard model for monitoring. UI/SDK will identify the drift v2 artifacts and appropriately upload to the monitor.

In [5]:
from ibm_watson_openscale.utils.configuration_utility import ConfigurationUtility
from ibm_watson_openscale.utils import convert_directory_to_dataframe

if "image_dir" in locals():
    training_data_df = convert_directory_to_dataframe(
        image_dir, **common_parameters)

config_util = ConfigurationUtility(
    training_data=training_data_df,
    common_parameters=common_parameters,
    scoring_fn=scoring_fn,
    batch_size=scoring_batch_size)

config_util.create_drift_configuration_package(
    drift_v2_parameters=drift_v2_parameters if "drift_v2_parameters" in locals() else {},
    display_link=True)

       id gender masterCategory subCategory            articleType baseColour  \
0   59263  Women    Accessories     Watches                Watches     Silver   
1   30039    Men    Accessories     Watches                Watches      Black   
2   29114    Men    Accessories       Socks                  Socks  Navy Blue   
3   29928    Men    Accessories     Watches                Watches      Black   
4   47957  Women    Accessories        Bags               Handbags       Blue   
5   12369    Men        Apparel     Topwear                 Shirts     Purple   
6   39386    Men        Apparel  Bottomwear                  Jeans       Blue   
7   15970    Men        Apparel     Topwear                 Shirts  Navy Blue   
8   53759    Men        Apparel     Topwear                Tshirts       Grey   
9   21379    Men        Apparel  Bottomwear            Track Pants      Black   
10   9204    Men       Footwear       Shoes           Casual Shoes      Black   
11  18653    Men       Footw

## Integrating Unstructured Image Subscriptions with a Defined Training Data Schema

Configure the training_data_schema for Unstructured Image Subscriptions with the Following Code

##### _1. Configure Credentials_:

In [6]:
CLOUD_API_KEY = "Your API Key"
IAM_URL="https://iam.ng.bluemix.net/oidc/token"
WOS_URL="https://aiopenscale.cloud.ibm.com"


##### _2. Connect with watsonx.gov instance_:

In [7]:
from ibm_cloud_sdk_core.authenticators import IAMAuthenticator
from ibm_watson_openscale import APIClient
from ibm_watson_openscale.supporting_classes.enums import OperationTypes, TargetTypes
from ibm_watson_openscale.supporting_classes import Target
from ibm_watson_openscale.base_classes.watson_open_scale_v2 import JsonPatchOperation


authenticator = IAMAuthenticator(apikey=CLOUD_API_KEY, url=IAM_URL)
wos_client = APIClient(authenticator=authenticator,service_url=WOS_URL)
wos_client.version

'3.0.36.15'

##### _3. Setting up subscription id and datamart id_:

In [8]:
wos_client.data_marts.show()
data_mart_id="TO BE EDITED"

0,1,2,3,4,5
AIOSFASTPATHYS1DEV-75DCF3BE-EB58-489E-942C-2A0689468177,,True,active,2024-05-12 08:22:03.343000+00:00,75dcf3be-eb58-489e-942c-2a0689468177


In [9]:
wos_client.subscriptions.show()
subscription_id="TO BE EDITED"

0,1,2,3,4,5,6,7,8,9
ae4de717-b1dd-41ac-abe8-b96c7c2e1388,model,Fashion Image Classifier,75dcf3be-eb58-489e-942c-2a0689468177,767f6481-aaf3-440e-a7bc-776e92906e27,Fashion Image Classifier deployment,53d7a352-634e-4c87-a7eb-58ec52d8c628,active,2024-06-06 09:06:13.754000+00:00,11082c3d-d9de-4c3b-a59d-7a6d49846e4d
cdacb9c3-fde5-4529-a97e-e732afc38bb6,model,gosales_two_model,75dcf3be-eb58-489e-942c-2a0689468177,3ce8243e-6ef6-41c3-8c3b-08e8190beeb2,Gosales_two_3,53d7a352-634e-4c87-a7eb-58ec52d8c628,active,2024-05-21 01:26:42.713000+00:00,45aabc34-be71-474d-8412-a9fab4323fc8
8b352c13-a2a2-409a-84f8-3097d25d648f,model,GermanCreditRiskModelPreProdYS1DEV,75dcf3be-eb58-489e-942c-2a0689468177,3522e2cf-a41e-4817-a1cf-5bcbd914019d,GermanCreditRiskModelPreProdYS1DEV,3b437b8b-ec37-451e-8a91-dc7e64a97f43,active,2024-05-12 08:25:21.924000+00:00,275b134b-3c0c-417b-9404-db2ff5ec800f
5e3654aa-f889-4717-ae70-b6906deb08c9,model,GermanCreditRiskModelYS1DEV,75dcf3be-eb58-489e-942c-2a0689468177,7700ae34-7c0e-4742-9edc-f58ae89a4892,GermanCreditRiskModelYS1DEV,4ceb9294-2dc9-4582-8a78-a6bb86c9b1b0,active,2024-05-12 08:27:48.013000+00:00,8c7ec566-cce7-4ee9-84e9-5912137e2221
e8b52763-cc42-4fd0-ac4f-7acb7044d608,model,GermanCreditRiskModelChallengerYS1DEV,75dcf3be-eb58-489e-942c-2a0689468177,1a86e270-6814-43af-b729-c2f0abf9d731,GermanCreditRiskModelChallengerYS1DEV,3b437b8b-ec37-451e-8a91-dc7e64a97f43,active,2024-05-12 08:22:52.697000+00:00,1c1ecfd6-b042-4f35-b7c4-4d362e2a9abe


##### _4. Patch the subscription_:

Update the subscription with `training_data_schema` which contains the information about the label column. Specify the label column datatype below.

In [10]:


subscription = wos_client.subscriptions.get(subscription_id=subscription_id).result
prediction_column_data = list(filter(lambda f: f["name"] == common_parameters.get("prediction_column"), subscription.entity.asset_properties.output_data_schema.fields))
prediction_column_dtype = prediction_column_data[0].get("type") if prediction_column_data else None

training_data_schema = {"type": "struct",
    "id": "1",
    "fields": [{"name": common_parameters.get("label_column"), "type": prediction_column_dtype, "nullable": True,"metadata": {"modeling_role": "target"}},   
     ]}
training_data_schema_patch_document=[
    JsonPatchOperation(op=OperationTypes.REPLACE, path="/asset_properties/training_data_schema", value=training_data_schema)
]

wos_client.subscriptions.update(subscription_id=subscription_id, patch_document=training_data_schema_patch_document)

<ibm_cloud_sdk_core.detailed_response.DetailedResponse at 0x7fa1ac73a3b0>

## Drift Configuration

##### _Upload drift archive_

In [None]:
archive_path = "TO BE EDITED" #The path to generated drift v2 archive

In [11]:
wos_client.monitor_instances.upload_drift_v2_archive(
    archive_path=archive_path,
    subscription_id=subscription_id
).result

{}

##### _Enable the drift monitor_

In [12]:
import time

target = Target(
    target_type=TargetTypes.SUBSCRIPTION,
    target_id=subscription_id
)

parameters = {
        #"min_samples": 10,
        #"max_samples": 1000,
        "train_archive": False
    }

drift_monitor_details = wos_client.monitor_instances.create(
    data_mart_id=data_mart_id,
    monitor_definition_id=wos_client.monitor_definitions.MONITORS.DRIFT_V2.ID,
    target=target,
    parameters=parameters
).result

drift_monitor_instance_id = drift_monitor_details.metadata.id
print(drift_monitor_details)

{
  "metadata": {
    "id": "efe61670-fe05-452f-a9b6-3312ef0aeaf6",
    "crn": "crn:v1:bluemix:public:aiopenscale:us-south:a/dcbe4da3c9574b28aa0e128376cfcef4:75dcf3be-eb58-489e-942c-2a0689468177:monitor_instance:efe61670-fe05-452f-a9b6-3312ef0aeaf6",
    "url": "/v2/monitor_instances/efe61670-fe05-452f-a9b6-3312ef0aeaf6",
    "created_at": "2024-06-06T09:12:38.686000Z",
    "created_by": "IBMid-662005298W"
  },
  "entity": {
    "data_mart_id": "75dcf3be-eb58-489e-942c-2a0689468177",
    "monitor_definition_id": "drift_v2",
    "target": {
      "target_type": "subscription",
      "target_id": "11082c3d-d9de-4c3b-a59d-7a6d49846e4d"
    },
    "parameters": {
      "train_archive": false
    },
    "thresholds": [
      {
        "metric_id": "confidence_drift_score",
        "type": "upper_limit",
        "value": 0.05
      },
      {
        "metric_id": "prediction_drift_score",
        "type": "upper_limit",
        "value": 0.05
      },
      {
        "metric_id": "model_qualit

## Definitions

### Common Parameters

| Parameter | Description | Default Value | Possible Value(s) |
|:-|:-|:-|:-|
| problem_type | Enumeration classifying if your model is a binary or a multi-class classifier or a regressor. |  | `binary`, `multiclass`, `regression` |
| input_data_type | The type of your input data. |  | `unstructured_image`|
| label_column | The column which contains the target field (also known as label column or the class label). |  | A string value referring column name |
| feature_columns | Columns identified as features by model. The order of the feature columns should be same as that of the subscription. Use helper methods to compute these if required. |  | A list of column names |
| categorical_columns | Feature columns identified as categorical by model. Use helper methods to compute these if required. |  | A list of column names |
| prediction_column | The column containing the model output. This should be of the same data type as the label column. |  | A string value referring column name |
| image_path_column | The column containing the path of each image. |  | A string value referring column name |
| probability_column | The column (of type array) containing the model probabilities for all the possible prediction outcomes. This is not required for regression models. One of `probability_column` or `class_probabilities` must be specified for classification models. If both are specified, `class_probabilities` is preferred.|  | A string value referring column name |
| class_probabilities | The columns (of type double) containing the model probabilities of class labels. This is not required for regression models. For example, for Go Sales model deployed in MS Azure ML Studio, value of this property would be `["Scored Probabilities for Class \"Camping Equipment\"", "Scored Probabilities for Class \"Mountaineering Equipment\"", "Scored Probabilities for Class \"Personal Accessories\""]`. Please note escaping double quotes is a must-have requirement for above example. One of `probability_column` or `class_probabilities` must be specified for classification models. If both are specified, `class_probabilities` is preferred. |  | A list of column names |
| enable_drift_v2 | Boolean value to allow generation of Drift v2 Archive. | `True` | `True` or `False` |


### Drift v2 Parameters

| Parameter | Description | Default Value | Possible Value(s) |
| :- | :- | :- | :- |
| max_samples | Defines maximum sample size on which the drift v2 archive is created. | None | |