<img src="https://github.com/pmservice/ai-openscale-tutorials/raw/master/notebooks/images/banner.png" align="left" alt="banner">

# Notebook for generating configuration for batch subscriptions in IBM Watson OpenScale in IBM Cloud Pak for Data v4.7 - JDBC Storage

This notebook shows how to generate the following artefacts:
1. Configuration JSON needed to configure an IBM Watson OpenScale subscription.
2. Drift Configuration Archive
3. Explainability Perturbations Archive
3. DDLs for creating Feedback, Payload, Drifted Transactions and Explanations tables

The user needs to provide the necessary inputs (where marked) and download the generated artefacts. These artefacts 
have to be then uploaded to IBM Watson OpenScale UI. 

PS: This notebook can only generate artefacts for one model at a time. For multiple models, this notebook needs to be run for each model separately.

**Contents:**
1. [Installing Dependencies](#Installing-Dependencies)
2. [Select IBM Watson OpenScale Services](#Select-IBM-Watson-OpenScale-Services)
3. [Read sample scoring data](#Read-sample-scoring-data)
4. [Specify Model Inputs](#Specify-Model-Inputs)
5. [Generate Common Configuration](#Generate-Common-Configuration)
6. [Generate DDL for creating Scored Training data table](#Generate-DDL-for-creating-Scored-Training-data-table)
6. [Generate DDL for creating Feedback table](#Generate-DDL-for-creating-Feedback-table)
7. [Generate DDL for creating Payload table](#Generate-DDL-for-creating-Payload-table)
8. [Provide Spark Connection Details](#Provide-Spark-Connection-Details)
9. [Provide Spark Resource Settings [Optional]](#Provide-Spark-Resource-Settings-[Optional])
10. [Provide Additional Spark Settings [Optional]](#Provide-Additional-Spark-Settings-[Optional])
11. [Provide Storage Inputs](#Provide-Storage-Inputs)
12. [Provide Drift Parameters [Optional]](#Provide-Drift-Parameters-[Optional])
13. [Provide Fairness Parameters [Optional]](#Provide-Fairness-Parameters-[Optional])
14. [Run Configuration Job](#Run-Configuration-Job)
15. [Download Configuration JSON](#Download-Configuration-JSON)
16. [Download Drift Archive](#Download-Drift-Archive)
17. [Generate DDL for creating Drifted Transactions Table](#Generate-DDL-for-creating-Drifted-Transactions-table)
18. [Generate Perturbations csv](#Generate-Perturbations-csv)
19. [Generate DDL for creating Explanations Queue table](#Generate-DDL-for-creating-Explanations-Queue-table)
20. [Generate DDL for creating Explanations Table](#Generate-DDL-for-creating-Explanations-Table)
21. [Create Configuration Archive](#Create-Configuration-Archive)

### Installing Dependencies

In [None]:
# Note: Restart kernel after the dependencies are installed
import sys

PYTHON = sys.executable

!$PYTHON -m pip install --no-warn-conflicts pyspark | tail -n 1  

**Note:** For IBM Watson OpenScale Cloud Pak for Data version 4.5

In [None]:
# When this notebook is to be run on a zLinux cluster,
# install scikit-learn==1.1.1 using conda before installing ibm-wos-utils
# !conda install scikit-learn=1.1.1

!$PYTHON -m pip install --no-warn-conflicts "ibm-wos-utils~=4.7.0" | tail -n 1

### Select IBM Watson OpenScale Services

Details of the service-specific flags available:

- ENABLE_QUALITY: Flag to allow generation of common configuration details needed if quality alone is selected
- ENABLE_FAIRNESS : Flag to allow generation of fairness specific data distribution needed for configuration
- ENABLE_MODEL_DRIFT: Flag to allow generation of Drift Archive containing relevant information for Model Drift.
- ENABLE_DATA_DRIFT: Flag to allow generation of Drift Archive containing relevant information for Data Drift.
- ENABLE_EXPLAINABILITY : Flag to allow generation of explainability configuration and perturbations

In [None]:
# ----------------------------------------------------------------------------------------------------
# IBM Confidential
# OCO Source Materials
# 5737-H76
# Copyright IBM Corp. 2021, 2023
# The source code for this Notebook is not published or other-wise divested of its trade
# secrets, irrespective of what has been deposited with the U.S.Copyright Office.
# ----------------------------------------------------------------------------------------------------

VERSION = "jdbc-1.1.7"

# Version history:

# jdbc-1.1.7 : Upgrade ibm-wos-utils to 4.7.0 (scikit-learn has been upgraded to 1.1.1)
# jdbc-1.1.6 : Upgrade ibm-wos-utils to 4.6.0 and update explainability archive with stats.
# jdbc-1.1.5 : Changed the way drift archive is created.
# jdbc-1.1.4 : Upgrade ibm-wos-utils to 4.1.1 (scikit-learn has been upgraded to 1.0.2)
# jdbc-1.1.3 : Add two drift tuning parameters: max_ranges_modifier and tail_discard_threshold; Upgrade ibm-wos-utils to 4.0.34
# jdbc-1.1.2 : Upgrade ibm-wos-utils to 4.0.31
# jdbc-1.1.1 : Add comment about conda install for zLinux environments; Upgrade ibm-wos-utils to 4.0.25
# jdbc-1.1   : Add partition information; Upgrade ibm-wos-utils to 4.0.24
# 1.0        : Initial release

In [None]:
# Optional Input: Keep an identifiable name. This id is used to append to various table creation DDLs.
# A random UUID is used if this is not present.
# NOTEBOOK_RUN_ID = "some_identifiable_name"
NOTEBOOK_RUN_ID = None


# Service Configuration Flags
ENABLE_QUALITY = True
ENABLE_MODEL_DRIFT = True
ENABLE_DATA_DRIFT = True
ENABLE_EXPLAINABILITY = True
ENABLE_FAIRNESS = True

RUN_JOB = ENABLE_QUALITY or ENABLE_MODEL_DRIFT or ENABLE_DATA_DRIFT or ENABLE_EXPLAINABILITY or ENABLE_FAIRNESS

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName(
    "Common Configuration Generation").getOrCreate()

### Read sample scoring data

A sample scoring data is required to infer the schema of the complete data, so the size of the sample should be chosen accordingly. 

Additionally, the sample scoring data should have the following fields:
1. Feature Columns
2. Label/Target Column
3. Prediction Column (with same data type as the label column)
4. Probability Column (an array of model probabilities for all the class labels. Not required for regression models)


The sample data should be of type `pyspark.sql.dataframe.DataFrame`. The cell below gives samples on:
- how to read a CSV file from the local system into a Pyspark Dataframe.
- how to read parquet files in a directory from the local system into a Pyspark Dataframe.
- how to read orc files in a directory from the local system into a Pyspark Dataframe.

In [None]:
# Load a csv or a directory containing csv files as PySpark DataFrame
# spark_df = spark.read.csv("/path/to/dir/containing/csv/files", header=True, inferSchema=True)

# Load a directory containing parquet files as PySpark DataFrame
# spark_df = spark.read.parquet("/path/to/dir/containing/parquet/files")

# Load a directory containing orc files as PySpark DataFrame
# spark_df = spark.read.orc("/path/to/dir/containing/orc/files")

spark_df.printSchema()

### Specify Model Inputs

#### Specify the Model Type

- Specify **binary** if the model is a binary classifier.
- Specify **multiclass** if the model is a multi-class classifier.
- Specify **regression** if the model is a regressor.

In [None]:
# MODEL_TYPE = "binary"
MODEL_TYPE = "multiclass"
# MODEL_TYPE = "regression"

#### Provide Column Details 

To proceed with this notebook, the following information is required.:

- **LABEL_COLUMN**: The column which contains the target field (also known as label column or the class label).
- **PREDICTION_COLUMN**: The column containing the model output. This should be of the same data type as the label column.
- **PROBABILITY_COLUMN**: The column (of type array) containing the model probabilities for all the possible prediction outcomes. This is not required for regression models.
- **CLASS_PROBABILITIES**: The columns (of type double) containing the model probabilities of class labels. This is not required for regression models. For example, for Go Sales model deployed in MS Azure ML Studio, value of this property would be `["Scored Probabilities for Class \"Camping Equipment\"", "Scored Probabilities for Class \"Mountaineering Equipment\"", "Scored Probabilities for Class \"Personal Accessories\""]`. Please note escaping double quotes is a must-have requirement for above example.
- **PARTITION_COLUMN**: The column to help Spark read and write data using multiple workers in your JDBC storage. This will help improve the performance of your Spark jobs. The default value is set to `wos_partition_column`. The cells below will include this column for generating CREATE TABLE DDLs and ALTER TABLE DDLs for your data source. This column will not be used for computation purposes.
- **PROTECTED_ATTRIBUTES**: [Optional] The columns which exist in training data but are not used to train the model. This is required to monitor fairness on non-feature columns i.e Indirect Bias. 

Note: Please be careful when choosing an existing feature column as partition column. If data in this feature column is not properly divided across various possible values, it could lead to data-skew problem with Spark computation. Which means, majority of data is sent to one worker for computation - leading to wastage of compute resources and increased computation time. It is recommended to use a column with monotonically increasing value as partition column.

In [None]:
LABEL_COLUMN = "<label_column>"
PREDICTION_COLUMN = "<model prediction column>"
PROBABILITY_COLUMN = "<model probability column. ignored in case of regression models>"
CLASS_PROBABILITIES = ["<list of columns containing class probabilities. Ignored in case of regression models>"]

PARTITION_COLUMN = "wos_partition_column"
# [Optional] Provide list of protected attributes i.e non-feature columns present in the data.
PROTECTED_ATTRIBUTES = []

Based on the sample data and key columns provided above, the notebook will deduce the feature columns and the categorical columns. They will be printed in the output of this cell. If you wish to make changes to them, you can do so in the subsequent cell.

In [None]:
from pyspark.sql.types import BooleanType, StringType

feature_columns = spark_df.columns.copy()
feature_columns.remove(LABEL_COLUMN)
feature_columns.remove(PREDICTION_COLUMN)

if MODEL_TYPE != "regression":
    feature_columns.remove(PROBABILITY_COLUMN)

if PROTECTED_ATTRIBUTES:
    for protected_attribute in PROTECTED_ATTRIBUTES:
        feature_columns.remove(protected_attribute)

print("Feature Columns : {}".format(feature_columns))

categorical_columns = [f.name for f in spark_df.schema.fields if isinstance(f.dataType, (BooleanType, StringType)) and f.name in feature_columns]
print("Categorical Columns : {}".format(categorical_columns))

In [None]:
config_info = {
    "problem_type": MODEL_TYPE,
    "model_type": MODEL_TYPE,
    "label_column": LABEL_COLUMN,
    "prediction": PREDICTION_COLUMN,
    "probability": PROBABILITY_COLUMN,
    "class_probabilities": CLASS_PROBABILITIES
}

config_info["feature_columns"] = feature_columns
config_info["categorical_columns"] = categorical_columns
config_info["protected_attributes"] = PROTECTED_ATTRIBUTES

In [None]:
from ibm_wos_utils.joblib.utils.notebook_utils import validate_config_info
validate_config_info(config_info)

### Generate Common Configuration

IBM Watson OpenScale requires two additional fields - a unique identifier for each record in your feedback/payload tables ("scoring_id") and a timestamp field ("scoring_timestamp") denoting when that record entered the table. These fields are automatically added in the common configuration. 

Please make sure that these fields are present in the respective tables.

In [None]:
from ibm_wos_utils.joblib.utils.notebook_utils import generate_schemas

common_config = config_info.copy()
common_configuration = generate_schemas(spark_df, common_config)

config_json = {}
config_json["common_configuration"] = common_configuration
config_json["batch_notebook_version"] = VERSION

In [None]:
from ibm_wos_utils.joblib.utils.notebook_utils import get_max_length_categories

max_length_categories = get_max_length_categories(spark_df)

### Generate DDL for creating Scored Training data table

In [None]:
from ibm_wos_utils.joblib.utils.ddl_utils_db2 import generate_scored_training_table_ddl

# Schema Name where Scored Training Table should be created.
SCORED_TRAINING_SCHEMA_NAME = None

generate_scored_training_table_ddl(config_json, schema_name=SCORED_TRAINING_SCHEMA_NAME,\
                                 table_suffix=NOTEBOOK_RUN_ID, max_length_categories=max_length_categories, partition_column=PARTITION_COLUMN)

### Generate DDL for creating Feedback table


In [None]:
from ibm_wos_utils.joblib.utils.ddl_utils_db2 import generate_feedback_table_ddl

# Schema Name where Feedback Table should be created.
FEEDBACK_SCHEMA_NAME = None

if ENABLE_QUALITY:
    generate_feedback_table_ddl(config_json, schema_name=FEEDBACK_SCHEMA_NAME,\
                                 table_suffix=NOTEBOOK_RUN_ID, max_length_categories=max_length_categories, partition_column=PARTITION_COLUMN)

### Generate DDL for creating Payload table


In [None]:
from ibm_wos_utils.joblib.utils.ddl_utils_db2 import generate_payload_table_ddl

# Schema Name where Payload Table should be created.
PAYLOAD_SCHEMA_NAME = None

if ENABLE_MODEL_DRIFT or ENABLE_DATA_DRIFT or ENABLE_EXPLAINABILITY or ENABLE_FAIRNESS:
    generate_payload_table_ddl(config_json, schema_name=PAYLOAD_SCHEMA_NAME,\
                                table_suffix=NOTEBOOK_RUN_ID, max_length_categories=max_length_categories, partition_column=PARTITION_COLUMN)

### Provide Spark Connection Details

1. If your job is going to run on Spark cluster as part of an IBM Analytics Engine instance on IBM Cloud Pak for Data, enter the following details:
    
    - **IAE_SPARK_DISPLAY_NAME**: Display Name of the Spark instance in IBM Analytics Engine
    - **IAE_SPARK_JOBS_ENDPOINT**: Spark Jobs Endpoint for IBM Analytics Engine
    - **IBM_CPD_VOLUME**: IBM Cloud Pak for Data storage volume name
    - **IBM_CPD_USERNAME**: IBM Cloud Pak for Data username
    - **IBM_CPD_APIKEY**: IBM Cloud Pak for Data API key


2. If your job is going to run on Spark Cluster as part of a Remote Hadoop Ecosystem, enter the following details:

    - **SPARK_MANAGER_ENDPOINT**: Endpoint URL where the Spark Manager Application is running
    - **SPARK_MANAGER_USERNAME**: Username to connect to Spark Manager Application
    - **SPARK_MANAGER_PASSWORD**: Password to connect to Spark Manager Application

#### Credentials Block for Spark in IAE

In [None]:
from ibm_wos_utils.joblib.utils.constants import SparkType

IAE_SPARK_DISPLAY_NAME = "<Display Name of the Spark instance in IBM Analytics Engine>"
IAE_SPARK_JOBS_ENDPOINT = "<Spark Jobs Endpoint for IBM Analytics Engine>"
IBM_CPD_VOLUME = "<IBM Cloud Pak for Data storage volume name>"
IBM_CPD_USERNAME = "<IBM Cloud Pak for Data username>"
IBM_CPD_APIKEY = "<IBM Cloud Pak for Data API key>"

# Credentials Block for Spark in IAE
credentials = {
    "connection": {
        "display_name": IAE_SPARK_DISPLAY_NAME,
        "endpoint": IAE_SPARK_JOBS_ENDPOINT,
        "location_type": SparkType.IAE_SPARK.value,
        "volume": IBM_CPD_VOLUME
    },
    "credentials": {
        "username": IBM_CPD_USERNAME,
        "apikey": IBM_CPD_APIKEY
    }
}

#### Credentials Block for Spark in Remote Hadoop Ecosystem

In [None]:
from ibm_wos_utils.joblib.utils.constants import SparkType

SPARK_MANAGER_ENDPOINT = "<Endpoint URL where Spark Manager Application is running>"
SPARK_MANAGER_USERNAME = "<Username to connect to Spark Manager Application>"
SPARK_MANAGER_PASSWORD = "<Password to connect to Spark Manager Application>"

# Credentials Block for Spark in Remote Hadoop Ecosystem
credentials = {
    "connection": {
        "endpoint": SPARK_MANAGER_ENDPOINT,
        "location_type": SparkType.REMOTE_SPARK.value
    },
    "credentials": {
        "username": SPARK_MANAGER_USERNAME,
        "password": SPARK_MANAGER_PASSWORD
    }
}

### Provide Spark Resource Settings. [Optional]

Configure how much of your Spark Cluster resources can this job consume.

In [None]:
"""
spark_settings = {
    # max_num_executors: Maximum Number of executors to launch for this session
    "max_num_executors": "2",
    
    # min_executors: Minimum Number of executors to launch for this session
    "min_executors": "1",
    
    # executor_cores: Number of cores to use for each executor
    "executor_cores": "2",
    
    # executor_memory: Amount of memory (in GBs) to use per executor process
    "executor_memory": "2",
    
    # driver_cores: Number of cores to use for the driver process
    "driver_cores": "2",
    
    # driver_memory: Amount of memory (in GBs) to use for the driver process 
    "driver_memory": "1"
}
"""

spark_settings = None

### Provide Additional Spark Settings [Optional]

Any other Spark property that can be set via **SparkConf**, provide them in the next cell. These properties are sent to the Spark cluster verbatim. Leave the variable `conf` to `None` or `{}` if no additional property is required.

- [A list of available properties for Spark 2.4.6](https://spark.apache.org/docs/2.4.6/configuration.html#available-properties)

In [None]:
"""
conf = {
    "spark.yarn.maxAppAttempts": 1
}

"""

conf = None

### Provide Storage Inputs

Enter DB2 Storage details.
 - **JDBC_HOST**: Hostname of the JDBC Connection
 - **JDBC_PORT**: Port of the JDBC Connection
 - **JDBC_USE_SSL**: Boolean Flag to indicate whether to use SSL while connecting.
 - **JDBC_SSL_CERTIFICATE**: SSL Certificate [Base64 encoded string] of the JDBC Connection. Ignored if JDBC_USE_SSL is False.
 - **JDBC_DRIVER**: Optional. Class name of the JDBC driver to use to connect e.g. for DB2 use com.ibm.db2.jcc.DB2Driver
 - **JDBC_USERNAME**: Username of the JDBC Connection
 - **JDBC_PASSWORD**: Password of the JDBC Connection
 - **JDBC_DATABASE_NAME**: Name of the JDBC Database to use to connect.
 - **TRAINING_SCHEMA_NAME**: Name of the JDBC Schema that has training table.
 - **TRAINING_TABLE_NAME**: Name of the JDBC Table that has the scored training data.


In [None]:
JDBC_HOST = "<Hostname of the JDBC Connection>"
JDBC_PORT = "<Port of the JDBC Connection>"
JDBC_USE_SSL = "<Boolean Flag to indicate whether to use SSL while connecting.>"
JDBC_SSL_CERTIFICATE = "<SSL Certificate [Base64 encoded string] of the JDBC Connection. Ignored if JDBC_USE_SSL is False.>"
JDBC_DRIVER = "<Optional. Class name of the JDBC driver to use to connect e.g. for DB2 use com.ibm.db2.jcc.DB2Driver>"
JDBC_USERNAME = "<Username of the JDBC Connection>"
JDBC_PASSWORD = "<Password of the JDBC Connection>"

JDBC_DATABASE_NAME = "<Name of the JDBC Database to use to connect.>"
TRAINING_SCHEMA_NAME = "<Name of the JDBC Schema that has training table.>"
TRAINING_TABLE_NAME = "<Name of the JDBC Table that has the scored training data.>"

In [None]:
num_partitions_recommended = 12

if spark_settings:
    executors = int(spark_settings.get("max_num_executors", 2))
    cores = int(spark_settings.get("executor_cores", 2))
    num_partitions_recommended = 3 * executors * cores
    
print("{} is the recommended value for number of partitions in your data. Please change this value as per your data.".format(num_partitions_recommended))

- **NUM_PARTITIONS**: The maximum number of partitions that Spark can divide the data into. In JDBC, it also means the maximum number of connections that Spark can make to the JDBC store for reading/writing data. 

The recommended value is calculated in the above cell as a multiple of total workers allotted for this job. You can use the same value or change it in the next cell.

In [None]:
NUM_PARTITIONS = num_partitions_recommended

In [None]:
jdbc_url = "jdbc:db2://{}:{}/{}".format(JDBC_HOST, JDBC_PORT, JDBC_DATABASE_NAME)

storage_details = {
    "type": "jdbc",
    "connection": {
        "jdbc_driver": JDBC_DRIVER,
        "jdbc_url": jdbc_url,
        "use_ssl": JDBC_USE_SSL,
        "certificate": JDBC_SSL_CERTIFICATE,
        "location_type": "jdbc"
    },
    "credentials":{
        "username": JDBC_USERNAME,
        "password": JDBC_PASSWORD,
    }
}

tables = [
    {
        "database": JDBC_DATABASE_NAME,
        "schema": TRAINING_SCHEMA_NAME,
        "table": TRAINING_TABLE_NAME,
        "type": "training",
        "parameters": {
            "partition_column": PARTITION_COLUMN,
            "num_partitions": NUM_PARTITIONS
        }
    }
]

### Provide Drift Parameters [Optional]

Provide the optional drift parameters in this cell. Leave the variable `drift_parameters` to `None` or `{}` if no additional parameter is required.

In [None]:
"""
drift_parameters = {
    "model_drift": {
        # enable_drift_model_tuning - Controls whether there will be Hyper-Parameter 
        # Optimisation in the Drift Detection Model. Default: False
        "enable_drift_model_tuning": True,
        
        # max_bins - Specify the maximum number of categories in categorical columns.
        # Default: OpenScale will determine an approximate value. Use this only in cases
        # where OpenScale approximation fails.
        "max_bins": 10,
    },
    "data_drift": {
        # enable_two_col_learner - Enable learning of data constraints on two column 
        # combinations. Default: True
        "enable_two_col_learner": True,
        
        # use_alt_learner - Boolean parameter which switches learning method to help 
        # with performance during constraint learning process. Default: False
        "use_alt_learner": False,
        
        # categorical_unique_threshold - Used to discard categorical columns with a
        # large number of unique values relative to total rows in the column.
        # Should be between 0 and 1. Default: 0.8
        "categorical_unique_threshold": 0.7,
        
        # max_distinct_categories - Used to discard categorical columns with a large
        # absolute number of unique categories. Also, used for not learning
        # categorical-categorical constraint, if potential combinations of two columns
        # are more than this number. Default: 100000
        "max_distinct_categories": 10000,

        # max_ranges_modifier - Affects the number of ranges we find for a numerical column.
        # For a numerical column, we learn multiple ranges instead of one min-max depending
        # on how sparse data is. This modifier combined with approximate distinct values in
        # the column defines the upper limit on how many bins to divide data into during
        # multiple ranges computation. This can either be a float or a dictionary of column
        # names and float values. Its value should be greater than 0. Default: 0.01
        # 1. float: This value is applied for all numerical columns. Default value of 0.01
        # indicates total number of bins used during computation of ranges are not more than
        # 1% of distinct values in the column.
        # 2. dict of str -> float: A column name -> value, dict can be used to over-ride
        # individual modifier for each column. If not provided for a column, default value
        # of 0.01 will be used.
        "max_ranges_modifier": 0.01,
            
        # tail_discard_threshold -- Used to discard off values from either end of data
        # distribution in a column if the data is found to have large ranges which results in
        # data being divided into a large number of bins for multiple ranges computation. This
        # threshold will be used if the these bins are found be greater than
        # `max_ranges_modifier * approx_distinct_count` for a column. Default value indicates
        # that 1 percentile data from either ends will be discarded. Its value can be between
        # 0 and 0.1. Default: 0.01
        "tail_discard_threshold": 0.01,
        
        # user_overrides - Used to override drift constraint learning to selectively learn 
        # constraints on feature columns. Its a list of configuration, each specifying 
        # whether to learn distribution and/or range constraint on given set of columns.
        # First configuration of a given column would take preference.
        # 
        # "constraint_type" can have two possible values : single|double - signifying 
        # if this configuration is for single column or two column constraint learning.
        #
        # "learn_distribution_constraint" : True|False - signifying whether to learn 
        # distribution constraint for given config or not.
        #
        # "learn_range_constraint" : True|False - signifying whether to learn range 
        # constraint for given config or not. Only applicable to numerical feature columns.
        # 
        # "features" : [] - provides either a list of feature columns to be governed by 
        # given configuration for constraint learning.
        # Its a list of strings containing feature column names if "constraint_type" is "single".
        # Its a list of list of strings containing feature column names if "constraint_type" if 
        # "double". If only one column name is provided, all of the two column constraints 
        # involving this column will be dictated by given configuration during constraint learning.
        # This list is case-insensitive.
        #
        # In the example below, first config block says do not learn distribution and range single 
        # column constraints for features "MARITAL_STATUS", "PROFESSION", "IS_TENT" and "age".
        # Second config block says do not learn distribution and range two column constraints 
        # where "IS_TENT", "PROFESSION", and "AGE" are one of the two columns. Whereas, specifically, 
        # do not learn two column distribution and range constraint on combination of "MARITAL_STATUS" 
        # and "PURCHASE_AMOUNT".
        "user_overrides": [
            {
                "constraint_type": "single",
                "learn_distribution_constraint": False,
                "learn_range_constraint": False,
                "features": [
                  "MARITAL_STATUS",
                  "PROFESSION",
                  "IS_TENT",
                  "age"
                ]
            },
            {
                "constraint_type": "double",
                "learn_distribution_constraint": False,
                "learn_range_constraint": False,
                "features": [
                  [
                    "IS_TENT"
                  ],
                  [
                    "MARITAL_STATUS"
                    "PURCHASE_AMOUNT"
                  ],
                  [
                    "PROFESSION"
                  ],
                  [
                    "AGE"
                  ]
                ]
            }
        ]
    }
}
"""

drift_parameters = None

### Provide Fairness Parameters [REQUIRED if `ENABLE_FAIRNESS` is set to True]

Provide the fairness parameters in this cell. Leave the variable `fairness_parameters` to `None` or `{}` if fairness is not to be enabled.

In [None]:
"""
fairness_parameters = {
    "features": [
        {
            "feature": "<The fairness attribute name>", # The feature on which the fairness check is to be done
            "majority": [<majority groups/ranges for categorical/numerical columns respectively>],
            "minority": [<minority groups/ranges for categorical/numerical columns respectively>],
            "threshold": <The threshold value between 0 and 1> [OPTIONAL, default value is 0.8]
        }
    ],
    "class_label": LABEL_COLUMN,
    "favourable_class": [<favourable classes/ranges for classification/regression models repectively>],
    "unfavourable_class": [<unfavourable classes/ranges for classification/regression models repectively>],
    "min_records": <The minimum number of records on which the fairness check is to be done> [OPTIONAL]

    # The following parameters are only supported for subscriptions with a synchronous scoring endpoint.
    
    "perform_perturbation": <(Boolean) Whether the user wants to calculate the balanced (payload + perturbed) data.>,
    "sample_size_percent": <(Integer 1-100) How much percentage of data to be read for balanced data calculation.>,
    "numerical_perturb_count_per_row": <[Optional] The number of perturbed rows to be generated per row for numerical perturbation. [Default: 2]>,
    "float_decimal_place_precision": <[Optional] The decimal place precision to be used for numerical perturbation when data is float.>,
    "numerical_perturb_seed": <[Optional] The seed to be used for numerical perturbation while picking up random values.>,
    "scoring_page_size": <[Optional] The size of the page in the number of rows. [Default: 1000]>
}
"""

fairness_parameters = None

### Run Configuration Job

In [None]:
SHOW_PROGRESS = True

arguments = {
    "batch_notebook_version": VERSION,
    "common_configuration" : common_configuration,
    "enable_data_drift": ENABLE_DATA_DRIFT,
    "enable_model_drift": ENABLE_MODEL_DRIFT,
    "enable_explainability": ENABLE_EXPLAINABILITY,
    "enable_fairness": ENABLE_FAIRNESS,
    "monitoring_run_id": NOTEBOOK_RUN_ID,
    "storage": storage_details,
    "tables": tables,
    "show_progress": SHOW_PROGRESS
}

if ENABLE_MODEL_DRIFT or ENABLE_DATA_DRIFT:
    arguments["drift_parameters"] = drift_parameters
    
if ENABLE_FAIRNESS:
    if fairness_parameters is None or fairness_parameters == {}:
        raise ValueError("Fairness parameters are required if fairness is enabled.")
    arguments["fairness_parameters"] = fairness_parameters

job_params = {
    "arguments": arguments,
    "spark_settings": spark_settings,
    "dependency_zip": [],
    "conf": conf
}

The following cell will run the Configuration job. If `SHOW_PROGRESS` is `True`, it will also print the status of job in the output section. Please wait for the status to be **FINISHED**.

A successful job status goes through the following values:
1. STARTED
2. Model Drift Configuration STARTED
3. Data Drift Configuration STARTED
    - Data Drift: Summary Stats Calculated
    - Data Drift: Column Stats calculated.
    - Data Drift: (number/total) CategoricalDistributionConstraint columns processed
    - Data Drift: (number/total) NumericRangeConstraint columns processed
    - Data Drift: (number/total) CategoricalNumericRangeConstraint columns processed
    - Data Drift: (number/total) CatCatDistributionConstraint columns processed
4. Explainability Configuration STARTED
5. Explainability Configuration COMPLETED
6. Fairness Configuration STARTED
7. Fairness Configuration COMPLETED
8. FINISHED

If at anytime there is a failure, you will see a **FAILED** status with an exception trace. 

In [None]:
from ibm_wos_utils.joblib.clients.engine_client import EngineClient
from ibm_wos_utils.common.batch.jobs.configuration import Configuration
from ibm_wos_utils.joblib.utils.notebook_utils import JobStatus


if RUN_JOB:
    job_name="Configuration_Job"
    client = EngineClient(credentials=credentials)
    job_response = client.engine.run_job(job_name=job_name, job_class=Configuration,
                                        job_args=job_params, background=True)
    # Print Job Status.
    if SHOW_PROGRESS:
        JobStatus(client, job_response).print_status()

If `SHOW_PROGRESS` is `False`, you can run the below cell to check the job status at any point manually.

In [None]:
if not SHOW_PROGRESS and RUN_JOB:
    job_id = job_response.get("id")
    print(client.engine.get_job_status(job_id))

### Download Configuration JSON

In [None]:
import json
from ibm_wos_utils.joblib.utils.notebook_utils import create_download_link

if RUN_JOB:
        configuration = client.engine.get_file(job_response.get(
                "output_file_path") + "/configuration.json")
        config=json.loads(json.loads(configuration).get("configuration"))
else:
        config = config_json

# handle class probabilities explicitly
from ibm_wos_utils.joblib.utils.param_utils import get

class_probabilities = get(common_configuration, "class_probabilities")
if class_probabilities:
    # clean up any class probability columns already added
    updated_output_data_schema_fields = []
    for field in get(config, "common_configuration.output_data_schema.fields"):
        if get(field, "metadata.modeling_role") == "class_probability":
            continue

        updated_output_data_schema_fields.append(field)

    # add class probabilities to output_data_schema
    for class_probability in class_probabilities:
        updated_output_data_schema_fields.append({
            "name": class_probability,
            "type": "double",
            "nullable": True,
            "metadata": {
                "modeling_role": "class_probability"
            }
        })

    config["common_configuration"]["output_data_schema"]["fields"] = updated_output_data_schema_fields
    config["common_configuration"]["probability_fields"] = class_probabilities

display(create_download_link(config, "config"))

### Download Drift Archive


In [None]:
import os
from ibm_wos_utils.joblib.utils.notebook_utils import create_download_link
    
if ENABLE_MODEL_DRIFT or ENABLE_DATA_DRIFT:
    drift_archive = client.engine.get_file(job_response.get(
            "output_file_path") + "/drift_configuration")

    with open("tmp_drift.tar.gz", mode="wb") as tf:
        tf.write(drift_archive)
        tf.flush()
        drift_archive = spark.sparkContext.sequenceFile(tf.name).collect()[0][1]
    os.remove("tmp_drift.tar.gz")

If `ENABLE_MODEL_DRIFT` is True, and the `MODEL_TYPE` is not `regression`, the below cell checks the training quality of the drift detection model that helps detect the drop in the accuracy. If the trained drift detection model did not meet the quality standards, a message is displayed to the user saying that the drop in the accuracy cannot be detected. By default, the drift model is generated without any hyperparameter optimisation, i.e. `enable_drift_model_tuning` is `False`. The user can try running the configuration job again by setting `enable_drift_model_tuning` as `True` in the `drift_parameters` above.

In [None]:
from ibm_wos_utils.joblib.utils.notebook_utils import check_for_ddm_quality

if ENABLE_MODEL_DRIFT and (MODEL_TYPE != "regression"):
    check_for_ddm_quality(drift_archive)

In [None]:
if ENABLE_MODEL_DRIFT or ENABLE_DATA_DRIFT:
    display(create_download_link(drift_archive, "drift", client))

### Generate DDL for creating Drifted Transactions table


In [None]:
from ibm_wos_utils.joblib.utils.ddl_utils_db2 import generate_drift_table_ddl

# Schema Name where Drifted Transactions Table should be created.
DRIFT_SCHEMA_NAME = None

if ENABLE_MODEL_DRIFT or ENABLE_DATA_DRIFT:
    generate_drift_table_ddl(drift_archive, schema_name=DRIFT_SCHEMA_NAME, table_suffix=NOTEBOOK_RUN_ID, partition_column=PARTITION_COLUMN)

### Generate Perturbations csv

In [None]:
import pandas as pd
from ibm_wos_utils.explainability.utils.perturbations import Perturbations

if ENABLE_EXPLAINABILITY:
    perturbations_count = 10000 # Default value for the number of perturbations to be generated as part of perturbations.csv file
    # Please modify the perturbations_count variable according to your usecase
    perturbations=Perturbations(training_stats=config.get("explainability_configuration"), problem_type=MODEL_TYPE, perturbations_count=perturbations_count)
    perturbs_df = perturbations.generate_perturbations()
    perturbs_df.to_csv("perturbations.csv",index=False)

The perturbations required for explainability are stored in the file perturbations.csv in the above step.
The user should score these perturbations against the user model and provide the scoring output as a dataframe with **probability** and **prediction** columns.

Please note that the probability and prediction column names in the data frame should be same as PREDICTION_COLUMN and PROBABILITY_COLUMN provided in this notebook.

Note: For regression model probability column is not required.

In [None]:
from ibm_wos_utils.joblib.utils.notebook_utils import create_archive
from json import dumps

if ENABLE_EXPLAINABILITY:
    # Load a csv output of scored perturbations as pandas DataFrame
    scored_perturbations = pd.read_csv("scored_perturbations.csv")
    archive_data = {
        "perturbations.csv": scored_perturbations.to_csv(index=False),
        "training_statistics.json": dumps({"training_statistics": config.get("explainability_configuration")})
    }
    display(create_archive(data=archive_data, archive_name="explainability"))

### Generate DDL for creating Explanations Queue table [Optional]

Provide details for creating a separate Explanations Queue table. IBM Watson OpenScale will be generating Explanations for all the transactions in this table. Alternatively, the payload table created in the notebook above can also be used for this purpose.

In [None]:
from ibm_wos_utils.joblib.utils.ddl_utils_db2 import generate_payload_table_ddl

# Database Name where Explanations Queue Table should be created.
EXPLANATIONS_QUEUE_SCHEMA_NAME = None

if ENABLE_EXPLAINABILITY:
    generate_payload_table_ddl(config_json, schema_name=EXPLANATIONS_QUEUE_SCHEMA_NAME,\
                                            table_prefix="explanations_queue", table_suffix=NOTEBOOK_RUN_ID,
                                            max_length_categories=max_length_categories, partition_column=PARTITION_COLUMN)

### Generate DDL for creating Explanations Table

In [None]:
from ibm_wos_utils.joblib.utils.ddl_utils_db2 import generate_explanations_table_ddl

# Schema Name where Explanations Table should be created.
EXPLANATIONS_SCHEMA_NAME = None

if ENABLE_EXPLAINABILITY:
    generate_explanations_table_ddl(schema_name=EXPLANATIONS_SCHEMA_NAME, table_suffix=NOTEBOOK_RUN_ID, partition_column=PARTITION_COLUMN)

### Create Configuration Archive
Collect all the artefacts generated above - configuration json, drift archive, explain archive - and bundle them into an archive. This archive is used as is by IBM Watson OpenScale UI/SDK to onboard model for monitoring. 
UI/SDK will identify the different artefacts and appropriately upload to respective monitors.

In [None]:
import tarfile
import json

# update flags in configuration json
config["common_configuration"]["enable_drift"] = True if ENABLE_MODEL_DRIFT or ENABLE_DATA_DRIFT else False
config["common_configuration"]["enable_explainability"] = ENABLE_EXPLAINABILITY
config["common_configuration"]["enable_fairness"] = ENABLE_FAIRNESS
config["common_configuration"]["enable_quality"] = ENABLE_QUALITY

# write to local
with open("common_configuration.json", "wb") as f:
    f.write(json.dumps(config).encode('utf-8'))

if ENABLE_FAIRNESS:
    # write fairness_statistics.json to local
    with open("fairness_statistics.json", "wb") as f:
        f.write(json.dumps(config.get("fairness_configuration")).encode('utf-8'))

if ENABLE_MODEL_DRIFT or ENABLE_DATA_DRIFT:
    # build and write drift archive to local
    with open("drift_archive.tar.gz", "wb") as f:
        from ibm_wos_utils.joblib.utils.notebook_utils import bundle_drift_model
        f.write(bundle_drift_model(drift_archive, client))

if ENABLE_EXPLAINABILITY:
    # build and write explain archive to local
    from io import BytesIO
    with BytesIO() as archive:
        with tarfile.open(fileobj=archive, mode="w:gz") as tf:
            for filename, filedata in archive_data.items():
                content = BytesIO(filedata.encode("utf8"))
                tarinfo = tarfile.TarInfo(filename)
                tarinfo.size = len(content.getvalue())
                tf.addfile(
                    tarinfo=tarinfo, fileobj=content)

        with open("explainability.tar.gz", "wb") as f:
            f.write(archive.getvalue())

with tarfile.open("configuration_archive.tar.gz", "w:gz") as f:
    # collect all files from local and write to configuration archive
    f.add("common_configuration.json", arcname="common_configuration.json")
    
    if ENABLE_MODEL_DRIFT or ENABLE_DATA_DRIFT:
        f.add("drift_archive.tar.gz", arcname="drift_archive.tar.gz")
        
    if ENABLE_EXPLAINABILITY:
        f.add("explainability.tar.gz", arcname="explainability.tar.gz")
        
    if ENABLE_FAIRNESS:
        f.add("fairness_statistics.json", arcname="fairness_statistics.json")

In [None]:
# create download link for configuration package
from io import BytesIO
import base64

data = None
with open('configuration_archive.tar.gz', 'rb') as f:
    # read configuration archive from local
    data = f.read()

format_args = {
    "payload": base64.b64encode(data).decode(),
    "title": "Download Configuration Archive",
    "filename": "configuration_archive.tar.gz"
}

from IPython.display import HTML
html = '<a download="{filename}" href="data:text/json;base64,{payload}" target="_blank">{title}</a>'
HTML(html.format(**format_args))

#### Authors
Developed by [Prem Piyush Goyal](mailto:prempiyush@in.ibm.com), [Pratap Kishore Varma V](mailto:pvemulam@in.ibm.com)