<img src="https://github.com/pmservice/ai-openscale-tutorials/raw/master/notebooks/images/banner.png" align="left" alt="banner">

# Notebook for generating configuration package for batch subscriptions in IBM Watson OpenScale in IBM Cloud Pak for Data v5.2

This notebook shows how to generate the configuration package containing following artefacts:
1. Configuration JSON needed to configure an IBM Watson OpenScale subscription.
2. Drift Configuration Archive
3. Drift v2 Configuration Archive
4. Explainability Configuration Archive

Optionally, user can generate following using Configuration JSON:
1. DDLs for creating Feedback, Payload, Drifted Transactions and Explanations tables

The user needs to provide the necessary inputs (where marked) and download the generated configuration package.
This package contains artefacts for different monitors which have to be then uploaded to IBM Watson OpenScale UI during configuration. 

PS: This notebook can only generate configuration package for one model at a time. For multiple models, this notebook needs to be run for each model separately.

**Contents:**
1. [Install Pre-requisites and required dependencies](#Installing-Dependencies)
2. [Specify Model Details](#Specify-Model-Details)
3. [Select IBM Watson OpenScale Services and provide configuration options](#Select-IBM-Watson-OpenScale-Services)
4. [Provide Spark Connection Details](#Provide-Spark-Connection-Details)
7. [Provide Storage Inputs](#Provide-Storage-Inputs)
8. [Generate Configuration Package](#Generate-Configuration-Package)
    1. [Download Configuration Package](#Download-Configuration-Package)
9. [Generate DDLs for tables](#Generate-DDLs-For-Tables)
10. [Helper Methods](#Helper-Methods)
    1. [Use sample data and get feature and categorical columns](#Use-Sample-Data-And-Get-Feature-And-Categorical-Columns)
    2. [Generate DDL for creating Scored Training data table](#Generate-DDL-for-creating-Scored-Training-data-table)

In [None]:
# Note: Restart kernel after the dependencies are installed
import sys

PYTHON = sys.executable

!$PYTHON -m pip install --no-warn-conflicts pyspark | tail -n 1

# When this notebook is to be run on a zLinux cluster,
# install scikit-learn==1.5.1 using conda before installing ibm-wos-utils
# !conda install scikit-learn=1.5.1

!$PYTHON -m pip install --no-warn-conflicts "ibm-metrics-plugin[notebook]~=5.2.0"

## Provide Model Details

| Parameter | Description | Possible Value(s) |
| :- | :- | :- |
| label_column | The column which contains the target field (also known as label column or the class label). | |
| model_type | Enumeration classifying if your model is a binary or a multi-class classifier or a regressor. | `binary`, `multiclass`, `regression` |
| feature_columns | Columns identified as features by model. The order of the feature columns should be same as that of the subscription. Use helper methods to compute these if required.| A list of column names |
| categorical_columns | Feature columns identified as categorical by model. Use helper methods to compute these if required.| A list of column names |
| prediction | The column containing the model output. This should be of the same data type as the label column. | |
| probability | The column (of type array) containing the model probabilities for all the possible prediction outcomes. This is not required for regression models. | |
| class_probabilities | The columns (of type double) containing the model probabilities of class labels. This is not required for regression models. For example, for Go Sales model deployed in MS Azure ML Studio, value of this property would be `["Scored Probabilities for Class \"Camping Equipment\"", "Scored Probabilities for Class \"Mountaineering Equipment\"", "Scored Probabilities for Class \"Personal Accessories\""]`. Please note escaping double quotes is a must-have requirement for above example. | |
| protected_attributes | [Optional] The columns which exist in training data but are not used to train the model. This is required to monitor fairness on non-feature columns i.e Indirect Bias.| A list of non-feature column names|


## Select IBM Watson OpenScale services

| Parameter | Description | Possible Value(s) |
| :- | :- | :- |
| enable_quality | Boolean value to allow generation of common configuration details needed if quality alone is selected | `True` or `False` |
| enable_fairness | Boolean value to allow generation of fairness specific data distribution needed for configuration | `True` or `False` |
| enable_drift | Boolean value to allow generation of Drift Archive containing relevant information for Model and Data Drift. | `True` or `False` |
| enable_drift_v2 | Boolean value to allow generation of Drift v2 configuration  | `True` or `False` |
| enable_explainability | Boolean value to allow generation of explainability configuration  | `True` or `False` |


### Provide Drift Parameters [Required if enable_drift is set to True]
Provide the drift parameters. `model_drift.enable` and `data_drift.enable` flags must be set if drift is enabled.

### Provide Drift v2 Parameters [Required if enable_drift_v2 is set to True]
Provide the Drift v2 parameters.Leave the variable `drift_v2_parameters` to `None` or `{}` if drift v2 is not to be enabled.


### Provide Fairness Parameters [Required if enable_fairness is set to True]
Provide the fairness parameters. Leave the variable `fairness_parameters` to `None` or `{}` if fairness is not to be enabled.

### Provide a method to use for scoring [Required if enable_explainability is set to True]
As part of configuration, explainability requires a scoring function to be defined which should take data frame as input and output couple of arrays.

- Input dataframe is expected to contain feature columns
- Score function must return output as prediction column array and probability column array.
- The data type of the label column and prediction column should be same . User needs to make sure that label column and prediction column array should have the same unique class labels
- Please update the score function below with the help of templates documented [here](https://github.com/IBM/watson-openscale-samples/wiki/Score-function-templates-for-IBM-Watson-OpenScale), if applicable.

In [18]:
common_params = {
    "model_type" : "<to_be_edited>",
    "label_column" : "<to_be_edited>",
    "feature_columns": ["<to_be_edited>"],
    "categorical_columns": ["<to_be_edited>"],
    "prediction" : "<to_be_edited>",
    "probability" : "<to_be_edited>",
    "class_probabilities": ["<to_be_edited>"],
    "class_labels": ["<to_be_edited>"], # [Optional]. The list of unique class labels in the order of model prediction.
    "enable_quality" : True,
    "enable_drift" : True,
    "enable_drift_v2" : True,
    "enable_fairness" : True,
    "enable_explainability" : True
}
# [Optional] Provide list of protected attributes i.e non-feature columns present in the data.
protected_attributes = []
common_params["protected_attributes"] = protected_attributes

"""
drift_parameters = {
    "model_drift": {
        "enable": True,
        # enable_drift_model_tuning - Controls whether there will be Hyper-Parameter 
        # Optimisation in the Drift Detection Model. Default: False
        "enable_drift_model_tuning": True,
        
        # max_bins - Specify the maximum number of categories in categorical columns.
        # Default: OpenScale will determine an approximate value. Use this only in cases
        # where OpenScale approximation fails.
        "max_bins": 10,
    },
    "data_drift": {
        "enable": True,
        # enable_two_col_learner - Enable learning of data constraints on two column 
        # combinations. Default: True
        "enable_two_col_learner": True,
        
        # use_alt_learner - Boolean parameter which switches learning method to help 
        # with performance during constraint learning process. Default: False
        "use_alt_learner": False,
        
        # categorical_unique_threshold - Used to discard categorical columns with a
        # large number of unique values relative to total rows in the column.
        # Should be between 0 and 1. Default: 0.8
        "categorical_unique_threshold": 0.7,
        
        # max_distinct_categories - Used to discard categorical columns with a large
        # absolute number of unique categories. Also, used for not learning
        # categorical-categorical constraint, if potential combinations of two columns
        # are more than this number. Default: 100000
        "max_distinct_categories": 10000

        # max_ranges_modifier - Affects the number of ranges we find for a numerical column.
        # For a numerical column, we learn multiple ranges instead of one min-max depending
        # on how sparse data is. This modifier combined with approximate distinct values in
        # the column defines the upper limit on how many bins to divide data into during
        # multiple ranges computation. This can either be a float or a dictionary of column
        # names and float values. Its value should be greater than 0. Default: 0.01
        # 1. float: This value is applied for all numerical columns. Default value of 0.01
        # indicates total number of bins used during computation of ranges are not more than
        # 1% of distinct values in the column.
        # 2. dict of str -> float: A column name -> value, dict can be used to over-ride
        # individual modifier for each column. If not provided for a column, default value
        # of 0.01 will be used.
        "max_ranges_modifier": 0.01,
            
        # tail_discard_threshold -- Used to discard off values from either end of data
        # distribution in a column if the data is found to have large ranges which results in
        # data being divided into a large number of bins for multiple ranges computation. This
        # threshold will be used if the these bins are found be greater than
        # `max_ranges_modifier * approx_distinct_count` for a column. Default value indicates
        # that 1 percentile data from either ends will be discarded. Its value can be between
        # 0 and 0.1. Default: 0.01
        "tail_discard_threshold": 0.01,
        
        # user_overrides - Used to override drift constraint learning to selectively learn 
        # constraints on feature columns. Its a list of configuration, each specifying 
        # whether to learn distribution and/or range constraint on given set of columns.
        # First configuration of a given column would take preference.
        # 
        # "constraint_type" can have two possible values : single|double - signifying 
        # if this configuration is for single column or two column constraint learning.
        #
        # "learn_distribution_constraint" : True|False - signifying whether to learn 
        # distribution constraint for given config or not.
        #
        # "learn_range_constraint" : True|False - signifying whether to learn range 
        # constraint for given config or not. Only applicable to numerical feature columns.
        # 
        # "features" : [] - provides either a list of feature columns to be governed by 
        # given configuration for constraint learning.
        # Its a list of strings containing feature column names if "constraint_type" is "single".
        # Its a list of list of strings containing feature column names if "constraint_type" if 
        # "double". If only one column name is provided, all of the two column constraints 
        # involving this column will be dictated by given configuration during constraint learning.
        # This list is case-insensitive.
        #
        # In the example below, first config block says do not learn distribution and range single 
        # column constraints for features "MARITAL_STATUS", "PROFESSION", "IS_TENT" and "age".
        # Second config block says do not learn distribution and range two column constraints 
        # where "IS_TENT", "PROFESSION", and "AGE" are one of the two columns. Whereas, specifically, 
        # do not learn two column distribution and range constraint on combination of "MARITAL_STATUS" 
        # and "PURCHASE_AMOUNT".
        "user_overrides": [
            {
                "constraint_type": "single",
                "learn_distribution_constraint": False,
                "learn_range_constraint": False,
                "features": [
                  "MARITAL_STATUS",
                  "PROFESSION",
                  "IS_TENT",
                  "age"
                ]
            },
            {
                "constraint_type": "double",
                "learn_distribution_constraint": False,
                "learn_range_constraint": False,
                "features": [
                  [
                    "IS_TENT"
                  ],
                  [
                    "MARITAL_STATUS"
                    "PURCHASE_AMOUNT"
                  ],
                  [
                    "PROFESSION"
                  ],
                  [
                    "AGE"
                  ]
                ]
            }
        ]
    }
}
"""
drift_parameters = {
    "model_drift": {
        "enable": True
    },
    "data_drift": {
        "enable": True,
        "drift_model_path": "./drift_detection_model"
    },
    "drift_model_path": "drift_detection_model"
}

"""
drift_v2_parameters = {
    "feature_importance": [                            
                            "LoanDuration",
                            "LoanPurpose",
                            "CreditHistory",
                            .
                            .
                            .
                            ],
    "most_important_features": [
                            "LoanDuration",
                            "LoanPurpose",
                            "CreditHistory",
                            .
                            .
                            .
                            ],
    "important_input_metadata_columns": [                            
                            "CheckingStatus",
                            "LoanDuration",
                            "CreditHistory"
                            ] 
}
"""
drift_v2_parameters = {  
    "train_archive": True,
    "feature_importance": [], #required field
    "most_important_features": [],
    "important_input_metadata_columns": []
}

"""
fairness_parameters = {
    "features": [
        {
            "feature": "<The fairness attribute name>", # The feature on which the fairness check is to be done
            "majority": [<majority groups/ranges for categorical/numerical columns respectively>],
            "minority": [<minority groups/ranges for categorical/numerical columns respectively>],
            "metric_ids": [<list of metrics ids (metrics) to be computed],
            "threshold": <The threshold value between 0 and 1> #this is needed for disparate impact
            # Valid metrics are fairness_value (disparate impact),statistical_parity_difference,average_odds_difference, false_discovery_rate_difference, error_rate_difference, false_negative_rate_difference, false_omission_rate_difference, false_positive_rate_difference, true_positive_rate_difference, average_abs_odds_difference
        }
    ],
    "thresholds" : [
        {
        "metric_id": "<metric_id>",
        "specific_values": [
            {
                "applies_to": [
                    {
                        "key": "feature",
                        "type": "tag",
                        "value": "<fairness attribute name>"
                    }
                ],
                "value": <lower value>
            }
        ],
        "type": "lower_limit",
        "value": <lower value>
    },
    {
        "metric_id": "<metric_id>",
        "specific_values": [
            {
                "applies_to": [
                    {
                        "key": "feature",
                        "type": "tag",
                        "value": "<fairness attribute name>"
                    }
                ],
                "value": <upper_value>
            }
        ],
        "type": "upper_limit",
        "value": <upper_value>
    }
    ],
#    #example of fairness configuration:
#     "features": [
#         {
#             "feature": "Sex", # The feature on which the fairness check is to be done
#             "majority": ["male"],
#             "minority": ["female"],
#             "metric_ids": ["fairness_value","statistical_parity_difference"]
#         }
#     ],
#     "thresholds": [{
#         "metric_id": "fairness_value",
#         "specific_values": [
#             {
#                 "applies_to": [
#                     {
#                         "key": "feature",
#                         "type": "tag",
#                         "value": "Sex"
#                     }
#                 ],
#                 "value": 85
#             }
#         ],
#         "type": "lower_limit",
#         "value": 85
#     },
#     {
#         "metric_id": "fairness_value",
#         "specific_values": [
#             {
#                 "applies_to": [
#                     {
#                         "key": "feature",
#                         "type": "tag",
#                         "value": "Sex"
#                     }
#                 ],
#                 "value": 125
#             }
#         ],
#         "type": "upper_limit",
#         "value": 125
#     },
#     {
#         "metric_id": "statistical_parity_difference",
#         "specific_values": [
#             {
#                 "applies_to": [
#                     {
#                         "key": "feature",
#                         "type": "tag",
#                         "value": "Sex"
#                     }
#                 ],
#                 "value": -0.3
#             }
#         ],
#         "type": "lower_limit",
#         "value": -0.3
#     },
#     {
#         "metric_id": "statistical_parity_difference",
#         "specific_values": [
#             {
#                 "applies_to": [
#                     {
#                         "key": "feature",
#                         "type": "tag",
#                         "value": "Sex"
#                     }
#                 ],
#                 "value": 0.3
#             }
#         ],
#         "type": "upper_limit",
#         "value": 0.3
#     }],
    
    "class_label": common_params.get("label_column"),
    "favourable_class": [<favourable classes/ranges for classification/regression models repectively>],
    "unfavourable_class": [<unfavourable classes/ranges for classification/regression models repectively>],
    "min_records": <The minimum number of records on which the fairness check is to be done>,

    # The following parameters are only supported for subscriptions with a synchronous scoring endpoint.
    
    "perform_perturbation": <(Boolean) Whether the user wants to calculate the balanced (payload + perturbed) data.>,
    "sample_size_percent": <(Integer 1-100) How much percentage of data to be read for balanced data calculation.>,
    "numerical_perturb_count_per_row": <[Optional] The number of perturbed rows to be generated per row for numerical perturbation. [Default: 2]>,
    "float_decimal_place_precision": <[Optional] The decimal place precision to be used for numerical perturbation when data is float.>,
    "numerical_perturb_seed": <[Optional] The seed to be used for numerical perturbation while picking up random values.>,
    "scoring_page_size": <[Optional] The size of the page in the number of rows. [Default: 1000]>
}
"""
fairness_parameters = {}

"""
# Lime global explanation feature is available from Cloud Pak for Data version 4.6.4 onwards.
# Set the below explainability parameters to enable lime global explanation generation.
# Note: When LIME global explanation is enabled, the explainability archive upload and explainability monitor enablement should be done using python sdk/api. 
# LIME global explanation configuration is not supported from IBM Watson OpenScale GUI.
explainability_parameters = {
    "lime":{ # specify this attribute only if you want to generate lime global or local explanations
            "perturbations_count": 10000 # default value for the number of perturbations to be generated.
        },
    "global_explanation": {
        "enabled": True, # Enable global explanation
        "explanation_method": "lime", # The explanation method to use
        "training_data_sample_size": 1000, # [Optional] The sample size of records to be used for generating training data global explanation. If not specified entire training data is used.
        "sample_size": 1000, # [Optional] The sample size of records to be used for generating payload data global explanation. If not specified entire data in the payload window is used.
    }
}
"""
explainability_parameters = {}
scoring_fn = None

common_params["drift_parameters"] = drift_parameters
common_params["drift_v2_parameters"] = drift_v2_parameters  
common_params["fairness_parameters"] = fairness_parameters
common_params["explainability_parameters"] = explainability_parameters
common_params["score_function"] = scoring_fn
common_params["score_batch_size"] = 1000

## Provide Spark Connection Details

To generate configuration for monitoring models in IBM Watson OpenScale, a spark compute engine is required. It can be either IBM Analytics Engine or your own Spark Cluster. Provide details of any one of them in this section.

Please note, if you are using your own Spark cluster, checkout IBM Watson OpenScale documentation on how to setup spark manager API to enable interface for use with IBM Watson OpenScale services.

### Parameters for IBM Analytics Engine
If your job is going to run on Spark cluster as part of an IBM Analytics Engine instance on IBM Cloud Pak for Data, enter the following details:

| Parameter | Description | Possible Value(s) |
| :- | :- | :- |
| display_name | Display Name of the Spark instance in IBM Analytics Engine | |
| location_type | Identifies if compute engine is IBM IAE or Remote Spark. For IBM IAE, this must be set to `cpd_iae`. | `cpd_iae` |
| endpoint | Spark Jobs Endpoint for IBM Analytics Engine | |
| volume | IBM Cloud Pak for Data storage volume name | |
| username | IBM Cloud Pak for Data username | |
| apikey | IBM Cloud Pak for Data API key | |

### Parameters for Remote Spark Cluster
If your job is going to run on Spark Cluster as part of a Remote Hadoop Ecosystem, enter the following details:

| Parameter | Description | Possible Value(s) |
| :- | :- | :- |
| location_type | Identifies if compute engine is IBM IAE or Remote Spark. For Remote Spark, this must be set to `custom`. | `custom` |
| endpoint | Endpoint URL where the Spark Manager Application is running | |
| username | Username to connect to Spark Manager Application | |
| password | Password to connect to Spark Manager Application | |


### Provide Spark Resource Settings [Optional]
Configure how much of your Spark Cluster resources can this job consume. Leave the variable `spark_settings` to `{}` if no customisation is required.

| Parameter | Description |
| :- | :- |
| max_num_executors | Maximum Number of executors to launch for this session |
| min_executors | Minimum Number of executors to launch for this session |
| executor_cores | Number of cores to use for each executor |
| executor_memory | Amount of memory (in GBs) to use per executor process |
| driver_cores | Number of cores to use for the driver process |
| driver_memory | Amount of memory (in GBs) to use for the driver process |

### Provide Additional Spark Settings [Optional]

Any other Spark property that can be set via **SparkConf**. These properties are sent to the Spark cluster verbatim. Leave the variable `conf` to `None` or `{}` if no additional property is required.
If `conf` is being set, please make sure to set some default values to `spark_settings` parameters.

- [A list of available properties for Spark 2.4.6](https://spark.apache.org/docs/2.4.6/configuration.html#available-properties)

In [19]:
spark_connection_info = {
    "credentials": {
        "connection": {
            "endpoint": "<to_be_edited>",
            "location_type": "<to_be_edited>",
            "display_name": "<to_be_edited>",
            "volume": "<to_be_edited>"
        },
        "credentials": {
            "username": "<to_be_edited>",
            "password": "<to_be_edited>",
            "apikey": "<to_be_edited>"
        }
    }
}



"""
Example:

spark_settings = {
    # max_num_executors: Maximum Number of executors to launch for this session
    "max_num_executors": "2",
    
    # min_executors: Minimum Number of executors to launch for this session
    "min_executors": "1",
    
    # executor_cores: Number of cores to use for each executor
    "executor_cores": "2",
    
    # executor_memory: Amount of memory (in GBs) to use per executor process
    "executor_memory": "2",
    
    # driver_cores: Number of cores to use for the driver process
    "driver_cores": "2",
    
    # driver_memory: Amount of memory (in GBs) to use for the driver process 
    "driver_memory": "1"
}
"""
spark_settings = {}

"""
Example:

conf = {
    "spark.yarn.maxAppAttempts": 1
}
"""
# conf = {}
# spark_settings["conf"] = conf

spark_connection_info["spark_settings"] = spark_settings

## Provide Scored Training Data Table location

For generating configuration for monitoring model in IBM Watson OpenScale, scored training data table location is required. Supported locations are hive, DB2 or Postgres. If you do not have any such table already available, please refer to helper methods section on how to generate a DDL for scoring training data table. Using this DDL, create your table, load data and provide location here.

### Provide DB2 or Postgres table details where training data is hosted

| Parameter | Description | Possible Value(s) |
| :- | :- | :- |
| type | Describes the type of storage being used. For DB2 and Postgres, this must be set to `jdbc`. | `jdbc` |
| jdbc_url | Connection string for jdbc. DB2 Example: `jdbc:db2://jdbc_host:jdbc_port/database_name`, Postgres Example: `jdbc:postgresql://jdbc_host:jdbc_port/database_name` | |
| jdbc_driver | Optional. Class name of the JDBC driver to use to connect. Example: for DB2 use `com.ibm.db2.jcc.DB2Driver`,  for Postgres use `org.postgresql.Driver` ||
| use_ssl | Boolean Flag to indicate whether to use SSL while connecting | `True` or `False` |
| certificate | SSL Certificate [Base64 encoded string] of the JDBC Connection. Ignored if `use_ssl` is `False`. |
| location_type | Identifies the type of location for connection to use. For DB2 and Postgres, this must be set to `jdbc`. | `jdbc` |
| username | Username of the JDBC Connection | |
| password | Password of the JDBC Connection | |
| database | Name of database hosting training data table | |
| schema | Name of schema hosting training data table | |
| table | Name of training data table | |
| partition_column | The column to help Spark read and write data using multiple workers in your JDBC storage. This will help improve the performance of your Spark jobs. Please be careful when choosing an existing feature column as partition column. If data in this feature column is not properly divided across various possible values, it could lead to data-skew problem with Spark computation. Which means, majority of data is sent to one worker for computation - leading to wastage of compute resources and increased computation time. It is recommended to use a column with monotonically increasing value as partition column. | |
| num_partitions | The maximum number of partitions that Spark can divide the data into. In JDBC, it also means the maximum number of connections that Spark can make to the JDBC store for reading/writing data. The recommended value is calculated as: 3 * num_executors * num_cores_per_executor. | |
| jdbc_connection_type | JDBC Connection type used, supported types: db2, postgresql | |

In [20]:
training_data_connection = {
    "storage_details" : {
        "type": "jdbc",
        "connection": {
            "location_type": "jdbc",
            "jdbc_url": "<to_be_edited>",
            "jdbc_driver": "<to_be_edited>",
            "use_ssl": "<to_be_edited>",
            "certificate": "<to_be_edited>"
        },
        "credentials": {
            "username": "<to_be_edited>",
            "password": "<to_be_edited>"
        }
    },
    "tables" : [{
        "type": "training",
        "database": "<to_be_edited>",
        "schema": "<to_be_edited>",
        "table": "<to_be_edited>",
        "parameters": {
            "partition_column": "<to_be_edited>",
            "num_partitions": "<to_be_edited>"
        }
    }]
}

## Generate common configuration archive containing all the artefacts required for enabling monitors

The following cell will run the Configuration job. It will also print the status of job in the output section if available. Please wait for the status to be **FINISHED**.

A successful job status goes through the following values:
1. STARTED
2. Model Drift Configuration STARTED
3. Data Drift Configuration STARTED
    - Data Drift: Summary Stats Calculated
    - Data Drift: Column Stats calculated.
    - Data Drift: (number/total) CategoricalDistributionConstraint columns processed
    - Data Drift: (number/total) NumericRangeConstraint columns processed
    - Data Drift: (number/total) CategoricalNumericRangeConstraint columns processed
    - Data Drift: (number/total) CatCatDistributionConstraint columns processed
4. Drift v2 Configuration STARTED
5. Drift v2 Configuration COMPLETED
5. Explainability Configuration STARTED
7. Explainability Configuration COMPLETED
8. Fairness Configuration STARTED
9. Fairness Configuration COMPLETED
10. FINISHED

If at anytime there is a failure, you will see a **FAILED** status with an exception trace. 

In [21]:
%%time
from ibm_metrics_plugin.common.utils.configuration_utility import ConfigurationUtility
config_utility = ConfigurationUtility(common_params, training_data_connection,spark_connection_info)
config_utility.generate_configuration()

Application ID: None; Job ID: 6f95f881-1bbf-453b-ac5c-a43fbd135f55; Status: FINISHED.
Total Run Time: 15 minutes 40 seconds 
CPU times: user 7.2 s, sys: 62.5 ms, total: 7.26 s
Wall time: 18min 3s


## Download generated configuration archive.
**Note:**

**When LIME global explanation is enabled, the configuration archive upload and explainability monitor enablement should be done using python sdk/api.**

In [22]:
with open("./archives/configuration_archive.tar.gz", "rb") as binary_file:
    configuration_archive = binary_file.read()

display(config_utility.create_download_link(configuration_archive))

# OPTIONAL

Following cells must be executed only if you are going to create required tables on your own. Otherwise, you can also choose to create these tables in IBM Watson OpenScale UI.

Using configuration package:
1. Load common configuration json
2. Load drift archive

**STORAGE_FORMAT** : One of [`csv`, `parquet`, `orc`]

**Note:** 
1. Please select the format in which your training data is stored in Hive. The same format will be used to generate the various CREATE DDLs in this notebook.
2. ORC format is not supported for zLinux environments

**Generate DDLs for creating required tables:**
1. Feedback Table
2. Payload Table
3. Drifted Transactions Table
4. Explanations Queue Table
5. Explanations Table

### Choose the JDBC Connection type (supported: db2, postgresql)

In [11]:
import importlib

def get_table_ddl_module(db_type):
    module = None
    if db_type == 'db2':
        module = importlib.import_module('ibm_wos_utils.joblib.utils.ddl_utils_db2')
    elif db_type == 'postgresql':
        module = importlib.import_module('ibm_wos_utils.joblib.utils.ddl_utils_postgres')

    return module

In [12]:
JDBC_CONNECTION_TYPE = "<to_be_edited>" # supported: db2, postgresql

table_ddl_module = get_table_ddl_module(JDBC_CONNECTION_TYPE)

In [None]:
import json
import tarfile

# Provide the path to the configuration file.
configuration_archive = "./archives/configuration_archive.tar.gz"

config_json = None
with tarfile.open(configuration_archive, 'r:gz') as tar:
    if "common_configuration.json" not in tar.getnames():
        raise Exception("common_configuration.json file is missing in archive file")

    json_content = tar.extractfile('common_configuration.json')
    data = json_content.read().decode()
    config_json = json.loads(data)
    
# print(config_json)

# Optional Input: Keep an identifiable name. This id is used to append to various table creation DDLs.
# A random UUID is used if this is not present.
# NOTEBOOK_RUN_ID = "some_identifiable_name"
NOTEBOOK_RUN_ID = None

# The column to help Spark read and write data using multiple workers in your JDBC storage.
# This will help improve the performance of your Spark jobs. 
# The default value is set to `wos_partition_column`. 
# Included in CREATE TABLE DDLs and ALTER TABLE DDLs for your data source. 
# This column will not be used for computation purposes.

# Note: Please be careful when choosing an existing feature column as partition column. 
# If data in this feature column is not properly divided across various possible values, 
# it could lead to data-skew problem with Spark computation. 
# Which means, majority of data is sent to one worker for computation - leading to wastage 
# of compute resources and increased computation time. It is recommended to use a column 
# with monotonically increasing value as partition column.

PARTITION_COLUMN = "<to_be_edited>"

# Schema Name where tables should be created.
SCHEMA_NAME = "<to_be_edited>"

# FEEDBACK TABLE DDL
#######################
print("Feedback Table DDLs:")
table_ddl_module.generate_feedback_table_ddl(
    config_json,
    schema_name=SCHEMA_NAME,
    table_suffix=NOTEBOOK_RUN_ID,
    partition_column=PARTITION_COLUMN)
print("=========================")

# PAYLOAD TABLE DDL
#######################
print("Payload Table DDLs:")
table_ddl_module.generate_payload_table_ddl(
    config_json,
    schema_name=SCHEMA_NAME,
    table_suffix=NOTEBOOK_RUN_ID,
    partition_column=PARTITION_COLUMN)
print("=========================")

if config_json["common_configuration"]["enable_drift"]:
    # DRIFTED TRANSACTIONS TABLE DDL
    #######################
    drift_archive = None
    with tarfile.open(configuration_archive, 'r:gz') as tar:
        if "drift_archive.tar.gz" not in tar.getnames():
            raise Exception("drift_archive.tar.gz file is missing in archive file")

        drift_archive = tar.extractfile("drift_archive.tar.gz").read()

    print("Drifted Transactions Table DDLs:")
    table_ddl_module.generate_drift_table_ddl(
        drift_archive,
        schema_name=SCHEMA_NAME,
        table_suffix=NOTEBOOK_RUN_ID,
        partition_column=PARTITION_COLUMN)
    print("=========================")


if config_json["common_configuration"]["enable_explainability"]:
    # EXPLAIN TABLES DDL
    #######################

    # Explain Queue Table - IBM Watson OpenScale will be generating Explanations for 
    # all the transactions in this table. Alternatively, the payload table created in the 
    # notebook above can also be used for this purpose.

    print("Explanations Queue Table DDLs:")
    table_ddl_module.generate_payload_table_ddl(
        config_json,
        schema_name=SCHEMA_NAME,
        table_prefix="explanations_queue",
        table_suffix=NOTEBOOK_RUN_ID,
        partition_column=PARTITION_COLUMN)
    print("=========================")

    print("Explanations Table DDLs:")
    table_ddl_module.generate_explanations_table_ddl(
        schema_name=SCHEMA_NAME,
        table_suffix=NOTEBOOK_RUN_ID,
        partition_column=PARTITION_COLUMN)
    print("=========================")

## Helper methods

### Use sample scored training data to get feature and categorical columns

A sample scoring data is required to infer the schema of the complete data, so the size of the sample should be chosen accordingly. 

Additionally, the sample scoring data should have the following fields:
1. Feature Columns
2. Label/Target Column
3. Prediction Column (with same data type as the label column)
4. Probability Column (an array of model probabilities for all the class labels. Not required for regression models)

**STORAGE_FORMAT** : One of ["csv", "parquet", "orc"]

**Note:** 
1. Please select the format in which your training data is stored in Hive. The same format will be used to generate the various CREATE DDLs in this notebook.
2. ORC format is not supported for zLinux environments

The sample data should be of type `pyspark.sql.dataframe.DataFrame`. The cell below gives samples on:
- how to read a CSV file from the local system into a Pyspark Dataframe.
- how to read parquet files in a directory from the local system into a Pyspark Dataframe.
- how to read orc files in a directory from the local system into a Pyspark Dataframe. [Not supported for zLinux environments]

It is important that the same storage format is chosen as the training data, otherwise there could be schema mismatches.

#### Specify the Model Type

- Specify **binary** if the model is a binary classifier.
- Specify **multiclass** if the model is a multi-class classifier.
- Specify **regression** if the model is a regressor.

#### Provide Column Details 

To proceed with this notebook, the following information is required.:

- **LABEL_COLUMN**: The column which contains the target field (also known as label column or the class label).
- **PREDICTION_COLUMN**: The column containing the model output. This should be of the same data type as the label column.
- **PROBABILITY_COLUMN**: The column (of type array) containing the model probabilities for all the possible prediction outcomes. This is not required for regression models.

Based on the sample data and key columns provided above, the notebook will deduce the feature columns and the categorical columns. They will be printed in the output of this cell. If you wish to make changes to them, you can do so in the subsequent cell.

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.types import BooleanType, StringType

# Read sample scoring data and identify feature categorical columns
spark = SparkSession.builder.appName(
    "Common Configuration Generation").getOrCreate()

STORAGE_FORMAT = "csv"
# STORAGE_FORMAT = "parquet"
# STORAGE_FORMAT = "orc"

if STORAGE_FORMAT == "csv":
    # Load a csv or a directory containing csv files as PySpark DataFrame
    # spark_df = spark.read.csv("/path/to/dir/containing/csv/files", header=True, inferSchema=True)
    pass

elif STORAGE_FORMAT == "parquet":
    # Load a directory containing parquet files as PySpark DataFrame
    # spark_df = spark.read.parquet("/path/to/dir/containing/parquet/files")
    pass
    
elif STORAGE_FORMAT == "orc":
    # Load a directory containing orc files as PySpark DataFrame
    # spark_df = spark.read.orc("/path/to/dir/containing/orc/files")
    pass

else:
    # Load data from any source which matches the schema of the training data
    pass

spark_df.printSchema()

MODEL_TYPE = "binary"
# MODEL_TYPE = "multiclass"
# MODEL_TYPE = "regression"

LABEL_COLUMN = "<to_be_edited>"
PREDICTION_COLUMN = "<to_be_edited>"
PROBABILITY_COLUMN = "<to_be_edited>"
# [Optional] Provide list of protected attributes i.e non-feature columns present in the data.
PROTECTED_ATTRIBUTES = []

feature_columns = spark_df.columns.copy()
feature_columns.remove(LABEL_COLUMN)
feature_columns.remove(PREDICTION_COLUMN)

if MODEL_TYPE != "regression":
    feature_columns.remove(PROBABILITY_COLUMN)

if PROTECTED_ATTRIBUTES:
    for protected_attribute in PROTECTED_ATTRIBUTES:
        feature_columns.remove(protected_attribute)

print("Feature Columns : {}".format(feature_columns))

categorical_columns = [f.name for f in spark_df.schema.fields if isinstance(f.dataType, (BooleanType, StringType)) and f.name in feature_columns]
print("Categorical Columns : {}".format(categorical_columns))

### Generate DDL for creating Scored Training data table

Read sample data to figure out feature columns, categorical columns and their datatypes.
Using this information, generate DDL for scored training data table.

In [None]:
from ibm_wos_utils.joblib.utils.notebook_utils import generate_schemas
from ibm_wos_utils.joblib.utils.notebook_utils import get_max_length_categories
from ibm_wos_utils.joblib.utils.notebook_utils import validate_config_info

from pyspark.sql import SparkSession
from pyspark.sql.types import BooleanType, StringType

# Read sample scoring data
spark = SparkSession.builder.appName(
    "Read sample data and generate training data ddl").getOrCreate()

STORAGE_FORMAT = "csv"
# STORAGE_FORMAT = "parquet"
# STORAGE_FORMAT = "orc"

if STORAGE_FORMAT == "csv":
    # Load a csv or a directory containing csv files as PySpark DataFrame
    # spark_df = spark.read.csv("/path/to/dir/containing/csv/files", header=True, inferSchema=True)
    pass

elif STORAGE_FORMAT == "parquet":
    # Load a directory containing parquet files as PySpark DataFrame
    # spark_df = spark.read.parquet("/path/to/dir/containing/parquet/files")
    pass
    
elif STORAGE_FORMAT == "orc":
    # Load a directory containing orc files as PySpark DataFrame
    # spark_df = spark.read.orc("/path/to/dir/containing/orc/files")
    pass

else:
    # Load data from any source which matches the schema of the training data
    pass

# model details
MODEL_TYPE = "binary"
# MODEL_TYPE = "multiclass"
# MODEL_TYPE = "regression"

LABEL_COLUMN = "<to_be_edited>"
PREDICTION_COLUMN = "<to_be_edited>"
PROBABILITY_COLUMN = "<to_be_edited>"
# [Optional] Provide list of protected attributes i.e non-feature columns present in the data.
PROTECTED_ATTRIBUTES = []

feature_columns = spark_df.columns.copy()
feature_columns.remove(LABEL_COLUMN)
feature_columns.remove(PREDICTION_COLUMN)

if MODEL_TYPE != "regression":
    feature_columns.remove(PROBABILITY_COLUMN)

if PROTECTED_ATTRIBUTES:
    for protected_attribute in PROTECTED_ATTRIBUTES:
        feature_columns.remove(protected_attribute)

print("Feature Columns : {}".format(feature_columns))

categorical_columns = [f.name for f in spark_df.schema.fields if isinstance(f.dataType, (BooleanType, StringType)) and f.name in feature_columns]
print("Categorical Columns : {}".format(categorical_columns))

config_info = {
    "problem_type": MODEL_TYPE,
    "label_column": LABEL_COLUMN,
    "prediction": PREDICTION_COLUMN,
    "probability": PROBABILITY_COLUMN
}

config_info["feature_columns"] = feature_columns
config_info["categorical_columns"] = categorical_columns
config_info["protected_attributes"] = PROTECTED_ATTRIBUTES

# validation
validate_config_info(config_info)

# generate schema json using columns and their datatypes
cmn_config_json = {
    "common_configuration": generate_schemas(spark_df, config_info.copy())
}

# get length of values in different columns
max_length_categories = get_max_length_categories(spark_df)

# generate ddl using schema json
# Schema Name where Scored Training Table should be created.
SCORED_TRAINING_SCHEMA_NAME = "<to_be_edited>"

# The column to help Spark read and write data using multiple workers in your JDBC storage.
# This will help improve the performance of your Spark jobs.
# Included in CREATE TABLE DDLs and ALTER TABLE DDLs for your data source. 
# This column will not be used for computation purposes.

# Note: Please be careful when choosing an existing feature column as partition column. 
# If data in this feature column is not properly divided across various possible values, 
# it could lead to data-skew problem with Spark computation. 
# Which means, majority of data is sent to one worker for computation - leading to wastage 
# of compute resources and increased computation time. It is recommended to use a column 
# with monotonically increasing value as partition column.

PARTITION_COLUMN = "<to_be_edited>"

NOTEBOOK_RUN_ID = "<to_be_edited>" #optional

scored_training_data_create_table_ddl = table_ddl_module.generate_scored_training_table_ddl(
    cmn_config_json,
    schema_name=SCORED_TRAINING_SCHEMA_NAME,
    table_suffix=NOTEBOOK_RUN_ID,
    max_length_categories=max_length_categories,
    partition_column=PARTITION_COLUMN)

print("Scored Training Data Table DDLs:")
print(scored_training_data_create_table_ddl)
print("=========================")