<img src="https://github.com/pmservice/ai-openscale-tutorials/raw/master/notebooks/images/banner.png" align="left" alt="banner">


# Notebook for generating configuration for online subscriptions in IBM Watson OpenScale 

This notebook shows how to generate the following artefacts:

1. Configuration JSON needed to configure an IBM Watson OpenScale online subscription. This JSON also contains information related to fairness configuration.
2. Drift Configuration Archive

In order to use this notebook you need to do the following:

1. Read the training data into a pandas dataframe called "data_df".  There is sample code below to show how this can be done if the training data is in IBM Cloud Object Storage. 
2. Edit the below cells and provide the training data and fairness configuration information. 
3. Run the notebook. It will generate a JSON and a download link for the JSON will be present at the very end of the notebook.
4. Download the JSON by clicking on the link and upload it in the IBM AI OpenScale GUI.

If you have multiple models (deployments), you will have to repeat the above steps for each model (deployment).

**Note:** Please restart the kernel after executing below cell

In [None]:
!pip install pandas
!pip install ibm-cos-sdk
!pip install pyspark
!pip install --upgrade ibm-watson-openscale

**Note:** For IBM Watson OpenScale Cloud and Cloud Pak for Data version 4.6.x, use the cell below:

In [None]:
# When this notebook is to be run on a zLinux cluster,
# install scikit-learn==1.0.2 using conda before installing ibm-metrics-plugin
# !conda install scikit-learn=1.0.2
!pip install "numpy>1.20,<=1.22.3" "scipy==1.8.1"
!pip install "ibm-metrics-plugin"

## Openscale Configuration

In [None]:
import warnings
warnings.filterwarnings("ignore")

Based on the environment of execution use one of the below method for OpenScale configuration

Provide your IBM Watson OpenScale Cloud credentials in the following cell:

In [None]:
CLOUD_API_KEY = "***"
IAM_URL="https://iam.ng.bluemix.net/oidc/token"

In [None]:
from ibm_cloud_sdk_core.authenticators import IAMAuthenticator,BearerTokenAuthenticator

from ibm_watson_openscale import *
from ibm_watson_openscale.supporting_classes.enums import *
from ibm_watson_openscale.supporting_classes import *


authenticator = IAMAuthenticator(apikey=CLOUD_API_KEY)
#authenticator = BearerTokenAuthenticator(bearer_token=IAM_TOKEN) ## uncomment this line if using IAM token as authenticator
client = APIClient(authenticator=authenticator)
client.version

Provide your IBM Watson OpenScale CPD credentials in the following cell:

In [None]:
"""
WOS_CREDENTIALS = {
    "url": "<cluster url>",
    "username": "",
    "password": "",
    "instance_id": "<service instance id>"
}
"""

In [None]:
"""
from ibm_watson_openscale import APIClient as OpenScaleAPIClient
from ibm_cloud_sdk_core.authenticators import CloudPakForDataAuthenticator

authenticator = CloudPakForDataAuthenticator(
    url=WOS_CREDENTIALS["url"],
    username=WOS_CREDENTIALS["username"],
    password=WOS_CREDENTIALS["password"],
    disable_ssl_verification=True
)

client = OpenScaleAPIClient(
    service_url=WOS_CREDENTIALS["url"],
    service_instance_id=WOS_CREDENTIALS["instance_id"],
    authenticator=authenticator
)

client.version
"""

# Read training data into a pandas data frame

The first thing that you need to do is to read the training data into a pandas dataframe called "data_df".  Given below is sample code for doing this if the training data is in IBM Cloud Object Storage.  Please edit the below cell and make changes so that you can read your training data from the location where it is stored.  Please ensure that the training data is present in a data frame called "data_df".

*Note: Pandas' read\_csv method converts the columns to its data types. If you want the column type to not be interpreted, specify the dtype param to read_csv method in this cell. More on this method [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)*

*Note: By default NA values will be dropped while computing training data distribution and training the drift archive. Please ensure to handle the NA values during Pandas' read\_csv method*

In [None]:
# ----------------------------------------------------------------------------------------------------
# IBM Confidential
# OCO Source Materials
# 5900-A3Q, 5737-H76
# Copyright IBM Corp. 2018, 2023
# The source code for this Notebook is not published or other-wise divested of its trade 
# secrets, irrespective of what has been deposited with the U.S.Copyright Office.
# ----------------------------------------------------------------------------------------------------

VERSION = "5.4.3"

# Version history

# 5.4.3 : Add numpy and scipy versions to be installed.
# 5.4.2 : Remove explainability configuration while saving training_distribution
# 5.4.1 : Add sample size for generating global explanation
# 5.4.0 : Add support for SHAP Global explanation
# 5.3.6 : Fix issue with explainability archive generation for regression model
# 5.3.5 : Official notebook for IBM CPD 4.5.0. 
#         Upgrade ibm-wos-utils to 4.5.0. 
#         Added code to generate explainability perturbations archive.
# 5.3.4 : Upgrade ibm-wos-utils to 4.1.1 (scikit-learn has been upgraded to 1.0.2)
# 5.3.3 : Upgrade ibm-wos-utils to 4.0.34
# 5.3.2 : Upgrade ibm-wos-utils to 4.0.31
# 5.3.1 : Official notebook for IBM CPD 4.0.5

# code to read file in COS to pandas dataframe object
import sys
import types
import pandas as pd
from ibm_botocore.client import Config
import ibm_boto3

def __iter__(self): return 0

api_key = "<API Key>"
resource_instance_id = "<COS Resource Instance ID>"
auth_endpoint = "https://iam.ng.bluemix.net/oidc/token"
service_endpoint =  "<COS Service Endpoint>"
bucket =  "<Bucket Name>"
file_name= "<File Name>"

cos_client = ibm_boto3.client(service_name="s3",
    ibm_api_key_id=api_key,
    ibm_auth_endpoint=auth_endpoint,
    config=Config(signature_version="oauth"),
    endpoint_url=service_endpoint)

body = cos_client.get_object(Bucket=bucket,Key=file_name)["Body"]

# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

data_df = pd.read_csv(body)
data_df.head()

#Print columns from data frames
#print("column names:{}".format(list(data_df.columns.values)))

# Uncomment following 2 lines if you want to read training data from local CSV file when running through local Jupyter notebook
#data_df = pd.read_csv("<FULLPATH_TO_CSV_FILE>")
#data_df.head()


## Select the services for which configuration information needs to be generated

This notebook generates configuration information related to fairness, explainability, and drift services. The below flags can be used by the user to control service specific configuration information.

Details of the service specific flags available:

- enable_fairness : Flag to allow generation of fairness specific data distribution needed for configuration
- enable_explainability : Flag to allow generation of explainability specific information
- enable_drift: Flag to allow generation of drift detection model needed by drift service


service_configuration_support = { <br>
&nbsp;&nbsp;&nbsp;&nbsp;"enable_fairness": True,   
&nbsp;&nbsp;&nbsp;&nbsp;"enable_explainability": True,    
&nbsp;&nbsp;&nbsp;&nbsp;"enable_drift": False  
    }  



In [None]:
service_configuration_support = {
    "enable_fairness": True,
    "enable_explainability": True,
    "enable_drift": True
}

## Training Data and Fairness Configuration Information

Please provide information about the training data which is used to train the model.  In order to explain the configuration better, let us first consider an example of a Loan Processing Model which is trying to predict whether a person should get a loan or not. The training data for such a model will potentially contain the following columns: Credit_History, Monthly_salary, Applicant_Age, Loan_amount, Gender, Marital_status, Approval.  The "Approval" column contains the target field (label column or class label) and it can have the following values: "Loan Granted", "Loan Denied" or "Loan Partially Granted".  In this model we would like to ensure that the model is not biased against Gender=Female or Gender=Transgender.  We would also like to ensure that the model is not biased against the age group 15 to 30 years or age group 61 to 120 years. 

For the above model, the configuration information that we need to provide is:

- class_label:  This is the name of the column in the training data dataframe (data_df) which contains the target field (also known as label column or the class label).  For the Loan Processing Model it would be "Approval".
- feature_columns: This is a comma separated list of column names which contain the feature column names (in the training data dataframe data_df).  For the Loan Processing model this would be: ["Credit_History", "Monthly_salary", "Applicant_Age", "Loan_amount", "Gender", "Marital_status"]
- categorical_columns: The list of column names (in data_df) which contain categorical values.  This should also include those columns which originally contained categorical values and have now been converted to numeric values. E.g., in the Loan Processing Model, the Marital_status column originally could have values: Single, Married, Divorced, Separated, Widowed.  These could have been converted to numeric values as follows: Single -> 0, Married -> 1, Divorced -> 2, Separated -> 3 and Widowed -> 4.  Thus the training data will have numeric values.  Please identify such columns as categorical.  Thus the list of categorical columns for the Loan Processing Model will be Credit_History, Gender and Marital_status. 

For the Loan Processing Model, this information will be provided as follows:

training_data_info = { <br>
&nbsp;&nbsp;&nbsp;&nbsp;"class_label": "Approval",   
&nbsp;&nbsp;&nbsp;&nbsp;"feature_columns": ["Credit_History", "Monthly_salary", "Applicant_Age", "Loan_amount", "Gender", "Marital_status"],    
&nbsp;&nbsp;&nbsp;&nbsp;"categorical_columns": ["Credit_History","Gender","Marital_status"]   
    }  
    
  **Note:** Please note that categorical columns selected should be subset of feature columns. If there are no categorical columns among the feature columns selected , please set "categorical_columns as [] or None"

Please edit the next cell and provide the above information for your model.

In [None]:
training_data_info = {
    "class_label": "<EDIT THIS>",
    "feature_columns": ["<EDIT THIS>"],
    "categorical_columns": ["<EDIT THIS>"]
}

## Specify the Model Type

In the next cell, specify the type of your model.  If your model is a binary classification model, then set the type to "binary". If it is a multi-class classifier then set the type to "multiclass". If it is a regression model (e.g., Linear Regression), then set it to "regression".

In [None]:
#Set model_type. Acceptable values are:["binary","multiclass","regression"]
model_type = "binary"
#model_type = "multiclass"
#model_type = "regression"

## Specify the Fairness Configuration

You need to provide the following information for the fairness configuration: 

- fairness_attributes:  These are the attributes on which you wish to monitor fairness. In the Loan Processing Model, we wanted to ensure that the model is not biased against people of specific age group and people belonging to a specific gender.  Hence "Applicant_Age" and "Gender" will be the fairness attributes for the Loan Processing Model.
- With Indirect Bias support, you can also monitor protected attributes for fairness. The protected attributes are those attributes which are present in the training data but are not used to train the model. For example, sensitive attributes like gender, race, age may be present in training data but are not used for training. To check if there exists indirect bias with respect to some protected attribute due to possible correlation with some feature column, it can be specified in fairness configuration.
- type: The data type of the fairness attribute (e.g., float or int or double)
- minority:  The minority group for which we want to ensure that the model is not biased.  For the Loan Processing Model we wanted to ensure that the model is not biased against people in the age group 15 to 30 years & 61 to 120 years as well as people with Gender = Female or Gender = Transgender.  Hence the minority group for the fairness attribute "Applicant_Age" will be [15,30] and [61,120] and the minority group for fairness attribute "Gender" will be: "Female", "Transgender".  
- majority: The majority group for which the model might be biased towards.  For the Loan Processing Model, the majority group for the fairness attribute "Applicant_Age" will be [31,60], i.e., all the ages except the minority group.  For the fairness attribute "Gender" the majority group will be: "Male".  
- threshold:  The fairness threshold beyond which the Model is considered to be biased.  For the Loan Processing Model, let us say that the Bank is willing to tolerate the fact that Female and Transgender applicants will get up to 20% lesser approved loans than Males.  However, if the percentage is more than 20% then the Loan Processing Model will be considered biased.  E.g., if the percentage of approved loans for Female or Transgender applicants is say 25% lesser than those approved for Male applicants then the Model is to be considered as acting in a biased manner.  Thus for this scenario, the Fairness threshold will be 80 (100-20) (this is represented as a value normalized to 1, i.e., 0.8).  

The fairness attributes for Loan Processing Model will be specified as:

fairness_attributes = [  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;{   
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;"feature": "Applicant_Age",   
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;"type" : "int",   
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;"majority": [ [31,60] ],   
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;"minority": [ [15, 30], [61,120] ],  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;"threshold" : 0.8  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;},  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;{  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;"feature": "Gender",   
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;"type" : "string",   
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;"majority": ["Male"],  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;"minority": ["Female", "Transgender"],  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;"threshold" : 0.8  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}  
&nbsp;&nbsp;&nbsp;&nbsp;]  

Please edit the next cell and provide the fairness configuration for your model.

In [None]:
fairness_attributes = [{
                           "type" : "<DATA_TYPE>", #data type of the column eg: float or int or double
                           "feature": "<COLUMN_NAME>", 
                           "majority": [
                               [X, Y] # range of values for column eg: [31, 45] for int or [31.4, 45.1] for float
                           ],
                           "minority": [
                               [A, B], # range of values for column eg: [10, 15] for int or [10.5, 15.5] for float
                               [C, D]   # range of values for column eg: [80, 100] for int or [80.0, 99.9] for float                    
                           ],
                           "threshold": <VALUE> #such that 0<VALUE<=1. eg: 0.8
                       }]

## Specify the Favorable and Unfavorable class values

The second part of fairness configuration is about the favourable and unfavourable class values.  Recall that in the case of Loan Processing Model, the target field (label column or class label) can have the following values: "Loan Granted", "Loan Denied" and "Loan Partially Granted".  Out of these values "Loan Granted" and "Loan Partially Granted" can be considered as being favorable and "Loan Denied" is unfavorable.  In other words in order to measure fairness, we need to know the target field values which can be considered as being favourable and those values which can be considered as unfavourable.  

For the Loan Prediction Model, the values can be specified as follows:

parameters = {  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;"favourable_class" :  [ "Loan Granted", "Loan Partially Granted" ],  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;"unfavourable_class": [ "Loan Denied" ]  
&nbsp;&nbsp;&nbsp;&nbsp;}  

In case of a regression models, the favourable and unfavourable classes will be ranges.  For example, for a model which predicts medicine dosage, the favorable outcome could be between 80 ml to 120 ml or between 5 ml to 20 ml whereas unfavorable outcome will be values between 21 ml to 79ml.  For such a model, the favorable and unfavorable values will be specified as follows:
     
parameters = {  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;"favourable_class" :  [ [5, 20], [80, 120] ],  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;"unfavourable_class": [ [21, 79] ]  
&nbsp;&nbsp;&nbsp;&nbsp;}  

Please edit the next cell to provide information about your model.

In [None]:
# For classification models use the below.
parameters = {
        "favourable_class" :  [ "<EDIT THIS>", "<EDIT THIS>" ],
        "unfavourable_class": [ "<EDIT THIS>" ]
    }
# For regression models use the below.  Delete the entry which is not required.
parameters = {
        "favourable_class" :  [ [<EDIT THIS>, <EDIT THIS>], [<EDIT THIS>,<EDIT THIS>] ],
        "unfavourable_class": [ [<EDIT THIS>, <EDIT THIS>] ]
    }

## Specify the number of records which should be processed for Fairness

The final piece of information that needs to be provided is the number of records (min_records) that should be used for computing the fairness. Fairness checks runs hourly.  If min_records is set to 5000, then every hour fairness checking will pick up the last 5000 records which were sent to the model for scoring and compute the fairness on those 5000 records.  Please note that fairness computation will not start till the time that 5000 records are sent to the model for scoring.

If we set the value of "min_records" to a small number, then fairness computation will get influenced by the scoring requests sent to the model in the recent past. In other words, the model might be flagged as being biased if it is acting in a biased manner on the last few records, but overall it might not be acting in a biased manner.  On the other hand, if the "min_records" is set to a very large number, then we will not be able to catch model bias quickly. Hence the value of min_records should be set such that it is neither too small or too large.

Please updated the next cell to specify a value for min_records.

In [None]:
# min_records = <Minimum number of records to be considered for preforming scoring>
min_records = <EDIT THIS>
# max_records = <Maximum number of records to be considered while computing fairness> [OPTIONAL]
max_records = None

## [Optional] Specify the Explainability Configuration

Provide the following explainability configuration to enable SHAP global explanation. 
Set explainability parameters to None if SHAP global explanation is not required or if you are trying to use headless subscription ie., subscription which is configured without a REST endpoint for scoring needs.

 - global_explanation: The global explanation parameters.
     - enabled: Enable the global explanation. SHAP explanation also should be enabled when global explanation is enabled.
     - sample_size: The sample size of the records to be considered for computing global explanation in the payload window.
 - shap: The shap explanation parameters
     - enabled: Enable the shap explanation
     - perturbations_count: The no of perturbations created when generating a local explanation
     - background_data_set: The background data set to be used when generating the SHAP explanation of a transaction. The background data is used to determine the average predicted value for regression models and the average confidence value for classification models. When generating a local explanation, SHAP computes the shapley values which signify how much each feature contributed to moving the model output or model confidence from the computed average value.
     - background_data_set: The list of background data sets the user would like to configure.
 - local_explanation_method: The default local explanation method to be used when generating local explanation. Possible values are "shap" and "lime"

In [None]:
"""
explainability_parameters = {
    "global_explanation": {
        "enabled": True,
        "sample_size": 50 # the sample size of the records to be considered for computing payload global explanation
    },
    "local_explanation_method": "shap", # or lime
    "shap": {
        "enabled": True,
        "perturbations_count": 100
        #"background_data_set": "data_set_1", # If not set the background data is auto generated from training data
        #"background_data_sets": [{
        #    "name": "data_set_1",
        #    "file_name": "data_set_1.csv"
        #}]
    }
}
"""
# Uncomment the above lines to enable SHAP global explanation
explainability_parameters = None

## End of Input 

You need not edit anything beyond this point.  Run the notebook and go to the very last cell.  There will be a link to download the JSON file (called: "Download training data distribution JSON file").  Download the file and upload it using the IBM AI OpenScale GUI.

*Note: drop_na parameter of TrainingStats object should be set to 'False' if NA values are taken care while reading the training data in the above cells*

In [None]:
from ibm_watson_openscale.utils.training_stats import TrainingStats

enable_explainability = service_configuration_support.get("enable_explainability")
enable_fairness = service_configuration_support.get("enable_fairness")

if enable_explainability or enable_fairness:
    fairness_inputs = None
    if enable_fairness:
        fairness_inputs = {
                "fairness_attributes": fairness_attributes,
                "min_records" : min_records,
                "favourable_class" :  parameters["favourable_class"],
                "unfavourable_class": parameters["unfavourable_class"]
            }
        if max_records is not None:
            fairness_inputs["max_records"] = max_records
    
    input_parameters = {
        "label_column": training_data_info["class_label"],
        "feature_columns": training_data_info["feature_columns"],
        "categorical_columns": training_data_info["categorical_columns"],
        "fairness_inputs": fairness_inputs,  
        "problem_type" : model_type  
    }

    training_stats = TrainingStats(data_df,input_parameters, explain=enable_explainability, fairness=enable_fairness, drop_na=True)
    config_json = training_stats.get_training_statistics()
    config_json["notebook_version"] = VERSION
#print(config_json)

## [Optional] Score Function

This is required if you are configuring :
- Drift (for classification models), and/or 
- Explainability (for headless subscriptions, for computing SHAP global explanation)

Please update the score function which will be used for generating drift detection model which will used for drift detection. Also, if you have a headless subscription, this will be used to generate explainability archive which be used for explanations. 

The output of the score function should be a 2 arrays :
1. Array of probabilities
2. Array of model prediction


Please note:
- User is expected to make sure that the data type of the "class label" column selected and the prediction column are same. For eg : If class label is numeric, the prediction array should also be numeric
- Each entry of a probability array should have all the probabilities of the unique class label .
  For eg: If the model_type=multiclass and unique class labels are A, B, C, D . Each entry in the probability array should be a array of size 4 . Eg : [ [0.50,0.30,0.10,0.10] ,[0.40,0.20,0.30,0.10]...]
- **Please update the score function below with the help of templates documented [here](https://github.com/IBM/watson-openscale-samples/blob/main/training%20statistics/Score%20function%20templates%20for%20drift%20detection.md)**

In [None]:
#Update score function
# def score(training_data_frame){
#     <Fill in the template using the score function templates provided>
# }
# Comment the below line when using score function
score=None

## [Optional] Generate explainability archive

This is required for generating artifacts required for configuring explainability in IBM Watson OpenScale 

Output of this is an explainability configuration archive which must be uploaded to IBM Watson OpenScale during explain monitor configuration.

Based on the configuration parameters the explainability archive contains some or all of the below files:
  - training_statistics.json : The explainability training statistics
  - configuration.json : The explainability configuration parameters used while creating the explainability monitor instance
  - lime_scored_perturbations.json : The scored lime perturbations response
  - shap_background_data_training.csv : The SHAP background data created using the training statistics
  - shap_training_data_global_explanation.json : The SHAP training data global explanation

In [None]:
if enable_explainability:
    # Note: Using the entire training data for generating global explanation will take longer duration.
    # Use a random sample of training data for faster execution.
    sample_size = 5000
    if len(data_df) > sample_size:
        training_data = data_df.sample(sample_size)
    else:
        training_data = data_df

    explainability_archive = client.config_util.create_explainability_archive(config=config_json, 
                                                                       training_data=training_data, 
                                                                       parameters=explainability_parameters, 
                                                                       scoring_fn=score)
    display(explainability_archive)

### Indirect Bias
In case of Indirect bias i.e if protected attributes(the sensitive attributes like race, gender etc which are present in the training data but are not used to train the model) are being monitored for fairness:
- Bias service identifies correlations between the protected attribute and model features. Correlated attributes are also known as proxy features.
- Existence of correlations with model features can result in indirect bias w.r.t protected attribute even though it is not used to train the model.
- Highly correlated attributes based on their correlation strength are considered while computing bias for a given protected attribute.

The following cell identifies if user has configured protected attribute for fairness by checking the feature, non-feature columns and the fairness configuration. If protected attribute/s are configured then it identifies correlations and stores it in the fairness configuration.

In [None]:
# Checking if protected attributes are configured for fairness monitoring. If yes, then computing correlation information for each meta-field and updating it in the fairness configuration
if enable_fairness:
    fairness_configuration = config_json.get("fairness_configuration")
    training_columns = data_df.columns.tolist()
    label_column = training_data_info.get("class_label")
    training_columns.remove(label_column)
    feature_columns = training_data_info.get("feature_columns")
    non_feature_columns = list(set(training_columns) - set(feature_columns))
    if non_feature_columns is not None and len(non_feature_columns) > 0:
        protected_attributes = []
        fairness_attributes_list = [attribute.get("feature") for attribute in fairness_attributes]
        for col in non_feature_columns:
            if col in fairness_attributes_list:
                protected_attributes.append(col)
        if len(protected_attributes) > 0:
            from ibm_watson_openscale.utils.indirect_bias_processor import IndirectBiasProcessor
            fairness_configuration = IndirectBiasProcessor().get_correlated_attributes(data_df, fairness_configuration, feature_columns, protected_attributes, label_column)        

In [None]:
import json

print("Finished generating training distribution data")

# Create a file download link
import base64
from IPython.display import HTML

def create_download_link( title = "Download training data distribution JSON file", filename = "training_distribution.json"):  
    if enable_explainability or enable_fairness:
        if "explainability_configuration" in config_json:
            del config_json["explainability_configuration"]
        output_json = json.dumps(config_json, indent=2)
        b64 = base64.b64encode(output_json.encode())
        payload = b64.decode()
        html = '<a download="{filename}" href="data:text/json;base64,{payload}" target="_blank">{title}</a>'
        html = html.format(payload=payload,title=title,filename=filename)
        return HTML(html)
    else:
        print("No download link generated as fairness/explainability services are disabled.")

create_download_link()

## Fairness Statistics

Following code snippet is used to generate a json file containing fairness configuration.
This is used to configure Fairness monitor in IBM Watson OpenScale.

In [None]:
fairness_configuration = None

if enable_fairness:
    fairness_configuration = config_json.get("fairness_configuration")

# create download link if required
def create_download_link( title = "Download fairness statistics JSON file", filename = "fairness_statistics.json"):
    if fairness_configuration:
        # Create a file download link
        import base64
        import json
        from IPython.display import HTML
        
        print("Fairness configuration found, persisting to a json file and creating download link...")
        output_json = json.dumps(fairness_configuration, indent=2)
        b64 = base64.b64encode(output_json.encode())
        payload = b64.decode()
        html = '<a download="{filename}" href="data:text/json;base64,{payload}" target="_blank">{title}</a>'
        html = html.format(payload=payload,title=title,filename=filename)
        return HTML(html)
    else:
        print("Fairness configuration not found.")
create_download_link()

## Drift configuration archive generation

Following code snippet is used to generate artefacts required for configuring drift identification for a model in IBM Watson OpenScale. Output is a drift archive which contains:

- Drift Detection Model (used for Accuracy Drift detection for classification models)
- Data Constraints (used for Data Consistency Drift detection for classification/regression models)

In [None]:
#Generate drift detection model
from ibm_wos_utils.drift.drift_trainer import DriftTrainer
enable_drift = service_configuration_support.get("enable_drift")
if enable_drift:
    drift_detection_input = {
        "feature_columns":training_data_info.get("feature_columns"),
        "categorical_columns":training_data_info.get("categorical_columns"),
        "label_column": training_data_info.get("class_label"),
        "problem_type": model_type
    }
    
    drift_trainer = DriftTrainer(data_df,drift_detection_input)
    if model_type != "regression":
        #Note: batch_size can be customized by user as per the training data size
        drift_trainer.generate_drift_detection_model(score, batch_size=data_df.shape[0], check_for_ddm_quality=False)
    
    #Note:
    # - Two column constraints are not computed beyond two_column_learner_limit(default set to 200)
    # - Categorical columns with large (determined by categorical_unique_threshold; default > 0.8) number of unique values relative to total rows in the column are discarded. 
    #User can adjust the value depending on the requirement

    # user_overrides - Used to override drift constraint learning to selectively learn 
    # constraints on feature columns. Its a list of configuration, each specifying 
    # whether to learn distribution and/or range constraint on given set of columns.
    # First configuration of a given column would take preference.
    # 
    # "constraint_type" can have two possible values : single|double - signifying 
    # if this configuration is for single column or two column constraint learning.
    #
    # "learn_distribution_constraint" : True|False - signifying whether to learn 
    # distribution constraint for given config or not.
    #
    # "learn_range_constraint" : True|False - signifying whether to learn range 
    # constraint for given config or not. Only applicable to numerical feature columns.
    # 
    # "features" : [] - provides either a list of feature columns to be governed by 
    # given configuration for constraint learning.
    # Its a list of strings containing feature column names if "constraint_type" is "single".
    # Its a list of list of strings containing feature column names if "constraint_type" if 
    # "double". If only one column name is provided, all of the two column constraints 
    # involving this column will be dictated by given configuration during constraint learning.
    # This list is case-insensitive.
    #
    # In the example below, first config block says do not learn distribution and range single 
    # column constraints for features "MARITAL_STATUS", "PROFESSION", "IS_TENT" and "age".
    # Second config block says do not learn distribution and range two column constraints 
    # where "IS_TENT", "PROFESSION", and "AGE" are one of the two columns. Whereas, specifically, 
    # do not learn two column distribution and range constraint on combination of "MARITAL_STATUS" 
    # and "PURCHASE_AMOUNT".
    # "user_overrides"= [
    #     {
    #         "constraint_type": "single",
    #         "learn_distribution_constraint": False,
    #         "learn_range_constraint": False,
    #         "features": [
    #           "MARITAL_STATUS",
    #           "PROFESSION",
    #           "IS_TENT",
    #           "age"
    #         ]
    #     },
    #     {
    #         "constraint_type": "double",
    #         "learn_distribution_constraint": False,
    #         "learn_range_constraint": False,
    #         "features": [
    #           [
    #             "IS_TENT"
    #           ],
    #           [
    #             "MARITAL_STATUS"
    #             "PURCHASE_AMOUNT"
    #           ],
    #           [
    #             "PROFESSION"
    #           ],
    #           [
    #             "AGE"
    #           ]
    #         ]
    #     }
    # ]
    
    drift_trainer.learn_constraints(
        two_column_learner_limit=200, categorical_unique_threshold=0.8, user_overrides=[])
    drift_trainer.create_archive()

In [None]:
#Generate a download link for drift detection model
from IPython.display import HTML
import base64
import io

def create_download_link_for_ddm( title = "Download Drift detection model", filename = "drift_detection_model.tar.gz"):  
    
    #Retains stats information    
    if enable_drift:
        with open(filename,"rb") as file:
            ddm = file.read()
        b64 = base64.b64encode(ddm)
        payload = b64.decode()
        
        html = '<a download="{filename}" href="data:text/json;base64,{payload}" target="_blank">{title}</a>'
        html = html.format(payload=payload,title=title,filename=filename)
        return HTML(html)
    else:
        print("Drift Detection is not enabled. Please enable and rerun the notebook")

create_download_link_for_ddm()
