# Notebook for generating training data distribution and configuring Fairness

This notebook analyzes training data and outputs a JSON which contains information related to data distribution and fairness configuration.  In order to use this notebook you need to do the following:

1. Read the training data into a pandas dataframe called "data_df".  
2. Edit the below cells and provide the training data and fairness configuration information. 
3. Run the notebook. It will generate a JSON and a download link for the JSON will be present at the very end of the notebook.
4. Download the JSON by clicking on the link and upload it in the IBM AI OpenScale GUI.

If you have multiple models (deployments), you will have to repeat the above steps for each model (deployment).

**Note:** Please restart the kernel after executing below cell

In [None]:
!pip install pandas
!pip install ibm-cos-sdk
!pip install numpy
!pip install scikit-learn==0.24.1 
!pip install pyspark
!pip install lime
!pip install --upgrade ibm-watson-openscale
!pip install "ibm-wos-utils==4.1.1"

# Read training data into a pandas data frame

In [12]:
VERSION = "5.0.1"

In [1]:
import pandas as pd

In [2]:
data_df=pd.read_csv('german_credit_data_biased_training.csv')

In [3]:
data_df.dtypes

CheckingStatus              object
LoanDuration                 int64
CreditHistory               object
LoanPurpose                 object
LoanAmount                   int64
ExistingSavings             object
EmploymentDuration          object
InstallmentPercent           int64
Sex                         object
OthersOnLoan                object
CurrentResidenceDuration     int64
OwnsProperty                object
Age                          int64
InstallmentPlans            object
Housing                     object
ExistingCreditsCount         int64
Job                         object
Dependents                   int64
Telephone                   object
ForeignWorker               object
Risk                        object
dtype: object

# Select the services for which configuration information needs to be generated

This notebook has support to generaton configuration information related to fairness , explainability and drift service. The below can be used by the user to control service specific configuration information.

Details of the service speicifc flags available:

- enable_fairness : Flag to allow generation of fairness specific data distribution needed for configuration
- enable_explainability : Flag to allow generation of explainability specific information
- enable_drift: Flag to allow generation of drift detection model needed by drift service


service_configuration_support = { <br>
&nbsp;&nbsp;&nbsp;&nbsp;"enable_fairness": True,   
&nbsp;&nbsp;&nbsp;&nbsp;"enable_explainability": True,    
&nbsp;&nbsp;&nbsp;&nbsp;"enable_drift": False  
    }  



In [4]:
service_configuration_support = {
    "enable_fairness": True,
    "enable_explainability": True,
    "enable_drift": True
}

# Training Data and Fairness Configuration Information

Please provide information about the training data which is used to train the model.  

In [5]:
feature_columns = ['CheckingStatus', 'LoanDuration', 'CreditHistory', 'LoanPurpose', 'LoanAmount', 'ExistingSavings', 'EmploymentDuration', 'InstallmentPercent', 'Sex', 'OthersOnLoan', 'CurrentResidenceDuration', 'OwnsProperty', 'Age', 'InstallmentPlans', 'Housing', 'ExistingCreditsCount', 'Job', 'Dependents', 'Telephone', 'ForeignWorker']
categorical_columns = ['CheckingStatus', 'CreditHistory', 'LoanPurpose', 'ExistingSavings', 'EmploymentDuration', 'Sex', 'OthersOnLoan', 'OwnsProperty', 'InstallmentPlans', 'Housing', 'Job', 'Telephone', 'ForeignWorker']

In [6]:
training_data_info = {
    "class_label": "Risk",
    "feature_columns": feature_columns,
    "categorical_columns": categorical_columns
}

# Specify the Model Type

In the next cell, specify the type of your model.  If your model is a binary classification model, then set the type to "binary". If it is a multi-class classifier then set the type to "multiclass". If it is a regression model (e.g., Linear Regression), then set it to "regression".

In [7]:
#Set model_type. Acceptable values are:["binary","multiclass","regression"]
model_type = "binary"

# Specify the Fairness Configuration

You need to provide the following information for the fairness configuration: 

- fairness_attributes:  These are the attributes on which you wish to monitor fairness. 
- With Indirect Bias support, you can also monitor protected attributes for fairness. The protected attributes are those attributes which are present in the training data but are not used to train the model. To check if there exists indirect bias with respect to some protected attribute due to possible correlation with some feature column, it can be specified in fairness configuration.
- type: The data type of the fairness attribute (e.g., float or int or double)
- minority:  The minority group for which we want to ensure that the model is not biased.  

In [None]:
[
        {"feature": "Sex",
         "majority": ['male'],
         "minority": ['female'],
         "threshold": 0.95
         },
        {"feature": "Age",
         "majority": [[26, 75]],
         "minority": [[18, 25]],
         "threshold": 0.95
         }
    ]

In [8]:
fairness_attributes = [{
                           "type" : "int", #data type of the column eg: float or int or double
                           "feature": "Age", 
                           "majority": [
                               [26, 75] # range of values for column eg: [31, 45] for int or [31.4, 45.1] for float
                           ],
                           "minority": [
                               [18, 25],    # range of values for column eg: [80, 100] for int or [80.0, 99.9] for float                    
                           ],
                           "threshold": 0.95 
                       },
                       {
                           "type": "string",
                           "feature": "Sex",
                           "majority": ['male'],
                           "minority": ['female'],
                           "threshold": 0.95
                       }
                       ]

# Specify the Favorable and Unfavorable class values

The second part of fairness configuration is about the favourable and unfavourable class values. In other words in order to measure fairness, we need to know the target field values which can be considered as being favourable and those values which can be considered as unfavourable.  

In [9]:
# For classification models use the below.
parameters = {
        "favourable_class" :  ["No Risk"],
        "unfavourable_class": ["Risk"]
    }

# Specify the number of records which should be processed for Fairness

The final piece of information that needs to be provided is the number of records (min_records) that should be used for computing the fairness.

In [10]:
# min_records = <Minimum number of records to be considered for preforming scoring>
min_records = 10

# End of Input 

You need not edit anything beyond this point.  Run the notebook and go to the very last cell.  There will be a link to download the JSON file (called: "Download training data distribution JSON file").  Download the file and upload it using the IBM AI OpenScale GUI.

*Note: drop_na parameter of TrainingStats object should be set to 'False' if NA values are taken care while reading the training data in the above cells*

In [13]:
from ibm_watson_openscale.utils.training_stats import TrainingStats

enable_explainability = service_configuration_support.get('enable_explainability')
enable_fairness = service_configuration_support.get('enable_fairness')

if enable_explainability or enable_fairness:
    fairness_inputs = None
    if enable_fairness:
        fairness_inputs = {
                "fairness_attributes": fairness_attributes,
                "min_records" : min_records,
                "favourable_class" :  parameters["favourable_class"],
                "unfavourable_class": parameters["unfavourable_class"]
            }
    
    input_parameters = {
        "label_column": training_data_info["class_label"],
        "feature_columns": training_data_info["feature_columns"],
        "categorical_columns": training_data_info["categorical_columns"],
        "fairness_inputs": fairness_inputs,  
        "problem_type" : model_type  
    }

    training_stats = TrainingStats(data_df,input_parameters, explain=enable_explainability, fairness=enable_fairness, drop_na=True)
    config_json = training_stats.get_training_statistics()
    config_json["notebook_version"] = VERSION
print(config_json)

Installing collected packages: jenkspy, more-itertools, scikit-learn, ibm-db, retrying, ibm-wos-utils
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 0.22.1
    Uninstalling scikit-learn-0.22.1:
      Successfully uninstalled scikit-learn-0.22.1
Successfully installed ibm-db-3.1.3 ibm-wos-utils-4.5.24 jenkspy-0.2.0 more-itertools-8.14.0 retrying-1.3.3 scikit-learn-1.0.2
{'common_configuration': {'problem_type': 'binary', 'label_column': 'Risk', 'input_data_schema': {'type': 'struct', 'fields': [{'name': 'CheckingStatus', 'type': 'string', 'nullable': True, 'metadata': {'modeling_role': 'feature', 'measure': 'discrete'}}, {'name': 'LoanDuration', 'type': 'long', 'nullable': True, 'metadata': {'modeling_role': 'feature'}}, {'name': 'CreditHistory', 'type': 'string', 'nullable': True, 'metadata': {'modeling_role': 'feature', 'measure': 'discrete'}}, {'name': 'LoanPurpose', 'type': 'string', 'nullable': True, 'metadata': {'modeling_role': 'feature', 'meas

### Indirect Bias
In case of Indirect bias i.e if protected attributes(the sensitive attributes like race, gender etc which are present in the training data but are not used to train the model) are being monitored for fairness:
- Bias service identifies correlations between the protected attribute and model features. Correlated attributes are also known as proxy features.
- Existence of correlations with model features can result in indirect bias w.r.t protected attribute even though it is not used to train the model.
- Highly correlated attributes based on their correlation strength are considered while computing bias for a given protected attribute.

The following cell identifies if user has configured protected attribute for fairness by checking the feature, non-feature columns and the fairness configuration. If protected attribute/s are configured then it identifies correlations and stores it in the fairness configuration.

In [14]:
# Checking if protected attributes are configured for fairness monitoring. If yes, then computing correlation information for each meta-field and updating it in the fairness configuration
if enable_fairness:
    fairness_configuration = config_json.get('fairness_configuration')
    training_columns = data_df.columns.tolist()
    label_column = training_data_info.get('class_label')
    training_columns.remove(label_column)
    feature_columns = training_data_info.get('feature_columns')
    non_feature_columns = list(set(training_columns) - set(feature_columns))
    if non_feature_columns is not None and len(non_feature_columns) > 0:
        protected_attributes = []
        fairness_attributes_list = [attribute.get('feature') for attribute in fairness_attributes]
        for col in non_feature_columns:
            if col in fairness_attributes_list:
                protected_attributes.append(col)
        if len(protected_attributes) > 0:
            from ibm_watson_openscale.utils.indirect_bias_processor import IndirectBiasProcessor
            fairness_configuration = IndirectBiasProcessor().get_correlated_attributes(data_df, fairness_configuration, feature_columns, protected_attributes, label_column)        

In [15]:
import json

print("Finished generating training distribution data")

# Create a file download link
import base64
from IPython.display import HTML

def create_download_link( title = "Download training data distribution JSON file", filename = "training_distribution.json"):  
    if enable_explainability or enable_fairness:
        output_json = json.dumps(config_json, indent=2)
        b64 = base64.b64encode(output_json.encode())
        payload = b64.decode()
        html = '<a download="{filename}" href="data:text/json;base64,{payload}" target="_blank">{title}</a>'
        html = html.format(payload=payload,title=title,filename=filename)
        return HTML(html)
    else:
        print('No download link generated as fairness/explainability services are disabled.')

create_download_link()

Finished generating training distribution data


# Drift detection model generation

Please update the score function which will be used forgenerating drift detection model which will used for drift detection . This might take sometime to generate model and time taken depends on the training dataset size. The output of the score function should be a 2 arrays 1. Array of model prediction 2. Array of probabilities 

- User is expected to make sure that the data type of the "class label" column selected and the prediction column are same . For eg : If class label is numeric , the prediction array should also be numeric

- Each entry of a probability array should have all the probabities of the unique class lable .
  For eg: If the model_type=multiclass and unique class labels are A, B, C, D . Each entry in the probability array should be a array of size 4 . Eg : [ [0.50,0.30,0.10,0.10] ,[0.40,0.20,0.30,0.10]...]
  
**Note:**
- *User is expected to add "score" method , which should output prediction column array and probability column array.*
- *The data type of the label column and prediction column should be same . User needs to make sure that label column and prediction column array should have the same unique class labels*
- **Please update the score function below with the help of templates documented [here](https://github.com/IBM-Watson/aios-data-distribution/blob/master/Score%20function%20templates%20for%20drift%20detection.md)**

## Azure Credentials are taken from the [azuremlcredentials.json](azuremlcredentials.json) file in the below cell

In [16]:
import json
with open("azuremlcredentials.json", "r") as cred:
    creds = json.load(cred)

AZURE_ENGINE_CREDENTIALS =  {
    "client_id": creds.get('appId'),
    "client_secret": creds.get('password'),
    "tenant": creds.get('tenant'),
    "subscription_id": creds.get('subscriptionid')
}

print(AZURE_ENGINE_CREDENTIALS)

In [36]:
def score(training_data_frame):
  az_scoring_uri = "***" # eg http://1dbdef02-e155-49ff-854e-aeb6840ad4d9.eastus.azurecontainer.io/score
  
  headers = {'Content-Type':'application/json'} 
  
  input_values = training_data_frame.values.tolist()
  feature_cols = list(training_data_frame.columns)
  scoring_data = [{field: value  for field,value in zip(feature_cols, input_value)} for input_value in input_values]
  
  payload = {
  "input": scoring_data
  }
  
  import requests
  import json
  import numpy as np
  import time
  
  start_time = time.time()  
  response = requests.post(az_scoring_uri, json=payload, headers=headers)
  response_time = int((time.time() - start_time)*1000)
  response_dict = response.json()
  results = response_dict['output']
  prediction_vector = np.array([x["Scored Labels"] for x in results])
  probability_array = np.array([x["Scored Probabilities"] for x in results])
  return probability_array, prediction_vector

In [37]:
print(data_df.shape[0])
print(score(data_df.sample(3)))

5000
(array([[0.83181918, 0.16818082],
       [0.82112752, 0.17887248],
       [0.33026548, 0.66973452]]), array(['No Risk', 'No Risk', 'Risk'], dtype='<U7'))


In [38]:
#Generate drift detection model
from ibm_wos_utils.drift.drift_trainer import DriftTrainer
enable_drift = service_configuration_support.get('enable_drift')
if enable_drift:
    drift_detection_input = {
        "feature_columns":training_data_info.get('feature_columns'),
        "categorical_columns":training_data_info.get('categorical_columns'),
        "label_column": training_data_info.get('class_label'),
        "problem_type": model_type
    }
    
    drift_trainer = DriftTrainer(data_df,drift_detection_input)
    if model_type != "regression":
        #Note: batch_size can be customized by user as per the training data size
        drift_trainer.generate_drift_detection_model(score,batch_size=32)
    
    #Note:
    # - Two column constraints are not computed beyond two_column_learner_limit(default set to 200)
    # - Categorical columns with large (determined by categorical_unique_threshold; default > 0.8) number of unique values relative to total rows in the column are discarded. 
    #User can adjust the value depending on the requirement
    
    drift_trainer.learn_constraints(two_column_learner_limit=2, categorical_unique_threshold=0.8)
    drift_trainer.create_archive()

Scoring training dataframe...: 100%|██████████| 4000/4000 [00:10<00:00, 388.34rows/s]
Optimising Drift Detection Model...: 100%|██████████| 40/40 [06:57<00:00, 10.44s/models]
Scoring training dataframe...: 100%|██████████| 1000/1000 [00:02<00:00, 349.35rows/s]
Computing feature stats...: 100%|██████████| 20/20 [00:01<00:00, 19.46features/s]
Learning single feature constraints...: 100%|██████████| 21/21 [00:00<00:00, 31.34constraints/s]


In [39]:
#Generate a download link for drift detection model
from IPython.display import HTML
import base64
import io

def create_download_link_for_ddm( title = "Download Drift detection model", filename = "drift_detection_model.tar.gz"):  
    
    #Retains stats information    
    if enable_drift:
        with open(filename,'rb') as file:
            ddm = file.read()
        b64 = base64.b64encode(ddm)
        payload = b64.decode()
        
        html = '<a download="{filename}" href="data:text/json;base64,{payload}" target="_blank">{title}</a>'
        html = html.format(payload=payload,title=title,filename=filename)
        return HTML(html)
    else:
        print("Drift Detection is not enabled. Please enable and rerun the notebook")

create_download_link_for_ddm()
