# Notebook for generating training data distribution and configuring Fairness

This notebook analyzes training data and outputs a JSON which contains information related to data distribution and fairness configuration.  In order to use this notebook you need to do the following:

1. Read the training data into a pandas dataframe called "data_df".  There is sample code below to show how this can be done if the training data is in IBM Cloud Object Storage. 
2. Edit the below cells and provide the training data and fairness configuration information. 
3. Run the notebook. It will generate a JSON and a download link for the JSON will be present at the very end of the notebook.
4. Download the JSON by clicking on the link and upload it in the IBM AI OpenScale GUI.

If you have multiple models (deployments), you will have to repeat the above steps for each model (deployment).

# Read training data into a pandas data frame

The first thing that you need to do is to read the training data into a pandas dataframe called "data_df".  Given below is sample code for doing this if the training data is in IBM Cloud Object Storage.  Please edit the below cell and make changes so that you can read your training data from the location where it is stored.  Please ensure that the training data is present in a data frame called "data_df".

In [None]:
# ----------------------------------------------------------------------------------------------------
# IBM Confidential
# OCO Source Materials
# 5900-A3Q, 5737-J33
# Copyright IBM Corp. 2018
# The source code for this Notebook is not published or other-wise divested of its trade 
# secrets, irrespective of what has been deposited with the U.S.Copyright Office.
# ----------------------------------------------------------------------------------------------------

!pip install pandas
!pip install ibm-cos-sdk
!pip install numpy
!pip install pyspark

VERSION = 1.2

# code to read file in COS to pandas dataframe object
import sys
import types
import pandas as pd
from ibm_botocore.client import Config
import ibm_boto3

def __iter__(self): return 0

api_key = "<API Key>"
resource_instance_id = "crn:v1:bluemix:public:cloud-object-storage:global:a/111111aaa1a111aa11d111111aa11111:22b22bbb-b22b-22bb-2b22-22b22bB22b2b::"
auth_endpoint = "https://iam.ng.bluemix.net/oidc/token"
service_endpoint =  "https://s3-api.dal-us-geo.objectstorage.softlayer.net"
bucket =  "<Bucket Name>"
file_name= "<File Name>"

cos_client = ibm_boto3.client(service_name='s3',
    ibm_api_key_id=api_key,
    ibm_auth_endpoint=auth_endpoint,
    config=Config(signature_version='oauth'),
    endpoint_url=service_endpoint)

body = cos_client.get_object(Bucket=bucket,Key=file_name)['Body']

# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

data_df = pd.read_csv(body)
data_df.head()

#Print columns from data frams
#print("column names:{}".format(list(data_df.columns.values)))

# Uncomment following 2 lines if you want to read training data from local CSV file when running through local Jupyter notebook
#data_df = pd.read_csv("<FULLPATH_TO_CSV_FILE>")
#data_df.head()


# Training Data and Fairness Configuration Information

Please provide information about the training data which is used to train the model.  In order to explain the configuration better, let us first consider an example of a Loan Processing Model which is trying to predict whether a person should get a loan or not. The training data for such a model will potentially contain the following columns: Credit_History, Monthly_salary, Applicant_Age, Loan_amount, Gender, Marital_status, Approval.  The "Approval" column contains the target field (label column or class label) and it can have the following values: "Loan Granted", "Loan Denied" or "Loan Partially Granted".  In this model we would like to ensure that the model is not biased against Gender=Female or Gender=Transgender.  We would also like to ensure that the model is not biased against the age group 15 to 30 years or age group 61 to 120 years. 

For the above model, the configuration information that we need to provide is:

- class_label:  This is the name of the column in the training data dataframe (data_df) which contains the target field (also known as label column or the class label).  For the Loan Processing Model it would be "Approval".
- feature_columns: This is a comma separated list of column names which contain the feature column names (in the training data dataframe data_df).  For the Loan Processing model this would be: ["Credit_History", "Monthly_salary", "Applicant_Age", "Loan_amount", "Gender", "Marital_status"]
- categorical_columns: The list of column names (in data_df) which contain categorical values.  This should also include those columns which originally contained categorical values and have now been converted to numeric values. E.g., in the Loan Processing Model, the Marital_status column originally could have values: Single, Married, Divorced, Separated, Widowed.  These could have been converted to numeric values as follows: Single -> 0, Married -> 1, Divorced -> 2, Separated -> 3 and Widowed -> 4.  Thus the training data will have numeric values.  Please identify such columns as categorical.  Thus the list of categorical columns for the Loan Processing Model will be Credit_History, Gender and Marital_status. 

For the Loan Processing Model, this information will be provided as follows:

training_data_info = { <br>
&nbsp;&nbsp;&nbsp;&nbsp;"class_label": "Approval",   
&nbsp;&nbsp;&nbsp;&nbsp;"feature_columns": ["Credit_History", "Monthly_salary", "Applicant_Age", "Loan_amount", "Gender", "Marital_status"],    
&nbsp;&nbsp;&nbsp;&nbsp;"categorical_columns": ["Credit_History","Gender","Marital_status"]   
    }  
    
  **Note:** Please note that categorical columns selected should be subset of feature columns. If there are no categorical columns among the feature columns selected , please set "categorical_columns as [] or None"

Please edit the next cell and provide the above information for your model.

In [None]:
training_data_info = {
    "class_label": "<EDIT THIS>",
    "feature_columns": ["<EDIT THIS>"],
    "categorical_columns": ["<EDIT THIS>"]
}

# Specify the Model Type

In the next cell, specify the type of your model.  If your model is a binary classification model, then set the type to "binary". If it is a multi-class classifier then set the type to "multiclass". If it is a regression model (e.g., Linear Regression), then set it to "regression".

In [None]:
#Set model_type. Acceptable values are:["binary","multiclass","regression"]
model_type = "binary"
#model_type = "multiclass"
#model_type = "regression"

# Specify the Fairness Configuration

You need to provide the following information for the fairness configuration: 

- fairness_attributes:  These are the attributes on which you wish to monitor fairness. In the Loan Processing Model, we wanted to ensure that the model is not baised against people of specific age group and people belonging to a specific gender.  Hence "Applicant_Age" and "Gender" will be the fairness attributes for the Loan Processing Model.
- type: The data type of the fairness attribute (e.g., float or int or double)
- minority:  The minority group for which we want to ensure that the model is not biased.  For the Loan Processing Model we wanted to ensure that the model is not biased against people in the age group 15 to 30 years & 61 to 120 years as well as people with Gender = Female or Gender = Transgender.  Hence the minority group for the fairness attribute "Applicant_Age" will be [15,30] and [61,120] and the minority group for fairness attribute "Gender" will be: "Female", "Transgender".  
- majority: The majority group for which the model might be biased towards.  For the Loan Processing Model, the majority group for the fairness attribute "Applicant_Age" will be [31,60], i.e., all the ages except the minority group.  For the fairness attribute "Gender" the majority group will be: "Male".  
- threshold:  The fairness threshold beyond which the Model is considered to be biased.  For the Loan Processing Model, let us say that the Bank is willing to tolerate the fact that Female and Transgender applicants will get upto 20% lesser approved loans than Males.  However, if the percentage is more than 20% then the Loan Processing Model will be considered biased.  E.g., if the percentage of approved loans for Female or Transgender applicants is say 25% lesser than those approved for Male applicants then the Model is to be considered as acting in a biased manner.  Thus for this scenario, the Fairness threshold will be 80 (100-20) (this is represented as a value normalized to 1, i.e., 0.8).  

The fairness attributes for Loan Processing Model will be specified as:

fairness_attributes = [  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;{   
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;"feature": "Applicant_Age",   
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;"type" : "int",   
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;"majority": [ [31,60] ],   
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;"minority": [ [15, 30], [61,120] ],  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;"threshold" : 0.8  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;},  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;{  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;"feature": "Gender",   
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;"type" : "string",   
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;"majority": ["Male"],  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;"minority": ["Female", "Transgender"],  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;"threshold" : 0.8  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}  
&nbsp;&nbsp;&nbsp;&nbsp;]  

Please edit the next cell and provide the fairness configuration for your model.

In [None]:
fairness_attributes = [{
                           "type" : "<DATA_TYPE>", #data type of the column eg: float or int or double
                           "feature": "<COLUMN_NAME>", 
                           "majority": [
                               [X, Y] # range of values for column eg: [31, 45] for int or [31.4, 45.1] for float
                           ],
                           "minority": [
                               [A, B], # range of values for column eg: [10, 15] for int or [10.5, 15.5] for float
                               [C, D]   # range of values for column eg: [80, 100] for int or [80.0, 99.9] for float                    
                           ],
                           "threshold": <VALUE> #such that 0<VALUE<=1. eg: 0.8
                       }]

# Specify the Favorable and Unfavorable class values

The second part of fairness configuration is about the favourable and unfavourable class values.  Recall that in the case of Loan Processing Model, the target field (label column or class label) can have the following values: "Loan Granted", "Loan Denied" and "Loan Partially Granted".  Out of these values "Loan Granted" and "Loan Partially Granted" can be considered as being favorable and "Loan Denied" is unfavorable.  In other words in order to measure fairness, we need to know the target field values which can be considered as being favourable and those values which can be considered as unfavourable.  

For the Loan Prediction Model, the values can be specified as follows:

parameters = {  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;"favourable_class" :  [ "Loan Granted", "Loan Partially Granted" ],  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;"unfavourable_class": [ "Loan Denied" ]  
&nbsp;&nbsp;&nbsp;&nbsp;}  

In case of a regression models, the favourable and unfavourable classes will be ranges.  For example, for a model which predicts medicine dosage, the favorable outcome could be between 80 ml to 120 ml or between 5 ml to 20 ml whereas unfavorable outcome will be values between 21 ml to 79ml.  For such a model, the favorable and unfavorable values will be specified as follows:
     
parameters = {  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;"favourable_class" :  [ [5, 20], [80, 120] ],  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;"unfavourable_class": [ [21, 79] ]  
&nbsp;&nbsp;&nbsp;&nbsp;}  

Please edit the next cell to provide information about your model.

In [None]:
# For classification models use the below.
parameters = {
        "favourable_class" :  [ "<EDIT THIS>", "<EDIT THIS>" ],
        "unfavourable_class": [ "<EDIT THIS>" ]
    }
# For regression models use the below.  Delete the entry which is not required.
parameters = {
        "favourable_class" :  [ [<EDIT THIS>, <EDIT THIS>], [<EDIT THIS>,<EDIT THIS>] ],
        "unfavourable_class": [ [<EDIT THIS>, <EDIT THIS>] ]
    }

# Specify the number of records which should be processed for Fairness

The final piece of information that needs to be provided is the number of records (min_records) that should be used for computing the fairness. Fairness checks runs hourly.  If min_records is set to 5000, then every hour fairness checking will pick up the last 5000 records which were sent to the model for scoring and compute the fairness on those 5000 records.  Please note that fairness computation will not start till the time that 5000 records are sent to the model for scoring.

If we set the value of "min_records" to a small number, then fairness computation will get influenced by the scoring requests sent to the model in the recent past. In other words, the model might be flagged as being biased if it is acting in a biased manner on the last few records, but overall it might not be acting in a biased manner.  On the other hand, if the "min_records" is set to a very large number, then we will not be able to catch model bias quickly. Hence the value of min_records should be set such that it is neither too small or too large.

Please updated the next cell to specify a value for min_records.

In [None]:
# min_records = <Minimum number of records to be considered for preforming scoring>
min_records = <EDIT THIS>

# End of Input 

You need not edit anything beyond this point.  Run the notebook and go to the very last cell.  There will be a link to download the JSON file (called: "Download training data distribution JSON file").  Download the file and upload it using the IBM AI OpenScale GUI.

In [None]:
# Parameter check
if model_type != "regression":
    favourable_class = parameters.get("favourable_class")
    if (favourable_class is not None):
        if not(isinstance(favourable_class,(list,tuple))):
            raise Exception("'favourable_class' in parameters must be a list of values")
        if isinstance(favourable_class[0],(list,tuple)):
            raise Exception("'favourable_class' in parameters should not be a list of lists. It should be a list of values.")
    
    unfavourable_class = parameters.get("unfavourable_class")
    if (unfavourable_class is not None):
        if not(isinstance(unfavourable_class,(list,tuple))):
            raise Exception("'unfavourable_class' in parameters must be a list of values")
        if isinstance(unfavourable_class[0],(list,tuple)):
            raise Exception("'unfavourable_class' in parameters should not be a list of lists. It should be a list of values.")
    
else:
    favourable_class = parameters.get("favourable_class")
    if(favourable_class is None):
        raise Exception("'favourable_class' values are required in parameters in case of regression model.")
    else:
        if not(isinstance(favourable_class[0],(list,tuple))):
            raise Exception("'favourable class' in parametrs must be a list of lists")
    
    unfavourable_class = parameters.get("unfavourable_class")
    if(unfavourable_class is None):
        raise Exception("'unfavourable_class' values are required in parameters in case of regression model.")
    else:
        if not(isinstance(unfavourable_class[0],(list,tuple))):
            raise Exception("'unfavourable class' in parametrs must be a list of lists")
    
# Existence check    
if "class_label" not in training_data_info:
        raise Exception("'class_label' attributes in missing in 'training_data_info' input")
if "feature_columns" not in training_data_info:
        raise Exception("'feature_columns' attributes in missing in 'training_data_info' input")
if "categorical_columns" not in training_data_info:
        raise Exception("'categorical_columns' attributes in missing in 'training_data_info' input") 

if type(training_data_info.get("feature_columns")) is not list:
    raise Exception("'feature_columns' should be a list of values")
if type(training_data_info.get("categorical_columns")) is not list:
    raise Exception("'categorical_columns' should be a list of values")
    
if not training_data_info.get("feature_columns"):
    raise Exception("'feature_columns' should not be none or empty list")
    
#Verify existence of feature columns in training data
feature_columns = training_data_info.get("feature_columns")
if feature_columns is None or len(feature_columns) == 0:
    raise Exception("'feature_columns' should not be empty")
    
columns_from_data_frame = list(data_df.columns.values)
check_feature_column_existence = list(set(feature_columns) - set(columns_from_data_frame))
if len(check_feature_column_existence) > 0:
    raise Exception("Feature columns missing in training data.Details:{}".format(check_feature_column_existence))

    
#Verify existence of  categorical columns in feature columns
categorical_columns = training_data_info.get("categorical_columns")
if categorical_columns is not None and len(categorical_columns) > 0:
    check_cat_col_existence = list(set(categorical_columns) - set(feature_columns))
    if len(check_cat_col_existence) > 0:
        raise Exception("'categorical_columns' should be subset of feature columns.Details:{}".format(check_cat_col_existence))
            
# Input validations
for fea in fairness_attributes:
    if "feature" not in fea:
        raise Exception("'feature' attributes in missing in 'fairness_attributes' input")
    if "majority" not in fea:
        raise Exception("'majority' attributes in missing in 'fairness_attributes' input")
    if "minority" not in fea:
        raise Exception("'minority' attributes in missing in 'fairness_attributes' input" )   
        
acceptable_model_types = ["binary","multiclass","regression"]
if model_type not in acceptable_model_types:
    raise Exception ("Invalid model type. Acceptable values are:"+acceptable_model_types)
                        
if model_type=="regression":
    if "favourable_class" not in parameters:
        raise Exception("'favourable_class' attributes in missing in 'parameters' input")
    if "unfavourable_class" not in parameters:
        raise Exception("'unfavourable_class' attributes in missing in 'parameters' input") 
      
fairness_attributes_list = []
for fea in fairness_attributes:
    fairness_attributes_list.append(fea["feature"])    

The following cell contains the methods for validating the fairness attributes for overlapping majority/minority ranges and threshold

In [None]:
def validate_numeric_attr(value, type, feature):
    invalid_value = False
    if len(value) != 2:
        invalid_value = True
    if not invalid_value:
        for val in value:
            start = value[0]
            end = value[1]
            if start > end:
                raise Exception("Invalid range: The numerical range for {0} value of the attribute '{1}' is incorrect, start value of range must be less than the end value.".format(type,feature["feature"]))
    if invalid_value:
        error_msg = "Invalid syntax: The {0} value for the numerical attribute '{1}' must be specified as a list of ranges. Range format: [<begin_value>,<end_value>], Example: [[25,50],[60,75]]".format(type, feature)
        raise Exception(error_msg)

def validate_maj_min(feature, maj_min,type):
    if maj_min is None or maj_min == '' or maj_min == []:
        error_msg = "Missing required field: You haven't specified {0} value for the feature '{1}'.".format(type, feature)
        raise Exception(error_msg)
    if not isinstance(maj_min, list):
        error_msg = "Invalid syntax: The {0} value for feature '{1}' must be specified as a list of categorical values or numerical ranges.".format(type, feature)
        raise Exception(error_msg)
    for value in maj_min:
        if isinstance(value, list):
            validate_numeric_attr(value, type, feature)
        elif not isinstance(value, str):
            error_msg = "Invalid syntax: The {0} value for feature '{1}' must be specified as a list of categorical values or numerical ranges.".format(type, feature)
            raise Exception(error_msg)
        else:
            if value.strip() == '':
                error_msg = "Value of {0} can not be empty.".format(type)
                raise Exception(error_msg)

def validate_threshold(feature, threshold):
    if threshold is None or threshold == '':
        error_msg = "Missing required field: You haven't specified any threshold value for the feature '{0}'.".format(feature)
        raise Exception(error_msg)
    if not isinstance(threshold, float) and not isinstance(threshold, int):
        error_msg = "Invalid type: only numerical values are supported for threshold."
        raise Exception(error_msg)
    if threshold <= 0 or threshold > 1:
        error_msg = "The threshold value provided is invalid, it must be in range 0 < threshold <=1"
        raise Exception(error_msg)

def validate_feature(feature):
    feature_name = feature.get('feature')
    if feature_name is None or feature_name == '':
        error_msg = "Missing required field: You haven't specified the feature name."
        raise Exception(error_msg)
    majority = feature.get('majority')
    validate_maj_min(feature_name, majority, 'majority')
    minority = feature.get('minority')
    validate_maj_min(feature_name, minority, 'minority')
    for min_value in minority:
        #if attribute is categorical, same value can not be specified in both majority and minority
        if type(min_value) != type(majority[0]):
            error_msg = "Type mismatch: The data types of majority and minority for feature '{0}' are not matching.".format(feature_name)
            raise Exception(error_msg)
        if isinstance(min_value, str):
            if min_value in majority:
                error_msg = "Same value can not be specified as both majority and minority."
                raise Exception(error_msg)
        else:
            min_start = min_value[0]
            min_end = min_value[1]
            for maj_value in majority:
                maj_start = maj_value[0]
                maj_end = maj_value[1]
                if (min_start >= maj_start and min_start <= maj_end) or (min_end >= maj_start and min_end <= maj_end) or (maj_start >= min_start and maj_start <= min_end) or (maj_end >= min_start and maj_end <= min_end):
                    error_msg = "The ranges you specified for the minority and majority values overlap."
                    raise Exception(error_msg)
    threshold = feature.get('threshold')
    validate_threshold(feature_name, threshold)

for feature in fairness_attributes:
    validate_feature(feature)

Following cell contain code to generate schema information:

In [None]:
# Cell-3 : Responsible for generating schema from training data 

import numpy as np
from pyspark.sql import SparkSession

def generate_training_schema(payload_df,feature_columns,categorical_columns=None):
    spark = SparkSession.builder.appName('pandasToSparkDF').getOrCreate()
    df = spark.createDataFrame(payload_df)
    sc = df.schema
    fields = []
    for f in sc:
        field = f.jsonValue()
        column = field["name"]

        if column in feature_columns:
            field["metadata"]["modeling_role"] = "feature"

            #Set categorical column in input schema
        if categorical_columns is not None:
            if column in categorical_columns:
                field["metadata"]["measure"] = "discrete"

        fields.append(field)

    training_data_schema = {}
    training_data_schema["type"] = "struct"
    training_data_schema["fields"] = fields
    
    return training_data_schema

feature_columns = training_data_info.get("feature_columns")
categorical_columns = training_data_info.get("categorical_columns")
training_schema = generate_training_schema(data_df,feature_columns,categorical_columns)

Following cell contains methods defined for generating training data distribution :

In [None]:
# Cell-4: Define the function for computing distribution

# -----------------------------------------------------------------------------
# Licensed Materials - Property of IBM
# 
# (C) Copyright IBM Corp. 2018    All Rights Reserved.
# US Government Users Restricted Rights - Use, duplication or disclosure
# restricted by GSA ADP Schedule Contract with IBM Corp.
# -----------------------------------------------------------------------------
import math
import random
import numpy

def get_data_types(val):
     
    is_numeric = False
    is_int = False
    is_float = False
    try:
        float(val)
        is_numeric = True
        is_float = True
    except ValueError:
        pass
 
    try:
        import unicodedata
        unicodedata.numeric(val)
        is_numeric = True
        is_float = True
    except (TypeError, ValueError):
        pass
    
    try:
        int(val)
        is_numeric = True
        is_int = True
    except ValueError:
        pass
           
    return is_numeric, is_float, is_int

def getFrequencyTable(protected_attr_col, class_col):
    #Example:{Male: { class_outcome_1:28,class_outcome_2:29,class_outcome_3:50 }, Female: {class_outcome_1:28,class_outcome_2:29,class_outcome_3:50}}
    frequency_map={} 
    #class_label_types = set()
    str_flag=False
    other_flag=False
    nan_num=0
    for index,protected_attr_value in enumerate(protected_attr_col):
        
        #remove initial whitespaces
        if type(protected_attr_value) is str:
            protected_attr_value=protected_attr_value.lstrip()
        else:
            #remove nan
            if math.isnan(protected_attr_value):
                nan_num=nan_num+1
                continue
        
        class_value=class_col[index]
        if type(class_value) is str:
            class_value = class_value.lstrip()
            if class_value.isdigit():
                other_flag=True
            else:
                str_flag=True
        else:
            #remove nan
            if math.isnan(class_value):
                nan_num=nan_num+1
                continue 
            else:
                other_flag=True
        
            
        #Example is Yes        
        #Update frequency table               
        #checking if frequency map has value for this protected_attribute value. If Male already exists in the freq map
        if protected_attr_value in frequency_map:
            
            #get the dictionary for counts for different class values for this protected attribute value
            freq_count_dict=frequency_map[protected_attr_value]
            
            #check if this particular  counts for this particular class_value already exists: Example {Yes:50}
            if class_value in freq_count_dict:
                counts_for_class_value=freq_count_dict[class_value]
                counts_for_class_value=counts_for_class_value+1
                
                #Update counts for this particular class in the frequency count 
                freq_count_dict.update({class_value:counts_for_class_value})
                #Update the final map of freq
                frequency_map.update({protected_attr_value:freq_count_dict})
            else:
                counts_for_class_value=1  
                freq_count_dict.update({class_value:counts_for_class_value})
                frequency_map.update({protected_attr_value:freq_count_dict})            
            
        else:
            #This protected attribute does not exist in frequency map
            counts_for_class_value=1
            freq_count_dict={class_value:1}             
            frequency_map.update({protected_attr_value:freq_count_dict})
    #Log a warning if class label column contains mixed data
    if str_flag and other_flag:
        print("Class label columns contains mixed data")       
    return frequency_map  

def modify_distribution(attributes_dataset,protected_attribute,class_dataset):
        
        #protected attribute column
        extracted_column = attributes_dataset[protected_attribute].tolist()       
        #class label column
        labels=class_dataset.tolist()        
        #creating the frequency map for attribute and concerned favourabe class
        frequency_map=getFrequencyTable(extracted_column,labels)
        return frequency_map
    
def getDistribution(dataset,inputs): 
    
    distribution_map={}
    attributes_dataset = dataset[inputs['fairness_attributes']]
    class_dataset = dataset[inputs['class_label']]

    # Check if the class_label column is numerical or categorical
    class_labels = sorted(class_dataset.tolist())
    is_column_numeric = False
    count_numeric = 0
    row_num = class_dataset.shape[0]
    
    if row_num<1000:
        sample_size = row_num
    else:
        sample_size = 1000
    
    for i in range(sample_size):
        class_label = class_labels[random.randint(0,row_num-1)]
        is_numeric, is_float, is_int = get_data_types(class_label)
        if is_numeric or is_float or is_int :    
            count_numeric+=1

    if count_numeric > (sample_size-count_numeric):
        is_column_numeric = True

    #Get the additional info for class_label
    if  is_numeric:
        min = class_labels[0]
        for pos in range(1,row_num): 
            is_numeric, is_float, is_int = get_data_types(class_labels[-pos])
            if is_numeric or is_float or is_int :
                max = class_labels[-pos]
                break
        distinct_class_label_list = [min,max]

    else:
        distinct_class_label_values = set()
        for class_label in class_labels:
            is_numeric, is_float, is_int = get_data_types(class_label)
            if not(is_numeric or is_float or is_int):
                distinct_class_label_values.add(class_label)
        distinct_class_label_list = list(distinct_class_label_values)
    
    for protected_attribute in inputs['fairness_attributes']: 
        distribution_map[protected_attribute]=modify_distribution(attributes_dataset,protected_attribute,class_dataset)
    
    return distribution_map,distinct_class_label_list

Following cell contains code for generating training data distribution json for fairness. It uses the methods defined in above cell and expects pandas dataframe as input :

In [None]:
# Cell-5: Define the function for computing distribution

import math
import json
import numpy
import base64
import copy
import ast
import io,sys
import datetime
import re
import random
from IPython.display import HTML

total_rows = data_df.shape[0]
print("Total Rows retrieved " + str(total_rows))

fairness_params = { "class_label" : training_data_info["class_label"],
                    "fairness_attributes" : fairness_attributes_list }
data,distinct_class_label_values = getDistribution(data_df,fairness_params)
########################################################

from datetime import datetime,timedelta
import logging,time

import json
import numpy
import math
import datetime
import copy
import ast
import numpy as np
#from numba.tests.test_conversion import addition


def computeTrainingDataDistribution():
      
    try:    
        
        class_label = training_data_info["class_label"]
        
        fairness_params = { "class_label" : training_data_info["class_label"],
                            "fairness_attributes" : fairness_attributes_list }
        
        data_frame, feature_data_types = cleanPayloadData(data_df,fairness_params)
        
        total_rows = data_frame.shape[0]
        print("Total Rows retrieved " + str(total_rows))
        
        distribution_data = []   
        favourable_unfavourable_class = []
        if model_type is not None and model_type=="regression":
            
            favourable_class = parameters["favourable_class"]
            unfavourable_class = parameters["unfavourable_class"]
            #Combine the favourable and unfavourable ranges 
            favourable_unfavourable_class.extend(parameters['favourable_class'])
            favourable_unfavourable_class.extend(parameters['unfavourable_class'])
            distribution_data = compute_regression_training_distribution(data,  fairness_attributes_list, favourable_unfavourable_class, feature_data_types)
        else:
            distribution_data =  compute_training_distribution(data_frame, data, training_data_info, fairness_attributes, feature_data_types ) 

        distinct_class_label_feature = {}
        distinct_class_label_feature["attribute"] = fairness_params["class_label"]
        distinct_class_label_feature["is_class_label"] = True
        if model_type is None or model_type!="regression":
            temp = distinct_class_label_values[0]
            if type(temp) is int or type(temp) is float:
                if(distinct_class_label_values[0]>distinct_class_label_values[1]):
                    distinct_class_label_feature["min"] = distinct_class_label_values[1]
                    distinct_class_label_feature["max"] = distinct_class_label_values[0]
                else:
                    distinct_class_label_feature["min"] = distinct_class_label_values[0]
                    distinct_class_label_feature["max"] = distinct_class_label_values[1]
            else:
                distinct_class_label_feature["distinct_values"] = distinct_class_label_values        
        else:
            vList = []
            vList.extend(favourable_unfavourable_class)
            distinct_class_label_feature["distinct_values"] = vList
        
        distribution_data.append(distinct_class_label_feature)    
            
        return distribution_data 
                    
    except Exception as exc:
        raise exc


def compute_regression_training_distribution(data, fairness_attributes, favourable_classes,feature_data_types):
    distribution_data = []    
    distinct_data = {}
    for fairness_attribute in fairness_attributes[:]:
        keys = sorted(data[fairness_attribute].keys())

        is_numeric, is_float, is_int = get_data_types(keys[0])
        
        if(feature_data_types[fairness_attribute]=="categorical"):
            is_float = is_int = is_numeric = False

        key_values = None
        min_value = None
        max_value = None
        if len(keys)>0:
            if is_numeric:
                if is_float:
                    key_values = sorted(list(map(float, keys)))
                else:
                    key_values = sorted(list(map(int, keys)))    

                min_value = key_values[0]
                max_value = key_values[len(key_values)-1]

        if not is_numeric:
            distinct_data[fairness_attribute] = keys

        feature = {}
        feature["attribute"] = fairness_attribute
        if is_numeric is True:
            feature["min"] = min_value
            feature["max"] = max_value
        else:    
            feature["distinct_values"] = keys    

        class_label_values = []
        for key1, value1 in sorted(data[fairness_attribute].items()):
            value_array = {}
            value_array["label"] = key1
            sortedValues = sorted(value1.items())
            ranges = {}
            for listValue in favourable_classes[:]:
                ranges[tuple(listValue)] = 0

            for key2, value2 in sortedValues:
                for rKey, rValue in ranges.items():
                    range_start = rKey[0]
                    range_end = rKey[1]
                    if key2>=range_start and key2<=range_end:
                        ranges[rKey] = ranges[rKey] + 1
                        break
            range_items = ranges.items()  
            range_to_delete = []     
            for k, v in range_items:
                if ranges[k]==0:
                    range_to_delete.append(k)
            for x in range_to_delete[:]:
                del ranges[x]        

            a = []
            for rng, cnt in ranges.items():       
                b = {}
                b["class_value"] = str(list(rng))
                b["count"] = cnt   
                a.append(b)    

            value_array["counts"] = a
            class_label_values.append(value_array)
            
        feature["class_labels"] =  class_label_values  
        distribution_data.append(feature)
             
        return distribution_data  

def compute_training_distribution(payload_df,data, request_payload , feature_attributs, feature_data_types):  
       
    # Sorting values of majority minority range
    #parameters = request_payload.get("parameters")
    #training_data = request_payload.get("training_data")
    class_label = training_data_info["class_label"]
    fairness_params = { "class_label" : training_data_info["class_label"],"fairness_attributes" : fairness_attributes_list }

    sorted_data = {}
    # For each feature (if available) generate the initial set of boundaries
    bucket_size = 50
    for feature in feature_attributs[:]:
        values = {}
        feature_name = feature["feature"]
        majority = feature["majority"]
        minority = feature["minority"]
        data_type = None


        if payload_df[feature_name].dtype == np.float64 or payload_df[feature_name].dtype == np.float32 or payload_df[feature_name].dtype == np.double or payload_df[feature_name].dtype == np.longdouble:
            data_type = "float"
        elif(payload_df[feature_name].dtype == np.int64 or payload_df[feature_name].dtype == np.int32):
            data_type = "int"  

        if data_type is not None:

            for major in majority[:]:
                for maj in major[:]:
                    values[maj] = maj
            for minor in minority[:]:
                for min in minor[:]:
                    values[min] = min    
            boundaries = get_boundaries(majority, minority, data_type, bucket_size)   
            sorted_values = sorted(values.values())

            feature_data = {}
            feature_data["sorted_values"] = sorted_values
            feature_data["boundaries"] = boundaries
            sorted_data[feature_name] = feature_data


    distribution_data = []    
    distinct_data = {}
    bucket_data = {}


    #Combine all the count data under bucket as list for each fairness attributes if type is numeric ie int or float
    for fairness_attribute in fairness_params["fairness_attributes"]:
        keys = sorted(data[fairness_attribute].keys())

        is_numeric, is_float, is_int = get_data_types(keys[0])

        if(feature_data_types[fairness_attribute]=="categorical"):
            is_float = is_int = is_numeric = False

        # No need to count data for non numeric type since buckets are for int and float datatype
        if not is_numeric:
            continue

        bucket_dict = {}    
        boundaries = []
        if fairness_attribute in sorted_data:
            boundaries = sorted_data[fairness_attribute]["boundaries"]
            least_min = boundaries[0][0]
            highest_max = boundaries[-1][1]
            for key1, value1 in sorted(data[fairness_attribute].items()):
                idx = -1
                val = ast.literal_eval(str(key1))
                val = truncate(val,5)
                bucket = None
                a = []

                #Values less than the least minority value given as input
                if(val<boundaries[0][0]):
                    if(least_min!=boundaries[0][0]):
                        bucket = str(boundaries[0])
                        for key2, value2 in sorted(value1.items()):
                            b = {}
                            b["class_value"] = key2
                            b["count"] = value2   
                            a.append(b)

                        existing_values = None
                        if bucket in bucket_dict:
                            existing_values = bucket_dict[bucket]
                            existing_values.extend(a)
                            del bucket_dict[bucket]
                            boundaries [0][0] = val;
                            bucket = str(boundaries[0])
                            bucket_dict[bucket] = existing_values
                    
                    else:
                        left_most_boundary_bucket = [val,least_min]
                        boundaries.insert(0,left_most_boundary_bucket)

                        bucket = str(boundaries[0])
                        for key2, value2 in sorted(value1.items()):
                            b = {}
                            b["class_value"] = key2
                            b["count"] = value2   
                            a.append(b)

                        existing_values = None
                        if bucket in bucket_dict:
                            existing_values = bucket_dict[bucket]
                            existing_values.extend(a)
                            bucket_dict[bucket] = existing_values
                        else:
                            bucket_dict[bucket] = a
                        #Value found and corresponding buket has been inserted/modified so go to the next value
                    continue;

                #Values greater than the highest majority value given as input
                if(val>boundaries[-1][1]):
                    if(highest_max!=boundaries[-1][1]):
                        bucket = str(boundaries[-1])
                        for key2, value2 in sorted(value1.items()):
                            b = {}
                            b["class_value"] = key2
                            b["count"] = value2   
                            a.append(b)

                        existing_values = None
                        if bucket in bucket_dict:
                            existing_values = bucket_dict[bucket]
                            existing_values.extend(a)
                            del bucket_dict[bucket]
                            boundaries [-1][1] = val;
                            bucket = str(boundaries[-1])
                            bucket_dict[bucket] = existing_values
                        
                    else:
                        right_most_boundary_bucket = [highest_max,val]
                        boundaries.append(right_most_boundary_bucket)

                        bucket = str(boundaries[-1])
                        for key2, value2 in sorted(value1.items()):
                            b = {}
                            b["class_value"] = key2
                            b["count"] = value2   
                            a.append(b)

                        existing_values = None
                        if bucket in bucket_dict:
                            existing_values = bucket_dict[bucket]
                            existing_values.extend(a)
                            bucket_dict[bucket] = existing_values
                        else:
                            bucket_dict[bucket] = a
                    #Value found and corresponding buket has been inserted/modified so go to the next value
                    continue;
                
                for boundary_bucket in boundaries[:]: 
                    idx+=1
                    boundary_start = boundary_bucket[0]
                    boundary_end = boundary_bucket[1]
                    # fit the value in right boundary
                    if(val>=boundary_start and val<=boundary_end):
                        bucket = str(boundary_bucket)
                        for key2, value2 in sorted(value1.items()):
                            b = {}
                            b["class_value"] = key2
                            b["count"] = value2   
                            a.append(b)

                        existing_values = None
                        if bucket in bucket_dict:
                            existing_values = bucket_dict[bucket]
                            existing_values.extend(a)
                            bucket_dict[bucket] = existing_values
                        else:
                            bucket_dict[bucket] = a
                        #Value fits in the bucket so break, no need to further loop through remaining buckets
                        break

            bucket_data[fairness_attribute] = bucket_dict        

    # Sum up each of the bucket
    bucket_summation_data = {}
    for key, value in bucket_data.items():
        summation_array = {}
        for key1,value1 in value.items():
            bucket_data_count = {}
            for val in value1[:]:
                class_value = val["class_value"]
                count = val["count"]
                if not class_value in bucket_data_count:
                    bucket_data_count[class_value] = count
                else:
                    bucket_data_count[class_value] = bucket_data_count[class_value] + count  
            summation_array[key1] = bucket_data_count
        bucket_summation_data[key] =  summation_array


    distribution_data = []    
    distinct_data = {}

    #Build the json
    for fairness_attribute in fairness_params["fairness_attributes"]:

        keys = sorted(data[fairness_attribute].keys())

        min = None
        max = None

        is_numeric, is_float, is_int = get_data_types(keys[0])

        if(feature_data_types[fairness_attribute]=="categorical"):
            is_float = is_int = is_numeric = False

        key_values = None
        if len(keys)>0:
            if is_numeric:
                if is_float:
                    key_values = sorted(list(map(float, keys)))
                else:
                    key_values = sorted(list(map(int, keys)))  

                min = key_values[0]
                max = key_values[len(key_values)-1]

        feature = {}
        feature["attribute"] = fairness_attribute
        if is_numeric:
            feature["min"] = min
            feature["max"] = max
        else:    
            feature["distinct_values"] = keys    

        class_label_values = []
        if not is_numeric:     
            for key1, value1 in sorted(data[fairness_attribute].items()):
                value_array = {}
                value_array["label"] = key1
                #value_array["total_rows"] = total_rows

                a = []
                for key2, value2 in sorted(value1.items()):
                    b = {}
                    b["class_value"] = key2
                    b["count"] = value2   
                    a.append(b)
                value_array["counts"] = a
                class_label_values.append(value_array)
        else:
            bucket_data = None
            if fairness_attribute in bucket_summation_data and len(keys)>bucket_size:
                bucket_data = bucket_summation_data[fairness_attribute]            
            else:
                bucket_data = data[fairness_attribute] 

            for key,value in bucket_data.items():
                value_array = {}
                #value_array["label"] = key
                value_array["label"] = key
                a = []
                for key2, value2 in sorted(value.items()):
                    b = {}
                    b["class_value"] = key2
                    b["count"] = value2   
                    a.append(b)
                value_array["counts"] = a
                class_label_values.append(value_array)

        feature["class_labels"] =  class_label_values  
        distribution_data.append(feature)

    return distribution_data
 
def get_data_types(val):
     
    is_numeric = False
    is_int = False
    is_float = False
    try:
        float(val)
        is_numeric = True
        is_float = True
    except ValueError:
        pass
 
    try:
        import unicodedata
        unicodedata.numeric(val)
        is_numeric = True
        is_float = True
    except (TypeError, ValueError):
        pass
    
    try:
        int(val)
        is_numeric = True
        is_int = True
    except ValueError:
        pass
           
    return is_numeric, is_float, is_int

def get_boundaries(majority, minority, data_type,bucket_size):
    values = {}
    boundaries = []
    if data_type is not None:    
        for major in majority[:]:
            for maj in major[:]:
                values[maj] = maj
        for minor in minority[:]:
            for min in minor[:]: 
                values[min] = min

        sorted_values = sorted(values.values())

        min = sorted_values[0]
        max = sorted_values[-1]
        diff = max - min

        range_cnt = diff/bucket_size
        range = None

        if not data_type is None and data_type=="int":    
            range = int(range_cnt)
        else:    
            range = truncate(range_cnt,5)

        temp = min
        distribution = {}
        output = {}
        start = min
        end = max

        # Create the initial boundaries
        length_sorted_values = len(sorted_values)
        idx = 0

        if not data_type is None and data_type=="int":
            next_start = temp
        else:
            next_start = truncate(temp,5)

        while temp <= max:
            #start = round(temp,1)
            if not data_type is None and data_type=="int":
                start = next_start
                end = temp+range

                if (idx<(length_sorted_values)):

                    if(sorted_values[idx] == end):
                        idx+=1
                        next_start = temp+range
                        boundary = [start,end]
                        boundaries.append(boundary)
                        temp = temp + range

                    elif(sorted_values[idx] == start):
                        idx+=1
                        while(sorted_values[idx]<end):
                            boundary = [start,sorted_values[idx]]
                            boundaries.append(boundary)
                            start = sorted_values[idx]
                            idx+=1

                        next_start = temp+range
                        temp = temp + range
                        boundary = [start,end]
                        boundaries.append(boundary)   

                    elif(sorted_values[idx]>=start and sorted_values[idx]<=end):
                        while(idx<len(sorted_values) and sorted_values[idx]<end):
                            start_diff = sorted_values[idx] - start
                            end_diff = end - sorted_values[idx]

                            if start_diff>end_diff:
                                boundary = [start,sorted_values[idx]]
                                boundaries.append(boundary)
                                start = sorted_values[idx]
                                idx+=1

                            else:
                                boundary = [start,sorted_values[idx]]
                                boundaries.append(boundary)
                                start = sorted_values[idx]
                                idx+=1

                        next_start = temp+range
                        temp = temp + range
                        boundary = [start,end]
                        boundaries.append(boundary)       

                    else:
                        next_start = temp+range
                        boundary = [start,end]
                        boundaries.append(boundary)
                        temp = temp + range

                else:
                    next_start = temp+range
                    boundary = [start,end]
                    boundaries.append(boundary)
                    temp = temp + range

            else:
                start = next_start  
                end = truncate(temp+range,5)

                if (idx<(length_sorted_values)):
                    sorted_value = truncate(sorted_values[idx],5)

                    if(sorted_value == end):
                        idx+=1
                        next_start = truncate(temp+range,5)
                        boundary = [start,end]
                        boundaries.append(boundary)
                        temp = temp + range

                    elif(sorted_value == start):
                        idx+=1
                        if(idx<length_sorted_values):
                            sorted_value = truncate(sorted_values[idx],5)
                            while(sorted_value<end and idx<len(sorted_values)):
                                boundary = [start,sorted_value]
                                boundaries.append(boundary)
                                start = sorted_value
                                idx+=1
                                if(idx<length_sorted_values):
                                    sorted_value = truncate(sorted_values[idx],5)

                        next_start = truncate(temp+range,5)
                        temp = temp + range
                        boundary = [start,end]
                        boundaries.append(boundary)   

                    elif(sorted_value>=start and sorted_value<=end):
                        while(sorted_value<end and idx<len(sorted_values)):    
                            start_diff = sorted_value - start
                            end_diff = end - sorted_value

                            if start_diff>end_diff:
                                boundary = [start,sorted_value]
                                boundaries.append(boundary)
                                start = sorted_value
                                idx+=1
                                if(idx<length_sorted_values):
                                    sorted_value = truncate(sorted_values[idx],5)

                            else:
                                boundary = [start,sorted_value]
                                boundaries.append(boundary)
                                start = sorted_value
                                idx+=1
                                if(idx<len(sorted_values)):
                                    sorted_value = truncate(sorted_values[idx],5)

                        next_start = truncate(temp+range,5)
                        temp = temp + range
                        boundary = [start,end]
                        boundaries.append(boundary)       

                    else:
                        next_start = truncate(temp+range,5)
                        boundary = [start,end]
                        boundaries.append(boundary)
                        temp = temp + range

                else:
                    next_start = truncate(temp+range,5)
                    boundary = [start,end]
                    boundaries.append(boundary)
                    temp = temp + range

    return boundaries 

def getDistinctValues(dataset,inputs):  

    distinct_values_sync = {}
    attributes_dataset = dataset[inputs['fairness_attributes']]

    for protected_attribute in inputs['fairness_attributes']:

        protected_attribute_column =  attributes_dataset[protected_attribute].tolist()
        is_column_numeric = False
        count_numeric = 0
        row_num = attributes_dataset[protected_attribute].shape[0]

        if row_num<1000:
            sample_size = row_num
        else:
            sample_size = 1000

        for i in range(sample_size):
            protected_attribute_column_value = protected_attribute_column[random.randint(0,row_num-1)]
            is_numeric, is_float, is_int = get_data_types(protected_attribute_column_value)
            if is_numeric or is_float or is_int :
                count_numeric+=1

        if count_numeric > sample_size/2:
            is_column_numeric = True

        protected_attribute_column_sorted = custom_sorted(protected_attribute_column)

        if  is_column_numeric:
            min = protected_attribute_column_sorted[0]
            for pos in range(1,row_num):
                is_numeric, is_float, is_int = get_data_types(protected_attribute_column_sorted[-pos]) 
                if is_numeric or is_float or is_int :
                    max = protected_attribute_column_sorted[-pos]
                    break
            distinct_protected_attribute_column = [min,max]
            distinct_values_sync[protected_attribute] = distinct_protected_attribute_column

        else:
            distinct_protected_attribute_column = set()
            for attribute_value in protected_attribute_column_sorted:
                is_numeric, is_float, is_int = get_data_types(attribute_value)
                if not(is_numeric or is_float or is_int):
                    distinct_protected_attribute_column.add(attribute_value)

            distinct_values_sync[protected_attribute] = list(distinct_protected_attribute_column)

    return distinct_values_sync 

def truncate(float_value, number_of_digits):
    return math.floor(float_value * 10 ** number_of_digits) / 10 ** number_of_digits           

def custom_sorted(l):
    try:
        convert = lambda text: int(text) if text.isdigit() else text
        alphanum_key = lambda key: [convert(c) for c in re.split('([0-9]+)', key)]
        return sorted(l, key = alphanum_key) 
    except:
        return sorted(l)      

def cleanPayloadData(payload_df,fairness_params):

    #Get the numerical/categorical percentage 
    attributes_dataset = payload_df[fairness_params['fairness_attributes']]
    total_rows = payload_df.shape[0]
    feature_data_types = {}

    for protected_attribute in fairness_params['fairness_attributes']:
        row_num = payload_df.shape[0]
        protected_attribute_column = payload_df[protected_attribute].tolist()
        is_column_numeric = False
        count_numeric = 0

        for idx in range(row_num-1):
            protected_attribute_column_value = protected_attribute_column[idx]
            is_numeric, is_float, is_int = get_data_types(protected_attribute_column_value)
            if is_numeric or is_float or is_int :
                count_numeric+=1

        #Clean the data
        if count_numeric >= (.98*total_rows):
            indexes_to_drop = []
            for idx in range(row_num):
                protected_attribute_column_value = attributes_dataset[protected_attribute][idx]
                is_numeric, is_float, is_int = get_data_types(protected_attribute_column_value)
                if not(is_numeric or is_float or is_int):
                    indexes_to_drop.append(idx)

            feature_data_types[protected_attribute] = "numerical"
            payload_df = payload_df.drop(payload_df.index[indexes_to_drop])

        elif count_numeric <=(.02*total_rows):
            indexes_to_drop = []
            for idx in range(row_num-1):
                protected_attribute_column_value = protected_attribute_column[idx]
                is_numeric, is_float, is_int = get_data_types(protected_attribute_column_value)
                if is_numeric or is_float or is_int:
                    indexes_to_drop.append(idx)

            feature_data_types[protected_attribute] = "categorical"
            payload_df = payload_df.drop(payload_df.index[indexes_to_drop])

        else:
            raise Exception("Improper input data provided in " + protected_attribute + " column") 

    return payload_df, feature_data_types    

bias_training_distribution = computeTrainingDataDistribution()

Following cell contains code to generate training data distribution json for explainability :

In [None]:
# Cell-6: Responsible for generating explainability service distribution 

!pip install lime
# -----------------------------------------------------------------------------
#  Copyright (c) 2016, Marco Tulio Correia Ribeiro All rights reserved
# -----------------------------------------------------------------------------


import collections
from collections import Counter
import numpy as np

from sklearn.preprocessing import LabelEncoder
from lime.discretize import QuartileDiscretizer

feature_cols=training_data_info["feature_columns"]
categorical_cols=training_data_info["categorical_columns"]
label_col=training_data_info["class_label"]
numeric_cols = list(set(feature_cols) ^ set(categorical_cols))

# Convert columns to numeric incase data frame read them as non-numeric
data_df[numeric_cols] = data_df[numeric_cols].apply(pd.to_numeric, errors="coerce")

# Drop rows with invalid values
data_df.dropna(axis="index", subset=feature_cols, inplace=True)

random_state=10

training_data_schema = list(data_df.columns.values)
training_data_shape = data_df.shape[1]

# Feature column index
feature_column_index = [training_data_schema.index(x) for x in feature_cols]

# Categorical columns index
categorical_column_index = []
categorical_column_index = [feature_cols.index(x) for x in categorical_cols]

# numeric columns
numeric_column_index = []
index = 0
for f_col_index in feature_column_index :
    if index not in categorical_column_index:
        numeric_column_index.append(index)
    index = index + 1

# class labels
class_labels = []
if model_type != "regression":
    if(label_col != None):
        class_labels = data_df[label_col].unique()
        class_labels = class_labels.tolist()


# Filter feature columns from training data frames
data_frame = data_df.values
data_frame_features = data_frame[:, feature_column_index]

# Compute stats on complete training data
data_frame_num_features = data_frame_features[:,numeric_column_index]
num_base_values = np.median(data_frame_num_features,axis=0)
stds = np.std(data_frame_num_features, axis=0, dtype="float64")
mins = np.min(data_frame_num_features, axis=0)
maxs = np.max(data_frame_num_features, axis=0)

main_base_values = {}
main_cat_counts = {}
if(len(categorical_column_index) > 0):
    for cat_col in categorical_column_index:
        cat_col_value_counts = Counter(data_frame_features[:, cat_col])
        values, frequencies = map(list, zip(*(cat_col_value_counts.items())))
        max_freq_index = frequencies.index(np.max(frequencies))
        cat_base_value = values[max_freq_index]
        main_base_values[cat_col] = cat_base_value
        main_cat_counts[cat_col] = cat_col_value_counts

num_feature_range = np.arange(len(numeric_column_index))
main_stds = {}
main_mins = {}
main_maxs = {}
for x in num_feature_range:
    index = numeric_column_index[x]
    main_base_values[index] = num_base_values[x]
    main_stds[index] = stds[x]
    main_mins[index] = mins[x]
    main_maxs[index] = maxs[x]
    
# Encode categorical columns
categorical_columns_encoding_mapping = {}
for column_index_to_encode in categorical_column_index:
    le = LabelEncoder()
    le.fit(data_frame_features[:, column_index_to_encode])
    data_frame_features[:, column_index_to_encode] = le.transform(
        data_frame_features[:, column_index_to_encode])
    categorical_columns_encoding_mapping[column_index_to_encode] = le.classes_


# Compute training stats on descritized data
descritizer = QuartileDiscretizer(
    data_frame_features, categorical_features=categorical_column_index, feature_names=feature_cols, labels=class_labels, random_state= random_state)

d_means = descritizer.means
d_stds = descritizer.stds
d_mins = descritizer.mins
d_maxs = descritizer.maxs
d_bins = descritizer.bins(data_frame_features, labels=class_labels)

# Compute feature values and frequencies of all columns
cat_features = np.arange(data_frame_features.shape[1])
discretized_training_data = descritizer.discretize(data_frame_features)

feature_values = {}
feature_frequencies = {}
for feature in cat_features:
    column = discretized_training_data[:, feature]
    feature_count = collections.Counter(column)
    values, frequencies = map(list, zip(*(feature_count.items())))
    feature_values[feature] = values
    feature_frequencies[feature] = frequencies

index = 0
d_bins_revised = {}
for bin in d_bins:
    d_bins_revised[numeric_column_index[index]] = bin.tolist()
    index = index + 1

#Encode categorical columns
cat_col_mapping = {}
for column_index_to_encode in categorical_column_index:
    cat_col_encoding_mapping_value = categorical_columns_encoding_mapping[column_index_to_encode]
    cat_col_mapping[column_index_to_encode] = cat_col_encoding_mapping_value.tolist()


# Construct stats
data_stats = {}
data_stats["feature_columns"] = feature_cols
data_stats["categorical_columns"] = categorical_cols

#Common
data_stats["feature_values"] = feature_values
data_stats["feature_frequencies"] = feature_frequencies
data_stats["class_labels"] = class_labels
data_stats["categorical_columns_encoding_mapping"] = cat_col_mapping

#Descritizer
data_stats["d_means"] = d_means
data_stats["d_stds"] = d_stds
data_stats["d_maxs"] = d_maxs
data_stats["d_mins"] = d_mins
data_stats["d_bins"] = d_bins_revised

#Full data
data_stats["base_values"] = main_base_values
data_stats["stds"] = main_stds
data_stats["mins"] = main_mins
data_stats["maxs"] = main_maxs
data_stats["categorical_counts"] = main_cat_counts


#Convert to json
exp_cofig = {}
for k in data_stats:
    key_details = data_stats.get(k)
    if(key_details is not None) and (not isinstance(key_details, list)):
        new_details = {}
        for key_in_details in key_details:
            new_details[str(key_in_details)] = key_details[key_in_details]
    else :
        new_details = key_details
    exp_cofig[k] = new_details
    

#print(exp_cofig)

In [None]:
# Cell-7: Build the final json

#Fairness final configuration
fairness_config_json = {}
parameters_json = {}
feature_json = fairness_attributes
parameters_json["features"] = feature_json

parameters_json["favourable_class"] = parameters["favourable_class"]
parameters_json["unfavourable_class"] = parameters["unfavourable_class"]
parameters_json["min_records"] = min_records
parameters_json["model_type"] = model_type

fairness_config_json["parameters"] = parameters_json
fairness_config_json["distributions"] = bias_training_distribution

#Set input data schema in common configuration
common_configuration = {}
common_configuration["problem_type"] = model_type
common_configuration["label_column"] = training_data_info.get("class_label")
common_configuration["input_data_schema"] = training_schema


#Add common Bias and explainabiity data distribution to main json
d = {}
d["fairness_configuration"] = fairness_config_json
d["common_configuration"] = common_configuration
d["explainability_configuration"] = exp_cofig

json_data = json.dumps(d,indent=2)

# optionally write json to a file and create a download link

f = open("training_distribution.json","w+")
f.write(json_data)
f.close()
print("Finished writing data to training_distribution.json")

def create_download_link( title = "Download training data distribution JSON file", filename = "training_distribution.json"):  
    
    b64 = base64.b64encode(json_data.encode())
    payload = b64.decode()
    html = '<a download="{filename}" href="data:text/json;base64,{payload}" target="_blank">{title}</a>'
    html = html.format(payload=payload,title=title,filename=filename)
    return HTML(html)

create_download_link()