# Bias Detection & Data Explainability with Amazon SageMaker Clarify

## Introduction
Bias can be present in a large number of datasets and needs to subject to inspection prior to model training. As an example, insights on biased data may lead to data balancing to avoid having biased models.

Note that bias detection can be run at pre-training or at post-training stage. The latter needs a model configuration, in order to detect bias in the model's predictions. Data Explainability 

In addition to bias detection, data may contain uninformative features, which will definitely be useless when fed to a model. The objective being to clean the data for Machine Learning, it is critical to detect unnecessary features. **Data Explainability** can be run prior to training in order to detect such features. It involves techniques like *Partial Dependence* and *SHAP*, which also need a model configuration, as they are computed on top of predictions.

### Table of Contents

- [1. Analyze the dataset](#c1w2-1.)
  - [1.1. Create a pandas data frame from the CSV file](#c1w2-1.1.)
  - [1.2. Upload the dataset to S3 bucket](#c1w2-1.2.)
- [2. Analyze class imbalance on the dataset with Amazon SageMaker Clarify](#c1w2-2.)
  - [2.1. Configure a `DataConfig`](#c1w2-2.1.)
    - [Exercise 1](#c1w2-ex-1)
  - [2.2. Configure `BiasConfig`](#c1w2-2.2.)
  - [2.3. Configure Amazon SageMaker Clarify as a processing job](#c1w2-2.3.)
  - [2.4. Run the Amazon SageMaker Clarify processing job](#c1w2-2.4.)
    - [Exercise 2](#c1w2-ex-2)
  - [2.5. Run and review the Amazon SageMaker Clarify processing job on the unbalanced dataset](#c1w2-2.5.)
  - [2.6. Analyze unbalanced bias report](#c1w2-2.6.)
- [3. Balance the dataset by `product_category` and `sentiment`](#c1w2-3.)
- [4. Analyze bias on balanced dataset with Amazon SageMaker Clarify](#c1w2-4.)
  - [4.1. Configure a `DataConfig`](#c1w2-4.1.)
    - [Exercise 3](#c1w2-ex-3)
  - [4.2. Configure `BiasConfig`](#c1w2-4.2.)
  - [4.3. Configure SageMaker Clarify as a processing job](#c1w2-4.3.)
  - [4.4. Run the Amazon SageMaker Clarify processing job](#c1w2-4.4.)
    - [Exercise 4](#c1w2-ex-4)
  - [4.5. Run and review the Clarify processing job on the balanced dataset](#c1w2-4.5.)
  - [4.6. Analyze balanced bias report](#c1w2-4.6.)

In [None]:
import os
import sagemaker
import logging
import boto3
import time
import pandas as pd
import json
import botocore
from botocore.exceptions import ClientError


# ========================== low-level service client of the boto3 session ==========================
config = botocore.config.Config(user_agent_extra='bedissj-1699438736259')


sm = boto3.client(service_name='sagemaker', 
                  config=config)

sm_runtime = boto3.client('sagemaker-runtime',
                          config=config)

sess = sagemaker.Session(sagemaker_client=sm,
                         sagemaker_runtime_client=sm_runtime)

bucket = sess.default_bucket()
role = sagemaker.get_execution_role()
region = sess.boto_region_name


# 1. Import & Analyze the dataset
The problem to solve is about detecting bank customers who are more likely to churn, over two months. 

## 1.1 Import & reshape data
First import the dataset and perform data visualization to identify potential bias in the data. 



In [None]:
!aws s3 cp s3://$bucket/data/raw/BankChurners.csv ./Data Acquisition & Registry/data/BankChurners.csv


In [None]:
local_data_path = './Data Acquisition & Registry/data/BankChurners.csv'

# Import data
data = pd.read_csv(local_data_path)
cols = list(data.columns)
data.rename(
    columns ={'Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1':'churn_mon1',
              'Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2':'churn_mon2'},
    inplace = True
)

data = pd.concat((data['churn_mon1'], data.drop(columns = ['churn_mon2'])), axis=1)
display(data.info())

# Target & values
target = list(data.columns)[:1]


## 1.2 Investigate bias with data visualiztion

This section is about investigating two main imbalances in the dataset, through the following metrics
1. **Class Imbalance (CI):** to measure the imbalance in the number of reviews between different facet values.
2. **Difference in Proportion of Labels (DPL):** to measure the imbalance of positive/negative labels between different facet values.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format='retina'


# Investigate Class Imbalance
for cat in data.drop(columns = ['churn_mon1']).select_dtypes(['string', 'object']).columns.to_list():
    plt.figure(figsize = (12,5))
    data.plot.bar(x = cat, title='CI - Feature: {}'.format(cat))

    
# Investigate Difference in Proportion of Labels
for cat in data.drop(columns = ['churn_mon1']).select_dtypes(['string', 'object']).columns.to_list():
    plt.figure(figsize = (12,5))
    sns.countplot(data= data, x='churn_mon1', hue=cat)
    

# 2. Analyze data bias & explainability with Amazon Sagemaker Clarify
In this section, we analyze bias in `churn_mon1` with respect to the categorical variables, that we take as facets on the dataset.

## 2.1 Configure a `DataConfig`
To run Sagemaker Clarify processing job, we need to have a `DataConfig` that basically tells where to find the raw data, and where to upload the generated report. The label needs to be specified as well. In this notebook we are interested in customers churn, given by `churn_mon1` feature.

In [None]:
from sagemaker.clarify import DataConfig


# Configure data for pre-training bias detection & explainability
raw_data_s3_uri = 's3://{}/data/raw/BankChurners.csv'.format(bucket)
clarify_report_s3_uri = 's3://{}/data/sagemaker-clarify-report'.format(bucket)
churn_mon1 = target


data_config = DataConfig(
        s3_data_input_path=raw_data_s3_uri,
        s3_output_path=clarify_report_s3_uri,
        label=churn_mon1,
        headers=data.columns.to_list(),
        dataset_type='text/csv', 
)


## 2.2 Configure the BiasConfig
Bias is measured across the categories of interest. These categoies are configured as `facets`.  

To run bias detection, on ore more facets can be included in the analysis along with the positive outcome for the target variable,or a threshold for positive outcome in regression problems. This is configured as `label_values_or_threshold`.

As an additional input, it is possible to limit the bias metrics to one facet value (or to a subset). This approach is not considered in this notebook.  

In [None]:
from sagemaker.clarify import BiasConfig


# Configure pre-training bias detection
label_values = [1]
facets = data.drop(columns=target).select_dtypes('object').columns.to_list()


bias_config = BiasConfig(
        label_values_or_threshold=label_values, 
        facet_name=facets, 
        facet_values_or_threshold=None, 
        group_name=None
)


## 2.3 Configure Data Explainability
In addition to bias detection, computing feature importance is equally important. It is a key step for data preparation as it reveals the most useful and -on the contrary- uninformative features. 

Explainability with **Amazon Sagemaker Clarify** involves two main techniques that require a model training: `Partial Dependence` and `SHAP`.

In this section, we perform the following steps: 
- Train and deploy a Model to a Sagemaker Endpoint, then use it to configure a `ModelConfig`
- Configure explainability through `Partial Dependence` and `SHAP`

### 2.3.1 Configure the `ModelConfig`
As a first step, we instanciate an `Xgboost` classifier with fixed set of hyperparameters, and then deploy it to a Sagemaker Endpoint. We use the built-in algorithm to keep things simple.

In [None]:
from sagemaker.inputs import TrainingInput
from sagemaker.estimator import Estimator


# Configure instance type and distributed processing 
instance_count = 1
instance_type = 'ml.t3.medium'


# Configure model training for data explainability
xgboost_image_uri = sagemaker.image_uris.retrieve("xgboost", region, "1.7-1")
xgboost_output_path = 's3://{}/model/estimator-training-job'.format(bucket)


# Configure hyperparameters
xgboost_hyperparameters = {
        'max_depth': 7,
        'num_round': 300,
        'alpha': 0,
        'gamma': 0,
        'num_class': 2,
        'booster': 'gbtree',
        'early_stopping_rounds': 20,
        'seed': 2023,
        'verbosity': 1
}


# Instanciate Xgboost estimator
xgboost_estimator = Estimator(
        image_uri=xgboost_image_uri,
        role=role,
        instance_count=instance_count,
        instance_type=instance_type,
        output_path=xgboost_output_path,
        sagemaker_session=sess,
        hyperparameters=xgboost_hyperparameters
)


# Configure data channels & train model
data_channels = {
        'train': TrainingInput(s3_data=raw_data_s3_uri, content_type='text/csv')
}
xgboost_estimator.fit(inputs=data_channels, wait=False)


# Deploy model to sagemaker endpoint
model_name = "clarify-model-{}".format(datetime.now().strftime("%d-%m-%Y-%H-%M-%S"))
model = xgboost_estimator.create_model(name=model_name)
container_def = model.prepare_container_def()
sagemaker_session.create_model(model_name, role, container_def)


Once the model is deployed, we setup a `ModelConfig`, that involves the deployed model, in addition to the compute resources.

The model's input and output formats are given by `content_type`, and `accept_type`, respectively. 

In [None]:
from sagemaker.clarify import ModelConfig


# Configure model for data explainability
model_config = ModelConfig(
        model_name=model_name,
        instance_count=instance_count,
        instance_type=instance_type,
        accept_type='text/csv',
        content_type='text/csv'
)


### 2.3.2 Configure the desired post-training analysis
Data explainability is the second step. We are interested in `PartialDependence` and `SHAP` reports. Both rely on the model trained in the previous section, and  generate reports based on the model's predictions.

We run a simple `SHAPConfig`, in addition to a `PDPConfig`, where we limit the number of desired fetures to `top_k_features=7`.

In [None]:
from sagemaker.clarify import PDPConfig, SHAPConfig


# Configure partial dependence & SHAP analysis
shap_config = SHAPConfig()

partial_dependence_config = PDPConfig(
        features=data.drop(columns = ['churn_mon1']).columns.to_list(),
        top_k_features=7   
)


## 2.4 Configure Amazon SageMaker Clarify as a processing job

In [None]:
from sagemaker import clarify


# Configure Sagemaker Clarify Processor
clarify_processor = clarify.SagemakerClarifyProcessor(
        role=role,
        instance_count=instance_count,
        instance_type=instance_type,
        sagemaker_session=sess
)


clarify_processor.run_bias_and_explainability(
        data_config=data_config,
        model_config=model_config,
        bias_config=bias_config,
        explainability_config=[shap_config, 
                               partial_dependence_config],
        pre_training_methods='all',
        post_training_methods='all',
        wait=False,
        logs=False
)


## 2.5 Run and review the Amazon SageMaker Clarify processing job on the dataset

In [None]:
# Describe & access reports
run_bias_and_explainability_job_name = clarify_processor.latest_job.job_name
print(run_bias_and_explainability_job_name)


In [None]:
# Retrieve running processor
running_processor = sagemaker.processing.ProcessingJob.from_processing_name(
                                        processing_job_name=run_bias_and_explainability_job_name,
                                        sagemaker_session=sess)

running_processor.wait(logs=False)


## 2.6 Analyze the bias report

In [None]:
!aws s3 ls $clarify_report_s3_uri
