# Using Clarify to Detect Pre-Training Bias

## Environment Setup

- Image: Data Science
- Kernel: Python 3
- Instance type: ml.t3.medium

## Background

This notebook uses SageMaker Clarify to detect bias in a dataset for home loans.  The dataset, *loan_data.csv*, contains information about customers who applied for a home loan, and whether or not they were approved.  We also use SageMaker Experiments so we can view the Bias Report directly from the Experiments UI.

The dataset was adapted from [Kaggle](https://www.kaggle.com/datasets/devzohaib/eligibility-prediction-for-loan).

## Initialize Environment and Variables

In [3]:
# What version of SageMaker are you running?
import sagemaker
print(sagemaker.__version__)

2.145.0


In [None]:
# To use the Experiments functionality in the SageMaker Python SDK, you need to be running at least SageMaker v2.123.0
# If the version printed above is less than that, run this line of code
# You will need to restart the kernel after the upgrade
!pip install --upgrade 'sagemaker>=2'

In [5]:
# Import libraries
import boto3
import pandas as pd
import os
from time import sleep, gmtime, strftime

import sagemaker
from sagemaker import get_execution_role
from sagemaker.predictor import CSVSerializer
from sagemaker.inputs import TrainingInput

from sagemaker import clarify

from sagemaker.experiments.run import Run
from sagemaker.experiments.run import load_run

# Get the SageMaker session and the execution role from the SageMaker domain
sess = sagemaker.Session()
role = get_execution_role()

bucket = 'test-sagemaker-examples-1357942113492' # Update with the name of a bucket that is already created in S3
prefix = 'clarify-demo' # The name of the folder that will be created in the S3 bucket

---
## Data

For this lesson, the data in *loan_data.csv* has been cleaned.  We'll load it into a dataframe, parsing out the target attribute ("Approved"), then take the local file and upload it to S3 so SageMaker can use it.

In [6]:
# Read data into a dataframe
data_path = 'loan_data.csv'

# Parse out the target attribute ("Approved")
attributes = ['loan_id', 'gender', 'married', 'dependents', 'education', 'self_employed', 'applicant_income', 'coapplicant_income', 'loan_amount', 'term', 'credit_history', 'property_area']
target_attribute = ['approved']
col_names = attributes + target_attribute

df = pd.read_csv(data_path, delimiter=',', index_col=None)
df = df[col_names]

# Print the first five rows
df.head()

Unnamed: 0,loan_id,gender,married,dependents,education,self_employed,applicant_income,coapplicant_income,loan_amount,term,credit_history,property_area,approved
0,LP001002,1,No,0,Graduate,0,5849,0.0,10,360,1,Urban,Y
1,LP001003,1,Yes,1,Graduate,0,4583,1508.0,128,360,1,Rural,N
2,LP001005,1,Yes,0,Graduate,1,3000,0.0,66,360,1,Urban,Y
3,LP001006,1,Yes,0,Not Graduate,0,2583,2358.0,120,360,1,Urban,Y
4,LP001008,1,No,0,Graduate,0,6000,0.0,141,360,1,Urban,Y


In [7]:
# Upload the file to the S3 bucket defined above
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'loan_data.csv')).upload_file('loan_data.csv')

## Clarify and Experiments

In this section, we implement the Clarify code to detect bias in our dataset.  It starts with a processor for the job, then we define various configuration parameters.  When we run the pre_training_bias job, we hook into our Experiment.

In [8]:
# Define the processor for the job
clarify_processor = clarify.SageMakerClarifyProcessor(
    role=role,
    instance_count=1,
    instance_type='ml.m5.xlarge', 
    sagemaker_session=sess,
    job_name_prefix='clarify-pre-training-bias-detection-job'
)

# Specify the path where the bias report will be saved once complete
bias_report_output_path = 's3://{}/{}/clarify-bias'.format(bucket, prefix)

# Specify the S3 path to our input data
s3_data_input_path='s3://{}/{}'.format(bucket, prefix)

# Specify inputs, outputs, columns and target names
bias_data_config = clarify.DataConfig(
    s3_data_input_path=s3_data_input_path,
    s3_output_path=bias_report_output_path,
    label='approved',
    headers=df.columns.to_list(),
    dataset_type='text/csv',
)

# Specify the configuration of the bias detection job
# For facet_name, we include two sensitive features we want to check for bias: gender and self-employed
# For facet_values_or_threshold, we input the values of potentially disadvantaged groups (gender of 0 = female; self-employed of 1 = self-employed)
bias_config = clarify.BiasConfig(
    label_values_or_threshold=['Y'], # The value that indicates someone received a home loan
    facet_name=['gender', 'self_employed'],
    facet_values_or_threshold=[[0], [1]]
)

INFO:sagemaker.image_uris:Defaulting to the only supported framework/algorithm version: 1.0.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.


In [9]:
# Create an experiment and start a new run
experiment_name = 'loan-approval-experiment'
run_name = 'pre-training-bias'

# Run the bias detection job, associating it with our Experiment
with Run(
    experiment_name=experiment_name,
    run_name=run_name,
    sagemaker_session=sess,
) as run:
    clarify_processor.run_pre_training_bias(
        data_config=bias_data_config,
        data_bias_config=bias_config,
        logs=False,
    )

INFO:sagemaker.clarify:Analysis Config: {'dataset_type': 'text/csv', 'headers': ['loan_id', 'gender', 'married', 'dependents', 'education', 'self_employed', 'applicant_income', 'coapplicant_income', 'loan_amount', 'term', 'credit_history', 'property_area', 'approved'], 'label': 'approved', 'label_values_or_threshold': ['Y'], 'facet': [{'name_or_index': 'gender', 'value_or_threshold': [0]}, {'name_or_index': 'self_employed', 'value_or_threshold': [1]}], 'methods': {'report': {'name': 'report', 'title': 'Analysis Report'}, 'pre_training_bias': {'methods': 'all'}}}
INFO:sagemaker:Creating processing-job with name clarify-pre-training-bias-detection-job-2023-05-15-06-20-04-920


.............................................................!

## Cleaning Up Experiments

In this section, we delete our experiment (this cannot currently be done through the SageMaker UI).

In [10]:
from sagemaker.experiments.experiment import _Experiment

exp = _Experiment.load(experiment_name=experiment_name, sagemaker_session=sess)
exp._delete_all(action="--force")