# Statistical bias

## Introduction
- A data set is said to be biased if it can not completely ad accurately reflect the underlying problem space
- Statistical bias ris the tendency of a statistic to either overestimate or understimate a parameter

### Some causes of statistical bias include:
- Activity bias: People's tendency to chose activity or inactivity regardless of whether acting is the optimal choice or not.
- `Societal bias` Preconceived notions/beliefs that exist in society
- `Unconscious bias:` The natural tendecny to like or dislike.
- `Feedback loops:` User input models that use the user's inout for retraining the model
- `Data drift:` The data distributin changes significantly from what the training data that was used for training the model. Drift takes different forms;
    1. `Covariant drift:` occurs when the distribution of the independent variables (features) changes
    2. `Prior probability drift:` occurs when the distibution of the target variable (labeel) changes
    3. `Concept drift:` occurs when the relationship between the features and label changes e.g when the definition of the label changes due to a particular feature like age for instance

### Measuring statistical bias
There are diffeent metrices for measuring statistical bias. Some examples include;
1. `Class imbalance:` measure the difference in the numbe of examples provided for each facet value. <ins>*Facet is the feature for which bias is checked*</ins> e.g does a particular product category have disproportionately large number of reviews than other product categories
2. `DIfference in Proportion of Labels (DPL):` measure the difference in positive outcome between the different facet values e.g does a particular product category have disproportionately higher ratings than other products.

[More details on measuring statistical bias](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-measure-data-bias.html)

### Detecting statistical bias using Amazon Sagemaker studio
Amazon sagemaker studio provides a UI for generating bias reports. [Generate Reports for Bias in Pretraining Data in SageMaker Studio](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-data-bias-reports-ui.html)

### Detecting statistical bias using Amazon Sagemaker Clarify
Sagemaker Clarify is a construct that allows scaling bias detection process into a distributed process. It can among other things do the following;
1. Performs statistical bias detection on training data set
2. Generates statistical bias reports for bias detections tasks performed
3. Performs statistical bias detection in trained and deployed models
4. Provides capabilities for machine learning explainability
5. Detects drift in data and models



In [2]:
## modules
import boto3
import sagemaker
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
sess = sagemaker.Session()
bucket = sess.default_bucket()
account_id = sess.account_id
role = sagemaker.get_execution_role()
region = boto3.Session.region_name

In [8]:
#!pip install awswrangler

In [20]:
df = pd.read_csv('Reviews.csv')

In [21]:
df.head()

Unnamed: 0,sentiment,review_body,product_category
0,1,If this product was in petite i would get the...,Blouses
1,1,Love this dress! it's sooo pretty. i happene...,Dresses
2,0,I had such high hopes for this dress and reall...,Dresses
3,1,I love love love this jumpsuit. it's fun fl...,Pants
4,1,This shirt is very flattering to all due to th...,Blouses


### upload data to S# bucket

In [71]:
data_path_unbal = sess.upload_data(bucket = bucket
                                   , key_prefix = 'data'
                                   ,path = './Reviews.csv'
                                  )

In [74]:
print(data_path_unbal)
!aws s3 ls $data_path_unbal

s3://sagemaker-us-east-1-387004069299/data/Reviews.csv
2022-03-10 06:48:43    7239779 Reviews.csv


### Analyzing class imbalance with amazon Sagemaker Clarify

In [14]:
from sagemaker import clarify

#### Configure Dataconfig

In [75]:
# define output path
bias_report_output_unbal = f's3://{bucket}/bias/report/unbalanced'

data_config_unbal = clarify.DataConfig(
                                           s3_data_input_path = data_path_unbal
                                           ,s3_output_path = bias_report_output_unbal
                                           ,label = 'sentiment'
                                           ,headers = df.columns.to_list()
                                           ,dataset_type = 'text/csv'
                                          )

#### Configure BiasConfig

In [33]:
bias_config_unbal = clarify.BiasConfig(
                                       label_values_or_threshold = [1] #desired outcome
                                       ,facet_name = 'product_category'
                                      )

#### Configure a processing job

In [76]:
processor_unbal = clarify.SageMakerClarifyProcessor(
                                             role = role # previously defined <sagemaker.get)execution_role()>
                                             ,instance_count = 1
                                             ,instance_type = 'ml.m5.large'
                                             ,sagemaker_session = sess # previously defined <sagemaker.Session()>
                                            )

#### Run the processing job

In [77]:
processor_unbal.run_pre_training_bias(
                                      data_config = data_config_unbal
                                      ,data_bias_config = bias_config_unbal
                                      ,methods = ["CI", "DPL", "KL", "JS", "LP", "TVD", "KS"]
                                      ,wait = False
                                      ,logs = False
                                     )


Job Name:  Clarify-Pretraining-Bias-2022-03-10-06-49-39-319
Inputs:  [{'InputName': 'dataset', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-387004069299/data/Reviews.csv', 'LocalPath': '/opt/ml/processing/input/data', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'analysis_config', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-387004069299/bias/report/unbalanced/analysis_config.json', 'LocalPath': '/opt/ml/processing/input/config', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}]
Outputs:  [{'OutputName': 'analysis_result', 'AppManaged': False, 'S3Output': {'S3Uri': 's3://sagemaker-us-east-1-387004069299/bias/report/unbalanced', 'LocalPath': '/opt/ml/processing/output', 'S3UploadMode': 'EndOfJob'}}]


#### Run the processor job on the unbalanced data set

In [78]:
# get processing job name
processor_job_name_inbal = processor_unbal.latest_job.job_name
print(processor_job_name_inbal)

# initialize  processor job on data
running_processor = sagemaker.processing.ProcessingJob.from_processing_name(
                                                                           sagemaker_session = sess
                                                                           ,processing_job_name = processor_job_name_inbal
                                                                          )



Clarify-Pretraining-Bias-2022-03-10-06-49-39-319


In [79]:
%%time
# running the processing job on the data --> takes some minutes to run
running_processor.wait(logs=False)

..................................................................!CPU times: user 274 ms, sys: 23.3 ms, total: 298 ms
Wall time: 5min 33s


In [81]:
!aws s3 ls $bias_report_output_unbal/

2022-03-10 06:55:24      31732 analysis.json
2022-03-10 06:49:40        346 analysis_config.json
2022-03-10 06:55:24     385665 report.html
2022-03-10 06:55:24     121999 report.ipynb
2022-03-10 06:55:24     139371 report.pdf


In [82]:
#### Download generated bias reports

In [83]:
!aws s3 cp --recursive $bias_report_output_unbal ./Bias_reports/_unbalanced/

download: s3://sagemaker-us-east-1-387004069299/bias/report/unbalanced/analysis_config.json to Bias_reports/_unbalanced/analysis_config.json
download: s3://sagemaker-us-east-1-387004069299/bias/report/unbalanced/analysis.json to Bias_reports/_unbalanced/analysis.json
download: s3://sagemaker-us-east-1-387004069299/bias/report/unbalanced/report.ipynb to Bias_reports/_unbalanced/report.ipynb
download: s3://sagemaker-us-east-1-387004069299/bias/report/unbalanced/report.html to Bias_reports/_unbalanced/report.html
download: s3://sagemaker-us-east-1-387004069299/bias/report/unbalanced/report.pdf to Bias_reports/_unbalanced/report.pdf
