# Run Data Bias Analysis with SageMaker Clarify (Pre-Training) in this notebook

## Using [`smclarify`](https://github.com/aws/amazon-sagemaker-clarify)

---
# Amazon Science 

## _How Clarify helps machine learning developers detect unintended bias_ 

## [Read Post](https://www.amazon.science/latest-news/how-clarify-helps-machine-learning-developers-detect-unintended-bias)

[<img src="img/amazon_science_clarify.png"  width="100%" align="left">](https://www.amazon.science/latest-news/how-clarify-helps-machine-learning-developers-detect-unintended-bias)

----

# Terminology
https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-detect-data-bias.html

### Feature
An individual measurable property or characteristic of a phenomenon being observed, contained in a column for tabular data.

### Label
Feature that is the target for training a machine learning model. Referred to as the observed label or observed outcome.

### Predicted label
The label as predicted by the model. Also referred to as the predicted outcome.

### Sample
An observed entity described by feature values and label value, contained in a row for tabular data.

### Dataset
A collection of samples.

### Bias
An imbalance in the training data or the prediction behavior of the model across different groups, such as age or income bracket. Biases can result from the data or algorithm used to train your model. For instance, if an ML model is trained primarily on data from middle-aged individuals, it may be less accurate when making predictions involving younger and older people.

### Bias metric
A function that returns numerical values indicating the level of a potential bias.

### Bias report
A collection of bias metrics for a given dataset, or a combination of a dataset and a model.

### Positive label values
Label values that are favorable to a demographic group observed in a sample. In other words, designates a sample as having a positive result.

### Negative label values
Label values that are unfavorable to a demographic group observed in a sample. In other words, designates a sample as having a negative result.

### Group variable
Categorical column of the dataset that is used to form subgroups for the measurement of Conditional Demographic Disparity (CDD). Required only for this metric with regards to Simpson’s paradox.

### Facet
A column or feature that contains the attributes with respect to which bias is measured.

### Facet value
The feature values of attributes that bias might favor or disfavor.

### Predicted probability
The probability, as predicted by the model, of a sample having a positive or negative outcome.

---

# Pretraining Bias Metrics
https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-measure-data-bias.html

### Class Imbalance (CI)
Measures the imbalance in the number of members between different facet values.

### Difference in Proportions of Labels (DPL)
Measures the imbalance of positive outcomes between different facet values.

### Kullback-Leibler Divergence (KL)
Measures how much the outcome distributions of different facets diverge from each other entropically.

### Jensen-Shannon Divergence (JS)
Measures how much the outcome distributions of different facets diverge from each other entropically.

### Lp-norm (LP)
Measures a p-norm difference between distinct demographic distributions of the outcomes associated with different facets in a dataset.

### Total Variation Distance (TVD)
Measures half of the L1-norm difference between distinct demographic distributions of the outcomes associated with different facets in a dataset.

### Kolmogorov-Smirnov (KS)
Measures maximum divergence between outcomes in distributions for different facets in a dataset.

### Conditional Demographic Disparity (CDD)
Measures the disparity of outcomes between different facets as a whole, but also by subgroups.

---

In [None]:
!pip install -q smclarify==0.1

In [None]:
from smclarify.bias import report
from typing import Dict
from collections import defaultdict
import pandas as pd
import seaborn as sns

# Read Dataset From S3

In [None]:
%store -r bias_data_s3_uri

In [None]:
print(bias_data_s3_uri)

In [None]:
%store -r balanced_bias_data_s3_uri

In [None]:
print(balanced_bias_data_s3_uri)

In [None]:
!aws s3 cp $bias_data_s3_uri ./data-clarify/

In [None]:
!aws s3 cp $balanced_bias_data_s3_uri ./data-clarify/

# Analyze Unbalanced Data

In [None]:
df = pd.read_csv("./data-clarify/amazon_reviews_us_giftcards_software_videogames.csv")
df.head()

In [None]:
sns.countplot(data=df, x="star_rating", hue="product_category")

# Calculate Bias Metrics on Unbalanced Data

## Define 
* Facet Column (= Product Category), 
* Label Column (= Star Rating), 
* Positive Label Value (= 5,4)

In [None]:
facet_column = report.FacetColumn(name="product_category")
label_column = report.LabelColumn(name="star_rating", data=df["star_rating"], positive_label_values=[5, 4])

## Run SageMaker Clarify Bias Report

In [None]:
report.bias_report(
    df, facet_column, label_column, stage_type=report.StageType.PRE_TRAINING, group_variable=df["product_category"]
)

# Balance the data

In [None]:
df_grouped_by = df.groupby(["product_category", "star_rating"])
df_balanced = df_grouped_by.apply(lambda x: x.sample(df_grouped_by.size().min()).reset_index(drop=True))
df_balanced.head()

In [None]:
import seaborn as sns

sns.countplot(data=df_balanced, x="star_rating", hue="product_category")

# Calculate Bias Metrics on Balanced Data

## Define 
* Facet Column (= Product Category), 
* Label Column (= Star Rating), 
* Positive Label Value (= 5,4)

In [None]:
from smclarify.bias import report

facet_column = report.FacetColumn(name="product_category")
label_column = report.LabelColumn(name="star_rating", data=df_balanced["star_rating"], positive_label_values=[5, 4])

## Run SageMaker Clarify Bias Report

In [None]:
report.bias_report(
    df_balanced,
    facet_column,
    label_column,
    stage_type=report.StageType.PRE_TRAINING,
    group_variable=df_balanced["product_category"],
)

In [None]:
%%html

<p><b>Shutting down your kernel for this notebook to release resources.</b></p>
<button class="sm-command-button" data-commandlinker-command="kernelmenu:shutdown" style="display:none;">Shutdown Kernel</button>
        
<script>
try {
    els = document.getElementsByClassName("sm-command-button");
    els[0].click();
}
catch(err) {
    // NoOp
}    
</script>