## Applying the bias scan tool on the College Loan Check
In this notebook, the bias scan tool is applied on the CLC-case. The bias scan tool is based on an implementation of the k-means Hierarchical Bias Aware Clustering (HBAC) method\*. The python script `./helper_functions.py` contains functions that execute the bias scan. A conceptual description how the bias scan works, including the rationale why k-means is chosen as a clustering algorithm and paramater choices, can be found in the [bias scan tool report](https://github.com/NGO-Algorithm-Audit/Bias_scan/blob/master/Bias_scan_tool_report.pdf).

The classifier is used to make predictions on the CLC-dataset. Details on pre-processing steps performed on this dataset are provided in the `../data/CLC_dataset/CLC_preprocessing.ipynb` notebook.

\* Misztal-Radecka, Indurkya, *Information Processing and Management*. Bias-Aware Hierarchical Clustering for detecting the discriminated groups of users in recommendation systems (2021).

### Overview of notebook:
1. Load data and pre-processing
2. Bias scan using the k-means HBAC algorithm
    - False Positive Rate (FPR) as bias metric
    - False Negative Rate (FNR) as bias metric    
3. Clustering results
4. Statistical testing of inter-cluster difference 

In [1]:
import sys  
import random
import warnings
import numpy as np
import pandas as pd
import seaborn as sns

# IPython
from IPython.display import Markdown, display

# matplotlib
import matplotlib.pyplot as plt

# helper functions
sys.path.insert(1, './../')
from helper_functions import *

warnings.filterwarnings('ignore')

### 1. Load data and pre-processing

In [8]:
# read data
path = '../../data/CLC_dataset/CLC_dataset.csv'
df = pd.read_csv(path)

# new index
del df['Unnamed: 0']

# Calculating absolute errors
df['errors'] = abs(df['predicted_class'] - df['true_class'])

# Calculate FP errors
FP_condition = (df['predicted_class'] == 1) & (df['true_class'] == 0)
df['FP_errors'] = np.where(FP_condition, 1, 0)

# Calculate FN errors
FN_condition = (df['predicted_class'] == 0) & (df['true_class'] == 1)
df['FN_errors'] = np.where(FN_condition, 1, 0)

df.head()

Unnamed: 0,predicted_class,true_class,age_15-18,age_19-20,age_21-22,age_23-24,age_25-50,education_hbo,education_mbo 1-2,education_mbo 3-4,...,distance_1-2km,distance_10-20km,distance_2-5km,distance_20-50km,distance_5-10km,distance_50-500km,distance_unknown,errors,FP_errors,FN_errors
0,0,0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0,0,0
1,1,1,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0,0,0
2,0,1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0,1
3,0,0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0
4,0,0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0


### 2. Data initialization

In [9]:
features = df.drop(['predicted_class', 'true_class', 'errors', 'FP_errors', 'FN_errors'], axis=1)
full_data = init_dataset(df,features)
full_data.head()

Unnamed: 0,predicted_class,true_class,age_15-18,age_19-20,age_21-22,age_23-24,age_25-50,education_hbo,education_mbo 1-2,education_mbo 3-4,...,distance_2-5km,distance_20-50km,distance_5-10km,distance_50-500km,distance_unknown,errors,FP_errors,FN_errors,clusters,new_clusters
0,0,0,1.996652,-0.499196,-0.500103,-0.496754,-0.503105,-0.57925,1.734076,-0.578526,...,-0.354367,-0.351389,-0.352916,2.829654,-0.35485,0,0,0,0,-1
1,1,1,-0.500838,-0.499196,-0.500103,2.013071,-0.503105,-0.57925,-0.576676,1.728532,...,-0.354367,-0.351389,-0.352916,2.829654,-0.35485,0,0,0,0,-1
2,0,1,1.996652,-0.499196,-0.500103,-0.496754,-0.503105,-0.57925,-0.576676,-0.578526,...,-0.354367,-0.351389,-0.352916,-0.3534,-0.35485,1,0,1,0,-1
3,0,0,-0.500838,2.003222,-0.500103,-0.496754,-0.503105,-0.57925,1.734076,-0.578526,...,-0.354367,-0.351389,-0.352916,-0.3534,-0.35485,0,0,0,0,-1
4,0,0,-0.500838,2.003222,-0.500103,-0.496754,-0.503105,-0.57925,1.734076,-0.578526,...,-0.354367,-0.351389,-0.352916,-0.3534,-0.35485,0,0,0,0,-1


### 3. HBAC using k-means clustering

In [11]:
clustering_paramaters = {
    "n_clusters": 2,
    "init": "k-means++",
    "n_init": 20,
    "max_iter": 300
}

Specify:
- Minimal splittable cluster size
- Minimal acceptable cluster size

In [13]:
# minimal splittable cluster size
split_cluster_size = round(0.07 * len(full_data))
print("minimal splittable cluster size: ", split_cluster_size)

# minimal acceptable cluster size
acc_cluster_size = round(0.05 * len(full_data))
print("minimal acceptable cluster size: ", acc_cluster_size)

minimal splittable cluster size:  6992
minimal acceptable cluster size:  4994


#### 2a. FP as bias metric
Performing bias scan using helper functions.

In [14]:
# HBAC clustering
df_FP = HBAC_bias_scan(full_data, 'FP', split_cluster_size, acc_cluster_size, clustering_paramaters)
df_FP.head()

bias FP is:  0.668884892086331
done


Unnamed: 0,predicted_class,true_class,age_15-18,age_19-20,age_21-22,age_23-24,age_25-50,education_hbo,education_mbo 1-2,education_mbo 3-4,...,distance_2-5km,distance_20-50km,distance_5-10km,distance_50-500km,distance_unknown,errors,FP_errors,FN_errors,clusters,new_clusters
0,0,0,1.996652,-0.499196,-0.500103,-0.496754,-0.503105,-0.57925,1.734076,-0.578526,...,-0.354367,-0.351389,-0.352916,2.829654,-0.35485,0,0,0,1,-1
1,1,1,-0.500838,-0.499196,-0.500103,2.013071,-0.503105,-0.57925,-0.576676,1.728532,...,-0.354367,-0.351389,-0.352916,2.829654,-0.35485,0,0,0,2,-1
2,0,1,1.996652,-0.499196,-0.500103,-0.496754,-0.503105,-0.57925,-0.576676,-0.578526,...,-0.354367,-0.351389,-0.352916,-0.3534,-0.35485,1,0,1,1,-1
3,0,0,-0.500838,2.003222,-0.500103,-0.496754,-0.503105,-0.57925,1.734076,-0.578526,...,-0.354367,-0.351389,-0.352916,-0.3534,-0.35485,0,0,0,3,-1
4,0,0,-0.500838,2.003222,-0.500103,-0.496754,-0.503105,-0.57925,1.734076,-0.578526,...,-0.354367,-0.351389,-0.352916,-0.3534,-0.35485,0,0,0,3,-1


#### 2b. FN as bias metric
Performing bias scan using helper functions.

In [15]:
# HBAC clustering
df_FN = HBAC_bias_scan(full_data,'FN',split_cluster_size,acc_cluster_size, clustering_paramaters)
df_FN.head()

bias FN is:  0.3326779946636709
done


Unnamed: 0,predicted_class,true_class,age_15-18,age_19-20,age_21-22,age_23-24,age_25-50,education_hbo,education_mbo 1-2,education_mbo 3-4,...,distance_2-5km,distance_20-50km,distance_5-10km,distance_50-500km,distance_unknown,errors,FP_errors,FN_errors,clusters,new_clusters
0,0,0,1.996652,-0.499196,-0.500103,-0.496754,-0.503105,-0.57925,1.734076,-0.578526,...,-0.354367,-0.351389,-0.352916,2.829654,-0.35485,0,0,0,1,-1.0
1,1,1,-0.500838,-0.499196,-0.500103,2.013071,-0.503105,-0.57925,-0.576676,1.728532,...,-0.354367,-0.351389,-0.352916,2.829654,-0.35485,0,0,0,2,-1.0
2,0,1,1.996652,-0.499196,-0.500103,-0.496754,-0.503105,-0.57925,-0.576676,-0.578526,...,-0.354367,-0.351389,-0.352916,-0.3534,-0.35485,1,0,1,1,-1.0
3,0,0,-0.500838,2.003222,-0.500103,-0.496754,-0.503105,-0.57925,1.734076,-0.578526,...,-0.354367,-0.351389,-0.352916,-0.3534,-0.35485,0,0,0,3,-1.0
4,0,0,-0.500838,2.003222,-0.500103,-0.496754,-0.503105,-0.57925,1.734076,-0.578526,...,-0.354367,-0.351389,-0.352916,-0.3534,-0.35485,0,0,0,3,-1.0


### 3. Analysing clustering results
#### 3a. FP bias metric
Identifying cluster with most FPs.

In [17]:
c_FP = get_max_bias_cluster(df_FP, 'FP')
max_bias_FP = round(bias_acc(df_FP, 'FP', c_FP, "clusters"), 2)
highest_biased_cluster_FP = df_FP[df_FP['clusters']==c_FP]
print(f"cluster {c_FP} has the highest bias (FP): " + str(max_bias_FP))
print("#elements in highest biased cluster:", len(highest_biased_cluster_FP))

# discriminated cluster
discriminated_cluster_FP = full_data[full_data['clusters']==c_FP].drop(columns=['predicted_class', 'true_class', 'errors','clusters', 'new_clusters', 'FP_errors', 'FN_errors'])
not_discriminated_FP = full_data[full_data['clusters']!=c_FP].drop(columns=['predicted_class', 'true_class', 'errors','clusters', 'new_clusters', 'FP_errors', 'FN_errors'])

# index of discriminated cluster
FP_idx = discriminated_cluster_FP.index.tolist()

discriminated_cluster_FP.head()

1 has bias -0.015051747107309166
2 has bias 0.0029472090106297255
3 has bias 0.0005797592605796265
6 has bias 0.011859643716081392
8 has bias 0.012355547842307524
4 has bias -0.007001239335842735
5 has bias -0.002223383535546297
0 has bias 0.005402400429591947
9 has bias -0.012480239091631429
7 has bias -0.0017377192268152042
cluster 8 has the highest bias (FP): 0.01
#elements in highest biased cluster: 9521


Unnamed: 0,age_15-18,age_19-20,age_21-22,age_23-24,age_25-50,education_hbo,education_mbo 1-2,education_mbo 3-4,education_wo,distance_0-1km,distance_0km,distance_1-2km,distance_10-20km,distance_2-5km,distance_20-50km,distance_5-10km,distance_50-500km,distance_unknown
9,-0.500838,-0.499196,-0.500103,-0.496754,1.987658,1.72637,-0.576676,-0.578526,-0.57495,-0.353543,2.828363,-0.353526,-0.354421,-0.354367,-0.351389,-0.352916,-0.3534,-0.35485
19,-0.500838,-0.499196,-0.500103,-0.496754,1.987658,1.72637,-0.576676,-0.578526,-0.57495,-0.353543,-0.353561,-0.353526,-0.354421,2.821932,-0.351389,-0.352916,-0.3534,-0.35485
30,-0.500838,-0.499196,-0.500103,-0.496754,1.987658,1.72637,-0.576676,-0.578526,-0.57495,-0.353543,-0.353561,-0.353526,-0.354421,-0.354367,2.845845,-0.352916,-0.3534,-0.35485
41,-0.500838,-0.499196,-0.500103,-0.496754,1.987658,-0.57925,-0.576676,-0.578526,1.739283,-0.353543,2.828363,-0.353526,-0.354421,-0.354367,-0.351389,-0.352916,-0.3534,-0.35485
46,-0.500838,-0.499196,-0.500103,-0.496754,1.987658,1.72637,-0.576676,-0.578526,-0.57495,-0.353543,-0.353561,-0.353526,-0.354421,-0.354367,-0.351389,-0.352916,2.829654,-0.35485


#### 3b. FN as bias metric
Identifying cluster with most negative bias (FN).

In [18]:
c_FN = get_max_bias_cluster(df_FN, 'FN')
max_bias_FN = round(bias_acc(df_FN, 'FN', c_FN, "clusters"), 2)
highest_biased_cluster_FN = df_FN[df_FN['clusters']==c_FN]
print(f"cluster {c_FN} has the highest bias (FN): " + str(max_bias_FN))
print("#elements in highest biased cluster:", len(highest_biased_cluster_FN))

# discriminated cluster
discriminated_cluster_FN = full_data[full_data['clusters']==c_FN].drop(columns=['predicted_class', 'true_class', 'errors','clusters', 'new_clusters', 'FP_errors', 'FN_errors'])
not_discriminated_FN = full_data[full_data['clusters']!=c_FN].drop(columns=['predicted_class', 'true_class', 'errors','clusters', 'new_clusters', 'FP_errors', 'FN_errors'])

# index of discriminated cluster
FN_idx = discriminated_cluster_FN.index.tolist()

discriminated_cluster_FN.head()

1 has bias 0.002836803803115795
2 has bias -0.003358667500117374
3 has bias 0.004053754528294484
6 has bias -0.008150706290392162
8 has bias -0.0013617649073709437
4 has bias -0.009944922126896794
5 has bias 0.014292084117815795
0 has bias 0.011339077275208775
9 has bias -0.0059530122533856256
7 has bias 0.010781758480279624
cluster 5 has the highest bias (FN): 0.01
#elements in highest biased cluster: 7760


Unnamed: 0,age_15-18,age_19-20,age_21-22,age_23-24,age_25-50,education_hbo,education_mbo 1-2,education_mbo 3-4,education_wo,distance_0-1km,distance_0km,distance_1-2km,distance_10-20km,distance_2-5km,distance_20-50km,distance_5-10km,distance_50-500km,distance_unknown
14,-0.500838,-0.499196,-0.500103,2.013071,-0.503105,-0.57925,-0.576676,1.728532,-0.57495,-0.353543,-0.353561,-0.353526,-0.354421,-0.354367,-0.351389,2.833536,-0.3534,-0.35485
15,1.996652,-0.499196,-0.500103,-0.496754,-0.503105,-0.57925,1.734076,-0.578526,-0.57495,-0.353543,-0.353561,-0.353526,-0.354421,-0.354367,-0.351389,2.833536,-0.3534,-0.35485
23,-0.500838,-0.499196,1.999587,-0.496754,-0.503105,-0.57925,-0.576676,1.728532,-0.57495,-0.353543,-0.353561,-0.353526,-0.354421,-0.354367,-0.351389,2.833536,-0.3534,-0.35485
35,-0.500838,-0.499196,-0.500103,2.013071,-0.503105,-0.57925,-0.576676,1.728532,-0.57495,-0.353543,-0.353561,-0.353526,-0.354421,-0.354367,-0.351389,2.833536,-0.3534,-0.35485
53,-0.500838,-0.499196,-0.500103,2.013071,-0.503105,1.72637,-0.576676,-0.578526,-0.57495,-0.353543,-0.353561,-0.353526,-0.354421,-0.354367,-0.351389,2.833536,-0.3534,-0.35485


#### Similarities in cluster indices

In [19]:
print("#elements in highest biased cluster (FP):", len(highest_biased_cluster_FP))
print("#elements in highest biased cluster (FN):", len(highest_biased_cluster_FN))
print("Similarities:", len(set(FP_idx) & set(FN_idx)))

#elements in highest biased cluster (FP): 9521
#elements in highest biased cluster (FN): 7760
Similarities: 0


### 4. Statistical testing of inter-cluster difference 
#### 4a. FP as bias metric
Compute difference between cluster with most negative bias and rest of dataset. In addition, applying a Welch’s two-samples t-test for unequal variances to examine whether the differences in means for each feature are statistically significant and return results in a dataframe.

#### p-values
A small p-value (p<0.05) indicates that it is unlikely to observe inter-cluster difference due to chance. Sort difference on statistical significance (p-value).

In [20]:
cluster_analysis_FP = stat_df(full_data, discriminated_cluster_FP, not_discriminated_FP)
cluster_analysis_FP

Unnamed: 0,index,difference,p-value,[0.025,0.975],errors,num
0,age_15-18,-0.55361,0.0,-0.56,-0.55,0.00639,17
1,distance_5-10km,-0.3901,0.0,-0.4,-0.38,0.0099,16
2,education_mbo 3-4,-0.63948,0.0,-0.65,-0.63,0.01052,15
3,education_mbo 1-2,-0.47969,0.0,-0.49,-0.47,0.01031,14
4,education_hbo,0.54659,0.0,0.52,0.57,0.02659,13
5,education_wo,0.57325,0.0,0.55,0.6,0.02325,12
6,age_23-24,-0.54909,0.0,-0.56,-0.54,0.01091,11
7,age_21-22,-0.55279,0.0,-0.56,-0.55,0.00721,10
8,age_19-20,-0.55179,0.0,-0.56,-0.55,0.00821,9
9,age_25-50,2.19708,0.0,2.19,2.2,0.00708,8


In [21]:
cluster_analysis_FN = stat_df(full_data, discriminated_cluster_FN, not_discriminated_FN)
cluster_analysis_FN

Unnamed: 0,index,difference,p-value,[0.025,0.975],errors,num
0,distance_unknown,-0.38474,0.0,-0.39,-0.38,0.00526,17
1,age_19-20,-0.34754,0.0,-0.36,-0.33,0.01246,16
2,age_21-22,-0.35537,0.0,-0.37,-0.34,0.01463,15
3,distance_5-10km,3.07221,0.0,3.07,3.08,0.00221,14
4,distance_20-50km,-0.38099,0.0,-0.39,-0.37,0.00901,13
5,distance_50-500km,-0.38317,0.0,-0.39,-0.38,0.00683,12
6,distance_0-1km,-0.38332,0.0,-0.39,-0.38,0.00668,11
7,distance_0km,-0.38334,0.0,-0.39,-0.38,0.00666,10
8,distance_1-2km,-0.3833,0.0,-0.39,-0.38,0.0067,9
9,distance_10-20km,-0.38427,0.0,-0.39,-0.38,0.00573,8


### Confidence interval plot

In [22]:
feat_ls = ['verified', '#followers', 'user_engagement', '#URLs', '#mentions', '#hashs', 'length', 'sentiment_score']
CI_plot(cluster_analysis_FP, x_lim=[-2.3,3], feat_ls=feat_ls)

IndexError: index 0 is out of bounds for axis 0 with size 0

In [23]:
CI_plot(cluster_analysis_FN, x_lim=[-2.3,3], feat_ls=feat_ls)

IndexError: index 0 is out of bounds for axis 0 with size 0

#### Conclusion
On average, users that:
- are verified, have higher #followers, user engagement and #URLs;
- use less #hashags and have lower tweet length
have more true content classified as false (false positives).

On average, users that:
- use more #hashtags and have higher sentiment score;
- are non-verified, have less #followers, user engagement and tweet length
have more false content classified as true (false negatives).

\* The sentiment score is computed based on the [VADER python library](https://github.com/cjhutto/vaderSentiment#about-the-scoring). 

#### What's next?
Qualitative assessment with the help of subject matter experts to verify the measured quantitaive disparities. Additionally, sensitivity testing would be beneficial to shed light into the robustness of the bias scan tool. 