## Bias scan using Multi-Dimensional Subset Scan (MDSS)

"Identifying Significant Predictive Bias in Classifiers" https://arxiv.org/abs/1611.08292

The goal of bias scan is to identify a subgroup(s) that has significantly more predictive bias than would be expected from an unbiased classifier. There are $\prod_{m=1}^{M}\left(2^{|X_{m}|}-1\right)$ unique subgroups from a dataset with $M$ features, with each feature having $|X_{m}|$ discretized values, where a subgroup is any $M$-dimension
Cartesian set product, between subsets of feature-values from each feature --- excluding the empty set. Bias scan mitigates this computational hurdle by approximately identifing the most statistically biased subgroup in linear time (rather than exponential).


We define the statistical measure of predictive bias function, $score_{bias}(S)$ as a likelihood ratio score and a function of a given subgroup $S$. The null hypothesis is that the given prediction's odds are correct for all subgroups in

$\mathcal{D}$: $H_{0}:odds(y_{i})=\frac{\hat{p}_{i}}{1-\hat{p}_{i}}\ \forall i\in\mathcal{D}$.

The alternative hypothesis assumes some constant multiplicative bias in the odds for some given subgroup $S$:


$H_{1}:\ odds(y_{i})=q\frac{\hat{p}_{i}}{1-\hat{p}_{i}},\ \text{where}\ q>1\ \forall i\in S\ \mbox{and}\ q=1\ \forall i\notin S.$

In the classification setting, each observation's likelihood is Bernoulli distributed and assumed independent. This results in the following scoring function for a subgroup $S$

\begin{align*}
score_{bias}(S)= & \max_{q}\log\prod_{i\in S}\frac{Bernoulli(\frac{q\hat{p}_{i}}{1-\hat{p}_{i}+q\hat{p}_{i}})}{Bernoulli(\hat{p}_{i})}\\
= & \max_{q}\log(q)\sum_{i\in S}y_{i}-\sum_{i\in S}\log(1-\hat{p}_{i}+q\hat{p}_{i}).
\end{align*}
Our bias scan is thus represented as: $S^{*}=FSS(\mathcal{D},\mathcal{E},F_{score})=MDSS(\mathcal{D},\hat{p},score_{bias})$.

where $S^{*}$ is the detected most anomalous subgroup, $FSS$ is one of several subset scan algorithms for different problem settings, $\mathcal{D}$ is a dataset with outcomes $Y$ and discretized features $\mathcal{X}$, $\mathcal{E}$ are a set of expectations or 'normal' values for $Y$, and $F_{score}$ is an expectation-based scoring statistic that measures the amount of anomalousness between subgroup observations and their expectations.

Predictive bias emphasizes comparable predictions for a subgroup and its observations and Bias scan provides a more general method that can detect and characterize such bias, or poor classifier fit, in the larger space of all possible subgroups, without a priori specification.

In [1]:
from aif360.detectors.mdss_detector import bias_scan
from aif360.algorithms.preprocessing.optim_preproc_helpers.data_preproc_functions import load_preproc_data_compas

import numpy as np
import pandas as pd

We'll demonstrate finding the most anomalous subset with bias scan using the compas dataset. We can specify subgroups to be scored or scan for the most anomalous subgroup. Bias scan allows us to decide if we aim to identify bias as `higher` than expected probabilities or `lower` than expected probabilities.

# Compas Dataset

In [2]:
np.random.seed(0)

dataset_orig = load_preproc_data_compas()

The dataset has the categorical features one-hot encoded so we'll modify the dataset to convert them back 
to the categorical featues because scanning one-hot encoded features may find subgroups that are not meaningful eg. a subgroup with 2 race values. 

In [3]:
dataset_orig_df = pd.DataFrame(dataset_orig.features, columns=dataset_orig.feature_names)

age_cat = np.argmax(dataset_orig_df[['age_cat=Less than 25', 'age_cat=25 to 45', 
                                     'age_cat=Greater than 45']].values, axis=1).reshape(-1, 1)
priors_count = np.argmax(dataset_orig_df[['priors_count=0', 'priors_count=1 to 3', 
                                          'priors_count=More than 3']].values, axis=1).reshape(-1, 1)
c_charge_degree = np.argmax(dataset_orig_df[['c_charge_degree=F', 'c_charge_degree=M']].values, axis=1).reshape(-1, 1)

features = np.concatenate((dataset_orig_df[['sex', 'race']].values, age_cat, priors_count, \
                           c_charge_degree, dataset_orig.labels), axis=1)
feature_names = ['sex', 'race', 'age_cat', 'priors_count', 'c_charge_degree']

In [4]:
df = pd.DataFrame(features, columns=feature_names + ['two_year_recid'])

In [5]:
df.head()

Unnamed: 0,sex,race,age_cat,priors_count,c_charge_degree,two_year_recid
0,0.0,0.0,1.0,0.0,0.0,1.0
1,0.0,0.0,0.0,2.0,0.0,1.0
2,0.0,1.0,1.0,2.0,0.0,1.0
3,1.0,1.0,1.0,0.0,1.0,0.0
4,0.0,1.0,1.0,0.0,0.0,0.0


### training
We'll train a simple classifier to predict the probability of the outcome

In [6]:
from sklearn.linear_model import LogisticRegression
X = df.drop('two_year_recid', axis = 1)
y = df['two_year_recid']
clf = LogisticRegression(solver='lbfgs', C=1.0, penalty='l2')
clf.fit(X, y)

LogisticRegression()

Note that the probability scores we use are the probabilities of the favorable label, which is 0 in this case.

In [7]:
probs = pd.Series(clf.predict_proba(X)[:,0])

### bias scan
We can scan for a privileged and unprivileged subset using bias scan

In [8]:
# For privileged subset, label 0 has to be overpredicted, i.e overpredicted = True
# For unprivileged subset, label 0 has to be underpredicted, i.e overpredicted = False

privileged_subset = bias_scan(data=X,observations=y,expectations=probs,pos_label=0, overpredicted=True)
unprivileged_subset = bias_scan(data=X,observations=y,expectations=probs,pos_label=0,overpredicted=False)

In [9]:
print(privileged_subset)
print(unprivileged_subset)

({'age_cat': [1.0], 'priors_count': [0.0, 1.0, 2.0], 'sex': [1.0], 'race': [1.0], 'c_charge_degree': [0.0]}, 7.9086)
({'race': [0.0], 'age_cat': [1.0, 2.0], 'priors_count': [1.0], 'c_charge_degree': [0.0, 1.0]}, 7.0227)


In [10]:
dff = X.copy()
dff['observed'] = y 
dff['probabilities'] = 1 - probs

In [11]:
to_choose = dff[privileged_subset[0].keys()].isin(privileged_subset[0]).all(axis=1)
temp_df = dff.loc[to_choose]

In [12]:
"Our detected priviledged group has a size of {}, we observe {} as the average risk of recidivism, but our model predicts {}"\
.format(len(temp_df), temp_df['observed'].mean(), temp_df['probabilities'].mean())

'Our detected priviledged group has a size of 147, we observe 0.5374149659863946 as the average risk of recidivism, but our model predicts 0.38278159716895366'

In [13]:
to_choose = dff[unprivileged_subset[0].keys()].isin(unprivileged_subset[0]).all(axis=1)
temp_df = dff.loc[to_choose]

In [14]:
"Our detected priviledged group has a size of {}, we observe {} as the average risk of recidivism, but our model predicts {}"\
.format(len(temp_df), temp_df['observed'].mean(), temp_df['probabilities'].mean())

'Our detected priviledged group has a size of 732, we observe 0.3770491803278688 as the average risk of recidivism, but our model predicts 0.4447038821779929'

# Adult Dataset

In [15]:
data = pd.read_csv('https://gist.githubusercontent.com/Viktour19/b690679802c431646d36f7e2dd117b9e/raw/d8f17bf25664bd2d9fa010750b9e451c4155dd61/adult_autostrat.csv')
data.head()

Unnamed: 0,workclass,education,marital_status,occupation,relationship,race,sex,native_country,age_bin,education_num_bin,hours_per_week_bin,capital_gain_bin,capital_loss_bin,observed,expectation
0,Private,11th,Never-married,Machine-op-inspct,Own-child,Black,Male,United-States,17-27,1-8,40-44,0,0,0,0.236226
1,Private,HS-grad,Married-civ-spouse,Farming-fishing,Husband,White,Male,United-States,37-47,9,45-99,0,0,0,0.236226
2,Local-gov,Assoc-acdm,Married-civ-spouse,Protective-serv,Husband,White,Male,United-States,28-36,12-16,40-44,0,0,1,0.236226
3,Private,Some-college,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,United-States,37-47,10-11,40-44,7298-7978,0,1,0.236226
4,?,Some-college,Never-married,?,Own-child,White,Female,United-States,17-27,10-11,1-39,0,0,0,0.236226


Note that for the adult dataset, the positive label is 1 and thus the expectations provided is the probbility of the earning >50k i.e label 1

In [16]:
X = data.drop(['observed','expectation'], axis = 1)
probs = data['expectation']
y = data['observed']

In [17]:
# For privileged subset, label 1 has to be overpredicted, i.e overpredicted = True
# For unprivileged subset, label 1 has to be underpredicted, i.e overpredicted = False

privileged_subset = bias_scan(data=X,observations=y,expectations=probs,pos_label=1, overpredicted=True,penalty=50)
unprivileged_subset = bias_scan(data=X,observations=y,expectations=probs,pos_label=1, overpredicted=False,penalty=50)

In [18]:
print(privileged_subset)
print(unprivileged_subset)

({'relationship': [' Not-in-family', ' Other-relative', ' Own-child', ' Unmarried'], 'capital_gain_bin': ['0']}, 898.6027)
({'education_num_bin': ['12-16'], 'marital_status': [' Married-civ-spouse']}, 1064.5173)


In [19]:
dff = X.copy()
dff['observed'] = y 
dff['probabilities'] = probs

In [20]:
to_choose = dff[privileged_subset[0].keys()].isin(privileged_subset[0]).all(axis=1)
temp_df = dff.loc[to_choose]

"Our detected privileged group has a size of {}, we observe {} as the average probability of earning >50k, but our model predicts {}"\
.format(len(temp_df), np.round(temp_df['observed'].mean(),4), np.round(temp_df['probabilities'].mean(),4))

'Our detected privileged group has a size of 8532, we observe 0.0472 as the average probability of earning >50k, but our model predicts 0.2362'

In [21]:
to_choose = dff[unprivileged_subset[0].keys()].isin(unprivileged_subset[0]).all(axis=1)
temp_df = dff.loc[to_choose]

"Our detected unprivileged group has a size of {}, we observe {} as the average probability of earning >50k, but our model predicts {}"\
.format(len(temp_df), np.round(temp_df['observed'].mean(),4), np.round(temp_df['probabilities'].mean(),4))

'Our detected unprivileged group has a size of 2430, we observe 0.6996 as the average probability of earning >50k, but our model predicts 0.2362'

# Hospitalization Time

In [22]:
data = pd.read_csv('https://raw.githubusercontent.com/Adebayo-Oshingbesan/data/main/hospital.csv')
data = data[data['Length of Stay'] != '120 +'].fillna('Unknown')
data.shape

(29980, 22)

In [23]:
data.head()

Unnamed: 0,Health Service Area,Hospital County,Age Group,Zip Code - 3 digits,Gender,Race,Ethnicity,Type of Admission,Patient Disposition,APR MDC Code,...,APR Severity of Illness Description,APR Risk of Mortality,APR Medical Surgical Description,Payment Typology 1,Payment Typology 2,Payment Typology 3,Birth Weight,Abortion Edit Indicator,Emergency Department Indicator,Length of Stay
0,New York City,Kings,70 or Older,112,M,Black/African American,Not Span/Hispanic,Emergency,Expired,4,...,Extreme,Extreme,Medical,Medicare,Medicare,Self-Pay,0,N,Y,14
1,New York City,Queens,0 to 17,113,M,White,Spanish/Hispanic,Newborn,Home or Self Care,15,...,Minor,Minor,Medical,Medicaid,Medicaid,Unknown,3800,N,N,2
2,New York City,Kings,70 or Older,112,M,Black/African American,Not Span/Hispanic,Emergency,Skilled Nursing Home,4,...,Extreme,Extreme,Medical,Medicare,Unknown,Unknown,0,N,Y,13
3,New York City,Richmond,50 to 69,103,M,White,Not Span/Hispanic,Emergency,Skilled Nursing Home,1,...,Moderate,Minor,Medical,Medicare,Medicare,Unknown,0,N,Y,3
4,Long Island,Nassau,18 to 29,115,F,White,Spanish/Hispanic,Elective,Home or Self Care,14,...,Minor,Minor,Medical,Medicaid,Unknown,Unknown,0,N,N,3


In [24]:
X = data.drop(['Length of Stay'], axis = 1)
y = pd.to_numeric(data['Length of Stay'])

In [25]:
# Since this is not a binary classification task, there is no positive label.
# However, we know that staying shorter in the hospital is better.
# Thus, for privileged subset, we need to find a subset such that the model's predictions are systematically higher than the observations i.e overpredicted = True.
# For unprivileged subset, we need to find a subset such that the model's predictions are systematically lower than the the observations i.e overpredicted = False.
# Also since this is more of a counting time-based problem, we expect that the observations follow a Poisson distribution. Thus our scoring function is poisson.
# Finally, we pass in no expectations and use the inbuilt model of mean predictions

privileged_subset = bias_scan(data=X, observations=y, scoring = 'Poisson', overpredicted=True, penalty=5000)
unprivileged_subset = bias_scan(data=X, observations=y, scoring = 'Poisson', overpredicted=False, penalty=5000)

In [26]:
print(privileged_subset)
print(unprivileged_subset)

({'APR Severity of Illness Description': ['Minor']}, 1091.4343)
({'APR Severity of Illness Description': ['Extreme']}, 6230.5386)


In [27]:
dff = X.copy()
dff['observed'] = y 
dff['predicted'] = y.mean()

In [28]:
to_choose = dff[privileged_subset[0].keys()].isin(privileged_subset[0]).all(axis=1)
temp_df = dff.loc[to_choose]

"Our detected privileged group has a size of {}, we observe {} as the average number of days spent in the hospital, but our model predicts {}"\
.format(len(temp_df), np.round(temp_df['observed'].mean(),4), np.round(temp_df['predicted'].mean(),4))

'Our detected privileged group has a size of 10114, we observe 3.0771 as the average number of days spent in the hospital, but our model predicts 5.4231'

In [29]:
to_choose = dff[unprivileged_subset[0].keys()].isin(unprivileged_subset[0]).all(axis=1)
temp_df = dff.loc[to_choose]

"Our detected unprivileged group has a size of {}, we observe {} as the average number of days spent in the hospital, but our model predicts {}"\
.format(len(temp_df), np.round(temp_df['observed'].mean(),4), np.round(temp_df['predicted'].mean(),4))

'Our detected unprivileged group has a size of 1900, we observe 15.2216 as the average number of days spent in the hospital, but our model predicts 5.4231'

Assuming we want to scan for the second most privileged groups, we can remove the records that belongs to the most privileged_subset and then rescan.

In [30]:
to_choose = X[privileged_subset[0].keys()].isin(privileged_subset[0]).all(axis=1)
temp_df = X.loc[to_choose]
X_filtered = X[~X.index.isin(temp_df.index)]
y_filtered = y[~y.index.isin(temp_df.index)]

In [31]:
privileged_subset = bias_scan(data=X_filtered, observations=y_filtered, scoring = 'Poisson', overpredicted=True, penalty=1000)

In [None]:
print(privileged_subset)

In [None]:
dff = X_filtered.copy()
dff['observed'] = y_filtered 
dff['predicted'] = y_filtered.mean()

In [None]:
to_choose = dff[privileged_subset[0].keys()].isin(privileged_subset[0]).all(axis=1)
temp_df = dff.loc[to_choose]

"Our detected privileged group has a size of {}, we observe {} as the average number of days spent in the hospital, but our model predicts {}"\
.format(len(temp_df), np.round(temp_df['observed'].mean(),4), np.round(temp_df['predicted'].mean(),4))

In summary, this notebook explains how to use the new mdss bias scan interface in aif360.detectors to scan for bias, even for tasks beyond binary classification, using the concepts of over-predictions and under-predictions.