# **Measuring Bias in Regression Tasks**


## **Base Modules**

In [4]:
#!pip install scikit-learn==0.21.3
import sklearn
import pandas as pd
import numpy as np
%matplotlib inline

## **Load Data**

If running in colab the dataset is obtained in the next cell with the !wget command. If running locally, download zip file from (https://archive.ics.uci.edu/ml/machine-learning-databases/00320/student.zip) and unpack the dataset in the same folder as notebook.

In [2]:
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00320/student.zip
!unzip student.zip

--2022-07-26 12:53:45--  https://archive.ics.uci.edu/ml/machine-learning-databases/00320/student.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 20478 (20K) [application/x-httpd-php]
Saving to: ‘student.zip’


2022-07-26 12:53:46 (128 KB/s) - ‘student.zip’ saved [20478/20478]

Archive:  student.zip
  inflating: student-mat.csv         
  inflating: student-por.csv         
  inflating: student-merge.R         
  inflating: student.txt             


In [5]:
# read dataset with pandas
data = pd.read_csv('student-mat.csv',delimiter=';')
data

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,4,6,10,10
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
390,MS,M,20,U,LE3,A,2,2,services,services,...,5,5,4,4,5,4,11,9,9,9
391,MS,M,17,U,LE3,T,3,1,services,services,...,2,4,5,3,4,2,3,14,16,16
392,MS,M,21,R,GT3,T,1,1,other,other,...,5,5,3,3,3,3,3,10,8,7
393,MS,M,18,R,LE3,T,3,2,services,other,...,4,4,1,3,4,5,0,11,12,10


In [17]:
# check data column types
# data.dtypes

## **Preprocess Data and Train Model**

In the following section we will
1. Drop the grades from years 1 and 2 because of high correlation with year 3
2. Separate categorical and numerical columns
3. Create a simple preprocessing Pipeline with scikit-learn
4. Train a LGBM model with scikit-learn


In [9]:
# drop grades from first and second year because high correlation with third

data.drop(columns = ['G1','G2'],inplace=True)

In [11]:
# separate categorical and numerical data

categorical = []
numerical = []

for col in data.columns:
  if col == 'G3':
    pass
  elif col == 'sex':
    pass
  elif data[col].dtype == object:
    categorical.append(col)
  else:
    numerical.append(col)

print (categorical)
print (numerical)

['school', 'address', 'famsize', 'Pstatus', 'Mjob', 'Fjob', 'reason', 'guardian', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic']
['age', 'Medu', 'Fedu', 'traveltime', 'studytime', 'failures', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences']


Below we create a simple pipeline that will one hot encode categorical columns, and standard scale numerical ones.

In [12]:
# create a pipeline

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer


categorical_transformer = Pipeline(steps=[('onehot', OneHotEncoder(handle_unknown='ignore'))])
numeric_transformer = Pipeline(steps=[('scaler', StandardScaler())])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numerical),
        ('cat', categorical_transformer, categorical)
    ])

Below we train test split our data with a 70/30 ratio to prepare for training. We also remove the 'sex' attribute because we do not want to use protected attributes in training.

In [13]:
# train test split 70/30

from sklearn.model_selection import train_test_split

df_train, df_test = train_test_split(data, test_size=0.3,random_state=42)

df_train_X = df_train.drop(columns=['G3','sex'])
df_test_X = df_test.drop(columns=['G3','sex'])
df_train_y = df_train['G3']
df_test_y = df_test['G3']

We train a LGBM regression model. It performed best of 5 regression models tested.

In [14]:
# fit and predict.

from lightgbm import LGBMRegressor

lgbmr = Pipeline([
     ('preprocessor', preprocessor),
     ('reg', LGBMRegressor())])

lgbmr.fit(df_train_X,df_train_y)
y_pred = lgbmr.predict(df_test_X)

## **Display Performance and Bias Metrics**

We must define vectors containing the sex of the individuals to check for bias.

In [15]:
# define minority and majority vectors
gmaj = df_test['sex']=='M'
gmin = df_test['sex']=='F'
gmaj = gmaj.to_numpy()
gmin = gmin.to_numpy()
yobs = df_test_y.to_numpy()

We can now compute a range of accuracy and fairness metrics, displayed in the dataframe below. The first two are accuracy metrics, the remaining 11 are fairness metrics.

In [20]:
import sys
sys.path.append('../../holisticai')

In [21]:
# Holistic regression bias metrics batch computation
from holisticai.bias.metrics import regression_bias_metrics

# Display function documentation
?regression_bias_metrics

In [16]:
regression_bias_metrics(gmin, gmaj, y_pred, yobs)

Unnamed: 0_level_0,Value,Reference
Metric,Unnamed: 1_level_1,Unnamed: 2_level_1
Disparate Impact Q90,0.751232,1
Disparate Impact Q80,0.525862,1
Disparate Impact Q50,0.983871,1
No Adverse Impact Level,15.416289,_
Average Score Spread,-0.706965,0
Average Score Spread Q80,0.312707,0
Z Score spread,-0.254212,0
Z Score spread Q80,0.337663,0
Adverse Impact AUC,0.555399,0.5
Concurrent Validity Spread,0.141491,0


## ***Final Remarks***

We observe through some metrics that there is a strong negative bias towards students of female sex. For instance we look at the Disparate Impact Q 80% of 0.52 (a value of 1 would be fair). This means male students are about 2 times more likely to be predicted in the top 20% of grades. We would like to avoid such prediction bias, as it might be detrimental to female students in admissions or other scenarios.