# Summary of Findings


### Introduction
Prediction Problem: Predicting the board_disposition of an allegation based on complainant Ethnicity, Gender, Age and Fado_type. 

These predicting variables have been choosen because they provide a good sense of the complainant and also the type of allegation. Also this gives both categorical and numerical variables. The choosen evaluation metric is accuracy because the output column has more than 2 possible outcomes(It will be engineered to have 3 outcomes) and this is a classification task.

### Baseline Model
The number of features are 4 of which 3 are nominal and 1 is quantitative. The model accuracy was roughly 48.69% which is not great but seeing as there are 3 possible outcomes, any accuracy of over 33% tells us something about the most likely outcome. The reason it is not great is because in a real life context being able to predict an outcome 48.69% is not of much use or value.

### Final Model
2 Features were engineered. Firstly, the complainant age feature was Standard Scaled and secondly, the complainant ethncity feature was Binarized to whether the ethnicity was Black or not. The standard scaling helped because each value now signified how extreme or close to the mean it was in terms of its distance from the mean in terms of Standard deviations. Binarizing the complainant ethncity should help if the theory of policing being harsher towards complainants of Black ethnicities is to be believed. Also seeing as some ethnicities have very little data, grouping them together might help the accuracy so they are treated similarly by the model.


The choosen model was once again a Decision Tree Classifier. The parameters that performed best are as follows: max depth:4, min samples leaf: 2, min samples split: 2. The final accuracy was just over 50% which is an improvement but not as much as was hoped for.
### Fairness Evaluation
The interesting subset was choosen to be in the ethicity column - in particular, whether the model was more or less accurate for complainants of Black ethnicity compared to other ethnicities.

Null Hypothesis: The model is fair, there is no significant difference in its accuracy between the two subsets.

Alternative Hypothesis: The model is not fair, there is a significant difference in its accuracy between the two subsets.(two-sided)

The permutation test failed to reject the null hypothesis.

# Code

In [173]:
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.model_selection import train_test_split
from sklearn.base import BaseEstimator
from sklearn.preprocessing import OrdinalEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import Binarizer, QuantileTransformer, FunctionTransformer
%matplotlib inline
%config InlineBackend.figure_format = 'retina'  # Higher resolution figures

Loading Dataset and narrowing down on only the columns that will be used for prediction of board_disposition:

In [174]:
df = pd.read_csv('allegations_202007271729-Copy1.csv')
df = df[['complainant_ethnicity', 'complainant_gender', 
    'complainant_age_incident', 'fado_type', 'board_disposition']]
df

Unnamed: 0,complainant_ethnicity,complainant_gender,complainant_age_incident,fado_type,board_disposition
0,Black,Female,38.0,Abuse of Authority,Substantiated (Command Lvl Instructions)
1,Black,Male,26.0,Discourtesy,Substantiated (Charges)
2,Black,Male,26.0,Offensive Language,Substantiated (Charges)
3,Black,Male,45.0,Abuse of Authority,Substantiated (Charges)
4,,,16.0,Force,Substantiated (Command Discipline A)
...,...,...,...,...,...
33353,Asian,Male,21.0,Discourtesy,Unsubstantiated
33354,Asian,Male,21.0,Abuse of Authority,Unsubstantiated
33355,Asian,Male,21.0,Abuse of Authority,Substantiated (Formalized Training)
33356,Asian,Male,21.0,Abuse of Authority,Substantiated (Formalized Training)


Similar to how it was done in Project 3, cleaning the board_disposition columnn to only include 3 possible outcomes - Substantiated, Unsubstantiated and Exonerated.

In [175]:
def clean_board_disposition(arg):
    return arg.apply(lambda x: 'Subsantiated' if 'Substantiated' in x else x)

In [176]:
ft = FunctionTransformer(clean_board_disposition)
ft.fit(df['board_disposition'])
df['board_disposition'] = ft.transform(df['board_disposition'])

### Baseline Model

Splitting the data into Training and Testing:

In [201]:
X = df.drop('board_disposition', axis = 1)
y = df['board_disposition']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

Building the baseline model with imputed NaN values, OneHotEncoding for categorical columns and using a Decision Tree Classifier

In [204]:
cat_cols = ['complainant_ethnicity', 'complainant_gender', 'fado_type']
cat_trans = Pipeline(steps = [('imp', SimpleImputer(strategy='constant', fill_value='NULL')),
    ('oh', OneHotEncoder())])
preproc = ColumnTransformer([('cat', cat_trans, cat_cols),
            ('num', SimpleImputer(strategy='constant', fill_value=0), ['complainant_age_incident'])])
pl = Pipeline(steps = [('prep', preproc), ('class', DecisionTreeClassifier())])
pl.fit(X_train, y_train)
pl.score(X_test, y_test)

0.48691047162270185

### Final Model

Making the final model by onehotencoding fady_type and complainant_gender, Binarizing the ethnicity column through a functionTransformer and Standardizingly Scaling the age column while imputing missing values in all columns and using a Decision Tree Classifier. The complainant ethnicity column is being binarized in a manner wherein it would be True if its element is Black and False otherwise, this is due to the general debate around policing against this ethnicity in particular. Also certain ethnicities had very little data so that could have led to innacurate classifications in the Baseline model so treating all ethnicities other than Black the same in the model might improve its accuracy.

In [211]:
cat_cols = ['fado_type', 'complainant_gender']
cat_trans = Pipeline(steps = [('imp', SimpleImputer(strategy='constant', fill_value='NULL')),
    ('oh', OneHotEncoder(handle_unknown = 'ignore'))])
ethn_trans = Pipeline(steps = [('bin', FunctionTransformer(lambda x: x == "Black"))])
age_trans = Pipeline(steps = [('num', SimpleImputer(strategy='constant', fill_value=0)), 
                        ('std', StandardScaler())])
preproc = ColumnTransformer([('cat', cat_trans, cat_cols), 
                            ('fad', ethn_trans, ['complainant_ethnicity']), 
                            ('age', age_trans, ['complainant_age_incident'])])
pl = Pipeline(steps = [('prep', preproc), ('class', DecisionTreeClassifier())])
pl.fit(X_train, y_train)
pl.score(X_test, y_test)

0.4936051159072742

The accuracy is only slightly higher than the baseline score. Standardizing the age column was probably the biggest contributer to the increasing because binarizing the ethnicity column could have had the counter-effect of reducing the information given to make a classification as the number of categories of ethnicities was reduced to 2.

Now fine-tuning the model by finding the best parameters using GridSearch:

In [212]:
parameters = {
    'class__max_depth': [2,3,4,5,7,10,13,15,18,None], 
    'class__min_samples_split':[2,3,5,7,10,15,20],
    'class__min_samples_leaf':[2,3,5,7,10,15,20]
}

In [213]:
grids = GridSearchCV(pl, param_grid = parameters, cv = 3, return_train_score = True)

In [217]:
grids.fit(X_train, y_train)
grids.best_params_

{'class__max_depth': 4,
 'class__min_samples_leaf': 2,
 'class__min_samples_split': 2}

Inputting these parameters into the same model:

In [219]:
cat_cols = ['fado_type', 'complainant_gender']
cat_trans = Pipeline(steps = [('imp', SimpleImputer(strategy='constant', fill_value='NULL')),
    ('oh', OneHotEncoder(handle_unknown = 'ignore'))])
ethn_trans = Pipeline(steps = [('bin', FunctionTransformer(lambda x: x == "Asian"))])
age_trans = Pipeline(steps = [('num', SimpleImputer(strategy='constant', fill_value=0)), 
                        ('std', StandardScaler())])
preproc = ColumnTransformer([('cat', cat_trans, cat_cols), 
                            ('fad', ethn_trans, ['complainant_ethnicity']), 
                            ('age', age_trans, ['complainant_age_incident'])])
pl = Pipeline(steps = [('prep', preproc), ('class', DecisionTreeClassifier(max_depth = 4,
                                                                min_samples_leaf = 2, 
                                                                min_samples_split = 2))])
pl.fit(X_train, y_train)
pl.score(X_test, y_test)

0.5025979216626698

This accuracy of just over 50% is an improvement from the baseline model.

### Fairness Evaluation

Does the model perform better when the complainant_ethnicity is Black vs not Black?

The test-stat choosen is the absolute difference in accuracy between the two subsets because the outcome to be predicted is not Binary so getting a True Positive rate or something similar is up to interpretation of what is considered Positive so accuracy will provide a score of how often the model correctly predicts the board_disposition regardless of what the disposition actually is.

Null Hypothesis: The model is fair, there is no significant difference in its accuracy between the two subsets.

Alternative Hypothesis: The model is not fair, there is a significant difference in its accuracy between the two subsets.(two-sided)

In [194]:
test_df = pd.DataFrame({'complainant_ethnicity': X_test['complainant_ethnicity'], 
             'board_disposition': y_test, 'prediction': pl.predict(X_test)})
black = test_df.loc[test_df['complainant_ethnicity'] == 'Black']
black_acc = (black['prediction'] == black['board_disposition']).sum()/black.shape[0]
other = test_df.loc[test_df['complainant_ethnicity'] != 'Black']
other_acc = (other['prediction'] == other['board_disposition']).sum() / other.shape[0]
test_stat = abs(other_acc - black_acc)
test_stat

0.005597014925373123

The test-stat obtained is 0.00559 approximately.

Note: the accuracy is only found for the test data set. The splitting is done in every simulation below to keep the size of the test set the same. The model is not re-fitted.

In [220]:
obs = []
for _ in range(100):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)
    test_df = pd.DataFrame({'complainant_ethnicity': X_test['complainant_ethnicity'], 
             'board_disposition': y_test, 'prediction': pl.predict(X_test)})
    black = test_df.loc[test_df['complainant_ethnicity'] == 'Black']
    black_acc = (black['prediction'] == black['board_disposition']).sum()/black.shape[0]
    other = test_df.loc[test_df['complainant_ethnicity'] != 'Black']
    other_acc = (other['prediction'] == other['board_disposition']).sum() / other.shape[0]
    obs.append(abs(other_acc - black_acc))
np.count_nonzero(np.array(obs) >= test_stat) / 100

0.5

The p-value of 0.5 suggests that an outcome equally or more extreme than the test-stat is seen 50% of the time in the 100 simulations carried out which is far more than the default significance level of 5% so we fail to reject the null hypothesis and conclude the model can be called fair in regards to whether the complainaint has an ethnicity of Black or not.