# Criminal Recidivism

For this assignment we'll be exploring a very topical application in the world of machine learning - predicting if someone is going to commit a repeat offence (aka criminal recidivism). In the United States, a number of states use a tool called COMPAS to predict whether somebody will commit a repeat offence, this is often used to determine bail, sentencing or parol. 

ProPublic - an awesome data-oriented news agency - did a really fascinating deep dive into COMPAS and criminal recidivism data and found that COMPAS was unfairly discriminating against certain offenders on the basis of race. In today's assignment we're going to create our own classifier to see if we can predict criminal recidivism, and see if our classifers also suffer from the same issue. Your goal for this project is two-fold:

1. Create a classifer to predict whether or not somebody will be a repeat offender using the `compas.csv` data located in the data folder. In this case, you're trying to predict the variable 'two_year_recid'. 
2. Using some tools we learned today, evaluate your model and it's accuracy overall and separately for different races. Do you notice any unfairness?

You'll have to do some cleaning on the data-set (i.e. first and last name aren't going to be super helpful). This is a real data-set (the same one ProPublica used) so it's messy! Don't worry if you can't get every column cleaned up and useful, you can always build a simple model off of a sub-set the data.

Fair machine learning is a hot button issue in the data science community, if this is something that sounds interesting to you, here are some great resources to check-out:

1. [Propublica Article on Machine Bias](https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing)
2. [Tutorial on Fairness Machine Learning](https://towardsdatascience.com/a-tutorial-on-fairness-in-machine-learning-3ff8ba1040cb)
3. Talk to your TAs!

In [195]:
from IPython.display import Image
import pandas as pd
import numpy as np
from sklearn import datasets

import seaborn as sns
sns.set(rc={'figure.figsize':(6,6)}) #Set our seaborn aesthetics (we're going to customize our figure size)

import warnings
warnings.simplefilter("ignore")

In [196]:
df = pd.read_csv("compas_clean.csv")
df.head()

Unnamed: 0,Two_yr_Recidivism,Number_of_Priors,score_factor,Age_Above_FourtyFive,Age_Below_TwentyFive,African_American,Asian,Hispanic,Native_American,Other,Female,Misdemeanor
0,0,0,0,1,0,0,0,0,0,1,0,0
1,1,0,0,0,0,1,0,0,0,0,0,0
2,1,4,0,0,1,1,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,1,0,1
4,1,14,1,0,0,0,0,0,0,0,0,0


In [197]:
df["Two_yr_Recidivism"].value_counts(True)
#the split is rather balanced 

0    0.54488
1    0.45512
Name: Two_yr_Recidivism, dtype: float64

In [198]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.metrics import confusion_matrix

In [203]:
def roc_curve(true_classes, predictions, prediction_probabilities):
    fpr, tpr, _ = metrics.roc_curve(true_classes, prediction_probabilities[:,1])
    roc_auc = metrics.roc_auc_score(true_classes, predictions)

    sns.mpl.pyplot.fill_between(fpr, tpr, step='post', alpha=0.2,color='b')
    sns.lineplot(x=fpr, y=tpr, linestyle='--', label='ROC Curve(area = %0.2f)' % roc_auc)
   
    sns.mpl.pyplot.xlabel('FPR')
    sns.mpl.pyplot.ylabel('TPR (recall)')
    sns.mpl.pyplot.title('ROC Curve')
    
def func (cols = []):
    X = df.drop(columns = cols)
    y = df['Two_yr_Recidivism']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state =10)
    
    model = LogisticRegression()
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    prediction_probs = model.predict_proba(X_test)

    prediction_probs_df = pd.DataFrame(prediction_probs)
    prediction_probs_df = round(prediction_probs_df, 2) 
    X = X_test.reset_index().copy()
    X["Two_yr_Recidivism"] = y_test.tolist()
    X["prediction"] = predictions
    X = pd.concat([X, prediction_probs_df], axis=1)
    
    true_classes = y_test
    print("confusion matrix: \n", confusion_matrix(true_classes, predictions))
    print("accuracy:  ", metrics.accuracy_score(true_classes, predictions))  
    print("precision: ", metrics.precision_score(true_classes, predictions))
    print("f1:             ", metrics.f1_score(true_classes, predictions))  
    print("auc:          ", metrics.roc_auc_score(true_classes, predictions))

    print()   
    #roc_curve(y_test, predictions, prediction_probs)

In [204]:
print("all races")
func(['Two_yr_Recidivism'])
print("African Americans")
func([ 'Two_yr_Recidivism', "Asian", 'Hispanic', "Native_American", "Other"])
print("Asians")
func([ 'Two_yr_Recidivism', "African_American", 'Hispanic', "Native_American", "Other"])
print("Hispanic")
func([ 'Two_yr_Recidivism', "African_American", "Asian", "Native_American", "Other"])
print("Other")
func([ 'Two_yr_Recidivism', "Asian", 'Hispanic', "Native_American", "African_American"])

all races
confusion matrix: 
 [[528 138]
 [261 308]]
accuracy:   0.676923076923077
precision:  0.6905829596412556
f1:              0.606896551724138
auc:           0.667046660016783

African Americans
confusion matrix: 
 [[529 137]
 [258 311]]
accuracy:   0.680161943319838
precision:  0.6941964285714286
f1:              0.6116027531956736
auc:           0.6704336146339661

Asians
confusion matrix: 
 [[537 129]
 [267 302]]
accuracy:   0.6793522267206478
precision:  0.7006960556844548
f1:              0.604
auc:           0.6685310090406751

Hispanic
confusion matrix: 
 [[515 151]
 [257 312]]
accuracy:   0.6696356275303643
precision:  0.673866090712743
f1:              0.6046511627906976
auc:           0.6608018387455997

Other
confusion matrix: 
 [[538 128]
 [268 301]]
accuracy:   0.6793522267206478
precision:  0.7016317016317016
f1:              0.6032064128256512
auc:           0.6684030251692816

