# <img style="float: left; padding-right: 10px; width: 45px" src="https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/iacs.png"> CS-s109A Introduction to Data Science

## Lab 6: Case Studies - Olympic Medals and COMPAS

**Harvard University**<br/>
**Summer 2021**<br/>
**Authors:** Kevin Rader, Shivam Raval, Chris Gumb, Pavlos Protopapas and Chris Tanner

---

In [None]:
import random
random.seed(12345)

import os
import sys

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import sklearn as sk

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve, auc, roc_auc_score, accuracy_score, confusion_matrix
from sklearn.ensemble import RandomForestClassifier

%matplotlib inline

# Lab 6 Part 1: Predicting Olympic Medal Counts in 2020

![](medalcount.png)

### References: 

World Bank Data: https://data.worldbank.org/indicator/

Olympic Medal Counts: https://en.wikipedia.org/wiki/2016_Summer_Olympics_medal_table

We'd like to predict the total medal count for the current 2020 olympics based on past history and some demographic data (really, just population and per capita GDP).  Let's firs tread in the data and do some quick EDA:

### 1. Load the datasets

In [None]:
# Read in the data
medals00 = pd.read_csv('data/medals2000.csv', encoding = "ISO-8859-1")
medals04 = pd.read_csv('data/medals2004.csv', encoding = "ISO-8859-1")
medals08 = pd.read_csv('data/medals2008.csv', encoding = "ISO-8859-1")
medals12 = pd.read_csv('data/medals2012.csv', encoding = "ISO-8859-1")
medals16 = pd.read_csv('data/medals2016.csv', encoding = "ISO-8859-1")

gdp = pd.read_csv('data/gdp_per_capita.csv')
pop = pd.read_csv('data/population.csv')

medals00.head()

Let's do some simple EDA teating `total` medals in 2016 as the response:

In [None]:
fig, ax = plt.subplots(nrows=2, ncols=2, figsize=(18,15))

ax[0][0].hist(medals16['total'])
ax[0][1].scatter(medals12['total'],medals16['total'])
ax[1][0].hist(np.log2(medals16['total']+1))
ax[1][1].scatter(np.log2(medals12['total']+1),np.log2(medals16['total']+1))
plt.show()


Wait what, a negative correlation!?!?!?!?  What's going on?  Let's check one thing

In [None]:
medals16.head()


In [None]:
medals12.head()

That explains it!  So we need to carefully [merge](), and then explore:

In [None]:
medals = medals16.merge(medals12,on="code",how="outer")
medals.columns = medals.columns.str.replace("_x","16").str.replace("_y","12")
medals.head()

In [None]:
fig, ax = plt.subplots(nrows=3, ncols=2, figsize=(18,15))

ax[0][0].hist(medals['total16'])
ax[0][1].scatter(medals['total16'],medals['total12'])
ax[1][0].hist(np.log2(medals['total16']+1))
ax[1][1].scatter(np.log2(medals['total16']+1),np.log2(medals['total12']+1))
ax[2][0].hist(np.sqrt(medals['total16']))
ax[2][1].scatter(np.sqrt(medals['total16']),np.sqrt(medals['total12']))
plt.show()



Now that looks much better.  But what are the warning signs for?!?!?  Let's check one thing:

In [None]:
medals.tail()

OK there's missingness!  But this is sytematic :)

Note: this happened because of the way we merged.  We used `outer` join which takes the union of the two data set and fills in `NaN`s where they are missing.  If we used `inner`, then it would have taken the intersection (but would drop the observations that do not show up in the other data set...not the ideal here.

**Q1.1** What values should we impute into this data set for `total16` and `total12`?  What about for `country`? Using `pd.fillna` [docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html) to accomplish the task.  Be careful: order of operations matter!

*your answer here*

In [None]:
#######
# your code here
#######

medals['country16'] = medals['country16'].fillna(medals['country12'])
medals['country12'] = medals['country12'].fillna(medals['country16'])
medals = medals.fillna(0)
medals.tail()

**Q1.2** Fit a model (`lm1`) to predict total medal count in 2016 from total medal count in 2012.  Address the following:

- Print out the coefficients and $R^2$
- Interpret the coefficients carefully
- Investigate the assumptions
- Investigate how well Brazil ('BRA') and England ('GBR') are predicted.  Think: why are we checking these two countries?

In [None]:
#######
# your code here
#######
from sklearn.linear_model import LinearRegression

lm1 = LinearRegression().fit(medals[['total12']], medals['total16'])
print("Intercept =",lm1.intercept_,", Slope(s) =",lm1.coef_)
print("R-sq =",lm1.score(medals[['total12']], medals['total16']))

yhats = lm1.predict(medals[['total12']])
resids = medals['total16'] - yhats

print("Brazil: yhat = ", yhats[medals.code=='BRA'][0], ", resid =", resids[medals.code=='BRA'].iloc[0])
print("England: yhat = ", yhats[medals.code=='GBR'][0], ", resid =", resids[medals.code=='GBR'].iloc[0])

fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(18,6))

ax[0].hist(resids)
ax[1].scatter(yhats,resids)
plt.show()




*your answer here*

**Q1.3** Incorporate total medals from 2008 as a second predictor and fir a second model (`lm2`).  Address the following:

- Print out the coefficients and $R^2$
- Interpret the coefficients carefully and compare to `lm1`

Note: you'll have to do some processing first.

In [None]:
medals_raw = medals.copy()
medals = medals.merge(medals08,on="code",how="outer")
medals.columns=medals.columns[:-5].append(medals.columns[-5:]+"08")
medals.head()

In [None]:
medals['country08'] = medals['country08'].fillna(medals['country12'])
medals = medals.fillna(0)
medals.tail()

In [None]:
#######
# your code here
#######

lm2 = LinearRegression().fit(medals[['total12','total08']], medals['total16'])
print("Intercept =",lm2.intercept_,", Slope(s) =",lm2.coef_)
print("R-sq =",lm2.score(medals[['total12','total08']], medals['total16']))



*your answer here*

**Q1.4** Incorporate Population and GDP into the model (use the 2016 versions of those measurements...now call it `lm3`).  Interpret the results and compare to previous work.  Do not forget to merge first (there will be some issues, so to simplify life, let's merge using `how = 'inner'`)!



In [None]:
#######
# your code here
#######
#medals = medals_raw.copy()
medals_raw = medals.copy()

medals = medals.merge(pop[['code','2016']],on="code",how="inner")
medals['pop'] = medals['2016']
medals = medals.merge(gdp[['code','2016']],on="code",how="inner")
medals['gdp'] = medals['2016_y']
medals.shape
#medals_raw.shape


In [None]:
medals.head()

In [None]:
medals[medals['gdp'].isnull()]

In [None]:
medals = medals[-(medals['gdp'].isnull())]

In [None]:
lm3 = LinearRegression().fit(medals[['total12','total08','pop','gdp']], medals['total16'])
print("Intercept =",lm3.intercept_,", Slope(s) =",lm3.coef_)
print("R-sq =",lm3.score(medals[['total12','total08','pop','gdp']], medals['total16']))



*your answer here*

**Q1.5** Take a step back and think about what we have done so far.  What would you do differently?  What other predictors would you want to include?



*your answer here*

---

## Lab 6 Part 2: COMPAS Case Study

**Harvard University**<br/>
**Summer 2021**<br/>
**Authors:** Kevin Rader, Shivam Raval, Chris Gumb, Pavlos Protopapas and Chris Tanner

---

In [None]:
## RUN THIS CELL TO GET THE RIGHT FORMATTING 
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css").text
HTML(styles)

In [None]:
import random
random.seed(112358)

import os
import sys

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve, auc, roc_auc_score, accuracy_score, confusion_matrix
from sklearn.ensemble import RandomForestClassifier

%matplotlib inline

# COMPAS Algorithm

Reference: https://www.uclalawreview.org/injustice-ex-machina-predictive-algorithms-in-criminal-sentencing/

Further Reads: https://www.technologyreview.com/2019/10/17/75285/ai-fairer-than-judge-criminal-risk-assessment-algorithm/

Correctional Offender Management Profiling for Alternative Sanctions (COMPAS), was designed to assess a defendant’s risk of recidivism — that is, the potential risk that the defendant will commit a crime in the future. The algorithm predicts a defendant's risk of being rearrested for a crime while awaiting trial, (e.g., the period of time between their initial arrest until their trial).

COMPAS’s algorithm uses a variety of factors to generate a recidivism-risk score between 1 and 10. It does this by comparing an individual’s attributes and qualities to those of known high-risk offenders and attribues a score to the individual. At an initial court hearing after someone has been arrested for a crime, the judge needs to decide whether the defendant should be put in jail while they await trial. In jurisdictions that use COMPAS, the judge is provided with the COMPAS risk assessment as an input in this decision; COMPAS recommends that "high risk" defendants should be jailed pending their trial. As a result, a defendant’s sentence is determined — to at least some degree — by COMPAS’s recidivism risk assessment.

In forecasting who would re-offend, the algorithm made mistakes with black and white defendants but in very different ways:
The formula was particularly likely to falsely flag black defendants as future criminals, wrongly labeling them this way at almost twice the rate as white defendants, while the white defendants were mislabeled as low risk more often than black defendants. This is an example of bias and we shall try to investigate this effect.

### 1. Load the dataset

In [None]:
# Read in the data
compas_df = pd.read_csv('data/compas.csv')
compas_df.head()

The dataset contains many variables, some of which are raw data and other are processed:

**Unprocessed data**: `age`, `c_charge_degree`,`c_charge_desc`, `race`, `sex`, `priors_count`,`juv_fel_count`,`juv_misd_count`,`juv_other_count`,`length_of_stay`
\
<br>
**Pre-processed data**: `length_of_stay_thresh`, `priors_1`, `priors_234`, `priors_5plus`, `juv_fel_1plus`,`juv_misd_1plus`,`juv_other_1plus`,`charge_any_drug`,`charge_any_violence_aggression`,`charge_any_theft`\
<br>
**COMPAS Outputs**: `score_text`,`decile_score`
\
<br>
**Outcome Variable**: `two_year_recid`

NOTE:
* `score_text`,`decile_score` Should not be used in building models, because they are the outcomes of the COMPAS model.
* `length_of_stay` (and the processed `length_of_stay_thresh`) should not be used in model, because it is an outcome of the judge's decision on pretrial risk, which may be informed by COMPAS (e.g., a defendant will not spend time in jail if they are not put in jail by the judge).

We will be looking at `two_year_recid`: the prediction outcome that someone who is conviced will be rearrested in the next two years


<div class="exercise"><b>Exercise 1.1:</b> Compare the number of convictions based on an individual being: 1.Male or Female 2.Misdemeanor(M) or Felony(F) 3. African-American or Caucasian? What does this tell you about the data?</div>

In [None]:
#Your code here

In [None]:
#Your code here

<div class="exercise"><b>Exercise 1.2:</b> Convert the above variable columns into binary or one hot encoded columns as requiered (Hint: pd.get_dummies might be helpful, especially when there are multiple categories) </div>

In [None]:
#Your code here
# Process Binary Categorical Variables

compas_df['sex'] = ___

compas_df['felony'] = ___


# One Hot Encode the Race Var
one_hot_df = pd.get_dummies(___)
compas_race_df = pd.concat(___)

# Drop the Categoricl Vars
compas_race_df = compas_race_df.drop(___)

In [None]:
compas_race_df.head()

<div class="exercise"><b>Exercise 1.3:</b> Create a train test split of the data, with train_size = 0.8, random_state = 209 and stratified on race </div>

In [None]:
#Your code here
# Make Train Test Split
train_df, test_df = train_test_split(___)

### 2. EDA on the unprocessed variables

In [None]:
unproces_cols = ['age', 'priors_count', 'juv_fel_count', 'juv_misd_count', 'juv_other_count','length_of_stay','decile_score']

#Separate the variables based on race
aa_idx_train = np.where(train_df['race_African-American']==1)[0]
cc_idx_train = np.where(train_df['race_Caucasian']==1)[0]
non_aa_cc_idx_train = np.where(np.all([train_df['race_Caucasian']==0, train_df['race_African-American']==0], axis=0))[0]

<div class="exercise"><b>Exercise 2.1:</b> Plot the above predictors in a suitable plot for the two races. You should have 7 different plots for the 7 predictors. Do you see any visible trends?</div>

<b>Note:</b> The way you present the data is extemely important in the real world. Plots can induce biases on the obersever so one should be really careful while trying to determine the type of plot what best represents the data

In [None]:
#Your code here


*Your interpretation here*

<div class="exercise"><b>Exercise 2.2:</b> The trends observed in the plot above may point to some of the biases present here. Discuss why do the trends seem different for different races? Hint: it points to deeper societal issues and human biases that creep into the data</div>

*Your interpretation here*

### 3. Fit a logistic regression model to the data

Lets Build a logistic regression model to predict recidivism (`two_year_recid`) from the relevant predictors (including `race`).

In [None]:
# Drop variables in favor of their pre-processed equivalents; also drop decile_score which is the COMPAS output
X_drop = ['priors_count', 'juv_fel_count', 'juv_misd_count', 'juv_other_count',
          'decile_score', 'two_year_recid', 'length_of_stay', 'length_of_stay_thresh']
X_train, X_test = train_df.drop(columns = X_drop), test_df.drop(columns = X_drop)
y_train, y_test = train_df['two_year_recid'], test_df['two_year_recid']

# Scale data to X_train
scaler = MinMaxScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

<div class="exercise"><b>Exercise 3.1:</b> Fit a logistic regression model of your choice and print the accuracy score and the model coefficients. What do the coefficients for different races tell you about its realtion to the outcome prediction?</div>

In [None]:
#Your code here
# Create and fit the model
logit_model = ____

# Print Accuracy score
print('The accuracy score for the logistic regression model on the training data:')
display(logit_model.score(X_train_scaled, y_train))

# Print Coefficients
print('The coefficients for the logistic regression model are:')
display(pd.DataFrame(index=X_train.columns, data={'coefficients': logit_model.coef_[0]}))

*Your interpretation here*

<div class="exercise"><b>Exercise 3.2:</b> The function below is given to you to obtain the confusion matrices for predictions for different races and obtain the relevant rates. Use it for your model and interpret the results</div>

In [None]:
# Function to report the FPR and FNR by model
def evaluate_model(model, X_test, y_true, aa_idx, cc_idx, threshold, make_cf = True):
    
    y_pred_proba = model.predict_proba(X_test)[:,1]
    y_pred = np.array([1 if y > threshold else 0 for y in y_pred_proba])
    model_accuracy = accuracy_score(y_pred, y_true)
    
    cf_afam = confusion_matrix(y_true.values[aa_idx], y_pred[aa_idx])
    confusion_afam = dict(zip(['tn','fp','fn','tp'], cf_afam.ravel()))
    
    cf_cau = confusion_matrix(y_true.values[cc_idx], y_pred[cc_idx])
    confusion_caucasian = dict(zip(['tn','fp','fn','tp'], cf_cau.ravel()))
    
    if make_cf == True:
        
        print('Confusion Matrix for African-American:')
        display(pd.DataFrame(cf_afam, 
        index=['true:0', 'true:1'], 
        columns=['pred:0', 'pred:1'] ))

        print('Confusion Matrix for Caucasian:')
        display(pd.DataFrame(cf_cau, 
        index=['true:0', 'true:1'], 
        columns=['pred:0', 'pred:1'] ))
        
    

    fpr_afam = confusion_afam['fp'] / (confusion_afam['fp'] + confusion_afam['tn'])
    fnr_afam = confusion_afam['fn'] / (confusion_afam['fn'] + confusion_afam['tp'])

    fpr_caucasian = confusion_caucasian['fp'] / (confusion_caucasian['fp'] + confusion_caucasian['tn'])
    fnr_caucasian = confusion_caucasian['fn'] / (confusion_caucasian['fn'] + confusion_caucasian['tp'])
    
    with np.errstate(all='raise'):
        try:
            fpr_ratio = fpr_afam/fpr_caucasian
        except:
            fpr_ratio = 0

        try:
            fnr_ratio = fnr_afam/fnr_caucasian
        except:
            fnr_ratio = 0

    return dict(zip(['model_accuracy', 'fpr_afam', 'fnr_afam', 'fpr_caucasian', 'fnr_caucasian','fpr_ratio', 'fnr_ratio'],
                    [model_accuracy, fpr_afam, fnr_afam, fpr_caucasian, fnr_caucasian, fpr_ratio, fnr_ratio]))


In [None]:
# Save the indexes for the two races on the test set
aa_idx_test = np.where(test_df['race_African-American']==1)[0]
cc_idx_test = np.where(test_df['race_Caucasian']==1)[0]

In [None]:
#Your code here


*Your interpretation here*

### 4. Reducing bais: buiding a Race-Agnostic model

What if race was not included as a factor? Let's refit the logistic model but this time **without** `race` as a predictor. 

<div class="exercise"><b>Exercise 4.1:</b> Drop all columns related to the races and refit the logistic regression model. Again obtain the accuracy score and the model coefficients and compare yor results with those of 3.1 </div>

In [None]:
#Your code here
# Drop the race columns, which are the last 5 columns in the df
X_train_scaled_no_race = ___
X_test_scaled_no_race = ___

# Fit the model
logit_model_no_race = ____ 

# Print Accuracy score
print('The accuracy score for the logistic regression model on the training data:')
display(logit_model_no_race.score(X_train_scaled_no_race, y_train))

# Print Coefficients
print('The coefficients for the logistic regression model excluding race are:')
display(pd.DataFrame(index=X_train.columns[:-5], data={'coefficients': logit_model_no_race.coef_[0]}))

In [None]:
results = evaluate_model(logit_model_no_race, X_test_scaled_no_race, y_test, aa_idx_test, cc_idx_test, 0.5)
results_df = results_df.append(pd.DataFrame(results, index=['logit_model_no_race']))
print('The accuracy and error rates for the model on the test set are:')
results_df

*Your interpretation here*

<div class="exercise"><b>Exercise 4.2:</b> Discuss whether such a model be trusted to be unbiased even if it doesn’t explicitly use a variable such as race to predict future crime? </div>

*Your interpretation here*

### Closing thoughts: 
Is algorithmic modeling is appropriate in this use case given the historical biases encoded in the data and the risk of amplifying these historical inequities?

# Bonus Material: Tweaking the decision threshold to make fairer models

Can we reduce bias further by chaning threshold? Lets make an ROC curve for the two subgroups:

In [None]:
plt.rcParams["figure.figsize"] = (10,8)

In [None]:
def make_roc(name, model, ytest, xtest, idx_test, ax=None, labe=5, proba=True, skip=0):
    initial=False
    if not ax:
        ax=plt.gca()
        initial=True
    if proba:#for stuff like logistic regression
        
        y_pred = model.predict_proba(xtest)[:,0]
        preds_proba = model.predict_proba(xtest)[:,1]      
       
        auc_score = roc_auc_score(y_test.values[idx_test], preds_proba[idx_test])
        fpr, tpr, thresholds = roc_curve(y_test.values[idx_test], preds_proba[idx_test])
    
    roc_auc = auc(fpr, tpr)
    if skip:
        l=fpr.shape[0]
        ax.plot(fpr[0:l:skip], tpr[0:l:skip], '.-', alpha=0.3, label='ROC curve for %s (area = %0.2f)' % (name, roc_auc))
    else:
        ax.plot(fpr, tpr, '.-', alpha=0.3, label='ROC curve for %s (area = %0.2f)' % (name, roc_auc))
    label_kwargs = {}
    label_kwargs['bbox'] = dict(
        boxstyle='round,pad=0.3', alpha=0.2,
    )
    if labe!=None:
        for k in range(0, fpr.shape[0],labe):
            #from https://gist.github.com/podshumok/c1d1c9394335d86255b8
            threshold = str(np.round(thresholds[k], 2))
            ax.annotate(threshold, (fpr[k], tpr[k]), **label_kwargs)
    if initial:
        ax.plot([0, 1], [0, 1], 'k--')
        ax.set_xlim([0.0, 1.0])
        ax.set_ylim([0.0, 1.05])
        ax.set_xlabel('False Positive Rate')
        ax.set_ylabel('True Positive Rate')
        ax.set_title('ROC')
    ax.legend(loc="lower right")
    return ax


sns.set_context("poster")

In [None]:
make_roc("Afrian-American", logit_model_no_race, y_test, X_test_scaled_no_race, aa_idx_test, ax=None, labe=60, proba=True, skip=1);

In [None]:
make_roc("Caucasian", logit_model_no_race, y_test, X_test_scaled_no_race, cc_idx_test, ax=None, labe=40, proba=True, skip=1);

<b>1. </b> Lets choose a new single threshold for our model that may will reduce the bias between these two racial groups (as measured by the ratios of FPR and FNR)

In [None]:
results_df = pd.DataFrame()
results = evaluate_model(logit_model_no_race, X_test_scaled_no_race, y_test, aa_idx_test, cc_idx_test, 0.45)
results_df = results_df.append(pd.DataFrame(results, index=['logit_model']))
print('The accuracy and error rates for the model on the test set are:')
results_df

It doesnt seem to improve, maybe there's a optimal threshold?

In [None]:
t_vals = np.linspace(0,1,100)
fpr_ratios = []
fnr_ratios = []
accuracies = []
for t in t_vals:
    result = evaluate_model(logit_model_no_race, X_test_scaled_no_race, y_test, aa_idx_test, cc_idx_test, threshold=t, make_cf = False)
    fpr_ratios.append(result['fpr_ratio'])
    fnr_ratios.append(result['fnr_ratio'])
    accuracies.append(result['model_accuracy'])

plt.figure(figsize=(8,10))
plt.subplot(3,1,1)
plt.plot(t_vals,fpr_ratios)
plt.title("False Positive Ratio")
plt.xlabel("Threshold")
plt.ylabel("Ratio")

plt.subplot(3,1,2)
plt.plot(t_vals,fnr_ratios)
plt.title("False Negative Ratio")
plt.xlabel("Threshold")
plt.ylabel("Ratio")

plt.subplot(3,1,3)
plt.plot(t_vals,accuracies)
plt.title("Model Accuracies")
plt.xlabel("Threshold")
plt.ylabel("Accuracy")
plt.tight_layout()
plt.show()

From the above graphs, a threshold around .8 could be good since the FNR ratio is nearly 1 while the FPR ratio dips close to 1. In practice, this still means we are accepting that a higher ratio of African Americans will be wrongly classified as "high risk". A selection that brings FPR ratio to 1 will run into the opposite problem where we have to accept a FNR ratio that is further from 1.

Selecting on model accuracy alone is not a great answer since it ignores the fact that False Negative and Positive ratios remain poor. We should make sure the model accuracy doesn't dip too close to .5 as that means we are essentially doing a coin toss.

--------------------------

<b>2. </b> Another approach to reducing bias is to use different thresholds for the different racial groups to better ensure that the groups have similar false positive and false negative rates.

In [None]:
# Lets separate datset by race
X_test_scaled_no_race_aa = X_test_scaled_no_race[aa_idx_test]
X_test_scaled_no_race_cc = X_test_scaled_no_race[cc_idx_test]
y_test_aa = np.array(y_test)[aa_idx_test]
y_test_cc = np.array(y_test)[cc_idx_test]

In [None]:
# Getting FPR, FNR, and accuracy by threshold for both datasets
fpr_aa = []
fnr_aa = []
accuracy_aa = []
fpr_cc = []
fnr_cc = []
accuracy_cc = []
for threshold in t_vals:
    # Get values for AA dataset
    y_pred_proba = logit_model_no_race.predict_proba(X_test_scaled_no_race_aa)[:,1]
    y_preds = np.array([1 if y > threshold else 0 for y in y_pred_proba])
    accuracy_aa.append(accuracy_score(y_preds, y_test_aa))
    confusion_afam = dict(zip(['tn','fp','fn','tp'], confusion_matrix(y_test_aa, y_preds).ravel()))
    fpr_aa.append(confusion_afam['fp'] / (confusion_afam['fp'] + confusion_afam['tn']))
    fnr_aa.append(confusion_afam['fn'] / (confusion_afam['fn'] + confusion_afam['tp']))

    # Get values for CC dataset
    y_pred_proba = logit_model_no_race.predict_proba(X_test_scaled_no_race_cc)[:,1]
    y_preds = np.array([1 if y > threshold else 0 for y in y_pred_proba])
    accuracy_cc.append(accuracy_score(y_preds, y_test_cc))
    confusion_caucasian = dict(zip(['tn','fp','fn','tp'], confusion_matrix(y_test_cc, y_preds).ravel()))
    fpr_cc.append(confusion_caucasian['fp'] / (confusion_caucasian['fp'] + confusion_caucasian['tn']))
    fnr_cc.append(confusion_caucasian['fn'] / (confusion_caucasian['fn'] + confusion_caucasian['tp']))

In [None]:
plt.style.use('default')
plt.figure(figsize=(15,5))
plt.subplot(1,2,1)
plt.plot(t_vals,fnr_cc,label='FNR')
plt.plot(t_vals,fpr_cc,label='FPR')
plt.plot(t_vals,accuracy_cc,label='Accuracy')
plt.title('Caucasian dataset')
plt.legend()

plt.subplot(1,2,2)
plt.plot(t_vals,fnr_aa,label='FNR')
plt.plot(t_vals,fpr_aa,label='FPR')
plt.plot(t_vals,accuracy_aa,label='Accuracy')
plt.title('African American dataset')
plt.legend()
plt.show()

In [None]:
print("Caucasian numbers w/threshold=.4")
print("FPR = ",fpr_cc[40])
print("FNR = ",fnr_cc[40])
print("Accuracy = ",accuracy_cc[40])

In [None]:
print("African American numbers w/threshold=.5")
print("FPR = ",fpr_aa[50])
print("FNR = ",fpr_aa[50])
print("Accuracy = ",accuracy_aa[50])

From these graphs, a threshold around .4 for Caucasians and around .5 for African Americans seems to provide equitable treatment while maintaining good accuracy rates. Model accuracy was .66 in both which was near the peak. The False Positive and False Negative rates were also very similar for both groups, hovering around .33.

#### Comparing the fairness of the above two methods:
A model satifies group fairness if the subjects in both race groups have equal probability of being assigned to the positive class. Individual fairness is achieved if two individuals with equal characteristics aside from their race have equal probability of being assigned to the positive class.

The model with fixed threshold across subgroups (<b>4.1</b>) does not quite achieve group fairness as the ratios of FPR and FNR between the two groups is not quite 1. However, it does maintain individual fairness by not changing the threshold depending on the subject's race. The model with different thresholds for each subgroups (<b>4.2</b>) achieves group fairness since the groups as a whole are equally classified. However, since the thresholds change based on the individual's race, it may not be individually fair (if we assume that the predictors capture all of the relevant information about what qualifies someone to be a recidivism risk).

Changing the thresholds can *reduce* bias between the two classes, but it can also affect model accuracy. We want our model to be **accurate** but also **fair**. 

## HW: Try different approaches introduced in the class to see if you can make a better performing fair model with higher accuracy