# INFO 370 PS8
*Name: Brian Park, Jae Sang Woo*

In [108]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## 1. Is COMPAS fair? 

1. (1pt) Load the COMPAS data, and perform the basic checks.
2. (1pt) Filter the data to keep only Caucasian and African-Americans. There are just too few offenders of other races.
3. (2pt) Create a new dummy variable based off of COMPAS risk score (decile_score), which indicates if an individual was classified as low risk (score 1-4) or high risk (score 5-10). Hint: you can do it in different ways but for technical reasons related the tasks below, thebest way to do it is to create a variable “high score”, that takes values 1 (decile score 5 andabove) and 0 (decile score 1-4).
4. (6pt) Now analyze the offenders across this new risk category:(a) What is the recidivism rate (percentage of offenders who re-commit the crime) for lowrisk and high-risk individuals? (b) What are the recidivism rates for African-Americans and Caucasians?
5. What do you think, is the model fair? If a person is classified as high-risk, there is a certain probability that the classification is wrong and the person is unjustly classified as high-risk. Is the probability that they arenmisclassified (approximately) the same for caucasians and african-americans? Do you think this means the model is fair?
6. (8 pt) Now create a confusion matrix comparing COMPAS predictions for recidivism (lownrisk/high risk you created above) and the actual two-year recidivism and interpret the results. In order to be on the same page, let’s call recidivists “positives”. Note: you do not have to predict anything here. COMPAS has made the prediction for you, this is the variable you created in 3 based on decile_score. See the referred articles about the controversy around COMPAS methodology. Note 2: Do not just output a confusion matrix with accompanying text like “accuracy = x%, precision = y%”. Interpret your results such as “z% of recidivists were falsly classified as low-risk, COMPAS accurately classified k% of individuals, etc.”
7. (8pt) Find the accuracy of the COMPAS classification, and also how its errors (false negatives and false positives) are distributed. Would you feel comfortable having a judge to use COMPAS to inform sentencing guidelines? What do you think, how well can judges perform the same task without COMPAS’s help? At what point would the error/misclassification risk be acceptable for you? Do you think the acceptable error rate should be the same for humannjudges and for algorithms? Remember: human judges are not perfect either!
8. (10pt) Now repeat your confusion matrix calculation and analysis from 6. But this time do it separately for African-Americans and for Caucasians: (a) How accurate is the COMPAS classification for African-American individuals? For Caucasians? (b) What are the false positive rates (false recidivism rates) FPR? (c) The false negative rates (false no-recidivism rates) FNR? We did not talk about FPR and FNR in class, but you can consult Lecture Notes, section
9. (12pt) If you have done this correctly, you will find that COMPAS’s percentage of correctly categorized individuals (accuracy) is fairly similar for African-Americans and Caucasians, but that false positive rates and false negative rates are different. Look again at the overal recidivism rates in the dataset for Black and White individuals. In your opinion, is the COMPAS algorithm fair? Justify your answer.

In [109]:
#1-1
compas = pd.read_csv("../data/compas-score-data.csv.bz2", sep="\t")
compas.head()

Unnamed: 0,age,c_charge_degree,race,age_cat,sex,priors_count,decile_score,two_year_recid
0,69,F,Other,Greater than 45,Male,0,1,0
1,34,F,African-American,25 - 45,Male,0,3,1
2,24,F,African-American,Less than 25,Male,4,4,1
3,44,M,Other,25 - 45,Male,0,1,0
4,41,F,Caucasian,25 - 45,Male,14,6,1


In [110]:
#1-1
compas.shape

(6172, 8)

In [111]:
#1-1
compas.isna().sum()

age                0
c_charge_degree    0
race               0
age_cat            0
sex                0
priors_count       0
decile_score       0
two_year_recid     0
dtype: int64

In [112]:
#1-2
filtered = compas[compas['race'].isin(["Caucasian", "African-American"])]
filtered.head()

Unnamed: 0,age,c_charge_degree,race,age_cat,sex,priors_count,decile_score,two_year_recid
1,34,F,African-American,25 - 45,Male,0,3,1
2,24,F,African-American,Less than 25,Male,4,4,1
4,41,F,Caucasian,25 - 45,Male,14,6,1
6,39,M,Caucasian,25 - 45,Female,0,1,0
7,27,F,Caucasian,25 - 45,Male,0,4,0


In [113]:
#1-2
filtered.shape

(5278, 8)

In [114]:
#1-3
filtered = filtered.copy()
filtered["high score"] = np.where(filtered["decile_score"] > 4, 1, 0)

filtered.head()

Unnamed: 0,age,c_charge_degree,race,age_cat,sex,priors_count,decile_score,two_year_recid,high score
1,34,F,African-American,25 - 45,Male,0,3,1,0
2,24,F,African-American,Less than 25,Male,4,4,1,0
4,41,F,Caucasian,25 - 45,Male,14,6,1,1
6,39,M,Caucasian,25 - 45,Female,0,1,0,0
7,27,F,Caucasian,25 - 45,Male,0,4,0,0


In [115]:
#1-4-a
#percentage of offenders who re-commit the crime for low risk
rlow = (filtered[(filtered.two_year_recid == 1) & (filtered["high score"]
 == 0)].shape[0]) * 100 / filtered.shape[0] 

#percentage of offenders who re-commit the crime for high risk
rhigh = filtered[(filtered.two_year_recid == 1) & (filtered["high score"]
 == 1)].shape[0] * 100 / filtered.shape[0]

rlow, rhigh

(16.691928760894278, 30.35240621447518)

In [116]:
#1-4-b
#recidivism rates for African-Americans
raa = (filtered[(filtered.race == "African-American") & 
                (filtered.two_year_recid == 1)].shape[0]) * 100 / \
                filtered.shape[0]

#recidivism rates for Caucasians
rc = (filtered[(filtered.race == "Caucasian") & (filtered.two_year_recid
 == 1)].shape[0]) * 100 / filtered.shape[0]

raa, rc

(31.470253884046986, 15.574081091322471)

In [117]:
#1-5

Based on the computation above, the person with a high risk of recidivism is more likely to recidivate within 2 years. It also explains that African-Americans are more likely to recidivate within 2 years than Caucasians. The model seems fair enough if we only consider the first computation. However, if we misclassify those people in terms of their race, we can't say that the model is 100% fair because classifying by races would unjustly put people in high risk.

In [118]:
#1-6
TP = filtered[(filtered.two_year_recid == 1) & 
              (filtered["high score"] == 1)].shape[0]
TN = filtered[(filtered.two_year_recid == 0) & 
              (filtered["high score"] == 0)].shape[0]
FP = filtered[(filtered.two_year_recid == 0) & 
              (filtered["high score"] == 1)].shape[0]
FN = filtered[(filtered.two_year_recid == 1) & 
              (filtered["high score"] == 0)].shape[0]
TP, TN, FP, FN

(1602, 1872, 923, 881)

| | **Predicted** | |
| -------- | ------------: | ------:|
| | "Negatives" (N) | “Positives” (P) |
|**Actual**| 1872 | 923 |
| | 881 | 1602 |

In [119]:
#1-6
a = (TP + TN) / (TP + TN + FP + FN) #accuracy
p = TP / (TP + FP) #precision
r = TP / (TP + FN) #recall
f = TP / (TP + (0.5 * (FP + FN))) #f-score

a, p, r, f

(0.6582038651004168, 0.6344554455445545, 0.6451872734595248, 0.639776357827476)

In [120]:
#1-6
nlowrisk = (FN * 100 / (TP + TN + FP + FN)) # falsly classified as low-risk
nlowrisk

16.691928760894278

In [121]:
#1-6
nhighrisk = (FP * 100 / (TP + TN + FP + FN)) # falsly classified as high-risk
nhighrisk

17.48768472906404

In [122]:
#1-6
((TP + TN) * 100 / (TP + TN + FP + FN)) # accurately classified 

65.82038651004169

Based on the confusion matrix, 16.69% of recidivists were falsly classified as low-risk, 17.49% of recidivists were falsly classified as high-risk, and COMPAS accurately classified 65.82% of individuals.

In [123]:
#1-7
a = (TP + TN) / (TP + TN + FP + FN) #accuracy

#accuracy, falsly classified as low-risk, falsly classified as high-risk
a, nlowrisk, nhighrisk 

(0.6582038651004168, 16.691928760894278, 17.48768472906404)

In terms of its accuracy to have a judge for sentencing guidelines, we are not comfortable enough just with about 66% of accuracy. It's neither high nor low enough to use it for the guidline but it will be helpful to be used as a reference to judge people.  

Without COMPAS's help, judges will still be able to perform the same task because they would look more into the real trial they participate in rather than just using the data to classify people. Also, they would believe more in their knowledge and experience from being judges to have a judge.

I would accept the error/misclassification risk if they are significantly low numbers which would have a low effect on accurately classifying people.

I think that algorithms should have more strict acceptable error rate since machines can fix the error much quicker and more accurate than humans. Since I believe that algorithms have more abilities to reduce the error more accurate, the acceptable error rate should not be the same for human judges and algorithms.

In [124]:
#1-8
dfaa = filtered[filtered["race"] == "African-American"]
dfc = filtered[filtered["race"] == "Caucasian"]

In [125]:
#1-8-African-American
aTP = dfaa[(dfaa.two_year_recid == 1) & 
          (dfaa["high score"] == 1)].shape[0]
aTN = dfaa[(dfaa.two_year_recid == 0) & 
          (dfaa["high score"] == 0)].shape[0]
aFP = dfaa[(dfaa.two_year_recid == 0) & 
          (dfaa["high score"] == 1)].shape[0]
aFN = dfaa[(dfaa.two_year_recid == 1) & 
          (dfaa["high score"] == 0)].shape[0]
aTP, aTN, aFP, aFN

(1188, 873, 641, 473)

In [126]:
#1-8-Caucasian
cTP = dfc[(dfc.two_year_recid == 1) & 
         (dfc["high score"] == 1)].shape[0]
cTN = dfc[(dfc.two_year_recid == 0) & 
         (dfc["high score"] == 0)].shape[0]
cFP = dfc[(dfc.two_year_recid == 0) & 
         (dfc["high score"] == 1)].shape[0]
cFN = dfc[(dfc.two_year_recid == 1) & 
         (dfc["high score"] == 0)].shape[0]
cTP, cTN, cFP, cFN

(414, 999, 282, 408)

In [127]:
#1-8-a-African-American
aAA = (aTP + aTN) / (aTP + aTN + aFP + aFN) #accuracy for African-American
aAA

0.6491338582677165

Based on the confusion matrix and analysis above, the COMPAS classification is 64.9% accurate for African-Americans.

In [128]:
#1-8-a-Caucasian
cAA = (cTP + cTN) / (cTP + cTN + cFP + cFN) #accuracy for Caucasian
cAA

0.6718972895863052

Based on the confusion matrix and analysis above, the COMPAS classification is 67.2% accurate for Caucasians.

In [129]:
#1-8-b-African-American-FPR
aFPR = (aFP / (aTN + aFP))
aFPR

0.4233817701453104

The false positive rate for African Americans is 0.42.

In [130]:
#1-8-b-Caucasian-FPR
cFPR = (cFP / (cTN + cFP))
cFPR

0.22014051522248243

The false positive rate for Caucasian is 0.22.

In [131]:
#1-8-c-African-American-FNR
aFNR = (aFN / (aTP + aFN))
aFNR

0.2847682119205298

The false negative rate for African Americans is 0.28.

In [132]:
#1-8-c-Caucasian-FNR
cFNR = (cFN / (cTP + cFN))
cFNR

0.49635036496350365

The false negative rate for Caucasian is 0.50.

In [133]:
#1-9

Based on what we have done in this question, I don't believe that the COMPAS algorithm is fair enough. It is more biased toward that African-Americans are more likely to recidivate compared to Caucasians. This algorithm is more likely to classify people of African-Americans as high risk, meaning that race plays a significant role in classifying people in this algorithm. Therefore, I believe that this algorithm is not fair to become a sentencing guildline.

## 2. Can you beat COMPAS? 

1. (8pt) Before we start: what do you think, what is an appropriate model performance measure here? A, P, R, F or something else? Maybe you want to report multiple measures? Explain!
2. (6pt) Now it is time to do the modeling. Create a logistic regression model that contains all explanatory variables you have in data into the model. (Some of these you have to convert to dummies). Do not include the variables discussed above, do not include race and gender in this model either to avoid explicit gender/racial bias. Use 10-fold CV to compute its relevant performance measure(s) you discussed above.
3. (6pt) Experiment with different models to find the best model according to your preformance indicator. Try trees and k-NN, you may also include other types of models. Include/exclude different variables. You may also do feature engineering, e.g. create a different set of age groups, include variables like age2, age2, interaction effects, etc. But do not include race and gender. Report what did you try (no need to report the full results of all of your unsuccessful attempts), and your best model’s performance. Did you got better results or worse results than COMPAS?
4. (12pt) Discuss the results. Did you manage to be equally good as COMPAS? Did you create a better model? Do gender and race help to improve your predictions? What should judges do when having access to such models? Should they use such models?

In [134]:
#2-1

Since all values for A, P, R, F are fairly close to each other, we would try all the measures to see which performance measure performs the best.

In [135]:
#2-2
X = compas[["age", "c_charge_degree", "age_cat", "priors_count"]]

In [136]:
#2-2
y = compas.two_year_recid.values
X = pd.get_dummies(X
               ,columns = ["c_charge_degree", "age_cat"]).\
values
X[:5]

array([[69,  0,  1,  0,  0,  1,  0],
       [34,  0,  1,  0,  1,  0,  0],
       [24,  4,  1,  0,  0,  0,  1],
       [44,  0,  0,  1,  1,  0,  0],
       [41, 14,  1,  0,  1,  0,  0]])

In [137]:
#2-2
from sklearn.linear_model import LogisticRegression

m = LogisticRegression(max_iter=2500)
_ = m.fit(X,y)

In [138]:
#2-2 accuracy
from sklearn.model_selection import cross_val_score
cv = cross_val_score(m, X, y,
                scoring="accuracy",
                cv=10)
np.mean(cv)

0.6774110556875581

In [139]:
#2-2 precision
from sklearn.model_selection import cross_val_score
cv = cross_val_score(m, X, y,
                scoring="precision",
                cv=10)
np.mean(cv)

0.6678175901993845

In [140]:
#2-2 recall
from sklearn.model_selection import cross_val_score
cv = cross_val_score(m, X, y,
                scoring="recall",
                cv=10)
np.mean(cv)

0.579926283680732

In [141]:
#2-2 F-score
from sklearn.model_selection import cross_val_score
cv = cross_val_score(m, X, y,
                scoring="f1",
                cv=10)
np.mean(cv)

0.6205423125223205

In [142]:
#2-3-NN
from sklearn.neighbors import KNeighborsClassifier

k = 23
m = KNeighborsClassifier(k)
_ = m.fit(X, y)

yhat = m.predict(X)

np.mean(y == yhat)

0.6949125081011017

In [143]:
#2-3-DTree
from sklearn.tree import DecisionTreeClassifier

d = 25
m = DecisionTreeClassifier(max_depth = d)
_ = m.fit(X, y)

yhat = m.predict(X)

np.mean(y == yhat)

0.7394685677252106

In [144]:
#2-3-add different age of group / 18-64 or 65+
compas['new_age_cat'] = np.where(
    compas['age'] > 64, "1", "0")

X = compas[["c_charge_degree", "new_age_cat", "priors_count"]]

y = compas.two_year_recid.values
X = pd.get_dummies(X
               ,columns = ["c_charge_degree", "new_age_cat"]).\
values
X[:5]

array([[ 0,  1,  0,  0,  1],
       [ 0,  1,  0,  1,  0],
       [ 4,  1,  0,  1,  0],
       [ 0,  0,  1,  1,  0],
       [14,  1,  0,  1,  0]])

In [145]:
#2-3-add different age of group / 18-64 or 65+
k = 23
m = KNeighborsClassifier(k)
_ = m.fit(X, y)

yhat = m.predict(X)

np.mean(y == yhat)

0.619896305897602

In [146]:
#2-3-add different age of group / 18-64 or 65+
d = 25
m = DecisionTreeClassifier(max_depth = d)
_ = m.fit(X, y)

yhat = m.predict(X)

np.mean(y == yhat)

0.653596889176928

In [147]:
#2-3-age^2, new_age_cat
compas["age2"] = (compas.age)**2

X = compas[["age", "age2", "c_charge_degree", "priors_count", "new_age_cat"]]

y = compas.two_year_recid.values
X = pd.get_dummies(X
               ,columns = ["c_charge_degree", "new_age_cat"]).\
values
X[:5]

array([[  69, 4761,    0,    1,    0,    0,    1],
       [  34, 1156,    0,    1,    0,    1,    0],
       [  24,  576,    4,    1,    0,    1,    0],
       [  44, 1936,    0,    0,    1,    1,    0],
       [  41, 1681,   14,    1,    0,    1,    0]])

In [148]:
#2-3-age^2, new_age_cat
k = 23
m = KNeighborsClassifier(k)
_ = m.fit(X, y)

yhat = m.predict(X)

np.mean(y == yhat)

0.6955605962410888

In [149]:
#2-3-age^2, new_age_cat
d = 25
m = DecisionTreeClassifier(max_depth = d)
_ = m.fit(X, y)

yhat = m.predict(X)

np.mean(y == yhat)

0.7394685677252106

I tried both age^2 and different set of age group to split the data into two age groups - 18-64 and 65+. Compared to the COMPAS, this resulted in a better accuracy with DecisionTree when max_depth was 25.

In [150]:
#2-4

I created somewhat a better model than the COMPAS by excluding race and gender. It represents that judges should be aware of how gender and race factors can affect classifying people. Therefore, they should try their best to make an unbiased judges by excluding subject's gender and race. If some models make results based on those factors - gender and race - they should be careful about interpreting the predictions when using such models.

## 3. Is your model more fair?

1. (6pt) Replicate 1.8 using your best model: pick the best model from question 2.3 (i.e. your best model without sex and race), predict recidivism for everyone in data (ie only AfricanAmericans and Caucasians), and compute FPR and FNR separately for African-Americans and Caucasians.
2. (6pt) Explain what do you get. Are your results different from COMPAS in any significant way?

In [151]:
#3-1
from sklearn.metrics import confusion_matrix

dfaa = compas[compas["race"] == "African-American"]
dfc = compas[compas["race"] == "Caucasian"]

#African-Americans
Xaa = dfaa[["age", "age2", "c_charge_degree", "priors_count", "new_age_cat"]]
Xaa = pd.get_dummies(Xaa
               ,columns = ["c_charge_degree", "new_age_cat"]).\
values
yaa = dfaa.two_year_recid.values

#Caucasian
Xc = dfc[["age", "age2", "c_charge_degree", "priors_count", "new_age_cat"]]
Xc = pd.get_dummies(Xc
               ,columns = ["c_charge_degree", "new_age_cat"]).\
values
yc = dfc.two_year_recid.values

#African-Americans
maa = DecisionTreeClassifier(max_depth = 25)
_ = m.fit(Xaa, yaa)
yhataa = m.predict(Xaa)

#Caucasian
mc = DecisionTreeClassifier(max_depth = 25)
_ = m.fit(Xc, yc)
yhatc = m.predict(Xc)

cmaa = confusion_matrix(yaa, yhataa) #African-Americans
cmc = confusion_matrix(yc, yhatc) #Caucasian

cmaa, cmc

(array([[1231,  283],
        [ 454, 1207]]),
 array([[1180,  101],
        [ 365,  457]]))

In [152]:
#3-1-African-Americans
TP = cmaa[1,1]
FN = cmaa[1,0]
TN = cmaa[0,0]
FP = cmaa[0,1]

In [153]:
#3-1-accuracy-African-Americans
a = (TP + TN) / (TP + TN + FP + FN)
a

0.7678740157480315

In [154]:
#3-1-FPR-African-Americans
FPR = (FP / (TN + FP))
FPR

0.18692206076618229

In [155]:
#3-1-FNR-African-Americans
FNR = (FN / (TP + FN))
FNR

0.2733293196869356

In [156]:
#3-1-Caucasian
TP = cmc[1,1]
FN = cmc[1,0]
TN = cmc[0,0]
FP = cmc[0,1]

In [157]:
#3-1-accuracy-Caucasian
a = (TP + TN) / (TP + TN + FP + FN)
a

0.7784117926771279

In [158]:
#3-1-FPR-Caucasian
FPR = (FP / (TN + FP))
FPR

0.07884465261514442

In [159]:
#3-1-FNR-Caucasian
FNR = (FN / (TP + FN))
FNR

0.4440389294403893

In [160]:
#3-2

The results above definitely had a significant improvment by excluding race and gender. The accuracy slightly increases whereas errors decreases overall compared to the previous model including race and gender.

#### How many hours did you spend on this PS?

We spent 7 hours on this PS. I also want to say thank you so much for all your helps this quarter. Have a great end of year Runhan!