## 1.1 Load and prepar

In [38]:
# 1
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_csv("data/compas-score-data.csv.bz2", sep = "\t")
df.head()

Unnamed: 0,age,c_charge_degree,race,age_cat,sex,priors_count,decile_score,two_year_recid
0,69,F,Other,Greater than 45,Male,0,1,0
1,34,F,African-American,25 - 45,Male,0,3,1
2,24,F,African-American,Less than 25,Male,4,4,1
3,44,M,Other,25 - 45,Male,0,1,0
4,41,F,Caucasian,25 - 45,Male,14,6,1


In [39]:
df.shape

(6172, 8)

In [40]:
# 2
df = df[(df.race == "Caucasian") | (df.race == "African-American")]

In [41]:
# 3
# Create a new column 'high_score' and initialize it with 0
df['high_score'] = 0

# Update 'high_score' to 1 for individuals with decile_score 5 and above
df.loc[df['decile_score'] >= 5, 'high_score'] = 1
df.sample(5)

Unnamed: 0,age,c_charge_degree,race,age_cat,sex,priors_count,decile_score,two_year_recid,high_score
2302,21,F,African-American,Less than 25,Male,1,3,0,0
721,26,M,Caucasian,25 - 45,Male,4,7,1,1
907,27,F,African-American,25 - 45,Male,2,5,0,1
4782,29,M,African-American,25 - 45,Male,0,2,0,0
4916,32,F,Caucasian,25 - 45,Male,3,5,0,1


In [42]:
# 4
# a
recid_risk = df.groupby("high_score")["two_year_recid"].mean()
recid_risk

high_score
0    0.320015
1    0.634455
Name: two_year_recid, dtype: float64

The recidivism rate for low-risk individuals is 32%

The recidivism rate for high-risk individuals is 63.45%

In [43]:
# b
recid_race = df.groupby("race")["two_year_recid"].mean()
recid_race

race
African-American    0.52315
Caucasian           0.39087
Name: two_year_recid, dtype: float64

The recidivism rate for African-Americans is 52.32%

The recidivism rate for Caucasians is 39.09%

In [44]:
# 5
from sklearn.metrics import confusion_matrix

# "high_score" is COMPAS prediction and "two_year_recid" variable is actual recidivism
cm = confusion_matrix(df['two_year_recid'], df['high_score'])
cm

array([[1872,  923],
       [ 881, 1602]])

Accuracy = (TP + TN) / T = (1602 + 1872) / (1872 + 923 + 881 + 1602) = 0.658

Precision = TP / (TP + FP) = 1602 / (1602 + 923) = 0.634

Recall = TP / (TP + FN) = 1602 / (1602 + 881) = 0.645

True Negative (TN): Individuals correctly classified as low risk by COMPAS and did not recidivate.\
False Positive (FP): Individuals incorrectly classified as high risk by COMPAS but did not recidivate.\
False Negative (FN): Individuals incorrectly classified as low risk by COMPAS but recidivated.\
True Positive (TP): Individuals correctly classified as high risk by COMPAS and recidivated.

Accuracy = 0.658 means that out of all the individuals in the dataset, about 65.8% were correctly classified by COMPAS as either low risk or high risk based on their decile scores.\
Precision = 0.634 means that out of all the individuals classified as high risk by COMPAS, around 63.4% of them actually recidivated within two years.\
Recall = 0.6452 means that out of all the individuals who actually recidivated within two years, around 64.5% of them were correctly identified as high risk by COMPAS.

Q6

Accuracy: the proportion of correct predictions (both true positives and true negatives) out of the total number of predictions. In this case, accuracy = 0.658 means that out of all the individuals in the dataset, about 65.8% were correctly classified by COMPAS as either low risk or high risk based on their decile scores.

Percentage of low-risk individuals wrongly classified as high risk: This corresponds to the false positive rate (FPR), FPR = FP / (TN + FP) = 923 / (1872 + 923) = 0.33. This means that about 33% of individuals who were actually low risk according to COMPAS were wrongly classified as high risk.

Percentage of high-risk individuals wrongly classified as low risk: This corresponds to the false negative rate (FNR), FNR = FN / (TP + FN) = 881 / (1602 + 881) = 0.3548. This means that approximately 35.48% of individuals who were actually high risk according to COMPAS were wrongly classified as low risk.

I don't feel comfortable having a judge use COMPAS to inform sentencing guidelines since the FPR and FNR indicated that almost one-third of the individuals were classified in the wrong category. Using the COMPAS alone may lead individuals who are not potentially more dangerous to be more punished, as well as leading some vicious criminals to escape due punishment.

I believe that judges who perform the same task without COMPAS's help have a better outcome. This is because judges at the court have many years of experience, training, and a profound understanding of the law. More importantly, compared to quantitative data, judges also need some qualitative data, such as the offender's attitudes, their criminal motives, the action of repentance, and other factors, to make informed decisions.

I would say the law system need to aim for errors to be as low as possible, in terms of quantitative data, the ideal rate could be lower than 3% for me. Comparing the acceptable error rate for human judges and algorithms, if we assume that the judges of algorithms do not involve any human intervention, then I think the error rate of algorithms should be 0%. This is because in the role of human judges, suspects' sentence are often involved in a large number of legal workers to gather evidence, defending, and decide on sentences before they are convicted. This also means that there will be more people monitoring the correctness of the individual's conviction, and the suspect will be able to get more opportunities to defend themselves.

It's hard to say judges can perform the same tasks better or worse. Both human judeges and algorithms have strengths and weaknesses. Algorithms like COMPAS can provide consistency and objectivity in their decision-making process, but they can also be influenced by biases present in the data used to train them. Human judges, on the other hand, may bring contextual understanding and subjective judgment to their decision-making, but they can be susceptible to individual biases and inconsistencies.

The acceptable error rate may differ for human judges and algorithms. While fairness and accuracy should be sought in both cases, the mechanisms of bias and error can be different. In my point of view, a 5% misclassification risk is  acceptable for the human judeges, and a 10% misclassification risk is acceptable for the algorithms. This is because human judges possess the ability to consider the nuances and contextual factors of individual cases. They can take into account a wide range of information, including personal history, mitigating circumstances, and individual characteristics. This contextual understanding allows judges to make informed decisions that may deviate from strict algorithmic predictions. It is essential to strike a balance between human judgment and algorithmic assistance. The goal should be to continually improve the accuracy, fairness, and transparency of both approaches.

## 1.2 Analysis by race

In [45]:
# 1
hAA = df[(df.race == "African-American") & (df.high_score == 1)]
lAA = df[(df.race == "African-American") & (df.high_score == 0)]
hCC = df[(df.race == "Caucasian") & (df.high_score == 1)]
lCC = df[(df.race == "Caucasian") & (df.high_score == 0)]

hAA_recid = np.mean(hAA.two_year_recid)
lAA_recid = np.mean(lAA.two_year_recid)
hCC_recid = np.mean(hCC.two_year_recid)
lCC_recid = np.mean(lCC.two_year_recid)

hAA_recid, lAA_recid, hCC_recid, lCC_recid

(0.6495352651722253,
 0.3514115898959881,
 0.5948275862068966,
 0.2899786780383795)

The recidivism rate for high-risk African-Americans is 64.95%.\
The recidivism rate for low-risk African-Americans is 35.14%.\
The recidivism rate for high-risk Caucasians is 59.48%.\
The recidivism rate for low-risk Caucasians is 29.00%.

Q2

Based on the analysis of the results, it can be observed that the recidivism rate for high-risk African-Americans (64.95%) is 5.47% higher than that of high-risk Caucasians (59.48%). Similarly, the recidivism rate for low-risk African-Americans (35.14%) is 6.14% higher than that of low-risk Caucasians (29%).

These findings indicate the presence of racial disparity. In both high-risk and low-risk categories, African-Americans exhibit a higher recidivism rate, with approximately 6% difference compared to Caucasians. This suggests that the results favor Caucasians to a slight extent. However, it is important to note that these figures alone may not be sufficient evidence to conclude that COMPAS is unfair. Further analysis and examination are required to thoroughly assess the fairness of COMPAS, taking into account additional factors and conducting comprehensive investigations.

In [46]:
# 3
# "high_score" is COMPAS prediction and "two_year_recid" variable is actual recidivism
dfAA = df[df.race == "African-American"]
cmAA = confusion_matrix(dfAA['two_year_recid'], dfAA['high_score'])
cmAA

array([[ 873,  641],
       [ 473, 1188]])

Accuracy = (TP + TN) / T = (1188 + 873) / (873 + 641 + 473 + 1188) = 0.6491

Precision = TP / (TP + FP) = 1188 / (1188 + 641) = 0.6495

Recall = TP / (TP + FN) = 1188 / (1188 + 473) = 0.7152

In [47]:
dfCC = df[df.race == "Caucasian"]
cmCC = confusion_matrix(dfCC['two_year_recid'], dfCC['high_score'])
cmCC

array([[999, 282],
       [408, 414]])

Accuracy = (TP + TN) / T = (414 + 999) / (414 + 999 + 282 + 408) = 0.6719

Precision = TP / (TP + FP) = 414 / (414 + 282) = 0.5948

Recall = TP / (TP + FN) = 414 / (414 + 408) = 0.5036

a) The accuracy of COMPAS classification for African-Americans is 64.91%, and for Caucasians, it is 67.19%.

b) FPR = FP / (TN + FP) = 641 / (873 + 641) = 0.4238. The FPR for African-Americans is 42.38%, indicating that 42.38% of African-American who were actually low risk according to COMPAS were wrongly classified as high risk.

FPR = FP / (TN + FP) = 282 / (999 + 282) = 0.2205. The FPR for Caucasians is 22.05%, meaning that 22.05% of Caucasians who were actually low risk according to COMPAS were wrongly classified as high risk.

c) FNR = FN / (TP + FN) = 473 / (1188 + 473) = 0.2848. The FNR for African-Americans is 28.48%, indicating that 28.48% of African-Americans who were actually high risk according to COMPAS were wrongly classified as low risk.

FNR = FN / (TP + FN) = 408 / (414 + 408) = 0.4964. The false negative rate (FNR) for Caucasians is 49.64%, meaning that 49.64% of Caucasians who were actually high risk according to COMPAS were wrongly classified as low risk.

Q4

The accuracy of COMPAS in correctly categorizing individuals is quite similar for African-Americans and Caucasians, with only a 2.28% difference.

However, the analysis in the previous question reveals disparities in the false positive rates favoring Caucasians. The false positive rate for Caucasians is 20% lower than that for African-Americans. This suggests that African-Americans with a low-risk classification have a 20% higher chance of being wrongly classified as high-risk compared to Caucasians.

Additionally, the false negative rate favors Caucasians, as it is 22% higher than that for African-Americans. This implies that Caucasians with a high-risk classification have a 22% higher chance of being wrongly classified as low-risk compared to African-Americans.

Based on this analysis, it becomes more evident that under similar circumstances, Caucasian offenders are more likely to escape harsh sentences compared to African-Americans. Conversely, African-Americans may face more unjust sentences compared to Caucasian offenders at the same risk level.

These findings provide comparitively strong evidence that the COMPAS algorithm operates unfairly with regard to the offender's race, with a bias against African-Americans and in favor of Caucasian offenders.

Q5

My answer in question 4 aligns with my answer in question 2. In question 2, I discussed the racial disparity in recidivism rates, highlighting that African-Americans have higher rates compared to Caucasians in both the high-risk and low-risk categories. This suggests an imbalance that favors Caucasians to some extent.

In question 4, I analyzed the accuracy, false positive rate, and false negative rate of the COMPAS algorithm for African-Americans and Caucasians. I pointed out the disparities in false positive and false negative rates, which favor Caucasians. These disparities align with the racial disparity observed in the recidivism rates mentioned in question 2.

Therefore, my answer in question 4 further emphasizes the presence of racial bias in the COMPAS algorithm, supporting the notion that it operates unfairly by favoring Caucasians and potentially leading to unjust outcomes for African-Americans.

## 2.1 Create the model

Q1

I think accuracy (A), false positive rate (FPR), and false negative rate (FNR) are appropriate performance measures for the model. 

Accuracy (A): Accuracy is a commonly used performance measure that provides an overall understanding of the model's correctness in predicting recidivism. It calculates the proportion of correctly classified instances (both true positives and true negatives) out of the total predictions. However, accuracy alone may not be sufficient when the classes are imbalanced or when the costs of false positives and false negatives differ significantly.

False Positive Rate (FPR): FPR is the proportion of instances that are incorrectly classified as recidivists (positive class) among all the actual non-recidivists (negative class). FPR is particularly relevant in the context of fairness and avoiding biases. By assessing the FPR, we can measure how often the model wrongly identifies individuals as recidivists, which can help identify potential disparities in the treatment of different groups.

False Negative Rate (FNR): FNR is the proportion of instances that are incorrectly classified as non-recidivists (negative class) among all the actual recidivists (positive class). FNR is also crucial to consider, as it captures the instances where the model fails to identify individuals who are likely to recidivate. High FNR may result in increased risks and negative consequences for individuals who should have been flagged as high risk but were not.

By considering accuracy, FPR, and FNR together, we can gain a more comprehensive understanding of the model's performance, its ability to avoid biases, and its effectiveness in predicting recidivism accurately while minimizing both false positive and false negative errors.

Q2

We should not use variable decile score because it is known to be generated by the COMPAS model, which has faced criticism regarding its potential biases. To avoid explicit race and gender bias, it is important to consider features and variables that are not directly related to race or gender. By excluding the variable decile score and focusing on other relevant predictors that do not introduce bias, we can work towards creating a fair and unbiased model for predicting recidivism.

In [60]:
# 3
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, cross_val_predict

df_m = df.drop(columns=["decile_score", "race", "sex", "high_score"])
y = df_m.two_year_recid
X = df_m.drop("two_year_recid", axis=1)
X = pd.get_dummies(X, columns = ["c_charge_degree", "age_cat"], drop_first=True)

# create Logistic Regression model
mLR = LogisticRegression()
# Fit the model 
mLR.fit(X, y)
# Perform 10-fold cross-validation
cv = cross_val_score(mLR, X, y, cv=10, scoring = "accuracy")
# Compute mean accuracy
mean_accuracy = cv.mean()
# Perform cross-validated predictions
y_pred = cross_val_predict(mLR, X, y, cv=10)
# Compute confusion matrix
cm = confusion_matrix(y, y_pred)
# Calculate FPR
FPR = cm[0, 1] / (cm[0, 0] + cm[0, 1])
# Calculate FNR
FNR = cm[1, 0] / (cm[1, 0] + cm[1, 1])

print("Accuracy:", mean_accuracy)
print("FPR:", FPR)
print("FNR:", FNR)

Accuracy: 0.736064990512334
FPR: 0.19105545617173525
FNR: 0.3459524768425292


In [61]:
# 4
# Decision Trees
from sklearn.tree import DecisionTreeClassifier

dt_model = DecisionTreeClassifier()
dt_cv_scores = cross_val_score(dt_model, X, y, cv=10, scoring='accuracy')
dt_mean_accuracy = dt_cv_scores.mean()

dt_model.fit(X, y)
dt_y_pred = cross_val_predict(dt_model, X, y, cv=10)
dt_cm = confusion_matrix(y, dt_y_pred)
dt_FPR = dt_cm[0, 1] / (dt_cm[0, 0] + dt_cm[0, 1])
dt_FNR = dt_cm[1, 0] / (dt_cm[1, 0] + dt_cm[1, 1])

print("Accuracy:", dt_mean_accuracy)
print("FPR:", dt_FPR)
print("FNR:", dt_FNR)

Accuracy: 0.6761985365993904
FPR: 0.23685152057245082
FNR: 0.4220700765203383


In [62]:
# KNN
from sklearn.neighbors import KNeighborsClassifier

knn_model = KNeighborsClassifier()
knn_cv_scores = cross_val_score(knn_model, X, y, cv=10, scoring='accuracy')
knn_mean_accuracy = knn_cv_scores.mean()

knn_model.fit(X, y)
knn_y_pred = cross_val_predict(knn_model, X, y, cv=10)
knn_cm = confusion_matrix(y, knn_y_pred)
knn_FPR = knn_cm[0, 1] / (knn_cm[0, 0] + knn_cm[0, 1])
knn_FNR = knn_cm[1, 0] / (knn_cm[1, 0] + knn_cm[1, 1])

print("Accuracy:", knn_mean_accuracy)
print("FPR:", knn_FPR)
print("FNR:", knn_FNR)

Accuracy: 0.6517562963601863
FPR: 0.2819320214669052
FNR: 0.42287555376560615




| || | | | | |
| -------- | ------------: | ------:|  ------:|  ------:| ------:| ------:|
| | COMPAS | Logistic Regression | Decision Tree | KNN |
|Accuracy| 0.658 | 0.7361 | 0.6762 | 0.6518 | 
|FPR| 0.33 | 0.1906 | 0.2369 | 0.2819 | 
|FNR| 0.355 | 0.3460 | 0.4221 | 0.4229 | 

In addition to the logistic regression model, I also implemented K-NN and Decision Trees models (use same variables as the logistic regression one). The results are presented in the above table. 

Based on these values, the Logistic Regression model has the highest accuracy, lowest FPR and relatively lower FNR compared to the other models. Since lower FPR values indicate better fairness in terms of falsely classifying low-risk individuals as high risk, and lower FNR values indicate better fairness in terms of falsely classifying high-risk individuals as low risk, in terms of fairness and accuracy, the Logistic Regression model appears to be the best choice among the four models. My best model Logistic Regression model got better result than COMPAS.

## 2.2 Is your model more fair?

In [53]:
# 1
# COMPAS result for recidivism risk:
lAA_recid, hAA_recid, lCC_recid, hCC_recid

(0.3514115898959881,
 0.6495352651722253,
 0.2899786780383795,
 0.5948275862068966)

In [66]:
# My result (Logistic Regression):
mLR = LogisticRegression()
# Train the best model on all data
mLR.fit(X, y)
# Predict recidivism for all individuals
all_predictions = mLR.predict(X)
# Add the predicted recidivism to the original DataFrame
df['predicted_r'] = all_predictions
# Compute the percentage of the predicted low-risk and high-risk individuals who recidivate, by race
df.groupby(['race', 'predicted_r'])['two_year_recid'].mean()

race              predicted_r
African-American  0              0.289409
                  1              0.767892
Caucasian         0              0.259853
                  1              0.714521
Name: two_year_recid, dtype: float64

| | | |
| -------- | ------------: | ------:|
| | COMPAS | DT |
|low-risk African-American| 0.3514 | 0.2894 |
|high-risk African-American | 0.6495 | 0.7678 |
|low-risk Caucasian | 0.29 | 0.2598 |
|high-risk Caucasian | 0.5948 | 0.7145 |

For African-Americans, my Logistic Regression model has a lower recidivism rate for low-risk individuals compared to COMPAS, indicating it is slightly more fair in this regard. However, for high-risk individuals, the Logistic Regression has a significantly higher recidivism rate, indicating it is less fair in this aspect.

For Caucasians, my Logistic Regression model also has a lower recidivism rate for low-risk individuals compared to COMPAS, suggesting it is slightly more fair. However, similar to African-Americans, the recidivism rate for high-risk individuals is higher in the Logistic Regression model, indicating it is less fair in this aspect as well.

Overall, based on the comparison of recidivism rates for low-risk and high-risk individuals by race, it appears that my Logistic Regression model is less fair than COMPAS. It results in higher recidivism rates for high-risk individuals, regardless of race.

In [67]:
# 2
# Filter the dataframe for African-Americans
df_AA = df[df['race'] == 'African-American']
cm_AA = confusion_matrix(df_AA['two_year_recid'], df_AA['predicted_r'])
fpr_AA = cm_AA[0, 1] / (cm_AA[0, 0] + cm_AA[0, 1])
fnr_AA = cm_AA[1, 0] / (cm_AA[1, 0] + cm_AA[1, 1])

# Filter the dataframe for Caucasians
df_CC = df[df['race'] == 'Caucasian']
cm_CC = confusion_matrix(df_CC['two_year_recid'], df_CC['predicted_r'])
fpr_CC = cm_CC[0, 1] / (cm_CC[0, 0] + cm_CC[0, 1])
fnr_CC = cm_CC[1, 0] / (cm_CC[1, 0] + cm_CC[1, 1])

# Print the FPR and FNR by race
print("African-American:")
print("FPR:", fpr_AA)
print("FNR:", fnr_AA)

print("Caucasian:")
print("FPR:", fpr_CC)
print("FNR:", fnr_CC)

African-American:
FPR: 0.23778071334214002
FNR: 0.28296207104154125
Caucasian:
FPR: 0.13505074160811867
FNR: 0.4732360097323601


| | | |
| -------- | ------------: | ------:|
| | COMPAS | DT |
|FPR African-American| 0.4238 | 0.2378 |
|FNR African-American | 0.2848 | 0.2830 |
|FPR Caucasian | 0.2205 | 0.1351 |
|FNR Caucasian | 0.4964 | 0.4732 |

For African-Americans, my Logistic Regression model has a significant lower FPR compared to COMPAS, indicating it is more fair in terms of falsely classifying low-risk individuals as high risk. The FNR is also slightly lower in the Logistic Regression model, indicating it is slightly more fair in terms of falsely classifying high-risk individuals as low risk.

For Caucasians, my Logistic Regression model has a lower FPR compared to COMPAS, indicating it is more fair in terms of falsely classifying low-risk individuals as high risk. The FNR is also slightly lower in the Logistic Regression model, indicating it is slightly more fair in terms of falsely classifying high-risk individuals as low risk.

Overall, considering both African-Americans and Caucasians, my Logistic Regression model appears to be more fair than COMPAS. It has significant lower FPR values and comparable lower FNR values for both racial groups. 

Q3

Based on the results from 2.2.1 and 2.2.2, there are some differences in the interpretation of fairness between the recidivism rates and the false positive/negative rates for my Logistic Regression model compared to COMPAS. 

In terms of recidivism rates, my Logistic Regression model shows lower rates for low-risk individuals but higher rates for high-risk individuals compared to COMPAS. This suggests a trade-off between fairness for different risk categories.

However, when considering the false positive and false negative rates, my Decision Tree model generally performs better than COMPAS. It has lower FPR values for both African-Americans and Caucasians and comparable or slightly higher FNR values.

While my  model may have certain advantages in terms of false positive/negative rates, the higher recidivism rates for high-risk individuals raise concerns about the fairness of the model in that aspect. Therefore, it would be necessary to carefully consider the specific fairness criteria, the trade-offs involved, and the context in which the model is being used to determine whether my model is better or worse than COMPAS in terms of overall fairness.

I spend 14 hours on this ps