<a href="https://colab.research.google.com/github/LarrySnyder/ASJ/blob/main/compas/COMPAS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# COMPAS

This file is read-only. To work with it, you first need to **save a copy to your Google Drive:**

1. Go to the *File* menu. (The *File* menu inside the notebook, right below the filename—not the *File* menu in your browser, at the top of your screen.)
2. Choose *Save a copy in Drive*. (Log in to your Google account, if necessary.) Feel free to move it to a different folder in your Drive, if you want.
3. Colab should open up a new browser tab with your copy of the notebook. Double-click the filename at the top of the window and rename it `COMPAS [your name(s)]`. 
4. Close the original read-only notebook in your browser.


---
> 👓 **Note:** This notebook is part of the *Algorithms and Social Justice* course at Lehigh University, Profs. Larry Snyder and Suzanne Edwards.
---


---
> 📚 **Reference:** Portions of this notebook are adapted from the *Data-4ac* course at the University of California–Berkeley, Spring 2021, Prof. Margarita Boenig-Liptsin (available at https://github.com/ds-modules/data-4ac) and from Aaron Fraenkel, *Fairness and Algorithmic Decision* making, UCSD course DSC 167 (available at https://afraenkel.github.io/fairness-book/content/04-compas.html).
---



## Introduction



Decision making within the United States criminal justice system relies heavily on risk assessment, which determines the potential risk that a released defendant will fail to appear in court, or will cause harm to the public. Judges use these assessments to decide whether bail can be set, or whether a defendant should be detailed until their trial. 

While risk assessment is not a new concept in the legal system, the use of risk scores determined by an algorithm is gaining prevalence and support. Proponents promote the use of risk scores to guide judges in their decision making, arguing that machine learning could improve efficiency and accountability and reduce bias in decision making compared with human judgement ([Henry 2019](https://theappeal.org/risk-assessment-explained/)). 

On the other hand, critical voices raise the concern that biases can creep into these algorithms at any point in the process, and that algorithms are often applied to the wrong situations. Further, they exacerbate the racism embedded deep within the criminal justice system by perpetuating inequalities found in historical data ([Henry 2019](https://theappeal.org/risk-assessment-explained/)).

In the debate about the use of risk assessment algorithms, people have used data analysis to determine the extent to which these algorithms are fair to different groups of people. The [ProPublica article](https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing) that we read in class uses data analysis to argue that the COMPAS risk assessment algorithm is racially biased. Northpointe, the company that created COMPAS, offered a different analysis, still based on data science, to argue that the results are in fact not biased. ProPublica and Northpointe used different **metrics** (ways of quantifying concepts like "fairness") to support their arguments. 

In this notebook, **you will explore the actual data used by ProPublica, examining their arguments and analyses** to gain a deeper understanding of the technical and societal interpretations and implications of fairness. (In the next notebook, you'll explore Northpointe's "rebuttal" analysis.)

This is a longer notebook than we've used so far. You can view an outline of the notebook by clicking the "bullet list" icon at the top of the left-hand toolbar.

---
> 👓 **Note:** When we discuss "bias" in this notebook, we define it generally as prejudice or an inclination in favor of one person, thing, or group compared to another. In the context of machine learning, "bias" has a more narrow [mathematical meaning](https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff), but that is not the definition we will use here.
---

## Preliminary Python Stuff

First we need to import that Python packages that we'll need for our analysis. We've used all of these packages before.

* `pandas` (pronounced like the animal) handles data
* `numpy` ("num-pie") does numerical computations
* `seaborn` does data visualization
* `sklearn` ("s-k-learn") does machine learning
* `matplotlib` ("mat-plot-libe") does graphing


In [None]:
import pandas as pd
import numpy as np 
import seaborn as sns
from sklearn.metrics import confusion_matrix, roc_curve, roc_auc_score
import matplotlib.pyplot as plt

---

## COMPAS: Why It Was Created and How It Exists in the Court System <a id="compas"></a>

COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) is a commercial tool produced by the for-profit company Northpointe (now called Equivant). It is known as a **recidivism risk assessment system.** Tools like COMPAS are used to **predict the risk of future crimes for an individual who has entered the U.S. criminal justice system, by outputting a risk score from 1 to 10.**

While COMPAS was initially intended to aid decisions made by probation officers on treatment and supervision of those who are incarcerated, Northpointe has since emphasized the scalability of the tool to "fit the needs of many different decision points," including pre-screening assessments, pre-trial release decisions (whether or not to hold an arrested individual in jain until their trial), and post-trial next steps for the defendant. These algorithms are believed by many to hold the power to relieve the court system of unfair human bias from criminal justice officials.



#### Questions

**Question 0a**

List 3 parties who are affected by the COMPAS tool. In what ways are they affected? (Can you think of impacts beyond those in the courtroom for at least one of your examples?)

**Answer:** *YOUR ANSWER HERE*

**Question 0b**

Based on your initial reading, what is one problem of the criminal justice system that the COMPAS tool could potentially alleviate? What is one potential problem that using the COMPAS algorithm could introduce?

**Answer:** *YOUR ANSWER HERE*

## The Coproduction of Justice and Data<a id="coproduction"></a>

### Understanding the Methods of Data Collection

Before a risk score is determined for a defendant, the defendant is asked to fill out a questionnaire with questions meant to help predict the defendant's level of risk. Let's take a look at this questionnaire to get a better understanding of what goes into determining a risk score. [Here](https://www.documentcloud.org/documents/2702103-Sample-Risk-Assessment-COMPAS-CORE.html) is a link to a sample questionnaire.


#### Questions

**Question 1a**

What aspects of the questionnaire were particularly striking to you?

**Answer:** *YOUR ANSWER HERE*

**Question 1b**

Does the questionnaire ask explicitly about race? If not, is race still embedded in the questionnaire? Explain.

**Answer:** *YOUR ANSWER HERE*

---
### Understanding the Data

We will be using the data that were obtained and utilized by ProPublica in their own analysis of the COMPAS tool. They used Broward County, Florida public records of people who were scored in the COMPAS system between 2013 and 2014 ([ProPublica 2016](https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithm)). ProPublica has made the dataset available on GitHub (a website for sharing code, data, and other files), and we'll be loading the data directly from there.

First, we'll load the dataset from ProPublica's GitHub repository. In the same command to load the data, we'll also specify which columns we want to include. We'll select only the data that ProPublica used in their study, such as severity of the charge, number of priors, demographics, COMPAS scores, and whether each person was accused of another crime within two years.

---
> 👓 **Note:** The dataset contains the full names of the people included. We're omitting these columns out of respect for these people's privacy. However, it's worth reflecting on what it means for their full names to be included in a publicly available dataset, even one that was posted by ProPublica in the interest of studying fairness.
---


In [None]:
# Load the dataset from ProPublica's GitHub repository.
# Specify the columns to include.
data = pd.read_csv('https://raw.githubusercontent.com/propublica/compas-analysis/master/compas-scores-two-years.csv',
                   usecols=["age", "c_charge_degree", "race", "age_cat", "score_text", "sex", "priors_count", 
                    "days_b_screening_arrest", "decile_score", "is_recid", "two_year_recid", "c_jail_in", "c_jail_out"])

ProPublica removed any cases in which the criminal charge was not within 30 days of the COMPAS score, since those cases were harder to match the COMPAS score with the correct criminal case. We'll do the same, to follow ProPublica's analysis. We're left with 6172 rows in the dataset.

In [None]:
pp_data = data.query('days_b_screening_arrest <= 30 & days_b_screening_arrest >= -30')
len(pp_data)

We'll also include only rows for defendants whose "race" column is either "Caucasian" or "African-American", since those are the only "race" values considered in the ProPublica study. This leaves us with 5278 rows. (The `|` in the condition means "or".)

In [None]:
pp_data = pp_data.query('race == "Caucasian" | race == "African-American"')
len(pp_data)

Here's a look at (a handful of rows of) the ProPublica dataset.

In [None]:
pp_data

---
> ⚠️ **Important:** The dataset used by the COMPAS model and the dataset used by ProPublica analysis are completely different. 
>
>* **COMPAS** uses confidential, proprietary data for each person based on the answers to the questionnaire. It uses an algorithm that takes this dataset as an input and gives a risk assessment (COMPAS score) as an output.
* **ProPublica** used data about the same defendants but containing public information about arrest records, COMPAS scores, and so on. ProPublica used data science algorithms to analyze fairness in the risk assessments.
>
> There is information in the **COMPAS** data that is not in the **ProPublica** data, such as how many times the person has violated their parole or how many of their friends are gang members. And there is information in the **ProPublica** data that is not in the **COMPAS** data, such as the COMPAS score and whether the person was accused of another crime within two years.
>
> The dataset we are working with in this notebook is the **ProPublica** dataset.
---



## ProPublica's Perspective<a id="propublica"></a>

### What is ProPublica?

[ProPublica](https://www.propublica.org/about/) is a nonprofit organization that "produces investigative journalism with moral force." ProPublica was founded as a nonpartisan newsroom aiming to expose and question abuses of power, justice, and public trust, often by systems and institutions deeply ingrained in the U.S.

In 2016, ProPublica investigated the COMPAS algorithm to assess the accuracy of and potential racial bias within the tool, as it became more popular with the United States court system nationwide. In their analysis, ProPublica  used data from defendants with risk scores from Broward County from 2013 to 2014 to test for statistical differences in outcomes for Black and white defendants, which ultimately highlighted racial disparities that exist within the algorithm. ProPublica came to the conclusion that COMPAS utilizes data from a criminal justice system with a history of racial injustices, thus continuing to disproportionately target and arrest Black people in comparison to their white counterparts. While the COMPAS algorithm treats racial groups similarly in a certain sense (which we will explore below), which may make it appear to be neutral, ProPublica's data analysis and reporting emphasized the bias against Black defendants and their communities that COMPAS produced from this line of thinking, a claim that Northpointe has disputed (as we will see later).

Let's retrace ProPublica's statistical analysis in order to better understand ProPublica's argument and engage with the metric of fairness that it uses. In order to mimic their analysis more closely, we will use ProPublica's definitions of "high" and "low" COMPAS scores:
- Any score greater than 4 is considered a **high** score. A defendant with a high score is **predicted to recidivate.**
- Any score less than or equal to 4 is considered a **low** score. A defendant with a low score is **predicted not to recidivate.**

### Basic Data Analysis

Let's do some basic data analysis. First we'll plot histograms by "sex" and "race" (the terms used in the dataset). (Recall that a histogram indicates the number of observations with each value (or each range of values).) In previous notebooks, we used the `hist()` function from the `pandas` package, but this time we'll use the `histplot()` function in the `seaborn` package. They're similar, but the `seaborn` version is a little prettier and easier to use.



---
> 🤓 **Nerd detail:** The package is called `seaborn`, but when we imported it, we told Python to abbreviate it as `sns`, which is why it says `sns` below. It's common to abbreviate package names, such as `np` for `numpy`. The abbreviation `sns` for `seaborn` is traditional but non-intuitive; it turns out it's the initials of the character Sam Seaborn from *The West Wing*, after whom the developer of `seaborn` named the package. Coders are fond of inside jokes and references.
---

In [None]:
# Plot a histogram by sex. (stat="percent" tells seaborn to 
# plot percentages instead of total counts in each bin.)
sns.histplot(pp_data, x="sex", discrete=True, stat="percent");

---
> 🤓 **Nerd detail:** You might be wondering why there's a semicolon (`;`) at the end of the line of code above, since Python (unlike some other programming languages) doesn't usually use semicolons. If we omit it, the Jupyter notebook will print an extra line of text output that doesn't mean much to us, and the semicolon suppresses it. (Try removing it and re-running the cell, if you want.) It's not a big deal either way.
---

#### Questions

**Question 2a**

Roughly what percentage of defendants in the dataset are male?


**Answer:** *YOUR ANSWER HERE*

**Question 2b**

Plot a histogram by race. (Remember that we already filtered the data to include only rows in which "race" is Caucasian or African-American.) 

In [None]:
[...]

**Question 2c**

Roughly what percentage of defendants in the dataset are African-American?

**Answer:** *YOUR ANSWER HERE*

**Question 2d**

Plot a histogram by "`decile_score`"—that's the COMPAS score.

In [None]:
[...]

**Question 2e**

Are the scores skewed toward lower risk or higher risk?

**Answer:** *YOUR ANSWER HERE*

### Visualization of Disparity

Let's visualize how ProPublica began its investigation of racial disparity within the COMPAS risk assessment. The histogram below separates the one that you created above by race, displaying the differences between the risk scores of Black and white defendants. 

In [None]:
# Create a histogram for the risk scores for Black defendants.
sns.histplot(pp_data[pp_data["race"] == "African-American"], x="decile_score", discrete=True, color='orange', alpha=0.5)

# Create a histogram for the risk scores for white defendants and
# display it on the same plot as the plot for Black defendants.
sns.histplot(pp_data[pp_data["race"] == "Caucasian"], x="decile_score", discrete=True, alpha=0.5)

# Add legend.
plt.legend(labels=["Black", "white"]);

#### Questions

**Question 3a**

Is one racial group more likely to get a high risk score than the other? If so, does this by itself imply that the COMPAS model is biased? Why or why not?

**Answer:** *YOUR ANSWER HERE*

**Question 3b**

Assuming the risk scores are accurate, what might make one racial group more likely to get a high risk score than the other? (*Hint*: connect this to your knowledge of the history of policing and institutionalized racism.)

**Answer:** *YOUR ANSWER HERE*

### The COMPAS Predictions

At the moment, our dataset contains a column that tells us the COMPAS score (1–10), but it will be convenient for us to translate this into a prediction. (Remember that a defendant is predicted to recidivate if their COMPAS score is greater than 4.)

In [None]:
threshold = 4
pp_data['COMPAS_prediction'] = 1 * (pp_data['decile_score'] > threshold)

This code is a little tricky, but you'll need some of the tricks in your code below, so let's unpack it:

* First, we create a new variable called `threshold` and set it to 4. We'll compare scores to this variable in the next line. Strictly speaking, we don't need to do this—we could just put `4` in the next line instead of `threshold`—but creating the variable makes the code more flexible and easier to understand.
* The second line adds a new column to the dataset called `'COMPAS_prediction'`. (Or, if we have already added such a column, it replaces its data.) This column will contain a 1 if the defendant's `decile_score` is greater than `threshold` and to 0 otherwise. Here's how it works:
* The code `pp_data['decile_score'] > threshold` essentially creates a new column. Each value in the column equals `True` if the `decile_score` is greater than `threshold` and `False` otherwise.
* But the code we use below will need the predictions to be labeled as 1 and 0, not `True` and `False`. Multiplying `pp_data['decile_score'] > threshold` by 1 converts to 1 and 0.
* The second line therefore sets `pp_data['COMPAS_prediction']` equal to this new column of 1s and 0s.

Take a look at the dataset and confirm that the new column equals 1 when the `decile_score` is greater than 4 and equals 0 otherwise.

In [None]:
pp_data

### Accuracy, FPR, and FNR

Recall that predictive models use algorithms to predict a target feature (in this case, whether or not the defendant will recidivate). We can evaluate how well a model predicts these features using **goodness metrics.** Goodness metrics typically summarize discrepancies between actual values and predicted values. Metrics like these are quite important in data science and are typically used by data scientists to determine how accurate and effective an algorithm is at predicting its predetermined goal.

In our case, we want to examine the performance of `COMPAS_prediction` as a predictor of `two_year_recid`. 
We'll start with simply calculating the **accuracy** of the model, defined as the number of correct predictions divided by the total number of predictions.

(More information about how to calculate the accuracy and other metrics we will discuss below can be found [here](https://en.wikipedia.org/wiki/Confusion_matrix).)

In [None]:
correct_predictions = pp_data[pp_data['two_year_recid'] == pp_data['COMPAS_prediction']]
accuracy = len(correct_predictions) / len(pp_data)
print(f"accuracy = {accuracy}")

Again, let's unpack the code above:

* In the first line, `pp_data['two_year_recid'] == pp_data['COMPAS_prediction']` is a like a new column that is set to `True` if `two_year_recid` and `COMPAS_prediction` are equal and is set to `False` otherwise. (`==` is Python for "is equal to." `!=` means "is not equal to.")
* Writing `pp_data[*a column that contains True and False*]` means "only take the rows for which the column equals `True`."
* So the first line creates a new dataset called `correct_predictions`, which is the original dataset filtered so that it only includes rows in which `two_year_recid` and `COMPAS_prediction` are equal.
* The second line counts the number of rows in the `correct_predictions` dataset and divides it by the number in the original dataset to get the accuracy.

The calculation reveals that the overall accuracy is 0.658, or 65.8%. 

#### Questions

**Question 4a**

Do you think that 65.8% is a reasonable accuracy for a model such as COMPAS?

**Answer:** *YOUR ANSWER HERE*

**Question 4b**

Write code to calculate the accuracy *only* for Black defendants. Set this equal to a variable called `accuracy_black`. 

Write code to calculate the accuracy *only* for white defendants. Set this equal to a variable called `accuracy_white`.

*Hint 1*: Create new datasets called `pp_data_black` and `pp_data_white` that contain only the rows of `pp_data` whose `race` column equals "African-American" or "Caucasian", respectively. (That is, create new datasets that *filter* by race.) Then perform the accuracy calculation on each of those datasets instead of on `pp_data`. The code block below gets you started.

*Hint 2*: Remember, to create a new dataset that filters by a certain condition, you can use code like:

```
pp_data[pp_data["age"] >= 25]
pp_data[pp_data["score_text"] == "Medium"]
pp_data[pp_data["two_year_recid"] == pp_data["COMPAS_prediction"]
```

*Hint 3*: You should find that the accuracy for Black defendants is 0.6491. If not, something is wrong.

In [None]:
pp_data_black = [...]
correct_predictions_black = [...]
accuracy_black = [...]

In [None]:
pp_data_white = [...]
correct_predictions_white = [...]
accuracy_white = [...]

So far, the model seems pretty fair, in the sense that the accuracy is similar for Black and white defendants. 

Let's dig a little deeper and look at **false positives (FP)** and **false negatives (FN)**. We'll need to know the numbers of false positives, true positives, false negatives, and true negatives—overall, and for Black defendants, and for white defendants. We could write code for all of these...but fortunately there is a shortcut: We can use the `sklearn` function `confusion_matrix()`. 

Each defendant has a value for `COMPAS Decision` (either 0 or 1) and a value for `two_year_recid` (either 0 or 1). A *confusion matrix* shows how these values line up with each other—i.e., it shows  true and false positives and negatives. The **rows** represent the number of observations for which the **prediction** is 0 or 1, and the **columns** represent the number of observations for which the **actual** value is 0 or 1.



In [None]:
cm = confusion_matrix(pp_data['two_year_recid'], pp_data['COMPAS_prediction'])
cm

The results tell us that there are 1872 true negatives (prediction and actual = 0), 923 false negatives (prediction = 0, actual = 1), etc.

We can store these values in separate variables using the `ravel()` function, which flattens the matrix into individual values:

In [None]:
TN, FP, FN, TP = cm.ravel()
print(f"There are:\n{TN} true negatives\n{FP} false positives\n{FN} false negatives\n{TP} true positives")

Let's dig a little deeper into these numbers. 



In [None]:
total = TN + FP + FN + TP
print(f"{(TN + FN)/total} predicted not to recidivate")
print(f"{(TP + FP)/total} predicted to recidivate")
print(f"{(TN + FP)/total} did not recidivate")
print(f"{(TP + FN)/total} did recidivate")

First, 52.2% of defendants were predicted not to recidivate (these are the true negatives and false negatives) and 47.8% were predicted to recidivate (true and false positives). This is pretty close to the actual precentages who did not recidivate (true negatives and false positives), 53.0%, and who did recidivate (true positives and false negatives), 47.0%.

In [None]:
print(f"Overall accuracy = {(TN + TP)/total}")

On the other hand, the overall accuracy is relatively low: 65.8%. (We also calculated this earlier.)

ProPublica used the **false positive rate (FPR)** and **false negative rate (FNR)** as their metrics to understand and quantify fairness.

In terms of COMPAS, the FPR is the proportion (i.e., the percentage, but expressed as a decimal) of non-recidivating defendants who were predicted to recidivate by the model. Mathematically,
$$\text{FPR} = \frac{\text{FP}}{\text{FP}+\text{TN}},$$
where $\text{FP}$ is the number of false positives (people who were predicted to recidivate but did not) and $\text{TN}$ is the number of true negatives (people who were predicted not to recidivate and did not).

In contrast, the FNR is the proportion of recidivating defendants who were predicted not to recidivate by the model:
$$\text{FNR} = \frac{\text{FN}}{\text{FN}+\text{TP}},$$
where $\text{FN}$ is the number of false negatives (people who were predicted not to recidivate but did) and $\text{TP}$ is the number of true positives (people who were predicted to recidivate and did).

In [None]:
FPR = FP / (FP + TN)
FNR = FN / (TP + FN)
print(f"Among all non-recidivating defendants, {FPR} were predicted (incorrectly) to recidivate")
print(f"Among all recidivating defendants, {FNR} were predicted (incorrectly) not to recidivate")

So, among all defendants who did not recidivate, 33.0% were falsely predicted to recidivate (the FPR); and among all defendants who did recidivate, 35.5% were false predicted not to recidivate (the FNR). The fact that these are roughly equal is another relatively good sign.

### Does COMPAS Overpredict or Underpredict across Groups?

#### Questions

**Question 5a**

Calculate the FPR and FNR for Black defendants only.  

*Hint*: When we calcluated FPR and FNR for *all* defendants, we used 3 steps:
1. Calculate the confusion matrix `cm`.
2. Extract `TN`, `FP`, `FN`, and `TP` using `ravel()`.
3. Calculate `FPR` and `FNR` from these.

Do the same 3 steps, but start with the dataset `pp_data_black` that you created earlier, instead of using `pp_data`.

In [None]:
[...]

**Question 5b**

Do the same thing, but for white defendants only.

In [None]:
[...]

Your results should align with the table in the ProPublica article titled "Prediction Fails Differently for Black Defendants". Your results will differ a bit, due to small differences in the way the dataset was filtered.

**Question 5c**

What do you notice about these percentages? Are they different across racial groups?

**Answer:** *YOUR ANSWER HERE*

**Question 5d**

What can you conclude from these metrics about the *overprediction* of risk scores for Black and white defendants? 

**Answer:** *YOUR ANSWER HERE*

**Question 5e**

What can you conclude from these metrics about the *underprediction* of risk scores for Black and white defendants?

**Answer:** *YOUR ANSWER HERE*

### The ROC Curve

Let's plot the ROC curve for the COMPAS model. Recall that the ROC curve plots the TPR (the true positive rate, defined as $\text{TP}/(\text{TP}+\text{FN})$) vs. the FPR at different values of the threshold. A predictive model that is very good at predicting will have its ROC curve pushed up and to the left, whereas a poor model will be closer to the diagonal line that goes from the bottom-left to top-right.


`sklearn` has built-in functions—`roc_curve()` and `roc_auc_score()` to do the calculations for us.

In [None]:
# Calculate ROC curve data.
FPR, TPR, _ = roc_curve(pp_data['two_year_recid'], pp_data['decile_score'])
# Calculate AUC.
AUC = roc_auc_score(pp_data['two_year_recid'], pp_data['decile_score'])
# Plot curve.
plt.plot(FPR, TPR)
plt.plot([(0, 0), (1, 1)], 'r--')
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.title(f'ROC Curve: COMPAS\n AUC: {AUC:.4f}');

An AUC of 0.71 is borderline in terms of whether it would be considered acceptable. The low AUC is a result of the overlap between the two groups in the histogram: If there were a cleaner separation in the histogram between defendants who recidivated and who did not, the ROC would be pushed upward and to the left, and the AUC would be larger.

### Choosing a Threshold

---
> 🛑 **STOP!** This section is **entirely optional.** If you've been enjoying the coding so far and want to keep going, go for it. (The coding that comes next isn't particularly easier or harder than what you've done already.) If not, skip to the section on "How to Submit this Notebook".
---

In Question 3d, you built a histogram of the COMPAS scores for all defendants. Then, we built a histogram that separates the data by race.

Now let's separate the data in a different way, based on whether or not the defendant recidivated. The plot below shows the frequency of decile scores (COMPAS scores) for defendants who did not recidivate (purple bars) and for those who did (green bars).

In [None]:
sns.histplot(pp_data[pp_data["two_year_recid"] == 0], x="decile_score", discrete=True, color='purple', alpha=0.5)
sns.histplot(pp_data[pp_data["two_year_recid"] == 1], x="decile_score", discrete=True, color='green', alpha=0.5)
plt.legend(labels=["did not recidivate", "did recidivate"])
plt.axvline(threshold+0.5, color='k'); # (threshold was assigned above)

In the aggregate, the model assigned lower decile scores to defendants who would up not recidivating and higher decile scores for defendants who did. That's good. But there is a lot of overlap between the two groups of bars, suggesting that the model is not a particularly accurate predictor.

COMPAS basically draws a line right after a specific decile score and says "anybody above this line is predicted to recidivate; anybody below it is predicted not to." Northpointe set this line right after 4. This is called a **threshold model.**

Let's calculate our own threshold model. First we'll set the threshold to 4 to confirm that we get the same results as COMPAS.

#### Questions

**Question 6a**

Create a new variable called `my_threshold` and set it equal to 4.


In [None]:
[...]

**Question 6b** 

Write code to add a new column called `my_prediction` to `pp_data`; the values in the column should be set to 1 if the corresponding value of `decile_score` is greater than `my_threshold` and should be set to 0 otherwise.

*Hint*: Look back at the section titled "The COMPAS Predictions."

(You might want to view some rows of the new dataset to make sure your new column is set correctly.)

In [None]:
[...]

Here's a quick check to make sure that `my_prediction` currently agrees with `COMPAS_prediction`. (It should, since we're using the same threshold at the moment.)

In [None]:
# Does `COMPAS_prediction` column equal `my_prediction` column?
pp_data['COMPAS_prediction'].equals(pp_data['my_prediction'])

Here's another histogram. (It's the same as the histogram we plotted above, but the line is drawn at your threshold instead of Northpointe's.)

In [None]:
sns.histplot(pp_data[pp_data["two_year_recid"] == 0], x="decile_score", discrete=True, color='purple', alpha=0.5)
sns.histplot(pp_data[pp_data["two_year_recid"] == 1], x="decile_score", discrete=True, color='green', alpha=0.5)
plt.legend(labels=["did not recidivate", "did recidivate"])
plt.axvline(my_threshold+0.5, color='k'); 

**Question 6c**

For the model that uses `my_threshold`, calculate the FPR and FNR separately for Black and white defendents.

*Hint*: When you added the new column to `pp_data`, it did *not* automatically add this column to your datasets `pp_data_black` and `pp_data_white`. You'll have do add it yourself. You can either add it the same way you added it to `pp_data`, or you can create new versions of `pp_data_black` and `pp_data_white` based on the new `pp_data`, like you did earlier.

In [None]:
[...]

**Question 6d**

Now try other thresholds. Experiment with the value of `my_threshold`. For each value of `my_threshold` that you try, calculate the FPR and FNR. 

What value of `my_threshold` would you recommend, if you were designing the COMPAS system? For the value you choose, report the FPR and FNR.

**Answer:** *YOUR ANSWER HERE*

---

## How to Submit this Notebook

* Please click the "Share" button at the top of the page. In the Share options:

    * Under "General Access", change to "Restricted" (instead of "Anyone with the link").
    * At the top, share it with Oumaima (ous219@lehigh.edu), Larry (lvs2@lehigh.edu), and Suzanne (sme6@lehigh.edu).
    * Click "Copy Link".
    * Click "Done".

* Send a Slack DM to Oumaima, Larry, Suzanne, and your partner with the link you copied.
