<a href="https://colab.research.google.com/github/OvidiuDimofte/DPI-617-Final-Project-OD/blob/main/Another_copy_of_compas_1_exercises.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Law, Order, and Algorithms**
# Algorithmic fairness — Part 1 of 2

In this lab, we'll build a risk assessment algorithmic to inform pretrial detention decisions, and examine its accuracy and equity.

This exercise is predicated on the assumption that it is acceptable to detain (at least some) individuals pretrial.
Many, though, argue that (nearly) all defendants should be released on their own recognizance, in part because they have not yet been convicted of a crime. Further, even modest bail requirements can impose disproportionate burdens on the most vulnerable members of society. Some jurisdictions have taken steps toward reforming the pretrial processs — including [ending cash bail](https://www.npr.org/2021/02/22/970378490/illinois-becomes-first-state-to-eliminate-cash-bail) — though pretrial detention is still the norm rather than the exception.

**Getting started**

Before you start, create a copy of this Jupyter notebook in your own Google Drive by clicking `Copy to Drive` in the menubar. If you do not do this your work will not be saved!

Remember to save your work frequently by pressing command-S or clicking File > Save in the menubar.

We recommend completing this problem set in Google Chrome.

Run the cell below to load the `tidyverse` library and set some formatting options.

In [None]:
# Some initial setup

# load libraries
library(tidyverse)

# Set some formatting options
options(digits = 3,
  repr.matrix.max.rows = 10,
  repr.matrix.max.cols = 100,
  repr.plot.width = 8,
  repr.plot.height = 6)
theme_set(theme_bw())

## Background

In 2016, ProPublica published a [now-famous article](https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing) analyzing and criticizing the lack of fairness in a risk assessment tool used nationwide called COMPAS. Here, we will take a look at a cleaned-up version of the COMPAS data that ProPublica used, and try to better understand algorithmic fairness by investigating the claims ProPublica made, along with the [counterclaims](https://www.propublica.org/article/technical-response-to-northpointe) made by Northpointe (now re-branded as [Equivant](https://www.equivant.com/)).

While Northpointe notes that their algorithm does not use race information and that their model is _calibrated_ across racial groups, ProPublica points out that the COMPAS scores differ in false positive rates across racial groups (violating error-rate parity) and result in detaining relatively more Black defendants than white defendants. In this notebook, we will examine some of their claims by building and evaluating our own risk assessment tool.

## COMPAS data

We'll be working with publicly available COMPAS data collected and released by ProPublica. Run the cell below to load the data.

In [None]:
# Read the data
fname <- "https://github.com/5harad/DPI-617/blob/main/data/compas.rds?raw=true"
compas_df <- readRDS(url(fname))

head(compas_df)

A cleaned version of the COMPAS data is loaded as `compas_df`, with the following columns

* `id`: unique identifiers for each case
* `sex`, `dob`, `age`, `race`: demographic information for each defendant
* `recid_score`, `violence_score`: COMPAS scores assessing risk that a defendant will recidivate (`violence_score` for violent crimes) within two years of release (higher scores correspond to higher risk)
* `priors_count`: number of prior arrests
* `is_recid`, `is_violent_recid`: Indicator variable that is `1` if the defendant was arrested for a new (violent) crime within two years of release, and `0` otherwise

### Exercise 1: Build a risk assessment model for recidivism

We start by building our own risk assessment tool using only prior arrests (`priors_count`) and age (`age`) to predict whether a defendant will recidivate within two years of release (`is_recid`).
First, fit a model to estimate the probability of this outcome for each defendant.
We will call this model `recid_model`.

Hint: Remember that the general syntax for fitting a regression model is `lm(OUTCOME ~ FACTOR1 + FACTOR1, data = DATA)`.

In [None]:
# Build a regression model estimating recidivism probability

recid_model <-
# WRITE CODE HERE

# Fit the logistic regression model
lm(is_recid ~ priors_count + age, data=compas_df)



Run the following cell to inspect your fitted model. How do you interpret the results?

In [None]:
summary(recid_model)

Run the code below to generate recidivism predictions, based on your model above. In addition to the raw probability estimates, we'll bin the risk scores into risk deciles, similar to COMPAS scores.

In [None]:
# Now we'll generate predictions for everyone in our dataset
compas_df <- compas_df %>%
    mutate(
      risk = predict(recid_model),
      risk_level = ntile(risk, 10)
    )

head(compas_df)

### Exercise 2: Calibration

Northpointe argued in part that its model was _fair_ because it was generally _well-calibrated_ across different race groups (i.e., for people who received similar risk scores, the actual rate of recidivism was similar across race groups).

Check the calibration of your model by computing the empirical `recidivism_rate` (based on `is_recid`) for each risk level and race group. In the end you should have a data frame with three columns:
`race`, `risk_level`, and `recidivism_rate`.

Hint: first use `group_by` and then use `summarize`.

In [None]:
calibration_by_race <- compas_df %>%
# WRITE CODE HERE

    group_by(race, risk_level) %>%
    summarize(recidivism_rate = mean(is_recid))

  calibration_by_race
  print(calibration_by_race, n=20)


We can visualize model calibration by plotting the risk score bins with their corresponding emprical recidivism rate, using the code below.

In [None]:
# Calibration plot
ggplot(calibration_by_race,
       aes(x = risk_level, y = recidivism_rate, color = race)) +
    geom_line() + geom_point() +
    scale_y_continuous(labels = scales::percent_format(), limits = c(0, 1)) +
    scale_x_continuous(breaks = 1:10) +
    labs(x = "\nRisk level",
         y = "Recidivism rate\n")

How do you interpret the plot above? Do you think it's important for the model to be calibrated? What if the model weren't calibrated?

### Exercise 3: Disparities in detention

Part of the objection to COMPAS is that Black defendents are more likely to be detained that white defendents.

Examine detention rates by first selecting a detention threshold, and then computing the proportion of people above that threshold in each race group.

Hint: To compute detention rates, first compute a new column (using `mutate`) that indicates whether an individual is above the detention threshold. Then `group_by` `race` and `summarize`.

In [None]:
# Calculate detention rate by race

# With a detention threshold of 6, approximately half
# of individuals are detained
detention_threshold = 6

# WRITE CODE HERE
compas_df %>%
    mutate(detained = risk_level >= detention_threshold) %>%
    group_by(race) %>%
    summarize(
        detention_rate = mean(detained)
    )

Run the cell below to help you visualize detention rates.

In [None]:
# Plot the risk distribution
options(repr.plot.width = 12, repr.plot.height = 6)

compas_df %>%
  count(race, risk_level) %>%
  group_by(race) %>%
  mutate(p = n/sum(n)) %>%
  ungroup() %>%
ggplot() +
    geom_col(aes(x = risk_level, y = p)) +
    scale_x_continuous("Risk level", breaks = 1:10) +
    scale_y_continuous(element_blank(), labels=scales::percent_format()) +
    geom_vline(xintercept=detention_threshold-0.5, color='red', size=2) +
    facet_wrap(~race)

What do you think of the results above? Do you think differences in detention rates are a reasonable measure of your risk assessment algorithm's _fairness_? Do you think the algorithm leads to _disparate treatment_? What about _disparate impact_?

### Exercise 4: Equalizing detention rates

After observing differences in the detention rates above, one might seek a policy that equalizes detention rates across race groups.

In this exercise, set different thresholds for Black and white defendants to achieve comparable detention rates across groups.

In [None]:
# modify the detention thresholds below to equalize
# detention rates across race groups.
black_threshold = 8
white_threshold = 6

# Calculate detention rate by race
compas_df %>%
    mutate(detained = risk_level >= if_else(race == "Caucasian", white_threshold, black_threshold)) %>%
    group_by(race) %>%
    summarize(
        detention_rate = mean(detained),
    )


What are the threshold values that you find? Do you think this policy is more or less _fair_ than the single-threshold policy above? Does a multiple-threshold policy create disparate treatment? What about disparate impact?