<a href="https://colab.research.google.com/github/5harad/DPI-617/blob/main/labs/compas-2-answers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Law, Order, and Algorithms**
# Algorithmic fairness â€” Part 2 of 2

Today, we will continue building and evaluating our own risk assessment tool using the COMPAS data to examine some additional aspects of fairness. We'll specific look at gender disparities, and also examine error rates.

**Getting started**

Before you start, create a copy of this Jupyter notebook in your own Google Drive by clicking `Copy to Drive` in the menubar. If you do not do this your work will not be saved! 

Remember to save your work frequently by pressing command-S or clicking File > Save in the menubar. 

We recommend completing this problem set in Google Chrome.

Run the cell below to load the `tidyverse` library and set some formatting options.

In [None]:
# Some initial setup

# load libraries
library(tidyverse)

# Set some formatting options
options(digits = 3, 
  repr.matrix.max.rows = 10, 
  repr.matrix.max.cols = 100, 
  repr.plot.width = 8, 
  repr.plot.height = 6)
theme_set(theme_bw())

## COMPAS Data and risk model revisited

As last time, we'll be working with publicly available COMPAS data collected and released by ProPublica. 

Run the cell below to load the data. The code also recreates the risk model we developed in the first part of the lab.

In [None]:
# Read the data
fname <- "https://github.com/5harad/DPI-617/blob/main/data/compas.rds?raw=true"
compas_df <- readRDS(url(fname))

# Fit a recidivism risk model
recid_model <- lm(is_recid ~ priors_count + age, data = compas_df)

# Generate predictions
compas_df <- compas_df %>%
    mutate(
      risk = predict(recid_model),
      risk_level = ntile(risk, 10)
    )

head(compas_df)

Recall that the cleaned version of the COMPAS data is loaded as `compas_df`, with the following columns

* `id`: unique identifiers for each case
* `sex`, `dob`, `age`, `race`: demographic information for each defendant
* `recid_score`, `violence_score`: COMPAS scores assessing risk that a defendant will recidivate (`violence_score` for violent crimes) within two years of release (higher score correspond to higher risk)
* `priors_count`: number of prior arrests
* `is_recid`, `is_violent_recid`: Indicator variable that is `1` if the defendant was arrested for a new (violent) crime within two years of release, and `0` otherwise.

and after fitting our model, we have added the following columns

* `risk`: the model-estimated probability of recidivism
* `risk_level`: an integer risk score between 1 and 10

### Exercise 1: Calibration by gender

Last week we examined how our recidivism prediction model performed for different racial groups, and it turned out our model was well calibrated for white and Black defendants. We will now examining the calibration of our model by gender.

Calculate recidivism rates for male and female defendants in our dataset by creating a data frame called `calibration_by_gender` containing three columns: `sex`, `risk_level`, and `recidivism_rate`.

Once you complete the code below, run the subsequent cell to visualize your results. Based on the plot, do you think a your gender-blind risk assessment model is "fair"?

In [None]:
# Compute the recidivism rate for each risk level,
# separately for men and women.

calibration_by_gender <- compas_df %>%
# WRITE CODE HERE
# START solution
  group_by(sex, risk_level) %>%
  summarize(
    recidivism_rate = mean(is_recid)
  )
# END solution

# output your results
calibration_by_gender

In [None]:
# Calibration plot
ggplot(calibration_by_gender, 
       aes(x = risk_level, y = recidivism_rate, color = sex)) +
    geom_line() + geom_point() +
    scale_y_continuous(labels = scales::percent_format(), limits = c(0, 1)) +
    scale_x_continuous(breaks = 1:10) + 
    labs(x = "\nRisk level",
         y = "Recidivism rate\n")

### Exercise 2: Creating gender-specific risk scores

The plot above shows a roughly 1-point gap in calibration between men and women. For example, male defendants who were given a score of `4` recidivated at about the same rate as female defendants who were given a score of `5`. In part because the model is "blind" to gender, women have lower actual recidivism risk compared to their male counterparts who have the same nominal risk score.

One way to address this gap is to substract a point from the risk scores of women (some judges have told us that they informally do this). Create a new column called `gendered_risk_level` that adjusts the risk scores in this way. Afterwards, run the subsequent cell to generate the calibration plot. Discuss the results. Do you think the gender-aware or the gender-blind model is more "fair"?

_Hint:_ Use `mutate` to create the new column. When creating the new gendered risk level, you can start with the original `risk_level` and then substract `1` when `sex == Female`. Remember that `R` interprets `TRUE` is being the same as `1`.

In [None]:
# Create gender-specific risk scores

# WRITE CODE HERE
compas_df <- compas_df %>%
# START solution
  mutate(
    gendered_risk_level = risk_level - (sex == 'Female')
  )
# END solution

# output your results
compas_df

In [None]:
# Calibration plot
compas_df %>%
  group_by(sex, gendered_risk_level) %>%
  summarize(recidivism_rate = mean(is_recid)) %>%
  ggplot(aes(x = gendered_risk_level, y = recidivism_rate, color = sex)) +
    geom_line() + geom_point() +
    scale_y_continuous(labels = scales::percent_format(), limits = c(0, 1)) +
    scale_x_continuous(breaks = 1:10) + 
    labs(x = "\nGendered risk level",
         y = "Recidivism rate\n")

### Exercise 3: Comparing gender-specific and gender-blind models

Now let's compare our gender-specific and gender-blind models by computing the number of men and women detained at a risk threshold of 6 or above. 

_Hint:_ For each gender group, compute the `sum` of people with (gendered) risk level of 6 or more.

In [None]:
# Calculate number of men and women detained for gender-specific and gender-blind models

# WRITE CODE HERE
compas_df %>%
# START solution
  group_by(sex) %>%
  summarize(
    n_detained = sum(risk_level >= 6),
    n_gender_specific_detained = sum(gendered_risk_level >= 6)
  )
# END solution

By incorporating gender into the risk score, we are able to obtain a calibrated model with fewer women detained. However, by explicitly using gender, we violate the anti-classification principle. What do you think of this approach?

### Exercise 4: False positive rates

When ProPublica audited the COMPAS risk assessment tool, one of its most prominent critiques was that COMPAS had higher error rates for Black defendents than for white defendants. ProPublica (and others) have specifically considered _false positive rates_ (FPR): among individuals who did not ulimately recidivate, the proportion deemed high risk by the algorithm.

Compute the FPR for white and Black defendants in our dataset, assuming that those with a risk score of at least 6 are flagged as "high risk".

_Hint:_ Use `filter` to first restrict to those who did not recidivate, and then, for each race group, compute the proportion with `risk_level` >= 6.

In [None]:
# compute false positive rates for each race group

# WRITE CODE HERE
compas_df %>% 
# START solution
  filter(!is_recid) %>%
  group_by(race) %>%
  summarize(
    fpr = mean(risk_level >= 6)
  )  
# END solution


To help undestand why the false positive rate is so much higher for Black defendants compared to white defendants, we'll examine the overall distribution of risk for each group, and the distribution of risk for those who did not recidivate. Run the two cells below to generate these distributions.

In [None]:
# Plot the risk distribution
options(repr.plot.width = 12, repr.plot.height = 6)

compas_df %>%
  count(race, risk_level) %>%
  group_by(race) %>%
  mutate(p = n/sum(n)) %>%
  ungroup() %>%
  ggplot() +
    geom_col(aes(x = risk_level, y = p)) +
    scale_x_continuous("Risk level", breaks = 1:10) +
    scale_y_continuous(element_blank(), labels=scales::percent_format()) + 
    geom_vline(xintercept=5.5, color='red', size=2) +
    facet_wrap(~race)


Run the cell below to see the distribution of risk across race groups for
**people who did not ultimately recidivate**.

In [None]:
# Plot the risk distribution for people **who did not recidivate**
options(repr.plot.width = 12, repr.plot.height = 6)

compas_df %>%
  filter(!is_recid) %>%
  count(race, risk_level) %>%
  group_by(race) %>%
  mutate(p = n/sum(n)) %>%
  ungroup() %>%
  ggplot() +
    geom_col(aes(x = risk_level, y = p)) +
    scale_x_continuous("Risk level", breaks = 1:10) +
    scale_y_continuous(element_blank(), labels=scales::percent_format()) + 
    geom_vline(xintercept=5.5, color='red', size=2) +
    facet_wrap(~race)

How do the distributions above connect to false positive rates? What are some limitations of false positive rates as a measure of "fairness"?