<a href="https://colab.research.google.com/github/5harad/DPI-617/blob/main/labs/admissions-answers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Law, Order, and Algorithms**
## Narrow tailoring and disparate impact in law school admissions

**Getting started**

Before you start, create a copy of this Jupyter notebook in your own Google Drive by clicking `Copy to Drive` in the menubar. If you do not do this your work will not be saved! 

Remember to save your work frequently by pressing command-S or clicking File > Save in the menubar. 

We recommend completing this problem set in Google Chrome.

Run the cell below to load the `tidyverse` library and set some formatting options.

In [None]:
# load libraries
library(tidyverse)

# Set some formatting options
options(digits = 3, repr.matrix.max.rows = 10, repr.matrix.max.cols = 100)
theme_set(theme_bw())

## Background
In this exercise, we'll examine admissions decisions at top-tier law schools using the dataset from the _LSAC National Longitudinal Bar Passage Study_ ([Wightman and Ramsey, 1998](https://files.eric.ed.gov/fulltext/ED469370.pdf)).
This study presents national longitudinal bar passage data gathered from the class that started law school
in fall 1991 over a 5-year period.
In our analysis, we will focus on diversity and affirmative action policies. We'll explore a simple method to reverse engineer admissions criteria, and investigate the extent to which race-blind policies can achieve diversity. We'll also consider the consequences on diversity of a hypothetical scenario in which admissions decisions are based on statistical likelihood of bar passage.

Run the cell below to load the data that we'll be working with in this lab.

In [None]:
# Load data
fname = 'https://github.com/5harad/DPI-617/raw/main/data/bar_passage_data.csv?raw=true'
bar_data <- read_csv(url(fname), 
                 col_types = cols(MINORITY="l", TOP_TIER="l", MALE="l", PASS_BAR="l")) %>% 
    mutate(FAM_INC = as.factor(FAM_INC))

head(bar_data)

Each row in the data corresponds to a law school admit. The dataset contains the following variables:

* An ID number:
    * `ID`
    
    
* Basic demographic information about the applicant:
    * `MINORITY` is encoded as follows:        
        * `False`: Non-Hispanic white
        * `True`: Asian, Black, Hispanic, American Indian, Alaskan Native, or Other
    * `MALE` is coded as `True` for male applicants and `False` for female applicants
        
        
* Outcome of interest, Bar Passage:
    * `PASS_BAR` is a logical variable and is encoded as `False` regardless of why the student did not pass the exam.  They may have dropped out of law school, never taken the bar, or failed the exam. `PASS_BAR` is encoded as `True` if the student eventually passes the bar. 
    * `BAR` provides more detail about bar results and test history
    
 
* Academic Indicators:
    * `UGPA` (undergraduate GPA), `LSAT` (LSAT score, scaled to be between 10 and 50)
    
    
* Tier of Law School Attended:
    * `TOP_TIER` is an indicator variable for whether an applicant ultimately attends a top tier school


* Family Income Quintile:
    * `FAM_INC` provides the family income quintile
    * `FAM_INC_1`, `FAM_INC_2`, `FAM_INC_3`, `FAM_INC_4`,` FAM_INC_5` are indicator variables for the income quintile, where `FAM_INC_1` is the lowest-income quintile

Law school admits whose entries had missing data have been removed.

### Exploratory Data Analysis

We start our analysis by exploring class composition.

#### Exercise 1: Demographic composition and academic performance

For top-tier schools and, separately, for non-top-tier schools, compute the total number of law school admits, the percentage of admits who are racial minorties, the average LSAT score, and the average undergraduate GPA. (Hint: first `group_by` `TOP_TIER` and then `summarize`.)

In [None]:
# WRITE CODE HERE
# START SOLUTION

bar_data %>%
    group_by(TOP_TIER) %>%
    summarize(
        total_admits = n(),
        p_minority = mean(MINORITY),
        mean_LSAT = mean(LSAT),
        mean_UGPA = mean(UGPA)
    )

# END SOLUTION

### Reverse Engineering Current Admissions

We now attempt to reverse engineer admissions criteria for top-tier law schools. To do so, we make three key assumptions. First, we assume that students in our dataset comprise the full set of students who _applied_ to law school. In reality, our dataset only contains students who ultimately enrolled at a law school. Second, we assume that [students accepted to top-tier law schools](https://abovethelaw.com/2013/03/which-law-schools-had-the-highest-yield-rate/) all decided to enroll at a top-tier school. Finally, we assume that admissions decisions are based on a relatively small set of factors that we have access to: LSAT score, GPA, minority status, and family income. This is a coarse approximation of actual admissions policies, but is instructive nevertheless.

Given these assumptions, we can try to reconstruct admissions policies by fitting a simple regression model that predicts acceptance to a top-tier school based on the available information. 

In R, you can specify statistical models using formulas of the form `outcome variable ~ input variables` with each input variable seperated with the `+` symbol. We'll learn more about these models in the coming weeks, but for now we'll treat them (mostly) as black boxes.

In [None]:
# fit a linear probability model to predict acceptance at a top-tier school
lr_admit <- lm(TOP_TIER ~ LSAT + UGPA + MINORITY + FAM_INC_1 + FAM_INC_2,
                    data = bar_data)

# summarize the model
summary(lr_admit)

The list above shows the coefficient for each covariate estimated by our regression model. The coefficients (approximately) indicate how much different factors are weighted when making admissions decisions. In particular, the coefficients show how much the probability of acceptance to a top-tier law school increases with a one-point change in the variable. For example, all else being equal, the model indicates that a one-point increase in LSAT score corresponds to a 2.5% increase in admissions probability.

#### Exercise 2: 
Discuss the meaning of this model. What does it say about how law schools are admitting students? How accurate do you think it is? In what ways do you think it is misrepresenting or simplyifing the law school admissions process?

### Simulating Law School Admissions

#### Exercise 3: Exploring Alternative Admissions Policies

You'll now create an algorithm for admitting students to top-tier schools based on any given weighting of LSAT, GPA, minority status, and low-income status. Once the weights are provided, the code below will sort all the applicants and return the subset of $n$ = 6,882 applicants ranked highest, where $n$ is the actual number admitted to the top-tier schools.

Explore various admissions policies. Are you able to create admissions criteria that match the nominal academic quality (as measured by GPA and LSAT scores) and diversity of the set of students actually admitted to top-tier schools? Are you able to do so without explictly using race? Recall that _Gratz_ declared using race in a points based way as part of college admissions unconstitutional. 

In [None]:
# Modify these weights to explore alternative policies.
# The initial weights are inferred from the regression above.
LSAT_wt <- 0.026
GPA_wt <- 0.17
MINORITY_wt <- 0.22
INC1_wt <- 0.07
INC2_wt <- 0

# Compute the number of students admitted to a top-tier school
admit_n <- sum(bar_data$TOP_TIER)

# Rank applicants by the given weights, and return the top admit_n
admitted <- bar_data %>% 
    mutate(score = 
               LSAT * LSAT_wt + 
               UGPA * GPA_wt + 
               MINORITY * MINORITY_wt + 
               FAM_INC_1 * INC1_wt + 
               FAM_INC_2 * INC2_wt) %>%
    slice_max(score, n=admit_n)

# Compute the diversity of the admitted student body
admitted %>%
    summarize(
        p_minority = mean(MINORITY),
        mean_gpa = mean(UGPA),
        mean_lsat = mean(LSAT)
    )


### Using Predicted Bar Passage as a Selection Criterion

Finally, we consider what would happen if law schools selected students to optimize bar passage rates. This approach might be motivated from two perspectives. First, perhaps using an outcome-based algorithm would allow schools to lessen the weight on LSAT scores, given the critiques of standardized tests as favoring affluent non-minority groups, and hence constitute a "workable race-neutral alternative." Second, more crudely, one of the major inputs into U.S News and World Report law school rankings is bar passage. Schools might want to admit a class to increase bar passage rates or U.S. News might increase the weight of bar passage in its rankings. Our goal here is to examine whether the adoption of such a policy is a workable alternative and whether it might have disparate impact.

#### Exercise 4:

Based on a model to predict bar passage, simulate an admissions cycle where the students predicted as being the most likely to pass the bar are admitted into the highest tier law schools. We create the predictive model using a linear regression of the form above, but based only on LSAT scores and GPA.

Suppose an admissions office came to you and proposed using this model to determine which students are admitted. How would you evaluate the model and what would you recomemnd to the admissions office? If this model were used, would there be a valid disparate impact claim for any rejected applicants?

In [None]:
# predict bar passage rates via linear regresssion
bar_model <- lm(PASS_BAR ~ LSAT + UGPA, data = bar_data)
summary(bar_model)

In [None]:
# Modify these weights to explore alternative policies.
# The initial weights are set to zero.
LSAT_wt <- 0
GPA_wt <- 0
MINORITY_wt <- 0
INC1_wt <- 0
INC2_wt <- 0

# Compute the number of students admitted to a top-tier school
admit_n <- sum(bar_data$TOP_TIER)

# Rank applicants by the given weights, and return the top admit_n
admitted <- bar_data %>% 
    mutate(score = 
               LSAT * LSAT_wt + 
               UGPA * GPA_wt + 
               MINORITY * MINORITY_wt + 
               FAM_INC_1 * INC1_wt + 
               FAM_INC_2 * INC2_wt) %>%
    slice_max(score, n=admit_n)

# Compute the diversity of the admitted student body
admitted %>%
    summarize(
        p_minority = mean(MINORITY),
        mean_gpa = mean(UGPA),
        mean_lsat = mean(LSAT)
    )


#### Discussion Questions

* One way to characterize the use of bar passage information is as an attempt to reduce the importance of the LSAT in determining law school admissions. Does using bar passage data fulfill the goal of reducing emphasis on the LSAT?

* Consider what some of the potential problems with this dataset are. What factors are not represented in the data that might be relevant for predicting outcomes on the bar exam? For success as an attorney? Are their any concerns about state bar passage as an outcome measure?

* How well do these models mimic the procedure of the actual admissions process? How does the performance of actual admission officers compare to the models we have here and to the extent there are differences in outcomes, what factors might drive those differences? 

* Are there important differences between the populations of interest that may influence the model in undesirable ways? Consider whether minority students are more likely to practice in jurisdictions with lower bar passage rates (e.g., NY or CA)? Consider whether stereotype threat or implicit bias might explain differences in academic or bar passage performance between white and minority students and what implications that has for the approach you've studied above.