# Lecture 5: Source of Bias
This notebook is a part of the [Algorithmic Fairness, Accountability and Ethics (Spring 2023)](https://learnit.itu.dk/course/view.php?id=3021608) at [IT-University of Copenhagen](https://itu.dk/)

#### Ex.5.4: Data Analysis on the ProPublica Dataset 

**The goal of this exercise is to have you interact with the COMPAS dataset, to clean the dataset for analysis, extract insight, visualize findings, and replicate a part of the ProPublica's analysis. If you have worked already with the COMPAS dataset and find the exercise boring or redundant, consider working on the other exercises, or working on analyzing possible biases in a data set of your choice**

Please remember to use materials on [LearnIT](https://learnit.itu.dk/course/view.php?id=3020962) under Lecture 2 - Study Materials:
* Social Data: Biases, Methodological Pitfalls, and Ethical Boundaries
* A Survey on Bias and Fairness in Machine Learning
* Fairness and machine learning: Introduction Chapter

Also refer to the [How we analyzed the COMPAS Recidivism Algorithm](https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithm) (Article) and [ProPublica Github Repository](https://github.com/propublica/compas-analysis/).


#### Loading and surveying the data
* Load the dataset `compas-scores-two-years.csv`

#### Columns of Interest:
* `age` - Age of the defendant. It is numeric.
* `age_cat` - Category of Age. It can be < 25, 25-45, >45.
* `sex` - Sex of the defendant. It is either 'Male' or 'Female'
* `race` - Race of the defendant. It can be 'African-American', 'Caucasian', 'Hispanic', 'Asian', or 'Other'.
* `c_charge_degree` - Degree of the crime. It is either M (Misdemeanor), F (Felony), or O (not causing jail time).
* `priors_count` - Count of prior crimes committed by the defendant. It is numeric.
* `days_b_screening_arrest` - Days between the arrest and COMPAS screening.
* `decile_score` - The COMPAS score predicted by the system. It is between 0-10.
* `score_text` - Category of decile score. It can be Low (1-4), Medium (5-7), and High (8-10).
* `is_recid` - A variable to indicate if recidivism was done by the defendant. It can be 0, 1, -1.
* `two_year_recid` - A variable to indicate if recidivism was done by the defendant within two years.
* `c_jail_in` - Time when the defendant was jailed.
* `c_jail_out` - Time when the defendant was released from the jail.

#### Data Cleaning
Now that we have surveyed the dataset, let's look into cleaning the data. This data-cleaning is largely based off of ProPublica's methods. Requerements for the data filtering:
1. We only focus on cases where the COMPAS scored crime happened within +/- 30 days from when the person was arrested (if the value is missing, the record shoudl be removed). 
2. Then, we also get rid of cases where is_recid is -1 since we only want binary values for the purpose of our model (0 for no recidivism, 1 for yes recidivism). 
3. Finally, we don't want the c_charge_degree to be "O" which denotes ordinary traffic offenses (not as serious of a crime). 

Finish cleaning the dataset by filling in the code below based on the description above. The cleaned dataset should have 6172 records and 13 features.

(***Optional**) Create a "Lenghts of stay in jail" feature (you can compute this feature using `c_jail_in` and `c_jail_out`) and use it in the exercise*

#### Exploratory data Analysis

First, study basic statistics of the dataset (in case you make plots, make sure that you provide labels and titles)
* Frequency of different attributes (such as race, age, decile score, prio_counts)
* General descriptive statistics of the dataset

#### Bias Analysis

* Study the distribution of the recidivism score `decile_score` for different categories: does recidivism have the same distribution for different races? For different genders?
    * Make sure that your plots are comparable (e.g. axes have same scale)
* If it is not distributed in the same way, which biases do you identify in the input dataset that can lead to different distributions? Think about "how data can unintentionally discriminate" from the theory class
* Is there a measurement bias? Explain
* Is there a population bias? Explain
* Is there a sampling bias? Explain
* Look at the correlation between features. What can you notice? How could this affect the recidivism score? (*you can use `nominal` method from `dython` package to find correlations between categorical and continious variables (if not sure check the lecture slides). Read documentation to get more info.*)


#### Replicating ProPublica Analysis
Propublica used the COMPAS scores to predict recidivism if the score was >=5 and no recidivism if the score was < 5.

This is not a complete analysis since it solely uses the decile score and does a hard thresholding for prediction, discarding all other aspects of individuals. But let's reproduce it anyway.

Let's call this thresholded version of predicted recividism `predicted_recid`.

* Compute and compare the confusion matrix for each of the races
* Compute and compare the error rate, false positive rate, and false negative rate for each of the races
* What do you conclude?

#### References
- https://github.com/propublica/compas-analysis/
- https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithm
- https://mit-serc.pubpub.org/pub/risk-prediction-in-cj/release/2