# DIY Classification

Recidivism Case Study

Copyright 2020 Allen B. Downey

License: [Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/)

This is the seventh in a series of notebooks that make up a [case study on classification and algorithmic fairness](https://allendowney.github.io/RecidivismCaseStudy/).
This case study is part of the [*Elements of Data Science*](https://allendowney.github.io/ElementsOfDataScience/) curriculum.

In [1]:
def values(series):
    """Count the values and sort.
    
    series: pd.Series
    
    returns: series mapping from values to frequencies
    """
    return series.value_counts(dropna=False).sort_index()

## Data

The authors of "Machine Bias" published their data and analysis at <https://github.com/propublica/compas-analysis>.

The terms of use for the data are at <https://www.propublica.org/datastore/terms>.  In compliance with those terms, I am not redistributing the data.
The following cell downloads the data file we'll use directly from their repository.

In [2]:
from os.path import basename, exists

def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve
        local, _ = urlretrieve(url, filename)
        print('Downloaded ' + local)

download('https://github.com/propublica/compas-analysis/raw/master/' +
         'compas-scores-two-years.csv')

We can use Pandas to read the data file and make a `DataFrame`.

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

cp = pd.read_csv('compas-scores-two-years.csv')
cp.shape

(7214, 53)

The dataset includes 7214 rows, one for each defendant, and 53 columns.  

Here are the names of the columns.

In [4]:
cp.columns

Index(['id', 'name', 'first', 'last', 'compas_screening_date', 'sex', 'dob',
       'age', 'age_cat', 'race', 'juv_fel_count', 'decile_score',
       'juv_misd_count', 'juv_other_count', 'priors_count',
       'days_b_screening_arrest', 'c_jail_in', 'c_jail_out', 'c_case_number',
       'c_offense_date', 'c_arrest_date', 'c_days_from_compas',
       'c_charge_degree', 'c_charge_desc', 'is_recid', 'r_case_number',
       'r_charge_degree', 'r_days_from_arrest', 'r_offense_date',
       'r_charge_desc', 'r_jail_in', 'r_jail_out', 'violent_recid',
       'is_violent_recid', 'vr_case_number', 'vr_charge_degree',
       'vr_offense_date', 'vr_charge_desc', 'type_of_assessment',
       'decile_score.1', 'score_text', 'screening_date',
       'v_type_of_assessment', 'v_decile_score', 'v_score_text',
       'v_screening_date', 'in_custody', 'out_custody', 'priors_count.1',
       'start', 'end', 'event', 'two_year_recid'],
      dtype='object')

I have not found documentation for the columns in this dataset; we have to infer what they mean based on the column names and how they are used in the original analysis.

In [5]:
split = int(len(cp) * 0.3)
split

2164

In [6]:
shuffled = cp.sample(frac=1)
train = shuffled.iloc[:split]
len(train)

2164

In [7]:
test = shuffled.iloc[split:].copy()
len(test)

5050

In [8]:
import statsmodels.formula.api as smf

formula = 'two_year_recid ~ age + priors_count'
results = smf.logit(formula, data=train).fit()
results.summary()

Optimization terminated successfully.
         Current function value: 0.606212
         Iterations 6


0,1,2,3
Dep. Variable:,two_year_recid,No. Observations:,2164.0
Model:,Logit,Df Residuals:,2161.0
Method:,MLE,Df Model:,2.0
Date:,"Thu, 16 Mar 2023",Pseudo R-squ.:,0.119
Time:,17:52:48,Log-Likelihood:,-1311.8
converged:,True,LL-Null:,-1489.0
Covariance Type:,nonrobust,LLR p-value:,1.185e-77

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,1.0202,0.149,6.832,0.000,0.728,1.313
age,-0.0530,0.004,-11.770,0.000,-0.062,-0.044
priors_count,0.1761,0.013,13.640,0.000,0.151,0.201


In [9]:
test['logit_pred'] = results.predict(test)
test['logit_pred'].describe()

count    5050.000000
mean        0.449932
std         0.194447
min         0.023870
25%         0.318012
50%         0.450732
75%         0.547410
max         0.996028
Name: logit_pred, dtype: float64

In [10]:
high_risk = (test['logit_pred'] > 0.45)
high_risk.name = 'HighRisk'
values(high_risk)

False    2500
True     2550
Name: HighRisk, dtype: int64

In [11]:
values(test['two_year_recid'])

0    2772
1    2278
Name: two_year_recid, dtype: int64

In [12]:
new_charge_2 = (test['two_year_recid'] == 1)
new_charge_2.name = 'NewCharge2'
values(new_charge_2)

False    2772
True     2278
Name: NewCharge2, dtype: int64

In [13]:
white = (test['race'] == 'Caucasian')
white.name = 'white'
values(white)

False    3317
True     1733
Name: white, dtype: int64

In [14]:
black = (test['race'] == 'African-American')
black.name = 'black'
values(black)

False    2504
True     2546
Name: black, dtype: int64

In [15]:
male = (test['sex'] == 'Male')
male.mean()

0.8017821782178218

In [16]:
female = (test['sex'] == 'Female')
female.mean()

0.19821782178217823

In [17]:
def make_matrix(cp, threshold=0.45):
    """Make a confusion matrix.

    cp: DataFrame
    threshold: 

    returns: DataFrame containing the confusion matrix
    """
    a = np.where(cp['logit_pred'] > threshold,
                 'Positive',
                 'Negative')
    high_risk = pd.Series(a, name='Predicted')

    a = np.where(cp['two_year_recid'] == 1,
                 'Condition',
                 'No Condition')
    new_charge_2 = pd.Series(a, name='Actual')

    matrix = pd.crosstab(high_risk, new_charge_2)
    matrix.sort_index(axis=0, ascending=False, inplace=True)

    return matrix

Here are the confusion matrices for white defendants, black defendants, and all defendants.

In [18]:
matrix_all = make_matrix(test)
matrix_all

Actual,Condition,No Condition
Predicted,Unnamed: 1_level_1,Unnamed: 2_level_1
Positive,1564,986
Negative,714,1786


In [19]:
matrix_white = make_matrix(test[white])
matrix_white

Actual,Condition,No Condition
Predicted,Unnamed: 1_level_1,Unnamed: 2_level_1
Positive,375,268
Negative,316,774


In [20]:
matrix_black = make_matrix(test[black])
matrix_black

Actual,Condition,No Condition
Predicted,Unnamed: 1_level_1,Unnamed: 2_level_1
Positive,1034,586
Negative,268,658


In [21]:
matrix_male = make_matrix(test[male])
matrix_male

Actual,Condition,No Condition
Predicted,Unnamed: 1_level_1,Unnamed: 2_level_1
Positive,1357,786
Negative,576,1330


In [22]:
matrix_female = make_matrix(test[female])
matrix_female

Actual,Condition,No Condition
Predicted,Unnamed: 1_level_1,Unnamed: 2_level_1
Positive,207,200
Negative,138,456


In [23]:
def percent(x, y):
    """Compute the percentage `x/(x+y)*100`."""
    return x / (x+y) * 100

In [24]:
def predictive_value(m):
    """Compute positive and negative predictive value.
    
    m: confusion matrix
    """
    tp, fp, fn, tn = m.to_numpy().flatten()
    ppv = percent(tp, fp)
    npv = percent(tn, fn)
    return ppv, npv

In [25]:
def sens_spec(m):
    """Compute sensitivity and specificity.
    
    m: confusion matrix
    """
    tp, fp, fn, tn = m.to_numpy().flatten()
    sens = percent(tp, fn)
    spec = percent(tn, fp)
    return sens, spec

In [26]:
def error_rates(m):
    """Compute false positive and false negative rate.
    
    m: confusion matrix
    """
    tp, fp, fn, tn = m.to_numpy().flatten()
    fpr = percent(fp, tn)
    fnr = percent(fn, tp)
    return fpr, fnr

In [27]:
def prevalence(df):
    """Compute prevalence.
    
    m: confusion matrix
    """
    tp, fp, fn, tn = df.to_numpy().flatten()
    prevalence = percent(tp+fn, tn+fp)
    return prevalence

In [28]:
def compute_metrics(m, name=''):
    """Compute all metrics.
    
    m: confusion matrix
    
    returns: DataFrame
    """
    fpr, fnr = error_rates(m)
    ppv, npv = predictive_value(m)
    prev = prevalence(m)
    
    index = ['FP rate', 'FN rate', 'PPV', 'NPV', 'Prevalence']
    df = pd.DataFrame(index=index, columns=['Percent'])
    df.Percent = fpr, fnr, ppv, npv, prev
    df.index.name = name
    return df

Here are the metrics for all defendants.

In [29]:
compute_metrics(matrix_all, 'All defendants')

Unnamed: 0_level_0,Percent
All defendants,Unnamed: 1_level_1
FP rate,35.569986
FN rate,31.343284
PPV,61.333333
NPV,71.44
Prevalence,45.108911


Here are the same metrics for black defendants.

In [30]:
compute_metrics(matrix_black, 'Black defendants')

Unnamed: 0_level_0,Percent
Black defendants,Unnamed: 1_level_1
FP rate,47.106109
FN rate,20.583717
PPV,63.82716
NPV,71.058315
Prevalence,51.139042


And for white defendants.

In [31]:
compute_metrics(matrix_white, 'White defendants')

Unnamed: 0_level_0,Percent
White defendants,Unnamed: 1_level_1
FP rate,25.71977
FN rate,45.730825
PPV,58.320373
NPV,71.009174
Prevalence,39.873053


In [32]:
compute_metrics(matrix_male, 'Male defendants')

Unnamed: 0_level_0,Percent
Male defendants,Unnamed: 1_level_1
FP rate,37.145558
FN rate,29.798241
PPV,63.322445
NPV,69.779643
Prevalence,47.740183


In [33]:
compute_metrics(matrix_female, 'Female defendants')

Unnamed: 0_level_0,Percent
Female defendants,Unnamed: 1_level_1
FP rate,30.487805
FN rate,40.0
PPV,50.859951
NPV,76.767677
Prevalence,34.465534


In [34]:
male = (cp['sex'] == 'Male')
female = (cp['sex'] == 'Female')

In [35]:
formula = 'two_year_recid ~ age + priors_count'
results = smf.logit(formula, data=cp[male]).fit()
results.params

Optimization terminated successfully.
         Current function value: 0.619925
         Iterations 5


Intercept       1.052066
age            -0.049538
priors_count    0.149677
dtype: float64

In [36]:
formula = 'two_year_recid ~ age + priors_count'
results = smf.logit(formula, data=cp[female]).fit()
results.params

Optimization terminated successfully.
         Current function value: 0.587841
         Iterations 6


Intercept       0.198587
age            -0.037953
priors_count    0.219181
dtype: float64

In [37]:
cp.columns

Index(['id', 'name', 'first', 'last', 'compas_screening_date', 'sex', 'dob',
       'age', 'age_cat', 'race', 'juv_fel_count', 'decile_score',
       'juv_misd_count', 'juv_other_count', 'priors_count',
       'days_b_screening_arrest', 'c_jail_in', 'c_jail_out', 'c_case_number',
       'c_offense_date', 'c_arrest_date', 'c_days_from_compas',
       'c_charge_degree', 'c_charge_desc', 'is_recid', 'r_case_number',
       'r_charge_degree', 'r_days_from_arrest', 'r_offense_date',
       'r_charge_desc', 'r_jail_in', 'r_jail_out', 'violent_recid',
       'is_violent_recid', 'vr_case_number', 'vr_charge_degree',
       'vr_offense_date', 'vr_charge_desc', 'type_of_assessment',
       'decile_score.1', 'score_text', 'screening_date',
       'v_type_of_assessment', 'v_decile_score', 'v_score_text',
       'v_screening_date', 'in_custody', 'out_custody', 'priors_count.1',
       'start', 'end', 'event', 'two_year_recid'],
      dtype='object')

In [38]:
features = ['age', 'juv_fel_count', 'juv_misd_count', 
            'juv_other_count', 'priors_count']

features = ['age', 'priors_count']

In [39]:
X = cp[features].values
np.isnan(X).sum()

0

In [40]:
y = cp['two_year_recid'].values
np.isnan(y).sum()

0

In [41]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=0)

ModuleNotFoundError: No module named 'sklearn'

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
logisticRegr = LogisticRegression()

In [None]:
logisticRegr.fit(x_train, y_train)

In [None]:
predictions = logisticRegr.predict(x_test)

In [None]:
score = logisticRegr.score(x_test, y_test)
score

In [None]:
from sklearn import metrics

cm = metrics.confusion_matrix(y_test, predictions)
cm