# DIY Classification

This is the seventh in a series of notebooks that make up a [case study on classification and algorithmic fairness](https://allendowney.github.io/RecidivismCaseStudy/).
This case study is part of the [*Elements of Data Science*](https://allendowney.github.io/ElementsOfDataScience/) curriculum.
[Click here to run this notebook on Colab](https://colab.research.google.com/github/AllenDowney/RecidivismCaseStudy/blob/v1/07_diy.ipynb).

In [1]:
from os.path import basename, exists


def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve

        local, _ = urlretrieve(url, filename)
        print("Downloaded " + local)


download(
    "https://raw.githubusercontent.com/AllenDowney/RecidivismCaseStudy/v1/utils.py"
)

<IPython.core.display.Javascript object>

In [2]:
from utils import values

<IPython.core.display.Javascript object>

## Data

The authors of "Machine Bias" published their data and analysis at <https://github.com/propublica/compas-analysis>.

The terms of use for the data are at <https://www.propublica.org/datastore/terms>.  In compliance with those terms, I am not redistributing the data.
The following cell downloads the data file we'll use directly from their repository.

In [3]:
download(
    "https://github.com/propublica/compas-analysis/raw/master/compas-scores-two-years.csv"
)

<IPython.core.display.Javascript object>

We can use Pandas to read the data file and make a `DataFrame`.

In [4]:
import pandas as pd

cp = pd.read_csv("compas-scores-two-years.csv")
cp.shape

(7214, 53)

<IPython.core.display.Javascript object>

The dataset includes 7214 rows, one for each defendant, and 53 columns.  

Here are the names of the columns.

In [5]:
cp.columns

Index(['id', 'name', 'first', 'last', 'compas_screening_date', 'sex', 'dob',
       'age', 'age_cat', 'race', 'juv_fel_count', 'decile_score',
       'juv_misd_count', 'juv_other_count', 'priors_count',
       'days_b_screening_arrest', 'c_jail_in', 'c_jail_out', 'c_case_number',
       'c_offense_date', 'c_arrest_date', 'c_days_from_compas',
       'c_charge_degree', 'c_charge_desc', 'is_recid', 'r_case_number',
       'r_charge_degree', 'r_days_from_arrest', 'r_offense_date',
       'r_charge_desc', 'r_jail_in', 'r_jail_out', 'violent_recid',
       'is_violent_recid', 'vr_case_number', 'vr_charge_degree',
       'vr_offense_date', 'vr_charge_desc', 'type_of_assessment',
       'decile_score.1', 'score_text', 'screening_date',
       'v_type_of_assessment', 'v_decile_score', 'v_score_text',
       'v_screening_date', 'in_custody', 'out_custody', 'priors_count.1',
       'start', 'end', 'event', 'two_year_recid'],
      dtype='object')

<IPython.core.display.Javascript object>

I have not found documentation for the columns in this dataset; we have to infer what they mean based on the column names and how they are used in the original analysis.

In [6]:
split = int(len(cp) * 0.3)
split

2164

<IPython.core.display.Javascript object>

In [7]:
shuffled = cp.sample(frac=1)
train = shuffled.iloc[:split]
len(train)

2164

<IPython.core.display.Javascript object>

In [8]:
test = shuffled.iloc[split:].copy()
len(test)

5050

<IPython.core.display.Javascript object>

In [9]:
import statsmodels.formula.api as smf

formula = "two_year_recid ~ age + priors_count"
results = smf.logit(formula, data=train).fit()
results.summary()

Optimization terminated successfully.
         Current function value: 0.621868
         Iterations 5


0,1,2,3
Dep. Variable:,two_year_recid,No. Observations:,2164.0
Model:,Logit,Df Residuals:,2161.0
Method:,MLE,Df Model:,2.0
Date:,"Thu, 04 Apr 2024",Pseudo R-squ.:,0.09738
Time:,11:42:29,Log-Likelihood:,-1345.7
converged:,True,LL-Null:,-1490.9
Covariance Type:,nonrobust,LLR p-value:,8.925e-64

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,0.9091,0.148,6.131,0.000,0.618,1.200
age,-0.0468,0.004,-10.704,0.000,-0.055,-0.038
priors_count,0.1506,0.012,12.816,0.000,0.128,0.174


<IPython.core.display.Javascript object>

In [10]:
test["logit_pred"] = results.predict(test)
test["logit_pred"].describe()

count    5050.000000
mean        0.452667
std         0.175459
min         0.036168
25%         0.337007
50%         0.456721
75%         0.538439
max         0.990998
Name: logit_pred, dtype: float64

<IPython.core.display.Javascript object>

In [11]:
high_risk = test["logit_pred"] > 0.45
high_risk.name = "HighRisk"
values(high_risk)

HighRisk
False    2471
True     2579
Name: count, dtype: int64

<IPython.core.display.Javascript object>

In [12]:
values(test["two_year_recid"])

two_year_recid
0    2782
1    2268
Name: count, dtype: int64

<IPython.core.display.Javascript object>

In [13]:
new_charge_2 = test["two_year_recid"] == 1
new_charge_2.name = "NewCharge2"
values(new_charge_2)

NewCharge2
False    2782
True     2268
Name: count, dtype: int64

<IPython.core.display.Javascript object>

In [14]:
white = test["race"] == "Caucasian"
white.name = "white"
values(white)

white
False    3322
True     1728
Name: count, dtype: int64

<IPython.core.display.Javascript object>

In [15]:
black = test["race"] == "African-American"
black.name = "black"
values(black)

black
False    2508
True     2542
Name: count, dtype: int64

<IPython.core.display.Javascript object>

In [16]:
male = test["sex"] == "Male"
male.mean()

0.8073267326732674

<IPython.core.display.Javascript object>

In [17]:
female = test["sex"] == "Female"
female.mean()

0.19267326732673268

<IPython.core.display.Javascript object>

In [18]:
import numpy as np


def make_matrix(cp, threshold=0.45):
    """Make a confusion matrix.

    cp: DataFrame
    threshold:

    returns: DataFrame containing the confusion matrix
    """
    a = np.where(cp["logit_pred"] > threshold, "Positive", "Negative")
    high_risk = pd.Series(a, name="Predicted")

    a = np.where(cp["two_year_recid"] == 1, "Condition", "No Condition")
    new_charge_2 = pd.Series(a, name="Actual")

    matrix = pd.crosstab(high_risk, new_charge_2)
    matrix.sort_index(axis=0, ascending=False, inplace=True)

    return matrix

<IPython.core.display.Javascript object>

Here are the confusion matrices for white defendants, black defendants, and all defendants.

In [19]:
matrix_all = make_matrix(test)
matrix_all

Actual,Condition,No Condition
Predicted,Unnamed: 1_level_1,Unnamed: 2_level_1
Positive,1578,1001
Negative,690,1781


<IPython.core.display.Javascript object>

In [20]:
matrix_white = make_matrix(test[white])
matrix_white

Actual,Condition,No Condition
Predicted,Unnamed: 1_level_1,Unnamed: 2_level_1
Positive,381,276
Negative,295,776


<IPython.core.display.Javascript object>

In [21]:
matrix_black = make_matrix(test[black])
matrix_black

Actual,Condition,No Condition
Predicted,Unnamed: 1_level_1,Unnamed: 2_level_1
Positive,1036,590
Negative,280,636


<IPython.core.display.Javascript object>

In [22]:
matrix_male = make_matrix(test[male])
matrix_male

Actual,Condition,No Condition
Predicted,Unnamed: 1_level_1,Unnamed: 2_level_1
Positive,1366,813
Negative,557,1341


<IPython.core.display.Javascript object>

In [23]:
matrix_female = make_matrix(test[female])
matrix_female

Actual,Condition,No Condition
Predicted,Unnamed: 1_level_1,Unnamed: 2_level_1
Positive,212,188
Negative,133,440


<IPython.core.display.Javascript object>

In [24]:
def percent(x, y):
    """Compute the percentage `x/(x+y)*100`."""
    return x / (x + y) * 100

<IPython.core.display.Javascript object>

In [25]:
def predictive_value(m):
    """Compute positive and negative predictive value.

    m: confusion matrix
    """
    tp, fp, fn, tn = m.to_numpy().flatten()
    ppv = percent(tp, fp)
    npv = percent(tn, fn)
    return ppv, npv

<IPython.core.display.Javascript object>

In [26]:
def sens_spec(m):
    """Compute sensitivity and specificity.

    m: confusion matrix
    """
    tp, fp, fn, tn = m.to_numpy().flatten()
    sens = percent(tp, fn)
    spec = percent(tn, fp)
    return sens, spec

<IPython.core.display.Javascript object>

In [27]:
def error_rates(m):
    """Compute false positive and false negative rate.

    m: confusion matrix
    """
    tp, fp, fn, tn = m.to_numpy().flatten()
    fpr = percent(fp, tn)
    fnr = percent(fn, tp)
    return fpr, fnr

<IPython.core.display.Javascript object>

In [28]:
def prevalence(df):
    """Compute prevalence.

    m: confusion matrix
    """
    tp, fp, fn, tn = df.to_numpy().flatten()
    prevalence = percent(tp + fn, tn + fp)
    return prevalence

<IPython.core.display.Javascript object>

In [29]:
def compute_metrics(m, name=""):
    """Compute all metrics.

    m: confusion matrix

    returns: DataFrame
    """
    fpr, fnr = error_rates(m)
    ppv, npv = predictive_value(m)
    prev = prevalence(m)

    index = ["FP rate", "FN rate", "PPV", "NPV", "Prevalence"]
    df = pd.DataFrame(index=index, columns=["Percent"])
    df.Percent = fpr, fnr, ppv, npv, prev
    df.index.name = name
    return df

<IPython.core.display.Javascript object>

Here are the metrics for all defendants.

In [30]:
compute_metrics(matrix_all, "All defendants")

Unnamed: 0_level_0,Percent
All defendants,Unnamed: 1_level_1
FP rate,35.981308
FN rate,30.42328
PPV,61.186506
NPV,72.076083
Prevalence,44.910891


<IPython.core.display.Javascript object>

Here are the same metrics for black defendants.

In [31]:
compute_metrics(matrix_black, "Black defendants")

Unnamed: 0_level_0,Percent
Black defendants,Unnamed: 1_level_1
FP rate,48.12398
FN rate,21.276596
PPV,63.714637
NPV,69.432314
Prevalence,51.77026


<IPython.core.display.Javascript object>

And for white defendants.

In [32]:
compute_metrics(matrix_white, "White defendants")

Unnamed: 0_level_0,Percent
White defendants,Unnamed: 1_level_1
FP rate,26.235741
FN rate,43.639053
PPV,57.990868
NPV,72.455649
Prevalence,39.12037


<IPython.core.display.Javascript object>

In [33]:
compute_metrics(matrix_male, "Male defendants")

Unnamed: 0_level_0,Percent
Male defendants,Unnamed: 1_level_1
FP rate,37.743733
FN rate,28.965159
PPV,62.689307
NPV,70.653319
Prevalence,47.167035


<IPython.core.display.Javascript object>

In [34]:
compute_metrics(matrix_female, "Female defendants")

Unnamed: 0_level_0,Percent
Female defendants,Unnamed: 1_level_1
FP rate,29.936306
FN rate,38.550725
PPV,53.0
NPV,76.788831
Prevalence,35.457348


<IPython.core.display.Javascript object>

In [35]:
male = cp["sex"] == "Male"
female = cp["sex"] == "Female"

<IPython.core.display.Javascript object>

In [36]:
formula = "two_year_recid ~ age + priors_count"
results = smf.logit(formula, data=cp[male]).fit()
results.params

Optimization terminated successfully.
         Current function value: 0.619925
         Iterations 5


Intercept       1.052066
age            -0.049538
priors_count    0.149677
dtype: float64

<IPython.core.display.Javascript object>

In [37]:
formula = "two_year_recid ~ age + priors_count"
results = smf.logit(formula, data=cp[female]).fit()
results.params

Optimization terminated successfully.
         Current function value: 0.587841
         Iterations 6


Intercept       0.198587
age            -0.037953
priors_count    0.219181
dtype: float64

<IPython.core.display.Javascript object>

In [38]:
cp.columns

Index(['id', 'name', 'first', 'last', 'compas_screening_date', 'sex', 'dob',
       'age', 'age_cat', 'race', 'juv_fel_count', 'decile_score',
       'juv_misd_count', 'juv_other_count', 'priors_count',
       'days_b_screening_arrest', 'c_jail_in', 'c_jail_out', 'c_case_number',
       'c_offense_date', 'c_arrest_date', 'c_days_from_compas',
       'c_charge_degree', 'c_charge_desc', 'is_recid', 'r_case_number',
       'r_charge_degree', 'r_days_from_arrest', 'r_offense_date',
       'r_charge_desc', 'r_jail_in', 'r_jail_out', 'violent_recid',
       'is_violent_recid', 'vr_case_number', 'vr_charge_degree',
       'vr_offense_date', 'vr_charge_desc', 'type_of_assessment',
       'decile_score.1', 'score_text', 'screening_date',
       'v_type_of_assessment', 'v_decile_score', 'v_score_text',
       'v_screening_date', 'in_custody', 'out_custody', 'priors_count.1',
       'start', 'end', 'event', 'two_year_recid'],
      dtype='object')

<IPython.core.display.Javascript object>

In [39]:
features = ["age", "juv_fel_count", "juv_misd_count", "juv_other_count", "priors_count"]

features = ["age", "priors_count"]

<IPython.core.display.Javascript object>

In [40]:
X = cp[features].values
np.isnan(X).sum()

0

<IPython.core.display.Javascript object>

In [41]:
y = cp["two_year_recid"].values
np.isnan(y).sum()

0

<IPython.core.display.Javascript object>

In [42]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=0
)

<IPython.core.display.Javascript object>

In [43]:
from sklearn.linear_model import LogisticRegression

<IPython.core.display.Javascript object>

In [44]:
logisticRegr = LogisticRegression()

<IPython.core.display.Javascript object>

In [45]:
logisticRegr.fit(x_train, y_train)

<IPython.core.display.Javascript object>

In [46]:
predictions = logisticRegr.predict(x_test)

<IPython.core.display.Javascript object>

In [47]:
score = logisticRegr.score(x_test, y_test)
score

0.6757206208425721

<IPython.core.display.Javascript object>

In [48]:
from sklearn import metrics

cm = metrics.confusion_matrix(y_test, predictions)
cm

array([[812, 175],
       [410, 407]])

<IPython.core.display.Javascript object>

Recidivism Case Study

Copyright 2020 Allen B. Downey

License: [Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/)