# Philly Bail Fund
## Analysis of factors related to Bail Amounts

For more details, see the github repo: https://github.com/CodeForPhilly/pbf-analysis

### Library Imports

In [1]:
### Standard Imports - Sorry PEP8 fans, do not look below
import pandas as pd, numpy as np, os, re, json, pickle, math
from pathlib import Path
from datetime import datetime

## Specific Imports
import hashlib
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
import category_encoders as ce
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score

### Display options for notebooks
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 25)

### set path directories
curr_dir = Path(os.getcwd())
#print('Current Directory is: ', str(curr_dir))
data_dir = Path(curr_dir.parents[0] / 'Data/')
artifacts_dir = Path(curr_dir / 'artifacts/')

In [2]:
### Common project specific variables
FILENAME = '0c_distinct_dockets.csv'  # original data
TARGET_VARIABLE_NAME = 'bail_amount'
HOLDOUT_INDICATOR_NAME = 'holdout_ind'
HOLDOUT_SIZE = 0.80

### Helper Functions

In [3]:
# helper function to reduce memory footprint of the dataframe
def reduce_mem_usage(df, verbose=True):
    import numpy as np
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2    
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)    
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: 
        print('Mem. usage decreased to {:5.2f} MB ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df

### Data Import

Also, check a hash to see if this file has changed since this code was written. If it changes, someone should review this notebook to make sure that the code still works.

In [4]:
BLOCKSIZE = 65536
hasher = hashlib.md5()
with open(Path(data_dir) / FILENAME, 'rb') as afile:
    buf = afile.read(BLOCKSIZE)
    while len(buf) > 0:
        hasher.update(buf)
        buf = afile.read(BLOCKSIZE)
        
filehash = hasher.hexdigest()

with open(Path(artifacts_dir) / 'data_file_hash.txt', 'rb') as f:
    checkhash = pickle.load(f)
    
if filehash != checkhash:
    print("!! Warning !! The file is different than when this code was last updated. \nProceed with caution.")

In [5]:
UPDATE_HASH_FLAG = False

if UPDATE_HASH_FLAG==True:
    with open(Path(artifacts_dir) / 'data_file_hash.txt', 'wb') as f:
        pickle.dump(filehash, f)

In [6]:
indata = pd.read_csv(Path(data_dir) / FILENAME, parse_dates=['filing_date'], index_col='id')

indata.head(3)

Unnamed: 0_level_0,age,address,docket_number,filing_date,charge,represented_by,bail_type,bail_status,bail_amount,outstanding_bail_amount
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
3909,27.0,"Philadelphia, PA 19141",MC-51-CR-0011746-2020,2020-06-16 00:37:00+00:00,DUI: Gen Imp/Inc of Driving Safely - 1st Off,Defender Association of Philadelphia,Posted,ROR,0,0
4538,44.0,"Philadelphia, PA 19124",MC-51-CR-0011747-2020,2020-06-16 00:41:00+00:00,Verify Address or Photographed as Required,Defender Association of Philadelphia,Set,Monetary,50000,0
120,24.0,"Philadelphia, PA 19142",MC-51-CR-0011743-2020,2020-06-16 00:52:00+00:00,Criminal Mischief,Defender Association of Philadelphia,Posted,ROR,0,0


### Data Setup

A1. Keep = bail_amount, charge, bail_status, filing_date, age, represented_by

A2. Create hour of day and day of week from filing_date, then drop originial filing_date

A3. Delete rows where bail_status = 'Denied' (we will only worry about ones where there is a set amount)

##### A1: Keep only columns that might impact the bail amount


In [7]:
drop_list = ['address','docket_number','bail_type','outstanding_bail_amount']

indata.drop(columns=drop_list, inplace=True, errors='ignore')

##### A2: Parse Hour of Day and Day of Week, before dropping the date field

In [8]:
datecol = 'filing_date'

indata['filed_hour_of_day'] = indata[datecol].dt.hour

#The day of the week with Monday=0, Sunday=6
indata['filed_day_of_week'] = indata[datecol].dt.dayofweek

indata.drop(columns=[datecol], inplace=True, errors='ignore')

##### A3: Remove rows where bail does not apply

In [9]:
clean = indata[indata['bail_amount']>0]

clean.head(5)

Unnamed: 0_level_0,age,charge,represented_by,bail_status,bail_amount,filed_hour_of_day,filed_day_of_week
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
4538,44.0,Verify Address or Photographed as Required,Defender Association of Philadelphia,Monetary,50000,0,1
291,32.0,Contempt For Violation of Order or Agreement,Defender Association of Philadelphia,Monetary,50000,1,1
291,32.0,Burglary - Overnight Accommodations Person Pre...,Defender Association of Philadelphia,Monetary,75000,1,1
291,32.0,Burglary - Overnight Accommodations Person Pre...,Defender Association of Philadelphia,Monetary,75000,1,1
2396,51.0,Simple Assault,Defender Association of Philadelphia,Unsecured,25000,1,1


### Split data into Train and holdout

In case we want to do special encoding that uses target signals, we want to ensure we do this now. But, it means we'll have to remember to apply the transformations to the test dataset as well (more coding, blah)

So that I can compare this method with other software and techniques, I'm adding an indicator for the holdout and each of the 5 training folds, so that we can replicate results and compare directly.


In [27]:
with pd.option_context('mode.chained_assignment', None):
    temptrain, holdoutdata = train_test_split(
        clean,
        test_size=(1 - HOLDOUT_SIZE),
        random_state=1337
    )

    kf = KFold(n_splits=5)
    i = 0
    for _ , test_index in kf.split(temptrain,temptrain[TARGET_VARIABLE_NAME]):
        i+=1
        temp = temptrain.iloc[test_index]
        temp[HOLDOUT_INDICATOR_NAME]='T' + str(i)

        if i==1:
            traindata = temp.copy()
        else:
            traindata = traindata.append(temp)

    holdoutdata[HOLDOUT_INDICATOR_NAME]='H'
    clean = pd.concat([traindata,holdoutdata])

clean.head(3)

Unnamed: 0_level_0,age,charge,represented_by,bail_status,bail_amount,filed_hour_of_day,filed_day_of_week,holdout_ind
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2271,44.0,Burglary - Overnight Accommodations Person Pre...,Defender Association of Philadelphia,Monetary,20000,4,4,T1
4371,19.0,Robbery-Inflict Threat Imm Bod Inj,Defender Association of Philadelphia,Monetary,10000,18,3,T1
762,43.0,Arrest Prior To Requisition,Defender Association of Philadelphia,Monetary,50000,1,5,T1


In [28]:
clean.to_csv(Path(artifacts_dir) / 'ready_for_external_tests.csv')

### Feature Engineering

B1. encode categorical variables (bail_status, hour of day, day of week, represented_by, even charge!) 
using categorical_encoders library. Choose any method but best is probably Ordinal. Also 
good to try is just using alphabetical ordering and numbering them 1,2,3.. etc

B2. Impute numeric variable (age) with -9999

##### B1 & B2: Categorical Encoding and Imputation

In [23]:
train = clean[clean['holdout_ind'] != 'H']
holdout = clean[clean['holdout_ind'] == 'H']


vars_ordinal = ['charge','represented_by','bail_status','filed_hour_of_day','filed_day_of_week']

vars_numeric = ['age']


pipeline_numeric = Pipeline(steps=[('imputer', SimpleImputer(strategy='constant', fill_value=-9999))])

pipeline_ordinal = Pipeline(steps=[('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
                                          ('lex encoding', ce.ordinal.OrdinalEncoder())])

preprocessor = ColumnTransformer(
    transformers=[('num', pipeline_numeric, vars_numeric),
                  ('cat', pipeline_ordinal, vars_ordinal)])

### Machine Learning / Predictive Modeling

D1. Random Forest model - expect around 1.4e5 RMSE.

D2. Double check RMSE against holdout. Guessing "average" for every observation yields RMSE of 1.7e5 so at least the model is about 20% better than average


##### D1: Random Forest

In [94]:
pipe = Pipeline(steps=[('preprocessor', preprocessor),
                  ('regression', RandomForestRegressor(n_estimators=500,criterion='mse',max_depth=5
                                                       ,min_samples_split=20,min_samples_leaf=6,min_weight_fraction_leaf=0.0
                                                       ,max_features=0.3,max_leaf_nodes=1000,min_impurity_decrease=0.0
                                                       ,bootstrap=False,oob_score=False,n_jobs=None
                                                       ,random_state=1234,verbose=0,warm_start=False,ccp_alpha=0.0,max_samples=None))])

#TODO: Use Cross Fold Validation to tune the hyperparameters

#scores = cross_val_score(pipe, train.drop(TARGET_VARIABLE_NAME, axis=1), train[TARGET_VARIABLE_NAME], cv=5)

model = pipe.fit(train.drop(TARGET_VARIABLE_NAME, axis=1), train[TARGET_VARIABLE_NAME])

train_y_pred = model.predict(train.drop(TARGET_VARIABLE_NAME, axis=1))
holdout_y_pred = model.predict(holdout.drop(TARGET_VARIABLE_NAME, axis=1))

naive_guess = pd.Series(np.mean(holdout[TARGET_VARIABLE_NAME]))
naive_y_pred = naive_guess.repeat(len(holdout_y_pred))

##### D2: Check for stable Performance before continuing

In [101]:
train_mse = mean_squared_error(train[TARGET_VARIABLE_NAME], train_y_pred)
holdout_mse = mean_squared_error(holdout[TARGET_VARIABLE_NAME], holdout_y_pred)
naive_mse = mean_squared_error(holdout[TARGET_VARIABLE_NAME], naive_y_pred)

print('Train: ' + str(round(math.sqrt(train_mse))))
print('Holdout: ' + str(round(math.sqrt(holdout_mse))))
print('Naive: ' + str(round(math.sqrt(naive_mse))))

Train: 157398
Holdout: 194196
Naive: 210400


Right now we have small data, but at least we have somewhat of a model that is slightly better than average.

### Analysis

E1. Matrix of correlation (mutual information?) to prove these are independent variables

E2. Permutation Importance to show the relative importance of each variable in the model (this is a better interpretation than the tree-importance that comes from the model itself)

E3. Partial Dependence Plots for each of the variables (except the text)

E4. Score original training dataset with model. Filter for observations where predicted value is either top 10% or bottom 10%. Run SHAP to extract #1 reason for each observation in the top/bottom 10%.

E5. Look for any cases where age, represented_by is the #1 factor for the bail_amount. These could be interesting cases to highlight

E6. Word cloud of the terms - this could take some work I'm not too familiar w/ this


##### E1: Correlation Matrix 
Can we use Mutual Information ?

##### E2: Feature Impact

In [102]:
import eli5
from eli5.sklearn import PermutationImportance

perm = PermutationImportance(pipe, random_state=1).fit(train.drop(TARGET_VARIABLE_NAME, 1), train[TARGET_VARIABLE_NAME])
eli5.show_weights(perm, feature_names = train.columns.tolist())

ValueError: could not convert string to float: 'Criminal Attempt - Murder of A Law Enforcement Officer of the First Degree'

##### E3: Feature Effects

##### E4. Feature Explanations

##### E5. Specific Examples

##### E6. Term Frequncy/ Word Cloud?