# Philly Bail Fund
## Analysis of factors related to Bail Amounts

For more details, see the github repo: https://github.com/CodeForPhilly/pbf-analysis

[insert overview of approach here.. img?]


### Imports

In [7]:
### Standard Imports - Sorry PEP8 fans, do not look below
import pandas as pd, numpy as np, os, re, json, pickle, math, calendar
from time import time
from pprint import pprint
from pathlib import Path
from datetime import datetime


### Display options for notebooks
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 25)

### set path directories
curr_dir = Path(os.getcwd())
#print('Current Directory is: ', str(curr_dir))
data_dir = Path(curr_dir.parents[0] / 'Data/')
artifacts_dir = Path(curr_dir / 'artifacts/')

In [10]:
### Common project specific variables
FILENAME = '0c_distinct_dockets.csv'  # original data
TARGET_VARIABLE_NAME = 'bail_amount'
HOLDOUT_INDICATOR_NAME = 'holdout_ind'
DATECOL = 'filing_date'
HOLDOUT_SIZE = 0.80

### Helper Functions

In [2]:
# helper function to reduce memory footprint of the dataframe
def reduce_mem_usage(df, verbose=True):
    import numpy as np
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2    
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)    
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: 
        print('Mem. usage decreased to {:5.2f} MB ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df

### Data Import

In [8]:
# read in the data and take a peek at it
indata = pd.read_csv(Path(data_dir) / FILENAME, parse_dates=['filing_date'], index_col='id')

indata.head(3)

Unnamed: 0_level_0,age,address,docket_number,filing_date,charge,represented_by,bail_type,bail_status,bail_amount,outstanding_bail_amount
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
3909,27.0,"Philadelphia, PA 19141",MC-51-CR-0011746-2020,2020-06-16 00:37:00+00:00,DUI: Gen Imp/Inc of Driving Safely - 1st Off,Defender Association of Philadelphia,Posted,ROR,0,0
4538,44.0,"Philadelphia, PA 19124",MC-51-CR-0011747-2020,2020-06-16 00:41:00+00:00,Verify Address or Photographed as Required,Defender Association of Philadelphia,Set,Monetary,50000,0
120,24.0,"Philadelphia, PA 19142",MC-51-CR-0011743-2020,2020-06-16 00:52:00+00:00,Criminal Mischief,Defender Association of Philadelphia,Posted,ROR,0,0


### Data Setup

A1. Keep = bail_amount, charge, bail_status, filing_date, age, represented_by

A2. Create hour of day and day of week from filing_date, then drop originial filing_date

A3. Delete rows where bail_status = 'Denied' (we will only worry about ones where there is a set amount)

##### A1: Keep only columns that might impact the bail amount

In [None]:
# drop things we know aren't useful
drop_list = ['address','docket_number','bail_type','outstanding_bail_amount']

indata.drop(columns=drop_list, inplace=True, errors='ignore')

##### A2: Parse Hour of Day and Day of Week, before dropping the date field

In [11]:
# hour of day
indata['filed_hour_of_day'] = indata[DATECOL].dt.hour

# day of the week with Monday=0, Sunday=6
indata['filed_day_of_week'] = indata[DATECOL].dt.dayofweek.apply(lambda x: calendar.day_abbr[x])

indata.drop(columns=[DATECOL], inplace=True, errors='ignore')

indata['charge'] = indata['charge'].str.lower().str.replace('[^\w\s]','')

##### A3: Remove rows where bail does not apply

In [12]:
# remove rows where bail amount is more than zero
clean = indata[indata['bail_amount']>0]

clean.head(5)

Unnamed: 0_level_0,age,address,docket_number,charge,represented_by,bail_type,bail_status,bail_amount,outstanding_bail_amount,filed_hour_of_day,filed_day_of_week
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
4538,44.0,"Philadelphia, PA 19124",MC-51-CR-0011747-2020,verify address or photographed as required,Defender Association of Philadelphia,Set,Monetary,50000,0,0,Tue
291,32.0,"Philadelphia, PA 19440",MC-51-CR-0011748-2020,contempt for violation of order or agreement,Defender Association of Philadelphia,Set,Monetary,50000,0,1,Tue
291,32.0,"Philadelphia, PA 19440",MC-51-CR-0011749-2020,burglary overnight accommodations person pres...,Defender Association of Philadelphia,Set,Monetary,75000,0,1,Tue
291,32.0,"Philadelphia, PA 19440",MC-51-CR-0011750-2020,burglary overnight accommodations person pres...,Defender Association of Philadelphia,Set,Monetary,75000,0,1,Tue
2396,51.0,"Philadelphia, PA 19145",MC-51-CR-0011751-2020,simple assault,Defender Association of Philadelphia,Posted,Unsecured,25000,0,1,Tue


### Load Models
These were trained using a Random Forest regressor with an ElasticNet submodel built from words in the charges column. See other notebook for details.

In [15]:
with open(Path(artifacts_dir) / 'text_submodel.mdl', 'rb') as f:
    submodel = pickle.load(f)

with open(Path(artifacts_dir) / 'model.mdl', 'rb') as f:
    model = pickle.load(f)

In [None]:
# use submodel to add replace the charge field with a number




### Analysis

E1. Matrix of correlation (mutual information?) to prove these are independent variables

E2. Permutation Importance to show the relative importance of each variable in the model (this is a better interpretation than the tree-importance that comes from the model itself)

E3. Partial Dependence Plots for each of the variables (except the text)

E4. Score original training dataset with model. Filter for observations where predicted value is either top 10% or bottom 10%. Run SHAP to extract #1 reason for each observation in the top/bottom 10%.

E5. Look for any cases where age, represented_by is the #1 factor for the bail_amount. These could be interesting cases to highlight

E6. Word cloud of the terms - this could take some work I'm not too familiar w/ this


##### E1: Correlation Matrix 
Can we use Mutual Information ?

In [17]:
# I did check this in external tools and prove that they are independent - I just need to find time to code this in python and make a nice graphic

##### E2: Feature Impact

In [20]:
import eli5
from eli5.sklearn import PermutationImportance

x = clean.drop([TARGET_VARIABLE_NAME], axis=1)
y = clean[TARGET_VARIABLE_NAME]

estimator = final_pipe.named_steps['final_model']
imputer = final_pipe.named_steps['preprocessor']

x2 = imputer.transform(x)

perm = PermutationImportance(estimator, random_state=1, cv=None)

perm.fit(x2, y)

features = list(x.columns)

eli5.explain_weights_df(perm, feature_names = features)

NameError: name 'final_pipe' is not defined

In [None]:
perm = PermutationImportance(estimator, random_state=1, cv=None)