**Link to the 'data' folder containing all the .csv files used in this notebook**
https://drive.google.com/drive/folders/1yoFNldlv3zyhrLEpY58Gw2SDVm-MH50_?usp=sharing

**Add it to your drive shortcut** in a folder called Big Data. The path should be: directory = '/content/drive/My Drive/Big Data/data/'

#Police Killings Test on Random Records - Team "How I Met Your Big Data"

##Imports, Drive Mounting and Function Definitions

In [1]:
from google.colab import drive
drive.mount('/content/drive')
directory = '/content/drive/My Drive/Big Data/data/'

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
!pip install shap
!pip install gender_guesser



In [3]:
%matplotlib inline
import math
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
import random 
import time
import re
from collections import defaultdict
from google.colab import files

In [4]:
# Function to manage counties
def county_manager(police):
  police_desc = police.copy()

  eco = pd.read_csv(directory+'economy.csv')
  del eco['Unnamed: 0']
  eco['county'] = eco['county'].str.strip()
  eco['state'] = eco['state'].str.strip()
  eco[eco['county']=='District of Colu'] = eco[eco['county']=='District of Colu'].replace('District of Colu', 'District of Columbia')
  eco[eco['state_po']=='ia'] = eco[eco['state_po']=='ia'].replace('ia', 'DC')

  #Preprocessing
  police_desc['death_county'] = police_desc['death_county'].str.capitalize()
  eco['county'] = eco['county'].str.capitalize()
  police_desc['death_county'].replace({"Saint ": "St. ", 
                                     "St ":"St. ", 
                                     " county": ""}, regex=True, inplace=True)
  police_desc['death_county'].replace({"mary":"mary's"}, regex=True, inplace=True)
  police_desc['death_county'].replace({"mary's's":"mary's"}, regex=True, inplace=True)

  county1 = police_desc['death_county'].unique()
  county2 = eco['county'].unique()
  z = set(county1).intersection(set(county2))

  #Check which county in our dataset are not contained in eco
  diff = police_desc[~police_desc['death_county'].isin(county2)]

  if(len(diff['death_county'].unique()) > 0):
    police_desc.loc[~police_desc['death_county'].isin(county2), 'death_county'] = np.vectorize(get_location)(police_desc[~police_desc['death_county'].isin(county2)]['latitude'], police_desc[~police_desc['death_county'].isin(county2)]['longitude'], 'county')

  police_desc['death_city'].fillna('Unknown', inplace=True)
  police_desc['death_county'] = police_desc['death_county'].str.title()
  eco['county'] = eco['county'].str.title()
  county4 = eco['county'].unique()
  police_desc['death_county'].replace({" County":""}, regex=True, inplace=True)
  newcountyset = police_desc['death_county'].unique()
  w = set(newcountyset).intersection(set(county4))
  police_desc[~police_desc['death_county'].isin(county4)]

  police_desc['death_county'].replace({"Anchorage": "Anchorage Borough/Municipality", 
                                     "Northwest Arctic":"Northwest Arctic Borough", 
                                     "Fairbanks North Star":"Fairbanks North Star Borough", 
                                     "Ketchikan Gateway":"Ketchikan Gateway Borough", 
                                     "Matanuska-Susitna":"Matanuska-Susitna Borough", 
                                     "Saint ": "St. ", 
                                     "Kenai Peninsula": "Kenai Peninsula Borough", 
                                     "North Slope": "North Slope Borough", 
                                     "Athens-Clarke": "Athens", 
                                     "Nome": "Nome Census Area", 
                                     "Sitka": "Sitka City And Borough",
                                     "Petersburg Borough":"Petersburg Census Area", 
                                     "Newport News":"Newport News City", 
                                     "Denali":"Denali Borough",
                                     "Oglala Lakota":"Dakota", 
                                     "Kusilvak Census Area":"Unorganized Borough",
                                     }, regex=True, inplace=True)
  
  police_desc['death_county'].replace({"Newport News City City": "Newport News City"}, regex=True, inplace=True)
  police_desc.loc[(~police_desc['death_county'].isin(county4)) & (police_desc['death_city']=='Washington'), 'death_county'] = 'District Of Columbia'
  police_desc.loc[(~police_desc['death_county'].isin(county4)) & (police_desc['death_county']=='') & (police_desc['death_city']=='New york'), 'death_county'] = 'New York'
  police_desc.loc[(~police_desc['death_county'].isin(county4)) & (police_desc['death_county']=='') & (police_desc['death_city']=='Brooklyn'), 'death_county'] = 'Kings'
  police_desc.loc[(~police_desc['death_county'].isin(county4)) & (police_desc['death_county']=='') & (police_desc['death_city']=='Chesapeake'), 'death_county'] = 'Kanawa'
  police_desc.loc[(~police_desc['death_county'].isin(county4)) & (police_desc['death_county']=='') & (police_desc['death_city']=='Virginia beach'), 'death_county'] = 'Virginia Beach City'
  police_desc.loc[(~police_desc['death_county'].isin(county4)) & (police_desc['death_county']=='') & (police_desc['death_city']=='Lynchburg'), 'death_county'] = 'Lynchburg City'
  police_desc.loc[(~police_desc['death_county'].isin(county4)) & (police_desc['death_city']=='Kaltag'), 'death_county'] = 'Yukon-Koyukuk Census Area'
  police_desc.loc[(~police_desc['death_county'].isin(county4)) & (police_desc['death_city']=='Nome'), 'death_county'] = 'Nome Census Area'
  police_desc.loc[(~police_desc['death_county'].isin(county4)) & (police_desc['death_city']=='Delta junction'), 'death_county'] = 'Southeast Fairbanks Census Area'
  police_desc.loc[(~police_desc['death_county'].isin(county4)) & (police_desc['death_city']=='Craig'), 'death_county'] = 'Prince Of Wales-Hyder Census Area'
  police_desc.loc[(~police_desc['death_county'].isin(county4)) & (police_desc['death_city']=='Bethel'), 'death_county'] = 'Bethel Census Area'
  police_desc.loc[(~police_desc['death_county'].isin(county4)) & (police_desc['death_city']=='Chevak'), 'death_county'] = 'Bethel Census Area'
  police_desc.loc[(~police_desc['death_county'].isin(county4)) & (police_desc['death_city']=='Pilot station'), 'death_county'] = 'Bethel Census Area'
  police_desc.loc[(~police_desc['death_county'].isin(county4)) & (police_desc['death_city']=='Copper center'), 'death_county'] = 'Valdez-Cordova Census Area'
  police_desc.loc[(~police_desc['death_county'].isin(county4)) & (police_desc['death_city']=="Saint mary's"), 'death_county'] = 'Bethel Census Area'
  police_desc.loc[(~police_desc['death_county'].isin(county4)) & (police_desc['death_city']=='Chesapeake'), 'death_county'] = 'Kanawha'

  police_desc['death_county'] = police_desc['death_county']+', '+police_desc['state_po']
  police_desc['death_city'] = police_desc['death_city']+', '+police_desc['death_county']

  return police_desc


In [5]:
def get_gender(name): 
  detector = gender.Detector()
  guess = detector.get_gender(name)
  if guess == 'female' or guess == 'mostly_female': 
    guess = 'Female'
    return guess
  else: 
    guess = 'Male'
    return guess

In [6]:
def test_numeric(x):
    try:
        int(x)
        return True
    except Exception:
        return False

In [7]:
def age_from_months(test_str):
  for ele in test_str.split():
    if ele.isdigit() < 23:
      age = 0
    else:
      age = 1
    return age

In [8]:
def age_from_s(test_str):
  intero = int(test_str[:len(test_str)-1]) + 5
  return intero

In [9]:
def trimester(x):
  if x <= 3:
    trimester = 1
  elif x > 3 and x <= 6:
    trimester = 2
  elif x > 6 and x <= 9:
    trimester = 3
  else:
    trimester = 4
  return trimester

In [10]:
import geopy
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="geoapiExercises2")
    
def get_location(lat, long, loc):
  location = geolocator.reverse(str(lat)+","+str(long), language='en', timeout=1)
  if location is None : return None
  else: 
    address = location.raw['address']
    if loc == 'county':
      loc = address.get('county', '')
    elif loc == 'city':
      loc = address.get('city', '')
  return loc

In [11]:
%matplotlib inline
import os, sys, inspect
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler, MinMaxScaler
import warnings
warnings.filterwarnings('ignore')

#UTILITY FUNCTIONS
#Encode columns
def encoder(df, column): 
    le.fit(df[column])
    le_name_mapping = dict(zip(le.classes_, le.transform(le.classes_))) 
    print('\t',le_name_mapping)
    return le.transform(df[column])

#Multicolumn encoder
class MultiColumnLabelEncoder:

    def __init__(self, columns=None):
        self.columns = columns # array of column names to encode

    def fit(self, X, y=None):
        self.encoders = {}
        columns = X.columns if self.columns is None else self.columns
        for col in columns:
            self.encoders[col] = LabelEncoder().fit(X[col])
        return self

    def transform(self, X):
        output = X.copy()
        columns = X.columns if self.columns is None else self.columns
        for col in columns:
            output[col] = self.encoders[col].transform(X[col])
        return output

    def fit_transform(self, X, y=None):
        return self.fit(X,y).transform(X)

    def inverse_transform(self, X):
        output = X.copy()
        columns = X.columns if self.columns is None else self.columns
        for col in columns:
            if col != 'death_description':
              output[col] = self.encoders[col].inverse_transform(X[col].astype(int))
        return output

    def inverse_transform_desc(self, X):
        output = X.copy()
        columns = X.columns if self.columns is None else self.columns
        for col in columns:
            if col == 'death_description':
              output[col] = self.encoders[col].inverse_transform(X[col].astype(int))
        return output

In [12]:
def local_exp(id, ds, prediction, info, expl_df):
    chosen_instance = ds.loc[[id]]
    shap_values = explainer.shap_values(chosen_instance)
    print('Predicted value:', prediction.loc[id, 'predict'])

    if(prediction.shape[0] == police.shape[0]):
        print('True value:', target.loc[id])
    else:
        print('True value: ', y_output.loc[id])
    
    print('Encounter Description: \n')
    print(re.sub("(.{180})", "\\1\n", info.loc[id, 'death_description'], 0, re.DOTALL))

    shap.initjs()
    expl_df['death_description'] = expl_df['death_description'].str[0:7] + '...'
    display(shap.force_plot(explainer.expected_value[1], shap_values[1], expl_df.loc[[id]]))

    #p = shap.force_plot(explainer.expected_value[1], shap_values[1], expl_df.loc[[id]])
    #return p

## Read Test Data

In [13]:
#If given file does not have header: pass names = header_columns to read csv
header_columns = ['Unique ID', "Subject's name", "Subject's age", "Subject's gender",
       "Subject's race", "Subject's race with imputations",
       'Imputation probability', 'URL of image of deceased',
       'Date of injury resulting in death (month/day/year)',
       'Location of injury (address)', 'Location of death (city)',
       'Location of death (state)', 'Location of death (zip code)',
       'Location of death (county)', 'Full Address', 'Latitude', 'Longitude',
       'Agency responsible for death', 'Cause of death',
       'A brief description of the circumstances surrounding the death',
       'Dispositions/Exclusions INTERNAL USE, NOT FOR ANALYSIS',
       'Intentional Use of Force (Developing)',
       'Link to news article or photo of official document',
       'Symptoms of mental illness? INTERNAL USE, NOT FOR ANALYSIS', 'Video',
       'Date&Description', 'Unique ID formula',
       'Unique identifier (redundant)', 'Date (Year)']

police = pd.read_csv(directory+'police_test_2.txt', delimiter=',') 
police.dropna(thresh=10, axis=0, inplace=True) #There's an empty row in the original dataset. Just in case, we try to delete it.
print(len(police))
police

8


Unnamed: 0,Unique ID,Subject's name,Subject's age,Subject's gender,Subject's race,Subject's race with imputations,Imputation probability,URL of image of deceased,Date of injury resulting in death (month/day/year),Location of injury (address),Location of death (city),Location of death (state),Location of death (zip code),Location of death (county),Full Address,Latitude,Longitude,Agency responsible for death,Cause of death,A brief description of the circumstances surrounding the death,"Dispositions/Exclusions INTERNAL USE, NOT FOR ANALYSIS",Intentional Use of Force (Developing),Link to news article or photo of official document,"Symptoms of mental illness? INTERNAL USE, NOT FOR ANALYSIS",Video,Date&Description,Unique ID formula,Unique identifier (redundant),Date (Year)
0,11330,James Paul Galladora,43,Male,Hispanic/Latino,Hispanic/Latino,not imputed,http://ak-cache.legacy.net/legacy/Images/Cobra...,05/11/2012,Beacham Dr,Houston,TX,77070,Harris,Beacham Dr Houston TX 77070 Harris,29.993797,-95.58626,"Harris County Constable's Office Precinct 4, H...",Gunshot,"Violating a protective order, James Galladora ...",Ruled suicide,Suicide,http://abc13.com/archive/8666167/,No,,"5/11/2012: Violating a protective order, James...",,11330,2012
1,11333,Glen Wayne Kyle,51,Male,European-American/White,European-American/White,not imputed,https://www.fatalencounters.org/wp-content/upl...,05/12/2012,W Britton Rd & N Richland Rd,Yukon,OK,73099,Canadian,W Britton Rd & N Richland Rd Yukon OK 73099 Ca...,35.565674,-97.7984,"Canadian County Sheriff's Office, El Reno Poli...",Gunshot,Officers received a call about a driver firing...,Justified,"Intentional Use of Force, Deadly",http://newsok.com/article/3675257,No,,5/12/2012: Officers received a call about a dr...,,11333,2012
2,23240,Ernest Lynn France,47,Male,European-American/White,European-American/White,not imputed,,05/12/2012,Oak Street and Dale Street,Kingsport,TN,37660,Sullivan,Oak Street and Dale Street Kingsport TN 37660 ...,36.541674,-82.551778,Kingsport Police Department,Gunshot,Ernest Lynn France was allegedly off his menta...,Justified,"Intentional Use of Force, Deadly",http://www.heraldcourier.com/news/before-point...,Yes,,5/12/2012: Ernest Lynn France was allegedly of...,,23240,2012
3,11334,Xavier Gonzalez-Torres,30,Male,Hispanic/Latino,Hispanic/Latino,not imputed,,05/12/2012,15000 block Stafford Street,City of Industry,CA,91744,Los Angeles,15000 block Stafford Street City of Industry C...,34.028646,-117.963876,Los Angeles County Sheriff's Department,Gunshot,Deputies reported that they followed a suspect...,Unreported,"Intentional Use of Force, Deadly",http://homicide.latimes.com/post/xavier-gonzal...,Unknown,,5/12/2012: Deputies reported that they followe...,,11334,2012
4,11342,Brian Wesley King Jr.,25,Male,African-American/Black,African-American/Black,not imputed,https://fatalencounters.org/wp-content/uploads...,05/14/2012,2414 NW Cornell Avenue,Lawton,OK,73505,Comanche,2414 NW Cornell Avenue Lawton OK 73505 Comanche,34.607945,-98.426579,"Lawton Police Department, Comanche Nation Poli...",Gunshot,Comanche County Dispatch received a 911 call r...,Justified,"Intentional Use of Force, Deadly",http://www.kswo.com/story/18371858/man-dead-af...,No,,5/14/2012: Comanche County Dispatch received a...,,11342,2012
5,11349,Seth Isaac Adams,24,Male,European-American/White,European-American/White,not imputed,http://media.cmgdigital.com/shared/img/photos/...,05/16/2012,1950 A Road,Loxahatchee,FL,33470,Palm Beach,1950 A Road Loxahatchee FL 33470 Palm Beach,26.705838,-80.295364,Palm Beach County Sheriff's Office,Gunshot,Undercover Sgt Michael Mario Custer trespassin...,Justified,"Intentional Use of Force, Deadly",http://www.palmbeachpost.com/news/news/crime-l...,No,,5/16/2012: Undercover Sgt Michael Mario Custer...,,11349,2012
6,11346,John Melo,59,Male,European-American/White,European-American/White,not imputed,http://ak-cache.legacy.net/legacy/images/Cobra...,05/17/2012,South 33rd Street and Shortridge Ave.,San Jose,CA,95116,Santa Clara,South 33rd Street and Shortridge Ave. San Jose...,37.350311,-121.858389,California Highway Patrol,Vehicle,Bystander John Melo died hours after he was we...,Unreported,Vehicle/Pursuit,http://www.contracostatimes.com/ci_20742872,No,,5/17/2012: Bystander John Melo died hours afte...,,11346,2012
7,11348,Parish Laconley Powell,38,Male,African-American/Black,African-American/Black,not imputed,,05/17/2012,800 block Dartmouth Avenue,Bessemer,AL,35020,Jefferson,800 block Dartmouth Avenue Bessemer AL 35020 J...,33.386375,-86.955283,Bessemer Police Department,Gunshot,Officer Gabriel Kinderknecht shot and killed P...,Unreported,"Intentional Use of Force, Deadly",http://blog.al.com/spotnews/2012/05/man_fatall...,Yes,,5/17/2012: Officer Gabriel Kinderknecht shot a...,,11348,2012


# Preprocess Test Data 

In [14]:
police.rename(columns={
    "Unique ID": "id_1", 
    "Subject's name": "name",
    "Subject's age": "age",
    "Subject's gender": "gender",
    "Subject's race": "race",
    "Subject's race with imputations": "race_with_imputations",
    "Imputation probability": "imputation_probability", 
    "URL of image of deceased": "url_image",
    "Date of injury resulting in death (month/day/year)": "death_date", 
    "Location of injury (address)": 'death_address',
    "Location of death (city)": "death_city", 
    "Location of death (state)": "death_state",
    "Location of death (zip code)": "death_zipcode", 
    "Location of death (county)": "death_county", 
    "Full Address": "death_full_address", 
    "Latitude": "latitude",
    "Longitude": "longitude", 
    "Agency responsible for death": "police_agency",
    "Cause of death": "death_cause", 
    "A brief description of the circumstances surrounding the death": "death_description",
    "Dispositions/Exclusions INTERNAL USE, NOT FOR ANALYSIS": "dispositions_exclusions", 
    "Intentional Use of Force (Developing)": "use_of_force",
    "Link to news article or photo of official document": "death_article", 
    "Symptoms of mental illness? INTERNAL USE, NOT FOR ANALYSIS": "mental_illness",
    "Video": "video", 
    "Date&Description": "date_description",
    "Unique ID formula": "id_2", 
    "Unique identifier (redundant)": "id_3",
    "Date (Year)": "death_date_year", 
    }, inplace = True)

police.drop([
             'url_image', 
             'death_article', 
             'date_description', 
             'video', 
             'id_1',
             'id_2', 
             'id_3',
             'death_date_year'
             ], axis=1, inplace=True)
police.columns

Index(['name', 'age', 'gender', 'race', 'race_with_imputations',
       'imputation_probability', 'death_date', 'death_address', 'death_city',
       'death_state', 'death_zipcode', 'death_county', 'death_full_address',
       'latitude', 'longitude', 'police_agency', 'death_cause',
       'death_description', 'dispositions_exclusions', 'use_of_force',
       'mental_illness'],
      dtype='object')

## Age

In [15]:
police['age'].replace(['18-25'], 21, inplace=True)
police['age'].replace(['20s-30s'], 30, inplace=True)
police['age'].replace(['46/53'], 50, inplace=True)
police['age'].replace(['40-50','55.'], 55, inplace=True)
if(police['age'].dtype == object):
  police.loc[police['age'].str.contains('months', na=False), 'age'] = police['age'].apply(lambda x: age_from_months(x) if isinstance(x, str) and 'months' in x else x)
  police.loc[police['age'].str.contains('days', na=False), 'age'] = 0
  police.loc[police['age'].str.contains('s', na=False), 'age'] = police.age.apply(lambda x: age_from_s(x) if isinstance(x, str) and 's' in x else x)

np.random.seed(42) 
#calculate the probability associated with each age value
s = police['age'].value_counts(normalize=True)
#replacing missing values using the probability distribution
missing = police['age'].isnull()
police.loc[missing,'age'] = np.random.choice(s.index, size=len(police[missing]),p=s.values)

## Gender

Fill missing values using the Gender-Guesser library. https://pypi.org/project/gender-guesser/


In [16]:
import gender_guesser.detector as gender 
test = police.loc[police['gender'].isnull(), 'gender']
if(len(test) > 0):
  police.loc[police['gender'].isnull(), 'gender'] = np.vectorize(get_gender)(police[police['gender'].isnull()]['name'])
#Drop "Name"
police.drop(['name'], axis=1, inplace=True)

## Race, Race with imputations, Imputation Probability

Read the dataset containing ethnic distribution, used to fill missing values

In [17]:
df_pop = pd.read_csv(directory+'df_pop.csv') 
df_pop.drop(['Unnamed: 0', 'pop_2000', 'pop_2010', 'pop_2020', 'areaKM', 'density2020'], axis=1, inplace=True)
df_pop.head()

Unnamed: 0,state_po,State,European-American/White,African-American/Black,Hispanic/Latino,Other
0,AL,Alabama,0.6552,0.2649,0.0428,0.0371
1,AK,Alaska,0.6063,0.031,0.0705,0.2922
2,AZ,Arizona,0.5471,0.0421,0.3134,0.0974
3,AR,Arkansas,0.7243,0.1523,0.0747,0.0487
4,CA,California,0.3718,0.0552,0.3902,0.1828


In [18]:
police['race'].replace(['HIspanic/Latino'], 'Hispanic/Latino', inplace=True)
police['race'].replace(['Race unspecified'], np.nan, inplace=True)
police['race_with_imputations'].replace(['HIspanic/Latino'], 'Hispanic/Latino', inplace=True)
police['race_with_imputations'].replace(['Other Race'], 'Other', inplace=True)
police['race_with_imputations'].replace(['Race unspecified'], np.nan, inplace=True) 
police.race.fillna(police.race_with_imputations, inplace=True)
police.drop(['race_with_imputations', 'imputation_probability'], axis=1, inplace=True)
police['race'].replace(['Asian/Pacific Islander', 
                        'Native American/Alaskan', 
                        'Middle Eastern'], 'Other', inplace=True)
print(police['race'].unique())
states = df_pop['state_po'].unique() #list of states
ethnicdist = dict() #empty dict
for state in states: #create a dictionary where each state is associated with a pd.Series containing its ethnic distribution (from df_pop)
  ethnicdist[state] = pd.Series({'European-American/White': df_pop[df_pop['state_po']==state]['European-American/White'].values[0],
                       'African-American/Black': df_pop[df_pop['state_po']==state]['African-American/Black'].values[0],
                       'Hispanic/Latino': df_pop[df_pop['state_po']==state]['Hispanic/Latino'].values[0], 
                       'Other': df_pop[df_pop['state_po']==state]['Other'].values[0]})

for state in states: 
  missingrace = police[police['death_state']==state]['race'].isnull() #boolean series: rows with the current state and missing value for race have True
  police.loc[(police['death_state']==state) & (missingrace), 'race'] = np.random.choice(ethnicdist[state].index, 
                                                                                  size=len(police[police['death_state']==state][missingrace]),
                                                                                  p=ethnicdist[state].values)

['Hispanic/Latino' 'European-American/White' 'African-American/Black']


## Temporal Information

In [19]:
police['death_date'] = police['death_date'].apply(pd.to_datetime)

In [20]:
#Date --> Year and Month
police['year'] = police['death_date'].dt.year
police['month'] = police['death_date'].dt.month
police.drop(labels=['death_date'], axis=1, inplace=True)

## Geographical Information


In [21]:
police.drop(['death_address', 'death_zipcode', 'death_full_address'], axis=1, inplace=True)
if(police['death_county'].isnull().sum() > 0):
  police.loc[police['death_county'].isnull(), 'death_county'] = np.vectorize(get_location)(police[police['death_county'].isnull()]['latitude'], 
                                                                                     police[police['death_county'].isnull()]['longitude'], 'county')
if(police['death_county'].isnull().sum() > 0):
  police.loc[police['death_city'].isnull(), 'death_city'] = np.vectorize(get_location)(police[police['death_city'].isnull()]['latitude'], 
                                                                                     police[police['death_city'].isnull()]['longitude'], 'city')  
police['death_city'] = police['death_city'].str.capitalize()
police['death_county'] = police['death_county'].str.capitalize()
police['death_state'] = police['death_state'].str.upper()

## Police_agency

In [22]:
police.drop(['police_agency'], axis=1, inplace=True)

## Death_cause

In [23]:
police['death_cause'].replace(['Unknown','Undetermined'], np.nan, inplace=True)
police['death_cause'].replace('Pursuit', 'Vehicle', inplace=True)
police['death_cause'].fillna('Gunshot', inplace=True)

## Dispositions_exclusions

In [24]:
police['dispositions_exclusions'] = police['dispositions_exclusions'].str.title()
police.loc[police['dispositions_exclusions'].str.contains('Justified|Jusified|Justifed|Excusable|Accidental|Acquitted|Cleared|Dismissed|Medical Emergency|Settled Out Of Court', na=False), 'dispositions_exclusions'] = 'Justified'
police.loc[police['dispositions_exclusions'].str.contains('Criminal|Guilty|Administrative Discipline|Unjustified', na=False), 'dispositions_exclusions'] = 'Guilty'
police.loc[police['dispositions_exclusions'].str.contains('Suicide|Overdose', na=False), 'dispositions_exclusions'] = 'Suicide'
police.loc[police['dispositions_exclusions'].str.contains('Pending', na=False), 'dispositions_exclusions'] = 'Pending'
police['dispositions_exclusions'] = police['dispositions_exclusions'].str.split('\/').str[-1].str.strip()
police['dispositions_exclusions'].replace(['Unknown','Family Awarded Money','Results Unreported','Referred To Prosecutor','Drowned','Off-Duty'], 'Unknown', inplace=True)
police['dispositions_exclusions'].fillna('Unknown', inplace=True)

## Use_of_force

In [25]:
police['use_of_force'] = police['use_of_force'].str.capitalize()
police['use_of_force'].replace(['Undetermined', 'Unknown'], np.nan, inplace=True)
police.loc[police['use_of_force'].str.contains('use of force', na=False), 'use_of_force'] = 'Yes' 
police.loc[police['use_of_force'].str.contains('Vehicle|Pursuit', na=False), 'use_of_force'] = 'Vehicle/Pursuit' 
police['use_of_force'] = police['use_of_force'].fillna('Yes')  

## Mental_illness

In [26]:
police['mental_illness'].replace('Unknown', np.nan, inplace=True)
police['mental_illness'].fillna('No', inplace=True)

In [27]:
#Reorder
police = police.reindex(columns=['age', 'gender', 'race', 'mental_illness', # who
                                 'death_cause', 'use_of_force', # how
                                 'year', 'month', 'trimester', # when
                                 'death_city', 'death_county', 'death_state', 'latitude', 'longitude', 'death_description', 'dispositions_exclusions']) # where
police.drop(['trimester'], axis=1, inplace=True)    

police.rename(columns={"death_state": "state_po"}, inplace=True)

In [28]:
# Adjust geographical information about counties
police = county_manager(police)

In [29]:
#Target
target = police[['dispositions_exclusions']]
info = police.copy() #Need this to print descriptions during explainability phase
police.drop(['dispositions_exclusions'], axis=1, inplace=True)   
expl_df = police.copy() #Used to plot decoded values in the explainability phase 
police = police[["age",	"gender",	"race",	"mental_illness",	"death_cause",	"use_of_force",	"death_city",	"death_county",	"latitude",	"longitude",	"death_description",	"state_po",	"year",	"month"]]

# Encode and Scale test rows + load model and make predictions

**Load complete and preprocessed dataset to encode the test rows accordingly**

In [30]:
all = pd.read_csv(directory + 'dataset_complete.csv') 
all.drop(['Unnamed: 0'], axis=1, inplace=True)  
all.dropna(inplace = True)

In [31]:
#Target
target_all = all[['dispositions_exclusions']]
all.drop(['dispositions_exclusions'], axis=1, inplace=True)  

In [32]:
#Attributes and encoding
numerics_attrs = all.select_dtypes(include=np.number).columns.tolist()
categorical_attrs = list(set(all.columns) - set(numerics_attrs))

In [33]:
le = LabelEncoder()
print('Variable Mapping:')
for col in categorical_attrs: 
  #for each column, encode on the original dataset, then apply the fitted encoder on the test data
  all[col] = encoder(all, col)
  police[col] = le.transform(police[col])

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



	 {'Female': 0, 'Male': 1, 'Transgender': 2}
	 {'Abbeville, Henry, AL': 0, 'Aberdeen, Grays Harbor, WA': 1, 'Aberdeen, Harford, MD': 2, 'Aberdeen, Moore, NC': 3, 'Abilene, Dickinson, KS': 4, 'Abilene, Taylor, TX': 5, 'Abingdon, Harford, MD': 6, 'Abingdon, Washington, VA': 7, 'Abington, Montgomery, PA': 8, 'Abington, Plymouth, MA': 9, 'Abita springs, St. Tammany Parish, LA': 10, 'Acampo, San Joaquin, CA': 11, "Accokeek, Prince George'S, MD": 12, 'Accomac, Accomack, VA': 13, 'Ackley, Hardin, IA': 14, 'Acton, York, ME': 15, 'Acworth, Cherokee, GA': 16, 'Acworth, Cobb, GA': 17, 'Ada, Hardin, OH': 18, 'Ada, Norman, MN': 19, 'Ada, Pontotoc, OK': 20, 'Adairsville, Bartow, GA': 21, 'Adamson, Pittsburg, OK': 22, 'Addis, West Baton Rouge Parish, LA': 23, 'Addison, Dallas, TX': 24, 'Addison, Dupage, IL': 25, 'Addison, Winston, AL': 26, 'Adel, Cook, GA': 27, 'Adel, Dallas, IA': 28, 'Adelanto, San Bernardino, CA': 29, "Adelphi, Prince George'S, MD": 30, 'Adger, Jefferson, AL': 31, 'Adolphus, Allen,

In [34]:
#Target Variable Encoding
le = LabelEncoder()
print('Target Variable Mapping:')
target_all['dispositions_exclusions'] = encoder(target_all, 'dispositions_exclusions')

Target Variable Mapping:
	 {'Guilty': 0, 'Justified': 1, 'Pending': 2, 'Suicide': 3, 'Unknown': 4, 'Unreported': 5}


**Load model and scaler**

In [35]:
import pickle
loaded_model = pickle.load(open(directory + 'random_forest_model.sav', 'rb'))
scaler = pickle.load(open(directory + 'scaler.pkl', 'rb'))

In [36]:
#Target Variable Encoding
le = LabelEncoder()
print('Target Variables Present in Test Rows:')
print(target.value_counts())

Target Variables Present in Test Rows:
dispositions_exclusions
Justified                  4
Unreported                 3
Suicide                    1
dtype: int64


In [37]:
police = pd.DataFrame(scaler.transform(police), index=police.index, columns=police.columns)

**Test rows, encoded and scaled**

In [38]:
police.head()

Unnamed: 0,age,gender,race,mental_illness,death_cause,use_of_force,death_city,death_county,latitude,longitude,death_description,state_po,year,month
0,0.401869,0.5,0.666667,0.5,0.583333,0.5,1.056986,0.667878,0.227977,0.711383,3.51662,0.86,0.6,0.363636
1,0.476636,0.5,0.333333,0.5,0.583333,1.5,2.515132,0.209302,0.351369,0.688839,2.228502,0.72,0.6,0.363636
2,0.439252,0.5,0.333333,1.0,0.583333,1.5,1.171603,1.496366,0.372983,0.84422,1.316377,0.84,0.6,0.363636
3,0.280374,0.5,0.666667,0.5,0.583333,1.5,0.424018,0.945494,0.317331,0.483329,1.08949,0.08,0.6,0.363636
4,0.233645,0.5,0.0,0.5,0.583333,1.5,1.246619,0.340843,0.330159,0.682437,0.949939,0.72,0.6,0.363636


**Make predictions**

In [39]:
X_output = police.copy()
X_output.loc[:,'predict'] = loaded_model.predict(X_output)
y_output = target.copy()

In [40]:
dict_target =  {0: 'Guilty', 1: 'Justified',2: 'Pending',3: 'Suicide',4: 'Unknown',5: 'Unreported'}
X_output=X_output.replace({"predict": dict_target})

#Local Explainability 

In [41]:
import shap
explainer = shap.TreeExplainer(loaded_model)

In [42]:
for l in range(len(X_output)):
  local_exp(X_output.index[l], police, X_output, info, expl_df)
  print('\n\n\n')

Predicted value: Justified
True value: dispositions_exclusions    Suicide
Name: 0, dtype: object
Encounter Description: 

Violating a protective order, James Galladora stabbed his wife, Jeanne, in the chest at her residence. Constable's deputies and HPD SWAT officers responded. James, barricaded in Je
anne's residence, shot at the responding officers periodically for six hours, then shot himself fatally in the head. Precinct Four constable's deputies had already confronted James
 for visiting Jeanne's residence on May 6. They did not arrest him. Jeanne survived her injuries.






Predicted value: Justified
True value: dispositions_exclusions    Justified
Name: 1, dtype: object
Encounter Description: 

Officers received a call about a driver firing a shot at another vehicle near SH-66 and Cemetery Road, police said. He said a Canadian County sheriff's deputy and El Reno police of
ficer were sent to the location and found the suspect vehicle in the area. Officers tried to stop the vehicle, but the driver led them on a pursuit. The man then drove to his house
. Police said the man got out of the vehicle with a pistol in hand and fired four shots at officers. He was killed when Deputy Sgt. Paul Reynolds fired one shot and struck Kyle in 
the chest, and El Reno officer Jarad Loggins fired and struck the man in the thigh.






Predicted value: Justified
True value: dispositions_exclusions    Justified
Name: 2, dtype: object
Encounter Description: 

Ernest Lynn France was allegedly off his mental illness medication when he shot at a neighbor's vehicle and after a chase, pointed his gun at police, who shot and killed him.






Predicted value: Justified
True value: dispositions_exclusions    Unreported
Name: 3, dtype: object
Encounter Description: 

Deputies reported that they followed a suspected carjacking perpetrator, and that when the chase ended, he left the car and pointed a gun at deputies. At that point, deputies shot 
him to death.






Predicted value: Justified
True value: dispositions_exclusions    Justified
Name: 4, dtype: object
Encounter Description: 

Comanche County Dispatch received a 911 call reporting a man breaking into the caller's home. Police responded and found a male victim in the street who had been physically assault
ed. They also saw another man come out of the house. That man, Brian Wesley King Jr., fired several shots at officers and then fled the area in a van. Comanche County Deputies and 
Comanche Nation Officers located the van and began pursuit toward Lawton on U.S. Highway 62. Once inside the Lawton city limits, Lawton Police joined in pursuit. There, King rammed
 a Comanche Nation Police vehicle and a Comanche County Sheriff's vehicle. Four officers shot at the suspect, killing him.






Predicted value: Justified
True value: dispositions_exclusions    Justified
Name: 5, dtype: object
Encounter Description: 

Undercover Sgt Michael Mario Custer trespassing on private property. Adams arrived, asked Custer his business. Custer shot Adams four times, and did not render aid or restrain. Set
h opened the chained gate. Made it 300 feet, collapsed, called his brother for help. See Change.org Seth Adams






Predicted value: Justified
True value: dispositions_exclusions    Unreported
Name: 6, dtype: object
Encounter Description: 

Bystander John Melo died hours after he was wedged between two cars during a series of crashes involving a car fleeing the CHP.






Predicted value: Justified
True value: dispositions_exclusions    Unreported
Name: 7, dtype: object
Encounter Description: 

Officer Gabriel Kinderknecht shot and killed Parish Laconley Powell when he allegedly lunged at and threw a knife at him.






