The following packages/modules are used:
- pandas
- numpy
- NLTLK
- scikit-learn
- scipy
- joblib

We also use functions from the NLP_Preprocessing file.

In [54]:
#Data Manipulation
import pandas as pd
import os
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from scipy.sparse import hstack 
from sklearn.model_selection import train_test_split

#NLP
from sklearn.feature_extraction.text import CountVectorizer

#NLP Preprocessing Functions
from NLP_PreProcessing import removeStopWords, removeFeatures, lemmatize

#Model Creation 
import joblib

## 1. Loading data

A dummy dataset of 400 reports was created for demonstration purposes. SPD utilizes 1.6+ million incident/offense reports for training and validation. Note that the small size of the example data creates different considerations for data decisions, computational cost, algorithm selection and tuning, etc. A more detailed description of the dataset, as well as SPD's data engineering practices are included in 'feature_engineering_training_documentation.'

In [55]:
#read in dummy dataset
biasdf = pd.read_csv('dummy_narrative_github.csv')

The 'train_report_ids' and 'test_report_ids' files contain the report_ids of reports that have gone through the feature engineering process in past runs. report_ids of reports processed in this run are appended at the end of this notebook to the same files.

In [57]:
#Read report_ids that have already been processed
reports_train = pd.read_csv('train_report_ids.csv', header= None)
reports_train_list = set(reports_train[0].values)
reports_test = pd.read_csv('test_report_ids.csv', header= None)
reports_test_list = set(reports_test[0].values)

In [58]:
#create combined list of report ids to filter out in case they appear in the biasdf data source
reports_list = reports_test_list | reports_train_list

In [59]:
#filter dataset for report_ids not in reports_list (already processed)
biasdf = biasdf[~biasdf['report_id'].isin(reports_list)]

Reports in draft/pending/rejected status are not read as they might be modified in the future.

In [61]:
#keep only reports that are marked as completed
biasdf = biasdf[(biasdf['approval_status'] != 'Draft') & (biasdf['approval_status'] != 'Pending Approval') & (biasdf['approval_status'] != 'Rejected')].reset_index(drop=True)

### Feedback Step

In this step, the user reads their version of completed tasks (i.e., a file with the IDs of reports that have been flagged by the classifier and already adjusted as needed by the Bias Crime Unit). At SPD, we read the RMS tasks table and api_log to correctly categorize reports with bias events that have been associated to 'New Bias Crime Confirmation' tasks. Please see the documentation for more information.

In [62]:
#check that completed_tasks.csv exists (if it's the first time running the classifier):
if os.path.isfile('completed_tasks.csv'):
    
    #read completed_tasks table
    completed_tasks = pd.read_csv("completed_tasks.csv")
    
else:
    # if no completed tasks file, continue code
    print('no completed tasks.')

Read the offenses table (i.e., a table with the indicator used to label reports associated with tasks) with the 'is_suspected_hate_crime' flag to label reports that were confirmed as events with bias elements by the bias crime unit.

In [65]:
#read in offenses table to get 'is_suspected_hate_crime' field
label_df = pd.read_csv("offenses.csv")

#keep only positive class
positive_ren = label_df[label_df['is_suspected_hate_crime'] == 1]

#join positive rens to completed_tasks to get label (if completed task not in positive list, then it's negative)
positive_label = pd.merge(positive_ren, completed_tasks, on=['reporting_event_number'], how='inner')
positive_label.rename(columns= {'is_suspected_hate_crime':'actual_label'}, inplace=True)

#create list of reporting event numbers from completed tasks that are positive
list_positive_tasks = set(positive_label['reporting_event_number'].unique().tolist())

## 2. Label Creation

To label reports for training, we use reports with offenses associated to incidents with bias elements, crimes with bias elements, and hate crimes. We also use output from previous tasks to label reports that were true positives.

In [66]:
#Create Label

def label_function(df):
    
    if df['reporting_event_number'] in list_positive_tasks:
        return 1
    
    elif df['crime_description_1'] in ['Incident Contains Bias Elements -- NO CRIME', 'Offense Contains Bias Elements -- CRIME',
                                     'RCW - 9A.36.080 | HATE CRIME OFFENSE', 'SMC - - 12A.06.115 | MALICIOUS HARASSMENT',
                                     'X91 | MALICIOUS HARASSMENT', 'X92 | BIAS INCIDENT', 'RCW - 9A.36.080 | MALICIOUS HARASSMENT']:
        return 1
    else:
        return 0

# Apply the function and create a new column
biasdf['label'] = biasdf.apply(label_function, axis=1)


## 3. Text Preprocessing for NLP

In [68]:
#make sure date field has date format
biasdf['event_start_date'] = biasdf['event_start_date'].astype('datetime64[s]')

If some reports don't have narratives, we still want to use them if they have demographic information about the victim(s)/suspects(s). With this purpose, we fill missing narratives with a neutral word for preprocessing. This is a methodological decision based on the assumption that the word of choice is not more likely to be associated with the positive or negative class. 

In [69]:
#replace empty narratives with neutral word
biasdf['narrative'] = biasdf['narrative'].fillna('narrative')

In [70]:
# Apply the pre-processing functions to the 'narrative' column
bias_df_nlp = biasdf.copy()
bias_df_nlp['corpus'] = bias_df_nlp['narrative'].apply(removeStopWords)
bias_df_nlp['corpus'] = bias_df_nlp['corpus'].apply(removeFeatures)
bias_df_nlp['corpus'] = bias_df_nlp['corpus'].apply(lemmatize)

## 4.  One-Hot Encoding Demographics

The next step is to create categorical variables (i.e., dummies) for demographic fields.

In [71]:
#If after preprocessing the narrative is still blank, replace with neutral word:
bias_df_nlp['corpus'] = bias_df_nlp['corpus'].replace('','narrative')

In [72]:
#if the jurisdiction uses special characters for 'missing' fields, make them consistent:

df = bias_df_nlp.copy()

#in our dataset unknown ages are inputted as -1
df['subject_age_1'] = df['subject_age_1'].replace(-1, np.NaN)
df['victim_age_1'] = df['victim_age_1'].replace(-1, np.NaN)

df['subject_race_1'] = df['subject_race_1'].replace('-', 'Sub_Race_Unknown')
df['subject_race_1'] = df['subject_race_1'].fillna('Sub_Race_Unknown')
df['subject_race_1'] = df['subject_race_1'].replace('Unknown', 'Sub_Race_Unknown')

df['victim_race_1'] = df['victim_race_1'].replace('-', 'Vic_Race_Unknown')
df['victim_race_1'] = df['victim_race_1'].fillna('Vic_Race_Unknown')
df['victim_race_1'] = df['victim_race_1'].replace('Unknown', 'Vic_Race_Unknown')

df['subject_gender_1'] = df['subject_gender_1'].replace('-', 'Sub_Gender_Unknown')
df['subject_gender_1'] = df['subject_gender_1'].fillna('Sub_Gender_Unknown')
df['subject_gender_1'] = df['subject_gender_1'].replace('Unknown', 'Sub_Gender_Unknown')

df['victim_gender_1'] = df['victim_gender_1'].replace('-', 'Vic_Gender_Unknown')
df['victim_gender_1'] = df['victim_gender_1'].fillna('Vic_Gender_Unknown')
df['victim_gender_1'] = df['victim_gender_1'].replace('Unknown', 'Vic_Gender_Unknown')

df['subject_ethnicity_1'] = df['subject_ethnicity_1'].replace('-', 'Sub_Ethni_Unknown')
df['subject_ethnicity_1'] = df['subject_ethnicity_1'].fillna('Sub_Ethni_Unknown')
df['subject_ethnicity_1'] = df['subject_ethnicity_1'].replace('Unknown', 'Sub_Ethni_Unknown')

df['victim_ethnicity_1'] = df['victim_ethnicity_1'].replace('-', 'Vic_Ethni_Unknown')
df['victim_ethnicity_1'] = df['victim_ethnicity_1'].fillna('Vic_Ethni_Unknown')
df['victim_ethnicity_1'] = df['victim_ethnicity_1'].replace('Unknown', 'Vic_Ethni_Unknown')

df['beat'] = df['beat'].replace('99', 'beat_Unknown')
df['beat'] = df['beat'].replace('-', 'beat_Unknown')
df['beat'] = df['beat'].replace('OOJ', 'beat_OOJ')
df['beat'] = df['beat'].replace('Unknown', 'beat_Unknown')
df['beat'] = df['beat'].fillna('beat_Unknown')

df['precinct'] = df['precinct'].replace('Unknown', 'precinct_Unknown')
df['precinct'] = df['precinct'].replace('-', 'precinct_Unknown')
df['precinct'] = df['precinct'].replace('OOJ', 'precinct_OOJ')
df['precinct'] = df['precinct'].fillna('precinct_Unknown')

We use Scikit-learn's OneHotEncoder to create dummies per unique category for each variable.

In [73]:
ohe = OneHotEncoder()

In [74]:
#create dummies

precinct_e = ohe.fit_transform(df[['precinct']])
new_column_names = [category for category in ohe.categories_[0]]
precinct_df = pd.DataFrame(precinct_e.toarray(), columns=new_column_names)
df = pd.concat([df, precinct_df], axis=1)

gender_e = ohe.fit_transform(df[['victim_gender_1']])
new_column_names = [category for category in ohe.categories_[0]]
gender_df = pd.DataFrame(gender_e.toarray(), columns=new_column_names)
df = pd.concat([df, gender_df], axis=1)

race_e = ohe.fit_transform(df[['victim_race_1']])
new_column_names = [category for category in ohe.categories_[0]]
race_df = pd.DataFrame(race_e.toarray(), columns=new_column_names)
df = pd.concat([df, race_df], axis=1)

ethnicity_e = ohe.fit_transform(df[['victim_ethnicity_1']])
new_column_names = [category for category in ohe.categories_[0]]
ethnicity_df = pd.DataFrame(ethnicity_e.toarray(), columns=new_column_names)
df = pd.concat([df, ethnicity_df], axis=1)

beat_e = ohe.fit_transform(df[['beat']])
new_column_names = [category for category in ohe.categories_[0]]
beat_df = pd.DataFrame(beat_e.toarray(), columns=new_column_names)
df = pd.concat([df, beat_df], axis=1)

In [75]:
# Create a new instance of OneHotEncoder for Subject's categories
ohe_subject = OneHotEncoder()
subject_race_e = ohe_subject.fit_transform(df[['subject_race_1']])

# Rename the one-hot encoded columns with a prefix to differentiate them
new_column_names = ['subject_' + category for category in ohe_subject.categories_[0]]


subject_race_df = pd.DataFrame(subject_race_e.toarray(), columns=new_column_names)
df = pd.concat([df, subject_race_df], axis=1)

In [76]:
subject_gender_e = ohe_subject.fit_transform(df[['subject_gender_1']])
new_column_names = ['subject_' + category for category in ohe_subject.categories_[0]]

subject_gender_df = pd.DataFrame(subject_gender_e.toarray(), columns=new_column_names)

df = pd.concat([df, subject_gender_df], axis=1)

In [77]:
subject_ethnicity_e = ohe_subject.fit_transform(df[['subject_ethnicity_1']])
new_column_names = ['subject_' + category for category in ohe_subject.categories_[0]]

subject_ethnicity_df = pd.DataFrame(subject_ethnicity_e.toarray(), columns=new_column_names)

df = pd.concat([df, subject_ethnicity_df], axis=1)

Sometimes the new data might not include all possible values (e.g., no event with a Native Hawaiian or Other Pacific Islander victim occurred in that month); as such, we need to make sure all possible values are represented:

In [80]:
#CHECK IF COLUMN IN LIST IS PRESENT, IF NOT, CREATE AND ASSIGN 0 TO IT

columns = ['victim_age_1','subject_age_1','East', 'North', 'precinct_OOJ', 'South', 'Southwest', 'West', 'precinct_Unknown', 'Female',
 'Gender Diverse (gender non-conforming and/or transgender)', 'Male', 'Vic_Gender_Unknown',
 'American Indian or Alaska Native', 'Asian', 'Black or African American', 'Native Hawaiian or Other Pacific Islander',
 'Vic_Race_Unknown', 'White', 'Hispanic Or Latino', 'Not Hispanic Or Latino', 'Vic_Ethni_Unknown',
 'subject_American Indian or Alaska Native', 'subject_Asian', 'subject_Black or African American',
 'subject_Native Hawaiian or Other Pacific Islander', 'subject_Sub_Race_Unknown', 'subject_White',
 'subject_Female', 'subject_Gender Diverse (gender non-conforming and/or transgender)',
 'subject_Male', 'subject_Sub_Gender_Unknown', 'subject_Hispanic Or Latino', 'subject_Not Hispanic Or Latino',
 'subject_Sub_Ethni_Unknown', 'B1', 'B2', 'B3', 'C1', 'C2', 'C3', 'D1', 'D2', 'D3', 'E1', 'E2', 'E3', 'F1', 'F2',
 'F3', 'G1', 'G2', 'G3', 'H1', 'H2', 'H3', 'J1', 'J2', 'J3', 'K1', 'K2', 'K3', 'L1', 'L2', 'L3', 'M1', 'M2', 'M3',
 'N1', 'N2', 'N3', 'O1', 'O2', 'O3', 'Q1', 'Q2', 'Q3', 'R1', 'R2', 'R3', 'S1', 'S2', 'S3', 'U1', 'U2', 'U3',
 'beat_Unknown', 'W1', 'W2', 'W3', 'beat_OOJ']

for column in columns:
    if column not in df.columns:
        df[column] = 0

In [81]:
#SAVE FINAL FEATURES (remove original categorical variables, leave dummies)

final_features = df[['label','report_id', 'reporting_event_number', 'event_start_date','victim_age_1', 
 'subject_age_1','corpus', 'East', 'North', 'precinct_OOJ', 'South', 'Southwest', 'West', 'precinct_Unknown', 'Female',
 'Gender Diverse (gender non-conforming and/or transgender)', 'Male', 'Vic_Gender_Unknown',
 'American Indian or Alaska Native', 'Asian', 'Black or African American', 'Native Hawaiian or Other Pacific Islander',
 'Vic_Race_Unknown', 'White', 'Hispanic Or Latino', 'Not Hispanic Or Latino', 'Vic_Ethni_Unknown',
 'subject_American Indian or Alaska Native', 'subject_Asian', 'subject_Black or African American',
 'subject_Native Hawaiian or Other Pacific Islander', 'subject_Sub_Race_Unknown', 'subject_White',
 'subject_Female', 'subject_Gender Diverse (gender non-conforming and/or transgender)',
 'subject_Male', 'subject_Sub_Gender_Unknown', 'subject_Hispanic Or Latino', 'subject_Not Hispanic Or Latino',
 'subject_Sub_Ethni_Unknown', 'B1', 'B2', 'B3', 'C1', 'C2', 'C3', 'D1', 'D2', 'D3', 'E1', 'E2', 'E3', 'F1', 'F2',
 'F3', 'G1', 'G2', 'G3', 'H1', 'H2', 'H3', 'J1', 'J2', 'J3', 'K1', 'K2', 'K3', 'L1', 'L2', 'L3', 'M1', 'M2', 'M3',
 'N1', 'N2', 'N3', 'O1', 'O2', 'O3', 'Q1', 'Q2', 'Q3', 'R1', 'R2', 'R3', 'S1', 'S2', 'S3', 'U1', 'U2', 'U3',
 'beat_Unknown', 'W1', 'W2', 'W3', 'beat_OOJ']]

## 5. Text Vectorizer

In [82]:
biasdf = final_features

We perform a 80/20 split between training and validation data, but with larger datasets a smaller portion of the data can be used for validation.

In [83]:
#split on index
train, test = train_test_split(biasdf.index, test_size=0.2, random_state=0)
X_train, y_train = biasdf.loc[train, 'corpus'],  biasdf.loc[train, 'label']
X_test, y_test = biasdf.loc[test, 'corpus'],  biasdf.loc[test, 'label']

We train the vectorizer only on the training data to avoid data leakage. For the sake of the example, we only extract 50 features from the data. At SPD, we extract 3,000 text features.

In [84]:
#CountVectorizer fitted only to training set
 
#The max_features parameter tells the vectorizer to only consider the 50 most frequent words from the text corpus. 
cv = CountVectorizer(max_features=50)

cv.fit(X_train)
X_train = cv.transform(X_train)
X_test = cv.transform(X_test)

In [85]:
#save trained vectorizer for feature engineering
joblib.dump(cv, 'countvectorizer.pkl')

['countvectorizer.pkl']

In [86]:
#Get word vectors from vectorizer for column renaming
words = cv.get_feature_names_out()

In [87]:
#to avoid error with hstack make numeric columns numeric type
biasdf_add = biasdf.drop(columns=['report_id', 'event_start_date', 'reporting_event_number',
                                               'label', 'corpus'], axis=1).apply(pd.to_numeric)

In [88]:
#Merge sparse matrix back with other features
X_train = hstack([X_train, biasdf_add.loc[train].values])

X_test = hstack([X_test, biasdf_add.loc[test].values])

### Save dataframes for training/tuning

In [90]:
#get report_ids to update processed report ids CSV file (keep track of which reports have been processed)
X_train_report_id = biasdf.loc[train][['report_id']]
X_test_report_id = biasdf.loc[test][['report_id']]

In [32]:
# Append the ids of the processed reports to the report_ids CSV file if it exists
X_train_report_id.to_csv('train_report_ids.csv', mode='a', header=False, index=False)
X_test_report_id.to_csv('test_report_ids.csv', mode='a', header=False, index=False)

In [91]:
#Convert to dense
X_train_df = pd.DataFrame.sparse.from_spmatrix(X_train)
X_test_df = pd.DataFrame.sparse.from_spmatrix(X_test)

In [92]:
#y arrays to DF
y_train_df = pd.DataFrame(y_train).reset_index(drop=True)
y_train_df = y_train_df.rename(columns={0: "label"})
y_test_df = pd.DataFrame(y_test).reset_index(drop=True)
y_test_df = y_test_df.rename(columns={0: "label"})

You should end up with 141 features: 91 dummies for demographic values and 50 text features.

In [93]:
print(y_train_df.shape)
print(X_train_df.shape)
print(y_test_df.shape)
print(X_test_df.shape)

(316, 1)
(316, 141)
(79, 1)
(79, 141)


In [94]:
#combine labels and features
X_train_df['label'] = y_train_df['label']
X_test_df['label'] = y_test_df['label']
train_df = X_train_df
test_df = X_test_df

In [95]:
train_df['label'] = train_df['label'].astype(int)
test_df['label'] = test_df['label'].astype(int)

In [96]:
#reorder so first column is label since XGBoost (Amazon's implementation) reads the first column as the label
train_cols = train_df.columns.tolist()
train_cols = train_cols[-1:] + train_cols[:-1]
train_df = train_df[train_cols]

test_cols = test_df.columns.tolist()
test_cols = test_cols[-1:] + test_cols[:-1]
test_df = test_df[test_cols]

In [97]:
#we can rename the columns using the words from the vectorizer

names = ['victim_age_1','subject_age_1','East', 'North', 'precinct_OOJ', 'South', 'Southwest', 'West', 'precinct_Unknown',
 'Female','Gender Diverse (gender non-conforming and/or transgender)', 'Male', 'Vic_Gender_Unknown',
 'American Indian or Alaska Native', 'Asian', 'Black or African American', 'Native Hawaiian or Other Pacific Islander',
 'Vic_Race_Unknown', 'White', 'Hispanic Or Latino', 'Not Hispanic Or Latino', 'Vic_Ethni_Unknown',
 'subject_American Indian or Alaska Native', 'subject_Asian', 'subject_Black or African American',
 'subject_Native Hawaiian or Other Pacific Islander', 'subject_Sub_Race_Unknown', 'subject_White',
 'subject_Female', 'subject_Gender Diverse (gender non-conforming and/or transgender)',
 'subject_Male', 'subject_Sub_Gender_Unknown', 'subject_Hispanic Or Latino', 'subject_Not Hispanic Or Latino',
 'subject_Sub_Ethni_Unknown', 'B1', 'B2', 'B3', 'C1', 'C2', 'C3', 'D1', 'D2', 'D3', 'E1', 'E2', 'E3', 'F1', 'F2',
 'F3', 'G1', 'G2', 'G3', 'H1', 'H2', 'H3', 'J1', 'J2', 'J3', 'K1', 'K2', 'K3', 'L1', 'L2', 'L3', 'M1', 'M2', 'M3',
 'N1', 'N2', 'N3', 'O1', 'O2', 'O3', 'Q1', 'Q2', 'Q3', 'R1', 'R2', 'R3', 'S1', 'S2', 'S3', 'U1', 'U2', 'U3',
 'beat_Unknown', 'W1', 'W2', 'W3', 'beat_OOJ']

col_names = np.concatenate((['label'], words, names))

train_df = train_df.rename(columns=dict(zip(train_df.columns, col_names), inplace=True))
test_df = test_df.rename(columns=dict(zip(test_df.columns, col_names), inplace=True))

We save the files without headers since the AWS XGBoost model object takes csv files with no column name.
If not running for the first time, we append new processed reports to the training and testing datasets (i.e., we train on all the reports ever processed every month). 

In [98]:
#Append new processed reports to training and testing file for xgboost

# Append the DataFrame to the CSV file
train_df.to_csv('train_final.csv', mode='a', header=False, index=False)
test_df.to_csv('test_final.csv', mode='a', header=False, index=False)

In [99]:
#if needed, save with headers:

def save_dataframe(df, filename):
    if os.path.isfile(filename):
        df.to_csv(filename, mode='a', header=False, index=False)
    else:
        df.to_csv(filename, index=False)
            
# Save train and test dataframes
save_dataframe(train_df, 'train_final_wh.csv')
save_dataframe(test_df, 'test_final_wh.csv')