Due to computational cost, at SPD we have separate jobs for feature engineering for training and feature engineering for inference. See the documentation for more information.

The following main packages/modules are used:
- pandas
- numpy
- NLTLK
- scikit-learn
- scipy
- joblib

We also use functions from the NLP_Preprocessing file.

In [80]:
import sys
import os.path

import sagemaker_pyspark
import sagemaker
import boto3
import os

#Data Manipulation
import pandas as pd
from collections import Counter
import numpy as np
from os import listdir
from os.path import isfile, join
from datetime import datetime
from sklearn.preprocessing import OneHotEncoder

#NLP Preprocessing Functions
from NLP_PreProcessing import removeStopWords, removeFeatures, lemmatize

#Model Creation 
from joblib import dump, load

## 1. Load new data

Unlike the report-level dataset for monthly training, the data source for daily inferences is a dataset with unique offenses per row. The flattening/pivoting of the offense dataset is computationally extensive; as such, the report-level dataset for training is created once a month exclusively for training. A dummy dataset of ~40 unique offenses was created for demonstration purposes. A more detailed description of the features in the dataset is included in 'feature_engineering_inference_documentation.'

In [81]:
#read incident offense dataset
demographics = pd.read_csv('incident_offense.csv', usecols=['report_id','reporting_event_number','subject_ethnicity',
                                                            'subject_gender','subject_personid','offense_id', 'offense_code_id',
                                                            'subject_race','subject_age', 'victim_age', 'victim_ethnicity',
                                                            'victim_gender', 'victim_personid', 'victim_race', 'beat',
                                                            'crime_description','precinct', 'event_start_date', 'approval_status', 'report_submitted_date','report_ucr_approved_by'])

In [82]:
#make sure date fields are in date format
demographics['event_start_date'] = pd.to_datetime(demographics['event_start_date'])
demographics['report_submitted_date'] = pd.to_datetime(demographics['report_submitted_date'])

demographics = demographics.sort_values(by='report_submitted_date', ascending=False)

We add a condition checking if a table with processed reports exist (if there is not one, the script assumes this is the first time the script is deployed).This code script runs daily. However, we might want to pass more than 24 hrs of reports the first time the script runs. We look at 15 days worth of reports, but this can be adjusted by the researcher as needed.

In [83]:
#check that final_reports.csv exists:
if os.path.isfile('final_reports.csv'):
    
    #read final_reports table
    fe_last = pd.read_csv("final_reports.csv")
    fe_last['report_submitted_date'] = pd.to_datetime(fe_last['report_submitted_date'])
    
    # 1) Look for latest date in output table and take all reports at that date and later
    # (in case some reports with the last date were not processed) -->table gets replaced every time it runs:
    latest_date = sorted(fe_last['report_submitted_date'].unique().tolist(), reverse=True)[0]
    mask = demographics['report_submitted_date'] >= latest_date
    filtered_demo = demographics[mask]
    
    # 2) Filter out reports already in feature engineering
    recent_reports = set(fe_last['reporting_event_number'].unique().tolist())
    filtered_demo = filtered_demo[~filtered_demo['reporting_event_number'].isin(recent_reports)]
    
    # 3) Filter reports in 'draft' status (should not be any)
    filtered_demo = filtered_demo[filtered_demo['approval_status'] != 'Draft']
    
    # 4) Filter reports that have been updated by Tory (bias crimes unit research analyst):
    # if using mark43 table: filtered_demo = filtered_demo[filtered_demo['updated_by'] != 'TORY WHITE']
    filtered_demo = filtered_demo[filtered_demo['report_ucr_approved_by'] != '9601']
    
else: 
    # 1) Filter reports in 'draft' status
    filtered_demo = demographics[demographics['approval_status'] != 'Draft']
    
    # 2) Filter reports that have been updated by Tory (bias crimes unit research analyst):
    filtered_demo = filtered_demo[filtered_demo['report_ucr_approved_by'] != '9601']
    
    # 3) Look back over the past 15 days only:
    fifteen_days = datetime.now() - pd.DateOffset(days=15)
    filtered_demo = filtered_demo[filtered_demo['report_submitted_date'] >= fifteen_days]
    

In [84]:
df = filtered_demo

We add an if statement to check if the dataframe is empty in case no new reports (not yet processed) exist. The rest of the code cells only run if this is not true.

In [86]:
if df.empty:
    raise ValueError("Dataframe is empty.")
    
print("No error, continue running")

No error, continue running


## 2. Feature Engineering

In order to order offenses, we create a rank of offenses related to events with bias elements, which includes deprecated offense codes present in historical data, RCW and SMC offense codes, and SPD-specific codes used for reporting.

In [87]:
# dictionary of ranked crime descriptions
offense_code_ranks = {"RCW - 9A.36.080 | HATE CRIME OFFENSE": 1,
    "RCW - 9A.36.080 | MALICIOUS HARASSMENT": 2,
    "SMC - - 12A.06.115 | MALICIOUS HARASSMENT": 3,
    "Incident Contains Bias Elements -- NO CRIME": 4,
    "Offense Contains Bias Elements -- CRIME": 5,
    "X91 | MALICIOUS HARASSMENT": 6,
    "X92 | BIAS INCIDENT": 7}

#Create rank for offenses (9 for '-' in description)
df["offense_rank"] = df["crime_description"].apply(lambda x: offense_code_ranks.get(x, 9) if x == '-' else offense_code_ranks.get(x, 8))

We also rank victims on each report by the completeness of their demographic data.

In [88]:
#Create victim rank by demographics completeness
def calculate_rank(group):
    def inner_calculate_rank(row):
        conditions_met = sum(
            [
                pd.notna(row['victim_age']) and row['victim_age'] != -1,
                row['victim_race'] not in ["Unknown", None, "-"],
                row['victim_gender'] not in ["Unknown", None, "-"],
                row['victim_ethnicity'] not in ["Unknown", None, "-"]
            ]
        )
        return conditions_met
     
    group['victim_rank'] = group.apply(inner_calculate_rank, axis=1)
    #rows with same completeness get same rank
    group['victim_rank'] = group['victim_rank'].rank(ascending=False, method='dense').astype(int)
    return group

# Group the DataFrame by 'report_id' and apply the ranking function to each group.
df = df.groupby('report_id').apply(calculate_rank).reset_index(drop=True)

# Sort the DataFrame by 'rank' column in ascending order to get the desired ranking.
df.sort_values(['report_id','victim_rank'], ascending=False, inplace=True)

  df = df.groupby('report_id').apply(calculate_rank).reset_index(drop=True)


Similarly, we rank suspects on each report by the completeness of their demographic data.

In [89]:
#Create subject rank by demographics completeness
def calculate_rank_s(group):
    def inner_calculate_rank_s(row):
        conditions_met = sum(
            [
                pd.notna(row['subject_age']) and row['subject_age'] != -1,
                row['subject_race'] not in ["Unknown", None, "-"],
                row['subject_gender'] not in ["Unknown", None, "-"],
                row['subject_ethnicity'] not in ["Unknown", None, "-"]
            ]
        )
        return conditions_met 
    

    group['subject_rank'] = group.apply(inner_calculate_rank_s, axis=1)
    #rows with same completeness get same rank
    group['subject_rank'] = group['subject_rank'].rank(ascending=False, method='dense').astype(int)
    return group

# Group the DataFrame by 'report_id' and apply the ranking function to each group.
df = df.groupby('report_id').apply(calculate_rank_s).reset_index(drop=True)

# Sort the DataFrame by 'rank' column in ascending order to get the desired ranking.
df.sort_values(['report_id','subject_rank'], ascending=False, inplace=True)

  df = df.groupby('report_id').apply(calculate_rank_s).reset_index(drop=True)


Create combined demographics completeness rank.

In [90]:
#create subject + victim rank 
df['demographics_rank'] = df['victim_rank'] + df['subject_rank']

#drop individual ranks (we'll use the combined rank)
df = df.drop(['victim_rank', 'subject_rank'], axis = 1)

### Pivot dataset (keep one row per unique report_id)

Note that you will get as many 'victim' columns as the maximum number of victims in a report in your dataset, which might change based on the data sample. This is also the case for suspects and offenses. In our application, we limit the number of offenses, suspects, and victims to 5 each, matching the format of the dummy_narrative_github dataset used in the feature engineering for training.

In [91]:
# Create a new DataFrame for the final dataset
final_data = pd.DataFrame()

# Group the original data by 'report_id'
grouped = df.groupby('report_id')

# Extract common columns for each report
final_data['reporting_event_number'] = grouped['reporting_event_number'].first()
final_data['report_id'] = grouped['report_id'].first()
final_data['precinct'] = grouped['precinct'].first()
final_data['beat'] = grouped['beat'].first()
final_data['event_start_date'] = grouped['event_start_date'].first()
final_data['report_submitted_date'] = grouped['report_submitted_date'].first()
final_data['approval_status'] = grouped['approval_status'].first()
final_data['report_ucr_approved_by'] = grouped['report_ucr_approved_by'].first()

# Extract unique values and labels for each category (offense, victim, subject)
for col_prefix in ['offense_id', 'crime_description', 'victim_personid', 'victim_age', 'victim_race', 'victim_gender', 'victim_ethnicity', 'subject_personid', 'subject_age', 'subject_race', 'subject_gender', 'subject_ethnicity', 'offense_rank', 'demographics_rank']:
    # Pivot the data to create separate columns for each unique value in the category
    # 1) lambda function by group of columns that resets index,2) pivoting, 3) read prefix 
    pivoted = grouped[col_prefix].apply(lambda x: x.reset_index(drop=True)).unstack().add_prefix(col_prefix + '_')
    final_data = pd.concat([final_data, pivoted], axis=1)

# Reset the index of the final dataset
final_data.reset_index(drop=True, inplace=True)

Create flag for reports that have bias-related offense codes to reorder differently. 

Note that although the reports usually contain an indicator of whether they are events with bias elements or not (such as the offense code), our model does not use these offenses as features since this is precisely the field used for data labeling (see the feature engineering for training process). Instead, the goal of the classifier is to catch reports that might have an incorrect/missing offense code.

In [92]:
offense_rank_columns = final_data.filter(like='offense_rank')

# Check if any value in the 'offense_rank' columns is less than 8, then flag as bias
final_data['bias_flag'] = offense_rank_columns.lt(8).any(axis=1).astype(int)

### Create separate dataframe for reports with bias events

For bias events, we use the offense rank to reorder offenses, associated victims and suspects since we want to ensure that at least the first victim is the victim associated to the highest-ranked offense, usually a bias-related offense (see the ranking above).

In [93]:
#Separate dataframe for bias events
bias_reports = final_data[final_data['bias_flag'] == 1]
offense_rank_columns = bias_reports.filter(like='offense_rank')

# Create an empty DataFrame to store the results
result_df_bias = pd.DataFrame()

# Reorder columns based on the 'offense_rank' values for flagged rows
for index, row in bias_reports.iterrows():
    # Get the 'offense_rank' values for the current row
    rank_values = row[offense_rank_columns.columns].values
    # Get the corresponding suffixes for the 'offense_rank' columns
    suffixes = [col.split('_')[-1] for col in offense_rank_columns.columns]
    # Get the corresponding 'demographics_rank' values for the current row
    demographics_values = [row[f'demographics_rank_{suffix}'] for suffix in suffixes]
    
    # Sort the columns based on 'offense_rank' x[1] values and, if tied, 'demographics_rank'-x[2]
    sorted_columns = sorted(zip(offense_rank_columns.columns, rank_values, demographics_values), 
                             key=lambda x: (x[1], -x[2]))
    
    # Create a mapping of old column names to new column names for this row
    column_mapping = {}
    
    for i, (old_name, _, _) in enumerate(sorted_columns, start=1):
        new_name = f'offense_rank_{i}'
        column_mapping[old_name] = new_name
        
        # Rename associated columns with the same suffix
        suffix = old_name.split('_')[-1]
        for column_prefix in ['crime_description', 'offense_id', 'victim_personid', 'victim_age',
                              'victim_race', 'victim_gender', 'victim_ethnicity', 'subject_personid',
                              'subject_age', 'subject_race', 'subject_gender',
                              'subject_ethnicity', 'demographics_rank']:
            associated_column = f'{column_prefix}_{suffix}'
            new_associated_column = f'{column_prefix}_{i}'
            column_mapping[associated_column] = new_associated_column
    
    # Rename columns for the current row
    renamed_row = row.rename(column_mapping)
    
    # Append the renamed row to the result DataFrame
    result_df_bias = pd.concat([result_df_bias, renamed_row.to_frame().T], ignore_index=True)

Get only unique offenses and shift crime description accordingly, which means victims, subjects, and offenses after offense #1 won't necessarily be linked by suffix (e.g., if there are multiple unique offenses, but only one victim, the same victim would be repeated several times, which we don't want).

In [94]:
offense_rank_columns = bias_reports.filter(like='offense_rank')
# Determine the number of columns based on available data
num_columns = len(offense_rank_columns.columns)

# Iterate over rows in result_df
for index, row in result_df_bias.iterrows():
    # Create a set to keep track of unique offenses for the current row
    unique_offenses = set()
    
    for i in range(1, num_columns + 1):
        col_name = f'offense_id_{i}'
        desc_name = f'crime_description_{i}'
        
        # Check if the offense is already in the set
        if row[col_name] in unique_offenses:
            # Find the next unique offense not in the set
            j = i + 1
            while j <= num_columns and (row[col_name] in unique_offenses or row[f'offense_id_{j}'] in unique_offenses):
                col_name = f'offense_id_{j}'
                desc_name = f'crime_description_{j}'
                j += 1
        
        # Add the unique offense to the set of unique offenses
        unique_offenses.add(row[col_name])
        
        # Update the row with the adjusted offense and description
        result_df_bias.at[index, f'offense_id_{i}'] = row[col_name]
        result_df_bias.at[index, f'crime_description_{i}'] = row[desc_name]

Get only unique victims and shift demographics accordingly.

In [95]:
offense_rank_columns = bias_reports.filter(like='offense_rank')
# Determine the number of columns based on available data
num_columns = len(offense_rank_columns.columns)

# Iterate over rows in result_df
for index, row in result_df_bias.iterrows():
    # Create a set to keep track of unique offenses for the current row
    unique_victims = set()
    
    for i in range(1, num_columns + 1):
        col_name = f'victim_personid_{i}'
        age_name = f'victim_age_{i}'
        race_name = f'victim_race_{i}'
        gender_name = f'victim_gender_{i}'
        ethnicity_name = f'victim_ethnicity_{i}'
        
        # Check if the offense is already in the set
        if row[col_name] in unique_victims:
            # Find the next unique offense not in the set
            j = i + 1
            while j <= num_columns and (row[col_name] in unique_victims or row[f'victim_personid_{j}'] in unique_victims):
                col_name = f'victim_personid_{j}'
                age_name = f'victim_age_{j}'
                race_name = f'victim_race_{j}'
                gender_name = f'victim_gender_{j}'
                ethnicity_name = f'victim_ethnicity_{j}'
                j += 1
        
        # Add the unique offense to the set of unique offenses
        unique_victims.add(row[col_name])
        
        # Update the row with the adjusted offense and description
        result_df_bias.at[index, f'victim_personid_{i}'] = row[col_name]
        result_df_bias.at[index, f'victim_age_{i}'] = row[age_name]
        result_df_bias.at[index, f'victim_race_{i}'] = row[race_name]
        result_df_bias.at[index, f'victim_gender_{i}'] = row[gender_name]
        result_df_bias.at[index, f'victim_ethnicity_{i}'] = row[ethnicity_name]

Get only unique subjects and shift demographics accordingly.

In [96]:
offense_rank_columns = bias_reports.filter(like='offense_rank')
# Determine the number of columns based on available data
num_columns = len(offense_rank_columns.columns)

# Iterate over rows in result_df
for index, row in result_df_bias.iterrows():
    # Create a set to keep track of unique offenses for the current row
    unique_subjects = set()
    
    for i in range(1, num_columns + 1):
        col_name = f'subject_personid_{i}'
        age_name = f'subject_age_{i}'
        race_name = f'subject_race_{i}'
        gender_name = f'subject_gender_{i}'
        ethnicity_name = f'subject_ethnicity_{i}'
        
        # Check if the offense is already in the set
        if row[col_name] in unique_subjects:
            # Find the next unique offense not in the set
            j = i + 1
            while j <= num_columns and (row[col_name] in unique_subjects or row[f'subject_personid_{j}'] in unique_subjects):
                col_name = f'subject_personid_{j}'
                age_name = f'subject_age_{j}'
                race_name = f'subject_race_{j}'
                gender_name = f'subject_gender_{j}'
                ethnicity_name = f'subject_ethnicity_{j}'
                j += 1
        
        # Add the unique offense to the set of unique offenses
        unique_subjects.add(row[col_name])
        
        # Update the row with the adjusted offense and description
        result_df_bias.at[index, f'subject_personid_{i}'] = row[col_name]
        result_df_bias.at[index, f'subject_age_{i}'] = row[age_name]
        result_df_bias.at[index, f'subject_race_{i}'] = row[race_name]
        result_df_bias.at[index, f'subject_gender_{i}'] = row[gender_name]
        result_df_bias.at[index, f'subject_ethnicity_{i}'] = row[ethnicity_name]

### Create separate dataframe for reports with no bias-related offenses

For no bias-related offenses, we reorder based on offense AND demographics ranks.

In [97]:
#Separate dataframe for bias events
no_bias_reports = final_data[final_data['bias_flag'] == 0]
offense_rank_columns = no_bias_reports.filter(like='offense_rank')

# Create an empty DataFrame to store the results
result_df = pd.DataFrame()

# Reorder columns based on the 'offense_rank' values for flagged rows
for index, row in no_bias_reports.iterrows():
    # Get the 'offense_rank' values for the current row
    rank_values = row[offense_rank_columns.columns].values
    # Get the corresponding suffixes for the 'offense_rank' columns
    suffixes = [col.split('_')[-1] for col in offense_rank_columns.columns]
    # Get the corresponding 'demographics_rank' values for the current row
    demographics_values = [row[f'demographics_rank_{suffix}'] for suffix in suffixes]
    
    # Sort the columns based on 'offense_rank' x[1] values and, if tied, 'demographics_rank'-x[2]
    sorted_columns = sorted(zip(offense_rank_columns.columns, rank_values, demographics_values), 
                             key=lambda x: (x[1], -x[2]))
    
    # Create a mapping of old column names to new column names for this row
    column_mapping = {}
    
    for i, (old_name, _, _) in enumerate(sorted_columns, start=1):
        new_name = f'offense_rank_{i}'
        column_mapping[old_name] = new_name
        
        # Rename associated columns with the same suffix
        suffix = old_name.split('_')[-1]
        for column_prefix in ['crime_description', 'offense_id', 'victim_personid', 'victim_age',
                              'victim_race', 'victim_gender', 'victim_ethnicity', 'subject_personid',
                              'subject_age', 'subject_race', 'subject_gender',
                              'subject_ethnicity', 'demographics_rank']:
            associated_column = f'{column_prefix}_{suffix}'
            new_associated_column = f'{column_prefix}_{i}'
            column_mapping[associated_column] = new_associated_column
    
    # Rename columns for the current row
    renamed_row = row.rename(column_mapping)
    
    # Append the renamed row to the result DataFrame
    result_df = pd.concat([result_df, renamed_row.to_frame().T], ignore_index=True)

Get only unique offenses and shift crime description accordingly.

In [98]:
demographics_rank_columns = no_bias_reports.filter(like='demographics_rank')
# Determine the number of columns based on available data
num_columns = len(demographics_rank_columns.columns)

# Iterate over rows in result_df
for index, row in result_df.iterrows():
    # Create a set to keep track of unique offenses for the current row
    unique_offenses = set()
    
    for i in range(1, num_columns + 1):
        col_name = f'offense_id_{i}'
        desc_name = f'crime_description_{i}'
        
        # Check if the offense is already in the set
        if row[col_name] in unique_offenses:
            # Find the next unique offense not in the set
            j = i + 1
            while j <= num_columns and (row[col_name] in unique_offenses or row[f'offense_id_{j}'] in unique_offenses):
                col_name = f'offense_id_{j}'
                desc_name = f'crime_description_{j}'
                j += 1
        
        # Add the unique offense to the set of unique offenses
        unique_offenses.add(row[col_name])
        
        # Update the row with the adjusted offense and description
        result_df.at[index, f'offense_id_{i}'] = row[col_name]
        result_df.at[index, f'crime_description_{i}'] = row[desc_name]

Get only unique victims and shift demographics accordingly.

In [99]:
demographics_rank_columns = no_bias_reports.filter(like='demographics_rank')
# Determine the number of columns based on available data
num_columns = len(demographics_rank_columns.columns)

# Iterate over rows in result_df
for index, row in result_df.iterrows():
    # Create a set to keep track of unique offenses for the current row
    unique_victims = set()
    
    for i in range(1, num_columns + 1):
        col_name = f'victim_personid_{i}'
        age_name = f'victim_age_{i}'
        race_name = f'victim_race_{i}'
        gender_name = f'victim_gender_{i}'
        ethnicity_name = f'victim_ethnicity_{i}'
        
        # Check if the offense is already in the set
        if row[col_name] in unique_victims:
            # Find the next unique offense not in the set
            j = i + 1
            while j <= num_columns and (row[col_name] in unique_victims or row[f'victim_personid_{j}'] in unique_victims):
                col_name = f'victim_personid_{j}'
                age_name = f'victim_age_{j}'
                race_name = f'victim_race_{j}'
                gender_name = f'victim_gender_{j}'
                ethnicity_name = f'victim_ethnicity_{j}'
                j += 1
        
        # Add the unique offense to the set of unique offenses
        unique_victims.add(row[col_name])
        
        # Update the row with the adjusted offense and description
        result_df.at[index, f'victim_personid_{i}'] = row[col_name]
        result_df.at[index, f'victim_age_{i}'] = row[age_name]
        result_df.at[index, f'victim_race_{i}'] = row[race_name]
        result_df.at[index, f'victim_gender_{i}'] = row[gender_name]
        result_df.at[index, f'victim_ethnicity_{i}'] = row[ethnicity_name]

Get only unique subjects and shift demographics accordingly.

In [100]:
demographics_rank_columns = no_bias_reports.filter(like='demographics_rank')
# Determine the number of columns based on available data
num_columns = len(demographics_rank_columns.columns)

# Iterate over rows in result_df
for index, row in result_df.iterrows():
    # Create a set to keep track of unique offenses for the current row
    unique_subjects = set()
    
    for i in range(1, num_columns + 1):
        col_name = f'subject_personid_{i}'
        age_name = f'subject_age_{i}'
        race_name = f'subject_race_{i}'
        gender_name = f'subject_gender_{i}'
        ethnicity_name = f'subject_ethnicity_{i}'
        
        # Check if the offense is already in the set
        if row[col_name] in unique_subjects:
            # Find the next unique offense not in the set
            j = i + 1
            while j <= num_columns and (row[col_name] in unique_subjects or row[f'subject_personid_{j}'] in unique_subjects):
                col_name = f'subject_personid_{j}'
                age_name = f'subject_age_{j}'
                race_name = f'subject_race_{j}'
                gender_name = f'subject_gender_{j}'
                ethnicity_name = f'subject_ethnicity_{j}'
                j += 1
        
        # Add the unique offense to the set of unique offenses
        unique_subjects.add(row[col_name])
        
        # Update the row with the adjusted offense and description
        result_df.at[index, f'subject_personid_{i}'] = row[col_name]
        result_df.at[index, f'subject_age_{i}'] = row[age_name]
        result_df.at[index, f'subject_race_{i}'] = row[race_name]
        result_df.at[index, f'subject_gender_{i}'] = row[gender_name]
        result_df.at[index, f'subject_ethnicity_{i}'] = row[ethnicity_name]

### Get only the first column of each for both datasets

Although the current version of the algorithm uses only the first offense, victim, and subject, we are working on a version of the model that will incorporate more of these features.

In [101]:
#Get only one column of each
def one_offense(column_name):
    try:
        # Extract the numerical part from the column name
        numerical_part = int(''.join(filter(str.isdigit, column_name)))
        return numerical_part > 1
    except ValueError:
        return False
    
# Get a list of columns to drop
columns_to_drop = [col for col in result_df.columns if one_offense(col)]
columns_to_drop_b = [col for col in result_df_bias.columns if one_offense(col)]

# Drop the columns
result_df.drop(columns=columns_to_drop, inplace=True)
result_df_bias.drop(columns=columns_to_drop_b, inplace=True)

### Concatenate datasets

In [102]:
final_df = pd.concat([result_df, result_df_bias], axis=0).reset_index(drop = True)

### Read in narratives and merge

We store report narratives in a separate dataset.

In [105]:
narr = pd.read_csv('narratives.csv')

In [106]:
#merge left on report_id
df = final_df.merge(narr[['narrative', 'report_id']], how='left', left_on= 'report_id', right_on= 'report_id')

### NLP Preprocessing

Fill missing narratives with neutral word (i.e., narrative).

In [107]:
df['narrative'] = df['narrative'].fillna('narrative')

In [108]:
# Apply the pre-processing functions to the 'narrative' column
df['corpus'] = df['narrative'].apply(removeStopWords)
df['corpus'] = df['corpus'].apply(removeFeatures)
df['corpus'] = df['corpus'].apply(lemmatize)

In [109]:
#If empty corpus after processing, replace with 'narrative':
df['corpus'] = df['corpus'].replace('','narrative')

### One-Hot Encoding Demographics

In [110]:
df['subject_age_1']= df['subject_age_1'].replace(-1, np.NaN)
df['victim_age_1'] = df['victim_age_1'].replace(-1, np.NaN)

df['subject_race_1'] = df['subject_race_1'].replace('-', 'Sub_Race_Unknown')
df['subject_race_1'] = df['subject_race_1'].fillna('Sub_Race_Unknown')
df['subject_race_1'] = df['subject_race_1'].replace('Unknown', 'Sub_Race_Unknown')

df['victim_race_1'] = df['victim_race_1'].replace('-', 'Vic_Race_Unknown')
df['victim_race_1'] = df['victim_race_1'].fillna('Vic_Race_Unknown')
df['victim_race_1'] = df['victim_race_1'].replace('Unknown', 'Vic_Race_Unknown')

df['subject_gender_1'] = df['subject_gender_1'].replace('-', 'Sub_Gender_Unknown')
df['subject_gender_1'] = df['subject_gender_1'].fillna('Sub_Gender_Unknown')
df['subject_gender_1'] = df['subject_gender_1'].replace('Unknown', 'Sub_Gender_Unknown')

df['victim_gender_1'] = df['victim_gender_1'].replace('-', 'Vic_Gender_Unknown')
df['victim_gender_1'] = df['victim_gender_1'].fillna('Vic_Gender_Unknown')
df['victim_gender_1'] = df['victim_gender_1'].replace('Unknown', 'Vic_Gender_Unknown')

df['subject_ethnicity_1'] = df['subject_ethnicity_1'].replace('-', 'Sub_Ethni_Unknown')
df['subject_ethnicity_1'] = df['subject_ethnicity_1'].fillna('Sub_Ethni_Unknown')
df['subject_ethnicity_1'] = df['subject_ethnicity_1'].replace('Unknown', 'Sub_Ethni_Unknown')

df['victim_ethnicity_1'] = df['victim_ethnicity_1'].replace('-', 'Vic_Ethni_Unknown')
df['victim_ethnicity_1'] = df['victim_ethnicity_1'].fillna('Vic_Ethni_Unknown')
df['victim_ethnicity_1'] = df['victim_ethnicity_1'].replace('Unknown', 'Vic_Ethni_Unknown')

df['beat'] = df['beat'].replace('99', 'beat_Unknown')
df['beat'] = df['beat'].replace('OOJ', 'beat_OOJ')
df['beat'] = df['beat'].replace('Unknown', 'beat_Unknown')
df['beat'] = df['beat'].fillna('beat_Unknown')


df['precinct'] = df['precinct'].replace('Unknown', 'precinct_Unknown')
df['precinct'] = df['precinct'].replace('OOJ', 'precinct_OOJ')
df['precinct'] = df['precinct'].fillna('precinct_Unknown')

  df['subject_age_1']= df['subject_age_1'].replace(-1, np.NaN)
  df['victim_age_1'] = df['victim_age_1'].replace(-1, np.NaN)


In [111]:
ohe = OneHotEncoder()

In [112]:
precinct_e = ohe.fit_transform(df[['precinct']])
df[ohe.categories_[0]] = precinct_e.toarray()

gender_e = ohe.fit_transform(df[['victim_gender_1']])
df[ohe.categories_[0]] = gender_e.toarray()

race_e = ohe.fit_transform(df[['victim_race_1']])
df[ohe.categories_[0]] = race_e.toarray()

ethnicity_e = ohe.fit_transform(df[['victim_ethnicity_1']])
df[ohe.categories_[0]] = ethnicity_e.toarray()

beat_e = ohe.fit_transform(df[['beat']])
df[ohe.categories_[0]] = beat_e.toarray()

In [113]:
# Create a new instance of OneHotEncoder for Subject's categories
ohe_subject = OneHotEncoder()
subject_race_e = ohe_subject.fit_transform(df[['subject_race_1']])

# Rename the one-hot encoded columns with a prefix to differentiate them
new_column_names = ['subject_' + category for category in ohe_subject.categories_[0]]

# Create a DataFrame from the one-hot encoded array and set the column names
subject_race_df = pd.DataFrame(subject_race_e.toarray(), columns=new_column_names)

# Concatenate the one-hot encoded DataFrame with the original DataFrame
df = pd.concat([df, subject_race_df], axis=1)

In [114]:
subject_gender_e = ohe_subject.fit_transform(df[['subject_gender_1']])
new_column_names = ['subject_' + category for category in ohe_subject.categories_[0]]
# Create a DataFrame from the one-hot encoded array and set the column names
subject_gender_df = pd.DataFrame(subject_gender_e.toarray(), columns=new_column_names)

# Concatenate the one-hot encoded DataFrame with the original DataFrame
df = pd.concat([df, subject_gender_df], axis=1)

In [115]:
subject_ethnicity_e = ohe_subject.fit_transform(df[['subject_ethnicity_1']])
new_column_names = ['subject_' + category for category in ohe_subject.categories_[0]]
# Create a DataFrame from the one-hot encoded array and set the column names
subject_ethnicity_df = pd.DataFrame(subject_ethnicity_e.toarray(), columns=new_column_names)

# Concatenate the one-hot encoded DataFrame with the original DataFrame
df = pd.concat([df, subject_ethnicity_df], axis=1)

In [116]:
#CHECK IF COLUMN IN LIST IS PRESENT, IF NOT, CREATE AND ASSIGN 0 TO IT

columns = ['victim_age_1','subject_age_1','East', 'North', 'precinct_OOJ', 'South', 'Southwest', 'West', 'precinct_Unknown', 'Female',
 'Gender Diverse (gender non-conforming and/or transgender)', 'Male', 'Vic_Gender_Unknown',
 'American Indian or Alaska Native', 'Asian', 'Black or African American', 'Native Hawaiian or Other Pacific Islander',
 'Vic_Race_Unknown', 'White', 'Hispanic Or Latino', 'Not Hispanic Or Latino', 'Vic_Ethni_Unknown',
 'subject_American Indian or Alaska Native', 'subject_Asian', 'subject_Black or African American',
 'subject_Native Hawaiian or Other Pacific Islander', 'subject_Sub_Race_Unknown', 'subject_White',
 'subject_Female', 'subject_Gender Diverse (gender non-conforming and/or transgender)',
 'subject_Male', 'subject_Sub_Gender_Unknown', 'subject_Hispanic Or Latino', 'subject_Not Hispanic Or Latino',
 'subject_Sub_Ethni_Unknown', 'B1', 'B2', 'B3', 'C1', 'C2', 'C3', 'D1', 'D2', 'D3', 'E1', 'E2', 'E3', 'F1', 'F2',
 'F3', 'G1', 'G2', 'G3', 'H1', 'H2', 'H3', 'J1', 'J2', 'J3', 'K1', 'K2', 'K3', 'L1', 'L2', 'L3', 'M1', 'M2', 'M3',
 'N1', 'N2', 'N3', 'O1', 'O2', 'O3', 'Q1', 'Q2', 'Q3', 'R1', 'R2', 'R3', 'S1', 'S2', 'S3', 'U1', 'U2', 'U3',
 'beat_Unknown', 'W1', 'W2', 'W3', 'beat_OOJ']

for column in columns:
    if column not in df.columns:
        df[column] = 0

The final_reports.csv file is rewritten everytime this notebook runs. Consider saving the original file. 

In [108]:
#save as final reports table (this table is rewritten everytime the code runs)
df.to_csv("final_reports.csv", compression='gzip') 

In [117]:
final_features = df[['reporting_event_number','report_id','victim_age_1', 'subject_age_1','corpus', 'East', 'North', 'precinct_OOJ', 'South', 'Southwest', 'West', 'precinct_Unknown', 'Female',
 'Gender Diverse (gender non-conforming and/or transgender)', 'Male', 'Vic_Gender_Unknown',
 'American Indian or Alaska Native', 'Asian', 'Black or African American', 'Native Hawaiian or Other Pacific Islander',
 'Vic_Race_Unknown', 'White', 'Hispanic Or Latino', 'Not Hispanic Or Latino', 'Vic_Ethni_Unknown',
 'subject_American Indian or Alaska Native', 'subject_Asian', 'subject_Black or African American',
 'subject_Native Hawaiian or Other Pacific Islander', 'subject_Sub_Race_Unknown', 'subject_White',
 'subject_Female', 'subject_Gender Diverse (gender non-conforming and/or transgender)',
 'subject_Male', 'subject_Sub_Gender_Unknown', 'subject_Hispanic Or Latino', 'subject_Not Hispanic Or Latino',
 'subject_Sub_Ethni_Unknown', 'B1', 'B2', 'B3', 'C1', 'C2', 'C3', 'D1', 'D2', 'D3', 'E1', 'E2', 'E3', 'F1', 'F2',
 'F3', 'G1', 'G2', 'G3', 'H1', 'H2', 'H3', 'J1', 'J2', 'J3', 'K1', 'K2', 'K3', 'L1', 'L2', 'L3', 'M1', 'M2', 'M3',
 'N1', 'N2', 'N3', 'O1', 'O2', 'O3', 'Q1', 'Q2', 'Q3', 'R1', 'R2', 'R3', 'S1', 'S2', 'S3', 'U1', 'U2', 'U3',
 'beat_Unknown', 'W1', 'W2', 'W3', 'beat_OOJ']]

### NLP: Vectorizer

In [118]:
#read in trained vectorizer from training feature engineering
import joblib

cv = joblib.load('countvectorizer.pkl')

In [119]:
vectorized_corpus = cv.transform(final_features['corpus'])

In [120]:
#keep processed reporting_event_number and report_ids
rens = final_features['reporting_event_number']
r_ids = final_features['report_id']

#to avoid error with hstack, make all columns numeric
final_features = final_features.drop(['corpus', 'reporting_event_number', 'report_id'], axis=1).apply(pd.to_numeric)

In [121]:
#Merge sparse matrix back with other features
from scipy.sparse import hstack

final_df = hstack([vectorized_corpus, final_features.values])

In [122]:
words = cv.get_feature_names_out()

In [123]:
final_features = pd.DataFrame.sparse.from_spmatrix(final_df)

In [124]:
#Read rens and r_ids
final_features['reporting_event_number'] = rens
final_features['report_id'] = r_ids

In [125]:
names = ['victim_age_1', 'subject_age_1','East', 'North', 'precinct_OOJ', 'South', 'Southwest', 'West',
 'precinct_Unknown', 'Female', 'Gender Diverse (gender non-conforming and/or transgender)', 'Male', 'Vic_Gender_Unknown',
 'American Indian or Alaska Native', 'Asian', 'Black or African American', 'Native Hawaiian or Other Pacific Islander',
 'Vic_Race_Unknown', 'White', 'Hispanic Or Latino', 'Not Hispanic Or Latino', 'Vic_Ethni_Unknown',
 'subject_American Indian or Alaska Native', 'subject_Asian', 'subject_Black or African American',
 'subject_Native Hawaiian or Other Pacific Islander', 'subject_Sub_Race_Unknown', 'subject_White',
 'subject_Female', 'subject_Gender Diverse (gender non-conforming and/or transgender)',
 'subject_Male', 'subject_Sub_Gender_Unknown', 'subject_Hispanic Or Latino', 'subject_Not Hispanic Or Latino',
 'subject_Sub_Ethni_Unknown', 'B1', 'B2', 'B3', 'C1', 'C2', 'C3', 'D1', 'D2', 'D3', 'E1', 'E2', 'E3', 'F1', 'F2',
 'F3', 'G1', 'G2', 'G3', 'H1', 'H2', 'H3', 'J1', 'J2', 'J3', 'K1', 'K2', 'K3', 'L1', 'L2', 'L3', 'M1', 'M2', 'M3',
 'N1', 'N2', 'N3', 'O1', 'O2', 'O3', 'Q1', 'Q2', 'Q3', 'R1', 'R2', 'R3', 'S1', 'S2', 'S3', 'U1', 'U2', 'U3',
 'beat_Unknown', 'W1', 'W2', 'W3', 'beat_OOJ', 'reporting_event_number','report_id']

col_names = np.concatenate((words, names))

final_features = final_features.rename(columns=dict(zip(final_features.columns, col_names), inplace=True))

In [126]:
#save report_id and reporting_event_number to identify positive predictions for tasks
final_features[['report_id', 'reporting_event_number']].to_csv('inference_reports.csv', index=False)

In [127]:
#Save without report_id and no index or header for xgboost
final_features = final_features.drop(['report_id', 'reporting_event_number'], axis = 1)
final_features.to_csv('inference_features.csv', index=False, header=False)