### Predictive Modeling
In this notebook, we explore various models to predict crime type based on various time and location predictors. 

In [18]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import OrdinalEncoder, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
from sklearn.ensemble import RandomForestClassifier

In [61]:
def multi_performance(y_true, y_pred, classes, y_score=None):
    """
    Returns a dataframe that summarizes performance metrics (accuracy, f1, precision, recall)
    for each class and the overall score (which returns a weighted average of all the scores 
    based on the number of true instances)

    y_true: np.array(n, ); true class labels
    y_pred: np.array(n, ); predicted class labels
    classes: list of class labels (from label_encoder.classes_ for example)
    y_score: currently not used;
    """
    
    # Add overall to classes to create list of columns
    column_list = np.concatenate([classes, ['Overall']])

    result = np.empty(shape=(4, len(classes) + 1), dtype=np.dtype(object))
    # F1 scores
    result[0] = np.concatenate([f1_score(y_true, y_pred, average=None),
                               [f1_score(y_true, y_pred, average='micro')]])
    # Precision
    result[1] = np.concatenate([precision_score(y_true, y_pred, average=None),
                               [precision_score(y_true, y_pred, average='micro')]])
    # Recall
    result[2] = np.concatenate([precision_score(y_true, y_pred, average=None),
                               [precision_score(y_true, y_pred, average='micro')]])
    # Accuracy
    filler_list = [''] * len(classes)
    filler_list.append(accuracy_score(y_true, y_pred))

    result[3] = filler_list
                               
    # Convert result to pandas df
    df = pd.DataFrame(result, columns=column_list, index=['f1', 'precision', 'recall', 'accuracy'])

    return df

### Process Data
Since the data has many categorical features, we need to encode them in order to use sci-kit learn implementations of the different models. Options that we will explore are:

1. One-hot-encoding. Covert each level of the categorical feature into binary indicator variables. The problem with one-hot-encoding is that it can dramatically increase the dimensionality of the data which will increase the computational cost of training and increases the overfitting risk (increases model variance)

2. Ordinal encoding. Assign each level of the categorical feature an integer. While this does not increase the dimensionality of the data, it can introduce bias since the model can interpret the variables based on its magnitude while in reality the numerical values were arbitrarily assigned. This can be less of an issue with tree-based methods, however.

In [2]:
# Read csv data
df = pd.read_csv('data/clean_data.csv')
# Drop un-needed or already processed columns
df = df.drop(columns=['OBJECTID', 'OCC_HOUR', 'OCC_DATE', 'dayofweek'])

In [3]:
## Get one-hot-encoded data
df1 = pd.get_dummies(df[['NEIGHBOURHOOD_158', 'LOCATION_TYPE']])

# Combine with original data
df_one_hot = pd.concat([df, df1], axis=1)

X_one_hot = df_one_hot.drop(columns=['MCI_CATEGORY', 'NEIGHBOURHOOD_158', 'LOCATION_TYPE'])

In [4]:
## Get ordinal encoded data
ordinal_encoder = OrdinalEncoder()
df_ordinal = df.copy() # write on top of a copy of df
df_ordinal.loc[:, ['NEIGHBOURHOOD_158', 'LOCATION_TYPE']] = ordinal_encoder.fit_transform(df[['NEIGHBOURHOOD_158', 'LOCATION_TYPE']]) # create labels

X_ordinal = df_ordinal.drop(columns=['MCI_CATEGORY', 'NEIGHBOURHOOD_158', 'LOCATION_TYPE'])

In [10]:
## Process y with one-hot encoding
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(df['MCI_CATEGORY'])

# Encoded classes
classes = label_encoder.classes_
classes

array(['Assault', 'Auto Theft', 'Break and Enter', 'Robbery',
       'Theft Over'], dtype=object)

### Random Forest
Random Forest is capable of handling mixed data types, performs automatic feature selection, is robust to outliers, and discovers non-linear relationships. Since Random Forest averages many different decision trees, it is also not prone to overfitting and has low variance in the bias-variance tradeoff. RF is also useful because it discovers feature importance scores, although this needs to be carefully interpreted if there are colinear variables or if variables are of high cardinality.


In [11]:
# Train-test split
X_train1, X_test1, y_train1, y_test1 = train_test_split(X_one_hot, y, test_size=0.3, random_state=0)

# Fit model
rf = RandomForestClassifier()
rf.fit(X_train1, y_train1)

# Get predictions
y_pred1 = rf.predict(X_test1)

In [62]:
# Evaluate performance
multi_performance(y_true=y_test1, y_pred=y_pred1, classes=classes)

Unnamed: 0,Assault,Auto Theft,Break and Enter,Robbery,Theft Over,Overall
f1,0.757149,0.57925,0.551668,0.466324,0.080172,0.655847
precision,0.706495,0.598515,0.591509,0.553245,0.165517,0.655847
recall,0.706495,0.598515,0.591509,0.553245,0.165517,0.655847
accuracy,,,,,,0.655847


In [20]:
# Review class frequencies in the data
df.MCI_CATEGORY.value_counts() / df.MCI_CATEGORY.value_counts().sum()

Assault            0.534758
Break and Enter    0.193989
Auto Theft         0.142905
Robbery            0.095363
Theft Over         0.032985
Name: MCI_CATEGORY, dtype: float64