## San Francisco Crime Prediction
### Machine Learning I - ZHAW SoE - DS21a

Gérôme Meyer, Lea Keller, Alessio Drigatti

For this assignment we joined a kaggle challenge from 2014. Kaggle provides a dataset with nearly 12 years of crime reports from across all of San Francisco's neighborhoods. The dataset provides information about the exact time, the category of the incident, a description of the incident, the day of week, the name of de police department district, how the incident was resolved, approximate street address, geographical longitude and geographical latitude.
The goal is to predict the category of an incident, given time and location.

In [1]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import make_column_transformer
import numpy as np
import re
import seaborn as sns
import matplotlib.pyplot as plt

## Data exploration

In [None]:
data = pd.read_csv('../data/train.csv')

# There seems to be invalid data that contains latitude = 90. (Which would be the North Pole)
data = data[data['Y'] != 90]
data = data[data['Category'] != 'NONE']

standard_scaler = StandardScaler()
x = np.asarray(data['X']).reshape((-1, 1))
y = np.asarray(data['Y']).reshape((-1, 1))
data['X'] = StandardScaler().fit_transform(x)
data['Y'] = StandardScaler().fit_transform(y)

sns.set(rc={'figure.figsize': (20, 15)})
sns.scatterplot(data, x='Y', y='X', hue='Category')
plt.show()

## Data cleaning and feature engineering
First, we imported the data and deleted data with invalid latitude. Then we dropped the variables, which are not part of the test data.

In [None]:
data = pd.read_csv('../data/train.csv')

# There seems to be invalid data that contains latitude = 90. (Which would be the North Pole)
data = data[data['Y'] != 90]

In [None]:
y_train = data['Category']

# The description and category are not part of the test data, therefore we cannot use them for training.
# And we need to drop the resolution variable since it is our target variable and also not part of the test data.
x_train = data.drop(['Descript', 'Category', 'Resolution'], axis=1)

Then we separated the date informations in the subcomponents year, month, day, hour and minute and saved them as floats. We added the category federal holiday and set the name based on the date. We have also added two additional categories "isday" and "IsWeekend". The categorisation if it is day or night is based on the hour (time) and the categorisation if it is a weekday or the weekend is based on the day of week.
After the new categories were added the information of the full date is no longer needed, so we dropped it.

In [None]:
# Split the date into its subcomponents
x_train[['year', 'month', 'day', 'hour', 'minute']] = x_train.Dates.str.extract(
    '(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2}) (?P<hour>\d{2}):(?P<minute>\d{2})')

# Convert the date numbers into actual integers
x_train = x_train.astype({
    'year': 'float64',
    'month': 'float64',
    'day': 'float64',
    'hour': 'float64',
    'minute': 'float64',
})

x_train["federal_holiday"] = "None"
x_train.loc[x_train['Dates'].str.match('\d{4}-01-01'), 'federal_holiday'] = 'new_year'
x_train.loc[x_train['Dates'].str.match('\d{4}-07-04'), 'federal_holiday'] = 'independence_day'
x_train.loc[x_train['Dates'].str.match('\d{4}-11-24'), 'federal_holiday'] = 'thanksgiving'
x_train.loc[x_train['Dates'].str.match('\d{4}-12-25'), 'federal_holiday'] = 'christmas_day'
x_train.loc[x_train['Dates'].str.match('\d{4}-12-26'), 'federal_holiday'] = 'christmas_day'
x_train.loc[x_train['Dates'].str.match('\d{4}-12-31'), 'federal_holiday'] = 'new_year'

x_train.loc[x_train['hour']<=7,'isday']= 0
x_train.loc[x_train['hour']>7,'isday']= 1
x_train.loc[x_train['hour']>19,'isday']= 0

x_train["IsWeekend"]= 0
x_train.loc[x_train['DayOfWeek'] == 'Sunday', 'IsWeekend'] = 1
x_train.loc[x_train['DayOfWeek'] == 'Saturday', 'IsWeekend'] = 1

x_train.drop(['Dates'], inplace=True, axis=1)

# TODO Gérôme

In [None]:
# One-Hot encoding for the DayOfTheWeek
column_transformer = make_column_transformer(
    (OneHotEncoder(), ['DayOfWeek', 'PdDistrict']),
    (StandardScaler(), ['X', 'Y', 'year', 'month', 'day', 'hour', 'minute']),
    remainder='passthrough'
)

x_train_transformed = column_transformer.fit_transform(x_train)
x_train = pd.DataFrame(data=x_train_transformed, columns=column_transformer.get_feature_names_out())

x_train = x_train.rename(columns={element: re.sub(r'^(.+)__', '', element) for element in x_train.columns.tolist()})

In [None]:
x_train.to_csv('../data/x_train_cleaned.csv', index=False)
y_train.to_csv('../data/y_train_cleaned.csv', index=False)

x_train.head(10)

## Model Implementation

In order to try and predict the category we trained a fully-connected sequential Neural Network on the training data from Kaggle.


Architecture:
- Input Layer with a Neuron for each feature
- Hidden Layer with **2560** Neurons
- Hidden Layer with **780** Neurons
- Output Layer with **39** Categories

The output of the hidden Layers is normalized with a "[BatchNormalization](https://keras.io/api/layers/normalization_layers/batch_normalization/)" Layer.


Note: The code below contains a commented section that can be uncommented to work with only a fraction of the data which is useful for trying out different architecures.

First we import the following libraries:
- matplotlib to plot the score of the model
- pandas to read and transform our data that we created earlier
- kears for the actual Neural Network
- sklearn (Scikit-Learn) for some more data preprocessing and splitting
- numpy for when we wish to reduce

In [12]:
import matplotlib.pyplot as plt
import pandas as pd
from keras.layers import Input, Dense, BatchNormalization, Dropout
from keras.models import Sequential
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import numpy as np

Next we read the data from the files created earlier.

In [19]:
x_clean = pd.read_csv('../data/x_train_cleaned.csv')
y_clean = pd.read_csv('../data/y_train_cleaned.csv')

For architecture experimentation you can run this cell with a custom "data_fraction".
E.g. with data_fraction = 2, you would work with only half of the data, therefore reducing training time.

In [20]:
data_fraction = 20
drop_indices = np.random.choice(x_clean.index, int(len(x_clean) / data_fraction), replace=False)
x_clean = x_clean.drop(drop_indices)
y_clean = y_clean.drop(drop_indices)

In [21]:
# TODO: Deal with the Address. Maybe we can use it in a useful manner
x_clean.drop(['Address'], inplace=True, axis=1)

Now we can split the data to into training and test data. We need the test part of the data to evaluate how well our model will deal with similar data that it hasn't seen yet.
Also we encode the string descriptions of the category into numbers since our model cannot be trained to predict strings.

In [22]:
label_encoder = LabelEncoder()
y_clean = label_encoder.fit_transform(y_clean['Category'])

x_train, x_test, y_train, y_test = train_test_split(x_clean, y_clean)

### Model definition

Creation of the model and its

In [None]:
model = Sequential(name='san_francisco_sequential')
model.add(Input(shape=x_train.shape[1]))

# First hidden layer
model.add(Dense(2560, activation='relu'))
model.add(BatchNormalization())
model.add(Dropout(0.2))

# Second hidden layer
model.add(Dense(780, activation='relu'))
model.add(BatchNormalization())
model.add(Dropout(0.2))

# Output Layer for 39 categories
model.add(Dense(39, activation='softmax'))

model.compile(loss="sparse_categorical_crossentropy", optimizer="sgd", metrics=["accuracy"])

history = model.fit(
    x_train,
    y_train,
    batch_size=16,
    epochs=10,
    validation_split=.1
)

model.save('./small_model')

for j in list(history.history.keys()):
    plt.plot(history.history[j])
    plt.title(j + ' over epochs')
    plt.ylabel(j)
    plt.xlabel('Epochs')
    plt.show()

print(model.evaluate(x_test, y_test))

Epoch 1/10

In [None]:
data = pd.read_csv('../data/test.csv')

# TODO: What are we supposed to do to get predictions for the data sets that do not have proper coordinates?
data = data[data['Y'] != 90]

# There seems to be invalid data that contains latitude = 90. (Which would be the North Pole)
standard_scaler = StandardScaler()
x = np.asarray(data['X']).reshape((-1, 1))
y = np.asarray(data['Y']).reshape((-1, 1))
data['X'] = StandardScaler().fit_transform(x)
data['Y'] = StandardScaler().fit_transform(y)

print(data['Y'].max())

sns.set(rc={'figure.figsize': (20, 15)})
sns.scatterplot(data, x='Y', y='X')
plt.show()

In [None]:
data = pd.read_csv('../data/test.csv')

# There seems to be invalid data that contains latitude = 90. (Which would be the North Pole)
# data = data[data['Y'] != 90]


# The description and category are not part of the test data, therefore we cannot use them for training.
# And we need to drop the resolution variable since it is our target variable and also not part of the test data.

# TODO: Extract this into a function. ###
# Split the date into its subcomponents
data[['year', 'month', 'day', 'hour', 'minute']] = data.Dates.str.extract(
    '(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2}) (?P<hour>\d{2}):(?P<minute>\d{2})')

# data["federal_holiday"] = "None"
# data.loc[data['Dates'].str.match('\d{4}-01-01'), 'federal_holiday'] = 'new_year'
# data.loc[data['Dates'].str.match('\d{4}-07-04'), 'federal_holiday'] = 'independence_day'
# data.loc[data['Dates'].str.match('\d{4}-11-24'), 'federal_holiday'] = 'thanksgiving'
# data.loc[data['Dates'].str.match('\d{4}-12-25'), 'federal_holiday'] = 'christmas_day'
# data.loc[data['Dates'].str.match('\d{4}-12-26'), 'federal_holiday'] = 'christmas_day'
# data.loc[data['Dates'].str.match('\d{4}-12-31'), 'federal_holiday'] = 'new_year'

data.drop(['Dates'], inplace=True, axis=1)

# Convert the date numbers into actual integers
data = data.astype({
    'year': 'float64',
    'month': 'float64',
    'day': 'float64',
    'hour': 'float64',
    'minute': 'float64',
})

# One-Hot encoding for the DayOfTheWeek
column_transformer = make_column_transformer(
    (OneHotEncoder(), ['DayOfWeek', 'PdDistrict']),
    (StandardScaler(), ['X', 'Y', 'year', 'month', 'day', 'hour', 'minute']),
    remainder='passthrough'
)

data_transformed = column_transformer.fit_transform(data)
data = pd.DataFrame(data=data_transformed, columns=column_transformer.get_feature_names_out())

data = data.rename(columns={element: re.sub(r'^(.+)__', '', element) for element in data.columns.tolist()})

data.to_csv('../data/test_cleaned.csv', index=False)

data

## Generate Kaggle submission

In [None]:
import keras
import pandas as pd

categories = ['ARSON', 'ASSAULT', 'BAD CHECKS', 'BRIBERY', 'BURGLARY', 'DISORDERLY CONDUCT',
              'DRIVING UNDER THE INFLUENCE', 'DRUG/NARCOTIC', 'DRUNKENNESS', 'EMBEZZLEMENT', 'EXTORTION',
              'FAMILY OFFENSES', 'FORGERY/COUNTERFEITING', 'FRAUD', 'GAMBLING', 'KIDNAPPING', 'LARCENY/THEFT',
              'LIQUOR LAWS', 'LOITERING', 'MISSING PERSON', 'NON-CRIMINAL', 'OTHER OFFENSES', 'PORNOGRAPHY/OBSCENE MAT',
              'PROSTITUTION', 'RECOVERED VEHICLE', 'ROBBERY', 'RUNAWAY', 'SECONDARY CODES', 'SEX OFFENSES FORCIBLE',
              'SEX OFFENSES NON FORCIBLE', 'STOLEN PROPERTY', 'SUICIDE', 'SUSPICIOUS OCC', 'TREA', 'TRESPASS',
              'VANDALISM', 'VEHICLE THEFT', 'WARRANTS', 'WEAPON LAWS']

model = keras.models.load_model('./small_model')
df = pd.read_csv('../data/test_cleaned.csv')

df.drop(['Address'], inplace=True, axis=1)
df.drop(['Id'], inplace=True, axis=1)

arr = df.to_numpy()

predictions = model.predict(arr)

for prediction in predictions:
    prediction_max = prediction.max()
    prediction[prediction != prediction_max] = 0
    prediction[prediction != 0] = 1

print(predictions.shape)

predictions = predictions.astype('int32')

df_predictions = pd.DataFrame(predictions, columns=categories)

df_predictions.to_csv('predictions.csv', index_label='Id')