# Detecting NYC Traffic Anomalies Modeling Notebook

In this notebook I will walkthrough the modeling process I used on the previously cleaned NYC traffic dataset. If you want to follow along and haven't cleaned the data yet, please go back and follow along with EDA notebook. For this notebook, we are only fiting our models to one sensor, because the goal is to decide upon a final model we can use on the rest of the senors. At the end of this notebook we will have the final models we will then train on the rest of the senors in the next notebook. 

<b>Note: An ARIMA model was tried, but it consumed too much memory and kept crashing my kernel. Consequently, all trained models are variations of neural nets.

## Libraries Needed

In [1]:
#import required libraries
import pandas as pd
import numpy as np
import datetime
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')
from keras.losses import mean_squared_error
import keras
import tensorflow as tf
import pickle

## Custom Functions Needed 

In [2]:
def standardize_df(dataframe, column, X_steps, y_steps, num_train_sets):
    """
    This function standardizes the data of a column.
    
    dataframe: dataframe to standarize
    column: the column in the dataframe to standardize
    X_steps: the number of rows used to train on
    y_steps: the number of rows used to test on
    num_train_sets: the number of training sets there are for this dataframe
    """
    #import needed libraries
    from sklearn.preprocessing import StandardScaler
    
    #slice the dataframe down to the column to be scaled
    df = pd.DataFrame(dataframe[column])
    
    # initialize the standard scaler
    scaler = StandardScaler()
    #train the standard scaler on the rows that will be in the training set
    scaler = scaler.fit(df.iloc[:(num_train_sets * (X_steps + y_steps))])
    #save the mean used to standardize the dataframe
    mean = df.iloc[:(num_train_sets * (X_steps + y_steps))].mean().iloc[0]
    #save the standard deviation used to standardize the dataframe
    std = df.iloc[:(num_train_sets * (X_steps + y_steps))].std().iloc[0]
    
    # standardize the dataframe
    df = pd.DataFrame(scaler.transform(df), columns=[column])
    
    # return the standardized dataframe with the mean and standard deviation used to standardize
    return df, mean, std

In [3]:
def train_val_test_dfs(dataframe, column, X_steps=50, y_steps=1, val_size=0.2, test_size=0.1, standardize=False):
    """
    This function splits a dataframe into training, testing and
    validation sets.
    
    dataframe: dataframe to split
    column: string of the column in the dataframe to use since our application is a univariate timeseries
    X_steps: integer of the number of rows used to train on
    y_steps: integer of the number of rows used to predict on
    val_size: float less than 1.0 of the percentage of the dataframe to be for validation
    test_size: float less than 1.0 of the percentage of the dataframe to be for testing
    standardize: boolean indicating whether or not to standardize the sets
    """
    
    # calculate the number of sets that are possible given the amount of steps to train and predict on
    num_sets = len(dataframe) // (X_steps + y_steps)
#     num_sets = len(dataframe) - (X_steps + y_steps) + 1
    # calculate how to offset the data so that the sets are of the most recent data
    offset = (len(dataframe) % (X_steps + y_steps)) - 1
#     offset = 0
    # calculate the number of validation sets
    num_val_sets = int(num_sets * val_size)
    # calculate the number of test sets
    num_test_sets = int(num_sets * test_size)
    # calculate the number of training sets
    num_train_sets = num_sets - num_val_sets - num_test_sets
    
    # if the dataframe needs to be standardized then standardize it. Otherwise pass on the dataframe given
    if standardize:
        df, mean, std = standardize_df(dataframe, column, X_steps, y_steps, num_train_sets)
    else:
        df = dataframe.copy()
        
    # instantiate empty lists for each type of set
    X_train, y_train = [], []
    X_val, y_val = [], []
    X_test, y_test = [], []
    
    # loop the number of sets the dataframe can produce
    for i in range(num_sets):
        # if i is less than the number of training sets to be produced make another training set
        if i < num_train_sets:
            X_train.append(np.array(df[column].iloc[(i*X_steps) + offset: (i*X_steps) + X_steps + offset])[:, np.newaxis])
            y_train.append(np.array(df[column].iloc[(i*X_steps) + X_steps + offset: (i*X_steps) + X_steps + offset + y_steps])[:, np.newaxis])
#             print(np.array(X_train).shape)
        # else if i is less than the number of validation sets to be produced make another validation set
        elif i < (num_train_sets + num_val_sets):
#             return np.array(X_train), np.array(y_train)
#             break
            X_val.append(np.array(df[column].iloc[(i*X_steps) + offset: (i*X_steps) + X_steps + offset])[:, np.newaxis])
            y_val.append(np.array(df[column].iloc[(i*X_steps) + X_steps + offset: (i*X_steps) + X_steps + offset + y_steps])[:, np.newaxis])
        # else make a testing set
        else:
            X_test.append(np.array(df[column].iloc[(i*X_steps) + offset: (i*X_steps) + X_steps + offset])[:, np.newaxis])
            y_test.append(np.array(df[column].iloc[(i*X_steps) + X_steps + offset: (i*X_steps) + X_steps + offset + y_steps])[:, np.newaxis])
    # turn the lists into arrays so keras can process the each set
    if standardize:
        return np.array(X_train), np.array(y_train), np.array(X_val), np.array(y_val), np.array(X_test), np.array(y_test), mean, std
    return np.array(X_train), np.array(y_train), np.array(X_val), np.array(y_val), np.array(X_test), np.array(y_test)

---

## Modeling

Import the data that we will use.

In [4]:
with open('sensor_dfs.pickle', 'rb') as handle:
    sensor_dfs = pickle.load(handle)

First, lets establish the amount of datapoints we want to use to predict and how far out we want to predict. For this we are going to use 18 steps which is 1.5 hours of 5 minute data to predict and we will forcast out 3 steps which is 15 minutes of 5 minute data.

In [5]:
X_steps = 6
y_steps = 3

Let's randomly choose a sensor to model.

In [6]:
import random
random.seed(28)
random.choice(list(sensor_dfs.keys()))

264

In [7]:
#split and standardize our data
X_train, y_train, X_val, y_val, X_test, y_test, mean, std = train_val_test_dfs(dataframe = sensor_dfs[264], column = 'SPEED', X_steps = X_steps, y_steps = y_steps, standardize = True)

### Baseline

Our baseline is simply predicting the last value for the next x values. For all of our models we are going to use mean squared error for optimization. This is because we want to reduce the amount that our predictions are off by while punishing large error. Mean squared error does this because we are squaring the error. Therefore, the model is more heavily punished when being off by large amounts. The reason we don't want to be off by a lot is becuase we want to detect anomalies. Consequently, we want to reduce the amount of predictions we miss by a lot so we will be more accurate in detecting anomalies in the future.

In [8]:
y_pred = X_val[:, -y_steps:]

np.mean(mean_squared_error(y_val, y_pred))

0.21101815302622678

As you can see it didn't perform that well. On average our model is off by 2 standard deviations each predicted step.

### Vanilla Neural Net

This model is as simple as it gets for a neural net. We have an input layer and an output layer. They are fully connected. We will run this model for 10 epochs and use Mean Squared Error for our loss function.

In [9]:
model1 = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[X_steps, 1]),
    keras.layers.Dense(y_steps)
])

model1.compile(optimizer='Adam', loss=tf.keras.losses.MeanSquaredError())

model1.fit(X_train, y_train, epochs = 5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7fa7a6696040>

In [10]:
model1.evaluate(X_val, y_val)



0.13528427481651306

We improved a significant amount from our baseline, but we are overfitting to our trainging set quite a bit.

### RNNs

The next two models will use recursive neural nets to see if we can improve upon our predictions. The first one is pretty simple, two rnn layers with 10 neurons each and then the output layer.

In [11]:
model2 = keras.models.Sequential([
    keras.layers.SimpleRNN(10, return_sequences=True, input_shape=[None, 1]),
    keras.layers.SimpleRNN(10),
    keras.layers.Dense(y_steps, activation='relu')
])

model2.compile(optimizer='Adam', loss=tf.keras.losses.MeanSquaredError())

model2.fit(X_train, y_train, epochs = 10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7fa677fccd30>

In [12]:
model2.evaluate(X_val, y_val)



0.501369833946228

This model is way worse than any of the previous models. We will try to improve upon this RNN in the next model, but RNNs might not be viable for this dataset.

In [13]:
model3 = keras.models.Sequential([
    keras.layers.SimpleRNN(10, return_sequences=True, input_shape=[None, 1]),
    keras.layers.Dropout(rate=0.1),
    keras.layers.SimpleRNN(20, return_sequences=True),
    keras.layers.SimpleRNN(30),
    keras.layers.Dense(y_steps, activation='relu')
])

model3.compile(optimizer='Adam', loss=tf.keras.losses.MeanSquaredError())

model3.fit(X_train, y_train, epochs = 5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7fa7b430c9a0>

In [14]:
model3.evaluate(X_val, y_val)



0.5011864900588989

I tried various combinations to try to improve upon the previous model, but improvements were only ever marginal if at all. I don't think RNNs are viable. Let's go back to fully connected neural nets and try to improve upon them.

### Fully Connected Neural Nets

This model adds six layers to our simple neural net from before. Three are dropout layers to help the model generalize. The other three are dense layers with varying neurons.

In [15]:
model4 = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[X_steps, 1]),
    keras.layers.Dense(40),
    keras.layers.Dropout(rate=0.2),
    keras.layers.Dense(40),
    keras.layers.Dropout(rate=0.2),
    keras.layers.Dense(20),
    keras.layers.Dropout(rate=0.2),
    keras.layers.Dense(y_steps)
])

model4.compile(optimizer='Adam', loss=tf.keras.losses.MeanSquaredError())

model4.fit(X_train, y_train, epochs = 5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7fa687eceb50>

In [16]:
model4.evaluate(X_val, y_val)



0.12884482741355896

This model performed the best so far, but there is overfitting occuring. Let's see if we can address that in the next model.

In [17]:
model5 = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[X_steps, 1]),
    keras.layers.Dense(200),
    keras.layers.Dropout(rate=0.2),
    keras.layers.Dense(20),
    keras.layers.Dense(y_steps)
])

model5.compile(optimizer='Adam', loss=tf.keras.losses.MeanSquaredError())

model5.fit(X_train, y_train, epochs = 5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7fa78d8eb070>

In [18]:
model5.evaluate(X_val, y_val)



0.12480004876852036

Perfect! We are achieving a good mse and our model isn't overfitting. Simplifying our model helped reduce the overfitting, which makes sense. We will use this for our final model. This concludes this notebook, but we will pick up in the next notebook - Final Models. In that model I will walk through a code that automatically fits a model to each of the senors. 