# Neural Networks for Sequences and Time Series

Let's try to use recurrent neural networks for the credit card fraud detection problem that we have studied earlier. RNNs are typically better at capturing the temporal nature of the data than CNNs, and here we will be able to see how much better they are.

As before, this first cell sets up the notebook.

Additionally, we load the file `helpers.py` which defines:

* `reshape_to_batches` 
* `convert_3d_to_2d` 

that we defined in the previous notebook.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split

from sklearn.metrics import roc_curve, auc, roc_auc_score
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
    
from sklearn.pipeline import Pipeline

import keras

In [None]:
from helpers import reshape_to_batches, convert_3d_to_2d

### Load the data

Load the `data/creditcard.parquet` using `pd.read_parquet()` into a DataFrame called `ccfd`.

In [None]:
# Your code here...
ccfd = pd.read_parquet('data/creditcard.parquet')
ccfd.head()


Use `sklearn`'s `train_test_split` to create a 70/30 train test split, and ensure that the data remain ordered by time. Call the splits `X_train`, `X_test`, `y_train`, and `y_test`.

In [None]:
# Your code here...
X_train, X_test, y_train, y_test = train_test_split(
    ccfd.drop('Class', axis=1), 
    ccfd['Class'], 
    shuffle=False, 
    test_size=0.3
)

print(f'Shape of training features: {X_train.shape}')
print(f'Shape of training labels: {y_train.shape}')
print(f'Shape of test features: {X_test.shape}')
print(f'Shape of test labels: {y_test.shape}')


Apply scaling preprocessing (on both the training and test set). Call the new scaled variables `X_train_s` and `X_test_s`.

_Hint: you may want to use a [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)._

In [None]:
# Your code here...
pipeline = Pipeline([
    ('scaling', StandardScaler()),
])
preprocessor = pipeline.fit(X_train)
X_train_s = preprocessor.transform(X_train)
X_test_s = preprocessor.transform(X_test)


Use the `reshape_to_batches` function with batch size 100 and apply it to the training data. Call the result `X_train_s_batch`.

In [None]:
# reshape to batches
BATCH_SIZE = 100
# Your code here...
X_train_s_batch = reshape_to_batches(X_train_s, BATCH_SIZE)
print(f'Shape after reshaping: {X_train_s_batch.shape}')


Note that the batch size is particularly important because this is the sequence size that we are going to train the RNN on. 
This means that any dependencies further apart than`BATCH_SIZE` **will not be taken into account**. 

We could in theory give only one batch with the entire sequence but that will take an excessive amount of time to train and success is not guaranteed (vanishing gradient problem). 

### Re-encoding the data

Create a `y_binary` with two columns (0, 1) and batch `y_train`. Save the output as `y_train_batch.`

In [None]:
# Your code here...
from keras.utils import to_categorical

y_binary = to_categorical(y_train)

print(f'Shape of training labels: {y_train.shape}')
print(f'Shape of training labels (one-hot): {y_binary.shape}')

y_train_batch = reshape_to_batches(y_binary, BATCH_SIZE)

print(f'Shape of training labels after reshaping: {y_train_batch.shape}')


### Create model

In theory the RNN can read arbitrarily many time-steps, which is one of the reasons it can, theoretically, offer better performance than the CNN for time series. 
In practice, however, it is limited by the vanishing gradient problem and the exploding computational requirement implied by taking increasingly many time-steps.

The cell below imports key Keras layers:
* `Input` and `Dense` which you already know
* `SimpleRNN` and `TimeDistributed` which are helpful for time series

In [None]:
# import all dependencies
from keras.layers import Input, Dense, SimpleRNN, TimeDistributed
from keras.models import Model

Create the input layer (called `inputs`) with appropriate dimensions

In [None]:
# add your code here for the input layer
inputs = Input(shape=(BATCH_SIZE, 30))


### Defining the architecture of the RNN

By default, Keras considers the **many-to-one** architecture, sometimes also known as an _encoder_. 
However, we want to perform a prediction at every time step. 
Therefore, we make the RNN layer return output for every sequence with `return_sequences=True`.

The cell below, chained to the `inputs` layer, is an RNN cell.
You should recognize a few things:

* how many neurons are there? (or what is the dimensionality of the output of that layer?)
* what is the activation function?
* the initializer is the Glorot initializer, centered at zero
* no dropout

The rest of the parameters don't really matter for now (we will modify some of them later) but feel free to have a look at the [documentation](https://keras.io/layers/recurrent/) for a definition of all the parameters.

In [None]:
rnn = SimpleRNN(
    64, 
    activation='tanh', 
    use_bias=True, 
    kernel_initializer='glorot_uniform',
    recurrent_initializer='orthogonal', 
    bias_initializer='zeros', 
    kernel_regularizer=None,
    recurrent_regularizer=None, 
    bias_regularizer=None, 
    activity_regularizer=None, 
    kernel_constraint=None, 
    recurrent_constraint=None, 
    bias_constraint=None, 
    dropout=0.0, 
    recurrent_dropout=0.0, 
    return_sequences=True, 
    return_state=False, 
    go_backwards=False, 
    stateful=False, 
    unroll=False
)(inputs)


The next cell is an output layer with 2 dimensions, given that there are two classes (we're still in the classification context). 

Then, we wrap a model around the whole thing and compile it.

In [None]:
predictions = TimeDistributed(Dense(2, activation='softmax'))(rnn)

rnn_model = Model(
    inputs=inputs, 
    outputs=predictions
)

rnn_model.compile(
    optimizer='rmsprop',
    loss='categorical_crossentropy',
    metrics=['accuracy']
)

Now we're good to fit this for a few epochs and check the performances. 

In [None]:
rnn_model.fit(X_train_s_batch, y_train_batch, epochs=15)

### Evaluation

Evaluating the performances of the model.

- check the shapes of our data, convert as required
- make predictions using `rnn_model.predict(X_test_s_batch)`
- calculate AUC

In [None]:
# check the shapes of all relevant objects, reshape if necessary
# first transform the test data into the appropriate shape
print(f'Shape of test features: {X_test_s.shape}')

X_test_s_batch = reshape_to_batches(X_test_s, BATCH_SIZE)
print(f'Shape of test features after reshaping: {X_test_s_batch.shape}')

y_binary = to_categorical(y_test)
print(f'Shape of test labels: {y_test.shape}')
print(f'Shape of test labels (one-hot): {y_binary.shape}')

y_test_batch = reshape_to_batches(y_binary, BATCH_SIZE)
print(f'Shape of test labels after reshaping: {y_test_batch.shape}')

In [None]:
# make the prediction
y_pred_rnn = rnn_model.predict(X_test_s_batch)

In [None]:
# show the roc auc score
print('AUC:')
print(roc_auc_score(
    convert_3d_to_2d(y_test_batch)[:,1], 
    convert_3d_to_2d(y_pred_rnn)[:,1]
))

### Comparison with the CNN results

Load the FPR and TPR from the CNN case in (`data/res_cnn.pkl`), show both the AUC of the RNN you've just trained as well as that of the CNN.

In [None]:
# load
import pickle

fpr_cnn, tpr_cnn, thresh_cnn, y_pred_cnn = pickle.load(
    open("data/res_cnn.pkl", "rb"))
fpr_rnn, tpr_rnn, thresh_rnn = roc_curve(
    convert_3d_to_2d(y_test_batch)[:, 1], convert_3d_to_2d(y_pred_rnn)[:, 1])

In [None]:
# show the AUC for the CNN and the RNN
plt.figure(figsize=(8, 6))
lw = 2
plt.plot(fpr_cnn, tpr_cnn, color='C0',
         lw=lw, label='CNN')
plt.plot(fpr_rnn, tpr_rnn, color='C1',
         lw=lw, label='RNN')
plt.plot([0, 1], [0, 1], color='navy', lw=lw, ls='--')
plt.xlim([0, 1])
plt.ylim([0, 1.05])
plt.xlabel('FPR', fontsize=12)
plt.ylabel('TPR', fontsize=12)
plt.legend(fontsize=12)
plt.show()


As you can observe, the RNN is better than the CNN here (the corresponding curve is mostly or completely above). 
AUC offers a nice way to compare between different classification models.

**Note**: remain careful though, the AUC puts emphasis on the *accuracy* but, as you know, in this case we may care more about fraud *recall*. 
Don't forget to also check the confusion matrices etc.

## LSTM

One of the best parts of using Keras' functional API is that we can easily reuse components. Let's replace the vanilla RNN with an LSTM.
Again, you should recognise a few things, in fact pretty much everything is similar to the `SimpleRNN` you used before. 

In [None]:
from keras.layers import LSTM

# the implementation parameter determines whether your hardware is cpu (1) or gpu (2)
lstm = LSTM(
    64, 
    activation='tanh', 
    recurrent_activation='hard_sigmoid', 
    use_bias=True, 
    kernel_initializer='glorot_uniform', 
    recurrent_initializer='orthogonal', 
    bias_initializer='zeros', 
    unit_forget_bias=True, 
    kernel_regularizer=None, 
    recurrent_regularizer=None, 
    bias_regularizer=None, 
    activity_regularizer=None, 
    kernel_constraint=None, 
    recurrent_constraint=None, 
    bias_constraint=None, 
    dropout=0.0, 
    recurrent_dropout=0.0, 
    implementation=1,      # CPU or GPU
    return_sequences=True, 
    return_state=False, 
    go_backwards=False, 
    stateful=False,
    unroll=False
)(inputs)

# finally we give a 2 dimensional softmax output layer
predictions = TimeDistributed(Dense(2, activation='softmax'))(lstm)

lstm_model = Model(
    inputs=inputs, 
    outputs=predictions
)

lstm_model.compile(
    optimizer='rmsprop',
    loss='categorical_crossentropy',
    metrics=['accuracy']
)

In [None]:
# add your code to fit the model
lstm_model.fit(X_train_s_batch, y_train_batch, epochs=15)


### Evaluate the quality of the LSTM classifier

Compare the LSTM to both the RNN and the CNN.

In [None]:
# add your code to compare models
y_pred_lstm = lstm_model.predict(X_test_s_batch)

fpr_lstm, tpr_lstm, thresh_lstm = roc_curve(
    convert_3d_to_2d(y_test_batch)[:, 1], 
    convert_3d_to_2d(y_pred_lstm)[:, 1]
)

plt.figure(figsize=(8, 6))
lw = 2
plt.plot(
    fpr_cnn,
    tpr_cnn,
    color='C0',
    lw=lw,
    label='CNN'
)
plt.plot(
    fpr_rnn,
    tpr_rnn,
    color='C1',
    lw=lw,
    label='RNN'
)
plt.plot(
    fpr_lstm,
    tpr_lstm,
    color='C2',
    lw=lw,
    label='LSTM'
)
plt.plot([0, 1], [0, 1], color='navy', lw=lw, ls='--')
plt.xlim([0, 1])
plt.ylim([0, 1.05])
plt.xlabel('FPR', fontsize=12)
plt.ylabel('TPR', fontsize=12)
plt.legend(fontsize=12)

print(f"CNN  AUC: {auc(fpr_cnn, tpr_cnn):.4f}")
print(f"RNN  AUC: {auc(fpr_rnn, tpr_rnn):.4f}")
print(f"LSTM AUC: {auc(fpr_lstm, tpr_lstm):.4f}")
plt.show()


## LSTM vs GRU

The last one we can test is the GRU. 

In [None]:
from keras.layers import GRU

gru = GRU(
    64, 
    activation='tanh', 
    recurrent_activation='hard_sigmoid',
    use_bias=True, 
    kernel_initializer='glorot_uniform',
    recurrent_initializer='orthogonal', 
    bias_initializer='zeros',
    kernel_regularizer=None, 
    recurrent_regularizer=None, 
    bias_regularizer=None,
    activity_regularizer=None, 
    kernel_constraint=None, 
    recurrent_constraint=None,
    bias_constraint=None, 
    dropout=0.0, 
    recurrent_dropout=0.0, 
    implementation=1,
    return_sequences=True, 
    return_state=False, 
    go_backwards=False, 
    stateful=False, 
    unroll=False
)(inputs)

# output layer, as per usual
predictions = TimeDistributed(Dense(2, activation='softmax'))(gru)

In [None]:
# model compilation and fitting
gru_model = Model(
    inputs=inputs,
    outputs=predictions
)
gru_model.compile(
    optimizer='rmsprop',
    loss='categorical_crossentropy',
    metrics=['accuracy']
)
gru_model.fit(X_train_s_batch, y_train_batch, epochs=15)

### Evaluation

In [None]:
# add your code to compare models
y_pred_gru = gru_model.predict(X_test_s_batch)

fpr_gru, tpr_gru, thresh_gru = roc_curve(convert_3d_to_2d(y_test_batch)[:, 1], 
                                         convert_3d_to_2d(y_pred_gru)[:, 1])

plt.figure(figsize=(8, 6))
lw = 2
plt.plot(
    fpr_cnn,
    tpr_cnn,
    color='C0',
    lw=lw,
    label='CNN'
)
plt.plot(
    fpr_rnn,
    tpr_rnn,
    color='C1',
    lw=lw,
    label='RNN'
)
plt.plot(
    fpr_lstm,
    tpr_lstm,
    color='C2',
    lw=lw,
    label='LSTM'
)
plt.plot(
    fpr_gru,
    tpr_gru,
    color='C3',
    lw=lw,
    label='GRU'
)
plt.plot([0, 1], [0, 1], color='navy', lw=lw, ls='--')
plt.xlim([0, 1])
plt.ylim([0, 1.05])
plt.xlabel('FPR', fontsize=12)
plt.ylabel('TPR', fontsize=12)
plt.legend(fontsize=12)

print(f"CNN  AUC: {auc(fpr_cnn, tpr_cnn):.4f}")
print(f"RNN  AUC: {auc(fpr_rnn, tpr_rnn):.4f}")
print(f"LSTM AUC: {auc(fpr_lstm, tpr_lstm):.4f}")
print(f"GRU  AUC: {auc(fpr_gru, tpr_gru):.4f}")


## Stacking: combine NNs as lego blocks

Just as with CNNs, RNN units can be stacked on top of each other to form a more involved model. 
Since the weights are shared in each RNN stack (layer), the hypothesis is that every stack forms both new features and a different time-scale at which it operates. 

Try to build two LSTM layers with the same settings as before, stack one after the other and test the whole lot. 

In [None]:
# add your code here
lstm1 = LSTM(64, activation='tanh', recurrent_activation='hard_sigmoid', use_bias=True, 
            kernel_initializer='glorot_uniform', recurrent_initializer='orthogonal', 
            bias_initializer='zeros', unit_forget_bias=True, kernel_regularizer=None, 
            recurrent_regularizer=None, bias_regularizer=None, activity_regularizer=None, 
            kernel_constraint=None, recurrent_constraint=None, bias_constraint=None, dropout=0.0, 
            recurrent_dropout=0.0, implementation=1, return_sequences=True, return_state=False, 
            go_backwards=False, stateful=False, unroll=False)(inputs)

lstm2 = LSTM(64, activation='tanh', recurrent_activation='hard_sigmoid', use_bias=True, 
            kernel_initializer='glorot_uniform', recurrent_initializer='orthogonal', 
            bias_initializer='zeros', unit_forget_bias=True, kernel_regularizer=None, 
            recurrent_regularizer=None, bias_regularizer=None, activity_regularizer=None, 
            kernel_constraint=None, recurrent_constraint=None, bias_constraint=None, dropout=0.0, 
            recurrent_dropout=0.0, implementation=1, return_sequences=True, return_state=False, 
            go_backwards=False, stateful=False, unroll=False)(lstm1)

predictions = TimeDistributed(Dense(2, activation='softmax'))(lstm2)
lstm64x64_model = Model(
    inputs=inputs,
    outputs=predictions
)
lstm64x64_model.compile(
    optimizer='rmsprop',
    loss='categorical_crossentropy',
    metrics=['accuracy']
)
lstm64x64_model.fit(X_train_s_batch, y_train_batch, epochs=15)
y_pred_lstm64x64 = lstm64x64_model.predict(X_test_s_batch)


Observe that the training is now a bit slower, you have twice as many parameters after all... 

Check the performances as compared with the 1-layer LSTM.

In [None]:
# add your code to compare models
fpr_lstm64x64, tpr_lstm64x64, thresh_lstm64x64 = roc_curve(
    convert_3d_to_2d(y_test_batch)[:, 1],
    convert_3d_to_2d(y_pred_lstm64x64)[:, 1]
)

plt.figure(figsize=(8, 6))
lw = 2
plt.plot(
    fpr_lstm,
    tpr_lstm,
    color='C4',
    lw=lw,
    label='LSTM 64'
)
plt.plot(
    fpr_lstm64x64,
    tpr_lstm64x64,
    color='C5',
    lw=lw,
    label='LSTM 64 x 64'
)
plt.plot([0, 1], [0, 1], color='navy', lw=lw, ls='--')
plt.xlim([0, 1])
plt.ylim([0, 1.05])
plt.xlabel('FPR', fontsize=12)
plt.ylabel('TPR', fontsize=12)
plt.legend(fontsize=12)

print(f"LSTM 64      AUC: {auc(fpr_lstm, tpr_lstm):.4f}")
print(f"LSTM 64 x 64 AUC: {auc(fpr_lstm64x64, tpr_lstm64x64):.4f}")


So that's worse. 
Quite likely we have started to overfit...

There are two ways to go about countering possible overfitting in the hope that a more complex model might lead to better performances (which is not necessarily true):
1. decrease the number of parameters
2. introduce regularisation

Let's start by reducing the number of parameters from 64 to 32, do exactly the same as before but with LSTMs with only 32 neurons per layer. 

In [None]:
# add your code here
lstm1 = LSTM(32, activation='tanh', recurrent_activation='hard_sigmoid', use_bias=True, 
            kernel_initializer='glorot_uniform', recurrent_initializer='orthogonal', 
            bias_initializer='zeros', unit_forget_bias=True, kernel_regularizer=None, 
            recurrent_regularizer=None, bias_regularizer=None, activity_regularizer=None, 
            kernel_constraint=None, recurrent_constraint=None, bias_constraint=None, dropout=0.0, 
            recurrent_dropout=0.0, implementation=1, return_sequences=True, return_state=False, 
            go_backwards=False, stateful=False, unroll=False)(inputs)

lstm2 = LSTM(32, activation='tanh', recurrent_activation='hard_sigmoid', use_bias=True, 
            kernel_initializer='glorot_uniform', recurrent_initializer='orthogonal', 
            bias_initializer='zeros', unit_forget_bias=True, kernel_regularizer=None, 
            recurrent_regularizer=None, bias_regularizer=None, activity_regularizer=None, 
            kernel_constraint=None, recurrent_constraint=None, bias_constraint=None, dropout=0.0, 
            recurrent_dropout=0.0, implementation=1, return_sequences=True, return_state=False, 
            go_backwards=False, stateful=False, unroll=False)(lstm1)

predictions = TimeDistributed(Dense(2, activation='softmax'))(lstm2)
lstm32x32_model = Model(
    inputs=inputs,
    outputs=predictions
)
lstm32x32_model.compile(
    optimizer='rmsprop',
    loss='categorical_crossentropy',
    metrics=['accuracy']
)

lstm32x32_model.fit(X_train_s_batch, y_train_batch, epochs=15)
y_pred_lstm32x32 = lstm32x32_model.predict(X_test_s_batch)


In [None]:
# add your code to compare models
fpr_lstm32x32, tpr_lstm32x32, thresh_lstm32x32 = roc_curve(
    convert_3d_to_2d(y_test_batch)[:, 1],
    convert_3d_to_2d(y_pred_lstm32x32)[:, 1]
)

plt.figure(figsize=(8, 6))
lw = 2
plt.plot(
    fpr_lstm,
    tpr_lstm,
    color='C4',
    lw=lw,
    label='LSTM 64'
)
plt.plot(
    fpr_lstm64x64,
    tpr_lstm64x64,
    color='C5',
    lw=lw,
    label='LSTM 64 x 64'
)
plt.plot(
    fpr_lstm32x32,
    tpr_lstm32x32,
    color='C6',
    lw=lw,
    label='LSTM 32 x 32'
)
plt.plot([0, 1], [0, 1], color='navy', lw=lw, ls='--')
plt.xlim([0, 1])
plt.ylim([0, 1.05])
plt.xlabel('FPR', fontsize=12)
plt.ylabel('TPR', fontsize=12)
plt.legend(fontsize=12)

print(f"LSTM 64      AUC: {auc(fpr_lstm, tpr_lstm):.4f}")
print(f"LSTM 64 x 64 AUC: {auc(fpr_lstm64x64, tpr_lstm64x64):.4f}")
print(f"LSTM 32 x 32 AUC: {auc(fpr_lstm32x32, tpr_lstm32x32):.4f}")


Ok, that's better. 

You may get a slightly different result but it should be approximately:

* AUC LSTM 64    = 0.9734
* AUC LSTM 64x64 = 0.9573 (-1.6 %)
* AUC LSTM 32x32 = 0.9739 (+0.05 %)

Of course, to be complete, you should also look at the fraud recall as we've already mentioned before.