# Neural Networks for Sequences and Time Series

Let's try to use the RNN for the Credit Card Fraud Detection problem. 

As before, this first cell sets up the notebook.
Additionally, we load the file `utils/helpers.py` which defines 

* `train_test_split_time_series`
* `reshape_to_batches` 
* `_3d_to_2d` 

that we defined in the previous notebook.

In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split

from sklearn.metrics import roc_curve, auc, roc_auc_score
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
    
from sklearn.pipeline import Pipeline

import keras

# this is the same as copy pasting a bunch of "def" here.
%run utils/helpers

### Load the data

Load the `creditcard.csv` data into a dataframe called `ccfd`

In [0]:
ccfd = pd.read_csv('./data/creditcard.csv')
ccfd.head()


The utils functions define the same `train_test_split_time_series` you had used before. Use this function to build a train-test split and check the shapes.

In [0]:
XTrain, yTrain, XTest, yTest = train_test_split_time_series(ccfd)
print(XTrain.shape)
print(yTrain.shape)
print(XTest.shape)
print(yTest.shape)


Apply the usual scaling preprocessing (on both the training and test set)

In [0]:
pipeline = Pipeline([
    ('scaling', StandardScaler()),
])
preprocessor = pipeline.fit(XTrain)
XTrain_s = preprocessor.transform(XTrain)
XTest_s = preprocessor.transform(XTest)


Use the `reshape_to_batches` function with batch size 100 and apply it to the training data. 

In [0]:
# reshape to batches
BATCH_SIZE = 100
XTrain_s_batch = reshape_to_batches(XTrain_s, BATCH_SIZE)
print(XTrain_s_batch.shape)


### Remark
Note that the batch size is particularly important because this is the sequence size that we are going to train the RNN on. 
This means that any dependencies further apart than`BATCH_SIZE` **will not be taken into account**. 

We could in theory give only one batch with the entire sequence but that will take an excessive amount of time to train and success is not guaranteed (vanishing gradient problem). 

### Re-encoding the data

As in the previous data, create a `y_binary` with two columns (0, 1) and batch `yTrain`. 

In [0]:
from keras.utils.np_utils import to_categorical

y_binary = to_categorical(yTrain)

print(yTrain.shape)
print(y_binary.shape)

yTrain_batch = reshape_to_batches(y_binary, BATCH_SIZE)

print(yTrain_batch.shape)


### Create model

In theory the RNN can read arbitrarily many time-steps, which is one of the reasons it can, theoretically, offer better performance than the CNN for time series. 
In practise however, it is limited by the vanishing gradient problem and the exploding computational requirement implied by taking increasingly many time-steps.

The cell below imports key Keras layers:

* `Input` and `Dense` which you already know
* `SimpleRNN` and `TimeDistributed` which are helpful for time series

In [0]:
# import all dependencies
from keras.layers import Input, Dense, SimpleRNN, TimeDistributed
from keras.models import Model

Create the input layer with appropriate dimensions

In [0]:
inputs = Input(shape=(BATCH_SIZE, 30))


### Defining the architecture of the RNN

By default, Keras considers the the **many-to-one** architecture, sometimes also known as an _encoder_. 
However, we want to perform a prediction at every time step. 
Therefore, we make the RNN layer return output for every sequence with `return_sequences=True`.

The cell below, chained to the `inputs` layer, is an RNN cell.
You should recognise a few things:

* how many neurons are there? (or what's the dimensionality of the output of that layer?)
* what's the activation function
* the initializer is the Glorot intializer, centered at zero
* no dropout

the rest of the parameters don't really matter for now (we will modify some of them later) but feel free to have a look at the [documentation](https://keras.io/layers/recurrent/) for a definition of all the parameters. 

In [0]:
rnn = SimpleRNN(64, 
                activation='tanh', 
                use_bias=True, 
                kernel_initializer='glorot_uniform',
                recurrent_initializer='orthogonal', 
                bias_initializer='zeros', 
                kernel_regularizer=None,
                recurrent_regularizer=None, 
                bias_regularizer=None, 
                activity_regularizer=None, 
                kernel_constraint=None, 
                recurrent_constraint=None, 
                bias_constraint=None, 
                dropout=0.0, 
                recurrent_dropout=0.0, 
                return_sequences=True, 
                return_state=False, 
                go_backwards=False, 
                stateful=False, 
                unroll=False)(inputs)


The next cell is an output layer with 2 dimensions given that there are two classes (we're still in the classification context). 

Then, we wrap a model around the whole and compile it.

In [0]:
predictions = TimeDistributed(Dense(2, activation='softmax'))(rnn)

rnn_model = Model(inputs=inputs, 
              outputs=predictions)

rnn_model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

Now we're good to fit this for a few epochs and check the performances. 

In [0]:
rnn_model.fit(XTrain_s_batch, yTrain_batch, epochs=15)

### Evaluation

From the previous notebook, you know how to evaluate the performances of a model such as the one you just trained.

In [0]:
#first transform the test data into the appropriate shape
print(XTest_s.shape)

XTest_s_batch = reshape_to_batches(XTest_s, BATCH_SIZE)
print(XTest_s_batch.shape)

y_binary = to_categorical(yTest)
print(yTest.shape)
print(y_binary.shape)

yTest_batch = reshape_to_batches(y_binary, BATCH_SIZE)
print(yTest_batch.shape)


In [0]:
y_pred_rnn = rnn_model.predict(XTest_s_batch)


In [0]:
print(roc_auc_score(
        _3d_to_2d(yTest_batch)[:,1], 
        _3d_to_2d(y_pred_rnn)[:,1]))


### Comparison with the CNN results

Load the FPR and TPR from the CNN case, and show both the AUC of the RNN you've just trained as well as that of the CNN.

In [0]:
import pickle

fpr_cnn, tpr_cnn, thresh_cnn, y_pred_cnn = pickle.load(
    open("res_cnn.pkl", "rb"))
fpr_rnn, tpr_rnn, thresh_rnn = roc_curve(
    _3d_to_2d(yTest_batch)[:, 1], _3d_to_2d(y_pred_rnn)[:, 1])

plt.figure(figsize=(8, 6))
lw = 2
plt.plot(fpr_cnn, tpr_cnn, color='darkorange',
         lw=lw, label='CNN')
plt.plot(fpr_rnn, tpr_rnn, color='darkgreen',
         lw=lw, label='RNN')
plt.plot([0, 1], [0, 1], color='navy', lw=lw, ls='--')
plt.xlim([0, 1])
plt.ylim([0, 1.05])
plt.xlabel('FPR', fontsize=12)
plt.ylabel('TPR', fontsize=12)
plt.legend(fontsize=12)


As you can observe, the RNN is globally better than the CNN here (the corresponding curve is everywhere above). 
AUC offer a nice way to compare between different classification models.

**Note**: remain careful though, the AUC put emphasis on the *accuracy* but, as you know, in this case we may care more about fraud *recall*. 
Don't forget to also check the confusion matrices etc.

## LSTM

One of the best parts of using Keras' functional API is that we can easily reuse components, let's replace the vanilla RNN with an LSTM.
Again, you should recognise a few things, in fact pretty much everything is similar to the `SimpleRNN` you used before. 

In [0]:
from keras.layers import LSTM

# the implementation parameter determines whether your hardware is cpu (1) or gpu (2)
lstm = LSTM(64, 
            activation='tanh', 
            recurrent_activation='hard_sigmoid', 
            use_bias=True, 
            kernel_initializer='glorot_uniform', 
            recurrent_initializer='orthogonal', 
            bias_initializer='zeros', 
            unit_forget_bias=True, 
            kernel_regularizer=None, 
            recurrent_regularizer=None, 
            bias_regularizer=None, 
            activity_regularizer=None, 
            kernel_constraint=None, 
            recurrent_constraint=None, 
            bias_constraint=None, 
            dropout=0.0, 
            recurrent_dropout=0.0, 
            implementation=1,      # CPU or GPU
            return_sequences=True, 
            return_state=False, 
            go_backwards=False, 
            stateful=False,
            unroll=False)(inputs)

#finally we give a 2 dimensional softmax output layer
predictions = TimeDistributed(Dense(2, activation='softmax'))(lstm)

lstm_model = Model(inputs=inputs, 
                   outputs=predictions)

lstm_model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

In [0]:
lstm_model.fit(XTrain_s_batch, yTrain_batch, epochs=15)


### Evaluate the quality of the LSTM classifier

Compare the LSTM to both the RNN and the CNN.

In [0]:
y_pred_lstm = lstm_model.predict(XTest_s_batch)

fpr_lstm, tpr_lstm, thresh_lstm = roc_curve(_3d_to_2d(yTest_batch)[:, 1], 
                                            _3d_to_2d(y_pred_lstm)[:, 1])

plt.figure(figsize=(8, 6))
lw = 2
plt.plot(fpr_cnn, tpr_cnn, color='darkorange',
         lw=lw, label='CNN')
plt.plot(fpr_rnn, tpr_rnn, color='darkgreen',
         lw=lw, label='RNN')
plt.plot(fpr_lstm, tpr_lstm, color='red',
         lw=lw, label='LSTM')
plt.plot([0, 1], [0, 1], color='navy', lw=lw, ls='--')
plt.xlim([0, 1])
plt.ylim([0, 1.05])
plt.xlabel('FPR', fontsize=12)
plt.ylabel('TPR', fontsize=12)
plt.legend(fontsize=12)

print("CNN  AUC: {0:.4f}".format(auc(fpr_cnn, tpr_cnn)))
print("RNN  AUC: {0:.4f}".format(auc(fpr_rnn, tpr_rnn)))
print("LSTM AUC: {0:.4f}".format(auc(fpr_lstm, tpr_lstm)))


## LSTM vs GRU

A last one we can test is the GRU. 

In [0]:
from keras.layers import GRU

gru = GRU(64, 
          activation='tanh', 
          recurrent_activation='hard_sigmoid',
          use_bias=True, 
          kernel_initializer='glorot_uniform',
          recurrent_initializer='orthogonal', 
          bias_initializer='zeros',
          kernel_regularizer=None, 
          recurrent_regularizer=None, 
          bias_regularizer=None,
          activity_regularizer=None, 
          kernel_constraint=None, 
          recurrent_constraint=None,
          bias_constraint=None, 
          dropout=0.0, 
          recurrent_dropout=0.0, 
          implementation=1,
          return_sequences=True, 
          return_state=False, 
          go_backwards=False, 
          stateful=False, 
          unroll=False)(inputs)

# output layer, as per usual
predictions = TimeDistributed(Dense(2, activation='softmax'))(gru)

# model compilation and fitting
gru_model = Model(inputs=inputs, outputs=predictions)
gru_model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'])
gru_model.fit(XTrain_s_batch, yTrain_batch, epochs=15)

### Evaluation

In [0]:
y_pred_gru = gru_model.predict(XTest_s_batch)

fpr_gru, tpr_gru, thresh_gru = roc_curve(_3d_to_2d(yTest_batch)[:, 1], 
                                         _3d_to_2d(y_pred_gru)[:, 1])

plt.figure(figsize=(8, 6))
lw = 2
plt.plot(fpr_cnn, tpr_cnn, color='darkorange',
         lw=lw, label='CNN')
plt.plot(fpr_rnn, tpr_rnn, color='darkgreen',
         lw=lw, label='RNN')
plt.plot(fpr_lstm, tpr_lstm, color='red',
         lw=lw, label='LSTM')
plt.plot(fpr_gru, tpr_gru, color='magenta',
         lw=lw, label='GRU')
plt.plot([0, 1], [0, 1], color='navy', lw=lw, ls='--')
plt.xlim([0, 1])
plt.ylim([0, 1.05])
plt.xlabel('FPR', fontsize=12)
plt.ylabel('TPR', fontsize=12)
plt.legend(fontsize=12)

print("CNN  AUC: {0:.4f}".format(auc(fpr_cnn, tpr_cnn)))
print("RNN  AUC: {0:.4f}".format(auc(fpr_rnn, tpr_rnn)))
print("LSTM AUC: {0:.4f}".format(auc(fpr_lstm, tpr_lstm)))
print("GRU  AUC: {0:.4f}".format(auc(fpr_gru, tpr_gru)))


## Stacking: combine NNs as lego blocks

Just as with CNNs, RNN units can be stacked on top of each other to form a more involved model. 
Since the weights are shared in each RNN stack (layer), the hypothesis is that every stack forms both new features and a different time-scale at which it operates. 

Try to build two LSTM layers with the same settings as before, stack one after the other and test the whole lot. 

In [0]:
lstm1 = LSTM(64, activation='tanh', recurrent_activation='hard_sigmoid', use_bias=True, 
            kernel_initializer='glorot_uniform', recurrent_initializer='orthogonal', 
            bias_initializer='zeros', unit_forget_bias=True, kernel_regularizer=None, 
            recurrent_regularizer=None, bias_regularizer=None, activity_regularizer=None, 
            kernel_constraint=None, recurrent_constraint=None, bias_constraint=None, dropout=0.0, 
            recurrent_dropout=0.0, implementation=1, return_sequences=True, return_state=False, 
            go_backwards=False, stateful=False, unroll=False)(inputs)

lstm2 = LSTM(64, activation='tanh', recurrent_activation='hard_sigmoid', use_bias=True, 
            kernel_initializer='glorot_uniform', recurrent_initializer='orthogonal', 
            bias_initializer='zeros', unit_forget_bias=True, kernel_regularizer=None, 
            recurrent_regularizer=None, bias_regularizer=None, activity_regularizer=None, 
            kernel_constraint=None, recurrent_constraint=None, bias_constraint=None, dropout=0.0, 
            recurrent_dropout=0.0, implementation=1, return_sequences=True, return_state=False, 
            go_backwards=False, stateful=False, unroll=False)(lstm1)

predictions = TimeDistributed(Dense(2, activation='softmax'))(lstm2)
lstm64x64_model = Model(inputs=inputs, outputs=predictions)
lstm64x64_model.compile(optimizer='rmsprop',
                       loss='categorical_crossentropy',
                       metrics=['accuracy'])
lstm64x64_model.fit(XTrain_s_batch, yTrain_batch, epochs=15)
y_pred_lstm64x64 = lstm64x64_model.predict(XTest_s_batch)


Observe that the training is now a bit slower, you have twice as many parameters after all... 

Check the performances as well as compared with the 1-layer LSTM.

In [0]:
fpr_lstm64x64, tpr_lstm64x64, thresh_lstm64x64 = roc_curve(
    _3d_to_2d(yTest_batch)[:, 1], _3d_to_2d(y_pred_lstm64x64)[:, 1])

plt.figure(figsize=(8, 6))
lw = 2
plt.plot(fpr_lstm, tpr_lstm, color='red',
         lw=lw, label='LSTM 64')
plt.plot(fpr_lstm64x64, tpr_lstm64x64, color='magenta',
         lw=lw, label='LSTM 64 x 64')
plt.plot([0, 1], [0, 1], color='navy', lw=lw, ls='--')
plt.xlim([0, 1])
plt.ylim([0, 1.05])
plt.xlabel('FPR', fontsize=12)
plt.ylabel('TPR', fontsize=12)
plt.legend(fontsize=12)

print("LSTM 64      AUC: {0:.4f}".format(auc(fpr_lstm, tpr_lstm)))
print("LSTM 64 x 64 AUC: {0:.4f}".format(auc(fpr_lstm64x64, tpr_lstm64x64)))


So that's worse. 
Quite likely we have started to overfit...

There are two ways to go about possible overfitting in the hope that a more complex model might lead to better performances (which is not necessarily true):
1. decrease the number of parameters
2. introduce regularisation

Let's start by reducing the number of parameters `64-->32`, do exactly the same as before but with LSTMs with only 32 neurons per layer. 

In [0]:
lstm1 = LSTM(32, activation='tanh', recurrent_activation='hard_sigmoid', use_bias=True, 
            kernel_initializer='glorot_uniform', recurrent_initializer='orthogonal', 
            bias_initializer='zeros', unit_forget_bias=True, kernel_regularizer=None, 
            recurrent_regularizer=None, bias_regularizer=None, activity_regularizer=None, 
            kernel_constraint=None, recurrent_constraint=None, bias_constraint=None, dropout=0.0, 
            recurrent_dropout=0.0, implementation=1, return_sequences=True, return_state=False, 
            go_backwards=False, stateful=False, unroll=False)(inputs)

lstm2 = LSTM(32, activation='tanh', recurrent_activation='hard_sigmoid', use_bias=True, 
            kernel_initializer='glorot_uniform', recurrent_initializer='orthogonal', 
            bias_initializer='zeros', unit_forget_bias=True, kernel_regularizer=None, 
            recurrent_regularizer=None, bias_regularizer=None, activity_regularizer=None, 
            kernel_constraint=None, recurrent_constraint=None, bias_constraint=None, dropout=0.0, 
            recurrent_dropout=0.0, implementation=1, return_sequences=True, return_state=False, 
            go_backwards=False, stateful=False, unroll=False)(lstm1)

predictions = TimeDistributed(Dense(2, activation='softmax'))(lstm2)
lstm32x32_model = Model(inputs=inputs, outputs=predictions)
lstm32x32_model.compile(optimizer='rmsprop',
                        loss='categorical_crossentropy',
                        metrics=['accuracy'])

lstm32x32_model.fit(XTrain_s_batch, yTrain_batch, epochs=15)
y_pred_lstm32x32 = lstm32x32_model.predict(XTest_s_batch)


In [0]:
fpr_lstm32x32, tpr_lstm32x32, thresh_lstm32x32 = roc_curve(
    _3d_to_2d(yTest_batch)[:, 1], _3d_to_2d(y_pred_lstm32x32)[:, 1])

plt.figure(figsize=(8, 6))
lw = 2
plt.plot(fpr_lstm, tpr_lstm, color='red',
         lw=lw, label='LSTM 64')
plt.plot(fpr_lstm64x64, tpr_lstm64x64, color='magenta',
         lw=lw, label='LSTM 64 x 64')
plt.plot(fpr_lstm32x32, tpr_lstm32x32, color='darkgreen',
         lw=lw, label='LSTM 32 x 32')
plt.plot([0, 1], [0, 1], color='navy', lw=lw, ls='--')
plt.xlim([0, 1])
plt.ylim([0, 1.05])
plt.xlabel('FPR', fontsize=12)
plt.ylabel('TPR', fontsize=12)
plt.legend(fontsize=12)

print("LSTM 64      AUC: {0:.4f}".format(auc(fpr_lstm, tpr_lstm)))
print("LSTM 64 x 64 AUC: {0:.4f}".format(auc(fpr_lstm64x64, tpr_lstm64x64)))
print("LSTM 32 x 32 AUC: {0:.4f}".format(auc(fpr_lstm32x32, tpr_lstm32x32)))


Ok, that's better. 

You may get slightly different result but I currently have:

* AUC LSTM 64    = 0.9734
* AUC LSTM 64x64 = 0.9573 (-1.6 %)
* AUC LSTM 32x32 = 0.9739 (+0.05 %) 

Of course, to be complete, you should also look at the fraud recall as we've already mentioned before.

### Regularisaton

With keras it is particularly easy to add any form of regularisation you want, either using the

* `[component]_regularizer` parameter (penalise components that are too far from sensible values) or 
* the `[component]_constraint` parameter (clip components to be within a set range). 

In the first case, you can apply both `l1` and `l2` of the regularisation techniques you have learned so far [regulariser docs](https://keras.io/regularizers/).
You can also add constraints (min norm, max norm, etc see [constrain docs](https://keras.io/constraints/))

Of course, picking the parameters of the regularisation is hard and there is no good simple generic technique to do it. 
You could think about CV but here it would just be computationally too expensive. 
There are some rule of thumbs in terms of what is "big" and what is "small" but none are really justified. 
This is where resources can make all the difference.
If you have access to a bunch of GPUs (or better, TPUs) training one neural net with a set of regularisation parameters can be done in a reasonable time and therefore you could do a form of randomised CV. 
If you're on a single CPU on your laptop however, you probably should not attempt doing hyperparameter tuning, your time is probably best invested buying credits off a cloud computing provider and using their GPUs paying per hour of use. 

Let's try to see if some basic regularisation will help in the double-LSTM case with 64. 

* add a l1_l2 regulariser (`keras.regularizers.l1_l2`) with parameter `0.01` for the `kernel_regularizer`
* do the same for the `recurrent_regularizer`
* keep 64 neurons on both layers

In [0]:
lstm1 = LSTM(64, activation='tanh', recurrent_activation='hard_sigmoid', use_bias=True, 
            kernel_initializer='glorot_uniform', recurrent_initializer='orthogonal', 
            bias_initializer='zeros', unit_forget_bias=True, 
             ###### regularizers 
            kernel_regularizer=keras.regularizers.l1_l2(0.01), 
            recurrent_regularizer=keras.regularizers.l1_l2(0.01),
            bias_regularizer=None, activity_regularizer=None, 
            kernel_constraint=None, recurrent_constraint=None, bias_constraint=None, dropout=0.0, 
            recurrent_dropout=0.0, implementation=1, return_sequences=True, return_state=False, 
            go_backwards=False, stateful=False, unroll=False)(inputs)

lstm2 = LSTM(64, activation='tanh', recurrent_activation='hard_sigmoid', use_bias=True, 
            kernel_initializer='glorot_uniform', recurrent_initializer='orthogonal', 
            bias_initializer='zeros', unit_forget_bias=True, 
            kernel_regularizer=keras.regularizers.l1_l2(0.01), 
            recurrent_regularizer=keras.regularizers.l1_l2(0.01), 
            bias_regularizer=None, activity_regularizer=None, 
            kernel_constraint=None, recurrent_constraint=None, bias_constraint=None, dropout=0.0, 
            recurrent_dropout=0.0, implementation=1, return_sequences=True, return_state=False, 
            go_backwards=False, stateful=False, unroll=False)(lstm1)


predictions = TimeDistributed(Dense(2, activation='softmax'))(lstm2)
lstm64x64_r_model = Model(inputs=inputs, outputs=predictions)
lstm64x64_r_model.compile(optimizer='rmsprop',
                          loss='categorical_crossentropy',
                          metrics=['accuracy'])

lstm64x64_r_model.fit(XTrain_s_batch, yTrain_batch, epochs=15)
y_pred_lstm64x64_r = lstm64x64_r_model.predict(XTest_s_batch)


In [0]:
fpr_lstm64x64_r, tpr_lstm64x64_r, thresh_lstm64x64_r = roc_curve(
    _3d_to_2d(yTest_batch)[:, 1], _3d_to_2d(y_pred_lstm64x64_r)[:, 1])

plt.figure(figsize=(8, 6))
lw = 2
plt.plot(fpr_lstm, tpr_lstm, color='red',
         lw=lw, label='LSTM 64')
plt.plot(fpr_lstm64x64, tpr_lstm64x64, color='magenta',
         lw=lw, label='LSTM 64 x 64')
plt.plot(fpr_lstm32x32, tpr_lstm32x32, color='darkgreen',
         lw=lw, label='LSTM 32 x 32')
plt.plot(fpr_lstm64x64_r, tpr_lstm64x64_r, color='orange',
         lw=lw, label='LSTM 64 x 64 + reg')
plt.plot([0, 1], [0, 1], color='navy', lw=lw, ls='--')
plt.xlim([0, 1])
plt.ylim([0, 1.05])
plt.xlabel('FPR', fontsize=12)
plt.ylabel('TPR', fontsize=12)
plt.legend(fontsize=12)

print("LSTM 64       AUC: {0:.4f}".format(auc(fpr_lstm, tpr_lstm)))
print("LSTM 64 x 64  AUC: {0:.4f}".format(auc(fpr_lstm64x64, tpr_lstm64x64)))
print("LSTM 32 x 32  AUC: {0:.4f}".format(auc(fpr_lstm32x32, tpr_lstm32x32)))
print("LSTM 64 x 64r AUC: {0:.4f}".format(auc(fpr_lstm64x64_r, tpr_lstm64x64_r)))


That's a very significant performance drop... again you can get a sense for the difficulty of tuning a multi-layered Neural Net...

Before we quit on regularisation altogether, we have to consider the reasons behind such a significant drop. 
It could be either that we applied an unreasonably high regularisation value or that it is much harder to optimise the problem with regularisation and the optimisation algorithm needs more epochs... 

You could try the latter by going from 15 to 25 or 50 epochs and you will see that you will obtain performances comparable with the other algorithms (though it will take much more time). 

### Dropout

One of the most effective forms of regularisations in the context of Neural Networks is Dropout.
There are two places where we can use dropout:

- on the input connection
- on the reccurent connections.

a dropout on the connection means that the data on that connection to each LSTM cell will be excluded from node activation and weight updates with a given probability. 
The dropout value is a percentage between 0 (no dropout) and 1 (no connection).

Get back to your 2-layer 64 LSTM and some regularisation and add

* a dropout with parameter 0.2
* a recurrent_dropout with parameter 0.05

you will need at least 25 epochs to get decent results. 

(why these parameters? well...)

In [0]:
lstm1 = LSTM(64, activation='tanh', recurrent_activation='hard_sigmoid', use_bias=True, 
            kernel_initializer='glorot_uniform', recurrent_initializer='orthogonal', 
            bias_initializer='zeros', unit_forget_bias=True, 
            #-- regularizers 
            kernel_regularizer=keras.regularizers.l1_l2(0.01), 
            recurrent_regularizer=keras.regularizers.l1_l2(0.01),
            bias_regularizer=None, activity_regularizer=None, 
            kernel_constraint=None, recurrent_constraint=None, bias_constraint=None, 
            #-- dropout
            dropout=0.2, 
            recurrent_dropout=0.05,
            implementation=1, return_sequences=True, return_state=False, 
            go_backwards=False, stateful=False, unroll=False)(inputs)

lstm2 = LSTM(64, activation='tanh', recurrent_activation='hard_sigmoid', use_bias=True, 
            kernel_initializer='glorot_uniform', recurrent_initializer='orthogonal', 
            bias_initializer='zeros', unit_forget_bias=True,
            #-- regularizers 
            kernel_regularizer=keras.regularizers.l1_l2(0.01), 
            recurrent_regularizer=keras.regularizers.l1_l2(0.01),
            bias_regularizer=None, activity_regularizer=None, 
            kernel_constraint=None, recurrent_constraint=None, bias_constraint=None,
            #-- dropout
            dropout=0.2, 
            recurrent_dropout=0.05,
            implementation=1, return_sequences=True, return_state=False, 
            go_backwards=False, stateful=False, unroll=False)(lstm1)

predictions = TimeDistributed(Dense(2, activation='softmax'))(lstm2)
lstm64x64_r2_model = Model(inputs=inputs, outputs=predictions)
lstm64x64_r2_model.compile(optimizer='rmsprop',
                           loss='categorical_crossentropy',
                           metrics=['accuracy'])

lstm64x64_r2_model.fit(XTrain_s_batch, yTrain_batch, epochs=25)
y_pred_lstm64x64_r2 = lstm64x64_r2_model.predict(XTest_s_batch)


In [0]:
fpr_lstm64x64_r2, tpr_lstm64x64_r2, thresh_lstm64x64_r2 = roc_curve(
    _3d_to_2d(yTest_batch)[:, 1], _3d_to_2d(y_pred_lstm64x64_r2)[:, 1])

plt.figure(figsize=(8, 6))
lw = 2
plt.plot(fpr_lstm, tpr_lstm, color='red',
         lw=lw, label='LSTM 64')
plt.plot(fpr_lstm64x64, tpr_lstm64x64, color='magenta',
         lw=lw, label='LSTM 64 x 64')
plt.plot(fpr_lstm32x32, tpr_lstm32x32, color='darkgreen',
         lw=lw, label='LSTM 32 x 32')
plt.plot(fpr_lstm64x64_r, tpr_lstm64x64_r, color='orange',
         lw=lw, label='LSTM 64 x 64 + reg')
plt.plot(fpr_lstm64x64_r2, tpr_lstm64x64_r2, color='cyan',
         lw=lw, label='LSTM 64 x 64 + reg2')
plt.plot([0, 1], [0, 1], color='navy', lw=lw, ls='--')
plt.xlim([0, 1])
plt.ylim([0, 1.05])
plt.xlabel('FPR', fontsize=12)
plt.ylabel('TPR', fontsize=12)
plt.legend(fontsize=12)

print("LSTM 64        AUC: {0:.4f}".format(auc(fpr_lstm, tpr_lstm)))
print("LSTM 64 x 64   AUC: {0:.4f}".format(auc(fpr_lstm64x64, tpr_lstm64x64)))
print("LSTM 32 x 32   AUC: {0:.4f}".format(auc(fpr_lstm32x32, tpr_lstm32x32)))
print("LSTM 64 x 64r  AUC: {0:.4f}".format(auc(fpr_lstm64x64_r, tpr_lstm64x64_r)))
print("LSTM 64 x 64r2 AUC: {0:.4f}".format(auc(fpr_lstm64x64_r2, tpr_lstm64x64_r2)))


well well... 

The take home message here is a bit disappointing but very important: regularisation is difficult to tune, requires a lot of practice and it doesn't hurt to have large computational resources...


## Bi-directional RNN

A RNN can be run simultaneously from "both directions":

* one "forward" in time
* one "backward in time

if you think about a sentence as being a sequence of words then that would amount to having one RNN reading the words as you would naturally and the other one reading the sentence backward (which can be useful in languages that reject verbs to the end of the sentence for example). 

In the context of credit card fraud detection, it is not very appropriate (we want to learn online, as new transactions come in, and not a posteriori) but we can still show how it works. 

After our rather unsuccessful attempt with regularisation we'll keep things simple and just duplicate the LSTM cell that had worked well before, one forward, one backward. 

In [0]:
# Forward cell
lstm_fwd = LSTM(64, activation='tanh', recurrent_activation='hard_sigmoid', use_bias=True, 
            kernel_initializer='glorot_uniform', recurrent_initializer='orthogonal', 
            bias_initializer='zeros', unit_forget_bias=True, kernel_regularizer=None, 
            recurrent_regularizer=None, bias_regularizer=None, activity_regularizer=None, 
            kernel_constraint=None, recurrent_constraint=None, bias_constraint=None, dropout=0.0, 
            recurrent_dropout=0.0, implementation=1, return_sequences=True, return_state=False, 
            ### GO FORWARD
            go_backwards=False, stateful=False, unroll=False)(inputs)

# Note two important things
# 1) we turn on the go_backwards parameter
# 2) we give the same input (inputs) to the backward LSTM
lstm_bck = LSTM(64, activation='tanh', recurrent_activation='hard_sigmoid', use_bias=True, 
            kernel_initializer='glorot_uniform', recurrent_initializer='orthogonal', 
            bias_initializer='zeros', unit_forget_bias=True, kernel_regularizer=None, 
            recurrent_regularizer=None, bias_regularizer=None, activity_regularizer=None, 
            kernel_constraint=None, recurrent_constraint=None, bias_constraint=None, dropout=0.0, 
            recurrent_dropout=0.0, implementation=1, return_sequences=True, return_state=False, 
            ### GO BACKWARD
            go_backwards=True, stateful=False, unroll=False)(inputs)

# now we have to combine the results of the two layers
# we can use different options, but the most common one
# is to concatenate them together
merge = keras.layers.Concatenate(axis=-1)([lstm_fwd, lstm_bck])

predictions = TimeDistributed(Dense(2, activation='softmax'))(merge)
bidir_model = Model(inputs=inputs, 
                    outputs=predictions)
bidir_model.compile(optimizer='rmsprop',
                    loss='categorical_crossentropy',
                    metrics=['accuracy'])

bidir_model.fit(XTrain_s_batch, yTrain_batch, epochs=15)
y_pred_bidir = bidir_model.predict(XTest_s_batch)

In [0]:
fpr_bidir, tpr_bidir, thresh_bidir = roc_curve(
    _3d_to_2d(yTest_batch)[:, 1], _3d_to_2d(y_pred_bidir)[:, 1])

plt.figure(figsize=(8, 6))
lw = 2
plt.plot(fpr_lstm, tpr_lstm, color='red',
         lw=lw, label='LSTM 64')
plt.plot(fpr_lstm64x64, tpr_lstm64x64, color='magenta',
         lw=lw, label='LSTM 64 x 64')
plt.plot(fpr_lstm32x32, tpr_lstm32x32, color='darkgreen',
         lw=lw, label='LSTM 32 x 32')
plt.plot(fpr_bidir, tpr_bidir, color='orange',
         lw=lw, label='LSTM 64 bidir')
plt.plot([0, 1], [0, 1], color='navy', lw=lw, ls='--')
plt.xlim([0, 1])
plt.ylim([0, 1.05])
plt.xlabel('FPR', fontsize=12)
plt.ylabel('TPR', fontsize=12)
plt.legend(fontsize=12)

print("LSTM 64        AUC: {0:.4f}".format(auc(fpr_lstm, tpr_lstm)))
print("LSTM 64 x 64   AUC: {0:.4f}".format(auc(fpr_lstm64x64, tpr_lstm64x64)))
print("LSTM 32 x 32   AUC: {0:.4f}".format(auc(fpr_lstm32x32, tpr_lstm32x32)))
print("LSTM 64 bidir  AUC: {0:.4f}".format(auc(fpr_bidir, tpr_bidir)))
