# The return of Time Series

### Can you use a CNN for Time Series Data?

**Anwer:** Of course you can, if with some creativity. The secret is representing the time series as an "image".

In this notebook we consider the problem of detecting fraud in credit card transaction but including a temporal component (when the transaction took place). 

In [0]:
# set up the project
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Nothing new here
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report, accuracy_score

# pipeline
from sklearn.pipeline import Pipeline

# Needed to build custom transformers in Sklearn
from sklearn.base import TransformerMixin, BaseEstimator

#our deep learning super power
import keras

### Load the dataset

Load the dataset `data/creditcard.csv` into Pandas and inspect it. 
The data is available at [this link](https://s3-eu-west-1.amazonaws.com/humongousdata/creditcard.csv), save it somewhere appropriate like `./data/credicard.csv`. 

In [0]:
# ccfd for "credit-card fraud data"
ccfd = pd.read_csv('./data/creditcard.csv')
ccfd.head()


As you can see, there are 3 types of features:

- **Time**: these are ordinal values, not datetime objects. They follow a _partial order_ ($t_1\le t_2\le \dots$) and we cannot use them as an index
- **V[N]**: these are PCA features
- **Amount**: transaction amount (in USD)
- **Class**: 0 - Genuine/ 1 - Fraud

As you already know, credit Card Fraud Detection is a *hard problem* because it is highly imbalanced. 
Use an appropriate plot to show the imbalance between the two classes. 

In [0]:
plt.figure(figsize=(8, 6))
ccfd["Class"].hist(alpha=0.4)
plt.grid('off')
plt.xticks([0, 1], ["Non-Fraud", "Fraud"], fontsize=12)
plt.yticks([1e5, 2e5], ["0.1", "0.2"], fontsize=12)
plt.ylabel("Number of occurences in millions", fontsize=14)


### Exploring imbalance

Write a function `print_class_distribution` that takes a `y` vector and returns the ratio of 0s (Genuine) to 1s (Fraud). Apply it on the data. 

The function `np.bincount` is particularly useful here or you can also use `scipy.stats.itemfreq`. 

In [0]:
from scipy.stats import itemfreq

def print_class_distribution(ys):
    # vectorise if needed
    ys = ys.reshape(-1)
    bins = np.bincount(ys)
    print("Genuine: {} / Fraud: {} = {:.3f}%".format(
        bins[0], bins[1], 100 * bins[1] / np.sum(bins)))

print_class_distribution(ccfd["Class"].values)


So **less than 0.5%** of the transactions are fraud.

### Train-test split

How are we going to split the data? Can we use the built-in sklearn function? Unfortunately not because you need to preserve the chronological order. 

The function below is a rudimentary implementation that takes a dataframe and returns a training and test set. 

In [0]:
def train_test_split_time_series(df, test_ratio=0.3, time="Time", labels="Class"):
    # sort the values according to the time column
    df.sort_values(time, inplace=True)
    # number of rows
    total_samples = df.shape[0]
    # splitting index (e.g.: to take first 80% as training then test_ratio=0.2)
    train_idx = int(total_samples * (1 - test_ratio))
    # locating the relevant parts
    XTrain = df.loc[:train_idx, df.columns != 'Class'].values
    yTrain = df.loc[:train_idx, df.columns == 'Class'].values
    XTest = df.loc[train_idx:, df.columns != 'Class'].values
    yTest = df.loc[train_idx:, df.columns == 'Class'].values
    
    return XTrain, yTrain, XTest, yTest

Apply the function above on the dataframe and show the shapes of the objects created, check everything makes sense.

In [0]:
XTrain, yTrain, XTest, yTest = train_test_split_time_series(ccfd)
print(XTrain.shape)
print(yTrain.shape)
print(XTest.shape)
print(yTest.shape)


Check the proportion of fraud cases in the training and test set, make sure it kind of looks ok (you can use the function you defined earlier for this). 

**Question**: could you use stratification here? if so, how? and what is the underlying assumption with respect to the proportion of fraudulent transaction through time?

In [0]:
print_class_distribution(yTrain)
print_class_distribution(yTest)
print_class_distribution(ccfd["Class"].values)


You can observe that there are proportionally more fraud samples into the training set. 
What is the implication of this? (discuss)
 
### Preprocessing

As per usual, it's a good idea to apply the standard scaler on columns.
Note that usual PCA output may in fact already be scaled, so test this first and if unsatisfactory, apply a standard scaler (e.g.: via a Pipeline so that you can apply the same transformation on training and testing data). 

In [0]:
pipeline = Pipeline([('scaling', StandardScaler())])
preprocessor = pipeline.fit(XTrain)
XTrain_s = preprocessor.transform(XTrain)
XTest_s = preprocessor.transform(XTest)


Check the mean and variance of the transformed training set just to check it corresponds to your intuition.

In [0]:
checker = lambda x: (np.mean(x), np.var(x))
for i in range(XTrain_s.shape[1]):
    print("mean: {0:.2f}, var: {1:.2f}".format(
            np.abs(np.mean(XTrain_s[:, i])), 
            np.var(XTrain_s[:, i])))


### Getting a feel for the data

Have a look at the first dimension (time), of the training data and show a histogram/distplot.  
What can you observe and how do you interpret it? 

In [0]:
plt.figure(figsize=(8, 6))
sns.distplot(XTrain_s[:, 0])
plt.xlabel("Time (normalised)", fontsize=12)
plt.ylabel("Frequency", fontsize=12)


What that histogram shows is effectively the _number of operations over a normalised timescale_ (left is earlier, right is later and, therefore, closer to the test set). 
So modes happen at times where there are a lot of transactions. 

It is a trimodal (*three modes*) distribution... How does that affect the design of our model? 

Note:
- We do not know the time span of the period we are looking at
- There could be different seasonality during each of the distinct time periods
- We could be evaluating on a different distribution (i.e. the test set does not bear much statistical resemblance with the training set), this will lead to poor generalisation. 

Long story short, always be careful and ask yourself whether your test set allows you to make any strong statement about the generalisability of your model. 
This is a *hard question* in general with no simple rule of thumb. 

### Prepare data for CNN application

The following function reshapes a matrix into a number of batches (smaller matrices with fewer rows and the same number of columns). 
Have a look at the code (and at the output) and check it makes sense. 

In [0]:
def reshape_to_batches(matrix, batch_size):
    # pad the matrix with zeros if the number of rows is not divisible by the batch_size
    # np.ceil is the upper-rounding operator so np.ceil(4.3) == 5.0
    batch_num = np.ceil(matrix.shape[0] / batch_size)
    modulo = batch_num * batch_size - matrix.shape[0]
    if modulo != 0: # not divisible by batch_size
        # add some 0-rows to the matrix
        padding = np.zeros((int(modulo), matrix.shape[1]))
        matrix = np.vstack((matrix, padding))
        
    return np.array(np.split(matrix, batch_num))


# Let's see how this works
matrix = np.zeros((7, 5))
print(matrix.shape)
print(reshape_to_batches(matrix, 3).shape)

Do the same on the training data with batches of size 100. 

In [0]:
print(XTrain_s.shape) # original dimensions
XTrain_s_batch = reshape_to_batches(XTrain_s, 100)

print(XTrain_s_batch.shape) # now in batches


### More dimension adjustments

Since we are going to use the `categorical_crossentropy` loss function (the standard loss for binary classification), we need to transform our class labels into a binary matrix of (1s and 0s) of shape `(samples, classes)`. 
Execute the cell below and make sure you understand the result.

In [0]:
from keras.utils.np_utils import to_categorical

y_binary = to_categorical(yTrain)

print(yTrain.shape)   # vector with 0 - 1 entries (-> 1 column)
print(y_binary.shape) # 2 classes: genuine or fraud (-> 2 columns)

print()
print("First three rows before binarization...")
print(yTrain[:3])
print("First three rows after binarization...")
print(y_binary[:3, :])

yTrain_batch = reshape_to_batches(y_binary, 100)

print()
print(yTrain_batch.shape)

## A CNN for the data

We will try to apply a CNN on the data. 
Where we had used 2-D convolutions on images, we now use 1-D convolutions on a sequence but the principle is the same. 
Remember that a convolution in 2D is done with the following steps:

- for a given patch size e.g.: `3x3`
- for a given kernel `K` corresponding to that size (i.e. `(3, 3)`) 
- take patches of the given size on the image, e.g. let's call one `I` 
- compute the element-wise product of the entries of the patch `I` and the kernel `K` and sum:

$$ \sum_{r,s=1}^3 I_{rs} K_{rs} $$

So the adaptation to something that operates in "time" instead of "space" is reasonably straightforward:

- for a time segment size e.g.: 9 steps
- for a given "kernel" corresponding to that size (i.e. `(9,)`)
- take segments of the given size in the time series e.g. let's call one `I`
- compute the element-wise product of the entries and the kernel and sum:

$$ \sum_{r=1}^9 I_r K_r $$

So the principle is identical, we try to learn "kernels" which can pick up features / patterns over time windows of a given size.

In practice, with Keras, you need to import `Conv1D` to do this. 

In [0]:
#import all dependencies
from keras.layers import Input, Dense, Conv1D
from keras.models import Model

First we need to define how the input will look like. 
For computational reasons, we want to process more transactions at a time, but not too many so that the transfer time outweights the computational gain. 
Hence, we are going to feed 100 transactions at a time, each with the 30 features. 

We can specify this using the `Input` constructor from Keras.

In [0]:
inputs = Input(shape=(100, 30)) # This returns a tensor

### Now comes the most important part!

We use 1D Conv since we are only going to stride one way (along the time axis). 
We define 32 kernels (features) and a kernel size (sliding window) of 5. 
Note that the `strides` parameter indicates how much to shift the kernel, just as for an autoregressor.  
 
The last important point is that we give the output of the previous layer as input to this layer `Conv1D(...)`.
For this we use Keras' syntax to chain Layers:

`A_Layer_Constructor(...options to define the new layer...)(the_previous_layer)` 

or, more specifically here

`Conv1D(...)(inputs)`

where inputs was the layer we just defined indicating the dimension of the input data.

In [0]:
# a layer instance is callable on a tensor, and returns a tensor

conv1 = Conv1D(32, (5),           # 32 filters with a window of width 5
               strides=1,         # think autoregression
               padding='causal',  # forward in time
               dilation_rate=1,   # ignore this and everything that follows are default parameters
               activation='relu', 
               use_bias=True,
               kernel_initializer='glorot_uniform', 
               bias_initializer='zeros',
               kernel_regularizer=None, # no regularisation for the moment
               bias_regularizer=None, 
               activity_regularizer=None,
               kernel_constraint=None, 
               bias_constraint=None)(inputs) # syntax to chain layers: Layer(...)(PreviousLayer)


Let's now add a fully-connected layer with 64 neurons after that and relu neurons (note that the choice of 64 is fairly arbitrary, we picked it to have something "large but not too large" but there's not much more than guesswork here as, unfortunately, with much of "deep learning"). 
Again, we chain that layer to the previous one.

In [0]:
fc1 = Dense(64, activation='relu')(conv1)

Finally, for the output, we need a softmax layer with 2 neurons (two classes: 0/1). 

In [0]:
predictions = Dense(2, activation='softmax')(fc1)

Since the task of credit card transaction fraud Detection is a binary classification problem, as an exercise, you can investigate the effect of changing the predictions layer from `softmax` to a 1 dimensional layer with a sigmoid activation (why?, what does it change?).

Using Keras' functional API we can define the model and compile it with the loss etc (this part should feel familiar).
Note that a model can have **more than one** inputs and outputs!

In [0]:
# wrapping the model, mentioning the input and output layers
model = Model(inputs=inputs, 
              outputs=predictions)
# compiling the model, here we choose "rmsprop" to do the training but you could use Adam etc
# the loss is the standard loss for classification and we want to show the accuracy.
model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

Fit the model on the batched training data with 100 epochs, you should see the accuracy on the test set increase dramatically.

In [0]:
model.fit(XTrain_s_batch, yTrain_batch, epochs=30) 

### Evaluation

Let's see how are model is doing. 

Questions:
1. what accuracy is reported here? (evaluated over what data?)
1. is accuracy a good metric to use in our current use case?
1. what would be a better evaluation metric?

### Evaluation on the test set

We need to apply the same pre-processing as on the training data.

In [0]:
#first transform the test data into the appropriate shape
print(XTest_s.shape)
xTest_s_batch = reshape_to_batches(XTest_s, 100)
print(xTest_s_batch.shape)
y_binary = to_categorical(yTest)
print(yTest.shape)
print(y_binary.shape)
yTest_batch = reshape_to_batches(y_binary, 100)
print(yTest_batch.shape)

In [0]:
#make the prediction with the trained model
y_pred = model.predict(xTest_s_batch)

In [0]:
# store the raw predictions we will need them in a bit
y_hat = np.copy(y_pred)

We will use the scikit learn implementation of precision, recall, f1, which means we need to reshape the tensors(N-dimensional vectors) into 2D

In [0]:
# a function to reshape batches into the original shape
def _3d_to_2d(arr):
    return arr.reshape(arr.shape[0] * arr.shape[1], arr.shape[2])


_3d_to_2d(y_pred).shape

Sklearn metrics functions expect a single vector, containing either a probability score or a confidence interval. 
Further, since our labels are binary labels, we can only compare them if our results are also binary. 
Hence, we are going to make a simplifying assumption: all classifications where there is a higher than 50% chance for a given class are going to be assigned that class and vice versa. 

**Question**: is 50% a good threshold here? if not, why not?

Let's stick with that threshold for now and we will revisit in a bit. 

In [0]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import roc_curve, auc

threshold = 0.5
y_pred[np.where(y_pred >= threshold)] = 1
y_pred[np.where(y_pred < threshold)]  = 0

print(confusion_matrix(
        _3d_to_2d(yTest_batch)[:, 1], 
        _3d_to_2d(y_pred)[:, 1]))

print()

print(classification_report(
        _3d_to_2d(yTest_batch)[:, 1], 
        _3d_to_2d(y_pred)[:, 1],
        target_names = ["Genuine", "Fraud"],
        digits = 5))

This is mostly pretty good but, as always with highly imbalanced data, you need to be careful.
Remember that the confusion matrix reads as:

|                | Genuine (pred) | Fraud (pred) |
|----------------|----------------|--------------|
| Genuine (true) | TP             | FP           |
| Fraud (true)   | FN             | TN           |

so there were quite a few fraudulent transactions that were considered "fine" by the algorithm which is really not what you want! (`FN` is large compared to `FN+TN`) the fraud recall `TN/(TN+FN)` is quite poor here even though everything else "looks great".

### ROC or how performances change with threshold

As hinted at above, the threshold was picked _arbitrarily_ at 50% and, as you may have guessed, the threshold is rather suboptimal in this case. 
The selection of threshold is a subtle operation which is best done using what is known as the "ROC curve". 
The ROC curve effectively computes the confusion matrix for a range of thresholds from 0 to 1 and displays the recall versus the Fall-Out or False Positive Rate (FPR). 

- True Positive Rate (TPR) also _Recall_ or _sensitivity_: `TP/(TP+FN)`
- False Positive Rate (TNR): `FP/(FP+TN)`

Try to think in a general sense for an arbitrary binary classifier about how this curve should look like reflecting about extreme cases:

1. what happens if the threshold is set to something very close to `1` like `0.99`?
1. what happens if the threshold is set to something very close to `0` like `0.01`?
1. what is an ideal pair TPR/FPR?

Once this curve is computed, the AUC or _area under the curve_ is a metric of performance of the algorithm.
The ideal curve maximises the AUC, try to draw it and think about what it means. 

In the cell below, we compute the ROC and AUC for the classifier we just trained and for a range of thresholds. 


In [0]:
from sklearn.metrics import roc_curve, auc, roc_auc_score
# long way, allows to plot the curve
fpr, tpr, thresholds = roc_curve(_3d_to_2d(yTest_batch)[:, 1], 
                                 _3d_to_2d(y_hat)[:, 1])
print(auc(fpr, tpr))

# short way, gives the AUC directly
print(roc_auc_score(_3d_to_2d(yTest_batch)[:, 1], 
                    _3d_to_2d(y_hat)[:, 1]))

plt.figure(figsize=(8, 6))
lw = 2
plt.plot(fpr, tpr, color='darkorange',
         lw=lw, label='ROC curve (area = {0:.2f})'.format(auc(fpr, tpr)))
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--') # random-guess baseline
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate (FPR)', fontsize=12)
plt.ylabel('True Positive Rate (TPR)', fontsize=12)
plt.title('Receiver Operating Characteristic', fontsize=12)
plt.legend(loc="lower right", fontsize=12)

In [0]:
# we save the curve to facilitate comparisons later on
import pickle

pickle.dump((fpr, tpr, thresholds, y_pred), 
            open("res_cnn.pkl", "wb"))

Using the ROC curve, we can now **select a threshold**, depending on our requirements. 
It is now a **business decision**.

For example, your client may say that they want the optimal performance for a maximum of 0.6% FPR. 

* Find the threshold where the false positive rate is larger or equal to 0.6%
* Show the threshold as well as the corresponding FPR and TPR 

In [0]:
# find the value of FPR, where it is 0.6% or slightly above
fpr_id = np.where(fpr >= 0.006)[0][0]

print("FPR {:2.2f}%".format(fpr[fpr_id]*100))
print("TPR {:2.2f}%".format(tpr[fpr_id]*100))
print("threshold {:.2e}".format(thresholds[fpr_id]*100))


That's a pretty extreme threshold... 

Show the confusion matrix and classification report with this threshold and discuss

- the precision
- the recall
- the f1-score

in the context of a business where each undected fraud potentially costs `$$$$` whilst each false flag (flagging a genuine transaction as fraud) costs `$`. 

In [0]:
cutt_off_tr = thresholds[fpr_id]
y_pred[np.where(y_hat >= cutt_off_tr)] = 1
y_pred[np.where(y_hat < cutt_off_tr)]  = 0

print(confusion_matrix(
            _3d_to_2d(yTest_batch)[:, 1],
            _3d_to_2d(y_pred)[:, 1]))

print(classification_report(
        _3d_to_2d(yTest_batch)[:, 1], 
        _3d_to_2d(y_pred)[:, 1],
        target_names = ["Genuine", "Fraud"],
        digits=5))
