# Anomaly detection in time series

In this notebook we will train an encoder-decoder model that learns to reconstruct its own inputs -- an autoencoder. We already investigated this type of model in notebook 13 where we used it for denoising images, but now we turn to a different type of task, which is to look for unexpected changes in data, so-called anomalies.

Again we base ourselves on a Keras example, but give it a try yourself before checking the solution on the Keras webpage.

## Setup

In [None]:
import numpy as np
import pandas as pd
import keras
from matplotlib import pyplot as plt

## Load the data

The [Numenta Anomaly Benchmark (NAB)](
https://www.kaggle.com/boltzmannbrain/nab) dataset contains simulated time series data, labelled in sections of normal and anomalous behaviour. There are two files: `art_daily_small_noise.csv`, which contains normal data we will use for training, and `art_daily_jumpsup.csv` which we use for testing.

For simplicity we read the CSV files with the Pandas library.

In [None]:
master_url_root = "https://raw.githubusercontent.com/numenta/NAB/master/data/"

df_small_noise_url_suffix = "artificialNoAnomaly/art_daily_small_noise.csv"
df_small_noise_url = master_url_root + df_small_noise_url_suffix
df_small_noise = pd.read_csv(
    df_small_noise_url, parse_dates=True, index_col="timestamp"
)

df_daily_jumpsup_url_suffix = "artificialWithAnomaly/art_daily_jumpsup.csv"
df_daily_jumpsup_url = master_url_root + df_daily_jumpsup_url_suffix
df_daily_jumpsup = pd.read_csv(
    df_daily_jumpsup_url, parse_dates=True, index_col="timestamp"
)

Print data contents:

In [None]:
print(df_small_noise.head())

print(df_daily_jumpsup.head())

## Visualise the data

The training data looks like this:

In [None]:
fig, ax = plt.subplots()
df_small_noise.plot(legend=False, ax=ax)
plt.show()

While our test data contains an unexpected jump, that we should hopefully be able to detect.

In [None]:
fig, ax = plt.subplots()
df_daily_jumpsup.plot(legend=False, ax=ax)
plt.show()

## Prepare training data

Get data values from the training timeseries data file and normalize the
`value` data. We have a `value` for every 5 mins for 14 days.

-   24 * 60 / 5 = **288 timesteps per day**
-   288 * 14 = **4032 data points** in total

As usual, we normalise by subtracting the mean and dividing by the standard deviation.

In [None]:
training_mean = df_small_noise.mean()
training_std = df_small_noise.std()
df_training_value = (df_small_noise - training_mean) / training_std
print("Number of training samples:", len(df_training_value))

### Create fixed-length sequences of data

Create sequences combining `TIME_STEPS` contiguous data values from the
training data.

In [None]:
TIME_STEPS = 288

# Generated training sequences for use in the model.
def create_sequences(values, time_steps=TIME_STEPS):
    output = []
    for i in range(len(values) - time_steps + 1):
        output.append(values[i : (i + time_steps)])
    return np.stack(output)


x_train = create_sequences(df_training_value.values)
print("Training input shape: ", x_train.shape)

### <span style="color: red; font-weight: bold;">Exercise: Build the autoencoder<span>

This time you are left on your own to create the model! But some hints to create a baseline architecture that should perform reasonably well:

- Two `Conv2D` layers with `kernel_size` around 5-7
- Two `Conv2DTranspose` layers to get back to the inital input shape
- Use either `strides=2` or pooling layers to downsample the data into our "compressed" or "encoded" representation
- Some dropout is probably nice.
- The output is a time series with identical shap to the input.


In [None]:
model = ...

model.compile(optimizer=keras.optimizers.Adam(learning_rate=0.001), loss="mse")
model.summary()

## Train the model

For our autoencoder the target is the input, so we need to specify this correctly in the `fit` function.

In [None]:
history = model.fit(
    x_train,
    x_train,
    epochs=50,
    batch_size=128,
    validation_split=0.1,
    callbacks=[
        keras.callbacks.EarlyStopping(monitor="val_loss", patience=5, mode="min")
    ],
)

Plot training and validation loss to evaluate the training procedure.

In [None]:
plt.plot(history.history["loss"], label="Training Loss")
plt.plot(history.history["val_loss"], label="Validation Loss")
plt.legend()
plt.show()

## Detecting anomalies

We will try to detect anomalies by determining how well our model is able to reconstruct the input data.

Let's use the mean absolute error (MAE) as our metric: the absolute difference between each data point and the prediction, averaged over all data points in the time series.

To classify anomalies, we need to set a threshold for how high a MAE value we consider as anomalous. We can select the threshold for instance by computing the MAE for all the sequences in the training data, and set the threshold equal to the highest (=worst) value we see. Anything above this value, we then consider to be an anomaly.

In [None]:
# Predict for all test sequences
x_train_pred = model.predict(x_train)

### <span style="color: red; font-weight: bold;">Exercise:<span>

Compute the mean average errors -- either by writing the function yourself, or using `keras.metrics.MeanAbsoluteError`.

In [None]:
mean_average_errors = ...

Then we take the maximum as out threshold.

In [None]:
threshold = np.max(mean_average_errors)
print("Reconstruction error threshold: ", threshold)

## Compare recontruction

Before we start looking for anomalies, let's see how our model has recontructed the first sample. This is the 288 timesteps from day 1 of our training dataset.

In [None]:
plt.plot(x_train[0])
plt.plot(x_train_pred[0])
plt.show()

## Prepare test data

Normalise out test data and create sequences:

In [None]:
df_test_value = (df_daily_jumpsup - training_mean) / training_std
fig, ax = plt.subplots()
df_test_value.plot(legend=False, ax=ax)
plt.show()

# Create sequences from test values.
x_test = create_sequences(df_test_value.values)
print("Test input shape: ", x_test.shape)

## Find anomalies

Now for the real test: Compute MAE for all sequences in the test set, and check if any break the threshold.

In [None]:
x_test_pred = model.predict(x_test)
test_mae_loss = np.mean(np.abs(x_test_pred - x_test), axis=1)
test_mae_loss = test_mae_loss.reshape((-1))

plt.hist(test_mae_loss, bins=50)
plt.xlabel("test MAE loss")
plt.ylabel("No of samples")
plt.show()

# Detect all the samples which are anomalies.
anomalies = test_mae_loss > threshold
print("Number of anomaly samples: ", np.sum(anomalies))
print("Indices of anomaly samples: ", np.where(anomalies))

## Plot anomalies

We now know the samples of the data which are anomalies. With this, we will
find the corresponding `timestamps` from the original test data. We will be
using the following method to do that:

Let's say time_steps = 3 and we have 10 training values. Our `x_train` will
look like this:

- 0, 1, 2
- 1, 2, 3
- 2, 3, 4
- 3, 4, 5
- 4, 5, 6
- 5, 6, 7
- 6, 7, 8
- 7, 8, 9

All except the initial and the final time_steps-1 data values, will appear in
`time_steps` number of samples. So, if we know that the samples
[(3, 4, 5), (4, 5, 6), (5, 6, 7)] are anomalies, we can say that the data point
5 is an anomaly.

In [None]:
# data i is an anomaly if samples [(i - timesteps + 1) to (i)] are anomalies
anomalous_data_indices = []
for data_idx in range(TIME_STEPS - 1, len(df_test_value) - TIME_STEPS + 1):
    if np.all(anomalies[data_idx - TIME_STEPS + 1 : data_idx]):
        anomalous_data_indices.append(data_idx)

Overlay the anomalies on the original test data plot:

In [None]:
df_subset = df_daily_jumpsup.iloc[anomalous_data_indices]
fig, ax = plt.subplots()
df_daily_jumpsup.plot(legend=False, ax=ax)
df_subset.plot(legend=False, ax=ax, color="r")
plt.show()