<a href="https://colab.research.google.com/github/KristynaPijackova/Tutorials_NNs_and_signals/blob/main/Anomaly_detection_template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Anomaly detection with ECG dataset

So in our previous task we used the ECG5000 dataset for a classification task. 

However, as we noticed when analyzing the dataset, the classes are really imbalanced and thus not suitable for classification task. We tried to improve this by creating a synthetic data with SMOTE method (Synthetic Minority Oversampling TEchnique). While this helped us a little bit and we could still see that the underrepresented classes get mixed up with each other. 

Due to the great imbalance of the dataset, it is much better to re-consider our problem and instead of classifying signals which are hugely under-represented, we could try learning features from the strongly represented class and do an anomaly detection instead.



---

Interactive of our example: https://anomagram.fastforwardlabs.com/#/


**Few jupyter notebook shortcuts you can use to make your life easier**

Run cell: `shift + enter`

Undo last action (inside a cell): `ctrl + m + z`

Find and replace: `ctrl + h`

Insert code cell above: `ctrl + a`

Insert code cell below: `ctrl + b`

Delete cell: `ctrl + m + d`

Let's download the data from  http://www.timeseriesclassification.com/Downloads/ECG5000.zip

use !wget and !unzip

And now since we don't need train and test files, we can concatenate the two files together and create one file containing all the data instead.

Here we can simply do this by the the unix command `cat`

And we can check if the file-structure still looks the same with !head

Let's import libraries we are about to use

## Let's prepare our data

Once again, we create a dataframe with pandas...

`pd.read_csv`

And add prefixes, cause by now we know we cannot work with the column names if they are just numerical...

`add_prefix('c')`

So how many samples in each class do we have now? Let's see...

`.value_counts()`

### We split the original data into train and test sets

`train_test_split()`

### Now we normalize the data

We use MinMaxScaler that was import from sklearn

Here the scaler holds the `preprocessing.MinMaxScaler()` and we fit it on our training data in the second row of the code, which will be applied on the data we want to scale just in a second with `data_scaler.transform(data_we_want_to_scale)`

We take columns 1 to 140 since the 0th column holds the classes and we don't want to change their value.

#### Separate the data into normal and abnormal sets

Well we know we have 5 classes, the first one with index 1 represents normal signals, whilst other 4 represents heart abnormalities.

What we want is to have a class 0 that would represent the normal ECG signals and class 1 that holds all the abnormalities. Once again, pandas functions will help us with this. 

We also don't need the classes - the 0th column, so we take only the columns 1 to 140.

`query('c0 == 1').values[:,1:]`

## Let's look at our data and the difference

## Create the autoencoder

Define autoencoder

Compile the model

Train the model

## So what did our model learn?

The autoencoder is made of dense layers is trying to learn the distribution of the signal, so it can reproduce it at it's output. 

Since we are training it on the normal data it will learn to reproduce the normal data and will struggle with the abnormalities. 

We pass the normal test data into the model and plot their distribution in comparisson to their original shape. 

`model(data).numpy()`

Now we do the same thing, but with the abnormal data.

## Finding threshold

As we can see, the normal data is fitted pretty accurately, whilst the abnormal data does not really copy the original data. 

And that's what we are focusing at. We can now compute the mean square error `mse` or mean absolute error `mae` of the normal and abnormal data, plot their distribution and based on that decide where the threshold should be, so let's get into it!


For predictions you can use `model.predict(data)` and for the `loss tf.keras.losses.mse(reconstructed_data, original_data)`



#### Normal train data


Plot the normal data distribution

`plt.hist(loss, bins=50)`

Get the mean value and standard deviation of the data

`np.mean()` and `np.std()`

Let's decide where our threshold is going to be...

We can take the mean value + 2*std and see what we get. 

If we look at our histogram, 0.064 seem like a good threshold, where most of our normal data lies on the left side. Now we should see, how the distribution of the abnormal data looks like.

## Evaluation

#### Normal data

Calculate the loss on the testing data and plot the histogram

#### Abnormal data

Again, compute the MAE of the predicted abnormal data and display it in the histogram.

Seems like our threshold should work well in detecting the most abnormalities. So let's put it all together.



```
plt.axvline(threshold, color='r', linewidth=3, linestyle='dashed', label='{:0.3f}'.format(threshold))
```



In [None]:
# plot histograms of normal and abnormal data


# plot a vertical line, which displays the threshold


# add a legend and title



## Confusion matrix, ROC curve

To be able to plot the confusion matrix and ROC curve we first need to count how many errors I. and II. we have -> how many false positives or false negatives. 

For this we are going to use the tf.math function, where we compare the threshold with the test losses of normal and abnormal data and get an array with True/False values. 

Next we count how many manu nonzero values we have (non-zero = 1 = True).

`tf.math.less(loss, threshold)`
`tf.math.greater(loss, threshold)`

`tf.math.count_nonzero`

To **plot the confusion matrix**, we need to create a list which will hold the values we want to display. 

We can also define the categories to display as ticks - normal/anomalities. 

And to plot the data we will use **seaborn**. Seaborn is a library for statistical data visualization and is based on matplotlib, but is more user friendly.
Basicly all we could do is to write `sns.heatmap(cm)` and we would have our confusion matrix. However we added few extras to make it nice and representable. 

Now we combine the first three cells of this block into one as a function which will predict true positives, false negatives, true negatives and false positives. We will need this function to get us values, so we can plot ROC.

Calculate the roc values - tpr and fpr...

...first define empty list which will hold the values

...then create a for loop which will take 100 values from 0 to 1 (use `np.linspace(from, to, steps)`)

...now use the defined predictor function and, where the threshold is the value 0 to 1 of the forloop

...calculate the trp and frp for each iteration `tpr = tp/(tp + fn)` and `fpr = fp/(fp + tn)`

...`append` the tpr and fpr values in the roc list

Plot the ROC curve