# Detection of Anomalies: Comparison with classification
M4U1 - Exercise 2

## What are we going to do?
- We will create a dataset for anomaly detection with normal and anomalous cases
- We will train 2 models in a semi-supervised way, using SVM classification
- We will evaluate both models and graphically represent their results

Remember to follow the instructions for the submission of assignments indicated in [Submission Instructions](https://github.com/Tokio-School/Machine-Learning-EN/blob/main/Submission_instructions.md).

## Instructions
Anomaly detection methods that use Gaussian distribution covariance and the low probability of an event (used in the previous exercise) and those that use classification are similar, especially if we classify with a Gaussian kernel SVM, since both try to model a Gaussian distribution on the data.

Their main differences are only noticeable in some circumstances, e.g.:
- If the distribution of the normal examples is not Gaussian/normal or has multiple centroids that we have not detected beforehand.
- In a high-dimensional dataset, where determining the normal distribution of the data is more difficult.
- Classification, being a supervised learning method, requires a higher percentage of outliers than reinforcement learning.

In this exercise we will combine both methods, which you have already solved in previous exercises, to analyse their results and differences.

Follow the instructions below to solve the same dataset using both anomaly detection with Gaussian distribution, and SVM with a Gaussian kernel, copying code cells from previous exercises where possible:

In [None]:
# TODO: Use this cell to import all the necessary libraries

import time
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)

## The steps we are going to take are

We are going to create a synthetic dataset following the same steps as in the previous anomaly detection exercise. However, we will then create 2 different datasets, one for anomaly detection and one for classification, each with its 3 subsets of training, validation, and test data, since for Gaussian distribution covariance detection we do not assign outliers to the training subset, but for SVM classification we need to do so.

Los pasos que vamos a dar son:
1. Create a dataset with normal data and a dataset with outliers.
1. Preprocess, normalise, and randomly reorder the data.
1. Create training, validation, and test subsets to solve for Gaussian distribution covariance, with no outliers in the training subset.
1. Create training, validation, and test subsets to solve by SVM with Gaussian kernel, with anomalous data (outliers) distributed across all the subsets.
1. Plot the data for the 2 sets of subsets.

Fill in the following code cells, copying your code from previous exercises whenever possible. At the end you should have generated, normalised, split, and reordered the ndarrays *X_gdc_train, X_gdc_val, X_gdc_test, X_svm_train, X_svm_val, X_svm_test* and their respective *Y* counterparts.

In [None]:
# TODO: Generate two independent synthetic datasets with normal and outlier data

m = 1000
n = 2
outliers_ratio = 0.25    # Percentage of outliers vs. normal data, modifiable

[...]

In [None]:
# TODO: Normalise the data of both datasets using the same normalisation parameters

In [None]:
# TODO: Randomly reorder the 2 datasets

In [None]:
# TODO: Divide the 1st dataset into training, validation, and test subsets for Gaussian distribution covariance, with outliers only in the validation and test subsets

In [None]:
# TODO: Divide the 2nd dataset into training, validation, and test subsets for Gaussian kernel SVM classification, with outliers distributed across all the subsets

In [None]:
# TODO: Plot the 3 subsets on a 2D graph for both cases, indicating the normal data and the outliers

## Anomaly detection resolution using normal distribution covariance

To solve the dataset using normal distribution covariance, follow the steps from the previous exercise, copying the code of the corresponding cells and using the appropriate subsets:

In [None]:
# TODO: Model the Gaussian distribution

In [None]:
# TODO: Determine the probability threshold for detecting outliers

In [None]:
# TODO: Evaluate the final accuracy of the model using its F1-score

## Resolution using SVM Classification

Similarly, follow the steps in the SVM exercises above to classify the data into normal and outliers using SVM, copying the code from the corresponding cells and using the appropriate subsets.

Use an RBF kernel with the Scikit-learn [OneClassSVM method](https://scikit-learn.org/stable/auto_examples/svm/plot_oneclass.html) and *outlier_ratio* as the *nu* parameter. To regularise the model, optimise *gamma* with [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html):

In [None]:
# TODO: Train a OneClassSVM model and optimise gamma on the validation subset

In [None]:
# TODO: Evaluate the final accuracy of the model using its F1-score

## Comparison of the results of the two methods

Now compare both methods, showing their F1-score and plotting their results:

In [None]:
# TODO: Display the F1-score results of both models

print('F1-score of the Gaussian distribution covariance:')
print()
print('F1-score of the classification by SVM:')
print()

Plot the results of both models on their test subsets:

In [None]:
# TODO: Plot errors and hits next to the distribution and the epsilon cutoff threshold contour line
# for the Gaussian distribution covariance

# Assign z = 1 for hits, and z = 0 for misses
# Hits: Y_test == Y_test_pred
z_cdg = [...]

# Plot the 2D graph
# Use different colours for hits and misses
[...]

plt.show()

In [None]:
# TODO: Plot errors and hits next to the distribution and the border between classes
# for classification by SVM

# Assign z = 1 for hits, and z = 0 for misses
# Hits: Y_test == Y_test_pred
z_svm = [...]

# Plot the graph
# Use different colours for hits and misses
[...]

plt.show()

*What conclusions can you draw? What are the differences between the two methods?*