# Anomaly Detection Using a Variational Autoencoder
- Anomaly detection is one of the most widespread use cases for unsupervised machine learning, especially in industrial applications.
- Applications of anomaly detection: fraud detection in banking, preventive maintenance in heavy industries, threat detection in cybersecurity
- Challenge: defining outliers explicitly
- Variational autoencoders (VAEs): 
    - The `encoder` is trained to learn the general structure of the training data to isolate only its discriminative features, which are summarised in a compact **latent vector**. 
    - The `latent vector` constitutes an information bottleneck that *forces* the model to be very *selective about what to encode*. 
    - A `decoder` is to re-construct the original data from the latent vector as faithfully as possible. 
    - Outlier detection: when presented with an out-of-distribution sample (an outlier), the system will not be able to make an accurate reconstruction. 
        - By detecting in-accurate reconstructions, we can tell which examples are outliers.
- Reference:
    - [Kaggle Notebook](https://www.kaggle.com/code/lucfrachon/anomaly-detection-using-vaes)
    - [Anomaly Detection Using a Variational Autoencoder](https://medium.com/@luc.frachon/anomaly-detection-using-a-variational-autoencoder-part-i-e48cd26c027d)

## Project Overview
- Goal: tracking the internal temperature of some sizeable industrial machine about which we have no prior knowledege about (like what kind of machine or industry) and detect if any abnomaly happens.
- Dataset: a one-dimensional time series &#8594; engineer features and make the data multi-dimensional to make things more interesting and relevant.
- Approach:
    - Gather and preprocess data, including train/test split,
    - Build a VAE and train it on the training set,
    - Pass test samples to the VAE and record the reconstruction loss for each,
    - Identify test samples with a high reconstruction loss and flag them as anomalies.

## Feature Engineering
- There is a challenge in feature engineering as not having much information about what kind of machine or industry, we will therefore have to make assumptions.
- Assumptions (A):
    - A1: The timestamps cover the Christmas and New Year holidays. Since we are dealing with an industrial machine, it stands to reason that its workload might be affected by holidays and maybe even by the proximity (in time) to a holiday (`gap_holiday`). 
        - A1.1: applicable holidays are those typical in Europe and the Americas
    - A2: There is the difference in temparature between 
        - A2.1: every hour in the day 
        - A2.2: weekdays and weekends
        - A2.3: day in the month
        - A2.4: month to month

## Data Pre-processing
- Categorical features (`day_of_week`, `holiday`): encode `day_of_week` from 0 to 6, and `holiday` as 0 or 1.
- Continous features (`gap_holiday`, `t`, `value`): is essential to normalise the continuous variables as the weights of a neural network are randomly initialised from a shared distribution, so they all tend to have similar scales initially. 
    - Test data must be normalised using statistics observed on the train set to avoid leaking information from the test set into the model

## Train-Test Split
- Time series data: we take the last 30% of the observations (in chronological order) as our test set
    - Train set: 15,900 timestamps
    - Test set: 6,800 timestamps

## Model
- The encoder's first layer: a concatenation of the continuous variables and embedding vectors encoding the categorical variables.
    - Embedding vectors (learnable vectors of dimension 16 in our experiments) are heavily used in Natural Language Processing and, more generally, when working with discrete data. 
        - For example: the vector `day_of_week` can take seven values (one for each day), each 16-dimensional. After training, the vector for day_of_week==0 might, for example, capture the fact that the machine's activity is lower on Sundays. 
- Hyper-parameters of fully connected layers: (`layer_dims`) and neurons per layer `(32, 64, 128, 256)`
    - Each layer can be batch-normalised. 

In [None]:
# define a layer as a sequence of {fully-connected unit, batch normalisation, leaky-ReLU activation}
class Layer(nn.Module):
    '''
    A single fully connected layer with optional batch 
    normalisation and activation.
    '''
    def __init__(self, in_dim, out_dim, bn = True):
        super().__init__()
        layers = [nn.Linear(in_dim, out_dim)]
        if bn: layers.append(nn.BatchNorm1d(out_dim))
        layers.append(nn.LeakyReLU(0.1, inplace=True))
        self.block = nn.Sequential(*layers)
    
    def forward(self, x):
        return self.block(x) 