## 1. Overview

This is a Demo project for the DEGAN framework.

In this demo project, we trained and congigurate Generative Adversarial Network (GAN) through the DEGAN framework for detecting anomalies on sample time series. The data is divided into three groups (1) training, (2) validation, and (3) testing. Training and validation data are used to configure the Discriminator model (D) which is then used to detect anomalies in test data.

The average performance metrics (Precision, Recall, F1) are calculated across all testing time series (TS) -- could be 
one TS or a collection TSs. The data could be from one system or from different components of a system. E.g., if the
problem is detecting anomalies on a railroad data, the time series could be sensor readings across a large
segments (10 miles) of the track or smaller segments (1 mile).

### Framework

The DEGAN framework includes three main components: GAN training, D model selection, and probabilistic anomaly detection.

<img src="imgs/degan_framework.png" width="600"/>

### Generator & Discriminator

The architectures of G and D employed in the above framework are shown below. A two-layer Dense neural network is adopted as the base model of G. The input layer is a 1d-tensor of random values drawn from a fixed standard Gaussian distribution ranging between 0 and 1 with a dimension of 128. The two fully connected layers are followed by a Tanh activation layer. For the D, we employed a 1d-convolutional model, CNN-D, which consists of one convolutional layer (Conv1D) and two fully connected layers. The Conv1D layer has 16 channels of filters with a size of 5. The output of Conv1D is flattened before being fed into the Dense layers.

***(Users of this demo can adjust the size of the G & D to different size as they need in*** `Conv1D_GAN.py` ***)***

<img src="imgs/d_architecture.png" width="700"/>

## 2. Rquired Libraries

Import the following required libraries and files before running the demo.

Users need to install all required packages following the `requirements.txt`


Users may also directly use the code below (uncomment) to install all required packages.

In [None]:
# pip install -r requirements.txt

In [None]:
import numpy
import random
import tensorflow as tf
from DEGAN.train_val import *
from DEGAN.test import *
from DEGAN.utils import *
from matplotlib import pyplot as plt

## 3. GPU Environment

Users can set up GPU environment based on their need and hardware capabilities.

In [None]:
physical_devices = tf.config.experimental.list_physical_devices('GPU')

# Can uncommented following codes if your system has GPU hardware. 
# tf.config.experimental.set_memory_growth(physical_devices[0], True)
# tf.config.experimental.set_memory_growth(physical_devices[1], True)

Users can use the following session-info to check the current environment.

In [None]:
import session_info
session_info.show()

## 4. Import Dataset

Import the sample dataset. 

In this sample dataset, `training.csv`, `val.csv` and `testing.csv` each have some columns where the first column denotes the index number of the time series and the remaining columns denote each of the values that are part of the window size. The sample dataset are not presented in this demo project.

**(Users can change this into their customized time series datasets for training, validation, and testing.)**

In [None]:
X_train = pd.read_csv('Your training data.csv')
X_val = pd.read_csv('Your validation data.csv')
X_test = pd.read_csv('Your testing data.csv')
ts_df = pd.read_csv('Your testing time series data.csv') # This is adjusted to fit the Railroad Data in this sample.

## 5. Dictionary Settings

For the simplicity of the code and direct adjustments of the hyper parameters in this demo, the following dictionaries are set:

**input_df_dic:**  Dictionary for the input data. Containing the input dataset used for processing. For example, training, validation and testing dataset.

**GAN_param_dic:**  Dictionary for the hyper parameters related to the GAN. Containing the various parameters of the Generative Adversarial Network. For example, learning rate, epochs and threshold.

**post_process_dic:**  Dictionary for the data post processing. Containing the ground truth list, the parameters used for getting the anomaly index list, and the tolerance range used for evaluating the model's performance.

***(Users can customize the parameters in these dictionaries)***

In [None]:
input_df_dic = {
    "train_df": X_train,
    "val_df": X_val,
    "test_df": X_test,
    "ts_df": ts_df,
    "folder_num": 1,
    "num_test": 60000 # Originally X_test.shape[0], This is adjusted to fit the Railroad Data.
}

# This is adjusted to fit the Railroad Data.
GAN_param_dic={
    "glr": 0.001,
    "dlr": 0.0001,
    "total_epochs": 500,
    "val_freq": 5,
    "val_criterion": "absolute",
    "ano_thr": 10,
    "random_seed": 3500,
    "regularization_coeff": 0.0001,
    "latent_dim": 256,
    "out_dim": 100,
    "conv_sz": 5,
    "n_channels": 16,
    "drate": .25,
    "noise_dim": 128
}

# This is adjusted to fit the Railroad Data.
post_process_dic={
    # the minimum distance between two KDE peaks
    "peak_dist": 50,
    # the number of bins set over the KDE profile height histogram
    "peak_bins": 21,
    # the number of the quantile of the bins to set the minimum peak height for selection
    "peak_quant": 11,
    # parameter of smoothing level
    "kde_bandwidth": 50,
    # tolerance that shows how effective the algorithm is in detecting anomalies in a given vicinity of the real anomalies.
    "tolerance":[50,100,150,200],
    # index of the ground truth anomalies in the data set.
    "ground_truth_list": [17225]
    }


## 6. Training & Validation

Training and validating the Generative Adversarial Network on the time series and extracting the discriminator model to be used for testing.

The pseudo code of the training and validation of GAN is shown below:

<img src="imgs/GAN_pseudocode.png" alt="Drawing" style="width: 700px;" align="left"/>

### Inputs & Outputs


**Inputs:**
   
    window_length: int
        Length of the window used in the sliding window method for subsequence extractions.
        
    GAN_param_dic: dict
        Dictionary containing the various parameters of the Generative Adversarial Network.
    
    input_df_dic: dict
        Dictionary containing the input dataset used for processing.
        
**Outputs:**
   
    model: model
        Trained discriminator model used for evaluation of the testing inspection.

In [None]:
# set a random seed for reproduction
random.seed(GAN_param_dic["random_seed"])
numpy.random.seed(GAN_param_dic["random_seed"])
tf.random.set_seed(GAN_param_dic["random_seed"])

# the window length to extract subsequence of original time series data
# This window length is adjusted to fit the Railroad Data sample.
window_length = 100

input_df_dic["train_df"] = input_df_dic["train_df"].iloc[:,1:(window_length+1)]
input_df_dic["val_df"] = input_df_dic["val_df"].iloc[:,1:(window_length+1)]

model = GAN_train_val(window_length, input_df_dic, GAN_param_dic)

**Example Outputs:**

--------------------------------------------------------------------------------------------------

  100%|          |85/500 [00:00<?, ?it/s]
 
model saved, best epoch = 85

As shown above, the training and validation process stopped, and the trained model is temporarly stored. 

Next step would be testing the traied model and calculate the performance metrics.

## 7. Testing & Calculate Performance Metrics

Testing the Generative Adversarial Network on the test time series data, and calculating the average performance metrics (Precision, Recall, F1) across the testing data sets.

### Inputs & Outputs


**Inputs:**
   
    model: model
        Trained discriminator model used for evaluation of the testing inspection.
        
**Outputs:**
   
    KDE_scores: dataframe
        Kernel density scores of all the points in the time series.

    anomaly_index_list: list
        Indices where an anomaly has been detected by the trained model.
    
    predictedAnomalies: ndarray
        Predicted anomalies generated after applying kernel density estimation on all the predictions 
        of the model.
        
    metrics: ndarray
        Evaluation (Precision, Recall and F1) scores for the testing time series given the specified 
        tolerance ranges. 
    

In [None]:
KDE_scores, anomaly_index_list, predictedAnomalies, metrics = GAN_test(window_length, 
                                                                       input_df_dic, post_process_dic, model)

**Example Outputs:**

--------------------------------------------------------------------------------------------------

[[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]

The above results can be rounded and put into a dataframe for clarity:

In [None]:
def round_metrics(metrics):
    rounded_metrics = np.round(metrics, 2)
    columns = []
    for tolerance in post_process_dic["tolerance"]:
        columns.append("Recall (" + str(tolerance) + ")")
        columns.append("Precision (" + str(tolerance) + ")")
        columns.append("F1 Score (" + str(tolerance) + ")")
    rounded_metrics_df = pd.DataFrame(rounded_metrics, columns = columns, index = ['Score'])
    return rounded_metrics_df

round_metrics(metrics)

**Example Outputs:**

--------------------------------------------------------------------------------------------------

|  | Recall (50) | Precision (50) | F1 Score (50) | Recall (100) | Precision (100) | F1 Score (100) | Recall (150) | Precision (150) | F1 Score (150) | Recall (200) | Precision (200) | F1 Score (200) |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| **Score** | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |

## 8. Output of the Anomaly Detection Task

Apart from the performance metrics, below we can see the actual output results of the anomaly detection task.

Outputs of the anomaly detection tesk can also be visualized in a plot to show the predicted anomalies and the actual anomalies.

### Lists of Actual & Predicted Anomaly locations:

In [None]:
defect_index_list = post_process_dic["ground_truth_list"]
print('Actual anomaly locations:')
print(defect_index_list)

**Example Outputs:**

--------------------------------------------------------------------------------------------------

Actual anomaly locations:
 
[17225]

In [None]:
print('Predicted anomaly locations:')
print(predictedAnomalies)

**Example Outputs:**

--------------------------------------------------------------------------------------------------

Predicted anomaly locations:
     
[17253]

### Plot of Actual & Predicted Anomalies

The testing time series is denoted in gray while the predicted abnormal windows are shown in green. Kernel density score is applied over these predictions to get the predicted anomalies. These are the locations in the time series with the highest kernel density scores. In the figure, it is shown as a red blob. Actual anomalies are shown as a blue cross.

It can be seen from the plot below that the predicted abnormal windows align witht the actual anomalies.

In [None]:
# # Below code has been used to plot the anomaly detection task

fig = plt.figure()
fig, ax = plt.subplots(figsize=(14,4))
d = KDE_scores.iloc[predictedAnomalies]

# plot acceleration data
ts = input_df_dic['ts_df']

# This is adjusted to fit the Railroad Data.
ts_x = ts['Test_Channel']

scaler = MinMaxScaler()
ts_x_scaled = scaler.fit_transform(ts_x.values.reshape(-1,1))
ts_x_scaled = pd.DataFrame(ts_x_scaled)

# This is un-commented to fit the Railroad Data.
ts_x_scaled.index += input_df_dic['ts_df']['Unnamed: 0'][0]

ts_x_scaled.columns = ['Testing time series']  # to show legend
ts_x_scaled.plot(ax=ax, color='black',alpha=0.15,linewidth=1)
f = KDE_scores.iloc[ts_x_scaled.index]


# Plot actual defects
z = defect_index_list
w = ts_x_scaled.loc[defect_index_list]
plt.scatter(z, w, s=80, marker = "x", color='blue',label ="Actual Anomalies") 

# Plot predicted anomalies and KDE profile
f.columns = ['Kernel Density Score']
f.plot(ax=ax, alpha = 1,color='black', label = "Kernel Density Score")      
x = anomaly_index_list
y = ts_x_scaled.loc[anomaly_index_list]
plt.scatter(x, y, s=10, alpha=0.8,color='green',label ="Predicted Abnormal Windows")
x = predictedAnomalies
y = ts_x_scaled.loc[predictedAnomalies]
plt.scatter(x, d, s=66, alpha=0.8,color='red',label ="Predicted Anomalies")


plt.legend(prop={'family': 'Times New Roman','size':13})
plt.xticks(fontproperties='Times New Roman',fontsize=13)
plt.yticks(fontproperties='Times New Roman',fontsize=13)
plt.show()


**Example Outputs:**

--------------------------------------------------------------------------------------------------


<img src="imgs/sample_results.png" width="700"/>