# Weekly Report - July 4
## Summary of Contents
- Goals
- Results & Findings
- Plan of the Next Week
- Appendix

## Goals
My goals of the week include:
- Implement more metrices to evaluate anomaly detectors
- Anlayze and address the fallacy of the Gaussian-based anomaly detectors
- Assess different configuration of Deep Autoencoders
- Apply Anomaly Detection on Synthetic Datasets

## Results & Findings
### 1. Anomaly Detection on Synthetic Datasets
I created 4 synthetic binary vector datasets. Here is a brief introduction of them:

#### Characteristics of the 4 dataset
|Dataset|#1|#2|#3|#4|
|---|---|---|---|---|
|Sample Size|85,322|100,000|100,000|85,674|
|No. Dimensions|16|16|16|16|
|No of gaussian distribution used to generate the dataset|1|1|3|2|
|%Anomaly|11.8%|13.1%|10.0%|11.2%|
|Anomaly|the total number of '1's in the vector is less than the threshold (4)|The sum of the right (`n-1`) digits is even **AND** the leftmost digit is even (1)|Generated with two different distribution other than the normal one|the total number of '1's in the vector is less than the threshold (4)|

#### Basic Method to Generate Data - Achieve Correlation and Binary

- First, I generated a dataset with a multivariate distribution with a random mean vector and a covariance matrix
- Second, I convert the dataset to binary by setting all the numbers that is larger or equal to 0.5 to 1, and others to 0. 











#### Detector Evaluation - Reconstruction Error with PCA
Work well in the dataset 3

|Dataset|#1|#2|#3|#4|
|---|---|---|---|---|
|Precision|16.4%|18.0%|63.7%|16.2%|
|Recall|100%|81.8%|50.4%|81.9%|
|R-Precision|18.6%|17.9%|63.7%|9.3%|
|Precision@50|12.0%|26.0%|100%|10.0%|

#### Detector Evaluation - Gaussian Models with PCA
|Dataset|#1|#2|#3|#4|
|---|---|---|---|---|
|Precision|12.9%||||
|Recall|100%||||
|R-Precision|22.9%||||
|Precision@50|0||||

#### Detector Evaluation - Reconstruction Error with Autoencoder
|Dataset|#1|#2|#3|#4|
|---|---|---|---|---|
|Precision|20.2%||||
|Recall|82.9%||||
|R-Precision|14.8%||||
|Precision@50|4.0%||||

#### Detector Evaluation - Gaussian Models with Autoencoder
Work amazingly well in the dataset 1

|Dataset|#1|#2|#3|#4|
|---|---|---|---|---|
|Precision|71.1%||||
|Recall|94.5%%||||
|R-Precision|76.7%%||||
|Precision@50|100.0%||||

## Appendices
### Appendix 1: code to generate the Synthetic Dataset 4

In [None]:
import numpy as np
import random
import matplotlib.pyplot as plt
from random import shuffle

# Generate 100K numbers, each of which has 16 digits
# Anomaly: number of 1s is larger than 2

# Set Parameteres
n_dimensions = 16
n_samples = 10**5
data1_ratio = 0.5 # Dataset 1
data2_ratio = 0.5 # Dataset 2
Anomaly_Threshold = 4 # Anomaly if total # 1s is less than the threshold

def generate_random_mg_data(n_dimensions, n_samples):
    """
    Generate a random dataset 
    """
    mu = np.random.rand(n_dimensions) # Random vector for mean
    cov = np.random.rand(n_dimensions,n_dimensions) # Random matrix for covaraince
    data_mg = np.random.multivariate_normal(mu, cov, size=n_samples) # Generate a random matrix with multivariate normal distribution
    data_bi = data_mg >= 0.5 # Convert to binary - True if the data is larger than 0.5; otherwise 0
    data = data_bi*1 # Convert True/False to 1/0
    return data

def generate_random_data_2md(n_dimensions, data1_size, data2_size):
    """
    Generate a random data set with two multivate gaussian distribution
    """
    # Generate two dataset
    data1 = generate_random_mg_data(n_dimensions, data1_size)
    data2 = generate_random_mg_data(n_dimensions, data2_size)
    # Merge
    data = np.concatenate((data1,data2))

    # Shuffle
    shuffle(data)

    return data

np.random.seed(9001)
# Generate a random data set with two multivate gaussian distribution
data = generate_random_data_2md(n_dimensions,int(n_samples*data1_ratio),int(n_samples*data2_ratio))

# Label Anomaly if the number of 1s is less than 7
data_rowsum = np.sum(data,axis = 1)
labels = data_rowsum < Anomaly_Threshold # Anomaly if total # 1s is less than the threshold
labels = labels*1

print("Percentage of Anomaly in the dataset: " + str(np.sum(labels)/len(labels))) # Find percentage of anomaly in the dataset
print(data[labels == 1][:5]) # Print the first 5 rows of anomaly data as examples

if np.sum(labels)/len(labels) > 0.2:
    print("Too much anomaly: start cleaning!")
    labels_remove = (labels==1) & (np.random.rand(n_samples) <= 0.6) # Remove around 60% of anomalies
    print(str(sum(labels_remove)) +' Anomalies are going to be removed.')
    data = data[~labels_remove] # Remove the selected data
    labels = labels[~labels_remove] # Remove the corresponding labels
    print("Percentage of Anomaly in the dataset after cleaning: " + str(np.sum(labels)/len(labels))) # Find percentage of anomaly in the dataset


# Save the data and labels
np.save('data.npy',data)
np.save('labels.npy',labels)
print('Data and Labels have been saved!')