# Anomaly Detection :


It's also called Outlier detection, could be used to detect samples which do not conform to an expected pattern or other items in a dataset.So when there is a classification or unsupervised learning problem (rare case identification), it could be helpful. So the following would be introduce how to build anomaly detection method.

1, Define the notation:

m-- Number of training samples

n-- Number of sample features

X-- Features (vector)

$\mu$-- mean

$\sigma^2$-- variance

$\epsilon$-- threshold

2, Gaussian distribution:

To perform anomaly detection, you will first need to fit a model to the data's
distribution.

Given a training data set, you could estimate the Gaussian distribution for each of the features. 
<img src="https://github.com/GongliDuan/exdata-data-household_power_consumption/blob/master/1.png?raw=true" alt="1.png">

3, Estimating parameters for a Gaussian:

In the Gaussian distribution, to estimate the mean, you will use:
<img src="https://github.com/GongliDuan/exdata-data-household_power_consumption/blob/master/2.png?raw=true" alt="2.png">

And for the variance you will use:
<img src="https://github.com/GongliDuan/exdata-data-household_power_consumption/blob/master/3.png?raw=true" alt="3.png">


4, F-1 score:

tp: is the number of true positives: the ground truth label says it's an anomaly and our algorithm correctly classified it as an anomaly.

fp: is the number of false positives: the ground truth label says it's not an anomaly, but our algorithm incorrectly classified it as an anomaly.

fn: is the number of false negatives: the ground truth label says it's ananomaly, but our algorithm incorrectly classified it as not being anomalous.

precision:
<img src="https://github.com/GongliDuan/exdata-data-household_power_consumption/blob/master/4.png?raw=true" alt="4.png">

recall:
<img src="https://github.com/GongliDuan/exdata-data-household_power_consumption/blob/master/5.png?raw=true" alt="5.png">

F1 score:
<img src="https://github.com/GongliDuan/exdata-data-household_power_consumption/blob/master/6.png?raw=true" alt="6.png">

5, Selecting the threshold:

After we build the Gaussian distribution formula, when there is a new test sample, we could put it in and get the p. If p<$\epsilon$, we will count it as an anomaly one. 

So we could know, it's a key setp to choose the value of $\epsilon$. We can do this by borrowing the idea of cross-validation from supervised learning. 

Assume we have m-1 normal samples, and m-2 abnormal samples. We can split m1 into three sub set, m-11, m-12, m-13, meantime divide m-2 into m-21 and m-22. Then, we can use m-11 to calculate mean and variance, so we would have the Gaussian distribution. After that, we can combine m-12 and m-21, assign targets to them. m-12 as 0 and m-21 as 1. We loop through a list of possible $\epsilon$ values, for each one there is corresponding result on detection. F1 score could be used to judge which $\epsilon$ value is the most proper. In the end, m-13 and m-22 samples are left for test. 




###### The following is the module built by python:

In [None]:
import math
import numpy as np
import scipy

class Anomaly_Detection():
    
    def __init__(self):
        
        self.mu = None
        self.sigma2 = None
        self.best_epsilon = 0
        self.best_f1_score = 0

    def gaussian_para(self, X):
        #get the mean and variance from a sub set of normal training data set 
        self.mu = np.mean(X, axis=0)
        self.sigma2 = np.var(X, axis=0)
        
    def gaussian(self, X):
        # calculate the Gaussian distribution 
        p=np.prod(np.exp(-1.0*(X - self.mu)**2 / (2.0 * self.sigma2) ) / np.sqrt(2.0 * math.pi * self.sigma2) , axis=1)
        return p
    
    def fit_epsilon(self, X, y):
        # by 2nd normal training data sub set and 1st abnormal training data sub set,
        # through a for loop, to search the proper epsilon value.
        # use f1 score, to value the performance and define epsilon
        p_value = self.gaussian(X)
        p_max = max(p_value)
        p_min = min(p_value)
        step = (p_max-p_min) / 100.0
        
        for epsilon in np.arange(p_min, p_max + step, step):
            y_val= (p_value < epsilon).astype(int)
            tp = 0
            fp = 0
            fn = 0
            for i in range(0,len(y_val)):
                if y_val[i]==1 and y[i]==1:
                    tp=tp+1
                if y_val[i]==1 and y[i]==0:
                    fp=fp+1
                if y_val[i]==0 and y[i]==1:
                    fn=fn+1
                    
                if tp==0 and fp==0:
                    continue
                if tp==0 and fn==0:
                    continue

                prec = float(tp)/(tp+fp)
                rec = float(tp)/(tp+fn)
                f1 = 2*prec*rec/(prec+rec)
                
                if f1 > self.best_f1_score:
                    self.best_f1_score=f1
                    self.best_epsilon=epsilon
            
    def fit(self, X_train, X_val, y_val):
        # training to get the module
        self.gaussian_para(X_train)
        self.fit_epsilon(X_val, y_val)
        
    def predict(self, X):
        # forecast the test samples, to find which of them are normal and which are not.
        # in result array, 0 represents normal, 1 means abnormal
        p=self.gaussian(X)
        y=(p < self.best_epsilon).astype(int)
        return y