## Anomaly Detection

Here we form a cluster of points, and then through the centroid, we calculate the probabililty of this function being okay, the farther the point the lower probability it has of being a normal input, and higher probability of it being an anomaly

## Gaussian/Normal/Bell_shape Distribution

The Formula is:
$$p(x) = \frac{1}{\sqrt{2\pi}\sigma}e^{\frac{-(x-\mu)^2}{2\sigma^2}}$$

Maximum likelihood for $\mu$, $\sigma$
$$\mu = \frac{1}{m}\sum_{i=1}^{m}x^{(i)}$$
$$\sigma^2 = \frac{1}{m}\sum_{i=1}^{m}(x^{(i)}-\mu)^2$$

Incase of multiple features
$$p(\vec{x}) = p(x_1;\mu_1,\sigma_1^2)*p(x_2;\mu_2,\sigma_2^2)*...*p(x_n;\mu_n,\sigma_n^2)$$
Or,
$$p(\vec{x})=\prod_{j=1}^{n}p(x_i;\mu_i, \sigma_j^2)$$

## The importance of real-number evalutaion

When developing a learning algorithm, making decision is much easier if we have a way of evaluating our learning algorithm.

Assume we have some labeled data, of anomalous(1) and non-anomalous(0) examples.

The algorithm works better if we create a x_cv and include some anomalour data into the training set

__For example:__ <br>
To tune the parameter $\epsilon$ <br>

Training set: 6000 good engines <br>
CV: 2000 good engines and 10 anomalous <bR>
Test: 2000 good engines and 10 anomalous

### Algorithm evaluation
Fit model on training set <br>
On a cross validation/test example, predict the value <br>
And use the skewed data sets diagnostics, that is the F1 score for the classification accuracy

## Anomaly detection vs Supervised Learning

#### Anomaly Detection
1) Very small positive examples and large number of negative examples (0-20)
2) Many different "types" of anomalie. Hard for any algorithm to learn from positive examples what the anoalies look like; future anomalies may look nothing like any of the anomalous examplews we've seen so far

#### Supervised Learning
1) Large number of positive and negative examples
2) Enough positive examples for algorithm to get a sense of what positive examples likely to be similar to ones in training set.

## Choosing what features to choose

Very important to choose the right features in anomaly detection even more regourously than in 

Make sure the features that you give it are gaussian, if not, then change it

using ply.hist() plot the data, and if the feature is gaussian, then use it <br>
If not maybe take the log of that feature, and that may make it more gaussian, try using __log(x+c)__ or some __fractional power of x__

## Error analysis for anomaly detection

We want p(x) to be large for normal and p(x) to be ver small to anomalous examples

In these cases, if we can look at the error, in case, let's say an anomaly is detected with a high p(x) value, that is, very close to a normal feature, If we can investigate and find another feature, that helps  use distingiush the two, maybe asking self "What made me think it was an anomaly" will help creating a better algorithm.

In [1]:
import numpy as np

In [11]:
def estimate_gaussian(X): 
    """
    Calculates mean and variance of all features 
    in the dataset
    
    Args:
        X (ndarray): (m, n) Data matrix
    
    Returns:
        mu (ndarray): (n,) Mean of all features
        var (ndarray): (n,) Variance of all features
    """

    m, n = X.shape
    
    ### START CODE HERE ### 
    mu = np.zeros(n)
    var = np.zeros(n)
    for i in range(n):
        mu[i] = np.mean(X[:,i])
        var[i] = np.var(X[:,i])
    ### END CODE HERE ### 
        
    return mu, var

In [12]:
estimate_gaussian(np.array([[1, 2], [3, 4], [5,6]]))

(array([3., 4.]), array([2.66666667, 2.66666667]))

In [None]:
# UNQ_C2
# GRADED FUNCTION: select_threshold

def select_threshold(y_val, p_val): 
    """
    Finds the best threshold to use for selecting outliers 
    based on the results from a validation set (p_val) 
    and the ground truth (y_val)
    
    Args:
        y_val (ndarray): Ground truth on validation set
        p_val (ndarray): Results on validation set
        
    Returns:
        epsilon (float): Threshold chosen 
        F1 (float):      F1 score by choosing epsilon as threshold
    """ 

    best_epsilon = 0
    best_F1 = 0
    F1 = 0
    
    step_size = (max(p_val) - min(p_val)) / 1000
    
    for epsilon in np.arange(min(p_val), max(p_val), step_size):
    
        ### START CODE HERE ### 

        ### END CODE HERE ### 
        
        if F1 > best_F1:
            best_F1 = F1
            best_epsilon = epsilon
        
    return best_epsilon, best_F1

In [None]:
p_val = [0.4, 0.3, ]

In [13]:
# UNQ_C2
# GRADED FUNCTION: select_threshold

def select_threshold(y_val, p_val): 
    """
    Finds the best threshold to use for selecting outliers 
    based on the results from a validation set (p_val) 
    and the ground truth (y_val)
    
    Args:
        y_val (ndarray): Ground truth on validation set
        p_val (ndarray): Results on validation set
        
    Returns:
        epsilon (float): Threshold chosen 
        F1 (float):      F1 score by choosing epsilon as threshold
    """ 

    best_epsilon = 0
    best_F1 = 0
    F1 = 0
    
    step_size = (max(p_val) - min(p_val)) / 1000
    
    for epsilon in np.arange(min(p_val), max(p_val), step_size):
    
        ### START CODE HERE ### 
        predictions = (p_val < epsilon)
        
        tp = np.sum((predictions == True) & (y_val == 1))
        fn = np.sum((predictions == False) & (y_val == 1))
        fp = np.sum((predictions == True) & (y_val == 0))
        
        prec = tp/(tp+fp)
        rec = tp/(tp+fn)
        
        F1 = 2*(prec*rec)/(prec+rec)
        
        ### END CODE HERE ### 
        
        if F1 > best_F1:
            best_F1 = F1
            best_epsilon = epsilon
        
    return best_epsilon, best_F1