# Lecture14 Anomaly Detection

For an dataset $\{ x^{(1)}, \dots, x^{(m)} \}$, detect that is $x_{test}$ anomalous?

Anomaly detection example: fraud detection, manufacturing, monitoring computer in a data center.

**Gaussion distribution**: $x \sim \mathcal{N}(\mu, \sigma^2): p(x;\mu, \sigma^2) = \frac{1}{\sqrt{2\pi \sigma}}exp\left( -\frac{(x-\mu)^2}{2\sigma^2} \right)$


**Anomaly detection algorithm**:
1. choose features $x_i$ that you think might be indicative of anomalous examples.
2. fit parameters $\mu_1, \dots, \mu_n, \sigma^2_1, \dots, \sigma^2_n$.
    * $\mu_j = \frac{1}{m}\sum^m_{i=1}x^{(i)}_j$
    * $\sigma^2_j = \frac{1}{m}\sum^m_{i=1}(x^{(i)}_j - \mu_j)^2$
3. given new example x, comput $p(x) = \prod_{j=1}^n p(x_j; \mu_j, \sigma^2_j) = \prod^n_{j=1} \frac{1}{\sqrt{2\pi \sigma_j}}exp(-\frac{(x_j-\mu_j)^2}{2\sigma_j^2})$, anomaly if $p(x) < \epsilon$.

## Algorithm evaluation

10000 good (normal) engines, 20 flawed engines (anomalous).

1. Training set: 6000 good engines; CV: 2000 good engines (y=0), 10 anomalous (y=1); Test: 2000 good engines (y=0), 10 anomalous (y=1);
2. Fit model $p(x)$ on training set $\{x^{(1)}, \dots, x^{(m)}$.
3. On a cross validation/test example x, predict $y=1$ (anomaly) if $p(x) < \epsilon$; $y=0$ (normal) if $p(x) \geq \epsilon$.
4. Possible evaluation metrics, use cross validation set to choose parameter $\epsilon$:
    * True positive, false positive, false negative, true negative.
    * Precision/Recall.
    * F1-score.
    
    
**Anomaly detection vs. supervised learning**
* anomaly detection:
    * very small number of positive examples.
    * large number of negative examples.
    * many different "types" of anomalies.
* supervised learning
    * large number of positive and negative examples.
    * enough positive examples for algorithm.

## Multivariate Gaussian Distribution

$$p(x; \mu, \Sigma) = \frac{1}{(2\pi)^{n/2} |\Sigma|^{1/2}} exp\left( -\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x - \mu)\right)$$

**Parameter fitting**: $\mu=\frac{1}{m}\sum^m_{i=1}x^{(i)}, \Sigma = \frac{1}{m}\sum^m_{i=1}(x^{(i)} - \mu)(x^{(i)} - \mu)^T$.


**Original model vs. Multivariate gaussian**:
* Original model:
    * Manually create features to capture anomalies where $x_1, x_2$ take unusual combinations of values.
    * Computationally cheaper.
    * OK even if m is small.
* Multivariate gaussian：
    * Automatically captures correlations between features.
    * Computationally more expensive.
    * Must have $m > n$ or else $\Sigma$ is non-invertible.