## Problem Motivation

Just like in other learning problems, we are given a dataset  \\({x^{(1)}, x^{(2)},\dots,x^{(m)}}\\).

We are then given a new example, \\(x_{test}\\), and we want to know whether this new example is abnormal/anomalous.

We define a "model" \\(p(x)\\) that tells us the probability the example is not anomalous. We also use a threshold ϵ (epsilon) as a dividing line so we can say which examples are anomalous and which are not.

A very common application of anomaly detection is **detecting fraud**:

* \\(x^{(i)}\\) =  features of user i's activities
* Model \\(p(x)\\) from the data.
* Identify unusual users by checking which have p(x)<ϵ.

If our anomaly detector is flagging **too many** anomalous examples, then we need to **decrease** our threshold ϵ


## Gaussian Distribution

The Gaussian Distribution is a familiar bell-shaped curve that can be described by a function \\(\mathcal{N}(\mu,\sigma^2)\\).

Let x∈ℝ. If the probability distribution of x is Gaussian with mean μ, variance \\(\sigma^2\\), then:

$$x \sim \mathcal{N}(\mu, \sigma^2)$$

The little ∼ or 'tilde' can be read as "distributed as."

The Gaussian Distribution is parameterized by a mean and a variance.

Mu, or μ, describes the center of the curve, called the mean. The width of the curve is described by sigma, or σ, called the standard deviation.

The full function is as follows:

$$\large p(x;\mu,\sigma^2) = \dfrac{1}{\sigma\sqrt{(2\pi)}}e^{-\dfrac{1}{2}(\dfrac{x - \mu}{\sigma})^2}$$

We can estimate the parameter μ from a given dataset by simply taking the average of all the examples:

$$\mu = \dfrac{1}{m}\displaystyle \sum_{i=1}^m x^{(i)}$$

We can estimate the other parameter, \\(\sigma^2\\), with our familiar squared error formula:

$$\sigma^2 = \dfrac{1}{m}\displaystyle \sum_{i=1}^m(x^{(i)} - \mu)^2$$


## Algorithm

Given a training set of examples, \\(\lbrace x^{(1)},\dots,x^{(m)}\rbrace\\) where each example is a vector, \\(x \in \mathbb{R}^n\\).

$$p(x) = p(x_1;\mu_1,\sigma_1^2)p(x_2;\mu_2,\sigma^2_2)\cdots p(x_n;\mu_n,\sigma^2_n)$$

In statistics, this is called an "independence assumption" on the values of the features inside training example x.

More compactly, the above expression can be written as follows:

$$p(x) = = \displaystyle \prod^n_{j=1} p(x_j;\mu_j,\sigma_j^2)$$

### The algorithm

1. Choose features \\(X_i\\) that you think might be indicative of anomalous examples.

2. Fit parameters \\(\mu_1,\dots,\mu_n,\sigma_1^2,\dots,\sigma_n^2\\)

3. Calculate \\(\mu_j = \dfrac{1}{m}\displaystyle \sum_{i=1}^m x_j^{(i)}\\)

4. Calculate \\(\sigma^2_j = \dfrac{1}{m}\displaystyle \sum_{i=1}^m(x_j^{(i)} - \mu_j)^2\\)

5. Given a new example x, compute p(x):
$$p(x) = \displaystyle \prod^n_{j=1} p(x_j;\mu_j,\sigma_j^2) = \prod\limits^n_{j=1} \dfrac{1}{\sqrt{2\pi}\sigma_j}exp(-\dfrac{(x_j - \mu_j)^2}{2\sigma^2_j})$$
Anomaly if p(x)<ϵ<br>
A vectorized version of the calculation for μ is \\(\mu = \dfrac{1}{m}\displaystyle \sum_{i=1}^m x^{(i)}\\). You can vectorize \\(\sigma^2\\) similarly.