# 9) Unsupervised learning - Outlier detection

* Unsupervised learning (briefly)
* Outlier detection
* Dimensionality reduction

### Unsupervised learning

In unsupervised learning the training data has the form $D = \{(x_1, ?), (x_2, ?),...(x_n, ?)\}$. So we have no labels for the data pwoints, and the goal is to get an insigt into the data distribution, and see if we can make some sense of it. __Three important problems__ are __clustering__, where we applying algorithms to try and find natural clusters within the data, and __outlier detection__, where we try to indentity data points which seem to deviate from the pattern in the data, and finally __dimensionality reduction__ where we seek to reduce the dimensionality of our data while preserving as much information as possible.

### Outlier detection

Outlier detection is in a sense the opposite problem of clustering. In clustering we want to find the points that are similar and group them. In outlier detection we seek to find the points that are not similar to other points.

The standard definition of an __outlier__ is that is an object that deviates so much from the rest of the data set as to
arouse suspicion that it was generated by a different mechanism. It could be that the point is just an error, and so we might want to remove it. It could also indicate fraud, or simply be a valid but rare point. An outlier can be both an abnomality or simply just noise. Often the abnormalies can have high interest to the analyst.

Outlier detection can be applied to areas such as:

- Credit card fraud detection
- Telecom fraud detection 
- Customer segmentation
- Medical analysis
- Surveillance

The output of an outlier algorithm can be binary labels, or a score for how much an object is an outlier. 

Outlier detection can be done using supervised, unsupervised or semi-supervised approaches, but here we discuss unsupervised approaches

#### DBScan

Dbscan finds outliers as a bi-product. (the points that does not get in a cluster).

#### Basic outlier detection models


#### Probabilistic and  statistical approach

Often first step before smarter things.

Limitations: Only for single features. Sensitive to many outliers, because outliers affect learning. Not all outliers are detected

#### Expectation maximization

Fit several gaussians to the data. 


### Local outliers instead of global outliers

Each object has an outlier factor: degree to which the object deviates. 

k-distance of an object p:





## Dimensionality reduction

The goal of dimensionality reduction is to take high dimensional data, and reduce the number of dimensions while preserving as much information/structure as possible. If we have very high dimensional data, there might be some redundant attributes, and if two attributes "change together" and are almost linearly dependent, then discarding one of them doesnt cost us much in terms of information/structure (we can always infer the discarded one from the one we keep).

The motivations for doing dimensionality reduction include the following. In terms of generalization, lowering the dimension will lower the complexity of the chosen hypothesis, which helps combat overfitting. Many dimensional data also require more computation than lower dimension data. Reducing to two or three dimensions is often used to facilitate visualization of higher dimensional data.


#### Principle Components Analysis (PCA)

Assumption: Each observed variable should be normally distributed

PCA is a technique for reducing the number of dimensions. The idea is to find the unit vectors $u_i$ which point in the direction of highest variance. Picking these directions preserve the relative distance between the data points, and therefore preserve more information. These directions are called __principal components__, and if we want to reduce the dimensionality from d to k, we pick the k principal components $u_1,...,u_k$, and compute a new representation $x_i'$ for each point $x_i$ as:



$$ x_i' =\begin{bmatrix}
           u_1^T x_{i} \\
           u_2^T x_{i} \\
           \vdots \\
           u_k^T x_{i}
         \end{bmatrix} \in \mathbb{R}^k
$$

Here each of the k entries in $x_i'$ represent $x_i$'s projection onto the principle component unit vector $u_j$. Since the principle components are unit vectors, this projection will give us the distance from the mean to the point (along the direction of $u_j$).

<img src="imgs/projection.png" style="width: 400px;"/>

#### Finding the principal components

The top principle component is given by: 

$$\underset{u}{ \operatorname{arg max}} \frac{1}{n}\sum_{i=1}^n(x_i^T u)^2$$

That is, it is the vector u that maximize the average squared length of projections of data onto u. Further we constrain the lenght of u to be 1. This formula can be rewritten as:

$$ \frac{1}{n}\sum_{i=1}^n(x_i^T u)^2 = \frac{1}{n}\sum_{i=1}^n(x_i^T u)^T(x_i^T u) = u^T \bigg[ \frac{1}{n}\sum_{i=1}^n x_i x_i^T \bigg] u$$ 

Here we can identify the term in big square brackets as the covariance matrix (notice that we dont subtract the mean because we assume the data has been normalized so mean = 0, see section on preprocessing). __Linear algebra tells us that the vector u that maximize this expression is the largest eigenvector of the covariance matrix__. Hence finding the principal component is a matter of finding the eigenvectors of the covariance matrix.

The covariance matrix tells us something about the elipsioidal shape of the data. See the plots below of 2d data with different covariance matrices. From the plots its easy to see which direction the principle component points. 

<img src="imgs/covintuition.png" style="width: 450px;"/>
http://www.visiondummy.com/2014/04/geometric-interpretation-covariance-matrix/


#### To sum up

$$\Sigma u = \lambda u$$

0) Normalize the data (see preprocessing section) 

1) Compute covariance matrix

2) Find eigenvectors/values and pick the longer eigenvector(and normalize them)

The covariance matrix can be computed using the __outer product__ as follows:

$$Cov(X) = \frac{1}{n}\sum_{i=1}^n (x_i- \bar x) (x_i- \bar x)^T$$

Notice that $(x_i- \bar x)$ is a d dimensional column vector corresponding to a data point $x_i$ where the mean of each attribute is subtracted from the corresponding entry. Note also that if we have preprocessed the data, then all means are 0, so we dont have to subtract it.

  


#### Preprocessing data

To compare the variances of each dimension fairly we must normalize the data, such that each dimension have mean = 0 and variance = 1 (attributes with high values like yearly salary have much higher variance than attributes with low values like years of employment). 

We center the data (mean = 0) by subtracting the mean $\mu = \frac{1}{n}\sum_{i=1}^n x_i$ from each attribute.

We normalize the variance by dividing each attribute by the variance $\sigma^2  = \frac{1}{n}\sum_{i=1}^n(x_i-\mu)^2$.



