# 8) Unsupervised learning - Clustering

* Unsupervised learning (briefly)
* Clustering  
    * Representative-based clustering
    * Density-based clustering
    * Hierarchical clustering
    * Subspace clustering
* K-means
* Expectation-maximization clustering
* Cluster measurement
    * Silhouette coefficient
    * F1 score

### Unsupervised learning

In unsupervised learning the training data has the form $D = \{(x_1, ?), (x_2, ?),...(x_n, ?)\}$. So we have no labels for the data points, and the goal is to get an insigt into the data distribution, and see if we can make some sense of it. __Two important problems__ are __clustering__, where we applying algorithms to try and find natural clusters within the data, and __outlier detection__, where we try to indentity data points which seem to deviate from the pattern in the data.


### Clustering

Clustering can be though of as organizing the data into "natural" clusters. Natural in the sense that members of the same cluster are similar/close to each other. Clustering can be used as a preprocessing step for supervised learning, to find groups in the data and assign labels to them.

<img src="imgs/clustering.png" style="width: 400px;"/>

There are several clustering approaches:

#### Representative-based clustering

The goal of representative-based clustering is to partition the dataset into k clusters where each cluster is represented by some point. For some algorithms the point is an actual point in the data set, but sometimes it is just the mean of the points.

The representative-based clustering methods finds convex-shaped clusters, so when the clusters have nonconvex structure they perform poorly. Furthermore the number of clusters k must be specified by the user, which often not possible. 

#### Density-based clustering

Density-based approaches look at the density of the neighborhood around each point, and choose clusters accordingly. These approaches determine the value of k, so the user does not have to specify then number of clusters as is the case in representative-based clustering. Of course this does not imply that the algorithm will "detect" the correct number of clusters. Instead of the parameter k, the user has to provide parameters $\epsilon$ and minpts. These parameters specify what level of density is enough to determine a cluster, and thus have huge impact on the result. Density-based clustering does not care if the clusters are convex or not. 

Contrary to representative-based algorithms these types of algorithm can detect noise (points not in a cluster) and this i is related to outlier detection.


#### Hierachical clustering



In hierachical clustering we wish to build a dendrogram(tree) of subclusters. The root of the tree consists of a cluster containing all data points, and the leaves contain clusters containing a single point. Thus, the meaningful clusters lie somewhere in the other layers of the tree. The trees can be constructed bottom-up (called agglomerative approach) or top-down (divisive approach). Sometimes different clusters have different density, in such cases density-based clustering algorithms struggle, because of the fixed parameters $\epsilon$ and minpts. Thi

#### Subspace clustering

If our data is high dimensional, there is a risk that some of the features are irrelevant and can thus obscure an otherwise good clustering. This is illustrated below where the data in 3d looks random, but if one dimension is discarded it looks like there is some structure. 

<img src="imgs/subspaceclustering.png" style="width: 450px;"/>



### K-means

The most well known clustering algorithm is the K-means algorithm. Here we want to split the dataset into k clusters $C = \{c_1,...c_k\}$, each represented by a centroid $\mu_i$, such that the following error function is minimized:

$$SSE(C) = \sum_{i=1}^k \sum_{x_j \in c_i} \|x_j - \mu_i \|^2$$

This is called _sum of squared errors_ (sum of all squared distances from all points to their centroid $\mu$) and the algorithm iteratively tries to minimize this sum. It does so by repeatedly assigning the data points to the cluster defined by the nearest centroid, and recalculating the centroids of the new clusters. The algorithm is described in pseudocode below:

<img src="imgs/kmeans.png" style="width: 450px;"/>

The stopping criterion is given by:

$$\sum_{i=1}^k \| \mu_i^t - \mu_i^{t-1}\|^2 \le \epsilon$$

The running time of the algorithm is $ \mathcal{O}(tknd)$ where n is \#points, k is \#clusters, t is \# iterations, and d is the dimensionality of the data points.

The k-mean approach to clustering have the following __advantages__:

- Relatively efficient (often k,t,d << n)
- Easy to understand and implement

And the following __disadvantages__:

- Only works if mean is defined (cannot compute mean of categorical values like (1=red, 2=green etc))
- Hard to know what k to pick
- Sensitive to noisy data and outliers
- Clusters will always have convex shapes
- All points get to be in a cluster (shouldnt always be the case)
- Can terminate in local minima

For the problem of deciding the number of clusters (k), we could try all k from 1 to n-1, and pick the one with the least error. This would be the clustering with k-1, but it's unlikely that this clustering makes any sense. __a measurement of how good a clustering is must be independent of k__ (see silhouette coefficient section).

### Expectation Maximization 

K-means clustering provide _hard assignment_ clusterings. That is, for each point we report: _this_ is the cluster it belongs to. Expectation Maximization(EM) clustering provides a __soft assignment__ clustering, where each point has a probability of belonging i each cluster. Each cluster is represented by a gaussian distribution, and so if we require a hard assignment, we can just report the cluster which have the highest probability of containing that point. Each gaussian has a mean and a covariance matrix which describe the clusters elipsoidal shape.



The density function for each cluster C is the standard gaussian multivariate density function:

$$P(x\rvert C) = \frac{1}{\sqrt{(2\pi)^d\rvert \Sigma_C \rvert}}\cdot e^{-\frac{1}{2}(x-\mu_C)^T\cdot(\Sigma_C)^{-1}\cdot(x-\mu_C)}$$


#### Initialization

For each cluster $C_i$ we __initialize a random mean__ within the ranges of each dimension. We then __initialize the covariance matrix to be the identity matrix__. Finally we __initialize the prior probabilities, $P(C_i)$, to $\frac{1}{k}$__, such that each cluster has equal probability from the beginning.

##### E step


In each expectation step we compute the __posterior probability__ of a cluster $C_i$ given a point $x$, $P(C_i \rvert x)$. This probability can be thought of as a __weight or contribution of point x to cluster $C_i$__, and will be used in the M step. We use bayes theorem to compute posteriors:

$$ P(C_i \rvert x) = \frac{P(C_i, x)}{P(x)} = \frac{P(x \rvert C_i) P(C_i)}{\sum_{a = 1}^k P(x \rvert C_i) P(C_i)} $$

##### M step

In each maximization step we re-estimate 3 things:

__Priors__ $P(C_i)$ for each cluster as the fraction of weights that contribute to that cluster:

$$P(C_i) = \frac{1}{n}\sum_{x\in D}P(C_i \rvert x)$$


__Means__ $\mu_i$ of $C_i$, as the weighted averate of all points:


$$\mu_i = \frac{\sum_{x\in D} x \cdot P(C_i \rvert x)}{\sum_{x\in D} P(C_i \rvert x)} $$

__Covariance matrix__ $\Sigma_i$ as the weighted covariance over all pairs of dimensions:


$$\Sigma_i = \frac{\sum_{x\in D} P(C_i \rvert x)(x-\mu_i)(x-\mu_i)^T}{\sum_{x\in D} P(C_i \rvert x)}$$

##### Stopping criterion

We can stop the algorithm when the means no longer change significantly (same stopping criterion as in k-means). The algorithm will then have converged to a local maximum of the log likelihood. And the parameters for the gaussians now constitute a (local)__maximum likelihood estimation__.

### Cluster measurement

It is important to be able to measure how well a clustering is. There are two overall categories, __internal__ and __external__ evaluation mearsures. External measures assume that the "correct" clustering is known beforehand, that is, the labels are given for all data points. If we know the correct clustering beforehand, there is not need for clustering, but instead the labels can be used to test and validate clustering methods. In most cases however we dont know the labels, and so to evaluate the quality of a clustering internal measurements must measure the compactness and "similarity" between objects in the same cluster.


#### Silhouette coefficient

The silhouette coefficient approach is an __internal evaluation measure__, and it provides measurements of both how well each object belongs to its assigned cluster, and how good the total clustering is.
Let $a(o)$ be the average distance between object o and other objects in its cluster A, and let $b(o)$ be the average distance from object o to all objects _in its second closest cluster_. Formally:

$$
a(o) = \frac{1}{\vert C_A \vert} \sum_{p\in C_A} dist(o,p)\\
b(o) = \underset{C_i \neq C_A}{\operatorname{min}} \bigg[ \frac{1}{\vert C_i \vert} \sum_{p\in C_i} dist(o,p) \bigg]
$$

The silhouette of an object o is then given by:

$$s(o) = \frac{b(o) - a(o)}{max\{a(o),b(o)\}}$$

Its easy to see that if the $b(o)$ is very big, and  $a(o)$ is very small (which sound ideal), then the numerator and denominator are close to each other, and the $s(o)$ will be close to 1. On the other hand, if $b(o)$ is small and $a(o)$ is big (which sound bad), then the numerater is negative, and the denominator is positive, and so $s(o)$ will be close to -1. Hence object o seems to be in the correct cluster if $s(o)$ is close to 1, and in the wrong cluster if $s(o)$ is close to -1.

<img src="imgs/silhouette.png" style="width: 350px;"/>

The silhouette coefficient $s_C$ of a clustering is the average of all $s(o)$'s. This value is an indication of how strong the stucture is. If $s_C$ is between 0.7 and 1, then we say that the clustring have strong structure. Below 0.5 is weak to no structure. Often we use the technique to compare clusters - for example to determine the parameter k - so we don't care what the value is, just which clustering scores highest.


#### F1 score

F1 score is an __external evaluation measure__, and thus it is useful only when we have the set of correct labels for the data points. 

TODO mere
