# 8) Unsupervised learning - Clustering

* Unsupervised learning (briefly)
* Clustering  
    * Representative-based
    * Density-based
    * Hierarchical 
    * Subspace clustering
* K-means
* Expectation-maximization clustering
* Cluster measuring
    * Silhouette coefficient
    * F1 score

### Unsupervised learning

In unsupervised learning the training data has the form $D = \{(x_1, ?), (x_2, ?),...(x_n, ?)\}$. So we have no labels for the data points, and the goal is to get an insigt into the data distribution, and see if we can make some sense of it. __Two important problems__ are __clustering__, where we applying algorithms to try and find natural clusters within the data, and __outlier detection__, where we try to indentity data points which seem to deviate from the pattern in the data.


### Clustering

Clustering can be though of as organizing the data into "natural" clusters. Natural in the sense that members of the same cluster are similar/close to each other. Clustering can be used as a preprocessing step for supervised learning, to find groups in the data and assign labels to them.

<img src="imgs/clustering.png" style="width: 400px;"/>

There are several clustering approaches:

#### Representative-based clustering

The goal of representative-based clustering is to partition the dataset into k clusters where each cluster is represented by some point. For some algorithms the point is an actual point in the data set, but sometimes it is just the mean of the points.

The representative-based clustering methods finds convex-shaped clusters, so when the clusters have nonconvex structure they perform poorly. Furthermore the number of clusters k must be specified by the user, which often not possible. 

#### Density-based clustering

Density-based approaches look at the density of the neighborhood around each point, and choose clusters accordingly. 

#### Hierachical clustering

TODO: 

#### Subspace clustering

TODO: 

### K-means

The most well known clustering algorithm is the K-means algorithm. Here we want to split the dataset into k clusters $C = \{c_1,...c_k\}$, each represented by a centroid $\mu_i$, such that the following error function is minimized:

$$SSE(C) = \sum_{i=1}^k \sum_{x_j \in c_i} \|x_j - \mu_i \|^2$$

This is called _sum of squared errors_ (sum of all squared distances from all points to their centroid $\mu$) and the algorithm iteratively tries to minimize this sum. It does so by repeatedly assigning the data points to the cluster defined by the nearest centroid, and recalculating the centroids of the new clusters. The algorithm is described in pseudocode below:

<img src="imgs/kmeans.png" style="width: 450px;"/>

The stopping criterion is given by:

$$\sum_{i=1}^k \| \mu_i^t - \mu_i^{t-1}\|^2 \le \epsilon$$

The running time of the algorithm is $ \mathcal{O}(tknd)$ where n is \#points, k is \#clusters, t is \# iterations, and d is the dimensionality of the data points.

The k-mean approach to clustering have the following __advantages__:

- Relatively efficient (often k,t,d << n)
- Easy to understand and implement

And the following __disadvantages__:

- Only works if mean is defined (cannot compute mean of categorical values like (1=red, 2=green etc))
- Hard to know what k to pick
- Sensitive to noisy data and outliers
- Clusters will always have convex shapes
- All points get to be in a cluster (shouldnt always be the case)
- Can terminate in local minima

For the problem of deciding the number of clusters (k), we could try all k from 1 to n-1, and pick the one with the least error. This would be the clustering with k-1, but it's unlikely that this clustering makes any sense. __a measurement of how good a clustering is must be independent of k__ (see silhouette coefficient and f1 score).

### K-medoid

The idea with k-medoid is very much the same as K-means, in that we want to find k clusters, and minimize the sum of the distances from points to their representative. The difference is that now the representative is a member of the cluster, and not the mean of the points in the cluster. "Mediod" is the "median" in several dimensions.
\begin{center}
    \includegraphics[scale=0.4]{pics/kmedoid.png}
\end{center}
\begin{itemize}
    \item More general than k-means (can do categorical variables)
    \item Not as sensitive to noise and outliers
    \item slow
    \item still need to supply k
    
\end{itemize}

### Expectation Maximization 

K-means clustering and k-medoid clustering provide _hard assignment_ clusterings. That is, for each point we report: _this_ is the cluster it belongs to. Expectation Maximization(EM) clustering provides a __soft assignment__ clustering, where each point has a probability of belonging i each cluster. Each cluster is represented by a normal distribution, and so if we require a hard assignment, we can just report the cluster which have the highest probability of containing that point. 




