# Clustering

_Kevin Siswandi_  
**Fundamentals of Machine Learning**  
June 2020

The goal of cluster analysis is to find groups (representative examples) in data. Examples:
* Mean shift: finding modes based on density estimate
* KMeans
* DBSCAN

There are many ways to classify clustering methods:
* Hierarchical
    - agglomerative (single-linkage, complete linkage, etc.)
    - divisive -- usually less computationally efficient than agglomerative
* Flat
    - crisp (e.g. k-means, mean-shift)
    - fuzzy (e.g. Gaussian Mixture Model)

## Hierarchical Clustering

With hierarchical clustering, we can see at which level clusters merge via a dendogram. Three distance criteria are possible:
1. Single linkage -- shortest distance between any two points in two clusters
2. Complete linkage -- 
3. Average linkage

which usually give different results.

In the **single linkage clustering**, distance between cluster 1 and cluster 2 is found by

$$ \min_{i \in C_1, j \in C_2} d_{i,j} $$

where $d_{ij}$ can be Euclidean, Mahalanobis, etc. This is equivalent to building a minimum spanning tree and truncates it:
- Spanning (a subgraph of a full graph, where each node is connected to any other through the subgraph)
- Minimum (the sum of edge weights used to construct the subgraph is minimal)
- Tree (no cycles)

Truncating is done by eliminating some edges that are above some threshold. This gives the same result as the dendogram approach but can be solved efficiently with Prim's or Kruskal's algorithm. The limitation of single linkage, however, is that noise can 'bridge' or 'chain' clusters.

In **complete linkage clustering**, it works the same but now we use

$$ \max_{i \in C_1, j \in C_2} d_{i,j} $$

The limitation, however, is that it does not allow elongated clusters (produces compact clusters).

In general, we have the **Lance-Williams formulation for agglomerative clustering**. The pseudocode is:

```
init d_ij = || x_i - x_j || # L2 norm, with some distance criteria

Repeat {
    (i, j) = arg min d_ij
    create new cluster (i, j)
    delete cluster i and cluster j
    for every cluster k, update distance to new cluster (i,j):
        d_k(i,j) =  a1 * d_ki + a2 * d_kj + b * d_ij + g * abs(d_kj - d_ki)
}
```

With this formulation, all standard linkage criteria can be applied by choosing the appropriate coefficients.

## DBSCAN

Given a parameter k:
1. Find coarse density estimate by computing, for each point $i$, the distance to the k-th nearest neighbor $d_i^k$ -- called core distance.
2. Points with $d_i^k < \theta$, where $\theta$ is some threshold, are high density points -- called core points.
3. Find the connected components of core points
4. Assign nearby non-core points to clusters

This algorithm works well for clusters of similar density even in the presence of noise. The drawback is that the parameters make a lot of difference: changing the parameters will give very different clustering results.