<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Hierarchical-Agglomerative-Clustering" data-toc-modified-id="Hierarchical-Agglomerative-Clustering-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Hierarchical Agglomerative Clustering</a></span><ul class="toc-item"><li><span><a href="#Visuals" data-toc-modified-id="Visuals-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Visuals</a></span></li><li><span><a href="#Types" data-toc-modified-id="Types-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Types</a></span><ul class="toc-item"><li><span><a href="#Single-link" data-toc-modified-id="Single-link-1.2.1"><span class="toc-item-num">1.2.1&nbsp;&nbsp;</span>Single link</a></span></li><li><span><a href="#Complete-link" data-toc-modified-id="Complete-link-1.2.2"><span class="toc-item-num">1.2.2&nbsp;&nbsp;</span>Complete link</a></span></li><li><span><a href="#Average-link" data-toc-modified-id="Average-link-1.2.3"><span class="toc-item-num">1.2.3&nbsp;&nbsp;</span>Average link</a></span></li><li><span><a href="#Ward" data-toc-modified-id="Ward-1.2.4"><span class="toc-item-num">1.2.4&nbsp;&nbsp;</span>Ward</a></span></li></ul></li><li><span><a href="#Example-Code" data-toc-modified-id="Example-Code-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Example Code</a></span><ul class="toc-item"><li><span><a href="#SciPy-for-dendrograms" data-toc-modified-id="SciPy-for-dendrograms-1.3.1"><span class="toc-item-num">1.3.1&nbsp;&nbsp;</span>SciPy for dendrograms</a></span></li></ul></li></ul></li><li><span><a href="#DBSCAN" data-toc-modified-id="DBSCAN-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>DBSCAN</a></span></li></ul></div>

# Hierarchical Agglomerative Clustering

## Visuals

Can allow us to see the clusters within clusters 

Visual == dendrogram
    
- Useful in visualizing HIGH dimensional separation
- can say only want "n" clusters --> cut the link of the highest tree for two clusters
    + clusters are more alike

![](images/clustergram.png)

## Types

### Single link

(Not in scikit-learn)

- links at the distance between (closest points) in each cluster
- tends to create elongated clusters (reachs out to the other points)
- tends to eat-up a lot of the points for a cluster

### Complete link

- Same as single link, however uses the farthest point
- Tends to create more compact clusters
- However, will tend to "ignore" other points that are *similar* to the actual cluster

### Average link

- Same as complete but will use the measurement average for each pt-pt for cluster

### Ward

- Minimizes variance during cluster merge
    + Center between cluster calculated
    + Sum Sq-Distance from pts in clusters to center
    + Subtract pt distance from cluster centers
 
 
 $$Dist(A, B) = \sum_{x_a \in A} \sum_{x_b \in B} \left [ (c_0-x_a)^2 + (c_0-x_b)^2  - (c_a-x_a)^2 - (c_b-x_b)^2  \right] $$

## Example Code

In [None]:
from sklearn import datasets, cluster
import matplotlib.pyplot as plt
import numpy as np
from scipy.cluster.hierarchy import dendrogram, ward, single

X = datasets.load_iris().data[:30]

c = cluster.AgglomerativeClustering(n_clusters=3, linkage='ward')
labels = c.fit_predict(X)

model = cluster.AgglomerativeClustering(n_clusters=3)
model = model.fit(X)


### SciPy for dendrograms

In [None]:
def plot_dendrogram(model, **kwargs):

    # Children of hierarchical clustering
    children = model.children_

    # Distances between each pair of children
    # Since we don't have this information, we can use a uniform one for plotting
    distance = np.arange(children.shape[0])

    # The number of observations contained in each cluster level
    no_of_observations = np.arange(2, children.shape[0]+2)

    # Create linkage matrix and then plot the dendrogram
    linkage_matrix = np.column_stack([children, distance, no_of_observations]).astype(float)

    # Plot the corresponding dendrogram
    dendrogram(linkage_matrix, **kwargs)



plt.title('Hierarchical Clustering Dendrogram')
plot_dendrogram(model, labels=model.labels_)
plt.show()

In [None]:
link_matrix = ward(X)
dendrogram(link_matrix)
plt.show()

# DBSCAN

https://www.naftaliharris.com/blog/visualizing-dbscan-clustering/

Great at finding noise

- Choose a point
    + within $\epsilon$ are there `min_sample`?
        - yes then a core point
    + within core point?
        - yes, border point

In [None]:
db = cluster.DBSCAN(eps=0.5, min_samples=4)
db.fit(X)