# AGNES worked example

We use the distance data that was presented in class. Note that the scipy hierarchical clustering has more options and can be used to draw a dendrogram.

It has the feature that it expects a condensed (1-d) version of the distance matrix.

In [None]:
# See https://stackoverflow.com/a/50662956/1988855
from pathlib import Path
# See https://stackoverflow.com/a/65907473/1988855
import ipynbname

try:
    fname = Path(__file__).stem
except NameError:
    fname = ipynbname.name().replace(".ipynb","")
print(fname)

dataDir = "data"
# Make sure the outputDir subdirectory exists
outputDir = "output/" + fname
import os, errno
try:
    os.makedirs(outputDir)
except OSError as e:
    if e.errno != errno.EEXIST:
        raise
import numpy as np

distMatrix = np.array([
    (0,  9,  3, 6, 11),
    (9,  0,  7, 5, 10),
    (3,  7,  0, 9, 2),
    (6,  5,  9, 0, 8),
    (11, 10, 2, 8, 0)
])
condensed = [9, 3, 6, 11, 7, 5, 10, 9, 2, 8]

We now generate the different clustering an their associated dendrograms for this data.

In [None]:
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import linkage, dendrogram
import time

%matplotlib inline

algName = "AgglomerativeClustering"
labels = ('A','B','C','D','E')
for link in ['complete', 'single', 'average', 'ward']:
    plt.figure()
    start_time = time.time()
    Z = linkage(condensed, method=link)
    end_time = time.time()
    plt.figure(figsize=(6, 4))
    plt.title(algName+" : "+link)
    R = dendrogram(Z, labels=labels, truncate_mode=None)
    plt.savefig(outputDir + '/AGNES'+link+'.pdf')
    elapsed_time = end_time-start_time
#    print(elapsed_time)

Because there is relatively little data, the choice of linkage has less dramatic effects.

We can choose to "cut" the tree at different heights. For example, cutting the single linkage at a distance of 4, we see an ACE cluster and two singleton clusters B and D.

As we did in the class notes, we also consider how the distance matrix can be used to define some points. We use MultiDimensional Scaling, mapping the points into the x-y plane. Note that this embedding is _not unique_. You can shift (translate) and rotate the points to get a different embedding which still has the same between-point distances.

The sklearn MDS class can work with observation-feature data or distance-distance data. We use the latter here (i.e., the distance matrix has been `precomputed`). Note that we can transform into any number of dimensions, but 2 is handy for putting a scatter plot on the screen! 

In [None]:
from sklearn.manifold import MDS

embedding = MDS(n_components=2, dissimilarity='precomputed')
X = embedding.fit_transform(distMatrix)

plt.scatter(X[:,0], X[:,1], s=100, c='orange')
plt.axis('square')
plt.title('MDS: Embedded points in 2-D')
plt.savefig(outputDir + '/AGNES_MDS.pdf')