# AGNES worked example

We use the distance data that was presented in class. Note that the scipy hierarchical clustering has more options and can be used to draw a dendrogram.

It has the feature that it expects a condensed (1-d) version of the distance matrix.

In [None]:
## See https://stackoverflow.com/a/50662956/1988855
#from pathlib import Path
# See https://stackoverflow.com/a/65907473/1988855
#import ipynbname

#try:
#  fname = Path(__file__).stem
#except NameError:
#  fname = ipynbname.name().replace(".ipynb","")
#print(fname)
fname = "GenDist_AGNES"

import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import linkage, dendrogram
import time

dataDir = "data"
# Make sure the outputDir subdirectory exists
outputDir = "output/" + fname
import os, errno
try:
  os.makedirs(outputDir)
except OSError as e:
  if e.errno != errno.EEXIST:
    raise

## Create the distance matrix `distmatrix`

In [None]:
distMatrix = np.array([
    (0,  9,  3, 6, 11),
    (9,  0,  7, 5, 10),
    (3,  7,  0, 9, 2),
    (6,  5,  9, 0, 8),
    (11, 10, 2, 8, 0)
])
# See https://stackoverflow.com/a/44395030
condensed = list(distMatrix[np.tril_indices(5, k=-1)]) # [9, 3, 6, 11, 7, 5, 10, 9, 2, 8]
#print(f"condensed is {condensed} and len(condensed) is {len(condensed)}")

We now generate the different clustering an their associated dendrograms for this data.

In [None]:
algName = "AgglomerativeClustering"
labels = ('A','B','C','D','E')
for link in ['complete', 'single', 'average', 'ward']:
  #plt.figure()
  start_time = time.time()
  Z = linkage(condensed, method=link)
  end_time = time.time()
  fig, ax = plt.subplots(layout='constrained')
  #plt.figure(figsize=(6, 4))
  ax.set_title(algName+" : "+link)
  R = dendrogram(Z, ax=ax, orientation='top', labels=labels, truncate_mode=None)
  fig.savefig(outputDir + '/AGNES'+link+'.pdf', bbox_inches='tight')
  elapsed_time = end_time-start_time
  print(f"{link} takes {elapsed_time}")
  plt.show()

Because there is relatively little data, the choice of linkage has less dramatic effects.

We can choose to "cut" the tree at different heights. For example, cutting the single linkage at a distance of 4, we see an ACE cluster and two singleton clusters B and D.

As we did in the class notes, we also consider how the distance matrix can be used to define some points. We use MultiDimensional Scaling, mapping the points into the x-y plane. Note that this embedding is _not unique_. You can shift (translate) and rotate the points to get a different embedding which still has the same between-point distances.

The sklearn MDS class can work with observation-feature data or distance-distance data. We use the latter here (i.e., the distance matrix has been `precomputed`). Note that we can transform into any number of dimensions, but 2 is handy for putting a scatter plot on the screen! 

In [None]:
from sklearn.manifold import MDS

embedding = MDS(n_components=2, dissimilarity='precomputed', normalized_stress='auto', random_state=42)
X = embedding.fit_transform(distMatrix)

fig, ax = plt.subplots(layout='constrained')
ax.scatter(X[:,0], X[:,1], s=100, c='orange')
ax.axis('equal')
ax.set_title('MDS: Embedded points in 2-D')
fig.savefig(outputDir + '/AGNES_MDS.pdf', bbox_inches='tight')
plt.show()