# Importing libraries...
  * #### The Cell #1 imports the essential matplotlib modules for displaying figures outside jupyter cell 
  * #### The Cell #2 imports the essessential numpy and scipy modules for our computations (including the agg. clustering)

In [1]:
import PyQt5
import matplotlib.pyplot as plt
from matplotlib import style;  style.use('ggplot')
get_ipython().magic('matplotlib qt')

In [2]:
import numpy as np
from scipy.cluster.hierarchy import dendrogram, linkage
from scipy.cluster.hierarchy import fcluster

# Loading Processed Data Matrix...

In [5]:
X = np.load('comp-data/1-preprocessing-comp-data/user-feature-set.npy')

# Generating The Hierachical Clustering Dendrogram...
  * #### Using Complete Linkage Method

In [6]:
# generate the linkage matrix
ZC = linkage(X, 'complete')

In [7]:
# calculate full dendrogram
plt.figure(1, figsize=(25, 10))
plt.title('Hierarchical Clustering Dendrogram -- Complete-Linkage')
plt.xlabel('X[i]')
plt.ylabel('distance')
dendrogram(
    ZC,
    leaf_rotation=90.,  # rotates the x axis labels
    leaf_font_size=8.,  # font size for the x axis labels
)
plt.show()

In [6]:
plt.figure(2, figsize=(25, 10))
plt.title('Hierarchical Clustering Dendrogram -- Complete-Linkage (truncated)')
plt.xlabel('sample index')
plt.ylabel('distance')
dendrogram(
    ZC,
    truncate_mode='lastp',  # show only the last p merged clusters
    p=20,  # show only the last p merged clusters
    show_leaf_counts=False,  # otherwise numbers in brackets are counts
    leaf_rotation=90.,
    leaf_font_size=12.,
    show_contracted=True,  # to get a distribution impression in truncated branches
)
plt.show()

# Getting the number of Clusters... (trimming the dendrogram)

In [47]:
max_d = 2.065
clusters_ = fcluster(ZC, max_d, criterion='distance')
clusters_

array([1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 2, 1, 2, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1,
       1, 2, 1, 2, 2, 1, 1, 1, 1, 1, 2, 1, 2, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 1, 1, 2, 1, 1, 1, 2, 1,
       1, 1, 1, 1, 1, 1, 4, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 3, 1, 2, 1, 1, 3, 1, 1, 1, 2, 1, 1, 1, 4, 1, 1, 1, 1, 2, 1, 1, 1,
       1, 4, 1, 1, 2, 1, 1, 1, 2, 2, 1, 1, 1, 2, 2, 1, 1, 2, 2, 2, 1, 1, 1,
       1, 1, 1, 2, 1, 2, 1, 1, 2, 1, 1, 1, 2, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1,
       1, 1, 1, 1, 2, 3, 1, 1, 1, 2, 1, 1, 3, 1, 1, 1, 1, 2, 2, 1, 1, 2, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 2, 1, 1,
       2, 2, 1, 1, 1, 2, 1, 1, 2, 1, 2, 2, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1,
       1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1,
       1, 2, 2, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2,
       4, 1,

# Getting the -optimal- number of Clusters... (k-means elbow method)
  * #### As seen in the above dendrogram if we "trim" the tree in a certain distance point, we have from 2 to 6 clusters (2 and 6 are for padding purposes).
  * #### For [2,3,4,5,6] clusters we have braking points [2.3535, 2,311, 2.065, 1.9331, 1.749] respectively
  * #### Running k-means for k = 2, 3, 4, 5 - and optionally 6 - and applying the elbow method should tell us the optimal number of clusters

  * ### Importing sklearn essential libraries for k-means and scipy

In [13]:
from sklearn.cluster import KMeans
from sklearn import metrics
from scipy.spatial.distance import cdist

In [28]:
# k means determine optimal k
distortions = []
K = [2, 3, 4, 5, 6]
for k in K:
    kmeanTest = KMeans(n_clusters=k, n_init=20, n_jobs=-1, precompute_distances=True, random_state=0, verbose=2)
    kmeanTest.fit(X); kmeanTest.fit(X)
    distortions.append(sum(np.min(cdist(X, kmeanTest.cluster_centers_, 'euclidean'), axis=1)) / X.shape[0])
 
# Plot the elbow
plt.figure(2, figsize=(25, 10))
plt.plot(K, distortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Distortion')
plt.title('The Elbow Method showing the optimal k')
plt.show()

 * #### From the elbow method we saw that the optimal number of clusters is 3
 * #### So, we trim the dendrogram at max_d = 2.311

In [46]:
max_d = 2.311
clusters_ = fcluster(ZC, max_d, criterion='distance')
clusters_

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       3, 1,

 * #### But, at a postprocessing stage and in conjuction with the above dendrogram we can conclude that 2 possible clusters are:
 * #### (red, green, cyan, purple) -- 4 clusters (dendrogram observation)
 * #### ((red, green), cyan, purple) -- 3 clusters (k-means elbow method)
 
#### So, because of the data's nature (user tastes & preferences) it's better to have them in as much clusters as possible, while maintaining the distance metrics at a viable-logic value. As such, we finally choose 4 clusters to represent our users.

# ~ END OF CHAPTER 3 - AGGLOMERATIVE CLUSTERING ~