# Clustering generated data

Clustering is easier when the clusters have a simple shape (hyperellipsoid, say)  and are well separated from each other.

However, not all data has these characteristics. One data set that was chosen to be "difficult" was used to make the case for a new `hdbscan` clustering algorithm. This data set was published with the _Comparing Clustering Algorithms_ [page](https://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html).

As usual, we import several libraries, including `csupport` which is a set of python functions that support the notebooks associated with week 10.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn.cluster as cluster
import time

import sys
#sys.path.append('../../20-resources/files/python')
import clusSupport as cs

dataDir = "data"
# Make sure the outputDir subdirectory exists
outputDir = "output/Practical_A_GeneratedData"
import os, errno
try:
    os.makedirs(outputDir)
except OSError as e:
    if e.errno != errno.EEXIST:
        raise

%matplotlib inline

Assign plot settings for seaborn and for matplotlib

In [None]:
sns.set_context('poster')
sns.set_color_codes()
plot_kwds = {'alpha' : 0.25, 's' : 15, 'linewidths':0}
paletteName = 'deep'
fontSize = 10
fc = '#cccccc'

Load existing data stored in NumPy's binary format. This data has complex features and is a major challenge for k-means, Gaussian Mixture Models and similar algorithms, although suitably tuned density-based clustering algorithms work reasonably well.

In [None]:
data = np.load(dataDir + '/complexDataForClustering.npy')
data.T

Generate a scatter plot of the two columns after transforming them to rows 

In [None]:
plt.scatter(data.T[0], data.T[1], c='b', **plot_kwds)
frame = plt.gca()
frame.axes.get_xaxis().set_visible(False)
frame.axes.get_yaxis().set_visible(False)
plt.savefig(outputDir + '/complexDataForClustering.pdf')

As can be seen, the data has some areas where the points are closer together, but the boundaries are indistinct and the shapes are not simple shapes such as lines, circles or ellipses. It looks like there are _6_ clusters, so we set `n_clusters` = 6 below.

In [None]:
from scipy.spatial.distance import cdist
outFile = outputDir + '/kmeans6_generated.pdf'
nClusters = 6
clusterParams = {'n_clusters':nClusters, 'random_state':0}
algName = "KMeans"

start_time = time.time()
kmeans = cluster.KMeans(**clusterParams)
labels = kmeans.fit_predict(data)
# Subsequently, this will be invoked using a function call of the form
# kmeans, labels = w8s.fitClusterLabels(data, cluster.KMeans, (), clusterParams)
end_time = time.time()
elapsed_time = end_time-start_time
print(elapsed_time)

title = '{} Clusters found by {}'.format(str(nClusters),algName)
plt = cs.plot_2dClusters(data, labels, title, paletteName, fontSize, plot_kwds)
outFile = outputDir + '/{}{}_generated.pdf'.format(algName,str(nClusters))
plt.savefig(outFile)

title = '{} Clusters (with regions) found by {}'.format(str(nClusters),algName)
centres = kmeans.cluster_centers_
radii = [cdist(data[labels == i], [center]).max()
         for i, center in enumerate(centres)]
plt = cs.overlayDisks(plt, centres, radii, fc, plot_kwds)
outFile = outputDir + '/{}{}withCentres_generated.pdf'.format(algName,str(nClusters))
plt.savefig(outFile)

As might be expected, k-means does not do well wit this data set. Its features are not remotely circular, so k-means needs to create larger cluster "regions" that overlap. It is clearly not the best choice for this data.

In some ways, k-means is a special case of a Gaussian Mixture model, in the sense that the latter can handle hyperellipsoidal features. So we try this model instead. We can make the Gaussian Mixtures model more like k-means by choosing a `'spherical'` covariance, or we can allow full flexibiltiy (ellipsoids with arbitrary orientation and eccentricity) by choosing the `'full'` covariance model.

In [None]:
import sklearn.mixture as mixture

algName = "GaussianMixture"
for covType in ['spherical', 'full']:
    clusterParams = {'n_components':6, 'covariance_type':covType, 'max_iter':100, 'random_state':0}
    start_time = time.time()
    gaussianMixture = mixture.GaussianMixture(**clusterParams)
    #print(dir(gaussianMixture))
    labels = gaussianMixture.fit(data).predict(data)
    # Subsequently, this will be invoked using a function call of the form
    # gaussianMixture, labels = w8s.fitClusterLabels(data, mixture.GaussianMixture, (), clusterParams)
    end_time = time.time()
    elapsed_time = end_time-start_time
    print(elapsed_time)
    #plt, elapsed_time = w8s.plot_clusters(data, mixture.GaussianMixture, (), clusterParams, plot_kwds)
    
    plt.clf() # Start new plot
    title = '{} Clusters found by {}'.format(str(nClusters),algName)
    plt = cs.plot_2dClusters(data, labels, title, paletteName, fontSize, plot_kwds)
    outFile = outputDir + '/{}{}_{}_generated.pdf'.format(algName,str(nClusters),covType)
    plt.savefig(outFile)

    title = '{} Clusters (with regions) found by {}'.format(str(nClusters),algName)
    weights = gaussianMixture.weights_
    means = gaussianMixture.means_
    covariances = gaussianMixture.covariances_
    plt = cs.overlayEllipses(plt, weights, means, covariances)
    outFile = outputDir + '/{}{}_{}_withEllipses_generated.pdf'.format(algName,str(nClusters),covType)
    plt.savefig(outFile)

The Gaussian Mixture model, when allowed to vary the covariance, generally does better in the sense that it can match the geometric features in the data more faithfully than k-means. However, it is clear that there is still a lot of overlap (so the clusters are not distinguishable in this case).

A completely different approach is the density-based approach followed by `DBSCAN`. This model takes a more relaxed interpretation of clusters - they are merely regions where the data is closer together. The "shape" of the cluster, its centre and other parameters are immaterial to its definition.

With `DBSCAN`, the user specifies a (global) distance tolerance rather than the number of clusters. We note that it can be difficult to choose this tolerance, as it can be too large in some areas and too small in others. Indeed, the quality of any clustering can be very sensitive in respect of this setting. That said, `DBSCAN` clearly performs much better than its peers in respect of this data set.

In [None]:
algName = "DBSCAN"
for eps in [0.01, 0.025, 0.05]:
    clusterParams = {'eps':eps}
    start_time = time.time()
    dbscan = cluster.DBSCAN(**clusterParams)
    labels = dbscan.fit_predict(data)
    # Subsequently, this will be invoked using a function call of the form
    # dbscan, labels = w8s.fitClusterLabels(data, cluster.DBSCAN, (), clusterParams)
    end_time = time.time()
    elapsed_time = end_time-start_time
    print(elapsed_time)
    
    plt.clf() # Start new plot
    nClusters = len(set(labels))
    title = '{} Clusters found by {} with eps={}'.format(str(nClusters),algName,str(eps).replace('.','p'))
    plt = cs.plot_2dClusters(data, labels, title, paletteName, fontSize, plot_kwds)
    outFile = outputDir + '/{}{}_{}_generated.pdf'.format(algName,str(nClusters),str(eps))
    plt.savefig(outFile)

As can be seen, `DBSCAN` starts to model "noise" when `eps` is too large. A more recent refinement of `DBSCAN` makes configuration much easier. `HDBSCAN` derives a hierarchical clustering and then uses the `min_cluster_size` setting to control how find the clustering should be. Generally, smaller cluster sizes are associated with more clusters. This setting is much more appealing than the distance threshold used by its parent `DBSCAN` algorithm. We try several options and see that, with the best setting, the clustering is srprisingly good and close to what would be considered a good clustering by human experts.

In [None]:
import hdbscan
algName = "HDBSCAN"
for minClusterSize in [10, 15, 30]:
    clusterParams = {'min_cluster_size':minClusterSize}
    start_time = time.time()
    hdbscanModel = hdbscan.HDBSCAN(**clusterParams)
    labels = hdbscanModel.fit_predict(data)
    # Subsequently, this will be invoked using a function call of the form
    # hdbscan, labels = w8s.fitClusterLabels(data, hdbscan.HDBSCAN, (), clusterParams)
    end_time = time.time()
    elapsed_time = end_time-start_time
    print(elapsed_time)
    
    plt.clf() # Start new plot
    nClusters = len(set(labels))
    title = '{} Clusters found by {} minClusterSize={}'.format(str(nClusters),algName,str(minClusterSize))
    plt = cs.plot_2dClusters(data, labels, title, paletteName, fontSize, plot_kwds)
    outFile = outputDir + '/{}{}_{}_generated.pdf'.format(algName,str(nClusters),str(minClusterSize))
    plt.savefig(outFile)