# Hierarchical Clustering Practical

This notebook will look at:
    
   * How sklearn and scipy implement hierarchical clustering.
   * How to truncate a dendogram for ease of visualisation.
   * How to get a better idea of what different clusters are capturing.
   * How well clustering works on simple data.
   

In [None]:
# Import libraries.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

In [None]:
# Import online retail dataset

customers_ml_data = pd.read_csv('data/online_retail_afterEDA.csv', index_col='CustomerID')
non_binary_cols = [
    'balance', 'max_spent', 'mean_spent', 
    'min_spent', 'n_orders','total_items', 
    'total_refunded', 'total_spent' ]

customers = customers_ml_data[non_binary_cols]
Xscores = pd.read_csv('data/pca_scores.csv', index_col='CustomerID')

In [None]:
# This function generates a pairplot for some cluster assignment, for some chosen variables.

def pairplot_cluster(df, cols, cluster_assignment):
    """
    Input
        df, dataframe that contains the data to plot
        cols, columns to consider for the plot
        cluster_assignments, cluster asignment returned 
        by the clustering algorithm
    """
    pd.set_option('mode.chained_assignment', None) #don't want a setcopywarning
    # seaborn will color the samples according to the column cluster
    df['cluster'] = cluster_assignment 
    sns.pairplot(df, vars=cols, hue='cluster')
    df.drop('cluster', axis=1, inplace=True)

## Hierarchical clustering implementations.

First, let's take a look at pre-existing implementations of agglomerative clustering in python. Clustering in scipy is primarily done using the functions in [`scipy.cluster.hierarchy`](http://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html). Sklearn implements hierarchical clustering in the class [`sklearn.cluster.AgglomerativeClustering`](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html#sklearn.cluster.AgglomerativeClustering), which is mainly a wrapper for the scipy functions.


The `customers` variable contains a dataset of retail data. If you wish, take the time now to have a dig through the data, and get a feel for what it contains. How many variables are there? How clusterable do you think it is likely to be?

Cluster the retail data provided in `customers` using the scipy function `linkage()`. In addition to the data, `linkage()` takes two important arguments: `method`, which determines which linkage method we want to use, and `metric`, which determines the distance measure.

To start, set `method='average'` and `metric = 'euclidean'`. Store the returned hierarchy in a variable `Z`.

In [None]:
from scipy.cluster.hierarchy import linkage

# Apply hierarchical clustering to retail dataset


Now we need to visualise our data. scipy has a function for this called, not surprisingly, `dendrogram()`. Plot the dendrogram of the hierarchy you've just learned.

In [None]:
from scipy.cluster.hierarchy import dendrogram

# Draw the dendrogram



The coloring of the figure highlights that the data can be segmented in a few big clusters that were merged only in the very last iterations of the algorithm. In particular, there is one huge cluster along with a couple of smaller clusters.

Let's try a different method. We're going to use a more refined approach that chooses two clusters to combine that minimise the variance of the data under the clusters. This is an improvement on average linkage called the [Ward variance minimsation algorithm, or Ward's method](https://en.wikipedia.org/wiki/Ward%27s_method). This can give us more compact clusters, which might prevent the problem above, where most of our data ends up in one huge cluster. This might not be a problem, of course; but if we don't think our data is that unbalanced, this is an indication we might want to try a different approach.

Set `method = 'ward'`, and re-run the `linkage` function.

In [None]:
# Apply hierarchical clustering to retail dataset, using Ward's method.



Plot the resulting dendrogram.

In [None]:
# Draw the Ward dendrogram



Ok, so this looks as though it has split into more balanced super-clusters. You can also see the huge difference that a different choice of algorithm can make on the resultant clusters.

## Truncating the Dendogram

We can improve the readability of the dendrogram showing only the last merged clusters and a threshold to color the clusters. For this use:

* The option `truncate_mode` in `dendrogram`. The setting 'lastp' means the last p non-singleton clusters are the only non-leaf nodes in the dendrogram.
* Set `color_threshold=70`. This will colour distinct clusters which start below this height the same colour.

In [None]:
# Draw the truncated dendogram


Much better. The numbers in (brackets) on the horizontal axis indicate the size of the cluster corresponding to each leaf node. If there is a number not in brackets, the leaf only corresponds to a single data item, and the number is that data item's index.

## What are clusters capturing?

Let's switch to sklearn.

Use `sklearn.cluster.AgglomerativeClustering` to cluster the retail data. As `AgglomerativeClustering` is a class, you'll have to instantiate an object. Set `linkage='ward'` and `metric='euclidean'` when you do.

We can also make `AgglomerativeClustering` return a flat cluster assignment. The dendrogram above suggests 3 might be a good number of clusters; set `n_clusters=3` when you instatiate the clusterer.

In [None]:
from sklearn.cluster import AgglomerativeClustering

# Initialise an instance of AgglomerativeClustering with the appropriate settings



Now get our flat cluster assignment of the retail data with the function `fit_predict`.

In [None]:
#cluster the retail dataset, and return a flat assignment.



We can try to visualise which clusters we've settled on, relative to the data. We've provided the function `pairplot_cluster` above, which will plot a pair of variables against one another using `seaborn`'s `pairplot`. Visualise the clusters with respect to the `'mean_spent'` and `'max_spent'` columns of the pandas dataframe, by setting the variable `cluster_assignment` in `pairplot_cluster` to the result of `fit_predict`, and setting the other parameters appropriately.

In [None]:
# Visualise the clusters using pairplot_cluster()



It certainly looks as though the clusters are capturing meaningful variation in the data. Feel free to play around with different flat cluster numbers and different pairs of variables to get a feel for what's going on.

## How well does clustering work on data?

Now you've had a chance to try out clustering, for this final part we're going to pose a more open question; how well does agglomerative clustering work, and what are the problems that might arise?

We're going to be working with a different small dataset; the [Iris flowers dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set), which has the advantage of being small enough to visualise in its entirety. It is also labelled; we know that there are really three underlying cluster (corresponding to 50 examples of each of the three species of Iris). This means we have a ground truth, which is interesting for comparison to how the clustering algorithms work.

In [None]:
# Load the dataset

from sklearn import datasets
X, y = datasets.load_iris(return_X_y = True)

X should have shape `(150,4)`, which is 150 data points in 4 dimensions. To get a feel for the data, plot the first dimension against the second. Feel free to colour the points with their true cluster membership.

In [None]:
#Plot the first two dimensions of the Iris data.



It seems clear that one of the clusters is very distinct, but the other two are quite intermingled. Just in these two dimensions, it might be difficult to imagine an agglomerative clustering algorithm which is going to disentangle these two clusters well. This problem of noisy clusters is very difficult to overcome without additional knowledge about the data.

Let's try clustering the Iris dataset as before. We'll use a `'euclidean'` distance measure and `'average'` linkage, to begin. Plot the full dendrogram.

In [None]:
#Cluster the iris dataset and plot the dendrogram.



As we expected, it appears that the cluster which is easily separable in the first two dimensions is easily identifiable as an individual cluster. Aligning cluster with actual class might normally be an issue, but because I haven't shuffled the dataset, the first 50 sample indexes correspond to the same class; and have also been sorted into the same cluster.

We can get a visual impression of how well the clusters might match the true classes by using `AgglomerativeClustering` to return 3 flat clusters. Do this now; make sure to set `linkage` to `'average'` and `metric` to `'euclidean'`.

In [None]:
#Intanstiate another instance of the AgglomerativeClustering class, and use its fit_predict() function.
#Store the cluster assignments in a variable.



Now plot the first two dimensions of the iris dataset twice; once coloured according to their true class, once according to the cluster assignments your agglomerative clusterer has learned.

In [None]:
#plot both cluster assignments and true labels alongside each other.



Not too terrible! Obviously the colours won't match, but we have recovered one of the clusters perfectly, and the other two reasonably well. However, at the boundary, the clusterer has clearly smoothed some of the noise. One question to think about would be: might other agglomerative approaches work better? Why?

However, to create this plot, we have been guilty of cheating. We knew in advance that there were 3 clusters, so it was easy to pick a number of flat clusters to return. Without this knowledge, what could we do? Just looking at the data, we might assume that there were only two clusters. The final part of this practical will be to introduce you to one approach (there are many!) we can use to try to work out how many clusters there are.

This is called the **elbow method**.

## The Elbow Method

The elbow method works according to the following intuition: adding a cluster should explain more of the **variance** of the data. 

If you imagine a cluster for every point, the average variance within each cluster will be 0. If you imagine a single cluster, its variance will be the variance of the dataset (>0). 

So we can assume that each time we add a cluster, the **average within-cluster variance** will reduce. However, at some point we will hit diminishing returns; this point will be marked by a transition from a steep gradient to a shallow one. This transition is the 'elbow' of the graph; hence the name.

This might sound confusing, so let's see it in action.

First, make a function `cluster_variance` that takes `points`, an array $N$ x $D$, where $N$ is the number of points and $D$ is the dimensionality of the data. The function should return a scalar value `variance`. The variance of points $\{x_1,...,x_N \}$ is given by:

$$ \text{Var} = \frac{1}{N^2} \sum^{i=N}_{i=1} \sum^{j=N}_{j=1} \frac{1}{2} ||x_i - x_j||^2$$

If you're stuck, click on the details dropdown for a hint.

<details>
You *can* code the euclidean distance directly. Neater, however, is to use something like `np.linalg.norm` to compute the distance between each pair of points.
    
</details>

In [None]:
# Make a function that computes the variance of a cluster of points.



Now, for the iris dataset, use `AgglomerativeClustering` to return cluster assignments for $C$ clusters, where $C = \{1,...,10\}$. Store the cluster assignments in an array or list.

In [None]:
# Compute cluster assignments for 1 cluster to 10 clusterts.



Finally, for each number of clusters, computer the average within-cluster variance, and plot against number of clusters. You should use the function you made earlier.

In [None]:
# Compute average within-cluster variance, and plot against number of clusters.



The elbow clearly indicates that after a cluster number of 2 or 3, we start to hit diminishing returns; there's little else to be explained by adding more clusters. This is often a good way to get a sense for how clusterable your data might be, particularly if you can't eyeball it. 

You may also see the number of clusters plotted against the ratio of within-cluster variance to total data variance. This will flip the graph vertically and rescale it, but the elbow will still be there.

This concludes the practical; feel free to spend some time reading the sklearn and scipy docs and trying out things I haven't mentioned.