# Hierarchical Clustering

In Clustering, we try to group the variables which are similar to each other. We measure the similarities with respect to distance measures.

In [None]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

In [None]:
import os

In [None]:
print(os.__file__)

In [None]:
pwd

In [None]:
df = pd.read_csv('Cust_Spend_Data.csv')
df

The above data is of supermarkets and the visits of customers and their spendings in the supermarket.

The following is the explanation of the dataframe loaded in the previous step.

$\underline{CustID}$: This is a number kept by the supermarket to identify the customer on their next visit to the supermarket.

$\underline{Name}$: This refers to the name of the customer. Here, to maintain anonymity we have letters in place of the actual names of customers.

$\underline{AvgMthlySpend}$: The amount of money spent in the Supermarket by the customer.

$\underline{NoofVisits}$: The number of visits by the customer to the supermarket.

$\underline{ApparelItems}$: The number of apparel items bought by the customer on his visits to the Supermarket.

$\underline{FnVItems}$: The number of Food and Vegetable items bought by the customer.

$\underline{StaplesItems}$: The number of staple items bought by the customer.

For performing clustering, let us create a copy of the original loaded dataframe with only the relevant variables.

In [None]:
data = df.drop(['Name','Cust_ID'], axis=1)
data.head()

In [None]:
from scipy.cluster.hierarchy import dendrogram, linkage

In [None]:
ward_link = linkage(data,method = 'ward',metric='euclidean')
ward_link

In [None]:
Pleasedend_ward = dendrogram(ward_link)

Some food for thought!!!! What is wrong with the above results??. 

#### In this step we are importing the Standard Scaler function to scale the data (Computing Z Scores) StandardScaler scales the data by subtracting the observation from the mean of the variable and dividing it by the standard deviation of the variable.

# z = $\frac{(x - \mu)}{\sigma}$

###### Note: All the symbols follow the usual nomeclature.

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
#In this step we are defining an empty function. Going forward, you will be seeing this kind of definitions of functions from
# the sklearn library

scale = StandardScaler()

In [None]:
#For 'sklearn' function we have to fit the function into the dataframe. 
#The backend of this function is that now we know the mean and standard deviation of the variables of this particular dataset.

ss = scale.fit(data)
ss

In [None]:
#In this step we perform the standard scaling operation mentioned earlier.

ss1 = ss.transform(data)
ss1

In [None]:
data.columns

In [None]:
#Finally we go ahead and put the above values in a dataframe for easier understanding.

pd.DataFrame(ss1,columns=data.columns)

In [None]:
#FYI, We can go ahead and pass the following code snippet to perform all the functions of the standard scaler performed 
# above individually. Not required to pass them in individual steps

data_scaled = pd.DataFrame(scale.fit_transform(data),columns=data.columns)
data_scaled

#### Method 1: Performing Hierarchical Clustering with the 'scipy' package 

In [None]:
from scipy.cluster.hierarchy import dendrogram, linkage

Let us now try to cluster the data with the Euclidean distance and Ward's method for linkage.

In [None]:
wardlink = linkage(data_scaled,method = 'ward')

In [None]:
warddend = dendrogram(wardlink)

The above code snippet takes into account the Euclidean distance. If we are to perform the clustering algorithm by any other distance method we can perform the following code snippets.

Now that we have visualized the number of clusters, we need to cluster the data according to their similarity metrics.


In [None]:
from scipy.cluster.hierarchy import fcluster

The above library helps us to extract the cluster numbers by looking at the dendrogram.

We are going to perform this by two different methods.

From the dendrogram, we see that 3 clusters are optimum. Thus, we are going to form 3 clusters based on the 'maxclust' criterion in the fcluster package.

In [None]:
#Method 1
#Uses Cophenetic distance using maxclust function
#The cophenetic distance between two objects is the height of the dendrogram where the two branches that include the two objects merge into a single branch.

clusters = fcluster(wardlink, 3, criterion='maxclust')
clusters

In the following second method to extract the number of clusters, we will use the 'distance' criterion. So, let us look at the optimum distance on the Y-axis of the dendrogam plot and extract the clusters.

In [None]:
# Method 2

clusters = fcluster(wardlink, 4, criterion='distance')
clusters

Let us now go ahead and attach these clusters with the original dataframe and try to interpet it from a business perspective.

In [None]:
df['clusters'] = clusters

In [None]:
df.head()

Now, let us go ahead and export this particular into a csv and try to draw inferences from the clusters thus formed.

In [None]:
df.to_csv('Hierarchical Clustering output.csv')

#The above command is going export the file as a .csv into the location that the Jupyter Notebook is currently running.

We will try to profile the clusters with the mean of the spending on each category. This will give us an idea about the various clusters thus built.

In [None]:
df1 = df.drop(['Cust_ID','Name'],axis=1)
df_clust = df1.groupby('clusters').mean()
df_clust.head()

In [None]:
#This particular code snippet makes the above output into a data frame.

df_clust = df_clust.reset_index()
df_clust

In [None]:
cluster_freq = df['clusters'].value_counts().sort_index()
cluster_freq

In [None]:
df_clust['Frequency'] = cluster_freq.values
df_clust

From the above data frame, we know the frequency of the occurence of each clusters.

#### Following is an example of performing Hierarchical Clustering on the same data set. But in the following code snippet we have used 'Manhattan or cityblock' distance. Do try out other types of distance metrics along with different linkage methods.

In [None]:
dendrogram = dendrogram(linkage(data_scaled, method  = "single", metric='cityblock'))

You can play around with the data and try to form clusters and build dendrogram by using different distance metrics and different linkage algorithms.

# END