# Hierarchical Clustering Using Scipy
In this notebook, we will be using Scipy to make dendograms. The notebook gives the basic steps to realise a dendrogram from a numeric matrix. Let’s describe a few customisation that you can easily apply to your dendrogram based on the car dataset which contains different attributes like model, mpg, gear and other features related to cars.


## 1. Making a simple Dendogram


### Importing libraries


In [None]:
import pandas as pd
from matplotlib import pyplot as plt
from scipy.cluster import hierarchy
import numpy as np

### Importing Data Set

In [None]:
df = pd.read_csv('mtcars.csv')

In [None]:
df = df.set_index('model')

**Agglomerative Clustering** has different types like *WARD*,  *COMPLETE-LINK* and *AVERAGE*. We however will be using *WARD* clustering in this example. 

In [None]:
# Calculate the distance between each sample
Z = hierarchy.linkage(df, 'ward')

In [None]:
hierarchy.dendrogram(Z, leaf_font_size=8)
plt.show()

Above figure shows a simple dendogram without proper labels shown. Let us explore further and see how customizations can be made to improve the plot.

## 2. Adding leaf labels

In [None]:
del df.index.name

In [None]:
df.head()

In [None]:
Z = hierarchy.linkage(df, 'ward')

**Task 1:** Make a dendogram for improving the leaf rotation, font size of labels. Keep in mind the following:

- leaf_rotation = 90
- leaf_font_size= 10
- labels=df.index

In [None]:
# Plot with Custom leaves


The above plot shows labels and names of different models, those which are adjusted to get a better understanding of the dendogram. We can further improve the visualization by changing the orientation of these labels.

## 3. Orientation

**Task 2:**: Fix the orientation of dendogram above by putting the orientation = "left". Please note that the lable font size should be 10

In [None]:
# Orientation of the dendrogram


The plot above shows how the orientation of the plot and labels can be changed. The orientation in the plot above has been made to right and can also be set to left.

## 4. Number of clusters

In [None]:
# Control number of clusters in the plot + add horizontal line.
hierarchy.dendrogram(Z, color_threshold=240,labels=df.index,leaf_rotation=90)
plt.axhline(y=240, c='grey', lw=1, linestyle='dashed')

A horizontal shows the threshold from where the clusters can be identified. Currently there are 3 clusters shown. Changing the threshold would result in different number of clusters.

# Hierarchical Clustering using SKLearn

We have a csv file that contains all the votes from the 114th Senate.  Each row contains the votes of an individual senator. Votes are coded as 0 for “No”, 1 for “Yes”, and 0.5 for “Abstain”. There are different columns having the bills, the party, and the state of each senator. Lets see how clustering will be applied on this data set.

## 1. Clustering
Let's now use sklearn's ```AgglomerativeClustering``` to conduct the heirarchical clustering

In [None]:
from sklearn.cluster import AgglomerativeClustering

In [None]:
data=pd.read_csv('114_congress.csv')
df_senate= data.copy()

In [None]:
df_senate.shape

In [None]:
df_senate.head()

In [None]:
df_senate.dtypes

In [None]:
df_senate.set_index('name',inplace=True)

In [None]:
df_senate.head()

In [None]:
X=df_senate.drop(['party','state'], axis = 1)

In [None]:
X.head()

In [None]:
Y=df_senate['party']

In [None]:
Y.head()

### Hierarchical Clustering

Ward is the default linkage algorithm, so we'll start with that

In [None]:
ward = AgglomerativeClustering(n_clusters=3)
ward_pred = ward.fit_predict(X)

Let's also try complete and average linkages

**Task 3**:
* Conduct hierarchical clustering with complete linkage, store the predicted labels in the variable ```complete_pred```
* Conduct hierarchical clustering with average linkage, store the predicted labels in the variable ```avg_pred```

In [None]:
# Hierarchical clustering using complete linkage
# Write the code to create an instance of AgglomerativeClustering with the appropriate parameters
complete = None

# Fit & predict
# Write a code to make AgglomerativeClustering fit the dataset and predict the cluster labels
complete_pred = None

# Hierarchical clustering using average linkage
# Write the code to create an instance of AgglomerativeClustering with the appropriate parameters
avg = None

# Fit & predict
# Write a code to make AgglomerativeClustering fit the dataset and predict the cluster labels
avg_pred = None

To determine which clustering result better matches the original labels of the samples, we can use ```adjusted_rand_score``` which is an *external cluster validation index* which results in a score between -1 and 1, where 1 means two clusterings are identical of how they grouped the samples in a dataset (regardless of what label is assigned to each cluster).

Cluster validation indices are discussed later in the course.

In [None]:
from sklearn.metrics import adjusted_rand_score

ward_ar_score = adjusted_rand_score(Y, ward_pred)

**Task 4**:
* Calculate the Adjusted Rand score of the clusters resulting from complete linkage and average linkage

In [None]:
# Write the code to calculate the adjusted Rand score for the complete linkage clustering labels
complete_ar_score = None

# Write a code to calculate the adjusted Rand score for the average linkage clustering labels
avg_ar_score = None

Which algorithm results in the higher Adjusted Rand Score?

In [None]:
print( "Scores: \nWard:", ward_ar_score,"\nComplete: ", complete_ar_score, "\nAverage: ", avg_ar_score)

In [None]:
complete_pred

In [None]:
print(pd.crosstab(complete_pred, df_senate["party"]))

## 3. Dendrogram visualization with scipy

Let's visualize the highest scoring clustering result. 

To do that, we'll need to use Scipy's [```linkage```](https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html) function to perform the clusteirng again so we can obtain the linkage matrix it will later use to visualize the hierarchy

**Task 5:** Specify the linkage type. Scipy accepts 'ward', 'complete', 'average', as well as other values. Pick the one that resulted in the highest Adjusted Rand Score

In [None]:
# Import scipy's linkage function to conduct the clustering
from scipy.cluster.hierarchy import linkage

#Write code here
linkage_type = None 

linkage_matrix = None

**Task 6:**

Plot a dendogram using scipy's [dendrogram()](https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.cluster.hierarchy.dendrogram.html) function

- leaf_font_size=10
- labels=X.index
- orientation="right"
- figsize=(15,18)

In [None]:
# write the code to plot using 'dendrogram()'



## 4. Visualization with Seaborn's ```clustermap``` 

The [seaborn](http://seaborn.pydata.org/index.html) plotting library for python can plot a [clustermap](http://seaborn.pydata.org/generated/seaborn.clustermap.html), which is a detailed dendrogram which also visualizes the dataset in more detail. It conducts the clustering as well, so we only need to pass it the dataset and the linkage type we want, and it will use scipy internally to conduct the clustering

In [None]:
import seaborn as sns
lut = dict(zip(Y, "rbg"))
row_colors = Y.map(lut)
sns.clustermap(X, row_colors=row_colors, figsize=(12,10),method='complete')
plt.show()

Looking at the colors of the dimensions can you observe how they different senators have voted. The bar on the extreme left shows the party they belong to. The plot on the whole shows which of the senators have voted, which have not and which of them have abstained. 