# Visualization with hierarchical clustering and t-SNE

In this chapter, you'll learn about two unsupervised learning techniques for data visualization, hierarchical clustering and t-SNE. Hierarchical clustering merges the data samples into ever-coarser clusters, yielding a tree visualization of the resulting cluster hierarchy. t-SNE maps the data samples into 2d space so that the proximity of the samples to one another can be visualized.

# (1) Visualizing hierarchies

## Visualizations communicate insight
- "t-SNE": Creates a 2D map of a dataset (later)
- "Hierarchical clustering" (this video)

## A hierarchy of groups
- Groups of living things can form a hierarchy
- Clusters are contained in one another

<p align='center'>
    <img src='image/Screenshot 2021-02-18 235251.png'>
</p>

## Eurovision scoring dataset
- Countries gave scores to songs performed at the Eurovision 2016
- 2D array of scores
- Rows are countries, columns are songs

<p align='center'>
    <img src='image/Screenshot 2021-02-18 235505.png'>
</p>

## Hierarchical clustering of voting countries
<p align='center'>
    <img src='image/Screenshot 2021-02-18 235628.png'>
</p>

## Hierarchical clustering
- Every country begins in a separate cluster
- At each step, the two closet clusters are merged
- Continue until all countries in a single cluster
- This is "agglomerative" hierarchical clustering

## The dendrogram of a hierarchical clustering
- Read from the bottom up
- Vertical lines represent clusters

<p align='center'>
    <img src='image/Screenshot 2021-02-19 000009.png'>
</p>

## Hierarchical clustering with SciPy
- Given `samples` (the array of scores), and `country_names`

In [None]:
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import linkage, dendrogram
mergings = linkage(samples, method='complete')
dendrogram(mergins, 
            labels=country_names, 
            leaf_rotation=90,
            leaf_font_size=6)
plt.show()

# Exercise I: How many merges?

If there are 5 data samples, how many merge operations will occur in a hierarchical clustering? (To help answer this question, think back to the video, in which Ben walked through an example of hierarchical clustering using 6 countries.)

### Possible Answers
- 4 merges (T)
- 3 merges
- This can't be known in advance

# Exercise II: Hierarchical clustering of the grain data

In the video, you learned that the SciPy `linkage()` function performs hierarchical clustering on an array of samples. Use the `linkage()` function to obtain a hierarchical clustering of the grain samples, and use `dendrogram()` to visualize the result. A sample of the grain measurements is provided in the array `samples`, while the variety of each grain sample is given by the list `varieties`.

### Instructions

- Import:
    - `linkage` and `dendrogram` from `scipy.cluster.hierarchy`.
    - matplotlib.pyplot as plt.
- Perform hierarchical clustering on `samples` using the `linkage()` function with the `method='complete'` keyword argument. Assign the result to `mergings`.
- Plot a dendrogram using the `dendrogram()` function on `mergings`. Specify the keyword arguments `labels=varieties`, `leaf_rotation=90`, and `leaf_font_size=6`.


In [None]:
# Perform the necessary imports
from scipy.cluster.hierarchy import linkage, dendrogram
import matplotlib.pyplot as plt

# Calculate the linkage: mergings
mergings = linkage(samples, method='complete')

# Plot the dendrogram, using varieties as labels
dendrogram(mergings,
           labels=varieties,
           leaf_rotation=90,
           leaf_font_size=6,
)
plt.show()


## Plot

<p align='center'>
    <img src='image/[2021-02-19] 013215.svg'>
</p>

# Exercise III: Hierarchies of stocks

In chapter 1, you used k-means clustering to cluster companies according to their stock price movements. Now, you'll perform hierarchical clustering of the companies. You are given a NumPy array of price movements `movements`, where the rows correspond to companies, and a list of the company names `companies`. SciPy hierarchical clustering doesn't fit into a sklearn pipeline, so you'll need to use the `normalize()` function from `sklearn.preprocessing` instead of `Normalizer`.

`linkage` and `dendrogram` have already been imported from `scipy.cluster.hierarchy`, and PyPlot has been imported as `plt`.

### Instructions

- Import `normalize` from `sklearn.preprocessing`.
- Rescale the price movements for each stock by using the `normalize()` function on `movements`.
- Apply the `linkage()` function to `normalized_movements`, using `'complete'` linkage, to calculate the hierarchical clustering. Assign the result to `mergings`.
- Plot a dendrogram of the hierarchical clustering, using the list `companies` of company names as the `labels`. In addition, specify the `leaf_rotation=90`, and `leaf_font_size=6` keyword arguments as you did in the previous exercise.


In [None]:
# Import normalize
from sklearn.preprocessing import normalize

# Normalize the movements: normalized_movements
normalized_movements = normalize(movements)

# Calculate the linkage: mergings
mergings = linkage(normalized_movements, method='complete')

# Plot the dendrogram
dendrogram(mergings, labels=companies, leaf_rotation=90, leaf_font_size=6)
plt.show()


## Plot

<p align='center'>
    <img src='image/[2021-02-19] 014008.svg'>
</p>

# (2) Cluster labels in hierarchical clustering

## Cluster labels in hierarchical clustering
- Not only a visualization tool!
- Cluster labels at any intermediate stage can be recovered
- For use in e.h. cross-tabulations

<p align='center'>
    <img src='image/Screenshot 2021-02-18 235628.png'>
</p>

## Intermediate clusterings & height on dendrogram
- E.g. at height 15:
    - Bulgaria, Cyprus, Greece are one cluster
    - Russia and Moldova are another
    - Armenia in a cluster on its own

<p align='center'>
    <img src='image/Screenshot 2021-02-19 125814.png'>
</p>

## Dendrograms show cluster distances
- Height on dendrogram = distance between merging clusters
- E.g. clusters with only Cyprus with only Cyprus and Greece had distance approx. 6
- This new cluster distance approx. 12 from cluster with only Bulgaria

<p align='center'>
    <img src='image/Screenshot 2021-02-19 130115.png'>
</p>

## Intermediate clusterings & height on dendrogram
- Height on dendrogram specifies max. distance between merging clusters
- Don't merge clusters further apart than this (e.g. 15)

## Distance between clusters
- Defined by a "linkage method"
- In "complete" linkage: distance between clusters is max distance between their smaples
- Specified via method parameter, e.g. linkage (samples, method="complete")
- Different linkage method, different hierarchical clustering!

## Extracting cluster labels
- Use the `fcluster()` function
- Returns a NumPy array of cluster labels

## Extracting cluster labels using fcluster

In [None]:
from scipy.cluster.hierarchy import linkage
mergings = linkage(samples, method='complete')
from scipy.cluster.hierarchy import fcluster
labels = fcluster(mergings, 15, criterion='distance')
print(labels)

## Aligning clster labels with country names
Given a list of strings `country_names`:

In [None]:
import pandas as pd
pair = pd.DataFrame({'labels': labels, 'countries': country_names})
print(pairs.sort_values('labels'))

# Exercise IV: Which clusters are closet?

In the video, you learned that the linkage method defines how the distance between clusters is measured. In complete linkage, the distance between clusters is the distance between the furthest points of the clusters. In single linkage, the distance between clusters is the distance between the closest points of the clusters.

Consider the three clusters in the diagram. Which of the following statements are true?

<p align='center'>
    <img src='image/cluster_linkage_riddle.png' width=50%>
</p>

**A.** In single linkage, Cluster 3 is the closest to Cluster 2.

**B.** In complete linkage, Cluster 1 is the closest to Cluster 2.

### Answer the question

Possible Answers

- Neither A nor B.
- A only.  
- Both A and B. (T)

# Exercise V: Different linkage, different hierarchical clustering!

In the video, you saw a hierarchical clustering of the voting countries at the Eurovision song contest using `'complete'` linkage. Now, perform a hierarchical clustering of the voting countries with `'single'` linkage, and compare the resulting dendrogram with the one in the video. Different linkage, different hierarchical clustering!

You are given an array `samples`. Each row corresponds to a voting country, and each column corresponds to a performance that was voted for. The list `country_names` gives the name of each voting country. This dataset was obtained from [Eurovision](www.eurovision.tv/page/results).

### Instructions

- Import `linkage` and `dendrogram` from `scipy.cluster.hierarchy`.
- Perform hierarchical clustering on `samples` using the `linkage()` function with the `method='single'` keyword argument. Assign the result to `mergings`.
- Plot a dendrogram of the hierarchical clustering, using the list `country_names` as the `labels`. In addition, specify the `leaf_rotation=90`, and `leaf_font_size=6` keyword arguments as you have done earlier.


In [None]:
# Perform the necessary imports
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import linkage, dendrogram

# Calculate the linkage: mergings
mergings = linkage(samples, method='single')

# Plot the dendrogram
dendrogram(mergings, labels=country_names, leaf_rotation=90, leaf_font_size=6)
plt.show()


## Plot

<p align='center'>
    <img src='image/[2021-02-19] 132428.svg'>
</p>

# Exercise VI: Intermediate clusterings

Displayed on the right is the dendrogram for the hierarchical clustering of the grain samples that you computed earlier. If the hierarchical clustering were stopped at height 6 on the dendrogram, how many clusters would there be?

<p align='center'>
    <img src='image/[2021-02-19] 132645.svg'>
</p>

### Instructions

#### Possible Answer
- 1
- 3 (T)
- As many as there were at the beginning

# Exercise VII: Extracting the cluster labels

In the previous exercise, you saw that the intermediate clustering of the grain samples at height 6 has 3 clusters. Now, use the `fcluster()` function to extract the cluster labels for this intermediate clustering, and compare the labels with the grain varieties using a cross-tabulation.

The hierarchical clustering has already been performed and `mergings` is the result of the `linkage()` function. The list `varieties` gives the variety of each grain sample.

### Instructions

- Import:
    - `pandas` as `pd`.
    - `fcluster` from `scipy.cluster.hierarchy`.
- Perform a flat hierarchical clustering by using the `fcluster()` function on `mergings`. Specify a maximum height of `6` and the keyword argument `criterion='distance'`.
- Create a DataFrame df with two columns named `'labels'` and `'varieties'`, using `labels` and `varieties`, respectively, for the column values. This has been done for you.
- Create a cross-tabulation `ct` between `df['labels']` and `df['varieties']` to count the number of times each grain variety coincides with each cluster label.


In [None]:
# Perform the necessary imports
import pandas as pd
from scipy.cluster.hierarchy import fcluster

# Use fcluster to extract labels: labels
labels = fcluster(mergings, 6, criterion='distance')

# Create a DataFrame with labels and varieties as columns: df
df = pd.DataFrame({'labels': labels, 'varieties': varieties})

# Create crosstab: ct
ct = pd.crosstab(df['labels'], df['varieties'])

# Display ct
print(ct)
