# Data Science Ex 10 - Clustering (Hierarchical Methods)

03.05.2022, Lukas Kretschmar (lukas.kretschmar@ost.ch)

## Let's have some Fun with Hierarchical Clustering approaches!

In this exercise, we are going to have a look at hierarchical clustering approaches and how you can visualize them.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
sns.set()

## Introduction

### Hierarchical Methods (Agglomerative Clustering)

It's now time to have a look on how you can work with hierarchical clustering.
In this section, we will work with an agglomerative clustering algorithm.

But let's start with the data.

In [None]:
from sklearn.preprocessing import normalize

In [None]:
data = pd.read_csv("./Demo_WholesaleCustomers.csv", sep=";")
data.head(5)

Our dataset contains information on customers.
More precise, we know how much a customer has spent on different segments of food.

Although, the values use the same unit (money), they are in different ranges and `Channel` and `Region` represent categorical data.
Thus, we need to normalize them first.
Here, calling `normalize()` does a bunch of things and we won't go into details.
But we just have to know that values per column are normalized to unit norm (are in the range from `0` to `1` compared to all values in the given dataset).

In [None]:
features = data.columns.drop(["Channel", "Region"]).values
features

In [None]:
normalized = normalize(data[features])
data_n = data[["Channel", "Region"]].copy()
data_n[features] = pd.DataFrame(normalized, columns=features)
data_n.head(5)

Now, we can move on and apply a clustering algorithm.

#### Dendrograms

Reference (Linkage): https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html \
Reference (Dendrogram): https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.dendrogram.html

The first thing we do, is checking how many cluster would make sense.
For this, we draw a *dendrogram*.

In [None]:
from scipy.cluster.hierarchy import dendrogram, linkage

In [None]:
graph = linkage(data_n[features], method="ward")

fig, ax = plt.subplots(figsize=(20,10))
dendrogram(graph, ax=ax)
ax.set(title="Dendrogram")

A possible approach to find a good number of clusters is to locate the longest vertical line not interrupted by a split (this means the best reduction in distance).

In [None]:
from matplotlib.patches import Ellipse

fig, ax = plt.subplots(figsize=(20,10))
dend = dendrogram(graph, ax=ax)

mark = Ellipse((2625, 8.5), 100, 8, lw=2, ls="--", color="r", fill=False) # the position of the ellipse is just a guess
ax.add_artist(mark)

ax.axhline(6, c="r", ls="--", lw=2)
ax.set(title="Dendrogram")

Here, we assume that we can split the data into 2 clusters.

But this is just a suggestion, we could also have said that we want to create 3 or 4 clusters as these numbers also would have made sense.

#### Agglomerative Clustering

Reference: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html

In [None]:
from sklearn.cluster import AgglomerativeClustering

Building clusters with the `AgglomerativeClustering` algorithm works the same way as you've seen with other clustering algorithms.
Just call the `fit_predict()` method.

In [None]:
model = AgglomerativeClustering(n_clusters=2)
data_n["Cluster"] = model.fit_predict(data_n[features])

We can also try to visualize the data.

In [None]:
n_features = len(features)
fig, ax = plt.subplots(n_features, n_features, figsize=(30,30))

for x in range(n_features):
    for y in range(n_features):
        data_n.plot.scatter(ax=ax[x,y], x=features[x], y=features[y], c="Cluster", cmap="rainbow", colorbar=False)

#### Pairplot

Reference: https://seaborn.pydata.org/generated/seaborn.pairplot.html

The code above might be a bit uninteresting to write, and there exists already a solution to simplify that.
Seaborn offers a simple method, called `pairplot()`, that does the same.

In [None]:
sns.pairplot(data_n)

And we can use the `pairplot()` to color our clusters (`hue`) as well.

In [None]:
sns.pairplot(data_n, hue="Cluster", palette="rainbow")

And you can also select/limit the features you want to show (`vars`).

In [None]:
data["Cluster"] = data_n["Cluster"]
sns.pairplot(data, hue="Cluster", palette="rainbow", vars=features)

### Customization

Regarding the `AgglomerativeClustering` model, we've just used the default values for the hyperparameters.

There are some interesting hyperparameters that we could have changed and/or that could be changed in other use cases:
- **affinity**: the distance metric used (default: `eucledian`)
  - `eucledian`
  - `l1`
  - `l2`
  - `manhattan`
  - `cosine`
  - `precomputed`
- **linkage**: which distance should be used (default: `ward`)
  - `ward` (minimize variance of clusters)
  - `average` (average of all distances between points)
  - `complete` (maximum distance between two clusters)
  - `single` (minimum distance between two clusters)

## Exercises

### Ex01 - Clustering McDonald's Menus

In this exercise, you are going to build clusters with menus from McDonald's.
First, load **Ex10_01_Data.csv**.

As you can see, you have detailed information on nutritional values of their offers that you will use for clustering.

Normalize the feature values and store them in a new dataset.
Features are all columns besides `Category`, `Item`, `Serving Size`, `Calories`, and `Calories from Fat`.

Show the dendrogram of this dataset.

How many cluster should you build?
Draw the horizontal line to indicate where to cut.

I'd suggest, you should build 3 clusters.
Use the agglomerative clustering algorithm and predict the clusters for your data.
Assign these clusters directly to the original dataset that you've loaded at the beginning of this exercise.

Show the assigned clusters per category as a bar chart.
What do you think what the cluster mean?

Looking at the plot, we could assume that the clusters distinguish between food and beverages.
Not perfectly, but it points into that direction.
Some further investigation into data is probably needed.

Thus, list the mean and median values per cluster.

Plot all the combination of features (as seen in the introduction) to see if you can get a good visualization how these clusters could be interpreted.

Pick two plots that look interesting to you and show them side-by-side and bigger.
Use the calories for the size of the points.

It seems that the three clusters were built around products with either high amounts of sugar or fat or those in between.
Did you find any other meanings for the clusters?

#### Solutions

In [None]:
# %load ./Ex10_01_Sol.py

### Ex02 - Clustering Movies

In this exercise, you do a cluster analysis for movies.
You find your data in **Ex10_02_Data.csv**.

As you can see, you have scores and financial information for some movies between 2007 and 2013.

Create a new dataset that does not contain the `Movie` and `Year` columns.
And normalize the data of this dataset with the following subsets (normalization per subset):
- `RottenTomatoes`, `AudienceScore`
- `TheatersOpenWeek`, `OpeningWeekend`, `BOAvgOpenWeekend`,
- `DomesticGross`, `ForeignGross`, `WorldGross`, `Budget`, `OpenProfit`
- `Profitability`

Use the `normalize()` for all groups but `Profitability`.
For `Profitability` import the `MinMaxScaler` class from `sklearn.preprocessing` and call the `fit_transform()` method for the data in that column.

Draw a dendrogram of this new dataset.

Let's say we want to build 3 clusters.
Where (height) should we make the split?
Draw a red line to show the 3 clusters in the dendrogram.

Create the agglomerative clustering algorithm model and predict the clusters for the given data.
Assign the clusters to the dataset.

Since we have a lot of features and we cannot show all of them in one simple plot, let's plot all combinations of features with the pair plot.
Use the cluster for coloring.

As you can see, for some feature combinations you can actually see "good" clusters.

#### Solution

In [None]:
# %load ./Ex10_02_Sol.py

## Self-Study

### [As Reference] Principal Component Analysis (PCA)

Note: PCA is not relevant within this course.
I've just added a section so you know how to apply it.

Reference (StandardScaler): https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html \
Reference (PCA): https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

You have already heard of PCA in lecture 4.
The idea of the Principle Component Analysis - in short PCA - is to calculate a defined number of new attributes called principal components that explain the variance in your data.
Having these attributes, we can reduce the dimension of our data, but still hold the contained information to a large part.
Depending on the reduction we can make, we may be able to bring n-dimensions down to 2 or 3.
And as a result, are able to visualize the data.

The theory behind PCA is out of scope for this exercise, but we need to know this handy tool.

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

In [None]:
data = pd.read_csv("./Demo_Credit.csv", sep=";")
data.head(5)

Before we run a PCA, we have to prepare the data for the PCA.
This means, we have to use a `StandardScaler` to emphasize variables with a high variance and shift all values into the same range.

In [None]:
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)
pca_data = pd.DataFrame(data_scaled, columns=data.columns)
pca_data.head(5)

We can now run a PCA on our dataset.
As usual, there is a `fit()` method to run the analysis.

In [None]:
pca = PCA()
pca.fit(pca_data)

Having completed the analysis, we can check how much each principal component adds to the variance in the data.

In [None]:
np.around(pca.explained_variance_ratio_, 3)

In [None]:
n = range(1, len(pca_data.columns) + 1)
fig, ax = plt.subplots(figsize=(10,5))
ax.plot(n, pca.explained_variance_ratio_.cumsum(), "o--")
ax.set(ylim=(-.1,1.1), xlabel="Number of Components", ylabel="Explained Variance (Cumulative)", title="PCA Variance")

As you can see, with 7 components we can explain roughly 95% of the dataset.
This means, from our 10 features at the beginning, we were able to break them down to 7 but still keep the information in the data.

Depending on the dataset, it's sometimes also possible to break it down to 2 or 3 principal components.
And then we are in the range of visualizing the data.
But let's do that anyway.

First, we need to run the analysis again.
But this time, we specify the expected number of components we want.

In [None]:
pca = PCA(n_components=7)
pca_data_trans = pca.fit_transform(pca_data)
pca_data_trans[:5]

Now, we can pack these points into a `DataFrame` and add the known gender again for coloring purposes.

In [None]:
data_p = pd.DataFrame(pca_data_trans, columns=["PC1", "PC2", "PC3", "PC4", "PC5", "PC6", "PC7"])
data_p["Gender"] = data["Gender"]
data_p.head(5)

The 2D plot looks like the following:

In [None]:
fig, ax = plt.subplots(figsize=(10,10))
data_p.plot.scatter("PC1", "PC2", ax=ax, c="Gender", cmap="rainbow")
ax.set(xlabel="PC1", ylabel="PC2", title="Credit PCA")

And based on what we have learned in the last exercise, we can also show 3 principal components.

In [None]:
from mpl_toolkits.mplot3d import Axes3D
fig, ax = plt.subplots(figsize=(10,10), subplot_kw={"projection": "3d"})
ax.scatter(data_p["PC1"], data_p["PC2"], data_p["PC3"], c=data_p["Gender"], cmap="rainbow")
ax.set(xlabel="PC1", ylabel="PC2", zlabel="PC3", title="Credit PCA")

Well, we haven't done any clustering by now.

Now, we could find a clustering algorithm that can work on this data (but that is not in the scope of this exercise).

### Ex - PCA with Income Data

In this exercise, you'll run a PCA for a given dataset containing income and personal data.
So, load **Ex10_PCA_Data.csv**.

Since you need to run a PCA, you have to scale the data with a `StandardScaler`.

Run a PCA and show how much each component contributes to the variance in the data.

Show the cumulative sum of the variance for each component.

So, we have seen that 3 components can explain 95% of the total variance.
Let's do the analysis again, but this time just with 2 components (we don't need more to visualize the data in 2D).
The result should be a new dataset (`DataFrame`) containing the 2 components and the loaded data (4 columns).

Plot the components as a scatter plot 4 times (2x2) and use the other columns for coloring (1 column per plot).

#### Solution

In [None]:
# %load ./Ex10_PCA_Sol.py