# Week 07

Unsupervised Learning: Clustering

## Setup

Run the following 2 cells to import all necessary libraries and helpers for this week's exercises

In [None]:
!wget -q https://github.com/DM-GY-9103-2024F-H/9103-utils/raw/main/src/data_utils.py

In [None]:
import matplotlib.pyplot as plt
import pandas as pd

from data_utils import StandardScaler
from data_utils import KMeansClustering, GaussianClustering, SpectralClustering
from data_utils import object_from_json_url

## More Wine ! 🍷🍷🍷

Let's pretend we own an online wine store.

Last week we created a model that predicts wine quality based on a bunch of its properties. We could use this model to figure out how much to pay suppliers for the wine, and how much to charge costumers.

But, maybe this "`quality`" feature might not be something we want to share with our costumers. Even though it's based on data, it sounds abstract and subjective and would require explanations about our data and our process, which could create confusion.

Using all six features from the original dataset (`alcohol`, `acidity`, `density`, etc) might also not be very useful for costumers who want to buy new wines that are similar to ones that they have previously liked.

What we can do instead is classify the wines into groups that take into account all of the features of the dataset, but present costumers with a more manageable amount of information.

### Recommendations

What we're really hoping to have is a simple recommendation system for our costumers, where we can recommend wines based on previous wines they liked, without them having to know the $6$ features of the previous wines.

There are a few ways of doing this, but the strategy we'll take is called clustering.

### Clustering

[Clustering](https://en.wikipedia.org/wiki/Cluster_analysis), or cluster analysis, is an example of an *unsupervised* learning method that groups items based on their many features and properties.

We'll use it to divide our wines in such a way that wines in the same group, or *cluster*, are more similar to each other than to wines in other clusters.

These clusters won't necessarily correlate directly to the features in our dataset, but will be computed using a combination of the features.

### Supervised Learning

The models that we've trained so far for doing regression and classification are considered *supervised* models. During training we give the model our input features, but also provide it with the *correct* values for the output signals. These output signals tend to be human-labeled values, and are sometimes called the *supervisory signals*.

When fully-labeled training data is processed during training, we are hoping that the model learns to extrapolate what it *sees* in the labeled data to new, unseen, unlabeled instances of data with the same input features, but unknown output values.

#### Supervised Classification:

Given a set of initial data points with labels:<br>
<img src="./imgs/classification-02.jpg" width="620px"/>

We create a model that learns to assign labels to the original points:
<img src="./imgs/classification-03.jpg" width="620px"/>

so that later we can assign correct labels to new data points:
<img src="./imgs/classification-04.jpg" width="620px"/>

### Unsupervised Learning

Unlike supervised learning, unsupervised models learn patterns from unlabeled data. This means all of the features are considered input features, and there are no separate output features or signals. The idea is that by analyzing and processing data in specific ways, the model is able to build a concise representation of its features and create new ways of interpreting, visualizing or generating similar data.

We can use unsupervised learning models to explore new datasets and try to simplify our data before we do any kind of supervised learning.

We can also use supervised learning to build recommendation systems that learn how to group items by their many features or characteristics.

The steps for training an unsupervised model should seem familiar:

1. Load dataset
2. Encode label features as numbers
3. Normalize the data
4. Select variables and features to be considered
5. Create a model
6. Run model on input data and test data
7. Measure error

Even though it all looks familiar, that last step isn't very obvious.

How do we measure error on a model that doesn't have a set of correct answers?

Maybe *error* is not the right term, but we'll see how to define *metrics* to score and measure our unsupervised models.

#### Unsupervised Clusterings:
Since there are no correct labels, both of the following clusterings are valid!

<img src="./imgs/clustering-00.jpg" width="620px"/>

<img src="./imgs/clustering-01.jpg" width="620px"/>

Let's run it !

### Preparing Data

We'll load the same wine dataset as last week and normalize its features:

In [None]:
## 1. Load Dataset
WINE_FILE = "https://raw.githubusercontent.com/DM-GY-9103-2024F-H/9103-utils/main/datasets/json/wines.json"

# Read into DataFrame
wines_data = object_from_json_url(WINE_FILE)
wines_df = pd.DataFrame.from_records(wines_data)

## 3. Normalize
wine_scaler = StandardScaler()
wines_scaled = wine_scaler.fit_transform(wines_df)

## 4. Select variables to be considered
##    We're gonna drop the quality features to avoid re-clustering by quality
features = wines_scaled.drop(columns=["quality"])

### Clusterings

Let's look at our first clustering algorithm:

#### [K-means Clustering](https://scikit-learn.org/stable/modules/clustering.html#k-means):
Tries to separate the data into $k$ groups with similar statistical properties. Requires the number of clusters to be determined beforehand, and the algorithm tries to minimize the difference between objects in a cluster.

In [None]:
n_clusters = 4

## 5. Create Clustering object
km_model = KMeansClustering(n_clusters=n_clusters)

## 6. Run the model on the training data
km_predicted = km_model.fit_predict(features)

### Plots

Since we can't see in $4D$ or $5D$ yet, let's pick $2$ or $3$ variables to visualize our data and clusters.

This could be any of our features, but let's look at the *covariances* related to the `quality` of the wine and pick the top $2$ or $3$ variables related to it.

In [None]:
## Look at covariances again
wines_scaled.cov()["quality"].sort_values()

Let's plot `alcohol`, `chloride` and `density`

In [None]:
# For plotting
xl, yl, zl = "alcohol", "chlorides", "density"
x = wines_scaled[xl]
y = wines_scaled[yl]
z = wines_scaled[zl]

clusters = km_predicted["clusters"]

plt.scatter(x, z, c=clusters, marker='o', linestyle='', alpha=0.5)
plt.title("k-means clustering")
plt.xlabel(xl)
plt.ylabel(zl)
plt.xlim(-2.2, 3.2)
plt.ylim(-2.5, 3.5)
plt.show()


fig = plt.figure(figsize=(8, 8))
ax = fig.add_subplot(projection='3d')

ax.scatter(x, y, z, c=clusters, marker='o', linestyle='', alpha=0.5)

ax.set_title("k-means clustering")
ax.set_xlabel(xl)
ax.set_ylabel(yl)
ax.set_zlabel(zl)

ax.set_ylim(-2.5, 8)
ax.set_zlim(-2.5, 2.5)

plt.show()

### More Clusterings!

Let's look at another clustering method:

#### [Gaussian Clustering](https://scikit-learn.org/stable/modules/mixture.html#mixture):
This is similar to K-means, but this model assumes that all features of our data can be modeled as [Gaussian distributions](https://en.wikipedia.org/wiki/Normal_distribution).

Repeat steps $5$ and $6$ for this clustering method.

The Gaussian Clustering class is called `GaussianClustering` and its constructor takes the same parameters as the `KMeansClustering` constructor.

In [None]:
# TODO: Gaussian clustering

## 5. Create Clustering object
## 6. Run the model on the training data

### Plot Gaussian clusters

Just like the plots for `K-means`:

In [None]:
# TODO: plot Gaussian Clustering results

### One More Method!

Let's look at our final clustering method:

#### [Spectral Clustering](https://scikit-learn.org/stable/modules/clustering.html#spectral-clustering):
When appropriate, this method automatically combines and removes a few of our features, before doing K-means clustering. This should always be as good as, or better than, regular K-means clustering.

The Spectral Clustering class is called `SpectralClustering` and its constructor takes the same parameters as the previous two methods.

Repeat steps $5$ and $6$ to create a model and run it on our data.

In [None]:
# TODO: Spectral clustering

## 5. Create Clustering object
## 6. Run the model on the training data

### Plot results

In [None]:
# TODO: plot Spectral Clustering results

### Scoring

Would be nice to have a way to measure how good these clusters actually are.

It would help determine if we need more clusters, or if one method is actually better than the others.

There are a couple of ways to do this. We'll look at three of them.

### Distance

The first kind of scoring uses the distances between each point and its cluster's center as a metric.

This is sometimes called the L2-distance, and it's just like the more familiar [Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance) from geometry, but extended to measure more than just $2$ or $3$ dimensions/parameters.

Each cluster's center is represented by the average values of all of the features of all of its members: $(\overline{F_0}, \overline{F_1}, \overline{F_2}, ...)$ 

A smaller cluster distance means that the cluster center is a good representation of its members.

Luckily, our clustering models have a `distance_error()` function that can be used to report the distance error, after `fit()` has been called.

In [None]:
km_model.distance_error()

### Repeat for the other $2$ methods

Run the `distance_error()` function for the other methods:

In [None]:
## TODO: run the other method's distance_error()

### Likelihood

The second way of scoring clusters treats each cluster as a potential normal distribution, and then calculates the likelihood that each point came from its cluster distribution.

Values closer to zero mean that the clusters' statistical properties (mean, variation) are good estimators for the data.

Our model objects also have a `likelihood_error()` function we can use:

In [None]:
km_model.likelihood_error()

### Repeat for the other methods

Run the `likelihood_error()` function for the Gaussian and Spectral clustering models:

In [None]:
# TODO: run likelihood_error() for other clustering methods

Although somewhat related, the `distance` and `likelihood` calculations measure different things, and are in different units.

We can't compare distances to likelihoods to draw any kind of conclusion.

What we want to do is use either one of these metrics to select a clustering method and tune its parameters.

### Balance

A final metric we can consider when analyzing different clustering algorithms and strategies is to see how balanced the resulting clusters are. This isn't always important; we might have categories of items or events that are more common than others, and will produce unequal cluster groups.

In other cases, where we know we want to have groups of similar sizes, this is a good metric to look at. For example, if we were to use the body measurement dataset for deciding how many sizes of bike helmets to produce, we should probably have sizes that cover similar portions of the population, and avoid very bespoke sizes that only fit few people.

We compute `balance error` by summing the differences between our cluster sizes and the sizes of a perfectly balanced clustering. Once we have this sum, we scale it to get a number between $0$, for a perfectly balanced clustering, and $1$, for a most-unbalanced clustering.

$\displaystyle balance\ error = \frac{1}{2} \left(\frac{n}{n-1}\right) \sum_{i=1}^{n}{\left|\frac{C_i}{C_0 + C_1 + ... + C_n} - \frac{1}{n}\right|}$

The $\frac{C_i}{C_0 + C_1 + ... + C_n}$ terms are the sizes of our $n$ clusters expressed as the percentage of the total number of items in all clusters. The $\frac{1}{n}$ term is the size of each cluster in a perfectly balanced clustering. We sum up these differences and scale it all by $\frac{1}{2} \left(\frac{n}{n-1}\right)$ to get a number between $0$ and $1$.

We don't have to focus too much on this math right now. It's here for completeness and because it's good to practice reading an algorithm described as text, math equations and code.

In [None]:
# If we have a list of clusters, like this:
print(km_predicted[:10], "\n...")

# This gives us the counts for each label:
label_counts = km_predicted['clusters'].value_counts()
print(label_counts)

# This gives each cluster's size as a percentage of total number of items
cluster_sizes_pct = label_counts / len(km_predicted['clusters'])
print(cluster_sizes_pct)

# This is the size of all clusters in a fully balanced clustering, expressed as a percentage
balanced_cluster_size_pct = 1 / n_clusters
print(balanced_cluster_size_pct)

# This is the sum of the distances between cluster sizes and perfectly-balanced sizes
sum_distances = (cluster_sizes_pct - balanced_cluster_size_pct).abs().sum()
print(sum_distances)

scale_factor = 0.5 * n_clusters / (n_clusters - 1)

balance_error = scale_factor * sum_distances
print(balance_error)

### Balance Error

Luckily this has also been implemented for us and we can get our model's `balance error` by calling the `balance_error()` function of our clustering object:

In [None]:
km_model.balance_error()

### Repeat for `Gaussian` and `Spectral` clustering

Get `balance error` for all clustering methods by calling their `balance_error()` method.

In [None]:
# TODO: get balance_error for gaussian and spectral clustering

### Number of clusters

If we consider the $3$ metrics for the $3$ methods, it seems like the `Spectral Clustering` algorithm performs a little bit better, even though it doesn't produce the most balanced clusters.

Once we have chosen a method, we can tune its parameters to see if we can find a combination that produces "better" clusters.

Since the only parameters our model has is the number of clusters, let's try different cluster numbers to see if there's a *better* way of clustering our wines:

In [None]:
# try 2 - 10 clusters
num_clusters = list(range(2,10))

# collect distance, likelihood and balance errors
dist_err = []
like_err = []
bala_err = []

# get distance, likelihood and balance for different clustering sizes
for n in num_clusters:
  mm = SpectralClustering(n_clusters=n)
  mm.fit_predict(features)
  dist_err.append(mm.distance_error())
  like_err.append(mm.likelihood_error())
  bala_err.append(mm.balance_error())


# plot errors as function of number of clusters
plt.plot(num_clusters, dist_err, marker='o')
plt.xlabel("Number of Clusters")
plt.ylabel("Distance Error")
plt.show()

plt.plot(num_clusters, like_err, marker='o')
plt.xlabel("Number of Clusters")
plt.ylabel("Likelihood Error")
plt.show()

plt.plot(num_clusters, bala_err, marker='o')
plt.xlabel("Number of Clusters")
plt.ylabel("Balance Error")
plt.show()

### Interpretation

Looks like $6$ could be a good number of clusters for this model, since adding additional clusters doesn't seem to make the errors go down that much.

Let's look at our data and how it got clustered:

In [None]:
predicted = {}

n = 6
m_model = SpectralClustering(n_clusters=n)
predicted = m_model.fit_predict(features)

# For plotting
xl, yl, zl = "alcohol", "chlorides", "density"
x = wines_scaled[xl]
y = wines_scaled[yl]
z = wines_scaled[zl]

plt.scatter(x, z, c=predicted["clusters"], marker='o', linestyle='', alpha=0.5)
plt.title("Spectral Clustering n = %s" % n)
plt.xlabel(xl)
plt.ylabel(zl)
plt.xlim(-2.2, 3.2)
plt.ylim(-2.5, 3.5)
plt.show()

fig = plt.figure(figsize=(8, 8))
ax = fig.add_subplot(projection='3d')

ax.scatter(x, y, z, c=predicted, marker='o', linestyle='', alpha=0.5)
plt.title("Spectral Clustering n = %s" % n)
ax.set_xlabel(xl)
ax.set_ylabel(yl)
ax.set_zlabel(zl)
ax.set_ylim(-2.5, 8)
ax.set_zlim(-2.5, 2.5)
plt.show()

### Analysis

So, even though $6$ gives us the smallest error values, some of the clusters are really small and hard to find on the graphs.

And the cluster sizes are really unequal.

If this clustering is to be used for recommending wines to costumers, maybe using $4$ or $3$ clusters is a more sensible way of grouping our wines. Not because we have to balance the cluster sizes, but because the subtleties of having $6$ categories of wine might be less easy to explain.

Using $4$ categories is probably more legible. The categories could be something like: `strong` for the more alcoholic wines, `bold` and `dense` for the ones that are less alcoholic, but have high density, and `wild` for the ones high in chlorides.

### Conclusion

Unsupervised learning can be a very powerful and useful tool for performing exploratory data analysis and creating recommendation systems.

Because there's no labeled/correct answer in unsupervised learning, we can be a bit more subjective in how we pick our metrics for success.