# Data Science Ex 12 - Clustering (Density-based Methods)

15.05.2022, Lukas Kretschmar (lukas.kretschmar@ost.ch)

## Let's have some Fun with Density-based Clustering approaches!

In this exercise, we are going to have a look at density-based clustering.
Further, we have a look at possibilities to scale and normalize data.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
sns.set()

## Introduction

Before we go to the density-based clustering approach, we want to introduce some preprocessing steps for a data scientist.

- Scaling numerical data
- Normalizing numerical data
- Encoding categorical data

This knowledge is important in general, but in this exercise especially since we need to prepare the data for clustering.
For one, it could be a good idea to reduce the dimensions (number of features) of our data and therefore improve the runtime of our algorithms.
Or we can visualize the data easier.
Further, since clustering needs to calculate the distance between points, the values should be in the same range.
Otherwise, some features will dominate over others.

## Preprocessing

### Preprocessing Data

When building clusters, it's essential that the values are in a comparable range.
Otherwise, calculating distances will include biases (higher values have a larger impact - e.g. one column is in `km`, another in `mm`).

Therefore, we need to know some techniques to scale our data.

#### Numerical Data

References: https://scikit-learn.org/stable/modules/preprocessing.html & https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html

We will have a look at the follwing possibilities (but there are many more, this list is not exhaustive):
- **Normalizer**: Normalizes a row to unit norm (the sum of all values is `1`, the values are relative to each other).
- **MinMaxScaler**: Transforms features into a defined range.
- **RobustScaler**: Scales features but mitigates outliers by scaling all values into a given quantile range.
- **StandardScaler**: Scales features to unit variance.

In [None]:
from sklearn.preprocessing import Normalizer, MinMaxScaler, RobustScaler, StandardScaler

In [None]:
rng = np.random.RandomState(42)
data = pd.DataFrame(rng.randn(100000) + 5, columns=["Values"])
data

In [None]:
data.hist(bins=100)

The sample data has a normal distribution around a mean of 5.

Now, let's have a look what effects the scalers and normalizers have on this data.

##### Normalizer

Reference: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Normalizer.html

The `Normalizer()` scales the values in a way that the sum is `1`.
We already saw this method in action in the last exercise - when we called `normalize()`.
`Normalize()` is just the class implementing the method and can be used in `Pipelines` or when instead of a method a class is needed.

In [None]:
norm = Normalizer()
data_norm = norm.fit_transform(data.T) # Since the algorithm works on rows, we have to transform the data
data_norm = pd.DataFrame(data_norm.T)
fig, ax = plt.subplots(1,2,figsize=(20,5))

data_norm.hist(bins=100, ax=ax[0])
ax[0].set(title="Histogram of Normalizer (l2, default)")

(data_norm[0]
     .sort_values()             # Sorting all values in ascending order
     .reset_index(drop=True)    # Removing index
     .apply(lambda v: v**2)     # Squaring values
     .cumsum()                  # Taking the cumulative sum
     .plot(ax=ax[1]))           # Plotting the line
ax[1].set(title="Cumulative sum of normalized data")

We had to square each value since the default behavior of `Normalize()` uses squares when normalizing.
If this behavior is not needed, but just the values relative to each other, we can set the `norm` parameter to `l1`.

In [None]:
norm = Normalizer(norm="l1")
data_norm = norm.fit_transform(data.T)
data_norm = pd.DataFrame(data_norm.T)
fig, ax = plt.subplots(1,2,figsize=(20,5))

data_norm.hist(bins=100, ax=ax[0])
ax[0].set(title="Histogram of Normalizer (l1)")

(data_norm[0]
    .sort_values()          # Sorting all values in ascending order
    .reset_index(drop=True) # Removing index
    .cumsum()               # Taking the cumulative sum
    .plot(ax=ax[1]))         # Plotting the line
ax[1].set(title="Cumulative sum of normalized data")

As you can see, the values changed but the sum is still `1`.

##### MinMaxScaler

Reference: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html

The `MinMaxScaler` transforms the data so it is in a given range.
The default range is `(0,1)`.

In [None]:
minmax = MinMaxScaler()
data_minmax = minmax.fit_transform(data)
data_minmax = pd.DataFrame(data_minmax)
print(f"Min: {data_minmax.min()[0]}")
print(f"Max: {data_minmax.max()[0]}")
fig, ax = plt.subplots(figsize=(10,5))
data_minmax.hist(bins=100, ax=ax)
ax.set(title="Histogram of MinMaxScaler")

As you can see, all the values are now scaled to a range from `0` to `1`.
We can change this by providing a range to `feature_range`.

In [None]:
minmax = MinMaxScaler(feature_range=(2,4))
data_minmax = minmax.fit_transform(data)
data_minmax = pd.DataFrame(data_minmax)
print(f"Min: {data_minmax.min()[0]}")
print(f"Max: {data_minmax.max()[0]}")
fig, ax = plt.subplots(figsize=(10,5))
data_minmax.hist(bins=100, ax=ax)
ax.set(title="Histogram of MinMaxScaler (2,4)")

##### RobustScaler

Reference: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html

The `RobustScaler` transforms the data based on a given quantile range (default is 1st quartile (25%) to 3rd quartile (75%)).
With this approach we try to remove the impact of outliers. 

In [None]:
robust = RobustScaler()
data_robust = robust.fit_transform(data)
data_robust = pd.DataFrame(data_robust)
fig, ax = plt.subplots(1,2,figsize=(20, 5))

(data - 5).hist(bins=100, ax=ax[0])
ax[0].set(title="Original Data (shifted to 0)")

data_robust.hist(bins=100, ax=ax[1])
ax[1].set(title="Data with RobustScaler")

As you can see, the scaled data has a smaller range of values.

##### StandardScaler

Reference: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

The `StandardScaler` removes the mean and transforms the data using the variance.

In [None]:
std = StandardScaler()
data_std = std.fit_transform(data)
data_std = pd.DataFrame(data_std)
fig, ax = plt.subplots(1,2,figsize=(20,5))

data.hist(bins=100, ax=ax[0])
ax[0].set(title="Original Data")

data_std.hist(bins=100, ax=ax[1])
ax[1].set(title="Data with StandardScaler")

In [None]:
rng = np.random.RandomState(42)
data_exp = pd.DataFrame(rng.uniform(5,10,size=100000), columns=["Values"])
data_exp_std = StandardScaler().fit_transform(data_exp)
data_exp_std = pd.DataFrame(data_exp_std)
fig, ax = plt.subplots(1,2,figsize=(20,5))

data_exp.hist(bins=100, ax=ax[0])
ax[0].set(title="Original Data (Uniform Distribution)")

data_exp_std.hist(bins=100, ax=ax[1])
ax[1].set(title="Data with StandardScaler")

As we can see, the scaled data has now a mean of `0` but the distribution hasn't changed.

#### Categorial Data

When working with categorial data, we usually need to transform it into numbers.
We have already seen one approach with the `pd.get_dummies()` method.
Here, we will introduce another method that accomplishes the same.

In [None]:
jobs = ["Engineer", "Accountant", "Manager", "Professor", "Student"]

In [None]:
people = pd.DataFrame({"Name" : ["Johnny", "Jenny", "Jake"], "Job":["Engineer", "Manager", "Student"]})
people

##### OneHotEncoder

Reference: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

In [None]:
from sklearn.preprocessing import OneHotEncoder

In [None]:
enc = OneHotEncoder(sparse=False) # sparse=False will get us an array as result and not a sparse array object

data_cat = enc.fit_transform(people["Job"].to_numpy().reshape(-1,1))
print(f"Raw data:\r\n{data_cat}")
print()

data_cat = pd.DataFrame(data_cat, columns=["Is_" + str(c) for c in enc.categories_[0]])
data_cat

As we can see, the result is a dataset with multiple columns containing a `1` if a category was present in a row.
Or `0` to indicate the absence of this category.
We can also provide a complete list of all possible values.
And then for every possibility a column is provided.

In [None]:
enc = OneHotEncoder(categories=[jobs], sparse=False)
data_cat = enc.fit_transform(people["Job"].to_numpy().reshape(-1,1))
data_cat = pd.DataFrame(data_cat, columns=["Is_" + str(c) for c in enc.categories_[0]])
data_cat

*Please note:* Compared with the examples above, we had to transform the values first.
Throwing a whole dataset (as shown in the examples above) at a scaler works well.
But when we just want to use the scaler for one column, we have to reshape the values first.

The column `Job` looks like this:

In [None]:
people["Job"]

When we just take the numpy array, we have an array of all values.
But the values would be treated as one row.

In [None]:
people["Job"].to_numpy()

Calling `reshape(-1,1)` switches the row to a column representation.

In [None]:
people["Job"].to_numpy().reshape(-1,1)

And with this kind of input, the scalers can work.

As an alternative, we could have also created a `DataFrame` with the one `Series`.

##### Distances with Categories

We've learned when working with categorical values, we have to use specific ways to calculate the distances.
Using methods from the [`DistanceMetric` class](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.DistanceMetric.html) or own implementations, we can use some clustering algorithms on categorical data as well.
But we won't got into details in this exercise.
It's primarily an FYI.

### DENCLUE

Unfortunatelly, there is no implementation of DENCLUE provided by any package in Anaconda or other python packages.
Therefore, we had to find another way to get a DENCLUE algorithm up and running.
Luckily for us, there is an [open-source implementation](https://github.com/mgarrett57/DENCLUE) available which we can use.

**Disclaimer:** While testing the implementation, I ran into several issues with the code.
The algorithm was implemented in April 2017, and since then, a module and methods the algorithm uses were changed.
Thus, I had to fix the implementation so it runs with the current version of the `networkx` module.
It works now, but I cannot guarantee that no other problems will occur.

Since `networkx` is also not a standard module, we have to install it first with `conda install networkx` or `pip install networkx`.
At the time of this exercise, in `conda` version `2.7.1` was available.
`pip` offered version `2.8`.

If you want to see how it's implemented, feel free to check out the [code in denclue.py](./denclue.py) yourself.

In [None]:
from denclue import DENCLUE

And now we can simply use the clustering algorithm by calling it.

In [None]:
DENCLUE()

The algorithm offers some hyperparameters as well (all are optional):
- **h**: Hill-climbing parameter (you can define the size of the neighborhood)
- **eps**: Convergence threshold for density (you can stop hill-climbing at a certain level)
- **min_density**: Threshold to consider a cluster and not discard it as noise
- **metric**: Distance metric used

Compared to the algorithms from sklearn, the interface for this algortihm is limited.
There is only a `fit()` method that we can use, the assigned clusters are stored in the `label_` property and we get information on all clusters by calling `clust_info_`.

Let's start with some data.

In [None]:
data = pd.read_csv("./Demo_3Cluster_Noise.csv", sep=";")
data.head(5)

In [None]:
fig, ax = plt.subplots(figsize=(10,10))
data.plot.scatter("x", "y", c="b", ax=ax)

You can see that we have 3 clusters in here, but there are points that aren't that close to an obvious cluster.

So let's see what the DENCLUE algorithm can do with such data.
*Note:* The execution might take a while.

In [None]:
model = DENCLUE()
model.fit(data.to_numpy()) # Unfortunately, the algorithm cannot handle DataFrames - so we need to provide an array.

By calling `labels_`, we get the cluster numbers assigned to each point.

In [None]:
model.labels_

And with `clust_info_`, we get some more insights how the clusters are composed.

In [None]:
model.clust_info_

So we see that the algorithm found a total of 7 clusters in the given data.
But we see also, that 4 of the 7 cluster only contain 1 or 2 points.

Before we head into plotting, let's introduce a helper method.
This method creates a new dataset containing the centroids of each cluster.
And we can filter the clusters by specifying a `min_density`.

In [None]:
def get_centroids(model, min_density=0.0):
    centroids = pd.DataFrame(columns=["x", "y", "density"])
    for i in range(len(model.clust_info_)):
        clust = model.clust_info_[i]
        if(clust["density"] < min_density):
            continue
        centroid = pd.DataFrame([clust["centroid"]], columns=["x", "y"])
        centroid["density"] = clust["density"]
        centroids = pd.concat([centroids, centroid], ignore_index=True)
    return centroids

In [None]:
centroids = get_centroids(model)
fig, ax = plt.subplots(figsize=(10,10))
data.plot.scatter("x", "y", ax=ax, c=model.labels_, cmap="rainbow", colorbar=False)
centroids.plot.scatter("x", "y", ax=ax, c="k", s=200, alpha=.5)

By calling the `set_minimum_density`, we can change the number of clusters found in the data.
Those clusters not fulfilling the density requirement count as outliers.

In [None]:
model.set_minimum_density(0.01)
centroids = get_centroids(model, min_density=0.01)
fig, ax = plt.subplots(figsize=(10,10))
data.plot.scatter("x", "y", ax=ax, c=model.labels_, cmap="rainbow", colorbar=False)
centroids.plot.scatter("x", "y", ax=ax, c="k", s=200, alpha=.5)

If we are honest, although some outliers were detected, the clusters still have some points that are quite far away from the center and could also be counted as outliers.
To reduce the cluster size, we can limit the boundaries of a cluster.

In [None]:
model_lim = DENCLUE(h=.5, min_density=0.01)
model_lim.fit(data.to_numpy())

In [None]:
len(model_lim.clust_info_)

In [None]:
centroids_lim = get_centroids(model_lim, min_density=0.01)
fig, ax = plt.subplots(figsize=(10,10))
data.plot.scatter("x", "y", ax=ax, c=model_lim.labels_, cmap="rainbow", colorbar=False)
centroids.plot.scatter("x", "y", ax=ax, c="k", s=200, alpha=.5)

The clusters detected are now smaller.
And the one on the bottom left even got split into two clusters.

## Exercises

### Ex01 - Preprocessing

In this exercise, you are going to use the scalers and normalizer introduced above.
First, load the data from **Ex12_01_Data.csv**.

What you have here are some specs on cars.
And you will now scale this data.
But before you start, create a new empty dataset with just the name and year since you won't scale these columns.

First, use a `MinMaxScaler` with a range of `(-1,1)` to scale the `mpg`.
Assign the results to the new dataset you've created above.

Next, the `cylinders`.
Use a `MinMaxScaler` again, but this time with a range of `(-2,2)`.

Now, use a `StandardScaler` for the `horsepower`.

Do the same for the `acceleration`.

In the last step, use a `RobustScaler` for `displacement` and `weight`.

And you are finished.
You've successfully scaled some features into more comparable ranges.

#### Solution

In [None]:
# %load ./Ex12_01_Sol.py

### Ex02 - Simple DENCLUE

In this exercise, you are going to use the DENCLUE algorithm in its simplest form.
To begin, load **Ex12_02_Data.csv**.

Plot the clusters.
Use the value in the `label` column for coloring.

Run the `DENCLUE` algorithm for the dataset (use only columns `x` & `y`).

How many cluster were found?

Get the centroids of these clusters.
*Hint:* You may use the method defined in the introduction.
But feel free to code it yourself.

Plot the data.
Use the clusters assigned by the DENCLUE algorithm for coloring.
Plot the centroids as well.

We didn't limit the cluster density.
So, we will do it now to get only our 4 expected clusters.
Use the `set_minimum_density` method and use a reasonable value for the density so only the 4 clusters remain.

Plot the data again, with the centroids.

Congratulations!
You have used DENCLUE successfully.

#### Solutions

In [None]:
# %load ./Ex12_02_Sol.py

### Ex03 - More DENCLUE

Let's use DENCLUE again.
Load **Ex12_03_Data.csv**.

Plot the data so you have an idea with what you are dealing.

Use the DENCLUE algorithm again.
But this time, specify from the beginning a `min_density` of `0.01`.
And set `h=.75`.

Get the centroids for the clusters.

How many cluster were found? - Only count those with a density equal or greater than specified above.

Plot the data with the found clusters.
Also plot the centroids.

So you see, clusters were found, but not the number we expected.
You know from the original data that we expect 6 clusters.
Try to find good values for the parameters to get close to these 6 clusters.
This exercise has no right or wrong answer.
It shows how hard it can be to find good parameters.

A good starting point is the following model:
```python
DENCLUE(h=.11, eps=.005, min_density=0.1)
```
But there is space to improve.

#### Solution

In [None]:
# %load ./Ex12_03_Sol.py

### Ex04 - Airbnb Clustering

In this exercise, you are going to find clusters of Airbnb listings in Zurich.
In the dataset **Ex12_04_Data.csv**, you'll find the raw data from Zurich published by Airbnb.

Plot these listings in a scatter plot using `longitude` and `latitude` and use the `availability_365` for coloring.

Select all listings that are *Entire home/apt* (`room_type`) and are at least availbale *300* days a year (`availability_365`).
How many listings would that be?

**Note:** During the rest of the exercise, we will just use this reduced dataset.
Working with the full dataset would result in a quite long execution of the *DENCLUE* algorithm.
And you want to finish this exercise, eventually.

Create and run the *DENCLUE* algorithm with `h=0.003`.
This code will take some time to run.

Set the `minimum_density` to *300*.

Get the centroids of the model.

Plot the data again, but with
- Outliers should be grayed out (visible as outliers)
- Clusters
- Centroids

#### Solution

In [None]:
# %load ./Ex12_04_Sol.py