:::{.callout-info appearance="simple"}
For questions and/or suggestions regarding this notebook, please contact [Pieter Overdevest](https://www.linkedin.com/in/pieteroverdevest/) (pieter@innovatewithdata.nl).
:::

## Aim


In this exercise we explore K-means clustering.



## Questions

1. An interesting parameter to explore in this notebook is `n_cluster_std`, initially it is set at two. It quantifies the separation/overlap between clusters. Observe what the value size means for the visualisations down below, what number of clusters would you conclude from the last plot? Then, set it to one, and do the same. Lastly, set it to three, and do the same. 

2. What does the line shape in the last plot tell you about the separation of the clusters?

## Reference

1. "An Introduction to Statistical Learning" by J. Gareth, D. Witten, T. Hastie, and R. Tibshirani, section 12.4.1. "[K-Means Clustering](https://hastie.su.domains/ISLR2/ISLRv2_website.pdf)".

## Import libraries

We start by importing a few libraries,

In [104]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import altair as alt
import string


from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

## Initialization

In [105]:
# We use `c_alphabet` to replace numbers by letters.
c_alphabet = string.ascii_lowercase

In [106]:
# Constants used in Altair visualisations.
N_TITLEFONTSIZE_TITLE  = 24
N_LABELFONTSIZE_AXIS   = 18
N_TITLEFONTSIZE_AXIS   = 24
N_LABELFONTSIZE_LEGEND = 18
N_TITLEFONTSIZE_LEGEND = 18


## Generate random data


In [107]:
# Generate random data.
n_samples     = 300
n_centers     = 4
n_cluster_std = 2
random_state  = 170

X, y = make_blobs(
    n_samples    = n_samples,     # Default is 100
    centers      = n_centers,     # Default is 3
    cluster_std  = n_cluster_std, # Default is 1
    random_state = random_state
)

# Formate data to Pandas data frame and add the label data.
df_data = pd.DataFrame(X, columns=['x1','x2']).assign(label=y)

# Replace clusters numbers by letters, otherwise we get a continuous color scale.
# You can try it out by uncommenting the next line, commenting the second line, and run the next cell.
#df_data['label'] = y
df_data['label'] = [c_alphabet[x] for x in y]

df_data.head(10)

Unnamed: 0,x1,x2,label
0,7.082163,-4.391706,d
1,-5.718273,3.525807,b
2,-1.808874,0.367458,b
3,-5.484652,2.22596,b
4,-11.87059,-6.219351,a
5,-7.406979,-1.395809,b
6,-3.82796,3.038423,b
7,-10.928136,-2.273432,a
8,-3.854599,-3.739893,b
9,-7.028461,-2.27709,a


## Visualize random data

In [108]:
alt.Chart(df_data).mark_circle(size=100, opacity=0.75).encode(
    x     = 'x1',
    y     = 'x2',
    color = 'label'
).properties(   
    title  = 'Generated data', 
    width  = 500,
    height = 500
).configure_title(
    fontSize      = N_TITLEFONTSIZE_TITLE
).configure_axis(
    labelFontSize = N_LABELFONTSIZE_AXIS,
    titleFontSize = N_TITLEFONTSIZE_AXIS
).configure_legend(
    labelFontSize = N_LABELFONTSIZE_LEGEND,
    titleFontSize = N_TITLEFONTSIZE_LEGEND,
    orient        = 'bottom'
)

## Assess the data with K-Means

Let's consider three clusters.

In [109]:
# Number of clusters.
k = 3  

# Define KMeans object and apply our data to it by using the fit() method.
# Set 'n_init' to 10, to have the function run 10 times, to lower risk of ending up in local mimimum.
kmeans = KMeans(n_clusters=k, n_init=10)
kmeans.fit(X)

# First ten estimated cluster labels.
kmeans.labels_[0:10]

array([2, 0, 1, 0, 0, 0, 1, 0, 0, 0], dtype=int32)

Add the estimated cluster references to `df_data`. As before, we replace the numerical references by letters. And, we create a data frame of the cluster centers.

In [110]:
df_data['label_estimated'] = [c_alphabet[x] for x in kmeans.labels_]

df_centers = pd.DataFrame({
    'x1':kmeans.cluster_centers_[:, 0],
    'x2':kmeans.cluster_centers_[:, 1]
})

df_centers

Unnamed: 0,x1,x2
0,-7.823946,-3.198092
1,1.26731,0.594209
2,4.866481,-9.380856


In [111]:
# Layer 1 - the circles.
sc1 = alt.Chart(df_data).mark_circle(size=100, opacity=0.75).encode(
    x='x1',
    y='x2',
    color = 'label_estimated'
)

# Layer 2 - the crosses.
sc2 = alt.Chart(df_centers).mark_point(size=200, opacity=1, color='purple', fill='yellow', shape='cross').encode(
    x='x1',
    y='x2'
)

# Combined plot:
(sc1 + sc2).properties(   
    title         = 'Generated data', 
    width         = 500,
    height        = 500
).configure_title(
    fontSize      = N_TITLEFONTSIZE_TITLE
).configure_axis(
    labelFontSize = N_LABELFONTSIZE_AXIS,
    titleFontSize = N_TITLEFONTSIZE_AXIS
).configure_legend(
    labelFontSize = N_LABELFONTSIZE_LEGEND,
    titleFontSize = N_TITLEFONTSIZE_LEGEND,
    orient        = 'bottom'
)


## Search for optimal K

We obtain the `inertia_` value for each k. It equals the sum of squared distances between each data point and its closest cluster center (i.e., centroid). 

In [112]:
# Range of k-values to evaluate.
l_k       = np.arange(1,11)

# Perform K-Means for each k-value and obtain the inertia.
l_inertia = [KMeans(n_clusters=k, n_init=10).fit(X).inertia_ for k in l_k]

df_inertia = pd.DataFrame({'k': l_k, 'inertia': l_inertia})

df_inertia

Unnamed: 0,k,inertia
0,1,17057.64029
1,2,8073.006303
2,3,3860.253378
3,4,2286.98488
4,5,2021.272807
5,6,1759.648111
6,7,1516.25379
7,8,1354.571702
8,9,1186.814127
9,10,1096.573163


### Plot inertia



In [113]:
lp = alt.Chart(df_inertia).mark_line().encode(
    x='k',
    y='inertia'
)

sc = alt.Chart(df_inertia).mark_circle(size=100).encode(
    x='k',
    y='inertia'
)

(lp+sc).properties( 
    title         = 'Evaluate K-Means for range of k-values',
    width         = 700,
    height        = 400,
).configure_title(
    fontSize      = N_TITLEFONTSIZE_TITLE
).configure_axis(
    labelFontSize = N_LABELFONTSIZE_AXIS,
    titleFontSize = N_TITLEFONTSIZE_AXIS
)