# <font color='#31394d'>$k-Means Clustering</font>

In this notebook, we are going to apply clustering algorithm to identify homogenous groups of customers from the `mall_customer.csv` dataset.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

### <font color='#31394d'>Import and Explore the Data</font>

In [2]:
df = pd.read_csv("data/mall_customers.csv")
df.head()

Unnamed: 0,CustomerID,Gender,Age,Annual Income (k$),Spending Score (1-100)
0,1,Male,19,15,39
1,2,Male,21,15,81
2,3,Female,20,16,6
3,4,Female,23,16,77
4,5,Female,31,17,40


🚀 <font color='#31394d'>Exercise: </font> Explore your data. How large is it? Are there any missing values? What are the data types?

In [None]:
# your code goes here

🚀 <font color='#31394d'>Exercise: </font> The income and spending score columns have pretty awkward names. Rename them as "AnnualIncome" and "SpendingScore", respectively.

In [None]:
# your code goes here

🚀 <font color='#31394d'>Exercise: </font> Visualize the `Age` and `SpendingScore` distinguished by `Gender`.

In [None]:
# your code goes here

### <font color='#31394d'> Apply k-Means Clustering </font>

Let's attempt to identify clusters based on Age and SpendingScore. Using two variables will allow us to visualize the results, but feel free to re-run this with all the continuous variables. K-means clustering is not suitable for categorical variables.

In [None]:
from sklearn.cluster import KMeans

🚀 <font color='#eb3483'>Exercise: </font> Have a look at the help for `KMeans`. What do the "init" and "n_init" arguments do?

In [None]:
# your code goes here

Let's perform K-means clustering with K=4 (no particular reason, just as an example!)...

In [None]:
km = KMeans(n_clusters=4) # K = 4 
km.fit(df[['Age','SpendingScore']])

🚀 <font color='#eb3483'>Exercise: </font> What attributes does the `km` object have?

In [None]:
# your code goes here

Let's have a look at cluster centroids:

In [None]:
pd.DataFrame(km.cluster_centers_, columns=['Age','SpendingScore'], index=['Cluster1', 'Cluster2', 'Cluster3', 'Cluster4'])

Let's add a column with the predicted cluster label:

In [None]:
df['Cluster'] = km.labels_ + 1
df.head()

Let's visualize our clusters. Note that we can only do this because this is a toy example with two features (so we can plot them in a 2D space). If we had more than three features, we would not be able to visualize the clusters, but we could still examine the cluster centroids to determine what the clusters represent.

In [None]:
sns.relplot(x="Age", y="SpendingScore", data=df, hue="Cluster")

### <font color='#eb3483'> Finding the Best $K$ </font>

The `sklearn` `KMeans` method calls the total within-cluster variation "inertia". This is stored as the `inertia_` attribute of the fitted object. Next, we will loop over different values of $K$, store the inertia and choose the best value of $K$ using the "elbow" method. 

In [None]:
inertia = []
for k in range(1 , 21):
    estimator = KMeans(n_clusters=k)
    estimator.fit(df[['Age','SpendingScore']])
    inertia.append(estimator.inertia_)

In [None]:
inertia

In [None]:
sns.pointplot(x=np.arange(1,21), y=inertia)

<font color='#eb3483'>Exercise: </font> How many clusters would you select? (There is no single "right" answer.)

# <font color='#eb3483'> Hierarchical Clustering </font>

Let's try hierarchical clustering instead...

In [None]:
from sklearn.cluster import AgglomerativeClustering
#?AgglomerativeClustering

In [None]:
hier = AgglomerativeClustering(n_clusters=None, distance_threshold=0, linkage='complete')

hier.fit(df[['Age', 'SpendingScore']])

To plot the dendrogram, we need to create a function (taken from [here](https://scikit-learn.org/stable/auto_examples/cluster/plot_agglomerative_dendrogram.html))...

In [None]:
from scipy.cluster.hierarchy import dendrogram

def plot_dendrogram(model, **kwargs):
    # Create linkage matrix and then plot the dendrogram

    # create the counts of samples under each node
    counts = np.zeros(model.children_.shape[0])
    n_samples = len(model.labels_)
    for i, merge in enumerate(model.children_):
        current_count = 0
        for child_idx in merge:
            if child_idx < n_samples:
                current_count += 1  # leaf node
            else:
                current_count += counts[child_idx - n_samples]
        counts[i] = current_count

    linkage_matrix = np.column_stack([model.children_, model.distances_,
                                      counts]).astype(float)

    # Plot the corresponding dendrogram
    dendrogram(linkage_matrix, **kwargs)

In [None]:
plt.figure(figsize=(10,10))
plt.title('Hierarchical Clustering Dendrogram')

plot_dendrogram(hier)

Let's apply a threshold of 60...

In [None]:
hier_thresh = AgglomerativeClustering(n_clusters=None, distance_threshold=60, linkage='complete')
hier_thresh.fit(df[['Age', 'SpendingScore']])

In [None]:
hier_thresh.labels_

Let's add the cluster labels to our data frame:

In [None]:
df['Cluster'] = hier_thresh.labels_ + 1
df.head()

Let's plot the clusters:

In [None]:
sns.relplot(x="Age", y="SpendingScore", data=df, hue="Cluster")

Let's summarise the features by cluster:

In [None]:
df.groupby('Cluster')[['Age','SpendingScore']].describe()