# <font color='#eb3483'>$K$-Means Clustering</font>

In this notebook, we are going to apply clustering algorithm to identify homogenous groups of customers from the `mall_customer.csv` dataset.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings("ignore")

### <font color='#eb3483'>Import and Explore the Data</font>

In [None]:
#Import your data

<font color='#eb3483'> Explore your data. How large is it? Are there any missing values? What are the data types?

<font color='#eb3483'> The income and spending score columns have pretty awkward names. Rename them as "AnnualIncome" and "SpendingScore", respectively.

In [None]:
# rename the columns
df.rename(columns={"Annual Income (k$)": "AnnualIncome", "Spending Score (1-100)": "SpendingScore"}, inplace=True)
df.head()

<font color='#eb3483'>Visualize the `Age` and `SpendingScore` distinguished by `Gender`.

In [None]:
sns.relplot(x="Age", y="SpendingScore", data=df, hue="Gender")

### <font color='#eb3483'> Apply $K$-Means Clustering </font>

Let's attempt to identify clusters based on Age and SpendingScore. Using two variables will allow us to visualize the results, but feel free to re-run this with all the continuous variables. K-means clustering is not suitable for categorical variables.

In [None]:
from sklearn.cluster import KMeans

<font color='#eb3483'>Have a look at the help for `KMeans`. What do the "init" and "n_init" arguments do?

In [None]:
?KMeans

Let's perform K-means clustering with K=4 (no particular reason, just as an example!)...

In [None]:
km = KMeans(n_clusters=4) # K = 4
km.fit(df[['Age','SpendingScore']])

In [None]:
# Create a DataFrame from the cluster centers of a KMeans model
# km.cluster_centers_ contains the centroids of each cluster

pd.DataFrame(km.cluster_centers_, columns=['Age','SpendingScore'], index=['Cluster1', 'Cluster2', 'Cluster3', 'Cluster4'])

In [None]:
# Add a new column 'Cluster' to the DataFrame df
# km.labels_ contains the cluster labels assigned by KMeans (starting from 0)

df['Cluster'] = km.labels_ + 1
df.head()

In [None]:
sns.relplot(x="Age", y="SpendingScore", data=df, hue="Cluster")

### <font color='#eb3483'> Finding the Best $K$ </font>

The `sklearn` `KMeans` method calls the total within-cluster variation "inertia". This is stored as the `inertia_` attribute of the fitted object. Next, we will loop over different values of $K$, store the inertia and choose the best value of $K$ using the "elbow" method.

In [None]:
inertia = []
for k in range(1, 11):
    kmeans = KMeans(n_clusters=k, random_state=0)
    kmeans.fit(df[['Age','SpendingScore']])
    inertia.append(kmeans.inertia_)

# Plotting the inertia values
plt.figure(figsize=(10, 6))
plt.plot(range(1, 11), inertia, marker='o')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Inertia')
plt.title('Inertia vs. Number of clusters')
plt.xticks(np.arange(1, 11, 1))  # Set x-axis ticks to integers
plt.grid(True)
plt.show()

<font color='#eb3483'>Exercise: </font> How many clusters would you select? (There is no single "right" answer.)

# <font color='#eb3483'> Hierarchical Clustering </font>

Let's try hierarchical clustering instead...

In [None]:
from sklearn.cluster import AgglomerativeClustering
#?AgglomerativeClustering

In [None]:
from scipy.cluster.hierarchy import dendrogram, linkage
import matplotlib.pyplot as plt

# Perform hierarchical clustering
Z = linkage(df[['Age', 'SpendingScore']])

# Plot the dendrogram
plt.figure(figsize=(12, 6))
dendrogram(Z)
plt.xlabel('Sample Index')
plt.ylabel('Distance')
plt.title('Hierarchical Clustering Dendrogram')
plt.show()

Lets apply 3 clusters

In [None]:
# Perform agglomerative clustering
hier = AgglomerativeClustering(n_clusters=3, linkage='ward')
hier.fit(df[['Age', 'SpendingScore']])

In [None]:
df['Cluster'] = hier.labels_ + 1
df.head()

In [None]:
sns.relplot(x="Age", y="SpendingScore", data=df, hue="Cluster")

## Evaluating clustering

* Silhouette Score: The silhouette score measures how similar an object is to its own cluster compared to other clusters. It ranges from -1 to 1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters.

* Davies-Bouldin Index: This index measures the average similarity between each cluster and its most similar cluster, taking into account the cluster's size. It ranges from 0 to infinity, with lower values indicating better clustering.

* Calinski-Harabasz Index (Variance Ratio Criterion): This index compares the ratio of the variance within clusters with the variance between clusters. A higher value indicates better clustering.
* Visual Inspection: Sometimes, simply visualizing the clusters can provide insight into the quality of clustering. Scatter plots, heatmaps, and other visualization techniques can help assess how well the data points are grouped.

* Domain Knowledge: In many cases, domain knowledge is essential for evaluating clustering results. Subject matter experts can assess whether the clusters make sense in the context of the data and the problem domain.

In [None]:
from sklearn.metrics import silhouette_score

# Assuming labels are the cluster labels obtained from AgglomerativeClustering

hier = AgglomerativeClustering(n_clusters=3, linkage='ward')
labels = hier.fit_predict(df[['Age', 'SpendingScore']])

silhouette_avg = silhouette_score(df[['Age', 'SpendingScore']], labels)
print(f"Silhouette Score: {silhouette_avg}")