# K-Means Clustering

Let's import the packages that we will use during the practical:

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

##  The dataset

As the first step, we need to import data from `retail_dataset.csv` using `read_csv()` function from `pandas` (`pd`). We also want to define the column that we are going to use as the row labels of the dataframe: *CustomerID*. Once loaded, we can apply `head()` function to preview the first five rows of our dataframe. 

In [None]:
# Import the data from the retail_dataset.csv

customers_data = pd.read_csv('data/retail_dataset.csv', index_col='CustomerID')
customers_data.head()

We will start by looking specifically at the numerical features. Below we list non-binary features and separate them into a dataframe called `customers`:

In [None]:
non_binary_cols = [
    'balance', 'max_spent', 'mean_spent', 
    'min_spent', 'n_orders','total_items', 
    'total_refunded', 'total_spent']

customers = customers_data[non_binary_cols]
customers.head()

## Clustering with K-Means

K-Means clustering is a method for finding clusters and cluster centroids (the centre point of a cluster) in a set of points. The K-Means algorithm is quite simple and alternates between two steps:

1. For each centroid, identify the subset of training points that are closer to it than to any other centroid.
2. Update the location of the centroid to match the points related to it.

These two steps are repeated until the centroids no longer move (significantly) or the assignments no longer change. Then a new point $x$ can be assigned to the nearest cluster.

### Run K-Means with two features

Isolate the features `mean_spent` and `max_spent`, then run the K-Means algorithm on the resulting dataset using $k=2$ and visualise the result. You will need:

* to create an instance of `KMeans` with 2 clusters,
* fit this to the isolated features (via the `.fit` method),
* look how it is doing by showing the assignment predicted (via the `.predict` method).

This is the standard `sklearn` workflow for most of the algorithms.

In [None]:
from sklearn.cluster import KMeans

cust2  = customers[['mean_spent', 'max_spent']]
# Apply K-Means with 2 clusters using a subset of features 
# (mean_spent and max_spent)

kmeans = KMeans(n_clusters=2)
kmeans.fit(cust2)


In [None]:
#store the cluster assignment
cluster_assignment = kmeans.predict(cust2)

Let's introduce a simple function to better visualise what is going on:

In [None]:
# This function generates a pairplot enhanced with the result of K-Means
def pairplot_cluster(df, cols, cluster_assignment):
    """
    Input
        df, dataframe that contains the data to plot
        cols, columns to consider for the plot
        cluster_assignments, cluster asignment returned 
        by the clustering algorithm
    """
    # seaborn will color the samples according to the column cluster
    df_tmp = df.copy() # create a copy so we don't modify the original dataframe
    df_tmp['cluster'] = cluster_assignment 
    sns.pairplot(df_tmp, vars=cols, hue='cluster')

Let's use the method now to see how we did previously (ignore the warnings if anything comes up):

In [None]:
# Visualise the clusters using pairplot_cluster()
pairplot_cluster(customers, ['mean_spent', 'max_spent'], cluster_assignment)


#### What can you observe?

* The separation between the two clusters is "clean" (the two clusters can be separated with a line).
* One cluster contains customers with low spendings, the other one with high spendings.

### Run K-Means with all the features
Run K-Means using all the features available and visualise the result in the subspace of `mean_spent` and `max_spent`.

In [None]:
# Apply K-Means with 2 clusters using all features
kmeans = KMeans(n_clusters=2)
kmeans.fit(customers)
cluster_assignment = kmeans.predict(customers)


Visualise the cluster assignment using the same subset of variables as before. What has changed?

In [None]:
# Visualise the clusters using pairplot_cluster()
pairplot_cluster(customers, ['mean_spent', 'max_spent'], cluster_assignment)


***Question***: Why can't the clusters be separated with a line as before?

### Compare expenditure between clusters

Select the features `mean_spent` and `max_spent` and compare the two clusters obtained above using them.

In [None]:
# Compare expenditure between clusters
features = ['mean_spent', 'max_spent']

# create a dataframe corresponding to the case
# cluster_assignment == 0
cluster1_df = pd.DataFrame(data=customers[cluster_assignment == 0], 
                           columns=customers.columns)[features]

cluster1_desc = cluster1_df.describe()


In [None]:
# then with cluster_assignment == 1
cluster2_df = pd.DataFrame(data=customers[cluster_assignment == 1], 
                           columns=customers.columns)[features]

cluster2_desc = cluster2_df.describe()


In [None]:
# Join both
compare_df = cluster1_desc.join(cluster2_desc, lsuffix='_cluster1', rsuffix='_cluster2')
compare_df


### Look at the centroids

Look at the centroids of the clusters by calling `kmeans.cluster_centers_` and check the values of the centroids for the features `mean_spent`, `max_spent`. You will need to create a new dataframe where the data is simply `kmeans.cluster_centers_`.

In [None]:
# Get the centroids and display them
centers_df = pd.DataFrame(data=kmeans.cluster_centers_, columns=customers.columns)
print(centers_df[features])


### Compare mean expediture with box plot

Compare the distribution of the feature `mean_spent` in the two clusters using a box plot. You will need:

* `sns.boxplot` (seaborn's boxplot)

In [None]:
# Compare mean expediture with box plot

#plt.figure(figsize = (10,6))
sns.boxplot(data=[cluster1_df.mean_spent, cluster2_df.mean_spent])
plt.show()


Does this seem to make sense? How can you interpret the plots?

### Look at the inertia
Inertia measures the internal coherence of clusters. You can look at the inertia easily by calling ``kmeans.inertia_``:

In [None]:
# Look at the inertia
print('Inertia: {0:.2f}'.format(kmeans.inertia_))


The value of inertia on its own does not say much as it is not normalized. However, it can be used for selecting a suitable number of clusters as part of the elbow method.

In elbow method, we first calculate inertia for clusterings with different numbers of clusters. We then choose the number with the largest change in rate of decline as explained in the lecture.

### Compute the silhouette score
Compute the silhouette score of the clusters resulting from the application of K-Means.

The score represents how similar a sample is to the samples in its own cluster compared to samples in other clusters. The best value is 1, while the worst value is -1. Values close to 0 suggest overlapping clusters. Negative values occur when a sample is assigned to the wrong cluster (a different cluster is more similar).

`sklearn` provides the function `silhouette_score`, which you can call and display.

In [None]:
from sklearn.metrics import silhouette_score

# Computing the silhouette score
print('Silhouette score: {0:.2f}'.format(silhouette_score(customers, cluster_assignment)))


This silhouette score is reasonably high, which we can interpret by saying that the corresponding clusters are quite compact.

### Finding the optimal number of clusters

Try plotting the inertia and silhouette score for different numbers of clusters (e.g. between 1 and 20)

Note that silhouette score can only be calculated for two or more clusters.

In [None]:
k_vals = [i + 1 for i in range(20)]
silhouette_scores = []
inertias = []

# calculate the scores
for k in k_vals:
    kmeans = KMeans(n_clusters=k)
    kmeans.fit(customers)
    cluster_assignment = kmeans.labels_
    if k>1:
        silhouette_scores.append(silhouette_score(customers, cluster_assignment))
    else:
        silhouette_scores.append(None)
    inertias.append(kmeans.inertia_)
    
print(f'Silhoutte scores: {silhouette_scores} \n\n Intertias {inertias}')


In [None]:
fig, ax1 = plt.subplots()

color = 'tab:orange'
ax1.set_xlabel('k')
ax1.set_ylabel('inertias', color=color)
ax1.plot(k_vals, inertias, color=color)
ax1.tick_params(axis='y', labelcolor=color)

ax1.set_xlim(0,20)
ax1.set_xticks(k_vals)
ax1.set_ylim(0,26000)
ax1.grid(visible=True, axis='x', linestyle='--')

ax2 = ax1.twinx()  # instantiate a second axes that shares the same x-axis

color = 'tab:blue'
ax2.set_ylabel('silhouette score', color=color) #  we already handled the x-label with ax1
ax2.plot(k_vals, silhouette_scores, color=color)
ax2.tick_params(axis='y', labelcolor=color)
fig.tight_layout()  # otherwise the right y-label is slightly clipped
plt.show()