# **Advanced Analytics Showcase - Clustering**

![kmeans-meme](images/kmeans_meme.png)

## Learning Objectives

* Explain the key concepts of clustering and the most common clustering algorithm - KMeans
* Recognize common use cases and potential business impact of clustering
* Familiarize with key vocabularies in clustering and understand the language of a Data Analyst/Data Scientist
* Appreciate the advantages of clustering over Excel-based segmentation

<img src = "images/clusteringcartoon3.png" height = 400 width = 600><br>

## Introduction to Python Jupyter Notebook

**What is a Jupyter Notebook**

What you are working with right now is a "Python Jupyter Notebook".

- Python Jupyter Notebook is a popular and powerful tool used by data scientists and analysts to conduct analytics projects.
- It allows users to create and share documents containing live code, equations, visualizations, and text.
- Jupyter Notebook provides an interactive environment where data scientists can explore data, build models, and communicate their findings effectively.
- The flexibility of Jupyter Notebook makes it easy to experiment with different data analysis techniques and share the results with stakeholders.

**How to run a Jupyter Notebook?**

- Insert text to the code cell by clicking on it.
- To execute the code, select the code cell and press "Shift" + "Enter" keys on the keyboard.
- Wait for the code to execute, which might take some time for longer cells.
- Once executed, the output will be displayed below the code cell.

Practice on the cell below

In [None]:
PRACTICE_TEXT = "REPLACE WITH YOUR NAME"
print("Hi {}! Welcome to the clustering showcase session. You have learnt how to execute a Jupyter code cell!".format(PRACTICE_TEXT))

## Introduction to Clustering

- A cluster is a group of similar data objects that are different from objects in other clusters.
- Clustering is an unsupervised learning technique where data is grouped or labeled into separate classes based on similarities.
- RFM segmentation is an example of clustering using transactional data. There are more advanced and powerful clustering techniques available for better customer segmentation.
- Clustering can be used as a standalone tool to gain insights into data distribution or as a preprocessing step in other algorithms.
<br>

<table><tr>
<td><img src = "images/cluster1.png" height = "300" width = "300"/></td>
<td><img src = "images/cluster2.png" height = "300" width = "300"/></td>
</tr></table>

</font>

## Possible Applications

Clustering algorithms can be leveraged by SMEs from various industries. It's more powerful than RFM analysis, as it can be used to solve a wider range of business problems

| **Application / Use Cases**                                                                                  | **Clustering Algorithms** | **RFM Analysis** |
|--------------------------------------------------------------------------------------------------------------|--------------------------|------------------|
| Customer segmentation for a local coffee shop based on customer demographics and purchase history.           | Yes                      | Restricted       |
| Product segmentation for a fashion boutique based on customer purchase behavior and product features.        | Yes                      | Restricted       |
| Anomaly detection for a manufacturing company based on sensor data and production statistics.                | Yes                      | No               |
| Traffic pattern analysis for a local delivery service based on real-time traffic data and customer location. | Yes                      | No               |
| Fraud detection for an online retailer based on historical order data and user behavior.                     | Yes                      | No               |
| Image segmentation for a marketing agency based on customer preferences and engagement.                      | Yes                      | No               |
| Voice recognition for a virtual assistant based on audio data and user patterns.                             | Yes                      | No               |
| Clustering of genes for a biotech company based on expression data.                                          | Yes                      | No               |
| Personalized medicine for a healthcare clinic based on patient genetics and clinical data.                   | Yes                      | No               |
| Customer segmentation for a telecom company based on customer usage and demographic data.                    | Yes                      | Restricted       |
| Product improvement for a bakery store based on customer purchase history, feedback, and demographics.       | Yes                      | Restricted       |
| Market basket analysis for a grocery store based on transaction data.                                        | Yes                      | Yes              |
| Product recommendations for a e-commerce website based on customer purchase history and product features.    | Yes                      | Restricted       |
| Customer churn prediction for a FnB company based on transaction data and customer demographics.             | Yes                      | Restricted       |
| Sales territory planning for a  real estate agency based on customer demographics and housing preferences.   | Yes                      | Restricted       |

<br>

<table><tr>
<td><img src = "images/discussion.png" height = "150" width = "150"/></td>
<td><b>Discussion: </b> Which one(s) of these potentially use cases of clustering is relevant to your company? Can you use RFM analysis to solve it?</td>
</tr></table>


## Advantages of Clustering Algorithm over RFM Analysis

**Recap: Hyper-personalization introduced in M1**

<img src = "images/ml_usecases_v4.png">

**Challenge: Many customer micro-segments**
- Imagine you want to segment your customers into 200 non-overlapping clusters, it would be a mission impossible to complete with RFM analysis
- You will need a Machine Learning algorithm to discover segments in the data automatically - that's what clustering is good at

**Challenge: Segmentation based on behavious, preferences and demographics**
- The customer segments were defined by the purchasing history only. Other useful information, such as demographics and products purchased, are not considered in the segmenting customers
- Machine learning clustering algorithm takes in not only transaction data, but also other features about a customer, to derive better segments

**Challenge: Continuous monitor and adjust**
- To come up with the most sensible customer segments, you had to manually decide definition and criteria for each segment. It was a time-consuming and tedious process to iterate through RFM Excel template
- Machine learning clustering automatically decides cluster assignment, which makes iterations much easier and faster

In summary, the table below summarises the advantages of clustering over RFM analysis

|                         | Clustering                                       | RFM                                         |
|-------------------------|--------------------------------------------------|---------------------------------------------|
| Segment definition      | The algorithm discovers natural groups           | User has to define each segment             |
| Column inputs           | Any numeric variables                            | Only RFM columns                            |
| Limit to No. segments   | No theoretical limit                             | Difficult to define more than 8 segments    |
| Prune to bias           | Model decides segments based on distance metrics | Manual judgment and assumptions             |
| Time to iterate         | Fast                                             | Slow                                        |
| Scale                   | Suitable for large datasets of all sizes         | Suitable for small to medium datasets       |

## How Do We Define Good Clustering Algorithms?

High quality clusters can be created by reducing the distance between the objects in the same cluster known as intra-cluster minimization and increasing the distance with the objects in the other cluster known as inter-cluster maximization.

- <b>Intra-cluster Minimization:</b> The closer the objects in a cluster, the more likely they belong to the same cluster.
- <b>Inter-cluster Maximization:</b> This makes the separation between two clusters. The main goal is to maximize the distance between 2 clusters.

<img src = "images/good_clusters.jpeg" height="700" width="700">

The distance represents the similarity between any pairs of data. Some commonly used distance metrics are given below:

1. __Euclidean Distance:__ The Euclidean distance or Euclidean metric is the ordinary distance between two points that one would measure with a ruler.
2. __Manhattan Distance:__ Manhattan distance is a type of distance measurement that follows a path along the sides of a triangle, instead of the direct straight line between two points. It is called Manhattan distance because the streets in Manhattan, New York City are laid out in a grid pattern, and this type of distance measurement is similar to how you would travel along the streets to get from one point to another. This metric is less affected by outliers than the Euclidean metrics.

<img src = "images/Example-of-Euclidean-and-Manhattan-distances-between-two-points-A-and-B-The-Euclidean.png" height="500" width="500">

## The Most Widely Used Clustering Algorithm - KMeans

K-means clustering is a simple unsupervised learning algorithm that is used to solve clustering problems. It follows a simple procedure of classifying a given data set into a number of clusters, defined by the letter "k," which is fixed beforehand. The clusters are then positioned as points and all observations or data points are associated with the nearest cluster, computed, adjusted and then the process starts over using the new adjustments until a desired result is reached.

K-Means includes the following 4 steps:

1. Start with number of clusters we want e.g., 3 in this case. K-Means algorithm start the process with random centers in data, and then tries to attach the nearest points to these centers
2. Algorithm then moves the randomly allocated centers to the means of created groups
3. In the next step, data points are again reassigned to these newly created centers
4. Steps 2 & 3 are repeated until no member changes their association/groups or the time budget is running out

[Interactive Demo of K-Means Algorithm](https://www.naftaliharris.com/blog/visualizing-k-means-clustering/)

**Why do we use K-Mean**

* It is one of the most common clustering algorithm with many proven success in various industries and use cases
* K-Means is useful when we have an idea of how many clusters actually exists in your space
* With a large number of variables, K-Means is computationally faster than other clustering techniques (if K is small)

## Apply KMeans Like a Data Scientist

In this section, we will apply K-Means algorithm on the FFC data. You will be able to:
- Observe the differences between clusters identified from KMeans and RFM analysis
- Recognize different types of visualisations created from Python & Excel
- Iterate KMeans quickly with different configurations and parameters

### Load required libraries

Python is an open-sourced programming language with many contributors creating libraries for different purposes. These libraries will need to be loaded at the beginning of the scripts. After loading, we will be able to use the functions in these libraries

In [None]:
# Note - all these lines in the code block starting with # signs are comments
# Comments provide explanations of the code and also instructions for execution

# Importing required Python libraries for this session
# Press Shift + Enter

!wget https://raw.githubusercontent.com/RISEBCG/DAB/main/M3A2-clustering/cluster_helper.py
from cluster_helper import *

### Load the dataset

Python will need to load the data into the memory before consuming or transforming the data

In [None]:
# Load the dataset prepared for clustering.
# Press Shift + Enter

df = pd.read_csv('https://raw.githubusercontent.com/RISEBCG/DAB/main/M3A2-clustering/customer_clustering_v1.csv')

In [None]:
# Let's take a glance at the data
# .head() command shows the top 10 rows of the dataset
# Key in df.head(10) and press Shift + Enter



From the top 10 rows, we can tell that the dataset contains the following columns in 4 categories:

*ID:*
- customer_id

*Demographics:*
- age
- annual income

*RFM features:*
- most_recent_purchase_days
- total_net_sales
- total_no_transaction
- overall seg

*Purchase behaviours:*
- total_sku_quantities
- Greek yogurt_count
- Nuts_count
- Apple slices_count
- Chips_count
- Soda_count
- Chocolate bar_count
- Seasonal special_count
- Promotion item_count
- Lunchbox - Beef_count
- Lunchbox - Vegan_count

The next step is to examine if there are any outliers, missing values or any other issues with this data

### Data inspection

Recall the concepts of "Dirty data" and "Cleaned data". Before conducting any analysis or modelling, it is critical to inspect the data for any issues. These issues have to be fixed before proceeding with the analytics.

In some context, this step is also referred as "Exploratory Data Analysis (EDA)", as you might be able to draw some initial insights from the data while doing the inspection.

In [None]:
# We use .describe() to generate a data summary for all the numerical values.
# From the data summary, we could easily spot any missing values or outliers
# Key in df.describe() and press Shift + Enter



### Data pre-processing

This step includes both handling issues spot from the previous step, and transform the data based on the requirements of the algorithms. In this case:

- The dataset does not have any missing values or extreme outliers.
- However, the numeric values in the dataset have different scales, which can cause issues with distance metrics.
- Variables with larger scales can dominate distance metrics mathematically, even if they are less significant than variables with smaller scales.
- For example, assume two customers with age 25 and 45, annual income 10k and 11k. Using the Euclidean distance, the large difference in annual income will dominate the smaller difference in age.
- However, based on our intuition, the annual income difference is less significant than the age difference
- To fix this issue, **data standardization** can be used.

In this session, we will use Min-Max Standardization, which scales the numerical values from 0-1. More details can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html)

In [None]:
# Data standardization can be applied to numerical values
# Press Shift + Enter

numeric_columns = ['age', 'annual_income', 'total_sku_quantities',
                   'total_net_sales', 'most_recent_purchase_days', 'total_no_transactions',
                   'Greek yogurt_count', 'Nuts_count', 'Apple slices_count', 'Chips_count',
                   'Soda_count', 'Chocolate bar_count', 'Seasonal special_count',
                   'Promotion item_count', 'Lunchbox - Beef_count', 'Lunchbox - Vegan_count']

df, df_original = standardize_numeric_data(df, numeric_columns)

### Training KMeans model

The process of fitting data into the algorithm is referred as "model training". Since KMeans can consume more features, all numeric columns in this dataset are used for training the KMeans model.

In [None]:
# Now, the data is ready for clustering!
# From RFM analysis, the customers were segmented in to 4 groups
# Let's start with the same number of segment
# Recall RFM only takes in the 3 columns recency, frequency and monetary values
# In K-means, we can use more variables for clustering - in this case, let's use all the numeric values
# Press Shift + Enter

df_pred = run_kmeans_model(4, df, df_original, numeric_columns)

In [None]:
# Let's take a look at the output dataframe
# The last column indicate the cluster that the customer is assigned to based on K-Means clustering
# Key in df_pred.head(10) and press Shift + Enter
# Note: In Python, number usually starts with 0. Hence, the 4 clusters are cluster 0, 1, 2 and 3



### Explain clusters

After the model assigns data into different clusters, we want to explain them to draw useful business insights

In [None]:
# How would you visualise the clusters?
# If we plot each of the data point on a 3-D dimensional space, which axis representing the recency, frequency and monetary value of the data point
# We could have a brief understanding of each segment based on the data points' positions in the 3-D space
# Press Shift + Enter

create_cluster_visualisation(df_pred)

Each dot represents a data point and each color represents the cluster assignment. You can drag the visualisation to view these data points from different angles. You can also zoom in and out.

You might have realised that while some clusters are more distinctive, data points from different cluster could still mix together in the 3-D space. In the most ideal scenario, you will want to see mutually exclusive clusters. So, why does this happen?

There are several possible explanations:
- KMeans uses other input columns, together with the RFM columns, to identify clusters
- KMeans model can be fine-tuned further to achieve better distinction
- There's no natural groups in the datasets

Now. Let's deep dive into each of these clusters and observe what are they.

In [None]:
# Let's take a look at the first cluster from the KMeans model
# Press Shift + Enter

create_segment_profiling_chart(df_pred, 0)

You might have realised - this is very similar to the Star segment we have identified from the RFM analysis!
- These are mostly working adults, with relatively high income
- Favorite SKUs are healthy food - Nuts, Greek Yogurt and Apple Slices
- High in Monetary, Frequency and Recency values

However, note that the segment size is significantly lower than our Star segment - it only has 18 customers. Can you think of potential reasons?

Can you create the same chart for the other 3 clusters created by the KMeans model?

In [None]:
# Replace the 0 in the next cell with other cluster numbers, i.e., 1, 2, and 3. Press Shift+Enter
t = widgets.TabBar(["Cluster 0", "Cluster 1", "Cluster 2", "Cluster 3"])
with t.output_to(0):
    create_segment_profiling_chart(df_pred, 0)

with t.output_to(1):
    create_segment_profiling_chart(df_pred, 0) # Replace this 0 with 1

with t.output_to(2):
    create_segment_profiling_chart(df_pred, 0) # Replace this 0 with 2

with t.output_to(3):
    create_segment_profiling_chart(df_pred, 0) # Replace this 0 with 3

<table><tr>
<td><img src = "images/discussion.png" height = "150" width = "150"/></td>
<td><b>Discussion: </b> How are the clusters from KMeans different from RFM? What are the new insights you can get from these KMeans clusters?</td>
</tr></table>

### Model evaluation

The process of evaluating machine learning model based on statistic and business metrics is commonly referred as "model evaluation"

How can you be sure that 4 is the right number of clusters? You can evaluate the clusters by using Elbow Method

<b>Algorithm for Elbow Method</b>

1. Run k-means clustering on the dataset for a range of values of k (say, k from 1 to 10 in the examples above)
2. For each value of k calculate the [sum of squared errors (SSE)](https://hlab.stanford.edu/brian/error_sum_of_squares.html) - which is a statistic metrics to evaluate clusters
3. Plot a line chart of the SSE for each value of k
4. If the line chart looks like an arm, then the "elbow" on the arm is the value of k that is the best.

The plot should look like this:
<img src = "images/elbow1.png"><br>

In our plot we see a pretty clear elbow at k = 3, indicating that 3 is the best number of clusters.

Let's apply this method on our models now!


In [None]:
# Let's create the elbow chart for this problem!
# Press Shift + Enter

plot_elbow_chart(df_pred, numeric_columns)

From the plot, 2 is the "clear elbow" of the chart. Shall we re-run the algorithm with number of clusters = 2?

Well, you have to consider the following questions before making this call:
- Is it enough to have just 2 clusters? Will you ignore some smaller segments among your customers?
- Are the 2 clusters identified suitable for marketing purpose?

In reality, the elbow algorithm serves more as a reference. You might want to iterate the KMeans with different parameters to create the most meaningful clusters

### Iterate KMeans with different number of clusters

Different choices of numbers of clusters will lead to different cluster assignments, which may bring out different insights about customer segmentations.
Let's use this exercise to explore different configurations to the KMeans model. Observe the different customer segments.

In the next cell, replace number_of_clusters with your choice.
- number_of_clusters: an integer, recommended range is from 2 to 6

Hint:
- Try number_of_clusters 5 and observe the outcome clusters
- What are the issues with some clusters (check the segment size)
- What can you do with these clusters?

In [None]:
# Replace number_of_clusters with your choice
# Explore the output clusters, observe differences, and identity new insights
# Press Shift + Enter

experiment_different_cluster(df, df_original, numeric_columns, number_of_clusters)

<table><tr>
<td><img src = "images/discussion.png" height = "150" width = "150"/></td>
<td><b>Discussion: </b> What do you observe from the new clustering results?</td>
</tr></table>

## Summary

- Machine learning-based clustering algorithms, advanced in nature, surpass Excel-based segmentation analysis due to their flexibility across various use cases and industries.
- Advanced clustering algorithms have several benefits over RFM analysis, including the ability to handle and process more information than RFM features, scalability to micro-segments, and faster iteration times.
- Among the most frequently used clustering algorithms, K-Means clustering stands out for its capacity to provide superior insights in a shorter amount of time.

<font style="font-family:Trebuchet MS;">

***

*This marks the end of this lesson*<br><br>

<div style="text-align: center"><font size="8"><font style="font-family:Trebuchet MS;">Happy Clustering !!!</font></font></div>