# **<h1 align="center">Data Mining 2024-25</h3>**
## **<h3 align="center">Customer Segmentation - ABCDEats Inc.</h3>**
### **<h3 align="center">Clustering</h3>**


**Group 10 members:**<br>Alexandra Pinto - 20211599@novaims.unl.pt - 20211599<br>
Marco Galão  - r20201545@novaims.unl.pt - r20201545<br>
Sven Goerdes - 20240503@novaims.unl.pt - 20240503<br>
Tim Straub  - 20240505@novaims.unl.pt - 20240505<br>

<a id = "toc"></a>

# Table of Contents

* [1. Import the Libraries](#import_libraries)
* [2. Import the Dataset](#import_dataset)
* [3. Clustering](#clustering)
    * [3.1 SOM](#som)
    * [3.2 RFM](#rfm)
    * [3.3 Mean Shift](#mean_shift)
    * [3.4 Hierarchical Clustering](#hierarchical_clustering)
    * [3.5 K-medoids](#k_medoids)
    * [3.6 DBSCAN](#dbscan)
    





# 1. Import the Libraries <a class="anchor" id="import_libraries"></a>
[Back to ToC](#toc)<br>

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import math
from scipy.cluster.hierarchy import dendrogram, linkage

import warnings
warnings.filterwarnings("ignore")

#Importing the functions created in main.py
from main import *
import importlib
imported_module = importlib.import_module("main")
importlib.reload(imported_module)

<module 'main' from 'c:\\Users\\timst\\OneDrive\\Desktop\\NOVA IMS\\Semester 1\\Data Mining\\Project Repo\\CustomerSegmentationDataMining\\deliverables\\main.py'>

# 2. Import the Dataset <a class="anchor" id="import_dataset"></a>
[Back to ToC](#toc)<br>

In this section, we import the preprocessed dataset and set the customer_id as the index column. Also, check the first and last 5 rows of the dataset.

In [10]:
df_clustering = pd.read_csv("../Data/Preprocessed_Data.csv", index_col="customer_id")
df_clustering.head().T

customer_id,1b8f824d5e,f6d1b2ba63,180c632ed8,4eb37a6705,6aef2b6726
customer_region,2360,4660,4660,4660,8670
customer_age,18,38,26,20,40
vendor_count,2,1,2,2,2
product_count,5,2,3,5,2
chain_orders,1,2,1,0,0
...,...,...,...,...,...
avg_daily_orders,0.022222,0.022222,0.022222,0.022222,0.022222
avg_order_value,14.44,4.6,15.78,27.72,12.46
promo_used,1,1,1,0,1
chain_orders_prop,0.5,1.0,0.5,0.0,0.0


In [11]:
df_clustering.tail().T

customer_id,f4e366c281,f6b6709018,f74ad8ce3f,f7b19c0241,fd40d3b0e0
customer_region,8670,8670,8670,8670,4660
customer_age,30,26,24,34,30
vendor_count,1,1,1,1,1
product_count,1,1,1,1,1
chain_orders,1,0,1,0,0
...,...,...,...,...,...
avg_daily_orders,0.011111,0.011111,0.011111,0.011111,0.011111
avg_order_value,18.04,18.04,17.79,12.03,7.91
promo_used,1,1,0,1,0
chain_orders_prop,1.0,0.0,1.0,0.0,0.0


# 3. Clustering <a class="anchor" id="clustering"></a>
[Back to ToC](#toc)<br>

In [12]:
#Placeholder for the features that will be used in the clustering
metric_features = None
categorical_features = None
unused_features = None

> ## 3.1 SOM <a class="anchor" id="som"></a>
[Back to 3. Clustering](#clustering)<br>

> ## 3.2 RFM <a class="anchor" id="rfm"></a>
[Back to 3. Clustering](#clustering)<br>

> ## 3.3 Mean Shift <a class="anchor" id="mean_shift"></a>
[Back to 3. Clustering](#clustering)<br>

> ## 3.4 Hierarchical Clustering <a class="anchor" id="hierarchical_clustering"></a>
[Back to 3. Clustering](#clustering)<br>

In [None]:
df_hierarchicalClustering = df_clustering.copy()

Find best linkage method

In [None]:
hc_methods = ["ward", "complete", "average", "single"]
max_nclus = 10

r2_hc = np.vstack([ get_r2_hc(df_hierarchicalClustering[metric_features], 
                              link, 
                              max_nclus=max_nclus, 
                              min_nclus=1, 
                              dist="euclidean") 
                              for link in hc_methods])

Define number of clusters based on dendogram

In [None]:
# setting distance_threshold=0 and n_clusters=None ensures we compute the full tree
linkage = 'ward'
distance = 'euclidean'


hclust = AgglomerativeClustering(linkage=linkage, metric=distance, distance_threshold=0, n_clusters=None)
hclust.fit_predict(df_hierarchicalClustering[metric_features])

In [None]:
# Adapted from:
# https://scikit-learn.org/stable/auto_examples/cluster/plot_agglomerative_dendrogram.html#sphx-glr-auto-examples-cluster-plot-agglomerative-dendrogram-py

# create the counts of samples under each node (number of points being merged)
counts = np.zeros(hclust.children_.shape[0])
n_samples = len(hclust.labels_)

# hclust.children_ contains the observation ids that are being merged together
# At the i-th iteration, children[i][0] and children[i][1] are merged to form node n_samples + i
for i, merge in enumerate(hclust.children_):
    # track the number of observations in the current cluster being formed
    current_count = 0
    for child_idx in merge:
        if child_idx < n_samples:
            # If this is True, then we are merging an observation
            current_count += 1  # leaf node
        else:
            # Otherwise, we are merging a previously formed cluster
            current_count += counts[child_idx - n_samples]
    counts[i] = current_count

# the hclust.children_ is used to indicate the two points/clusters being merged (dendrogram's u-joins)
# the hclust.distances_ indicates the distance between the two points/clusters (height of the u-joins)
# the counts indicate the number of points being merged (dendrogram's x-axis)
linkage_matrix = np.column_stack(
    [hclust.children_, hclust.distances_, counts]
).astype(float)



In [None]:
# Plot the corresponding dendrogram
sns.set_theme()
fig = plt.figure(figsize=(11,5))
# The Dendrogram parameters need to be tuned
y_threshold = 100
dendrogram(linkage_matrix, truncate_mode='level', p=5, color_threshold=y_threshold, above_threshold_color='k')
plt.hlines(y_threshold, 0, 1000, colors="r", linestyles="dashed")
plt.title(f'Hierarchical Clustering Dendrogram: {linkage.title()} Linkage', fontsize=21)
plt.xlabel('Number of points in node (or index of point if no parenthesis)')
plt.ylabel(f'{distance.title()} Distance', fontsize=13)
plt.show()

In [None]:
##########################################
# Visualize the Dendrogram with y_threshold = 75
##########################################

# Plot the corresponding dendrogram
sns.set_theme()
fig = plt.figure(figsize=(11,5))
# The Dendrogram parameters need to be tuned
y_threshold = 75
dendrogram(linkage_matrix, truncate_mode='level', p=5, color_threshold=y_threshold, above_threshold_color='k')
plt.hlines(y_threshold, 0, 1000, colors="r", linestyles="dashed")
plt.title(f'Hierarchical Clustering Dendrogram: {linkage.title()} Linkage', fontsize=21)
plt.xlabel('Number of points in node (or index of point if no parenthesis)')
plt.ylabel(f'{distance.title()} Distance', fontsize=13)
plt.show()

Test cluster solution for different amount of clusters

In [None]:
linkage = 'ward'
distance = 'euclidean'

In [None]:
# 4 cluster solution
n_clusters = 4

hc4_clust = AgglomerativeClustering(linkage=linkage, metric=distance, n_clusters=n_clusters)
hc4_labels = hc4_clust.fit_predict(df_hierarchicalClustering[metric_features])

In [None]:
# Characterizing the 4 clusters
df_hierarchicalClustering_concat = pd.concat([df_hierarchicalClustering[metric_features], 
                       pd.Series(hc4_labels, 
                                 name='labels', 
                                 index=df_hierarchicalClustering.index)], 
                    axis=1)

df_hierarchicalClustering_concat.groupby('labels').mean()

In [None]:
# 5 cluster solution
n_clusters=5

hc5_clust = AgglomerativeClustering(linkage=linkage, metric=distance, n_clusters=n_clusters)
hc5_labels = hc5_clust.fit_predict(df_hierarchicalClustering[metric_features])

In [None]:
# Characterizing the 5 clusters
df_hierarchicalClustering_concat = pd.concat([df_hierarchicalClustering[metric_features], 
                       pd.Series(hc5_labels, 
                                 name='labels', 
                                 index=df_hierarchicalClustering.index)], 
                    axis=1)

df_hierarchicalClustering_concat.groupby('labels').mean()

In [None]:
pd.crosstab(
    pd.Series(hc5_labels, name='hc5_labels', index=df_hierarchicalClustering.index),
    pd.Series(hc4_labels, name='hc4_labels', index=df_hierarchicalClustering.index),
    )

Final Hierarchical clustering solution

In [None]:
# final cluster solution
linkage = "ward"
distance = "euclidean"
n_clusters = 4

hclust = AgglomerativeClustering(linkage=linkage, metric=distance, n_clusters=n_clusters)

hc_labels = hclust.fit_predict(df_hierarchicalClustering[metric_features])

In [None]:
# Characterizing the final clusters

df_hierarchicalClustering_concat = pd.concat([
    df_hierarchicalClustering[metric_features], 
    pd.Series(hc_labels, name='labels', index=df_hierarchicalClustering.index)
    ], 
    axis=1)
df_hierarchicalClustering_concat.groupby('labels').mean()

> ## 3.5 K-medoids <a class="anchor" id="k_medoids"></a>
[Back to 3. Clustering](#clustering)<br>

> ## 3.6 DBSCAN <a class="anchor" id="dbscan"></a>
[Back to 3. Clustering](#clustering)<br>