Import packages

In [0]:
%pip install dbscan scikit-learn scipy

# DBSCAN

DBSCAN groups points that are closely packed together, marking as clusters those points with at least a minimum number of neighbors within a distance $\epsilon$ (radius). Points with fewer than the minimum number of neighbors are labeled as noise. The algorithm does not require specifying the number of clusters in advance.

In [0]:
import mlflow
from dbscan import DBSCAN
import pandas as pd

customer_pd = spark.read.table("workspace.default.clustering_df").toPandas()

customer_pd = customer_pd.drop(columns = ["customer_id", "last_order_date", "first_order_date", "std_order_value"])
customer_pd.head()

# set the experiment id
mlflow.set_experiment(experiment_id="3004562734275050")
mlflow.autolog()


labels, core_sample_mask = DBSCAN(customer_pd, eps=0.01, min_samples=2)

customer_pd["cluster"] = labels

import matplotlib.pyplot as plt
import numpy as np


#from sklearn.manifold import TSNE

n_samples = customer_pd.shape[0]
perplexity = min(30, n_samples - 1)  # Use 30 or less than n_samples

tsne = TSNE(
    n_components=2,
    random_state=42,
    perplexity=perplexity
)
tsne_results = tsne.fit_transform(
    customer_pd.drop(columns=['cluster'])
)

plt.figure(figsize=(8, 6))
unique_labels = np.unique(labels)
colors = plt.cm.Spectral(np.linspace(0, 1, len(unique_labels)))
for k, col in zip(unique_labels, colors):
    class_member_mask = (labels == k)
    xy = tsne_results[class_member_mask]
    plt.scatter(
        xy[:, 0],
        xy[:, 1],
        c=[col],
        label=f'Cluster {k}',
        edgecolors='k',
        s=50
    )
plt.title('DBSCAN Clusters (t-SNE Transformed)')
plt.xlabel('t-SNE 1')
plt.ylabel('t-SNE 2')
plt.legend()
plt.show()

# Hierarchical Clustering

Hierarchical clustering builds a tree of clusters by either merging smaller clusters into larger ones (agglomerative) or splitting larger clusters into smaller ones (divisive). The process continues until all points are in a single cluster or each point is its own cluster. The result is often visualized as a dendrogram, showing the hierarchy of cluster merges or splits.

In [0]:
customer_pd = customer_pd.drop(columns = ["cluster"])

In [0]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage
import mlflow

mlflow.set_experiment(experiment_id="3004562734275050")
mlflow.autolog()

linked = linkage(customer_pd, method='ward')
plt.figure(figsize=(10, 7))
dendrogram(linked)
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Sample Index')
plt.ylabel('Distance')
plt.show()

In [0]:
from scipy.cluster.hierarchy import fcluster

customer_pd["cluster"] = fcluster(
    linked,
    t = 4,
    criterion="maxclust"
)

In [0]:
from sklearn.metrics import silhouette_score, davies_bouldin_score

silhouette_avg = silhouette_score(customer_pd.drop(columns=["cluster"]), customer_pd["cluster"])
print(f"Silhouette Score: {silhouette_avg}")

davies_bouldin_score_metrics = davies_bouldin_score(customer_pd.drop(columns=["cluster"]), customer_pd["cluster"])
print(f"Davies-Bouldin Score: {davies_bouldin_score_metrics}")



# Evaluation

After applying clustering algorithms, we have identified meaningful clusters in our data. To assess their business relevance, we perform aggregations within each cluster to analyze patterns and characteristics. This helps us determine if the clusters align with business objectives and provide actionable insights.

In [0]:
customer_pd.groupby("cluster").agg(["mean", "min", "max", "count"])


In [0]:
customer_pd["cluster"].value_counts()


### Cluster Analysis & Interpretation

Below is the summary statistics for each cluster. Let's interpret the clusters based on their characteristics:

 Cluster | Units Purchased | Loyalty Segment | Num Orders | Count |
---------|----------------|----------------|------------|-------|
 **1**   | Moderate units, moderate spend, moderate tenure, low recency | Average loyalty, moderate order count, moderate product/category diversity | Represents steady, engaged customers with consistent purchasing behavior. | **16** |
 **2**   | Lower units, higher spend, high avg order value, long tenure | Slightly higher loyalty, fewer orders, higher product/category diversity | Likely high-value, less frequent buyers who purchase larger baskets. | **7** |
 **3**   | Very high units, very high spend, single order, long tenure | No loyalty, single product/category | Outlier: Possibly a one-time bulk purchase, not a typical loyal customer. | **1** |
 **4**   | High units, extremely high spend, single order, moderate tenure | High loyalty, single product/category | Outlier: Another one-time, very high-value purchase, possibly a business or special event. | **1** |

**Business Insights:**
- **Cluster 1:** Target for loyalty programs and retention strategies; these are your core, repeat customers (**16** customers).
- **Cluster 2:** Upsell/cross-sell opportunities; focus on increasing purchase frequency (**7** customers).
- **Cluster 3 & 4:** Investigate for special cases (bulk/corporate orders); may require different engagement or service models (**1** customer each).

These interpretations can guide marketing, retention, and product strategies tailored to each segment.

In [0]:
import os
os.getcwd()
os.chdir("/Workspace/Users/moritzmm02@gmail.com/clustering/")

In [0]:
customer_pd.to_csv("Data/clustered_customers.csv")