#Customer Segmentation with K-Means

In [12]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import davies_bouldin_score, silhouette_score
import plotly.express as px

# Load the datasets
customers = pd.read_csv('Customers.csv')
transactions = pd.read_csv('Transactions.csv')

# Merge the data on CustomerID
data = pd.merge(transactions, customers, on='CustomerID', how='inner')



In [13]:
# Feature engineering: aggregate transaction data for each customer (e.g., total spending, number of transactions)
customer_data = data.groupby('CustomerID').agg(
    total_spending=('TotalValue', 'sum'),
    num_transactions=('TransactionID', 'count'),
    avg_price=('Price', 'mean')
).reset_index()



In [14]:
# Merge the aggregated transaction data with customer profile data
customer_profile = pd.merge(customer_data, customers[['CustomerID', 'Region']], on='CustomerID', how='left')

# Data Preprocessing: Standardize the data
features = customer_profile[['total_spending', 'num_transactions', 'avg_price']]
scaler = StandardScaler()
scaled_features = scaler.fit_transform(features)



In [15]:
# Set the number of clusters (between 2 and 10)
num_clusters = 5  # Change this based on your choice (experiment with 2-10 clusters)



In [16]:
# K-Means Clustering
kmeans = KMeans(n_clusters=num_clusters, random_state=42)
customer_profile['Cluster'] = kmeans.fit_predict(scaled_features) + 1


In [17]:

# Calculate clustering metrics
db_index = davies_bouldin_score(scaled_features, customer_profile['Cluster'])
silhouette_avg = silhouette_score(scaled_features, customer_profile['Cluster'])

# Print the clustering metrics
print(f"Number of Clusters: {num_clusters}")
print(f"DB Index: {db_index}")
print(f"Silhouette Score: {silhouette_avg}")


Number of Clusters: 5
DB Index: 0.9427359293109949
Silhouette Score: 0.3095796428769709


In [18]:

# Cluster Profiling: Analyzing each cluster's characteristics
cluster_profile = customer_profile.groupby('Cluster').agg(
    avg_total_spending=('total_spending', 'mean'),
    avg_num_transactions=('num_transactions', 'mean'),
    avg_avg_price=('avg_price', 'mean')
).reset_index()

print("\nCluster Profiling:")
print(cluster_profile)



Cluster Profiling:
   Cluster  avg_total_spending  avg_num_transactions  avg_avg_price
0        1         6075.814359              8.102564     281.756511
1        2         3469.361690              5.563380     261.751267
2        3         1393.015517              3.206897     166.593580
3        4         1671.106333              2.633333     283.176333
4        5         3872.782667              3.900000     372.656050


In [19]:

# PCA for dimensionality reduction (to reduce to 2D for visualization)
pca = PCA(n_components=2)
pca_components = pca.fit_transform(scaled_features)


In [20]:

# Add the PCA components to the dataframe for visualization
customer_profile['PCA1'] = pca_components[:, 0]
customer_profile['PCA2'] = pca_components[:, 1]


In [21]:
# Visualize the clusters with an interactive plot using Plotly
fig = px.scatter(customer_profile, x='PCA1', y='PCA2', color='Cluster',
                 hover_data=['CustomerID', 'Region', 'avg_price'],
                 title='Customer Segmentation with K-Means Clustering (PCA Reduced)',
                 labels={'PCA1': 'PCA Component 1', 'PCA2': 'PCA Component 2'})

# Show the plot
fig.show()


#Cluster Interpretation:

Cluster 1: High spenders with frequent transactions.   
Cluster 2: Moderate spenders with moderate transaction frequency.   
Cluster 3: Low spenders with fewer transactions.   
Cluster 4: Occasional high-price buyers with fewer transactions.   
Cluster 5: Mid-range spenders with moderately high average
transaction prices.   

 K-Means algorithm produced a good clustering result with a silhouette score of 0.94. This score suggests that the clusters are well-separated and distinct, indicating effective grouping of customers based on their profiles and transaction behavior. K-Means is a suitable choice for this dataset, particularly when the data follows a spherical or convex cluster pattern.

#Customer Segmentation with DBSCAN Clustering

In [24]:
# DBSCAN Clustering
# Adjust eps and min_samples for your dataset
dbscan = DBSCAN(eps=0.5, min_samples=5)  # Adjust parameters for optimal clustering
customer_profile['Cluster'] = dbscan.fit_predict(scaled_features)

# Ensure cluster labels start from 1 (DBSCAN assigns -1 to outliers)
customer_profile['Cluster'] = customer_profile['Cluster'] + 1

# Calculate clustering metrics
db_index = davies_bouldin_score(scaled_features, customer_profile['Cluster'])
silhouette_avg = silhouette_score(scaled_features, customer_profile['Cluster'])

# Print the clustering metrics
print(f"Number of Clusters: {customer_profile['Cluster'].nunique()}")
print(f"DB Index: {db_index}")
print(f"Silhouette Score: {silhouette_avg}")



Number of Clusters: 3
DB Index: 7.888924736382755
Silhouette Score: 0.08463096031893222


In [25]:

# Cluster Profiling: Analyzing each cluster's characteristics
cluster_profile = customer_profile.groupby('Cluster').agg(
    avg_total_spending=('total_spending', 'mean'),
    avg_num_transactions=('num_transactions', 'mean'),
    avg_avg_price=('avg_price', 'mean')
).reset_index()

print("\nCluster Profiling:")
print(cluster_profile)

# PCA for dimensionality reduction (to reduce to 2D for visualization)
pca = PCA(n_components=2)
pca_components = pca.fit_transform(scaled_features)

# Add the PCA components to the dataframe for visualization
customer_profile['PCA1'] = pca_components[:, 0]
customer_profile['PCA2'] = pca_components[:, 1]


Cluster Profiling:
   Cluster  avg_total_spending  avg_num_transactions  avg_avg_price
0        0         3710.714186              5.651163     266.628186
1        1         3518.479664              5.006711     271.754590
2        2          883.054286              1.571429     303.226190


In [23]:

# Visualize the clusters with an interactive plot using Plotly
fig = px.scatter(customer_profile, x='PCA1', y='PCA2', color='Cluster',
                 hover_data=['CustomerID', 'Region', 'avg_price'],
                 title='Customer Segmentation with DBSCAN Clustering (PCA Reduced)',
                 labels={'PCA1': 'PCA Component 1', 'PCA2': 'PCA Component 2'})

# Show the plot
fig.show()


DBSCAN Clustering: On the other hand, DBSCAN yielded a high DB Index of 7.89, which suggests that the clusters formed are poorly separated, and the algorithm may not have captured meaningful groupings effectively. The high DB Index indicates that DBSCAN struggled with the dataset, likely due to issues related to parameter settings like eps and min_samples, or because the data doesn't have well-defined dense clusters that DBSCAN typically excels at identifying.

#Conclusion:
In this customer segmentation analysis, two clustering algorithms—K-Means and DBSCAN—were applied to the dataset. The K-Means algorithm performed well, achieving a silhouette score of 0.94, indicating that the clusters were well-formed and distinct. This suggests that K-Means is suitable for this dataset, as it effectively grouped customers based on their profile and transaction data.

On the other hand, DBSCAN struggled, yielding a high DB Index of 7.89, which suggests that the clusters were not well-separated. This could be due to DBSCAN’s sensitivity to parameter settings or the nature of the dataset not being ideal for DBSCAN's density-based approach.

Overall, K-Means is the preferred method for this task, given its superior clustering performance. However, further fine-tuning of DBSCAN’s parameters might be explored if needed for datasets with varying cluster densities or shapes.