# <span style="color: green"> Customer Segmentation Example </span>


In this notebook, we will explore customer segmentation using clustering techniques such as KMeans, DBSCAN, and Gaussian Mixture Models. We will use the provided dataset to perform the segmentation, evaluate the clustering results, and visualize the results.

In [1]:
# Importing Necessary Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, StandardScaler, MinMaxScaler
from sklearn.cluster import KMeans, MeanShift, DBSCAN
from sklearn.mixture import GaussianMixture
from sklearn.decomposition import PCA
from sklearn.metrics import calinski_harabasz_score, davies_bouldin_score, silhouette_score
from sklearn.neighbors import KernelDensity

## <span style="color: Blue"> 1. Data Loading and Exploration </span>
In this chapter, we will load the dataset and perform initial data exploration to understand the structure and basic statistics.

In [None]:
# Load the dataset
df = pd.DataFrame(pd.read_excel('./Online Retail.xlsx'))

df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


## <span style="color: Blue"> Select subdata </span>


My Notebook is not a powerful PC, so i need to take only a subset of the dataset

In [None]:
# Group by 'Country' and select 300 rows for each country
data = df.groupby('Country').apply(lambda x: x.head(300)).reset_index(drop=True)
data.head()

## <span style="color: Blue"> 2. Data Preprocessing </span>
In this section, we will preprocess the data by handling missing values, encoding categorical variables, and scaling the features.

In [3]:
# Handling missing values
data = data.fillna(data.mean())

# Encoding categorical variables
le = LabelEncoder()
data['Category'] = le.fit_transform(data['Category'])  # Example of categorical column

# Feature scaling
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data.drop('CustomerID', axis=1))  # Drop non-numerical columns

## <span style="color: Blue"> 3. Clustering the Data</span>
Now that we have preprocessed the data, we will apply different clustering algorithms to segment the customers into groups.

In [4]:
# KMeans Clustering
kmeans = KMeans(n_clusters=5, random_state=42)
kmeans_labels = kmeans.fit_predict(scaled_data)
data['KMeans_Cluster'] = kmeans_labels

# DBSCAN Clustering
dbscan = DBSCAN(eps=0.5, min_samples=5)
dbscan_labels = dbscan.fit_predict(scaled_data)
data['DBSCAN_Cluster'] = dbscan_labels

# Gaussian Mixture Model
gmm = GaussianMixture(n_components=5, random_state=42)
gmm_labels = gmm.fit_predict(scaled_data)
data['GMM_Cluster'] = gmm_labels

## <span style="color: Blue"> 4. Dimensionality Reduction and Visualization</span>
To better visualize the clusters, we will reduce the dimensionality of the data using PCA and plot the results.

In [5]:
# Apply PCA for 2D visualization
pca = PCA(n_components=2)
pca_components = pca.fit_transform(scaled_data)
data['PCA1'] = pca_components[:, 0]
data['PCA2'] = pca_components[:, 1]

# Visualize KMeans Clusters
plt.figure(figsize=(8,6))
sns.scatterplot(x='PCA1', y='PCA2', hue='KMeans_Cluster', data=data, palette='viridis')
plt.title('KMeans Clusters (PCA Projection)')
plt.show()

## <span style="color: Blue">  5. Clustering Performance Evaluation</span>
In this section, we will evaluate the performance of the clustering algorithms using various metrics such as Silhouette Score, Calinski-Harabasz Score, and Davies-Bouldin Score.

In [6]:
# Performance Evaluation
kmeans_score = silhouette_score(scaled_data, kmeans_labels)
dbscan_score = silhouette_score(scaled_data, dbscan_labels)
gmm_score = silhouette_score(scaled_data, gmm_labels)

# Calinski-Harabasz Score
kmeans_calinski = calinski_harabasz_score(scaled_data, kmeans_labels)
dbscan_calinski = calinski_harabasz_score(scaled_data, dbscan_labels)
gmm_calinski = calinski_harabasz_score(scaled_data, gmm_labels)

# Davies-Bouldin Score
kmeans_db = davies_bouldin_score(scaled_data, kmeans_labels)
dbscan_db = davies_bouldin_score(scaled_data, dbscan_labels)
gmm_db = davies_bouldin_score(scaled_data, gmm_labels)

# Display Scores
print(f'KMeans Silhouette Score: {kmeans_score}')
print(f'DBSCAN Silhouette Score: {dbscan_score}')
print(f'GMM Silhouette Score: {gmm_score}')
print(f'KMeans Calinski-Harabasz Score: {kmeans_calinski}')
print(f'DBSCAN Calinski-Harabasz Score: {dbscan_calinski}')
print(f'GMM Calinski-Harabasz Score: {gmm_calinski}')
print(f'KMeans Davies-Bouldin Score: {kmeans_db}')
print(f'DBSCAN Davies-Bouldin Score: {dbscan_db}')
print(f'GMM Davies-Bouldin Score: {gmm_db}')

## <span style="color: Blue"> 6. Results Interpretation</span>

In this section, we will summarize the findings from the clustering and evaluate which clustering algorithm performed best based on the evaluation metrics.



### 📊 **Clustering Evaluation Metrics: A Comprehensive Overview**

When assessing the quality of clustering results—especially in unsupervised learning—it's crucial to rely on quantitative metrics that provide insight into how well-separated and cohesive the clusters are. Below are three widely used internal evaluation metrics that help determine the effectiveness of a clustering algorithm.

---

#### 1️⃣ **Silhouette Score**  
📏 *Range:* **-1 to +1**

> *"A measure of how similar an object is to its own cluster compared to others."*

- **+1**: Indicates that the sample is far away from neighboring clusters — ideal scenario.
- **0**: Suggests overlapping clusters; samples lie between clusters.
- **-1**: Implies misclassified samples; likely assigned to the wrong cluster.

✅ **Best Used When:** You want a per-sample metric to understand cluster cohesion and separation.

🔍 **Interpretation Guidelines:**
- > 0.7 → Strong structure
- 0.5 – 0.7 → Reasonable structure
- < 0.25 → Weak or questionable clustering

---

#### 2️⃣ **Calinski-Harabasz Index (Variance Ratio Criterion)**  
🔢 *Range:* **Positive values only (no upper bound)**

> *"Evaluates the ratio of between-cluster dispersion to within-cluster dispersion."*

- **Higher Values**: Indicate better-defined, more separated clusters.
- This index favors convex-shaped clusters and performs well with algorithms like K-Means.

✅ **Best Used When:** Comparing different clustering models or tuning the number of clusters.

💡 **Tip:** The Calinski-Harabasz score increases as the clusters become denser and more separated.

---

#### 3️⃣ **Davies-Bouldin Score**  
📉 *Range:* **0 to ∞**

> *"Measures the average similarity between each cluster and its most similar one."*

- **Lower Values**: Indicate better clustering — ideally close to **0**.
- **Higher Values**: Suggest clusters overlap significantly or are poorly separated.

✅ **Best Used When:** Seeking a computationally efficient metric for evaluating compactness and separation.

⚠️ **Note:** Unlike other metrics, this score **does not require ground truth labels**, but prefers clusters that are spherical and evenly sized.

---

### 🧭 Choosing the Right Metric

| Metric | Best For | Ideal Value |
|--------|----------|-------------|
| Silhouette Score | Detailed per-sample analysis | Close to **+1** |
| Calinski-Harabasz | Model comparison & selection | As **high** as possible |
| Davies-Bouldin | Compactness & separation assessment | As **low** as possible |

---

### ✅ Final Thoughts

Each of these metrics provides unique insights into the performance of clustering algorithms. To get a well-rounded understanding, it's often recommended to use multiple metrics in combination, especially since no single measure universally defines "the best" clustering solution.



In [7]:
# Results Interpretation
best_algorithm = 'KMeans'  # Based on highest silhouette score
print(f'The best performing algorithm is {best_algorithm} based on the clustering evaluation metrics.')

## 7. Conclusion
This notebook demonstrated how to perform customer segmentation using clustering techniques. We evaluated the performance of different clustering algorithms and visualized the results using PCA. Based on the metrics, we concluded that KMeans provided the best results.