# Customer Clustering – Online Retail Dataset

> **Purpose:** Segment customers based on purchasing behavior to inform marketing and retention strategies.

📌 **Includes:**
- Data preprocessing
- K-Means clustering with Elbow method
- Hierarchical clustering with dendrogram
- Ready-to-use visuals (inserted below)

**Dataset:** Kaggle – *Online Retail Customer Clustering* (hellbuoy/online-retail-customer-clustering)

In [None]:
# === Image URL parameters (pre-filled) ===
IMAGE_COVER_URL = ""

IMAGE_ELBOW_URL = "https://raw.githubusercontent.com/Lucas-Peterson/my-analysis/main/analys/Online-Retail/Elbow_Method_for_K-Means.png"
IMAGE_CLUSTERS_URL = "https://raw.githubusercontent.com/Lucas-Peterson/my-analysis/main/analys/Online-Retail/Customer_Clustering(K-Means).png"

IMAGE_EXTRA_1 = "https://raw.githubusercontent.com/Lucas-Peterson/my-analysis/main/analys/Online-Retail/Customer_Clustering(Hierarchical).png"
IMAGE_EXTRA_2 = "https://raw.githubusercontent.com/Lucas-Peterson/my-analysis/main/analys/Online-Retail/Dendogram.png"

print("Image URLs ready. Run 'Display images' below.")

In [None]:
# === Display images ===
from IPython.display import Image, display

if IMAGE_COVER_URL:
    display(Image(url=IMAGE_COVER_URL, embed=True))
display(Image(url=IMAGE_ELBOW_URL, embed=True))
display(Image(url=IMAGE_CLUSTERS_URL, embed=True))
display(Image(url=IMAGE_EXTRA_1, embed=True))
display(Image(url=IMAGE_EXTRA_2, embed=True))

## Data & Preprocessing
- Drop missing `CustomerID`
- Keep only positive `Quantity` and `UnitPrice`
- Aggregate at **customer level** (sum, mean, nunique)
- Standard scaling before clustering

In [None]:
import pandas as pd
import kagglehub
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans, AgglomerativeClustering
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.cluster.hierarchy as sch

In [None]:
# Download dataset
print("Downloading dataset from KaggleHub...")
path = kagglehub.dataset_download("hellbuoy/online-retail-customer-clustering")
dataset_path = f"{path}/OnlineRetail.csv"
print(f"Path to dataset: {dataset_path}")

In [None]:
df = pd.read_csv(dataset_path, encoding='ISO-8859-1')
df = df.dropna(subset=['CustomerID'])
df = df[df['Quantity'] > 0]
df = df[df['UnitPrice'] > 0]

customer_df = df.groupby('CustomerID').agg({
    'Quantity': 'sum',
    'UnitPrice': 'mean',
    'InvoiceNo': 'nunique',
    'StockCode': 'nunique'
}).reset_index()

scaler = StandardScaler()
scaled_data = scaler.fit_transform(customer_df.drop(columns=['CustomerID']))

## K-Means Clustering
- Elbow method to choose optimal **k**
- Assign clusters and visualize

In [None]:
sse = []
k_values = list(range(1, 11))
for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(scaled_data)
    sse.append(kmeans.inertia_)

plt.figure(figsize=(8, 5))
plt.plot(k_values, sse, marker='o')
plt.title('Elbow Method for K-Means')
plt.xlabel('Number of clusters')
plt.ylabel('SSE')
plt.grid(True)
plt.show()

In [None]:
optimal_k = 4
kmeans = KMeans(n_clusters=optimal_k, random_state=42)
customer_df['KMeans_Cluster'] = kmeans.fit_predict(scaled_data)

plt.figure(figsize=(8, 6))
sns.scatterplot(x='Quantity', y='UnitPrice', hue='KMeans_Cluster', data=customer_df, s=80, palette='viridis')
plt.title('Customer Clustering (K-Means)')
plt.xlabel('Total Purchases')
plt.ylabel('Average Unit Price')
plt.legend(title='Cluster')
plt.grid(True)
plt.show()

## Hierarchical Clustering
- Ward linkage with Euclidean distance
- Dendrogram + scatter plot

In [None]:
plt.figure(figsize=(10, 7))
dendrogram = sch.dendrogram(sch.linkage(scaled_data, method='ward'))
plt.title('Dendrogram')
plt.xlabel('Customers')
plt.ylabel('Euclidean Distance')
plt.show()

hierarchical = AgglomerativeClustering(n_clusters=4, metric='euclidean', linkage='ward')
customer_df['Hierarchical_Cluster'] = hierarchical.fit_predict(scaled_data)

plt.figure(figsize=(8, 6))
sns.scatterplot(x='Quantity', y='UnitPrice', hue='Hierarchical_Cluster', data=customer_df, s=80, palette='viridis')
plt.title('Customer Clustering (Hierarchical)')
plt.xlabel('Total Purchases')
plt.ylabel('Average Unit Price')
plt.legend(title='Cluster')
plt.grid(True)
plt.show()

## Next Steps
- Add **RFM analysis**
- Validate clusters with **customer lifetime value (CLV)**
- Use PCA/UMAP for better visual separation
- Build **personas** based on cluster summary