# Project 09: Customer Segmentation

**Difficulty**: ⭐⭐ Intermediate
**Estimated Time**: 4-5 hours
**Prerequisites**: 
- Basic understanding of unsupervised learning
- Pandas for data manipulation
- Familiarity with scikit-learn

## Learning Objectives

By the end of this notebook, you will be able to:
1. Perform RFM (Recency, Frequency, Monetary) analysis on customer transaction data
2. Apply and compare multiple clustering algorithms (K-Means, DBSCAN, Hierarchical, GMM)
3. Determine optimal number of clusters using elbow method and silhouette analysis
4. Visualize high-dimensional customer segments using PCA and t-SNE
5. Profile customer segments and derive actionable business insights
6. Develop targeted marketing strategies for different customer groups

## 1. Setup and Imports

First, let's import all necessary libraries and configure our environment.

In [None]:
# Data manipulation and analysis
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning - Clustering
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.mixture import GaussianMixture

# Machine Learning - Preprocessing and Metrics
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score
from sklearn.metrics import silhouette_samples

# Dimensionality Reduction
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

# Hierarchical clustering visualization
from scipy.cluster.hierarchy import dendrogram, linkage
from scipy.spatial.distance import cdist

# Date and time handling
from datetime import datetime, timedelta

# Warnings
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

# Configure visualization defaults
%matplotlib inline
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 10

print("All libraries imported successfully!")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")

## 2. Load and Explore Dataset

We'll use the Online Retail dataset which contains transactions from a UK-based online retailer.

**Note**: Download the dataset from [Kaggle](https://www.kaggle.com/datasets/vijayuv/onlineretail) and place it in the `data/` directory.

In [None]:
# Load the dataset
# Using relative path so it works on any computer
data_path = 'data/online_retail.csv'

try:
    retail_data = pd.read_csv(data_path, encoding='ISO-8859-1')
    print(f"Dataset loaded successfully!")
    print(f"Shape: {retail_data.shape}")
except FileNotFoundError:
    print(f"Error: File not found at {data_path}")
    print("Please download the dataset from Kaggle and place it in the data/ directory")
    # Create sample data for demonstration
    print("\nCreating sample data for demonstration...")
    np.random.seed(42)
    n_transactions = 10000
    n_customers = 500
    
    retail_data = pd.DataFrame({
        'InvoiceNo': [f'INV{i:06d}' for i in range(n_transactions)],
        'StockCode': np.random.choice(['PROD' + str(i) for i in range(100)], n_transactions),
        'Description': np.random.choice(['Product ' + str(i) for i in range(100)], n_transactions),
        'Quantity': np.random.randint(1, 50, n_transactions),
        'InvoiceDate': pd.date_range(start='2010-01-01', periods=n_transactions, freq='H'),
        'UnitPrice': np.random.uniform(1, 100, n_transactions),
        'CustomerID': np.random.choice(range(10000, 10000 + n_customers), n_transactions),
        'Country': np.random.choice(['United Kingdom', 'France', 'Germany', 'Spain'], n_transactions)
    })
    print(f"Sample data created with shape: {retail_data.shape}")

In [None]:
# Display first few rows
print("First 5 rows of the dataset:")
retail_data.head()

In [None]:
# Dataset information
print("Dataset Information:")
print("=" * 50)
retail_data.info()

In [None]:
# Basic statistics
print("\nNumerical columns statistics:")
retail_data.describe()

In [None]:
# Check for missing values
print("Missing values per column:")
missing_values = retail_data.isnull().sum()
missing_percentage = (missing_values / len(retail_data)) * 100

missing_df = pd.DataFrame({
    'Missing Count': missing_values,
    'Percentage': missing_percentage
})

print(missing_df[missing_df['Missing Count'] > 0])

### Exercise 1: Data Exploration

1. How many unique customers are in the dataset?
2. What is the date range of transactions?
3. Which country has the most transactions?

In [None]:
# Your code here
# Hint: Use unique(), value_counts(), and date operations


## 3. Data Cleaning and Preprocessing

Before performing RFM analysis, we need to clean the data:
- Remove rows with missing CustomerID
- Remove cancelled transactions (InvoiceNo starting with 'C')
- Remove negative quantities and prices
- Convert InvoiceDate to datetime

In [None]:
# Create a copy for cleaning
clean_data = retail_data.copy()

print(f"Original data shape: {clean_data.shape}")

# Remove missing CustomerID
clean_data = clean_data.dropna(subset=['CustomerID'])
print(f"After removing missing CustomerID: {clean_data.shape}")

# Convert CustomerID to integer
clean_data['CustomerID'] = clean_data['CustomerID'].astype(int)

# Convert InvoiceDate to datetime
clean_data['InvoiceDate'] = pd.to_datetime(clean_data['InvoiceDate'])

# Remove cancelled transactions (InvoiceNo starting with 'C')
clean_data = clean_data[~clean_data['InvoiceNo'].astype(str).str.startswith('C')]
print(f"After removing cancellations: {clean_data.shape}")

# Remove negative quantities and prices
clean_data = clean_data[(clean_data['Quantity'] > 0) & (clean_data['UnitPrice'] > 0)]
print(f"After removing negative values: {clean_data.shape}")

# Calculate total amount for each transaction
clean_data['TotalAmount'] = clean_data['Quantity'] * clean_data['UnitPrice']

# Remove extreme outliers (transactions > 99.9th percentile)
threshold = clean_data['TotalAmount'].quantile(0.999)
clean_data = clean_data[clean_data['TotalAmount'] <= threshold]
print(f"After removing extreme outliers: {clean_data.shape}")

print(f"\nFinal cleaned data shape: {clean_data.shape}")
print(f"Data retained: {(len(clean_data) / len(retail_data)) * 100:.2f}%")

In [None]:
# Verify cleaned data
print("Cleaned data summary:")
print(f"Number of unique customers: {clean_data['CustomerID'].nunique()}")
print(f"Number of transactions: {len(clean_data)}")
print(f"Date range: {clean_data['InvoiceDate'].min()} to {clean_data['InvoiceDate'].max()}")
print(f"Total revenue: £{clean_data['TotalAmount'].sum():,.2f}")

## 4. RFM Analysis

**RFM Analysis** segments customers based on:
- **Recency (R)**: Days since last purchase
- **Frequency (F)**: Total number of purchases
- **Monetary (M)**: Total amount spent

Lower recency is better (more recent), while higher frequency and monetary values are better.

In [None]:
# Define reference date (day after last transaction)
reference_date = clean_data['InvoiceDate'].max() + timedelta(days=1)
print(f"Reference date for recency calculation: {reference_date}")

# Calculate RFM metrics for each customer
rfm_data = clean_data.groupby('CustomerID').agg({
    'InvoiceDate': lambda x: (reference_date - x.max()).days,  # Recency
    'InvoiceNo': 'nunique',  # Frequency (number of unique invoices)
    'TotalAmount': 'sum'  # Monetary
}).reset_index()

# Rename columns
rfm_data.columns = ['CustomerID', 'Recency', 'Frequency', 'Monetary']

print(f"\nRFM data created for {len(rfm_data)} customers")
print("\nFirst 5 customers:")
rfm_data.head()

In [None]:
# RFM statistics
print("RFM Statistics:")
print("=" * 50)
rfm_data[['Recency', 'Frequency', 'Monetary']].describe()

In [None]:
# Visualize RFM distributions
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Recency distribution
axes[0].hist(rfm_data['Recency'], bins=50, edgecolor='black', alpha=0.7)
axes[0].set_xlabel('Recency (days)')
axes[0].set_ylabel('Number of Customers')
axes[0].set_title('Recency Distribution')
axes[0].axvline(rfm_data['Recency'].median(), color='red', linestyle='--', 
                label=f"Median: {rfm_data['Recency'].median():.0f} days")
axes[0].legend()

# Frequency distribution (log scale for better visualization)
axes[1].hist(rfm_data['Frequency'], bins=50, edgecolor='black', alpha=0.7)
axes[1].set_xlabel('Frequency (number of orders)')
axes[1].set_ylabel('Number of Customers')
axes[1].set_title('Frequency Distribution')
axes[1].axvline(rfm_data['Frequency'].median(), color='red', linestyle='--',
                label=f"Median: {rfm_data['Frequency'].median():.0f} orders")
axes[1].legend()

# Monetary distribution (log scale)
axes[2].hist(np.log1p(rfm_data['Monetary']), bins=50, edgecolor='black', alpha=0.7)
axes[2].set_xlabel('Monetary (log scale)')
axes[2].set_ylabel('Number of Customers')
axes[2].set_title('Monetary Distribution (Log Scale)')
axes[2].axvline(np.log1p(rfm_data['Monetary'].median()), color='red', linestyle='--',
                label=f"Median: £{rfm_data['Monetary'].median():.0f}")
axes[2].legend()

plt.tight_layout()
plt.show()

print("Note: Monetary is shown on log scale due to high skewness")

### Exercise 2: RFM Analysis

1. What percentage of customers made only one purchase?
2. Calculate the correlation between Frequency and Monetary values
3. Identify the top 10 customers by Monetary value

In [None]:
# Your code here


## 5. Feature Engineering and Scaling

Before clustering, we need to:
1. Handle skewness using log transformation
2. Standardize features to have mean=0 and std=1

This ensures all features contribute equally to distance calculations.

In [None]:
# Apply log transformation to reduce skewness
# Adding 1 to handle zero values (log1p)
rfm_log = rfm_data.copy()
rfm_log['Recency_log'] = np.log1p(rfm_log['Recency'])
rfm_log['Frequency_log'] = np.log1p(rfm_log['Frequency'])
rfm_log['Monetary_log'] = np.log1p(rfm_log['Monetary'])

# Select features for clustering
features_for_clustering = ['Recency_log', 'Frequency_log', 'Monetary_log']
X = rfm_log[features_for_clustering].values

# Standardize features (mean=0, std=1)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print(f"Features shape: {X_scaled.shape}")
print(f"\nScaled features statistics:")
print(f"Mean: {X_scaled.mean(axis=0)}")
print(f"Std: {X_scaled.std(axis=0)}")

In [None]:
# Visualize before and after transformation
fig, axes = plt.subplots(2, 3, figsize=(15, 8))

# Original features
for idx, col in enumerate(['Recency', 'Frequency', 'Monetary']):
    axes[0, idx].hist(rfm_data[col], bins=50, edgecolor='black', alpha=0.7)
    axes[0, idx].set_title(f'{col} (Original)')
    axes[0, idx].set_ylabel('Count')

# Scaled features
for idx, col in enumerate(features_for_clustering):
    axes[1, idx].hist(X_scaled[:, idx], bins=50, edgecolor='black', alpha=0.7)
    axes[1, idx].set_title(f'{col} (Scaled)')
    axes[1, idx].set_ylabel('Count')

plt.tight_layout()
plt.show()

print("Transformation reduces skewness and standardizes scale")

## 6. Determining Optimal Number of Clusters

We'll use two methods:
1. **Elbow Method**: Plot WCSS (Within-Cluster Sum of Squares) vs. K
2. **Silhouette Analysis**: Measure cluster cohesion and separation

In [None]:
# Elbow Method - Calculate WCSS for different values of K
wcss_values = []
silhouette_scores = []
K_range = range(2, 11)

print("Calculating metrics for different K values...")
for k in K_range:
    # Fit K-Means
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X_scaled)
    
    # Calculate WCSS (inertia)
    wcss_values.append(kmeans.inertia_)
    
    # Calculate silhouette score
    silhouette_avg = silhouette_score(X_scaled, kmeans.labels_)
    silhouette_scores.append(silhouette_avg)
    
    print(f"K={k}: WCSS={kmeans.inertia_:.2f}, Silhouette={silhouette_avg:.3f}")

print("\nCalculations complete!")

In [None]:
# Plot Elbow Curve and Silhouette Scores
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Elbow curve
axes[0].plot(K_range, wcss_values, marker='o', linewidth=2, markersize=8)
axes[0].set_xlabel('Number of Clusters (K)')
axes[0].set_ylabel('WCSS (Within-Cluster Sum of Squares)')
axes[0].set_title('Elbow Method for Optimal K')
axes[0].grid(True, alpha=0.3)
axes[0].set_xticks(K_range)

# Silhouette scores
axes[1].plot(K_range, silhouette_scores, marker='s', linewidth=2, markersize=8, color='green')
axes[1].set_xlabel('Number of Clusters (K)')
axes[1].set_ylabel('Silhouette Score')
axes[1].set_title('Silhouette Score for Different K')
axes[1].grid(True, alpha=0.3)
axes[1].set_xticks(K_range)

# Highlight optimal K (highest silhouette score)
optimal_k = K_range[np.argmax(silhouette_scores)]
axes[1].axvline(optimal_k, color='red', linestyle='--', 
                label=f'Optimal K={optimal_k}')
axes[1].legend()

plt.tight_layout()
plt.show()

print(f"\nBased on silhouette score, optimal K = {optimal_k}")
print(f"Highest silhouette score: {max(silhouette_scores):.3f}")

### Exercise 3: Optimal Clusters

1. Looking at the elbow curve, where does the "elbow" appear to be?
2. Calculate the Davies-Bouldin Index for K=4, 5, 6 (lower is better)
3. Based on business constraints, would you choose 4, 5, or 6 segments? Why?

In [None]:
# Your code here
# Hint: Use davies_bouldin_score from sklearn.metrics


## 7. K-Means Clustering

We'll apply K-Means with our optimal K value.

In [None]:
# Set optimal K (you can adjust based on analysis)
optimal_k = 5

# Fit K-Means model
kmeans = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)
kmeans_labels = kmeans.fit_predict(X_scaled)

# Add cluster labels to RFM data
rfm_data['KMeans_Cluster'] = kmeans_labels

# Calculate clustering metrics
silhouette_avg = silhouette_score(X_scaled, kmeans_labels)
davies_bouldin = davies_bouldin_score(X_scaled, kmeans_labels)
calinski_harabasz = calinski_harabasz_score(X_scaled, kmeans_labels)

print(f"K-Means Clustering with K={optimal_k}")
print("=" * 50)
print(f"Silhouette Score: {silhouette_avg:.3f} (higher is better, range: -1 to 1)")
print(f"Davies-Bouldin Index: {davies_bouldin:.3f} (lower is better)")
print(f"Calinski-Harabasz Index: {calinski_harabasz:.3f} (higher is better)")
print("\nCluster sizes:")
print(rfm_data['KMeans_Cluster'].value_counts().sort_index())

In [None]:
# Visualize cluster distribution
plt.figure(figsize=(10, 5))
cluster_counts = rfm_data['KMeans_Cluster'].value_counts().sort_index()
bars = plt.bar(cluster_counts.index, cluster_counts.values, edgecolor='black', alpha=0.7)

# Add value labels on bars
for bar in bars:
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height,
             f'{int(height)}\n({height/len(rfm_data)*100:.1f}%)',
             ha='center', va='bottom')

plt.xlabel('Cluster')
plt.ylabel('Number of Customers')
plt.title('K-Means Cluster Distribution')
plt.xticks(cluster_counts.index)
plt.grid(axis='y', alpha=0.3)
plt.show()

## 8. DBSCAN Clustering

DBSCAN (Density-Based Spatial Clustering) can find arbitrarily shaped clusters and identify outliers.

In [None]:
# DBSCAN requires eps and min_samples parameters
# eps: Maximum distance between two samples to be considered neighbors
# min_samples: Minimum number of samples in a neighborhood

# Try different eps values to find optimal clustering
dbscan = DBSCAN(eps=0.5, min_samples=5)
dbscan_labels = dbscan.fit_predict(X_scaled)

# Add cluster labels to RFM data
rfm_data['DBSCAN_Cluster'] = dbscan_labels

# Count clusters (excluding outliers labeled as -1)
n_clusters = len(set(dbscan_labels)) - (1 if -1 in dbscan_labels else 0)
n_outliers = list(dbscan_labels).count(-1)

print(f"DBSCAN Clustering Results (eps=0.5, min_samples=5)")
print("=" * 50)
print(f"Number of clusters: {n_clusters}")
print(f"Number of outliers: {n_outliers} ({n_outliers/len(dbscan_labels)*100:.1f}%)")

# Calculate silhouette score (excluding outliers)
if n_clusters > 1:
    mask = dbscan_labels != -1
    if sum(mask) > n_clusters:  # Need enough points for silhouette
        silhouette_dbscan = silhouette_score(X_scaled[mask], dbscan_labels[mask])
        print(f"Silhouette Score (excluding outliers): {silhouette_dbscan:.3f}")

print("\nCluster sizes:")
print(rfm_data['DBSCAN_Cluster'].value_counts().sort_index())

## 9. Hierarchical Clustering

Hierarchical clustering creates a tree-like structure (dendrogram) showing cluster relationships.

In [None]:
# Create dendrogram (using subset for visualization)
# Using all data can be too dense to visualize
sample_size = min(1000, len(X_scaled))
sample_indices = np.random.choice(len(X_scaled), sample_size, replace=False)
X_sample = X_scaled[sample_indices]

# Calculate linkage matrix
linkage_matrix = linkage(X_sample, method='ward')

# Plot dendrogram
plt.figure(figsize=(15, 7))
dendrogram(linkage_matrix, truncate_mode='lastp', p=30, 
           leaf_font_size=10, show_contracted=True)
plt.xlabel('Sample Index or (Cluster Size)')
plt.ylabel('Distance')
plt.title(f'Hierarchical Clustering Dendrogram (sample of {sample_size} customers)')
plt.axhline(y=10, color='red', linestyle='--', label='Potential cut height')
plt.legend()
plt.show()

print("Dendrogram shows hierarchical cluster structure")
print("Red line suggests a potential cut height for cluster formation")

In [None]:
# Apply Agglomerative Clustering with optimal K
hierarchical = AgglomerativeClustering(n_clusters=optimal_k, linkage='ward')
hierarchical_labels = hierarchical.fit_predict(X_scaled)

# Add cluster labels to RFM data
rfm_data['Hierarchical_Cluster'] = hierarchical_labels

# Calculate metrics
silhouette_hier = silhouette_score(X_scaled, hierarchical_labels)
davies_bouldin_hier = davies_bouldin_score(X_scaled, hierarchical_labels)

print(f"Hierarchical Clustering with K={optimal_k}")
print("=" * 50)
print(f"Silhouette Score: {silhouette_hier:.3f}")
print(f"Davies-Bouldin Index: {davies_bouldin_hier:.3f}")
print("\nCluster sizes:")
print(rfm_data['Hierarchical_Cluster'].value_counts().sort_index())

## 10. Gaussian Mixture Model (GMM)

GMM provides probabilistic (soft) clustering where each customer has a probability of belonging to each cluster.

In [None]:
# Fit Gaussian Mixture Model
gmm = GaussianMixture(n_components=optimal_k, random_state=42, n_init=10)
gmm.fit(X_scaled)
gmm_labels = gmm.predict(X_scaled)

# Get probability of belonging to each cluster
gmm_proba = gmm.predict_proba(X_scaled)

# Add cluster labels to RFM data
rfm_data['GMM_Cluster'] = gmm_labels

# Calculate metrics
silhouette_gmm = silhouette_score(X_scaled, gmm_labels)
davies_bouldin_gmm = davies_bouldin_score(X_scaled, gmm_labels)

print(f"Gaussian Mixture Model with K={optimal_k}")
print("=" * 50)
print(f"Silhouette Score: {silhouette_gmm:.3f}")
print(f"Davies-Bouldin Index: {davies_bouldin_gmm:.3f}")
print(f"BIC Score: {gmm.bic(X_scaled):.2f} (lower is better)")
print(f"AIC Score: {gmm.aic(X_scaled):.2f} (lower is better)")
print("\nCluster sizes:")
print(rfm_data['GMM_Cluster'].value_counts().sort_index())

In [None]:
# Examine cluster assignment confidence
max_probabilities = gmm_proba.max(axis=1)

plt.figure(figsize=(10, 5))
plt.hist(max_probabilities, bins=50, edgecolor='black', alpha=0.7)
plt.xlabel('Maximum Cluster Probability')
plt.ylabel('Number of Customers')
plt.title('GMM Cluster Assignment Confidence')
plt.axvline(max_probabilities.mean(), color='red', linestyle='--',
            label=f'Mean: {max_probabilities.mean():.3f}')
plt.legend()
plt.grid(axis='y', alpha=0.3)
plt.show()

print(f"Average cluster assignment confidence: {max_probabilities.mean():.3f}")
print(f"Customers with >90% confidence: {(max_probabilities > 0.9).sum()} "
      f"({(max_probabilities > 0.9).sum()/len(max_probabilities)*100:.1f}%)")

### Exercise 4: Comparing Clustering Methods

1. Create a comparison table of silhouette scores for all methods
2. Which method produces the most balanced cluster sizes?
3. For which business scenarios would you prefer GMM over K-Means?

In [None]:
# Your code here


## 11. Dimensionality Reduction for Visualization

We'll use PCA and t-SNE to visualize our 3D clusters in 2D space.

In [None]:
# PCA - Linear dimensionality reduction
pca = PCA(n_components=2, random_state=42)
X_pca = pca.fit_transform(X_scaled)

print("PCA Results:")
print("=" * 50)
print(f"Explained variance ratio: {pca.explained_variance_ratio_}")
print(f"Total variance explained: {pca.explained_variance_ratio_.sum():.2%}")
print(f"\nPCA components shape: {X_pca.shape}")

In [None]:
# t-SNE - Non-linear dimensionality reduction
# Note: t-SNE can be slow for large datasets
print("Running t-SNE (this may take a minute for large datasets)...")
tsne = TSNE(n_components=2, random_state=42, perplexity=30, n_iter=1000)
X_tsne = tsne.fit_transform(X_scaled)
print(f"t-SNE components shape: {X_tsne.shape}")
print("t-SNE complete!")

In [None]:
# Visualize K-Means clusters using PCA and t-SNE
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# PCA visualization
scatter1 = axes[0].scatter(X_pca[:, 0], X_pca[:, 1], 
                          c=rfm_data['KMeans_Cluster'], 
                          cmap='viridis', alpha=0.6, s=50)
axes[0].set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%} variance)')
axes[0].set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%} variance)')
axes[0].set_title('K-Means Clusters (PCA Projection)')
plt.colorbar(scatter1, ax=axes[0], label='Cluster')

# t-SNE visualization
scatter2 = axes[1].scatter(X_tsne[:, 0], X_tsne[:, 1], 
                          c=rfm_data['KMeans_Cluster'], 
                          cmap='viridis', alpha=0.6, s=50)
axes[1].set_xlabel('t-SNE Component 1')
axes[1].set_ylabel('t-SNE Component 2')
axes[1].set_title('K-Means Clusters (t-SNE Projection)')
plt.colorbar(scatter2, ax=axes[1], label='Cluster')

plt.tight_layout()
plt.show()

print("PCA preserves global structure, t-SNE preserves local structure")

## 12. Cluster Profiling and Analysis

Now let's analyze what characterizes each cluster and give them business-meaningful names.

In [None]:
# Calculate cluster profiles using K-Means results
cluster_profile = rfm_data.groupby('KMeans_Cluster').agg({
    'Recency': ['mean', 'median'],
    'Frequency': ['mean', 'median'],
    'Monetary': ['mean', 'median', 'sum'],
    'CustomerID': 'count'
}).round(2)

# Flatten column names
cluster_profile.columns = ['_'.join(col).strip() for col in cluster_profile.columns.values]
cluster_profile = cluster_profile.rename(columns={'CustomerID_count': 'Customer_Count'})

# Add percentage of total customers
cluster_profile['Percentage'] = (cluster_profile['Customer_Count'] / 
                                 cluster_profile['Customer_Count'].sum() * 100).round(1)

# Add revenue percentage
cluster_profile['Revenue_Percentage'] = (cluster_profile['Monetary_sum'] / 
                                         cluster_profile['Monetary_sum'].sum() * 100).round(1)

print("Cluster Profile Summary:")
print("=" * 100)
cluster_profile

In [None]:
# Visualize cluster characteristics
fig, axes = plt.subplots(2, 3, figsize=(16, 10))

clusters = sorted(rfm_data['KMeans_Cluster'].unique())

# Recency by cluster
recency_means = [rfm_data[rfm_data['KMeans_Cluster']==c]['Recency'].mean() for c in clusters]
axes[0, 0].bar(clusters, recency_means, edgecolor='black', alpha=0.7)
axes[0, 0].set_xlabel('Cluster')
axes[0, 0].set_ylabel('Average Recency (days)')
axes[0, 0].set_title('Average Recency by Cluster')
axes[0, 0].set_xticks(clusters)

# Frequency by cluster
frequency_means = [rfm_data[rfm_data['KMeans_Cluster']==c]['Frequency'].mean() for c in clusters]
axes[0, 1].bar(clusters, frequency_means, edgecolor='black', alpha=0.7, color='orange')
axes[0, 1].set_xlabel('Cluster')
axes[0, 1].set_ylabel('Average Frequency (orders)')
axes[0, 1].set_title('Average Frequency by Cluster')
axes[0, 1].set_xticks(clusters)

# Monetary by cluster
monetary_means = [rfm_data[rfm_data['KMeans_Cluster']==c]['Monetary'].mean() for c in clusters]
axes[0, 2].bar(clusters, monetary_means, edgecolor='black', alpha=0.7, color='green')
axes[0, 2].set_xlabel('Cluster')
axes[0, 2].set_ylabel('Average Monetary (£)')
axes[0, 2].set_title('Average Monetary Value by Cluster')
axes[0, 2].set_xticks(clusters)

# Customer count by cluster
customer_counts = rfm_data['KMeans_Cluster'].value_counts().sort_index()
axes[1, 0].bar(customer_counts.index, customer_counts.values, edgecolor='black', alpha=0.7, color='red')
axes[1, 0].set_xlabel('Cluster')
axes[1, 0].set_ylabel('Number of Customers')
axes[1, 0].set_title('Customer Count by Cluster')
axes[1, 0].set_xticks(clusters)

# Revenue contribution by cluster
revenue_by_cluster = rfm_data.groupby('KMeans_Cluster')['Monetary'].sum().sort_index()
axes[1, 1].bar(revenue_by_cluster.index, revenue_by_cluster.values, 
               edgecolor='black', alpha=0.7, color='purple')
axes[1, 1].set_xlabel('Cluster')
axes[1, 1].set_ylabel('Total Revenue (£)')
axes[1, 1].set_title('Total Revenue by Cluster')
axes[1, 1].set_xticks(clusters)

# Revenue percentage pie chart
axes[1, 2].pie(revenue_by_cluster.values, labels=[f'Cluster {i}' for i in clusters],
               autopct='%1.1f%%', startangle=90)
axes[1, 2].set_title('Revenue Distribution Across Clusters')

plt.tight_layout()
plt.show()

In [None]:
# Assign business-meaningful names to clusters
# Based on RFM characteristics, we'll name each cluster

# Calculate normalized scores for easier interpretation
# Lower recency is better, so we invert it
cluster_summary = rfm_data.groupby('KMeans_Cluster').agg({
    'Recency': 'mean',
    'Frequency': 'mean',
    'Monetary': 'mean',
    'CustomerID': 'count'
})

# Create scoring system (higher is better)
cluster_summary['Recency_Score'] = 1 / (cluster_summary['Recency'] + 1)  # Inverse for recency
cluster_summary['Frequency_Score'] = cluster_summary['Frequency']
cluster_summary['Monetary_Score'] = cluster_summary['Monetary']

# Normalize scores to 0-1 range
for col in ['Recency_Score', 'Frequency_Score', 'Monetary_Score']:
    min_val = cluster_summary[col].min()
    max_val = cluster_summary[col].max()
    cluster_summary[f'{col}_Norm'] = (cluster_summary[col] - min_val) / (max_val - min_val)

print("Normalized Cluster Scores (0=Worst, 1=Best):")
print("=" * 70)
print(cluster_summary[['Recency_Score_Norm', 'Frequency_Score_Norm', 'Monetary_Score_Norm']])

# Define cluster names based on characteristics
# You may need to adjust these based on your actual data
cluster_names = {}
for cluster_id in cluster_summary.index:
    r_score = cluster_summary.loc[cluster_id, 'Recency_Score_Norm']
    f_score = cluster_summary.loc[cluster_id, 'Frequency_Score_Norm']
    m_score = cluster_summary.loc[cluster_id, 'Monetary_Score_Norm']
    
    # Classification logic
    if r_score > 0.7 and f_score > 0.7 and m_score > 0.7:
        cluster_names[cluster_id] = "Champions"
    elif r_score > 0.6 and f_score > 0.5:
        cluster_names[cluster_id] = "Loyal Customers"
    elif m_score > 0.7:
        cluster_names[cluster_id] = "Big Spenders"
    elif r_score < 0.3 and f_score < 0.3:
        cluster_names[cluster_id] = "Lost Customers"
    elif r_score < 0.4:
        cluster_names[cluster_id] = "At Risk"
    else:
        cluster_names[cluster_id] = "Potential Loyalists"

# Apply names to dataframe
rfm_data['Segment_Name'] = rfm_data['KMeans_Cluster'].map(cluster_names)

print("\nCluster Names:")
for cluster_id, name in cluster_names.items():
    print(f"Cluster {cluster_id}: {name}")

## 13. Business Recommendations by Segment

Based on our cluster analysis, let's develop targeted strategies for each customer segment.

In [None]:
# Create comprehensive segment profiles
segment_profiles = rfm_data.groupby('Segment_Name').agg({
    'CustomerID': 'count',
    'Recency': ['mean', 'median'],
    'Frequency': ['mean', 'median'],
    'Monetary': ['mean', 'median', 'sum']
}).round(2)

segment_profiles.columns = ['_'.join(col).strip() for col in segment_profiles.columns.values]
segment_profiles = segment_profiles.rename(columns={'CustomerID_count': 'Customer_Count'})

# Calculate percentages
segment_profiles['Pct_Customers'] = (
    segment_profiles['Customer_Count'] / segment_profiles['Customer_Count'].sum() * 100
).round(1)

segment_profiles['Pct_Revenue'] = (
    segment_profiles['Monetary_sum'] / segment_profiles['Monetary_sum'].sum() * 100
).round(1)

# Calculate average customer lifetime value per segment
segment_profiles['Avg_CLV'] = segment_profiles['Monetary_mean']

print("Detailed Segment Profiles:")
print("=" * 100)
segment_profiles

In [None]:
# Define marketing strategies for each segment
marketing_strategies = {
    "Champions": {
        "Priority": "HIGH",
        "Objective": "Retention and Advocacy",
        "Strategies": [
            "Exclusive VIP rewards and early access to new products",
            "Personalized thank-you messages and recognition",
            "Referral program incentives",
            "Request reviews and testimonials",
            "Upsell premium products and services"
        ],
        "Budget_Allocation": "25-30%",
        "Expected_ROI": "High"
    },
    "Loyal Customers": {
        "Priority": "HIGH",
        "Objective": "Increase spending and frequency",
        "Strategies": [
            "Loyalty points program",
            "Cross-sell complementary products",
            "Exclusive member-only discounts",
            "Educational content about products",
            "Subscription or bundle offers"
        ],
        "Budget_Allocation": "20-25%",
        "Expected_ROI": "High"
    },
    "Big Spenders": {
        "Priority": "HIGH",
        "Objective": "Increase frequency of purchases",
        "Strategies": [
            "Personalized product recommendations",
            "Time-limited premium offers",
            "Concierge-style customer service",
            "Bundle deals on high-value items",
            "Free shipping and premium delivery options"
        ],
        "Budget_Allocation": "15-20%",
        "Expected_ROI": "Medium-High"
    },
    "Potential Loyalists": {
        "Priority": "MEDIUM",
        "Objective": "Convert to loyal customers",
        "Strategies": [
            "Onboarding email series",
            "First-purchase discount for next order",
            "Product education and tutorials",
            "Engagement campaigns (contests, surveys)",
            "Introduce loyalty program benefits"
        ],
        "Budget_Allocation": "15-20%",
        "Expected_ROI": "Medium"
    },
    "At Risk": {
        "Priority": "MEDIUM-HIGH",
        "Objective": "Re-engagement and retention",
        "Strategies": [
            "Win-back email campaigns",
            "Special 'we miss you' discounts (15-20%)",
            "Survey to understand why they left",
            "Highlight new products or improvements",
            "Limited-time reactivation offers"
        ],
        "Budget_Allocation": "10-15%",
        "Expected_ROI": "Medium"
    },
    "Lost Customers": {
        "Priority": "LOW",
        "Objective": "Cost-effective reactivation",
        "Strategies": [
            "Automated low-cost email campaigns",
            "Aggressive discounts (25-30%) if high historical value",
            "Survey for feedback and improvements",
            "Remarketing ads with special offers",
            "Consider removing from active lists if no response"
        ],
        "Budget_Allocation": "5-10%",
        "Expected_ROI": "Low-Medium"
    }
}

# Display strategies
for segment, strategy in marketing_strategies.items():
    if segment in segment_profiles.index:
        print(f"\n{'='*80}")
        print(f"SEGMENT: {segment}")
        print(f"{'='*80}")
        print(f"Priority: {strategy['Priority']}")
        print(f"Objective: {strategy['Objective']}")
        print(f"Budget Allocation: {strategy['Budget_Allocation']} of marketing budget")
        print(f"Expected ROI: {strategy['Expected_ROI']}")
        print(f"\nCustomers: {segment_profiles.loc[segment, 'Customer_Count']:.0f} "
              f"({segment_profiles.loc[segment, 'Pct_Customers']:.1f}% of total)")
        print(f"Revenue: £{segment_profiles.loc[segment, 'Monetary_sum']:,.2f} "
              f"({segment_profiles.loc[segment, 'Pct_Revenue']:.1f}% of total)")
        print(f"\nRecommended Strategies:")
        for i, strat in enumerate(strategy['Strategies'], 1):
            print(f"  {i}. {strat}")

### Exercise 5: Marketing Strategy Development

1. Calculate the total marketing budget allocation percentage (should sum to ~100%)
2. If you have a $100,000 marketing budget, how much would you allocate to each segment?
3. Which segment offers the best balance of size and revenue potential?
4. Design a specific email campaign for the "At Risk" segment

In [None]:
# Your code here


## 14. Key Insights and Action Items

Let's summarize the most important findings and next steps.

In [None]:
# Generate executive summary
print("CUSTOMER SEGMENTATION - EXECUTIVE SUMMARY")
print("=" * 80)
print(f"\nAnalysis Date: {datetime.now().strftime('%Y-%m-%d')}")
print(f"Total Customers Analyzed: {len(rfm_data):,}")
print(f"Total Revenue: £{rfm_data['Monetary'].sum():,.2f}")
print(f"Average Customer Value: £{rfm_data['Monetary'].mean():,.2f}")
print(f"\nNumber of Segments Identified: {optimal_k}")
print(f"Clustering Method: K-Means")
print(f"Silhouette Score: {silhouette_avg:.3f}")

print(f"\n{'='*80}")
print("SEGMENT BREAKDOWN")
print(f"{'='*80}\n")

for segment in segment_profiles.index:
    count = segment_profiles.loc[segment, 'Customer_Count']
    pct = segment_profiles.loc[segment, 'Pct_Customers']
    revenue = segment_profiles.loc[segment, 'Monetary_sum']
    rev_pct = segment_profiles.loc[segment, 'Pct_Revenue']
    avg_value = segment_profiles.loc[segment, 'Monetary_mean']
    
    print(f"{segment}:")
    print(f"  - Customers: {count:.0f} ({pct:.1f}%)")
    print(f"  - Revenue: £{revenue:,.2f} ({rev_pct:.1f}%)")
    print(f"  - Avg Customer Value: £{avg_value:,.2f}")
    print()

print(f"{'='*80}")
print("TOP 3 PRIORITY ACTIONS")
print(f"{'='*80}")
print("1. PROTECT HIGH-VALUE SEGMENTS")
print("   → Launch VIP program for Champions and Big Spenders")
print("   → Implement personalized communication strategy")
print("\n2. REDUCE CHURN IN AT-RISK SEGMENT")
print("   → Immediate win-back campaign with special offers")
print("   → Conduct customer satisfaction surveys")
print("\n3. GROW POTENTIAL LOYALISTS")
print("   → Onboarding program to increase engagement")
print("   → Introduce to loyalty rewards program")

print(f"\n{'='*80}")
print("RECOMMENDED NEXT STEPS")
print(f"{'='*80}")
print("□ Present findings to marketing team")
print("□ Develop segment-specific email templates")
print("□ Set up automated customer journey triggers based on segments")
print("□ Create dashboard to monitor segment migration")
print("□ A/B test different strategies within each segment")
print("□ Re-run segmentation quarterly to track changes")
print("□ Integrate segments into CRM system")
print("□ Train customer service team on segment characteristics")

## 15. Summary and Key Takeaways

### What We Learned

In this comprehensive customer segmentation project, we:

1. **Performed RFM Analysis**: Calculated Recency, Frequency, and Monetary metrics for each customer
2. **Applied Multiple Clustering Algorithms**: 
   - K-Means for efficient partitioning
   - DBSCAN for density-based clustering
   - Hierarchical clustering to understand segment relationships
   - GMM for probabilistic clustering
3. **Determined Optimal K**: Used elbow method and silhouette analysis
4. **Visualized Clusters**: Applied PCA and t-SNE for 2D visualization
5. **Created Business Segments**: Translated clusters into actionable customer segments
6. **Developed Marketing Strategies**: Designed targeted campaigns for each segment

### Key Concepts

- **RFM Analysis**: A proven framework for customer value assessment
- **Feature Scaling**: Critical for distance-based algorithms
- **Cluster Validation**: Multiple metrics provide different perspectives on quality
- **Dimensionality Reduction**: PCA and t-SNE serve different visualization purposes
- **Business Translation**: Technical clusters must map to actionable segments

### Practical Applications

Customer segmentation enables:
- **Personalized Marketing**: Tailor messages to segment characteristics
- **Resource Optimization**: Focus budget on high-value/high-potential segments
- **Churn Prevention**: Identify and act on at-risk customers
- **Product Development**: Understand needs of different customer types
- **Pricing Strategy**: Segment-based pricing and promotions

### Next Steps

To extend this analysis:
1. **Predictive Modeling**: Build classifier to assign new customers to segments
2. **Temporal Analysis**: Track how customers move between segments over time
3. **Advanced Features**: Include product categories, channel preferences, geography
4. **A/B Testing**: Measure effectiveness of segment-specific strategies
5. **Customer Lifetime Value**: Predict CLV for each segment
6. **Cohort Analysis**: Combine segmentation with cohort tracking

### Additional Resources

- [RFM Analysis Guide](https://www.putler.com/rfm-analysis/)
- [Scikit-learn Clustering](https://scikit-learn.org/stable/modules/clustering.html)
- [Customer Segmentation Best Practices](https://www.optimove.com/resources/learning-center/customer-segmentation)
- [Marketing Analytics with Python](https://www.datacamp.com/courses/marketing-analytics-with-python)

---

**Congratulations!** You've completed a comprehensive customer segmentation project. You now have the skills to segment customers, interpret clusters, and develop data-driven marketing strategies.