## **5. Customer Segmentation Modeling**

### **5.1 Overview**
This notebook demonstrates customer segmentation using K-means clustering. The code has been modularized into the `src/clustering.py` module for production use.

**Modeling Steps:**
1. Load customer features
2. Scale RFM features
3. Find optimal number of clusters (Elbow Method, Silhouette Score)
4. Perform K-means clustering
5. Create cluster profiles and segment names
6. Visualize results

**Production Usage:**
```python
from src.clustering import run_clustering_pipeline
customer_segments = run_clustering_pipeline(customer_features, n_clusters=3)
```

### **5.2 Load Customer Features**

In [1]:
# Import libraries and load customer features
import pandas as pd
import numpy as np
import sys
import os

# Add src directory to path for importing our module
sys.path.append(os.path.join(os.path.dirname(os.getcwd()), 'src'))

# Load customer features
df = pd.read_csv('../data/processed/Customer_RFM_Features.csv')
print(f"Customer features shape: {df.shape}")
print("\nFeature columns:")
print(df.columns.tolist())
df.head()

Customer features shape: (13196, 9)

Feature columns:
['CustomerID', 'Recency', 'Frequency', 'Monetary', 'TotalItems', 'UniqueProducts', 'Country', 'AvgOrderValue', 'ItemsPerOrder']


Unnamed: 0,CustomerID,Recency,Frequency,Monetary,TotalItems,UniqueProducts,Country,AvgOrderValue,ItemsPerOrder
0,12347.0,96,7,3412.53,1905,97,Iceland,487.504286,272.142857
1,12348.0,221,3,90.2,140,6,Finland,30.066667,46.666667
2,12349.0,698,1,1197.15,547,64,Italy,1197.15,547.0
3,12350.0,312,1,294.4,196,16,Norway,294.4,196.0
4,12352.0,275,7,1147.44,502,50,Norway,163.92,71.714286


### **5.3 Feature Scaling**

In [2]:
# Select and scale RFM features for clustering
from feature_engineering import get_rfm_feature_columns
from sklearn.preprocessing import StandardScaler

# Get RFM features
rfm_features = get_rfm_feature_columns()
print(f"RFM features for clustering: {rfm_features}")

# Extract RFM data
X = df[rfm_features].copy()
print(f"\nRFM data shape: {X.shape}")

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
print(f"Scaled features shape: {X_scaled.shape}")

# Display scaled features
scaled_df = pd.DataFrame(X_scaled, columns=rfm_features)
print("\nSample scaled features:")
print(scaled_df.head())

RFM features for clustering: ['Recency', 'Frequency', 'Monetary']

RFM data shape: (13196, 3)
Scaled features shape: (13196, 3)

Sample scaled features:
    Recency  Frequency  Monetary
0 -0.253606  -0.464574  2.144055
1  0.345852  -1.092186 -0.228514
2  2.633386  -1.405993  0.561990
3  0.782258  -1.405993 -0.082690
4  0.604818  -0.464574  0.526490


### **5.4 Find Optimal Clusters**

In [3]:
# Find optimal number of clusters using our modular function
from clustering import find_optimal_clusters, plot_elbow_method, plot_silhouette_scores

cluster_range, wcss, silhouette_scores = find_optimal_clusters(X_scaled)
print(f"Cluster range: {cluster_range}")
print(f"WCSS values: {wcss}")
print(f"Silhouette scores: {silhouette_scores}")

# Note: Functions available in src/clustering.py

Cluster range: [2, 3, 4, 5, 6, 7, 8, 9, 10]
WCSS values: [25066.98151184842, 18354.503070359628, 13178.526585619384, 9548.786432350447, 7645.093517026175, 6222.63444908569, 5150.244140859604, 4654.606970967973, 4223.284876301929]
Silhouette scores: [0.6085455122757862, 0.613783132954403, 0.5269359971457813, 0.5419525948894539, 0.5422866418688904, 0.43481850733726873, 0.45683103715378864, 0.4585345608270636, 0.4061609859885725]


In [4]:
# # Plot Elbow Method
# plot_elbow_method(cluster_range, wcss, save_path='../data/processed/images/elbow_method_notebook.png')

# # Plot Silhouette Scores
# plot_silhouette_scores(cluster_range, silhouette_scores, 
#                       save_path='../data/processed/images/silhouette_scores_notebook.png')

# Determine optimal k (highest silhouette score)
optimal_k = cluster_range[np.argmax(silhouette_scores)]
print(f"Optimal number of clusters: {optimal_k} (Silhouette Score: {max(silhouette_scores):.3f})")

Optimal number of clusters: 3 (Silhouette Score: 0.614)


### **5.5 Perform K-means Clustering**

In [5]:
# Perform clustering using our modular function
from clustering import perform_kmeans_clustering

optimal_k = 3  # Based on silhouette analysis
kmeans = perform_kmeans_clustering(X_scaled, optimal_k)

# Add cluster assignments to dataframe
df['Cluster'] = kmeans.labels_
print(f"Clustering completed with {optimal_k} clusters")
print("\nCluster distribution:")
print(df['Cluster'].value_counts().sort_index())

# Note: Function available in src/clustering.py

Clustering completed with 3 clusters

Cluster distribution:
Cluster
0     1930
1    11259
2        7
Name: count, dtype: int64


### **5.6 Create Cluster Profiles**

In [7]:
# Create cluster profiles using our modular function
from clustering import create_cluster_profiles

cluster_profile = create_cluster_profiles(df)
print("Cluster Profiles:")
cluster_profile

# Note: Function available in src/clustering.py

Cluster Profiles:


Unnamed: 0_level_0,Recency,Frequency,Monetary,TotalItems,UniqueProducts,AvgOrderValue,ItemsPerOrder,Count
Cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2,24.0,106.571429,38371.682857,20779.714286,1027.714286,636.288499,299.538391,7
0,607.935751,1.957513,419.407416,244.2,30.621244,243.053078,142.679258,1930
1,70.269829,11.27276,385.009413,198.742339,25.667466,61.685331,34.288669,11259


### **5.7 Define Segment Names**

In [8]:
# Define segment names based on RFM characteristics
from clustering import define_segment_names, assign_segments

segment_names = define_segment_names(cluster_profile)
print("Segment Names:")
for cluster_id, segment_name in segment_names.items():
    print(f"Cluster {cluster_id}: {segment_name}")

# Assign segments to customers
df = assign_segments(df, segment_names)
print(f"\nSegment distribution:")
print(df['Segment'].value_counts())

# Note: Functions available in src/clustering.py

Segment Names:
Cluster 2: Regular Customers
Cluster 0: At-Risk Customers
Cluster 1: Recent Engaged Customers

Segment distribution:
Segment
Recent Engaged Customers    11259
At-Risk Customers            1930
Regular Customers               7
Name: count, dtype: int64


### **5.8 Visualize Segments**


In [None]:
# # Create RFM segment visualizations
# from visualization import plot_rfm_segments, plot_cluster_scatter

# # Plot RFM characteristics by segment
# plot_rfm_segments(df, save_path='../data/processed/images/rfm_segments_notebook.png')

# # Plot cluster scatter plots
# plot_cluster_scatter(df, 'Frequency', 'Monetary', 
#                    save_path='../data/processed/images/frequency_monetary_scatter_notebook.png')

# plot_cluster_scatter(df, 'Recency', 'Monetary', 
#                    save_path='../data/processed/images/recency_monetary_scatter_notebook.png')

# Note: Functions available in src/visualization.py

### **5.9 Save Results and Models**

In [None]:
# # Save clustering results and models
# from clustering import save_model

# # Save customer segments
# df.to_csv('../data/processed/Customer_Segments.csv', index=False)
# print("Customer segments saved to: ../data/processed/Customer_Segments.csv")

# # Save trained models
# save_model(kmeans, scaler, 
#           '../models/kmeans_customer_segmentation_notebook.pkl',
#           '../models/scaler_customer_segmentation_notebook.pkl')

# print("\nModels saved successfully!")
# print(f"Final segments shape: {df.shape}")
# print(f"Segments created: {df['Segment'].nunique()}")