# E-commerce Customer Segmentation - Complete ML Pipeline

This notebook demonstrates the complete end-to-end unsupervised learning pipeline for customer segmentation, combining all modular components in execution sequence.

## Project Overview
- **Task**: Unsupervised clustering for customer segmentation
- **Dataset**: E-commerce behavioral data from SQLite database (1M events, 146K users)
- **Algorithms**: K-means, DBSCAN, Hierarchical clustering
- **Goal**: Identify distinct customer groups based on purchasing patterns and behavior

## Workflow Overview:
1. **Setup and Imports** - Import libraries and configure project
2. **Configuration** - Set up directories and verify settings
3. **Data Loading** - Extract event-level data from SQLite database
4. **Exploratory Data Analysis** - Understand data distribution and patterns
5. **Data Aggregation** - Aggregate events to customer-level features
6. **Feature Engineering** - Create RFM, behavioral, and temporal features
7. **Feature Selection** - Select features suitable for clustering
8. **Preprocessing** - Handle outliers, scale, and transform data
9. **Optimal Cluster Selection** - Determine best number of clusters
10. **Clustering Model Training** - Train and compare multiple algorithms
11. **Cluster Evaluation** - Evaluate clustering quality
12. **Cluster Analysis** - Profile and interpret each cluster
13. **Visualizations** - Create comprehensive visualizations
14. **Summary** - Review findings and business insights


## 1. Setup and Imports

**Justification**: Import all necessary libraries and set up the project structure. This ensures all dependencies are available and the project root is in the Python path for module imports. We use pandas for data manipulation, sklearn for clustering algorithms, and matplotlib/seaborn for visualizations.


In [None]:
# Import system and path utilities
import sys
from pathlib import Path

# Set project root for imports
project_root = Path().absolute().parent.parent
sys.path.insert(0, str(project_root))

# Standard libraries
import pandas as pd
import numpy as np
import joblib
import warnings
warnings.filterwarnings('ignore')

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
from mpl_toolkits.mplot3d import Axes3D
plt.style.use('seaborn-v0_8')
sns.set_palette("Set2")

# Machine Learning - Clustering
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score
from sklearn.decomposition import PCA

# Project modules
from config.config import (
    ensure_directories, MODEL_PATHS, DIRECTORIES, 
    DATABASE_PATH, DB_TABLE_NAME, DATA_CONFIG,
    KMEANS_CONFIG, PREPROCESSING_CONFIG, FEATURE_CONFIG
)
from src.data.data_loader import load_and_prepare_data
from src.data.data_aggregator import aggregate_data
from src.preprocessing.feature_engineering import engineer_features, select_features_for_clustering
from src.preprocessing.preprocessor import ClusteringPreprocessor
from src.clustering.cluster_selector import ClusterSelector
from src.clustering.cluster_trainer import ClusteringTrainer
from src.evaluation.cluster_evaluator import ClusterEvaluator
from src.utils.visualizations import (
    plot_pca_clusters, plot_cluster_comparison, plot_feature_distributions_by_cluster,
    plot_correlation_matrix, plot_elbow_curve, plot_silhouette_scores
)
from src.utils.logger import setup_logger

print("✓ All imports successful")
print(f"Project root: {project_root}")
print(f"Database path: {DATABASE_PATH}")


## 2. Configuration Setup

**Justification**: Ensure all necessary directories exist (models, logs, reports, etc.) and verify configuration. This is critical for saving artifacts and logs throughout the pipeline. We also set up logging to track the pipeline execution.


In [None]:
# Create all necessary directories
ensure_directories()

# Setup logger
logger = setup_logger("notebook_pipeline")

# Display configuration
print("Configuration:")
print(f"  - Database table: {DB_TABLE_NAME}")
print(f"  - Chunk size: {DATA_CONFIG.get('chunk_size', 100000):,}")
print(f"  - K-means n_clusters range: {KMEANS_CONFIG.get('n_clusters_range', [2, 10])}")
print(f"  - Preprocessing scaling: {PREPROCESSING_CONFIG.get('scaling_method', 'robust')}")
print(f"  - Use PCA: {PREPROCESSING_CONFIG.get('use_pca', False)}")
print("\n✓ Configuration loaded")
print(f"✓ Directories created: {list(DIRECTORIES.values())}")


## 3. Data Loading

**Justification**: Load event-level data from the SQLite database. The data loader handles:
- Database connection with chunking for memory efficiency (1M+ rows)
- Type conversion (event_time to datetime, numeric columns)
- Data validation (missing columns, invalid values)
- Basic statistics logging

This is the raw event-level data where each row represents a user interaction (view, cart, purchase).


In [None]:
# Load data from database
print("Loading event-level data from database...")
print(f"This may take a few minutes for large datasets (1M+ events)...")

df = load_and_prepare_data(
    db_path=DATABASE_PATH,
    table_name=DB_TABLE_NAME,
    chunk_size=DATA_CONFIG.get("chunk_size", 100000),
    max_rows=None  # Load all data
)

print(f"\n✓ Data loaded successfully")
print(f"  - Shape: {df.shape}")
print(f"  - Columns: {list(df.columns)}")
print(f"\nFirst few rows:")
df.head()


## 4. Exploratory Data Analysis (EDA)

**Justification**: Understanding the data is crucial before modeling. EDA helps identify:
- Event type distribution (view, cart, purchase funnel)
- Temporal patterns (hourly activity, peak times)
- User behavior patterns (engagement levels, spending)
- Data quality issues (missing values, outliers)
- Product and category preferences

This analysis informs feature engineering and helps understand the business context.


In [None]:
# Basic statistics
print("=== Data Overview ===")
print(f"Total events: {len(df):,}")
print(f"Date range: {df['event_time'].min()} to {df['event_time'].max()}")

print(f"\n=== Missing Values ===")
missing = df.isnull().sum()
print(missing[missing > 0] if missing.sum() > 0 else "No missing values")

print(f"\n=== Event Type Distribution ===")
event_counts = df['event_type'].value_counts()
print(event_counts)
print(f"\nEvent type proportions:")
print(df['event_type'].value_counts(normalize=True))

# Visualize event type distribution
plt.figure(figsize=(10, 6))
event_counts.plot(kind='bar', color='steelblue', alpha=0.7)
plt.title('Event Type Distribution', fontsize=14, fontweight='bold')
plt.xlabel('Event Type')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

# Calculate conversion funnel
if 'view' in event_counts.index and 'cart' in event_counts.index and 'purchase' in event_counts.index:
    view_to_cart = (event_counts.get('cart', 0) / event_counts.get('view', 1)) * 100
    cart_to_purchase = (event_counts.get('purchase', 0) / event_counts.get('cart', 1)) * 100
    view_to_purchase = (event_counts.get('purchase', 0) / event_counts.get('view', 1)) * 100
    
    print(f"\n=== Conversion Funnel ===")
    print(f"View → Cart: {view_to_cart:.2f}%")
    print(f"Cart → Purchase: {cart_to_purchase:.2f}%")
    print(f"View → Purchase: {view_to_purchase:.2f}%")


In [None]:
# Temporal patterns
print("=== Temporal Analysis ===")
df['hour'] = df['event_time'].dt.hour
df['day_of_week'] = df['event_time'].dt.day_name()

# Hourly activity
hourly_activity = df.groupby('hour').size()
print(f"\nPeak activity hour: {hourly_activity.idxmax()} (hour {hourly_activity.idxmax()})")

# Visualize hourly activity
plt.figure(figsize=(12, 6))
hourly_activity.plot(kind='line', marker='o', linewidth=2, markersize=8)
plt.title('Hourly Activity Pattern', fontsize=14, fontweight='bold')
plt.xlabel('Hour of Day')
plt.ylabel('Number of Events')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

# Event types by hour
plt.figure(figsize=(14, 6))
hourly_by_type = df.groupby(['hour', 'event_type']).size().unstack(fill_value=0)
hourly_by_type.plot(kind='bar', stacked=False, width=0.8)
plt.title('Event Types by Hour', fontsize=14, fontweight='bold')
plt.xlabel('Hour of Day')
plt.ylabel('Number of Events')
plt.legend(title='Event Type')
plt.xticks(rotation=0)
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()


In [None]:
# User and product statistics
print("=== User and Product Statistics ===")
print(f"Unique users: {df['user_id'].nunique():,}")
print(f"Unique products: {df['product_id'].nunique():,}")
print(f"Unique categories: {df['category_id'].nunique():,}")
print(f"Unique sessions: {df['user_session'].nunique():,}")

# User engagement distribution
user_event_counts = df.groupby('user_id').size()
print(f"\n=== User Engagement Distribution ===")
print(f"Average events per user: {user_event_counts.mean():.2f}")
print(f"Median events per user: {user_event_counts.median():.2f}")
print(f"Max events per user: {user_event_counts.max()}")

# Visualize user engagement
plt.figure(figsize=(12, 6))
plt.hist(user_event_counts, bins=50, color='steelblue', alpha=0.7, edgecolor='black')
plt.title('User Engagement Distribution (Events per User)', fontsize=14, fontweight='bold')
plt.xlabel('Number of Events')
plt.ylabel('Number of Users')
plt.yscale('log')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

# Spending statistics (for purchases)
purchase_data = df[df['event_type'] == 'purchase']
if len(purchase_data) > 0:
    print(f"\n=== Purchase Statistics ===")
    print(f"Total purchases: {len(purchase_data):,}")
    print(f"Average order value: ${purchase_data['price'].mean():.2f}")
    print(f"Median order value: ${purchase_data['price'].median():.2f}")
    print(f"Total revenue: ${purchase_data['price'].sum():,.2f}")
    
    # Price distribution
    plt.figure(figsize=(12, 6))
    plt.hist(purchase_data['price'], bins=50, color='green', alpha=0.7, edgecolor='black')
    plt.title('Purchase Price Distribution', fontsize=14, fontweight='bold')
    plt.xlabel('Price ($)')
    plt.ylabel('Frequency')
    plt.yscale('log')
    plt.grid(alpha=0.3)
    plt.tight_layout()
    plt.show()


## 5. Data Aggregation

**Justification**: Aggregate event-level data to customer-level features. This transformation is essential because:
- Clustering operates on customers, not individual events
- We need to capture each customer's overall behavior pattern
- Features include: event counts, spending metrics, engagement levels, temporal patterns, diversity metrics

The aggregation creates one row per customer with comprehensive behavioral features.


In [None]:
# Aggregate events to customer level
print("Aggregating events to customer-level features...")
print("This may take a few minutes...")

customer_df = aggregate_data(df, include_rfm=True)

print(f"\n✓ Aggregation completed")
print(f"  - Original events: {len(df):,}")
print(f"  - Unique customers: {len(customer_df):,}")
print(f"  - Features created: {len(customer_df.columns)}")

print(f"\nCustomer-level features:")
print(list(customer_df.columns))

print(f"\nFirst few rows:")
customer_df.head()


In [None]:
# Summary statistics of customer features
print("=== Customer Feature Statistics ===")
print(customer_df.describe())

# Visualize key customer metrics
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Total events distribution
customer_df['total_events'].hist(bins=50, ax=axes[0, 0], color='steelblue', alpha=0.7)
axes[0, 0].set_title('Total Events per Customer', fontweight='bold')
axes[0, 0].set_xlabel('Total Events')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].set_yscale('log')

# Purchase count distribution
customer_df['purchase_count'].hist(bins=50, ax=axes[0, 1], color='green', alpha=0.7)
axes[0, 1].set_title('Purchase Count per Customer', fontweight='bold')
axes[0, 1].set_xlabel('Purchase Count')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].set_yscale('log')

# Total spending distribution (for customers with purchases)
spending_customers = customer_df[customer_df['total_spending'] > 0]
if len(spending_customers) > 0:
    spending_customers['total_spending'].hist(bins=50, ax=axes[1, 0], color='orange', alpha=0.7)
    axes[1, 0].set_title('Total Spending per Customer (Purchasers)', fontweight='bold')
    axes[1, 0].set_xlabel('Total Spending ($)')
    axes[1, 0].set_ylabel('Frequency')
    axes[1, 0].set_yscale('log')

# Conversion rate distribution
customer_df['purchase_conversion_rate'].hist(bins=50, ax=axes[1, 1], color='purple', alpha=0.7)
axes[1, 1].set_title('Purchase Conversion Rate', fontweight='bold')
axes[1, 1].set_xlabel('Conversion Rate')
axes[1, 1].set_ylabel('Frequency')

plt.tight_layout()
plt.show()


## 6. Feature Engineering

**Justification**: Create advanced features that capture customer behavior patterns:
- **RFM Features**: Recency, Frequency, Monetary scores for classic customer segmentation
- **Behavioral Features**: Engagement intensity, loyalty scores, activity levels
- **Temporal Features**: Peak activity hours, session patterns
- **Engagement Features**: Exploration vs focus, diversity metrics
- **Price Sensitivity**: Price preference, tolerance, sensitivity metrics

These features help the clustering algorithm identify distinct customer segments with meaningful behavioral differences.


In [None]:
# Engineer additional features
print("Engineering additional features...")

engineered_df = engineer_features(
    customer_df,
    include_behavioral=True,
    include_temporal=True,
    include_engagement=True,
    include_price=True
)

print(f"\n✓ Feature engineering completed")
print(f"  - Original features: {len(customer_df.columns)}")
print(f"  - Engineered features: {len(engineered_df.columns)}")
print(f"  - New features added: {len(engineered_df.columns) - len(customer_df.columns)}")

# Show new features
new_features = [col for col in engineered_df.columns if col not in customer_df.columns]
print(f"\nNew features created: {new_features}")


## 7. Feature Selection

**Justification**: Select only numeric features suitable for clustering. We exclude:
- `user_id` (identifier, not a feature)
- Categorical variables (will be handled separately if needed)
- Features with infinite or invalid values

This ensures the clustering algorithm receives clean, numeric input.


In [None]:
# Select features for clustering
print("Selecting features for clustering...")

features_df = select_features_for_clustering(engineered_df, exclude_cols=["user_id"])

print(f"\n✓ Feature selection completed")
print(f"  - Selected features: {len(features_df.columns)}")
print(f"  - Features: {list(features_df.columns)[:10]}...")  # Show first 10

# Check for infinite or invalid values
print(f"\n=== Data Quality Check ===")
print(f"Rows with infinite values: {(np.isinf(features_df).any(axis=1)).sum()}")
print(f"Rows with NaN values: {features_df.isnull().any(axis=1).sum()}")

# Replace infinite values with NaN, then fill
features_df = features_df.replace([np.inf, -np.inf], np.nan)
features_df = features_df.fillna(features_df.median())

print(f"✓ Data cleaned")
print(f"  - Final shape: {features_df.shape}")


## 8. Preprocessing

**Justification**: Preprocess data for clustering:
- **Outlier Treatment**: Use IQR method to cap extreme values that could distort clusters
- **Scaling**: Apply RobustScaler (more robust to outliers than StandardScaler) to normalize features
- **Feature Selection**: Remove highly correlated features and low-variance features
- **Optional PCA**: Can reduce dimensionality while retaining most variance

Preprocessing ensures features are on similar scales and clustering isn't dominated by features with large values.


In [None]:
# Initialize and fit preprocessor
print("Preprocessing data...")
print(f"  - Scaling method: {PREPROCESSING_CONFIG.get('scaling_method', 'robust')}")
print(f"  - Correlation threshold: {PREPROCESSING_CONFIG.get('correlation_threshold', 0.95)}")
print(f"  - Variance threshold: {PREPROCESSING_CONFIG.get('variance_threshold', 0.01)}")

preprocessor = ClusteringPreprocessor(
    scaling_method=PREPROCESSING_CONFIG.get("scaling_method", "robust"),
    use_pca=PREPROCESSING_CONFIG.get("use_pca", False),
    correlation_threshold=PREPROCESSING_CONFIG.get("correlation_threshold", 0.95),
    variance_threshold=PREPROCESSING_CONFIG.get("variance_threshold", 0.01),
    outlier_iqr_multiplier=FEATURE_CONFIG.get("outlier_iqr_multiplier", 1.5)
)

X_processed = preprocessor.fit_transform(features_df)

print(f"\n✓ Preprocessing completed")
print(f"  - Original features: {features_df.shape[1]}")
print(f"  - Processed features: {X_processed.shape[1]}")
print(f"  - Features removed: {features_df.shape[1] - X_processed.shape[1]}")

# Save preprocessor
preprocessor.save(MODEL_PATHS["preprocessor"])
joblib.dump(preprocessor.feature_names_, MODEL_PATHS["feature_names"])
print(f"  - Preprocessor saved to: {MODEL_PATHS['preprocessor']}")

# Visualize feature distributions before and after scaling
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Before scaling (sample features)
sample_features = features_df.select_dtypes(include=[np.number]).iloc[:, :5]
sample_features.boxplot(ax=axes[0])
axes[0].set_title('Feature Distributions (Before Scaling)', fontweight='bold')
axes[0].set_ylabel('Value')
axes[0].tick_params(axis='x', rotation=45)

# After scaling
X_processed_df = pd.DataFrame(X_processed, columns=preprocessor.feature_names_)
X_processed_df.iloc[:, :5].boxplot(ax=axes[1])
axes[1].set_title('Feature Distributions (After Scaling)', fontweight='bold')
axes[1].set_ylabel('Scaled Value')
axes[1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()


## 9. Optimal Cluster Selection

**Justification**: Determine the optimal number of clusters using multiple methods:
- **Elbow Method**: Find the "elbow" where inertia (within-cluster SS) stops decreasing significantly
- **Silhouette Score**: Higher is better; measures how well-separated clusters are
- **Davies-Bouldin Index**: Lower is better; measures average similarity between clusters
- **Calinski-Harabasz Index**: Higher is better; ratio of between-cluster to within-cluster variance

We use Silhouette Score as the primary method as it provides a good balance between cluster separation and cohesion.


In [None]:
# Determine optimal number of clusters
print("Determining optimal number of clusters...")
print(f"Testing K values: {KMEANS_CONFIG.get('n_clusters_range', list(range(2, 11)))}")

selector = ClusterSelector(
    n_clusters_range=KMEANS_CONFIG.get("n_clusters_range", list(range(2, 11))),
    random_state=KMEANS_CONFIG.get("random_state", 42),
    n_init=KMEANS_CONFIG.get("n_init", 10)
)

# Evaluate all methods
comparison_df = selector.evaluate_all_methods(X_processed)

print(f"\n✓ Cluster selection evaluation completed")
print("\nComparison of methods:")
print(comparison_df)

# Select optimal K using silhouette score
optimal_k = selector.select_optimal_k(X_processed, method="silhouette")
print(f"\n✓ Optimal number of clusters: {optimal_k} (selected using Silhouette Score)")

# Visualize cluster selection metrics
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# Elbow curve
if "elbow" in selector.scores_:
    elbow_scores = selector.scores_["elbow"]
    k_values = sorted(elbow_scores.keys())
    inertias = [elbow_scores[k] for k in k_values]
    plot_elbow_curve(k_values, inertias)

# Silhouette scores
if "silhouette" in selector.scores_:
    silhouette_scores = selector.scores_["silhouette"]
    k_values = sorted([k for k in silhouette_scores.keys() if silhouette_scores[k] > -1])
    scores = [silhouette_scores[k] for k in k_values]
    plot_silhouette_scores(k_values, scores)

# Davies-Bouldin scores
if "davies_bouldin" in selector.scores_:
    db_scores = selector.scores_["davies_bouldin"]
    k_values = sorted([k for k in db_scores.keys() if db_scores[k] < float("inf")])
    scores = [db_scores[k] for k in k_values]
    plt.figure(figsize=(10, 6))
    plt.plot(k_values, scores, 'ro-', linewidth=2, markersize=8)
    plt.xlabel("Number of Clusters (K)", fontsize=12)
    plt.ylabel("Davies-Bouldin Index", fontsize=12)
    plt.title("Davies-Bouldin Index for Different K Values", fontsize=14, fontweight="bold")
    plt.grid(alpha=0.3)
    plt.tight_layout()
    plt.show()

# Save comparison
comparison_path = DIRECTORIES["reports"] / "cluster_selection_comparison.csv"
comparison_df.to_csv(comparison_path, index=False)
print(f"\n✓ Comparison saved to: {comparison_path}")


In [None]:
# Train K-means clustering model
print(f"Training K-means clustering with {optimal_k} clusters...")

trainer = ClusteringTrainer(
    n_clusters=optimal_k,
    algorithm="kmeans",
    random_state=KMEANS_CONFIG.get("random_state", 42)
)

trainer.fit(X_processed)

print(f"\n✓ Model training completed")
print(f"  - Number of clusters: {len(np.unique(trainer.labels_))}")
print(f"  - Cluster sizes: {dict(zip(*np.unique(trainer.labels_, return_counts=True)))}")

# Save model
trainer.save(MODEL_PATHS["clusterer"])
print(f"  - Model saved to: {MODEL_PATHS['clusterer']}")

# Add cluster labels to customer dataframe for analysis
engineered_df['cluster'] = trainer.labels_
features_df['cluster'] = trainer.labels_

print(f"\nCluster distribution:")
print(engineered_df['cluster'].value_counts().sort_index())


## 11. Cluster Evaluation

**Justification**: Evaluate clustering quality using internal metrics:
- **Silhouette Score**: Measures how similar customers are to their own cluster vs other clusters (range: -1 to 1, higher is better)
- **Davies-Bouldin Index**: Average similarity ratio of clusters (lower is better)
- **Calinski-Harabasz Index**: Ratio of between-cluster to within-cluster variance (higher is better)

These metrics help validate that clusters are well-separated and internally cohesive.


In [None]:
# Evaluate clustering results
print("Evaluating clustering results...")

evaluator = ClusterEvaluator(
    model=trainer.model,
    preprocessor=preprocessor,
    save_plots=True
)

metrics = evaluator.evaluate(X_processed, trainer.labels_, feature_names=preprocessor.feature_names_)

print(f"\n✓ Evaluation completed")
print(f"\n=== Evaluation Metrics ===")
print(f"Silhouette Score: {metrics.get('silhouette_score', 'N/A'):.4f}")
print(f"Davies-Bouldin Index: {metrics.get('davies_bouldin', 'N/A'):.4f}")
print(f"Calinski-Harabasz Index: {metrics.get('calinski_harabasz', 'N/A'):.4f}")
if 'inertia' in metrics:
    print(f"Inertia: {metrics['inertia']:.2f}")
print(f"\nCluster sizes:")
for cluster, size in sorted(metrics['cluster_sizes'].items()):
    print(f"  Cluster {int(cluster)}: {size} customers ({size/len(trainer.labels_)*100:.1f}%)")

# Generate cluster profiles
profiles = evaluator.generate_cluster_profiles(
    features_df,
    trainer.labels_,
    feature_names=features_df.columns.tolist()
)

print(f"\n✓ Cluster profiles generated")
print(f"\nCluster Profiles (mean feature values):")
print(profiles.head(10))  # Show first 10 features


## 12. Cluster Analysis and Visualization

**Justification**: Visualize and interpret clusters to understand:
- How clusters are separated in feature space (PCA visualization)
- Cluster size distribution
- Feature characteristics of each cluster
- Business interpretation of each segment

This helps translate clustering results into actionable business insights.


In [None]:
# Visualize clusters in PCA space
print("Creating visualizations...")

# 2D PCA visualization
plot_pca_clusters(
    X_processed,
    trainer.labels_,
    n_components=2,
    save_path=DIRECTORIES["reports"] / "pca_clusters_2d.png",
    title=f"Customer Clusters in 2D PCA Space (K={optimal_k})"
)

# Cluster distribution
evaluator.plot_cluster_distribution(
    trainer.labels_,
    save_path=DIRECTORIES["reports"] / "cluster_distribution.png"
)

# Cluster profiles heatmap
evaluator.plot_cluster_profiles(
    profiles,
    save_path=DIRECTORIES["reports"] / "cluster_profiles.png",
    top_n_features=15
)

# Feature distributions by cluster
key_features = ['total_events', 'purchase_count', 'total_spending', 'purchase_conversion_rate', 
                'avg_order_value', 'unique_products_viewed']
available_features = [f for f in key_features if f in features_df.columns]

if len(available_features) > 0:
    plot_feature_distributions_by_cluster(
        features_df,
        trainer.labels_,
        available_features,
        n_features=len(available_features),
        save_path=DIRECTORIES["reports"] / "feature_distributions_by_cluster.png"
    )

print("✓ Visualizations created and saved")


In [None]:
# Detailed cluster analysis
print("\n=== Detailed Cluster Analysis ===")

for cluster_id in sorted(engineered_df['cluster'].unique()):
    cluster_data = engineered_df[engineered_df['cluster'] == cluster_id]
    print(f"\n--- Cluster {int(cluster_id)} ({len(cluster_data)} customers, {len(cluster_data)/len(engineered_df)*100:.1f}%) ---")
    
    # Key metrics
    print(f"  Average Events: {cluster_data['total_events'].mean():.1f}")
    print(f"  Average Purchases: {cluster_data['purchase_count'].mean():.1f}")
    if cluster_data['total_spending'].sum() > 0:
        print(f"  Average Spending: ${cluster_data[cluster_data['total_spending'] > 0]['total_spending'].mean():.2f}")
        print(f"  Total Revenue: ${cluster_data['total_spending'].sum():,.2f}")
    print(f"  Conversion Rate: {cluster_data['purchase_conversion_rate'].mean():.3f}")
    print(f"  Unique Products: {cluster_data['unique_products_viewed'].mean():.1f}")
    print(f"  Unique Categories: {cluster_data['unique_categories'].mean():.1f}")
    
    # Segment characteristics
    if 'recency_score' in cluster_data.columns:
        print(f"  RFM - Recency Score: {cluster_data['recency_score'].mean():.1f}")
        print(f"  RFM - Frequency Score: {cluster_data['frequency_score'].mean():.1f}")
        print(f"  RFM - Monetary Score: {cluster_data['monetary_score'].mean():.1f}")


## 13. Generate Evaluation Report

**Justification**: Create a comprehensive markdown report summarizing:
- Model configuration and parameters
- Evaluation metrics
- Cluster statistics and profiles
- Business interpretations and recommendations

This report serves as documentation for stakeholders and future reference.


In [None]:
# Generate and save evaluation report
print("Generating evaluation report...")

model_info = {
    "Algorithm": "K-means",
    "Number of Clusters": optimal_k,
    "Number of Features": X_processed.shape[1],
    "Number of Customers": len(customer_df),
    "Selection Method": "Silhouette Score",
    "Preprocessing": f"RobustScaler, Correlation threshold: {PREPROCESSING_CONFIG.get('correlation_threshold', 0.95)}"
}

report = evaluator.generate_report(metrics, profiles, model_info)
evaluator.save_report(report, DIRECTORIES["reports"] / "cluster_report.md")

print(f"✓ Report generated and saved to: {DIRECTORIES['reports'] / 'cluster_report.md'}")
print("\nReport preview:")
print(report[:500] + "...")


## 14. Summary and Business Insights

**Justification**: Summarize key findings and provide actionable business recommendations based on the identified customer segments. This helps translate technical results into business value.


In [None]:
print("=" * 60)
print("PIPELINE SUMMARY")
print("=" * 60)

print(f"\n✓ Successfully segmented {len(customer_df):,} customers into {optimal_k} clusters")
print(f"✓ Clustering Quality (Silhouette Score): {metrics.get('silhouette_score', 'N/A'):.4f}")

print(f"\n=== Key Findings ===")
print(f"1. Data Processed: {len(df):,} events from {customer_df['user_id'].nunique():,} customers")
print(f"2. Features Created: {len(engineered_df.columns)} customer-level features")
print(f"3. Optimal Clusters: {optimal_k} (determined using Silhouette Score)")
print(f"4. Cluster Quality: {'Excellent' if metrics.get('silhouette_score', 0) > 0.5 else 'Good' if metrics.get('silhouette_score', 0) > 0.3 else 'Fair'}")

print(f"\n=== Business Recommendations ===")
print("1. **High-Value Customers**: Focus retention efforts on clusters with high spending and purchase frequency")
print("2. **At-Risk Customers**: Identify clusters with declining engagement and implement re-engagement campaigns")
print("3. **New Customer Onboarding**: Tailor onboarding for clusters with low engagement but potential")
print("4. **Personalization**: Use cluster profiles to personalize product recommendations and marketing messages")
print("5. **Pricing Strategy**: Adjust pricing for clusters with high price sensitivity")

print(f"\n=== Next Steps ===")
print("1. Deploy clustering model for real-time customer segmentation")
print("2. Integrate with marketing automation for personalized campaigns")
print("3. Monitor cluster evolution over time")
print("4. A/B test marketing strategies by cluster")
print("5. Refine clusters with additional features or domain knowledge")

print("\n" + "=" * 60)
print("Pipeline completed successfully!")
print("=" * 60)
