# Machine Learning: Customer Segmentation

**Author:** Aleksandre Chakhvashvili
**Purpose:** Implement K-Means clustering and classification models for customer segmentation

## Objectives
- Implement K-Means clustering to segment customers
- Determine optimal number of clusters using Elbow method and Silhouette score
- Visualize and analyze customer segments
- Build classification models to predict customer segments
- Compare model performance

## Models Implemented
1. K-Means Clustering (Unsupervised)
2. Logistic Regression (Supervised)
3. Decision Tree Classifier (Supervised)

## 1. Setup and Data Loading

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sys
import warnings
warnings.filterwarnings('ignore')

# Add src directory to path
sys.path.append('../src')

# Import custom ML functions
from models import (
    prepare_clustering_features,
    find_optimal_clusters_elbow,
    calculate_silhouette_scores,
    plot_elbow_curve,
    plot_silhouette_scores,
    train_kmeans,
    visualize_clusters_2d,
    analyze_clusters,
    describe_clusters,
    prepare_classification_data,
    train_logistic_regression,
    train_decision_tree,
    evaluate_classifier,
    plot_confusion_matrix,
    compare_models
)

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 4)

# Set random seed for reproducibility
np.random.seed(42)

In [None]:
# Load processed data
data_path = '../data/processed/mall_customers_processed.csv'
df = pd.read_csv(data_path)

print(f"Dataset loaded successfully!")
print(f"Shape: {df.shape}")
df.head()

## 2. Part 1: K-Means Clustering

### 2.1 Feature Selection and Preparation

For customer segmentation, we will use:
- Annual Income (k$)
- Spending Score (1-100)

These features are most relevant for identifying distinct customer segments based on purchasing power and spending behavior.

In [None]:
# Select features for clustering
clustering_features = ['Annual Income (k$)', 'Spending Score (1-100)']

print("Features selected for clustering:")
print(clustering_features)

# Display feature statistics
print("\nFeature Statistics:")
print(df[clustering_features].describe())

In [None]:
# Prepare and scale features
X_scaled, scaler = prepare_clustering_features(df, clustering_features)

print("\nScaled features:")
print(X_scaled.head())
print(f"\nScaled data shape: {X_scaled.shape}")

### 2.2 Determining Optimal Number of Clusters

We will use two methods to determine the optimal number of clusters:
1. Elbow Method - plots WCSS (Within-Cluster Sum of Squares)
2. Silhouette Score - measures cluster quality

In [None]:
# Calculate inertias for Elbow method
print("Calculating WCSS for different values of k...")
inertias = find_optimal_clusters_elbow(X_scaled, max_k=10)

print("\nWCSS values:")
for k, inertia in inertias.items():
    print(f"k={k}: {inertia:.2f}")

In [None]:
# Plot Elbow curve
plot_elbow_curve(
    inertias,
    save_path='../reports/figures/20_elbow_curve.png'
)

In [None]:
# Calculate Silhouette scores
print("Calculating Silhouette scores for different values of k...")
silhouette_scores = calculate_silhouette_scores(X_scaled, max_k=10)

print("\nSilhouette scores:")
for k, score in silhouette_scores.items():
    print(f"k={k}: {score:.4f}")

In [None]:
# Plot Silhouette scores
plot_silhouette_scores(
    silhouette_scores,
    save_path='../reports/figures/21_silhouette_scores.png'
)

### 2.3 Optimal Cluster Selection

Based on the Elbow curve and Silhouette scores, determine the optimal number of clusters.
Common choices are between 3-5 clusters for customer segmentation.

After examining the plots, select the optimal k value below:

In [None]:
# Set optimal number of clusters based on analysis
# Typically k=5 works well for this dataset based on elbow and silhouette analysis
optimal_k = 5

print(f"Optimal number of clusters selected: {optimal_k}")
print(f"WCSS for k={optimal_k}: {inertias[optimal_k]:.2f}")
print(f"Silhouette score for k={optimal_k}: {silhouette_scores[optimal_k]:.4f}")

### 2.4 Train K-Means Model

In [None]:
# Train K-Means clustering model
kmeans_model, cluster_labels = train_kmeans(X_scaled, n_clusters=optimal_k, random_state=42)

# Add cluster labels to original dataframe
df['Cluster'] = cluster_labels

print(f"\nCluster distribution:")
print(df['Cluster'].value_counts().sort_index())

### 2.5 Visualize Customer Segments

In [None]:
# Visualize clusters in 2D space
visualize_clusters_2d(
    df,
    cluster_labels,
    'Annual Income (k$)',
    'Spending Score (1-100)',
    title='Customer Segments - K-Means Clustering',
    save_path='../reports/figures/22_customer_segments.png'
)

### 2.6 Analyze Cluster Characteristics

In [None]:
# Analyze cluster characteristics
cluster_analysis = analyze_clusters(df, cluster_labels, clustering_features)

print("Cluster Analysis Summary:")
print("=" * 80)
print(cluster_analysis)

In [None]:
# Get cluster descriptions
cluster_descriptions = describe_clusters(df, cluster_labels)

print("\nCluster Descriptions:")
print("=" * 80)
for cluster_id, description in cluster_descriptions.items():
    print(f"Cluster {cluster_id}: {description}")

In [None]:
# Additional cluster analysis - demographic breakdown
print("\nCluster Demographics:")
print("=" * 80)

for cluster in sorted(df['Cluster'].unique()):
    cluster_data = df[df['Cluster'] == cluster]
    
    print(f"\nCluster {cluster}: {cluster_descriptions[cluster]}")
    print("-" * 80)
    print(f"Size: {len(cluster_data)} customers ({len(cluster_data)/len(df)*100:.1f}%)")
    print(f"Average Age: {cluster_data['Age'].mean():.1f} years")
    print(f"Average Income: ${cluster_data['Annual Income (k$)'].mean():.1f}k")
    print(f"Average Spending Score: {cluster_data['Spending Score (1-100)'].mean():.1f}")
    print(f"Gender Distribution: {cluster_data['Gender'].value_counts().to_dict()}")

## 3. Part 2: Classification Models

### 3.1 Prepare Data for Classification

Now we will build classification models to predict customer cluster membership.
This allows us to classify new customers into segments.

Features used:
- Age
- Annual Income (k$)
- Spending Score (1-100)
- Gender_Encoded

Target: Cluster label from K-Means

In [None]:
# Define features for classification
classification_features = ['Age', 'Annual Income (k$)', 'Spending Score (1-100)', 'Gender_Encoded']
target = 'Cluster'

print("Classification Features:")
print(classification_features)
print(f"\nTarget: {target}")

In [None]:
# Split data into train and test sets
X_train, X_test, y_train, y_test = prepare_classification_data(
    df,
    classification_features,
    target,
    test_size=0.2,
    random_state=42
)

### 3.2 Model 1: Logistic Regression

In [None]:
# Train Logistic Regression model
lr_model = train_logistic_regression(X_train, y_train, random_state=42)

In [None]:
# Evaluate Logistic Regression
lr_results = evaluate_classifier(lr_model, X_test, y_test, model_name="Logistic Regression")

In [None]:
# Plot confusion matrix for Logistic Regression
plot_confusion_matrix(
    lr_results['confusion_matrix'],
    'Logistic Regression',
    save_path='../reports/figures/23_confusion_matrix_lr.png'
)

### 3.3 Model 2: Decision Tree Classifier

In [None]:
# Train Decision Tree model
dt_model = train_decision_tree(X_train, y_train, random_state=42)

In [None]:
# Evaluate Decision Tree
dt_results = evaluate_classifier(dt_model, X_test, y_test, model_name="Decision Tree")

In [None]:
# Plot confusion matrix for Decision Tree
plot_confusion_matrix(
    dt_results['confusion_matrix'],
    'Decision Tree',
    save_path='../reports/figures/24_confusion_matrix_dt.png'
)

### 3.4 Model Comparison

In [None]:
# Compare model performance
model_results = {
    'Logistic Regression': lr_results,
    'Decision Tree': dt_results
}

comparison_table = compare_models(model_results)

print("\nModel Performance Comparison:")
print("=" * 80)
print(comparison_table)

# Identify best model
best_model_idx = comparison_table['Accuracy'].idxmax()
best_model_name = comparison_table.loc[best_model_idx, 'Model']
best_accuracy = comparison_table.loc[best_model_idx, 'Accuracy']

print(f"\nBest performing model: {best_model_name}")
print(f"Accuracy: {best_accuracy:.4f}")

In [None]:
# Visualize model comparison
plt.figure(figsize=(10, 6))

metrics = ['Accuracy', 'Precision', 'Recall']
x = np.arange(len(metrics))
width = 0.35

lr_values = [lr_results['accuracy'], lr_results['precision'], lr_results['recall']]
dt_values = [dt_results['accuracy'], dt_results['precision'], dt_results['recall']]

plt.bar(x - width/2, lr_values, width, label='Logistic Regression', color='steelblue')
plt.bar(x + width/2, dt_values, width, label='Decision Tree', color='orange')

plt.xlabel('Metrics', fontsize=12)
plt.ylabel('Score', fontsize=12)
plt.title('Model Performance Comparison', fontsize=14, fontweight='bold')
plt.xticks(x, metrics)
plt.legend()
plt.ylim(0, 1.1)
plt.grid(True, alpha=0.3, axis='y')

# Add value labels on bars
for i, (lr_val, dt_val) in enumerate(zip(lr_values, dt_values)):
    plt.text(i - width/2, lr_val + 0.02, f'{lr_val:.3f}', ha='center', fontsize=9)
    plt.text(i + width/2, dt_val + 0.02, f'{dt_val:.3f}', ha='center', fontsize=9)

plt.tight_layout()
plt.savefig('../reports/figures/25_model_comparison.png', dpi=300, bbox_inches='tight')
print("Model comparison chart saved to: ../reports/figures/25_model_comparison.png")
plt.show()

## 4. Save Results

In [None]:
# Save data with cluster assignments
output_path = '../data/processed/mall_customers_with_clusters.csv'
df.to_csv(output_path, index=False)
print(f"Data with cluster assignments saved to: {output_path}")

In [None]:
# Save comprehensive ML report
report_path = '../reports/results/ml_results_report.txt'

with open(report_path, 'w') as f:
    f.write("MACHINE LEARNING RESULTS REPORT\n")
    f.write("=" * 80 + "\n\n")
    
    f.write("1. K-MEANS CLUSTERING\n")
    f.write("-" * 80 + "\n")
    f.write(f"Features Used: {', '.join(clustering_features)}\n")
    f.write(f"Optimal Number of Clusters: {optimal_k}\n")
    f.write(f"Silhouette Score: {silhouette_scores[optimal_k]:.4f}\n")
    f.write(f"WCSS (Inertia): {inertias[optimal_k]:.2f}\n\n")
    
    f.write("Cluster Descriptions:\n")
    for cluster_id, description in cluster_descriptions.items():
        f.write(f"  Cluster {cluster_id}: {description}\n")
    f.write("\n")
    
    f.write("Cluster Statistics:\n")
    f.write(cluster_analysis.to_string())
    f.write("\n\n")
    
    f.write("2. CLASSIFICATION MODELS\n")
    f.write("-" * 80 + "\n")
    f.write(f"Features Used: {', '.join(classification_features)}\n")
    f.write(f"Target Variable: Cluster labels from K-Means\n")
    f.write(f"Train/Test Split: 80/20\n\n")
    
    f.write("Model Performance Comparison:\n")
    f.write(comparison_table.to_string(index=False))
    f.write("\n\n")
    
    f.write(f"Best Model: {best_model_name}\n")
    f.write(f"Best Accuracy: {best_accuracy:.4f}\n\n")
    
    f.write("3. DETAILED EVALUATION METRICS\n")
    f.write("-" * 80 + "\n")
    
    f.write("\nLogistic Regression:\n")
    f.write(f"  Accuracy: {lr_results['accuracy']:.4f}\n")
    f.write(f"  Precision: {lr_results['precision']:.4f}\n")
    f.write(f"  Recall: {lr_results['recall']:.4f}\n")
    f.write(f"  Confusion Matrix:\n{lr_results['confusion_matrix']}\n")
    
    f.write("\nDecision Tree:\n")
    f.write(f"  Accuracy: {dt_results['accuracy']:.4f}\n")
    f.write(f"  Precision: {dt_results['precision']:.4f}\n")
    f.write(f"  Recall: {dt_results['recall']:.4f}\n")
    f.write(f"  Confusion Matrix:\n{dt_results['confusion_matrix']}\n")

print(f"ML results report saved to: {report_path}")

## 5. Summary and Insights

Run all cells above and document your findings here:

### K-Means Clustering Results:
- Optimal number of clusters: [To be filled after running]
- Silhouette score: [To be filled after running]
- Cluster characteristics: [To be filled after running]

### Customer Segments Identified:
1. [To be filled after running]
2. [To be filled after running]
3. [To be filled after running]
4. [To be filled after running]
5. [To be filled after running]

### Classification Model Performance:
- Logistic Regression Accuracy: [To be filled after running]
- Decision Tree Accuracy: [To be filled after running]
- Best performing model: [To be filled after running]

### Business Recommendations:
Based on the customer segments identified:
1. [To be filled after running]
2. [To be filled after running]
3. [To be filled after running]

### Model Selection Justification:
- [To be filled after running - explain why one model performed better]
- [Discuss trade-offs between models]

### Next Steps:
1. Update README.md with final results
2. Prepare presentation materials
3. Deploy model for new customer classification (optional)