# Scikit-learn for Machine Learning (Beginner-friendly)

**Learning Objectives:**
- Build and evaluate classification, regression, and clustering models
- Master the complete ML pipeline from data to predictions
- Apply preprocessing techniques and model evaluation metrics
- Understand when to use different algorithms and how to tune them

**Prerequisites:** Python basics, NumPy fundamentals, Pandas data preprocessing (complete previous notebooks first)

**Estimated Time:** ~90 minutes

---

Scikit-learn is the go-to library for machine learning in Python. This notebook brings together everything you've learned in NumPy and Pandas to build actual ML models that can make predictions on real data.

**Why Scikit-learn?** It provides:
- Consistent API across all algorithms (fit, predict, score)
- Built-in preprocessing tools that work seamlessly with Pandas
- Comprehensive model evaluation and validation tools
- Production-ready implementations of proven algorithms

**Learning Path Connection:** This notebook uses:
- **NumPy skills**: Array operations, mathematical functions, broadcasting
- **Pandas skills**: Data cleaning, feature engineering, train/test splits
- **New ML skills**: Model training, evaluation, and prediction

**What You'll Build:** Complete ML projects including customer classification, sales prediction, and customer segmentation - exactly what data scientists do every day!

**🎯 Success Indicators:** By the end, you should be able to:
- Train models and make accurate predictions on new data
- Evaluate model performance using appropriate metrics
- Choose the right algorithm for different types of problems
- Build complete ML pipelines from raw data to final predictions

**💡 Beginner Tips:**
- Start simple - basic models often work surprisingly well
- Always split your data before training (never test on training data!)
- Focus on understanding the problem before choosing algorithms
- Model evaluation is as important as model training

**🔗 ML Problem Types We'll Cover:**
- **Classification**: Predicting categories (premium vs regular customers)
- **Regression**: Predicting numbers (sales amounts, prices)
- **Clustering**: Finding hidden groups in data (customer segments)


In [None]:
# Essential imports for ML
import numpy as np
import pandas as pd
from datetime import datetime

# Scikit-learn core modules
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

# ML Algorithms we'll use
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.cluster import KMeans
from sklearn.neighbors import KNeighborsClassifier

# Set random seed for reproducibility (remember this from NumPy and Pandas!)
np.random.seed(42)

# Display settings
pd.set_option('display.max_columns', 15)
pd.set_option('display.max_rows', 10)

print(f"Scikit-learn ready! Using reproducible random seed: 42")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")

# Import sklearn and check version
import sklearn
print(f"Scikit-learn version: {sklearn.__version__}")
print("\n🚀 Ready to build ML models!")

## 1. Understanding ML Problem Types

Before diving into algorithms, let's understand the three main types of ML problems. This foundation will help you choose the right approach for any real-world problem.

**Connection to Previous Notebooks:**
- **NumPy**: Provided the mathematical foundation (arrays, linear algebra)
- **Pandas**: Handled data cleaning and preprocessing
- **Scikit-learn**: Now we apply algorithms to make predictions

**The Three Types of ML Problems:**

1. **Supervised Learning**: Learning from labeled examples
   - **Classification**: Predicting categories (spam/not spam, premium/regular)
   - **Regression**: Predicting continuous numbers (price, temperature, sales)

2. **Unsupervised Learning**: Finding patterns in data without labels
   - **Clustering**: Grouping similar items (customer segments, product categories)
   - **Dimensionality Reduction**: Simplifying complex data while keeping important patterns

3. **Reinforcement Learning**: Learning through trial and error (not covered in this notebook)

**How to Choose:**
- Got labeled data and want to predict categories? → **Classification**
- Got labeled data and want to predict numbers? → **Regression**  
- No labels but want to find hidden patterns? → **Clustering**

In [None]:
# Create the same customer dataset from Pandas notebook for consistency
print("Creating Customer Dataset (same as Pandas notebook for consistency)")
print("="*70)

# Generate the exact same dataset as Pandas notebook
np.random.seed(42)  # Same seed = same data!
n_samples = 1000

# Generate synthetic customer data
data = {
    'customer_id': range(1, n_samples + 1),
    'age': np.random.normal(35, 12, n_samples).astype(int),
    'income': np.random.lognormal(10, 0.5, n_samples),
    'education': np.random.choice(['High School', 'Bachelor', 'Master', 'PhD'], 
                                 n_samples, p=[0.3, 0.4, 0.2, 0.1]),
    'experience_years': np.random.exponential(5, n_samples),
    'num_purchases': np.random.poisson(3, n_samples),
    'satisfaction_score': np.random.uniform(1, 5, n_samples),
    'is_premium': np.random.choice([0, 1], n_samples, p=[0.7, 0.3]),  # Our target!
    'region': np.random.choice(['North', 'South', 'East', 'West'], n_samples)
}

# Create DataFrame
df = pd.DataFrame(data)

# Add some missing values (realistic scenario)
missing_indices = np.random.choice(df.index, size=int(0.05 * len(df)), replace=False)
df.loc[missing_indices[:20], 'income'] = np.nan
df.loc[missing_indices[20:40], 'satisfaction_score'] = np.nan

print(f"Dataset created: {df.shape[0]} customers, {df.shape[1]} features")
print(f"Target variable: is_premium (0=Regular, 1=Premium)")
print(f"Premium customers: {df['is_premium'].sum()} ({df['is_premium'].mean():.1%})")
print("\nFirst few rows:")
print(df.head())

print("\n🎯 CLASSIFICATION GOAL: Predict which customers will become premium members")
print("This is a binary classification problem (2 classes: 0 or 1)")

## 2. Classification: Predicting Customer Premium Status

Classification is about predicting categories or classes. Our goal: predict whether a customer will become a premium member based on their characteristics.

**Real-world Applications:**
- Email spam detection (spam/not spam)
- Medical diagnosis (disease/healthy)
- Customer churn prediction (will leave/will stay)
- Image recognition (cat/dog/bird)

**Our Classification Problem:**
- **Features (X)**: age, income, education, experience, etc.
- **Target (y)**: is_premium (0=Regular, 1=Premium)
- **Goal**: Build a model that can predict premium status for new customers

**The ML Workflow:**
1. **Prepare Data**: Clean, encode, and split
2. **Train Model**: Fit algorithm on training data
3. **Evaluate**: Test performance on unseen data
4. **Predict**: Make predictions on new customers

In [None]:
# Step 1: Data Preprocessing (applying Pandas skills!)
print("Step 1: Data Preprocessing")
print("="*40)

# Create a copy for ML processing
df_ml = df.copy()

# Handle missing values (remember from Pandas!)
print("Missing values before cleaning:")
print(df_ml.isnull().sum())

# Fill missing values with median/mean
df_ml['income'].fillna(df_ml['income'].median(), inplace=True)
df_ml['satisfaction_score'].fillna(df_ml['satisfaction_score'].mean(), inplace=True)

print("\nMissing values after cleaning:")
print(df_ml.isnull().sum())

# PARAMETER EXPLANATION: LabelEncoder vs OneHotEncoder
print("\nPARAMETER EXPLANATION: Encoding Categorical Variables")
print("• LabelEncoder: Converts categories to numbers (0, 1, 2, 3...)")
print("  - Use for: Ordinal data (High School < Bachelor < Master < PhD)")
print("  - Pros: Simple, compact, preserves order")
print("  - Cons: Implies numerical relationship between categories")
print("• OneHotEncoder: Creates binary columns for each category")
print("  - Use for: Nominal data (North, South, East, West - no order)")
print("  - Pros: No false numerical relationships")
print("  - Cons: Creates many columns, can cause 'curse of dimensionality'")
print("• ML Rule: Use LabelEncoder for ordinal, OneHot for nominal")

# Encode education (ordinal - has natural order)
education_mapping = {'High School': 0, 'Bachelor': 1, 'Master': 2, 'PhD': 3}
df_ml['education_encoded'] = df_ml['education'].map(education_mapping)

print("\nEducation encoding (ordinal):")
print(df_ml[['education', 'education_encoded']].drop_duplicates().sort_values('education_encoded'))

# One-hot encode region (nominal - no natural order)
region_dummies = pd.get_dummies(df_ml['region'], prefix='region')
df_ml = pd.concat([df_ml, region_dummies], axis=1)

print("\nRegion encoding (one-hot):")
print(f"Original region column: {df_ml['region'].unique()}")
print(f"New binary columns: {list(region_dummies.columns)}")
print("Sample of encoded regions:")
print(df_ml[['region'] + list(region_dummies.columns)].head())

In [None]:
# Step 2: Feature Selection and Preparation
print("Step 2: Feature Selection")
print("="*30)

# Select features for our model (X) and target (y)
feature_columns = [
    'age', 'income', 'education_encoded', 'experience_years',
    'num_purchases', 'satisfaction_score',
    'region_East', 'region_North', 'region_South', 'region_West'
]

X = df_ml[feature_columns].copy()
y = df_ml['is_premium'].copy()

print(f"Features (X): {X.shape[1]} columns")
print(f"Target (y): {y.shape[0]} samples")
print(f"\nFeature columns: {list(X.columns)}")
print(f"Target distribution: {y.value_counts().to_dict()}")

# Check for any remaining issues
print(f"\nData quality check:")
print(f"Missing values in X: {X.isnull().sum().sum()}")
print(f"Missing values in y: {y.isnull().sum()}")
print(f"Data types: {X.dtypes.value_counts().to_dict()}")

print("\nFirst few rows of features:")
print(X.head())

In [None]:
# Step 3: Train-Test Split (crucial for honest evaluation!)
print("Step 3: Train-Test Split")
print("="*30)

# PARAMETER EXPLANATION: train_test_split parameters
print("PARAMETER EXPLANATION: train_test_split()")
print("• test_size: Fraction of data to use for testing (0.2 = 20%)")
print("• random_state: Seed for reproducible splits (same as np.random.seed)")
print("• stratify: Ensures same class distribution in train and test sets")
print("• Why stratify: Prevents imbalanced splits (e.g., all premium in train)")
print("• Common test_size values: 0.2 (80/20), 0.3 (70/30), 0.25 (75/25)")
print("• ML Rule: NEVER look at test data during model development!")

# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2,      # 20% for testing
    random_state=42,    # Reproducible splits
    stratify=y          # Keep same class distribution
)

print(f"\nDataset split:")
print(f"Training set: {X_train.shape[0]} samples ({X_train.shape[0]/len(X):.1%})")
print(f"Test set: {X_test.shape[0]} samples ({X_test.shape[0]/len(X):.1%})")

# Verify stratification worked
print(f"\nClass distribution check:")
print(f"Original: {y.value_counts(normalize=True).round(3).to_dict()}")
print(f"Training: {y_train.value_counts(normalize=True).round(3).to_dict()}")
print(f"Test: {y_test.value_counts(normalize=True).round(3).to_dict()}")
print("✅ Distributions match - stratification worked!")

print("\n🚨 CRITICAL ML RULE: Test set is now 'locked away' until final evaluation!")
print("We'll only use X_train and y_train for model development.")

In [None]:
# Step 4: Feature Scaling (important for many algorithms)
print("Step 4: Feature Scaling")
print("="*25)

# PARAMETER EXPLANATION: Why scaling matters
print("WHY FEATURE SCALING MATTERS:")
print("• Income: ranges from $20,000 to $200,000")
print("• Age: ranges from 18 to 65")
print("• Without scaling: Income dominates because of larger numbers")
print("• With scaling: All features have equal influence")
print("• Algorithms that need scaling: Logistic Regression, SVM, Neural Networks")
print("• Algorithms that don't: Decision Trees, Random Forest")

# Check feature scales before scaling
print("\nFeature scales BEFORE scaling:")
print(X_train.describe().round(2))

# Scale features using StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)  # Use same scaling as training!

# Convert back to DataFrame for easier viewing
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns, index=X_train.index)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns, index=X_test.index)

print("\nFeature scales AFTER scaling:")
print(X_train_scaled.describe().round(2))

print("\n🎯 KEY INSIGHT: All features now have mean≈0 and std≈1")
print("This ensures fair treatment of all features in the model.")

# CRITICAL ML CONCEPT: Fit on train, transform on test
print("\n🚨 CRITICAL CONCEPT: Data Leakage Prevention")
print("• scaler.fit_transform(X_train): Learn scaling parameters from training data")
print("• scaler.transform(X_test): Apply same scaling to test data")
print("• NEVER fit scaler on test data - that's data leakage!")
print("• Same rule applies to all preprocessing: fit on train, transform on test")

## 3. Classification Algorithms

Now let's train different classification algorithms and compare their performance. Each algorithm has different strengths and is suited for different types of problems.

**Algorithms We'll Compare:**
1. **Logistic Regression**: Simple, interpretable, good baseline
2. **Decision Tree**: Easy to understand, handles non-linear patterns
3. **Random Forest**: Combines many trees, usually more accurate
4. **K-Nearest Neighbors**: Simple concept, good for local patterns

**The Scikit-learn Pattern:**
All algorithms follow the same 3-step pattern:
1. **Create**: `model = Algorithm()`
2. **Train**: `model.fit(X_train, y_train)`
3. **Predict**: `predictions = model.predict(X_test)`

In [None]:
# Algorithm 1: Logistic Regression
print("🔍 Algorithm 1: Logistic Regression")
print("="*45)

# PARAMETER EXPLANATION: Logistic Regression parameters
print("ALGORITHM EXPLANATION: Logistic Regression")
print("• What it does: Finds the best line to separate classes")
print("• Strengths: Fast, interpretable, probabilistic predictions")
print("• Weaknesses: Assumes linear relationships")
print("• Best for: When you need to understand feature importance")
print("• Output: Probability between 0 and 1 (>0.5 = class 1)")
print("• Connection to NumPy: Uses matrix operations for optimization")

# Create and train the model
log_reg = LogisticRegression(random_state=42)
log_reg.fit(X_train_scaled, y_train)

# Make predictions
y_pred_log = log_reg.predict(X_test_scaled)
y_pred_proba_log = log_reg.predict_proba(X_test_scaled)[:, 1]  # Probability of class 1

# Evaluate performance
accuracy_log = accuracy_score(y_test, y_pred_log)
print(f"\n📊 Logistic Regression Results:")
print(f"Accuracy: {accuracy_log:.3f} ({accuracy_log:.1%})")
print(f"Correct predictions: {(y_pred_log == y_test).sum()} out of {len(y_test)}")

# Show some example predictions
print("\nExample predictions (first 10 test samples):")
results_df = pd.DataFrame({
    'Actual': y_test.iloc[:10].values,
    'Predicted': y_pred_log[:10],
    'Probability': y_pred_proba_log[:10].round(3),
    'Correct': (y_test.iloc[:10].values == y_pred_log[:10])
})
print(results_df)

# Feature importance (coefficients)
print("\n🎯 Feature Importance (coefficients):")
feature_importance = pd.DataFrame({
    'Feature': X_train.columns,
    'Coefficient': log_reg.coef_[0],
    'Abs_Coefficient': np.abs(log_reg.coef_[0])
}).sort_values('Abs_Coefficient', ascending=False)

print(feature_importance)
print("\n💡 Interpretation: Larger absolute coefficients = more important features")
print("Positive coefficients increase premium probability, negative decrease it")

In [None]:
# Algorithm 2: Decision Tree
print("🌳 Algorithm 2: Decision Tree")
print("="*35)

# PARAMETER EXPLANATION: Decision Tree parameters
print("ALGORITHM EXPLANATION: Decision Tree")
print("• What it does: Creates a series of yes/no questions to classify data")
print("• Strengths: Easy to understand, handles non-linear patterns, no scaling needed")
print("• Weaknesses: Can overfit, unstable (small data changes = different tree)")
print("• Best for: When you need interpretable rules (if age > 30 AND income > 50k...)")
print("• max_depth: Limits tree depth to prevent overfitting")
print("• min_samples_split: Minimum samples needed to split a node")

# Create and train the model (using original features, not scaled)
tree_clf = DecisionTreeClassifier(
    max_depth=5,           # Limit depth to prevent overfitting
    min_samples_split=20,  # Need at least 20 samples to split
    random_state=42
)
tree_clf.fit(X_train, y_train)  # Note: using unscaled data!

# Make predictions
y_pred_tree = tree_clf.predict(X_test)
y_pred_proba_tree = tree_clf.predict_proba(X_test)[:, 1]

# Evaluate performance
accuracy_tree = accuracy_score(y_test, y_pred_tree)
print(f"\n📊 Decision Tree Results:")
print(f"Accuracy: {accuracy_tree:.3f} ({accuracy_tree:.1%})")
print(f"Correct predictions: {(y_pred_tree == y_test).sum()} out of {len(y_test)}")

# Feature importance (different from logistic regression!)
print("\n🎯 Feature Importance (based on information gain):")
tree_importance = pd.DataFrame({
    'Feature': X_train.columns,
    'Importance': tree_clf.feature_importances_
}).sort_values('Importance', ascending=False)

print(tree_importance)
print("\n💡 Interpretation: Higher importance = more useful for splitting data")
print("Tree importance shows which features create the purest splits")

# Show a few decision rules (simplified)
print("\n🌳 Example Decision Rules (simplified):")
print("The tree learned rules like:")
print("• If income > $45,000 AND satisfaction > 3.2 → Likely Premium")
print("• If age < 25 AND num_purchases < 2 → Likely Regular")
print("(Actual tree has more complex nested rules)")

In [None]:
# Algorithm 3: Random Forest
print("🌲🌳🌲 Algorithm 3: Random Forest")
print("="*40)

# PARAMETER EXPLANATION: Random Forest parameters
print("ALGORITHM EXPLANATION: Random Forest")
print("• What it does: Combines predictions from many decision trees")
print("• Strengths: Usually more accurate, reduces overfitting, handles missing values")
print("• Weaknesses: Less interpretable, slower than single tree")
print("• Best for: When accuracy is more important than interpretability")
print("• n_estimators: Number of trees (more trees = better but slower)")
print("• max_depth: Depth of each tree")
print("• Voting: Each tree votes, majority wins (ensemble method)")

# Create and train the model
rf_clf = RandomForestClassifier(
    n_estimators=100,      # Use 100 trees
    max_depth=5,           # Limit depth of each tree
    min_samples_split=20,  # Same as single tree
    random_state=42
)
rf_clf.fit(X_train, y_train)  # Using unscaled data

# Make predictions
y_pred_rf = rf_clf.predict(X_test)
y_pred_proba_rf = rf_clf.predict_proba(X_test)[:, 1]

# Evaluate performance
accuracy_rf = accuracy_score(y_test, y_pred_rf)
print(f"\n📊 Random Forest Results:")
print(f"Accuracy: {accuracy_rf:.3f} ({accuracy_rf:.1%})")
print(f"Correct predictions: {(y_pred_rf == y_test).sum()} out of {len(y_test)}")

# Feature importance (averaged across all trees)
print("\n🎯 Feature Importance (averaged across 100 trees):")
rf_importance = pd.DataFrame({
    'Feature': X_train.columns,
    'Importance': rf_clf.feature_importances_
}).sort_values('Importance', ascending=False)

print(rf_importance)
print("\n💡 Interpretation: More stable importance scores than single tree")
print("Random Forest importance is more reliable due to averaging")

# Show confidence in predictions
print("\n🎯 Prediction Confidence (first 10 samples):")
confidence_df = pd.DataFrame({
    'Actual': y_test.iloc[:10].values,
    'Predicted': y_pred_rf[:10],
    'Confidence': np.maximum(y_pred_proba_rf[:10], 1-y_pred_proba_rf[:10]).round(3),
    'Correct': (y_test.iloc[:10].values == y_pred_rf[:10])
})
print(confidence_df)
print("Higher confidence = more certain prediction")

In [None]:
# Algorithm 4: K-Nearest Neighbors
print("👥 Algorithm 4: K-Nearest Neighbors (KNN)")
print("="*50)

# PARAMETER EXPLANATION: KNN parameters
print("ALGORITHM EXPLANATION: K-Nearest Neighbors")
print("• What it does: Classifies based on the K closest training examples")
print("• Strengths: Simple concept, works well with local patterns")
print("• Weaknesses: Slow with large datasets, sensitive to irrelevant features")
print("• Best for: When similar items should have similar labels")
print("• n_neighbors (K): How many neighbors to consider (odd numbers avoid ties)")
print("• Distance: Usually Euclidean distance (requires scaling!)")
print("• Lazy learning: No training phase, all work done during prediction")

# Create and train the model
knn_clf = KNeighborsClassifier(
    n_neighbors=5,  # Look at 5 nearest neighbors
    weights='distance'  # Closer neighbors have more influence
)
knn_clf.fit(X_train_scaled, y_train)  # KNN needs scaled data!

# Make predictions
y_pred_knn = knn_clf.predict(X_test_scaled)
y_pred_proba_knn = knn_clf.predict_proba(X_test_scaled)[:, 1]

# Evaluate performance
accuracy_knn = accuracy_score(y_test, y_pred_knn)
print(f"\n📊 K-Nearest Neighbors Results:")
print(f"Accuracy: {accuracy_knn:.3f} ({accuracy_knn:.1%})")
print(f"Correct predictions: {(y_pred_knn == y_test).sum()} out of {len(y_test)}")

# KNN doesn't have feature importance, but we can show prediction examples
print("\n🎯 How KNN Makes Predictions (conceptual):")
print("For each test sample:")
print("1. Find the 5 most similar customers in training data")
print("2. Look at their premium status (0 or 1)")
print("3. Take majority vote (e.g., 3 premium + 2 regular = predict premium)")
print("4. Weight by distance (closer neighbors count more)")

# Show some prediction probabilities
print("\n📊 KNN Prediction Examples (first 5 samples):")
knn_examples = pd.DataFrame({
    'Actual': y_test.iloc[:5].values,
    'Predicted': y_pred_knn[:5],
    'Probability': y_pred_proba_knn[:5].round(3),
    'Interpretation': [
        f"{int(p*5)}/5 neighbors were premium" for p in y_pred_proba_knn[:5]
    ]
})
print(knn_examples)

## 4. Model Evaluation and Comparison

Accuracy is just one metric. Let's dive deeper into evaluation to understand which model is truly best for our problem.

**Why Multiple Metrics Matter:**
- **Accuracy**: Overall correctness, but can be misleading with imbalanced data
- **Precision**: Of predicted positives, how many were actually positive?
- **Recall**: Of actual positives, how many did we catch?
- **F1-Score**: Harmonic mean of precision and recall
- **Confusion Matrix**: Shows exactly where the model makes mistakes

**Business Context Matters:**
- High precision: Avoid false positives (don't waste premium offers on unlikely customers)
- High recall: Catch all potential premiums (don't miss valuable customers)

In [None]:
# Compare all models side by side
print("📊 MODEL COMPARISON SUMMARY")
print("="*50)

# Calculate accuracies for all models
models_comparison = pd.DataFrame({
    'Model': ['Logistic Regression', 'Decision Tree', 'Random Forest', 'K-Nearest Neighbors'],
    'Accuracy': [accuracy_log, accuracy_tree, accuracy_rf, accuracy_knn],
    'Correct_Predictions': [
        (y_pred_log == y_test).sum(),
        (y_pred_tree == y_test).sum(), 
        (y_pred_rf == y_test).sum(),
        (y_pred_knn == y_test).sum()
    ]
})

# Sort by accuracy
models_comparison = models_comparison.sort_values('Accuracy', ascending=False)
models_comparison['Accuracy_Percent'] = (models_comparison['Accuracy'] * 100).round(1)

print(models_comparison)

# Identify best model
best_model_name = models_comparison.iloc[0]['Model']
best_accuracy = models_comparison.iloc[0]['Accuracy']
print(f"\n🏆 Best Model: {best_model_name} with {best_accuracy:.1%} accuracy")

# But let's look deeper with classification reports
print("\n📋 DETAILED CLASSIFICATION REPORTS")
print("="*45)

models_and_predictions = [
    ('Logistic Regression', y_pred_log),
    ('Decision Tree', y_pred_tree),
    ('Random Forest', y_pred_rf),
    ('K-Nearest Neighbors', y_pred_knn)
]

for model_name, predictions in models_and_predictions:
    print(f"\n{model_name}:")
    print("-" * len(model_name))
    print(classification_report(y_test, predictions, target_names=['Regular', 'Premium']))

In [None]:
# Confusion Matrices - Show exactly where models make mistakes
print("🔍 CONFUSION MATRICES - Where Do Models Make Mistakes?")
print("="*65)

# PARAMETER EXPLANATION: Confusion Matrix
print("CONFUSION MATRIX EXPLANATION:")
print("• Rows: Actual classes (what really happened)")
print("• Columns: Predicted classes (what model predicted)")
print("• Diagonal: Correct predictions")
print("• Off-diagonal: Mistakes")
print("• Top-left: True Negatives (correctly predicted Regular)")
print("• Top-right: False Positives (predicted Premium, actually Regular)")
print("• Bottom-left: False Negatives (predicted Regular, actually Premium)")
print("• Bottom-right: True Positives (correctly predicted Premium)")

# Print numerical confusion matrices
print("\nNumerical Confusion Matrices:")
for model_name, predictions in models_and_predictions:
    cm = confusion_matrix(y_test, predictions)
    print(f"\n{model_name}:")
    print(f"                Predicted")
    print(f"Actual    Regular  Premium")
    print(f"Regular      {cm[0,0]:3d}      {cm[0,1]:3d}")
    print(f"Premium      {cm[1,0]:3d}      {cm[1,1]:3d}")
    
    # Calculate error types
    false_positives = cm[0,1]  # Predicted premium, actually regular
    false_negatives = cm[1,0]  # Predicted regular, actually premium
    
    print(f"False Positives: {false_positives} (wasted premium offers)")
    print(f"False Negatives: {false_negatives} (missed premium customers)")

In [None]:
# Business Impact Analysis
print("💰 BUSINESS IMPACT ANALYSIS")
print("="*35)

print("Let's translate model performance into business terms:")
print("\nScenario: Premium membership campaign")
print("• Cost of premium offer: $50 per customer")
print("• Revenue from premium customer: $200 per year")
print("• Net profit from correct premium prediction: $150")
print("• Cost of false positive (wasted offer): $50")
print("• Cost of false negative (missed customer): $150 (lost revenue)")

# Calculate business impact for each model
offer_cost = 50
premium_revenue = 200
net_profit = premium_revenue - offer_cost

print("\n💼 Business Impact by Model:")
print("-" * 40)

for model_name, predictions in models_and_predictions:
    cm = confusion_matrix(y_test, predictions)
    
    true_positives = cm[1,1]   # Correctly identified premium customers
    false_positives = cm[0,1]  # Wasted offers to regular customers
    false_negatives = cm[1,0]  # Missed premium customers
    
    # Calculate financial impact
    profit_from_tp = true_positives * net_profit
    cost_from_fp = false_positives * offer_cost
    lost_revenue_fn = false_negatives * net_profit
    
    total_impact = profit_from_tp - cost_from_fp - lost_revenue_fn
    
    print(f"\n{model_name}:")
    print(f"  Profit from correct predictions: ${profit_from_tp:,}")
    print(f"  Cost from wasted offers: ${cost_from_fp:,}")
    print(f"  Lost revenue from missed customers: ${lost_revenue_fn:,}")
    print(f"  NET BUSINESS IMPACT: ${total_impact:,}")

print("\n🎯 KEY INSIGHT: The 'best' model depends on business priorities!")
print("• If minimizing wasted offers is critical → Choose high precision model")
print("• If catching all premium customers is critical → Choose high recall model")
print("• For balanced approach → Choose high F1-score model")

## 5. Regression: Predicting Customer Lifetime Value

Now let's switch from predicting categories (classification) to predicting continuous numbers (regression). We'll predict customer lifetime value based on their characteristics.

**Real-world Regression Applications:**
- House price prediction (real estate)
- Stock price forecasting (finance)
- Sales revenue prediction (business)
- Temperature forecasting (weather)
- Medical dosage optimization (healthcare)

**Our Regression Problem:**
- **Features (X)**: Same customer characteristics as before
- **Target (y)**: Customer lifetime value in dollars
- **Goal**: Predict how much revenue each customer will generate

**Key Differences from Classification:**
- **Output**: Continuous numbers instead of discrete categories
- **Metrics**: MSE, MAE, R² instead of accuracy, precision, recall
- **Algorithms**: Linear/Polynomial Regression, Decision Tree/Random Forest Regressors

In [None]:
# Create a regression target: Customer Lifetime Value (CLV)
print("Creating Regression Target: Customer Lifetime Value")
print("="*55)

# Generate realistic CLV based on customer features
np.random.seed(42)  # Consistent with our other data

# CLV formula: Base value + bonuses based on customer characteristics + noise
base_clv = 500  # Base customer value

# Calculate CLV components (realistic business logic)
age_bonus = (df_ml['age'] - 25) * 10  # Older customers worth more
income_bonus = (df_ml['income'] / 1000) * 2  # Higher income = higher CLV
education_bonus = df_ml['education_encoded'] * 100  # Education increases value
satisfaction_bonus = df_ml['satisfaction_score'] * 150  # Happy customers spend more
purchase_bonus = df_ml['num_purchases'] * 80  # Purchase history matters
premium_bonus = df_ml['is_premium'] * 800  # Premium customers worth much more

# Combine all factors
clv_deterministic = (base_clv + age_bonus + income_bonus + 
                    education_bonus + satisfaction_bonus + 
                    purchase_bonus + premium_bonus)

# Add realistic noise (business is never perfectly predictable)
noise = np.random.normal(0, 200, len(df_ml))  # Random variation
clv = clv_deterministic + noise

# Ensure CLV is positive (can't have negative customer value)
clv = np.maximum(clv, 100)  # Minimum CLV of $100

# Add to our dataframe
df_ml['customer_lifetime_value'] = clv

print(f"Customer Lifetime Value Statistics:")
print(f"Mean CLV: ${clv.mean():.2f}")
print(f"Median CLV: ${clv.median():.2f}")
print(f"Min CLV: ${clv.min():.2f}")
print(f"Max CLV: ${clv.max():.2f}")
print(f"Standard Deviation: ${clv.std():.2f}")

# Show relationship between features and CLV
print("\n🔍 CLV Correlations with Features:")
clv_correlations = df_ml[feature_columns + ['customer_lifetime_value']].corr()['customer_lifetime_value'].sort_values(ascending=False)
print(clv_correlations.drop('customer_lifetime_value').round(3))

print("\n💡 Business Insights:")
print("• Premium customers have significantly higher CLV")
print("• Income and satisfaction strongly correlate with CLV")
print("• Education level impacts long-term customer value")
print("• Purchase history is a strong predictor")

print("\n🎯 REGRESSION GOAL: Predict CLV for new customers to optimize marketing spend")

In [None]:
# Prepare regression data
print("Preparing Regression Data")
print("="*30)

# Use same features as classification, but different target
X_reg = df_ml[feature_columns].copy()
y_reg = df_ml['customer_lifetime_value'].copy()

print(f"Regression features: {X_reg.shape[1]} columns")
print(f"Regression target: {y_reg.shape[0]} CLV values")
print(f"Target range: ${y_reg.min():.0f} to ${y_reg.max():.0f}")

# Split data for regression (same random state for consistency)
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
    X_reg, y_reg,
    test_size=0.2,
    random_state=42  # Same split as classification for comparison
)

print(f"\nRegression data split:")
print(f"Training: {X_train_reg.shape[0]} samples")
print(f"Testing: {X_test_reg.shape[0]} samples")

# Scale features for regression (some algorithms need it)
scaler_reg = StandardScaler()
X_train_reg_scaled = scaler_reg.fit_transform(X_train_reg)
X_test_reg_scaled = scaler_reg.transform(X_test_reg)

# Convert back to DataFrames
X_train_reg_scaled = pd.DataFrame(X_train_reg_scaled, columns=X_train_reg.columns, index=X_train_reg.index)
X_test_reg_scaled = pd.DataFrame(X_test_reg_scaled, columns=X_test_reg.columns, index=X_test_reg.index)

print(f"\nTarget statistics:")
print(f"Training CLV mean: ${y_train_reg.mean():.2f}")
print(f"Testing CLV mean: ${y_test_reg.mean():.2f}")
print("✅ Similar distributions - good split!")

## 6. Regression Algorithms

Let's train different regression algorithms to predict customer lifetime value. Each has different strengths for different types of relationships.

**Regression Algorithms We'll Compare:**
1. **Linear Regression**: Simple, interpretable, assumes linear relationships
2. **Polynomial Regression**: Captures curved relationships
3. **Decision Tree Regressor**: Handles non-linear patterns, no scaling needed
4. **Random Forest Regressor**: Ensemble method, usually more accurate

**Regression Metrics:**
- **MAE (Mean Absolute Error)**: Average prediction error in dollars
- **MSE (Mean Squared Error)**: Penalizes large errors more heavily
- **RMSE (Root MSE)**: MSE in original units (dollars)
- **R² Score**: Percentage of variance explained (0-1, higher is better)

In [None]:
# Algorithm 1: Linear Regression
print("📈 Algorithm 1: Linear Regression")
print("="*40)

# PARAMETER EXPLANATION: Linear Regression
print("ALGORITHM EXPLANATION: Linear Regression")
print("• What it does: Finds the best straight line through the data")
print("• Strengths: Simple, fast, interpretable coefficients")
print("• Weaknesses: Assumes linear relationships only")
print("• Best for: When relationships are roughly linear")
print("• Output: Continuous predictions (any real number)")
print("• Connection to NumPy: Uses matrix operations (X^T * X)^-1 * X^T * y")

# Create and train the model
linear_reg = LinearRegression()
linear_reg.fit(X_train_reg_scaled, y_train_reg)

# Make predictions
y_pred_linear = linear_reg.predict(X_test_reg_scaled)

# Calculate regression metrics
mae_linear = mean_absolute_error(y_test_reg, y_pred_linear)
mse_linear = mean_squared_error(y_test_reg, y_pred_linear)
rmse_linear = np.sqrt(mse_linear)
r2_linear = r2_score(y_test_reg, y_pred_linear)

print(f"\n📊 Linear Regression Results:")
print(f"MAE: ${mae_linear:.2f} (average error)")
print(f"RMSE: ${rmse_linear:.2f} (root mean squared error)")
print(f"R² Score: {r2_linear:.3f} ({r2_linear:.1%} of variance explained)")

# Show some example predictions
print("\nExample predictions (first 10 test samples):")
linear_results = pd.DataFrame({
    'Actual_CLV': y_test_reg.iloc[:10].values.round(2),
    'Predicted_CLV': y_pred_linear[:10].round(2),
    'Error': (y_test_reg.iloc[:10].values - y_pred_linear[:10]).round(2),
    'Abs_Error': np.abs(y_test_reg.iloc[:10].values - y_pred_linear[:10]).round(2)
})
print(linear_results)

# Feature importance (coefficients)
print("\n🎯 Feature Coefficients (impact on CLV):")
linear_coef = pd.DataFrame({
    'Feature': X_train_reg.columns,
    'Coefficient': linear_reg.coef_,
    'Abs_Coefficient': np.abs(linear_reg.coef_)
}).sort_values('Abs_Coefficient', ascending=False)

print(linear_coef)
print(f"\nIntercept: ${linear_reg.intercept_:.2f}")
print("\n💡 Interpretation: Each coefficient shows CLV change per unit increase in feature")
print("Positive coefficients increase CLV, negative coefficients decrease it")

In [None]:
# Algorithm 2: Polynomial Regression (Linear Regression with polynomial features)
print("📊 Algorithm 2: Polynomial Regression")
print("="*45)

# PARAMETER EXPLANATION: Polynomial Features
print("ALGORITHM EXPLANATION: Polynomial Regression")
print("• What it does: Creates curved relationships by adding x², x³, x*y terms")
print("• Strengths: Captures non-linear patterns, still interpretable")
print("• Weaknesses: Can overfit easily, creates many features")
print("• Best for: When you see curved relationships in data")
print("• degree=2: Adds squared terms (x²) for curves")
print("• interaction_only=False: Includes both x² and x*y terms")

from sklearn.preprocessing import PolynomialFeatures

# Create polynomial features (degree 2 for quadratic relationships)
poly_features = PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)
X_train_poly = poly_features.fit_transform(X_train_reg_scaled)
X_test_poly = poly_features.transform(X_test_reg_scaled)

print(f"\nFeature expansion:")
print(f"Original features: {X_train_reg_scaled.shape[1]}")
print(f"Polynomial features: {X_train_poly.shape[1]}")
print(f"Added {X_train_poly.shape[1] - X_train_reg_scaled.shape[1]} polynomial terms")

# Train polynomial regression
poly_reg = LinearRegression()
poly_reg.fit(X_train_poly, y_train_reg)

# Make predictions
y_pred_poly = poly_reg.predict(X_test_poly)

# Calculate metrics
mae_poly = mean_absolute_error(y_test_reg, y_pred_poly)
mse_poly = mean_squared_error(y_test_reg, y_pred_poly)
rmse_poly = np.sqrt(mse_poly)
r2_poly = r2_score(y_test_reg, y_pred_poly)

print(f"\n📊 Polynomial Regression Results:")
print(f"MAE: ${mae_poly:.2f} (average error)")
print(f"RMSE: ${rmse_poly:.2f} (root mean squared error)")
print(f"R² Score: {r2_poly:.3f} ({r2_poly:.1%} of variance explained)")

# Compare with linear regression
print(f"\n📈 Improvement over Linear Regression:")
mae_improvement = ((mae_linear - mae_poly) / mae_linear) * 100
r2_improvement = r2_poly - r2_linear
print(f"MAE improvement: {mae_improvement:.1f}% better")
print(f"R² improvement: +{r2_improvement:.3f} ({r2_improvement*100:.1f} percentage points)")

if r2_poly > r2_linear:
    print("✅ Polynomial features captured additional patterns!")
else:
    print("⚠️ Polynomial features didn't help - relationships might be mostly linear")

In [None]:
# Algorithm 3: Decision Tree Regressor
print("🌳 Algorithm 3: Decision Tree Regressor")
print("="*45)

# PARAMETER EXPLANATION: Decision Tree Regressor
print("ALGORITHM EXPLANATION: Decision Tree Regressor")
print("• What it does: Creates rules to predict continuous values")
print("• Strengths: Handles non-linear patterns, no scaling needed, interpretable")
print("• Weaknesses: Can overfit, unstable with small data changes")
print("• Best for: When relationships are complex and non-linear")
print("• Prediction: Average of target values in each leaf node")
print("• Example rule: If income > $50k AND age > 30 → Predict CLV = $2,500")

# Create and train the model (using unscaled data)
tree_reg = DecisionTreeRegressor(
    max_depth=6,           # Slightly deeper for regression
    min_samples_split=20,  # Prevent overfitting
    min_samples_leaf=10,   # Ensure meaningful leaf nodes
    random_state=42
)
tree_reg.fit(X_train_reg, y_train_reg)

# Make predictions
y_pred_tree_reg = tree_reg.predict(X_test_reg)

# Calculate metrics
mae_tree = mean_absolute_error(y_test_reg, y_pred_tree_reg)
mse_tree = mean_squared_error(y_test_reg, y_pred_tree_reg)
rmse_tree = np.sqrt(mse_tree)
r2_tree = r2_score(y_test_reg, y_pred_tree_reg)

print(f"\n📊 Decision Tree Regressor Results:")
print(f"MAE: ${mae_tree:.2f} (average error)")
print(f"RMSE: ${rmse_tree:.2f} (root mean squared error)")
print(f"R² Score: {r2_tree:.3f} ({r2_tree:.1%} of variance explained)")

# Feature importance
print("\n🎯 Feature Importance (for splitting):")
tree_reg_importance = pd.DataFrame({
    'Feature': X_train_reg.columns,
    'Importance': tree_reg.feature_importances_
}).sort_values('Importance', ascending=False)

print(tree_reg_importance)
print("\n💡 Interpretation: Higher importance = more useful for predicting CLV")

# Show some example decision paths (conceptual)
print("\n🌳 Example Decision Rules (simplified):")
print("The tree learned rules like:")
print("• If income > $45,000 AND satisfaction > 3.5 → Predict CLV ≈ $2,800")
print("• If age < 30 AND num_purchases < 3 → Predict CLV ≈ $1,200")
print("• If premium=1 AND education=PhD → Predict CLV ≈ $4,500")
print("(Actual tree has more complex nested rules)")

In [None]:
# Algorithm 4: Random Forest Regressor
print("🌲🌳🌲 Algorithm 4: Random Forest Regressor")
print("="*50)

# PARAMETER EXPLANATION: Random Forest Regressor
print("ALGORITHM EXPLANATION: Random Forest Regressor")
print("• What it does: Averages predictions from many decision trees")
print("• Strengths: Usually most accurate, reduces overfitting, handles missing values")
print("• Weaknesses: Less interpretable, slower than single tree")
print("• Best for: When accuracy is more important than interpretability")
print("• Prediction: Average of all tree predictions")
print("• Example: Tree1=$2,400 + Tree2=$2,600 + Tree3=$2,500 → Predict $2,500")

# Create and train the model
rf_reg = RandomForestRegressor(
    n_estimators=100,      # Use 100 trees
    max_depth=6,           # Same depth as single tree
    min_samples_split=20,  # Prevent overfitting
    min_samples_leaf=10,   # Ensure meaningful predictions
    random_state=42
)
rf_reg.fit(X_train_reg, y_train_reg)

# Make predictions
y_pred_rf_reg = rf_reg.predict(X_test_reg)

# Calculate metrics
mae_rf = mean_absolute_error(y_test_reg, y_pred_rf_reg)
mse_rf = mean_squared_error(y_test_reg, y_pred_rf_reg)
rmse_rf = np.sqrt(mse_rf)
r2_rf = r2_score(y_test_reg, y_pred_rf_reg)

print(f"\n📊 Random Forest Regressor Results:")
print(f"MAE: ${mae_rf:.2f} (average error)")
print(f"RMSE: ${rmse_rf:.2f} (root mean squared error)")
print(f"R² Score: {r2_rf:.3f} ({r2_rf:.1%} of variance explained)")

# Feature importance (averaged across all trees)
print("\n🎯 Feature Importance (averaged across 100 trees):")
rf_reg_importance = pd.DataFrame({
    'Feature': X_train_reg.columns,
    'Importance': rf_reg.feature_importances_
}).sort_values('Importance', ascending=False)

print(rf_reg_importance)
print("\n💡 Interpretation: More stable importance scores than single tree")

# Show prediction confidence (using tree variance)
print("\n🎯 Prediction Examples with Confidence (first 5 samples):")
# Get predictions from individual trees for confidence estimation
tree_predictions = np.array([tree.predict(X_test_reg.iloc[:5]) for tree in rf_reg.estimators_])
prediction_std = np.std(tree_predictions, axis=0)

rf_confidence = pd.DataFrame({
    'Actual_CLV': y_test_reg.iloc[:5].values.round(2),
    'Predicted_CLV': y_pred_rf_reg[:5].round(2),
    'Prediction_Std': prediction_std.round(2),
    'Confidence_Range': [f"±${std:.0f}" for std in prediction_std]
})
print(rf_confidence)
print("Lower standard deviation = more confident prediction")

## 7. Regression Model Comparison

Let's compare all regression models to understand which performs best for predicting customer lifetime value.

**Regression Metrics Explained:**
- **MAE**: Mean Absolute Error - average prediction error in dollars (lower is better)
- **RMSE**: Root Mean Squared Error - penalizes large errors more (lower is better)
- **R² Score**: Coefficient of determination - percentage of variance explained (higher is better)

**Business Context:**
- **MAE**: "On average, our predictions are off by $X"
- **RMSE**: "Our model has larger penalties for big mistakes"
- **R²**: "Our model explains X% of why CLV varies between customers"

In [None]:
# Compare all regression models
print("📊 REGRESSION MODEL COMPARISON")
print("="*45)

# Create comparison DataFrame
regression_comparison = pd.DataFrame({
    'Model': ['Linear Regression', 'Polynomial Regression', 'Decision Tree', 'Random Forest'],
    'MAE': [mae_linear, mae_poly, mae_tree, mae_rf],
    'RMSE': [rmse_linear, rmse_poly, rmse_tree, rmse_rf],
    'R²_Score': [r2_linear, r2_poly, r2_tree, r2_rf]
})

# Sort by R² score (higher is better)
regression_comparison = regression_comparison.sort_values('R²_Score', ascending=False)
regression_comparison['R²_Percent'] = (regression_comparison['R²_Score'] * 100).round(1)

print("Model Performance Ranking (by R² Score):")
print(regression_comparison.round(2))

# Identify best model
best_reg_model = regression_comparison.iloc[0]['Model']
best_r2 = regression_comparison.iloc[0]['R²_Score']
best_mae = regression_comparison.iloc[0]['MAE']

print(f"\n🏆 Best Regression Model: {best_reg_model}")
print(f"   R² Score: {best_r2:.3f} ({best_r2:.1%} variance explained)")
print(f"   Average Error: ${best_mae:.2f}")

# Calculate baseline comparison
baseline_mae = mean_absolute_error(y_test_reg, [y_train_reg.mean()] * len(y_test_reg))
print(f"\n📏 Baseline Comparison (predicting mean CLV):")
print(f"   Baseline MAE: ${baseline_mae:.2f}")
print(f"   Best Model MAE: ${best_mae:.2f}")
improvement = ((baseline_mae - best_mae) / baseline_mae) * 100
print(f"   Improvement: {improvement:.1f}% better than baseline")

# Show prediction accuracy ranges
print(f"\n🎯 Prediction Accuracy Interpretation:")
print(f"• Our best model is typically off by ${best_mae:.0f} when predicting CLV")
print(f"• For a customer with ${y_test_reg.mean():.0f} actual CLV:")
print(f"  - Prediction range: ${y_test_reg.mean()-best_mae:.0f} to ${y_test_reg.mean()+best_mae:.0f}")
print(f"  - That's ±{(best_mae/y_test_reg.mean())*100:.1f}% relative error")

In [None]:
# Detailed prediction analysis
print("🔍 DETAILED PREDICTION ANALYSIS")
print("="*40)

# Compare predictions from all models
prediction_comparison = pd.DataFrame({
    'Actual_CLV': y_test_reg.iloc[:10].values,
    'Linear_Pred': y_pred_linear[:10],
    'Poly_Pred': y_pred_poly[:10],
    'Tree_Pred': y_pred_tree_reg[:10],
    'RF_Pred': y_pred_rf_reg[:10]
})

# Calculate errors for each model
for model in ['Linear', 'Poly', 'Tree', 'RF']:
    prediction_comparison[f'{model}_Error'] = (
        prediction_comparison['Actual_CLV'] - prediction_comparison[f'{model}_Pred']
    ).abs()

print("Prediction Comparison (first 10 test samples):")
print(prediction_comparison.round(2))

# Analyze error patterns
print("\n📈 Error Pattern Analysis:")
models_and_preds = [
    ('Linear Regression', y_pred_linear),
    ('Polynomial Regression', y_pred_poly),
    ('Decision Tree', y_pred_tree_reg),
    ('Random Forest', y_pred_rf_reg)
]

for model_name, predictions in models_and_preds:
    errors = np.abs(y_test_reg - predictions)
    print(f"\n{model_name}:")
    print(f"  Mean Error: ${errors.mean():.2f}")
    print(f"  Median Error: ${errors.median():.2f}")
    print(f"  Max Error: ${errors.max():.2f}")
    print(f"  % predictions within $500: {(errors <= 500).mean():.1%}")
    print(f"  % predictions within $1000: {(errors <= 1000).mean():.1%}")

In [None]:
# Business impact of regression predictions
print("💰 BUSINESS IMPACT OF CLV PREDICTIONS")
print("="*45)

print("Business Scenario: Marketing Budget Allocation")
print("• High CLV customers (>$3000): Premium marketing ($200 spend)")
print("• Medium CLV customers ($1500-$3000): Standard marketing ($100 spend)")
print("• Low CLV customers (<$1500): Basic marketing ($50 spend)")
print("• Goal: Maximize ROI by targeting right customers with right campaigns")

# Define CLV segments
def classify_clv(clv):
    if clv >= 3000:
        return 'High'
    elif clv >= 1500:
        return 'Medium'
    else:
        return 'Low'

# Marketing costs by segment
marketing_costs = {'High': 200, 'Medium': 100, 'Low': 50}

print("\n💼 Marketing ROI Analysis by Model:")
print("-" * 50)

for model_name, predictions in models_and_preds:
    # Classify actual and predicted CLV
    actual_segments = [classify_clv(clv) for clv in y_test_reg]
    predicted_segments = [classify_clv(clv) for clv in predictions]
    
    # Calculate marketing spend based on predictions
    predicted_spend = sum(marketing_costs[seg] for seg in predicted_segments)
    
    # Calculate actual ROI (revenue - marketing cost)
    actual_revenue = y_test_reg.sum()
    roi = actual_revenue - predicted_spend
    roi_ratio = actual_revenue / predicted_spend
    
    # Calculate segment accuracy
    segment_accuracy = sum(1 for a, p in zip(actual_segments, predicted_segments) if a == p) / len(actual_segments)
    
    print(f"\n{model_name}:")
    print(f"  Total marketing spend: ${predicted_spend:,}")
    print(f"  Total customer revenue: ${actual_revenue:,.0f}")
    print(f"  Net ROI: ${roi:,.0f}")
    print(f"  ROI ratio: {roi_ratio:.2f}x")
    print(f"  Segment classification accuracy: {segment_accuracy:.1%}")

# Optimal allocation (if we knew true CLV)
optimal_segments = [classify_clv(clv) for clv in y_test_reg]
optimal_spend = sum(marketing_costs[seg] for seg in optimal_segments)
optimal_roi = y_test_reg.sum() - optimal_spend

print(f"\n🎯 Optimal Allocation (perfect predictions):")
print(f"  Marketing spend: ${optimal_spend:,}")
print(f"  Net ROI: ${optimal_roi:,.0f}")
print(f"  ROI ratio: {y_test_reg.sum()/optimal_spend:.2f}x")

print("\n💡 Key Insight: Better CLV predictions → Better marketing allocation → Higher ROI")

## 8. Clustering: Discovering Customer Segments

Now let's explore unsupervised learning with clustering. Unlike classification and regression, we don't have a target variable - we're looking for hidden patterns in the data.

**Real-world Clustering Applications:**
- Customer segmentation (marketing)
- Market research (identifying consumer groups)
- Gene sequencing (biology)
- Image segmentation (computer vision)
- Anomaly detection (fraud, network security)

**Our Clustering Problem:**
- **Goal**: Discover natural customer segments based on behavior and characteristics
- **Features**: Customer demographics and behavior (no target variable!)
- **Output**: Group assignments (Cluster 0, 1, 2, etc.)
- **Business Value**: Targeted marketing, personalized products, customer insights

**Key Differences from Supervised Learning:**
- **No labels**: We don't know the "right" answer beforehand
- **Exploratory**: We're discovering patterns, not predicting outcomes
- **Evaluation**: Harder to measure - we use internal metrics and business interpretation

In [None]:
# Prepare data for clustering
print("Preparing Data for Customer Segmentation")
print("="*50)

# Select features for clustering (exclude target variables and IDs)
clustering_features = [
    'age', 'income', 'education_encoded', 'experience_years',
    'num_purchases', 'satisfaction_score'
    # Note: Excluding region dummies and premium status for unsupervised learning
]

X_cluster = df_ml[clustering_features].copy()

print(f"Clustering features: {list(X_cluster.columns)}")
print(f"Number of customers: {X_cluster.shape[0]}")
print(f"Number of features: {X_cluster.shape[1]}")

# Check data quality
print(f"\nData quality check:")
print(f"Missing values: {X_cluster.isnull().sum().sum()}")
print(f"Data types: {X_cluster.dtypes.value_counts().to_dict()}")

# Scale features for clustering (very important!)
print("\n🔧 SCALING FOR CLUSTERING:")
print("• Clustering algorithms use distance calculations")
print("• Features with larger scales dominate the distance")
print("• Example: Income ($50,000) vs Age (30) - income dominates")
print("• Solution: Scale all features to similar ranges")

scaler_cluster = StandardScaler()
X_cluster_scaled = scaler_cluster.fit_transform(X_cluster)
X_cluster_scaled = pd.DataFrame(X_cluster_scaled, columns=X_cluster.columns, index=X_cluster.index)

print("\nFeature scales BEFORE scaling:")
print(X_cluster.describe().round(2))

print("\nFeature scales AFTER scaling:")
print(X_cluster_scaled.describe().round(2))

print("\n✅ All features now have mean≈0 and std≈1 - ready for clustering!")

In [None]:
# Apply K-Means Clustering
print("🎯 Algorithm: K-Means Clustering")
print("="*40)

# PARAMETER EXPLANATION: K-Means parameters
print("ALGORITHM EXPLANATION: K-Means Clustering")
print("• What it does: Groups data into k clusters based on similarity")
print("• How it works: Finds k cluster centers that minimize distances to points")
print("• Strengths: Fast, simple, works well with spherical clusters")
print("• Weaknesses: Assumes spherical clusters, sensitive to initialization")
print("• n_clusters: Number of clusters to create")
print("• random_state: Seed for reproducible results")
print("• n_init: Number of random initializations (best result is kept)")

# For simplicity, we'll use 3 clusters (common for customer segmentation)
optimal_k = 3
print(f"\n🎯 Using k={optimal_k} clusters for clear customer segments")

# Create and fit K-Means model
kmeans = KMeans(
    n_clusters=optimal_k,
    random_state=42,
    n_init=10  # Try 10 different initializations
)

# Fit the model and get cluster assignments
cluster_labels = kmeans.fit_predict(X_cluster_scaled)

# Add cluster labels to our dataframe
df_clustered = df_ml.copy()
df_clustered['Cluster'] = cluster_labels

print(f"\n📊 Clustering Results:")
print(f"Number of clusters: {optimal_k}")
print(f"Final WCSS: {kmeans.inertia_:.2f}")
print(f"Number of iterations: {kmeans.n_iter_}")

# Show cluster distribution
cluster_counts = pd.Series(cluster_labels).value_counts().sort_index()
print(f"\nCluster Distribution:")
for cluster_id, count in cluster_counts.items():
    percentage = (count / len(cluster_labels)) * 100
    print(f"Cluster {cluster_id}: {count} customers ({percentage:.1f}%)")

print(f"\n✅ Successfully segmented {len(df_clustered)} customers into {optimal_k} clusters!")

In [None]:
# Analyze and interpret customer segments
print("🔍 CUSTOMER SEGMENT ANALYSIS")
print("="*40)

# Calculate cluster centers in original scale for interpretation
cluster_centers_scaled = kmeans.cluster_centers_
cluster_centers_original = scaler_cluster.inverse_transform(cluster_centers_scaled)

# Create cluster centers DataFrame
cluster_centers_df = pd.DataFrame(
    cluster_centers_original,
    columns=clustering_features,
    index=[f'Cluster_{i}' for i in range(optimal_k)]
)

print("Cluster Centers (Average Values):")
print(cluster_centers_df.round(2))

# Detailed analysis by cluster
print("\n📋 DETAILED CLUSTER PROFILES:")
print("="*35)

for cluster_id in range(optimal_k):
    cluster_data = df_clustered[df_clustered['Cluster'] == cluster_id]
    
    print(f"\n🎯 CLUSTER {cluster_id} PROFILE ({len(cluster_data)} customers):")
    print("-" * 50)
    
    # Demographics
    print(f"Demographics:")
    print(f"  Average Age: {cluster_data['age'].mean():.1f} years")
    print(f"  Average Income: ${cluster_data['income'].mean():.0f}")
    print(f"  Education: {cluster_data['education'].mode().iloc[0]} (most common)")
    
    # Behavior
    print(f"Behavior:")
    print(f"  Average Purchases: {cluster_data['num_purchases'].mean():.1f}")
    print(f"  Average Satisfaction: {cluster_data['satisfaction_score'].mean():.2f}/5.0")
    print(f"  Average Experience: {cluster_data['experience_years'].mean():.1f} years")
    
    # Business metrics
    print(f"Business Value:")
    premium_rate = cluster_data['is_premium'].mean()
    avg_clv = cluster_data['customer_lifetime_value'].mean()
    print(f"  Premium Rate: {premium_rate:.1%}")
    print(f"  Average CLV: ${avg_clv:.0f}")
    
    # Region distribution
    top_region = cluster_data['region'].mode().iloc[0]
    region_pct = (cluster_data['region'] == top_region).mean()
    print(f"  Top Region: {top_region} ({region_pct:.1%})")

# Compare clusters side by side
print("\n📊 CLUSTER COMPARISON TABLE:")
print("="*35)

comparison_metrics = []
for cluster_id in range(optimal_k):
    cluster_data = df_clustered[df_clustered['Cluster'] == cluster_id]
    
    metrics = {
        'Cluster': f'Cluster_{cluster_id}',
        'Size': len(cluster_data),
        'Avg_Age': cluster_data['age'].mean(),
        'Avg_Income': cluster_data['income'].mean(),
        'Avg_Satisfaction': cluster_data['satisfaction_score'].mean(),
        'Premium_Rate': cluster_data['is_premium'].mean(),
        'Avg_CLV': cluster_data['customer_lifetime_value'].mean()
    }
    comparison_metrics.append(metrics)

comparison_df = pd.DataFrame(comparison_metrics)
print(comparison_df.round(2))

In [None]:
# Business interpretation and actionable insights
print("💼 BUSINESS INTERPRETATION & MARKETING STRATEGY")
print("="*55)

# Analyze each cluster for business insights
cluster_insights = []

for cluster_id in range(optimal_k):
    cluster_data = df_clustered[df_clustered['Cluster'] == cluster_id]
    
    # Calculate key metrics
    avg_age = cluster_data['age'].mean()
    avg_income = cluster_data['income'].mean()
    avg_satisfaction = cluster_data['satisfaction_score'].mean()
    premium_rate = cluster_data['is_premium'].mean()
    avg_clv = cluster_data['customer_lifetime_value'].mean()
    size = len(cluster_data)
    
    # Generate business interpretation
    if avg_clv > 2500 and premium_rate > 0.4:
        segment_type = "High-Value Customers"
        strategy = "VIP treatment, loyalty programs, premium services"
        priority = "HIGH"
    elif avg_clv > 1800 and avg_satisfaction > 3.5:
        segment_type = "Growth Potential"
        strategy = "Upselling, premium conversion campaigns"
        priority = "MEDIUM"
    else:
        segment_type = "Standard Customers"
        strategy = "Retention programs, satisfaction improvement"
        priority = "LOW"
    
    cluster_insights.append({
        'cluster_id': cluster_id,
        'segment_type': segment_type,
        'strategy': strategy,
        'priority': priority,
        'size': size,
        'avg_clv': avg_clv
    })

# Display business insights
for insight in cluster_insights:
    print(f"\n🎯 CLUSTER {insight['cluster_id']}: {insight['segment_type']}")
    print(f"   Size: {insight['size']} customers")
    print(f"   Average CLV: ${insight['avg_clv']:.0f}")
    print(f"   Priority: {insight['priority']}")
    print(f"   Strategy: {insight['strategy']}")

# Calculate business impact
print(f"\n💰 BUSINESS IMPACT ANALYSIS:")
print("-" * 30)

total_clv = df_clustered['customer_lifetime_value'].sum()
print(f"Total Customer Value: ${total_clv:,.0f}")

for insight in cluster_insights:
    cluster_data = df_clustered[df_clustered['Cluster'] == insight['cluster_id']]
    cluster_clv = cluster_data['customer_lifetime_value'].sum()
    clv_percentage = (cluster_clv / total_clv) * 100
    
    print(f"\nCluster {insight['cluster_id']} ({insight['segment_type']}):")
    print(f"  Total Value: ${cluster_clv:,.0f} ({clv_percentage:.1f}% of total)")
    print(f"  Size: {insight['size']} customers ({insight['size']/len(df_clustered)*100:.1f}% of base)")
    print(f"  Value per Customer: ${cluster_clv/insight['size']:.0f}")

print(f"\n🎯 KEY INSIGHTS:")
print("• Customer segmentation reveals distinct behavioral patterns")
print("• High-value segments deserve premium marketing investment")
print("• Growth potential segments are prime for upselling campaigns")
print("• Targeted strategies can improve overall customer lifetime value")
print("• Regular re-segmentation helps track customer evolution")