# Module 10: Model Training and Evaluation

**Difficulty**: ⭐⭐⭐  
**Estimated Time**: 90 minutes  
**Prerequisites**: 
- [Module 08: MLlib Basics](08_pyspark_machine_learning_mllib_basics.ipynb)
- [Module 09: Feature Engineering at Scale](09_feature_engineering_at_scale.ipynb)
- Understanding of model evaluation metrics

## Learning Objectives

By the end of this notebook, you will be able to:

1. Train and compare multiple classification algorithms (Logistic Regression, Decision Trees, Random Forests, GBT)
2. Train and evaluate regression models (Linear Regression, Decision Tree Regressor, Gradient Boosted Trees)
3. Implement cross-validation using CrossValidator for robust model selection
4. Perform hyperparameter tuning with ParamGridBuilder to optimize model performance
5. Use comprehensive evaluation metrics to select the best model for production deployment

## 1. Setup and Introduction

**Model Selection Process:**

1. **Algorithm Selection**: Choose candidate algorithms based on problem type
2. **Training**: Fit models on training data
3. **Validation**: Evaluate on validation set or cross-validation
4. **Hyperparameter Tuning**: Optimize model parameters
5. **Final Evaluation**: Test best model on held-out test set
6. **Deployment**: Deploy the selected model to production

**Classification Algorithms in MLlib:**
- **Logistic Regression**: Linear classifier, fast, interpretable
- **Decision Tree**: Non-linear, handles interactions, prone to overfitting
- **Random Forest**: Ensemble of trees, reduces overfitting, robust
- **Gradient Boosted Trees**: Sequential ensemble, often highest accuracy

**Regression Algorithms in MLlib:**
- **Linear Regression**: Simple, fast, assumes linear relationships
- **Decision Tree Regressor**: Captures non-linearity, interpretable
- **Random Forest Regressor**: Robust ensemble method
- **Gradient Boosted Tree Regressor**: High accuracy, slower training

**Cross-Validation:**
- Splits data into k folds
- Trains on k-1 folds, validates on 1 fold
- Repeats k times and averages results
- Provides more reliable performance estimates

In [None]:
# Import required libraries
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, rand, when, expr
from pyspark.sql.types import StructType, StructField, StringType, DoubleType, IntegerType

# Feature engineering
from pyspark.ml.feature import VectorAssembler, StringIndexer, StandardScaler

# Classification algorithms
from pyspark.ml.classification import (
    LogisticRegression,
    DecisionTreeClassifier,
    RandomForestClassifier,
    GBTClassifier
)

# Regression algorithms
from pyspark.ml.regression import (
    LinearRegression,
    DecisionTreeRegressor,
    RandomForestRegressor,
    GBTRegressor
)

# Model selection and tuning
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder, TrainValidationSplit
from pyspark.ml import Pipeline

# Evaluation
from pyspark.ml.evaluation import (
    BinaryClassificationEvaluator,
    MulticlassClassificationEvaluator,
    RegressionEvaluator
)

# Utilities
import numpy as np
import random
import time

# Set random seeds
np.random.seed(42)
random.seed(42)

In [None]:
# Create Spark session with more memory for model training
spark = SparkSession.builder \
    .appName("Model Training and Evaluation") \
    .config("spark.driver.memory", "4g") \
    .config("spark.executor.memory", "4g") \
    .getOrCreate()

spark.sparkContext.setLogLevel("ERROR")
print(f"Spark version: {spark.version}")
print("Spark session created successfully!")

## 2. Classification: Comparing Multiple Algorithms

We'll train and compare four classification algorithms on the same dataset.

**Scenario**: Predict customer churn based on usage patterns and demographics.

In [None]:
# Generate comprehensive customer churn dataset
n_customers = 5000
churn_data = []

for _ in range(n_customers):
    # Features
    tenure_months = np.random.randint(1, 72)
    monthly_charges = np.random.uniform(20, 120)
    total_charges = monthly_charges * tenure_months + np.random.normal(0, 100)
    num_products = np.random.randint(1, 5)
    support_calls = np.random.randint(0, 10)
    data_usage_gb = np.random.exponential(50)
    
    # Churn probability based on multiple factors
    # Higher churn if: short tenure, high charges, many support calls, few products
    churn_score = (
        -tenure_months * 0.05 +
        monthly_charges * 0.02 +
        support_calls * 0.3 -
        num_products * 0.5 +
        np.random.normal(0, 1)
    )
    
    churn_prob = 1 / (1 + np.exp(-churn_score))
    churned = 1 if np.random.random() < churn_prob else 0
    
    churn_data.append((
        tenure_months,
        float(monthly_charges),
        float(total_charges),
        num_products,
        support_calls,
        float(data_usage_gb),
        churned
    ))

df_churn = spark.createDataFrame(
    churn_data,
    ["tenure_months", "monthly_charges", "total_charges", "num_products", "support_calls", "data_usage_gb", "churned"]
)

print(f"Total customers: {df_churn.count()}")
print("\nChurn distribution:")
df_churn.groupBy("churned").count().show()
print("\nSample data:")
df_churn.show(10)
df_churn.describe().show()

In [None]:
# Prepare features and split data
# We'll use this same split for all models to ensure fair comparison
feature_cols = ["tenure_months", "monthly_charges", "total_charges", "num_products", "support_calls", "data_usage_gb"]

assembler = VectorAssembler(inputCols=feature_cols, outputCol="raw_features")
scaler = StandardScaler(inputCol="raw_features", outputCol="features", withStd=True, withMean=False)

# Prepare features
df_assembled = assembler.transform(df_churn)
df_scaled = scaler.fit(df_assembled).transform(df_assembled)

# Split: 60% train, 20% validation, 20% test
train_df, val_df, test_df = df_scaled.randomSplit([0.6, 0.2, 0.2], seed=42)

print(f"Training samples: {train_df.count()}")
print(f"Validation samples: {val_df.count()}")
print(f"Test samples: {test_df.count()}")

In [None]:
# Model 1: Logistic Regression
print("Training Logistic Regression...")
start_time = time.time()

lr = LogisticRegression(
    labelCol="churned",
    featuresCol="features",
    maxIter=20,
    regParam=0.01
)

lr_model = lr.fit(train_df)
lr_train_time = time.time() - start_time

# Evaluate on validation set
lr_predictions = lr_model.transform(val_df)

# Multiple metrics
acc_evaluator = MulticlassClassificationEvaluator(labelCol="churned", predictionCol="prediction", metricName="accuracy")
auc_evaluator = BinaryClassificationEvaluator(labelCol="churned", rawPredictionCol="rawPrediction", metricName="areaUnderROC")
f1_evaluator = MulticlassClassificationEvaluator(labelCol="churned", predictionCol="prediction", metricName="f1")

lr_accuracy = acc_evaluator.evaluate(lr_predictions)
lr_auc = auc_evaluator.evaluate(lr_predictions)
lr_f1 = f1_evaluator.evaluate(lr_predictions)

print(f"Logistic Regression - Time: {lr_train_time:.2f}s, Accuracy: {lr_accuracy:.4f}, AUC: {lr_auc:.4f}, F1: {lr_f1:.4f}")

In [None]:
# Model 2: Decision Tree Classifier
print("Training Decision Tree...")
start_time = time.time()

dt = DecisionTreeClassifier(
    labelCol="churned",
    featuresCol="features",
    maxDepth=10,
    minInstancesPerNode=20
)

dt_model = dt.fit(train_df)
dt_train_time = time.time() - start_time

dt_predictions = dt_model.transform(val_df)

dt_accuracy = acc_evaluator.evaluate(dt_predictions)
dt_auc = auc_evaluator.evaluate(dt_predictions)
dt_f1 = f1_evaluator.evaluate(dt_predictions)

print(f"Decision Tree - Time: {dt_train_time:.2f}s, Accuracy: {dt_accuracy:.4f}, AUC: {dt_auc:.4f}, F1: {dt_f1:.4f}")
print(f"Tree depth: {dt_model.depth}, Number of nodes: {dt_model.numNodes}")

In [None]:
# Model 3: Random Forest Classifier
print("Training Random Forest...")
start_time = time.time()

rf = RandomForestClassifier(
    labelCol="churned",
    featuresCol="features",
    numTrees=50,
    maxDepth=10,
    minInstancesPerNode=20,
    seed=42
)

rf_model = rf.fit(train_df)
rf_train_time = time.time() - start_time

rf_predictions = rf_model.transform(val_df)

rf_accuracy = acc_evaluator.evaluate(rf_predictions)
rf_auc = auc_evaluator.evaluate(rf_predictions)
rf_f1 = f1_evaluator.evaluate(rf_predictions)

print(f"Random Forest - Time: {rf_train_time:.2f}s, Accuracy: {rf_accuracy:.4f}, AUC: {rf_auc:.4f}, F1: {rf_f1:.4f}")

# Feature importances
print("\nFeature Importances:")
for i, importance in enumerate(rf_model.featureImportances):
    print(f"{feature_cols[i]}: {importance:.4f}")

In [None]:
# Model 4: Gradient Boosted Trees Classifier
print("Training Gradient Boosted Trees...")
start_time = time.time()

gbt = GBTClassifier(
    labelCol="churned",
    featuresCol="features",
    maxIter=20,
    maxDepth=5,
    stepSize=0.1,
    seed=42
)

gbt_model = gbt.fit(train_df)
gbt_train_time = time.time() - start_time

gbt_predictions = gbt_model.transform(val_df)

gbt_accuracy = acc_evaluator.evaluate(gbt_predictions)
gbt_auc = auc_evaluator.evaluate(gbt_predictions)
gbt_f1 = f1_evaluator.evaluate(gbt_predictions)

print(f"GBT Classifier - Time: {gbt_train_time:.2f}s, Accuracy: {gbt_accuracy:.4f}, AUC: {gbt_auc:.4f}, F1: {gbt_f1:.4f}")

# Feature importances
print("\nFeature Importances:")
for i, importance in enumerate(gbt_model.featureImportances):
    print(f"{feature_cols[i]}: {importance:.4f}")

In [None]:
# Compare all classification models
print("\n" + "="*80)
print("CLASSIFICATION MODEL COMPARISON (Validation Set)")
print("="*80)
print(f"{'Model':<25} {'Time (s)':<12} {'Accuracy':<12} {'AUC':<12} {'F1 Score':<12}")
print("-"*80)
print(f"{'Logistic Regression':<25} {lr_train_time:<12.2f} {lr_accuracy:<12.4f} {lr_auc:<12.4f} {lr_f1:<12.4f}")
print(f"{'Decision Tree':<25} {dt_train_time:<12.2f} {dt_accuracy:<12.4f} {dt_auc:<12.4f} {dt_f1:<12.4f}")
print(f"{'Random Forest':<25} {rf_train_time:<12.2f} {rf_accuracy:<12.4f} {rf_auc:<12.4f} {rf_f1:<12.4f}")
print(f"{'Gradient Boosted Trees':<25} {gbt_train_time:<12.2f} {gbt_accuracy:<12.4f} {gbt_auc:<12.4f} {gbt_f1:<12.4f}")
print("="*80)

## 3. Regression: Comparing Multiple Algorithms

Now let's compare regression algorithms on a house price prediction task.

**Scenario**: Predict house prices based on various features.

In [None]:
# Generate house price dataset
n_houses = 5000
house_data = []

for _ in range(n_houses):
    # Features
    square_feet = np.random.uniform(800, 4000)
    bedrooms = np.random.randint(1, 6)
    bathrooms = np.random.randint(1, 5)
    age_years = np.random.randint(0, 50)
    lot_size = np.random.uniform(1000, 10000)
    garage_spaces = np.random.randint(0, 4)
    
    # Price with non-linear relationships and interactions
    base_price = (
        square_feet * 200 +
        bedrooms * 50000 +
        bathrooms * 30000 -
        age_years * 2000 +
        lot_size * 10 +
        garage_spaces * 15000 +
        100000
    )
    
    # Add some non-linearity
    if square_feet > 3000:
        base_price *= 1.2
    if age_years > 30:
        base_price *= 0.85
    
    # Add noise
    price = base_price + np.random.normal(0, 50000)
    price = max(50000, price)  # Minimum price
    
    house_data.append((
        float(square_feet),
        bedrooms,
        bathrooms,
        age_years,
        float(lot_size),
        garage_spaces,
        float(price)
    ))

df_houses = spark.createDataFrame(
    house_data,
    ["square_feet", "bedrooms", "bathrooms", "age_years", "lot_size", "garage_spaces", "price"]
)

print(f"Total houses: {df_houses.count()}")
print("\nSample data:")
df_houses.show(10)
df_houses.describe().show()

In [None]:
# Prepare regression features
reg_feature_cols = ["square_feet", "bedrooms", "bathrooms", "age_years", "lot_size", "garage_spaces"]

reg_assembler = VectorAssembler(inputCols=reg_feature_cols, outputCol="features")
df_reg_assembled = reg_assembler.transform(df_houses)

# Split data
reg_train, reg_val, reg_test = df_reg_assembled.randomSplit([0.6, 0.2, 0.2], seed=42)

print(f"Training samples: {reg_train.count()}")
print(f"Validation samples: {reg_val.count()}")
print(f"Test samples: {reg_test.count()}")

In [None]:
# Regression evaluator
reg_evaluator = RegressionEvaluator(labelCol="price", predictionCol="prediction")

# Model 1: Linear Regression
print("Training Linear Regression...")
start_time = time.time()

lin_reg = LinearRegression(
    labelCol="price",
    featuresCol="features",
    maxIter=20,
    regParam=0.01,
    elasticNetParam=0.0
)

lin_reg_model = lin_reg.fit(reg_train)
lin_reg_time = time.time() - start_time

lin_reg_pred = lin_reg_model.transform(reg_val)

lin_reg_rmse = reg_evaluator.evaluate(lin_reg_pred, {reg_evaluator.metricName: "rmse"})
lin_reg_r2 = reg_evaluator.evaluate(lin_reg_pred, {reg_evaluator.metricName: "r2"})
lin_reg_mae = reg_evaluator.evaluate(lin_reg_pred, {reg_evaluator.metricName: "mae"})

print(f"Linear Regression - Time: {lin_reg_time:.2f}s, RMSE: ${lin_reg_rmse:,.0f}, R²: {lin_reg_r2:.4f}, MAE: ${lin_reg_mae:,.0f}")

In [None]:
# Model 2: Decision Tree Regressor
print("Training Decision Tree Regressor...")
start_time = time.time()

dt_reg = DecisionTreeRegressor(
    labelCol="price",
    featuresCol="features",
    maxDepth=10,
    minInstancesPerNode=20
)

dt_reg_model = dt_reg.fit(reg_train)
dt_reg_time = time.time() - start_time

dt_reg_pred = dt_reg_model.transform(reg_val)

dt_reg_rmse = reg_evaluator.evaluate(dt_reg_pred, {reg_evaluator.metricName: "rmse"})
dt_reg_r2 = reg_evaluator.evaluate(dt_reg_pred, {reg_evaluator.metricName: "r2"})
dt_reg_mae = reg_evaluator.evaluate(dt_reg_pred, {reg_evaluator.metricName: "mae"})

print(f"Decision Tree - Time: {dt_reg_time:.2f}s, RMSE: ${dt_reg_rmse:,.0f}, R²: {dt_reg_r2:.4f}, MAE: ${dt_reg_mae:,.0f}")

In [None]:
# Model 3: Random Forest Regressor
print("Training Random Forest Regressor...")
start_time = time.time()

rf_reg = RandomForestRegressor(
    labelCol="price",
    featuresCol="features",
    numTrees=50,
    maxDepth=10,
    minInstancesPerNode=20,
    seed=42
)

rf_reg_model = rf_reg.fit(reg_train)
rf_reg_time = time.time() - start_time

rf_reg_pred = rf_reg_model.transform(reg_val)

rf_reg_rmse = reg_evaluator.evaluate(rf_reg_pred, {reg_evaluator.metricName: "rmse"})
rf_reg_r2 = reg_evaluator.evaluate(rf_reg_pred, {reg_evaluator.metricName: "r2"})
rf_reg_mae = reg_evaluator.evaluate(rf_reg_pred, {reg_evaluator.metricName: "mae"})

print(f"Random Forest - Time: {rf_reg_time:.2f}s, RMSE: ${rf_reg_rmse:,.0f}, R²: {rf_reg_r2:.4f}, MAE: ${rf_reg_mae:,.0f}")

print("\nFeature Importances:")
for i, importance in enumerate(rf_reg_model.featureImportances):
    print(f"{reg_feature_cols[i]}: {importance:.4f}")

In [None]:
# Model 4: Gradient Boosted Tree Regressor
print("Training Gradient Boosted Tree Regressor...")
start_time = time.time()

gbt_reg = GBTRegressor(
    labelCol="price",
    featuresCol="features",
    maxIter=20,
    maxDepth=5,
    stepSize=0.1,
    seed=42
)

gbt_reg_model = gbt_reg.fit(reg_train)
gbt_reg_time = time.time() - start_time

gbt_reg_pred = gbt_reg_model.transform(reg_val)

gbt_reg_rmse = reg_evaluator.evaluate(gbt_reg_pred, {reg_evaluator.metricName: "rmse"})
gbt_reg_r2 = reg_evaluator.evaluate(gbt_reg_pred, {reg_evaluator.metricName: "r2"})
gbt_reg_mae = reg_evaluator.evaluate(gbt_reg_pred, {reg_evaluator.metricName: "mae"})

print(f"GBT Regressor - Time: {gbt_reg_time:.2f}s, RMSE: ${gbt_reg_rmse:,.0f}, R²: {gbt_reg_r2:.4f}, MAE: ${gbt_reg_mae:,.0f}")

print("\nFeature Importances:")
for i, importance in enumerate(gbt_reg_model.featureImportances):
    print(f"{reg_feature_cols[i]}: {importance:.4f}")

In [None]:
# Compare all regression models
print("\n" + "="*90)
print("REGRESSION MODEL COMPARISON (Validation Set)")
print("="*90)
print(f"{'Model':<25} {'Time (s)':<12} {'RMSE':<15} {'R² Score':<12} {'MAE':<15}")
print("-"*90)
print(f"{'Linear Regression':<25} {lin_reg_time:<12.2f} ${lin_reg_rmse:<14,.0f} {lin_reg_r2:<12.4f} ${lin_reg_mae:<14,.0f}")
print(f"{'Decision Tree':<25} {dt_reg_time:<12.2f} ${dt_reg_rmse:<14,.0f} {dt_reg_r2:<12.4f} ${dt_reg_mae:<14,.0f}")
print(f"{'Random Forest':<25} {rf_reg_time:<12.2f} ${rf_reg_rmse:<14,.0f} {rf_reg_r2:<12.4f} ${rf_reg_mae:<14,.0f}")
print(f"{'Gradient Boosted Trees':<25} {gbt_reg_time:<12.2f} ${gbt_reg_rmse:<14,.0f} {gbt_reg_r2:<12.4f} ${gbt_reg_mae:<14,.0f}")
print("="*90)

## 4. Cross-Validation

Cross-validation provides more reliable performance estimates by averaging results across multiple train/validation splits.

**Benefits:**
- Reduces variance in performance estimates
- Uses all data for both training and validation
- Detects overfitting more reliably

**Trade-off:**
- Training time increases by k-fold (e.g., 3x slower for 3-fold CV)

**When to use:**
- Small to medium datasets where each sample is valuable
- When you need high confidence in model selection
- For final model selection before production

In [None]:
# Prepare data for cross-validation (use train + val for CV)
cv_data = train_df.union(val_df)

print(f"Cross-validation data size: {cv_data.count()}")

# Create a simple model pipeline
cv_lr = LogisticRegression(
    labelCol="churned",
    featuresCol="features",
    maxIter=10
)

# Create evaluator
cv_evaluator = BinaryClassificationEvaluator(
    labelCol="churned",
    rawPredictionCol="rawPrediction",
    metricName="areaUnderROC"
)

# Create CrossValidator
# We'll test different values of regParam
param_grid = ParamGridBuilder() \
    .addGrid(cv_lr.regParam, [0.001, 0.01, 0.1]) \
    .build()

crossval = CrossValidator(
    estimator=cv_lr,
    estimatorParamMaps=param_grid,
    evaluator=cv_evaluator,
    numFolds=3,  # 3-fold cross-validation
    seed=42
)

print("\nStarting 3-fold cross-validation...")
print(f"Testing {len(param_grid)} parameter combinations")
print(f"Total model training runs: {3 * len(param_grid)} (3 folds × {len(param_grid)} params)")

In [None]:
# Run cross-validation
start_time = time.time()
cv_model = crossval.fit(cv_data)
cv_time = time.time() - start_time

print(f"\nCross-validation completed in {cv_time:.2f}s")

# Get best model and its parameters
best_lr_model = cv_model.bestModel
print(f"\nBest regParam: {best_lr_model.getRegParam()}")

# Average metrics for each parameter combination
print("\nAverage AUC for each parameter combination:")
for params, avg_metric in zip(param_grid, cv_model.avgMetrics):
    reg_param = params[cv_lr.regParam]
    print(f"regParam={reg_param}: {avg_metric:.4f}")

In [None]:
# Evaluate best model on test set
cv_test_predictions = cv_model.transform(test_df)
cv_test_auc = cv_evaluator.evaluate(cv_test_predictions)

print(f"\nBest model performance on test set:")
print(f"AUC: {cv_test_auc:.4f}")

# Also calculate accuracy
cv_test_accuracy = acc_evaluator.evaluate(cv_test_predictions)
print(f"Accuracy: {cv_test_accuracy:.4f}")

## 5. Hyperparameter Tuning with Grid Search

**Hyperparameter Tuning** finds the best configuration for a model by systematically trying different combinations.

**Common hyperparameters:**

**Random Forest:**
- `numTrees`: Number of trees (more = better, but slower)
- `maxDepth`: Maximum tree depth (higher = more complex)
- `minInstancesPerNode`: Minimum samples per leaf (higher = more regularization)

**Gradient Boosted Trees:**
- `maxIter`: Number of boosting iterations
- `maxDepth`: Tree depth per iteration
- `stepSize`: Learning rate (smaller = more conservative)

**TrainValidationSplit vs CrossValidator:**
- TrainValidationSplit: Faster (single split), less reliable
- CrossValidator: Slower (k splits), more reliable

In [None]:
# Hyperparameter tuning for Random Forest
print("Hyperparameter Tuning for Random Forest Classifier")
print("="*60)

# Create Random Forest estimator
rf_tuning = RandomForestClassifier(
    labelCol="churned",
    featuresCol="features",
    seed=42
)

# Build parameter grid
# Testing combinations of numTrees, maxDepth, and minInstancesPerNode
param_grid_rf = ParamGridBuilder() \
    .addGrid(rf_tuning.numTrees, [20, 50]) \
    .addGrid(rf_tuning.maxDepth, [5, 10]) \
    .addGrid(rf_tuning.minInstancesPerNode, [10, 20]) \
    .build()

print(f"Total parameter combinations: {len(param_grid_rf)}")
print(f"Parameters being tuned:")
print(f"  - numTrees: [20, 50]")
print(f"  - maxDepth: [5, 10]")
print(f"  - minInstancesPerNode: [10, 20]")

# Use TrainValidationSplit for faster tuning
# Splits data into 80% train, 20% validation
tvs = TrainValidationSplit(
    estimator=rf_tuning,
    estimatorParamMaps=param_grid_rf,
    evaluator=auc_evaluator,
    trainRatio=0.8,
    seed=42
)

print("\nRunning hyperparameter tuning...")
start_time = time.time()
tvs_model = tvs.fit(cv_data)
tuning_time = time.time() - start_time

print(f"Tuning completed in {tuning_time:.2f}s")

In [None]:
# Get best Random Forest model
best_rf_model = tvs_model.bestModel

print("\nBest Random Forest Parameters:")
print(f"  numTrees: {best_rf_model.getNumTrees}")
print(f"  maxDepth: {best_rf_model.getMaxDepth()}")
print(f"  minInstancesPerNode: {best_rf_model.getMinInstancesPerNode()}")

# Show all results
print("\nAll parameter combinations and their validation AUC:")
print("-"*70)
for params, metric in zip(param_grid_rf, tvs_model.validationMetrics):
    num_trees = params[rf_tuning.numTrees]
    max_depth = params[rf_tuning.maxDepth]
    min_instances = params[rf_tuning.minInstancesPerNode]
    print(f"Trees={num_trees:2d}, Depth={max_depth:2d}, MinInst={min_instances:2d} → AUC: {metric:.4f}")

In [None]:
# Evaluate tuned model on test set
tuned_predictions = tvs_model.transform(test_df)

tuned_auc = auc_evaluator.evaluate(tuned_predictions)
tuned_accuracy = acc_evaluator.evaluate(tuned_predictions)
tuned_f1 = f1_evaluator.evaluate(tuned_predictions)

print("\n" + "="*60)
print("TUNED MODEL PERFORMANCE ON TEST SET")
print("="*60)
print(f"AUC:      {tuned_auc:.4f}")
print(f"Accuracy: {tuned_accuracy:.4f}")
print(f"F1 Score: {tuned_f1:.4f}")
print("="*60)

# Compare to default Random Forest from earlier
print("\nComparison to Default Random Forest:")
print(f"Default RF (validation): AUC={rf_auc:.4f}, Acc={rf_accuracy:.4f}, F1={rf_f1:.4f}")
print(f"Tuned RF (test):         AUC={tuned_auc:.4f}, Acc={tuned_accuracy:.4f}, F1={tuned_f1:.4f}")
print(f"Improvement:             AUC={tuned_auc-rf_auc:+.4f}, Acc={tuned_accuracy-rf_accuracy:+.4f}, F1={tuned_f1-rf_f1:+.4f}")

## 6. Complete Model Selection Pipeline

Let's build a complete workflow that:
1. Preprocesses data
2. Trains multiple models
3. Uses cross-validation
4. Tunes hyperparameters
5. Selects the best model
6. Evaluates on test set

In [None]:
# Complete pipeline with preprocessing and model
# We'll use the original unprocessed churn data

# Split original data
pipeline_train, pipeline_test = df_churn.randomSplit([0.8, 0.2], seed=42)

# Build comprehensive pipeline
# Stage 1: Assemble features
pipeline_assembler = VectorAssembler(
    inputCols=feature_cols,
    outputCol="raw_features"
)

# Stage 2: Scale features
pipeline_scaler = StandardScaler(
    inputCol="raw_features",
    outputCol="features",
    withStd=True,
    withMean=False
)

# Stage 3: Model (we'll swap this during tuning)
pipeline_gbt = GBTClassifier(
    labelCol="churned",
    featuresCol="features",
    seed=42
)

# Create pipeline
complete_pipeline = Pipeline(stages=[
    pipeline_assembler,
    pipeline_scaler,
    pipeline_gbt
])

print("Complete ML Pipeline Created:")
print("  1. VectorAssembler: Combine features")
print("  2. StandardScaler: Normalize features")
print("  3. GBTClassifier: Train model")

In [None]:
# Hyperparameter grid for GBT
gbt_param_grid = ParamGridBuilder() \
    .addGrid(pipeline_gbt.maxIter, [10, 20]) \
    .addGrid(pipeline_gbt.maxDepth, [3, 5]) \
    .addGrid(pipeline_gbt.stepSize, [0.05, 0.1]) \
    .build()

print(f"Testing {len(gbt_param_grid)} GBT parameter combinations:")
print(f"  - maxIter: [10, 20]")
print(f"  - maxDepth: [3, 5]")
print(f"  - stepSize: [0.05, 0.1]")

# Use CrossValidator for more reliable results
pipeline_cv = CrossValidator(
    estimator=complete_pipeline,
    estimatorParamMaps=gbt_param_grid,
    evaluator=auc_evaluator,
    numFolds=3,
    seed=42
)

print(f"\nUsing 3-fold cross-validation")
print(f"Total training runs: {3 * len(gbt_param_grid)}")

In [None]:
# Train with cross-validation and hyperparameter tuning
print("\nStarting complete model selection process...")
start_time = time.time()
final_model = pipeline_cv.fit(pipeline_train)
total_time = time.time() - start_time

print(f"Model selection completed in {total_time:.2f}s ({total_time/60:.1f} minutes)")

In [None]:
# Extract best model and parameters
best_pipeline_model = final_model.bestModel
best_gbt = best_pipeline_model.stages[-1]  # Last stage is the GBT model

print("\nBest Model Parameters:")
print(f"  maxIter: {best_gbt.getMaxIter()}")
print(f"  maxDepth: {best_gbt.getMaxDepth()}")
print(f"  stepSize: {best_gbt.getStepSize()}")

# Show all CV results
print("\nCross-Validation Results (Average AUC across 3 folds):")
print("-"*80)
for params, avg_metric in zip(gbt_param_grid, final_model.avgMetrics):
    max_iter = params[pipeline_gbt.maxIter]
    max_depth = params[pipeline_gbt.maxDepth]
    step_size = params[pipeline_gbt.stepSize]
    print(f"Iter={max_iter:2d}, Depth={max_depth}, Step={step_size:.2f} → Avg AUC: {avg_metric:.4f}")

best_idx = final_model.avgMetrics.index(max(final_model.avgMetrics))
print("-"*80)
print(f"Best configuration: Index {best_idx} with AUC={max(final_model.avgMetrics):.4f}")

In [None]:
# Final evaluation on test set
final_predictions = final_model.transform(pipeline_test)

final_auc = auc_evaluator.evaluate(final_predictions)
final_accuracy = acc_evaluator.evaluate(final_predictions)
final_f1 = f1_evaluator.evaluate(final_predictions)
final_precision = MulticlassClassificationEvaluator(
    labelCol="churned", predictionCol="prediction", metricName="weightedPrecision"
).evaluate(final_predictions)
final_recall = MulticlassClassificationEvaluator(
    labelCol="churned", predictionCol="prediction", metricName="weightedRecall"
).evaluate(final_predictions)

print("\n" + "="*80)
print("FINAL MODEL PERFORMANCE ON TEST SET")
print("="*80)
print(f"AUC:       {final_auc:.4f}")
print(f"Accuracy:  {final_accuracy:.4f}")
print(f"Precision: {final_precision:.4f}")
print(f"Recall:    {final_recall:.4f}")
print(f"F1 Score:  {final_f1:.4f}")
print("="*80)

# Feature importances from best model
print("\nFeature Importances (from best model):")
for i, importance in enumerate(best_gbt.featureImportances):
    print(f"  {feature_cols[i]:<20} {importance:.4f}")

## 7. Exercises

### Exercise 1: Multi-Model Comparison

Create a comprehensive comparison of all classification algorithms.

**Tasks:**
1. Generate a binary classification dataset with 3000 samples
2. Train: Logistic Regression, Decision Tree, Random Forest, and GBT
3. For each model, record: training time, accuracy, AUC, F1, precision, recall
4. Create a summary table and identify the best model
5. Discuss which model you would choose for production and why

In [None]:
# Your code here
# TODO: Generate classification data
# TODO: Train all four models
# TODO: Evaluate and compare
# TODO: Create summary table

### Exercise 2: Cross-Validation Deep Dive

Explore the impact of cross-validation fold count.

**Tasks:**
1. Use the churn dataset
2. Perform cross-validation with k=2, 3, 5, and 10 folds
3. For each k, record: total training time, average AUC, standard deviation
4. Compare the trade-off between reliability (higher k) and speed (lower k)
5. Recommend an optimal k value

In [None]:
# Your code here
# TODO: Run CV with different fold counts
# TODO: Measure time and performance
# TODO: Analyze trade-offs

### Exercise 3: Regression Hyperparameter Tuning

Tune a Random Forest Regressor for the house price prediction task.

**Tasks:**
1. Use the house price dataset
2. Create a parameter grid for Random Forest Regressor with:
   - numTrees: [30, 50, 100]
   - maxDepth: [5, 10, 15]
   - minInstancesPerNode: [5, 10, 20]
3. Use TrainValidationSplit to find the best parameters
4. Evaluate the tuned model on the test set
5. Compare to a default Random Forest Regressor

In [None]:
# Your code here
# TODO: Create parameter grid
# TODO: Run hyperparameter tuning
# TODO: Evaluate and compare to default model

## 8. Exercise Solutions

### Solution 1: Multi-Model Comparison

In [None]:
# Generate dataset
n = 3000
compare_data = []
for _ in range(n):
    x1 = np.random.uniform(0, 10)
    x2 = np.random.uniform(0, 10)
    x3 = np.random.uniform(0, 10)
    x4 = np.random.uniform(0, 10)
    
    score = x1 * 0.5 + x2 * 0.3 + x3 * x4 * 0.02 - 5
    prob = 1 / (1 + np.exp(-score))
    label = 1 if np.random.random() < prob else 0
    
    compare_data.append((float(x1), float(x2), float(x3), float(x4), label))

df_compare = spark.createDataFrame(compare_data, ["x1", "x2", "x3", "x4", "label"])

# Prepare features
comp_assembler = VectorAssembler(inputCols=["x1", "x2", "x3", "x4"], outputCol="features")
df_comp_features = comp_assembler.transform(df_compare)
comp_train, comp_test = df_comp_features.randomSplit([0.7, 0.3], seed=42)

print(f"Dataset: {df_compare.count()} samples")
df_compare.groupBy("label").count().show()

In [None]:
# Dictionary to store results
results = {}

# Evaluators
comp_acc_eval = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
comp_auc_eval = BinaryClassificationEvaluator(labelCol="label", rawPredictionCol="rawPrediction", metricName="areaUnderROC")
comp_f1_eval = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="f1")
comp_prec_eval = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="weightedPrecision")
comp_rec_eval = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="weightedRecall")

# Model 1: Logistic Regression
start = time.time()
comp_lr = LogisticRegression(labelCol="label", featuresCol="features", maxIter=20)
comp_lr_model = comp_lr.fit(comp_train)
comp_lr_pred = comp_lr_model.transform(comp_test)
results['Logistic Regression'] = {
    'time': time.time() - start,
    'accuracy': comp_acc_eval.evaluate(comp_lr_pred),
    'auc': comp_auc_eval.evaluate(comp_lr_pred),
    'f1': comp_f1_eval.evaluate(comp_lr_pred),
    'precision': comp_prec_eval.evaluate(comp_lr_pred),
    'recall': comp_rec_eval.evaluate(comp_lr_pred)
}

# Model 2: Decision Tree
start = time.time()
comp_dt = DecisionTreeClassifier(labelCol="label", featuresCol="features", maxDepth=10)
comp_dt_model = comp_dt.fit(comp_train)
comp_dt_pred = comp_dt_model.transform(comp_test)
results['Decision Tree'] = {
    'time': time.time() - start,
    'accuracy': comp_acc_eval.evaluate(comp_dt_pred),
    'auc': comp_auc_eval.evaluate(comp_dt_pred),
    'f1': comp_f1_eval.evaluate(comp_dt_pred),
    'precision': comp_prec_eval.evaluate(comp_dt_pred),
    'recall': comp_rec_eval.evaluate(comp_dt_pred)
}

# Model 3: Random Forest
start = time.time()
comp_rf = RandomForestClassifier(labelCol="label", featuresCol="features", numTrees=50, maxDepth=10, seed=42)
comp_rf_model = comp_rf.fit(comp_train)
comp_rf_pred = comp_rf_model.transform(comp_test)
results['Random Forest'] = {
    'time': time.time() - start,
    'accuracy': comp_acc_eval.evaluate(comp_rf_pred),
    'auc': comp_auc_eval.evaluate(comp_rf_pred),
    'f1': comp_f1_eval.evaluate(comp_rf_pred),
    'precision': comp_prec_eval.evaluate(comp_rf_pred),
    'recall': comp_rec_eval.evaluate(comp_rf_pred)
}

# Model 4: GBT
start = time.time()
comp_gbt = GBTClassifier(labelCol="label", featuresCol="features", maxIter=20, maxDepth=5, seed=42)
comp_gbt_model = comp_gbt.fit(comp_train)
comp_gbt_pred = comp_gbt_model.transform(comp_test)
results['GBT'] = {
    'time': time.time() - start,
    'accuracy': comp_acc_eval.evaluate(comp_gbt_pred),
    'auc': comp_auc_eval.evaluate(comp_gbt_pred),
    'f1': comp_f1_eval.evaluate(comp_gbt_pred),
    'precision': comp_prec_eval.evaluate(comp_gbt_pred),
    'recall': comp_rec_eval.evaluate(comp_gbt_pred)
}

print("All models trained!")

In [None]:
# Summary table
print("\n" + "="*120)
print("COMPREHENSIVE MODEL COMPARISON")
print("="*120)
print(f"{'Model':<20} {'Time(s)':<10} {'Accuracy':<12} {'AUC':<12} {'F1':<12} {'Precision':<12} {'Recall':<12}")
print("-"*120)
for model_name, metrics in results.items():
    print(f"{model_name:<20} {metrics['time']:<10.2f} {metrics['accuracy']:<12.4f} {metrics['auc']:<12.4f} "
          f"{metrics['f1']:<12.4f} {metrics['precision']:<12.4f} {metrics['recall']:<12.4f}")
print("="*120)

# Find best model by AUC
best_model = max(results.items(), key=lambda x: x[1]['auc'])
print(f"\nBest Model (by AUC): {best_model[0]} with AUC={best_model[1]['auc']:.4f}")
print("\nRecommendation for Production:")
print(f"  - {best_model[0]} offers the best predictive performance")
print(f"  - Consider Random Forest for balance of accuracy and interpretability")
print(f"  - Use Logistic Regression if speed and explainability are critical")

### Solution 2: Cross-Validation Deep Dive

In [None]:
# Test different fold counts
fold_counts = [2, 3, 5]
cv_results = {}

base_lr = LogisticRegression(labelCol="churned", featuresCol="features", maxIter=10)
base_evaluator = BinaryClassificationEvaluator(labelCol="churned", rawPredictionCol="rawPrediction")

# Use smaller parameter grid for speed
simple_grid = ParamGridBuilder().addGrid(base_lr.regParam, [0.01, 0.1]).build()

print("Testing different cross-validation fold counts...\n")

for k in fold_counts:
    print(f"Running {k}-fold cross-validation...")
    
    cv = CrossValidator(
        estimator=base_lr,
        estimatorParamMaps=simple_grid,
        evaluator=base_evaluator,
        numFolds=k,
        seed=42
    )
    
    start = time.time()
    cv_model = cv.fit(cv_data)
    elapsed = time.time() - start
    
    avg_metrics = cv_model.avgMetrics
    best_metric = max(avg_metrics)
    std_dev = np.std(avg_metrics)
    
    cv_results[k] = {
        'time': elapsed,
        'best_auc': best_metric,
        'std_dev': std_dev
    }
    
    print(f"  Time: {elapsed:.2f}s, Best AUC: {best_metric:.4f}, Std Dev: {std_dev:.4f}\n")

print("Cross-validation comparison complete!")

In [None]:
# Summary
print("\n" + "="*70)
print("CROSS-VALIDATION FOLD COUNT COMPARISON")
print("="*70)
print(f"{'Folds':<10} {'Time (s)':<15} {'Best AUC':<15} {'Std Dev':<15}")
print("-"*70)
for k, metrics in cv_results.items():
    print(f"{k:<10} {metrics['time']:<15.2f} {metrics['best_auc']:<15.4f} {metrics['std_dev']:<15.6f}")
print("="*70)

print("\nRecommendation:")
print("  - For rapid prototyping: Use 2-3 folds (faster)")
print("  - For production model selection: Use 5 folds (good balance)")
print("  - For small datasets: Use 10 folds (maximum data utilization)")
print("  - Trade-off: Higher k = more reliable but slower")

### Solution 3: Regression Hyperparameter Tuning

In [None]:
# Prepare house data
tune_train, tune_test = df_reg_assembled.randomSplit([0.8, 0.2], seed=42)

# Default Random Forest
print("Training default Random Forest Regressor...")
default_rfr = RandomForestRegressor(labelCol="price", featuresCol="features", seed=42)
default_rfr_model = default_rfr.fit(tune_train)
default_pred = default_rfr_model.transform(tune_test)

default_rmse = reg_evaluator.evaluate(default_pred, {reg_evaluator.metricName: "rmse"})
default_r2 = reg_evaluator.evaluate(default_pred, {reg_evaluator.metricName: "r2"})

print(f"Default RF: RMSE=${default_rmse:,.0f}, R²={default_r2:.4f}\n")

In [None]:
# Hyperparameter tuning
print("Starting hyperparameter tuning...")

tune_rfr = RandomForestRegressor(labelCol="price", featuresCol="features", seed=42)

tune_grid = ParamGridBuilder() \
    .addGrid(tune_rfr.numTrees, [30, 50, 100]) \
    .addGrid(tune_rfr.maxDepth, [5, 10, 15]) \
    .addGrid(tune_rfr.minInstancesPerNode, [5, 10, 20]) \
    .build()

print(f"Total combinations: {len(tune_grid)}\n")

tune_tvs = TrainValidationSplit(
    estimator=tune_rfr,
    estimatorParamMaps=tune_grid,
    evaluator=RegressionEvaluator(labelCol="price", predictionCol="prediction", metricName="rmse"),
    trainRatio=0.8,
    seed=42
)

start = time.time()
tune_model = tune_tvs.fit(tune_train)
tune_time = time.time() - start

print(f"Tuning completed in {tune_time:.2f}s\n")

In [None]:
# Best model
best_rfr = tune_model.bestModel
print("Best Parameters:")
print(f"  numTrees: {best_rfr.getNumTrees}")
print(f"  maxDepth: {best_rfr.getMaxDepth()}")
print(f"  minInstancesPerNode: {best_rfr.getMinInstancesPerNode()}\n")

# Evaluate tuned model
tuned_pred = tune_model.transform(tune_test)
tuned_rmse = reg_evaluator.evaluate(tuned_pred, {reg_evaluator.metricName: "rmse"})
tuned_r2 = reg_evaluator.evaluate(tuned_pred, {reg_evaluator.metricName: "r2"})

print("\n" + "="*70)
print("COMPARISON: Default vs Tuned Random Forest Regressor")
print("="*70)
print(f"{'Model':<20} {'RMSE':<20} {'R² Score':<20}")
print("-"*70)
print(f"{'Default RF':<20} ${default_rmse:<19,.0f} {default_r2:<20.4f}")
print(f"{'Tuned RF':<20} ${tuned_rmse:<19,.0f} {tuned_r2:<20.4f}")
print("-"*70)
print(f"{'Improvement':<20} ${default_rmse - tuned_rmse:<19,.0f} {tuned_r2 - default_r2:<20.4f}")
print("="*70)
print(f"\nRMSE improvement: {((default_rmse - tuned_rmse) / default_rmse * 100):.1f}%")
print(f"Hyperparameter tuning improved model performance!")

## 9. Summary

Congratulations! You've mastered model training and evaluation in PySpark.

### Key Concepts:

1. **Algorithm Selection**:
   - Logistic Regression: Fast, interpretable, linear decision boundaries
   - Decision Trees: Non-linear, interpretable, can overfit
   - Random Forests: Robust, reduces overfitting, feature importances
   - Gradient Boosted Trees: Often highest accuracy, slower, sequential training

2. **Model Evaluation**:
   - Classification: Accuracy, AUC, F1, Precision, Recall
   - Regression: RMSE, MAE, R²
   - Choose metrics based on business requirements
   - Always evaluate on held-out test data

3. **Cross-Validation**:
   - Provides more reliable performance estimates
   - Reduces variance in model selection
   - Trade-off between reliability (higher k) and speed (lower k)
   - Essential for small to medium datasets

4. **Hyperparameter Tuning**:
   - ParamGridBuilder creates parameter combinations
   - CrossValidator: More reliable (k-fold), slower
   - TrainValidationSplit: Faster (single split), less reliable
   - Can significantly improve model performance

5. **Complete ML Pipelines**:
   - Combine preprocessing, feature engineering, and modeling
   - Ensure reproducibility and prevent data leakage
   - Make deployment straightforward
   - Can be saved and loaded for production use

### Best Practices:

- Start with simple models (Logistic/Linear Regression) as baselines
- Always use a separate test set for final evaluation
- Use cross-validation for model selection, not just single validation split
- Monitor both training and validation performance to detect overfitting
- Consider training time vs accuracy trade-offs for production
- Document hyperparameter choices and model selection rationale
- Use feature importances to understand and explain model decisions

### Production Considerations:

- **Model Deployment**: Save best model using `model.save(path)`
- **Monitoring**: Track model performance over time
- **Retraining**: Schedule periodic retraining with new data
- **A/B Testing**: Compare new models against production baseline
- **Explainability**: Prefer interpretable models when transparency matters

### What's Next?

In [Module 11: Spark Streaming Basics](11_spark_streaming_basics.ipynb), you'll learn:
- Real-time data processing with Structured Streaming
- Reading from streaming sources
- Window operations and aggregations
- Output modes and sinks

### Additional Resources:

- [MLlib Classification Guide](https://spark.apache.org/docs/latest/ml-classification-regression.html)
- [Model Selection and Tuning](https://spark.apache.org/docs/latest/ml-tuning.html)
- [Evaluation Metrics](https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html)

In [None]:
# Clean up
spark.stop()
print("Spark session stopped. Excellent work on model training and evaluation!")