# Module 3 Notebook 3: Hyperparameter Tuning and Model Optimization

In this notebook, we'll focus on optimizing the Linear Regression model we trained in Notebook 2 to predict `total_purchase_amount`. We'll use hyperparameter tuning with Cross-Validation to find the best model configuration and potentially improve its predictive performance.

In [None]:
from pyspark.sql import SparkSession
from pyspark.ml.regression import LinearRegression, LinearRegressionModel
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.sql.functions import col
import time

## 1. Load Data

First, we load the pre-processed training and testing datasets created in Module 2, specifically for the regression task.

In [None]:
# Load training and testing data
train_data = spark.table("ecommerce.regression_train_features")
test_data = spark.table("ecommerce.regression_test_features")

print("\nSample Training Data:")
train_data.select("customer_id", "product_id", "features", "total_purchase_amount").limit(5).display()


Sample Training Data:


customer_id,product_id,features,total_purchase_amount
1,154,"Map(vectorType -> sparse, length -> 27, indices -> List(0, 1, 2, 3, 4, 7, 11, 12, 16, 24), values -> List(0.3548387096774194, 0.2552140504939627, 3.9096937406569814E-4, 0.6551724137931034, 0.1875, 0.5, 0.15797645978130034, 0.6551724137931034, 1.0, 1.0))",41.26
4,244,"Map(vectorType -> sparse, length -> 27, indices -> List(0, 1, 2, 3, 4, 6, 7, 8, 9, 10, 11, 12, 14, 17), values -> List(0.564516129032258, 0.22420417124039518, 0.043811568093832645, 0.4137931034482758, 0.3125, 0.5, 0.5, 0.3333333333333333, 0.01304181051016494, 1.603082406851895E-5, 0.6516631232308321, 0.4137931034482758, 1.0, 1.0))",137.89
6,373,"Map(vectorType -> sparse, length -> 27, indices -> List(0, 1, 2, 3, 4, 7, 9, 11, 12, 13, 14, 18), values -> List(0.22580645161290322, 0.0570801317233809, 0.061044884817355975, 0.5862068965517241, 0.125, 0.5, 0.009589566551591868, 0.9311661982921617, 0.5862068965517241, 1.0, 1.0, 1.0))",112.2
7,58,"Map(vectorType -> sparse, length -> 27, indices -> List(0, 1, 2, 3, 4, 7, 9, 11, 12, 15, 17), values -> List(0.564516129032258, 0.2897914379802415, 0.08802943769404731, 0.5172413793103449, 0.3125, 0.5, 0.01649405446873801, 0.568094947294946, 0.5172413793103449, 1.0, 1.0))",219.74
7,222,"Map(vectorType -> sparse, length -> 27, indices -> List(0, 1, 2, 3, 4, 6, 7, 8, 9, 10, 11, 12, 15, 19), values -> List(0.564516129032258, 0.2897914379802415, 0.021342328184292228, 0.2758620689655172, 0.4375, 0.5, 0.5, 0.3333333333333333, 0.01457614115841964, 1.803467707708382E-5, 0.6960536549098333, 0.2758620689655172, 1.0, 1.0))",106.71


## 2. Baseline Model Recap

Let's quickly re-evaluate the baseline Linear Regression model (from Notebook 2) on the test set to establish a performance benchmark before tuning.

In [None]:
# Define a baseline Linear Regression model
lr_baseline = LinearRegression(featuresCol="features", labelCol="total_purchase_amount")

# Train the baseline model
start_time_baseline = time.time()
lr_baseline_model = lr_baseline.fit(train_data)
end_time_baseline = time.time()

# Make predictions on the test data
predictions_baseline = lr_baseline_model.transform(test_data)

# Evaluate the baseline model
evaluator_rmse = RegressionEvaluator(labelCol="total_purchase_amount", predictionCol="prediction", metricName="rmse")
evaluator_mae = RegressionEvaluator(labelCol="total_purchase_amount", predictionCol="prediction", metricName="mae")
evaluator_r2 = RegressionEvaluator(labelCol="total_purchase_amount", predictionCol="prediction", metricName="r2")

rmse_baseline = evaluator_rmse.evaluate(predictions_baseline)
mae_baseline = evaluator_mae.evaluate(predictions_baseline)
r2_baseline = evaluator_r2.evaluate(predictions_baseline)

print("--- Baseline Model Performance ---")
print(f"Training Time: {end_time_baseline - start_time_baseline:.2f} seconds")
print(f"RMSE: {rmse_baseline:.2f}")
print(f"MAE: {mae_baseline:.2f}")
print(f"R2: {r2_baseline:.4f}")

Downloading artifacts:   0%|          | 0/15 [00:00<?, ?it/s]

Uploading artifacts:   0%|          | 0/4 [00:00<?, ?it/s]

--- Baseline Model Performance ---
Training Time: 33.26 seconds
RMSE: 25.81
MAE: 12.70
R2: 0.9911


## 3. Hyperparameter Tuning with CrossValidator

Now, we set up the hyperparameter tuning process using `CrossValidator`. This involves:
1.  **Estimator:** Defining the model we want to tune (Linear Regression).
2.  **Parameter Grid:** Specifying the hyperparameters and the range of values to test.
3.  **Evaluator:** Defining the metric used to compare model performance (RMSE).
4.  **CrossValidator Setup:** Configuring the number of folds (splits of the training data).

In [None]:
# 1. Define the Estimator
lr = LinearRegression(featuresCol="features", labelCol="total_purchase_amount")

# 2. Define the Parameter Grid
# We'll test combinations of regularization strength and the L1/L2 mix.
paramGrid = ParamGridBuilder() \
    .addGrid(lr.maxIter, [10, 20]) \
    .addGrid(lr.regParam, [0.01, 0.1, 0.5]) \
    .addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0]) \
    .build()

print(f"Number of parameter combinations to test: {len(paramGrid)}") # 2 * 3 * 3 = 18 combinations

# 3. Define the Evaluator
# We'll use RMSE as the primary metric for optimization.
evaluator = RegressionEvaluator(
    labelCol="total_purchase_amount", 
    predictionCol="prediction", 
    metricName="rmse"
)

# 4. Setup CrossValidator
# We use 3 folds for a balance between robustness and computation time.
cv = CrossValidator(
    estimator=lr, 
    estimatorParamMaps=paramGrid, 
    evaluator=evaluator, 
    numFolds=3, # K=3 folds
    seed=42 # For reproducibility
)

print("CrossValidator configured. Ready for training...")

Number of parameter combinations to test: 18
CrossValidator configured. Ready for training...


## 4. Fit CrossValidator

Now we train the `CrossValidator` on the training data. This will train multiple Linear Regression models (Number of Folds × Number of Parameter Combinations) and evaluate them using the specified evaluator (RMSE).

In [None]:
# Train the CrossValidator model
# This will run 3 * 18 = 54 training jobs in total.
print("Starting CrossValidator fitting...")
start_time_cv = time.time()
cvModel = cv.fit(train_data)
end_time_cv = time.time()
print(f"CrossValidator fitting finished in {end_time_cv - start_time_cv:.2f} seconds.")

Starting CrossValidator fitting...


Downloading artifacts:   0%|          | 0/30 [00:00<?, ?it/s]

Uploading artifacts:   0%|          | 0/4 [00:00<?, ?it/s]

Downloading artifacts:   0%|          | 0/15 [00:00<?, ?it/s]

Uploading artifacts:   0%|          | 0/4 [00:00<?, ?it/s]

CrossValidator fitting finished in 105.18 seconds.


## 5. Analyze Tuning Results

After fitting, we can extract the best model found by the `CrossValidator` and examine the optimal hyperparameters.

In [None]:
# Get the best model from CrossValidation
bestModel = cvModel.bestModel

# Print the best parameters found
print("--- Best Model Parameters ---")
print(f"MaxIter: {bestModel.getMaxIter()}")
print(f"RegParam: {bestModel.getRegParam()}")
print(f"ElasticNetParam: {bestModel.getElasticNetParam()}")

# You can also examine the average metrics across folds for each parameter combination
# Combine parameters and metrics for easier viewing
avg_metrics = cvModel.avgMetrics
param_maps = cvModel.getEstimatorParamMaps()

print("\n--- Average RMSE per Parameter Combination (across folds) ---")
results = []
for params, metric in zip(param_maps, avg_metrics):
    results.append((
        params[lr.maxIter], 
        params[lr.regParam], 
        params[lr.elasticNetParam], 
        metric
    ))

# Sort results by RMSE (lower is better)
results.sort(key=lambda x: x[3])

for iter_val, reg_param, enet_param, rmse_val in results[:5]: # Show top 5
    print(f"  maxIter={iter_val}, regParam={reg_param}, elasticNetParam={enet_param} -> Avg RMSE: {rmse_val:.4f}")

--- Best Model Parameters ---
MaxIter: 20
RegParam: 0.01
ElasticNetParam: 1.0

--- Average RMSE per Parameter Combination (across folds) ---
  maxIter=20, regParam=0.01, elasticNetParam=1.0 -> Avg RMSE: 27.8287
  maxIter=20, regParam=0.01, elasticNetParam=0.5 -> Avg RMSE: 27.8288
  maxIter=10, regParam=0.1, elasticNetParam=0.0 -> Avg RMSE: 27.8288
  maxIter=20, regParam=0.1, elasticNetParam=0.0 -> Avg RMSE: 27.8288
  maxIter=10, regParam=0.01, elasticNetParam=0.0 -> Avg RMSE: 27.8293


## 6. Evaluate Tuned Model

Finally, we evaluate the performance of the *best model* (selected through cross-validation) on the held-out *test* dataset.

In [None]:
# Make predictions on the test data using the best model
predictions_tuned = bestModel.transform(test_data)

# Evaluate the tuned model using the same evaluators
rmse_tuned = evaluator_rmse.evaluate(predictions_tuned)
mae_tuned = evaluator_mae.evaluate(predictions_tuned)
r2_tuned = evaluator_r2.evaluate(predictions_tuned)

print("--- Tuned Model Performance (on Test Set) ---")
print(f"RMSE: {rmse_tuned:.2f}")
print(f"MAE: {mae_tuned:.2f}")
print(f"R2: {r2_tuned:.4f}")

--- Tuned Model Performance (on Test Set) ---
RMSE: 25.81
MAE: 12.70
R2: 0.9911


## 7. Compare and Conclude

Let's compare the performance of the tuned model against the baseline.

In [None]:
print("--- Performance Comparison --- Gherkin table format ---")
print("| Metric | Baseline Model | Tuned Model |")
print("|--------|----------------|-------------|")
print(f"| RMSE   | {rmse_baseline:<14.2f} | {rmse_tuned:<11.2f} |")
print(f"| MAE    | {mae_baseline:<14.2f} | {mae_tuned:<11.2f} |")
print(f"| R2     | {r2_baseline:<14.4f} | {r2_tuned:<11.4f} |")

# Discussion based on typical results
print("\nDiscussion:")
print("Hyperparameter tuning aimed to find the optimal settings for the Linear Regression model.")
print(f"By exploring different values for maxIter, regParam, and elasticNetParam using {cv.getNumFolds()}-fold cross-validation, we identified a configuration that performs best according to the RMSE metric on the validation folds.")
print("Comparing the tuned model's performance on the held-out test set to the baseline model:")
if rmse_tuned < rmse_baseline:
    print("- The tuned model shows an improvement in RMSE, indicating better predictive accuracy on average.")
    print("- The MAE likely also improved, suggesting smaller average prediction errors.")
    print("- The R-squared value might be slightly higher, explaining a tiny bit more variance.")
    print("Overall, the tuning process successfully refined the model, leading to more reliable purchase amount predictions.")
elif rmse_tuned == rmse_baseline:
     print("- The tuned model's performance is identical to the baseline. This might happen if the default parameters were already near optimal for this dataset, or if the search grid didn't contain significantly better configurations.")
else:
    print("- The tuned model performed slightly worse or the same as the baseline on the test set. This can sometimes occur due to the randomness in cross-validation splits or if the chosen parameter grid didn't contain better options than the default. The baseline model was already quite strong.")
print("Even small improvements can be valuable, and the tuning process provides confidence that we've explored reasonable model configurations.")

--- Performance Comparison --- Gherkin table format ---
| Metric | Baseline Model | Tuned Model |
|--------|----------------|-------------|
| RMSE   | 25.81          | 25.81       |
| MAE    | 12.70          | 12.70       |
| R2     | 0.9911         | 0.9911      |

Discussion:
Hyperparameter tuning aimed to find the optimal settings for the Linear Regression model.
By exploring different values for maxIter, regParam, and elasticNetParam using 3-fold cross-validation, we identified a configuration that performs best according to the RMSE metric on the validation folds.
Comparing the tuned model's performance on the held-out test set to the baseline model:
- The tuned model performed slightly worse or the same as the baseline on the test set. This can sometimes occur due to the randomness in cross-validation splits or if the chosen parameter grid didn't contain better options than the default. The baseline model was already quite strong.
Even small improvements can be valuable, and the 

## 8. Model Persistence

Now that we have identified the best model through hyperparameter tuning, we should save it for future use (e.g., deployment, batch predictions). We save the trained `LinearRegressionModel` object.

In [None]:
# Define the path to save the model
model_save_path = "dbfs:/Workspace/Users/war_che@hotmail.com/PySpark MLlib/models/tuned_linear_regression_model"

# Save the best model
bestModel.write().overwrite().save(model_save_path)

print(f"Best tuned Linear Regression model saved to: {model_save_path}")

# Demonstrate loading the model back (optional)
try:
    loaded_model = LinearRegressionModel.load(model_save_path)
    print("\nSuccessfully loaded the saved model.")
    print(f"Loaded Model MaxIter: {loaded_model.getMaxIter()}")
    print(f"Loaded Model RegParam: {loaded_model.getRegParam()}")
except Exception as e:
    print(f"\nError loading the model: {e}")

Best tuned Linear Regression model saved to: dbfs:/Workspace/Users/war_che@hotmail.com/PySpark MLlib/models/tuned_linear_regression_model

Successfully loaded the saved model.
Loaded Model MaxIter: 20
Loaded Model RegParam: 0.01


## Summary & Next Steps

In this notebook, we successfully:
1. Loaded the pre-processed regression data.
2. Established a baseline performance using the default Linear Regression model.
3. Configured and executed hyperparameter tuning using `CrossValidator` with a `ParamGridBuilder` and `RegressionEvaluator`.
4. Identified the best set of hyperparameters (`maxIter`, `regParam`, `elasticNetParam`) based on cross-validated RMSE.
5. Evaluated the optimized model on the test set and compared its performance to the baseline.
6. Saved the best tuned model for future use.

This concludes Module 3 on training and evaluating models. In Module 4, we will focus on building end-to-end ML Pipelines, integrating all the steps from data loading and feature engineering to model training and persistence.