# Module 3 Notebook 2: Regression with Linear Regression

**Objective:** Train a basic Linear Regression model to predict purchase amounts and evaluate its performance.

In this notebook, we focus on the regression task: predicting the `total_purchase_amount` for customer-product pairs where a purchase occurred. We'll use the preprocessed data from Module 2, train a simple `LinearRegression` model, and evaluate it using standard regression metrics.

In [0]:
from pyspark.sql import SparkSession
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.sql.functions import col


## 1. Load Data

We load the training and testing datasets created in Module 2, Notebook 3. These datasets contain the final features and the target variable (`total_purchase_amount`) for the regression task.

In [0]:
# Load training and testing data
train_data = spark.table("ecommerce.regression_train_features")
test_data = spark.table("ecommerce.regression_test_features")

# Display schema and a sample
print("Training Data Schema:")
train_data.printSchema()

print("Sample Training Data:")
train_data.select("features", "total_purchase_amount").limit(5).display()

Training Data Schema:
root
 |-- customer_id: integer (nullable = true)
 |-- product_id: long (nullable = true)
 |-- features: vector (nullable = true)
 |-- total_purchase_amount: double (nullable = true)

Sample Training Data:


features,total_purchase_amount
"Map(vectorType -> sparse, length -> 27, indices -> List(0, 1, 2, 3, 4, 7, 11, 12, 16, 24), values -> List(0.3548387096774194, 0.2552140504939627, 3.9096937406569814E-4, 0.6551724137931034, 0.1875, 0.5, 0.15797645978130034, 0.6551724137931034, 1.0, 1.0))",41.26
"Map(vectorType -> sparse, length -> 27, indices -> List(0, 1, 2, 3, 4, 6, 7, 8, 9, 10, 11, 12, 14, 17), values -> List(0.564516129032258, 0.22420417124039518, 0.043811568093832645, 0.4137931034482758, 0.3125, 0.5, 0.5, 0.3333333333333333, 0.01304181051016494, 1.603082406851895E-5, 0.6516631232308321, 0.4137931034482758, 1.0, 1.0))",137.89
"Map(vectorType -> sparse, length -> 27, indices -> List(0, 1, 2, 3, 4, 7, 9, 11, 12, 13, 14, 18), values -> List(0.22580645161290322, 0.0570801317233809, 0.061044884817355975, 0.5862068965517241, 0.125, 0.5, 0.009589566551591868, 0.9311661982921617, 0.5862068965517241, 1.0, 1.0, 1.0))",112.2
"Map(vectorType -> sparse, length -> 27, indices -> List(0, 1, 2, 3, 4, 7, 9, 11, 12, 15, 17), values -> List(0.564516129032258, 0.2897914379802415, 0.08802943769404731, 0.5172413793103449, 0.3125, 0.5, 0.01649405446873801, 0.568094947294946, 0.5172413793103449, 1.0, 1.0))",219.74
"Map(vectorType -> sparse, length -> 27, indices -> List(0, 1, 2, 3, 4, 6, 7, 8, 9, 10, 11, 12, 15, 19), values -> List(0.564516129032258, 0.2897914379802415, 0.021342328184292228, 0.2758620689655172, 0.4375, 0.5, 0.5, 0.3333333333333333, 0.01457614115841964, 1.803467707708382E-5, 0.6960536549098333, 0.2758620689655172, 1.0, 1.0))",106.71


## 2. Define the Model

We define a `LinearRegression` model, specifying the input feature column (`features`) and the target label column (`total_purchase_amount`). We'll use default hyperparameters for now; tuning will be covered in the next notebook.

In [0]:
# Define the Linear Regression model
lr = LinearRegression(featuresCol="features", labelCol="total_purchase_amount")

# We can optionally set some basic parameters, but defaults are often fine for a first pass
# lr.setMaxIter(10)
# lr.setRegParam(0.1)

print(lr)

LinearRegression_8a1012325e28


## 3. Train the Model

We train the model using the `fit()` method on our training data.

In [0]:
# Train the model
print("Training the Linear Regression model...")
lr_model = lr.fit(train_data)
print("Training complete.")

Training the Linear Regression model...


Downloading artifacts:   0%|          | 0/15 [00:00<?, ?it/s]

Uploading artifacts:   0%|          | 0/4 [00:00<?, ?it/s]

Training complete.


## 4. Make Predictions

We use the trained model's `transform()` method to make predictions on the unseen test data.

In [0]:
# Make predictions on the test data
predictions = lr_model.transform(test_data)

# Show some predictions
print("Sample Predictions:")
predictions.select("features", "total_purchase_amount", "prediction").limit(10).display()

Sample Predictions:


features,total_purchase_amount,prediction
"Map(vectorType -> sparse, length -> 27, indices -> List(0, 1, 2, 3, 4, 7, 9, 11, 12, 14, 22), values -> List(0.22580645161290322, 0.13913282107574096, 0.03621449653110506, 0.7586206896551724, 0.25, 0.5, 0.004602991944764097, 0.08516270620618045, 0.7586206896551724, 1.0, 1.0))",83.96,76.57551367920041
"Map(vectorType -> sparse, length -> 27, indices -> List(0, 1, 2, 3, 4, 6, 7, 8, 9, 10, 11, 12, 16, 23), values -> List(0.1774193548387097, 0.2906147091108672, 0.1626969220744375, 0.5517241379310346, 0.4375, 0.5, 0.5, 0.3333333333333333, 0.013425393172228614, 1.402697105995408E-5, 0.23073548105098457, 0.5517241379310346, 1.0, 1.0))",308.6,316.0477194324582
"Map(vectorType -> sparse, length -> 27, indices -> List(1, 2, 3, 4, 7, 9, 11, 12, 16, 24), values -> List(0.04226125137211855, 0.033830349955920117, 0.5862068965517241, 0.0625, 0.5, 0.006137322593018795, 0.9249645777973425, 0.5862068965517241, 1.0, 1.0))",46.75,45.775889680244674
"Map(vectorType -> sparse, length -> 27, indices -> List(0, 1, 2, 3, 4, 7, 9, 11, 12, 15, 17), values -> List(0.27419354838709675, 0.19401756311745336, 0.15273103606884128, 0.7586206896551724, 0.3125, 0.5, 0.01457614115841964, 0.7265411662724947, 0.7586206896551724, 1.0, 1.0))",309.9,310.84511551268184
"Map(vectorType -> sparse, length -> 27, indices -> List(0, 1, 2, 3, 4, 7, 9, 11, 12, 14, 17), values -> List(0.4193548387096774, 0.1432491767288694, 0.15273103606884128, 0.7586206896551724, 0.25, 0.5, 0.02454929037207518, 0.2227386663062664, 0.7586206896551724, 1.0, 1.0))",292.21,304.72883295495126
"Map(vectorType -> sparse, length -> 27, indices -> List(0, 1, 2, 3, 4, 6, 7, 8, 9, 10, 11, 12, 15, 18), values -> List(0.24193548387096775, 0.2327113062568606, 0.08346046226378934, 0.6551724137931034, 0.3125, 0.5, 0.5, 0.3333333333333333, 0.01304181051016494, 1.402697105995408E-5, 0.5179645315634482, 0.6551724137931034, 1.0, 1.0))",190.89,190.3363753536686
"Map(vectorType -> sparse, length -> 27, indices -> List(0, 1, 2, 3, 4, 6, 7, 8, 9, 10, 11, 12, 19), values -> List(0.0967741935483871, 0.2189901207464325, 0.025068036337153588, 0.7241379310344829, 0.125, 0.5, 0.5, 0.3333333333333333, 0.00728807057920982, 1.603082406851895E-5, 0.8600121083334342, 0.7241379310344829, 1.0))",69.83,65.17806029189988
"Map(vectorType -> sparse, length -> 27, indices -> List(0, 1, 2, 3, 4, 6, 7, 8, 9, 10, 11, 12, 20), values -> List(0.22580645161290322, 0.03924259055982437, 0.022837211085131662, 0.44827586206896547, 0.0625, 0.5, 0.5, 0.3333333333333333, 0.02838511699271193, 2.404623610277842E-5, 0.4282070705132843, 0.44827586206896547, 1.0))",40.66,29.968923453589703
"Map(vectorType -> sparse, length -> 27, indices -> List(1, 2, 3, 4, 6, 7, 8, 9, 10, 11, 12, 15, 22), values -> List(0.4080680570801317, 0.06253210165203726, 0.5172413793103449, 0.375, 0.5, 0.5, 0.3333333333333333, 0.002685078634445723, 2.404623610277842E-5, 0.7075910409748363, 0.5172413793103449, 1.0, 1.0))",146.27,153.99310609483825
"Map(vectorType -> sparse, length -> 27, indices -> List(1, 2, 3, 4, 6, 7, 8, 9, 10, 11, 12, 15, 20), values -> List(0.4080680570801317, 0.028632757100693778, 0.7586206896551724, 0.75, 0.5, 0.5, 0.3333333333333333, 0.02531645569620253, 1.603082406851895E-5, 0.17139224842149936, 0.7586206896551724, 1.0, 1.0))",122.93,128.69544086611415


## 5. Evaluate the Model

We use `RegressionEvaluator` to assess the model's performance using standard metrics: RMSE, MAE, and R².

In [0]:
# Create evaluators for different metrics
rmse_evaluator = RegressionEvaluator(labelCol="total_purchase_amount", predictionCol="prediction", metricName="rmse")
mae_evaluator = RegressionEvaluator(labelCol="total_purchase_amount", predictionCol="prediction", metricName="mae")
r2_evaluator = RegressionEvaluator(labelCol="total_purchase_amount", predictionCol="prediction", metricName="r2")

# Calculate metrics
rmse = rmse_evaluator.evaluate(predictions)
mae = mae_evaluator.evaluate(predictions)
r2 = r2_evaluator.evaluate(predictions)

# Print the metrics
print(f"Root Mean Squared Error (RMSE) on test data: {rmse:.4f}")
print(f"Mean Absolute Error (MAE) on test data: {mae:.4f}")
print(f"R-squared (R²) on test data: {r2:.4f}")

Root Mean Squared Error (RMSE) on test data: 25.8060
Mean Absolute Error (MAE) on test data: 12.7021
R-squared (R²) on test data: 0.9911


### Interpreting the Metrics

*   **RMSE (Root Mean Squared Error):** Measures the average magnitude of the errors between predicted and actual values, in the same units as the target (`total_purchase_amount`). Lower values indicate a better fit.
*   **MAE (Mean Absolute Error):** Similar to RMSE, it measures the average absolute difference between predictions and actuals. It's less sensitive to large errors (outliers) than RMSE. Lower is better.
*   **R² (R-squared):** Represents the proportion of the variance in the target variable (`total_purchase_amount`) that is explained by the model's features. It ranges from 0 to 1 (or can be negative for very poor fits). Higher values (closer to 1) indicate that the model explains more of the variability.

In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import mean, stddev

# --- Configuration ---
test_table = "ecommerce.regression_test_features"
target_col = "total_purchase_amount"
# -------------------

try:
    print(f"Loading test data from: {test_table}")
    test_data = spark.table(test_table)

    # Calculate mean and standard deviation of the actual target variable
    print(f"Calculating statistics for target column: {target_col}")
    stats = test_data.select(
        mean(target_col).alias("mean_target"),
        stddev(target_col).alias("stddev_target")
    ).first() # Use .first() as agg returns a single row DataFrame

    if stats and stats["mean_target"] is not None and stats["stddev_target"] is not None:
        mean_target = stats["mean_target"]
        stddev_target = stats["stddev_target"]

        print("--- Target Variable Statistics ---")
        print(f"Mean '{target_col}':   {mean_target:.2f}")
        print(f"Std Dev '{target_col}': {stddev_target:.2f}")

        print("--- Model Error Metrics ---")
        print(f"MAE:  {mae:.2f}")
        print(f"RMSE: {rmse:.2f}")

        print("--- Diagnosis ---")
        # Compare MAE to Mean
        mae_perc_mean = (mae / mean_target) * 100 if mean_target != 0 else float('inf')
        print(f"MAE ({mae:.2f}) is {mae_perc_mean:.1f}% of the Mean Target ({mean_target:.2f}).")
        print("  Interpretation: This shows the average error relative to the average purchase size.")

        # Compare RMSE to Standard Deviation
        rmse_vs_stddev = rmse / stddev_target if stddev_target != 0 else float('inf')
        print(f"RMSE ({rmse:.2f}) is {rmse_vs_stddev:.2f} times the Standard Deviation ({stddev_target:.2f}).")
        print("  Interpretation: ")
        print("    - If < 1.0: Errors are typically smaller than the natural spread of the data.")
        print("    - If ~ 1.0: Errors are roughly the same size as the natural spread (model isn't reducing uncertainty much).")
        print("    - If > 1.0: Errors are generally larger than the natural spread (model might be performing poorly).")

        # Overall comment based on comparisons
        if mae_perc_mean > 30 or rmse_vs_stddev > 0.8: # Example thresholds, adjust as needed
             print("Overall: The errors appear relatively high compared to the target variable's scale. Tuning may help.")
        elif mae_perc_mean < 15 and rmse_vs_stddev < 0.6:
             print("Overall: The errors seem reasonable relative to the target variable's scale. Tuning might still yield improvements.")
        else:
             print("Overall: The errors are moderate. Tuning is recommended to see if performance can be improved.")

    else:
        print(f"Could not calculate statistics for column '{target_col}' in table '{test_table}'. Check table/column names and data.")

except Exception as e:
    print(f"An error occurred: {e}")



Loading test data from: ecommerce.regression_test_features
Calculating statistics for target column: total_purchase_amount
--- Target Variable Statistics ---
Mean 'total_purchase_amount':   285.96
Std Dev 'total_purchase_amount': 273.52
--- Model Error Metrics ---
MAE:  12.70
RMSE: 25.81
--- Diagnosis ---
MAE (12.70) is 4.4% of the Mean Target (285.96).
  Interpretation: This shows the average error relative to the average purchase size.
RMSE (25.81) is 0.09 times the Standard Deviation (273.52).
  Interpretation: 
    - If < 1.0: Errors are typically smaller than the natural spread of the data.
    - If ~ 1.0: Errors are roughly the same size as the natural spread (model isn't reducing uncertainty much).
    - If > 1.0: Errors are generally larger than the natural spread (model might be performing poorly).
Overall: The errors seem reasonable relative to the target variable's scale. Tuning might still yield improvements.


The baseline model performs quite well! The Root Mean Squared Error (RMSE) tells us the typical difference between the predicted purchase amount and the actual amount is about $25.81. The Mean Absolute Error (MAE) indicates that, on average, our predictions are off by $12.70. Considering the average purchase amount is around $285.96 and varies significantly (standard deviation of $273.52), these errors are relatively small, suggesting the model has learned meaningful patterns. 

The R² value (close to 1.0) confirms that the model explains a large portion of the variability in purchase amounts. While these results are good, we can explore if tuning the model's internal settings, known as hyperparameters, can lead to even better predictions. This leads us directly into the next notebook where we'll use hyperparameter tuning techniques to potentially enhance our model's accuracy.

## 6. Business Value

Predicting purchase amounts helps businesses forecast revenue, identify potentially high-spending customer segments for targeted offers, and optimize inventory based on expected demand value.