# Module 4 Notebook 1: Pipeline Integration for Regression

**Objective:** Consolidate the feature engineering, scaling, and model training steps for our linear regression (purchase amount prediction) task into a single, end-to-end PySpark ML `Pipeline`.

**Recap:** In previous modules, we performed these steps individually:
1. **M2N1:** Joined data, engineered customer journey features (counts, time, purchase history) and categorical features.
2. **M2N2:** Scaled numerical features.
3. **M2N3:** Assembled features, split data.
4. **M3N2:** Trained a baseline `LinearRegression` model.
5. **M3N3:** Tuned hyperparameters for `LinearRegression`.

**Goal:** Build a `Pipeline` that takes the raw interaction, customer, and product data as input and outputs predictions using the optimized regression model found in M3N3. This demonstrates a more streamlined and production-ready approach.

In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count, sum, avg, min, max, datediff, first, lit, when, expr, unix_timestamp
from pyspark.sql.window import Window
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler, StandardScaler
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.sql.types import DoubleType
import mlflow
import mlflow.spark
# Set a seed for reproducibility
SEED = 42

## 1. Load and Prepare Raw Data

We start by loading the raw data sources and replicating the aggregation logic from Module 2 Notebook 1 to create our base dataset for the regression task.

In [0]:
# Load raw data
interactions_df = spark.table("ecommerce.interactions")
customers_df = spark.table("ecommerce.customers")
products_df = spark.table("ecommerce.products")

# --- Replicating Feature Engineering Logic from M2N1 --- 

# Calculate time differences and interaction counts
window_spec = Window.partitionBy("customer_id", "product_id").orderBy("timestamp")

interactions_with_time = interactions_df \
    .withColumn("first_interaction_time", min("timestamp").over(window_spec)) \
    .withColumn("last_interaction_time", max("timestamp").over(window_spec))

# Aggregate interactions at customer-product level
customer_product_agg = interactions_with_time \
    .groupBy("customer_id", "product_id") \
    .agg(
        count(when(col("interaction_type") == "view", 1)).alias("view_count"),
        count(when(col("interaction_type") == "add_to_cart", 1)).alias("add_to_cart_count"),
        count(when(col("interaction_type") == "review", 1)).alias("review_count"),
        count(when(col("interaction_type") == "purchase", 1)).alias("purchase_count"),
        count("interaction_type").alias("total_interactions"),
        sum("time_spent_seconds").alias("total_time_spent"),
        sum(when(col("interaction_type") == "purchase", col("purchase_amount")).otherwise(0)).alias("total_purchase_amount"),
        avg(when(col("interaction_type") == "review", col("user_rating"))).alias("avg_user_rating"), # Needs imputation later
        first("first_interaction_time").alias("first_interaction_time"),
        first("last_interaction_time").alias("last_interaction_time"),
        first("previous_visits").alias("previous_visits") # Assuming it's consistent per customer-product pair
    )

# Calculate derived time features 
customer_product_agg = customer_product_agg \
    .withColumn("interaction_time_span_seconds", unix_timestamp(col("last_interaction_time")) - unix_timestamp(col("first_interaction_time"))) \
    .withColumn("interaction_time_span_days", col("interaction_time_span_seconds") / (60*60*24)) \
    .withColumn("avg_purchase_amount", 
                when(col("purchase_count") > 0, col("total_purchase_amount") / col("purchase_count"))
                .otherwise(0.0))

# --- Impute Missing Values (Simplified for Pipeline Demo) --- 
# In a real pipeline, imputation might be a separate stage (e.g., Imputer)
# Here, we fill directly for simplicity, assuming logic from M2N2

# Impute avg_user_rating (e.g., with global mean or a fixed value like 3.0)
# Calculating global mean requires an extra step, let's use a fixed value for demo
customer_product_agg = customer_product_agg.fillna(3.0, subset=["avg_user_rating"])
# Fill other potential nulls introduced by aggregation (e.g., counts, sums)
numeric_cols_to_fill_zero = [
    "view_count", "add_to_cart_count", "review_count", "purchase_count",
    "total_interactions", "total_time_spent", "total_purchase_amount",
    "interaction_time_span_seconds", "interaction_time_span_days",
    "avg_purchase_amount", "previous_visits"
]
customer_product_agg = customer_product_agg.fillna(0, subset=numeric_cols_to_fill_zero)

# --- Join with Customer and Product Data --- 
base_df = customer_product_agg \
    .join(customers_df, "customer_id") \
    .join(products_df, "product_id")

# --- Filter for Regression Task --- 
# Keep only records where a purchase actually occurred for amount prediction
regression_base_data = base_df.filter(col("purchase_count") > 0)

# --- Select Final Columns --- 
# Select features needed for the regression model (consistent with M2/M3)
# Exclude the target variable itself ('total_purchase_amount') from features
feature_columns = [
    # Numerical (Original - to be scaled)
    "age", "tenure_days", "price", "avg_rating", # product avg_rating
    "previous_visits", 
    # Journey Features (Engineered - to be scaled)
    "view_count", "add_to_cart_count", "review_count", 
    "total_interactions", "total_time_spent", 
    "interaction_time_span_days",
    "avg_user_rating", # customer avg rating per product
    # Categorical (Original - to be indexed/encoded)
    "gender", "country", "membership_level", "category" # Assuming device requires complex handling not suitable for basic pipeline demo
]
target_column = "total_purchase_amount"

regression_input_data = regression_base_data.select(feature_columns + [target_column])

print(f"Prepared {regression_input_data.count()} records for regression pipeline.")
regression_input_data.printSchema()

Prepared 25238 records for regression pipeline.
root
 |-- age: integer (nullable = true)
 |-- tenure_days: integer (nullable = true)
 |-- price: double (nullable = true)
 |-- avg_rating: double (nullable = true)
 |-- previous_visits: long (nullable = true)
 |-- view_count: long (nullable = false)
 |-- add_to_cart_count: long (nullable = false)
 |-- review_count: long (nullable = false)
 |-- total_interactions: long (nullable = false)
 |-- total_time_spent: long (nullable = true)
 |-- interaction_time_span_days: double (nullable = false)
 |-- avg_user_rating: double (nullable = false)
 |-- gender: string (nullable = true)
 |-- country: string (nullable = true)
 |-- membership_level: string (nullable = true)
 |-- category: string (nullable = true)
 |-- total_purchase_amount: double (nullable = false)



## 2. Define Pipeline Stages

Now, we define each step of our data processing and modeling workflow as a stage in the pipeline.

1.  **Categorical Handling:** `StringIndexer` -> `OneHotEncoder`
2.  **Numerical Handling:** `VectorAssembler` (raw numerical) -> `StandardScaler`
3.  **Final Assembly:** `VectorAssembler` (scaled numerical + OHE categorical)
4.  **Model:** `LinearRegression` (with best hyperparameters)

In [0]:
# Identify categorical and numerical columns from our selected features
categorical_cols = ["gender", "country", "membership_level", "category"]
numerical_cols = [
    "age", "tenure_days", "price", "avg_rating", "previous_visits",
    "view_count", "add_to_cart_count", "review_count", 
    "total_interactions", "total_time_spent", 
    "interaction_time_span_days", "avg_user_rating"
]

# --- Stage 1: Categorical Feature Processing --- 
indexers = [
    StringIndexer(inputCol=col, outputCol=f"{col}_index", handleInvalid="keep") 
    for col in categorical_cols
]
encoder = OneHotEncoder(
    inputCols=[f"{col}_index" for col in categorical_cols],
    outputCols=[f"{col}_vec" for col in categorical_cols],
    dropLast=True # Common practice 
)

# --- Stage 2: Numerical Feature Processing --- 
numerical_assembler = VectorAssembler(
    inputCols=numerical_cols,
    outputCol="unscaled_numerical_features",
    handleInvalid="keep" # Or use 'skip' or imputation
)

scaler = StandardScaler(
    inputCol="unscaled_numerical_features",
    outputCol="scaled_numerical_features",
    withStd=True,
    withMean=True
)

# --- Stage 3: Final Feature Assembly --- 
final_feature_cols = ["scaled_numerical_features"] + [f"{col}_vec" for col in categorical_cols]
final_assembler = VectorAssembler(
    inputCols=final_feature_cols,
    outputCol="features",
    handleInvalid="keep"
)

# --- Stage 4: Model Definition --- 
# Use the best hyperparameters found in Module 3 Notebook 3
# Example values - replace with actual best params from M3N3
best_maxIter = 20 
best_regParam = 0.01
best_elasticNetParam = 0.0 # Corresponds to L2 regularization (Ridge)

lr = LinearRegression(
    featuresCol="features", 
    labelCol=target_column,
    maxIter=best_maxIter,
    regParam=best_regParam,
    elasticNetParam=best_elasticNetParam
)

# Combine all stages into a list
all_stages = indexers + [encoder, numerical_assembler, scaler, final_assembler, lr]
print(f"Defined {len(all_stages)} stages for the pipeline.")

Defined 9 stages for the pipeline.


## 3. Create and Train the Pipeline

We assemble the stages into a `Pipeline` object and fit it to the training data. This process trains the indexers, learns the scaling parameters, and trains the linear regression model.

In [0]:
# Create the Pipeline
pipeline = Pipeline(stages=all_stages)

# Split data into training and testing sets
train_data, test_data = regression_input_data.randomSplit([0.8, 0.2], seed=SEED)

print(f"Training data count: {train_data.count()}")
print(f"Test data count: {test_data.count()}")

# Fit the pipeline to the training data
print("Fitting the pipeline...")
pipelineModel = pipeline.fit(train_data)
print("Pipeline fitting complete.")

Training data count: 20299
Test data count: 4939
Fitting the pipeline...


Downloading artifacts:   0%|          | 0/85 [00:00<?, ?it/s]

Uploading artifacts:   0%|          | 0/4 [00:00<?, ?it/s]

Pipeline fitting complete.


## 4. Apply Pipeline for Predictions

Use the fitted `pipelineModel` to transform the test data. This applies all the learned transformations and generates predictions using the trained model.

In [0]:
# Make predictions on the test data
print("Generating predictions on test data...")
predictions = pipelineModel.transform(test_data)

# Show some predictions
predictions.select(target_column, "prediction", "features").limit(5).display()

Generating predictions on test data...


total_purchase_amount,prediction,features
49.16,45.41873102566308,"Map(vectorType -> sparse, length -> 39, indices -> List(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 12, 16, 25, 35), values -> List(-1.5085758549084674, -1.4524684713685045, -0.7679363155102512, -0.042036733454782794, 0.07336906942525259, -0.05422856352508102, -1.0263281838169567, 0.9966548401187559, -0.027705520508331027, -0.47281951575762976, 1.4149879965977856, 1.0, 1.0, 1.0, 1.0))"
355.44,364.30572797598495,"Map(vectorType -> sparse, length -> 39, indices -> List(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 12, 19, 26, 29), values -> List(-1.5085758549084674, -1.3927622473103776, 0.26979550322483287, -0.7532590280468658, -1.2939594695424854, -0.05422856352508102, 0.9669873940383505, -0.9993332198025627, -0.027705520508331027, 2.5199788725051966, -0.6248383701675808, 1.0, 1.0, 1.0, 1.0))"
71.25,59.951812608822,"Map(vectorType -> sparse, length -> 39, indices -> List(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 13, 17, 26, 34), values -> List(-1.5085758549084674, -1.3600201244397918, -0.6691096674426527, -1.6422868962869699, 0.07336906942525259, -0.05422856352508102, -1.0263281838169567, -0.9993332198025627, -1.410987848419265, -0.6526194189284584, -0.6248383701675808, 1.0, 1.0, 1.0, 1.0))"
121.38,131.88300002720143,"Map(vectorType -> sparse, length -> 39, indices -> List(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 12, 23, 28, 34), values -> List(-1.5085758549084674, -1.3484640810737027, -0.31910729120745734, -1.286675748990928, -1.2939594695424854, -0.05422856352508102, 0.9669873940383505, -0.9993332198025627, -0.027705520508331027, -0.35391957978982364, -0.6248383701675808, 1.0, 1.0, 1.0, 1.0))"
151.47,147.56074593390602,"Map(vectorType -> sparse, length -> 39, indices -> List(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 13, 23, 27, 32), values -> List(-1.5085758549084674, -1.3388340449352951, -0.21778396992551943, -0.5754534543988449, -1.2939594695424854, -0.05422856352508102, -1.0263281838169567, 0.9966548401187559, -0.027705520508331027, -0.37421956885749785, -0.9920311188926247, 1.0, 1.0, 1.0, 1.0))"


## 5. Evaluate the Pipeline Model

Evaluate the performance of the model trained within the pipeline using standard regression metrics.

In [0]:
# Create an evaluator
evaluator_rmse = RegressionEvaluator(labelCol=target_column, predictionCol="prediction", metricName="rmse")
evaluator_mae = RegressionEvaluator(labelCol=target_column, predictionCol="prediction", metricName="mae")
evaluator_r2 = RegressionEvaluator(labelCol=target_column, predictionCol="prediction", metricName="r2")

# Calculate metrics
rmse = evaluator_rmse.evaluate(predictions)
mae = evaluator_mae.evaluate(predictions)
r2 = evaluator_r2.evaluate(predictions)

print(f"Pipeline Model Performance on Test Data:")
print(f"  Root Mean Squared Error (RMSE): {rmse:.2f}")
print(f"  Mean Absolute Error (MAE): {mae:.2f}")
print(f"  R-squared (R²): {r2:.4f}")

# Compare with M3N3 results (mention if they are similar/identical)
print("These metrics should be very close to the results obtained for the tuned model in M3N3.")

Pipeline Model Performance on Test Data:
  Root Mean Squared Error (RMSE): 20.51
  Mean Absolute Error (MAE): 12.35
  R-squared (R²): 0.9941
These metrics should be very close to the results obtained for the tuned model in M3N3.


## 6. Save Pipeline Model with MLflow

Finally, we save the trained `PipelineModel` using MLflow. This packages the entire pipeline (including all preprocessing stages and the trained model) for easy loading and deployment later.

In [0]:
# Start an MLflow run to log the model
print("Starting MLflow run to log pipeline model...")
with mlflow.start_run() as run:
    # Log the pipeline model
    mlflow.spark.log_model(pipelineModel, "ecommerce_regression_model")
    
    # Get the run ID for reference in the next notebook
    run_id = run.info.run_id
    print(f'Pipeline model saved successfully under MLflow Run ID: {run_id}')
    print("Artifact Path: 'ecommerce_regression_model'")

# You can optionally save the run_id to a file or variable for M4N2 
# For now, we just print it.

Starting MLflow run to log pipeline model...


2025/04/21 01:18:19 INFO mlflow.spark: Inferring pip requirements by reloading the logged model from the databricks artifact repository, which can be time-consuming. To speed up, explicitly specify the conda_env or pip_requirements when calling log_model().


Downloading artifacts:   0%|          | 0/85 [00:00<?, ?it/s]



Uploading artifacts:   0%|          | 0/4 [00:00<?, ?it/s]

2025/04/21 01:18:52 INFO mlflow.tracking._tracking_service.client: 🏃 View run stylish-wren-332 at: adb-2187061022924749.9.azuredatabricks.net/ml/experiments/232784118800640/runs/2a473aa8df5f45209a4dd93b22d7fc63.
2025/04/21 01:18:52 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: adb-2187061022924749.9.azuredatabricks.net/ml/experiments/232784118800640.


Pipeline model saved successfully under MLflow Run ID: 2a473aa8df5f45209a4dd93b22d7fc63
Artifact Path: 'ecommerce_regression_model'


## 7. Conclusion and Next Steps

In this notebook, we successfully:
*   Loaded raw data and replicated the necessary feature engineering for the regression task.
*   Defined all preprocessing steps (indexing, encoding, scaling) and the regression model as stages.
*   Assembled these stages into a single PySpark `Pipeline`.
*   Trained the entire pipeline using `.fit()` on the training data.
*   Applied the fitted pipeline using `.transform()` to generate predictions on the test data.
*   Evaluated the model's performance, confirming consistency with previous results.

**Benefits Demonstrated:**
*   **Streamlined Workflow:** Combines multiple steps into one object.
*   **Consistency:** Ensures the same transformations are applied during training and prediction.
*   **Reduced Complexity:** Simplifies the process of applying the model to new data.
*   **Deployment Ready:** The `pipelineModel` encapsulates the entire workflow, making it easier to save, load, and deploy (as we'll see in the next notebook).

**Next Steps:** In Module 4 Notebook 2, we will focus on saving this `pipelineModel` and using it for distributed inference on new data.