# Module 4 Notebook 2: Model Persistence and Distributed Inference

**Objective:** Demonstrate how to load a saved PySpark ML Pipeline (persisted using MLflow in M4N1) and use it for scalable batch inference on new, unseen data.

**Recap:** In M4N1, we built an end-to-end `Pipeline` for our regression task and saved the resulting `PipelineModel` using MLflow, logging its `run_id`.

**Goal:** Load the saved `PipelineModel` and apply it to new raw data to generate predictions, showcasing the inference process and discussing Spark's ability to handle this at scale.


In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import when, col
from pyspark.ml import PipelineModel
import mlflow
import mlflow.spark
import pandas as pd
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, LongType

# MLflow details from M4N1
MLFLOW_RUN_ID = "2a473aa8df5f45209a4dd93b22d7fc63"
MLFLOW_ARTIFACT_PATH = "ecommerce_regression_model"


## 1. Load the Trained Pipeline Model via MLflow

We use the `run_id` and artifact path logged during the M4N1 execution to load the complete `PipelineModel`.


In [0]:
# Construct the model URI
model_uri = f"runs:/{MLFLOW_RUN_ID}/{MLFLOW_ARTIFACT_PATH}"
print(f"Loading model from: {model_uri}")

# Load the PipelineModel
loaded_pipeline_model = None # Initialize variable
try:
    loaded_pipeline_model = mlflow.spark.load_model(model_uri)
    print("PipelineModel loaded successfully!")
    print(f"Type: {type(loaded_pipeline_model)}")
except Exception as e:
    print(f"Error loading model: {e}")
    print("Please ensure the MLFLOW_RUN_ID is correct and the model was logged successfully in M4N1.")
    # Stop execution if model loading fails (optional, comment out if not in Databricks)
    # dbutils.notebook.exit("Model loading failed") 

#Display stages of the loaded pipeline
if loaded_pipeline_model is not None:
    print("\nStages in the loaded pipeline:")
    for i, stage in enumerate(loaded_pipeline_model.stages):
        print(f'  Stage {i}: {stage.__class__.__name__}')


Loading model from: runs:/2a473aa8df5f45209a4dd93b22d7fc63/ecommerce_regression_model


2025/04/21 01:55:31 INFO mlflow.spark: 'runs:/2a473aa8df5f45209a4dd93b22d7fc63/ecommerce_regression_model' resolved as 'dbfs:/databricks/mlflow-tracking/232784118800640/2a473aa8df5f45209a4dd93b22d7fc63/artifacts/ecommerce_regression_model'


Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading artifacts:   0%|          | 0/89 [00:00<?, ?it/s]

PipelineModel loaded successfully!
Type: <class 'pyspark.ml.pipeline.PipelineModel'>

Stages in the loaded pipeline:
  Stage 0: StringIndexerModel
  Stage 1: StringIndexerModel
  Stage 2: StringIndexerModel
  Stage 3: StringIndexerModel
  Stage 4: OneHotEncoderModel
  Stage 5: VectorAssembler
  Stage 6: StandardScalerModel
  Stage 7: VectorAssembler
  Stage 8: LinearRegressionModel


## 2. Prepare Sample New Data

Now, let's create a small batch of new, raw customer-product data. This data simulates what might arrive for prediction. It must have the same schema as the *initial* input to the pipeline created in M4N1 (before any transformations like StringIndexer or StandardScaler were applied). It should **not** contain the target variable (`total_purchase_amount`).


In [0]:
# Define the schema based on the raw features used in M4N1's input
# Ensure this matches the columns selected in M4N1 before the pipeline
# feature_columns = ["age", "tenure_days", ..., "category"]
schema = StructType([
    StructField("customer_id", LongType(), True), # Added for context
    StructField("product_id", LongType(), True),   # Added for context
    StructField("age", IntegerType(), True),
    StructField("tenure_days", IntegerType(), True),
    StructField("price", DoubleType(), True),
    StructField("avg_rating", DoubleType(), True), # Product avg rating
    StructField("previous_visits", LongType(), True),
    StructField("view_count", LongType(), True),
    StructField("add_to_cart_count", LongType(), True),
    StructField("review_count", LongType(), True),
    StructField("total_interactions", LongType(), True),
    StructField("total_time_spent", LongType(), True),
    StructField("interaction_time_span_days", DoubleType(), True),
    StructField("avg_user_rating", DoubleType(), True), # Customer avg rating
    StructField("gender", StringType(), True),
    StructField("country", StringType(), True),
    StructField("membership_level", StringType(), True),
    StructField("category", StringType(), True)
])

# Create sample raw data (using Pandas for convenience)
new_data_pd = pd.DataFrame([
    {
        "customer_id": 20001, "product_id": 501, "age": 65, "tenure_days": 1500, "price": 499.99, "avg_rating": 4.7, # High Value
        "previous_visits": 20, "view_count": 30, "add_to_cart_count": 5, "review_count": 2, "total_interactions": 37,
        "total_time_spent": 2500, "interaction_time_span_days": 30.0, "avg_user_rating": 4.5,
        "gender": "F", "country": "US", "membership_level": "Platinum", "category": "Electronics"
    },
    {
        "customer_id": 20002, "product_id": 602, "age": 40, "tenure_days": 600, "price": 75.00, "avg_rating": 4.1, # Mid-Range
        "previous_visits": 10, "view_count": 15, "add_to_cart_count": 3, "review_count": 1, "total_interactions": 19,
        "total_time_spent": 900, "interaction_time_span_days": 10.0, "avg_user_rating": 4.0,
        "gender": "M", "country": "CA", "membership_level": "Gold", "category": "Clothing"
    },
    {
        "customer_id": 20003, "product_id": 703, "age": 25, "tenure_days": 150, "price": 19.95, "avg_rating": 3.8, # Lower-End Purchaser
        "previous_visits": 4, "view_count": 8, "add_to_cart_count": 2, "review_count": 0, "total_interactions": 10,
        "total_time_spent": 500, "interaction_time_span_days": 5.0, "avg_user_rating": 3.5,
        "gender": "Other", "country": "UK", "membership_level": "Bronze", "category": "Books"
    },
    {
        "customer_id": 20004, "product_id": 804, "age": 33, "tenure_days": 800, "price": 120.00, "avg_rating": 4.4, # Established Purchaser
        "previous_visits": 12, "view_count": 22, "add_to_cart_count": 4, "review_count": 1, "total_interactions": 27,
        "total_time_spent": 1500, "interaction_time_span_days": 25.5, "avg_user_rating": 4.2,
        "gender": "F", "country": "DE", "membership_level": "Silver", "category": "Home & Kitchen"
    }
])

# Convert Pandas DataFrame to Spark DataFrame with the defined schema
sample_new_data_df = spark.createDataFrame(new_data_pd, schema=schema)

print("Schema of the new data:")
sample_new_data_df.printSchema()

print("\nSample new data content:")
sample_new_data_df.limit(5).display()


Schema of the new data:
root
 |-- customer_id: long (nullable = true)
 |-- product_id: long (nullable = true)
 |-- age: integer (nullable = true)
 |-- tenure_days: integer (nullable = true)
 |-- price: double (nullable = true)
 |-- avg_rating: double (nullable = true)
 |-- previous_visits: long (nullable = true)
 |-- view_count: long (nullable = true)
 |-- add_to_cart_count: long (nullable = true)
 |-- review_count: long (nullable = true)
 |-- total_interactions: long (nullable = true)
 |-- total_time_spent: long (nullable = true)
 |-- interaction_time_span_days: double (nullable = true)
 |-- avg_user_rating: double (nullable = true)
 |-- gender: string (nullable = true)
 |-- country: string (nullable = true)
 |-- membership_level: string (nullable = true)
 |-- category: string (nullable = true)


Sample new data content:


customer_id,product_id,age,tenure_days,price,avg_rating,previous_visits,view_count,add_to_cart_count,review_count,total_interactions,total_time_spent,interaction_time_span_days,avg_user_rating,gender,country,membership_level,category
20001,501,65,1500,499.99,4.7,20,30,5,2,37,2500,30.0,4.5,F,US,Platinum,Electronics
20002,602,40,600,75.0,4.1,10,15,3,1,19,900,10.0,4.0,M,CA,Gold,Clothing
20003,703,25,150,19.95,3.8,4,8,2,0,10,500,5.0,3.5,Other,UK,Bronze,Books
20004,804,33,800,120.0,4.4,12,22,4,1,27,1500,25.5,4.2,F,DE,Silver,Home & Kitchen


## 3. Perform Batch Inference

Now, we apply the loaded `PipelineModel` to our new data using the `.transform()` method. This single call executes all the necessary preprocessing steps (StringIndexing, OneHotEncoding, Scaling, Assembling) using the parameters learned during training, followed by the prediction step from the trained Linear Regression model.


In [0]:
# Apply the pipeline to the new data
print("Applying loaded pipeline model to new data...")
new_predictions_df = loaded_pipeline_model.transform(sample_new_data_df)
print("Inference complete.")

# Show the results
# Select some key input columns and the prediction
print("\nPredictions on new data:")

new_predictions_df = new_predictions_df.withColumn(
    "prediction",
    when(col("prediction") < 0, 0.0).otherwise(col("prediction"))
)

new_predictions_df.select(
    "customer_id", 
    "product_id", 
    "price", 
    "category", 
    "prediction" # The output column from the LinearRegressionModel stage
).limit(5).display()


Applying loaded pipeline model to new data...
Inference complete.

Predictions on new data:


customer_id,product_id,price,category,prediction
20001,501,499.99,Electronics,1112.6040491092283
20002,602,75.0,Clothing,79.93871112852068
20003,703,19.95,Books,0.0
20004,804,120.0,Home & Kitchen,300.53514159022046


## 4. Distributed Inference Discussion

- **Scalability:** We used a small, manually created DataFrame here for demonstration. However, the `sample_new_data_df` could easily point to a large dataset stored in cloud storage (e.g., reading from Parquet files: `spark.read.parquet("/path/to/new/data/")`). Spark would automatically distribute the `.transform()` operation across the cluster, applying the entire pipeline (preprocessing + model prediction) in parallel to partitions of the data. This allows inference to scale to terabytes or petabytes of data.
- **Consistency:** The key benefit of using the saved `PipelineModel` is that it guarantees the *exact same* preprocessing steps (with the same fitted parameters like StringIndexer mappings and StandardScaler means/stds) are applied to the new data as were applied during training. This prevents inconsistencies and potential errors that could arise from manually reimplementing the preprocessing logic for inference.


## 5. Conclusion and Next Steps

In this notebook, we successfully:
*   Loaded a pre-trained `PipelineModel` from MLflow using its `run_id`.
*   Prepared sample raw input data mirroring the schema expected by the pipeline.
*   Performed batch inference by applying the loaded `PipelineModel`'s `.transform()` method to the new data.
*   Observed the generated predictions alongside original features.
*   Discussed how this process scales seamlessly in Spark and ensures consistent preprocessing.

**Benefits Demonstrated:**
*   **Model Reusability:** Easily load and reuse complex, trained pipelines.
*   **Simplified Inference:** A single `.transform()` call handles all steps.
*   **Production Readiness:** Using MLflow for persistence provides a standard way to manage models for deployment.
*   **Scalable Batch Prediction:** Leverage Spark's distributed nature for large datasets.

**Next Steps:** In the final notebook (Module 4 Notebook 3), we will explore techniques to optimize the performance of Spark ML workloads, such as caching, partitioning, and other configuration tuning.
