# Module 6: Machine Learning Pipelines

**Training Objective:** Master Spark ML Pipelines to create reproducible, production-ready ML workflows with MLflow tracking.

**Scope:**
- Pipeline Concepts: Why use Pipelines?
- Defining Stages: Chaining Imputers, Encoders, Scalers, and Models
- MLflow Tracking: Logging experiments, parameters, and metrics
- Hyperparameter Tuning: Using CrossValidator for automatic model selection
- Model Persistence: Saving the pipeline for production use

## Context and Requirements

- **Training day:** Day 1 - Data Preparation Fundamentals
- **Notebook type:** Demo
- **Technical requirements:**
  - Databricks Runtime 14.x LTS or newer
  - Unity Catalog enabled
  - MLflow enabled (default in Databricks)
  - Permissions: CREATE TABLE, SELECT, MODIFY
- **Dependencies:** `02_Data_Splitting.ipynb` (creates `customer_train`, `customer_test` tables)
- **Execution time:** ~30 minutes

> **Note:** This module brings together all previous concepts into a production-ready workflow.

## Theoretical Introduction

**Why use Pipelines?**

| Benefit | Description |
|---------|-------------|
| **Data Leakage Prevention** | `fit()` on train, `transform()` on test - automatically |
| **Reproducibility** | Single artifact contains all preprocessing + model |
| **Simplicity** | Save/load entire workflow as one object |
| **Consistency** | Same transformations in training and production |

**Pipeline Components:**

```
Data ‚Üí [Stage 1: Imputer] ‚Üí [Stage 2: Encoder] ‚Üí [Stage 3: Scaler] ‚Üí [Stage 4: Model] ‚Üí Predictions
```

**CrossValidator for Hyperparameter Tuning:**
- Define a **ParamGrid** (list of hyperparameter combinations)
- CrossValidator trains $k$ models for each combination (k-fold CV)
- Picks the best model based on evaluation metric
- Total models trained: `len(paramGrid) √ó numFolds`

**MLflow Integration:**
- **Experiments**: Group related runs together
- **Runs**: Single training execution with params, metrics, artifacts
- **Model Registry**: Version and stage models (Staging ‚Üí Production)

## Per-User Isolation

Run the initialization script for per-user catalog and schema isolation:

In [None]:
%run ./00_Setup

**Import Libraries and Load Data:**

In [None]:
import mlflow
import mlflow.spark
from pyspark.ml import Pipeline
from pyspark.ml.feature import Imputer, StringIndexer, OneHotEncoder, VectorAssembler, RobustScaler
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator

# Load Raw Split Data (We start from scratch in the pipeline!)
train_df = spark.table("customer_train")
test_df = spark.table("customer_test")

## Section 1: Defining Pipeline Stages

**Why use a Pipeline?**
1.  **Prevention of Data Leakage:** When we calculate things like "Mean" for imputation or "Max" for scaling, we must calculate them **only on the Training set** and apply them to the Test set. A Pipeline ensures `fit()` is called on Train and `transform()` on Test.
2.  **Reproducibility:** It bundles all preprocessing steps and the model into a single artifact.
3.  **Simplicity:** You can save/load the entire workflow as one object.

We will reconstruct our manual steps into a reusable Pipeline.

In [None]:
# 1. Imputation
imputer = Imputer(inputCols=["age", "salary"], outputCols=["age_imp", "salary_imp"]).setStrategy("median")

# 2. Encoding
indexer = StringIndexer(inputCol="country", outputCol="country_idx", handleInvalid="keep")
encoder = OneHotEncoder(inputCols=["country_idx"], outputCols=["country_vec"])

# 3. Assembly
assembler = VectorAssembler(inputCols=["age_imp", "salary_imp", "country_vec"], outputCol="features_raw")

# 4. Scaling (RobustScaler because we have outliers!)
scaler = RobustScaler(inputCol="features_raw", outputCol="features")

# 5. Model (Predicting Salary based on Age and Country - just for demo)
lr = LinearRegression(labelCol="salary", featuresCol="features")

# --- The Pipeline ---
pipeline = Pipeline(stages=[imputer, indexer, encoder, assembler, scaler, lr])

## Section 2: Training with MLflow

We use `mlflow.start_run()` to track this experiment.

In [None]:
# Set Experiment
username = spark.sql("SELECT current_user()").collect()[0][0]
experiment_path = f"/Users/{username}/dp4ml_pipeline_demo"
mlflow.set_experiment(experiment_path)

with mlflow.start_run(run_name="salary_prediction_v1"):
    
    # Log Parameters
    mlflow.log_param("model", "LinearRegression")
    mlflow.log_param("scaler", "RobustScaler")
    
    # Fit Pipeline
    print("Training Pipeline...")
    model = pipeline.fit(train_df)
    
    # Evaluate
    predictions = model.transform(test_df)
    evaluator = RegressionEvaluator(labelCol="salary", metricName="rmse")
    rmse = evaluator.evaluate(predictions)
    
    print(f"RMSE: {rmse}")
    
    # Log Metrics
    mlflow.log_metric("rmse", rmse)
    
    # Log Model
    mlflow.spark.log_model(model, "model")
    
    print("Run saved to MLflow.")

In [None]:
# Save the Pipeline Model for Production
# This allows us to load the exact same transformations and model later for inference.

model_path = f"/Users/{username}/dp4ml_pipeline_model"
model.write().overwrite().save(model_path)

print(f"Pipeline Model saved to: {model_path}")


In [None]:
# Evaluate the best model on TEST set
predictions_cv = best_model.transform(test_df)
rmse_cv = evaluator.evaluate(predictions_cv)

# Extract the best hyperparameters from the LinearRegression stage
# The last stage in the pipeline is the LinearRegressionModel
lr_model_stage = best_model.stages[-1]

print(f"Best Model RMSE: {rmse_cv}")
print(f"Best regParam: {lr_model_stage.getRegParam()}")
print(f"Best elasticNetParam: {lr_model_stage.getElasticNetParam()}")


## Best Practices

### üéØ Pipeline Strategy Guide:

| Component | Best Practice | Why |
|-----------|--------------|-----|
| **Order of stages** | Impute ‚Üí Encode ‚Üí Scale ‚Üí Model | Data dependencies |
| **handleInvalid** | Use "keep" for StringIndexer | Handle new categories |
| **Scaler choice** | RobustScaler for outliers | Most robust default |
| **CrossValidator folds** | 3-5 for large data, 5-10 for small | Balance bias/variance |
| **MLflow logging** | Log params, metrics, AND model | Full reproducibility |

### ‚ö†Ô∏è Common Mistakes to Avoid:

1. **Fitting pipeline on all data** ‚Üí Data leakage
2. **Too many CV folds** ‚Üí Slow training, no benefit
3. **Not logging to MLflow** ‚Üí Lost experiments
4. **Huge param grids** ‚Üí Combinatorial explosion
5. **Not saving the best model** ‚Üí Can't reproduce

### üí° Pro Tips:

- Use `parallelism` parameter in CrossValidator for faster training
- Start with small param grid, expand based on results
- Always evaluate on holdout TEST set (not validation)
- Use MLflow Model Registry for production deployment
- Save both the pipeline AND the fitted model

## Summary

### What we achieved:

- **Pipeline Definition**: Created end-to-end workflow (Impute ‚Üí Encode ‚Üí Scale ‚Üí Model)
- **MLflow Tracking**: Logged parameters, metrics, and model artifacts
- **CrossValidator**: Automated hyperparameter search with 3-fold CV
- **Model Persistence**: Saved pipeline for production use

### Key Takeaways:

| # | Principle |
|---|-----------|
| 1 | **Pipelines prevent data leakage** - fit on train, transform on test |
| 2 | **MLflow is essential** - track all experiments |
| 3 | **CrossValidator automates tuning** - finds best hyperparameters |
| 4 | **Save entire pipeline** - includes all preprocessing |
| 5 | **Evaluate on TEST only once** - final unbiased estimate |

### MLflow Artifacts Created:

| Artifact | Location | Purpose |
|----------|----------|---------|
| Experiment | `/Users/{user}/dp4ml_pipeline_demo` | Group runs |
| Run | Auto-generated | Track this training |
| Model | `/Users/{user}/dp4ml_pipeline_model` | Production deployment |

### Next Steps:

üìö **Next Module:** Module 7 - Feature Store & MLflow (production ML)

## Cleanup

Optionally remove demo artifacts created during exercises:

In [None]:
# Cleanup - remove demo artifacts created in this notebook

# Uncomment the lines below to remove demo artifacts:

# import shutil
# shutil.rmtree(model_path, ignore_errors=True)
# mlflow.delete_experiment(mlflow.get_experiment_by_name(experiment_path).experiment_id)

# print("‚úÖ All demo artifacts removed")

print("‚ÑπÔ∏è Cleanup disabled (uncomment code to remove demo artifacts)")

In [None]:
# Create CrossValidator
# numFolds=3 means we do 3-fold cross-validation for each param combo
# Total fits = len(paramGrid) * numFolds = 9 * 3 = 27 models!

crossval = CrossValidator(
    estimator=pipeline_tune,
    estimatorParamMaps=paramGrid,
    evaluator=RegressionEvaluator(labelCol="salary", metricName="rmse"),
    numFolds=3,
    parallelism=4  # Train 4 models in parallel (faster on clusters)
)

print("CrossValidator configured. Training will evaluate 27 models...")

# Fit CrossValidator (This takes longer!)
cv_model = crossval.fit(train_df)

# Get best model
best_model = cv_model.bestModel
print("‚úÖ Best model found!")


In [None]:
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

# We reuse our pipeline but replace the model stage to allow tuning
# Let's create a fresh pipeline for tuning
lr_tune = LinearRegression(labelCol="salary", featuresCol="features")

pipeline_tune = Pipeline(stages=[imputer, indexer, encoder, assembler, scaler, lr_tune])

# Define the Parameter Grid
# We will try different values of regParam (L2 regularization) and elasticNetParam (L1 vs L2 mix)
paramGrid = ParamGridBuilder() \
    .addGrid(lr_tune.regParam, [0.01, 0.1, 0.5]) \
    .addGrid(lr_tune.elasticNetParam, [0.0, 0.5, 1.0]) \
    .build()

print(f"Number of hyperparameter combinations to test: {len(paramGrid)}")


## Section 3: Hyperparameter Tuning with CrossValidator

**Why tune hyperparameters?**
In the previous example, we used default settings for `LinearRegression`. But models have "knobs" (hyperparameters) that can drastically change performance (e.g., `regParam` for regularization).

**CrossValidator** automates this:
1.  Define a **ParamGrid** (list of hyperparameters to try).
2.  CrossValidator trains $k$ models for each combination (k-fold).
3.  It picks the best model based on the evaluation metric.