# Module 6: Machine Learning Pipelines

**Training Objective:** Master Spark ML Pipelines to create reproducible, production-ready ML workflows with MLflow tracking.

**Scope:**
- Pipeline Concepts: Why use Pipelines?
- Defining Stages: Chaining Imputers, Encoders, Scalers, and Models
- MLflow Tracking: Logging experiments, parameters, and metrics
- Hyperparameter Tuning: Using CrossValidator for automatic model selection
- Model Persistence: Saving the pipeline for production use

## Context and Requirements

- **Training day:** Day 1 - Data Preparation Fundamentals
- **Notebook type:** Demo
- **Technical requirements:**
  - Databricks Runtime 14.x LTS or newer
  - Unity Catalog enabled
  - MLflow enabled (default in Databricks)
  - Permissions: CREATE TABLE, SELECT, MODIFY
- **Dependencies:** `02_Data_Splitting.ipynb` (creates `customer_train`, `customer_test` tables)
- **Execution time:** ~30 minutes

> **Note:** This module brings together all previous concepts into a production-ready workflow.

## Theoretical Background

**Spark ML Pipelines:**
A Pipeline is a sequence of stages (Transformers and Estimators) executed in order. This ensures:
1. Consistent data transformation between training and inference
2. No data leakage (transformers fit only on training data)
3. Easy saving/loading of the entire workflow

**CrossValidator:**
Performs k-fold cross-validation with hyperparameter tuning:
- Splits training data into k folds
- Trains model on k-1 folds, validates on remaining fold
- Repeats for all combinations of hyperparameters
- Total models trained: `len(paramGrid) × numFolds`

**MLflow with Unity Catalog Models:**

| Feature | Description |
|---------|-------------|
| **Model Registry** | `catalog.schema.model_name` format |
| **Versioning** | Automatic version tracking (v1, v2, ...) |
| **Aliases** | `@champion`, `@challenger` for deployment stages |
| **Governance** | Unity Catalog permissions apply |
| **Lineage** | Track data and model dependencies |

**⚠️ Unity Catalog requires Model Signature:**

Unity Catalog models **MUST** include a signature (input/output schema). Without it, registration fails.

```python
from mlflow.models.signature import infer_signature

# Infer signature from training data and predictions
signature = infer_signature(train_df.toPandas(), predictions.select("prediction").toPandas())

# Register with signature
mlflow.spark.log_model(
    model, 
    "model", 
    signature=signature,
    input_example=train_df.limit(5).toPandas(),
    registered_model_name="catalog.schema.model"
)
```

## Per-User Isolation

Run the initialization script for per-user catalog and schema isolation:

In [0]:
%run ./00_Setup

**Import Libraries and Load Data:**

In [0]:
import mlflow
import mlflow.spark
from pyspark.ml import Pipeline
from pyspark.ml.feature import Imputer, StringIndexer, OneHotEncoder, VectorAssembler, RobustScaler
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator

# Load Raw Split Data (We start from scratch in the pipeline!)
train_df = spark.table("customer_train")
test_df = spark.table("customer_test")

## Section 1: Defining Pipeline Stages

**Why use a Pipeline?**
1.  **Prevention of Data Leakage:** When we calculate things like "Mean" for imputation or "Max" for scaling, we must calculate them **only on the Training set** and apply them to the Test set. A Pipeline ensures `fit()` is called on Train and `transform()` on Test.
2.  **Reproducibility:** It bundles all preprocessing steps and the model into a single artifact.
3.  **Simplicity:** You can save/load the entire workflow as one object.

We will reconstruct our manual steps into a reusable Pipeline.

In [0]:
# 1. Imputation
imputer = Imputer(inputCols=["age", "salary"], outputCols=["age_imp", "salary_imp"]).setStrategy("median")

# 2. Encoding
indexer = StringIndexer(inputCol="country", outputCol="country_idx", handleInvalid="keep")
encoder = OneHotEncoder(inputCols=["country_idx"], outputCols=["country_vec"])

# 3. Assembly
assembler = VectorAssembler(inputCols=["age_imp", "salary_imp", "country_vec"], outputCol="features_raw")

# 4. Scaling (RobustScaler because we have outliers!)
scaler = RobustScaler(inputCol="features_raw", outputCol="features")

# 5. Model (Predicting Salary based on Age and Country - just for demo)
lr = LinearRegression(labelCol="salary", featuresCol="features")

# --- The Pipeline ---
pipeline = Pipeline(stages=[imputer, indexer, encoder, assembler, scaler, lr])

## Section 2: Training with MLflow

We use `mlflow.start_run()` to track this experiment.

In [0]:
# Set MLflow to use Unity Catalog for model registry
mlflow.set_registry_uri("databricks-uc")

# Import signature inference for Unity Catalog (REQUIRED!)
from mlflow.models.signature import infer_signature

# Set Experiment
username = spark.sql("SELECT current_user()").collect()[0][0]
experiment_path = f"/Users/{username}/dp4ml_pipeline_demo"
mlflow.set_experiment(experiment_path)

# Model name for Unity Catalog
model_name = f"{catalog_name}.{schema_name}.salary_prediction_model"

with mlflow.start_run(run_name="salary_prediction_v1"):
    
    # Log Parameters
    mlflow.log_param("model", "LinearRegression")
    mlflow.log_param("scaler", "RobustScaler")
    
    # Fit Pipeline
    print("Training Pipeline...")
    model = pipeline.fit(train_df)
    
    # Evaluate
    predictions = model.transform(test_df)
    evaluator = RegressionEvaluator(labelCol="salary", metricName="rmse")
    rmse = evaluator.evaluate(predictions)
    
    print(f"RMSE: {rmse}")
    
    # Log Metrics
    mlflow.log_metric("rmse", rmse)
    
    # Infer model signature from input and output data
    # Unity Catalog REQUIRES signature for model registration
    input_example = train_df.limit(5).toPandas()
    signature = infer_signature(
        train_df.toPandas(),
        predictions.select("prediction").toPandas()
    )
    
    # Log Model and Register to Unity Catalog with signature
    mlflow.spark.log_model(
        model, 
        "model",
        signature=signature,
        input_example=input_example,
        registered_model_name=model_name  # Auto-register to UC
    )
    
    print(f"Model registered to Unity Catalog: {model_name}")

In [0]:
# Register the Pipeline Model in Unity Catalog
# Unity Catalog Models provide governance, versioning, and lineage tracking

# Set registry to Unity Catalog
mlflow.set_registry_uri("databricks-uc")

# Model name in Unity Catalog format: catalog.schema.model_name
model_name = f"{catalog_name}.{schema_name}.salary_prediction_model"

# Register the model
model_info = mlflow.register_model(
    model_uri=f"runs:/{mlflow.active_run().info.run_id if mlflow.active_run() else None}/model",
    name=model_name
)

print(f"Model registered in Unity Catalog: {model_name}")
print(f"Model version: {model_info.version}")

## Section 3: Hyperparameter Tuning with CrossValidator

**Why tune hyperparameters?**
In the previous example, we used default settings for `LinearRegression`. But models have "knobs" (hyperparameters) that can drastically change performance (e.g., `regParam` for regularization).

**CrossValidator** automates this:
1.  Define a **ParamGrid** (list of hyperparameters to try).
2.  CrossValidator trains $k$ models for each combination (k-fold).
3.  It picks the best model based on the evaluation metric.

In [0]:
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

# We reuse our pipeline but replace the model stage to allow tuning
# Let's create a fresh pipeline for tuning
lr_tune = LinearRegression(labelCol="salary", featuresCol="features")

pipeline_tune = Pipeline(stages=[imputer, indexer, encoder, assembler, scaler, lr_tune])

# Define the Parameter Grid
# We will try different values of regParam (L2 regularization) and elasticNetParam (L1 vs L2 mix)
paramGrid = ParamGridBuilder() \
    .addGrid(lr_tune.regParam, [0.01, 0.1, 0.5]) \
    .addGrid(lr_tune.elasticNetParam, [0.0, 0.5, 1.0]) \
    .build()

print(f"Number of hyperparameter combinations to test: {len(paramGrid)}")

In [0]:
# Create CrossValidator
# numFolds=3 means we do 3-fold cross-validation for each param combo
# Total fits = len(paramGrid) * numFolds = 9 * 3 = 27 models!

crossval = CrossValidator(
    estimator=pipeline_tune,
    estimatorParamMaps=paramGrid,
    evaluator=RegressionEvaluator(labelCol="salary", metricName="rmse"),
    numFolds=3,
    parallelism=4  # Train 4 models in parallel (faster on clusters)
)

print("CrossValidator configured. Training will evaluate 27 models...")

# Fit CrossValidator (This takes longer!)
cv_model = crossval.fit(train_df)

# Get best model
best_model = cv_model.bestModel
print("Best model found!")

In [0]:
print(best_model)

In [0]:
# Evaluate the best model on TEST set
predictions_cv = best_model.transform(test_df)
rmse_cv = evaluator.evaluate(predictions_cv)

# Extract the best hyperparameters from the LinearRegression stage
# The last stage in the pipeline is the LinearRegressionModel
lr_model_stage = best_model.stages[-1]

# Compute R2 on test set
evaluator_r2 = RegressionEvaluator(labelCol="salary", predictionCol="prediction", metricName="r2")
r2_cv = evaluator_r2.evaluate(predictions_cv)

print(f"Best Model RMSE: {rmse_cv}")
print(f"Best Model R2: {r2_cv}")
print(f"Best regParam: {lr_model_stage.getRegParam()}")
print(f"Best elasticNetParam: {lr_model_stage.getElasticNetParam()}")

These lines print the evaluation metrics and hyperparameters of the best model found by cross-validation:

- **Best Model RMSE: `{rmse_cv}`**  
  Shows the Root Mean Squared Error (RMSE) of the best model on the test set. Lower RMSE means better predictive accuracy.

- **Best Model R2: `{r2_cv}`**  
  Shows the R-squared (coefficient of determination) of the best model on the test set. Higher R2 (closer to 1) means the model explains more variance in the target variable.

- **Best regParam: `{lr_model_stage.getRegParam()}`**  
  Displays the value of the regularization parameter (regParam) chosen by cross-validation. This controls the amount of regularization applied to the model to prevent overfitting.

- **Best elasticNetParam: `{lr_model_stage.getElasticNetParam()}`**  
  Displays the value of the elasticNet mixing parameter (elasticNetParam) chosen by cross-validation. This controls the mix between L1 (lasso) and L2 (ridge) regularization.

## Best Practices

### Pipeline Strategy Guide:

| Component | Best Practice | Why |
|-----------|--------------|-----|
| **Order of stages** | Impute → Encode → Scale → Model | Data dependencies |
| **handleInvalid** | Use "keep" for StringIndexer | Handle new categories |
| **Scaler choice** | RobustScaler for outliers | Most robust default |
| **CrossValidator folds** | 3-5 for large data, 5-10 for small | Balance bias/variance |
| **MLflow logging** | Log params, metrics, AND model | Full reproducibility |

### Common Mistakes to Avoid:

1. **Fitting pipeline on all data** → Data leakage
2. **Too many CV folds** → Slow training, no benefit
3. **Not logging to MLflow** → Lost experiments
4. **Huge param grids** → Combinatorial explosion
5. **Not saving the best model** → Can't reproduce

### Pro Tips:

- Use `parallelism` parameter in CrossValidator for faster training
- Start with small param grid, expand based on results
- Always evaluate on holdout TEST set (not validation)
- Use MLflow Model Registry for production deployment
- Save both the pipeline AND the fitted model

## Summary

### What we achieved:

- **Pipeline Definition**: Created end-to-end workflow (Impute → Encode → Scale → Model)
- **MLflow Tracking**: Logged parameters, metrics, and model artifacts
- **Unity Catalog Models**: Registered model with governance and versioning
- **CrossValidator**: Automated hyperparameter search with 3-fold CV

### Key Takeaways:

| # | Principle |
|---|-----------|
| 1 | **Pipelines prevent data leakage** - fit on train, transform on test |
| 2 | **MLflow is essential** - track all experiments |
| 3 | **Unity Catalog Models** - governance, versioning, lineage |
| 4 | **CrossValidator automates tuning** - finds best hyperparameters |
| 5 | **Evaluate on TEST only once** - final unbiased estimate |

### Unity Catalog Artifacts Created:

| Artifact | Location | Purpose |
|----------|----------|---------|
| Experiment | `/Users/{user}/dp4ml_pipeline_demo` | Group runs |
| Model | `{catalog}.{schema}.salary_prediction_model` | Production deployment |
| Versions | v1, v2, ... | Track model iterations |

### Next Steps:

**Next Module:** Module 7 - Feature Store & MLflow (production ML)

## Cleanup

Optionally remove demo artifacts created during exercises:

In [0]:
# Cleanup - remove demo artifacts created in this notebook

# Uncomment the lines below to remove demo artifacts:

# import shutil
# shutil.rmtree(model_path, ignore_errors=True)
# mlflow.delete_experiment(mlflow.get_experiment_by_name(experiment_path).experiment_id)

# print("All demo artifacts removed")

print("Cleanup disabled (uncomment code to remove demo artifacts)")