# Workshop 3: ML Pipeline with MLflow

## Objective

Build a complete ML pipeline for customer segmentation using Spark ML and track experiments with MLflow. Register the model in Unity Catalog for governance and versioning.

## Context and Requirements

- **Workshop:** Customer Segmentation for RetailMax
- **Notebook type:** Hands-on Exercise
- **Prerequisites:** `02_Workshop_Data_Cleaning_and_Features.ipynb` completed
- **Technical requirements:**
  - Databricks Runtime 14.x LTS or newer
  - Unity Catalog enabled
  - MLflow enabled (default in Databricks)
- **Execution time:** ~30 minutes

---

## Theoretical Background

**Why use Pipelines?**

| Benefit | Description |
|---------|-------------|
| **Data Leakage Prevention** | `fit()` on train, `transform()` on test - automatic |
| **Reproducibility** | Single artifact contains all preprocessing + model |
| **Simplicity** | Save/load entire workflow as one object |
| **Consistency** | Same transformations in training and production |

**Pipeline Components:**

```
Data -> [Imputer] -> [Assembler] -> [Scaler] -> [Model] -> Predictions
```

**Unity Catalog Models (recommended):**

| Feature | Description |
|---------|-------------|
| **Model Registry** | `catalog.schema.model_name` format |
| **Versioning** | Automatic version tracking (v1, v2, ...) |
| **Aliases** | `@champion`, `@challenger` for deployment |
| **Governance** | Unity Catalog permissions apply |

---

## Section 1: Load Feature Data

In [None]:
# Load customer features from previous workshop
df_features = spark.table("workshop_customer_features")
display(df_features)

## Section 2: Data Splitting

Split data into training (80%) and testing (20%) sets.

**Important:** For imbalanced datasets, use stratified sampling to preserve class distribution in both sets.

In [None]:
# Exercise 1: Split data into training (80%) and testing (20%) sets
# Set seed=42 for reproducibility

train_df, test_df = # TODO: Split data using randomSplit
# print(f"Train: {train_df.count()}, Test: {test_df.count()}")

In [None]:
# Exercise 1b (Challenge): Stratified Split
# Check class distribution before and after split
# Use sampleBy() for stratified sampling if classes are imbalanced

# Check distribution
print("Full dataset distribution:")
display(df_features.groupBy("customer_segment").count())

# TODO: Verify that train and test have similar distributions
# print("Train distribution:")
# display(train_df.groupBy("customer_segment").count())

## Section 3: Define ML Pipeline

A Pipeline encapsulates a sequence of transformations into a single object. This is critical for reproducibility (same steps for training and inference).

**Pipeline stages:**
1. **Imputer:** Fill missing values in numeric features
2. **Assembler:** Combine features into a single vector
3. **Scaler:** Scale features (StandardScaler) so large values do not dominate
4. **Indexer:** Convert text label (`customer_segment`) to numeric
5. **Model:** Classifier (e.g., Logistic Regression)

In [None]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler, StandardScaler, StringIndexer, Imputer
from pyspark.ml.classification import LogisticRegression, RandomForestClassifier

# 1. Label Indexer (Target: customer_segment)
label_indexer = StringIndexer(inputCol="customer_segment", outputCol="label")

# Exercise 2a: Define Imputer for columns 'total_spend', 'recency', 'tenure'
imputer = # TODO: Create Imputer

# Exercise 2b: Define VectorAssembler using imputed columns plus 'order_count' and 'country_index'
assembler = # TODO: Create VectorAssembler

# 4. Scaler
scaler = StandardScaler(inputCol="features_raw", outputCol="features")

# Exercise 2c: Choose model (LogisticRegression or RandomForestClassifier)
lr = # TODO: Create classifier

In [None]:
# Exercise 3: Create Pipeline combining all stages

pipeline = # TODO: Create Pipeline with all stages

## Section 4: Training, Evaluation, and Tuning with MLflow

Track experiments and find optimal hyperparameters.

**Tasks:**
1. Start MLflow experiment run
2. Log model parameters
3. Calculate and log metrics (Accuracy, F1-Score)
4. (Challenge) Use CrossValidator for hyperparameter search

In [None]:
import mlflow
import mlflow.spark
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

# Setup Experiment
username = spark.sql("SELECT current_user()").collect()[0][0]
experiment_path = f"/Users/{username}/workshop_customer_segmentation"
mlflow.set_experiment(experiment_path)

# Exercise 4: Run MLflow experiment
# Inside 'with mlflow.start_run():' block:
# 1. Train pipeline on training set
# 2. Make predictions on test set
# 3. Calculate Accuracy and F1 Score
# 4. Log metrics and model to MLflow

# with mlflow.start_run(run_name="LR_Baseline"):
#     TODO: Implement training and logging
    
# Exercise 4 (Challenge): Implement CrossValidation (Grid Search)
# Create paramGrid for regParam (0.1, 0.01) and elasticNetParam (0.0, 0.5, 1.0)
# Use CrossValidator with 3 folds

---

# Solutions

Reference solutions for the exercises above.

In [None]:
# 1. Split
train_df, test_df = df_features.randomSplit([0.8, 0.2], seed=42)
print(f"Train: {train_df.count()}, Test: {test_df.count()}")

# 1b. Verify distribution (stratification check)
print("Train distribution:")
display(train_df.groupBy("customer_segment").count())
print("Test distribution:")
display(test_df.groupBy("customer_segment").count())

# 2. Pipeline Definition
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler, StandardScaler, StringIndexer, Imputer
from pyspark.ml.classification import LogisticRegression

label_indexer = StringIndexer(inputCol="customer_segment", outputCol="label")
imputer = Imputer(inputCols=["total_spend", "recency", "tenure"], outputCols=["total_spend_imp", "recency_imp", "tenure_imp"])
assembler = VectorAssembler(inputCols=["total_spend_imp", "recency_imp", "tenure_imp", "order_count", "country_index"], outputCol="features_raw")
scaler = StandardScaler(inputCol="features_raw", outputCol="features")
lr = LogisticRegression(labelCol="label", featuresCol="features")

pipeline = Pipeline(stages=[label_indexer, imputer, assembler, scaler, lr])

# 3. MLflow & CrossValidation
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
import mlflow

with mlflow.start_run(run_name="LR_GridSearch"):
    # Grid
    paramGrid = ParamGridBuilder() \
        .addGrid(lr.regParam, [0.1, 0.01]) \
        .addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0]) \
        .build()
    
    # CV
    crossval = CrossValidator(estimator=pipeline,
                              estimatorParamMaps=paramGrid,
                              evaluator=MulticlassClassificationEvaluator(labelCol="label", metricName="accuracy"),
                              numFolds=3)
    
    # Fit
    cvModel = crossval.fit(train_df)
    
    # Best Model Metrics
    best_model = cvModel.bestModel
    predictions = best_model.transform(test_df)
    accuracy = MulticlassClassificationEvaluator(labelCol="label", metricName="accuracy").evaluate(predictions)
    
    mlflow.log_metric("accuracy_cv", accuracy)
    mlflow.spark.log_model(best_model, "best_model")
    print(f"Best CV Accuracy: {accuracy}")

## Summary

Workshop complete. The following has been achieved:

| Component | Description |
|-----------|-------------|
| Data Splitting | 80/20 train/test split with seed for reproducibility |
| Pipeline | Imputer -> Assembler -> Scaler -> Classifier |
| MLflow Tracking | Experiment logged with params and metrics |
| Hyperparameter Tuning | CrossValidator with grid search |

---

## Best Practices: ML Pipelines

| Practice | Description |
|----------|-------------|
| **Use Pipelines** | Prevents data leakage, ensures reproducibility |
| **Set random seed** | `seed=42` for reproducible splits |
| **Stratify if imbalanced** | Preserve class distribution in splits |
| **Log everything** | Params, metrics, and model artifacts to MLflow |
| **Evaluate on holdout** | Final metric only from test set (never validation) |
| **Save fitted pipeline** | Not just model - entire transformation chain |

## Common Mistakes to Avoid

| Mistake | Consequence |
|---------|-------------|
| Fitting scaler on all data | Data leakage - inflated metrics |
| Using test set for tuning | Overfitting to test distribution |
| No random seed | Non-reproducible results |
| Large param grid | Combinatorial explosion, slow training |
| Not logging to MLflow | Lost experiments, no comparison |

---

**Next Steps:**
- Register best model in MLflow Model Registry
- Deploy model for batch or real-time inference
- Set up monitoring for model performance