# Workshop 3: ML Pipeline with MLflow

## Business Context: From Analysis to Prediction

In Workshops 1 and 2, we cleaned data and built the Customer 360 table with RFM features. Now it's time to **teach a computer to predict customer segments**.

**The business goal:**
- Currently: Manual segment assignment by analysts (slow, inconsistent)
- Target: Automatic classification for new customers in real-time

**How it will work in production:**

```
New Customer Signs Up → Customer 360 Calculated → ML Model → Segment Prediction
                                                              ↓
                                                    Marketing Automation
                                                    • Premium → VIP offer
                                                    • Standard → Loyalty program
                                                    • Basic → Welcome discount
```

---

## What the ML Model Will Learn

We're training a **classification model** that learns patterns from historical data:

| Customer Features | Actual Segment | Model Learns |
|-------------------|----------------|--------------|
| Spend: $5000, Orders: 50, Recency: 3 days | Premium | High spend + frequent + recent = Premium |
| Spend: $500, Orders: 5, Recency: 30 days | Standard | Medium activity = Standard |
| Spend: $50, Orders: 1, Recency: 180 days | Basic | Low engagement = Basic |

**Once trained, the model can predict:**
- "This new customer spent $3000 in first week → likely Premium (87% confidence)"
- Marketing can treat them as Premium from day 1, not after 6 months

---

## Why Pipelines? Preventing Expensive Mistakes

**The problem:** In production, you need the EXACT same data transformations as in training.

| Training | Production | Result |
|----------|------------|--------|
| Scale features 0-1 | Forget to scale |  Garbage predictions |
| Impute missing with mean=500 | Use mean=600 | ️ Inconsistent results |

**The solution:** Spark ML Pipeline bundles ALL steps into one object:

```
Pipeline = [Imputer → Assembler → Scaler → Model]
           ↓
        pipeline.save("production_model")
           ↓
        pipeline.load() → pipeline.transform(new_data) → Predictions
```

---

## Context and Requirements

- **Workshop:** Customer Segmentation for RetailMax
- **Notebook type:** Hands-on Exercise
- **Prerequisites:** `02_Workshop_Data_Cleaning_and_Features.ipynb` completed
- **Technical requirements:**
  - Databricks Runtime 14.x LTS or newer
  - Unity Catalog enabled
  - MLflow enabled (default in Databricks)
- **Execution time:** ~30 minutes

---

## Theoretical Background

**Pipeline Components:**

| Stage | Purpose | Business Value |
|-------|---------|----------------|
| **Imputer** | Fill missing values | Handle incomplete customer data |
| **Assembler** | Combine features into vector | Required format for ML |
| **Scaler** | Normalize scale (0-1) | Fair comparison of features |
| **Model** | Learn patterns | Make predictions |

**Unity Catalog Models (recommended):**

| Feature | Business Value |
|---------|---------------|
| **Model Registry** | Central catalog of all models |
| **Versioning** | Track which version is in production |
| **Aliases** | `@champion` = current production model |
| **Governance** | Who can deploy? Who can access? |

---

In [0]:
%run ../demo/00_Setup

## Section 1: Load Feature Data

**Current state:** Customer 360 table with RFM features from Workshop 2

**What we have per customer:**
- `total_spend`, `order_count`, `recency`, `tenure` (numeric features)
- `country_index` (encoded categorical)
- `customer_segment` (our target label: Basic/Standard/Premium)

In [0]:
# Environment setup (same as Workshop Setup)
current_user_email = spark.sql("SELECT current_user()").collect()[0][0]
username = current_user_email.split("@")[0].replace(".", "_").replace("-", "_")

if "trainer" in username or "krzysztof_burejza" in username:
    effective_user = "trainer"
else:
    effective_user = username

catalog_name = "data_ml_preparation"
schema_name = f"ml_dp_{effective_user}"

spark.sql(f"USE CATALOG {catalog_name}")
spark.sql(f"USE SCHEMA {schema_name}")

print(f"Using: {catalog_name}.{schema_name}")

In [0]:
# Load customer features from previous workshop
df_features = spark.table("workshop_customer_features")
display(df_features)

## Section 2: Data Splitting

**Business context:** We need to simulate production conditions. The model must predict customers it has never seen.

**How it works:**
1. **Training set (80%):** Model learns patterns from these customers
2. **Test set (20%):** We pretend these are "new" customers and measure accuracy

**Why 80/20?**
- More training data → better learning
- Enough test data → reliable accuracy estimate
- Industry standard balance

**Warning: Class Imbalance**
If 70% of customers are Basic, a lazy model could predict "Basic" always and get 70% accuracy. Stratified sampling ensures both train and test have the same class proportions.

In [0]:
# Exercise 1: Split data into training (80%) and testing (20%) sets
# Set seed=42 for reproducibility

train_df, test_df = # TODO: Split data using randomSplit
# print(f"Train: {train_df.count()}, Test: {test_df.count()}")

In [0]:
# Exercise 1b (Challenge): Stratified Split
# Check class distribution before and after split
# Use sampleBy() for stratified sampling if classes are imbalanced

# Check distribution
print("Full dataset distribution:")
display(df_features.groupBy("customer_segment").count())

# TODO: Verify that train and test have similar distributions
# print("Train distribution:")
# display(train_df.groupBy("customer_segment").count())

## Section 3: Define ML Pipeline

**Business context:** This is where the magic happens. We define the "recipe" that transforms raw features into predictions.

**Think of it as a production line:**

```
Customer Data → [Clean] → [Assemble] → [Scale] → [Classify] → Segment Prediction
```

**Pipeline stages explained:**

| Stage | What it does | Example |
|-------|--------------|---------|
| **Imputer** | Fill missing values with mean | NULL → 500.0 |
| **Assembler** | Combine columns into one vector | [5000, 50, 3, 365] |
| **Scaler** | Normalize to 0-1 range | 5000 → 0.75 |
| **Indexer** | Convert "Premium" → 2 | Text → Number |
| **Model** | Learn patterns, make predictions | Features → Segment |

**Why Pipeline, not separate steps?**
- Guarantees same transformations in training and production
- One object to save, load, deploy
- Prevents "I forgot to scale" errors

In [0]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler, StandardScaler, StringIndexer, Imputer
from pyspark.ml.classification import LogisticRegression, RandomForestClassifier

# 1. Label Indexer (Target: customer_segment)
label_indexer = StringIndexer(inputCol="customer_segment", outputCol="label")

# Exercise 2a: Define Imputer for columns 'total_spend', 'recency', 'tenure'
imputer = # TODO: Create Imputer

# Exercise 2b: Define VectorAssembler using imputed columns plus 'order_count' and 'country_index'
assembler = # TODO: Create VectorAssembler

# 4. Scaler
scaler = StandardScaler(inputCol="features_raw", outputCol="features")

# Exercise 2c: Choose model (LogisticRegression or RandomForestClassifier)
lr = # TODO: Create classifier

In [0]:
# Exercise 3: Create Pipeline combining all stages

pipeline = # TODO: Create Pipeline with all stages

## Section 4: Training, Evaluation, and Tuning with MLflow

**Business context:** Training one model is not enough. We need to:
1. **Track experiments:** What parameters did we try? What worked?
2. **Compare versions:** Is the new model better than the old one?
3. **Deploy safely:** Register the best model for production use

**MLflow answers these questions:**

| Feature | Business Value |
|---------|---------------|
| **Experiment Tracking** | "The model with regParam=0.01 had 85% accuracy" |
| **Model Registry** | "Version 3 is in production, Version 4 is testing" |
| **Artifacts** | "Here's the exact model used to make this prediction" |

**Hyperparameter Tuning:**
Models have "settings" (hyperparameters) that affect performance. CrossValidator automatically tries different combinations and picks the best one.

```
regParam: [0.1, 0.01] × elasticNetParam: [0.0, 0.5, 1.0] = 6 combinations
→ CrossValidator tests all 6 → Returns best performing model
```

---

### ️ Unity Catalog Requires Model Signature

Unity Catalog models **MUST** include a **signature** (input/output schema). Without it, you get:

```
MlflowException: Model passed for registration did not contain any signature metadata.
```

**Solution:** Use `infer_signature()` to automatically detect schema:

```python
from mlflow.models.signature import infer_signature

signature = infer_signature(
    train_df.select(input_columns).toPandas(),  # Input schema
    predictions.select("prediction").toPandas()  # Output schema
)

mlflow.spark.log_model(model, "model", signature=signature, registered_model_name=...)
```

In [0]:
import mlflow
import mlflow.spark
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from mlflow.models.signature import infer_signature

# Setup MLflow with Unity Catalog
mlflow.set_registry_uri("databricks-uc")

# Setup Experiment
username = spark.sql("SELECT current_user()").collect()[0][0]
experiment_path = f"/Users/{username}/workshop_customer_segmentation"
mlflow.set_experiment(experiment_path)

# Model name for Unity Catalog (uses catalog and schema from Workshop Setup)
# These variables should be defined if you ran 00_Workshop_Setup.ipynb
model_name = f"{catalog_name}.{schema_name}.customer_segmentation_model"

# Exercise 4: Run MLflow experiment
# Inside 'with mlflow.start_run():' block:
# 1. Train pipeline on training set
# 2. Make predictions on test set
# 3. Calculate Accuracy and F1 Score
# 4. Log metrics and register model to Unity Catalog
#
# IMPORTANT: Unity Catalog requires model SIGNATURE (input/output schema)
# Use infer_signature() to automatically detect schema from data

# with mlflow.start_run(run_name="LR_Baseline"):
#     # Train
#     model = pipeline.fit(train_df)
#     
#     # Predict
#     predictions = model.transform(test_df)
#     
#     # Evaluate
#     accuracy = MulticlassClassificationEvaluator(labelCol="label", metricName="accuracy").evaluate(predictions)
#     mlflow.log_metric("accuracy", accuracy)
#     
#     # Infer signature for Unity Catalog (REQUIRED!)
#     input_cols = ["total_spend", "recency", "tenure", "order_count", "country_index", "customer_segment"]
#     signature = infer_signature(train_df.select(input_cols).toPandas(), predictions.select("prediction").toPandas())
#     
#     # Register model to Unity Catalog with signature
#     mlflow.spark.log_model(model, "model", signature=signature, registered_model_name=model_name)
    
# Exercise 4 (Challenge): Implement CrossValidation (Grid Search)
# Create paramGrid for regParam (0.1, 0.01) and elasticNetParam (0.0, 0.5, 1.0)
# Use CrossValidator with 3 folds

---

# Solutions

Reference solutions for the exercises above.

In [0]:
# Environment setup (same as Workshop Setup)
current_user_email = spark.sql("SELECT current_user()").collect()[0][0]
username = current_user_email.split("@")[0].replace(".", "_").replace("-", "_")

if "trainer" in username or "krzysztof_burejza" in username:
    effective_user = "trainer"
else:
    effective_user = username

catalog_name = "data_ml_preparation"
schema_name = f"ml_dp_{effective_user}"

spark.sql(f"USE CATALOG {catalog_name}")
spark.sql(f"USE SCHEMA {schema_name}")

print(f"Using: {catalog_name}.{schema_name}")

In [0]:
# Load customer features from previous workshop
df_features = spark.table("workshop_customer_features")
display(df_features)

In [0]:
# 1. Split
train_df, test_df = df_features.randomSplit([0.8, 0.2], seed=42)
print(f"Train: {train_df.count()}, Test: {test_df.count()}")

# 1b. Verify distribution (stratification check)
print("Train distribution:")
display(train_df.groupBy("customer_segment").count())
print("Test distribution:")
display(test_df.groupBy("customer_segment").count())

# 2. Pipeline Definition
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler, StandardScaler, StringIndexer, Imputer
from pyspark.ml.classification import LogisticRegression

label_indexer = StringIndexer(inputCol="customer_segment", outputCol="label")

imputer = Imputer(inputCols=["total_spend", "recency", "tenure"], outputCols=["total_spend_imp", "recency_imp", "tenure_imp"])

assembler = VectorAssembler(inputCols=["total_spend_imp", "recency_imp", "tenure_imp", "order_count", "country_index"], outputCol="features_raw", handleInvalid='keep')

scaler = StandardScaler(inputCol="features_raw", outputCol="features")

lr = LogisticRegression(labelCol="label", featuresCol="features")

pipeline = Pipeline(stages=[label_indexer, imputer, assembler, scaler, lr])

# 3. MLflow with Unity Catalog & CrossValidation
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from mlflow.models.signature import infer_signature
import mlflow

# Setup Unity Catalog for model registry
mlflow.set_registry_uri("databricks-uc")
model_name = f"{catalog_name}.{schema_name}.customer_segmentation_model"

with mlflow.start_run(run_name="LR_GridSearch"):
    # Grid
    paramGrid = ParamGridBuilder() \
        .addGrid(lr.regParam, [0.1, 0.01]) \
        .addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0]) \
        .build()
    
    # CV
    crossval = CrossValidator(estimator=pipeline,
                              estimatorParamMaps=paramGrid,
                              evaluator=MulticlassClassificationEvaluator(labelCol="label", metricName="accuracy"),
                              numFolds=3)
    
    # Fit
    cvModel = crossval.fit(train_df)
    
    # Best Model Metrics
    best_model = cvModel.bestModel
    predictions = best_model.transform(test_df)
    accuracy = MulticlassClassificationEvaluator(labelCol="label", metricName="accuracy").evaluate(predictions)
    
    mlflow.log_metric("accuracy_cv", accuracy)
    
    # Infer model signature from input and output data
    # Unity Catalog requires signature for model registration
    input_example = train_df.select("total_spend", "recency", "tenure", "order_count", "country_index", "customer_segment").limit(5).toPandas()
    signature = infer_signature(
        train_df.select("total_spend", "recency", "tenure", "order_count", "country_index", "customer_segment").toPandas(),
        predictions.select("prediction").toPandas()
    )
    
    # Register model to Unity Catalog with signature
    mlflow.spark.log_model(
        best_model, 
        "best_model",
        signature=signature,
        input_example=input_example,
        registered_model_name=model_name
    )
    
    print(f"Best CV Accuracy: {accuracy}")
    print(f"Model registered to Unity Catalog: {model_name}")

## Summary: From Data to Business Value

### What We Built

```
Workshop 1          Workshop 2          Workshop 3
    ↓                   ↓                   ↓
Raw Transactions → Customer 360 → ML Model in Production
                       ↓                   ↓
                   RFM Features    Automatic Segment Prediction
```

### Business Impact

| Before (Manual) | After (ML Model) |
|-----------------|------------------|
| Analysts review customer data weekly | Real-time prediction for new customers |
| 2% campaign conversion rate | Targeted campaigns → 5%+ expected |
| 6 months to identify Premium customers | Day 1 segment assignment |
| Inconsistent segment definitions | Consistent, reproducible classification |

### What Marketing Can Now Do

1. **New customer signs up** → Model predicts "Premium" with 87% confidence
2. **Marketing automation** → Immediately sends VIP welcome package
3. **Result** → Customer feels valued, more likely to buy again

---

## Workshop Achievements

| Component | Description |
|-----------|-------------|
| Data Splitting | 80/20 train/test split with seed for reproducibility |
| Pipeline | Imputer → Assembler → Scaler → Classifier |
| MLflow Tracking | Experiment logged with params and metrics |
| Unity Catalog Model | Model registered with governance and versioning |
| Hyperparameter Tuning | CrossValidator with grid search |

### Unity Catalog Artifacts Created:

| Artifact | Location |
|----------|----------|
| Experiment | `/Users/{user}/workshop_customer_segmentation` |
| Model | `{catalog}.{schema}.customer_segmentation_model` |

---

## Best Practices: ML Pipelines

| Practice | Description |
|----------|-------------|
| **Use Pipelines** | Prevents data leakage, ensures reproducibility |
| **Set random seed** | `seed=42` for reproducible splits |
| **Stratify if imbalanced** | Preserve class distribution in splits |
| **Use Unity Catalog Models** | Governance, versioning, lineage tracking |
| **Log everything** | Params, metrics, and model artifacts to MLflow |
| **Evaluate on holdout** | Final metric only from test set (never validation) |

## Common Mistakes to Avoid

| Mistake | Consequence |
|---------|-------------|
| Fitting scaler on all data | Data leakage - inflated metrics |
| Using test set for tuning | Overfitting to test distribution |
| Saving model locally | No governance, hard to share |
| No random seed | Non-reproducible results |
| Not logging to MLflow | Lost experiments, no comparison |

---

## The Complete Journey

```
 EDA (Workshop 1)
   "We have dirty data - 5% invalid transactions found"
           ↓
 Data Cleaning (Workshop 2)  
   "Silver layer ready - 10,000 transactions → 2,000 customers"
           ↓
 Feature Engineering (Workshop 2)
   "Customer 360 with RFM features built"
           ↓
 ML Pipeline (Workshop 3)
   "Model trained: 85% accuracy on segment prediction"
           ↓
 Unity Catalog (Workshop 3)
   "Model registered, ready for production"
           ↓
 Business Impact
   "Marketing can now personalize campaigns for 10,000+ customers"
```

---

**Next Steps (Production):**
- Set model alias `@champion` for production deployment
- Deploy model for batch or real-time inference
- Set up monitoring for model performance drift
- A/B test: ML segments vs manual segments