# ü§ñ ML Model Training Challenge

**Welcome to the Machine Learning Challenge!** Now that you've explored and prepared your data in **04.1**, it's time to **build, train, and optimize predictive models**.

---

## üéØ Your Mission

Build a **predictive maintenance model** that can:
1. **Predict which sensor will fail** before it happens ‚ùå
2. **Compare different ML algorithms** (Logistic Regression vs XGBoost)
3. **Optimize hyperparameters** to maximize accuracy
4. **Track experiments** using MLflow
5. **Register the best model** in Unity Catalog for production deployment

---

## üìä The Challenge

**Scenario:** Your energy company loses **millions** every time a wind turbine breaks down unexpectedly. Your ML model must:
- **Classify sensor status:** `'ok'`, `'sensor_B'`, `'sensor_E'`, `'sensor_F'` (multi-class classification)
- **Achieve >85% accuracy** to be production-worthy
- **Be explainable** - which features matter most?

---

## üèÜ Challenge Levels

### **Level 1: Basic Model Training (30 min)**
- Load training data from `turbine_hourly_features`
- Train Logistic Regression model
- Log metrics to MLflow
- Achieve baseline accuracy

### **Level 2: Advanced Model Comparison (30 min)**
- Train XGBoost Classifier
- Compare multiple models
- Select best performing model

### **Level 3: Hyperparameter Optimization (45 min)**
- Use Optuna for automated hyperparameter tuning
- Run 10+ optimization trials
- Register best model to Unity Catalog with @prod alias

---

## üìù What You'll Learn

‚úÖ **Multi-class classification** - Predict multiple failure types  
‚úÖ **MLflow experiment tracking** - Track every model, parameter, and metric  
‚úÖ **Model comparison** - Logistic Regression vs XGBoost  
‚úÖ **Hyperparameter tuning** - Optuna for automated optimization  
‚úÖ **Unity Catalog model registry** - Production-grade model management  
‚úÖ **Model signatures** - Ensure schema compatibility  

---

Let's start! üöÄ

---

## üì¶ Setup: Install Libraries

**What you need to do:**
Install required packages:
- `mlflow` - Experiment tracking & model registry
- `optuna` - Hyperparameter optimization
- `xgboost` - Gradient boosting library

**üí° Note:** After installation, Python kernel will restart automatically.

In [None]:
# RUN THIS CELL - No changes needed
%pip install --quiet databricks-sdk==0.40.0 mlflow==2.22.0 optuna optuna-integration[mlflow] xgboost
dbutils.library.restartPython()

In [None]:
# RUN THIS CELL - No changes needed
%run ../_resources/00-setup $reset_all_data=false

---

## üì• Task 1: Import Libraries & Configure MLflow

**What you need to do:**
1. Import necessary libraries for ML training
2. Configure MLflow to use Unity Catalog as model registry

**Libraries you'll need:**
- **sklearn:** `train_test_split`, `StandardScaler`, `LogisticRegression`
- **sklearn.metrics:** `accuracy_score`, `precision_score`, `recall_score`, `f1_score`
- **xgboost:** `XGBClassifier`
- **mlflow:** Model tracking and registry
- **optuna:** Hyperparameter optimization

**üí° Hint:** Use `mlflow.set_registry_uri('databricks-uc')` to enable Unity Catalog

In [None]:
# YOUR CODE HERE: Import libraries from sklearn
# Hint: from sklearn.model_selection import train_test_split
#       from sklearn.preprocessing import StandardScaler
#       from sklearn.linear_model import LogisticRegression
#       from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

In [None]:
# YOUR CODE HERE: Import XGBoost
# Hint: from xgboost import XGBClassifier

In [None]:
# YOUR CODE HERE: Import MLflow, Optuna, and other utilities
# Hint: import mlflow
#       from mlflow.models import infer_signature
#       from mlflow import MlflowClient
#       import optuna
#       from optuna.integration.mlflow import MLflowCallback
#       import numpy as np
#       import pandas as pd

In [None]:
# YOUR CODE HERE: Configure MLflow to use Unity Catalog
# Hint: mlflow.set_registry_uri('databricks-uc')

### ‚úÖ Success Criteria:
- All libraries imported without errors
- MLflow configured to use Unity Catalog registry

---

## üìä Task 2: Load & Prepare Training Data

**What you need to do:**
1. Load `turbine_hourly_features` table (created in 04.1)
2. Drop `turbine_id` column (not a predictive feature)
3. Separate features (X) and target (y)
4. Encode target labels as integers
5. Split into train/test sets (80/20 split)

**Feature columns (X):**
- `avg_energy`
- `std_sensor_A`, `std_sensor_B`, `std_sensor_C`, `std_sensor_D`, `std_sensor_E`, `std_sensor_F`

**Target column (y):**
- `abnormal_sensor` (multi-class: 'ok', 'sensor_B', 'sensor_E', 'sensor_F')

**üí° Hints:**
- Use `spark.table()` to load data
- Convert to pandas with `.toPandas()`
- Encode labels: `pd.factorize(y)` converts strings to integers
- Use `train_test_split()` with `test_size=0.2` and `random_state=42`

In [None]:
# YOUR CODE HERE: Define table name and load data
# Hint: features_table_name = f'{catalog}.{db}.turbine_hourly_features'
#       training_dataset = spark.table(features_table_name).drop('turbine_id')

In [None]:
# YOUR CODE HERE: Prepare features (X) and target (y)
# Hint: X = training_dataset.toPandas()[['avg_energy', 'std_sensor_A', ...]]
#       y = training_dataset.toPandas()['abnormal_sensor']

In [None]:
# YOUR CODE HERE: Encode target labels
# Hint: y_encoded = pd.factorize(y)[0]
# This converts 'ok', 'sensor_B', etc. to 0, 1, 2, 3

In [None]:
# YOUR CODE HERE: Split into train and test sets
# Hint: X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.2, random_state=42)

In [None]:
# YOUR CODE HERE: Print dataset shapes to verify
# How many samples in train vs test?
# Hint: print(f"Train size: {X_train.shape}, Test size: {X_test.shape}")

### ‚úÖ Success Criteria:
- Training set: ~80% of data
- Test set: ~20% of data
- Features (X) contain 7 columns
- Target (y) encoded as integers (0, 1, 2, ...)
- No missing values in X_train or X_test

---

## üìù Task 3: Create Model Signature

**What you need to do:**
1. Create MLflow model signature (defines input/output schema)
2. Create input example (for model testing)

**üí° Why?** Unity Catalog requires models to have signatures for schema validation and governance.

**Hints:**
- Use `mlflow.models.infer_signature(X_train, y_train)`
- Input example: `X_train.iloc[[0]]` (first row with column names)

In [None]:
# YOUR CODE HERE: Create model signature
# Hint: signature = infer_signature(X_train, y_train)

In [None]:
# YOUR CODE HERE: Display signature
# Verify it shows input features and output type

In [None]:
# YOUR CODE HERE: Create input example
# Hint: input_example = X_train.iloc[[0]]

In [None]:
# YOUR CODE HERE: Display input example
# Verify it shows all feature columns

### ‚úÖ Success Criteria:
- Signature shows 7 input features (all sensors + avg_energy)
- Signature shows output type (integer for class)
- Input example contains 1 row with all features

---

# üéØ LEVEL 1: Build Logistic Regression Model (30 min)

**Goal:** Train a baseline model and log it to MLflow

---

## üìä Task 4: Train Logistic Regression

**What you need to do:**
1. Start an MLflow run (with run_name="Logistic Regression Run")
2. Define Logistic Regression model with `max_iter=200`
3. Fit model on training data
4. Predict on test data
5. Calculate metrics: accuracy, precision, recall, f1_score
6. Log parameters, metrics, and model to MLflow

**üí° Important:**
- Use `average="macro"` for multi-class metrics
- Log model with signature and input_example

**Hints:**
```python
with mlflow.start_run(run_name="..."):
    # 1. Define model
    # 2. Train model
    # 3. Make predictions
    # 4. Calculate metrics
    # 5. Log everything to MLflow
```

In [None]:
# YOUR CODE HERE: Train Logistic Regression model
# Remember to wrap everything in mlflow.start_run()

# Step 1: Start MLflow run

# Step 2: Define model (LogisticRegression with max_iter=200)

# Step 3: Train model (.fit)

# Step 4: Make predictions (.predict on X_test)

# Step 5: Calculate metrics
# Hint: accuracy_score(y_test, predictions)
#       precision_score(y_test, predictions, average="macro")
#       recall_score(y_test, predictions, average="macro")
#       f1_score(y_test, predictions, average="macro")

# Step 6: Log to MLflow
# mlflow.set_tag("model_family", "LogisticRegression")
# mlflow.log_param("max_iter", 200)
# mlflow.log_metric("accuracy", acc)
# mlflow.log_metric("precision", precision)
# mlflow.log_metric("recall", recall)
# mlflow.log_metric("f1_score", f1)

# Step 7: Log model
# mlflow.sklearn.log_model(
#     lr_model,
#     artifact_path="logreg_model",
#     input_example=input_example,
#     signature=signature
# )

### ü§î Reflection Questions:

**1. What accuracy did you achieve?**
```
[Your answer: e.g., 87.5%]
```

**2. Is the model production-ready (>85% accuracy)?**
```
[Your answer: Yes/No]
```

**3. Which metric matters most for predictive maintenance?**
- Accuracy: Overall correctness
- Precision: Avoid false alarms
- Recall: Don't miss failures
- F1: Balance of precision and recall
```
[Your answer]
```

---

# üî• LEVEL 2: Build XGBoost Model (30 min)

**Goal:** Train a more advanced model and compare with baseline

---

## üìä Task 5: Train XGBoost Classifier

**What you need to do:**
1. Start a new MLflow run (with run_name="XGBoost Run")
2. Define XGBoost model with:
   - `objective="multi:softprob"` (multi-class classification)
   - `num_class=3` (number of classes)
   - `max_depth=3`
   - `n_estimators=100`
   - `learning_rate=0.1`
   - `eval_metric="mlogloss"`
3. Train, predict, evaluate, and log to MLflow

**üí° Hint:** Structure is similar to Logistic Regression, but use `XGBClassifier` and `mlflow.xgboost.log_model()`

In [None]:
# YOUR CODE HERE: Train XGBoost model

# Step 1: Start MLflow run

# Step 2: Define XGBoost model
# Hint: xgb_model = XGBClassifier(
#           objective="multi:softprob",
#           num_class=3,
#           max_depth=3,
#           n_estimators=100,
#           learning_rate=0.1,
#           use_label_encoder=False,
#           eval_metric="mlogloss"
#       )

# Step 3: Train model

# Step 4: Make predictions

# Step 5: Calculate metrics (same as before)

# Step 6: Log parameters
# mlflow.set_tag("model_family", "XGBOOST")
# mlflow.log_param("max_depth", 3)
# mlflow.log_param("n_estimators", 100)
# mlflow.log_param("learning_rate", 0.1)

# Step 7: Log metrics

# Step 8: Log model
# mlflow.xgboost.log_model(
#     xgb_model,
#     artifact_path="xgb_model",
#     input_example=input_example,
#     signature=signature
# )

---

## üìà Task 6: Compare Models

**What you need to do:**
1. Search all MLflow runs
2. Sort by accuracy (descending)
3. Identify best performing model
4. Compare Logistic Regression vs XGBoost

**üí° Hints:**
- Use `mlflow.search_runs()` with `order_by=['metrics.accuracy DESC']`
- Display top 5 runs
- Check `tags.model_family` column to identify model type

In [None]:
# YOUR CODE HERE: Search and display all runs
# Hint: mlflow.search_runs(order_by=['metrics.accuracy DESC','start_time DESC'])

In [None]:
# YOUR CODE HERE: Get best run details
# Hint: best_run = mlflow.search_runs(order_by=['metrics.accuracy DESC']).iloc[0]
#       print(f"Best accuracy: {best_run['metrics.accuracy']}")
#       print(f"Best model type: {best_run['tags.model_family']}")
#       best_model_run_id = best_run['run_id']

### ü§î Reflection Questions:

**1. Which model performed better: Logistic Regression or XGBoost?**
```
[Your answer]
```

**2. By how much did accuracy improve?**
```
[Your answer: e.g., +3.2%]
```

**3. Why might XGBoost perform better for this problem?**
```
[Your answer: Think about non-linear relationships, feature interactions]
```

---

# üöÄ LEVEL 3: Hyperparameter Optimization (45 min)

**Goal:** Use Optuna to find optimal XGBoost parameters

---

## üîß Task 7: Set Up Optuna Optimization

**What you need to do:**
1. Get current experiment ID from MLflow
2. Create Optuna MLflow callback
3. Define objective function for hyperparameter tuning
4. Run optimization trials

**Hyperparameters to tune:**
- `num_class`: 2-10
- `max_depth`: 2-15
- `n_estimators`: 10-100

**üí° How Optuna works:**
- Tries different parameter combinations
- Learns from previous trials
- Finds optimal parameters to maximize accuracy

In [None]:
# YOUR CODE HERE: Get current experiment ID
# This finds the MLflow experiment associated with this notebook

# Hint:
# client = MlflowClient()
# experiments = client.search_experiments()
# notebook_name = dbutils.notebook.entry_point.getDbutils().notebook().getContext().notebookPath().get().split("/")[-1]

# Then loop through experiments to find matching name
# Store experiment_id for later use

In [None]:
# YOUR CODE HERE: Create Optuna MLflow callback
# This automatically logs each Optuna trial to MLflow

# Hint:
# mlflow_callback = MLflowCallback(
#     tracking_uri='databricks',
#     metric_name="accuracy",
#     create_experiment=False,
#     mlflow_kwargs={"nested": True},
#     tag_trial_user_attrs=True
# )

---

## üéØ Task 8: Define Objective Function

**What you need to do:**
Create a function that:
1. Receives hyperparameter suggestions from Optuna
2. Trains XGBoost model with those parameters
3. Evaluates model on test set
4. Logs everything to MLflow
5. Returns accuracy (optimization target)

**üí° Structure:**
```python
def objective(trial):
    with mlflow.start_run(nested=True):
        # 1. Suggest hyperparameters
        num_class = trial.suggest_int('num_class', 2, 10)
        max_depth = trial.suggest_int('max_depth', 2, 15)
        n_estimators = trial.suggest_int('n_estimators', 10, 100)
        
        # 2. Train model
        # 3. Predict and evaluate
        # 4. Log to MLflow
        
        return accuracy
```

In [None]:
# YOUR CODE HERE: Define objective function for Optuna

def objective(trial):
    """Objective function for hyperparameter tuning."""
    
    # Step 1: Start nested MLflow run
    
    # Step 2: Suggest hyperparameters
    # Hint: trial.suggest_int('num_class', 2, 10)
    
    # Step 3: Define XGBoost model with suggested parameters
    
    # Step 4: Train model
    
    # Step 5: Predict and calculate metrics
    
    # Step 6: Log parameters and metrics to MLflow
    
    # Step 7: Return accuracy (Optuna will maximize this)
    
    pass  # Remove this and write your code

---

## üî¨ Task 9: Run Optimization Trials

**What you need to do:**
1. Start parent MLflow run
2. Create Optuna study (maximize accuracy)
3. Run 10 optimization trials
4. Log best parameters and metrics
5. Train final model with best parameters
6. Save model to MLflow

**üí° This takes time:** 10 trials = ~5-10 minutes depending on data size

In [None]:
# YOUR CODE HERE: Run optimization trials

# Step 1: Start parent MLflow run
# with mlflow.start_run(experiment_id=experiment_id, run_name="optimize", nested=True):

    # Step 2: Create Optuna study
    # Hint: study = optuna.create_study(direction="maximize", load_if_exists=True)
    
    # Step 3: Run optimization
    # Hint: study.optimize(objective, n_trials=10, callbacks=[mlflow_callback])
    
    # Step 4: Log best parameters
    # Hint: mlflow.log_params(study.best_params)
    #       mlflow.log_metric("accuracy", study.best_value)
    
    # Step 5: Set metadata tags
    # mlflow.set_tags({"project": "Turbine maintenance predictor", ...})
    
    # Step 6: Train final model with best parameters
    # Hint: model = XGBClassifier(**study.best_params).fit(X_train, y_train)
    
    # Step 7: Calculate final metrics
    
    # Step 8: Log final model
    # mlflow.xgboost.log_model(
    #     xgb_model=model,
    #     artifact_path="xgb_model",
    #     input_example=input_example,
    #     signature=signature
    # )

---

## üìä Task 10: Compare Optimization Results

**What you need to do:**
1. Search for all runs with logged models
2. Compare best pre-optimization vs post-optimization accuracy
3. Determine if optimization improved performance

**üí° Hint:** Filter for runs with models: `filter_string="tags.'mlflow.log-model.history' !=''"`

In [None]:
# YOUR CODE HERE: Search for runs with logged models
# Hint: mlflow.search_runs(
#           filter_string="tags.'mlflow.log-model.history' !=''",
#           order_by=['metrics.accuracy DESC']
#       )

In [None]:
# YOUR CODE HERE: Compare pre vs post optimization
# Get best_run from Task 6 (before optimization)
# Get best_opt_run from optimization results
# Compare accuracies

# Hint:
# print(f"Best previous accuracy: {best_run['metrics.accuracy']}")
# best_opt_run = mlflow.search_runs(...).iloc[0]
# print(f"Best optimized accuracy: {best_opt_run['metrics.accuracy']}")
# if best_opt_run['metrics.accuracy'] > best_run['metrics.accuracy']:
#     print("Optimization improved accuracy!")

### ü§î Reflection Questions:

**1. Did hyperparameter tuning improve accuracy?**
```
[Your answer: Yes/No, by how much?]
```

**2. What were the optimal hyperparameters?**
```
[Your answer: num_class, max_depth, n_estimators]
```

**3. Would you run more trials (20, 50, 100) in production?**
```
[Your answer: Consider time vs accuracy tradeoff]
```

---

## üß™ Task 11: Test Best Model

**What you need to do:**
1. Get best model URI from MLflow
2. Load model from artifact store
3. Make predictions on test data
4. Verify model works correctly

**üí° This validates the model before registering to Unity Catalog**

In [None]:
# YOUR CODE HERE: Get best model URI
# Hint: Construct URI based on run_id and artifact_path
# Format: f"runs:/{run_id}/xgb_model" or f"runs:/{run_id}/logreg_model"

# best_model_uri = f"runs:/{best_opt_run_id}/..."
# model_name = "turbine_maintenance"

In [None]:
# YOUR CODE HERE: Load model from MLflow
# Hint: loaded_model = mlflow.pyfunc.load_model(best_model_uri)

In [None]:
# YOUR CODE HERE: Make predictions with loaded model
# Hint: loaded_model.predict(X_test)

### ‚úÖ Success Criteria:
- Model loads successfully
- Predictions match previous results
- No errors during inference

---

## üèÜ Task 12: Register Model to Unity Catalog

**What you need to do:**
1. Register best model to Unity Catalog
2. Set model alias to `@prod` (production-ready)

**üí° Why Unity Catalog?**
- **Governance:** Track who uses the model
- **Versioning:** Multiple model versions with aliases
- **Access control:** Fine-grained permissions
- **Lineage:** See which data created the model

**Hints:**
- Use `mlflow.register_model()` with full UC path: `f"{catalog}.{db}.{model_name}"`
- Set alias: `MlflowClient().set_registered_model_alias()`

In [None]:
# YOUR CODE HERE: Register model to Unity Catalog
# Hint: latest_model = mlflow.register_model(
#           best_model_uri,
#           f"{catalog}.{db}.{model_name}"
#       )

In [None]:
# YOUR CODE HERE: Set @prod alias
# Hint: MlflowClient().set_registered_model_alias(
#           name=f"{catalog}.{db}.{model_name}",
#           alias="prod",
#           version=latest_model.version
#       )

In [None]:
# YOUR CODE HERE: Verify model is registered
# Display model info from Unity Catalog
# Hint: MlflowClient().get_registered_model(f"{catalog}.{db}.{model_name}")

### ‚úÖ Success Criteria:
- Model registered in Unity Catalog
- Model has @prod alias
- Model version number visible
- You can see model in Catalog Explorer UI

---

## üéâ Congratulations!

You've completed the ML Model Training Challenge! üèÜ

### What You've Accomplished:

‚úÖ **Data Preparation** - Loaded and split training data  
‚úÖ **Baseline Model** - Trained Logistic Regression (Level 1)  
‚úÖ **Advanced Model** - Trained XGBoost Classifier (Level 2)  
‚úÖ **Model Comparison** - Evaluated multiple algorithms  
‚úÖ **Hyperparameter Optimization** - Used Optuna for tuning (Level 3)  
‚úÖ **MLflow Tracking** - Logged all experiments  
‚úÖ **Unity Catalog Registry** - Registered production model  

### Key Results:

**Best Model Accuracy:** [Your result]  
**Model Type:** [Logistic Regression / XGBoost]  
**Production Ready:** [Yes/No]  

---

## üìà Next Steps:

**üìò Notebook 04.3: Model Deployment**
- Deploy model for batch inference (Spark UDF)
- Create REST API endpoint for real-time predictions
- Monitor model performance in production

---

## üí≠ Final Reflection

**1. What was the most challenging part of model training?**
```
[Your answer]
```

**2. How would you improve the model further?**
Options:
- Add more features (from 04.1 bonus exercises)
- Try more algorithms (Random Forest, Neural Networks)
- Handle class imbalance (SMOTE, class weights)
- Collect more training data
```
[Your answer]
```

**3. What did you learn about MLflow and experiment tracking?**
```
[Your answer]
```

**4. How would you explain this model to a business stakeholder?**
```
[Your answer: Focus on business value, not technical details]
```

---

**Great work!** üöÄ Ready to deploy in 04.3!