# Databricks Data Preparation in ML - Notebook 08
## AutoML & MLflow Integration Demo

**Part of the Databricks Data Preparation in ML Training Series**

---

## Objectives

This hands-on demo shows practical AutoML and MLflow usage in Databricks:

- **Simple Data Setup** - Create clean demo dataset
- **AutoML Experiment** - Run classification with AutoML
- **MLflow Tracking** - Track experiments and models
- **Model Registry** - Register models for production
- **Model Serving** - Deploy models for predictions

## Duration: ~25 minutes | Level: Beginner → Intermediate

---

## Why AutoML + MLflow?

**Perfect combination for rapid ML development:**
- **AutoML**: Automatic feature engineering and model selection
- **MLflow**: Experiment tracking and model management
- **Databricks**: Unified platform for end-to-end ML workflows
- **Focus on Business**: Less time on code, more time on insights

# AutoML Quick Overview

AutoML automates the entire machine learning workflow:

**🔧 Data Preparation:**
- Missing value imputation
- Categorical encoding
- Feature scaling and engineering

**🤖 Model Building:**
- Algorithm selection (XGBoost, Random Forest, etc.)
- Hyperparameter optimization
- Cross-validation

**📊 Results:**
- Performance metrics
- Feature importance
- Generated code notebooks

# Environment Setup

In [0]:
# Import essential libraries
from databricks import automl
import mlflow
from pyspark.sql.functions import *
from pyspark.sql.types import *
import numpy as np

# Create Sample Dataset

Create a customer dataset with missing values to demonstrate AutoML's data preparation capabilities.

In [0]:
import numpy as np
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

# Ustaw ziarno dla powtarzalności
np.random.seed(42)

# Liczba rekordów
n = 1000

# Generowanie danych jako zwykłe Pythonowe typy (int, str)
customer_data_clean = []
for i in range(n):
    customer_data_clean.append((
        f"CUST_{i:04d}",
        int(np.random.randint(18, 80)),
        str(np.random.choice(['Male', 'Female'])),
        int(np.random.randint(25000, 120000)),
        str(np.random.choice(['High School', 'Bachelor', 'Master', 'PhD'])),
        str(np.random.choice(['Warsaw', 'Krakow', 'Gdansk', 'Wroclaw', 'Other'])),
        int(np.random.randint(1, 20)),
        int(np.random.randint(50, 500)),
        int(np.random.randint(1, 60)),
        int(np.random.choice([0, 1]))
    ))

# Schemat
schema = StructType([
    StructField("customer_id", StringType(), True),
    StructField("age", IntegerType(), True),
    StructField("gender", StringType(), True),
    StructField("income", IntegerType(), True),
    StructField("education", StringType(), True),
    StructField("city", StringType(), True),
    StructField("monthly_purchases", IntegerType(), True),
    StructField("avg_purchase_amount", IntegerType(), True),
    StructField("tenure_months", IntegerType(), True),
    StructField("is_high_value", IntegerType(), True)
])

# Utwórz DataFrame w Spark
df_customers = spark.createDataFrame(customer_data_clean, schema)


In [0]:
# Quick overview of our customer dataset
df_customers.display(10)

## 3. AutoML Setup

Now let's configure and run Databricks AutoML on our customer dataset. AutoML will:
- Automatically handle missing values
- Select and engineer features
- Try multiple ML algorithms
- Optimize hyperparameters
- Provide interpretability insights

In [0]:
# Run AutoML classification experiment
summary = automl.classify(
    dataset=df_customers,
    target_col="is_high_value",
    primary_metric="f1",
    timeout_minutes=10,  # Short timeout for demo
    max_trials=5,       # Limited trials for demo
    experiment_name="customer_value_automl"
)

In [0]:
# Access AutoML experiment results
experiment_id = summary.experiment.experiment_id
best_trial = summary.best_trial

# Display basic results
experiment_id, best_trial.mlflow_run_id

## 4. MLflow Experiment Tracking

AutoML automatically creates MLflow experiments to track all models and metrics. Let's explore the experiment results.

In [0]:
# Browse experiment runs in MLflow
runs = mlflow.search_runs(experiment_ids=[experiment_id])

# Display top performing models
top_runs = runs.sort_values('metrics.val_f1_score', ascending=False).head(3)
display(top_runs[['run_id', 'metrics.val_f1_score', 'metrics.val_precision_score', 'metrics.val_recall_score']])

In [0]:
# Load best model for prediction
best_model_uri = f"runs:/{best_trial.mlflow_run_id}/model"
model = mlflow.sklearn.load_model(best_model_uri)

# Show sample predictions
sample_data = df_customers.select([col for col in df_customers.columns if col != 'is_high_value']).limit(5).toPandas()
predictions = model.predict(sample_data)
predictions

## 5. Model Registry

Register the best model to MLflow Model Registry for production deployment and versioning.

In [0]:
# Register the best model to Unity Catalog
from mlflow.models import infer_signature

# Infer model signature
sample_data = df_customers.select([col for col in df_customers.columns if col != 'is_high_value']).limit(100).toPandas()
signature = infer_signature(sample_data, model.predict(sample_data))

# Register model
model_name = "customer_value_classifier"
registered_model = mlflow.register_model(
    model_uri=best_model_uri,
    name=model_name,
    signature=signature
)

registered_model.version

In [0]:
# Load model from registry for serving
from mlflow.tracking import MlflowClient

client = MlflowClient()

# Get latest version of the model
latest_version = client.get_latest_versions(model_name, stages=["None"])[0]

model_version_uri = f"models:/{model_name}/{latest_version.version}"
serving_model = mlflow.sklearn.load_model(model_version_uri)

# Test prediction with the registered model
test_prediction = serving_model.predict(sample_data.head(1))
test_prediction[0]

## 6. Summary and Next Steps

### What we accomplished:
✅ **Automated Data Preparation**: AutoML handled missing values and feature engineering automatically  
✅ **Model Training**: Multiple algorithms tested and optimized automatically  
✅ **MLflow Tracking**: All experiments, metrics, and models tracked automatically  
✅ **Model Registry**: Best model registered for production use  
✅ **Model Serving**: Model ready for batch or real-time predictions  

### Next Steps:
1. **Monitor Model Performance**: Set up model monitoring in production
2. **A/B Testing**: Test model performance against existing solutions
3. **Feature Store**: Move to centralized feature management
4. **Model Serving**: Deploy model as REST API endpoint
5. **Continuous Training**: Set up automated retraining pipelines

### Databricks AutoML Benefits:
- **No Code Required**: Fully automated ML pipeline
- **Best Practices**: Follows ML engineering best practices
- **Transparency**: All code and notebooks generated for review
- **Integration**: Native MLflow and Unity Catalog integration