# 🚲 MLflow via Delta Sharing: Training with AutoML

This notebook demonstrates how to train and register models in Unity Catalog from one workspace (Dev) and then use the shared artifacts in another workspace (Prod) for model inferencing.

Databricks provides a hosted MLflow Model Registry in Unity Catalog, fully compatible with the open-source MLflow Python client.  
Key benefits include:
- Centralized access control
- Full auditing and lineage tracking
- Cross-workspace model discovery and collaboration

**Dataset:** [Bike Sharing Dataset](https://archive.ics.uci.edu/dataset/275/bike+sharing+dataset)

---

## 📚 Workflow Overview

- Load the Bike Sharing dataset.
- Train a model using AutoML with hyperparameter tuning.
- Register the best-performing model to Unity Catalog (UC).
- Share the model via Delta Sharing to another workspace.
- Load the shared model in a second workspace and perform predictions.

---

## ⚙️ Requirements

- Databricks Runtime: **15.4 LTS ML** or later
- Unity Catalog enabled in both workspaces
- AutoML enabled

---

## Workspace A (Train + Register)

1. Trigger an AutoML experiment (including hyperparameter tuning).
2. Log the best model and experiment runs with MLflow.
3. Register the best model to UC Model Registry as:  
   `alexander_booth.default.bike_sharing_uc_model`
4. Share the registered model via Delta Sharing to Workspace B.

---

## Workspace B (Consume + Predict)

5. Accept the shared model via Unity Catalog UI or API.
6. Access the shared model using:  
   `models:/shared_catalog.shared_schema.bike_sharing_model`
7. Load the model and run batch predictions on new data.

---


## 🗂️ Unity Catalog Setup

Set the catalog and schema where the model will be registered.

You will need the following privileges:
- `USE CATALOG` on the target catalog
- `USE SCHEMA` and `CREATE MODEL` on the target schema

Update the catalog and schema below if necessary before proceeding.


In [0]:
# Catalog name where the model artifacts will be stored in Unity Catalog
CATALOG_NAME = "alexander_booth" 
SCHEMA_NAME = "default" # Schema (database) name within the catalog to organize model artifacts
TABLE_NAME = "bike_sharing_training_data"
MODEL_NAME = "bike_sharing_uc_model"

## 📊 Data Description

Dataset: [Bike Sharing Dataset (UCI ML Repository)](https://archive.ics.uci.edu/dataset/275/bike+sharing+dataset)

---

### 🔹 Features

| Column      | Description |
|:------------|:-------------|
| dteday      | Date |
| season      | Season (1: Spring, 2: Summer, 3: Fall, 4: Winter) |
| yr          | Year (0: 2011, 1: 2012) |
| mnth        | Month (1–12) |
| hr          | Hour of day (0–23) |
| holiday     | 1 if holiday, 0 otherwise |
| weekday     | Day of week (0 = Sunday) |
| workingday  | 1 if working day, 0 if weekend/holiday |
| weathersit  | Weather condition (1–4 scale) |
| temp        | Normalized temperature |
| atemp       | Normalized "feels like" temperature |
| hum         | Normalized humidity |
| windspeed   | Normalized wind speed |

---

### 🔹 Labels

| Column      | Description |
|:------------|:-------------|
| casual      | Count of casual users |
| registered  | Count of registered users |
| cnt         | Total rental count (casual + registered) |

---

### 🔹 Extra

| Column      | Description |
|:------------|:-------------|
| instant     | Record index (row ID) |

> Example:  
> Hour 0 on January 1, 2011 — 16 rentals recorded around midnight!


In [0]:
# Load the bike-sharing dataset from the specified CSV file into a Spark DataFrame.
# The dataset includes hourly bike rental data. The `header=True` option indicates that the first row contains column names,
# and `inferSchema=True` allows Spark to automatically infer data types for each column
bike_df = spark.read.csv(
    "/databricks-datasets/bikeSharing/data-001/hour.csv",
    header=True,
    inferSchema=True
)

# Print the schema of the loaded DataFrame to understand the structure and data types of its columns
print(bike_df.printSchema())
display(bike_df)

root
 |-- instant: integer (nullable = true)
 |-- dteday: date (nullable = true)
 |-- season: integer (nullable = true)
 |-- yr: integer (nullable = true)
 |-- mnth: integer (nullable = true)
 |-- hr: integer (nullable = true)
 |-- holiday: integer (nullable = true)
 |-- weekday: integer (nullable = true)
 |-- workingday: integer (nullable = true)
 |-- weathersit: integer (nullable = true)
 |-- temp: double (nullable = true)
 |-- atemp: double (nullable = true)
 |-- hum: double (nullable = true)
 |-- windspeed: double (nullable = true)
 |-- casual: integer (nullable = true)
 |-- registered: integer (nullable = true)
 |-- cnt: integer (nullable = true)

None


instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
1,2011-01-01,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.0,3,13,16
2,2011-01-01,1,0,1,1,0,6,0,1,0.22,0.2727,0.8,0.0,8,32,40
3,2011-01-01,1,0,1,2,0,6,0,1,0.22,0.2727,0.8,0.0,5,27,32
4,2011-01-01,1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0.0,3,10,13
5,2011-01-01,1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0.0,0,1,1
6,2011-01-01,1,0,1,5,0,6,0,2,0.24,0.2576,0.75,0.0896,0,1,1
7,2011-01-01,1,0,1,6,0,6,0,1,0.22,0.2727,0.8,0.0,2,0,2
8,2011-01-01,1,0,1,7,0,6,0,1,0.2,0.2576,0.86,0.0,1,2,3
9,2011-01-01,1,0,1,8,0,6,0,1,0.24,0.2879,0.75,0.0,1,7,8
10,2011-01-01,1,0,1,9,0,6,0,1,0.32,0.3485,0.76,0.0,8,6,14


In [0]:
# Step 2: Prepare data for training - Convert Spark DataFrame to Pandas DataFrame.

# Select only the relevant columns from the dataset that will be used for modeling
# These columns include features such as season, year, month, hour, weather conditions, and target variable (`cnt`)
selected_columns = ['season', 'yr', 'mnth', 'hr', 'holiday', 'weekday', 
                    'workingday', 'weathersit', 'temp', 'atemp', 'hum', 
                    'windspeed', 'cnt']

# Filter the DataFrame to include only the selected columns
bike_df = bike_df.select(selected_columns)
display(bike_df)

season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,cnt
1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.0,16
1,0,1,1,0,6,0,1,0.22,0.2727,0.8,0.0,40
1,0,1,2,0,6,0,1,0.22,0.2727,0.8,0.0,32
1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0.0,13
1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0.0,1
1,0,1,5,0,6,0,2,0.24,0.2576,0.75,0.0896,1
1,0,1,6,0,6,0,1,0.22,0.2727,0.8,0.0,2
1,0,1,7,0,6,0,1,0.2,0.2576,0.86,0.0,3
1,0,1,8,0,6,0,1,0.24,0.2879,0.75,0.0,8
1,0,1,9,0,6,0,1,0.32,0.3485,0.76,0.0,14


In [0]:
# Save it as a table
bike_df.write.mode("overwrite").saveAsTable(f"{CATALOG_NAME}.{SCHEMA_NAME}.{TABLE_NAME}")

## Machine Learning

In [0]:
# 1. Import packages
import databricks.automl
from mlflow import register_model
import mlflow

# Set the MLflow registry URI to use Databricks Unity Catalog (UC) for model registry management
mlflow.set_registry_uri("databricks-uc")

In [0]:
# 2. Define configs
input_table = f"{CATALOG_NAME}.{SCHEMA_NAME}.{TABLE_NAME}"
target_col = "cnt"
uc_model_name = f"{CATALOG_NAME}.{SCHEMA_NAME}.{MODEL_NAME}"
experiment_name = "bike_sharing_uc"

# 3. Launch AutoML
summary = databricks.automl.regress(
    dataset = input_table,
    target_col = target_col,
    timeout_minutes = 5,
    primary_metric = "rmse",
    experiment_name = experiment_name
)

print(summary)

2025/04/18 19:15:02 INFO databricks.automl.client.manager: AutoML will optimize for root mean squared error metric, which is tracked as val_root_mean_squared_error in the MLflow experiment.
2025/04/18 19:15:04 INFO databricks.automl.client.manager: MLflow Experiment ID: 2674917388788605
2025/04/18 19:15:04 INFO databricks.automl.client.manager: MLflow Experiment: https://e2-demo-field-eng.cloud.databricks.com/?o=1444828305810485#mlflow/experiments/2674917388788605


🏃 View run masked-sow-621 at: https://e2-demo-field-eng.cloud.databricks.com/ml/experiments/2674917388788605/runs/7a5b6f89946248ae830079aa8e7139f2
🧪 View experiment at: https://e2-demo-field-eng.cloud.databricks.com/ml/experiments/2674917388788605


2025/04/18 19:17:21 INFO databricks.automl.client.manager: Data exploration notebook: https://e2-demo-field-eng.cloud.databricks.com/?o=1444828305810485#notebook/2674917388788641
2025/04/18 19:20:58 INFO databricks.automl.client.manager: AutoML experiment completed successfully.


Unnamed: 0,Train,Validation,Test
root_mean_squared_error,36.71,44.281,41.954
mean_squared_error,1347.6,1960.771,1760.157
example_count,10441.0,3403.0,3535.0
r2_score,0.959,0.942,0.947
sum_on_target,1971133.0,651061.0,670485.0
score,0.959,0.942,0.947
mean_absolute_error,24.045,28.664,27.583
mean_on_target,188.788,191.32,189.67
max_error,381.302,361.703,386.103
mean_absolute_percentage_error,0.499,0.531,0.574


Overall summary:
	Experiment ID: 2674917388788605
	Number of trials: 4
	Evaluation metric distribution: min: 264.298, median: 91.804, max: 44.281
	Semantic type conversions: {'categorical': ['season', 'yr', 'mnth', 'holiday', 'weekday', 'workingday', 'weathersit']}
Best trial:

	Model: Pipeline
	Model path: dbfs:/databricks/mlflow-tracking/2674917388788605/c8059d59c6624144a5020db3017da1a8/artifacts/model
	Preprocessors: [('boolean', Pipeline(steps=[('cast_type',
	         FunctionTransformer(func=<function <lambda> at 0x7fd7609d9080>)),
	        ('imputers',
	         ColumnTransformer(remainder='passthrough', transformers=[])),
	        ('onehot',
	         OneHotEncoder(drop='first', handle_unknown='ignore'))]), ['holiday', 'yr', 'workingday']), ('numerical', Pipeline(steps=[('converter',
	         FunctionTransformer(func=<function <lambda> at 0x7fd7609d9300>)),
	        ('imputers',
	         ColumnTransformer(transformers=[('impute_mean',
	                                         

In [0]:
# 4. Best trial info
best_run_id = summary.best_trial.mlflow_run_id
best_model_uri = f"runs:/{best_run_id}/model"

print(f"Best Run ID: {best_run_id}")
print(f"Best Model URI: {best_model_uri}")

# 5. Register the best model into UC
registered_model = mlflow.register_model(
    model_uri = best_model_uri,
    name = uc_model_name
)

print(f"Registered Model Version: {registered_model.version}")

Best Run ID: c8059d59c6624144a5020db3017da1a8
Best Model URI: runs:/c8059d59c6624144a5020db3017da1a8/model


Successfully registered model 'alexander_booth.default.bike_sharing_uc_model'.


Downloading artifacts:   0%|          | 0/11 [00:00<?, ?it/s]

Uploading artifacts:   0%|          | 0/11 [00:00<?, ?it/s]

Registered Model Version: 1


Created version '1' of model 'alexander_booth.default.bike_sharing_uc_model'.


## ✅ Next Steps

Now that the best model is trained and registered, the next steps are:

1. **Alias the model to "prod"**  
   - Promote the registered model version to the "prod" alias.
   - This allows easy future access without hardcoding specific model versions.

2. **Add the model to a Delta Sharing share**  
   - Enable external consumers (or other workspaces) to access the model.
   - Share the model artifact by adding it to an existing Delta Share.

Both of these steps are available through the UI; however, we are presenting them programmatically to provide a complete, end-to-end example.

In [0]:
# Initialize MLflow client
client = mlflow.MlflowClient()

# Set alias "staging" to the newly registered model version
client.set_registered_model_alias(
    name = uc_model_name,                       # Same model registered
    alias = "staging",
    version = registered_model.version          # Version registered
)

In [0]:
%sql
-- Create a new Delta Sharing share
CREATE SHARE abooth_bike_sharing_model_share
COMMENT "Share for production ML models.";


info_name,info_value
share_name,abooth_bike_sharing_model_share
owner,alexander.booth@databricks.com
created_at,2025-04-18 19:35:51.558
created_by,alexander.booth@databricks.com
updated_at,2025-04-18 19:35:51.558
updated_by,alexander.booth@databricks.com
comment,Share for production ML models.


In [0]:
%sql
-- Add a recipient. Replace the recipient name with your own. Recipients can be created via SQL or in the UI.
GRANT SELECT ON SHARE abooth_bike_sharing_model_share TO RECIPIENT `xxxxxxxxxxxx`;


In [0]:
%sql
-- Add model to the Delta Sharing share
ALTER SHARE abooth_bike_sharing_model_share
ADD MODEL alexander_booth.default.bike_sharing_uc_model;

## 🚀 Pivot to Inference Demo

Now that the model has been trained, registered, and shared through Delta Sharing, we will pivot to a second workspace and notebook to demonstrate inference.

In the next section:
- We will simulate how an external consumer (another workspace or partner) would access the shared model.
- We will load the model directly from the Delta Sharing share.
- We will perform batch inference using the shared model on new data.

This approach demonstrates how Delta Sharing enables cross-workspace, cross-cloud model collaboration without manual exports.
