&nbsp;
&nbsp;
![](../_resources/images/e2eai-4.jpg)


# Data Science with Databricks

## ML is key to wind turbine farm optimization

The current market makes energy even more strategic than before. Being able to ingest and analyze our Wind turbine state is a first step, but this isn't enough to thrive in a very competitive market.

We need to go further to optimize our energy production, reduce maintenance cost and reduce downtime. Modern data company achieve this with AI.

<style>
.right_box{
  margin: 30px; box-shadow: 10px -10px #CCC; width:650px;height:300px; background-color: #1b3139ff; box-shadow:  0 0 10px  rgba(0,0,0,0.6);
  border-radius:25px;font-size: 35px; float: left; padding: 20px; color: #f9f7f4; }
.badge {
  clear: left; float: left; height: 30px; width: 30px;  display: table-cell; vertical-align: middle; border-radius: 50%; background: #fcba33ff; text-align: center; color: white; margin-right: 10px}
.badge_b { 
  height: 35px}
</style>
<link href='https://fonts.googleapis.com/css?family=DM Sans' rel='stylesheet'>
<div style="font-family: 'DM Sans'; display: flex; align-items: flex-start;">
  <!-- Left Section -->
  <div style="width: 50%; color: #1b3139; padding-right: 20px;">
    <div style="color: #ff5f46; font-size:80px;">90%</div>
    <div style="font-size:30px; margin-top: -20px; line-height: 30px;">
      Enterprise applications will be AI-augmented by 2025 —IDC
    </div>
    <div style="color: #ff5f46; font-size:80px;">$10T+</div>
    <div style="font-size:30px; margin-top: -20px; line-height: 30px;">
       Projected business value creation by AI in 2030 —PWC
    </div>
  </div>

  <!-- Right Section -->
  <div class="right_box", style="width: 50%; color: red; font-size: 30px; line-height: 1.5; padding-left: 20px;">
    But—huge challenges getting ML to work at scale!<br/><br/>
    In fact, most ML projects still fail before getting to production
  </div>
</div>

## Machine learning is data + transforms.

ML is hard because delivering value to business lines isn't only about building a Model. <br>
The ML lifecycle is made of data pipelines: Data-preprocessing, feature engineering, training, inference, monitoring and retraining...<br>
Stepping back, all pipelines are data + code.


<img style="float: right; margin-top: 10px" width="500px" src="https://raw.githubusercontent.com/databricks-demos/dbdemos-resources/refs/heads/main/images/manufacturing/lakehouse-iot-turbine/team_flow_marc.png" />

<img src="https://raw.githubusercontent.com/databricks-demos/dbdemos-resources/refs/heads/main/images/marc.png" style="float: left;" width="80px"> 
<h3 style="padding: 10px 0px 0px 5px">Marc, as a Data Scientist, needs a data + ML platform accelerating all the ML & DS steps:</h3>

<div style="font-size: 19px; margin-left: 73px; clear: left">
<div class="badge_b"><div class="badge">1</div> Build Data Pipeline</div>
<div class="badge_b"><div class="badge">2</div> Data Exploration</div>
<div class="badge_b"><div class="badge">3</div> Feature creation</div>
<div class="badge_b"><div class="badge">4</div> Build & train model</div>
<div class="badge_b"><div class="badge">5</div> Deploy Model (Batch or serverless realtime)</div>
<div class="badge_b"><div class="badge">6</div> Monitoring</div>
</div>

**Marc needs a Data Intelligence Platform**. Let's see how we can deploy a Predictive Maintenance model in production with Databricks.


# Predictive maintenance

Let's see how we can now leverage the sensor data to build a model predictive maintenance model.

Our first step as Data Scientist is to analyze and build the features we'll use to train our model.

The sensor table enriched with turbine data has been saved within our Delta Live Table pipeline. All we have to do is read this information, analyze it and create an ML model.

<img src="https://github.com/databricks-demos/dbdemos-resources/raw/main/images/manufacturing/lakehouse-iot-turbine/lakehouse-manuf-iot-ds-flow.png" width="1000px">



<!-- Collect usage data (view). Remove it to disable collection. View README for more details.  -->
<img width="1px" src="https://ppxrzfxige.execute-api.us-west-2.amazonaws.com/v1/analytics?category=lakehouse&org_id=4003492105941350&notebook=%2F04-Data-Science-ML%2F04.1-automl-iot-turbine-predictive-maintenance&demo_name=lakehouse-iot-platform&event=VIEW&path=%2F_dbdemos%2Flakehouse%2Flakehouse-iot-platform%2F04-Data-Science-ML%2F04.1-automl-iot-turbine-predictive-maintenance&version=1">

In [0]:
%pip install --quiet databricks-sdk==0.40.0 mlflow==2.22.0 optuna optuna-integration[mlflow] xgboost
dbutils.library.restartPython()

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
databricks-connect 16.4.6 requires databricks-sdk>=0.46.0, but you have databricks-sdk 0.40.0 which is incompatible.[0m[31m
[0m[43mNote: you may need to restart the kernel using %restart_python or dbutils.library.restartPython() to use updated packages.[0m


In [0]:
%run ../_resources/00-setup $reset_all_data=false

## Configuration file

Please change your catalog and schema here to run the demo on a different catalog.

 
<!-- Collect usage data (view). Remove it to disable collection. View README for more details.  -->
<img width="1px" src="https://ppxrzfxige.execute-api.us-west-2.amazonaws.com/v1/analytics?category=lakehouse&org_id=4003492105941350&notebook=%2Fconfig&demo_name=lakehouse-iot-platform&event=VIEW&path=%2F_dbdemos%2Flakehouse%2Flakehouse-iot-platform%2Fconfig&version=1">


# Technical Setup notebook. Hide this cell results
Initialize dataset to the current user and cleanup data when reset_all_data is set to true

Do not edit

USE CATALOG `main`
using catalog.database `main`.`e2eai_iot_turbine`


data already existing. Run with reset_all_data=true to force a data cleanup for your local demo.


In [0]:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score
)

from xgboost import XGBClassifier
import numpy as np
import pandas as pd
import mlflow
from mlflow.models import infer_signature
from mlflow import MlflowClient
from mlflow.deployments import get_deploy_client
import optuna
from optuna.integration.mlflow import MLflowCallback
import os
import requests
import json


Exception ignored on calling ctypes callback function: <function _ThreadpoolInfo._find_modules_with_dl_iterate_phdr.<locals>.match_module_callback at 0xffea35db1e40>
Traceback (most recent call last):
  File "/databricks/python/lib/python3.12/site-packages/threadpoolctl.py", line 400, in match_module_callback
    self._make_module_from_path(filepath)
  File "/databricks/python/lib/python3.12/site-packages/threadpoolctl.py", line 515, in _make_module_from_path
    module = module_class(filepath, prefix, user_api, internal_api)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/databricks/python/lib/python3.12/site-packages/threadpoolctl.py", line 606, in __init__
    self.version = self.get_version()
                   ^^^^^^^^^^^^^^^^^^
  File "/databricks/python/lib/python3.12/site-packages/threadpoolctl.py", line 646, in get_version
    config = get_config().split()
             ^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'split'



## Accelerating Predictive maintenance model creation using MLFlow in Databricks

MLflow is an open source project allowing model tracking, packaging and deployment. Everytime your datascientist team work on a model, Databricks will track all the parameter and data used and will save it. This ensure ML traceability and reproductibility, making it easy to know which model was build using which parameters/data.

### A glass-box solution that empowers data teams without taking away control

While Databricks simplify model deployment and governance (MLOps) with MLFlow, bootstraping new ML projects can still be long and inefficient. 


<img width="1000" src="https://github.com/QuentinAmbard/databricks-demo/raw/main/retail/resources/images/auto-ml-full.png"/>


Models can be directly deployed, or instead leverage generated notebooks to boostrap projects with best-practices, saving you weeks of efforts.

<br style="clear: both">

<img width="600" src="https://github.com/QuentinAmbard/databricks-demo/raw/main/retail/resources/images/churn-auto-ml.png"/>



In [0]:
mlflow.set_registry_uri('databricks-uc')

### Create Train / Test Splits

In [0]:
features_table_name=f'{catalog}.{db}.turbine_hourly_features'

# Read training dataset from catalog
training_dataset = spark.table(features_table_name).drop('turbine_id')

# Prepare features and labels
X = training_dataset.toPandas()[['avg_energy', 'std_sensor_A', 'std_sensor_B', 'std_sensor_C', 'std_sensor_D', 'std_sensor_E', 'std_sensor_F']]

y = training_dataset.toPandas()['abnormal_sensor']

# Encode labels
y_encoded = pd.factorize(y)[0]

# Split the dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y_encoded,
    test_size=0.2, 
    random_state=42
)


In [0]:

# Save the signature (an example of the model input)
signature = infer_signature(X_train, y_train)
signature

inputs: 
  ['avg_energy': double (required), 'std_sensor_A': double (required), 'std_sensor_B': double (required), 'std_sensor_C': double (required), 'std_sensor_D': double (required), 'std_sensor_E': double (required), 'std_sensor_F': double (required)]
outputs: 
  [Tensor('int64', (-1,))]
params: 
  None

In [0]:

# This ensures column names are included
input_example = X_train.iloc[[0]]  # Picks first row, keeps column names!
input_example

Unnamed: 0,avg_energy,std_sensor_A,std_sensor_B,std_sensor_C,std_sensor_D,std_sensor_E,std_sensor_F
2279,0.927237,1.100594,2.021509,3.124961,2.277981,3.052224,5.503263


### Build Logistic Regression Model

In [0]:
# ML model versions in UC must have a model signature (Or input example). If you want to set a signature on a model that's already logged or saved, the mlflow.models.set_signature() API is available for this purpose.


# Start MLflow run
with mlflow.start_run(run_name="Logistic Regression Run"):

    # Define and train the model
    lr_model = LogisticRegression(max_iter=200)
    lr_model.fit(X_train, y_train)


    # Predict and evaluate
    predictions = lr_model.predict(X_test)

    acc = accuracy_score(y_test, predictions)
    precision = precision_score(y_test, predictions, average="macro")
    recall = recall_score(y_test, predictions, average="macro")
    f1 = f1_score(y_test, predictions, average="macro")

    # Log model parameters
    mlflow.set_tag("model_family", "LogisticRegression")
    mlflow.log_param("max_iter", 200)

    # Log metrics
    mlflow.log_metric("accuracy", acc)
    mlflow.log_metric("precision", precision)
    mlflow.log_metric("recall", recall)
    mlflow.log_metric("f1_score", f1)

    # Log the model
    mlflow.sklearn.log_model(lr_model, 
                            artifact_path="logreg_model",
                            input_example=input_example,
                            signature=signature,
                            #  registered_model_name="main.dbdemos_iot_turbine.dbdemos_turbine_maintenance"
                            )

    print(f"Logged with accuracy: {acc}")


Uploading artifacts:   0%|          | 0/11 [00:00<?, ?it/s]

Logged with accuracy: 0.9693165969316597


### Build XGBoost Model

In [0]:
# Start MLflow run
with mlflow.start_run(run_name="XGBoost Run"):

    # Define and train the model
    xgb_model = XGBClassifier(
        objective="multi:softprob",
        num_class=3,
        max_depth=3,
        n_estimators=100,
        learning_rate=0.1,
        use_label_encoder=False,
        eval_metric="mlogloss"
    )
    xgb_model.fit(X_train, y_train)


    # Predict and evaluate
    predictions = xgb_model.predict(X_test)

    acc = accuracy_score(y_test, predictions)
    precision = precision_score(y_test, predictions, average="macro")
    recall = recall_score(y_test, predictions, average="macro")
    f1 = f1_score(y_test, predictions, average="macro")

    # Log model parameters
    mlflow.set_tag("model_family", "XGBOOST")
    mlflow.log_param("max_depth", 3)
    mlflow.log_param("n_estimators", 100)
    mlflow.log_param("learning_rate", 0.1)

    # Log metrics
    mlflow.log_metric("accuracy", acc)
    mlflow.log_metric("precision", precision)
    mlflow.log_metric("recall", recall)
    mlflow.log_metric("f1_score", f1)

    mlflow.xgboost.log_model(xgb_model, 
                             artifact_path="xgb_model",
                             input_example=input_example,
                             signature=signature,
                             )
    print(f"Logged with accuracy: {acc}")


Uploading artifacts:   0%|          | 0/11 [00:00<?, ?it/s]

Logged with accuracy: 0.9790794979079498


### Compare ML models to find best performing

In [0]:
mlflow.search_runs(order_by=['metrics.accuracy DESC','start_time DESC'])

INFO:py4j.clientserver:Received command c on object id p0


Unnamed: 0,run_id,experiment_id,status,artifact_uri,start_time,end_time,metrics.recall,metrics.accuracy,metrics.precision,metrics.f1_score,params.learning_rate,params.n_estimators,params.max_depth,params.model_type,params.max_iter,tags.model_family,tags.mlflow.databricks.gitRepoCommit,tags.mlflow.user,tags.mlflow.source.name,tags.mlflow.runName,tags.mlflow.databricks.cluster.info.error,tags.mlflow.runColor,tags.mlflow.databricks.notebook.commandID,tags.mlflow.databricks.gitRepoProvider,tags.mlflow.databricks.workspaceURL,tags.mlflow.databricks.gitRepoStatus,tags.mlflow.databricks.gitRepoRelativePath,tags.mlflow.log-model.history,tags.mlflow.databricks.cluster.id,tags.mlflow.databricks.cluster.libraries.error,tags.mlflow.databricks.gitRepoReferenceType,tags.mlflow.databricks.notebookID,tags.mlflow.databricks.notebookPath,tags.mlflow.databricks.workspaceID,tags.mlflow.databricks.gitRepoUrl,tags.mlflow.databricks.gitRepoReference,tags.mlflow.databricks.webappURL,tags.mlflow.source.type,tags.mlflow.databricks.cluster.info,tags.mlflow.databricks.cluster.libraries
0,50b4b3f21e2a4caeafd89010a92df935,01751c9c53794379b592485104bc6ab0,FINISHED,dbfs:/databricks/mlflow-tracking/01751c9c53794...,2025-10-15 19:07:08.375000+00:00,2025-10-15 19:07:13.409000+00:00,0.969357,0.979079,0.976699,0.972979,0.1,100.0,3.0,,,XGBOOST,f4a651fbeb75be2fba52badb789bbc515600d2d3,kemalcan@berkeley.edu,/Users/kemalcan@berkeley.edu/e2e-data-science/...,XGBoost Run,INVALID_PARAMETER_VALUE: Cluster 0809-154351-l...,#7d54b2,1760555170114_7218925565958051044_85acce513b25...,gitHub,https://dbc-3a8de830-d6a2.cloud.databricks.com,unknown,lakehouse-iot-platform/04-Data-Science-ML/04.2...,"[{""artifact_path"":""xgb_model"",""flavors"":{""pyth...",0809-154351-lgw1ydqn-v2n,INVALID_ARGUMENT: INVALID_PARAMETER_VALUE: Clu...,branch,4368963137813901,/Users/kemalcan@berkeley.edu/e2e-data-science/...,4259139077557821,https://github.com/Berkeley-Data/e2e-data-scie...,kemalcan-feature-2,https://dbc-3a8de830-d6a2.cloud.databricks.com,NOTEBOOK,,
1,5d4cc080ec414352806cef86432d7c44,01751c9c53794379b592485104bc6ab0,FINISHED,dbfs:/databricks/mlflow-tracking/01751c9c53794...,2025-08-09 15:45:23.954000+00:00,2025-08-09 15:45:31.312000+00:00,0.969357,0.979079,0.976699,0.972979,0.1,100.0,3.0,XGBOOST,,,2268011268b41a756fcf8b247f32353b11fb18bf,kemalcan@berkeley.edu,/Users/kemalcan@berkeley.edu/e2e-data-science/...,XGBoost Run,,#da4c4c,1754754232754_8874398606133209524_ef18b40c8538...,gitHub,https://dbc-3a8de830-d6a2.cloud.databricks.com,unknown,lakehouse-iot-platform/04-Data-Science-ML/04.2...,"[{""artifact_path"":""xgb_model"",""flavors"":{""pyth...",0809-154351-lgw1ydqn-v2n,,branch,4368963137813901,/Users/kemalcan@berkeley.edu/e2e-data-science/...,4259139077557821,https://github.com/Berkeley-Data/e2e-data-scie...,final-student-version,https://dbc-3a8de830-d6a2.cloud.databricks.com,NOTEBOOK,"{""cluster_name"":"""",""spark_version"":""client.2.5...","{""installable"":[],""redacted"":[]}"
2,e1099a96af044b8c8aeebd95f7bfb4f0,01751c9c53794379b592485104bc6ab0,FINISHED,dbfs:/databricks/mlflow-tracking/01751c9c53794...,2025-10-15 19:07:02.384000+00:00,2025-10-15 19:07:08.073000+00:00,0.947482,0.969317,0.970082,0.958351,,,,,200.0,LogisticRegression,f4a651fbeb75be2fba52badb789bbc515600d2d3,kemalcan@berkeley.edu,/Users/kemalcan@berkeley.edu/e2e-data-science/...,Logistic Regression Run,INVALID_PARAMETER_VALUE: Cluster 0809-154351-l...,#479a5f,1760555170114_8324409258527997713_85acce513b25...,gitHub,https://dbc-3a8de830-d6a2.cloud.databricks.com,unknown,lakehouse-iot-platform/04-Data-Science-ML/04.2...,"[{""artifact_path"":""logreg_model"",""flavors"":{""p...",0809-154351-lgw1ydqn-v2n,INVALID_ARGUMENT: INVALID_PARAMETER_VALUE: Clu...,branch,4368963137813901,/Users/kemalcan@berkeley.edu/e2e-data-science/...,4259139077557821,https://github.com/Berkeley-Data/e2e-data-scie...,kemalcan-feature-2,https://dbc-3a8de830-d6a2.cloud.databricks.com,NOTEBOOK,,
3,8fe3da939ecc4683a6e71482fc91cc86,01751c9c53794379b592485104bc6ab0,FINISHED,dbfs:/databricks/mlflow-tracking/01751c9c53794...,2025-08-09 15:45:16.702000+00:00,2025-08-09 15:45:22.964000+00:00,0.947482,0.969317,0.970082,0.958351,,,,LogisticRegression,200.0,,2268011268b41a756fcf8b247f32353b11fb18bf,kemalcan@berkeley.edu,/Users/kemalcan@berkeley.edu/e2e-data-science/...,Logistic Regression Run,,#5387dd,1754754232754_7186461927566157459_ef18b40c8538...,gitHub,https://dbc-3a8de830-d6a2.cloud.databricks.com,unknown,lakehouse-iot-platform/04-Data-Science-ML/04.2...,"[{""artifact_path"":""logreg_model"",""flavors"":{""p...",0809-154351-lgw1ydqn-v2n,,branch,4368963137813901,/Users/kemalcan@berkeley.edu/e2e-data-science/...,4259139077557821,https://github.com/Berkeley-Data/e2e-data-scie...,final-student-version,https://dbc-3a8de830-d6a2.cloud.databricks.com,NOTEBOOK,"{""cluster_name"":"""",""spark_version"":""client.2.5...","{""installable"":[],""redacted"":[]}"


In [0]:
best_run = mlflow.search_runs(order_by=['metrics.accuracy DESC','start_time DESC']).iloc[0]

print(f"Best accuracy:                   {best_run['metrics.accuracy']}")
print(f"Best model type:                 {best_run['tags.model_family']}")

best_model_run_id = best_run['run_id']

print(f"Best model run id:               {best_model_run_id}")

INFO:py4j.clientserver:Received command c on object id p0


Best accuracy:                   0.9790794979079498
Best model type:                 XGBOOST
Best model run id:               50b4b3f21e2a4caeafd89010a92df935


### Optimize our best performing model
In a real world scenaio, we would start by comparing various model types, then tune hyperparameters for the best performing model type.  For this demo, we have only compared two model types: a logistic regression model to an XGBoost Classifier.

To illustrate an optimization process, let's use [Optuna](https://optuna.org/) to tune our XGBoost model.

Optuna is an open-source hyperparameter optimization framework for machine learning and deep learning models. It automates the process of finding the optimal hyperparameters for a given model and dataset, which can significantly improve model performance and reduce the manual effort involved in hyperparameter tuning.

In [0]:
# Set the mlflow experiment
client = MlflowClient()
experiments = client.search_experiments()
notebook_name = dbutils.notebook.entry_point.getDbutils().notebook().getContext().notebookPath().get().split("/")[-1]

# By default, the MLflow experiment name will be the notebook name
cnt = 0
for exp in experiments:
    if notebook_name in exp.name:
        print(f"Experiment Name: {exp.name}, Experiment ID: {exp.experiment_id}\n")
        cnt = cnt+1
        experiment_id = exp.experiment_id
if cnt > 1:
    print("Multiple experiments found. Please delete the other experiment manually.")
elif cnt == 1:
    print(f"Using experiment {experiment_id}")
    mlflow.set_experiment(experiment_id=experiment_id)

Experiment Name: /Users/kemalcan@berkeley.edu/e2e-data-science/lakehouse-iot-platform/04-Data-Science-ML/04.2-predictive_model_creation, Experiment ID: 01751c9c53794379b592485104bc6ab0

Using experiment 01751c9c53794379b592485104bc6ab0


Instantiating an MLflow callback for Optuna, which will automatically log Optuna trial metrics, parameters, and user attributes to MLflow during hyperparameter optimization. The arguments specify:

- `tracking_uri='databricks'`: Use Databricks as the MLflow tracking server.  

- `metric_name="accuracy"`: Track the "accuracy" metric for each trial.  

- `create_experiment=False`: Do not create a new MLflow experiment; use the   existing one.

- `mlflow_kwargs={"nested": True}`: Log each Optuna trial as a nested MLflow run under a parent run.  

- `tag_trial_user_attrs=True`: Log Optuna trial user attributes as MLflow tags.  

This callback is passed to Optuna's study.optimize() so that all trial results are automatically tracked in MLflow for easy comparison and analysis later

In [0]:
# Instantiate the MLflow callback for logging Optuna runs to MLflow
mlflow_callback = MLflowCallback(
    tracking_uri='databricks',
    metric_name="accuracy",
    create_experiment=False,
    mlflow_kwargs={"nested": True},
    tag_trial_user_attrs=True
)

In [0]:
def objective(trial):
    """Define the objective function for hyperparameter tuning."""

    with mlflow.start_run(nested=True):
        # Invoke suggest methods of a Trial object to generate hyperparameters
        cl_num_class = trial.suggest_int('num_class', 2, 10)
        cl_max_depth = trial.suggest_int('max_depth', 2, 15)
        cl_n_estimators = trial.suggest_int('n_estimators', 10, 100)

        # Define our model, referencing the hyperparameters we just generated
        classifier_obj = XGBClassifier(
            objective="multi:softprob",
            num_class=cl_num_class,
            max_depth=cl_max_depth,
            n_estimators=cl_n_estimators,
            learning_rate=0.1,
            use_label_encoder=False,
            eval_metric="mlogloss"
        )
        classifier_obj.fit(X_train, y_train)
        y_pred = classifier_obj.predict(X_test)

        acc = accuracy_score(y_test, y_pred)
        precision = precision_score(y_test, y_pred, average="macro")
        recall = recall_score(y_test, y_pred, average="macro")
        f1 = f1_score(y_test, y_pred, average="macro")

        # Log model parameters
        mlflow.set_tag("model_family", "XGBOOST")
        mlflow.log_param("num_class", cl_num_class)
        mlflow.log_param("max_depth", cl_max_depth)
        mlflow.log_param("n_estimators", cl_n_estimators)
        mlflow.log_param("learning_rate", 0.1)

        # Log metrics
        mlflow.log_metric("accuracy", acc)
        mlflow.log_metric("precision", precision)
        mlflow.log_metric("recall", recall)
        mlflow.log_metric("f1_score", f1)

    return acc  # An objective value linked with the Trial object

In [0]:
# Run the optimize trials
run_name = "optimize"

# Initiate the parent run and call the hyperparameter tuning child run logic
with mlflow.start_run(experiment_id=experiment_id, run_name=run_name, nested=True):
  # Initialize the Optuna study
  # Our goal is to maximize the accuracy (as defined in our objective function)
  study = optuna.create_study(direction="maximize", load_if_exists=True)

  # Execute the hyperparameter optimization trials.
  study.optimize(objective, n_trials=10, callbacks=[mlflow_callback])

  mlflow.log_params(study.best_params)
  mlflow.log_metric("accuracy", study.best_value)

  # Log tags
  mlflow.set_tags(
      tags={
          "project": "Turbine maintenance predictor",
          "optimizer_engine": "optuna",
          "model_family": "XGBOOST",
          "feature_set_version": 1,
      }
  )

  # Log a fit model instance
  model = XGBClassifier(study.best_params).fit(X_train, y_train)
  y_pred = model.predict(X_test)

  precision = precision_score(y_test, y_pred, average="macro")
  recall = recall_score(y_test, y_pred, average="macro")
  f1 = f1_score(y_test, y_pred, average="macro")

  mlflow.log_metric("precision", precision)
  mlflow.log_metric("recall", recall)
  mlflow.log_metric("f1_score", f1)

  mlflow.xgboost.log_model(
    xgb_model=model,
    artifact_path="xgb_model",
    input_example=input_example,
    signature=signature,
  )

  # Get the logged model uri so that we can load it from the artifact store
  #model_uri = mlflow.get_artifact_uri("xgb_model")

INFO:py4j.clientserver:Received command c on object id p0
[I 2025-10-15 19:07:16,288] A new study created in memory with name: no-name-3a2c11b5-ceb6-4dfb-b511-377e97b981a4
[I 2025-10-15 19:07:17,046] Trial 0 finished with value: 0.9832635983263598 and parameters: {'num_class': 6, 'max_depth': 5, 'n_estimators': 62}. Best is trial 0 with value: 0.9832635983263598.
[I 2025-10-15 19:07:18,080] Trial 1 finished with value: 0.9748953974895398 and parameters: {'num_class': 10, 'max_depth': 4, 'n_estimators': 10}. Best is trial 0 with value: 0.9832635983263598.
[I 2025-10-15 19:07:19,298] Trial 2 finished with value: 0.9832635983263598 and parameters: {'num_class': 10, 'max_depth': 5, 'n_estimators': 60}. Best is trial 0 with value: 0.9832635983263598.
[I 2025-10-15 19:07:20,396] Trial 3 finished with value: 0.9735006973500697 and parameters: {'num_class': 4, 'max_depth': 2, 'n_estimators': 42}. Best is trial 0 with value: 0.9832635983263598.
[I 2025-10-15 19:07:21,415] Trial 4 finished with 

Uploading artifacts:   0%|          | 0/11 [00:00<?, ?it/s]

In [0]:
# Show only runs with a logged model (our objective function did not save a model for every run) 
# and sort by accuracy
mlflow.search_runs(filter_string="tags.\"mlflow.log-model.history\" !=''", order_by=['metrics.accuracy DESC'])

INFO:py4j.clientserver:Received command c on object id p0


Unnamed: 0,run_id,experiment_id,status,artifact_uri,start_time,end_time,metrics.recall,metrics.accuracy,metrics.precision,metrics.f1_score,params.n_estimators,params.max_depth,params.num_class,params.learning_rate,params.model_type,params.max_iter,tags.model_family,tags.mlflow.databricks.gitRepoCommit,tags.optimizer_engine,tags.mlflow.rootRunId,tags.mlflow.user,tags.mlflow.source.name,tags.mlflow.runName,tags.mlflow.databricks.cluster.info.error,tags.mlflow.runColor,tags.mlflow.databricks.notebook.commandID,tags.mlflow.databricks.gitRepoProvider,tags.mlflow.databricks.workspaceURL,tags.mlflow.databricks.gitRepoStatus,tags.mlflow.databricks.gitRepoRelativePath,tags.mlflow.log-model.history,tags.mlflow.databricks.cluster.id,tags.feature_set_version,tags.mlflow.databricks.cluster.libraries.error,tags.mlflow.databricks.gitRepoReferenceType,tags.mlflow.databricks.notebookID,tags.mlflow.databricks.notebookPath,tags.mlflow.databricks.workspaceID,tags.mlflow.databricks.gitRepoUrl,tags.project,tags.mlflow.databricks.gitRepoReference,tags.mlflow.databricks.webappURL,tags.mlflow.source.type,tags.mlflow.databricks.cluster.info,tags.mlflow.databricks.cluster.libraries
0,1fd6d7f9beff4184889f7340767865f3,01751c9c53794379b592485104bc6ab0,FINISHED,dbfs:/databricks/mlflow-tracking/01751c9c53794...,2025-10-15 19:07:16.202000+00:00,2025-10-15 19:07:31.891000+00:00,0.968822,0.983264,0.973525,0.971155,62.0,5.0,6.0,,,,XGBOOST,f4a651fbeb75be2fba52badb789bbc515600d2d3,optuna,1fd6d7f9beff4184889f7340767865f3,kemalcan@berkeley.edu,/Users/kemalcan@berkeley.edu/e2e-data-science/...,optimize,INVALID_PARAMETER_VALUE: Cluster 0809-154351-l...,#e87b9f,1760555170114_5778336553049096846_85acce513b25...,gitHub,https://dbc-3a8de830-d6a2.cloud.databricks.com,unknown,lakehouse-iot-platform/04-Data-Science-ML/04.2...,"[{""artifact_path"":""xgb_model"",""flavors"":{""pyth...",0809-154351-lgw1ydqn-v2n,1.0,INVALID_ARGUMENT: INVALID_PARAMETER_VALUE: Clu...,branch,4368963137813901,/Users/kemalcan@berkeley.edu/e2e-data-science/...,4259139077557821,https://github.com/Berkeley-Data/e2e-data-scie...,Turbine maintenance predictor,kemalcan-feature-2,https://dbc-3a8de830-d6a2.cloud.databricks.com,NOTEBOOK,,
1,50b4b3f21e2a4caeafd89010a92df935,01751c9c53794379b592485104bc6ab0,FINISHED,dbfs:/databricks/mlflow-tracking/01751c9c53794...,2025-10-15 19:07:08.375000+00:00,2025-10-15 19:07:13.409000+00:00,0.969357,0.979079,0.976699,0.972979,100.0,3.0,,0.1,,,XGBOOST,f4a651fbeb75be2fba52badb789bbc515600d2d3,,,kemalcan@berkeley.edu,/Users/kemalcan@berkeley.edu/e2e-data-science/...,XGBoost Run,INVALID_PARAMETER_VALUE: Cluster 0809-154351-l...,#7d54b2,1760555170114_7218925565958051044_85acce513b25...,gitHub,https://dbc-3a8de830-d6a2.cloud.databricks.com,unknown,lakehouse-iot-platform/04-Data-Science-ML/04.2...,"[{""artifact_path"":""xgb_model"",""flavors"":{""pyth...",0809-154351-lgw1ydqn-v2n,,INVALID_ARGUMENT: INVALID_PARAMETER_VALUE: Clu...,branch,4368963137813901,/Users/kemalcan@berkeley.edu/e2e-data-science/...,4259139077557821,https://github.com/Berkeley-Data/e2e-data-scie...,,kemalcan-feature-2,https://dbc-3a8de830-d6a2.cloud.databricks.com,NOTEBOOK,,
2,5d4cc080ec414352806cef86432d7c44,01751c9c53794379b592485104bc6ab0,FINISHED,dbfs:/databricks/mlflow-tracking/01751c9c53794...,2025-08-09 15:45:23.954000+00:00,2025-08-09 15:45:31.312000+00:00,0.969357,0.979079,0.976699,0.972979,100.0,3.0,,0.1,XGBOOST,,,2268011268b41a756fcf8b247f32353b11fb18bf,,,kemalcan@berkeley.edu,/Users/kemalcan@berkeley.edu/e2e-data-science/...,XGBoost Run,,#da4c4c,1754754232754_8874398606133209524_ef18b40c8538...,gitHub,https://dbc-3a8de830-d6a2.cloud.databricks.com,unknown,lakehouse-iot-platform/04-Data-Science-ML/04.2...,"[{""artifact_path"":""xgb_model"",""flavors"":{""pyth...",0809-154351-lgw1ydqn-v2n,,,branch,4368963137813901,/Users/kemalcan@berkeley.edu/e2e-data-science/...,4259139077557821,https://github.com/Berkeley-Data/e2e-data-scie...,,final-student-version,https://dbc-3a8de830-d6a2.cloud.databricks.com,NOTEBOOK,"{""cluster_name"":"""",""spark_version"":""client.2.5...","{""installable"":[],""redacted"":[]}"
3,e1099a96af044b8c8aeebd95f7bfb4f0,01751c9c53794379b592485104bc6ab0,FINISHED,dbfs:/databricks/mlflow-tracking/01751c9c53794...,2025-10-15 19:07:02.384000+00:00,2025-10-15 19:07:08.073000+00:00,0.947482,0.969317,0.970082,0.958351,,,,,,200.0,LogisticRegression,f4a651fbeb75be2fba52badb789bbc515600d2d3,,,kemalcan@berkeley.edu,/Users/kemalcan@berkeley.edu/e2e-data-science/...,Logistic Regression Run,INVALID_PARAMETER_VALUE: Cluster 0809-154351-l...,#479a5f,1760555170114_8324409258527997713_85acce513b25...,gitHub,https://dbc-3a8de830-d6a2.cloud.databricks.com,unknown,lakehouse-iot-platform/04-Data-Science-ML/04.2...,"[{""artifact_path"":""logreg_model"",""flavors"":{""p...",0809-154351-lgw1ydqn-v2n,,INVALID_ARGUMENT: INVALID_PARAMETER_VALUE: Clu...,branch,4368963137813901,/Users/kemalcan@berkeley.edu/e2e-data-science/...,4259139077557821,https://github.com/Berkeley-Data/e2e-data-scie...,,kemalcan-feature-2,https://dbc-3a8de830-d6a2.cloud.databricks.com,NOTEBOOK,,
4,8fe3da939ecc4683a6e71482fc91cc86,01751c9c53794379b592485104bc6ab0,FINISHED,dbfs:/databricks/mlflow-tracking/01751c9c53794...,2025-08-09 15:45:16.702000+00:00,2025-08-09 15:45:22.964000+00:00,0.947482,0.969317,0.970082,0.958351,,,,,LogisticRegression,200.0,,2268011268b41a756fcf8b247f32353b11fb18bf,,,kemalcan@berkeley.edu,/Users/kemalcan@berkeley.edu/e2e-data-science/...,Logistic Regression Run,,#5387dd,1754754232754_7186461927566157459_ef18b40c8538...,gitHub,https://dbc-3a8de830-d6a2.cloud.databricks.com,unknown,lakehouse-iot-platform/04-Data-Science-ML/04.2...,"[{""artifact_path"":""logreg_model"",""flavors"":{""p...",0809-154351-lgw1ydqn-v2n,,,branch,4368963137813901,/Users/kemalcan@berkeley.edu/e2e-data-science/...,4259139077557821,https://github.com/Berkeley-Data/e2e-data-scie...,,final-student-version,https://dbc-3a8de830-d6a2.cloud.databricks.com,NOTEBOOK,"{""cluster_name"":"""",""spark_version"":""client.2.5...","{""installable"":[],""redacted"":[]}"


In [0]:
# Compare previous best_run to the best optimized run
print(f"Best previous accuracy:          {best_run['metrics.accuracy']}")
print(f"Best previous run id:            {best_model_run_id}")

best_opt_run = mlflow.search_runs(filter_string="tags.\"mlflow.log-model.history\" !=''", order_by=['metrics.accuracy DESC']).iloc[0]

print(f"Best optimized accuracy:         {best_opt_run['metrics.accuracy']}")

best_opt_run_id = best_opt_run['run_id']

print(f"Best optimized model run id:     {best_opt_run_id}")

if best_opt_run['metrics.accuracy'] > best_run['metrics.accuracy']:
    print("The accuracy improved by running parameter tuning")
else:
    print("The accuracy did not improve by running parameter tuning")    

INFO:py4j.clientserver:Received command c on object id p0


Best previous accuracy:          0.9790794979079498
Best previous run id:            50b4b3f21e2a4caeafd89010a92df935
Best optimized accuracy:         0.9832635983263598
Best optimized model run id:     1fd6d7f9beff4184889f7340767865f3
The accuracy improved by running parameter tuning


Now, we can test the best model. See if it is working as expected.

In [0]:
best_model_uri = f"runs:/{best_opt_run_id}/logreg_model" if best_run['tags.model_family'] == "LogisticRegression" else f"runs:/{best_opt_run_id}/xgb_model"

model_name = "turbine_maintenance"

In [0]:
# Make sure the model is loaded correctly from MLflow
loaded_model = mlflow.pyfunc.load_model(best_model_uri)
loaded_model

Downloading artifacts:   0%|          | 0/11 [00:00<?, ?it/s]

mlflow.pyfunc.loaded_model:
  artifact_path: xgb_model
  flavor: mlflow.xgboost
  run_id: 1fd6d7f9beff4184889f7340767865f3

INFO:py4j.clientserver:Closing down clientserver connection
INFO:py4j.clientserver:Closing down clientserver connection
INFO:py4j.clientserver:Closing down clientserver connection
INFO:py4j.clientserver:Closing down clientserver connection
INFO:py4j.clientserver:Closing down clientserver connection
INFO:py4j.clientserver:Closing down clientserver connection
INFO:py4j.clientserver:Closing down clientserver connection
INFO:py4j.clientserver:Closing down clientserver connection


In [0]:
loaded_model.predict(X_test)

array([1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 3, 1, 0, 1, 1,
       1, 1, 1, 3, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1,
       1, 0, 2, 1, 1, 1, 3, 1, 3, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 2, 1,
       1, 1, 0, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 2, 1, 2, 1, 1, 1,
       3, 1, 1, 3, 3, 1, 2, 1, 2, 1, 1, 3, 2, 1, 3, 1, 1, 1, 0, 1, 2, 1,
       1, 1, 1, 1, 3, 1, 1, 2, 1, 1, 1, 3, 3, 1, 1, 1, 1, 0, 0, 3, 3, 2,
       1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 3, 1, 1, 1, 2, 2, 3, 1, 2, 2, 1,
       3, 1, 0, 3, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 3, 1, 1, 2, 3, 1,
       3, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 2, 1, 1, 1, 1, 3, 1, 2, 1, 0, 1,
       3, 0, 2, 2, 1, 2, 2, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1,
       3, 1, 1, 1, 0, 0, 1, 1, 1, 1, 3, 1, 3, 1, 1, 1, 3, 1, 2, 1, 1, 3,
       1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 3, 1, 1, 0, 1, 1, 1, 1, 2, 3, 1,
       1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 3, 1, 1, 1, 1, 3, 0, 1, 1, 3,
       1, 1, 1, 1, 3, 1, 3, 1, 1, 1, 2, 1, 2, 2, 1,

Now we are satisfied with the model, we can register it in Unity Catalog

In [0]:

latest_model = mlflow.register_model(best_model_uri, f"{catalog}.{db}.{model_name}")


Successfully registered model 'main.e2eai_iot_turbine.turbine_maintenance'.


Downloading artifacts:   0%|          | 0/11 [00:00<?, ?it/s]

Uploading artifacts:   0%|          | 0/11 [00:00<?, ?it/s]

Created version '1' of model 'main.e2eai_iot_turbine.turbine_maintenance'.


If we're ready, we can move this model into Production stage in a click, or using the API. Let's register the model to Unity Catalog and move it to production.


In [0]:
# Flag it as Production ready using UC Aliases
MlflowClient().set_registered_model_alias(name=f"{catalog}.{db}.{model_name}", 
                                          alias="prod", 
                                          version=latest_model.version)

INFO:py4j.clientserver:Closing down clientserver connection
INFO:py4j.clientserver:Closing down clientserver connection
INFO:py4j.clientserver:Closing down clientserver connection
INFO:py4j.clientserver:Closing down clientserver connection
INFO:py4j.clientserver:Closing down clientserver connection
INFO:py4j.clientserver:Closing down clientserver connection
