&nbsp;
&nbsp;
![](../_resources/images/e2eai-4.jpg)


# Data Science with Databricks

## ML is key to wind turbine farm optimization

The current market makes energy even more strategic than before. Being able to ingest and analyze our Wind turbine state is a first step, but this isn't enough to thrive in a very competitive market.

We need to go further to optimize our energy production, reduce maintenance cost and reduce downtime. Modern data company achieve this with AI.

<style>
.right_box{
  margin: 30px; box-shadow: 10px -10px #CCC; width:650px;height:300px; background-color: #1b3139ff; box-shadow:  0 0 10px  rgba(0,0,0,0.6);
  border-radius:25px;font-size: 35px; float: left; padding: 20px; color: #f9f7f4; }
.badge {
  clear: left; float: left; height: 30px; width: 30px;  display: table-cell; vertical-align: middle; border-radius: 50%; background: #fcba33ff; text-align: center; color: white; margin-right: 10px}
.badge_b { 
  height: 35px}
</style>
<link href='https://fonts.googleapis.com/css?family=DM Sans' rel='stylesheet'>
<div style="font-family: 'DM Sans'; display: flex; align-items: flex-start;">
  <!-- Left Section -->
  <div style="width: 50%; color: #1b3139; padding-right: 20px;">
    <div style="color: #ff5f46; font-size:80px;">90%</div>
    <div style="font-size:30px; margin-top: -20px; line-height: 30px;">
      Enterprise applications will be AI-augmented by 2025 —IDC
    </div>
    <div style="color: #ff5f46; font-size:80px;">$10T+</div>
    <div style="font-size:30px; margin-top: -20px; line-height: 30px;">
       Projected business value creation by AI in 2030 —PWC
    </div>
  </div>

  <!-- Right Section -->
  <div class="right_box", style="width: 50%; color: red; font-size: 30px; line-height: 1.5; padding-left: 20px;">
    But—huge challenges getting ML to work at scale!<br/><br/>
    In fact, most ML projects still fail before getting to production
  </div>
</div>

## Machine learning is data + transforms.

ML is hard because delivering value to business lines isn't only about building a Model. <br>
The ML lifecycle is made of data pipelines: Data-preprocessing, feature engineering, training, inference, monitoring and retraining...<br>
Stepping back, all pipelines are data + code.


<img style="float: right; margin-top: 10px" width="500px" src="https://raw.githubusercontent.com/databricks-demos/dbdemos-resources/refs/heads/main/images/manufacturing/lakehouse-iot-turbine/team_flow_marc.png" />

<img src="https://raw.githubusercontent.com/databricks-demos/dbdemos-resources/refs/heads/main/images/marc.png" style="float: left;" width="80px"> 
<h3 style="padding: 10px 0px 0px 5px">Marc, as a Data Scientist, needs a data + ML platform accelerating all the ML & DS steps:</h3>

<div style="font-size: 19px; margin-left: 73px; clear: left">
<div class="badge_b"><div class="badge">1</div> Build Data Pipeline</div>
<div class="badge_b"><div class="badge">2</div> Data Exploration</div>
<div class="badge_b"><div class="badge">3</div> Feature creation</div>
<div class="badge_b"><div class="badge">4</div> Build & train model</div>
<div class="badge_b"><div class="badge">5</div> Deploy Model (Batch or serverless realtime)</div>
<div class="badge_b"><div class="badge">6</div> Monitoring</div>
</div>

**Marc needs a Data Intelligence Platform**. Let's see how we can deploy a Predictive Maintenance model in production with Databricks.


# Predictive maintenance

Let's see how we can now leverage the sensor data to build a model predictive maintenance model.

Our first step as Data Scientist is to analyze and build the features we'll use to train our model.

The sensor table enriched with turbine data has been saved within our Delta Live Table pipeline. All we have to do is read this information, analyze it and create an ML model.

<img src="https://github.com/databricks-demos/dbdemos-resources/raw/main/images/manufacturing/lakehouse-iot-turbine/lakehouse-manuf-iot-ds-flow.png" width="1000px">



<!-- Collect usage data (view). Remove it to disable collection. View README for more details.  -->
<img width="1px" src="https://ppxrzfxige.execute-api.us-west-2.amazonaws.com/v1/analytics?category=lakehouse&org_id=4003492105941350&notebook=%2F04-Data-Science-ML%2F04.1-automl-iot-turbine-predictive-maintenance&demo_name=lakehouse-iot-platform&event=VIEW&path=%2F_dbdemos%2Flakehouse%2Flakehouse-iot-platform%2F04-Data-Science-ML%2F04.1-automl-iot-turbine-predictive-maintenance&version=1">

In [0]:
%pip install --quiet databricks-sdk==0.40.0 mlflow==2.22.0 optuna optuna-integration[mlflow] xgboost
dbutils.library.restartPython()

In [0]:
%run ../_resources/00-setup $reset_all_data=false

In [0]:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score
)

from xgboost import XGBClassifier
import numpy as np
import pandas as pd
import mlflow
from mlflow.models import infer_signature
from mlflow import MlflowClient
from mlflow.deployments import get_deploy_client
import optuna
from optuna.integration.mlflow import MLflowCallback
import os
import requests
import json



## Accelerating Predictive maintenance model creation using MLFlow in Databricks

MLflow is an open source project allowing model tracking, packaging and deployment. Everytime your datascientist team work on a model, Databricks will track all the parameter and data used and will save it. This ensure ML traceability and reproductibility, making it easy to know which model was build using which parameters/data.

### A glass-box solution that empowers data teams without taking away control

While Databricks simplify model deployment and governance (MLOps) with MLFlow, bootstraping new ML projects can still be long and inefficient. 


<img width="1000" src="https://github.com/QuentinAmbard/databricks-demo/raw/main/retail/resources/images/auto-ml-full.png"/>


Models can be directly deployed, or instead leverage generated notebooks to boostrap projects with best-practices, saving you weeks of efforts.

<br style="clear: both">

<img width="600" src="https://github.com/QuentinAmbard/databricks-demo/raw/main/retail/resources/images/churn-auto-ml.png"/>



In [0]:
mlflow.set_registry_uri('databricks-uc')

### Create Train / Test Splits

In [0]:
features_table_name=f'{catalog}.{db}.turbine_hourly_features'

# Read training dataset from catalog
training_dataset = spark.table(features_table_name).drop('turbine_id')

# Prepare features and labels
X = training_dataset.toPandas()[['avg_energy', 'std_sensor_A', 'std_sensor_B', 'std_sensor_C', 'std_sensor_D', 'std_sensor_E', 'std_sensor_F']]

y = training_dataset.toPandas()['abnormal_sensor']

# Encode labels
y_encoded = pd.factorize(y)[0]

# Split the dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y_encoded,
    test_size=0.2, 
    random_state=42
)


In [0]:

# Save the signature (an example of the model input)
signature = infer_signature(X_train, y_train)
signature

In [0]:

# This ensures column names are included
input_example = X_train.iloc[[0]]  # Picks first row, keeps column names!
input_example

### Build Logistic Regression Model

In [0]:
# ML model versions in UC must have a model signature (Or input example). If you want to set a signature on a model that's already logged or saved, the mlflow.models.set_signature() API is available for this purpose.


# Start MLflow run
with mlflow.start_run(run_name="Logistic Regression Run"):

    # Define and train the model
    lr_model = LogisticRegression(max_iter=200)
    lr_model.fit(X_train, y_train)


    # Predict and evaluate
    predictions = lr_model.predict(X_test)

    acc = accuracy_score(y_test, predictions)
    precision = precision_score(y_test, predictions, average="macro")
    recall = recall_score(y_test, predictions, average="macro")
    f1 = f1_score(y_test, predictions, average="macro")

    # Log model parameters
    mlflow.set_tag("model_family", "LogisticRegression")
    mlflow.log_param("max_iter", 200)

    # Log metrics
    mlflow.log_metric("accuracy", acc)
    mlflow.log_metric("precision", precision)
    mlflow.log_metric("recall", recall)
    mlflow.log_metric("f1_score", f1)

    # Log the model
    mlflow.sklearn.log_model(lr_model, 
                            artifact_path="logreg_model",
                            input_example=input_example,
                            signature=signature,
                            #  registered_model_name="main.dbdemos_iot_turbine.dbdemos_turbine_maintenance"
                            )

    print(f"Logged with accuracy: {acc}")


### Build XGBoost Model

In [0]:
# Start MLflow run
with mlflow.start_run(run_name="XGBoost Run"):

    # Define and train the model
    xgb_model = XGBClassifier(
        objective="multi:softprob",
        num_class=3,
        max_depth=3,
        n_estimators=100,
        learning_rate=0.1,
        use_label_encoder=False,
        eval_metric="mlogloss"
    )
    xgb_model.fit(X_train, y_train)


    # Predict and evaluate
    predictions = xgb_model.predict(X_test)

    acc = accuracy_score(y_test, predictions)
    precision = precision_score(y_test, predictions, average="macro")
    recall = recall_score(y_test, predictions, average="macro")
    f1 = f1_score(y_test, predictions, average="macro")

    # Log model parameters
    mlflow.set_tag("model_family", "XGBOOST")
    mlflow.log_param("max_depth", 3)
    mlflow.log_param("n_estimators", 100)
    mlflow.log_param("learning_rate", 0.1)

    # Log metrics
    mlflow.log_metric("accuracy", acc)
    mlflow.log_metric("precision", precision)
    mlflow.log_metric("recall", recall)
    mlflow.log_metric("f1_score", f1)

    mlflow.xgboost.log_model(xgb_model, 
                             artifact_path="xgb_model",
                             input_example=input_example,
                             signature=signature,
                             )
    print(f"Logged with accuracy: {acc}")


### Compare ML models to find best performing

In [0]:
mlflow.search_runs(order_by=['metrics.accuracy DESC','start_time DESC'])

In [0]:
best_run = mlflow.search_runs(order_by=['metrics.accuracy DESC','start_time DESC']).iloc[0]

print(f"Best accuracy:                   {best_run['metrics.accuracy']}")
print(f"Best model type:                 {best_run['tags.model_family']}")

best_model_run_id = best_run['run_id']

print(f"Best model run id:               {best_model_run_id}")

### Optimize our best performing model
In a real world scenaio, we would start by comparing various model types, then tune hyperparameters for the best performing model type.  For this demo, we have only compared two model types: a logistic regression model to an XGBoost Classifier.

To illustrate an optimization process, let's use [Optuna](https://optuna.org/) to tune our XGBoost model.

Optuna is an open-source hyperparameter optimization framework for machine learning and deep learning models. It automates the process of finding the optimal hyperparameters for a given model and dataset, which can significantly improve model performance and reduce the manual effort involved in hyperparameter tuning.

In [0]:
# Set the mlflow experiment
client = MlflowClient()
experiments = client.search_experiments()
notebook_name = dbutils.notebook.entry_point.getDbutils().notebook().getContext().notebookPath().get().split("/")[-1]

# By default, the MLflow experiment name will be the notebook name
cnt = 0
for exp in experiments:
    if notebook_name in exp.name:
        print(f"Experiment Name: {exp.name}, Experiment ID: {exp.experiment_id}\n")
        cnt = cnt+1
        experiment_id = exp.experiment_id
if cnt > 1:
    print("Multiple experiments found. Please delete the other experiment manually.")
elif cnt == 1:
    print(f"Using experiment {experiment_id}")
    mlflow.set_experiment(experiment_id=experiment_id)

In [0]:
# Instantiate the MLflow callback for logging Optuna runs to MLflow
mlflow_callback = MLflowCallback(
    tracking_uri='databricks',
    metric_name="accuracy",
    create_experiment=False,
    mlflow_kwargs={"nested": True},
    tag_trial_user_attrs=True
)

In [0]:
def objective(trial):
    """Define the objective function for hyperparameter tuning."""

    with mlflow.start_run(nested=True):
        # Invoke suggest methods of a Trial object to generate hyperparameters
        cl_num_class = trial.suggest_int('num_class', 2, 10)
        cl_max_depth = trial.suggest_int('max_depth', 2, 15)
        cl_n_estimators = trial.suggest_int('n_estimators', 10, 100)

        # Define our model, referencing the hyperparameters we just generated
        classifier_obj = XGBClassifier(
            objective="multi:softprob",
            num_class=cl_num_class,
            max_depth=cl_max_depth,
            n_estimators=cl_n_estimators,
            learning_rate=0.1,
            use_label_encoder=False,
            eval_metric="mlogloss"
        )
        classifier_obj.fit(X_train, y_train)
        y_pred = classifier_obj.predict(X_test)

        acc = accuracy_score(y_test, y_pred)
        precision = precision_score(y_test, y_pred, average="macro")
        recall = recall_score(y_test, y_pred, average="macro")
        f1 = f1_score(y_test, y_pred, average="macro")

        # Log model parameters
        mlflow.set_tag("model_family", "XGBOOST")
        mlflow.log_param("num_class", cl_num_class)
        mlflow.log_param("max_depth", cl_max_depth)
        mlflow.log_param("n_estimators", cl_n_estimators)
        mlflow.log_param("learning_rate", 0.1)

        # Log metrics
        mlflow.log_metric("accuracy", acc)
        mlflow.log_metric("precision", precision)
        mlflow.log_metric("recall", recall)
        mlflow.log_metric("f1_score", f1)

    return acc  # An objective value linked with the Trial object

In [0]:
# Run the optimize trials
run_name = "optimize"

# Initiate the parent run and call the hyperparameter tuning child run logic
with mlflow.start_run(experiment_id=experiment_id, run_name=run_name, nested=True):
  # Initialize the Optuna study
  # Our goal is to maximize the accuracy (as defined in our objective function)
  study = optuna.create_study(direction="maximize", load_if_exists=True)

  # Execute the hyperparameter optimization trials.
  study.optimize(objective, n_trials=10, callbacks=[mlflow_callback])

  mlflow.log_params(study.best_params)
  mlflow.log_metric("accuracy", study.best_value)

  # Log tags
  mlflow.set_tags(
      tags={
          "project": "Turbine maintenance predictor",
          "optimizer_engine": "optuna",
          "model_family": "XGBOOST",
          "feature_set_version": 1,
      }
  )

  # Log a fit model instance
  model = XGBClassifier(study.best_params).fit(X_train, y_train)
  y_pred = model.predict(X_test)

  precision = precision_score(y_test, y_pred, average="macro")
  recall = recall_score(y_test, y_pred, average="macro")
  f1 = f1_score(y_test, y_pred, average="macro")

  mlflow.log_metric("precision", precision)
  mlflow.log_metric("recall", recall)
  mlflow.log_metric("f1_score", f1)

  mlflow.xgboost.log_model(
    xgb_model=model,
    artifact_path="xgb_model",
    input_example=input_example,
    signature=signature,
  )

  # Get the logged model uri so that we can load it from the artifact store
  #model_uri = mlflow.get_artifact_uri("xgb_model")

In [0]:
# Show only runs with a logged model (our objective function did not save a model for every run) 
# and sort by accuracy
mlflow.search_runs(filter_string="tags.\"mlflow.log-model.history\" !=''", order_by=['metrics.accuracy DESC'])

In [0]:
# Compare previous best_run to the best optimized run
print(f"Best previous accuracy:          {best_run['metrics.accuracy']}")
print(f"Best previous run id:            {best_model_run_id}")

best_opt_run = mlflow.search_runs(filter_string="tags.\"mlflow.log-model.history\" !=''", order_by=['metrics.accuracy DESC']).iloc[0]

print(f"Best optimized accuracy:         {best_opt_run['metrics.accuracy']}")

best_opt_run_id = best_opt_run['run_id']

print(f"Best optimized model run id:     {best_opt_run_id}")

if best_opt_run['metrics.accuracy'] > best_run['metrics.accuracy']:
    print("The accuracy improved by running parameter tuning")
else:
    print("The accuracy did not improve by running parameter tuning")    

Now, we can test the best model. See if it is working as expected.

In [0]:
best_model_uri = f"runs:/{best_opt_run_id}/logreg_model" if best_run['tags.model_family'] == "LogisticRegression" else f"runs:/{best_opt_run_id}/xgb_model"

model_name = "turbine_maintenance"

In [0]:
# Make sure the model is loaded correctly from MLflow
loaded_model = mlflow.pyfunc.load_model(best_model_uri)
loaded_model

In [0]:
loaded_model.predict(X_test)

Now we are satisfied with the model, we can register it in Unity Catalog

In [0]:

latest_model = mlflow.register_model(best_model_uri, f"{catalog}.{db}.{model_name}")


If we're ready, we can move this model into Production stage in a click, or using the API. Let's register the model to Unity Catalog and move it to production.


In [0]:
# Flag it as Production ready using UC Aliases
MlflowClient().set_registered_model_alias(name=f"{catalog}.{db}.{model_name}", 
                                          alias="prod", 
                                          version=latest_model.version)