
# Data Science with Databricks

## ML is key to wind turbine farm optimization

The current market makes energy even more strategic than before. Being able to ingest and analyze our Wind turbine state is a first step, but this isn't enough to thrive in a very competitive market.

We need to go further to optimize our energy production, reduce maintenance cost and reduce downtime. Modern data company achieve this with AI.

<style>
.right_box{
  margin: 30px; box-shadow: 10px -10px #CCC; width:650px;height:300px; background-color: #1b3139ff; box-shadow:  0 0 10px  rgba(0,0,0,0.6);
  border-radius:25px;font-size: 35px; float: left; padding: 20px; color: #f9f7f4; }
.badge {
  clear: left; float: left; height: 30px; width: 30px;  display: table-cell; vertical-align: middle; border-radius: 50%; background: #fcba33ff; text-align: center; color: white; margin-right: 10px}
.badge_b { 
  height: 35px}
</style>
<link href='https://fonts.googleapis.com/css?family=DM Sans' rel='stylesheet'>
<div style="font-family: 'DM Sans'; display: flex; align-items: flex-start;">
  <!-- Left Section -->
  <div style="width: 50%; color: #1b3139; padding-right: 20px;">
    <div style="color: #ff5f46; font-size:80px;">90%</div>
    <div style="font-size:30px; margin-top: -20px; line-height: 30px;">
      Enterprise applications will be AI-augmented by 2025 —IDC
    </div>
    <div style="color: #ff5f46; font-size:80px;">$10T+</div>
    <div style="font-size:30px; margin-top: -20px; line-height: 30px;">
       Projected business value creation by AI in 2030 —PWC
    </div>
  </div>

  <!-- Right Section -->
  <div class="right_box", style="width: 50%; color: red; font-size: 30px; line-height: 1.5; padding-left: 20px;">
    But—huge challenges getting ML to work at scale!<br/><br/>
    In fact, most ML projects still fail before getting to production
  </div>
</div>

## Machine learning is data + transforms.

ML is hard because delivering value to business lines isn't only about building a Model. <br>
The ML lifecycle is made of data pipelines: Data-preprocessing, feature engineering, training, inference, monitoring and retraining...<br>
Stepping back, all pipelines are data + code.


<img style="float: right; margin-top: 10px" width="500px" src="https://raw.githubusercontent.com/databricks-demos/dbdemos-resources/refs/heads/main/images/manufacturing/lakehouse-iot-turbine/team_flow_marc.png" />

<img src="https://raw.githubusercontent.com/databricks-demos/dbdemos-resources/refs/heads/main/images/marc.png" style="float: left;" width="80px"> 
<h3 style="padding: 10px 0px 0px 5px">Marc, as a Data Scientist, needs a data + ML platform accelerating all the ML & DS steps:</h3>

<div style="font-size: 19px; margin-left: 73px; clear: left">
<div class="badge_b"><div class="badge">1</div> Build Data Pipeline supporting real time (with DLT)</div>
<div class="badge_b"><div class="badge">2</div> Data Exploration</div>
<div class="badge_b"><div class="badge">3</div> Feature creation</div>
<div class="badge_b"><div class="badge">4</div> Build & train model</div>
<div class="badge_b"><div class="badge">5</div> Deploy Model (Batch or serverless realtime)</div>
<div class="badge_b"><div class="badge">6</div> Monitoring</div>
</div>

**Marc needs a Data Intelligence Platform**. Let's see how we can deploy a Predictive Maintenance model in production with Databricks.


# Predictive maintenance - Single click deployment with AutoML

Let's see how we can now leverage the sensor data to build a model predictive maintenance model.

Our first step as Data Scientist is to analyze and build the features we'll use to train our model.

The sensor table enriched with turbine data has been saved within our Delta Live Table pipeline. All we have to do is read this information, analyze it and start an Auto-ML run.

<img src="https://github.com/databricks-demos/dbdemos-resources/raw/main/images/manufacturing/lakehouse-iot-turbine/lakehouse-manuf-iot-ds-flow.png" width="1000px">

*Note: Make sure you switched to the "Machine Learning" persona on the top left menu.*


<!-- Collect usage data (view). Remove it to disable collection. View README for more details.  -->
<img width="1px" src="https://ppxrzfxige.execute-api.us-west-2.amazonaws.com/v1/analytics?category=lakehouse&org_id=4003492105941350&notebook=%2F04-Data-Science-ML%2F04.1-automl-iot-turbine-predictive-maintenance&demo_name=lakehouse-iot-platform&event=VIEW&path=%2F_dbdemos%2Flakehouse%2Flakehouse-iot-platform%2F04-Data-Science-ML%2F04.1-automl-iot-turbine-predictive-maintenance&version=1">

In [0]:
%pip install --quiet databricks-sdk==0.40.0 databricks-feature-engineering==0.8.0 mlflow==2.22.0
%pip install --quiet xgboost
dbutils.library.restartPython()

In [0]:
%run ../_resources/00-setup $reset_all_data=false

In [0]:
catalog

In [0]:
schema

## Data exploration and analysis

Let's review our dataset and start analyze the data we have to predict our churn

In [0]:
def plot(sensor_report):
  turbine_id = spark.table('turbine_training_dataset').where(f"abnormal_sensor = '{sensor_report}' ").limit(1).collect()[0]['turbine_id']
  #Let's explore a bit our datasets with pandas on spark.
  df = spark.table('sensor_bronze').where(f"turbine_id == '{turbine_id}' ").orderBy('timestamp').limit(500).pandas_api()
  df.plot(x="timestamp", y=["sensor_B"], kind="line", title=f'Sensor report: {sensor_report}').show()
plot('ok')

In [0]:
plot('sensor_B')

As we can see in these graph, we can clearly see some anomaly on the readings we get from sensor F. Let's continue our exploration and use the std we computed in our main feature table

In [0]:
# Read our churn_features table
turbine_dataset = spark.table('turbine_training_dataset').withColumn('damaged', col('abnormal_sensor') != 'ok')
display(turbine_dataset)

In [0]:
import seaborn as sns
g = sns.PairGrid(turbine_dataset.sample(0.01).toPandas()[['std_sensor_A', 'std_sensor_E', 'damaged','avg_energy']], diag_sharey=False, hue="damaged")
g.map_lower(sns.kdeplot).map_diag(sns.kdeplot, lw=3).map_upper(sns.regplot).add_legend()

### Further data analysis and preparation using pandas API

Because our Data Scientist team is familiar with Pandas, we'll use `pandas on spark` to scale `pandas` code. The Pandas instructions will be converted in the spark engine under the hood and distributed at scale.

Typicaly Data Science project would involve more advanced preparation and likely require extra data prep step, including more complex feature preparation. We'll keep it simple for this demo.

*Note: Starting from `spark 3.2`, koalas is builtin and we can get an Pandas Dataframe using `pandas_api()`.*

In [0]:
 # Convert to pandas (koalas)
dataset = turbine_dataset.pandas_api()

# Select the columns we would like to use as ML Model features. #Note: we removed percentiles_sensor_A/B/C.. feature to make the demo easier
columns = [
    "turbine_id",
    "hourly_timestamp",
    "avg_energy",
    "std_sensor_A",
    "std_sensor_B",
    "std_sensor_C",
    "std_sensor_D",
    "std_sensor_E",
    "std_sensor_F",
    "location",
    "model",
    "state",
    "abnormal_sensor"
]
dataset = dataset[columns]

# Drop missing values
dataset = dataset.dropna()   
display(dataset)


## Accelerating Predictive maintenance model creation using MLFlow and Databricks Auto-ML

MLflow is an open source project allowing model tracking, packaging and deployment. Everytime your datascientist team work on a model, Databricks will track all the parameter and data used and will save it. This ensure ML traceability and reproductibility, making it easy to know which model was build using which parameters/data.

### A glass-box solution that empowers data teams without taking away control

While Databricks simplify model deployment and governance (MLOps) with MLFlow, bootstraping new ML projects can still be long and inefficient. 

Instead of creating the same boilerplate for each new project, Databricks Auto-ML can automatically generate state of the art models for Classifications, regression, and forecast.


<img width="1000" src="https://github.com/QuentinAmbard/databricks-demo/raw/main/retail/resources/images/auto-ml-full.png"/>


Models can be directly deployed, or instead leverage generated notebooks to boostrap projects with best-practices, saving you weeks of efforts.

<br style="clear: both">

<img style="float: right" width="600" src="https://github.com/QuentinAmbard/databricks-demo/raw/main/retail/resources/images/churn-auto-ml.png"/>

### Using Databricks Auto ML with our Churn dataset

Auto ML is available in the "Machine Learning" space. All we have to do is start a new Auto-ML experimentation and select the feature table we just created (`turbine_hourly_features`)

Our prediction target is the `abnormal_sensor` column.

Click on Start, and Databricks will do the rest.

While this is done using the UI, you can also leverage the [python API](https://docs.databricks.com/applications/machine-learning/automl.html#automl-python-api-1)

In [0]:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score
)

import mlflow
import pandas as pd
from mlflow.models import infer_signature
from mlflow import MlflowClient


In [0]:
mlflow.set_registry_uri('databricks-uc')



In [0]:
name=f'{catalog}.{db}.turbine_hourly_features'

# read training dataset from catalog
training_dataset = spark.table(name).drop('turbine_id')


# 2. Prepare features and labels
X = training_dataset.toPandas()[['avg_energy', 'std_sensor_A', 'std_sensor_B', 'std_sensor_C', 'std_sensor_D', 'std_sensor_E', 'std_sensor_F']]
y = training_dataset.toPandas()['abnormal_sensor']

# 3. Encode labels
y_encoded = pd.factorize(y)[0]


# Split the dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y_encoded,
    test_size=0.2, 
    random_state=42
)
# Standardize the features
scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc = scaler.transform(X_test)



# Start MLflow run
with mlflow.start_run():

    # Define and train the model
    lr_model = LogisticRegression(max_iter=200)
    lr_model.fit(X_train_sc, y_train)

    # Predict and evaluate
    predictions = lr_model.predict(X_test_sc)
    acc = accuracy_score(y_test, predictions)
    precision = precision_score(y_test, predictions, average="macro")
    recall = recall_score(y_test, predictions, average="macro")
    f1 = f1_score(y_test, predictions, average="macro")

    # Log model parameters
    mlflow.log_param("model_type", "LogisticRegression")
    mlflow.log_param("max_iter", 200)

    # Log metrics
    mlflow.log_metric("accuracy", acc)
    mlflow.log_metric("precision", precision)
    mlflow.log_metric("recall", recall)
    mlflow.log_metric("f1_score", f1)

    # Log the model
    mlflow.sklearn.log_model(lr_model, "logistic_regression_model")

    print(f"Logged with accuracy: {acc}")


In [0]:
#train an xgboost model
import xgboost as xgb
from xgboost import XGBClassifier





# Start MLflow run
with mlflow.start_run():

    # Define and train the model
    xgb_model = XGBClassifier()
    xgb_model.fit(X_train_sc, y_train)

    # Predict and evaluate
    predictions = xgb_model.predict(X_test_sc)
    acc = accuracy_score(y_test, predictions)
    precision = precision_score(y_test, predictions, average="macro")
    recall = recall_score(y_test, predictions, average="macro")
    f1 = f1_score(y_test, predictions, average="macro")

    # Log model parameters
    mlflow.log_param("model_type", "XGBOOST")
    mlflow.log_param("max_iter", 200)

    # Log metrics
    mlflow.log_metric("accuracy", acc)
    mlflow.log_metric("precision", precision)
    mlflow.log_metric("recall", recall)
    mlflow.log_metric("f1_score", f1)

    # Log the model
    mlflow.sklearn.log_model(lr_model, "logistic_regression_model")



In [0]:
# xp_path = "/Shared/dbdemos/experiments/lakehouse-iot-platform"

# xp_name = f"automl_iot_{datetime.now().strftime('%Y-%m-%d_%H:%M:%S')}"

# # training_dataset = fe.read_table(name=f'{catalog}.{db}.turbine_hourly_features').drop('turbine_id').sample(0.1) #Reduce the dataset size to speedup 

# #the demo
# try:
#     from databricks import automl
#     automl_run = automl.classify(
#         experiment_name = xp_name,
#         experiment_dir = xp_path,
#         dataset = training_dataset,
#         target_col = "abnormal_sensor",
#         timeout_minutes = 10
#     )
#     #Make sure all users can access dbdemos shared experiment
#     DBDemos.set_experiment_permission(f"{xp_path}/{xp_name}")


# except Exception as e:
#     if "cannot import name 'automl'" in str(e):
#         # Note: cannot import name 'automl' from 'databricks' likely means you're using serverless. Dbdemos doesn't support autoML serverless API - this will be improved soon.
#         # Adding a temporary workaround to make sure it works well for now - ignore this for classic run
#         DBDemos.create_mockup_automl_run(f"{xp_path}/{xp_name}", training_dataset.toPandas())
#     else:
#         raise e

In [0]:
mlflow.search_runs(order_by=['metrics.accuracy DESC'])

In [0]:
mlflow.search_runs(order_by=['metrics.accuracy DESC'])['tags.mlflow.log-model.history'][0]

In [0]:
best_run = pd.DataFrame(mlflow.search_runs(order_by=['metrics.accuracy DESC']).iloc[0])

print(f"Best accuracy:                   {best_run.loc['metrics.accuracy',0]}")
print(f"Best model type:                 {best_run.loc['params.model_type',0]}")

best_model_run_id = best_run.loc['run_id',0]



AutoML saved our best model in the MLFlow registry. Open the experiment from the AutoML run to explore its artifact and analyze the parameters used, including traceability to the notebook used for its creation.

If we're ready, we can move this model into Production stage in a click, or using the API. Let' register the model to Unity Catalog and move it to production.

You can programatically get the last best run from your automl training:
```
from mlflow import MlflowClient

# retrieve best model trial run
trial_id = automl_run.best_trial.mlflow_run_id
model_uri = "runs:/{}/model".format(automl_run.best_trial.mlflow_run_id)
#Use Databricks Unity Catalog to save our model
latest_model = mlflow.register_model(model_uri, f"{catalog}.{db}.{model_name}")
# Flag it as Production ready using UC Aliases
MlflowClient().set_registered_model_alias(name=f"{catalog}.{db}.{model_name}", alias="prod", version=latest_model.version)
```


### Next step: Explore the best notebook generated by Databricks AutoML and deploy our model in the registry!

Databricks AutoML generate state of the art notebooks for you to deploy your models!

Open [the generated Auto-ML notebook]($./04.2-automl-generated-notebook-iot-turbine) and deploy this model in production.

In [0]:
best_model_run_id

In [0]:


# # Define the model URI
# model_uri = "runs:/{}/model".format(best_model_run_id)

# # Download the model artifacts to a local path
# local_path = "/local_disk0/model"

# mlflow.artifacts.download_artifacts(artifact_uri=model_uri, dst_path=local_path)

# # Register the model with Unity Catalog
# latest_model = mlflow.register_model(model_uri, f"{catalog}.{db}.{model_name}")

# # Flag it as Production ready using UC Aliases
# MlflowClient().set_registered_model_alias(
#     name=f"{catalog}.{db}.{model_name}",
#     alias="prod",
#     version=latest_model.version
# )

In [0]:

# model_uri = "runs:/{}/model".format(best_model_run_id)

# model_uri

# # #Use Databricks Unity Catalog to save our model
# latest_model = mlflow.register_model(model_uri, f"{catalog}.{db}.{model_name}")


# # # Flag it as Production ready using UC Aliases
# # MlflowClient().set_registered_model_alias(name=f"{catalog}.{db}.{model_name}", alias="prod", version=latest_model.version)
