
# Data Science with Databricks

## ML is key to wind turbine farm optimization

The current market makes energy even more strategic than before. Being able to ingest and analyze our Wind turbine state is a first step, but this isn't enough to thrive in a very competitive market.

We need to go further to optimize our energy production, reduce maintenance cost and reduce downtime. Modern data company achieve this with AI.

<style>
.right_box{
  margin: 30px; box-shadow: 10px -10px #CCC; width:650px;height:300px; background-color: #1b3139ff; box-shadow:  0 0 10px  rgba(0,0,0,0.6);
  border-radius:25px;font-size: 35px; float: left; padding: 20px; color: #f9f7f4; }
.badge {
  clear: left; float: left; height: 30px; width: 30px;  display: table-cell; vertical-align: middle; border-radius: 50%; background: #fcba33ff; text-align: center; color: white; margin-right: 10px}
.badge_b { 
  height: 35px}
</style>
<link href='https://fonts.googleapis.com/css?family=DM Sans' rel='stylesheet'>
<div style="font-family: 'DM Sans'; display: flex; align-items: flex-start;">
  <!-- Left Section -->
  <div style="width: 50%; color: #1b3139; padding-right: 20px;">
    <div style="color: #ff5f46; font-size:80px;">90%</div>
    <div style="font-size:30px; margin-top: -20px; line-height: 30px;">
      Enterprise applications will be AI-augmented by 2025 —IDC
    </div>
    <div style="color: #ff5f46; font-size:80px;">$10T+</div>
    <div style="font-size:30px; margin-top: -20px; line-height: 30px;">
       Projected business value creation by AI in 2030 —PWC
    </div>
  </div>

  <!-- Right Section -->
  <div class="right_box", style="width: 50%; color: red; font-size: 30px; line-height: 1.5; padding-left: 20px;">
    But—huge challenges getting ML to work at scale!<br/><br/>
    In fact, most ML projects still fail before getting to production
  </div>
</div>

## Machine learning is data + transforms.

ML is hard because delivering value to business lines isn't only about building a Model. <br>
The ML lifecycle is made of data pipelines: Data-preprocessing, feature engineering, training, inference, monitoring and retraining...<br>
Stepping back, all pipelines are data + code.


<img style="float: right; margin-top: 10px" width="500px" src="https://raw.githubusercontent.com/databricks-demos/dbdemos-resources/refs/heads/main/images/manufacturing/lakehouse-iot-turbine/team_flow_marc.png" />

<img src="https://raw.githubusercontent.com/databricks-demos/dbdemos-resources/refs/heads/main/images/marc.png" style="float: left;" width="80px"> 
<h3 style="padding: 10px 0px 0px 5px">Marc, as a Data Scientist, needs a data + ML platform accelerating all the ML & DS steps:</h3>

<div style="font-size: 19px; margin-left: 73px; clear: left">
<div class="badge_b"><div class="badge">1</div> Build Data Pipeline supporting real time (with DLT)</div>
<div class="badge_b"><div class="badge">2</div> Data Exploration</div>
<div class="badge_b"><div class="badge">3</div> Feature creation</div>
<div class="badge_b"><div class="badge">4</div> Build & train model</div>
<div class="badge_b"><div class="badge">5</div> Deploy Model (Batch or serverless realtime)</div>
<div class="badge_b"><div class="badge">6</div> Monitoring</div>
</div>

**Marc needs a Data Intelligence Platform**. Let's see how we can deploy a Predictive Maintenance model in production with Databricks.


# Predictive maintenance 

Let's see how we can now leverage the sensor data to build a model predictive maintenance model.

Our first step as Data Scientist is to analyze and build the features we'll use to train our model.

The sensor table enriched with turbine data has been saved within our Delta Live Table pipeline. 

<img src="https://github.com/databricks-demos/dbdemos-resources/raw/main/images/manufacturing/lakehouse-iot-turbine/lakehouse-manuf-iot-ds-flow.png" width="1000px">

*Note: Make sure you switched to the "Machine Learning" persona on the top left menu.*


<!-- Collect usage data (view). Remove it to disable collection. View README for more details.  -->
<img width="1px" src="https://ppxrzfxige.execute-api.us-west-2.amazonaws.com/v1/analytics?category=lakehouse&org_id=4003492105941350&notebook=%2F04-Data-Science-ML%2F04.1-automl-iot-turbine-predictive-maintenance&demo_name=lakehouse-iot-platform&event=VIEW&path=%2F_dbdemos%2Flakehouse%2Flakehouse-iot-platform%2F04-Data-Science-ML%2F04.1-automl-iot-turbine-predictive-maintenance&version=1">

In [0]:
%pip install --quiet databricks-sdk==0.40.0 mlflow==2.22.0
dbutils.library.restartPython()

In [0]:
%run ../_resources/00-setup $reset_all_data=false

In [0]:

import numpy as np
import pandas as pd
import mlflow
from mlflow.models import infer_signature
from mlflow import MlflowClient
from mlflow.deployments import get_deploy_client
import os
import requests
import json


In [0]:
mlflow.set_registry_uri('databricks-uc')

In [0]:
# # # Creating a User-Defined Function (UDF) with an ML model in Spark allows you to apply the model to data within a Spark DataFrame. This means you can use the model to make predictions directly in your Spark SQL queries or DataFrame operations.
# # # By creating a UDF with an ML model, you can seamlessly integrate machine learning predictions into your data processing workflows in Spark.

predict_maintenance = mlflow.pyfunc.spark_udf(spark, 
                                              f"models:/{catalog}.{db}.dbdemos_turbine_maintenance@prod", 
                                              "float", #output
                                              env_manager='virtualenv'

                                              )


#This registers the UDF with Spark SQL, allowing you to use it in SQL queries.
spark.udf.register("predict_maintenance", predict_maintenance)


# This retrieves the names of the input columns that the model expects.
columns = predict_maintenance.metadata.get_input_schema().input_names()

columns

In [0]:
predict_maintenance.metadata.get_input_schema()

In [0]:


# # Create a sample DataFrame with the same schema as the input data
# sample_data = [(1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0)]

# sample_df = spark.createDataFrame(sample_data, columns)


# # # # Apply the UDF to the sample DataFrame
# # result_df = sample_df.withColumn("prediction", predict_maintenance(*[col(c) for c in columns]))

# # # Display the result to verify the function
# # display(result_df)

# # # create a table in the catalog
# # result_df.write.mode("overwrite").saveAsTable("turbine_hourly_predictions")

In [0]:
# This applies the UDF to a Spark DataFrame, adding a new column with the model's predictions.


batch_pred_df = spark.table('turbine_hourly_features').withColumn("dbdemos_turbine_maintenance", predict_maintenance(*columns))

batch_pred_df.display()

# create a table in the catalog
batch_pred_df.write.mode("overwrite").saveAsTable("turbine_hourly_predictions")

Databricks visualization. Run in Databricks to view.

In [0]:
# %sql
# SELECT turbine_id, 
# predict_maintenance(avg_energy, std_sensor_A, std_sensor_B, std_sensor_C, std_sensor_D, std_sensor_E, std_sensor_F) as prediction 

# FROM turbine_hourly_features




In [0]:
MODEL_SERVING_ENDPOINT_NAME

In [0]:
client = get_deploy_client("databricks")

for each in client.list_endpoints():
    if each['name'] == MODEL_SERVING_ENDPOINT_NAME:
        client.delete_endpoint(MODEL_SERVING_ENDPOINT_NAME)


In [0]:
# Endpoint creation spins up a container that will run the model for inference. This can take 12+ minutes to complete.
client = get_deploy_client("databricks")

try:
    endpoint = client.create_endpoint(
        name=MODEL_SERVING_ENDPOINT_NAME,
        config={
            "served_entities": [
                {
                    "name": "iot-maintenance-serving-endpoint",
                    "entity_name": f"{catalog}.{db}.{model_name}",
                    "entity_version": get_last_model_version(f"{catalog}.{db}.{model_name}"),
                    "workload_size": "Small",
                    "scale_to_zero_enabled": True
                }
            ]
        }
    )
except Exception as e:
    if "already exists" in str(e):
        print(f"Endpoint {catalog}.{db}.{MODEL_SERVING_ENDPOINT_NAME} already exists. Skipping creation.")
    else:
        raise e

while client.get_endpoint(MODEL_SERVING_ENDPOINT_NAME)['state']['config_update'] == 'IN_PROGRESS':
    time.sleep(10) 

if client.get_endpoint(MODEL_SERVING_ENDPOINT_NAME)['state']['ready'] != 'READY':
    print(f"Endpoint {catalog}.{db}.{MODEL_SERVING_ENDPOINT_NAME} creation failed.")
else:
    print(f"Endpoint {catalog}.{db}.{MODEL_SERVING_ENDPOINT_NAME} created successfully.")    

In [0]:
# Get the API endpoint and token for the current notebook context
API_ROOT = f"https://{dbutils.notebook.entry_point.getDbutils().notebook().getContext().browserHostName().value()}/"
API_TOKEN = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiToken().getOrElse(None)

In [0]:
def create_tf_serving_json(data):
    return {'inputs': {name: data[name].tolist() for name in data.keys()} if isinstance(data, dict) else data.tolist()}

def score_model(dataset):

    url = f'{API_ROOT}/serving-endpoints/{MODEL_SERVING_ENDPOINT_NAME}/invocations'

    headers = {'Authorization': f'Bearer {API_TOKEN}', 
               'Content-Type': 'application/json'}


    ds_dict = {'dataframe_split': dataset.to_dict(orient='split')} if isinstance(dataset, pd.DataFrame) else create_tf_serving_json(dataset)

    data_json = json.dumps(ds_dict, allow_nan=True)

    response = requests.request(method='POST', headers=headers, url=url, data=data_json)
    
    if response.status_code != 200:
        raise Exception(f'Request failed with status {response.status_code}, {response.text}')
    return response.json()

In [0]:
spark.table(f'turbine_hourly_features').toPandas()[:5]

In [0]:
columns = ['avg_energy', 'std_sensor_A', 'std_sensor_B', 'std_sensor_C', 'std_sensor_D', 'std_sensor_E', 'std_sensor_F']
dataset = spark.table(f'turbine_hourly_features').select(*columns).toPandas()[:5]

dataset

In [0]:
# Deploy your model and uncomment to run your inferences live!
score_model(dataset)

Try it on another computer


curl \
  -u token:$DATABRICKS_TOKEN \
  -X POST \
  -H "Content-Type: application/json" \
  -d '{"inputs": [[0.1889792 , 0.9644652 , 2.65583866, 3.4528106 , 2.48515875,
        2.28840325, 4.70213899],
       [0.19212258, 1.06818556, 2.38481843, 3.30341204, 2.17225129,
        2.34259302, 4.87087542],
       [0.17356345, 1.14208877, 2.0627087 , 3.01932966, 2.33955204,
        2.73069787, 4.23719664],
       [0.10343409, 1.04987272, 2.21921651, 3.24672614, 2.32046658,
        2.66270018, 4.28940458],
       [0.15481244, 1.03255521, 2.14210166, 2.72984232, 2.35974868,
        2.7614664 , 4.58878877]]}' \
  https://dbc-0664a3f5-7bb4.cloud.databricks.com/serving-endpoints/dbdemos_iot_turbine_prediction_endpoint/invocations

In [0]:
# %sql
# SELECT ai_query('dbdemos_iot_turbine_prediction_endpoint',
#     request => {
#   "dataframe_split": {
#     "data": [
#       [
#         0.3343003711119671,
#         0.3250868023612564,
#         -0.3970504309971035,
#         -0.26756059753270023,
#         -0.38967895864662727,
#         1.606278850727433,
#         4.490631184834478
#       ]
#     ]
#   }
# })