
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img
    src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png"
    alt="Databricks Learning"
  >
</div>



# Pipeline Deployment

In this demo, we will show how to use a model as part of a data pipeline for inference. In the first section of the demo, we will prepare data and perform some basic feature engineering. Then, we will fit and register the model to model registry. Please note that these two steps are already covered in other courses and they are not the main focus of this demo. In the last section, which is the main focus of this demo, we will create a Lakeflow Declarative (FKA Delta Live Tables or DLT) pipeline and use the registered model as part of the pipeline. 

**Learning Objectives:**

*By the end of this demo, you will be able to;*

* Describe steps for deploying a model within a pipeline.

* Develop a simple pipeline that performs batch inference in its final step.


## REQUIRED - SELECT CLASSIC COMPUTE
Before executing cells in this notebook, please select your classic compute cluster in the lab. Be aware that **Serverless** is enabled by default.

Follow these steps to select the classic compute cluster:
1. Navigate to the top-right of this notebook and click the drop-down menu to select your cluster. By default, the notebook will use **Serverless**.

2. If your cluster is available, select it and continue to the next cell. If the cluster is not shown:

   - Click **More** in the drop-down.

   - In the **Attach to an existing compute resource** window, use the first drop-down to select your unique cluster.

**NOTE:** If your cluster has terminated, you might need to restart it in order to select it. To do this:

1. Right-click on **Compute** in the left navigation pane and select *Open in new tab*.

2. Find the triangle icon to the right of your compute cluster name and click it.

3. Wait a few minutes for the cluster to start.

4. Once the cluster is running, complete the steps above to select your cluster.

## Requirements

Please review the following requirements before starting the lesson:

* To run this notebook, you need to use one of the following Databricks runtime(s): **17.3.x-cpu-ml-scala2.13**


## Classroom Setup

Before starting the demo, run the provided classroom setup script. This script will define configuration variables necessary for the demo. Execute the following cell:

In [0]:
%run ../Includes/Classroom-Setup-3.1a

**Other Conventions:**

Throughout this demo, we'll refer to the object `DA`. This object, provided by Databricks Academy, contains variables such as your username, catalog name, schema name, working directory, and dataset locations. Run the code block below to view these details:

In [0]:
print(f"Username:          {DA.username}")
print(f"Catalog Name:      {DA.catalog_name}")
print(f"Schema Name:       {DA.schema_name}")
print(f"Working Directory: {DA.paths.working_dir}")
print(f"User DB Location:  {DA.paths.datasets}")

## Data Preparation

For this demonstration, we will utilize a fictional dataset from a Telecom Company, which includes customer information. This dataset encompasses **customer demographics**, including gender, as well as internet subscription details such as subscription plans and payment methods.

After loading the dataset, we will perform simple **data cleaning and feature selection**. 

In the final step, we will split the dataset into **features** and **response** sets.

In [0]:
from pyspark.sql.functions import col

# Load dataset with spark
shared_volume_name = 'telco' # From Marketplace
csv_name = 'telco-customer-churn-missing' # CSV file name
dataset_p_telco = f"{DA.paths.datasets.telco}/{shared_volume_name}/{csv_name}.csv" # Full path

# Dataset specs
primary_key = "customerID"
response = "Churn"
features = ["SeniorCitizen", "tenure", "MonthlyCharges", "TotalCharges"] # Keeping numerical only for simplicity and demo purposes

# Read dataset (and drop nan)
telco_df = spark.read.csv(dataset_p_telco, inferSchema=True, header=True, multiLine=True, escape='"')\
            .withColumn("TotalCharges", F.expr("try_cast(trim(TotalCharges) as double)"))\
            .na.drop(how='any')

# Separate features and ground-truth
features_df = telco_df.select(primary_key, *features)
response_df = telco_df.select(primary_key, response)

# Train a sklearn Decision Tree Classification model
# Convert data to pandas dataframes
X_train_pdf = features_df.drop(primary_key).toPandas()
Y_train_pdf = response_df.drop(primary_key).toPandas()

for col in X_train_pdf.select_dtypes("int32"):
    X_train_pdf[col] = X_train_pdf[col].astype("double")

In [0]:
print(X_train_pdf.info())

## Model Preparation

**Note:** This section is not the main focus of this course. We are just repeating the model development and registration process here.


### Setup Model Registry with UC

Before we start model deployment, we need to fit and register a model. In this demo, **we will log models to Unity Catalog**, which means first we need to setup the **MLflow Model Registry URI**.

In [0]:
import mlflow

# Point to UC model registry
mlflow.set_registry_uri("databricks-uc")
client = mlflow.MlflowClient()


def get_latest_model_version(model_name):
    """Helper function to get latest model version"""
    model_version_infos = client.search_model_versions("name = '%s'" % model_name)
    return max([model_version_info.version for model_version_info in model_version_infos])

### Fit and Register a Model with UC

In [0]:
from sklearn.tree import DecisionTreeClassifier
from mlflow.models import infer_signature

# Use 3-level namespace for model name
model_name = f"{DA.catalog_name}.{DA.schema_name}.ml_model" 

alias_name = "pipeline"

# model to use for classification
clf = DecisionTreeClassifier(max_depth=4, random_state=10)

with mlflow.start_run(run_name="Model-Deployment demo") as mlflow_run:

    # Enable automatic logging of input samples, metrics, parameters, and models
    mlflow.sklearn.autolog(
        log_input_examples=True,
        log_models=False,
        log_post_training_metrics=True,
        silent=True)
    
    clf.fit(X_train_pdf, Y_train_pdf)

    # Log model and push to registry
    signature = infer_signature(X_train_pdf, Y_train_pdf)
    mlflow.sklearn.log_model(
        clf,
        artifact_path="decision_tree",
        signature=signature,
        registered_model_name=model_name
    )

    # Set model alias
    client.set_registered_model_alias(model_name, alias_name, get_latest_model_version(model_name))

## Configure Pipeline to Run Batch Inference

Now that our model is registered and ready, we can move on the most important part; using the model for inference inside a pipeline. 

**Note: The pipeline is already defined in `3.1b` notebook.**


### Config Variables

While defining the pipeline, you will need to use the following variables. Run the code block below first. Then, use the output in the next section while creating the pipeline.


In [0]:
print(f"mlpipeline.bronze_dataset_path: {dataset_p_telco}")
print(f"mlpipeline.model_name: {model_name}")
print(f"mlpipeline.model_alias: {alias_name}")

## Create the ETL Pipeline
This Vocareum environment has be configured so that **Lakeflow Declarative Pipelines** has been enabled (this feature is currently in Beta). 
> ****Note:**** To enable the Lakeflow Pipelines Editor: Open your user settings, go to Developer, and enable ****Lakeflow Pipelines Editor****.
### Instructions
1. Navigate to **Jobs & Pipelines** on the left side menu and click on **ETL pipeline** card at the top of the screen. 
2. Give the Pipeline the name `<labuserXXXXXXXX_XXXXXXXXXX>-pipeline`, where you need to replace `<labuserXXXXXXXX_XXXXXXXXXX>` with your labuser name.
    - Click on the profile icon at the top right to copy your labuser name or see the output to cell 8 above.
3. Make sure the catalog `dbacademy` is selected. 
4. Select your labuser schema, which is of the form `<labuserXXXXXXXX_XXXXXXXXXX>`. 
5. Select **Add existing assets** near the bottom under **Advanced options** 
</br>
<img src="../Includes/Images/etl-pipeline-1.png" width="500"/>
</br>
6. In **Pipeline root folder**, locate and open the folder **M03 - Pipeline Deployment/Pipeline**. 
7. In **Source code paths**, click on the folder icon and select **3.1b Demo - Inference Pipeline** and click **Select**. 
8. Back in the **Add existing assets** select **Add** at the bottom right. 
9. Click on the **Pipeline** menu item at the top left and select the notebook **3.1b Demo - Inference Pipeline**. 
10. This new editor will display the notebook in the center of the screen and the **Pipeline graph** on the right of the screen. We will need to configure the variables shown in the notebook **3.1b Demo - Inference Pipeline** in the **Pipeline settings**. To do this, click on the **settings** icon next to **Pipeline configuration** to open the pipeline settings. Then, scroll down to the **Configuration** section and click **Add configuration** to set up the necessary variables for the pipeline.
</br>
<img src="../Includes/Images/add-config.png" width="500"/>
</br>
11. The config variable values are defined in the section **Config Variables** in this notebook (**3.1a Demo - Pipeline Deployment**). Copy and paste the key-value pair into the configuration and click **Save**.
12. Back in **Pipeline settings**, navigate to and click **Dry run** at the top right. 
    - Dry-run mode allows you to test your policy configuration and monitor outbound connections without disrupting access to resources. This will not create or update any tables. 
13. Once the dry run is is completed, click **Run pipeline**. This will now create or update any tables in our pipeline. 

> Note we did not use classic compute for this pipeline run. We left **Serverless** as our compute by default.

## Additional Resources and Trainings
This demo is not a comprehensive introduction to **Lakeflow Declarative Pipelines**. For a deeper dive into this Databricks feature, check out our course **[Build Data Pipeline with Lakeflow Declarative Pipelines](https://www.databricks.com/training/catalog?search=lakeflow+declarative+pipelines)**.


## Conclusion

In this demonstration, we walked through the sequential process of training, registering, and deploying a model within a pipeline. Following the standard procedure of fitting and registering the model, we then established a Delta Live Tables pipeline. This pipeline not only ingests data from a source file but also implements necessary data transformations, culminating in the utilization of the registered model as the final step in the pipeline. While your specific project requirements may vary, this example illustrates how to set up and integrate a model for inference within the Delta Live Tables pipeline.

&copy; 2026 Databricks, Inc. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg logo are trademarks of the <a href="https://www.apache.org/" target="_blank">Apache Software Foundation</a>.<br/><br/><a href="https://databricks.com/privacy-policy" target="_blank">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use" target="_blank">Terms of Use</a> | <a href="https://help.databricks.com/" target="_blank">Support</a>