
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning">
</div>


# 3 - Adding Data Quality Expectations

In this demonstration we will add data quality expectations to apply quality constraints that validates data as it flows through Lakeflow Declarative Pipelines. Expectations provide greater insight into data quality metrics and allow you to fail updates or drop records when detecting invalid records.


### Learning Objectives

By the end of this lesson, you will be able to:
- Add quality constraints within a Lakeflow Declarative Pipeline to trigger appropriate actions (warn, drop, or fail) based on data expectations.
- Analyze pipeline metrics to identify and interpret data quality issues across different data flows.

## REQUIRED - SELECT CLASSIC COMPUTE

Before executing cells in this notebook, please select your classic compute cluster in the lab. Be aware that **Serverless** is enabled by default.

Follow these steps to select the classic compute cluster:

1. Navigate to the top-right of this notebook and click the drop-down menu to select your cluster. By default, the notebook will use **Serverless**.

1. If your cluster is available, select it and continue to the next cell. If the cluster is not shown:

    - In the drop-down, select **More**.

    - In the **Attach to an existing compute resource** pop-up, select the first drop-down. You will see a unique cluster name in that drop-down. Please select that cluster.

**NOTE:** If your cluster has terminated, you might need to restart it in order to select it. To do this:

1. Right-click on **Compute** in the left navigation pane and select *Open in new tab*.

1. Find the triangle icon to the right of your compute cluster name and click it.

1. Wait a few minutes for the cluster to start.

1. Once the cluster is running, complete the steps above to select your cluster.

%md
## A. Classroom Setup

Run the following cell to configure your working environment for this course.

This cell will also reset your `/Volumes/dbacademy/ops/labuser/` volume with the JSON files to the starting point, with one JSON file in each volume.

**NOTE:** The `DA` object is only used in Databricks Academy courses and is not available outside of these courses. It will dynamically create and reference the information needed to run the course.

In [0]:
%run ./Includes/Classroom-Setup-3

[43mNote: you may need to restart the kernel using %restart_python or dbutils.library.restartPython() to use updated packages.[0m


Schema labuser11197806_1755312348.1_bronze_db already exists. No action taken.
Schema labuser11197806_1755312348.2_silver_db already exists. No action taken.
Schema labuser11197806_1755312348.3_gold_db already exists. No action taken.
----------------------------------------------------------------------------------------
Directory /Volumes/dbacademy/ops/labuser11197806_1755312348@vocareum_com/customers already exists. No action taken.
Directory /Volumes/dbacademy/ops/labuser11197806_1755312348@vocareum_com/orders already exists. No action taken.
Directory /Volumes/dbacademy/ops/labuser11197806_1755312348@vocareum_com/status already exists. No action taken.
----------------------------------------------------------------------------------------


Searching for files in /Volumes/dbacademy/ops/labuser11197806_1755312348@vocareum_com/customers/ volume to delete prior to creating files...
Deleting file: /Volumes/dbacademy/ops/labuser11197806_1755312348@vocareum_com/customers/00.json

Searc

Schemas are available, lab check passed: ['1_bronze_db', '2_silver_db', '3_gold_db'].


0,1
Your catalog name variable reference: DA.catalog_name:,
"Variable reference to your source files (Python - DA.paths.working_dir, SQL - DA.paths_working_dir):",


Run the cell below to programmatically view the files in your `/Volumes/dbacademy/ops/labuser/orders` volume. Confirm you only see the original **00.json** file in the **orders** folder.

In [0]:
%python
spark.sql(f'LIST "{DA.paths.working_dir}/orders"').display()

path,name,size,modification_time
/Volumes/dbacademy/ops/labuser11197806_1755312348@vocareum_com/orders/00.json,00.json,15313,1755317295000


In [0]:
select * from json.`/Volumes/dbacademy/ops/labuser11197806_1755312348@vocareum_com/orders/00.json`;

customer_id,notifications,order_id,order_timestamp
23094,Y,75123,1640392092
23457,N,75124,1640392500
23564,Y,75125,1640394862
23392,N,75126,1640396067
23101,Y,75127,1640399066
23466,N,75128,1640404853
23834,Y,75129,1640407272
23852,Y,75130,1640419989
23483,Y,75131,1640422131
23821,N,75132,1640423697


## B. Adding Data Quality Expectations

This demonstration includes a simple starter Lakeflow Declarative Pipeline that has already been created. We will continue to build on it to explore it's capabilities.


1. Run the cell below to create your starter pipeline for this demonstration. The pipeline will set the following for you:

- Your default catalog: `labuser`

- Your configuration parameter: `source` = `/Volumes/dbacademy/ops/your-labuser-name`

  **NOTE:** If the pipeline already exists, an error will be returned. In that case, you'll need to delete the existing pipeline and rerun this cell.

  To delete the pipeline:

  - Select **Jobs and Pipelines** from the far-left navigation bar.  

  - Find the pipeline you want to delete.  

  - Click the three-dot menu ![ellipsis icon](./Includes/images/ellipsis_icon.png).  

  - Select **Delete**.

**NOTE:**  The `create_declarative_pipeline` function is a custom function built for this course to create the sample pipeline using the Databricks REST API. This avoids manually creating the pipeline and referencing the pipeline assets.

In [0]:
%python
create_declarative_pipeline(pipeline_name=f'3 - Adding Data Quality Expectations Project - {DA.catalog_name}', 
                            root_path_folder_name='3 - Adding Data Quality Expectations Project',
                            catalog_name = DA.catalog_name,
                            schema_name = 'default',
                            source_folder_names=['orders'],
                            configuration = {'source':DA.paths.working_dir})

Creating the Lakeflow Declarative Pipeline '3 - Adding Data Quality Expectations Project - labuser11197806_1755312348'...
Root folder path: /Workspace/Users/labuser11197806_1755312348@vocareum.com/build-data-pipelines-with-lakeflow-declarative-pipelines-3.0.2/Build Data Pipelines with Lakeflow Declarative Pipelines/3 - Adding Data Quality Expectations Project
Source folder path(s): [{'glob': {'include': '/Workspace/Users/labuser11197806_1755312348@vocareum.com/build-data-pipelines-with-lakeflow-declarative-pipelines-3.0.2/Build Data Pipelines with Lakeflow Declarative Pipelines/3 - Adding Data Quality Expectations Project/orders/**'}}]

Lakeflow Declarative Pipeline Creation '3 - Adding Data Quality Expectations Project - labuser11197806_1755312348' Complete!


2. Complete the following steps to open the starter pipeline for this demonstration:

   a. Click the folder icon ![Folder](./Includes/images/folder_icon.png) in the left navigation panel.
   
   b. In the **Build Data Pipelines with Lakeflow Declarative Pipelines** folder, find the **3 - Adding Data Quality Expectations Project** folder.
   
   c. Right-click and select **Open in a new tab**.

   d. In the new tab:
      - Select the **orders** folder (The main folder also contains the extra **python_excluded** folder that contains the Python version)

      - Click on **orders_pipeline.sql**.
      

   e. In the navigation pane of the new tab, you should see **Pipeline** and **All Files**. Ensure you are in the **Pipeline** tab. This will list all files in your pipeline.
   <br></br>
   **Example**
   
   ![Pipeline and All Files Tab](./Includes/images/pipeline_projecttabs.png)

#### IMPORTANT
   **NOTE:** If you open the **orders_pipeline.sql** file and it does not open up the pipeline editor, that is because that folder is not associated with a pipeline. Please make sure to run the previous cell to associate the folder with the pipeline and try again.

   **WARNING:** If you get the following warning when opening the **orders_pipeline.sql** file: 

   ```pipeline you are trying to access does not exist or is inaccessible. Please verify the pipeline ID, request access or detach this file from the pipeline.``` 

   Simply refresh the page and/or reselect the notebook.

3. In the new tab, follow the instructions provided in the comments within the **orders_pipeline.sql** file.

## Additional Resources

- [Manage data quality with pipeline expectations](https://docs.databricks.com/aws/en/dlt/expectations)

- [Expectation recommendations and advanced patterns](https://docs.databricks.com/aws/en/dlt/expectation-patterns)

- [Data Quality Management With Databricks](https://www.databricks.com/discover/pages/data-quality-management#expectations-with-delta-live-tables)


&copy; 2025 Databricks, Inc. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg logo are trademarks of the <a href="https://www.apache.org/" target="blank">Apache Software Foundation</a>.<br/>
<br/><a href="https://databricks.com/privacy-policy" target="blank">Privacy Policy</a> | 
<a href="https://databricks.com/terms-of-use" target="blank">Terms of Use</a> | 
<a href="https://help.databricks.com/" target="blank">Support</a>