
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img
    src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png"
    alt="Databricks Learning"
  >
</div>


# 6 - Change Data Capture with AUTO CDC with Slowing Changing Dimensions (SCD) TYPE 1

##### NOTE: The AUTO CDC APIs replace the APPLY CHANGES APIs, and have the same syntax. The APPLY CHANGES APIs are still available, but Databricks recommends using the AUTO CDC APIs in their place.

In this demonstration, we will continue to build our pipeline by ingesting **customer** data into our pipeline. The customer data includes new customers, customers who have deleted their accounts, and customers who have updated their information (such as address, email, etc.). We will need to build our customer pipeline by implementing change data capture (CDC) for customer data using SCD Type 1 (Type 2 is outside the scope of this course).

The customer pipeline flow will:

- The bronze table uses **Auto Loader** to ingest JSON data from cloud object storage with SQL (`FROM STREAM`).
- A table is defined to enforce constraints before passing records to the silver layer.
- `AUTO CDC` is used to automatically process CDC data into the silver layer as a Type 1.
- A gold table is defined to create a materialized view of the current customers with updated information (dropped customers, new customers and updated customer information).



### Learning Objectives

By the end of this lesson, students should feel comfortable:
- Apply the `AUTO CDC` operation in Lakeflow Spark Declarative Pipelines to process change data capture (CDC) by integrating and updating incoming data from a source stream into an existing Delta table, ensuring data accuracy and consistency.
- Analyze Slowly Changing Dimensions (SCD Type 1) tables within Lakeflow Spark Declarative Pipelines to effectively update, insert and drop customers in dimensional data, managing the state of records over time using appropriate keys, versioning, and timestamps.

## REQUIRED - SELECT CLASSIC COMPUTE (your cluster starts with **labuser**)

Before executing cells in this notebook, please select your classic compute cluster in the lab. Be aware that **Serverless** is enabled by default.

Follow these steps to select the classic compute cluster:

1. Navigate to the top-right of this notebook and click the drop-down menu to select your cluster. By default, the notebook will use **Serverless**.

1. If your cluster is available, select it and continue to the next cell. If the cluster is not shown:

    - In the drop-down, select **More**.

    - In the **Attach to an existing compute resource** pop-up, select the first drop-down. You will see a unique cluster name in that drop-down. Please select that cluster.

**NOTE:** If your cluster has terminated, you might need to restart it in order to select it. To do this:

1. Right-click on **Compute** in the left navigation pane and select *Open in new tab*.

1. Find the triangle icon to the right of your compute cluster name and click it.

1. Wait a few minutes for the cluster to start.

1. Once the cluster is running, complete the steps above to select your cluster.

## A. Classroom Setup

Run the following cell to configure your working environment for this course. This setup will reset your volume to one JSON file in each directory.

**NOTE:** The `DA` object is only used in Databricks Academy courses and is not available outside of these courses. It will dynamically create and reference the information needed to run the course.

In [0]:
%run ./Includes/Classroom-Setup-6

## B. Explore the Customer Data Source Files

1. Run the cell below to programmatically view the files in your `/Volumes/dbacademy/ops/lab-user-name/customers` volume. Confirm you only see one **00.json** file for customers.

In [0]:
%python
spark.sql(f'LIST "{DA.paths.working_dir}/customers"').display()

2. Run the query below to explore the customers **00.json** file located at `/Volumes/dbacademy/ops/lab-user-name/customers`. Note the following:

   a. The file contains **939 customers** (remember this number).

   b. It includes general customer information such as **email**, **name**, and **address**.

   c. The **timestamp** column specifies the logical order of customer events in the source data.

   d. The **operation** column indicates whether the entry is for a new customer, a deletion, or an update.
      - **NOTE:** Since this is the first JSON file, all rows will be considered new customers.


In [0]:
SELECT *
FROM read_files(
  DA.paths_working_dir || '/customers/00.json',
  format => "JSON"
)
ORDER BY operation;

### Question: 
How can we ingest new raw data source files (JSON) with customer updates into our pipeline to update the **customers_silver** table when inserts, updates, or deletes occur, without maintaining historical records (SCD Type 1)?

## C. Change Data Capture with AUTO CDC APIs in Lakeflow Spark Declarative Pipelines

1. Run the cell below to create your starter Spark Declarative Pipeline for this demonstration. The pipeline will set the following for you:
    - Your default catalog: `labuser`
    - Your configuration parameter: `source` = `/Volumes/dbacademy/ops/your-labuser-name`

    **NOTE:** If the pipeline already exists, an error will be returned. In that case, you'll need to delete the existing pipeline and rerun this cell.

**NOTE:**  The `create_declarative_pipeline` function is a custom function built for this course to create the sample pipeline using the Databricks REST API. This avoids manually creating the pipeline and referencing the pipeline assets.

In [0]:
%python
create_declarative_pipeline(pipeline_name=f'6 - Change Data Capture with AUTO CDC - {DA.catalog_name}', 
                            root_path_folder_name='6 - Change Data Capture with AUTO CDC Project',
                            catalog_name = DA.catalog_name,
                            schema_name = 'default',
                            source_folder_names=['orders', 'status', 'customers'],
                            configuration = {'source':DA.paths.working_dir})

2. Complete the following steps to open the starter Spark Declarative Pipeline project for this demonstration:

   a. In the main navigation bar right-click on **Jobs & Pipelines** and select **Open in Link in New Tab**.

   b. In **Jobs & Pipelines** select your **6 - Change Data Capture with AUTO CDC - labuser** pipeline.

   c. **REQUIRED:** At the top near your pipeline name, turn on **New pipeline monitoring**.

   d. In the **Pipeline details** pane on the far right, select **Open in Editor** (field to the right of **Source code**) to open the pipeline in the **Lakeflow Pipeline Editor**.
   
   e. In the new tab you should see five folders: 
      - **explorations**
      - **orders**
      - **status**
      - **customers**
      - Plus the extra **python_excluded** folder that contains the Python version. 

   f. Open the **customers** folder and select the **customers_pipeline.sql** file.
      - **NOTE:** The **status** and **orders** pipelines are the same as we saw in the previous demonstrations.

## D. Spark Declarative Pipeline CDC SCD Type 1 Pipeline Steps
Follow the steps below using the **customers_pipeline.sql** file in the Lakeflow Pipelines editor.

### PLEASE COMPLETE FIRST: Click the 'Run Pipeline' button to execute the Pipeline
1. To save some time, let's run the entire pipeline for **status**, **orders** and **customers**. While the pipeline is running explore the code in the **customers_pipeline.sql** for the new customers flow.

##### While the pipeline is running continue through the steps below to review the customer pipeline code.

### STEP 1: JSON -> Bronze Ingestion
The code in **STEP 1** of the **customers_pipeline.sql** file:
   - We define a bronze streaming table named **customers_bronze_raw_demo6** using a data source configured with Auto Loader (`FROM STREAM`).
   - Adds the table property `pipelines.reset.allowed = false` to prevent deletion of all ingested bronze data if a full refresh is triggered.
   - Creates columns to capture the time of data ingestion and the source file name for each row.

### STEP 2: Create the Bronze Clean Streaming Table with Data Quality Enforcement
##### **NOTE:** This displays how you can use advanced data quality techniques with expectations. Advanced expectations are outside the scope of this course.

The code in **STEP 2** of the **customers_pipeline.sql** file:

- Adds three violation constraint actions: **WARN**, **DROP**, and **FAIL**. Each defines how to handle constraint violations.
- Applies multiple conditions to a single constraint.
- Uses a built-in SQL function within a constraint.

#### About the data source:

- The data is a CDC feed that contains **`INSERT`**, **`UPDATE`**, and **`DELETE`** operations for customers.  
- REQUIREMENT: **UPDATE** and **INSERT** operations should contain valid entries for all fields.  
- REQUIREMENT: **DELETE** operations should contain **`NULL`** values for all fields except the **timestamp**, **customer_id**, and **operation** fields.

**NOTE:** To ensure only valid data reaches our silver table, we'll write a series of quality enforcement rules that allow expected null values in **DELETE** operations while rejecting bad data elsewhere.


### We'll break down each of these constraints below:

##### 1. **`valid_id`**
This constraint will cause our transaction to fail if a record contains a null value in the **`customer_id`** field.

##### 2. **`valid_operation`**
This constraint will drop any records that contain a null value in the **`operation`** field.

##### 3. **`valid_name`**
This constraint will track any records that contain a null value in the **`name`** field. Because there is no additional instruction for what to do with invalid records, violating rows will be recorded in metrics but not dropped.

##### 4. **`valid_address`**
This constraint checks if the **`operation`** field is **`DELETE`**; if not, it checks for null values in any of the 4 fields comprising an address. Because there is no additional instruction for what to do with invalid records, violating rows will be recorded in metrics but not dropped.

##### 5. **`valid_email`**
This constraint uses regex pattern matching to check that the value in the **`email`** field is a valid email address. It contains logic to not apply this to records if the **`operation`** field is **`DELETE`** (because these will have a null value for the **`email`** field). Violating records are dropped.

**NOTE:** When a record is going to be dropped, all values except the **customer_id** will be `null`.
| address                               | city         | customer_id | email                    | name           | operation | state |
|---------------------------------------|--------------|-------------|--------------------------|----------------|-----------|-------|
| null                                  | null         | 23617       | null                     | null           | DELETE    | null  |


### STEP 3: Processing CDC Data with **`AUTO CDC INTO`**
Spark Declarative Pipelines introduces a new syntactic structure for simplifying CDC feed processing: `AUTO CDC INTO` (formerly `APPLY CHANGES INTO`).

The code in **STEP 3** of the **customers_pipeline.sql** file uses `AUTO CDC INTO` to:
- Create the **2_silver_db.scd_type_1_customers_silver_demo6** streaming table if it doesn't exist,
- Updates the **2_silver_db.scd_type_1_customers_silver_demo6** streaming table with updates, inserts and deletes using records from the **1_bronze_db.customers_bronze_clean_demo6** streaming table.

#### Additional Notes
**`AUTO CDC INTO`** has the following guarantees and requirements:
- Performs incremental/streaming ingestion of CDC data
- Provides simple syntax to specify one or many fields as the primary key for a table
- Default assumption is that rows will contain inserts and updates
- Can optionally apply deletes
- Automatically orders late-arriving records using user-provided sequencing key (order to process rows)
- Uses a simple syntax for specifying columns to ignore with the **`EXCEPT`** keyword
- The default to applying changes is SCD Type 1. You can also use SCD Type 2 if you would like. We will focus on SCD Type 1.


#### Documentation
[AUTO CDC INTO (Lakeflow Spark Declarative Pipelines)](https://docs.databricks.com/aws/en/dlt-ref/dlt-sql-ref-apply-changes-into)

[The AUTO CDC APIs: Simplify change data capture with Lakeflow Spark Declarative Pipelines](https://docs.databricks.com/aws/en/dlt/cdc)

### STEP 4: Explore the Customers Pipeline Graph
After running the pipeline and reviewing the code cells, take time to explore the pipeline results for the **customers** flow following the steps below.

**Run with 1 JSON File**

![demo6_cdc_run01.png](./Includes/images/demo6_cdc_run_1.png)

<br></br>
Notice the following:
1. In the **customers** flow in the pipeline graph, notice that **939** rows were streamed into the three streaming tables. 
    - This is because all records are new and valid entries, they were ingested throughout the flow.

2. In the table window below, find the **scd1_type_1_customers_silver_demo06** table and select **Table metrics**. Note the following:

    - The **Upserted** column indicates that all **939** rows were upserted into the table, as all rows are new.

### STEP 5: Explore the Customers Pipeline Tables

1. Run the query below to view the **scd_type_1_customers_silver_demo6** streaming table (the table with SCD Type 1 updates, inserts and deletes). 

    Notice the following after the first run ingestion the **00.json** file:

   - The streaming table contains all **939 rows** from the **00.json** file, since they are all new customers being added to the target table.

   - Each record was inserted into the empty streaming table.

In [0]:
SELECT *
FROM 2_silver_db.scd_type_1_customers_silver_demo6;

2. Query the **scd_type_1_customers_silver_demo6** streaming table for the following **customer_id** values (*23225*, *23617*). 

   Notice the following:
      - **customer_id** = *23225*
         - **Address**: `76814 Jacqueline Mountains Suite 815`  
         - **State**: `TX`  
      - **customer_id** = *23617*
         - This customer exists in the first execution (in file **00.json**)

In [0]:
SELECT *
FROM 2_silver_db.scd_type_1_customers_silver_demo6
WHERE customer_id IN (23225, 23617);

## E. Land New Data to Your Data Source Volume
Complete the following after executing and reviewing the **customers** pipeline flow that consistent of ingesting one file (**00.json**) from cloud storage.

1. Run the cell below to land a new JSON file to each volume (**customers**, **status** and **orders**) to simulate new files being added to your cloud storage locations.

In [0]:
%python
copy_file_for_multiple_sources(copy_n_files = 2, 
                               sleep_set = 1,
                               copy_from_source='/Volumes/dbacademy_retail/v01/retail-pipeline',
                               copy_to_target = DA.paths.working_dir)

2. Run the cell below to programmatically view the files in your `/Volumes/dbacademy/ops/labuser-name/customers` volume. Confirm your volume now contains the original **00.json** file and the new **01.json** file.

In [0]:
%python
spark.sql(f'LIST "{DA.paths.working_dir}/customers"').display()

3. Run the cell to explore the raw data in the new **01.json** file prior to ingesting it in your pipeline. 

   Notice the following:

   - This file contains **23** rows.

   - The **operation** column specifies **UPDATE**, **DELETE**, and **NEW** operations for customers.
      - **In the new 01.json file there are**:
         - 12 customers with **UPDATE** values
         - 1 customer with a **DELETE** value
         - 10 new customers with a **NEW** value

   - In the results below, find the row with **customer_id** *23225* and note the following:

      - The original address for **Sandy Adams** (from the streaming table, file **00.json**) was: `76814 Jacqueline Mountains Suite 815`, `TX`
      - The updated address for **Sandy Adams** (from the file below) is: `512 John Stravenue Suite 239`, `TN`

   - In the results below, find the row with **customer_id** *23617* and note the following:
      - The **operation** for this customer is **DELETE**.
      - When the **operation** column is delete, all other column values are `null`.

In [0]:
SELECT *
FROM read_files(
  DA.paths_working_dir || '/customers/01.json',
  format => "JSON"
)
ORDER BY customer_id;

### E1. Go back to your pipeline and click **'Run pipeline'** button to ingest the new JSON file (**01.json**) incrementally and perform CDC SCD Type 1 on the **scd_type1_customers_silver_demo06** table.

## F. Explore the Customers Pipeline

After you have explored and landed 1 new JSON file into each of your cloud data sources, complete the following to explore the **customers** flow in the **Pipeline graph**:

a. 23 rows were read into the: 

  - **customers_bronze_raw_demo06** streaming table
  - **customers_bronze_clean_demo06** streaming table (all data quality checks passed)
  - The pipeline only ingested and processed the NEW **01.json** file 

b. In the **scd_type_1_customers_silver_demo6** streaming table details (The CDC SCD Type 1 table) it contains:
  - **Upserted = 22**:
    - 12 customers with UPDATE values (previous customer were simply updated with the new values)
    - 10 new customers with a NEW value (new customers were inserted into the table)
  - **Deleted records = 1**:
    - 1 customer was marked as DELETE and deleted from the table

![Run 2](./Includes/images/demo6_cdc_run_2.png)

## G. Explore the CDC SCD Type 1 on the scd_type_1_customers_silver_demo6 Streaming Table

1. View the data in the **scd_type_1_customers_silver_demo6** streaming table with SCD Type 1 and observe the following:

   a. The table contains **948 rows**:
      - **initial 939 customers** 
      - \+ **10** new customers
      - \- **1** deleted customer
      - **NOTES:** 
         - The **12** updates to original customers were made in place and updated the original record (SCD Type 1 does not keep historical records).
         - The **1** record marked for deletion was deleted from the table.

In [0]:
SELECT customer_id, address, name
FROM 2_silver_db.scd_type_1_customers_silver_demo6;

3. Query the **2_silver_db.scd_type_1_customers_silver_demo6** table for the following **customer_id** values: *23225* and *23617*. These were the values we reviewed earlier.  

    Notice the following:  

    - **customer_id** *23225* has been updated to the new address. The historical address was not retained because we used SCD Type 1.  
    - **customer_id** *23617* has been deleted from the table. It no longer exists because we used SCD Type 1.  


In [0]:
SELECT *
FROM 2_silver_db.scd_type_1_customers_silver_demo6
WHERE customer_id IN (23225, 23617);

## Additional Resources

- [What is change data capture (CDC)?](https://docs.databricks.com/aws/en/dlt/what-is-change-data-capture)

- [AUTO CDC INTO (Lakeflow Spark Declarative Pipelines)](https://docs.databricks.com/gcp/en/dlt-ref/dlt-sql-ref-apply-changes-into) documentation

- [The AUTO CDC APIs: Simplify change data capture with Lakeflow Spark Declarative Pipelines](https://docs.databricks.com/aws/en/dlt/cdc) documentation

- [How to implement Slowly Changing Dimensions when you have duplicates - Part 1: What to look out for?](https://community.databricks.com/t5/technical-blog/how-to-implement-slowly-changing-dimensions-when-you-have/ba-p/40568)

&copy; 2025 Databricks, Inc. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg logo are trademarks of the <a href="https://www.apache.org/" target="_blank">Apache Software Foundation</a>.<br/><br/><a href="https://databricks.com/privacy-policy" target="_blank">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use" target="_blank">Terms of Use</a> | <a href="https://help.databricks.com/" target="_blank">Support</a>