
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img
    src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png"
    alt="Databricks Learning"
  >
</div>


# 7 Bonus Lab - AUTO CDC INTO with SCD Type 1

##### NOTE: The AUTO CDC APIs replace the APPLY CHANGES APIs, and have the same syntax. The APPLY CHANGES APIs are still available, but Databricks recommends using the AUTO CDC APIs in their place.

### Estimated Duration: ~15-20 minutes

#### This is an optional lab that can be completed after class if you're interested in practicing CDC.


In this demonstration you will use Change Data Capture (CDC) to detect changes and apply them using SCD Type 1 logic (overwrite, no historical records).

### Learning Objectives

By the end of this lesson, you will be able to:
- Use `AUTO CDC INTO` to perform Change Data Capture (CDC) using SCD Type 1.

## REQUIRED - SELECT CLASSIC COMPUTE (your cluster starts with **labuser**)

Before executing cells in this notebook, please select your classic compute cluster in the lab. Be aware that **Serverless** is enabled by default.

Follow these steps to select the classic compute cluster:

1. Navigate to the top-right of this notebook and click the drop-down menu to select your cluster. By default, the notebook will use **Serverless**.

1. If your cluster is available, select it and continue to the next cell. If the cluster is not shown:

    - In the drop-down, select **More**.

    - In the **Attach to an existing compute resource** pop-up, select the first drop-down. You will see a unique cluster name in that drop-down. Please select that cluster.

**NOTE:** If your cluster has terminated, you might need to restart it in order to select it. To do this:

1. Right-click on **Compute** in the left navigation pane and select *Open in new tab*.

1. Find the triangle icon to the right of your compute cluster name and click it.

1. Wait a few minutes for the cluster to start.

1. Once the cluster is running, complete the steps above to select your cluster.

%md
## A. Classroom Setup

Run the following cell to configure your working environment for this course.

**NOTE:** The `DA` object is only used in Databricks Academy courses and is not available outside of these courses. It will dynamically create and reference the information needed to run the course.

In [0]:
%run ./Includes/Classroom-Setup-7

## B. SCENARIO

Your data engineering team wants to build a Lakeflow Spark Declarative Pipeline to maintain a record of current employees without keeping historical data (SCD Type 1). The project has been started, but the final step is to update the silver table with the current employee records that have not yet been completed. 

There are already two files in a cloud storage location that contain information about employees and employee updates.

### REQUIREMENTS:
It’s your job to complete the Spark Declarative Pipeline by adding the `AUTO CDC` statement to perform SCD Type 1.

Follow the steps below to complete your task.

## C. Explore the Raw Data Source Files

1. Run the cell below to programmatically view the files in your `/Volumes/your-lab-catalog-name/default/lab_files` volume. Confirm that you see **employees_1.csv** and **employees_2.csv**.

**NOTE:** You can also manually navigate to your **labuser.default.lab_files** volume and view the files in the volume.


In [0]:
%python
spark.sql(f'LIST "/Volumes/{DA.catalog_name}/default/lab_files"').display()

2. Query the 2 CSV files in that volume. 

  Notice the following:

   - The files contain a list of employees.

   - The **employees_1.csv** contains the initial employees.  
   
   - The **employees_2.csv** contains an update, a delete, and two new employees.

   - The **Operation** column provides information about the action for each record (new employee, update employee information, or delete employee).

   - The **ProcessDate** column indicates when the records were processed (acts as a sequence column).

   - In total, there are 10 rows.

      - There are two duplicate **EmployeeID** values:
        - **EmployeeID 1** – Sophia was an employee, then should be deleted.
        - **EmployeeID 3** – Liam received a bonus, and his **Salary** needs to be updated.
        
      - **Employee 6 & 7** - New employees from the **employees_2.csv** file.

In [0]:
SELECT
  _metadata.file_name as source_file, 
  *
FROM read_files(
  '/Volumes/' || DA.catalog_name || '/default/lab_files',
  format => 'CSV'
)
ORDER BY source_file, EmployeeID, ProcessDate DESC;

3. Looking at the output from above, our final table after applying SCD Type 1 on the two files should:

   - Contain 6 rows of data:
      - remove the **EmployeeID** with a `null` value (removed with a data quality expectation)
      - delete **EmployeeID** 1 (employee who left)

   - **EmployeeID 3** should have a current salary of 100,000 and only one row of data.

   - **EmployeeID 6 & 7** are new employees from **employees_2.csv** file.

   - No historical data should be tracked.

<br></br>

**FINAL TABLE OUTPUT**
| EmployeeID | FirstName | Country | Department | Salary | HireDate   | ProcessDate |
|------------|-----------|---------|------------|--------|------------|-------------|
| 2          | Nikos     | GR      | IT         | 55000  | 2025-04-10 | 2025-06-05  |
| 3          | Liam      | US      | Sales      | **100000** | 2025-05-03 | **2025-06-22**  |
| 4          | Elena     | GR      | IT         | 53000  | 2025-06-04 | 2025-06-05  |
| 5          | James     | US      | IT         | 60000  | 2025-06-05 | 2025-06-05  |
| 6          | Emily     | US      | Enablement | 80000  | 2025-06-09 | **2025-06-22**  |
| 7          | Yannis    | GR      | HR         | 70000  | 2025-06-20 | **2025-06-22**  |


## D. TO DO: Complete the Pipeline with SCD Type 1

1. Run the cell below to create your starter Spark Declarative Pipeline for this lab. The pipeline will set the following for you:
    - Your default catalog: `labuser`
    - Your configuration parameter: `source` = `/Volumes/dbacademy/ops/your-labuser-name`

    **NOTE:** If the pipeline already exists, an error will be returned. In that case, you'll need to delete the existing pipeline and rerun this cell.

    To delete the pipeline:

    a. Select **Jobs and Pipelines** from the far-left navigation bar.  

    b. Find the pipeline you want to delete.  

    c. Click the three-dot menu ![ellipsis icon](./Includes/images/ellipsis_icon.png).  

    d. Select **Delete**.

**NOTE:**  The `create_declarative_pipeline` function is a custom function built for this course to create the sample pipeline using the Databricks REST API. This avoids manually creating the pipeline and referencing the pipeline assets.

**NOTE:** The run the solution for this lab go to step **F. Lab Solution (OPTIONAL)**.

In [0]:
%python
create_declarative_pipeline(pipeline_name=f'7 - CDC Lab Starter Project - {DA.catalog_name}', 
                            root_path_folder_name='7 - CDC Lab Starter Project',
                            catalog_name = DA.catalog_name,
                            schema_name = 'default',
                            source_folder_names=['cdc_type_1_pipeline'],
                            configuration = {'source':f'/Volumes/{DA.catalog_name}/default/lab_files'})

2. Complete the following steps to open the starter Spark Declarative Pipeline project for this lab:

   a. In the main navigation bar right-click on **Jobs & Pipelines** and select **Open in Link in New Tab**.

   b. In **Jobs & Pipelines** select your **7 - CDC Lab Starter Project - labuser** pipeline.

   c. **REQUIRED:** At the top near your pipeline name, turn on **New pipeline monitoring**.

   d. In the **Pipeline details** pane on the far right, select **Open in Editor** (field to the right of **Source code**) to open the pipeline in the **Lakeflow Pipeline Editor**.

   e. In the new tab you should see the folder: **cdc_type_1_pipeline**. 

   f. Open the **cdc_type_1_pipeline** folder and select the **cdc_employees.sql** notebook.

#### TO DO: Review the code in the cdc_employees.sql file and complete the `AUTO CDC INTO` statement to perform SCD Type 1.
- For simplicity in training, all code for the pipeline is in one file **cdc_employees.sql**.

- Walk through the **cdc_employees.sql** file and read the comments.

- The **bronze** and **silver** table code is completed for you. You just need to complete the `AUTO CDC INTO` statement.

- If you need the solution to the pipeline you can view it in the **7 - CDC Lab Solution Project** folder.

[AUTO CDC INTO (Lakeflow Spark Declarative Pipelines)](https://docs.databricks.com/gcp/en/dlt-ref/dlt-sql-ref-apply-changes-into)

## E. Explore Your CDC SCD Type 1 Streaming Table

After you have completed the `AUTO CDC INTO` statement in the **cdc_employees.sql** file, compare your results to the solution image below.

**FINAL PIPELINE RUN**

![Lab 7 Pipeline Run](./Includes/images/lab_7_pipelinerun.png)

1. Run the cell below to view the data in your **lab_2_silver_db.current_employees_silver_demo7** streaming table that applied SCD Type 1, and compare it to the solution below.

    Notice with SCD Type 1 no historical data is kept.

**FINAL TABLE SOLUTION**
| EmployeeID | FirstName | Country | Department | Salary | HireDate   | ProcessDate |
|------------|-----------|---------|------------|--------|------------|-------------|
| 2          | Nikos     | GR      | IT         | 55000  | 2025-04-10 | 2025-06-05  |
| 3          | Liam      | US      | Sales      | **100000** | 2025-05-03 | **2025-06-22**  |
| 4          | Elena     | GR      | IT         | 53000  | 2025-06-04 | 2025-06-05  |
| 5          | James     | US      | IT         | 60000  | 2025-06-05 | 2025-06-05  |
| 6          | Emily     | US      | Enablement | 80000  | 2025-06-09 | **2025-06-22**  |
| 7          | Yannis    | GR      | HR         | 70000  | 2025-06-20 | **2025-06-22**  |

**NOTE**: If you ran the solution pipeline, the streaming table is named **current_employees_silver_demo7_solution**

In [0]:
SELECT *
FROM lab_2_silver_db.current_employees_silver_demo7
ORDER BY EmployeeID

## F. Lab Solution (OPTIONAL)
If you want to run the solution, you can execute the cell below to create a pipeline using the **7 - CDC Lab Solution Project** project folder.

Each table in this solution pipeline will end with **_solution** and the code cells below will need to be modified.

In [0]:
%python
create_declarative_pipeline(pipeline_name=f'7 - CDC Lab Solution Project - {DA.catalog_name}', 
                            root_path_folder_name='7 - CDC Lab Solution Project',
                            catalog_name = DA.catalog_name,
                            schema_name = 'default',
                            source_folder_names=['cdc_type_1_pipeline'],
                            configuration = {'source':f'/Volumes/{DA.catalog_name}/default/lab_files'})

## G. CHALLENGE SCENARIO
### Duration: ~10 minutes

**NOTE:** *If you finish early in a live class, feel free to complete the challenge below. The challenge is optional and most likely won't be completed during the live class. Only continue if your Spark Declarative Pipeline pipeline was set up correctly in the previous section by comparing your pipeline to the solution image.*

**SCENARIO:** In the challenge, you will land a new CSV file in your **lab_files** cloud storage volume and rerun the pipeline to watch the Spark Declarative Pipeline perform CDC SCD Type 1 on the new data.

1. Run the cell below to land another file in your **lab_files** cloud storage location.

In [0]:
%python
LabSetup.copy_file(copy_file = 'employees_3.csv', 
                   to_target_volume = f'/Volumes/{DA.catalog_name}/default/lab_files')

2. Query the **employees_3.csv** file. Notice the following:

   - **EmployeeID** values **2** and **6** need to be removed.

   - **EmployeeID 8** is a new employee in our company.


In [0]:
SELECT 
  _metadata.file_name as source_file,
  *
FROM read_files(
  '/Volumes/' || DA.catalog_name || '/default/lab_files/employees_3.csv',
  format => 'CSV'
);

3. Go back to your pipeline and select **Run pipeline**. Examine the pipeline run. Confirm it shows the following:

![Lab 7 Challenge Run](./Includes/images/lab_7_challengesolution.png)

4. Run the cell below to query the table **lab_2_silver_db.current_employees_silver_demo7** and view the results. Notice that:

   - The two employees (**EmployeeID** 2 and 6) were deleted.

   - **EmployeeID 8** was added.

   - No historical data is kept with SCD Type 1.

    **NOTE:** If you ran the solution pipeline, the streaming table is named **lab_2_silver_db.current_employees_silver_demo7_solution**.


    **FINAL TABLE**
| EmployeeID | FirstName  | Country | Department | Salary  | HireDate   | ProcessDate |
|------------|------------|---------|------------|---------|------------|-------------|
| 3          | Liam       | US      | Sales      | 100000  | 2025-05-03 | 2025-06-22  |
| 4          | Elena      | GR      | IT         | 53000   | 2025-06-04 | 2025-06-05  |
| 5          | James      | US      | IT         | 60000   | 2025-06-05 | 2025-06-05  |
| 7          | Yannis     | GR      | HR         | 70000   | 2025-06-20 | 2025-06-22  |
| 8          | Panagiotis | GR      | Enablement | 90000   | 2025-07-01 | 2025-07-22  |

In [0]:
SELECT *
FROM lab_2_silver_db.current_employees_silver_demo7
ORDER BY EmployeeID;

&copy; 2025 Databricks, Inc. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg logo are trademarks of the <a href="https://www.apache.org/" target="_blank">Apache Software Foundation</a>.<br/><br/><a href="https://databricks.com/privacy-policy" target="_blank">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use" target="_blank">Terms of Use</a> | <a href="https://help.databricks.com/" target="_blank">Support</a>