# Lecture 30. Delta Live Tables (Hands On)

## Requirements

- Azure Subscription, remaining quota should be 4+ cores.
- The Demo Cluster should be exist with the running results of the prior notebook

## Delta Live Tables Overview

Delta Live Tables or **DLT** is a framework for building reliable and maintainable data processing pipelines.  
DLT simplifies the hard work of building large-scale ETL while maintaining table dependencies and data quality.

---

<div  style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="../../assets/images/Screen-Captures/Workflows - Delta Live Tables - Pipeline details.jpg" alt="Workflows - Delta Live Tables - Pipeline details.jpg" style="width: 1280px">
</div>

Here, our **DLT** multi-hop pipeline is well visualized, and we can see our two bronze tables, `customers` and `orders_raw`.  
They are joined together into the silver table `orders_cleaned`, from which we calculate our gold table `daily_customer_books`.

DLT pipelines are implemented using Databricks notebooks. On the pipeline details on the right, we can see the path to the notebook containing the DLT table definitions.  
We can simply click here to navigate to the source code.

---


## Delta Live Tables Syntax and Table Declaration

Let us explore the content of this notebook to better understand the syntax used by Delta Live Tables.

In this SQL notebook, we declare our **Delta Live Tables** that together implement a simple multi-hop architecture.  
**DLT tables** will always be preceded by the `LIVE` keyword.

---



<div  style="text-align: center; ">
  <img src="../../assets/images/Presentation-Images/bookstore_schema.png" alt="Raw Data Schema" style="width: 480px;">
</div>

In [None]:
SET datasets.path=dbfs:/mnt/demo-datasets/bookstore;

### Bronze Layer Tables

Here, we start by declaring two tables implementing the **bronze layer**.  
These represent our data in its rawest form.

#### `orders_raw`

The table `orders_raw` ingests **Parquet** data incrementally by **Auto Loader** from our dataset directory.  
Incremental processing via **Auto Loader** requires the addition of the `STREAMING` keyword in the declaration.

The `cloud_files` method enables Auto Loader to be used natively with SQL.  
This method takes three parameters:
- The data file source location.
- The source data format, which is `parquet` in this case.
- An array of **Reader options**.

In this case, we declare the schema of our data.  
Also, notice that we add a comment here that will be visible to anyone exploring the data catalog.

Let us run this query and see what will happen.

In [None]:
CREATE OR REFRESH STREAMING LIVE TABLE orders_raw
COMMENT "The raw books orders, ingested from orders-raw"
AS SELECT * FROM cloud_files("${datasets.path}/orders-json-raw", "json",
                             map("cloudFiles.inferColumnTypes", "true"))


As you can see, running a **DLT** query from here only validates that it is syntactically valid.  
To define and populate this table, you must create a **DLT pipeline**.  
We will see later how to configure and run a new pipeline from this notebook.

---



#### `customers`

The second **bronze table** is `customers`, which presents **JSON** customer data.  


In [None]:
CREATE OR REFRESH LIVE TABLE customers
COMMENT "The customers lookup table, ingested from customers-json"
AS SELECT * FROM json.`${datasets.path}/customers-json`

This table is used below in a **join** operation to look up customer information.

---



### Silver Layer Tables

Next, we declare tables implementing the **silver layer**.  
This layer represents a refined copy of data from the **bronze layer**.

At this level, we apply operations like **data cleansing** and **enrichment**.

#### `orders_cleaned`

Here we declare our silver table `orders_cleaned`, which enriches the order's data with customer information.  

In addition, we implement **quality control** using `CONSTRAINT` keywords.
Here, we reject records with no `order_id`.  
The `CONSTRAINT` keyword enables **DLT** to collect metrics on constraint violations.  
It provides an optional `ON VIOLATION` clause specifying an action to take on records that violate the constraints.

The three modes currently supported by **Delta** are included in this table:
- `DROP ROW`, where we discard records that violate constraints.
- `FAIL UPDATE`, where the pipeline fails when a constraint is violated.
- Finally, when omitted, records violating constraints will be included, but violations will be reported in the metrics.

>> Constraint violation

| **`ON VIOLATION`** | Behavior |
| --- | --- |
| **`DROP ROW`** | Discard records that violate constraints |
| **`FAIL UPDATE`** | Violated constraint causes the pipeline to fail  |
| Omitted | Records violating constraints will be kept, and reported in metrics |

Notice also that we need to use the `LIVE` prefix to refer to other **DLT** tables.  
And for streaming **DLT tables**, we need to use the `STREAM` method.


In [None]:
CREATE OR REFRESH STREAMING LIVE TABLE orders_cleaned (
  CONSTRAINT valid_order_number EXPECT (order_id IS NOT NULL) ON VIOLATION DROP ROW
)
COMMENT "The cleaned books orders with valid order_id"
AS
  SELECT order_id, quantity, o.customer_id, c.profile:first_name as f_name, c.profile:last_name as l_name,
         cast(from_unixtime(order_timestamp, 'yyyy-MM-dd HH:mm:ss') AS timestamp) order_timestamp, o.books,
         c.profile:address:country as country
  FROM STREAM(LIVE.orders_raw) o
  LEFT JOIN LIVE.customers c
    ON o.customer_id = c.customer_id


---



### Gold Layer Tables

Lastly, we declare the **gold table**, in this case, the `daily number of books per customer in a specific region`.  
Here it is China.


In [None]:
CREATE OR REFRESH LIVE TABLE cn_daily_customer_books
COMMENT "Daily number of books per customer in China"
AS
  SELECT customer_id, f_name, l_name, date_trunc("DD", order_timestamp) order_date, sum(quantity) books_counts
  FROM LIVE.orders_cleaned
  WHERE country = "China"
  GROUP BY customer_id, f_name, l_name, date_trunc("DD", order_timestamp)


---



## Creating and Running a Delta Live Table Pipeline

Let us see now how to use this notebook to create a new **DLT pipeline**.

- To do so, start by navigating to the **Workflows** tab on the sidebar.
- Select the **Delta Live Table** tab.
- Click **Create Pipeline**.

<div  style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="../../assets/images/Screen-Captures/Workflows - Delta Live Tables tab.jpg" alt="Workflows - Delta Live Tables tab" style="width: 1280px">
</div>




### Configuring the Pipeline

- Under **General**:
  - Fill in a **Pipeline name**, for example, `demo_bookstore`.

  - The **Pipeline mode** specifies how the pipeline will be run.  
   Triggered pipelines run once and then shut down until the next manual or scheduled updates.  
   **Continuous pipelines** will continuously ingest new data as it arrives.
   For this demo, let us keep it **Triggered**.

- For **Source code**, use the navigator to locate and select the notebook with the delta table definitions, this one.

- Under **Destination**:
  - For **Storage options**, select **Hive Metastore**.
  - In the **Storage location** field, enter a path where the pipeline logs and data files will be stored (`dbfs:/mnt/demo/dlt/demo_bookstore`). 
   We will explore this directory later.
  - In the **Target schema** field, enter a target database name (`demo_bookstore_dlt_db`).

<div  style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="../../assets/images/Screen-Captures/Workflows - Delta Live Tables - Create pipeline 1.jpg" alt="Workflows - Delta Live Tables - Create pipeline 1.jpg" style="width: 1280px">
</div>

<div  style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="../../assets/images/Screen-Captures/Workflows - Delta Live Tables - Create pipeline 2.jpg" alt="Workflows - Delta Live Tables - Create pipeline 2.jpg" style="width: 1280px">
</div>

- Under **Compute**:
  - A new cluster will be created for our **DLT pipeline**.  
   For this, let us choose the **cluster mode**. For example, **fixed size**.
  - Set the number of **Workers** to `0` to create a **single-node cluster**.

- Under **Advanced**:
  - For **Configuration**, add a new configuration parameter.
    Set the key to `dataset.path` and the value to the location of the bookstore dataset(`dbfs:/mnt/demo-datasets/bookstore`).  
    This parameter is used in the notebook to specify the path to our source data files.
  - For **Driver type**, select **Standard_DS3_v2** (4 Cores) type.
  
Notice right-side the **DBUs estimate** provided, similar to that provided when configuring interactive clusters.  
Finally, click **Create**.

---



### Running the Pipeline

Great! The pipeline has been created.

  <div  style="text-align: center; line-height: 0; padding-top: 9px;">
    <img src="../../assets/images/Screen-Captures/Workflows - Delta Live Tables - demo_bookstore (just created).jpg" alt="Workflows - Delta Live Tables - demo_bookstore (just created).jpg" style="width: 1280px">
  </div>

- Select **Development** to run the pipeline in **development mode**.  
  This mode allows for interactive development by reusing the cluster, compared to creating a new cluster for each run in the production mode.
  **Development mode** also disables retries so that we can quickly identify and fix errors.

- Now click **Start**.  
  The initial run will take several minutes while the cluster is provisioned.

  <div  style="text-align: center; line-height: 0; padding-top: 9px;">
    <img src="../../assets/images/Screen-Captures/Workflows - Delta Live Tables - demo_bookstore (start in progress).jpg" alt="" style="width: 1280px">
  </div>

---



## Visualizing and Inspecting the Pipeline

Great! Our pipeline successfully ran.

- Below, we see all the **events** of our running pipeline, either **information**, **warning**, or **errors**.
- On the right-hand side, we see all the pipeline details and information related to the cluster.
- In the middle, we see the execution flow visualized as a **Directed Acyclic Graph (DAG)**.  
  This **DAG** represents the entities involved in the pipeline and the relationships between them.

Click on each entity to view a summary, which includes the **run status** and other **metadata summaries**, including the comment we set during the table definition in the notebook.

We can also see the **schema** of the table.

---



### Data Quality Metrics

If you select the `orders_cleaned` table, you can notice the results reported in the **data quality section**.  
Because this flow has **data expectation declared**, those metrics are extracted here.

As you can see, we have no records violating our constraint.

---



## Modifying the Pipeline

Let us now come back to our notebook for adding another table and see how this change is reflected here.

Let us open the notebook of this pipeline by clicking on the link here.  
Let us scroll to the end of this notebook and add a new cell.

We will add a new table similar to the previous **gold table** declaration.  
But this time, instead of China, we will filter for **France**.

But let us do something different to see what happens if we remove, for example, the `LIVE` prefix.

If we run this cell, the syntax of the query is correct. However, let us see what will happen in our pipeline.

Now click **Start** again to rerun our pipeline and examine the updated results.

As you can see, this generates an error: `Table or view not found`, because we missed the `LIVE` namespace.  
Let us correct this.

Okay, here we add again the `LIVE` keyword, and we run the query.

---



### Re-running the Pipeline

Great! The syntax is valid.

Let us rerun our pipeline by clicking **Start**.

Great! Our pipeline is successfully completed, and we can see now our two gold tables.

---



## [Exploring Pipeline Logs and Data](./Lecture-30__Delta-Live-Tables-(Hands-On)-2.ipynb)
