

# Delta Live Tables


<div  style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://raw.githubusercontent.com/derar-alhussein/Databricks-Certified-Data-Engineer-Associate/main/Includes/images/bookstore_schema.png" alt="Databricks Learning" style="width: 600">
</div>

In [0]:
SET datasets.path=dbfs:/mnt/demo-datasets/bookstore;

## 🔶 Bronze Layer Tables

The Bronze layer ingests raw, unprocessed data from external sources such as Parquet files, logs, or streaming systems.

- Acts as the landing zone for incremental data ingestion.
- Supports batch or streaming ingestion using Auto Loader.
- Stores data in its original form for traceability and reprocessing if needed.

Typical tables in this layer include raw versions of operational data like customer and order records.


In [0]:
--orders_raw
--DLT tables will always be preceded by the LIVE keyword
CREATE OR REFRESH STREAMING LIVE TABLE orders_raw
COMMENT "The raw books orders, ingested from orders-raw"
AS SELECT * FROM cloud_files("${datasets.path}/orders-json-raw", "json",
                             map("cloudFiles.inferColumnTypes", "true"))

In [0]:
--customers
CREATE OR REFRESH LIVE TABLE customers
COMMENT "The customers lookup table, ingested from customers-json"
AS SELECT * FROM json.`${datasets.path}/customers-json`


## 🔷 Silver Layer Tables

Silver tables clean, enrich, and transform data from the Bronze layer to improve usability and structure.

- Joins data from multiple Bronze sources.
- Applies filters, formatting, and deduplication.
- Enforces **data quality expectations**, such as non-null checks or valid formats.
- Helps prepare datasets for analytics or further aggregation.

The S

In [0]:
--orders_cleaned
CREATE OR REFRESH STREAMING LIVE TABLE orders_cleaned (
  CONSTRAINT valid_order_number EXPECT (order_id IS NOT NULL) ON VIOLATION DROP ROW
)
COMMENT "The cleaned books orders with valid order_id"
AS
  SELECT order_id, quantity, o.customer_id, c.profile:first_name as f_name, c.profile:last_name as l_name,
         cast(from_unixtime(order_timestamp, 'yyyy-MM-dd HH:mm:ss') AS timestamp) order_timestamp, o.books,
         c.profile:address:country as country
  FROM STREAM(LIVE.orders_raw) o
  LEFT JOIN LIVE.customers c
    ON o.customer_id = c.customer_id

## Constraint violation
The three modes currently supported by Delta

| **`ON VIOLATION`** | Behavior |
| --- | --- |
| **`DROP ROW`** | Discard records that violate constraints |
| **`FAIL UPDATE`** | Violated constraint causes the pipeline to fail  |
| Omitted | Records violating constraints will be kept, and reported in metrics |

## 🟡 Gold Tables

Gold tables contain **curated, aggregated, and business-level datasets** optimized for reporting, dashboards, and machine learning applications.

- Aggregations and KPIs are calculated at this stage.
- Supports business logic tailored to end-user needs.
- Offers clean, performant tables ready for decision-making.

These tables represent the final output of the multi-hop pipeline.

In [0]:
CREATE OR REFRESH LIVE TABLE cn_daily_customer_books
COMMENT "Daily number of books per customer in China"
AS
  SELECT customer_id, f_name, l_name, date_trunc("DD", order_timestamp) order_date, sum(quantity) books_counts
  FROM LIVE.orders_cleaned
  WHERE country = "China"
  GROUP BY customer_id, f_name, l_name, date_trunc("DD", order_timestamp)

In [0]:
CREATE OR REFRESH LIVE TABLE fr_daily_customer_books
COMMENT "Daily number of books per customer in France"
AS
  SELECT customer_id, f_name, l_name, date_trunc("DD", order_timestamp) order_date, sum(quantity) books_counts
  FROM LIVE.orders_cleaned
  WHERE country = "France"
  GROUP BY customer_id, f_name, l_name, date_trunc("DD", order_timestamp)

## ⚙️ Creating and Managing a DLT Pipeline

Creating a DLT pipeline involves:

- Configuring source and target paths.
- Defining table logic in Python or SQL with decorators.
- Selecting **pipeline modes**: triggered (batch-like) or continuous (streaming).
- Using **development mode** for iterative building and testing.

## 🗺️ Execution Flow & Metadata

- **Directed Acyclic Graphs (DAGs)** visually represent table dependencies and data flow within the pipeline.
- **Storage**: All DLT tables are stored as Delta tables.
- **Metadata**: Logs and records are accessible for monitoring progress, reviewing data quality results, and auditing changes.


## 🛑 Ending the Session

After the pipeline has been executed:

- Resources like clusters can be shut down manually or automatically.
- Output tables remain accessible for downstream use.