# delta live tables (dlt)

* build batch or streaming pipelines
* easily define end-to-end data pipelines in SQL or Python
* Prevent bad data from flowing into tables through validation and integrity checks and avoid data quality errors with predefined error policies (fail, drop, alert or quarantine data)**
* Gain deep visibility into pipeline operations with tools to visually track operational stats and data lineage.

# example pipeline

written as a Databricks notebook

```
import dlt
from pyspark.sql.functions import *
from pyspark.sql.types import *

json_path = "/databricks-datasets/wikipedia-datasets/data-001/clickstream/raw-uncompressed-json/2015_2_clickstream.json"
@dlt.table(
  comment="The raw wikipedia clickstream dataset, ingested from /databricks-datasets."
)
def clickstream_raw():
  return (spark.read.format("json").load(json_path))

@dlt.table(
  comment="Wikipedia clickstream data cleaned and prepared for analysis."
)
@dlt.expect("valid_current_page_title", "current_page_title IS NOT NULL")
@dlt.expect_or_fail("valid_count", "click_count > 0")
def clickstream_prepared():
  return (
    dlt.read("clickstream_raw")
      .withColumn("click_count", expr("CAST(n AS INT)"))
      .withColumnRenamed("curr_title", "current_page_title")
      .withColumnRenamed("prev_title", "previous_page_title")
      .select("current_page_title", "click_count", "previous_page_title")
  )
```

![dag](dlt-sales-graph.png)

![dag](dlt-sales-dataset-details.png)

# Limitations


* runs on a custom version of Databricks runtime, you cannot manually set the Spark version
* development can only be done as a DLT pipeline: you cannot run the pipeline as a normal notebook, you cannot do `import dlt`
* each pipeline uses only a single cluster, with optional [enhanced autoscaling](https://docs.databricks.com/workflows/delta-live-tables/delta-live-tables-concepts.html#enable-enhanced-autoscaling): no directly control over concurrency (multiple clusters executing tables in parallel)
* vendor locking; cannot extend this to EMR
* cannot query views from another cluster
* only supports delta tables