# Delta Live Tables (Lakeflow) Pipelines

**KION Training - Dzie≈Ñ 3**

---

## üìö Agenda

1. Wprowadzenie do Delta Live Tables (DLT)
2. Deklaratywne definicje pipeline'√≥w
3. Materialized Views vs Streaming Tables
4. Data Quality Expectations
5. Event Log i Lineage
6. Automatic Orchestration

---

## üéØ Cele szkolenia

Po tym module bƒôdziesz potrafiƒá:
- Definiowaƒá deklaratywne pipeline'y w DLT
- R√≥≈ºnicowaƒá Materialized Views i Streaming Tables
- Implementowaƒá Data Quality Expectations
- Monitorowaƒá pipeline'y przez Event Log
- Konfigurowaƒá automatycznƒÖ orkiestracjƒô

---

## 1Ô∏è‚É£ Wprowadzenie do Delta Live Tables (DLT)

**Delta Live Tables (DLT)** to framework do deklaratywnego budowania ETL/ELT pipeline'√≥w w Databricks.

### Kluczowe cechy:
- **Deklaratywny**: definiujesz "co", nie "jak"
- **Automatyczna orkiestracja**: DLT zarzƒÖdza zale≈ºno≈õciami
- **Data Quality**: wbudowane expectations
- **Monitoring**: event log, lineage, metrics
- **SQL i Python API**: elastyczno≈õƒá jƒôzyka

### R√≥≈ºnica DLT vs tradycyjne Notebooks:

| Aspekt | Tradycyjne Notebooks | Delta Live Tables |
|--------|---------------------|------------------|
| Definicja | Imperatywna (kroki) | Deklaratywna (rezultat) |
| Zale≈ºno≈õci | Rƒôczne | Automatyczne |
| Quality | Custom kod | Wbudowane expectations |
| Monitoring | Custom logging | Event log + lineage |
| Orchestracja | Databricks Jobs | Automatyczna |

---

## 2Ô∏è‚É£ Deklaratywne definicje pipeline'√≥w

### Python API - podstawowa sk≈Çadnia:

W DLT definiujemy tabele za pomocƒÖ dekorator√≥w `@dlt.table()` lub `@dlt.view()`.

In [None]:
import dlt
from pyspark.sql.functions import *

# Przyk≈Çad 1: Prosta tabela DLT
@dlt.table(
    name="raw_orders",
    comment="Raw orders data from CSV source"
)
def raw_orders():
    return (
        spark.read.format("csv")
        .option("header", "true")
        .option("inferSchema", "true")
        .load("/Volumes/main/default/kion_data/orders/*.csv")
    )

In [None]:
# Przyk≈Çad 2: Tabela z transformacjami
@dlt.table(
    name="cleaned_orders",
    comment="Cleaned orders with data quality checks"
)
def cleaned_orders():
    return (
        dlt.read("raw_orders")  # Odczyt z innej tabeli DLT
        .filter(col("order_id").isNotNull())
        .filter(col("amount") > 0)
        .withColumn("order_date", to_date(col("order_date")))
        .withColumn("processing_time", current_timestamp())
    )

### SQL API - alternatywna sk≈Çadnia:

DLT wspiera r√≥wnie≈º czysty SQL - idealne dla zespo≈Ç√≥w analitycznych.

In [None]:
# SQL w DLT (w osobnym notebooku SQL):

# CREATE OR REFRESH LIVE TABLE raw_orders
# COMMENT "Raw orders data from CSV source"
# AS
# SELECT * FROM csv.`/Volumes/main/default/kion_data/orders/*.csv`

# CREATE OR REFRESH LIVE TABLE cleaned_orders
# COMMENT "Cleaned orders with data quality checks"
# AS
# SELECT 
#   *,
#   CAST(order_date AS DATE) as order_date,
#   CURRENT_TIMESTAMP() as processing_time
# FROM LIVE.raw_orders
# WHERE order_id IS NOT NULL AND amount > 0

---

## 3Ô∏è‚É£ Materialized Views vs Streaming Tables

DLT oferuje dwa g≈Ç√≥wne typy tabel:

### Materialized Views
- **Batch processing**: przetwarzanie wsadowe
- **Full refresh**: ka≈ºde uruchomienie przetwarza wszystkie dane
- **Use case**: dane historyczne, agregacje, dimensionals

### Streaming Tables
- **Incremental processing**: tylko nowe dane
- **Continuous updates**: append-only lub upsert
- **Use case**: fact tables, real-time analytics, CDC

In [None]:
# Przyk≈Çad: Materialized View (batch)
@dlt.table(
    name="daily_sales_summary",
    comment="Daily aggregated sales - full refresh"
)
def daily_sales_summary():
    return (
        dlt.read("cleaned_orders")
        .groupBy("order_date")
        .agg(
            count("order_id").alias("total_orders"),
            sum("amount").alias("total_revenue"),
            avg("amount").alias("avg_order_value")
        )
    )

In [None]:
# Przyk≈Çad: Streaming Table (incremental)
@dlt.table(
    name="streaming_orders",
    comment="Streaming orders - incremental processing"
)
def streaming_orders():
    return (
        spark.readStream.format("cloudFiles")
        .option("cloudFiles.format", "csv")
        .option("header", "true")
        .option("inferSchema", "true")
        .load("/Volumes/main/default/kion_data/orders/")
    )

In [None]:
# Streaming table z transformacjami
@dlt.table(
    name="silver_orders_stream",
    comment="Silver layer - streaming incremental"
)
def silver_orders_stream():
    return (
        dlt.read_stream("streaming_orders")  # read_stream dla streaming source
        .filter(col("order_id").isNotNull())
        .withColumn("ingested_at", current_timestamp())
        .withColumn("year", year(col("order_date")))
        .withColumn("month", month(col("order_date")))
    )

### Kiedy u≈ºywaƒá Materialized View vs Streaming Table?

**Materialized View**:
- Agregacje i raporty (Gold layer)
- Dimensionale (np. produkty, klienci)
- Ma≈Çe do ≈õrednich datasety
- Potrzebujesz full refresh logiki

**Streaming Table**:
- Fact tables (transakcje, zdarzenia)
- Real-time/near-real-time processing
- Du≈ºe volumeny danych
- CDC (Change Data Capture)

---

## 4Ô∏è‚É£ Data Quality Expectations

**Expectations** to deklaratywny spos√≥b definiowania regu≈Ç jako≈õci danych w DLT.

### Trzy typy expectations:

1. **WARN**: loguj naruszenia, ale zachowaj dane
2. **DROP**: usu≈Ñ wiersze naruszajƒÖce regu≈Çƒô
3. **FAIL**: zatrzymaj pipeline przy naruszeniu

### Sk≈Çadnia:

In [None]:
# Przyk≈Çad 1: WARN - logowanie narusze≈Ñ
@dlt.table(
    name="orders_with_quality_checks"
)
@dlt.expect("valid_order_id", "order_id IS NOT NULL")
@dlt.expect("positive_amount", "amount > 0")
def orders_with_quality_checks():
    return dlt.read("raw_orders")

# Naruszenia sƒÖ logowane w Event Log, ale dane przep≈ÇywajƒÖ dalej

In [None]:
# Przyk≈Çad 2: DROP - usuwanie z≈Çych wierszy
@dlt.table(
    name="clean_orders"
)
@dlt.expect_or_drop("valid_order_id", "order_id IS NOT NULL")
@dlt.expect_or_drop("positive_amount", "amount > 0")
@dlt.expect_or_drop("valid_date", "order_date IS NOT NULL")
def clean_orders():
    return dlt.read("raw_orders")

# Wiersze niespe≈ÇniajƒÖce expectations sƒÖ automatycznie usuwane

In [None]:
# Przyk≈Çad 3: FAIL - zatrzymanie pipeline
@dlt.table(
    name="critical_orders"
)
@dlt.expect_or_fail("no_nulls_in_key", "order_id IS NOT NULL AND customer_id IS NOT NULL")
def critical_orders():
    return dlt.read("raw_orders")

# Pipeline zatrzyma siƒô, je≈õli jakikolwiek wiersz naruszy regu≈Çƒô

In [None]:
# Przyk≈Çad 4: Z≈Ço≈ºone expectations
@dlt.table(
    name="validated_orders"
)
@dlt.expect_or_drop("valid_order_id", "order_id IS NOT NULL")
@dlt.expect_or_drop("realistic_amount", "amount BETWEEN 1 AND 1000000")
@dlt.expect_or_drop("valid_status", "status IN ('pending', 'completed', 'cancelled')")
@dlt.expect_or_drop("recent_date", "order_date >= '2020-01-01'")
@dlt.expect("preferred_customer", "customer_id IN (SELECT customer_id FROM LIVE.vip_customers)")
def validated_orders():
    return (
        dlt.read("raw_orders")
        .withColumn("validation_timestamp", current_timestamp())
    )

# Kombinacja DROP (krytyczne) i WARN (informacyjne)

### Best Practices dla Expectations:

1. **U≈ºywaj FAIL tylko dla krytycznych warunk√≥w**: np. schema mismatch
2. **DROP dla data quality issues**: np. nulls, invalid values
3. **WARN dla business logic**: np. suspicious patterns
4. **Monitoruj Event Log**: regularnie sprawdzaj metryki jako≈õci
5. **Nazewnictwo expectations**: u≈ºywaj czytelnych nazw opisujƒÖcych regu≈Çƒô

---

## 5Ô∏è‚É£ Event Log i Lineage

### Event Log

Ka≈ºdy DLT pipeline generuje **Event Log** - szczeg√≥≈Çowy dziennik wszystkich operacji:
- Czas wykonania ka≈ºdej tabeli
- Liczba przetworzonych wierszy
- Naruszenia expectations
- Errors i warnings
- Resource usage (CPU, memory)

Event Log jest dostƒôpny przez:
1. **DLT Pipeline UI**: graficzny interfejs
2. **Event Log Table**: delta table z metadanymi

### Zapytanie Event Log:

In [None]:
# Event log jest zapisywany jako Delta Table
# Lokalizacja: system.event_log.<pipeline_id>

# Przyk≈Çadowe zapytanie:
event_log_df = spark.read.table("system.event_log.kion_dlt_pipeline")

# Filtrowanie po typie eventu
quality_events = event_log_df.filter(col("event_type") == "data_quality")
quality_events.display()

# Statystyki jako≈õci danych
quality_summary = (
    quality_events
    .groupBy("dataset", "expectation")
    .agg(
        sum("passed_records").alias("total_passed"),
        sum("failed_records").alias("total_failed")
    )
)
quality_summary.display()

In [None]:
# Monitoring flow metrics
flow_progress = (
    event_log_df
    .filter(col("event_type") == "flow_progress")
    .select(
        "timestamp",
        "dataset",
        "num_output_rows",
        "execution_duration"
    )
    .orderBy(desc("timestamp"))
)
flow_progress.display()

### Data Lineage

DLT automatycznie ≈õledzi **lineage** - relacje miƒôdzy tabelami:
- Kt√≥re tabele sƒÖ ≈∫r√≥d≈Çami (upstream)
- Kt√≥re tabele sƒÖ celami (downstream)
- Jak dane przep≈ÇywajƒÖ przez pipeline

**Lineage jest widoczny w**:
1. **DLT Pipeline Graph**: wizualizacja zale≈ºno≈õci
2. **Unity Catalog**: end-to-end lineage
3. **System tables**: metadata queries

### Przyk≈Çad lineage query:

In [None]:
# Lineage z Unity Catalog system tables
lineage_df = spark.sql("""
    SELECT 
        source_table_full_name,
        target_table_full_name,
        source_type,
        created_at
    FROM system.access.table_lineage
    WHERE target_table_full_name LIKE '%kion_dlt%'
    ORDER BY created_at DESC
""")
lineage_df.display()

---

## 6Ô∏è‚É£ Automatic Orchestration

DLT automatycznie zarzƒÖdza:
1. **Dependency resolution**: wykrywa kolejno≈õƒá wykonania
2. **Parallelization**: wykonuje niezale≈ºne tabele r√≥wnolegle
3. **Retry logic**: automatyczne retry przy b≈Çƒôdach
4. **Checkpointing**: dla streaming tables

### Konfiguracja Pipeline:

In [None]:
# Konfiguracja DLT Pipeline (JSON configuration)
pipeline_config = {
    "name": "KION_Orders_DLT_Pipeline",
    "storage": "/mnt/dlt/kion_orders",
    "target": "kion_dlt_db",
    "notebooks": [
        {
            "path": "/Workspace/KION/dlt_orders_bronze"
        },
        {
            "path": "/Workspace/KION/dlt_orders_silver"
        },
        {
            "path": "/Workspace/KION/dlt_orders_gold"
        }
    ],
    "configuration": {
        "source_path": "/Volumes/main/default/kion_data",
        "pipeline.maxParallelTables": "4"
    },
    "clusters": [
        {
            "label": "default",
            "num_workers": 2,
            "node_type_id": "Standard_DS3_v2"
        }
    ],
    "continuous": False,  # False = triggered mode, True = continuous
    "development": True   # True = development mode (full refresh ka≈ºde uruchomienie)
}

print("DLT Pipeline configuration ready!")

### Modes of Execution:

**Development Mode**:
- Reuse cluster between runs
- Automatic full refresh
- Szybkie iteracje
- U≈ºywaj podczas developmentu

**Production Mode**:
- New cluster per run
- Incremental processing
- Cost-optimized
- U≈ºywaj w produkcji

**Triggered vs Continuous**:
- **Triggered**: on-demand lub scheduled
- **Continuous**: always running, minimal latency

---

## üî® Kompletny przyk≈Çad: Bronze ‚Üí Silver ‚Üí Gold DLT Pipeline

### Pipeline Architecture:
```
raw_orders (CSV) 
    ‚Üì
bronze_orders (Raw + Audit)
    ‚Üì
silver_orders (Cleaned + Validated)
    ‚Üì
gold_daily_sales (Aggregated)
```

In [None]:
import dlt
from pyspark.sql.functions import *

# BRONZE LAYER
@dlt.table(
    name="bronze_orders",
    comment="Bronze: Raw orders with audit columns",
    table_properties={
        "quality": "bronze",
        "pipelines.autoOptimize.zOrderCols": "order_date"
    }
)
def bronze_orders():
    return (
        spark.readStream.format("cloudFiles")
        .option("cloudFiles.format", "csv")
        .option("header", "true")
        .option("inferSchema", "true")
        .load("/Volumes/main/default/kion_data/orders/")
        .withColumn("ingestion_timestamp", current_timestamp())
        .withColumn("source_file", input_file_name())
    )

In [None]:
# SILVER LAYER
@dlt.table(
    name="silver_orders",
    comment="Silver: Cleaned and validated orders",
    table_properties={
        "quality": "silver"
    }
)
@dlt.expect_or_drop("valid_order_id", "order_id IS NOT NULL")
@dlt.expect_or_drop("valid_customer_id", "customer_id IS NOT NULL")
@dlt.expect_or_drop("positive_amount", "amount > 0")
@dlt.expect_or_drop("valid_date", "order_date IS NOT NULL AND order_date >= '2020-01-01'")
@dlt.expect("reasonable_amount", "amount < 1000000")
def silver_orders():
    return (
        dlt.read_stream("bronze_orders")
        .select(
            col("order_id").cast("int"),
            col("customer_id").cast("int"),
            to_date(col("order_date")).alias("order_date"),
            col("product_id").cast("int"),
            col("quantity").cast("int"),
            col("amount").cast("double"),
            lower(trim(col("status"))).alias("status"),
            col("ingestion_timestamp")
        )
        .withColumn("year", year(col("order_date")))
        .withColumn("month", month(col("order_date")))
        .withColumn("quarter", quarter(col("order_date")))
    )

In [None]:
# GOLD LAYER - Aggregated daily sales
@dlt.table(
    name="gold_daily_sales",
    comment="Gold: Daily sales aggregations",
    table_properties={
        "quality": "gold"
    }
)
def gold_daily_sales():
    return (
        dlt.read("silver_orders")
        .groupBy("order_date", "year", "month", "quarter")
        .agg(
            count("order_id").alias("total_orders"),
            countDistinct("customer_id").alias("unique_customers"),
            sum("amount").alias("total_revenue"),
            avg("amount").alias("avg_order_value"),
            max("amount").alias("max_order_value"),
            sum("quantity").alias("total_quantity")
        )
        .withColumn("calculated_at", current_timestamp())
    )

In [None]:
# GOLD LAYER - Customer lifetime value
@dlt.table(
    name="gold_customer_ltv",
    comment="Gold: Customer lifetime value metrics"
)
def gold_customer_ltv():
    return (
        dlt.read("silver_orders")
        .groupBy("customer_id")
        .agg(
            count("order_id").alias("total_orders"),
            sum("amount").alias("lifetime_value"),
            avg("amount").alias("avg_order_value"),
            min("order_date").alias("first_order_date"),
            max("order_date").alias("last_order_date"),
            datediff(max("order_date"), min("order_date")).alias("customer_age_days")
        )
        .withColumn("customer_segment",
            when(col("lifetime_value") > 10000, "VIP")
            .when(col("lifetime_value") > 5000, "High Value")
            .when(col("lifetime_value") > 1000, "Medium Value")
            .otherwise("Low Value")
        )
    )

---

## üìä Monitoring i Troubleshooting

### Sprawdzanie statusu pipeline:

In [None]:
# Query Event Log dla b≈Çƒôd√≥w
errors_df = spark.sql("""
    SELECT 
        timestamp,
        level,
        dataset,
        message
    FROM event_log(system.event_log.kion_dlt_pipeline)
    WHERE level = 'ERROR'
    ORDER BY timestamp DESC
    LIMIT 20
""")
errors_df.display()

In [None]:
# Data quality violations
quality_violations = spark.sql("""
    SELECT 
        dataset,
        expectation,
        SUM(failed_records) as total_failures,
        SUM(passed_records) as total_passed,
        ROUND(SUM(failed_records) * 100.0 / (SUM(failed_records) + SUM(passed_records)), 2) as failure_rate_pct
    FROM event_log(system.event_log.kion_dlt_pipeline)
    WHERE event_type = 'data_quality'
    GROUP BY dataset, expectation
    HAVING SUM(failed_records) > 0
    ORDER BY failure_rate_pct DESC
""")
quality_violations.display()

In [None]:
# Pipeline execution time trends
execution_trends = spark.sql("""
    SELECT 
        date_trunc('hour', timestamp) as execution_hour,
        dataset,
        AVG(execution_duration / 1000) as avg_execution_seconds,
        SUM(num_output_rows) as total_rows_processed
    FROM event_log(system.event_log.kion_dlt_pipeline)
    WHERE event_type = 'flow_progress'
    GROUP BY execution_hour, dataset
    ORDER BY execution_hour DESC, dataset
""")
execution_trends.display()

---

## ‚úÖ Podsumowanie

### Nauczy≈Çe≈õ siƒô:

‚úÖ **Deklaratywne pipeline'y**: `@dlt.table()` API  
‚úÖ **Materialized Views vs Streaming Tables**: batch vs incremental  
‚úÖ **Data Quality Expectations**: warn / drop / fail  
‚úÖ **Event Log**: monitoring i troubleshooting  
‚úÖ **Lineage tracking**: automatyczne ≈õledzenie zale≈ºno≈õci  
‚úÖ **Automatic Orchestration**: dependency resolution  

### Key Takeaways:

1. **DLT upraszcza ETL**: deklaratywna sk≈Çadnia, automatyczna orkiestracja
2. **Quality first**: wbudowane expectations zapewniajƒÖ jako≈õƒá danych
3. **Observability**: Event Log + Lineage = pe≈Çna widoczno≈õƒá
4. **Streaming i Batch**: jeden framework dla obu paradygmat√≥w
5. **Production-ready**: retry, checkpointing, monitoring out-of-the-box

### Nastƒôpne kroki:
- **Notebook 03**: Databricks Jobs Orchestration
- **Workshop 02**: Hands-on DLT + Orchestration

---

## üìö Dodatkowe zasoby

- [Delta Live Tables Documentation](https://docs.databricks.com/delta-live-tables/index.html)
- [DLT Best Practices](https://docs.databricks.com/delta-live-tables/best-practices.html)
- [Event Log Reference](https://docs.databricks.com/delta-live-tables/observability.html)

---