# Lakeflow Pipelines - Demo

**Cel szkoleniowy:** Zrozumienie deklaratywnego podej≈õcia do budowania data pipelines w Databricks Lakeflow.

**Zakres tematyczny:**
- Koncepcje Lakeflow: deklaratywny spos√≥b definicji pipeline'√≥w
- SQL vs Python API
- Materialized views / streaming tables
- Expectations: warn / drop / fail
- Event log i lineage per tabela
- Automatic orchestration

**WA≈ªNE:** Lakeflow to nowa nazwa marketingowa dla **Delta Live Tables (DLT)**. W kodzie u≈ºywamy modu≈Çu `dlt`, ale koncepcyjnie m√≥wimy o "Lakeflow Pipelines".

## 1Ô∏è‚É£ Wprowadzenie do Lakeflow

**Lakeflow** to framework do deklaratywnego budowania ETL/ELT pipeline'√≥w w Databricks.

### üìå Nota: Lakeflow = Delta Live Tables
**Lakeflow** to nowa nazwa dla **Delta Live Tables (DLT)**. 
- API i sk≈Çadnia pozostajƒÖ te same (`@dlt.table()`)
- Dokumentacja mo≈ºe u≈ºywaƒá obu nazw
- W kodzie u≈ºywamy modu≈Çu `dlt`

### Kluczowe cechy:
- **Deklaratywny**: definiujesz "co chcesz osiƒÖgnƒÖƒá", nie "jak to zrobiƒá"
- **Automatyczna orkiestracja**: Lakeflow sam zarzƒÖdza zale≈ºno≈õciami miƒôdzy tabelami
- **Data Quality**: wbudowane expectations (warn/drop/fail)
- **Monitoring**: event log, lineage, quality metrics out-of-the-box
- **SQL i Python API**: elastyczno≈õƒá wyboru jƒôzyka

### R√≥≈ºnica Lakeflow vs tradycyjne Notebooks:

| Aspekt | Tradycyjne Notebooks | Lakeflow Pipelines |
|--------|---------------------|---------------------|
| **Definicja** | Imperatywna (kroki) | Deklaratywna (rezultat) |
| **Zale≈ºno≈õci** | Rƒôczne (musisz okre≈õliƒá kolejno≈õƒá) | Automatyczne (DAG inference) |
| **Quality** | Custom kod walidacji | Wbudowane expectations |
| **Monitoring** | Custom logging | Event log + lineage |
| **Orchestracja** | Databricks Jobs (manual) | Automatyczna |
| **Incremental** | Rƒôczna implementacja | Built-in streaming tables |
| **Retry logic** | Custom error handling | Automatyczne retry |

### Dlaczego Lakeflow?
1. **Mniej kodu**: deklaratywna sk≈Çadnia = mniej boilerplate
2. **Lepsza jako≈õƒá**: expectations enforcement
3. **≈Åatwiejszy debugging**: event log pokazuje dok≈Çadnie gdzie problem
4. **Production-ready**: retry, checkpointing, monitoring out-of-the-box
5. **Team collaboration**: SQL API dla analityk√≥w, Python dla engineers

---

## Kontekst i wymagania

- **Dzie≈Ñ szkolenia**: Dzie≈Ñ 3 - Transformation, Governance & Integrations
- **Typ notebooka**: Demo
- **Wymagania techniczne**:
  - Databricks Runtime 13.3 LTS+ z Lakeflow support
  - Unity Catalog w≈ÇƒÖczony
  - Uprawnienia: CREATE CATALOG, CREATE SCHEMA, CREATE TABLE
  - Klaster: Standard z minimum 2 workers

---

## Wstƒôp teoretyczny - Lakeflow Pipelines

**Lakeflow Pipelines** (poprzednio: Delta Live Tables) to framework Databricks do deklaratywnego budowania ETL/ELT pipeline'√≥w.

### Kluczowe cechy Lakeflow:
- **Deklaratywny**: definiujesz "co chcesz osiƒÖgnƒÖƒá", nie "jak to zrobiƒá"
- **Automatyczna orkiestracja**: Lakeflow sam zarzƒÖdza zale≈ºno≈õciami miƒôdzy tabelami
- **Data Quality**: wbudowane expectations (warn/drop/fail)
- **Monitoring**: event log, lineage, quality metrics out-of-the-box
- **SQL i Python API**: elastyczno≈õƒá wyboru jƒôzyka

### Lakeflow vs Tradycyjne Notebooks:

| Aspekt | Tradycyjne Notebooks | Lakeflow Pipelines |
|--------|---------------------|---------------------|
| **Definicja** | Imperatywna (kroki) | Deklaratywna (rezultat) |
| **Zale≈ºno≈õci** | Rƒôczne (musisz okre≈õliƒá kolejno≈õƒá) | Automatyczne (DAG inference) |
| **Quality** | Custom kod walidacji | Wbudowane expectations |
| **Monitoring** | Custom logging | Event log + lineage |
| **Orchestracja** | Databricks Jobs (manual) | Automatyczna |
| **Incremental** | Rƒôczna implementacja | Built-in streaming tables |
| **Retry logic** | Custom error handling | Automatyczne retry |

### Dlaczego Lakeflow?
1. **Mniej kodu**: deklaratywna sk≈Çadnia = mniej boilerplate
2. **Lepsza jako≈õƒá**: expectations enforcement
3. **≈Åatwiejszy debugging**: event log pokazuje dok≈Çadnie gdzie problem
4. **Production-ready**: retry, checkpointing, monitoring out-of-the-box
5. **Team collaboration**: SQL API dla analityk√≥w, Python dla engineers

---

## Izolacja per u≈ºytkownik

Uruchom skrypt inicjalizacyjny dla per-user izolacji katalog√≥w i schemat√≥w:

In [None]:
%run ../../00_setup

## Konfiguracja

Import bibliotek i ustawienie zmiennych ≈õrodowiskowych:

**UWAGA:** Lakeflow Pipelines sƒÖ uruchamiane jako osobne job'y, nie bezpo≈õrednio w notebooku. Ten notebook zawiera **definicje** tabel Lakeflow, kt√≥re zostanƒÖ wykonane przez Lakeflow engine.

In [None]:
import dlt  # Lakeflow u≈ºywa modu≈Çu 'dlt' (Delta Live Tables API)
from pyspark.sql import functions as F
from pyspark.sql.types import *

# Wy≈õwietl kontekst u≈ºytkownika
print("=== Kontekst u≈ºytkownika ===")
print(f"Katalog: {CATALOG}")
print(f"Schema (Lakeflow target): {BRONZE_SCHEMA}")  # Lakeflow bƒôdzie tworzyƒá tabele tutaj
print(f"U≈ºytkownik: {raw_user}")

# ≈öcie≈ºki do danych ≈∫r√≥d≈Çowych (z dataset/)
ORDERS_JSON = f"{DATASET_BASE_PATH}/orders/orders_batch.json"
CUSTOMERS_CSV = f"{DATASET_BASE_PATH}/customers/customers.csv"
PRODUCTS_PARQUET = f"{DATASET_BASE_PATH}/products/products.parquet"

print(f"\n=== ≈öcie≈ºki do danych ≈∫r√≥d≈Çowych ===")
print(f"Orders (JSON): {ORDERS_JSON}")
print(f"Customers (CSV): {CUSTOMERS_CSV}")
print(f"Products (Parquet): {PRODUCTS_PARQUET}")

print("\n‚úÖ Konfiguracja Lakeflow Pipeline gotowa!")
print("üìù Poni≈ºsze definicje tabel bƒôdƒÖ wykonane przez Lakeflow engine, nie bezpo≈õrednio w notebooku.")

---

## Sekcja 1: Deklaratywne definicje pipeline'√≥w

**Cel:** Zrozumienie podstawowej sk≈Çadni Lakeflow - deklaratywne defin icje tabel.

### Python API - podstawowa sk≈Çadnia:

W Lakeflow definiujemy tabele za pomocƒÖ dekorator√≥w `@dlt.table()` lub `@dlt.view()`.

**Klucz do zrozumienia:** 
- Funkcja z `@dlt.table()` **deklaruje**, co tabela powinna zawieraƒá
- Lakeflow engine automatycznie wykonuje funkcjƒô i materializuje wynik jako Delta table
- Nie musisz rƒôcznie wo≈Ç aƒá `.write.saveAsTable()` - Lakeflow robi to za Ciebie

**UWAGA:** W kodzie u≈ºywamy modu≈Çu `dlt` (Delta Live Tables API), ale koncepcyjnie m√≥wimy o "Lakeflow Pipelines".

In [None]:
import dlt  # Lakeflow u≈ºywa modu≈Çu 'dlt' (Delta Live Tables)
from pyspark.sql.functions import *

# Przyk≈Çad 1.1: Prosta tabela Lakeflow - Bronze Layer Orders

# Ta funkcja definiuje, JAK dane powinny byƒá wczytane do tabeli bronze_orders
# Lakeflow engine automatycznie:
# 1. Wykona tƒô funkcjƒô
# 2. Zmaterializuje wynik jako Delta table
# 3. Umie≈õci tabelƒô w target schema (z konfiguracji pipeline)

@dlt.table(
    name="bronze_orders",
    comment="Bronze layer: Raw orders from JSON source - immutable landing zone"
)
def bronze_orders():
    # Wczytaj surowe dane z orders_batch.json
    # UWAGA: W Lakeflow u≈ºywamy zmiennych przekazanych przez pipeline configuration
    return (
        spark.read
        .format("json")
        .option("multiLine", "true")
        .load(ORDERS_JSON)  # ≈öcie≈ºka z konfiguracji
        .withColumn("_bronze_ingest_timestamp", F.current_timestamp())
        .withColumn("_bronze_source_file", F.input_file_name())
    )

# Lakeflow automatycznie:
# - Materializuje ten DataFrame jako Delta table `bronze_orders`
# - Umieszcza tabelƒô w schema okre≈õlonym w pipeline config
# - ZarzƒÖdza life cycle (create, refresh, update)

In [None]:
# Przyk≈Çad 1.2: Tabela z transformacjami - Silver Layer

@dlt.table(
    name="silver_orders",
    comment="Silver layer: Cleaned and validated orders with business logic"
)
def silver_orders():
    # dlt.read() - odczyt z innej tabeli Lakeflow (batch mode)
    # Lakeflow automatycznie wykryje zale≈ºno≈õƒá: silver_orders depends on bronze_orders
    return (
        dlt.read("bronze_orders")
        
        # Data quality: filtruj tylko valid records
        .filter(F.col("order_id").isNotNull())
        .filter(F.col("customer_id").isNotNull())
        .filter(F.col("total_amount") > 0)
        
        # Transformacje biznesowe
        .withColumn("order_date", F.to_date(F.col("order_datetime")))
        .withColumn("order_year", F.year(F.col("order_datetime")))
        .withColumn("order_month", F.month(F.col("order_datetime")))
        .withColumn("payment_method", F.upper(F.trim(F.col("payment_method"))))
        
        # Derived columns
        .withColumn("order_status", 
                    F.when(F.col("total_amount") > 0, "COMPLETED")
                     .otherwise("UNKNOWN"))
        
        # Silver metadata
        .withColumn("_silver_processed_timestamp", F.current_timestamp())
    )

# Lakeflow automatycznie:
# 1. Wykrywa dependency: silver_orders ‚Üí bronze_orders
# 2. Zapewnia, ≈ºe bronze_orders jest przetworzony PRZED silver_orders
# 3. Materializuje wynik jako Delta table

---

### SQL vs Python API

Lakeflow wspiera **dwa API**:
1. **Python API**: `@dlt.table()` - dla data engineers, elastyczne transformacje PySpark
2. **SQL API**: `CREATE OR REFRESH LIVE TABLE` - dla analytics engineers, czytelna sk≈Çadnia SQL

**Klucz:** Mo≈ºesz mieszaƒá oba API w jednym pipeline! Bronze w Python, Silver/Gold w SQL.

### SQL API - przyk≈Çady:

SQL Lakeflow u≈ºywa specjalnej sk≈Çadni `CREATE OR REFRESH LIVE TABLE` zamiast zwyk≈Çego `CREATE TABLE`.

In [None]:
# SQL w Lakeflow (wykonywany w osobnym notebooku SQL):
# Te przyk≈Çady pokazujƒÖ sk≈Çadniƒô SQL dla Lakeflow

# ==============================================================================
# PRZYK≈ÅAD SQL 1: Bronze Layer w SQL
# ==============================================================================

"""
CREATE OR REFRESH LIVE TABLE bronze_customers
COMMENT "Bronze layer: Raw customers from CSV source"
AS
SELECT 
  *,
  current_timestamp() as _bronze_ingest_timestamp,
  input_file_name() as _bronze_source_file
FROM read_files(
  '/Volumes/.../dataset/customers/customers.csv',
  format => 'csv',
  header => true
)
"""

# ==============================================================================
# PRZYK≈ÅAD SQL 2: Silver Layer w SQL
# ==============================================================================

"""
CREATE OR REFRESH LIVE TABLE silver_customers
COMMENT "Silver layer: Cleaned and validated customers"
AS
SELECT 
  customer_id,
  UPPER(TRIM(customer_name)) as customer_name,
  LOWER(TRIM(customer_email)) as customer_email,
  customer_segment,
  current_timestamp() as _silver_processed_timestamp
FROM LIVE.bronze_customers  -- LIVE. prefix dla odczytu z innej tabeli Lakeflow
WHERE customer_id IS NOT NULL
  AND customer_email IS NOT NULL
"""

# ==============================================================================
# PRZYK≈ÅAD SQL 3: Gold Layer agregacje w SQL
# ==============================================================================

"""
CREATE OR REFRESH LIVE TABLE gold_customer_summary
COMMENT "Gold layer: Customer aggregations for BI"
AS
SELECT 
  customer_id,
  customer_segment,
  COUNT(*) as total_orders,
  SUM(total_amount) as lifetime_value,
  AVG(total_amount) as avg_order_value,
  MIN(order_date) as first_order_date,
  MAX(order_date) as last_order_date
FROM LIVE.silver_orders
GROUP BY customer_id, customer_segment
"""

print("üìù Powy≈ºsze przyk≈Çady SQL pokazujƒÖ sk≈Çadniƒô Lakeflow dla SQL notebooks")
print("üí° W praktyce: Python API dla ingest/complex transforms, SQL API dla analytics/aggregations")

---

## 3Ô∏è‚É£ Materialized Views vs Streaming Tables

Lakeflow oferuje dwa g≈Ç√≥wne typy tabel:

### Materialized Views
- **Batch processing**: przetwarzanie wsadowe
- **Full refresh**: ka≈ºde uruchomienie przetwarza wszystkie dane
- **Use case**: dane historyczne, agregacje, dimensionals

### Streaming Tables
- **Incremental processing**: tylko nowe dane
- **Continuous updates**: append-only lub upsert
- **Use case**: fact tables, real-time analytics, CDC

In [None]:
# Przyk≈Çad: Materialized View (batch)
@dlt.table(
    name="daily_sales_summary",
    comment="Daily aggregated sales - full refresh"
)
def daily_sales_summary():
    return (
        dlt.read("cleaned_orders")
        .groupBy("order_date")
        .agg(
            count("order_id").alias("total_orders"),
            sum("amount").alias("total_revenue"),
            avg("amount").alias("avg_order_value")
        )
    )

In [None]:
# Przyk≈Çad: Streaming Table (incremental)
@dlt.table(
    name="streaming_orders",
    comment="Streaming orders - incremental processing"
)
def streaming_orders():
    return (
        spark.readStream.format("cloudFiles")
        .option("cloudFiles.format", "csv")
        .option("header", "true")
        .option("inferSchema", "true")
        .load("/Volumes/main/default/kion_data/orders/")
    )

In [None]:
# Streaming table z transformacjami
@dlt.table(
    name="silver_orders_stream",
    comment="Silver layer - streaming incremental"
)
def silver_orders_stream():
    return (
        dlt.read_stream("streaming_orders")  # read_stream dla streaming source
        .filter(col("order_id").isNotNull())
        .withColumn("ingested_at", current_timestamp())
        .withColumn("year", year(col("order_date")))
        .withColumn("month", month(col("order_date")))
    )

### Kiedy u≈ºywaƒá Materialized View vs Streaming Table?

**Materialized View**:
- Agregacje i raporty (Gold layer)
- Dimensionale (np. produkty, klienci)
- Ma≈Çe do ≈õrednich datasety
- Potrzebujesz full refresh logiki

**Streaming Table**:
- Fact tables (transakcje, zdarzenia)
- Real-time/near-real-time processing
- Du≈ºe volumeny danych
- CDC (Change Data Capture)

---

## 4Ô∏è‚É£ Data Quality Expectations

**Expectations** to deklaratywny spos√≥b definiowania regu≈Ç jako≈õci danych w Lakeflow.

### Trzy typy expectations:

1. **WARN**: loguj naruszenia, ale zachowaj dane
2. **DROP**: usu≈Ñ wiersze naruszajƒÖce regu≈Çƒô
3. **FAIL**: zatrzymaj pipeline przy naruszeniu

### Sk≈Çadnia:

In [None]:
# Przyk≈Çad 1: WARN - logowanie narusze≈Ñ
@dlt.table(
    name="orders_with_quality_checks"
)
@dlt.expect("valid_order_id", "order_id IS NOT NULL")
@dlt.expect("positive_amount", "amount > 0")
def orders_with_quality_checks():
    return dlt.read("raw_orders")

# Naruszenia sƒÖ logowane w Event Log, ale dane przep≈ÇywajƒÖ dalej

In [None]:
# Przyk≈Çad 2: DROP - usuwanie z≈Çych wierszy
@dlt.table(
    name="clean_orders"
)
@dlt.expect_or_drop("valid_order_id", "order_id IS NOT NULL")
@dlt.expect_or_drop("positive_amount", "amount > 0")
@dlt.expect_or_drop("valid_date", "order_date IS NOT NULL")
def clean_orders():
    return dlt.read("raw_orders")

# Wiersze niespe≈ÇniajƒÖce expectations sƒÖ automatycznie usuwane

In [None]:
# Przyk≈Çad 3: FAIL - zatrzymanie pipeline
@dlt.table(
    name="critical_orders"
)
@dlt.expect_or_fail("no_nulls_in_key", "order_id IS NOT NULL AND customer_id IS NOT NULL")
def critical_orders():
    return dlt.read("raw_orders")

# Pipeline zatrzyma siƒô, je≈õli jakikolwiek wiersz naruszy regu≈Çƒô

In [None]:
# Przyk≈Çad 4: Z≈Ço≈ºone expectations
@dlt.table(
    name="validated_orders"
)
@dlt.expect_or_drop("valid_order_id", "order_id IS NOT NULL")
@dlt.expect_or_drop("realistic_amount", "amount BETWEEN 1 AND 1000000")
@dlt.expect_or_drop("valid_status", "status IN ('pending', 'completed', 'cancelled')")
@dlt.expect_or_drop("recent_date", "order_date >= '2020-01-01'")
@dlt.expect("preferred_customer", "customer_id IN (SELECT customer_id FROM LIVE.vip_customers)")
def validated_orders():
    return (
        dlt.read("raw_orders")
        .withColumn("validation_timestamp", current_timestamp())
    )

# Kombinacja DROP (krytyczne) i WARN (informacyjne)

### Best Practices dla Expectations:

1. **U≈ºywaj FAIL tylko dla krytycznych warunk√≥w**: np. schema mismatch
2. **DROP dla data quality issues**: np. nulls, invalid values
3. **WARN dla business logic**: np. suspicious patterns
4. **Monitoruj Event Log**: regularnie sprawdzaj metryki jako≈õci
5. **Nazewnictwo expectations**: u≈ºywaj czytelnych nazw opisujƒÖcych regu≈Çƒô

---

## 5Ô∏è‚É£ Event Log i Lineage

### Event Log

Ka≈ºdy Lakeflow pipeline generuje **Event Log** - szczeg√≥≈Çowy dziennik wszystkich operacji:
- Czas wykonania ka≈ºdej tabeli
- Liczba przetworzonych wierszy
- Naruszenia expectations
- Errors i warnings
- Resource usage (CPU, memory)

Event Log jest dostƒôpny przez:
1. **Lakeflow Pipeline UI**: graficzny interfejs
2. **Event Log Table**: delta table z metadanymi

### Zapytanie Event Log:

In [None]:
# Event log jest zapisywany jako Delta Table
# Lokalizacja: system.event_log.<pipeline_id>

# Przyk≈Çadowe zapytanie:
event_log_df = spark.read.table("system.event_log.kion_lakeflow_pipeline")

# Filtrowanie po typie eventu
quality_events = event_log_df.filter(col("event_type") == "data_quality")
quality_events.display()

# Statystyki jako≈õci danych
quality_summary = (
    quality_events
    .groupBy("dataset", "expectation")
    .agg(
        sum("passed_records").alias("total_passed"),
        sum("failed_records").alias("total_failed")
    )
)
quality_summary.display()

In [None]:
# Monitoring flow metrics
flow_progress = (
    event_log_df
    .filter(col("event_type") == "flow_progress")
    .select(
        "timestamp",
        "dataset",
        "num_output_rows",
        "execution_duration"
    )
    .orderBy(desc("timestamp"))
)
flow_progress.display()

### Data Lineage

Lakeflow automatycznie ≈õledzi **lineage** - relacje miƒôdzy tabelami:
- Kt√≥re tabele sƒÖ ≈∫r√≥d≈Çami (upstream)
- Kt√≥re tabele sƒÖ celami (downstream)
- Jak dane przep≈ÇywajƒÖ przez pipeline

**Lineage jest widoczny w**:
1. **Lakeflow Pipeline Graph**: wizualizacja zale≈ºno≈õci
2. **Unity Catalog**: end-to-end lineage
3. **System tables**: metadata queries

### Przyk≈Çad lineage query:

In [None]:
# Lineage z Unity Catalog system tables
lineage_df = spark.sql("""
    SELECT 
        source_table_full_name,
        target_table_full_name,
        source_type,
        created_at
    FROM system.access.table_lineage
    WHERE target_table_full_name LIKE '%kion_lakeflow%'
    ORDER BY created_at DESC
""")
lineage_df.display()

---

## 6Ô∏è‚É£ Automatic Orchestration

Lakeflow automatycznie zarzƒÖdza:
1. **Dependency resolution**: wykrywa kolejno≈õƒá wykonania
2. **Parallelization**: wykonuje niezale≈ºne tabele r√≥wnolegle
3. **Retry logic**: automatyczne retry przy b≈Çƒôdach
4. **Checkpointing**: dla streaming tables

### Konfiguracja Pipeline:

In [None]:
# Konfiguracja Lakeflow Pipeline (JSON configuration)
pipeline_config = {
    "name": "KION_Orders_Lakeflow_Pipeline",
    "storage": "/mnt/lakeflow/kion_orders",
    "target": "kion_lakeflow_db",
    "notebooks": [
        {
            "path": "/Workspace/KION/lakeflow_orders_bronze"
        },
        {
            "path": "/Workspace/KION/lakeflow_orders_silver"
        },
        {
            "path": "/Workspace/KION/lakeflow_orders_gold"
        }
    ],
    "configuration": {
        "source_path": "/Volumes/main/default/kion_data",
        "pipeline.maxParallelTables": "4"
    },
    "clusters": [
        {
            "label": "default",
            "num_workers": 2,
            "node_type_id": "Standard_DS3_v2"
        }
    ],
    "continuous": False,  # False = triggered mode, True = continuous
    "development": True   # True = development mode (full refresh ka≈ºde uruchomienie)
}

print("Lakeflow Pipeline configuration ready!")

### Modes of Execution:

**Development Mode**:
- Reuse cluster between runs
- Automatic full refresh
- Szybkie iteracje
- U≈ºywaj podczas developmentu

**Production Mode**:
- New cluster per run
- Incremental processing
- Cost-optimized
- U≈ºywaj w produkcji

**Triggered vs Continuous**:
- **Triggered**: on-demand lub scheduled
- **Continuous**: always running, minimal latency

---

## üî® Kompletny przyk≈Çad: Bronze ‚Üí Silver ‚Üí Gold Lakeflow Pipeline

### Pipeline Architecture:
```
raw_orders (CSV) 
    ‚Üì
bronze_orders (Raw + Audit)
    ‚Üì
silver_orders (Cleaned + Validated)
    ‚Üì
gold_daily_sales (Aggregated)
```

In [None]:
# ================================================================================
# PRZYK≈ÅAD KOMPLETNY: Bronze ‚Üí Silver ‚Üí Gold Lakeflow Pipeline
# Bazuje na plikach z dataset/: orders_batch.json, customers.csv
# ================================================================================

# BRONZE LAYER - Raw Data Landing
@dlt.table(
    name="lakeflow_bronze_orders",
    comment="Bronze: Raw orders from JSON - immutable landing zone",
    table_properties={
        "quality_layer": "bronze",
        "source_format": "json",
        "pipelines.autoOptimize.zOrderCols": "order_datetime"
    }
)
def lakeflow_bronze_orders():
    """
    Bronze layer: Load raw orders from JSON without transformations.
    
    Data quality: NONE (raw = raw)
    Processing: Full load (batch mode)
    """
    return (
        spark.read
        .format("json")
        .option("multiLine", "true")
        .load(ORDERS_JSON)
        # Audit metadata
        .withColumn("_bronze_ingest_timestamp", F.current_timestamp())
        .withColumn("_bronze_source_file", F.lit(ORDERS_JSON))
        .withColumn("_bronze_ingested_by", F.lit("lakeflow_pipeline"))
    )

In [None]:
# SILVER LAYER - Cleaned & Validated
@dlt.table(
    name="lakeflow_silver_orders",
    comment="Silver: Cleaned, validated, and standardized orders with business logic",
    table_properties={
        "quality_layer": "silver",
        "pipelines.autoOptimize.zOrderCols": "order_date"
    }
)
# Data Quality Expectations (Sekcja 3)
@dlt.expect_or_drop("valid_order_id", "order_id IS NOT NULL")
@dlt.expect_or_drop("valid_customer_id", "customer_id IS NOT NULL")
@dlt.expect_or_drop("positive_amount", "total_amount > 0")
@dlt.expect_or_drop("valid_datetime", "order_datetime IS NOT NULL")
@dlt.expect("reasonable_amount", "total_amount < 100000")  # WARN tylko
def lakeflow_silver_orders():
    """
    Silver layer: Apply data quality checks and business transformations.
    
    Data quality: NOT NULL validation, business rules (amount > 0)
    Processing: Batch mode (full refresh from Bronze)
    Expectations: DROP invalid records, WARN for suspicious values
    """
    return (
        dlt.read("lakeflow_bronze_orders")  # Dependency: Bronze ‚Üí Silver
        
        # Deduplikacja (je≈õli Bronze ma duplikaty)
        .dropDuplicates(["order_id"])
        
        # Transformacje dat
        .withColumn("order_date", F.to_date(F.col("order_datetime")))
        .withColumn("order_timestamp", F.to_timestamp(F.col("order_datetime")))
        .withColumn("order_year", F.year(F.col("order_datetime")))
        .withColumn("order_month", F.month(F.col("order_datetime")))
        .withColumn("order_quarter", F.quarter(F.col("order_datetime")))
        
        # Standaryzacja tekst√≥w
        .withColumn("payment_method", F.upper(F.trim(F.col("payment_method"))))
        
        # Type casting
        .withColumn("total_amount", F.col("total_amount").cast("decimal(10,2)"))
        .withColumn("quantity", F.col("quantity").cast("int"))
        
        # Derived business columns
        .withColumn("order_status", 
                    F.when(F.col("total_amount") > 0, "COMPLETED")
                     .otherwise("UNKNOWN"))
        
        # Silver audit metadata
        .withColumn("_silver_processed_timestamp", F.current_timestamp())
        .withColumn("_data_quality_flag", F.lit("VALID"))
    )

In [None]:
# GOLD LAYER - Business Aggregates (Daily)
@dlt.table(
    name="lakeflow_gold_daily_sales",
    comment="Gold: Daily sales aggregations for BI dashboards and reporting",
    table_properties={
        "quality_layer": "gold",
        "aggregation_level": "daily"
    }
)
def lakeflow_gold_daily_sales():
    """
    Gold layer: Business-level aggregations for BI consumption.
    
    Granularity: Daily (order_date)
    Use case: Sales dashboards, trend analysis, executive reporting
    Processing: Batch aggregation from Silver
    """
    return (
        dlt.read("lakeflow_silver_orders")  # Dependency: Silver ‚Üí Gold
        .groupBy("order_date", "order_year", "order_month", "order_quarter", "order_status")
        .agg(
            # Volume metrics
            F.count("order_id").alias("total_orders"),
            F.countDistinct("customer_id").alias("unique_customers"),
            F.sum("quantity").alias("total_quantity"),
            
            # Revenue metrics
            F.sum("total_amount").alias("total_revenue"),
            F.avg("total_amount").alias("avg_order_value"),
            F.min("total_amount").alias("min_order_value"),
            F.max("total_amount").alias("max_order_value"),
            
            # Payment method distribution
            F.count(F.when(F.col("payment_method") == "CREDIT CARD", 1)).alias("orders_credit_card"),
            F.count(F.when(F.col("payment_method") == "CASH", 1)).alias("orders_cash"),
            F.count(F.when(F.col("payment_method") == "PAYPAL", 1)).alias("orders_paypal")
        )
        # Gold audit metadata
        .withColumn("_gold_created_timestamp", F.current_timestamp())
        .withColumn("_gold_aggregation_level", F.lit("DAILY"))
        .orderBy("order_date")
    )

In [None]:
# GOLD LAYER - Customer Lifetime Value
@dlt.table(
    name="lakeflow_gold_customer_ltv",
    comment="Gold: Customer lifetime value and segmentation for CRM and marketing",
    table_properties={
        "quality_layer": "gold",
        "aggregation_level": "customer"
    }
)
def lakeflow_gold_customer_ltv():
    """
    Gold layer: Customer-level aggregations for CRM, segmentation, targeting.
    
    Granularity: Customer (customer_id)
    Use case: Customer segmentation, retention analysis, personalization
    Processing: Batch aggregation from Silver
    """
    return (
        dlt.read("lakeflow_silver_orders")
        .groupBy("customer_id")
        .agg(
            # Purchase frequency
            F.count("order_id").alias("total_orders"),
            
            # Monetary value
            F.sum("total_amount").alias("lifetime_value"),
            F.avg("total_amount").alias("avg_order_value"),
            F.max("total_amount").alias("max_order_value"),
            
            # Recency
            F.min("order_date").alias("first_order_date"),
            F.max("order_date").alias("last_order_date"),
            F.datediff(F.max("order_date"), F.min("order_date")).alias("customer_age_days"),
            
            # Payment preferences
            F.first(F.col("payment_method")).alias("preferred_payment_method")
        )
        # Customer segmentation (RFM-like)
        .withColumn("customer_segment",
            F.when(F.col("lifetime_value") > 10000, "VIP")
             .when(F.col("lifetime_value") > 5000, "High Value")
             .when(F.col("lifetime_value") > 1000, "Medium Value")
             .otherwise("Low Value")
        )
        .withColumn("purchase_frequency_segment",
            F.when(F.col("total_orders") >= 10, "Frequent")
             .when(F.col("total_orders") >= 5, "Regular")
             .otherwise("Occasional")
        )
        # Gold audit metadata
        .withColumn("_gold_created_timestamp", F.current_timestamp())
        .withColumn("_gold_aggregation_level", F.lit("CUSTOMER"))
    )

---

## üìä Monitoring i Troubleshooting

### Sprawdzanie statusu pipeline:

In [None]:
# Query Event Log dla b≈Çƒôd√≥w
errors_df = spark.sql("""
    SELECT 
        timestamp,
        level,
        dataset,
        message
    FROM event_log(system.event_log.kion_lakeflow_pipeline)
    WHERE level = 'ERROR'
    ORDER BY timestamp DESC
    LIMIT 20
""")
errors_df.display()

In [None]:
# Data quality violations
quality_violations = spark.sql("""
    SELECT 
        dataset,
        expectation,
        SUM(failed_records) as total_failures,
        SUM(passed_records) as total_passed,
        ROUND(SUM(failed_records) * 100.0 / (SUM(failed_records) + SUM(passed_records)), 2) as failure_rate_pct
    FROM event_log(system.event_log.kion_lakeflow_pipeline)
    WHERE event_type = 'data_quality'
    GROUP BY dataset, expectation
    HAVING SUM(failed_records) > 0
    ORDER BY failure_rate_pct DESC
""")
quality_violations.display()

In [None]:
# Pipeline execution time trends
execution_trends = spark.sql("""
    SELECT 
        date_trunc('hour', timestamp) as execution_hour,
        dataset,
        AVG(execution_duration / 1000) as avg_execution_seconds,
        SUM(num_output_rows) as total_rows_processed
    FROM event_log(system.event_log.kion_lakeflow_pipeline)
    WHERE event_type = 'flow_progress'
    GROUP BY execution_hour, dataset
    ORDER BY execution_hour DESC, dataset
""")
execution_trends.display()

---

## ‚úÖ Podsumowanie

### Nauczy≈Çe≈õ siƒô:

‚úÖ **Koncepcje Lakeflow:**  
- Deklaratywne definicje pipeline'√≥w (`@dlt.table()`)  
- Automatyczne zarzƒÖdzanie zale≈ºno≈õciami (DAG inference)  
- Lakeflow = Delta Live Tables (ta sama technologia, nowa nazwa)  

‚úÖ **SQL vs Python API:**  
- Python API: `@dlt.table()` - dla complex transforms i data engineering  
- SQL API: `CREATE OR REFRESH LIVE TABLE` - dla analytics i prostych agregacji  
- Mo≈ºna mieszaƒá oba API w jednym pipeline!  

‚úÖ **Materialized Views vs Streaming Tables:**  
- Materialized Views: batch processing, full refresh  
- Streaming Tables: incremental processing, append-only/upsert  

‚úÖ **Data Quality Expectations:**  
- `@dlt.expect()`: WARN - loguj naruszenia  
- `@dlt.expect_or_drop()`: DROP - usu≈Ñ z≈Çe rekordy  
- `@dlt.expect_or_fail()`: FAIL - zatrzymaj pipeline  

‚úÖ **Event Log i Lineage:**  
- Event Log: szczeg√≥≈Çowy dziennik wszystkich operacji pipeline  
- Lineage: automatyczne ≈õledzenie zale≈ºno≈õci miƒôdzy tabelami  
- Monitoring: quality metrics, execution times, row counts  

‚úÖ **Automatic Orchestration:**  
- Dependency resolution: automatyczna kolejno≈õƒá wykonania  
- Parallelization: r√≥wnoleg≈Çe przetwarzanie niezale≈ºnych tabel  
- Retry logic: automatyczne retry przy b≈Çƒôdach  

### Key Takeaways:

1. **Lakeflow upraszcza ETL**: deklaratywna sk≈Çadnia, automatyczna orkiestracja
2. **Quality first**: wbudowane expectations zapewniajƒÖ jako≈õƒá danych
3. **Observability**: Event Log + Lineage = pe≈Çna widoczno≈õƒá
4. **Streaming i Batch**: jeden framework dla obu paradygmat√≥w
5. **Production-ready**: retry, checkpointing, monitoring out-of-the-box

### Nastƒôpne kroki:
- **Notebook 03**: Databricks Jobs Orchestration
- **Workshop 02**: Hands-on Lakeflow + Orchestration

---

## üìö Dodatkowe zasoby

- [Lakeflow Documentation](https://docs.databricks.com/delta-live-tables/index.html)
- [Lakeflow Best Practices](https://docs.databricks.com/delta-live-tables/best-practices.html)
- [Event Log Reference](https://docs.databricks.com/delta-live-tables/observability.html)

---

## Best Practices - Lakeflow Pipelines

**1. Projektowanie Layer√≥w:**
```
Bronze: @dlt.table() bez expectations (raw = raw)
Silver: @dlt.expect_or_drop() dla data quality
Gold: @dlt.table() dla agregacji (Silver ju≈º clean)
```

**2. Naming Convention:**
```
lakeflow_bronze_orders
lakeflow_silver_orders
lakeflow_gold_daily_sales
```
Prefix `lakeflow_` odr√≥≈ºnia tabele Lakeflow od tradycyjnych tabel.

**3. Expectations Strategy:**
- **Bronze**: Brak expectations (immutable landing zone)
- **Silver**: `expect_or_drop` dla critical columns (order_id, customer_id, amount > 0)
- **Silver**: `expect` (warn) dla business logic (reasonable_amount < 100000)
- **Gold**: Brak expectations (Silver ju≈º zwalidowany)

**4. Table Properties:**
```python
table_properties={
    "quality_layer": "silver",
    "pipelines.autoOptimize.zOrderCols": "order_date",
    "delta.autoOptimize.optimizeWrite": "true"
}
```

**5. Development vs Production:**
- **Development mode**: `development: true`, full refresh ka≈ºde uruchomienie
- **Production mode**: `development: false`, incremental processing

---

## Deployment Guide - Jak uruchomiƒá Lakeflow Pipeline?

**Krok 1: Przygotowanie notebooka**
- Notebook z definicjami tabel (`@dlt.table()`)
- Import `dlt` na poczƒÖtku
- Zmienne konfiguracyjne (≈õcie≈ºki, schema)

**Krok 2: Utworzenie Lakeflow Pipeline (UI)**
1. W Databricks UI: **Workflows** ‚Üí **Delta Live Tables**
2. **Create Pipeline**
3. Skonfiguruj:
   - **Pipeline name**: `KION_Lakeflow_Orders_Pipeline`
   - **Notebook path**: ≈õcie≈ºka do tego notebooka
   - **Target schema**: `{BRONZE_SCHEMA}` (z 00_setup.ipynb)
   - **Storage location**: `/mnt/lakeflow/kion_orders`
   - **Configuration**: Key-value pairs dla zmiennych

**Krok 3: Pipeline Configuration (JSON)**
```json
{
  "name": "KION_Lakeflow_Orders_Pipeline",
  "storage": "/mnt/lakeflow/kion_orders",
  "target": "kion_bronze_schema",
  "notebooks": [
    {"path": "/Workspace/.../02_lakeflow_pipelines"}
  ],
  "configuration": {
    "DATASET_BASE_PATH": "/Volumes/.../dataset",
    "CATALOG": "kion_catalog"
  },
  "clusters": [
    {
      "label": "default",
      "num_workers": 2
    }
  ],
  "continuous": false,
  "development": true
}
```

**Krok 4: Uruchomienie**
- **Start** ‚Üí Pipeline wykonuje wszystkie tabele zgodnie z dependency graph
- Monitor execution w **Event Log**
- Sprawd≈∫ quality metrics w **Data Quality** tab

**Krok 5: Monitoring**
- Event Log: errory, warnings, execution times
- Lineage Graph: wizualizacja zale≈ºno≈õci
- Data Quality Dashboard: expectations violations

---

## Troubleshooting

**Problem 1: `dlt module not found`**
**RozwiƒÖzanie:** Lakeflow pipelines wymagajƒÖ Databricks Runtime 9.1+. Sprawd≈∫ wersjƒô klastra.

**Problem 2: Tabela nie jest tworzona**
**RozwiƒÖzanie:** Sprawd≈∫ Event Log dla errors. Czƒôsto problem z ≈õcie≈ºkƒÖ do danych lub schema permissions.

**Problem 3: Expectations fail ca≈Çy pipeline**
**RozwiƒÖzanie:** U≈ºywaj `@dlt.expect_or_drop()` zamiast `@dlt.expect_or_fail()` dla non-critical rules.

**Problem 4: Duplicate records w Silver**
**RozwiƒÖzanie:** Dodaj `.dropDuplicates(["primary_key"])` w definicji Silver table.

---

## Nastƒôpne kroki

- **Kolejny notebook**: 03_databricks_jobs_orchestration.ipynb
- **Warsztat praktyczny**: 02_lakeflow_orchestration_workshop.ipynb
- **Dokumentacja**: [Databricks Lakeflow](https://docs.databricks.com/delta-live-tables/index.html)

---

**üéØ Kluczowe wnioski:**

1. **Lakeflow = Deklaratywny**: Definiujesz "co", nie "jak"
2. **Quality first**: Expectations sƒÖ wbudowane, nie custom kod
3. **Observability**: Event Log + Lineage = full transparency
4. **Production-ready**: Retry, monitoring, orchestration out-of-the-box
5. **Team collaboration**: SQL dla analityk√≥w, Python dla engineers