# Lakeflow Spark Declarative Pipelines - Demo

**Cel szkoleniowy:** Zrozumienie deklaratywnego frameworku Lakeflow do budowy batch i streaming pipeline'ów oraz praktyczna implementacja Bronze→Silver→Gold z SQL API.

**Zakres tematyczny:**
- Koncepcje Lakeflow: deklaratywny sposób definicji pipeline'ów
- SQL vs Python API (focus na SQL)
- Materialized views / streaming tables
- Expectations: warn / drop / fail (data quality)
- Event log i lineage per tabela
- Automatic orchestration

## Kontekst i wymagania

- **Dzień szkolenia**: Dzień 3 - Transformation & Governance
- **Typ notebooka**: Demo
- **Wymagania techniczne**:
 - Databricks Runtime 16.4 LTS lub nowszy (zalecane: 17.3 LTS)
 - Unity Catalog włączony
 - Uprawnienia: CREATE TABLE, CREATE SCHEMA, SELECT, MODIFY
 - Klaster: Standard lub Serverless Compute
 
**Uwaga:** Ten notebook demonstruje **SQL API** dla Lakeflow SDP. Python API (`create_streaming_table()`, `table()`) jest alternatywą z tą samą funkcjonalnością.

> **Aktualizacja (Czerwiec 2025):** Nazwa produktu zmieniła się z "Delta Live Tables (DLT)" na "Lakeflow Spark Declarative Pipelines (SDP)". Funkcjonalność pozostaje taka sama. Dodatkowo "Databricks Jobs" to teraz "Lakeflow Jobs".

## Wstęp teoretyczny - Lakeflow Spark Declarative Pipelines

**Cel sekcji:** Zrozumienie czym jest Lakeflow SDP i jak rewolucjonizuje budowę pipeline'ów ETL/ELT.

---

### Czym jest Lakeflow Spark Declarative Pipelines?

**Lakeflow Spark Declarative Pipelines (SDP)** to deklaratywny framework do tworzenia batch i streaming data pipeline'ów w SQL i Python. Rozszerza Apache Spark Declarative Pipelines, działając na zoptymalizowanym Databricks Runtime.

```
┌─────────────────────────────────────────────────────────────┐
│ TRADYCYJNY APPROACH (Procedural) │
├─────────────────────────────────────────────────────────────┤
│ 1. Napisz kod: df = spark.read.table(...) │
│ 2. Napisz transformacje: df.filter().groupBy()... │
│ 3. Napisz orchestration: if/else, try/catch, retry logic │
│ 4. Napisz monitoring: log metrics, track failures │
│ 5. Napisz quality checks: manual assertions │
│ 6. Deploy: schedule w Jobs, zarządzaj dependencies │
│ │
│ = Setki linii kodu, manual orchestration, error handling │
└─────────────────────────────────────────────────────────────┘

 

┌─────────────────────────────────────────────────────────────┐
│ LAKEFLOW SDP (Declarative) │
├─────────────────────────────────────────────────────────────┤
│ 1. Zadeklaruj CO chcesz (WHAT): │
│ CREATE OR REFRESH STREAMING TABLE bronze AS ... │
│ CREATE OR REFRESH MATERIALIZED VIEW silver AS ... │
│ CREATE OR REFRESH MATERIALIZED VIEW gold AS ... │
│ │
│ 2. Lakeflow automatycznie: │
│ Orchestruje kolejność (dependency DAG) │
│ Retry przy failures │
│ Incremental processing │
│ Monitoring (Event Log) │
│ Data quality (expectations) │
│ Lineage tracking │
│ │
│ = Kilkanaście linii SQL, zero orchestration code │
└─────────────────────────────────────────────────────────────┘
```

---

### Kluczowe korzyści Lakeflow SDP

**1. Automatic Orchestration**

Lakeflow automatycznie:
- Analizuje zależności między tabelami (kto czyta z kogo)
- Buduje DAG (Directed Acyclic Graph)
- Wykonuje w poprawnej kolejności z maksymalną paralelizacją
- Retry na poziomie: task → flow → pipeline

```sql
-- Wystarczy zadeklarować:
CREATE OR REFRESH MATERIALIZED VIEW silver AS 
 SELECT * FROM bronze; -- Lakeflow wie: silver zależy od bronze

CREATE OR REFRESH MATERIALIZED VIEW gold AS 
 SELECT * FROM silver; -- Lakeflow wie: gold zależy od silver

-- Execution order: bronze → silver → gold (automatic!)
```

**2. Declarative Processing**

Deklaratywne API redukuje setki linii kodu do kilku:

```sql
-- Tradycyjnie (procedural):
-- 1. Read source
-- 2. Apply transformations
-- 3. Handle schema evolution
-- 4. Write to Delta
-- 5. Error handling
-- 6. Retry logic
-- 7. Metrics logging
-- = ~100+ lines of code

-- Lakeflow (declarative):
CREATE OR REFRESH STREAMING TABLE orders AS
 SELECT * FROM STREAM read_files('/path/to/orders');
-- = 2 lines, wszystko powyższe automatyczne!
```

**3. Incremental Processing**

Lakeflow przetwarza tylko nowe/zmienione dane:

- **Streaming tables**: Append-only, każdy rekord raz
- **Materialized views**: Incremental refresh (Databricks wykrywa zmiany w source)
- **AUTO CDC**: Out-of-order events handling, SCD Type 1/2

**4. Built-in Data Quality**

Expectations = SQL constraints z flexible handling:

```sql
CREATE OR REFRESH STREAMING TABLE orders (
 CONSTRAINT valid_amount EXPECT (total_amount > 0) ON VIOLATION DROP ROW,
 CONSTRAINT valid_date EXPECT (order_date IS NOT NULL) ON VIOLATION FAIL UPDATE
)
AS SELECT * FROM ...
```

---

### Podstawowe pojęcia

- **Lakeflow SDP**: Deklaratywny framework dla batch + streaming pipelines
- **Flow**: Jednostka przetwarzania (Append, AUTO CDC, Materialized View)
- **STREAMING TABLE**: Delta table dla streaming/incremental data (append-only, low-latency)
- **MATERIALIZED VIEW**: Delta table z incremental refresh (batch, cache results)
- **VIEW (temporary)**: Ephemeral, brak persist, zawsze recompute
- **SINK**: Streaming target (Delta, Kafka, EventHub, custom Python)
- **Pipeline**: Zbiór flows + tables + views + sinks (unit of deployment)
- **Expectations**: Data quality constraints (warn/drop/fail)
- **Event Log**: Delta table z metrykami, lineage, quality metrics

**Dlaczego to ważne?**

Lakeflow SDP eliminuje boilerplate code i pozwala skupić się na business logic zamiast orchestration. Deklaratywny model zapewnia:
- **Separation of concerns**: CO (deklaracja) vs JAK (execution engine)
- **Reusability**: Te same deklaracje w dev/test/prod
- **Observability**: Event Log out-of-the-box
- **Reliability**: Automatic retry i error handling

## Izolacja per użytkownik

Uruchom skrypt inicjalizacyjny dla per-user izolacji katalogów i schematów:

In [0]:
%run ../00_setup

## Konfiguracja

Import bibliotek i ustawienie zmiennych środowiskowych:

In [0]:
from pyspark.sql import functions as F
from pyspark.sql.types import *
from datetime import datetime
import uuid

# Ustaw katalog jako domyślny
spark.sql(f"USE CATALOG {CATALOG}")

# Ścieżki do danych źródłowych
ORDERS_JSON = f"{DATASET_BASE_PATH}/orders/orders_batch.json"
CUSTOMERS_CSV = f"{DATASET_BASE_PATH}/customers/customers.csv"
PRODUCTS_PARQUET = f"{DATASET_BASE_PATH}/products/products.parquet"

# Wyświetl kontekst użytkownika
display(spark.createDataFrame([
    ("Katalog", CATALOG),
    ("Schema Bronze", BRONZE_SCHEMA),
    ("Schema Silver", SILVER_SCHEMA),
    ("Schema Gold", GOLD_SCHEMA),
    ("Użytkownik", raw_user),
    ("Orders path", ORDERS_JSON),
    ("Customers path", CUSTOMERS_CSV),
    ("Products path", PRODUCTS_PARQUET)
], ["Parametr", "Wartość"]))

---

## Sekcja 1: Koncepcje Lakeflow - Flows, Tables, Views

**Wprowadzenie teoretyczne:**

Lakeflow SDP operuje na trzech kluczowych konceptach: **Flows** (jak dane przepływają), **Streaming Tables** (append-only targets), i **Materialized Views** (batch targets z incremental refresh).

---

### Flow Types (Typy przepływów danych)

**Flow** to jednostka przetwarzania danych w Lakeflow - definiuje JAK dane przepływają od źródła do celu.

```
┌─────────────────────────────────────────────────────────────┐
│ FLOW TYPES │
├─────────────────────────────────────────────────────────────┤
│ 1. APPEND FLOW │
│ • Źródło: Append-only (files, Kafka, Kinesis, Delta) │
│ • Semantyka: Streaming (continuous processing) │
│ • Gwarancja: Exactly-once per rekord │
│ • Latency: Low (seconds) │
│ • Use case: Real-time ingest, log streaming │
│ • Target: STREAMING TABLE │
│ │
│ Przykład SQL: │
│ CREATE OR REFRESH STREAMING TABLE orders AS │
│ SELECT * FROM STREAM read_files('/path'); │
│ │
│ 2. AUTO CDC FLOW │
│ • Źródło: Change Data Capture (CDF-enabled Delta) │
│ • Semantyka: Streaming z CDC operations │
│ • Operations: INSERT, UPDATE, DELETE, TRUNCATE │
│ • Sequencing: Out-of-order handling (automatic) │
│ • SCD: Type 1 (update) lub Type 2 (history tracking) │
│ • Use case: Sync z transactional DB, audit trail │
│ • Target: STREAMING TABLE │
│ │
│ Przykład SQL: │
│ AUTO CDC INTO target_table │
│ FROM source_table │
│ KEYS (user_id) │
│ SEQUENCE BY timestamp │
│ APPLY AS DELETE WHEN operation = 'DELETE'; │
│ │
│ 3. MATERIALIZED VIEW FLOW │
│ • Źródło: Batch read (Delta tables, views) │
│ • Semantyka: Batch (scheduled/triggered) │
│ • Refresh: Incremental (tylko zmienione partitions) │
│ • Cache: Wyniki persisted (performance) │
│ • Recompute: Full przy schema changes lub explicit │
│ • Use case: Aggregations, slow queries, BI dashboards │
│ • Target: MATERIALIZED VIEW │
│ │
│ Przykład SQL: │
│ CREATE OR REFRESH MATERIALIZED VIEW daily_summary AS │
│ SELECT date, SUM(amount) FROM orders GROUP BY date; │
└─────────────────────────────────────────────────────────────┘
```

**Kluczowe różnice:**

| Flow Type | Processing | Source | Latency | Incremental | Use Case |
|-----------|------------|--------|---------|-------------|----------|
| **Append** | Streaming | Append-only | Seconds | (watermarks) | Real-time ingest |
| **AUTO CDC** | Streaming | CDC events | Seconds | (sequencing) | DB sync, SCD |
| **Materialized View** | Batch | Any Delta | Minutes | (smart refresh) | Aggregations, BI |

---

### STREAMING TABLE vs MATERIALIZED VIEW

| Aspekt | STREAMING TABLE | MATERIALIZED VIEW |
|--------|-----------------|-------------------|
| **Semantyka** | Streaming (continuous) | Batch (scheduled/triggered) |
| **Processing** | Exactly-once per rekord | Incremental refresh (zmienione dane) |
| **Source** | `STREAM` keyword required | Batch read (no STREAM) |
| **Latency** | Low (seconds) | Higher (minutes) |
| **State** | Bounded (watermarks) | Stateless (recompute) |
| **Joins** | Stream-snapshot (static dims) | Full recompute (always correct) |
| **Use case** | Real-time ingest, CDC | Aggregations, slow queries |
| **Schema evolution** | Limited (full refresh) | Flexible |

**Kiedy używać:**
- **STREAMING TABLE**: Ingest z files/Kafka, CDC, low-latency transformations
- **MATERIALIZED VIEW**: Aggregations, joins z częstymi zmianami w dimensions, pre-compute slow queries

---

### VIEW (temporary)

**VIEW** to ephemeral object - nie ma persist, zawsze recompute przy query.

**Use cases:**
- Intermediate transformations (reusable logic)
- Data quality checks (nie publikuj do catalog)
- Testing (nie zapisuj do Delta)

```sql
-- VIEW: nie zapisuje do Delta
CREATE OR REFRESH VIEW temp_filtered AS
 SELECT * FROM bronze WHERE status = 'ACTIVE';

-- Użyj w downstream table
CREATE OR REFRESH MATERIALIZED VIEW silver AS
 SELECT * FROM temp_filtered;
```

---

### Automatic Dependency Resolution (DAG)

Lakeflow automatycznie buduje DAG z zależności:

```sql
-- Deklaracje (nie określasz kolejności):
CREATE OR REFRESH STREAMING TABLE bronze AS ...;
CREATE OR REFRESH MATERIALIZED VIEW silver AS SELECT * FROM bronze;
CREATE OR REFRESH MATERIALIZED VIEW gold AS SELECT * FROM silver;

-- Lakeflow execution order (automatic):
-- 1. bronze (no dependencies)
-- 2. silver (depends on bronze) [parallel if multiple silvers]
-- 3. gold (depends on silver) [parallel if multiple golds]
```

**Key point:** Ty deklarujesz CO, Lakeflow decyduje JAK i KIEDY.

### Przykład 1.1: STREAMING TABLE - Bronze Layer Ingest

**Cel:** Demonstracja STREAMING TABLE dla real-time ingest z Auto Loader

**Podejście:**
1. Użyj `read_files()` dla Auto Loader (SQL API)
2. `STREAM` keyword dla streaming semantics
3. Zapis do STREAMING TABLE (append-only)

In [0]:
-- Przykład 1.1 - STREAMING TABLE (Bronze Layer)
-- UWAGA: Ten kod jest demonstracją składni. W production pipeline,
-- uruchomiłbyś to jako część Lakeflow pipeline definition.

-- Dla demonstracji w notebooku, użyjemy tradycyjnego podejścia
-- z późniejszą konwersją na Lakeflow syntax

In [0]:
# Przykład 1.1 - Bronze Layer (tradycyjne podejście dla demonstracji)
# W production pipeline, użylibyśmy Lakeflow CREATE OR REFRESH STREAMING TABLE

spark.sql(f"USE SCHEMA {BRONZE_SCHEMA}")

# Bronze layer: wczytanie surowych danych z JSON (batch dla demo)
orders_bronze_df = (
    spark.read
    .format("json")
    .option("multiLine", "true")
    .load(ORDERS_JSON)
    .withColumn("_bronze_ingest_timestamp", F.current_timestamp())
    .withColumn("_bronze_source_file", F.input_file_name())
    .withColumn("_bronze_ingested_by", F.lit(raw_user))
    .withColumn("_bronze_version", F.lit(1))
)

# Zapisz do Delta (Bronze table)
bronze_table = "orders_bronze"
(
    orders_bronze_df
    .write
    .format("delta")
    .mode("overwrite")
    .option("overwriteSchema", "true")
    .saveAsTable(bronze_table)
)

# Podgląd
display(spark.table(bronze_table).limit(5))

---

## Sekcja 2: Silver Layer - MATERIALIZED VIEW + Expectations

**Wprowadzenie teoretyczne:**

Silver layer oczyszcza i waliduje dane z Bronze. MATERIALIZED VIEW zapewnia incremental refresh - przetwarza tylko zmienione dane. Expectations to wbudowane data quality constraints.

**Kluczowe pojęcia:**
- **MATERIALIZED VIEW**: Batch processing z incremental refresh
- **Expectations**: SQL constraints z akcjami: EXPECT (warn), DROP ROW, FAIL UPDATE
- **Data Quality Gates**: Walidacje między warstwami

**Zastosowanie praktyczne:**
- Deduplikacja po kluczu biznesowym
- Walidacja: NOT NULL, ranges, business rules
- Standardizacja: dates, text formats, type casting

### Przykład 2.1: MATERIALIZED VIEW z Expectations (Silver Layer)

**Cel:** Demonstracja MATERIALIZED VIEW dla Silver z data quality constraints

**Lakeflow SQL Syntax (Production):**

```sql
-- W production Lakeflow pipeline:
CREATE OR REFRESH MATERIALIZED VIEW silver.orders_silver (
 -- Expectations: Data Quality Constraints
 CONSTRAINT valid_amount EXPECT (total_amount > 0) ON VIOLATION DROP ROW,
 CONSTRAINT valid_date EXPECT (order_datetime IS NOT NULL) ON VIOLATION DROP ROW,
 CONSTRAINT valid_ids EXPECT (order_id IS NOT NULL AND customer_id IS NOT NULL)
)
COMMENT 'Silver layer - oczyszczone zamówienia z quality checks'
AS
SELECT
 order_id,
 customer_id,
 product_id,
 store_id,
 to_date(order_datetime) AS order_date,
 to_timestamp(order_datetime) AS order_timestamp,
 quantity,
 unit_price,
 CAST(total_amount AS DECIMAL(10,2)) AS total_amount,
 UPPER(TRIM(payment_method)) AS payment_method,
 CASE 
 WHEN total_amount > 0 THEN 'COMPLETED'
 ELSE 'UNKNOWN'
 END AS order_status,
 current_timestamp() AS _silver_processed_timestamp,
 'VALID' AS _data_quality_flag
FROM bronze.orders_bronze;
```

**Wyjaśnienie Expectations:**
- **EXPECT (warn)**: Log violation, zachowaj rekord (domyślnie)
- **ON VIOLATION DROP ROW**: Usuń invalid rekord
- **ON VIOLATION FAIL UPDATE**: Przerwij pipeline przy violation (strict mode)

**Implementacja tradycyjna (dla notebooka demo):**

In [0]:
# Przykład 2.1 - Silver Layer z data quality (tradycyjne podejście dla demo)

spark.sql(f"USE SCHEMA {SILVER_SCHEMA}")

# Wczytaj dane z Bronze
orders_bronze_df = spark.table(f"{BRONZE_SCHEMA}.{bronze_table}")

# Silver transformations z walidacją (symulacja Expectations)
orders_silver_df = (
    orders_bronze_df
    # Deduplikacja
    .dropDuplicates(["order_id"])
    # NOT NULL validation (DROP ROW equivalent)
    .filter(F.col("order_id").isNotNull())
    .filter(F.col("customer_id").isNotNull())
    .filter(F.col("product_id").isNotNull())
    # Business rule validation
    .filter(F.col("total_amount") > 0)
    .filter(F.col("order_datetime").isNotNull())
    # Standaryzacja
    .withColumn("order_date", F.to_date(F.col("order_datetime")))
    .withColumn("order_timestamp", F.to_timestamp(F.col("order_datetime")))
    .withColumn("total_amount", F.col("total_amount").cast("decimal(10,2)"))
    .withColumn("payment_method", F.upper(F.trim(F.col("payment_method"))))
    # Derived columns
    .withColumn("order_status", 
                F.when(F.col("total_amount") > 0, "COMPLETED").otherwise("UNKNOWN"))
    # Silver metadata
    .withColumn("_silver_processed_timestamp", F.current_timestamp())
    .withColumn("_data_quality_flag", F.lit("VALID"))
)

# Quality metrics
bronze_count = orders_bronze_df.count()
silver_count = orders_silver_df.count()
rejected_count = bronze_count - silver_count
rejection_rate = (rejected_count / bronze_count * 100) if bronze_count > 0 else 0

# Data Quality Metrics
display(spark.createDataFrame([
    ("Bronze input", bronze_count),
    ("Silver output", silver_count),
    ("Rejected", rejected_count),
    ("Rejection rate %", round(rejection_rate, 2))
], ["Metryka", "Wartość"]))

# Zapisz do Silver schema
silver_table = "orders_silver"
(
    orders_silver_df
    .write
    .format("delta")
    .mode("overwrite")
    .option("overwriteSchema", "true")
    .saveAsTable(silver_table)
)

# Sample Silver data
display(spark.table(silver_table).limit(5))

---

## Sekcja 3: Gold Layer - Business Aggregates

**Wprowadzenie teoretyczne:**

Gold layer zawiera pre-aggregowane metryki biznesowe, denormalized tables i KPI. MATERIALIZED VIEW z incremental refresh zapewnia, że tylko zmienione partycje są przeliczane.

**Kluczowe pojęcia:**
- **Business-level aggregates**: Daily/Monthly summaries, KPIs
- **Denormalization**: Pre-computed joins dla performance
- **Incremental refresh**: Tylko affected partitions

**Zastosowanie praktyczne:**
- BI dashboards (Power BI, Tableau)
- Executive reporting
- ML feature stores

### Przykład 3.1: MATERIALIZED VIEW dla Gold (Daily Aggregates)

**Cel:** Demonstracja Gold layer z business aggregates i KPI

**Lakeflow SQL Syntax (Production):**

```sql
-- W production Lakeflow pipeline:
CREATE OR REFRESH MATERIALIZED VIEW gold.daily_order_summary
COMMENT 'Gold layer - dzienne podsumowanie zamówień (KPI)'
AS
SELECT
 order_date,
 order_status,
 -- Volume metrics
 COUNT(order_id) AS total_orders,
 COUNT(DISTINCT customer_id) AS unique_customers,
 -- Revenue metrics
 SUM(total_amount) AS total_revenue,
 AVG(total_amount) AS avg_order_value,
 MIN(total_amount) AS min_order_value,
 MAX(total_amount) AS max_order_value,
 -- Derived KPIs
 ROUND(SUM(total_amount) / COUNT(DISTINCT customer_id), 2) AS revenue_per_customer,
 -- Gold metadata
 current_timestamp() AS _gold_created_timestamp,
 'DAILY' AS _gold_aggregation_level
FROM silver.orders_silver
GROUP BY order_date, order_status
ORDER BY order_date DESC, order_status;
```

**Automatic Dependency:** Lakeflow wie, że `gold.daily_order_summary` depends on `silver.orders_silver` → automatic execution order!

**Implementacja tradycyjna:**

In [0]:
# Przykład 3.1 - Gold Layer (Daily Aggregates)

spark.sql(f"USE SCHEMA {GOLD_SCHEMA}")

# Wczytaj dane z Silver
orders_silver_df = spark.table(f"{SILVER_SCHEMA}.{silver_table}")

# Gold aggregation: Daily order summary z KPI
daily_summary_df = (
    orders_silver_df
    .groupBy("order_date", "order_status")
    .agg(
        # Volume metrics
        F.count("order_id").alias("total_orders"),
        F.countDistinct("customer_id").alias("unique_customers"),
        # Revenue metrics
        F.sum("total_amount").alias("total_revenue"),
        F.avg("total_amount").alias("avg_order_value"),
        F.min("total_amount").alias("min_order_value"),
        F.max("total_amount").alias("max_order_value")
    )
    # Derived KPIs
    .withColumn("revenue_per_customer", 
                F.round(F.col("total_revenue") / F.col("unique_customers"), 2))
    # Gold metadata
    .withColumn("_gold_created_timestamp", F.current_timestamp())
    .withColumn("_gold_aggregation_level", F.lit("DAILY"))
    .orderBy("order_date", "order_status")
)

# Summary statistics
total_days = daily_summary_df.select("order_date").distinct().count()
total_orders_gold = daily_summary_df.agg(F.sum("total_orders")).collect()[0][0]
total_revenue_gold = daily_summary_df.agg(F.sum("total_revenue")).collect()[0][0]

# Gold layer summary
display(spark.createDataFrame([
    ("Total days aggregated", str(total_days)),
    ("Total orders", f"{total_orders_gold:,}"),
    ("Total revenue", f"${total_revenue_gold:,.2f}")
], ["Metryka", "Wartość"]))

# Zapisz do Gold schema
gold_table = "daily_order_summary"
(
    daily_summary_df
    .write
    .format("delta")
    .mode("overwrite")
    .option("overwriteSchema", "true")
    .saveAsTable(gold_table)
)

# Sample Gold data (daily KPIs)
display(spark.table(gold_table).limit(5))

---

## Sekcja 4: Event Log i Lineage

**Wprowadzenie teoretyczne:**

Lakeflow automatycznie loguje wszystkie operacje do **Event Log** (Delta table). Event Log zawiera:
- Flow progress (success/failure per tabela)
- Data quality metrics (expectations violations)
- Lineage tracking (source → target)
- Performance metrics (duration, records processed)

**Kluczowe pojęcia:**
- **Event Log**: Delta table w `system/events` (per pipeline)
- **Flow types**: `flow_definition`, `flow_progress`, `expectation`, `user_action`
- **Lineage**: Automatyczne śledzenie zależności Bronze → Silver → Gold

**Zastosowanie praktyczne:**
- Monitoring pipeline health
- Debugging failures
- Data quality reporting
- Audit i compliance

### Przykład 4.1: Event Log - Monitoring i Lineage

**Event Log Location:**
```
dbfs:/pipelines/<pipeline_id>/system/events
```

**Przykładowe queries Event Log (w production Lakeflow pipeline):**

```python
# 1. Query Event Log
event_log_path = "dbfs:/pipelines/<pipeline_id>/system/events"
event_log_df = spark.read.format("delta").load(event_log_path)

# 2. Flow progress per tabela
flow_progress = (
 event_log_df
 .filter("event_type = 'flow_progress'")
 .select("timestamp", "details.flow_name", "details.output_records", "details.status")
)

# 3. Expectations violations (data quality metrics)
expectations_df = (
 event_log_df
 .filter("event_type = 'expectation'")
 .select(
 "timestamp",
 "details.dataset",
 "details.name",
 "details.passed_records",
 "details.failed_records"
 )
)

# 4. Lineage tracking
lineage_df = (
 event_log_df
 .filter("event_type = 'flow_definition'")
 .select("details.flow_name", "details.input_datasets", "details.output_dataset")
)
```

**Dla demo (bez production pipeline):**

In [0]:
# Przykład 4.1 - Lineage Tracking (symulacja dla demo)

print("=" * 70)
print("LINEAGE TRACKING - Bronze → Silver → Gold")
print("=" * 70)

# Symulacja lineage metadata (w production: z Event Log)
lineage_data = [
    {
        "layer": "Bronze",
        "table": f"{BRONZE_SCHEMA}.{bronze_table}",
        "source": ORDERS_JSON,
        "target_tables": [f"{SILVER_SCHEMA}.{silver_table}"],
        "record_count": spark.table(f"{BRONZE_SCHEMA}.{bronze_table}").count()
    },
    {
        "layer": "Silver",
        "table": f"{SILVER_SCHEMA}.{silver_table}",
        "source": f"{BRONZE_SCHEMA}.{bronze_table}",
        "target_tables": [f"{GOLD_SCHEMA}.{gold_table}"],
        "record_count": spark.table(f"{SILVER_SCHEMA}.{silver_table}").count()
    },
    {
        "layer": "Gold",
        "table": f"{GOLD_SCHEMA}.{gold_table}",
        "source": f"{SILVER_SCHEMA}.{silver_table}",
        "target_tables": ["BI Dashboards", "ML Models"],
        "record_count": spark.table(f"{GOLD_SCHEMA}.{gold_table}").count()
    }
]

# Display lineage
import pandas as pd
lineage_df = pd.DataFrame(lineage_data)
print("\n[LINEAGE TABLE]")
print(lineage_df.to_string(index=False))

# Data flow diagram
print("\n[DATA FLOW DIAGRAM]")
print(f"""
Source Data (JSON)
    │
    ├─→ {BRONZE_SCHEMA}.{bronze_table} ({lineage_data[0]['record_count']} records)
        │   [STREAMING TABLE - Append-only]
        │
        ├─→ {SILVER_SCHEMA}.{silver_table} ({lineage_data[1]['record_count']} records)
            │   [MATERIALIZED VIEW - Validated + Cleaned]
            │   [Quality: {rejection_rate:.2f}% rejection rate]
            │
            ├─→ {GOLD_SCHEMA}.{gold_table} ({lineage_data[2]['record_count']} aggregates)
                    [MATERIALIZED VIEW - Business KPIs]
                    ├─→ Power BI Dashboards
                    └─→ ML Feature Store
""")

---

## Sekcja 5: SQL vs Python API

**Wprowadzenie:**

Lakeflow SDP oferuje dwa równoważne API: **SQL** i **Python**. Wybór zależy od preferencji zespołu i use case.

### Porównanie składni

| Aspekt | SQL | Python |
|--------|-----|--------|
| **STREAMING TABLE** | `CREATE OR REFRESH STREAMING TABLE` | `@dp.table()` |
| **MATERIALIZED VIEW** | `CREATE OR REFRESH MATERIALIZED VIEW` | `@dp.materialized_view()` |
| **VIEW** | `CREATE OR REFRESH VIEW` | `@dp.view()` / `@dp.temporary_view()` |
| **Expectations** | `CONSTRAINT ... EXPECT ... ON VIOLATION` | `@dp.expect()`, `@dp.expect_or_drop()`, `@dp.expect_or_fail()` |
| **Streaming read** | `FROM STREAM table` | `spark.readStream.table()` |

---

### Przykład: Ten sam pipeline w SQL i Python

**SQL Approach:**

```sql
-- Bronze
CREATE OR REFRESH STREAMING TABLE bronze.orders AS
SELECT * FROM STREAM read_files('/path/orders', format => 'json');

-- Silver
CREATE OR REFRESH MATERIALIZED VIEW silver.orders (
 CONSTRAINT valid_amount EXPECT (total_amount > 0) ON VIOLATION DROP ROW
)
AS SELECT 
 order_id, 
 customer_id, 
 CAST(total_amount AS DECIMAL(10,2)) AS total_amount
FROM bronze.orders;

-- Gold
CREATE OR REFRESH MATERIALIZED VIEW gold.daily_summary AS
SELECT 
 DATE(order_date) AS date,
 SUM(total_amount) AS revenue
FROM silver.orders
GROUP BY DATE(order_date);
```

**Python Approach (equivalent):**

```python
from pyspark import pipelines as dp
from pyspark.sql import functions as F

# Bronze
@dp.table(comment="Bronze orders")
def orders_bronze():
 return (
 spark.readStream
 .format("cloudFiles")
 .option("cloudFiles.format", "json")
 .load("/path/orders")
 )

# Silver
@dp.materialized_view(comment="Silver orders")
@dp.expect_or_drop("valid_amount", "total_amount > 0")
def orders_silver():
 return (
 spark.read.table("bronze.orders")
 .select(
 "order_id",
 "customer_id",
 F.col("total_amount").cast("decimal(10,2)")
 )
 )

# Gold
@dp.materialized_view(comment="Gold daily summary")
def daily_summary():
 return (
 spark.read.table("silver.orders")
 .groupBy(F.to_date("order_date").alias("date"))
 .agg(F.sum("total_amount").alias("revenue"))
 )
```

---

### Kiedy używać SQL vs Python?

**Użyj SQL jeśli:**
- Zespół ma silne SQL skills
- Proste transformacje (filters, aggregations)
- Integracja z BI tools (SQL-native workflows)
- Mniej metaprogramming

**Użyj Python jeśli:**
- Potrzebujesz loops / dynamic table creation
- Complex transformacje (UDFs, window functions)
- Integracja z ML workflows
- Testing (unit tests dla transformacji)

**Najlepiej:** Mieszaj! SQL dla prostych, Python dla złożonych.

---

## Best Practices

### Projektowanie Pipeline'ów:

**1. Separation of Concerns:**
- Bronze: Tylko ingest + audit metadata (no business logic)
- Silver: Data quality + standardization (no aggregations)
- Gold: Business logic + aggregations (BI-ready)

**2. Expectations Strategy:**
- Bronze → Silver: NOT NULL, schema validation (DROP ROW)
- Silver → Gold: Business rules, referential integrity (FAIL UPDATE dla critical)
- Start z EXPECT (warn), potem tighten do DROP ROW po analizie

**3. Incremental Processing:**
- Używaj STREAMING TABLE dla append-only sources (files, Kafka)
- Używaj MATERIALIZED VIEW dla batch z incremental refresh
- Partition Silver/Gold po date dla incremental MERGE

**4. Monitoring:**
- Regularnie query Event Log dla expectations violations
- Alert na spike w rejection rate (> 5%)
- Dashboard per pipeline: throughput, latency, quality metrics

### Wydajność:

**5. Partitioning:**
- Bronze: Minimal partitioning (bulk operations)
- Silver: Partition po date lub region
- Gold: Partition wg BI query patterns

**6. Optimization:**
- Włącz Auto Optimize dla Silver/Gold
- ZORDER BY po często filtrowanych kolumnach
- Vacuum regularnie (retention policy)

### Governance:

**7. Unity Catalog:**
- Bronze/Silver/Gold jako osobne schemas
- Różne access controls per warstwa
- Service principals dla pipelines (nie user accounts)

**8. Naming Conventions:**
- `bronze_<entity>`, `silver_<entity>`, `gold_<metric>`
- Comments per tabela (business context)
- Expectations: opisowe nazwy (`valid_customer_email` zamiast `check1`)

### Deployment:

**9. CI/CD:**
- Git repos dla pipeline code
- Databricks Asset Bundles (DAB) dla deployment
- Separate environments: dev → test → prod

**10. Testing:**
- Unit tests dla transformacji (pytest)
- Integration tests end-to-end (sample data)
- Data quality tests (Great Expectations)

---

## Podsumowanie

### Co zostało osiągnięte:

 **Zrozumienie Lakeflow SDP:**
- Deklaratywny framework dla batch + streaming pipelines
- Automatic orchestration (DAG), retry, monitoring
- Zero boilerplate code

 **Flow Types:**
- STREAMING TABLE (append-only, low-latency)
- MATERIALIZED VIEW (incremental refresh, batch)
- VIEW (ephemeral, reusable logic)

 **Medallion Architecture z Lakeflow:**
- Bronze: STREAMING TABLE z Auto Loader
- Silver: MATERIALIZED VIEW z Expectations (data quality)
- Gold: MATERIALIZED VIEW z business aggregates

 **Data Quality:**
- Expectations: EXPECT (warn) / DROP ROW / FAIL UPDATE
- Tracking metrics w Event Log
- Automatic rejection rate calculation

 **Event Log & Lineage:**
- Automatic logging do Delta table
- Flow progress, expectations, lineage tracking
- Query dla monitoring i debugging

### Kluczowe wnioski:

1. **Deklaratywność eliminuje complexity**: `CREATE OR REFRESH` vs setki linii procedural code
2. **Automatic orchestration**: Lakeflow buduje DAG i wykonuje w poprawnej kolejności
3. **Incremental processing**: Tylko nowe/zmienione dane (performance!)
4. **Built-in quality**: Expectations jako first-class citizens
5. **Observability out-of-the-box**: Event Log dla monitoring

### Quick Reference - Najważniejsze komendy:

| Operacja | SQL | Python |
|----------|-----|--------|
| **Streaming table** | `CREATE OR REFRESH STREAMING TABLE` | `@dp.table()` |
| **Materialized view** | `CREATE OR REFRESH MATERIALIZED VIEW` | `@dp.materialized_view()` |
| **Expectations** | `CONSTRAINT ... EXPECT ... ON VIOLATION DROP ROW` | `@dp.expect_or_drop()` |
| **Auto Loader** | `FROM STREAM read_files(...)` | `.format("cloudFiles")` |
| **Streaming read** | `FROM STREAM table` | `spark.readStream.table()` |

### Następne kroki:

- **Kolejny notebook**: 03_batch_streaming_load.ipynb - COPY INTO, Auto Loader deep dive
- **Warsztat praktyczny**: 02_lakeflow_orchestration_workshop.ipynb
- **Production deployment**: Databricks Jobs + Delta Live Tables UI
- **Dokumentacja**: [Lakeflow Pipelines Docs](https://docs.databricks.com/aws/en/ldp/)

### Zadanie domowe (opcjonalnie):

Utwórz własny Lakeflow pipeline:
1. Bronze: Wczytaj customers.csv z Auto Loader
2. Silver: Waliduj email format, deduplikuj po customer_id
3. Gold: Agreguj customers per region (daily counts)
4. Dodaj Expectations dla email i phone number

---

## Czyszczenie zasobów

Posprzątaj zasoby utworzone podczas notebooka:

In [0]:
# Opcjonalne czyszczenie zasobów testowych
# UWAGA: Uruchom tylko jeśli chcesz usunąć wszystkie utworzone dane

# Usuń tabele Demo
# spark.sql(f"DROP TABLE IF EXISTS {BRONZE_SCHEMA}.{bronze_table}")
# spark.sql(f"DROP TABLE IF EXISTS {SILVER_SCHEMA}.{silver_table}")
# spark.sql(f"DROP TABLE IF EXISTS {GOLD_SCHEMA}.{gold_table}")

# Wyczyść cache
# spark.catalog.clearCache()

displayHTML("<p> Aby usunąć tabele, odkomentuj kod powyżej i uruchom komórkę</p>")