# Streaming Data Ingestion - Demo

**Cel szkoleniowy:** Opanowanie technik streaming data ingestion do Delta Lake z u≈ºyciem Structured Streaming i Auto Loader.

**Zakres tematyczny:**
- Structured Streaming fundamentals
- Auto Loader (cloudFiles) deep dive
- readStream & writeStream API
- Watermarking & late data handling
- Checkpoint management
- Trigger modes (once, continuous, availableNow, processingTime)
- Schema evolution w streaming
- Stream-to-Delta patterns
- Exactly-once semantics
- Monitoring & troubleshooting

## Kontekst i wymagania

- **Dzie≈Ñ szkolenia**: Dzie≈Ñ 2 - Delta Lake & Lakehouse Architecture
- **Typ notebooka**: Demo
- **Wymagania techniczne**:
  - Databricks Runtime 13.0+ (zalecane: 14.3 LTS)
  - Unity Catalog w≈ÇƒÖczony
  - Uprawnienia: CREATE TABLE, CREATE SCHEMA, SELECT, MODIFY
  - Klaster: Standard z minimum 2 workers
- **Zale≈ºno≈õci**: 
  - Wykonany notebook 01_delta_lake_operations.ipynb
  - Wykonany notebook 02_batch_data_ingestion.ipynb
- **Czas realizacji**: ~60 minut

## Wstƒôp teoretyczny

**Cel sekcji:** Zrozumienie fundament√≥w Structured Streaming i kiedy stosowaƒá streaming vs batch ingestion.

### Structured Streaming - Kluczowe Koncepty

**Co to jest Structured Streaming?**
- Streaming engine zbudowany na Spark SQL
- Traktuje stream jako "unbounded table" (niesko≈ÑczonƒÖ tabelƒô)
- Micro-batch processing (domy≈õlnie) lub continuous processing
- Exactly-once semantics z idempotent writes
- Fault-tolerant z checkpoint recovery

**Micro-batch Architecture:**
```
Input Stream ‚Üí Micro-batch ‚Üí Processing ‚Üí Output Sink
     ‚Üì              ‚Üì             ‚Üì            ‚Üì
  (files)      (trigger)    (DataFrame)   (Delta)
                              API
```

**Dlaczego Streaming?**
- **Low latency**: Sekundy/minuty zamiast godzin
- **Real-time insights**: Dashboards, alerts, ML inference
- **Continuous processing**: Nie czeka na batch window
- **Event-driven**: React to data as it arrives

### Batch vs Streaming Decision Matrix

| Cecha | Batch (COPY INTO) | Streaming (Auto Loader) |
|-------|-------------------|-------------------------|
| **Latency** | Minutes-Hours | Seconds-Minutes |
| **File arrival** | Large files, scheduled | Continuous small files |
| **Complexity** | Low | Medium |
| **Cost** | Lower (runs on schedule) | Higher (always on) |
| **Use Case** | Daily ETL, reports | Real-time dashboards, CDC |
| **Idempotency** | Built-in (file tracking) | Built-in (checkpoint) |
| **Schema evolution** | Manual | Automatic (rescue mode) |

**Kiedy u≈ºywaƒá Streaming:**
- ‚úÖ Dane przychodzƒÖ kontinously (< 1h intervals)
- ‚úÖ Potrzebujesz low latency (< 5 min)
- ‚úÖ Ma≈Çe pliki (< 100MB each)
- ‚úÖ Event-driven applications
- ‚úÖ Real-time dashboards/analytics

**Kiedy u≈ºywaƒá Batch:**
- ‚úÖ Dane przychodzƒÖ w scheduled intervals (hourly/daily)
- ‚úÖ Du≈ºe pliki (> 1GB)
- ‚úÖ Latency nie jest krytyczna
- ‚úÖ Lower cost requirement
- ‚úÖ Simple operational model

### Auto Loader (cloudFiles) - The Game Changer

**Co to jest Auto Loader?**
- Databricks-managed streaming source (`cloudFiles`)
- Automatyczne file discovery (nie trzeba manually list files)
- Incremental processing (tylko nowe pliki)
- Schema inference & evolution
- File notification (nie trzeba skanowaƒá folderu)
- Checkpoint management

**Dlaczego Auto Loader > readStream.format("json")?**

| Feature | Auto Loader | Standard readStream |
|---------|-------------|---------------------|
| File discovery | Automatic (notifications) | Manual (directory listing) |
| Schema inference | Built-in + evolution | Manual definition |
| Small files | Optimized | Slow (many tasks) |
| Scalability | Millions of files | Struggles at 100k+ |
| Cost | Lower (notifications) | Higher (continuous scan) |

**Auto Loader Architecture:**
```
Cloud Storage ‚Üí File Notification ‚Üí Databricks ‚Üí Processing ‚Üí Delta Lake
    (S3)           (SQS/EventGrid)      (Auto        (Spark)      (Target)
                                        Loader)
```

## Izolacja per u≈ºytkownik

Uruchom skrypt inicjalizacyjny dla per-user izolacji katalog√≥w i schemat√≥w:

In [None]:
%run ../00_setup

## Konfiguracja

Import bibliotek i ustawienie zmiennych ≈õrodowiskowych dla streaming:

In [None]:
from pyspark.sql import functions as F
from pyspark.sql.types import *
from datetime import datetime, timedelta
import time

# Wy≈õwietl kontekst u≈ºytkownika
print("=== Kontekst u≈ºytkownika ===")
print(f"Katalog: {CATALOG}")
print(f"Schema Bronze: {BRONZE_SCHEMA}")
print(f"Schema Silver: {SILVER_SCHEMA}")
print(f"U≈ºytkownik: {raw_user}")

# Ustaw katalog i schemat jako domy≈õlne
spark.sql(f"USE CATALOG {CATALOG}")
spark.sql(f"USE SCHEMA {BRONZE_SCHEMA}")

# ≈öcie≈ºki do danych streaming
ORDERS_STREAMING_PATH = f"{DATASET_BASE_PATH}/orders"  # Folder z plikami streaming
CHECKPOINT_BASE = f"/tmp/{raw_user}/streaming_checkpoints"

# Cleanup previous checkpoints (dla demo)
try:
    dbutils.fs.rm(CHECKPOINT_BASE, True)
    print(f"‚úì Wyczy≈õcono poprzednie checkpoints")
except:
    pass

print(f"\n=== ≈öcie≈ºki dla Streaming ===")
print(f"Streaming source: {ORDERS_STREAMING_PATH}")
print(f"Checkpoint base: {CHECKPOINT_BASE}")

# Wy≈õwietl dostƒôpne pliki streaming
print(f"\n=== Dostƒôpne pliki streaming ===")
try:
    files = dbutils.fs.ls(ORDERS_STREAMING_PATH)
    stream_files = [f for f in files if f.name.startswith("orders_stream_")]
    print(f"Znaleziono {len(stream_files)} plik√≥w streaming:")
    for f in stream_files[:5]:  # Poka≈º pierwsze 5
        print(f"  - {f.name} ({f.size} bytes)")
    if len(stream_files) > 5:
        print(f"  ... i {len(stream_files) - 5} wiƒôcej")
except Exception as e:
    print(f"‚ö†Ô∏è  Nie mo≈ºna wy≈õwietliƒá plik√≥w: {e}")

## Sekcja 1: Structured Streaming Basics - readStream & writeStream

**Wprowadzenie teoretyczne:**

Structured Streaming opiera siƒô na dw√≥ch podstawowych operacjach:
- **`readStream`**: Czyta dane jako stream (unbounded DataFrame)
- **`writeStream`**: Zapisuje stream do sink (Delta, Parquet, console)

**Podstawowa sk≈Çadnia:**
```python
# Read stream
df_stream = spark.readStream \
    .format("json") \
    .option("maxFilesPerTrigger", 1) \
    .load("/path/to/files")

# Write stream
query = df_stream.writeStream \
    .format("delta") \
    .outputMode("append") \
    .option("checkpointLocation", "/path/to/checkpoint") \
    .toTable("target_table")
```

**Kluczowe koncepty:**

**1. Output Modes:**
- `append`: Tylko nowe rekordy (most common)
- `complete`: Ca≈Ça tabela wynikowa (tylko dla agregacji)
- `update`: Tylko zmienione rekordy (dla agregacji z watermark)

**2. Checkpoint Location:**
- ObowiƒÖzkowy dla production streams
- Przechowuje offset/progress dla fault tolerance
- Umo≈ºliwia restart bez duplikacji/utraty danych

**3. Trigger Modes:**
- `once`: Jednorazowe przetworzenie (batch-like)
- `availableNow`: Przetworz wszystko co jest dostƒôpne, potem zatrzymaj
- `processingTime`: Micro-batch co X sekund/minut
- `continuous`: Low-latency continuous processing (experimental)

**Dlaczego to wa≈ºne:**
- Exactly-once semantics z checkpoint
- Fault tolerance (restart bez duplikacji)
- Kontrola nad throughput (maxFilesPerTrigger)
- Monitoring & observability

### Przyk≈Çad 1.1: Basic readStream z JSON files

**Cel:** Utworzyƒá prosty streaming pipeline czytajƒÖcy JSON files i zapisujƒÖcy do Delta table.

**Podej≈õcie:**
1. Przygotuj target Delta table
2. U≈ºyj readStream do czytania JSON
3. U≈ºyj writeStream do zapisu do Delta
4. Monitor streaming query

In [None]:
# Przyk≈Çad 1.1 - Basic Streaming with readStream/writeStream

TARGET_TABLE = f"{BRONZE_SCHEMA}.orders_streaming_basic"
CHECKPOINT_PATH_BASIC = f"{CHECKPOINT_BASE}/basic_stream"

print(f"=== Przyk≈Çad 1.1: Basic Streaming ===\n")
print(f"Target table: {TARGET_TABLE}")
print(f"Checkpoint: {CHECKPOINT_PATH_BASIC}\n")

# Krok 1: Utw√≥rz target Delta table (opcjonalne - writeStream mo≈ºe auto-create)
spark.sql(f"""
CREATE TABLE IF NOT EXISTS {TARGET_TABLE} (
  order_id STRING,
  customer_id STRING,
  order_date STRING,
  total_amount DOUBLE,
  payment_method STRING,
  product_id STRING,
  quantity INT,
  _processing_timestamp TIMESTAMP
) USING DELTA
""")

print(f"‚úì Tabela {TARGET_TABLE} gotowa\n")

# Krok 2: Zdefiniuj explicit schema (zalecane dla production)
orders_schema = StructType([
    StructField("order_id", StringType(), True),
    StructField("customer_id", StringType(), True),
    StructField("order_date", StringType(), True),
    StructField("total_amount", DoubleType(), True),
    StructField("payment_method", StringType(), True),
    StructField("product_id", StringType(), True),
    StructField("quantity", IntegerType(), True)
])

# Krok 3: readStream - czytaj JSON files jako stream
df_stream = (spark.readStream
    .format("json")
    .schema(orders_schema)  # Explicit schema
    .option("maxFilesPerTrigger", 2)  # Przetw. max 2 pliki per trigger
    .load(ORDERS_STREAMING_PATH)
)

print("‚úì Stream reader skonfigurowany")
print(f"  Format: JSON")
print(f"  Max files per trigger: 2")
print(f"  Schema: Explicit ({len(orders_schema.fields)} kolumn)\n")

# Krok 4: Dodaj processing timestamp
df_stream_with_ts = df_stream.withColumn(
    "_processing_timestamp", 
    F.current_timestamp()
)

# Krok 5: writeStream - zapisz do Delta table
print("üöÄ Uruchamianie streaming query...\n")

query = (df_stream_with_ts.writeStream
    .format("delta")
    .outputMode("append")  # Tylko nowe rekordy
    .option("checkpointLocation", CHECKPOINT_PATH_BASIC)
    .trigger(availableNow=True)  # Przetworz wszystko i stop
    .toTable(TARGET_TABLE)
)

# Czekaj na zako≈Ñczenie (availableNow auto-stops)
query.awaitTermination()

print("‚úÖ Stream zako≈Ñczony\n")

# Krok 6: Sprawd≈∫ wyniki
count = spark.table(TARGET_TABLE).count()
print(f"=== Wyniki ===")
print(f"Za≈Çadowano rekord√≥w: {count}\n")

# Poka≈º przyk≈Çadowe dane
print("Przyk≈Çadowe dane:")
display(spark.table(TARGET_TABLE).limit(10))

**Wyja≈õnienie:**

**`readStream` options:**
- `schema`: Explicit schema (zalecane) - szybsze ni≈º inference
- `maxFilesPerTrigger`: Kontrola throughput - nie przeciƒÖ≈ºaj klastra
- `format("json")`: Wspiera JSON, CSV, Parquet, Avro, ORC

**`writeStream` options:**
- `outputMode("append")`: Tylko nowe rekordy (najbardziej efektywne)
- `checkpointLocation`: OBOWIƒÑZKOWE dla production - fault tolerance
- `trigger(availableNow=True)`: Przetworz wszystko co jest + stop (batch-like)

**`trigger` modes:**
- `availableNow=True`: One-time processing (zalecane dla scheduled jobs)
- `once=True`: Legacy version of availableNow
- `processingTime="10 seconds"`: Micro-batch co 10s (always-on)
- `continuous="1 second"`: Ultra-low latency (experimental)

**Checkpoint:**
- Przechowuje offset (kt√≥re pliki przetworzone)
- Umo≈ºliwia restart bez duplikacji
- Nie usuwaj checkpoint location je≈õli chcesz incremental processing!

**üí° Best Practice**: Zawsze u≈ºywaj `availableNow=True` dla scheduled jobs (batch-like streaming).

## Sekcja 2: Auto Loader (cloudFiles) - Deep Dive

**Wprowadzenie teoretyczne:**

Auto Loader to Databricks-managed streaming source zoptymalizowany dla incremental file ingestion z cloud storage.

**Kluczowe zalety Auto Loader:**

**1. Automatic File Discovery:**
- Nie trzeba manually list files w folderze
- File notification (SQS/EventGrid) zamiast directory listing
- Skaluje do millions of files

**2. Schema Inference & Evolution:**
- Automatyczne wykrywanie schema z sample files
- `cloudFiles.schemaEvolutionMode`: addNewColumns, rescue, failOnNewColumns
- Rescue columns dla unexpected data

**3. File Notification Modes:**
- **Directory listing** (default < 10k files): Skanuje folder
- **File notification** (> 10k files): Event-driven (SQS/EventGrid/Event Hub)
- Automatyczny wyb√≥r based on scale

**4. Performance Optimizations:**
- Batch small files together
- Parallel processing
- Efficient checkpointing

**Auto Loader Syntax:**
```python
df = spark.readStream \
    .format("cloudFiles") \  # Magic format!
    .option("cloudFiles.format", "json") \  # Source format
    .option("cloudFiles.schemaLocation", "/path") \  # Schema persistence
    .option("cloudFiles.inferColumnTypes", "true") \  # Type inference
    .load("/path/to/files")
```

**Por√≥wnanie: Standard readStream vs Auto Loader:**

| Feature | readStream.format("json") | readStream.format("cloudFiles") |
|---------|---------------------------|----------------------------------|
| File discovery | Manual listing | Automatic notifications |
| Performance | Slow for 10k+ files | Fast for millions |
| Schema inference | On each start | Cached, incremental |
| Schema evolution | Manual | Automatic |
| Small files | Many small tasks | Optimized batching |
| Cost | Higher (scan overhead) | Lower (event-driven) |
| Setup | Simple | Simple + notifications |

**Kiedy u≈ºywaƒá Auto Loader:**
- ‚úÖ > 1000 files w folderze
- ‚úÖ Files przychodzƒÖ continuously
- ‚úÖ Potrzebujesz schema evolution
- ‚úÖ Small files (< 10MB each)
- ‚úÖ Production pipelines (scale, reliability)

### Przyk≈Çad 2.1: Auto Loader z Schema Inference

**Cel:** U≈ºyƒá Auto Loader (cloudFiles) z automatycznym schema inference - najbardziej praktyczne podej≈õcie.

**Podej≈õcie:**
1. U≈ºyj `format("cloudFiles")` zamiast `format("json")`
2. W≈ÇƒÖcz schema inference i evolution
3. Zapisz inferred schema do location
4. Przetestuj auto-discovery nowych plik√≥w

In [None]:
# Przyk≈Çad 2.1 - Auto Loader with Schema Inference

TARGET_TABLE_AL = f"{BRONZE_SCHEMA}.orders_autoloader"
CHECKPOINT_PATH_AL = f"{CHECKPOINT_BASE}/autoloader"
SCHEMA_LOCATION_AL = f"{CHECKPOINT_BASE}/autoloader_schema"

print(f"=== Przyk≈Çad 2.1: Auto Loader ===\n")
print(f"Target table: {TARGET_TABLE_AL}")
print(f"Checkpoint: {CHECKPOINT_PATH_AL}")
print(f"Schema location: {SCHEMA_LOCATION_AL}\n")

# Cleanup previous run (dla demo)
spark.sql(f"DROP TABLE IF EXISTS {TARGET_TABLE_AL}")

# Krok 1: readStream z Auto Loader (cloudFiles)
df_autoloader = (spark.readStream
    .format("cloudFiles")  # üåü Auto Loader magic!
    .option("cloudFiles.format", "json")  # Source format
    .option("cloudFiles.schemaLocation", SCHEMA_LOCATION_AL)  # Persist schema
    .option("cloudFiles.inferColumnTypes", "true")  # Infer types (not just STRING)
    .option("cloudFiles.schemaEvolutionMode", "addNewColumns")  # Auto-add new columns
    .option("cloudFiles.maxFilesPerTrigger", 3)  # Throttle processing
    .load(ORDERS_STREAMING_PATH)
)

print("‚úì Auto Loader reader skonfigurowany")
print(f"  Format: cloudFiles (JSON)")
print(f"  Schema inference: ENABLED")
print(f"  Schema evolution: addNewColumns")
print(f"  Max files per trigger: 3\n")

# Krok 2: Dodaj metadata columns
df_autoloader_enriched = (df_autoloader
    .withColumn("_processing_time", F.current_timestamp())
    .withColumn("_source_file", F.input_file_name())
)

# Krok 3: writeStream do Delta
print("üöÄ Uruchamianie Auto Loader stream...\n")

query_al = (df_autoloader_enriched.writeStream
    .format("delta")
    .outputMode("append")
    .option("checkpointLocation", CHECKPOINT_PATH_AL)
    .trigger(availableNow=True)  # Process all available + stop
    .toTable(TARGET_TABLE_AL)
)

# Czekaj na zako≈Ñczenie
query_al.awaitTermination()

print("‚úÖ Auto Loader stream zako≈Ñczony\n")

# Krok 4: Analiza wynik√≥w
count = spark.table(TARGET_TABLE_AL).count()
print(f"=== Wyniki ===")
print(f"Za≈Çadowano rekord√≥w: {count}")

# Sprawd≈∫ inferred schema
print("\n=== Inferred Schema ===")
spark.table(TARGET_TABLE_AL).printSchema()

# Sprawd≈∫ unikalne source files
print("\n=== Source Files ===")
source_files = spark.table(TARGET_TABLE_AL).select("_source_file").distinct().count()
print(f"Przetworzone pliki: {source_files}")

# Poka≈º przyk≈Çadowe dane
print("\n=== Przyk≈Çadowe dane ===")
display(spark.table(TARGET_TABLE_AL).limit(10))

**Wyja≈õnienie:**

**cloudFiles options:**

`cloudFiles.format`:
- Source format (json, csv, parquet, avro)
- Auto Loader handle parsing

`cloudFiles.schemaLocation`:
- Persists inferred schema
- Checkpoint-like dla schema
- Umo≈ºliwia fast restarts (no re-inference)

`cloudFiles.inferColumnTypes`:
- `true`: Wykrywa INT, DOUBLE, DATE, etc.
- `false`: Wszystko jako STRING (szybsze)

`cloudFiles.schemaEvolutionMode`:
- `addNewColumns`: Auto-add new columns (most flexible)
- `rescue`: New columns ‚Üí `_rescued_data` JSON
- `failOnNewColumns`: Fail if schema changes (strict)
- `none`: No evolution (default)

**Auto Loader File Notification:**
- < 10k files: Directory listing mode (default)
- \> 10k files: File notification mode (auto-setup)
- AWS: SQS queue
- Azure: Event Grid
- GCP: Pub/Sub

**üí° Best Practice:** Zawsze u≈ºywaj Auto Loader zamiast standard readStream dla file-based sources!

---

## Sekcja 3: Trigger Modes - Kontrola Wykonania Streamu

**Wprowadzenie teoretyczne:**

Trigger okre≈õla **jak czƒôsto** streaming query wykonuje micro-batches. R√≥≈ºne modes dla r√≥≈ºnych use cases.

### Trigger Modes - Szczeg√≥≈Çowy PrzeglƒÖd

**1. `trigger(availableNow=True)` - Batch-like Streaming** ‚≠ê ZALECANE

```python
.trigger(availableNow=True)
```

**Zachowanie:**
- Przetwarza wszystkie dostƒôpne dane
- Zatrzymuje siƒô automatycznie po zako≈Ñczeniu
- Incremental (u≈ºywa checkpoint)
- Idempotent (mo≈ºna uruchomiƒá wielokrotnie)

**Use Cases:**
- ‚úÖ Scheduled jobs (hourly, daily)
- ‚úÖ Backfilling historical data
- ‚úÖ Cost optimization (nie always-on)
- ‚úÖ Databricks Workflows scheduled runs

**2. `trigger(once=True)` - Legacy One-Time**

```python
.trigger(once=True)
```

**Zachowanie:**
- Legacy version of `availableNow`
- Przetwarza jeden micro-batch
- Mo≈ºe nie przetworzyƒá wszystkich danych

**Use Cases:**
- ‚ùå Deprecated, use `availableNow` instead

**3. `trigger(processingTime="X seconds")` - Always-On Streaming**

```python
.trigger(processingTime="10 seconds")
```

**Zachowanie:**
- Uruchamia micro-batch co X sekund/minut
- Always-on (nigdy siƒô nie zatrzymuje)
- Continuous monitoring

**Use Cases:**
- ‚úÖ Real-time dashboards (low latency)
- ‚úÖ Monitoring & alerting
- ‚úÖ CDC pipelines
- ‚ö†Ô∏è Higher cost (always running)

**4. `trigger(continuous="X seconds")` - Ultra-Low Latency** ‚ö†Ô∏è Experimental

```python
.trigger(continuous="1 second")
```

**Zachowanie:**
- Continuous processing (nie micro-batches)
- Sub-second latency
- At-least-once semantics (nie exactly-once!)

**Use Cases:**
- ‚ö†Ô∏è Experimental - nie u≈ºywaj w production
- Research, POCs

**5. Default (no trigger specified)**

```python
.writeStream  # No trigger specified
```

**Zachowanie:**
- Micro-batch ASAP (jak najszybciej)
- Similar to `processingTime="0 seconds"`
- Always-on

### Trigger Modes - Decision Matrix

| Trigger Mode | Latency | Cost | Use Case | Production Ready |
|--------------|---------|------|----------|------------------|
| `availableNow=True` | Minutes | Low | Scheduled jobs | ‚úÖ YES |
| `processingTime="10s"` | Seconds | High | Real-time | ‚úÖ YES |
| `once=True` | Minutes | Low | Legacy | ‚ö†Ô∏è Use availableNow |
| `continuous="1s"` | Milliseconds | High | Ultra low-latency | ‚ùå Experimental |
| Default (none) | Seconds | High | Always-on | ‚ö†Ô∏è Rare |

**üí° Best Practice:**
- **Scheduled jobs**: `availableNow=True`
- **Real-time dashboards**: `processingTime="30 seconds"` lub `processingTime="1 minute"`
- **Cost optimization**: Zawsze preferuj `availableNow` je≈õli mo≈ºesz tolerowaƒá minutes latency

### Przyk≈Çad 3.1: Por√≥wnanie Trigger Modes

**Cel:** Por√≥wnaƒá r√≥≈ºne trigger modes i zobaczyƒá ich wp≈Çyw na execution.

**Podej≈õcie:**
1. Uruchom stream z `availableNow=True`
2. Por√≥wnaj z `processingTime`
3. Sprawd≈∫ monitoring metrics

In [None]:
# Przyk≈Çad 3.1 - Por√≥wnanie Trigger Modes

print("=== Przyk≈Çad 3.1: Trigger Modes ===\n")

# Test 1: availableNow (batch-like)
print("üìä Test 1: trigger(availableNow=True)")
print("  Typ: Batch-like streaming")
print("  Zachowanie: Przetworz wszystko ‚Üí zatrzymaj\n")

TARGET_TABLE_TRIGGER1 = f"{BRONZE_SCHEMA}.orders_trigger_availablenow"
CHECKPOINT_TRIGGER1 = f"{CHECKPOINT_BASE}/trigger_availablenow"

spark.sql(f"DROP TABLE IF EXISTS {TARGET_TABLE_TRIGGER1}")

start_time = time.time()

df_stream_trigger1 = (spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "json")
    .option("cloudFiles.schemaLocation", f"{CHECKPOINT_TRIGGER1}_schema")
    .load(ORDERS_STREAMING_PATH)
)

query_trigger1 = (df_stream_trigger1.writeStream
    .format("delta")
    .outputMode("append")
    .option("checkpointLocation", CHECKPOINT_TRIGGER1)
    .trigger(availableNow=True)  # üéØ Batch-like
    .toTable(TARGET_TABLE_TRIGGER1)
)

query_trigger1.awaitTermination()
elapsed1 = time.time() - start_time

count1 = spark.table(TARGET_TABLE_TRIGGER1).count()
print(f"‚úÖ Zako≈Ñczony w {elapsed1:.2f}s")
print(f"   Za≈Çadowano: {count1} rekord√≥w\n")

print("-" * 60 + "\n")

# Test 2: processingTime (always-on simulation)
print("üìä Test 2: trigger(processingTime='5 seconds')")
print("  Typ: Always-on streaming")
print("  Zachowanie: Micro-batch co 5s (symulacja)\n")
print("  ‚ö†Ô∏è  Uwaga: To uruchomi always-on stream!")
print("  ‚ö†Ô∏è  Trzeba bƒôdzie go rƒôcznie zatrzymaƒá\n")

print("  Kod przyk≈Çadowy (NIE URUCHAMIAJ teraz):")
print("""
    query = (df.writeStream
        .format("delta")
        .outputMode("append")
        .option("checkpointLocation", "/path")
        .trigger(processingTime="5 seconds")  # Co 5s
        .toTable("target")
    )
    
    # Stream dzia≈Ça w tle
    # Zatrzymaj: query.stop()
""")

print("\n" + "-" * 60 + "\n")

# Podsumowanie
print("=== Por√≥wnanie ===\n")

print("| Metric | availableNow | processingTime |")
print("|--------|--------------|----------------|")
print(f"| Execution time | {elapsed1:.2f}s | Infinite (always-on) |")
print(f"| Records | {count1} | Continuous |")
print("| Cost | Low (one-time) | High (always running) |")
print("| Latency | Minutes | Seconds |")
print("| Use Case | Scheduled jobs | Real-time |")

print("\nüí° Zalecenie: U≈ºywaj availableNow dla scheduled jobs (cost-effective)")
print("üí° U≈ºywaj processingTime tylko gdy potrzebujesz real-time (<5min latency)")

## Sekcja 4: Watermarking & Late Data Handling

**Wprowadzenie teoretyczne:**

Watermarking to mechanizm obs≈Çugi **late-arriving data** (dane sp√≥≈∫nione) w streaming aggregations.

### Problem: Late Data

W real-world streaming, dane nie zawsze przychodzƒÖ w kolejno≈õci:
```
Event Time: 10:00 ‚Üí 10:01 ‚Üí 10:02 ‚Üí 10:00 (LATE!)
Arrival Time: 10:05 ‚Üí 10:06 ‚Üí 10:07 ‚Üí 10:08
```

**Pytanie:** Jak d≈Çugo czekaƒá na sp√≥≈∫nione dane przed finalizacjƒÖ agregacji?

### Watermark - RozwiƒÖzanie

**Watermark** = threshold for late data tolerance

```python
df.withWatermark("event_time", "10 minutes")
```

**Znaczenie:**
- Czekaj do 10 minut na sp√≥≈∫nione dane
- Dane starsze ni≈º watermark sƒÖ **odrzucane**
- Finalizuj agregacje gdy watermark przekroczy window

### Watermark Behavior

**Przyk≈Çad:** Watermark "10 minutes"

```
Current max event_time: 12:00
Watermark: 12:00 - 10min = 11:50

Incoming event @ 11:55 ‚Üí ACCEPTED (> watermark)
Incoming event @ 11:45 ‚Üí DROPPED (< watermark)
```

**Watermark Movement:**
- Watermark ro≈õnie tylko w g√≥rƒô (monotonic)
- Based on max observed event_time
- Never moves backward

### Watermark + Windows

**Tumbling Window Example:**
```python
df.withWatermark("event_time", "10 minutes") \
  .groupBy(
      F.window("event_time", "5 minutes")
  ).count()
```

**Output:**
- Window [10:00-10:05] finalized gdy watermark > 10:05
- Late data < watermark ‚Üí dropped
- Late data > watermark ‚Üí included

### Output Modes z Watermark

**1. `append` mode:** (most common)
- Wypuszcza windows tylko gdy sƒÖ finalized (watermark passed)
- Once outputted, never updated
- Best for late data tolerance

**2. `update` mode:**
- Wypuszcza updates dla windows
- Can update same window multiple times
- More output, less complete

**3. `complete` mode:**
- Outputs entire result table
- Not recommended for streaming (too much data)

### Kiedy u≈ºywaƒá Watermark?

**Potrzebujesz watermark gdy:**
- ‚úÖ Streaming aggregations (groupBy + window)
- ‚úÖ Joins with event-time
- ‚úÖ Late data tolerance required
- ‚úÖ Need to finalize windows

**NIE potrzebujesz watermark gdy:**
- ‚ùå No aggregations (simple append)
- ‚ùå No event-time (only processing-time)
- ‚ùå No late data concerns

### Best Practices

**1. Wyb√≥r watermark threshold:**
- Zbyt ma≈Çy (1 min): Du≈ºo dropped late data
- Zbyt du≈ºy (1 day): State grows, memory issues
- Sweet spot: 10-30 minutes dla most use cases

**2. Event-time column:**
- Musi byƒá TIMESTAMP
- Powinien reprezentowaƒá event creation time (nie arrival)
- Mieƒá timezone awareness

**3. Monitoring:**
- Track dropped late data metrics
- Monitor watermark lag
- Adjust threshold based on observations

### Przyk≈Çad 4.1: Watermarking w praktyce

**Cel:** Zaimplementowaƒá watermarking dla streaming aggregation z oknem czasowym.

**Podej≈õcie:**
1. Parsuj event_time z danych
2. Dodaj watermark (10 minutes tolerance)
3. Window aggregation (5-minute tumbling windows)
4. Obserwuj finalization behavior

In [None]:
# Przyk≈Çad 4.1 - Watermarking for Late Data

TARGET_TABLE_WM = f"{SILVER_SCHEMA}.orders_windowed_aggregates"
CHECKPOINT_WM = f"{CHECKPOINT_BASE}/watermark_agg"

print("=== Przyk≈Çad 4.1: Watermarking & Windowing ===\n")
print(f"Target table: {TARGET_TABLE_WM}")
print(f"Checkpoint: {CHECKPOINT_WM}\n")

# Cleanup
spark.sql(f"DROP TABLE IF EXISTS {TARGET_TABLE_WM}")

# Krok 1: readStream source data
df_stream_wm = (spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "json")
    .option("cloudFiles.schemaLocation", f"{CHECKPOINT_WM}_schema")
    .load(ORDERS_STREAMING_PATH)
)

# Krok 2: Parse event_time (z order_date string ‚Üí timestamp)
df_with_event_time = df_stream_wm.withColumn(
    "event_time",
    F.to_timestamp(F.col("order_date"), "yyyy-MM-dd HH:mm:ss")
)

print("‚úì Event time column dodany")
print("  Format: yyyy-MM-dd HH:mm:ss ‚Üí TIMESTAMP\n")

# Krok 3: Apply Watermark + Window Aggregation
df_windowed = (df_with_event_time
    .withWatermark("event_time", "10 minutes")  # üåä Watermark!
    .groupBy(
        F.window("event_time", "5 minutes"),  # 5-min tumbling windows
        "payment_method"
    )
    .agg(
        F.count("*").alias("order_count"),
        F.sum("total_amount").alias("total_revenue"),
        F.avg("total_amount").alias("avg_order_value")
    )
    .select(
        F.col("window.start").alias("window_start"),
        F.col("window.end").alias("window_end"),
        "payment_method",
        "order_count",
        "total_revenue",
        "avg_order_value"
    )
)

print("‚úì Watermark & Windowing skonfigurowane")
print("  Watermark: 10 minutes (late data tolerance)")
print("  Window: 5 minutes tumbling")
print("  Aggregates: COUNT, SUM, AVG per payment_method\n")

# Krok 4: writeStream z append mode (finalized windows only)
print("üöÄ Uruchamianie windowed aggregation stream...\n")

query_wm = (df_windowed.writeStream
    .format("delta")
    .outputMode("append")  # Only finalized windows
    .option("checkpointLocation", CHECKPOINT_WM)
    .trigger(availableNow=True)
    .toTable(TARGET_TABLE_WM)
)

query_wm.awaitTermination()

print("‚úÖ Windowed aggregation zako≈Ñczona\n")

# Krok 5: Analiza wynik√≥w
print("=== Wyniki Agregacji ===\n")

result_df = spark.table(TARGET_TABLE_WM).orderBy("window_start", "payment_method")
window_count = result_df.select("window_start").distinct().count()

print(f"Liczba okien czasowych: {window_count}")
print(f"≈ÅƒÖczna liczba rekord√≥w agregacji: {result_df.count()}\n")

print("Przyk≈Çadowe wyniki (per window + payment method):")
display(result_df.limit(20))

# Krok 6: Visualize windows
print("\n=== Wizualizacja Okien Czasowych ===")
windows_summary = result_df.groupBy("window_start", "window_end").agg(
    F.sum("order_count").alias("total_orders"),
    F.sum("total_revenue").alias("total_revenue")
).orderBy("window_start")

display(windows_summary)

**Wyja≈õnienie Watermarking:**

**Co siƒô sta≈Ço:**
1. Event time parsed z `order_date` string
2. Watermark ustawiony na 10 minut
3. 5-minutowe okna (tumbling windows)
4. Agregacje per payment_method

**Watermark Behavior:**
- Window [10:00-10:05] finalized gdy max event_time > 10:15 (10:05 + 10min watermark)
- Late data > watermark ‚Üí included
- Late data < watermark ‚Üí dropped (nie pokazane w tym demo)

**üí° Debugging Late Data:**
```python
# Monitor dropped records
spark.conf.set("spark.sql.streaming.metricsEnabled", "true")

# Check watermark delays in Spark UI
# Streaming tab ‚Üí Query details ‚Üí Watermark
```

---

## Sekcja 5: Checkpoint Management

**Wprowadzenie teoretyczne:**

Checkpoint to **krityczny komponent** streaming pipelines - zapewnia fault tolerance i exactly-once semantics.

### Co Jest w Checkpoint?

**Checkpoint location przechowuje:**

**1. Offsets:**
- Kt√≥re pliki/partycje przetworzone
- W jakiej kolejno≈õci
- Do kt√≥rej pozycji w stream

**2. Metadata:**
- Query configuration
- Schema information
- State information (dla stateful operations)

**3. State Store** (dla stateful ops):
- Aggregation state
- Join state
- Watermark state

### Checkpoint Structure

```
checkpoint_location/
‚îú‚îÄ‚îÄ commits/
‚îÇ   ‚îú‚îÄ‚îÄ 0         # Batch 0 completion marker
‚îÇ   ‚îú‚îÄ‚îÄ 1
‚îÇ   ‚îî‚îÄ‚îÄ 2
‚îú‚îÄ‚îÄ offsets/
‚îÇ   ‚îú‚îÄ‚îÄ 0         # Batch 0 offsets
‚îÇ   ‚îú‚îÄ‚îÄ 1
‚îÇ   ‚îî‚îÄ‚îÄ 2
‚îú‚îÄ‚îÄ sources/
‚îÇ   ‚îî‚îÄ‚îÄ 0/
‚îÇ       ‚îî‚îÄ‚îÄ [source-specific data]
‚îú‚îÄ‚îÄ state/
‚îÇ   ‚îî‚îÄ‚îÄ [state store files for stateful ops]
‚îî‚îÄ‚îÄ metadata
```

### Checkpoint Behavior

**First Run:**
- Creates checkpoint location
- Starts from beginning (or latest based on `startingOffsets`)
- Writes offset after each batch

**Restart:**
- Reads last committed offset
- Resumes from that point
- No duplicate processing (exactly-once)
- No data loss

**Schema Evolution:**
- Checkpoint validates schema compatibility
- Incompatible changes ‚Üí fail (protection)
- Use `cloudFiles.schemaEvolutionMode` for flexibility

### Common Checkpoint Issues

**Problem 1: Incompatible schema change**
```
AnalysisException: Incompatible format/schema
```

**Solution:**
- Option A: Delete checkpoint (reprocess all data)
- Option B: Use new checkpoint location
- Option C: Use schema evolution features

**Problem 2: Checkpoint corruption**
```
StreamingQueryException: Unable to read offsets
```

**Solution:**
- Backup important checkpoints
- Delete corrupted checkpoint
- Reprocess from beginning

**Problem 3: Checkpoint location full**

**Solution:**
- Monitor checkpoint size
- Clean old state (automatic with retention)
- Use appropriate checkpoint location (not /tmp)

### Checkpoint Best Practices

**1. Location Choice:**
```python
# ‚ùå BAD - ephemeral, mo≈ºe zniknƒÖƒá
.option("checkpointLocation", "/tmp/checkpoint")

# ‚úÖ GOOD - persistent storage
.option("checkpointLocation", "s3://bucket/checkpoints/job1")
.option("checkpointLocation", "/dbfs/mnt/storage/checkpoints/job1")
```

**2. One Checkpoint Per Query:**
```python
# ‚ùå BAD - reusing checkpoint
query1.option("checkpointLocation", "/path/shared")
query2.option("checkpointLocation", "/path/shared")  # B≈ÅƒÑD!

# ‚úÖ GOOD - unique per query
query1.option("checkpointLocation", "/path/query1")
query2.option("checkpointLocation", "/path/query2")
```

**3. Checkpoint Cleanup:**
```python
# Delete checkpoint dla fresh start
dbutils.fs.rm("/path/to/checkpoint", recurse=True)

# Tylko dla development/testing!
# W production: zachowaj checkpoint dla incremental processing
```

**4. Monitoring:**
```python
# Check checkpoint size
dbutils.fs.ls("/path/to/checkpoint")

# Monitor in Spark UI
# Streaming tab ‚Üí Active Streams ‚Üí Query Details
```

**5. Backup Critical Checkpoints:**
```bash
# Before major changes
aws s3 sync s3://bucket/checkpoint s3://bucket/checkpoint_backup
```

### Checkpoint vs Schema Location

**Auto Loader ma DWA persistence locations:**

```python
.option("checkpointLocation", "/path/checkpoint")      # Stream offsets
.option("cloudFiles.schemaLocation", "/path/schema")   # Inferred schema
```

**Oba sƒÖ potrzebne:**
- `checkpointLocation`: Which files processed
- `schemaLocation`: What schema to use

**Lifecycle:**
- Delete both ‚Üí reprocess all + re-infer schema
- Delete checkpoint only ‚Üí reprocess all + use existing schema
- Delete schema only ‚Üí keep offsets + re-infer schema

### Przyk≈Çad 5.1: Checkpoint Management & Recovery

**Cel:** Zademonstrowaƒá checkpoint behavior - restart bez duplikacji.

**Podej≈õcie:**
1. Uruchom stream z checkpoint
2. Stop in middle
3. Restart - observe incremental processing

In [None]:
# Przyk≈Çad 5.1 - Checkpoint Management & Recovery

TARGET_TABLE_CP = f"{BRONZE_SCHEMA}.orders_checkpoint_demo"
CHECKPOINT_CP = f"{CHECKPOINT_BASE}/checkpoint_demo"

print("=== Przyk≈Çad 5.1: Checkpoint & Recovery ===\n")

# Cleanup dla demo
spark.sql(f"DROP TABLE IF EXISTS {TARGET_TABLE_CP}")
try:
    dbutils.fs.rm(CHECKPOINT_CP, True)
    print("‚úì Wyczy≈õcono poprzedni checkpoint\n")
except:
    pass

# Run 1: Initial stream (przetworzy tylko czƒô≈õƒá danych)
print("üìä RUN 1: Initial streaming query\n")

df_stream_cp = (spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "json")
    .option("cloudFiles.schemaLocation", f"{CHECKPOINT_CP}_schema")
    .option("cloudFiles.maxFilesPerTrigger", 3)  # Tylko 3 pliki
    .load(ORDERS_STREAMING_PATH)
)

query_cp1 = (df_stream_cp.writeStream
    .format("delta")
    .outputMode("append")
    .option("checkpointLocation", CHECKPOINT_CP)
    .trigger(availableNow=True)
    .toTable(TARGET_TABLE_CP)
)

query_cp1.awaitTermination()

count_run1 = spark.table(TARGET_TABLE_CP).count()
print(f"‚úÖ RUN 1 zako≈Ñczony")
print(f"   Za≈Çadowano: {count_run1} rekord√≥w\n")

# Sprawd≈∫ checkpoint content
print("=== Checkpoint Structure ===")
checkpoint_files = dbutils.fs.ls(CHECKPOINT_CP)
print(f"Foldery w checkpoint:")
for f in checkpoint_files:
    print(f"  - {f.name}")

print("\n" + "-" * 60 + "\n")

# Run 2: Restart z tym samym checkpoint (incremental)
print("üìä RUN 2: Restart z existing checkpoint\n")
print("  Checkpoint istnieje - stream resume from last offset")
print("  Tylko NOWE pliki bƒôdƒÖ przetworzone (no duplicates)\n")

df_stream_cp2 = (spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "json")
    .option("cloudFiles.schemaLocation", f"{CHECKPOINT_CP}_schema")
    .option("cloudFiles.maxFilesPerTrigger", 3)
    .load(ORDERS_STREAMING_PATH)
)

query_cp2 = (df_stream_cp2.writeStream
    .format("delta")
    .outputMode("append")
    .option("checkpointLocation", CHECKPOINT_CP)  # Ten sam checkpoint!
    .trigger(availableNow=True)
    .toTable(TARGET_TABLE_CP)
)

query_cp2.awaitTermination()

count_run2 = spark.table(TARGET_TABLE_CP).count()
new_records = count_run2 - count_run1

print(f"‚úÖ RUN 2 zako≈Ñczony")
print(f"   Total records: {count_run2}")
print(f"   New records: {new_records}")
print(f"   Previous records: {count_run1}\n")

# Verify: check distinct source files
source_files = spark.table(TARGET_TABLE_CP) \
    .select(F.input_file_name().alias("file")) \
    .distinct() \
    .count()

print(f"=== Verification ===")
print(f"Unique source files processed: {source_files}")
print(f"Total records: {count_run2}")
print(f"\nüí° Checkpoint zapewni≈Ç:")
print(f"   ‚úÖ No duplicates (exact-once semantics)")
print(f"   ‚úÖ Incremental processing")
print(f"   ‚úÖ Resume from last offset")

## Sekcja 6: Schema Evolution w Streaming

**Wprowadzenie teoretyczne:**

W production streaming pipelines, **schema changes** sƒÖ nieuniknione:
- Nowe kolumny w source data
- Zmienione typy danych
- Usuniƒôte kolumny

Auto Loader oferuje r√≥≈ºne strategie obs≈Çugi schema evolution.

### Schema Evolution Modes

**1. `addNewColumns` (ZALECANE)**
```python
.option("cloudFiles.schemaEvolutionMode", "addNewColumns")
```

**Zachowanie:**
- Nowe kolumny ‚Üí automatycznie dodane do tabeli
- IstniejƒÖce kolumny ‚Üí unchanged
- Deleted columns ‚Üí NULL w nowych danych

**Use Case:** Production pipelines z flexible schema

**2. `rescue`**
```python
.option("cloudFiles.schemaEvolutionMode", "rescue")
.option("rescuedDataColumn", "_rescued_data")
```

**Zachowanie:**
- Nowe/unexpected columns ‚Üí zapisane w `_rescued_data` (JSON)
- Schema tabeli ‚Üí unchanged
- Manual inspection & processing later

**Use Case:** Strict schema enforcement + monitoring

**3. `failOnNewColumns`**
```python
.option("cloudFiles.schemaEvolutionMode", "failOnNewColumns")
```

**Zachowanie:**
- Nowe kolumny ‚Üí FAIL streaming query
- Forces manual intervention
- No automatic changes

**Use Case:** Critical pipelines, strict governance

**4. `none` (default)**
```python
# No schema evolution mode specified
```

**Zachowanie:**
- Schema fixed at first inference
- New columns ‚Üí ignored (nie w rescue)
- Can cause data loss silently

**Use Case:** ‚ùå Nie u≈ºywaj (dangerous)

### Schema Evolution Decision Matrix

| Mode | New Columns | Type Changes | Failures | Use Case |
|------|-------------|--------------|----------|----------|
| `addNewColumns` | Auto-add | Fail | Rare | Production (flexible) |
| `rescue` | To JSON | Ignore | No | Strict + monitor |
| `failOnNewColumns` | Fail | Fail | Yes | Critical |
| `none` | Ignored | Ignored | No | ‚ùå Avoid |

### mergeSchema dla Delta

Opr√≥cz Auto Loader evolution, Delta Lake ma w≈Çasne `mergeSchema`:

```python
df.write.format("delta") \
  .mode("append") \
  .option("mergeSchema", "true") \  # Delta schema evolution
  .saveAsTable("table")
```

**Kombinacja:**
- Auto Loader evolution: Source files ‚Üí DataFrame
- Delta mergeSchema: DataFrame ‚Üí Delta table

**Best Practice:** U≈ºywaj obu dla full pipeline evolution

### Monitoring Schema Changes

```python
# Read schema location to see inferred schema
schema_files = dbutils.fs.ls("/path/to/schemaLocation")

# Check Delta table schema history
spark.sql("DESCRIBE HISTORY table_name")

# Monitor _rescued_data column (if using rescue mode)
spark.table("table").filter(col("_rescued_data").isNotNull()).count()
```

---

## Sekcja 7: Best Practices - Streaming Production Pipelines

### 1. Zawsze u≈ºywaj Auto Loader (cloudFiles)

```python
# ‚ùå DON'T - standard readStream
spark.readStream.format("json").load("/path")

# ‚úÖ DO - Auto Loader
spark.readStream.format("cloudFiles") \
  .option("cloudFiles.format", "json") \
  .load("/path")
```

**Dlaczego:** Scalability, performance, schema evolution

### 2. Explicit Checkpoint Locations

```python
# ‚ùå DON'T - ephemeral location
.option("checkpointLocation", "/tmp/checkpoint")

# ‚úÖ DO - persistent storage
.option("checkpointLocation", "s3://bucket/checkpoints/pipeline_v1")
```

**Dlaczego:** Fault tolerance, no data loss

### 3. U≈ºywaj availableNow dla Scheduled Jobs

```python
# ‚ùå DON'T - always-on unless necessary
.trigger(processingTime="10 seconds")

# ‚úÖ DO - batch-like dla scheduled
.trigger(availableNow=True)
```

**Dlaczego:** Cost optimization (pay only when running)

### 4. Schema Evolution Strategy

```python
# ‚úÖ DO - flexible schema evolution
.option("cloudFiles.schemaEvolutionMode", "addNewColumns")
.option("cloudFiles.schemaLocation", "/path/schema")

# Combined with Delta mergeSchema
.option("mergeSchema", "true")
```

**Dlaczego:** Handle schema changes gracefully

### 5. Throttle Processing (maxFilesPerTrigger)

```python
# ‚ùå DON'T - process unlimited files
spark.readStream.format("cloudFiles").load("/path")

# ‚úÖ DO - throttle for stability
.option("cloudFiles.maxFilesPerTrigger", 1000)
```

**Dlaczego:** Prevent cluster overload, stable processing

### 6. Monitoring & Alerting

```python
# Enable metrics
spark.conf.set("spark.sql.streaming.metricsEnabled", "true")

# Log query progress
query.lastProgress  # JSON with metrics

# Monitor in Spark UI
# Streaming tab ‚Üí Active/Completed Queries
```

**Metrics to monitor:**
- `inputRowsPerSecond`: Incoming rate
- `processedRowsPerSecond`: Processing rate
- `batchDuration`: Time per micro-batch
- `numInputRows`: Rows in batch
- `stateMemory`: State size (for stateful ops)

### 7. Error Handling

```python
# ‚ùå DON'T - let stream fail silently
query.start()

# ‚úÖ DO - monitor status
query = stream.start()
try:
    query.awaitTermination()
except Exception as e:
    # Log error, send alert
    print(f"Stream failed: {e}")
    # query.stop()
```

### 8. Idempotency & Reprocessing

```python
# Design for idempotency
# - Checkpoint enables exactly-once
# - Delta MERGE for upserts
# - Unique keys for deduplication

# Safe to rerun:
dbutils.fs.rm(checkpoint_path, True)  # Fresh start
query.start()  # Reprocess all data
```

### 9. Partitioning Strategy

```python
# Partition Bronze by date for efficient queries
.partitionBy("_processing_date")

# Don't over-partition (< 1GB per partition)
# Don't under-partition (> 10GB per partition)
```

### 10. Testing & Validation

```python
# Test with small dataset first
.option("cloudFiles.maxFilesPerTrigger", 1)

# Validate row counts
source_count = # from source
target_count = spark.table("target").count()
assert source_count == target_count

# Check for duplicates
duplicates = spark.table("target") \
  .groupBy("id").count() \
  .filter("count > 1").count()
assert duplicates == 0
```

### Quick Reference Card

**Production Streaming Checklist:**

```python
df = (spark.readStream
    .format("cloudFiles")  # ‚úÖ Auto Loader
    .option("cloudFiles.format", "json")
    .option("cloudFiles.schemaLocation", "/schema")  # ‚úÖ Schema persistence
    .option("cloudFiles.schemaEvolutionMode", "addNewColumns")  # ‚úÖ Evolution
    .option("cloudFiles.maxFilesPerTrigger", 1000)  # ‚úÖ Throttle
    .option("cloudFiles.inferColumnTypes", "true")  # ‚úÖ Type inference
    .load("/data")
)

query = (df.writeStream
    .format("delta")
    .outputMode("append")
    .option("checkpointLocation", "/checkpoint")  # ‚úÖ Persistent
    .option("mergeSchema", "true")  # ‚úÖ Delta evolution
    .trigger(availableNow=True)  # ‚úÖ Cost-effective
    .toTable("target")
)
```

**Monitoring:**
- Spark UI ‚Üí Streaming tab
- CloudWatch/Log Analytics metrics
- Alert on failures
- Track watermark lag (stateful ops)

## Sekcja 8: Troubleshooting Streaming Queries

### Problem 1: Stream jest powolny (low throughput)

**Objawy:**
- `processedRowsPerSecond` << `inputRowsPerSecond`
- Batch duration > 1 minute
- Growing input backlog

**Przyczyny & RozwiƒÖzania:**

**1. Too many small files**
```python
# Solution: Increase maxFilesPerTrigger
.option("cloudFiles.maxFilesPerTrigger", 1000)  # by≈Ço: 10
```

**2. Insufficient cluster resources**
```python
# Solution: Scale up cluster
# Dodaj workers lub zwiƒôksz executor memory
```

**3. Expensive transformations**
```python
# Solution: Optimize queries
# - Avoid UDFs
# - Use built-in functions
# - Cache intermediate results (ostro≈ºnie w streaming!)
```

**4. Small micro-batches**
```python
# Solution: Increase batch size
.option("maxBytesPerTrigger", "10g")  # by≈Ço: 1g
```

### Problem 2: Checkpoint incompatible schema

**Objawy:**
```
StreamingQueryException: Incompatible checkpoint schema
```

**Przyczyny:**
- Schema changed drastically (type changes)
- Different checkpoint dla different query

**RozwiƒÖzania:**

**Option 1: Delete checkpoint (reprocess all)**
```python
dbutils.fs.rm(checkpoint_path, True)
# Start query ‚Üí reprocesses from beginning
```

**Option 2: New checkpoint location**
```python
# Use different checkpoint
.option("checkpointLocation", "/checkpoint_v2")
```

**Option 3: Schema evolution mode**
```python
# Enable flexible schema
.option("cloudFiles.schemaEvolutionMode", "addNewColumns")
```

### Problem 3: OOM (Out of Memory) in Streaming

**Objawy:**
```
OutOfMemoryError: Java heap space
Container killed: exceeding memory limits
```

**Przyczyny & RozwiƒÖzania:**

**1. Large state (stateful aggregations)**
```python
# Solution: Add watermark to limit state
.withWatermark("event_time", "1 hour")  # Trim old state
```

**2. Too many partitions in memory**
```python
# Solution: Reduce shuffle partitions
spark.conf.set("spark.sql.shuffle.partitions", 200)  # by≈Ço: 2000
```

**3. Large micro-batches**
```python
# Solution: Throttle input
.option("maxBytesPerTrigger", "1g")
.option("maxFilesPerTrigger", 100)
```

### Problem 4: Data duplicates w output

**Objawy:**
- Duplicate records w Delta table
- Primary key violations

**Przyczyny & RozwiƒÖzania:**

**1. No checkpoint location**
```python
# ‚ùå Missing checkpoint
.writeStream.toTable("target")  # NO checkpoint!

# ‚úÖ Add checkpoint
.option("checkpointLocation", "/checkpoint")
```

**2. Checkpoint deleted between runs**
```python
# Don't delete checkpoint unless intended reprocessing
# Checkpoint = exactly-once semantics
```

**3. Multiple streams writing to same table**
```python
# Solution: Use MERGE instead of append
# Or ensure unique checkpoint per query
```

### Problem 5: Stream stuck (no progress)

**Objawy:**
- No new batches processed
- `numInputRows = 0` consistently
- Stream "running" but idle

**Przyczyny & RozwiƒÖzania:**

**1. No new files**
```python
# Check source location
dbutils.fs.ls("/source/path")

# Verify file notification (Auto Loader)
# Check SQS/EventGrid configuration
```

**2. Watermark blocking output**
```python
# Stateful query may wait for watermark
# Check watermark progress in Spark UI
# May need to adjust watermark threshold
```

**3. Trigger condition not met**
```python
# processingTime trigger may wait for next interval
# Solution: Use availableNow for immediate processing
```

### Problem 6: Schema mismatch errors

**Objawy:**
```
AnalysisException: Cannot resolve column 'xyz'
Schema mismatch: expected INT, found STRING
```

**RozwiƒÖzania:**

**1. Use explicit schema**
```python
# Instead of inference
.schema(explicit_schema)
```

**2. Enable schema evolution**
```python
.option("cloudFiles.schemaEvolutionMode", "addNewColumns")
.option("mergeSchema", "true")
```

**3. CAST problematic columns**
```python
df.withColumn("amount", col("amount").cast("decimal(10,2)"))
```

### Debugging Checklist

When stream fails, check in order:

**1. Spark UI ‚Üí Streaming Tab**
- Query status (Active/Failed)
- Last error message
- Batch statistics

**2. Check Logs**
```python
# Driver logs
# Databricks: Clusters ‚Üí Driver Logs

# Query progress
query.lastProgress

# Query status
query.status
```

**3. Checkpoint Location**
```python
# Verify checkpoint exists & accessible
dbutils.fs.ls(checkpoint_path)

# Check checkpoint size (may indicate issues)
```

**4. Source Data**
```python
# Verify files exist
dbutils.fs.ls(source_path)

# Sample read (non-streaming)
spark.read.format("json").load(source_path).show()
```

**5. Cluster Resources**
- CPU utilization
- Memory usage
- Disk space

### Monitoring Commands

```python
# 1. Query Status
print(query.status)

# 2. Last Progress (metrics)
import json
print(json.dumps(query.lastProgress, indent=2))

# 3. Recent Progress
for progress in query.recentProgress:
    print(f"Batch {progress['batchId']}: {progress['numInputRows']} rows")

# 4. Exception (if failed)
if query.exception():
    print(query.exception())

# 5. Checkpoint contents
dbutils.fs.ls(checkpoint_path)
```

---

## Sekcja 9: Podsumowanie & Nastƒôpne Kroki

### Co zosta≈Ço osiƒÖgniƒôte w tym notebooku:

‚úÖ **1. Structured Streaming Fundamentals**
- readStream & writeStream API
- Output modes (append, update, complete)
- Micro-batch architecture
- Exactly-once semantics

‚úÖ **2. Auto Loader (cloudFiles) Deep Dive**
- cloudFiles format vs standard readStream
- Automatic file discovery & notification
- Schema inference & caching
- Performance optimizations dla millions of files

‚úÖ **3. Trigger Modes**
- `availableNow=True` dla batch-like (ZALECANE)
- `processingTime` dla always-on real-time
- `once` (legacy)
- Cost vs latency trade-offs

‚úÖ **4. Watermarking & Late Data**
- Event-time processing
- Watermark threshold configuration
- Window aggregations
- Late data handling strategies

‚úÖ **5. Checkpoint Management**
- Fault tolerance mechanism
- Exactly-once delivery guarantees
- Checkpoint structure & lifecycle
- Recovery patterns

‚úÖ **6. Schema Evolution**
- `addNewColumns` mode (flexible)
- `rescue` mode (strict + monitoring)
- `failOnNewColumns` (critical systems)
- mergeSchema integration z Delta

‚úÖ **7. Best Practices**
- Production-ready patterns
- Monitoring & alerting
- Performance optimization
- Error handling

‚úÖ **8. Troubleshooting**
- Common issues & solutions
- Debugging workflow
- Performance tuning
- Monitoring commands

### Kluczowe Wnioski:

üí° **1. Always Use Auto Loader (cloudFiles)**
```
Auto Loader > standard readStream:
- Scalability (millions of files)
- Schema evolution (automatic)
- Performance (file notifications)
- Lower cost (event-driven)
```

üí° **2. availableNow dla Scheduled Jobs**
```
Cost optimization:
- availableNow: batch-like, pay per run
- processingTime: always-on, continuous cost
- 90% use cases: availableNow is sufficient
```

üí° **3. Checkpoint = Fault Tolerance**
```
Always specify checkpoint:
- Exactly-once semantics
- Incremental processing
- Restart without duplicates
- NO checkpoint = NO guarantees
```

üí° **4. Schema Evolution Strategy**
```
Production pipelines need:
- cloudFiles.schemaEvolutionMode: addNewColumns
- cloudFiles.schemaLocation: persist schema
- mergeSchema: true (Delta compatibility)
```

üí° **5. Watermark dla Stateful Operations**
```
Aggregations require:
- withWatermark() for late data
- Threshold: 10-30min typical
- Monitor dropped data
- Balance latency vs completeness
```

### Decision Tree - Streaming Setup

```
Potrzebujƒô streaming pipeline...

‚îú‚îÄ Source type?
‚îÇ  ‚îú‚îÄ Files (S3/ADLS/GCS)
‚îÇ  ‚îÇ  ‚îî‚îÄ Use: readStream.format("cloudFiles")
‚îÇ  ‚îÇ     Options: schemaEvolutionMode, maxFilesPerTrigger
‚îÇ  ‚îÇ
‚îÇ  ‚îî‚îÄ Kafka/EventHub
‚îÇ     ‚îî‚îÄ Use: readStream.format("kafka")
‚îÇ
‚îú‚îÄ Execution frequency?
‚îÇ  ‚îú‚îÄ Scheduled (hourly/daily)
‚îÇ  ‚îÇ  ‚îî‚îÄ Use: trigger(availableNow=True)
‚îÇ  ‚îÇ     Lower cost, batch-like
‚îÇ  ‚îÇ
‚îÇ  ‚îî‚îÄ Real-time (< 5min latency)
‚îÇ     ‚îî‚îÄ Use: trigger(processingTime="30 seconds")
‚îÇ        Always-on, higher cost
‚îÇ
‚îú‚îÄ Schema changes expected?
‚îÇ  ‚îú‚îÄ YES (flexible)
‚îÇ  ‚îÇ  ‚îî‚îÄ schemaEvolutionMode: addNewColumns
‚îÇ  ‚îÇ
‚îÇ  ‚îî‚îÄ NO (strict)
‚îÇ     ‚îî‚îÄ schemaEvolutionMode: failOnNewColumns
‚îÇ
‚îî‚îÄ Aggregations needed?
   ‚îú‚îÄ YES
   ‚îÇ  ‚îî‚îÄ Add watermark + window
   ‚îÇ     Handle late data
   ‚îÇ
   ‚îî‚îÄ NO
      ‚îî‚îÄ Simple append mode
         No watermark needed
```

### Comparison Matrix - Final Reference

| Feature | Batch (COPY INTO) | Streaming (Auto Loader) |
|---------|-------------------|-------------------------|
| **Latency** | Hours | Seconds-Minutes |
| **Cost** | Low | Medium-High |
| **Complexity** | Low | Medium |
| **Use Case** | Daily ETL | Real-time CDC |
| **File Discovery** | Manual | Automatic |
| **Schema Evolution** | Manual | Automatic |
| **Exactly-Once** | Built-in (file tracking) | Built-in (checkpoint) |
| **Late Data** | N/A | Watermark |
| **Stateful Ops** | No | Yes (aggregations) |
| **Best For** | Large files, scheduled | Small files, continuous |

### Nastƒôpne Kroki w Szkoleniu:

**üìö Kolejny Notebook:**
- **04_bronze_silver_gold_pipeline.ipynb**
  - Medallion Architecture implementation
  - Multi-hop transformations
  - Data quality checks
  - Complete streaming pipeline

**üõ†Ô∏è Warsztat Praktyczny:**
- **02_ingestion_pipeline_workshop.ipynb**
  - Build end-to-end streaming pipeline
  - Handle schema changes
  - Implement monitoring
  - Production deployment patterns

**üìñ Materia≈Çy Dodatkowe:**
- Databricks Auto Loader documentation
- Structured Streaming Programming Guide
- Delta Lake Streaming integration

### Zadanie Domowe (Optional):

**Zadanie:** Zbuduj production-ready streaming pipeline

**Requirements:**
1. ‚úÖ Use Auto Loader (cloudFiles)
2. ‚úÖ Enable schema evolution (addNewColumns)
3. ‚úÖ Checkpoint configuration
4. ‚úÖ Trigger: availableNow for cost optimization
5. ‚úÖ Add watermark (if aggregations)
6. ‚úÖ Monitoring metrics logging
7. ‚úÖ Error handling & alerting
8. ‚úÖ Unity Catalog integration

**Bonus:**
- Schedule jako Databricks Workflow (hourly)
- Add data quality checks
- Implement Bronze ‚Üí Silver transformation
- Dashboard w PowerBI/Tableau

### Production Deployment Checklist:

Before deploying streaming pipeline:

- [ ] Checkpoint location persistent (not /tmp)
- [ ] Schema evolution mode configured
- [ ] maxFilesPerTrigger set (throttling)
- [ ] Monitoring enabled (metrics, logs)
- [ ] Alerting configured (failures)
- [ ] Resource sizing validated (cluster)
- [ ] Testing completed (small dataset)
- [ ] Documentation updated (runbook)
- [ ] Backup strategy (checkpoint, code)
- [ ] Rollback plan (if deployment fails)

---

**Gratulacje!** üéâ 
Uko≈Ñczy≈Çe≈õ notebook o Streaming Data Ingestion. 
Jeste≈õ gotowy do budowania production-grade real-time data pipelines w Delta Lake!

## Sekcja 10: Czyszczenie Zasob√≥w

**Uwaga:** Ta sekcja jest opcjonalna. Uruchom tylko je≈õli chcesz usunƒÖƒá wszystkie dane utworzone podczas notebooka.

W ≈õrodowisku szkoleniowym zazwyczaj chcemy **zachowaƒá** dane dla kolejnych notebook√≥w.

### Opcja 1: Sprawd≈∫ utworzone zasoby (zalecane)

Zostaw tabele i checkpoints dla kolejnych notebook√≥w:
- `04_bronze_silver_gold_pipeline.ipynb` u≈ºyje tych danych
- Warsztaty praktyczne wykorzystajƒÖ streaming tables
- Checkpoints sƒÖ potrzebne dla incremental processing

In [None]:
# Opcja 1: Sprawd≈∫ utworzone zasoby (bez usuwania)

print("=== Utworzone tabele streaming w tym notebooku ===\n")

streaming_tables = [
    f"{BRONZE_SCHEMA}.orders_streaming_basic",
    f"{BRONZE_SCHEMA}.orders_autoloader",
    f"{BRONZE_SCHEMA}.orders_trigger_availablenow",
    f"{BRONZE_SCHEMA}.orders_checkpoint_demo",
    f"{SILVER_SCHEMA}.orders_windowed_aggregates"
]

total_records = 0
total_size_bytes = 0

for table in streaming_tables:
    full_table = f"{CATALOG}.{table}"
    try:
        if spark.catalog.tableExists(full_table):
            count = spark.table(full_table).count()
            total_records += count
            
            # Pobierz rozmiar
            detail = spark.sql(f"DESCRIBE DETAIL {full_table}").collect()[0]
            size_bytes = detail['sizeInBytes']
            size_mb = size_bytes / (1024 * 1024)
            total_size_bytes += size_bytes
            
            print(f"‚úÖ {table}")
            print(f"   Rekordy: {count:,}")
            print(f"   Rozmiar: {size_mb:.2f} MB")
            print()
        else:
            print(f"‚ö†Ô∏è  {table} - nie istnieje")
            print()
    except Exception as e:
        print(f"‚ö†Ô∏è  {table} - b≈ÇƒÖd: {str(e)}")
        print()

# Sprawd≈∫ checkpoints
print("=== Checkpoints ===\n")
try:
    checkpoint_dirs = dbutils.fs.ls(CHECKPOINT_BASE)
    print(f"Checkpoints w {CHECKPOINT_BASE}:")
    for cp in checkpoint_dirs:
        size = sum([f.size for f in dbutils.fs.ls(cp.path)])
        size_mb = size / (1024 * 1024)
        print(f"  - {cp.name}: {size_mb:.2f} MB")
except Exception as e:
    print(f"‚ö†Ô∏è  Checkpoints: {e}")

print(f"\n{'='*60}")
print(f"≈ÅƒÖczna liczba rekord√≥w: {total_records:,}")
print(f"≈ÅƒÖczny rozmiar tabel: {total_size_bytes / (1024 * 1024):.2f} MB")
print(f"{'='*60}")

print("\nüí° Dane sƒÖ zachowane dla kolejnych notebook√≥w")
print("üí° Checkpoints umo≈ºliwiajƒÖ incremental processing")
print("üí° Aby usunƒÖƒá, uruchom kom√≥rkƒô poni≈ºej (Opcja 2)")

### Opcja 2: Usu≈Ñ wszystkie zasoby (tylko je≈õli naprawdƒô chcesz)

**UWAGA:** To usunie wszystkie tabele, checkpoints i schema locations utworzone w tym notebooku!

Uruchom kom√≥rkƒô poni≈ºej tylko je≈õli:
- Sko≈Ñczy≈Çe≈õ szkolenie i chcesz posprzƒÖtaƒá
- Chcesz zaczƒÖƒá od nowa (fresh start)
- Testujesz notebook i potrzebujesz clean slate

In [None]:
# Opcja 2: Usu≈Ñ wszystkie zasoby streaming (TYLKO JE≈öLI JESTE≈ö PEWIEN!)

# ‚ö†Ô∏è  UWAGA: Odkomentuj poni≈ºszy kod tylko je≈õli chcesz usunƒÖƒá wszystko!

"""
print("=== üóëÔ∏è  USUWANIE ZASOB√ìW STREAMING ===\n")
print("‚ö†Ô∏è  To usunie wszystkie tabele i checkpoints!\n")

# Lista tabel do usuniƒôcia
streaming_tables = [
    f"{BRONZE_SCHEMA}.orders_streaming_basic",
    f"{BRONZE_SCHEMA}.orders_autoloader",
    f"{BRONZE_SCHEMA}.orders_trigger_availablenow",
    f"{BRONZE_SCHEMA}.orders_checkpoint_demo",
    f"{SILVER_SCHEMA}.orders_windowed_aggregates"
]

# Usu≈Ñ tabele
print("Usuwanie tabel...")
for table in streaming_tables:
    full_table = f"{CATALOG}.{table}"
    try:
        spark.sql(f"DROP TABLE IF EXISTS {full_table}")
        print(f"  ‚úì Usuniƒôto: {table}")
    except Exception as e:
        print(f"  ‚ö†Ô∏è  B≈ÇƒÖd przy {table}: {e}")

# Usu≈Ñ wszystkie checkpoints
print("\nUsuwanie checkpoints...")
try:
    dbutils.fs.rm(CHECKPOINT_BASE, True)
    print(f"  ‚úì Usuniƒôto wszystkie checkpoints z: {CHECKPOINT_BASE}")
except Exception as e:
    print(f"  ‚ö†Ô∏è  B≈ÇƒÖd: {e}")

print("\n‚úÖ Czyszczenie zako≈Ñczone!")
print("üí° Wszystkie streaming tables i checkpoints zosta≈Çy usuniƒôte")
print("üí° Mo≈ºesz teraz uruchomiƒá notebook od nowa")
"""

print("‚ö†Ô∏è  KOD CZYSZCZENIA JEST ZAKOMENTOWANY")
print("‚ö†Ô∏è  Odkomentuj powy≈ºszy kod tylko je≈õli chcesz usunƒÖƒá wszystkie zasoby")
print("\nüí° Zalecenie: Zostaw dane dla kolejnych notebook√≥w!")
print("üí° Nastƒôpny notebook: 04_bronze_silver_gold_pipeline.ipynb")