# Medallion Architecture

**Cel szkoleniowy:** Opanowanie implementacji architektury medalionowej (Bronze ‚Üí Silver ‚Üí Gold) w Databricks.

**Zakres tematyczny:**
- Bronze Layer: Ingestion surowych danych (CSV, JSON, Parquet)
- Silver Layer: Data Quality, Deduplication & Validation
- SCD (Slowly Changing Dimensions): Implementacja SCD Type 1 i Type 2
- Gold Layer: Agregacje biznesowe i Star Schema
- Best Practices: Partycjonowanie, Z-Ordering, Data Retention

## Kontekst i wymagania

- **Dzie≈Ñ szkolenia**: Dzie≈Ñ 2 - Delta Lake & Lakehouse
- **Typ notebooka**: Demo
- **Wymagania techniczne**:
  - Databricks Runtime 16.4 LTS lub nowszy (zalecane: 17.3 LTS)
  - Unity Catalog w≈ÇƒÖczony
  - Uprawnienia: CREATE TABLE, CREATE SCHEMA, SELECT, MODIFY
  - Klaster: Standard lub **Serverless Compute** (zalecane)
- **Zale≈ºno≈õci**: 
  - Wykonany notebook `01_delta_lake_operations.ipynb`
  - Wykonany notebook `02_Lakeflow_Connection.ipynb` (dla danych Bronze)
- **Czas realizacji**: ~90 minut

> **Uwaga (2025):** Serverless Compute jest teraz domy≈õlnym trybem dla nowych workload√≥w.

## Wstƒôp teoretyczny - Medallion Architecture

**Cel sekcji:** Zrozumienie architektury medalionowej jako fundamentalnego design pattern dla data lakehouse.

---

### Czym jest Medallion Architecture?

**Medallion Architecture** to wielowarstwowy wzorzec organizacji danych w data lakehouse, kt√≥ry dzieli dane na trzy warstwy o rosnƒÖcej jako≈õci i warto≈õci biznesowej:

```
DATA SOURCES
    ‚Üì
ü•â BRONZE (Raw)
    ‚Üì cleansing
ü•à SILVER (Validated)
    ‚Üì aggregation
ü•á GOLD (Business)
    ‚Üì
CONSUMPTION
```

### Warstwy - Szczeg√≥≈Çowy Opis

#### ü•â Bronze Layer - Raw / Landing Zone

**Charakterystyka:**
- Dane "as-is" bez transformacji warto≈õci
- Append-only, immutable
- Audit metadata: `_ingestion_timestamp`, `_source_file`, `_user`
- Multi-format: JSON, CSV, Parquet, Avro
- Schema-on-read approach

**Retention:** 3-7 lat (d≈Çugoterminowa historia)

**Use Cases:**
- Data recovery (reprocess pipeline)
- Audit trail & compliance
- Historical analysis
- Data science exploration

**Przyk≈Çad Bronze Table:**
```sql
CREATE TABLE bronze.orders_raw (
    order_id STRING,
    customer_id STRING,
    order_date STRING,        -- Raw string, nie parsed
    total_amount STRING,      -- Raw string, nie validated
    payment_method STRING,
    _ingestion_timestamp TIMESTAMP,
    _source_file STRING,
    _rescued_data STRING      -- Schema evolution
)
```

---

#### ü•à Silver Layer - Cleansed / Validated

**Charakterystyka:**
- **Deduplikacja** po kluczu biznesowym
- **Walidacja**: NOT NULL, data types, ranges
- **Standaryzacja**: dates, text, formats
- **Business rules** enforcement
- **Schema enforcement** (strict schema)
- **Upsert/Merge** patterns (SCD)

**Retention:** 1-2 lata (medium-term history)

**Use Cases:**
- Foundation for analytics
- Joins & enrichment
- ML feature engineering
- Data quality monitoring

**Przyk≈Çad Silver Table:**
```sql
CREATE TABLE silver.orders_clean (
    order_id BIGINT NOT NULL,     -- Validated, parsed
    customer_id BIGINT NOT NULL,
    order_date DATE NOT NULL,     -- Parsed to DATE
    total_amount DECIMAL(10,2),   -- Validated numeric
    payment_method STRING,
    _quality_score INT,           -- Data quality metric
    _processing_timestamp TIMESTAMP,
    _is_valid BOOLEAN
)
```

---

#### ü•á Gold Layer - Business / Aggregates

**Charakterystyka:**
- **Pre-aggregated** summaries (daily, monthly, yearly)
- **Denormalized** tables (joins pre-computed)
- **KPI calculations** & business metrics
- **Star schema** / dimensional models
- **ML feature stores**
- **Query-optimized** (partitioned, indexed)

**Retention:** 6-12 miesiƒôcy (short-term, refreshable)

**Use Cases:**
- BI dashboards (Power BI, Tableau)
- Executive reports
- ML model training
- Self-service analytics

**Przyk≈Çad Gold Table:**
```sql
CREATE TABLE gold.daily_sales_summary (
    report_date DATE NOT NULL,
    payment_method STRING,
    total_orders BIGINT,
    total_revenue DECIMAL(15,2),
    avg_order_value DECIMAL(10,2),
    unique_customers BIGINT,
    _computation_timestamp TIMESTAMP
)
PARTITIONED BY (report_date)
```

---

### Kluczowe Zasady Medallion Architecture

**1. Separation of Concerns**
- Bronze: Ingestion
- Silver: Data Quality
- Gold: Business Logic

**2. Incremental Processing**
- Process tylko nowe/zmienione dane
- Delta Lake MERGE operations
- Checkpoint management

**3. Idempotency**
- Mo≈ºna uruchomiƒá wielokrotnie bez duplikacji
- Deterministic transformations
- Unique keys & deduplication

**4. Schema Evolution**
- Bronze: Flexible (rescued data)
- Silver: Controlled (addNewColumns)
- Gold: Strict (versioned)

**5. Data Quality Gates**
- Validate before promoting to next layer
- Quarantine bad records
- Monitoring & alerting

---

### ETL vs ELT w Medallion

**Traditional ETL:**
```
Extract ‚Üí Transform ‚Üí Load
         (outside DB)
```

**Medallion ELT:**
```
Extract ‚Üí Load (Bronze) ‚Üí Transform (Silver) ‚Üí Load (Gold)
                 ‚Üì                    ‚Üì              ‚Üì
             raw data          cleansed data    aggregates
```

**Dlaczego ELT?**
- Zachowanie raw data (compliance)
- Flexibility (re-transform later)
- Scalability (Spark distributed processing)
- Cost-effective (storage cheaper than compute)

---

### Medallion vs Traditional Data Warehouse

| Feature | Traditional DWH | Medallion Lakehouse |
|---------|-----------------|---------------------|
| **Storage** | Proprietary (expensive) | Cloud object storage (cheap) |
| **Schema** | Schema-on-write | Schema-on-read (Bronze) |
| **Data Types** | Structured only | Structured + semi-structured |
| **Flexibility** | Rigid | Flexible (schema evolution) |
| **Raw Data** | Discarded | Preserved (Bronze) |
| **Processing** | ETL (batch) | ELT (batch + streaming) |
| **Cost** | High (compute + storage) | Lower (decouple compute/storage) |
| **Use Cases** | BI & reporting | BI + ML + data science |

---

### Best Practices

**1. Naming Conventions:**
```
bronze.{source_system}_{entity}_raw
silver.{entity}_clean
gold.{business_domain}_{aggregation_level}
```

**2. Partitioning:**
- Bronze: Ingestion date (`_ingestion_date`)
- Silver: Business date (`order_date`, `transaction_date`)
- Gold: Report date (`report_date`)

**3. Metadata Columns:**
```python
# Bronze
_ingestion_timestamp, _source_file, _user

# Silver
_processing_timestamp, _quality_score, _is_valid

# Gold
_computation_timestamp, _version
```

**4. Refresh Cadence:**
- Bronze: Real-time / hourly
- Silver: Hourly / daily
- Gold: Daily / on-demand

**5. Data Retention:**
```python
# Bronze: 3-7 years (compliance)
spark.sql("ALTER TABLE bronze.orders SET TBLPROPERTIES (
    'delta.logRetentionDuration' = '2555 days',
    'delta.deletedFileRetentionDuration' = '2555 days'
)")

# Silver: 1-2 years
# Gold: 6-12 months (refreshable from Silver)
```

---

### Pipeline Architecture Diagram

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ                      DATA SOURCES                             ‚îÇ
‚îÇ  ‚Ä¢ PostgreSQL  ‚Ä¢ MySQL  ‚Ä¢ APIs  ‚Ä¢ S3 Files  ‚Ä¢ Kafka          ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                         ‚îÇ COPY INTO / Auto Loader
                         ‚ñº
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ                   ü•â BRONZE LAYER                             ‚îÇ
‚îÇ                                                               ‚îÇ
‚îÇ  bronze.customers_raw      bronze.orders_raw                 ‚îÇ
‚îÇ  bronze.products_raw       bronze.events_raw                 ‚îÇ
‚îÇ                                                               ‚îÇ
‚îÇ  Features:                                                    ‚îÇ
‚îÇ  ‚Ä¢ Raw data (as-is)                                          ‚îÇ
‚îÇ  ‚Ä¢ Append-only                                               ‚îÇ
‚îÇ  ‚Ä¢ Audit metadata                                            ‚îÇ
‚îÇ  ‚Ä¢ Long retention (3-7y)                                     ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                         ‚îÇ MERGE (Dedup, Validate)
                         ‚ñº
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ                   ü•à SILVER LAYER                             ‚îÇ
‚îÇ                                                               ‚îÇ
‚îÇ  silver.customers_clean    silver.orders_clean               ‚îÇ
‚îÇ  silver.products_clean     silver.events_clean               ‚îÇ
‚îÇ                                                               ‚îÇ
‚îÇ  Features:                                                    ‚îÇ
‚îÇ  ‚Ä¢ Deduplicated                                              ‚îÇ
‚îÇ  ‚Ä¢ Validated (types, nulls)                                  ‚îÇ
‚îÇ  ‚Ä¢ SCD Type 1/2 (history)                                    ‚îÇ
‚îÇ  ‚Ä¢ Medium retention (1-2y)                                   ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                         ‚îÇ GROUP BY / JOIN / AGGREGATE
                         ‚ñº
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ                   ü•á GOLD LAYER                               ‚îÇ
‚îÇ                                                               ‚îÇ
‚îÇ  gold.daily_sales_summary                                    ‚îÇ
‚îÇ  gold.customer_360                                           ‚îÇ
‚îÇ  gold.product_performance                                    ‚îÇ
‚îÇ                                                               ‚îÇ
‚îÇ  Features:                                                    ‚îÇ
‚îÇ  ‚Ä¢ Pre-aggregated                                            ‚îÇ
‚îÇ  ‚Ä¢ Denormalized (star schema)                                ‚îÇ
‚îÇ  ‚Ä¢ KPIs & metrics                                            ‚îÇ
‚îÇ  ‚Ä¢ Short retention (6-12m)                                   ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                         ‚îÇ
                         ‚ñº
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ                  CONSUMPTION LAYER                            ‚îÇ
‚îÇ  ‚Ä¢ Power BI  ‚Ä¢ Tableau  ‚Ä¢ SQL Analytics  ‚Ä¢ ML Models         ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

## Izolacja per u≈ºytkownik

Uruchom skrypt inicjalizacyjny dla per-user izolacji katalog√≥w i schemat√≥w:

In [0]:
%run ../00_setup

## Konfiguracja

Import bibliotek i ustawienie zmiennych ≈õrodowiskowych:

In [0]:
from pyspark.sql import functions as F
from pyspark.sql.types import *
from pyspark.sql.window import Window
from datetime import datetime, timedelta
import time

# Wy≈õwietl kontekst u≈ºytkownika
display({
    "Katalog": CATALOG,
    "Schema Bronze": BRONZE_SCHEMA,
    "Schema Silver": SILVER_SCHEMA,
    "Schema Gold": GOLD_SCHEMA,
    "U≈ºytkownik": raw_user,
    "Dataset base": DATASET_BASE_PATH
})

# Ustaw katalog i schemat jako domy≈õlne
spark.sql(f"USE CATALOG {CATALOG}")

### Kontekst u≈ºytkownika

Wy≈õwietlenie aktualnej konfiguracji ≈õrodowiska oraz ≈õcie≈ºek do danych:

In [None]:
spark.sql(f"USE CATALOG {CATALOG}")

**Konfiguracja katalogu Unity Catalog:**

Ustawienie domy≈õlnego katalogu dla wszystkich operacji.

## Sekcja 1: Bronze Layer - Raw Data Landing

**Cel sekcji:** Zrozumienie roli Bronze layer jako landing zone dla raw data.

### Bronze Layer - Kluczowe Cechy

**1. Raw Data "As-Is"**
- Dane zapisywane bez transformacji warto≈õci
- Zachowanie oryginalnego formatu
- Multi-format support (JSON, CSV, Parquet)

**2. Append-Only Pattern**
- Nigdy nie usuwamy/modyfikujemy danych
- Immutable history
- Time-travel capability

**3. Audit Metadata**
```python
# Metadane audytowe w Bronze
_ingestion_timestamp  # Kiedy za≈Çadowano
_source_file         # SkƒÖd pochodzƒÖ dane
_user                # Kto za≈Çadowa≈Ç
_rescued_data        # Schema evolution (unexpected columns)
```

**4. Schema-on-Read**
- Elastyczny schemat (mo≈ºe siƒô zmieniaƒá)
- Rescued data column dla unknown columns
- Reprocessing capability

### Bronze Tables - Struktura

W tym demo za≈Çadujemy dane bezpo≈õrednio z plik√≥w ≈∫r√≥d≈Çowych (CSV, JSON, Parquet) do warstwy Bronze, aby zapewniƒá niezale≈ºno≈õƒá tego notebooka.

**Tabele Bronze:**
- `bronze.customers_raw` - dane klient√≥w (CSV)
- `bronze.orders_raw` - zam√≥wienia (JSON)
- `bronze.products_raw` - produkty (Parquet)

### Dlaczego Bronze jest Wa≈ºny?

**1. Data Recovery**
```python
# Mo≈ºemy reprocessowaƒá pipeline od Bronze
bronze_data = spark.table("bronze.orders_raw")
# Re-run transformations ‚Üí Silver ‚Üí Gold
```

**2. Schema Evolution**
```python
# Nowe kolumny w source nie ≈ÇamiƒÖ pipeline
# TrafiajƒÖ do _rescued_data
```

**3. Compliance & Audit**
```python
# Pe≈Çna historia: kto, co, kiedy za≈Çadowa≈Ç
# Retention: 3-7 lat (regulacje prawne)
```

**4. Data Science Exploration**
```python
# Analitycy mogƒÖ eksplorowaƒá raw data
# Tworzyƒá nowe features z surowych danych
```

### Przyk≈Çad 1.1: Inspekcja Bronze Layer

**Cel:** Sprawdziƒá dane w Bronze layer i zrozumieƒá ich strukturƒô.

### Przyk≈Çad 1.1: Ingestion do Bronze Layer (Raw Data)

Wczytujemy dane bezpo≈õrednio z plik√≥w ≈∫r√≥d≈Çowych (CSV, JSON, Parquet) do tabel Bronze.
Dziƒôki temu notebook jest niezale≈ºny od poprzednich krok√≥w.

In [None]:
# 1. Load Customers (CSV)
customers_path = f"{DATASET_BASE_PATH}/customers/customers.csv"

customers_df = (spark.read
    .format("csv")
    .option("header", "true")
    .option("inferSchema", "true") # In Bronze we often infer or use string
    .load(customers_path)
)

display(customers_df.limit(5))

In [None]:
# 3. Load Products (Parquet)
products_path = f"{DATASET_BASE_PATH}/products/products.parquet"

products_df = (spark.read
    .format("parquet")
    .load(products_path)
)

# Write to Bronze
(products_df.write
    .format("delta")
    .mode("overwrite")
    .saveAsTable("bronze.products_raw")
)

display({"status": "‚úÖ Table bronze.products_raw created/overwritten"})

In [None]:
# 2. Load Orders (JSON)
orders_path = f"{DATASET_BASE_PATH}/orders/orders_batch.json"

orders_df = (spark.read
    .format("json")
    .load(orders_path)
)

# Write to Bronze
(orders_df.write
    .format("delta")
    .mode("overwrite")
    .saveAsTable("bronze.orders_raw")
)

display({"status": "‚úÖ Table bronze.orders_raw created/overwritten"})

In [None]:
# Write to Bronze (Overwrite for full load simulation)
(customers_df.write
    .format("delta")
    .mode("overwrite")
    .saveAsTable("bronze.customers_raw")
)

display({"status": "‚úÖ Table bronze.customers_raw created/overwritten"})

**Utworzone tabele Bronze:**

Konwersja danych z batch do raw tables zosta≈Ça zako≈Ñczona pomy≈õlnie.

In [0]:
# Inspekcja Bronze Layer
bronze_tables = ["customers_raw", "orders_raw", "products_raw"]
results = []

for table in bronze_tables:
    full_table = f"{CATALOG}.{BRONZE_SCHEMA}.{table}"
    
    if spark.catalog.tableExists(full_table):
        df = spark.table(full_table)
        results.append({
            "table": table,
            "status": "‚úÖ",
            "records": df.count(),
            "columns": len(df.columns)
        })

display(spark.createDataFrame(results))

**Podsumowanie inspekcji Bronze Layer:**

Bronze Layer zawiera RAW data bez transformacji, zachowuje pe≈ÇnƒÖ historiƒô (append-only) i stanowi foundation dla dalszych transformacji.

## Sekcja 2: Silver Layer - Cleansing & Validation

**Cel sekcji:** Transformacja danych z Bronze do Silver z zastosowaniem data quality rules.

### Silver Layer - Kluczowe Cechy

**1. Data Cleansing**
- Parsing: string ‚Üí proper types (INT, DATE, DECIMAL)
- Trimming: whitespace, special characters
- Standardization: dates, phone numbers, emails
- Null handling: replacements, defaults

**2. Deduplication**
- Identyfikacja unique business key
- MERGE operation (upsert pattern)
- Keeping latest version based on timestamp

**3. Validation Rules**
```python
# Przyk≈Çadowe validations
- NOT NULL constraints
- Range checks (amount > 0)
- Referential integrity (FK exists)
- Business rules (discount <= price)
```

**4. Schema Enforcement**
- Strict schema (vs Bronze flexible)
- Explicit data types
- Column constraints

### Bronze ‚Üí Silver Transformation Pattern

**Typical Flow:**
```python
bronze_df = spark.table("bronze.orders_raw")

silver_df = (bronze_df
    # 1. Parse & Cast
    .withColumn("order_id", col("order_id").cast("bigint"))
    .withColumn("order_date", to_date(col("order_date")))
    .withColumn("total_amount", col("total_amount").cast("decimal(10,2)"))
    
    # 2. Validate
    .filter(col("order_id").isNotNull())
    .filter(col("total_amount") > 0)
    
    # 3. Standardize
    .withColumn("payment_method", upper(trim(col("payment_method"))))
    
    # 4. Add metadata
    .withColumn("_processing_timestamp", current_timestamp())
    .withColumn("_is_valid", lit(True))
)

# 5. MERGE to Silver (deduplication)
silver_df.write.format("delta") \
    .mode("append") \
    .option("mergeSchema", "false") \
    .saveAsTable("silver.orders_clean")
```

### MERGE Operation - Deduplication Pattern

**Problem:** Bronze zawiera duplikaty (append-only)

**RozwiƒÖzanie:** MERGE w Silver (upsert)

```sql
MERGE INTO silver.orders_clean AS target
USING (
    SELECT DISTINCT *
    FROM bronze.orders_raw
    WHERE _ingestion_timestamp > (
        SELECT MAX(_processing_timestamp)
        FROM silver.orders_clean
    )
) AS source
ON target.order_id = source.order_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
```

### Data Quality Checks

**Levels of Quality:**

**Level 1: Schema Validation**
- Correct data types
- Required columns present
- No unexpected nulls

**Level 2: Business Rules**
- Ranges (amount between 0-1000000)
- Referential integrity (customer_id exists)
- Logical consistency (order_date <= ship_date)

**Level 3: Statistical Checks**
- Outlier detection
- Distribution monitoring
- Anomaly alerts

**Quarantine Pattern:**
```python
# Valid records ‚Üí Silver
valid_df = df.filter(col("_is_valid") == True)
valid_df.write.saveAsTable("silver.orders_clean")

# Invalid records ‚Üí Quarantine
invalid_df = df.filter(col("_is_valid") == False)
invalid_df.write.saveAsTable("silver.orders_quarantine")
```

### Przyk≈Çad 2.1: Bronze ‚Üí Silver Transformation (Orders)

**Cel:** Transform orders z Bronze do Silver z cleansing i validation.

In [0]:
# Bronze ‚Üí Silver: Orders
bronze_orders = spark.table(f"{BRONZE_SCHEMA}.orders_raw")

# Transform & Validate
silver_orders = (bronze_orders
    .withColumn("order_id", F.col("order_id").cast("bigint"))
    .withColumn("customer_id", F.col("customer_id").cast("bigint"))
    .withColumn("order_datetime", F.to_timestamp(F.col("order_datetime"), "yyyy-MM-dd HH:mm:ss"))
    .withColumn("total_amount", F.col("total_amount").cast("decimal(10,2)"))
    .withColumn("payment_method", F.upper(F.trim(F.col("payment_method"))))
    .withColumn("_is_valid", 
        F.when(
            (F.col("order_id").isNotNull()) &
            (F.col("customer_id").isNotNull()) &
            (F.col("order_datetime").isNotNull()) &
            (F.col("total_amount") > 0),
            True
        ).otherwise(False)
    )
    .withColumn("_processing_timestamp", F.current_timestamp())
    .select(
        "order_id", "customer_id", "order_datetime", 
        "total_amount", "payment_method",
        "_is_valid", "_processing_timestamp"
    )
)

# Split valid/invalid
valid_orders = silver_orders.filter(F.col("_is_valid") == True)
invalid_orders = silver_orders.filter(F.col("_is_valid") == False)

# Write to Silver
valid_orders.write.format("delta").mode("overwrite").saveAsTable(f"{SILVER_SCHEMA}.orders_clean")

display({
    "bronze_records": bronze_orders.count(),
    "valid_orders": valid_orders.count(),
    "invalid_orders": invalid_orders.count(),
    "status": f"‚úÖ Created {SILVER_SCHEMA}.orders_clean"
})

In [None]:
display(spark.table(f"{SILVER_SCHEMA}.orders_clean").limit(10))

**Uwaga dotyczƒÖca production:**

W ≈õrodowisku produkcyjnym u≈ºyliby≈õmy operacji MERGE dla deduplikacji zamiast prostego overwrite.

## Sekcja 3: SCD (Slowly Changing Dimensions)

**Cel sekcji:** Implementacja SCD Type 1 i Type 2 dla ≈õledzenia zmian w danych.

### Co to jest SCD?

**Slowly Changing Dimensions (SCD)** to techniki ≈õledzenia zmian w wymiarach (dimension tables) w hurtowniach danych.

**Problem:**
```
Klient zmienia adres:
- Jan Kowalski, Warszawa ‚Üí Krak√≥w

Pytanie: Czy zachowaƒá historiƒô?
```

### SCD Types - Overview

| Type | Strategy | History | Use Case |
|------|----------|---------|----------|
| **Type 0** | No changes allowed | N/A | Reference data (countries) |
| **Type 1** | Overwrite | ‚ùå No | Current state only |
| **Type 2** | Add new row | ‚úÖ Yes | Full history tracking |
| **Type 3** | Add new column | ‚ö†Ô∏è Limited | Previous value only |

---

### SCD Type 1 - Overwrite

**Strategia:** Nadpisz starƒÖ warto≈õƒá nowƒÖ (bez historii)

**Implementacja:** Simple UPDATE/MERGE

**Example:**
```
Before:
customer_id | name        | city
1           | Jan Kowalski| Warszawa

Change: city ‚Üí Krak√≥w

After:
customer_id | name        | city
1           | Jan Kowalski| Krak√≥w     # Overwritten!
```

**Kod SQL:**
```sql
MERGE INTO silver.customers_dim AS target
USING updates AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN UPDATE SET
    target.city = source.city,
    target.updated_at = current_timestamp()
```

**Pros:**
- ‚úÖ Simple implementation
- ‚úÖ No history bloat
- ‚úÖ Always current values

**Cons:**
- ‚ùå No historical tracking
- ‚ùå Can't analyze "as of date"
- ‚ùå Lose audit trail

**Use Cases:**
- Correcting data entry errors
- Non-critical attributes (e.g., marketing preferences)
- Reference data that shouldn't have history

---

### SCD Type 2 - Historical Tracking

**Strategia:** Dodaj nowy rekord dla ka≈ºdej zmiany (pe≈Çna historia)

**Implementacja:** MERGE z version tracking

**Example:**
```
Before:
customer_id | name        | city    | effective_from | effective_to | is_current
1           | Jan Kowalski| Warszawa| 2023-01-01     | 9999-12-31   | true

Change: city ‚Üí Krak√≥w (2024-06-15)

After:
customer_id | name        | city    | effective_from | effective_to | is_current
1           | Jan Kowalski| Warszawa| 2023-01-01     | 2024-06-14   | false  # Closed
1           | Jan Kowalski| Krak√≥w  | 2024-06-15     | 9999-12-31   | true   # New!
```

**Kolumny SCD Type 2:**
- `effective_from` / `valid_from`: Start date
- `effective_to` / `valid_to`: End date (9999-12-31 = current)
- `is_current` / `is_active`: Boolean flag
- `version`: Optional version number
- `surrogate_key`: Technical key (not business key)

**Kod SQL (simplified):**
```sql
-- Step 1: Close old records
UPDATE silver.customers_dim
SET 
    effective_to = current_date() - 1,
    is_current = false
WHERE customer_id IN (SELECT customer_id FROM updates)
  AND is_current = true;

-- Step 2: Insert new records
INSERT INTO silver.customers_dim
SELECT 
    customer_id,
    name,
    city,
    current_date() AS effective_from,
    DATE '9999-12-31' AS effective_to,
    true AS is_current
FROM updates;
```

**Pros:**
- ‚úÖ Full history preserved
- ‚úÖ "As of date" queries possible
- ‚úÖ Audit trail
- ‚úÖ Temporal analytics

**Cons:**
- ‚ùå Table grows (more rows)
- ‚ùå More complex queries (need to filter is_current)
- ‚ùå Surrogate keys needed

**Use Cases:**
- Customer dimensions (address, preferences)
- Product dimensions (price history)
- Employee dimensions (salary, department)
- Compliance & audit requirements

---

### SCD Type 3 - Limited History (rzadziej u≈ºywany)

**Strategia:** Dodaj kolumnƒô dla previous value

**Example:**
```
customer_id | name        | city    | previous_city
1           | Jan Kowalski| Krak√≥w  | Warszawa
```

**Pros:**
- ‚úÖ Simple (one previous value)
- ‚úÖ No row explosion

**Cons:**
- ‚ùå Only 1 previous value
- ‚ùå Limited analytics

**Use Case:** Rarely used (SCD Type 2 is better)

---

### SCD Decision Matrix

**Kiedy u≈ºywaƒá kt√≥rego typu?**

| Requirement | Recommended Type |
|-------------|------------------|
| No history needed | Type 1 |
| Full history required | Type 2 |
| Audit/compliance | Type 2 |
| Data corrections | Type 1 |
| Current state only | Type 1 |
| Temporal analytics | Type 2 |
| Growing table OK | Type 2 |
| Storage constrained | Type 1 |

**üí° Best Practice:** 
- Use **Type 1** for Silver layer (current state)
- Use **Type 2** for Gold dimensional tables (history)

---

### MERGE Pattern for SCD Type 2 (Advanced)

**Complete Implementation:**

```sql
MERGE INTO silver.customers_dim AS target
USING (
    SELECT 
        customer_id,
        name,
        city,
        email,
        current_timestamp() AS effective_from,
        CAST('9999-12-31' AS DATE) AS effective_to,
        true AS is_current
    FROM staging.customers_updates
) AS source
ON target.customer_id = source.customer_id 
   AND target.is_current = true

-- Case 1: No change ‚Üí do nothing
WHEN MATCHED AND (
    target.city = source.city AND
    target.email = source.email
) THEN UPDATE SET target.updated_at = current_timestamp()

-- Case 2: Change detected ‚Üí close old, insert new
WHEN MATCHED AND (
    target.city != source.city OR
    target.email != source.email
) THEN UPDATE SET
    target.effective_to = current_date() - 1,
    target.is_current = false

-- Case 3: New customer ‚Üí insert
WHEN NOT MATCHED THEN INSERT (
    customer_id, name, city, email,
    effective_from, effective_to, is_current
) VALUES (
    source.customer_id, source.name, source.city, source.email,
    source.effective_from, source.effective_to, source.is_current
);

-- Step 2: Insert new versions for changed records
INSERT INTO silver.customers_dim
SELECT 
    source.*
FROM staging.customers_updates AS source
INNER JOIN silver.customers_dim AS target
    ON source.customer_id = target.customer_id
WHERE target.is_current = false
  AND target.effective_to = current_date() - 1;
```

### Przyk≈Çad 3.1: SCD Type 1 - Customers (Overwrite)

**Cel:** Implementacja SCD Type 1 - prosty overwrite bez historii.

In [0]:
# SCD Type 1: Customers
bronze_customers = spark.table(f"{BRONZE_SCHEMA}.customers_raw")

# Transform to Silver (SCD Type 1 - current state only)
customers_type1 = (bronze_customers
    .withColumn("customer_id", F.col("customer_id").cast("bigint"))
    .withColumn("name", F.trim(F.col("name")))
    .withColumn("email", F.lower(F.trim(F.col("email"))))
    .withColumn("city", F.initcap(F.trim(F.col("city"))))
    .withColumn("updated_at", F.current_timestamp())
    .select("customer_id", "name", "email", "city", "updated_at")
)

# Create/Replace table (Type 1 = overwrite)
customers_type1.write.format("delta").mode("overwrite").saveAsTable(f"{SILVER_SCHEMA}.customers_type1")

display({
    "status": f"‚úÖ Created {SILVER_SCHEMA}.customers_type1",
    "records": customers_type1.count(),
    "note": "SCD Type 1: Zawsze aktualny stan, bez historii"
})

**Utworzona tabela SCD Type 1:**

Tabela `customers_type1` zawiera zawsze aktualny stan bez historii. Wy≈õwietlenie przyk≈Çadowych danych:

In [None]:
display(spark.table(f"{SILVER_SCHEMA}.customers_type1").limit(5))

### Symulacja UPDATE - zmiana miasta dla customer_id=1

**PRZED ZMIANƒÑ:** Wy≈õwietlenie aktualnego stanu customer_id=1:

In [None]:
spark.sql(f"""
    SELECT * FROM {SILVER_SCHEMA}.customers_type1 
    WHERE customer_id = 1
""").show()

In [None]:
from pyspark.sql.types import StructType, StructField, LongType, StringType
updates_schema = StructType([
    StructField("customer_id", LongType(), True),
    StructField("name", StringType(), True),
    StructField("email", StringType(), True),
    StructField("city", StringType(), True)
])

updates_data = [(1, "Jan Kowalski", "jan@example.com", "Krak√≥w")]  # Changed city!
updates_df = spark.createDataFrame(updates_data, updates_schema)

In [None]:
updates_df.createOrReplaceTempView("customer_updates")

spark.sql(f"""
    MERGE INTO {SILVER_SCHEMA}.customers_type1 AS target
    USING customer_updates AS source
    ON target.customer_id = source.customer_id
    WHEN MATCHED THEN UPDATE SET
        target.city = source.city,
        target.updated_at = current_timestamp()
    WHEN NOT MATCHED THEN INSERT *
""")

**PO ZMIANIE (SCD Type 1 - overwrite):** Wy≈õwietlenie zaktualizowanego stanu:

In [None]:
spark.sql(f"""
    SELECT * FROM {SILVER_SCHEMA}.customers_type1 
    WHERE customer_id = 1
""").show()

**‚ö†Ô∏è  UWAGA:** Historia zmiany zosta≈Ça UTRACONA

**üí° Stara warto≈õƒá (Warszawa) zosta≈Ça nadpisana (Krak√≥w)**

### Przyk≈Çad 3.2: SCD Type 2 - Customers (Historical Tracking)

**Cel:** Implementacja SCD Type 2 - pe≈Çne ≈õledzenie historii zmian.

In [0]:
# SCD Type 2: Customers (Historical Tracking)
bronze_customers = spark.table(f"{BRONZE_SCHEMA}.customers_raw")

customers_type2_initial = (bronze_customers
    .withColumn("customer_id", F.col("customer_id").cast("bigint"))
    .withColumn("name", F.trim(F.col("name")))
    .withColumn("email", F.lower(F.trim(F.col("email"))))
    .withColumn("city", F.initcap(F.trim(F.col("city"))))
    # SCD Type 2 columns
    .withColumn("effective_from", F.current_date())
    .withColumn("effective_to", F.lit("9999-12-31").cast("date"))
    .withColumn("is_current", F.lit(True))
    .withColumn("version", F.lit(1))
    .select(
        "customer_id", "name", "email", "city",
        "effective_from", "effective_to", "is_current", "version"
    )
)

# Create initial table
customers_type2_initial.write.format("delta").mode("overwrite").saveAsTable(f"{SILVER_SCHEMA}.customers_type2")

display({
    "status": f"‚úÖ Created {SILVER_SCHEMA}.customers_type2",
    "records": customers_type2_initial.count(),
    "columns": "effective_from, effective_to, is_current, version"
})

**Utworzona tabela SCD Type 2:**

Tabela `customers_type2` zawiera kolumny: effective_from, effective_to, is_current, version. Wy≈õwietlenie przyk≈Çadowych danych:

In [None]:
display(spark.table(f"{SILVER_SCHEMA}.customers_type2").limit(5))

### Symulacja CHANGE - zmiana miasta dla customer_id=1

**PRZED ZMIANƒÑ:** Wy≈õwietlenie aktualnego stanu customer_id=1:

In [None]:
spark.sql(f"""
    SELECT * FROM {SILVER_SCHEMA}.customers_type2 
    WHERE customer_id = 1
    ORDER BY effective_from
""").show()

In [None]:
updates_data = [(1, "Jan Kowalski", "jan@example.com", "Krak√≥w")]  # Changed city!
updates_df = spark.createDataFrame(updates_data, ["customer_id", "name", "email", "city"])
updates_df.createOrReplaceTempView("customer_updates_type2")

**Step 1: Close old records** (set effective_to, is_current=false):

In [None]:
spark.sql(f"""
    MERGE INTO {SILVER_SCHEMA}.customers_type2 AS target
    USING (
        SELECT DISTINCT u.customer_id
        FROM customer_updates_type2 u
        INNER JOIN {SILVER_SCHEMA}.customers_type2 t
            ON u.customer_id = t.customer_id
        WHERE t.is_current = true
          AND (u.city != t.city OR u.email != t.email)  -- Detect changes
    ) AS changed
    ON target.customer_id = changed.customer_id 
       AND target.is_current = true
    WHEN MATCHED THEN UPDATE SET
        target.effective_to = current_date() - INTERVAL 1 DAY,
        target.is_current = false
""")

**Step 2: Insert new versions** - dodanie nowych rekord√≥w z updated values:

In [None]:
spark.sql(f"""
    INSERT INTO {SILVER_SCHEMA}.customers_type2
    SELECT 
        u.customer_id,
        u.name,
        u.email,
        u.city,
        current_date() AS effective_from,
        CAST('9999-12-31' AS DATE) AS effective_to,
        true AS is_current,
        COALESCE(MAX(t.version), 0) + 1 AS version
    FROM customer_updates_type2 u
    LEFT JOIN {SILVER_SCHEMA}.customers_type2 t
        ON u.customer_id = t.customer_id
    WHERE NOT EXISTS (
        SELECT 1 FROM {SILVER_SCHEMA}.customers_type2 existing
        WHERE existing.customer_id = u.customer_id
          AND existing.is_current = true
          AND existing.city = u.city
          AND existing.email = u.email
    )
    GROUP BY u.customer_id, u.name, u.email, u.city
""")

**PO ZMIANIE (SCD Type 2 - historical tracking):**

Historia zosta≈Ça zachowana! Wy≈õwietlenie wszystkich wersji customer_id=1:

In [None]:
spark.sql(f"""
    SELECT 
        customer_id,
        city,
        effective_from,
        effective_to,
        is_current,
        version
    FROM {SILVER_SCHEMA}.customers_type2 
    WHERE customer_id = 1
    ORDER BY effective_from
""").show()

**‚úÖ Historia zachowana!**

Mamy teraz 2 rekordy:
- **Version 1**: Warszawa (effective_to = dzisiaj-1, is_current=false)  
- **Version 2**: Krak√≥w (effective_to = 9999-12-31, is_current=true)

### Przyk≈Çad: Query 'as of date'

**Gdzie mieszka≈Ç klient 1 miesiƒÖc temu?** SCD Type 2 umo≈ºliwia temporal queries:

In [None]:
one_month_ago = (datetime.now() - timedelta(days=30)).date()

In [None]:
spark.sql(f"""
    SELECT 
        customer_id,
        name,
        city,
        effective_from,
        effective_to
    FROM {SILVER_SCHEMA}.customers_type2
    WHERE customer_id = 1
      AND '{one_month_ago}' BETWEEN effective_from AND effective_to
""").show()

## Sekcja 4: Gold Layer - Business Aggregates & Analytics

**Cel sekcji:** Transformacja Silver ‚Üí Gold z agregacjami biznesowymi.

### Gold Layer - Kluczowe Cechy

**1. Pre-Aggregated Data**
- Daily/Monthly/Yearly summaries
- Pre-computed KPIs
- Reduced data volume (faster queries)

**2. Denormalized Tables**
- Joins pre-computed (star schema)
- Wide tables dla BI tools
- No complex joins needed

**3. Business Logic**
- Revenue calculations
- Customer segmentation
- Product performance metrics

**4. Query Optimization**
- Partitioned by report_date
- Z-ordered for common filters
- Materialized views

### Silver ‚Üí Gold Transformation Pattern

**Typical Aggregation:**
```python
# Silver: Detail level (millions of rows)
silver_orders = spark.table("silver.orders_clean")

# Gold: Aggregated (thousands of rows)
gold_daily_sales = (silver_orders
    .groupBy(
        F.to_date("order_date").alias("report_date"),
        "payment_method"
    )
    .agg(
        F.count("*").alias("total_orders"),
        F.sum("total_amount").alias("total_revenue"),
        F.avg("total_amount").alias("avg_order_value"),
        F.countDistinct("customer_id").alias("unique_customers")
    )
)
```

### Gold Layer Tables - Examples

**1. Daily Sales Summary**
```sql
gold.daily_sales_summary
- report_date, payment_method
- total_orders, total_revenue, avg_order_value
- PARTITIONED BY (report_date)
```

**2. Customer 360**
```sql
gold.customer_360
- customer_id, name, email, city
- total_lifetime_value, total_orders, first_order_date, last_order_date
- customer_segment (VIP, Regular, New)
```

**3. Product Performance**
```sql
gold.product_performance
- product_id, product_name, category
- total_sold, total_revenue, avg_price
- PARTITIONED BY (category)
```

### Star Schema Pattern

**Fact Table (Orders):**
- order_id, customer_key, product_key, date_key
- total_amount, quantity

**Dimension Tables:**
- dim_customers (SCD Type 2)
- dim_products
- dim_dates

**Benefits:**
- Simplified queries
- Better performance (pre-joined)
- BI tool friendly

### Przyk≈Çad 4.1: Gold - Daily Sales Summary

**Cel:** Agregacja zam√≥wie≈Ñ do daily sales summary.

In [0]:
# Gold: Daily Sales Summary
silver_orders = spark.table(f"{SILVER_SCHEMA}.orders_clean")

# Aggregate to daily summary
daily_sales = (silver_orders
    .withColumn("report_date", F.to_date("order_datetime"))
    .groupBy("report_date", "payment_method")
    .agg(
        F.count("*").alias("total_orders"),
        F.sum("total_amount").alias("total_revenue"),
        F.avg("total_amount").alias("avg_order_value"),
        F.countDistinct("customer_id").alias("unique_customers"),
        F.min("total_amount").alias("min_order"),
        F.max("total_amount").alias("max_order")
    )
    .withColumn("_computation_timestamp", F.current_timestamp())
    .orderBy("report_date", "payment_method")
)

# Write to Gold
daily_sales.write.format("delta").mode("overwrite").partitionBy("report_date").saveAsTable(f"{GOLD_SCHEMA}.daily_sales_summary")

display({
    "status": f"‚úÖ Created {GOLD_SCHEMA}.daily_sales_summary",
    "aggregated_rows": daily_sales.count(),
    "partitioned_by": "report_date"
})

**Utworzona tabela Gold:**

Tabela `daily_sales_summary` zosta≈Ça utworzona z partycjƒÖ po `report_date`. Wy≈õwietlenie Top 10 dni po revenue:

In [None]:
display(
    spark.table(f"{GOLD_SCHEMA}.daily_sales_summary")
    .orderBy(F.desc("total_revenue"))
    .limit(10)
)

### Data Reduction Metrics

Por√≥wnanie wielko≈õci danych miƒôdzy Silver a Gold:

In [None]:
silver_count = silver_orders.count()
gold_count = daily_sales.count()
reduction_pct = ((silver_count - gold_count) / silver_count) * 100

**Redukcja danych:**

Gold tables sƒÖ znacznie mniejsze ‚Üí szybsze queries!

### Przyk≈Çad 4.2: Gold - Customer 360 (Denormalized)

**Cel:** Stworzenie denormalized customer view z KPIs.

In [0]:
# Gold: Customer 360
customers = spark.table(f"{SILVER_SCHEMA}.customers_type1")
orders = spark.table(f"{SILVER_SCHEMA}.orders_clean")

customer_360 = (customers
    .join(
        orders.groupBy("customer_id").agg(
            F.count("*").alias("total_orders"),
            F.sum("total_amount").alias("lifetime_value"),
            F.avg("total_amount").alias("avg_order_value"),
            F.min("order_datetime").alias("first_order_date"),
            F.max("order_datetime").alias("last_order_date")
        ),
        "customer_id",
        "left"
    )
    .withColumn("customer_segment",
        F.when(F.col("lifetime_value") > 1000, "VIP")
        .when(F.col("lifetime_value") > 500, "Regular")
        .otherwise("New")
    )
    .withColumn("_computation_timestamp", F.current_timestamp())
    .select(
        "customer_id", "name", "email", "city",
        "total_orders", "lifetime_value", "avg_order_value",
        "first_order_date", "last_order_date", "customer_segment",
        "_computation_timestamp"
    )
)

# Write to Gold
customer_360.write.format("delta").mode("overwrite").saveAsTable(f"{GOLD_SCHEMA}.customer_360")

display({
    "status": f"‚úÖ Created {GOLD_SCHEMA}.customer_360",
    "customers": customer_360.count()
})

**Utworzona tabela Customer 360:**

Denormalized view z KPIs klient√≥w. Top 10 klient√≥w po Lifetime Value:

In [None]:
display(
    spark.table(f"{GOLD_SCHEMA}.customer_360")
    .orderBy(F.desc("lifetime_value"))
    .limit(10)
)

### Customer Segmentation

Rozk≈Çad segment√≥w klient√≥w:

In [None]:
spark.sql(f"""
    SELECT 
        customer_segment,
        COUNT(*) as customer_count,
        SUM(lifetime_value) as total_value,
        AVG(lifetime_value) as avg_value
    FROM {GOLD_SCHEMA}.customer_360
    GROUP BY customer_segment
    ORDER BY total_value DESC
""").show()

**Customer 360:** Denormalized view dla BI dashboards z precomputed KPIs

## Sekcja 5: Podsumowanie & Best Practices

### Co zosta≈Ço osiƒÖgniƒôte?

‚úÖ **1. Medallion Architecture Implementation**
- ü•â Bronze: Raw data landing (append-only, immutable)
- ü•à Silver: Cleansed & validated data (deduplicated)
- ü•á Gold: Business aggregates & KPIs (pre-computed)

‚úÖ **2. SCD (Slowly Changing Dimensions)**
- **Type 1**: Overwrite (no history) - dla current state
- **Type 2**: Historical tracking (full history) - dla temporal analytics

‚úÖ **3. Data Quality Patterns**
- Validation rules (NOT NULL, ranges, types)
- Quarantine pattern (valid/invalid split)
- Metadata tracking (_processing_timestamp, _is_valid)

‚úÖ **4. Business Logic Implementations**
- Daily sales aggregations
- Customer 360 view (denormalized)
- Customer segmentation (VIP/Regular/New)

### Kluczowe Wnioski

üí° **1. Separation of Concerns**
```
Bronze = Ingestion (raw data)
Silver = Data Quality (cleansing, dedup)
Gold = Business Logic (aggregates, KPIs)
```

üí° **2. ELT > ETL**
```
Traditional ETL: Transform outside DB
Medallion ELT: Load first, transform in-place
- Preserve raw data (Bronze)
- Re-process capability
- Scalable (Spark distributed)
```

üí° **3. SCD Strategy**
```
Silver: Type 1 (current state)
Gold: Type 2 (history for dimensions)
Fact tables: Immutable (no SCD needed)
```

üí° **4. Incremental Processing**
```
Bronze ‚Üí Silver: MERGE (deduplication)
Silver ‚Üí Gold: Overwrite or MERGE (depends on use case)
Always use checkpoints for streaming
```

üí° **5. Partitioning Strategy**
```
Bronze: _ingestion_date
Silver: Business date (order_date, transaction_date)
Gold: Report date (report_date)
```

### Medallion Architecture - Decision Matrix

| Layer | Purpose | Schema | Updates | Retention | Use Case |
|-------|---------|--------|---------|-----------|----------|
| **Bronze** | Raw landing | Flexible | Append-only | 3-7 years | Recovery, audit |
| **Silver** | Validated | Strict | MERGE (dedup) | 1-2 years | Analytics prep |
| **Gold** | Business | Optimized | Overwrite/MERGE | 6-12 months | BI, reports |

### Production Checklist

**Bronze Layer:**
- [ ] Audit metadata (_ingestion_timestamp, _source_file)
- [ ] Schema evolution enabled (rescued_data)
- [ ] Long retention (compliance)
- [ ] Partition by _ingestion_date

**Silver Layer:**
- [ ] Data quality rules implemented
- [ ] Deduplication logic (MERGE)
- [ ] SCD Type 1 for dimension tables
- [ ] Quarantine pattern for bad data
- [ ] Partition by business date

**Gold Layer:**
- [ ] Pre-aggregated summaries
- [ ] Denormalized tables (star schema)
- [ ] SCD Type 2 for dimensions (optional)
- [ ] Partition by report_date
- [ ] Z-ordering for common filters

### Nastƒôpne Kroki

**üìö Kolejne Notebooki:**
- **05_optimization_best_practices.ipynb** - Performance tuning
- **Warsztaty praktyczne** - End-to-end pipeline implementation

**üõ†Ô∏è Zadanie Domowe:**
1. Zaimplementuj complete Bronze‚ÜíSilver‚ÜíGold pipeline
2. Dodaj SCD Type 2 dla products dimension
3. Stw√≥rz Gold table: monthly_product_performance
4. Zaimplementuj data quality monitoring

### Useful SQL Queries

**Query current customers only (SCD Type 2):**
```sql
SELECT * FROM silver.customers_type2
WHERE is_current = true
```

**Query historical data (as of date):**
```sql
SELECT * FROM silver.customers_type2
WHERE '2024-01-15' BETWEEN effective_from AND effective_to
```

**Gold aggregation refresh:**
```sql
INSERT OVERWRITE gold.daily_sales_summary
SELECT 
    CAST(order_date AS DATE) as report_date,
    payment_method,
    COUNT(*) as total_orders,
    SUM(total_amount) as total_revenue,
    AVG(total_amount) as avg_order_value
FROM silver.orders_clean
GROUP BY 1, 2
```

---

**Gratulacje!** üéâ 
Uko≈Ñczy≈Çe≈õ implementacjƒô Medallion Architecture z SCD Type 1/2!
Jeste≈õ gotowy do budowania production-grade data lakehouse pipelines!

## Sekcja 6: Czyszczenie Zasob√≥w

**Uwaga:** Ta sekcja jest opcjonalna. Uruchom tylko je≈õli chcesz usunƒÖƒá wszystkie dane utworzone w tym notebooku.

### Opcja 1: Sprawd≈∫ utworzone zasoby (zalecane)

In [0]:
# Sprawd≈∫ utworzone zasoby
medallion_tables = {
    "Silver": [
        f"{SILVER_SCHEMA}.orders_clean",
        f"{SILVER_SCHEMA}.customers_type1",
        f"{SILVER_SCHEMA}.customers_type2"
    ],
    "Gold": [
        f"{GOLD_SCHEMA}.daily_sales_summary",
        f"{GOLD_SCHEMA}.customer_360"
    ]
}

results = []
for layer, tables in medallion_tables.items():
    for table in tables:
        full_table = f"{CATALOG}.{table}"
        if spark.catalog.tableExists(full_table):
            count = spark.table(full_table).count()
            detail = spark.sql(f"DESCRIBE DETAIL {full_table}").collect()[0]
            size_mb = detail['sizeInBytes'] / (1024 * 1024)
            results.append({"layer": layer, "table": table, "records": count, "size_mb": round(size_mb, 2)})

display(spark.createDataFrame(results))

**Dane sƒÖ zachowane dla dalszego u≈ºytku**

Aby usunƒÖƒá wszystkie tabele, uruchom nastƒôpnƒÖ kom√≥rkƒô w sekcji opcjonalnej.

### Opcja 2: Usu≈Ñ wszystkie zasoby (tylko je≈õli naprawdƒô chcesz)

**UWAGA:** To usunie wszystkie tabele Silver i Gold utworzone w tym notebooku!

In [0]:
# Opcja 2: Usu≈Ñ wszystkie zasoby (TYLKO JE≈öLI JESTE≈ö PEWIEN!)

# ‚ö†Ô∏è  UWAGA: Odkomentuj poni≈ºszy kod tylko je≈õli chcesz usunƒÖƒá wszystko!

"""
print("=== üóëÔ∏è  USUWANIE ZASOB√ìW MEDALLION ===\n")

# Lista tabel do usuniƒôcia
tables_to_drop = [
    f"{SILVER_SCHEMA}.orders_clean",
    f"{SILVER_SCHEMA}.customers_type1",
    f"{SILVER_SCHEMA}.customers_type2",
    f"{GOLD_SCHEMA}.daily_sales_summary",
    f"{GOLD_SCHEMA}.customer_360"
]

print("Usuwanie tabel...\n")
for table in tables_to_drop:
    full_table = f"{CATALOG}.{table}"
    try:
        spark.sql(f"DROP TABLE IF EXISTS {full_table}")
        print(f"  ‚úì Usuniƒôto: {table}")
    except Exception as e:
        print(f"  ‚ö†Ô∏è  B≈ÇƒÖd przy {table}: {e}")

print("\n‚úÖ Czyszczenie zako≈Ñczone!")
print("üí° Wszystkie tabele Medallion zosta≈Çy usuniƒôte")
print("üí° Mo≈ºesz uruchomiƒá notebook od nowa")
"""

print("‚ö†Ô∏è  KOD CZYSZCZENIA JEST ZAKOMENTOWANY")
print("‚ö†Ô∏è  Odkomentuj powy≈ºszy kod tylko je≈õli chcesz usunƒÖƒá wszystkie zasoby")
print("\nüí° Zalecenie: Zostaw dane dla kolejnych notebook√≥w i warsztat√≥w!")
print("üí° Nastƒôpny notebook: 05_optimization_best_practices.ipynb")