# Bronze ‚Üí Silver ‚Üí Gold Pipeline - Demo

**Cel szkoleniowy:** Implementacja kompletnego end-to-end pipeline z Bronze przez Silver do Gold.

**Zakres tematyczny:**
- Bronze: raw load + audit columns (ingest_ts, source_file, ingested_by)
- Silver: cleaning, deduplikacja, sanity checks, JSON flattening (from_json, explode)
- Gold: KPI modeling, agregacje (daily/weekly/monthly), star schema vs denormalizacja
- End-to-end data lineage
- Performance monitoring per warstwa

## Kontekst i wymagania

- **Dzie≈Ñ szkolenia**: Dzie≈Ñ 2 - Lakehouse & Delta Lake
- **Typ notebooka**: Demo
- **Wymagania techniczne**:
  - Databricks Runtime 13.0+ (zalecane: 14.3 LTS)
  - Unity Catalog w≈ÇƒÖczony
  - Uprawnienia: CREATE TABLE, CREATE SCHEMA, SELECT, MODIFY
  - Klaster: Standard z minimum 2 workers

## Wstƒôp teoretyczny

**Cel sekcji:** Zrozumienie kompletnego data pipeline implementujƒÖcego Medallion Architecture (Bronze ‚Üí Silver ‚Üí Gold).

---

### Medallion Architecture - Metodologia

**Medallion Architecture** to design pattern dla lakehouse architecture, kt√≥ry organizuje dane w trzy warstwy jako≈õci:

```
Raw Data Sources ‚Üí [BRONZE] ‚Üí [SILVER] ‚Üí [GOLD] ‚Üí BI/ML Consumers
```

#### ü•â **BRONZE LAYER (Raw/Landing Zone)**

**Cel:** Immutable landing zone dla surowych danych

**Charakterystyka:**
- **1:1 kopia ≈∫r√≥d≈Ça** - ≈ºadne transformacje biznesowe
- **Multi-format support** - JSON, CSV, Parquet, Avro, XML
- **Audit metadata** - kto, kiedy, skƒÖd za≈Çadowa≈Ç dane
- **Append-only** - historia wszystkich load'√≥w

**Kluczowe kolumny:**
- `_bronze_ingest_timestamp` - kiedy dane trafi≈Çy do lakehouse
- `_bronze_source_file` - z jakiego pliku pochodzƒÖ dane
- `_bronze_ingested_by` - kto/co za≈Çadowa≈Ço dane
- `_bronze_version` - wersja schematu/procesu

**Zastosowanie:**
- Data recovery - mo≈ºliwo≈õƒá reprocessingu
- Audit trail dla compliance (GDPR, SOX)
- Schema evolution bez utraty historii

---

#### ü•à **SILVER LAYER (Validated/Cleansed)**

**Cel:** Czyste, zwalidowane dane gotowe do analiz

**Charakterystyka:**
- **Data quality enforcement** - reject/flag invalid records
- **Deduplikacja** - unique business keys
- **Standaryzacja** - formaty dat, case sensitivity, trimming
- **Business rules** - walidacje biznesowe
- **Type casting** - poprawne typy danych

**Typowe transformacje:**
- Deduplikacja: `dropDuplicates(["business_key"])`
- Walidacja NOT NULL: `filter(col("key").isNotNull())`
- Standaryzacja: `withColumn("email", lower(trim(col("email"))))`
- Type casting: `withColumn("date", to_date(col("date_str")))`

**Data Quality Gates:**
- Rejection rate monitoring (alert je≈õli > threshold)
- Quality flags dla suspicious records
- Logging invalid records dla investigation

---

#### ü•á **GOLD LAYER (Business/Aggregates)**

**Cel:** Business-level tables zoptymalizowane dla consumption

**Charakterystyka:**
- **Denormalizacja** - pre-computed joins dla performance
- **Pre-agregacje** - daily, weekly, monthly summaries
- **KPI tables** - business metrics i calculations
- **Star schema** - fact tables + dimension tables

**Design Patterns:**

**A) Star Schema (Dimensional Model):**
```
         ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
         ‚îÇ   DIM_DATE  ‚îÇ
         ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                ‚îÇ
    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
    ‚îÇ                       ‚îÇ
‚îå‚îÄ‚îÄ‚îÄ‚ñº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê         ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚ñº‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ DIM_CUSTOMER‚îÇ‚óÑ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§ FACT_ORDER‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò         ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                            ‚îÇ
                      ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚ñº‚îÄ‚îÄ‚îÄ‚îÄ‚îê
                      ‚îÇ DIM_PRODUCT‚îÇ
                      ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

**B) Denormalized Wide Tables:**
- Wszystkie dimensions zmergowane do fact table
- Eliminuje joiny w query time ‚Üí performance
- Trade-off: wiƒôkszy storage vs faster queries

---

### Data Flow & Lineage

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ                    SOURCE SYSTEMS                            ‚îÇ
‚îÇ  ‚Ä¢ Transactional DBs  ‚Ä¢ APIs  ‚Ä¢ Files  ‚Ä¢ Streaming          ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                        ‚îÇ
                        ‚ñº
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ                    BRONZE LAYER                              ‚îÇ
‚îÇ  ‚Ä¢ Raw ingestion (COPY INTO, Auto Loader)                   ‚îÇ
‚îÇ  ‚Ä¢ Multi-format: JSON, CSV, Parquet                          ‚îÇ
‚îÇ  ‚Ä¢ Audit columns: _bronze_ingest_timestamp, _source_file     ‚îÇ
‚îÇ  ‚Ä¢ Immutable: append-only                                    ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                        ‚îÇ
                        ‚ñº
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ                    SILVER LAYER                              ‚îÇ
‚îÇ  ‚Ä¢ Data quality checks & validation                          ‚îÇ
‚îÇ  ‚Ä¢ Deduplikacja per business key                             ‚îÇ
‚îÇ  ‚Ä¢ Standaryzacja: dates, text, formats                       ‚îÇ
‚îÇ  ‚Ä¢ Type casting & business rules                             ‚îÇ
‚îÇ  ‚Ä¢ Rejection rate monitoring                                 ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                        ‚îÇ
                        ‚ñº
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ                    GOLD LAYER                                ‚îÇ
‚îÇ  ‚Ä¢ Star schema: Fact + Dimensions                            ‚îÇ
‚îÇ  ‚Ä¢ Denormalized tables (pre-computed joins)                  ‚îÇ
‚îÇ  ‚Ä¢ Pre-aggregated summaries (daily, monthly)                 ‚îÇ
‚îÇ  ‚Ä¢ KPI calculations & business metrics                       ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                        ‚îÇ
                        ‚ñº
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ                    CONSUMPTION                               ‚îÇ
‚îÇ  ‚Ä¢ BI Dashboards (Power BI, Tableau)                         ‚îÇ
‚îÇ  ‚Ä¢ ML Models (Feature Store)                                 ‚îÇ
‚îÇ  ‚Ä¢ Ad-hoc Analytics (SQL)                                    ‚îÇ
‚îÇ  ‚Ä¢ Data Apps                                                 ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

---

### Star Schema - Przyk≈Çad dla e-commerce

W tym notebooku zbudujemy nastƒôpujƒÖcy star schema:

```
                    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
                    ‚îÇ   DIM_TIME       ‚îÇ
                    ‚îÇ                  ‚îÇ
                    ‚îÇ ‚Ä¢ order_date     ‚îÇ
                    ‚îÇ ‚Ä¢ order_year     ‚îÇ
                    ‚îÇ ‚Ä¢ order_month    ‚îÇ
                    ‚îÇ ‚Ä¢ order_quarter  ‚îÇ
                    ‚îÇ ‚Ä¢ day_of_week    ‚îÇ
                    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                             ‚îÇ
                             ‚îÇ 1:N
                             ‚îÇ
    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
    ‚îÇ                        ‚îÇ                        ‚îÇ
    ‚îÇ                        ‚ñº                        ‚îÇ
‚îå‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê   ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  DIM_CUSTOMER   ‚îÇ   ‚îÇ  FACT_ORDER     ‚îÇ    ‚îÇ DIM_PRODUCT   ‚îÇ
‚îÇ                 ‚îÇ   ‚îÇ  (Central)      ‚îÇ    ‚îÇ               ‚îÇ
‚îÇ ‚Ä¢ customer_id   ‚îÇ‚óÑ‚îÄ‚îÄ‚î§                 ‚îú‚îÄ‚îÄ‚îÄ‚ñ∫‚îÇ ‚Ä¢ product_id  ‚îÇ
‚îÇ ‚Ä¢ customer_name ‚îÇ N:1‚îÇ ‚Ä¢ order_id (PK) ‚îÇ1:N ‚îÇ ‚Ä¢ product_name‚îÇ
‚îÇ ‚Ä¢ country       ‚îÇ   ‚îÇ ‚Ä¢ customer_id   ‚îÇ    ‚îÇ ‚Ä¢ category    ‚îÇ
‚îÇ ‚Ä¢ email         ‚îÇ   ‚îÇ ‚Ä¢ product_id    ‚îÇ    ‚îÇ ‚Ä¢ price       ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò   ‚îÇ ‚Ä¢ order_date    ‚îÇ    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                      ‚îÇ ‚Ä¢ total_amount  ‚îÇ
                      ‚îÇ ‚Ä¢ payment_method‚îÇ
                      ‚îÇ ‚Ä¢ is_high_value ‚îÇ
                      ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

**Relacje:**
- **FACT_ORDER** (central fact table) - ka≈ºdy rzƒÖd = jedna transakcja
- **DIM_CUSTOMER** ‚Üí FACT_ORDER: 1:N (jeden klient, wiele zam√≥wie≈Ñ)
- **DIM_PRODUCT** ‚Üí FACT_ORDER: 1:N (jeden produkt, wiele zam√≥wie≈Ñ)  
- **DIM_TIME** ‚Üí FACT_ORDER: 1:N (jedna data, wiele zam√≥wie≈Ñ)

**Dlaczego Star Schema?**
1. **Performance**: Proste joiny, ≈Çatwa optymalizacja dla BI tools
2. **Czytelno≈õƒá**: Intuicyjna struktura (fact = zdarzenie, dim = kontekst)
3. **Flexibilno≈õƒá**: ≈Åatwe dodawanie nowych dimensions bez zmiany fact table
4. **Agregacje**: BI tools mogƒÖ ≈Çatwo grupowaƒá po dimensions

---

### Dlaczego to wa≈ºne?

Production data pipeline musi:
1. **Obs≈Çugiwaƒá r√≥≈ºne ≈∫r√≥d≈Ça** - JSON API, CSV exports, Parquet dumps
2. **Zapewniaƒá data quality** - validations, rejections, monitoring
3. **Optymalizowaƒá dla consumption** - denormalizacja, pre-agregacje
4. **Umo≈ºliwiaƒá audyt** - data lineage od ≈∫r√≥d≈Ça do dashboard
5. **Skalowaƒá siƒô** - od MB do PB danych

**W tym notebooku nauczymy siƒô:**
- Budowaƒá kompletny Bronze ‚Üí Silver ‚Üí Gold pipeline
- Implementowaƒá data quality gates w Silver
- Projektowaƒá star schema w Gold
- Monitorowaƒá health pipeline'u

## Izolacja per u≈ºytkownik

Uruchom skrypt inicjalizacyjny dla per-user izolacji katalog√≥w i schemat√≥w:

In [0]:
%run ../00_setup

## Konfiguracja

Import bibliotek i ustawienie zmiennych ≈õrodowiskowych:

In [0]:
from pyspark.sql import functions as F
from pyspark.sql.types import *
from pyspark.sql.window import Window
from datetime import datetime, timedelta

# Wy≈õwietl kontekst u≈ºytkownika
print("=== Kontekst u≈ºytkownika ===")
print(f"Katalog: {CATALOG}")
print(f"Schema Bronze: {BRONZE_SCHEMA}")
print(f"Schema Silver: {SILVER_SCHEMA}")
print(f"Schema Gold: {GOLD_SCHEMA}")
print(f"U≈ºytkownik: {raw_user}")

# Ustaw katalog jako domy≈õlny
spark.sql(f"USE CATALOG {CATALOG}")

# ≈öcie≈ºki do danych ≈∫r√≥d≈Çowych
ORDERS_JSON = f"{DATASET_BASE_PATH}/orders/orders_batch.json"
CUSTOMERS_CSV = f"{DATASET_BASE_PATH}/customers/customers.csv"
PRODUCTS_PARQUET = f"{DATASET_BASE_PATH}/products/products.parquet"

print(f"\n=== ≈öcie≈ºki do danych ===")
print(f"Orders: {ORDERS_JSON}")
print(f"Customers: {CUSTOMERS_CSV}")
print(f"Products: {PRODUCTS_PARQUET}")

---

## Sekcja 1: Bronze Layer - Raw Data Ingestion

**Wprowadzenie teoretyczne:**

Bronze layer przyjmuje surowe dane z r√≥≈ºnych ≈∫r√≥de≈Ç i format√≥w. Kluczowe jest dodanie audit metadata dla data lineage i troubleshooting.

**Kluczowe operacje:**
- Wczytanie z r√≥≈ºnych format√≥w (JSON, CSV, Parquet)
- Dodanie audit columns: ingest_timestamp, source_file, ingested_by
- Zapis do Delta bez transformacji warto≈õci biznesowych
- Versioning dla incremental loads

**Zastosowanie praktyczne:**
- Immutable landing zone - mo≈ºliwo≈õƒá reprocessingu
- Audit trail dla compliance
- Multiple source formats w jednym pipeline

### Przyk≈Çad 1.1: Bronze - Orders (JSON)

**Cel:** Ingest zam√≥wie≈Ñ z JSON do Bronze z audit metadata

In [0]:
# Przyk≈Çad 1.1 - Bronze Orders (czƒô≈õƒá 1: wczytanie)

spark.sql(f"USE SCHEMA {BRONZE_SCHEMA}")

# Ustaw zmiennƒÖ tabeli
bronze_orders_table = f"{BRONZE_SCHEMA}.orders_bronze"

# Wczytaj surowe orders z JSON
orders_raw = (
    spark.read
    .format("json")
    .option("multiLine", "true")
    .load(ORDERS_JSON)
)

print("=== Raw Orders Schema ===")
orders_raw.printSchema()
print(f"\n‚úì Wczytano {orders_raw.count()} rekord√≥w z JSON")

In [0]:
# Przyk≈Çad 1.1 - Bronze Orders (czƒô≈õƒá 2: audit metadata)

# Dodaj Bronze audit metadata
orders_bronze = (
    orders_raw
    .withColumn("_bronze_ingest_timestamp", F.current_timestamp())
    .withColumn("_bronze_source_file", F.input_file_name())
    .withColumn("_bronze_ingested_by", F.lit(raw_user))
    .withColumn("_bronze_version", F.lit(1))
)

print("=== Bronze Orders Schema (z audit columns) ===")
orders_bronze.printSchema()

In [0]:
# Przyk≈Çad 1.1 - Bronze Orders (czƒô≈õƒá 3: zapis do Delta)

# Zapisz do Bronze table
(
    orders_bronze
    .write
    .format("delta")
    .mode("overwrite")
    .option("overwriteSchema", "true")
    .saveAsTable(bronze_orders_table)
)

print(f"\n‚úì Bronze Orders zapisane: {bronze_orders_table}")
print(f"Liczba rekord√≥w: {spark.table(bronze_orders_table).count()}")

# PodglƒÖd danych
display(spark.table(bronze_orders_table).limit(5))

### Przyk≈Çad 1.2: Bronze - Customers (CSV) i Products (Parquet)

**Cel:** Ingest danych klient√≥w i produkt√≥w z r√≥≈ºnych format√≥w

In [0]:
# Przyk≈Çad 1.2 - Bronze Customers (czƒô≈õƒá 1: wczytanie z CSV)

bronze_customers_table = f"{BRONZE_SCHEMA}.customers_bronze"

# Wczytaj Customers z CSV
customers_raw = (
    spark.read
    .format("csv")
    .option("header", "true")
    .option("inferSchema", "true")
    .load(CUSTOMERS_CSV)
)

print("=== Raw Customers Schema ===")
customers_raw.printSchema()
print(f"\n‚úì Wczytano {customers_raw.count()} rekord√≥w z CSV")

In [0]:
# Przyk≈Çad 1.2 - Bronze Customers (czƒô≈õƒá 2: audit metadata)

# Dodaj Bronze audit metadata
customers_bronze = (
    customers_raw
    .withColumn("_bronze_ingest_timestamp", F.current_timestamp())
    .withColumn("_bronze_source_file", F.input_file_name())
    .withColumn("_bronze_ingested_by", F.lit(raw_user))
    .withColumn("_bronze_version", F.lit(1))
)

print("=== Bronze Customers Schema (z audit columns) ===")
customers_bronze.printSchema()

In [0]:
# Przyk≈Çad 1.2 - Bronze Customers (czƒô≈õƒá 3: zapis do Delta)

# Zapisz do Bronze table
(
    customers_bronze
    .write
    .format("delta")
    .mode("overwrite")
    .option("overwriteSchema", "true")
    .saveAsTable(bronze_customers_table)
)

print(f"\n‚úì Bronze Customers zapisane: {bronze_customers_table}")
print(f"Liczba rekord√≥w: {spark.table(bronze_customers_table).count()}")
display(spark.table(bronze_customers_table).limit(5))

### Przyk≈Çad 1.3: Bronze - Products (Parquet)

**Cel:** Ingest produkt√≥w z Parquet do Bronze z audit metadata

In [0]:
# Przyk≈Çad 1.3 - Bronze Products (czƒô≈õƒá 1: wczytanie z Parquet)

bronze_products_table = f"{BRONZE_SCHEMA}.products_bronze"

# Wczytaj Products z Parquet
products_raw = (
    spark.read
    .format("parquet")
    .load(PRODUCTS_PARQUET)
)

print("=== Raw Products Schema ===")
products_raw.printSchema()
print(f"\n‚úì Wczytano {products_raw.count()} rekord√≥w z Parquet")

In [0]:
# Przyk≈Çad 1.3 - Bronze Products (czƒô≈õƒá 2: audit metadata)

# Dodaj Bronze audit metadata
products_bronze = (
    products_raw
    .withColumn("_bronze_ingest_timestamp", F.current_timestamp())
    .withColumn("_bronze_source_file", F.input_file_name())
    .withColumn("_bronze_ingested_by", F.lit(raw_user))
    .withColumn("_bronze_version", F.lit(1))
)

print("=== Bronze Products Schema (z audit columns) ===")
products_bronze.printSchema()

In [0]:
# Przyk≈Çad 1.3 - Bronze Products (czƒô≈õƒá 3: zapis do Delta)

# Zapisz do Bronze table
(
    products_bronze
    .write
    .format("delta")
    .mode("overwrite")
    .option("overwriteSchema", "true")
    .saveAsTable(bronze_products_table)
)

print(f"\n‚úì Bronze Products zapisane: {bronze_products_table}")
print(f"Liczba rekord√≥w: {spark.table(bronze_products_table).count()}")
display(spark.table(bronze_products_table).limit(5))

# Bronze Layer Summary
print("\n" + "=" * 70)
print("BRONZE LAYER SUMMARY")
print("=" * 70)
print(f"‚úì Orders:    {spark.table(bronze_orders_table).count():,} records")
print(f"‚úì Customers: {spark.table(bronze_customers_table).count():,} records")
print(f"‚úì Products:  {spark.table(bronze_products_table).count():,} records")
print("=" * 70)

---

## Sekcja 2: Silver Layer - Cleansing & Validation

**Wprowadzenie teoretyczne:**

Silver layer wykonuje data quality checks, deduplikacjƒô, standaryzacjƒô i flattening nested structures. To warstwa gdzie enforcement business rules.

**Kluczowe transformacje:**
- Deduplikacja po kluczu biznesowym
- Walidacja NOT NULL, data types, ranges
- Standaryzacja: dates, case sensitivity, formats
- JSON flattening dla nested structures

**Data Quality Gates:**
- Reject invalid records (lub flaguj)
- Log data quality metrics
- Monitor rejection rates

### Przyk≈Çad 2.1: Silver Orders - Cleansing & Validation

**Cel:** Transformacja Bronze Orders ‚Üí Silver z quality checks

In [0]:
# Przyk≈Çad 2.1 - Silver Orders (czƒô≈õƒá 1: deduplikacja i walidacja NOT NULL)

spark.sql(f"USE SCHEMA {SILVER_SCHEMA}")

# Wczytaj z Bronze
orders_bronze_df = spark.table(bronze_orders_table)

print(f"=== Bronze Orders - Input ===")
print(f"Liczba rekord√≥w: {orders_bronze_df.count()}")

# Deduplikacja po kluczu biznesowym
orders_deduped = orders_bronze_df.dropDuplicates(["order_id"])
print(f"\n‚úì Po deduplikacji: {orders_deduped.count()} rekord√≥w")

# Walidacja NOT NULL
orders_validated = (
    orders_deduped
    .filter(F.col("order_id").isNotNull())
    .filter(F.col("customer_id").isNotNull())
    .filter(F.col("product_id").isNotNull())
)

print(f"‚úì Po walidacji NOT NULL: {orders_validated.count()} rekord√≥w")

In [0]:
# Przyk≈Çad 2.1 - Silver Orders (czƒô≈õƒá 2: walidacja biznesowa i type casting)

# Walidacja biznesowa: kwota musi byƒá > 0
orders_business_validated = (
    orders_validated
    .filter(F.col("total_amount") > 0)
)

print(f"=== Walidacja biznesowa ===")
print(f"‚úì Po walidacji total_amount > 0: {orders_business_validated.count()} rekord√≥w")

# Rzutowanie typ√≥w i standaryzacja
orders_typed = (
    orders_business_validated
    
    # Parsowanie daty z order_datetime
    .withColumn("order_date", F.to_date(F.col("order_datetime")))
    .withColumn("order_timestamp", F.to_timestamp(F.col("order_datetime")))
    
    # Standaryzacja payment_method (uppercase, trim)
    .withColumn("payment_method", F.upper(F.trim(F.col("payment_method"))))
    
    # Type casting dla consistency
    .withColumn("total_amount", F.col("total_amount").cast("decimal(10,2)"))
    .withColumn("quantity", F.col("quantity").cast("integer"))
)

print("\n=== Silver Orders Schema (po type casting) ===")
orders_typed.printSchema()

In [0]:
# Przyk≈Çad 2.1 - Silver Orders (czƒô≈õƒá 3: business logic i kategorie)

# Dodaj business logic
orders_silver = (
    orders_typed
    
    # Kategorizacja kwot (biznes logic)
    .withColumn(
        "order_value_category",
        F.when(F.col("total_amount") < 100, "LOW")
         .when(F.col("total_amount") < 500, "MEDIUM")
         .otherwise("HIGH")
    )
    
    # Oblicz warto≈õƒá jednostkowƒÖ
    .withColumn(
        "unit_price",
        F.when(F.col("quantity") > 0, F.col("total_amount") / F.col("quantity"))
         .otherwise(F.lit(None))
    )
    
    # Silver metadata
    .withColumn("_silver_processed_timestamp", F.current_timestamp())
    .withColumn("_data_quality_flag", F.lit("VALID"))
)

print("=== Silver Orders - Final Schema ===")
orders_silver.printSchema()
print(f"\n‚úì Silver Orders gotowe do zapisu: {orders_silver.count()} rekord√≥w")

In [0]:
# Przyk≈Çad 2.1 - Silver Orders (czƒô≈õƒá 4: quality metrics i zapis)

# Quality metrics
bronze_count = orders_bronze_df.count()
silver_count = orders_silver.count()
rejected_count = bronze_count - silver_count
rejection_rate = (rejected_count / bronze_count * 100) if bronze_count > 0 else 0

print("=" * 70)
print("SILVER ORDERS - DATA QUALITY METRICS")
print("=" * 70)
print(f"Bronze input:  {bronze_count:,} records")
print(f"Silver output: {silver_count:,} records")
print(f"Rejected:      {rejected_count:,} records ({rejection_rate:.2f}%)")
print("=" * 70)

# Zapisz do Silver
silver_orders_table = f"{SILVER_SCHEMA}.orders_silver"

(
    orders_silver
    .write
    .format("delta")
    .mode("overwrite")
    .option("overwriteSchema", "true")
    .saveAsTable(silver_orders_table)
)

print(f"\n‚úì Silver Orders zapisane: {silver_orders_table}")
display(spark.table(silver_orders_table).limit(5))

### Przyk≈Çad 2.2: Silver Customers & Products

**Cel:** Cleansing dimension tables

In [0]:
# Przyk≈Çad 2.2 - Silver Customers (czƒô≈õƒá 1: deduplikacja i walidacja)

customers_bronze_df = spark.table(bronze_customers_table)
silver_customers_table = f"{SILVER_SCHEMA}.customers_silver"

print(f"=== Bronze Customers - Input ===")
print(f"Liczba rekord√≥w: {customers_bronze_df.count()}")

# Deduplikacja i walidacja
customers_clean = (
    customers_bronze_df
    .dropDuplicates(["customer_id"])
    .filter(F.col("customer_id").isNotNull())
    .filter(F.col("customer_name").isNotNull())
)

print(f"\n‚úì Po deduplikacji i walidacji: {customers_clean.count()} rekord√≥w")

# Products (minimal cleaning - ju≈º dobre jako≈õci)
products_bronze_df = spark.table(bronze_products_table)

products_silver = (
    products_bronze_df
    .dropDuplicates(["product_id"])
    .filter(F.col("product_id").isNotNull())
    .withColumn("_silver_processed_timestamp", F.current_timestamp())
)

silver_products_table = f"{SILVER_SCHEMA}.products_silver"

(
    products_silver
    .write
    .format("delta")
    .mode("overwrite")
    .option("overwriteSchema", "true")
    .saveAsTable(silver_products_table)
)

print(f"\n‚úì Silver Products: {silver_products_table}")
print(f"Liczba rekord√≥w: {spark.table(silver_products_table).count()}")

In [0]:
# Przyk≈Çad 2.2 - Silver Customers (czƒô≈õƒá 2: standaryzacja i walidacja email)

# Standaryzacja
customers_standardized = (
    customers_clean
    
    # Standaryzacja text fields
    .withColumn("customer_name", F.trim(F.col("customer_name")))
    .withColumn("email", F.lower(F.trim(F.col("email"))))
    .withColumn("country", F.upper(F.trim(F.col("country"))))
    
    # Walidacja email (basic regex pattern)
    .withColumn(
        "email_valid",
        F.when(
            F.col("email").rlike(r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$"),
            True
        ).otherwise(False)
    )
)

print("=== Silver Customers - Po standaryzacji ===")
print(f"Liczba rekord√≥w: {customers_standardized.count()}")
print("\nPrzyk≈Çadowe dane:")
display(customers_standardized.select("customer_id", "customer_name", "email", "email_valid", "country").limit(5))

In [0]:
# Przyk≈Çad 2.2 - Silver Customers (czƒô≈õƒá 3: metadata i zapis)

# Dodaj Silver metadata
customers_silver = (
    customers_standardized
    .withColumn("_silver_processed_timestamp", F.current_timestamp())
    .withColumn("_data_quality_flag", F.lit("VALID"))
)

# Zapisz do Silver
(
    customers_silver
    .write
    .format("delta")
    .mode("overwrite")
    .option("overwriteSchema", "true")
    .saveAsTable(silver_customers_table)
)

print(f"‚úì Silver Customers zapisane: {silver_customers_table}")
print(f"Liczba rekord√≥w: {spark.table(silver_customers_table).count()}")

# Email validation stats
email_stats = spark.table(silver_customers_table).groupBy("email_valid").count()
print("\n=== Email Validation Stats ===")
display(email_stats)

### Przyk≈Çad 2.3: Silver Products - Minimal Cleaning

**Cel:** Cleansing produkt√≥w (dane ju≈º dobrej jako≈õci)

In [0]:
# Przyk≈Çad 2.3 - Silver Products

products_bronze_df = spark.table(bronze_products_table)
silver_products_table = f"{SILVER_SCHEMA}.products_silver"

print(f"=== Bronze Products - Input ===")
print(f"Liczba rekord√≥w: {products_bronze_df.count()}")

# Minimal cleaning (dane ju≈º dobrej jako≈õci z Parquet)
products_silver = (
    products_bronze_df
    .dropDuplicates(["product_id"])
    .filter(F.col("product_id").isNotNull())
    .filter(F.col("product_name").isNotNull())
    
    # Standaryzacja category
    .withColumn("category", F.upper(F.trim(F.col("category"))))
    
    # Type casting dla price
    .withColumn("price", F.col("price").cast("decimal(10,2)"))
    
    # Silver metadata
    .withColumn("_silver_processed_timestamp", F.current_timestamp())
    .withColumn("_data_quality_flag", F.lit("VALID"))
)

# Zapisz do Silver
(
    products_silver
    .write
    .format("delta")
    .mode("overwrite")
    .option("overwriteSchema", "true")
    .saveAsTable(silver_products_table)
)

print(f"\n‚úì Silver Products zapisane: {silver_products_table}")
print(f"Liczba rekord√≥w: {spark.table(silver_products_table).count()}")
display(spark.table(silver_products_table).limit(5))

# Silver Layer Summary
print("\n" + "=" * 70)
print("SILVER LAYER SUMMARY")
print("=" * 70)
print(f"‚úì Orders:    {spark.table(silver_orders_table).count():,} records")
print(f"‚úì Customers: {spark.table(silver_customers_table).count():,} records")
print(f"‚úì Products:  {spark.table(silver_products_table).count():,} records")
print("=" * 70)

---

## Sekcja 3: Gold Layer - Business Modeling & Star Schema

**Wprowadzenie teoretyczne:**

Gold layer tworzy business-level aggregates i KPI tables zoptymalizowane dla consumption przez BI tools, dashboardy i ML models.

---

### Star Schema - Dimensional Modeling

**Star Schema** to najpopularniejszy design pattern dla data warehousing, kt√≥ry organizuje dane w:
- **Fact Tables** (tabele fakt√≥w) - zdarzenia biznesowe, transakcje, pomiary
- **Dimension Tables** (tabele wymiar√≥w) - kontekst biznesowy dla fakt√≥w

**Dlaczego "Star" (Gwiazda)?**
Graficzne przedstawienie przypomina gwiazdƒô: fact table w centrum, dimension tables dooko≈Ça.

```
         DIM_TIME
             ‚îÇ
             ‚îÇ 1:N
             ‚îÇ
DIM_CUSTOMER‚îÄ‚îº‚îÄFACT_ORDER‚îÄDIM_PRODUCT
             ‚îÇ
             ‚îÇ 1:N
             ‚îÇ
         DIM_REGION
```

---

### Nasza implementacja Star Schema dla e-commerce

W tym notebooku zbudujemy nastƒôpujƒÖcy model:

#### **FACT_ORDER** (Central Fact Table)
**Typ:** Transaction fact table (ka≈ºdy rzƒÖd = jedna transakcja)

**Kolumny:**
- **Keys:** `order_id` (PK), `customer_id` (FK), `product_id` (FK)
- **Measures (metryki):** `total_amount`, `quantity`, `unit_price`
- **Time dimensions:** `order_date`, `order_year`, `order_month`, `order_quarter`, `day_of_week`
- **Flags:** `is_high_value`, `order_value_category`, `payment_method`
- **Denormalized dimensions:** `customer_name`, `country` (z DIM_CUSTOMER)

**Grain (granularno≈õƒá):** Jedna transakcja = jeden produkt w jednym zam√≥wieniu

---

#### **DIM_CUSTOMER** (Dimension Table)
**Typ:** Slowly Changing Dimension (SCD Type 1 - overwrite)

**Kolumny:**
- `customer_id` (PK)
- `customer_name`
- `email`, `email_valid`
- `country`
- `_silver_processed_timestamp`

**Relacja do FACT_ORDER:** 1:N (jeden klient, wiele zam√≥wie≈Ñ)

---

#### **DIM_PRODUCT** (Dimension Table)
**Typ:** Slowly Changing Dimension (SCD Type 1)

**Kolumny:**
- `product_id` (PK)
- `product_name`
- `category`
- `price`
- `_silver_processed_timestamp`

**Relacja do FACT_ORDER:** 1:N (jeden produkt, wiele zam√≥wie≈Ñ)

---

#### **DIM_TIME** (Dimension Table - implicit)
W naszym przypadku time dimensions sƒÖ denormalizowane w FACT_ORDER:
- `order_date` (date)
- `order_year`, `order_month`, `order_quarter` (partitioning keys)
- `order_day_of_week` (dla weekly patterns)

**Relacja do FACT_ORDER:** 1:N (jedna data, wiele zam√≥wie≈Ñ)

---

### Denormalizacja vs Normalizacja

**W tym notebooku u≈ºywamy denormalizacji:**

‚úÖ **Denormalizacja (nasz approach):**
```
FACT_ORDER (denormalized):
- order_id, customer_id, customer_name, country, product_id, 
  order_date, total_amount, payment_method, ...
```

**Zalety:**
- ‚úÖ Szybkie queries (bez join√≥w)
- ‚úÖ BI tools performance
- ‚úÖ Prostsze SQL dla analityk√≥w

**Wady:**
- ‚ö†Ô∏è Wiƒôkszy storage (duplikacja customer_name, country)
- ‚ö†Ô∏è Update complexity (trzeba update w wielu miejscach)

---

**Normalizacja (klasyczny star schema):**
```
FACT_ORDER (normalized):
- order_id, customer_id, product_id, order_date, total_amount, ...

DIM_CUSTOMER:
- customer_id, customer_name, country, ...

DIM_PRODUCT:
- product_id, product_name, category, price, ...
```

**Zalety:**
- ‚úÖ Mniejszy storage
- ‚úÖ ≈Åatwiejszy update dimensions

**Wady:**
- ‚ö†Ô∏è Wymaga join√≥w w query time

---

### Pre-agregowane Summary Tables

Opr√≥cz fact table tworzymy pre-aggregated summaries:

**1. DAILY_SALES_SUMMARY**
- **Grain:** dzie≈Ñ + kraj + payment_method
- **Measures:** total_orders, total_revenue, avg_order_value, unique_customers
- **Use case:** Daily sales dashboards

**2. MONTHLY_SALES_SUMMARY**
- **Grain:** miesiƒÖc + kraj
- **Measures:** total_orders, total_revenue, avg_order_value
- **Use case:** Monthly business reviews

**3. CUSTOMER_ANALYTICS**
- **Grain:** customer_id
- **Measures:** lifetime_value, total_orders, customer_segment
- **Use case:** Customer segmentation, retention analysis

---

### Relacje miƒôdzy tabelami

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ                     GOLD LAYER SCHEMA                        ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò

‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ   DIM_CUSTOMER       ‚îÇ 1:N Relationship
‚îÇ                      ‚îÇ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ ‚Ä¢ customer_id (PK)   ‚îÇ            ‚îÇ
‚îÇ ‚Ä¢ customer_name      ‚îÇ            ‚îÇ
‚îÇ ‚Ä¢ email              ‚îÇ            ‚ñº
‚îÇ ‚Ä¢ country            ‚îÇ     ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò     ‚îÇ   FACT_ORDER         ‚îÇ
                             ‚îÇ   (Central)          ‚îÇ
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê     ‚îÇ                      ‚îÇ
‚îÇ   DIM_PRODUCT        ‚îÇ     ‚îÇ ‚Ä¢ order_id (PK)      ‚îÇ
‚îÇ                      ‚îÇ     ‚îÇ ‚Ä¢ customer_id (FK)   ‚îÇ‚óÑ‚îÄ‚îÄ‚îÄ Foreign Key
‚îÇ ‚Ä¢ product_id (PK)    ‚îÇ‚óÑ‚îÄ‚îÄ‚îÄ‚îÄ‚îÇ ‚Ä¢ product_id (FK)    ‚îÇ‚óÑ‚îÄ‚îÄ‚îÄ Foreign Key
‚îÇ ‚Ä¢ product_name       ‚îÇ 1:N ‚îÇ ‚Ä¢ order_date         ‚îÇ
‚îÇ ‚Ä¢ category           ‚îÇ     ‚îÇ ‚Ä¢ total_amount       ‚îÇ
‚îÇ ‚Ä¢ price              ‚îÇ     ‚îÇ ‚Ä¢ payment_method     ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò     ‚îÇ ‚Ä¢ is_high_value      ‚îÇ
                             ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                                      ‚îÇ
                                      ‚îÇ Source for
                                      ‚îÇ aggregations
                                      ‚ñº
            ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
            ‚îÇ     PRE-AGGREGATED SUMMARY TABLES           ‚îÇ
            ‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
            ‚îÇ ‚Ä¢ DAILY_SALES_SUMMARY                       ‚îÇ
            ‚îÇ ‚Ä¢ MONTHLY_SALES_SUMMARY                     ‚îÇ
            ‚îÇ ‚Ä¢ CUSTOMER_ANALYTICS                        ‚îÇ
            ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

**Kluczowe zasady relacji:**
1. **FACT_ORDER.customer_id** ‚Üí **DIM_CUSTOMER.customer_id** (1:N)
2. **FACT_ORDER.product_id** ‚Üí **DIM_PRODUCT.product_id** (1:N)
3. **FACT_ORDER** jest ≈∫r√≥d≈Çem dla wszystkich summary tables

---

### Kluczowe operacje w Gold Layer

**Kluczowe transformacje:**
- **Joins** miƒôdzy fact i dimension tables
- **Denormalizacja** (pre-compute joins dla performance)
- **Agregacje:** daily, weekly, monthly summaries
- **KPI calculations:** lifetime value, customer segments
- **Time dimensions:** year, month, quarter, day_of_week

**Design patterns:**
- ‚úÖ Denormalized fact tables dla BI performance
- ‚úÖ Pre-aggregated summary tables na r√≥≈ºnych granulacjach
- ‚úÖ Customer-level analytics dla segmentacji
- ‚úÖ Partitioning po date dla query performance

### Przyk≈Çad 3.1: Gold - Order Fact Table (Denormalized)

**Cel:** Utworzenie denormalized fact table z joinami do dimensions

In [0]:
# Przyk≈Çad 3.1 - Gold Order Fact Table (czƒô≈õƒá 1: wczytanie Silver tables)

spark.sql(f"USE SCHEMA {GOLD_SCHEMA}")

gold_order_fact_table = f"{GOLD_SCHEMA}.order_fact"

# Wczytaj Silver tables
orders_silver_df = spark.table(silver_orders_table)
customers_silver_df = spark.table(silver_customers_table)
products_silver_df = spark.table(silver_products_table)

print("=== Silver Tables Loaded ===")
print(f"Orders:    {orders_silver_df.count():,} records")
print(f"Customers: {customers_silver_df.count():,} records")
print(f"Products:  {products_silver_df.count():,} records")

In [0]:
# Przyk≈Çad 3.1 - Gold Order Fact Table (czƒô≈õƒá 2: join z Customer dimension)

# Przygotuj Customer dimension (wybierz tylko potrzebne kolumny)
dim_customer = customers_silver_df.select(
    F.col("customer_id").alias("cust_id"),
    F.col("customer_name"),
    F.col("country"),
    F.col("email_valid")
)

# Join Orders z Customer dimension (denormalization)
order_with_customer = (
    orders_silver_df
    .join(
        dim_customer,
        orders_silver_df.customer_id == F.col("cust_id"),
        "left"  # LEFT JOIN - zachowaj wszystkie orders nawet bez customer match
    )
    .drop("cust_id")  # Usu≈Ñ alias column
)

print("=== After Customer Join ===")
print(f"Liczba rekord√≥w: {order_with_customer.count():,}")

# Sprawd≈∫ czy sƒÖ unmatched orders
unmatched = order_with_customer.filter(F.col("customer_name").isNull())
print(f"‚ö†Ô∏è  Unmatched orders (no customer): {unmatched.count()}")

In [0]:
# Przyk≈Çad 3.1 - Gold Order Fact Table (czƒô≈õƒá 3: join z Product dimension)

# Przygotuj Product dimension (wybierz tylko potrzebne kolumny)
dim_product = products_silver_df.select(
    F.col("product_id").alias("prod_id"),
    F.col("product_name"),
    F.col("category"),
    F.col("price")
)

# Join z Product dimension (denormalization)
order_with_dimensions = (
    order_with_customer
    .join(
        dim_product,
        order_with_customer.product_id == F.col("prod_id"),
        "left"  # LEFT JOIN - zachowaj wszystkie orders nawet bez product match
    )
    .drop("prod_id")  # Usu≈Ñ alias column
)

print("=== After Product Join ===")
print(f"Liczba rekord√≥w: {order_with_dimensions.count():,}")

# Sprawd≈∫ czy sƒÖ unmatched products
unmatched_products = order_with_dimensions.filter(F.col("product_name").isNull())
print(f"‚ö†Ô∏è  Unmatched orders (no product): {unmatched_products.count()}")

In [0]:
# Przyk≈Çad 3.1 - Gold Order Fact Table (czƒô≈õƒá 4: time dimensions)

# Dodaj time dimensions (dla partitioning i time-based analytics)
order_with_time = (
    order_with_dimensions
    
    # Time dimensions z order_date
    .withColumn("order_year", F.year("order_date"))
    .withColumn("order_month", F.month("order_date"))
    .withColumn("order_quarter", F.quarter("order_date"))
    .withColumn("order_day_of_week", F.dayofweek("order_date"))  # 1 = Sunday, 7 = Saturday
    
    # Dodaj month_name dla czytelno≈õci
    .withColumn("order_month_name", F.date_format("order_date", "MMMM"))
    .withColumn("order_day_name", F.date_format("order_date", "EEEE"))
)

print("=== Time Dimensions Added ===")
print("\nPrzyk≈Çadowe time dimensions:")
display(
    order_with_time
    .select("order_date", "order_year", "order_month", "order_month_name", 
            "order_quarter", "order_day_name", "order_day_of_week")
    .limit(5)
)

In [0]:
# Przyk≈Çad 3.1 - Gold Order Fact Table (czƒô≈õƒá 5: KPI calculations)

# Dodaj business KPIs i calculated fields
order_fact = (
    order_with_time
    
    # KPI: High value flag (zam√≥wienia >= 500)
    .withColumn(
        "is_high_value",
        F.when(F.col("total_amount") >= 500, True).otherwise(False)
    )
    
    # KPI: Revenue contribution (vs product price)
    .withColumn(
        "revenue_vs_price_ratio",
        F.when(F.col("price").isNotNull(), F.col("total_amount") / F.col("price"))
         .otherwise(F.lit(None))
    )
    
    # Gold metadata
    .withColumn("_gold_created_timestamp", F.current_timestamp())
    .withColumn("_gold_table_name", F.lit("order_fact"))
)

print("=== KPI Calculations Complete ===")
print(f"Liczba rekord√≥w: {order_fact.count():,}")
print("\nKPI distribution:")
display(
    order_fact
    .groupBy("is_high_value", "order_value_category")
    .agg(
        F.count("*").alias("order_count"),
        F.sum("total_amount").alias("total_revenue")
    )
    .orderBy(F.col("total_revenue").desc())
)

In [0]:
# Przyk≈Çad 3.1 - Gold Order Fact Table (czƒô≈õƒá 6: final selection i zapis)

# Select final columns dla Fact Table
order_fact_final = order_fact.select(
    # Primary Key
    "order_id",
    
    # Foreign Keys (relacje do dimensions)
    "customer_id",
    "product_id",
    
    # Denormalized Customer dimension
    "customer_name",
    "country",
    "email_valid",
    
    # Denormalized Product dimension
    "product_name",
    "category",
    "price",
    
    # Time dimensions
    "order_date",
    "order_timestamp",
    "order_year",
    "order_month",
    "order_month_name",
    "order_quarter",
    "order_day_of_week",
    "order_day_name",
    
    # Measures (metryki)
    "total_amount",
    "quantity",
    "unit_price",
    
    # Flags & Categories
    "payment_method",
    "order_value_category",
    "is_high_value",
    "revenue_vs_price_ratio",
    
    # Metadata
    "_gold_created_timestamp",
    "_gold_table_name"
)

print("=== Gold Order Fact - Final Schema ===")
order_fact_final.printSchema()

# Zapisz do Gold
(
    order_fact_final
    .write
    .format("delta")
    .mode("overwrite")
    .option("overwriteSchema", "true")
    .saveAsTable(gold_order_fact_table)
)

print(f"\n‚úÖ Gold Order Fact zapisane: {gold_order_fact_table}")
print(f"Liczba rekord√≥w: {spark.table(gold_order_fact_table).count():,}")
display(spark.table(gold_order_fact_table).limit(5))

### Przyk≈Çad 3.2: Gold - Aggregated Summary Tables

**Cel:** Pre-aggregowane tabele dla dashboard√≥w i raport√≥w

In [0]:
# Przyk≈Çad 3.2 - Gold Daily Sales Summary (czƒô≈õƒá 1: agregacja)

order_fact_df = spark.table(gold_order_fact_table)
gold_daily_summary_table = f"{GOLD_SCHEMA}.daily_sales_summary"

print("=== Daily Sales Summary - Agregacja ===")

# Daily aggregation: dzie≈Ñ + kraj + payment_method
daily_sales_summary = (
    order_fact_df
    .groupBy("order_date", "country", "payment_method")
    .agg(
        # Order metrics
        F.count("order_id").alias("total_orders"),
        F.countDistinct("customer_id").alias("unique_customers"),
        
        # Revenue metrics
        F.sum("total_amount").alias("total_revenue"),
        F.avg("total_amount").alias("avg_order_value"),
        F.min("total_amount").alias("min_order_value"),
        F.max("total_amount").alias("max_order_value"),
        
        # Product metrics
        F.sum("quantity").alias("total_quantity"),
        F.countDistinct("product_id").alias("unique_products"),
        
        # High value orders
        F.sum(
            F.when(F.col("is_high_value") == True, 1).otherwise(0)
        ).alias("high_value_orders_count"),
        
        F.sum(
            F.when(F.col("is_high_value") == True, F.col("total_amount")).otherwise(0)
        ).alias("high_value_revenue")
    )
    .withColumn("_gold_created_timestamp", F.current_timestamp())
    .orderBy("order_date", "country", "payment_method")
)

print(f"‚úì Daily summary zagregowany: {daily_sales_summary.count():,} rows")

In [0]:
# Przyk≈Çad 3.2 - Gold Daily Sales Summary (czƒô≈õƒá 2: zapis)

# Zapisz do Gold
(
    daily_sales_summary
    .write
    .format("delta")
    .mode("overwrite")
    .option("overwriteSchema", "true")
    .saveAsTable(gold_daily_summary_table)
)

print(f"‚úÖ Gold Daily Sales Summary zapisane: {gold_daily_summary_table}")
print(f"Liczba rekord√≥w: {spark.table(gold_daily_summary_table).count():,}")

print("\n=== Top 10 dni po revenue ===")
display(
    spark.table(gold_daily_summary_table)
    .orderBy(F.col("total_revenue").desc())
    .limit(10)
)

### Przyk≈Çad 3.3: Gold - Monthly Sales Summary

**Cel:** Pre-agregacja miesiƒôczna dla business reviews

In [0]:
# Przyk≈Çad 3.3 - Gold Monthly Sales Summary

gold_monthly_summary_table = f"{GOLD_SCHEMA}.monthly_sales_summary"

# Monthly aggregation: rok + miesiƒÖc + kraj
monthly_sales_summary = (
    order_fact_df
    .groupBy("order_year", "order_month", "order_month_name", "country")
    .agg(
        # Order metrics
        F.count("order_id").alias("total_orders"),
        F.countDistinct("customer_id").alias("unique_customers"),
        
        # Revenue metrics
        F.sum("total_amount").alias("total_revenue"),
        F.avg("total_amount").alias("avg_order_value"),
        
        # Product metrics
        F.sum("quantity").alias("total_quantity"),
        F.countDistinct("product_id").alias("unique_products"),
        
        # Category breakdown
        F.countDistinct("category").alias("unique_categories")
    )
    .withColumn("_gold_created_timestamp", F.current_timestamp())
    .orderBy("order_year", "order_month", "country")
)

# Zapisz do Gold
(
    monthly_sales_summary
    .write
    .format("delta")
    .mode("overwrite")
    .option("overwriteSchema", "true")
    .saveAsTable(gold_monthly_summary_table)
)

print(f"‚úÖ Gold Monthly Sales Summary zapisane: {gold_monthly_summary_table}")
print(f"Liczba rekord√≥w: {spark.table(gold_monthly_summary_table).count():,}")
display(spark.table(gold_monthly_summary_table))

### Przyk≈Çad 3.4: Gold - Customer Analytics & Segmentation

**Cel:** Customer lifetime value, tenure i segmentacja dla retention analysis

In [0]:
# Przyk≈Çad 3.4 - Gold Customer Analytics & Segmentation

gold_customer_analytics_table = f"{GOLD_SCHEMA}.customer_analytics"

# Customer-level aggregation
customer_analytics = (
    order_fact_df
    .groupBy("customer_id", "customer_name", "country", "email_valid")
    .agg(
        # Order metrics
        F.count("order_id").alias("total_orders"),
        F.min("order_date").alias("first_order_date"),
        F.max("order_date").alias("last_order_date"),
        
        # Revenue metrics
        F.sum("total_amount").alias("lifetime_value"),
        F.avg("total_amount").alias("avg_order_value"),
        F.max("total_amount").alias("max_order_value"),
        
        # Product diversity
        F.countDistinct("product_id").alias("unique_products_purchased"),
        F.countDistinct("category").alias("unique_categories_purchased"),
        
        # High value behavior
        F.sum(
            F.when(F.col("is_high_value") == True, 1).otherwise(0)
        ).alias("high_value_orders_count"),
        
        # Payment method preferences
        F.collect_set("payment_method").alias("payment_methods_used")
    )
    
    # Customer tenure (days between first and last order)
    .withColumn(
        "customer_tenure_days",
        F.datediff(F.col("last_order_date"), F.col("first_order_date"))
    )
    
    # Order frequency (orders per day)
    .withColumn(
        "order_frequency",
        F.when(
            F.col("customer_tenure_days") > 0,
            F.col("total_orders") / F.col("customer_tenure_days")
        ).otherwise(F.lit(None))
    )
    
    # RFM-inspired segmentation
    .withColumn(
        "customer_segment",
        F.when(F.col("lifetime_value") >= 1000, "PREMIUM")
         .when(F.col("lifetime_value") >= 500, "GOLD")
         .when(F.col("lifetime_value") >= 200, "SILVER")
         .otherwise("BRONZE")
    )
    
    # Customer tier based on order count
    .withColumn(
        "customer_tier",
        F.when(F.col("total_orders") >= 10, "FREQUENT")
         .when(F.col("total_orders") >= 5, "REGULAR")
         .when(F.col("total_orders") >= 2, "OCCASIONAL")
         .otherwise("ONE_TIME")
    )
    
    .withColumn("_gold_created_timestamp", F.current_timestamp())
    .orderBy(F.col("lifetime_value").desc())
)

# Zapisz do Gold
(
    customer_analytics
    .write
    .format("delta")
    .mode("overwrite")
    .option("overwriteSchema", "true")
    .saveAsTable(gold_customer_analytics_table)
)

print(f"‚úÖ Gold Customer Analytics zapisane: {gold_customer_analytics_table}")
print(f"Liczba klient√≥w: {spark.table(gold_customer_analytics_table).count():,}")

print("\n=== Top 10 Customers by Lifetime Value ===")
display(spark.table(gold_customer_analytics_table).limit(10))

print("\n=== Customer Segmentation Distribution ===")
display(
    spark.table(gold_customer_analytics_table)
    .groupBy("customer_segment", "customer_tier")
    .agg(
        F.count("*").alias("customer_count"),
        F.sum("lifetime_value").alias("total_revenue"),
        F.avg("lifetime_value").alias("avg_lifetime_value")
    )
    .orderBy(F.col("total_revenue").desc())
)

print("\n" + "=" * 70)
print("GOLD LAYER SUMMARY")
print("=" * 70)
print(f"‚úì Order Fact:         {spark.table(gold_order_fact_table).count():,} records")
print(f"‚úì Daily Summary:      {spark.table(gold_daily_summary_table).count():,} aggregates")
print(f"‚úì Monthly Summary:    {spark.table(gold_monthly_summary_table).count():,} aggregates")
print(f"‚úì Customer Analytics: {spark.table(gold_customer_analytics_table).count():,} customers")
print("=" * 70)

---

## Sekcja 3.5: Star Schema - Relacje i Queries

**Cel:** Zrozumienie relacji miƒôdzy tabelami i przyk≈Çadowe queries wykorzystujƒÖce star schema

### Przyk≈Çad 3.5.1: Weryfikacja relacji Star Schema

**Cel:** Sprawdzenie integralno≈õci referential integrity miƒôdzy tabelami

In [0]:
# Przyk≈Çad 3.5.1 - Weryfikacja relacji Star Schema

print("=" * 80)
print("STAR SCHEMA - WERYFIKACJA RELACJI")
print("=" * 80)

# Load tables
fact_orders = spark.table(gold_order_fact_table)
dim_customers = spark.table(silver_customers_table)
dim_products = spark.table(silver_products_table)

print("\n[1] FACT_ORDER ‚Üí DIM_CUSTOMER Relationship (1:N)")
print("-" * 70)

# Sprawd≈∫ czy wszystkie customer_id w FACT majƒÖ match w DIM_CUSTOMER
unmatched_customers = (
    fact_orders
    .select("customer_id")
    .distinct()
    .join(
        dim_customers.select("customer_id"),
        ["customer_id"],
        "left_anti"  # Orders bez matching customer
    )
)

unmatched_count = unmatched_customers.count()
total_customers_in_fact = fact_orders.select("customer_id").distinct().count()

print(f"Unique customers w FACT_ORDER: {total_customers_in_fact}")
print(f"Unmatched customers (orphans): {unmatched_count}")
print(f"‚úì Referential integrity: {'OK' if unmatched_count == 0 else 'FAILED'}")

print("\n[2] FACT_ORDER ‚Üí DIM_PRODUCT Relationship (1:N)")
print("-" * 70)

# Sprawd≈∫ czy wszystkie product_id w FACT majƒÖ match w DIM_PRODUCT
unmatched_products = (
    fact_orders
    .select("product_id")
    .distinct()
    .join(
        dim_products.select("product_id"),
        ["product_id"],
        "left_anti"  # Orders bez matching product
    )
)

unmatched_products_count = unmatched_products.count()
total_products_in_fact = fact_orders.select("product_id").distinct().count()

print(f"Unique products w FACT_ORDER: {total_products_in_fact}")
print(f"Unmatched products (orphans): {unmatched_products_count}")
print(f"‚úì Referential integrity: {'OK' if unmatched_products_count == 0 else 'FAILED'}")

print("\n[3] Cardinality Analysis")
print("-" * 70)

# Customer ‚Üí Orders (1:N)
customer_orders_stats = (
    fact_orders
    .groupBy("customer_id")
    .agg(F.count("order_id").alias("order_count"))
    .agg(
        F.min("order_count").alias("min_orders_per_customer"),
        F.avg("order_count").alias("avg_orders_per_customer"),
        F.max("order_count").alias("max_orders_per_customer")
    )
)

print("Customer ‚Üí Orders cardinality:")
display(customer_orders_stats)

# Product ‚Üí Orders (1:N)
product_orders_stats = (
    fact_orders
    .groupBy("product_id")
    .agg(F.count("order_id").alias("order_count"))
    .agg(
        F.min("order_count").alias("min_orders_per_product"),
        F.avg("order_count").alias("avg_orders_per_product"),
        F.max("order_count").alias("max_orders_per_product")
    )
)

print("\nProduct ‚Üí Orders cardinality:")
display(product_orders_stats)

print("\n" + "=" * 80)
print("‚úÖ Star Schema Relationships Validated")
print("=" * 80)

### Przyk≈Çad 3.5.2: Business Queries wykorzystujƒÖce Star Schema

**Cel:** Przyk≈Çadowe analytical queries na Gold Layer (denormalized fact table)

In [0]:
# Przyk≈Çad 3.5.2 - Business Queries na Star Schema

print("=" * 80)
print("BUSINESS QUERIES - STAR SCHEMA W PRAKTYCE")
print("=" * 80)

# Query 1: Revenue by Country and Quarter (Time + Geographic dimension)
print("\n[Query 1] Revenue by Country and Quarter")
print("-" * 70)

revenue_by_country_quarter = (
    spark.table(gold_order_fact_table)
    .groupBy("country", "order_year", "order_quarter")
    .agg(
        F.sum("total_amount").alias("total_revenue"),
        F.count("order_id").alias("total_orders"),
        F.countDistinct("customer_id").alias("unique_customers")
    )
    .orderBy("order_year", "order_quarter", F.col("total_revenue").desc())
)

display(revenue_by_country_quarter)

# Query 2: Top Products by Category (Product dimension)
print("\n[Query 2] Top 10 Products by Revenue per Category")
print("-" * 70)

top_products_by_category = (
    spark.table(gold_order_fact_table)
    .groupBy("category", "product_name")
    .agg(
        F.sum("total_amount").alias("total_revenue"),
        F.sum("quantity").alias("total_quantity"),
        F.countDistinct("customer_id").alias("unique_buyers")
    )
    .withColumn(
        "rank",
        F.row_number().over(
            Window.partitionBy("category")
            .orderBy(F.col("total_revenue").desc())
        )
    )
    .filter(F.col("rank") <= 3)  # Top 3 per category
    .orderBy("category", "rank")
)

display(top_products_by_category)

# Query 3: Payment Method Trends by Month (Time + Payment dimension)
print("\n[Query 3] Payment Method Trends Over Time")
print("-" * 70)

payment_trends = (
    spark.table(gold_order_fact_table)
    .groupBy("order_year", "order_month", "payment_method")
    .agg(
        F.count("order_id").alias("order_count"),
        F.sum("total_amount").alias("revenue")
    )
    .withColumn(
        "revenue_share",
        F.round(
            F.col("revenue") / F.sum("revenue").over(Window.partitionBy("order_year", "order_month")) * 100,
            2
        )
    )
    .orderBy("order_year", "order_month", F.col("revenue").desc())
)

display(payment_trends)

print("\n" + "=" * 80)
print("‚úÖ Przyk≈Çadowe queries pokazujƒÖ jak denormalized fact table")
print("   eliminuje potrzebƒô join√≥w w query time ‚Üí performance BI tools")
print("=" * 80)

---

## Sekcja 4: Pipeline Monitoring & Lineage

**Wprowadzenie teoretyczne:**

Production pipeline wymaga monitoringu na ka≈ºdym etapie: data volumes, quality metrics, processing time.

**Kluczowe metryki:**
- Record counts per warstwa
- Rejection rates
- Processing time
- Data freshness

### Przyk≈Çad 4.1: Pipeline Health Dashboard

**Cel:** Monitoring kompletnego pipeline'u

In [0]:
# Przyk≈Çad 4.1 - Pipeline Health Dashboard

print("=" * 80)
print("                    PIPELINE HEALTH DASHBOARD                    ")
print("=" * 80)

# Bronze layer metrics
print("\n‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê")
print("‚îÇ                     BRONZE LAYER (Raw Data)                  ‚îÇ")
print("‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò")

bronze_orders_count = spark.table(bronze_orders_table).count()
bronze_customers_count = spark.table(bronze_customers_table).count()
bronze_products_count = spark.table(bronze_products_table).count()

print(f"  üì¶ Orders:    {bronze_orders_count:>8,} records")
print(f"  üë• Customers: {bronze_customers_count:>8,} records")
print(f"  üì¶ Products:  {bronze_products_count:>8,} records")
print(f"  Total:        {bronze_orders_count + bronze_customers_count + bronze_products_count:>8,} records")

# Silver layer metrics
print("\n‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê")
print("‚îÇ                  SILVER LAYER (Cleansed Data)                ‚îÇ")
print("‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò")

silver_orders_count = spark.table(silver_orders_table).count()
silver_customers_count = spark.table(silver_customers_table).count()
silver_products_count = spark.table(silver_products_table).count()

orders_rejection_rate = ((bronze_orders_count - silver_orders_count) / bronze_orders_count * 100) if bronze_orders_count > 0 else 0
customers_rejection_rate = ((bronze_customers_count - silver_customers_count) / bronze_customers_count * 100) if bronze_customers_count > 0 else 0
products_rejection_rate = ((bronze_products_count - silver_products_count) / bronze_products_count * 100) if bronze_products_count > 0 else 0

print(f"  üì¶ Orders:    {silver_orders_count:>8,} records (rejection: {orders_rejection_rate:>5.2f}%)")
print(f"  üë• Customers: {silver_customers_count:>8,} records (rejection: {customers_rejection_rate:>5.2f}%)")
print(f"  üì¶ Products:  {silver_products_count:>8,} records (rejection: {products_rejection_rate:>5.2f}%)")

# Gold layer metrics
print("\n‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê")
print("‚îÇ              GOLD LAYER (Business-Ready Data)                ‚îÇ")
print("‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò")

gold_fact_count = spark.table(gold_order_fact_table).count()
gold_daily_count = spark.table(gold_daily_summary_table).count()
gold_monthly_count = spark.table(gold_monthly_summary_table).count()
gold_customer_count = spark.table(gold_customer_analytics_table).count()

print(f"  ‚≠ê Order Fact Table:    {gold_fact_count:>8,} records")
print(f"  üìä Daily Summary:       {gold_daily_count:>8,} aggregates")
print(f"  üìä Monthly Summary:     {gold_monthly_count:>8,} aggregates")
print(f"  üë• Customer Analytics:  {gold_customer_count:>8,} customers")

# Data quality summary
print("\n‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê")
print("‚îÇ                     DATA QUALITY METRICS                     ‚îÇ")
print("‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò")

print(f"  ‚úÖ Orders rejection rate:     {orders_rejection_rate:>6.2f}%")
print(f"  ‚úÖ Customers rejection rate:  {customers_rejection_rate:>6.2f}%")
print(f"  ‚úÖ Products rejection rate:   {products_rejection_rate:>6.2f}%")
print(f"  ‚úÖ Silver‚ÜíGold propagation:  {(gold_fact_count / silver_orders_count * 100) if silver_orders_count > 0 else 0:>6.2f}%")

# Data flow summary
print("\n‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê")
print("‚îÇ                      DATA FLOW SUMMARY                       ‚îÇ")
print("‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò")

print(f"  Bronze ‚Üí Silver: {bronze_orders_count:>8,} ‚Üí {silver_orders_count:>8,} orders")
print(f"  Silver ‚Üí Gold:   {silver_orders_count:>8,} ‚Üí {gold_fact_count:>8,} fact records")

# Overall status
overall_rejection = (orders_rejection_rate + customers_rejection_rate + products_rejection_rate) / 3

print("\n" + "=" * 80)
if overall_rejection < 5:
    print("                 ‚úÖ Pipeline Status: HEALTHY")
elif overall_rejection < 10:
    print("                 ‚ö†Ô∏è  Pipeline Status: WARNING (High Rejection)")
else:
    print("                 ‚ùå Pipeline Status: CRITICAL (Very High Rejection)")
print("=" * 80)

---

## Best Practices - Production Pipeline

### Bronze Layer Best Practices

**‚úÖ Audit Metadata:**
- Zawsze dodawaj `_bronze_ingest_timestamp`, `_bronze_source_file`, `_bronze_ingested_by`
- Umo≈ºliwia data lineage i troubleshooting
- Weryfikacja: kiedy i skƒÖd dane trafi≈Çy do lakehouse

**‚úÖ Immutability:**
- Nigdy nie UPDATE/DELETE w Bronze - tylko APPEND
- Bronze = landing zone dla raw data recovery
- U≈ºywaj `_bronze_version` dla schema evolution

**‚úÖ Idempotency:**
- U≈ºywaj COPY INTO lub Auto Loader
- Checkpoint locations dla streaming
- Zapobiega duplicate loads przy retry

**‚úÖ Multi-format Support:**
```python
# JSON z multi-line
spark.read.format("json").option("multiLine", "true").load(path)

# CSV z header inference
spark.read.format("csv").option("header", "true").option("inferSchema", "true").load(path)

# Parquet (binary efficient)
spark.read.format("parquet").load(path)
```

---

### Silver Layer Best Practices

**‚úÖ Data Quality Gates:**
```python
# Walidacja NOT NULL
.filter(F.col("customer_id").isNotNull())

# Walidacja range
.filter(F.col("total_amount") > 0)

# Walidacja regex (email)
.filter(F.col("email").rlike(r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$"))
```

**‚úÖ Rejection Rate Monitoring:**
- Log rejection rates dla alerting (threshold: 5%)
- Zapisuj rejected records do quarantine table dla investigation
```python
rejected_count = bronze_count - silver_count
rejection_rate = (rejected_count / bronze_count * 100)
if rejection_rate > 5:
    # Alert operations team
    pass
```

**‚úÖ Standaryzacja:**
```python
# Dates
.withColumn("order_date", F.to_date(F.col("order_datetime")))

# Text normalization
.withColumn("email", F.lower(F.trim(F.col("email"))))

# Case consistency
.withColumn("country", F.upper(F.trim(F.col("country"))))
```

**‚úÖ Type Casting:**
```python
# Explicit type casting
.withColumn("total_amount", F.col("total_amount").cast("decimal(10,2)"))
.withColumn("quantity", F.col("quantity").cast("integer"))
```

**‚úÖ Slowly Changing Dimensions (SCD):**
- SCD Type 1: Overwrite (dla dimension tables jak Customer, Product)
- SCD Type 2: History tracking (je≈õli potrzebny audit history zmian)

---

### Gold Layer Best Practices

**‚úÖ Denormalizacja dla Performance:**
- Pre-compute joins miƒôdzy fact i dimensions
- Trade-off: wiƒôkszy storage vs szybsze queries
- Idealny dla BI dashboards (eliminuje joiny w runtime)

```python
# Denormalized fact table
fact_with_dimensions = (
    fact
    .join(dim_customer, "customer_id", "left")  # LEFT JOIN!
    .join(dim_product, "product_id", "left")
)
```

**‚ö†Ô∏è U≈ºywaj LEFT JOIN:**
- Zachowaj wszystkie fact records nawet bez dimension match
- Monitor unmatched records (orphans)

**‚úÖ Pre-agregacje:**
```python
# Daily summary
.groupBy("order_date", "country", "payment_method")

# Monthly summary
.groupBy("order_year", "order_month", "country")

# Customer-level
.groupBy("customer_id")
```

**‚úÖ Partitioning Strategy:**
```python
# Partition po date dla time-based queries
.write.partitionBy("order_year", "order_month")

# ZORDER BY dla multi-dimensional filtering
spark.sql(f"OPTIMIZE {table_name} ZORDER BY (country, payment_method)")
```

**‚úÖ Time Dimensions:**
```python
# Dodaj time dimensions dla analytics
.withColumn("order_year", F.year("order_date"))
.withColumn("order_month", F.month("order_date"))
.withColumn("order_quarter", F.quarter("order_date"))
.withColumn("order_day_of_week", F.dayofweek("order_date"))
```

---

### Monitoring Best Practices

**‚úÖ Pipeline Health Metrics:**
- Record counts per warstwa
- Rejection rates (Bronze ‚Üí Silver)
- Processing time per stage
- Data freshness (last ingest timestamp)

**‚úÖ Alerting Thresholds:**
- Rejection rate > 5% ‚Üí WARNING
- Rejection rate > 10% ‚Üí CRITICAL
- Unmatched dimensions > 1% ‚Üí WARNING

**‚úÖ Data Lineage:**
```python
# DESCRIBE HISTORY dla audytu
spark.sql(f"DESCRIBE HISTORY {table_name}").show()

# Track data flow
Bronze (_bronze_ingest_timestamp) 
  ‚Üí Silver (_silver_processed_timestamp) 
  ‚Üí Gold (_gold_created_timestamp)
```

---

### Performance Optimization

**‚úÖ File Size Optimization:**
```python
# OPTIMIZE dla small files problem
spark.sql(f"OPTIMIZE {table_name}")

# Auto optimize (Databricks specific)
spark.sql(f"ALTER TABLE {table_name} SET TBLPROPERTIES (delta.autoOptimize.optimizeWrite = true)")
```

**‚úÖ ZORDER BY dla Multi-dimensional Queries:**
```python
# Colocate related information
spark.sql(f"OPTIMIZE {table_name} ZORDER BY (country, payment_method, order_date)")
```

**‚úÖ VACUUM dla Storage Management:**
```python
# Clean up old files (default retention: 7 days)
spark.sql(f"VACUUM {table_name} RETAIN 168 HOURS")  # 7 days
```

**‚úÖ Predicate Pushdown:**
```python
# Filter early w pipeline
.filter(F.col("order_date") >= "2024-01-01")  # Pushed to file scan
```

---

## Troubleshooting - Typowe problemy i rozwiƒÖzania

### Problem 1: High Rejection Rate w Silver (> 5%)

**Symptom:** Du≈ºa liczba rekord√≥w odrzucana podczas Bronze ‚Üí Silver

**Diagnoza:**
```python
# Analiza odrzuconych rekord√≥w
bronze_df = spark.table(bronze_orders_table)
rejected = bronze_df.filter(
    F.col("order_id").isNull() | 
    F.col("customer_id").isNull() |
    (F.col("total_amount") <= 0)
)

print(f"Rejected count: {rejected.count()}")
display(rejected.groupBy("_bronze_source_file").count())
```

**RozwiƒÖzanie:**
1. Zidentyfikuj root cause: null values, invalid formats, business rule violations
2. Komunikuj z data source team o jako≈õci danych
3. Rozwa≈º quarantine table dla rejected records:
```python
# Zapisz rejected records
rejected.write.format("delta").mode("append").saveAsTable("quarantine.rejected_orders")
```
4. Implementuj auto-remediation dla known issues (np. fill defaults)

---

### Problem 2: Unmatched Foreign Keys w Gold (Orphan Records)

**Symptom:** Orders bez matching customer/product po LEFT JOIN

**Diagnoza:**
```python
# Sprawd≈∫ unmatched customers
fact_orders = spark.table(gold_order_fact_table)
unmatched_customers = fact_orders.filter(F.col("customer_name").isNull())

print(f"Unmatched customers: {unmatched_customers.count()}")
display(unmatched_customers.select("order_id", "customer_id"))
```

**RozwiƒÖzanie:**
1. **Referential integrity check** w Silver przed Gold:
```python
# Validate foreign keys przed joinami
valid_customer_ids = dim_customer.select("customer_id").distinct()
orders_validated = orders.join(valid_customer_ids, "customer_id", "inner")
```

2. **Default handling** dla orphans:
```python
# U≈ºyj coalesce dla missing dimensions
.withColumn("customer_name", F.coalesce(F.col("customer_name"), F.lit("UNKNOWN")))
```

3. **Monitor orphan rate** w pipeline metrics

---

### Problem 3: D≈Çugi Processing Time dla Gold Aggregations

**Symptom:** Gold pipeline execution > 10 minutes dla small data volumes

**Diagnoza:**
```python
# Explain query plan
spark.table(silver_orders_table).explain(True)

# Check file statistics
spark.sql(f"DESCRIBE DETAIL {silver_orders_table}").show()
```

**RozwiƒÖzanie:**

1. **Incremental Processing:**
```python
# Process tylko nowe/updated dates
max_processed_date = spark.table(gold_daily_summary_table).agg(F.max("order_date")).collect()[0][0]

orders_incremental = (
    spark.table(silver_orders_table)
    .filter(F.col("order_date") > max_processed_date)
)
```

2. **Cache Silver tables** przed wieloma agregacjami:
```python
orders_silver_df.cache()
# Multiple aggregations...
orders_silver_df.unpersist()
```

3. **Optimize Silver tables:**
```python
spark.sql(f"OPTIMIZE {silver_orders_table}")
spark.sql(f"OPTIMIZE {silver_orders_table} ZORDER BY (order_date, customer_id)")
```

4. **Partitioning:**
```python
# Partition Gold tables po date
.write.partitionBy("order_year", "order_month").saveAsTable(...)
```

---

### Problem 4: Small Files Problem w Bronze

**Symptom:** TysiƒÖce ma≈Çych plik√≥w w Bronze Delta table

**Diagnoza:**
```python
# Check file count
detail = spark.sql(f"DESCRIBE DETAIL {bronze_orders_table}").collect()[0]
print(f"Number of files: {detail['numFiles']}")
print(f"Size in bytes: {detail['sizeInBytes']}")
```

**RozwiƒÖzanie:**

1. **OPTIMIZE regualrnie:**
```python
# Manual optimize
spark.sql(f"OPTIMIZE {bronze_orders_table}")

# Auto-optimize (Databricks)
spark.sql(f"""
ALTER TABLE {bronze_orders_table} 
SET TBLPROPERTIES (
  delta.autoOptimize.optimizeWrite = true,
  delta.autoOptimize.autoCompact = true
)
""")
```

2. **Batch load zamiast per-file:**
```python
# Load wszystkie pliki w jednej operacji
spark.read.format("json").load("path/to/folder/*.json")
```

---

### Problem 5: Schema Evolution Failures

**Symptom:** Pipeline fails z schema mismatch error

**Diagnoza:**
```python
# Compare schemas
bronze_schema = spark.table(bronze_orders_table).schema
new_data_schema = spark.read.json(new_file_path).schema

print("Bronze schema:", bronze_schema)
print("New data schema:", new_data_schema)
```

**RozwiƒÖzanie:**

1. **Enable schema evolution:**
```python
# Merge schema mode
.write.format("delta").mode("append").option("mergeSchema", "true").saveAsTable(...)
```

2. **Schema validation before write:**
```python
# Validate schema compatibility
if new_data_schema != expected_schema:
    # Handle schema change
    pass
```

3. **Track schema changes w audit:**
```python
# DESCRIBE HISTORY shows schema changes
spark.sql(f"DESCRIBE HISTORY {bronze_orders_table}").filter("operation = 'WRITE'").show()
```

---

### Problem 6: Data Quality Regression

**Symptom:** Nagle spike w rejection rate lub invalid values

**Diagnoza:**
```python
# Trend analysis rejection rates
rejection_history = spark.sql(f"""
SELECT 
    date(_bronze_ingest_timestamp) as ingest_date,
    count(*) as total_records,
    sum(case when _data_quality_flag = 'INVALID' then 1 else 0 end) as invalid_count
FROM {silver_orders_table}
GROUP BY date(_bronze_ingest_timestamp)
ORDER BY ingest_date DESC
""")

display(rejection_history)
```

**RozwiƒÖzanie:**

1. **Automated data quality checks:**
```python
# Define quality rules
quality_checks = [
    ("not_null", F.col("order_id").isNotNull()),
    ("positive_amount", F.col("total_amount") > 0),
    ("valid_date", F.col("order_date") <= F.current_date())
]

for check_name, condition in quality_checks:
    invalid_count = df.filter(~condition).count()
    if invalid_count > threshold:
        # Alert
        pass
```

2. **Quarantine invalid data:**
```python
invalid_df = df.filter(~all_conditions)
invalid_df.write.format("delta").mode("append").saveAsTable("quarantine_table")
```

---

### Problem 7: Memory Out of Error (OOM)

**Symptom:** Executor crashes z OutOfMemoryError

**Diagnoza:**
```python
# Check data skew
spark.table(silver_orders_table).groupBy("customer_id").count().orderBy(F.col("count").desc()).show()
```

**RozwiƒÖzanie:**

1. **Repartition data:**
```python
# Repartition przed heavy operations
df = df.repartition(200, "customer_id")
```

2. **Increase executor memory** w cluster configuration

3. **Use broadcast joins** dla small dimension tables:
```python
from pyspark.sql.functions import broadcast
fact.join(broadcast(small_dim), "key")
```

4. **Process w batches:**
```python
# Process per country
countries = [row.country for row in df.select("country").distinct().collect()]
for country in countries:
    country_df = df.filter(F.col("country") == country)
    # Process...
```

---

## Podsumowanie

**W tym notebooku zbudowali≈õmy kompletny Bronze ‚Üí Silver ‚Üí Gold pipeline:**

‚úÖ **Bronze Layer:**
- Multi-format ingestion (JSON, CSV, Parquet)
- Audit metadata dla lineage
- Immutable landing zone

‚úÖ **Silver Layer:**
- Data quality validation
- Deduplikacja i standaryzacja
- Business rules enforcement
- Quality metrics logging

‚úÖ **Gold Layer:**
- Denormalized fact tables
- Pre-aggregated summaries (daily, monthly)
- Customer analytics i segmentacja
- BI-ready tables

‚úÖ **Monitoring:**
- Pipeline health dashboard
- Data quality metrics
- Rejection rate tracking

**Kluczowe wnioski:**
1. End-to-end pipeline wymaga r√≥≈ºnych transformacji per warstwa
2. Data quality gates w Silver chroniƒÖ przed bad data w Gold
3. Denormalizacja w Gold poprawia performance BI dashboard√≥w
4. Monitoring jest kluczowy dla production reliability

**Nastƒôpne kroki:**
- **Kolejny notebook**: 05_optimization_best_practices.ipynb
- **Warsztat praktyczny**: 03_end_to_end_bronze_silver_gold_workshop.ipynb
- **Delta Live Tables**: Declarative pipelines z automatic data quality

---

## Cleanup

PosprzƒÖtaj zasoby utworzone podczas notebooka:

In [0]:
# Opcjonalne czyszczenie zasob√≥w testowych
# UWAGA: Uruchom tylko je≈õli chcesz usunƒÖƒá wszystkie utworzone dane

# Bronze
# spark.sql(f"DROP TABLE IF EXISTS {bronze_orders_table}")
# spark.sql(f"DROP TABLE IF EXISTS {bronze_customers_table}")
# spark.sql(f"DROP TABLE IF EXISTS {bronze_products_table}")

# Silver
# spark.sql(f"DROP TABLE IF EXISTS {silver_orders_table}")
# spark.sql(f"DROP TABLE IF EXISTS {silver_customers_table}")
# spark.sql(f"DROP TABLE IF EXISTS {silver_products_table}")

# Gold
# spark.sql(f"DROP TABLE IF EXISTS {gold_order_fact_table}")
# spark.sql(f"DROP TABLE IF EXISTS {gold_daily_summary_table}")
# spark.sql(f"DROP TABLE IF EXISTS {gold_monthly_summary_table}")
# spark.sql(f"DROP TABLE IF EXISTS {gold_customer_analytics_table}")

# spark.catalog.clearCache()
# print("Zasoby zosta≈Çy wyczyszczone")