# Warsztat 1: Advanced PySpark Transformations

**Cel warsztatu:**
- Praktyczne zastosowanie Window Functions (lag, lead, rank, rolling aggregations)
- Przetwarzanie z≈Ço≈ºonych struktur (JSON, arrays, structs)
- Zaawansowane operacje na datach i czasie
- Optymalizacja transformacji dla wydajno≈õci

**Czas:** 90 minut

---

## üìö Inicjalizacja ≈õrodowiska

In [None]:
%run ../../00_setup

## üéØ Konfiguracja

In [None]:
from pyspark.sql import functions as F
from pyspark.sql.window import Window
from pyspark.sql.types import *
from datetime import datetime, timedelta

# Wy≈õwietl kontekst u≈ºytkownika
print("=== Kontekst u≈ºytkownika ===")
print(f"Katalog: {CATALOG}")
print(f"Schema: {BRONZE_SCHEMA}")
print(f"U≈ºytkownik: {raw_user}")

# Ustaw katalog i schemat jako domy≈õlne
spark.sql(f"USE CATALOG {CATALOG}")
spark.sql(f"USE SCHEMA {BRONZE_SCHEMA}")

## üìä Przygotowanie danych z Databricks Volume

Wczytaj dane z Databricks Volume dla warsztat√≥w:

In [None]:
# ≈öcie≈ºka do Volume
volume_path = "/Volumes/main/default/kion_data"

# Wczytanie danych klient√≥w
customers_df = spark.read.csv(f"{volume_path}/customers/customers.csv", header=True, inferSchema=True)

# Wczytanie danych zam√≥wie≈Ñ (batch)
orders_df = spark.read.json(f"{volume_path}/orders/orders_batch.json")

# Wczytanie danych produkt√≥w
products_df = spark.read.parquet(f"{volume_path}/products/products.parquet")

# Przygotowanie z≈ÇƒÖczonego widoku dla ƒáwicze≈Ñ
test_orders = (
    orders_df
    .join(customers_df, "customer_id")
    .join(products_df, "product_id")
    .select(
        "order_id",
        "customer_id",
        F.col("order_date").cast("date").alias("order_date"),
        F.col("total_amount"),
        F.col("status")
    )
)

test_orders.createOrReplaceTempView("orders")
display(test_orders)

---

## ü™ü Czƒô≈õƒá 1: Window Functions

### Zadanie 1.1: Ranking - ROW_NUMBER, RANK, DENSE_RANK

**Instrukcje:**
1. Dla ka≈ºdego klienta, uszereguj zam√≥wienia po dacie (od najnowszego)
2. Dodaj kolumny:
   - `row_num`: u≈ºywajƒÖc `row_number()`
   - `rank`: u≈ºywajƒÖc `rank()`
   - `dense_rank`: u≈ºywajƒÖc `dense_rank()`
3. Window spec: `partitionBy("customer_id").orderBy(F.desc("order_date"))`

**Oczekiwany rezultat:**
- Ka≈ºdy klient ma zam√≥wienia ponumerowane od 1 (najnowsze)

In [None]:
# TODO: Zadanie 1.1 - Ranking functions

from pyspark.sql.window import Window

# Definicja window spec
window_spec = Window.____("____").orderBy(F.____("____"))  # partitionBy customer_id, orderBy desc order_date

# Dodaj kolumny ranking
orders_ranked = (
    test_orders
    .withColumn("row_num", F.____().____(window_spec))  # row_number, over
    .withColumn("rank", F.____().over(____))  # rank, window_spec
    .withColumn("dense_rank", F.____().over(window_spec))  # dense_rank
)

display(orders_ranked.orderBy("customer_id", "order_date"))

**Wyja≈õnienie r√≥≈ºnic:**

- **ROW_NUMBER**: Unikalne numery sekwencyjne (1, 2, 3...)
- **RANK**: Luki w numeracji przy r√≥wnych warto≈õciach (1, 2, 2, 4...)
- **DENSE_RANK**: Brak luk przy r√≥wnych warto≈õciach (1, 2, 2, 3...)

### Zadanie 1.2: LAG i LEAD - Por√≥wnanie z poprzednimi/nastƒôpnymi warto≈õciami

**Instrukcje:**
1. Dla ka≈ºdego klienta, oblicz:
   - `previous_order_amount`: warto≈õƒá poprzedniego zam√≥wienia (u≈ºywajƒÖc `lag`)
   - `next_order_amount`: warto≈õƒá nastƒôpnego zam√≥wienia (u≈ºywajƒÖc `lead`)
   - `amount_diff_vs_previous`: r√≥≈ºnica miƒôdzy aktualnym a poprzednim
2. Window spec: `partitionBy("customer_id").orderBy("order_date")`

In [None]:
# TODO: Zadanie 1.2 - LAG i LEAD

# Window spec - porzƒÖdek chronologiczny
window_chrono = Window.partitionBy("____").orderBy("____")  # customer_id, order_date

# U≈ºyj LAG i LEAD
orders_lag_lead = (
    test_orders
    .withColumn("previous_order_amount", F.____(____, ____).over(____))  # lag, total_amount, 1, window_chrono
    .withColumn("next_order_amount", F.____(____, 1).over(window_chrono))  # lead, total_amount
    .withColumn(
        "amount_diff_vs_previous",
        F.col("____") - F.col("____")  # total_amount, previous_order_amount
    )
)

display(orders_lag_lead.select(
    "customer_id", "order_date", "total_amount", 
    "previous_order_amount", "next_order_amount", "amount_diff_vs_previous"
).orderBy("customer_id", "order_date"))

### Zadanie 1.3: Rolling Aggregations - ≈örednie ruchome

**Instrukcje:**
1. Oblicz ≈õredniƒÖ ruchomƒÖ (rolling average) dla kwoty zam√≥wienia:
   - Okno: 3 ostatnie zam√≥wienia (current + 2 poprzednie)
2. U≈ºyj `.rowsBetween(-2, 0)` dla window spec
3. Dodaj kolumnƒô `rolling_avg_3_orders`

In [None]:
# TODO: Zadanie 1.3 - Rolling aggregations

# Window spec z rowsBetween
window_rolling = (
    Window
    .partitionBy("customer_id")
    .orderBy("order_date")
    .____(____, ____)  # rowsBetween, -2, 0 (3 ostatnie rekordy)
)

# Rolling average
orders_rolling = (
    test_orders
    .withColumn(
        "rolling_avg_3_orders",
        F.____("____").over(____)  # avg, total_amount, window_rolling
    )
    .withColumn(
        "rolling_sum_3_orders",
        F.sum("total_amount").over(window_rolling)
    )
)

display(orders_rolling.select(
    "customer_id", "order_date", "total_amount", 
    "rolling_avg_3_orders", "rolling_sum_3_orders"
).orderBy("customer_id", "order_date"))

### Zadanie 1.4: Cumulative Sum - Suma narastajƒÖca

**Instrukcje:**
1. Oblicz sumƒô narastajƒÖcƒÖ (cumulative sum) kwot zam√≥wie≈Ñ per klient
2. U≈ºyj `.rowsBetween(Window.unboundedPreceding, Window.currentRow)`
3. Dodaj kolumnƒô `cumulative_amount`

In [None]:
# TODO: Zadanie 1.4 - Cumulative sum

# Window spec dla cumulative
window_cumulative = (
    Window
    .partitionBy("____")
    .orderBy("____")
    .rowsBetween(Window.____, Window.____)  # unboundedPreceding, currentRow
)

# Cumulative sum
orders_cumulative = (
    test_orders
    .withColumn(
        "cumulative_amount",
        F.____(____("____")).over(window_cumulative)  # round, sum total_amount
    )
)

display(orders_cumulative.select(
    "customer_id", "order_date", "total_amount", "cumulative_amount"
).orderBy("customer_id", "order_date"))

---

## üóÇÔ∏è Czƒô≈õƒá 2: Przetwarzanie z≈Ço≈ºonych struktur

### Zadanie 2.1: JSON Processing - from_json() i explode()

**Instrukcje:**
1. Wczytaj dane JSON z Volume (orders)
2. U≈ºyj `from_json()` do sparsowania JSON je≈õli potrzeba
3. U≈ºyj `explode()` do "rozpakowania" array
4. WyciƒÖgnij pola z nested struct

In [None]:
# Wczytaj dane JSON z Volume (zam√≥wienia mogƒÖ zawieraƒá nested structures)
# Volume zawiera ju≈º sparsowane JSON, ale mo≈ºemy stworzyƒá przyk≈Çad z zagnie≈ºd≈ºonƒÖ strukturƒÖ

# Opcja 1: U≈ºyj danych z Volume i stw√≥rz nested JSON
json_orders = spark.read.json(f"{volume_path}/orders/orders_batch.json")

# Opcja 2: Dla ƒáwicze≈Ñ stw√≥rz testowe dane z zagnie≈ºd≈ºonym JSON string
json_data = spark.createDataFrame([
    (1, '{"items": [{"product": "laptop", "price": 1200}, {"product": "mouse", "price": 25}], "total": 1225}'),
    (2, '{"items": [{"product": "keyboard", "price": 80}], "total": 80}'),
    (3, '{"items": [{"product": "monitor", "price": 350}, {"product": "cable", "price": 15}], "total": 365}')
], ["order_id", "order_json"])

display(json_data)

In [None]:
# TODO: Zadanie 2.1 - JSON processing

# Definicja schematu JSON
json_schema = StructType([
    StructField("items", ArrayType(StructType([
        StructField("product", StringType()),
        StructField("price", IntegerType())
    ]))),
    StructField("total", IntegerType())
])

# Parse JSON
orders_parsed = (
    json_data
    .withColumn("parsed", F.____(____("____"), ____))  # from_json, order_json, json_schema
)

display(orders_parsed.select("order_id", "parsed"))

In [None]:
# TODO: Explode array i wyciƒÖgnij pola

orders_exploded = (
    orders_parsed
    .withColumn("item", F.____("____"))  # explode, parsed.items
    .select(
        "order_id",
        F.col("____").alias("product_name"),  # item.product
        F.col("____").alias("product_price"),  # item.price
        F.col("____").alias("order_total")  # parsed.total
    )
)

display(orders_exploded)

### Zadanie 2.2: Array Functions - collect_list, array_contains

**Instrukcje:**
1. Zgrupuj zam√≥wienia per klient
2. U≈ºyj `collect_list()` do zebrania wszystkich kwot zam√≥wie≈Ñ w array
3. U≈ºyj `array_contains()` do sprawdzenia czy klient ma zam√≥wienie > 500
4. U≈ºyj `size()` do zliczenia liczby zam√≥wie≈Ñ

In [None]:
# TODO: Zadanie 2.2 - Array functions

customer_arrays = (
    test_orders
    .groupBy("____")  # customer_id
    .agg(
        F.____(____("____")).alias("order_amounts"),  # collect_list, total_amount
        F.collect_list("order_date").alias("order_dates"),
        F.count("*").alias("total_orders")
    )
    .withColumn(
        "num_orders",
        F.____("____")  # size, order_amounts
    )
)

display(customer_arrays)

### Zadanie 2.3: Struct - ≈ÅƒÖczenie kolumn w struktury

**Instrukcje:**
1. Utw√≥rz struct `customer_info` zawierajƒÖcy: customer_id, total_orders
2. Utw√≥rz struct `order_summary` zawierajƒÖcy: min/max/avg amount
3. WyciƒÖgnij pola ze struct u≈ºywajƒÖc `.` notacji

In [None]:
# TODO: Zadanie 2.3 - Struct operations

customer_structs = (
    test_orders
    .groupBy("customer_id")
    .agg(
        F.count("*").alias("total_orders"),
        F.min("total_amount").alias("min_amount"),
        F.max("total_amount").alias("max_amount"),
        F.avg("total_amount").alias("avg_amount")
    )
    .withColumn(
        "customer_info",
        F.____("____", "____")  # struct, customer_id, total_orders
    )
    .withColumn(
        "order_summary",
        F.struct("min_amount", "____", "____")  # max_amount, avg_amount
    )
)

display(customer_structs.select("customer_info", "order_summary"))

In [None]:
# WyciƒÖgnij pola ze struct
customer_flat = (
    customer_structs
    .select(
        F.col("customer_info.____").alias("customer_id"),  # customer_id
        F.col("order_summary.____").alias("avg_order_value")  # avg_amount
    )
)

display(customer_flat)

---

## üìÖ Czƒô≈õƒá 3: Zaawansowane operacje na datach

### Zadanie 3.1: Date truncation i extraction

**Instrukcje:**
1. U≈ºyj `date_trunc()` do zaokrƒÖglenia dat do: month, quarter, year
2. U≈ºyj `year()`, `month()`, `dayofweek()` do ekstrakcji czƒô≈õci daty
3. Oblicz `days_since_order` (r√≥≈ºnica miƒôdzy dzisiaj a datƒÖ zam√≥wienia)

In [None]:
# TODO: Zadanie 3.1 - Date functions

orders_dates = (
    test_orders
    .withColumn("order_month", F.____(____("____"), "____"))  # date_trunc, order_date, month
    .withColumn("order_quarter", F.date_trunc("____", "order_date"))  # quarter
    .withColumn("order_year_num", F.____("____"))  # year, order_date
    .withColumn("order_month_num", F.____(____("____")))  # month, order_date
    .withColumn("day_of_week", F.____(____("order_date")))  # dayofweek
    .withColumn(
        "days_since_order",
        F.datediff(F.____, "____")  # current_date, order_date
    )
)

display(orders_dates.select(
    "order_id", "order_date", "order_month", "order_quarter",
    "order_year_num", "order_month_num", "day_of_week", "days_since_order"
))

### Zadanie 3.2: Date arithmetic - dodawanie/odejmowanie okres√≥w

**Instrukcje:**
1. U≈ºyj `date_add()` do dodania 30 dni do daty zam√≥wienia
2. U≈ºyj `add_months()` do dodania 3 miesiƒôcy
3. U≈ºyj `last_day()` do uzyskania ostatniego dnia miesiƒÖca
4. U≈ºyj `next_day()` do uzyskania najbli≈ºszego poniedzia≈Çku

In [None]:
# TODO: Zadanie 3.2 - Date arithmetic

orders_date_math = (
    test_orders
    .withColumn("delivery_date_estimate", F.____(____("____"), ____))  # date_add, order_date, 30
    .withColumn("renewal_date", F.____(____("order_date"), ____))  # add_months, 3
    .withColumn("month_end", F.____(____("____")))  # last_day, order_date
    .withColumn("next_monday", F.next_day("____", "____"))  # order_date, Monday
)

display(orders_date_math.select(
    "order_date", "delivery_date_estimate", "renewal_date", 
    "month_end", "next_monday"
))

### Zadanie 3.3: Generowanie sekwencji dat

**Instrukcje:**
1. U≈ºyj `sequence()` do wygenerowania array dat miƒôdzy dwoma datami
2. U≈ºyj `explode()` do utworzenia jednego wiersza per data
3. Stw√≥rz calendar table z wszystkimi dniami miƒôdzy min a max order_date

In [None]:
# TODO: Zadanie 3.3 - Date sequences

# Znajd≈∫ min i max dates
date_range = test_orders.select(
    F.min("order_date").alias("min_date"),
    F.max("order_date").alias("max_date")
).first()

# Generuj sekwencjƒô dat
calendar = (
    spark.range(1)
    .select(
        F.____(  # explode
            F.____(
                F.lit(date_range["____"]),  # min_date
                F.lit(date_range["max_date"]),
                F.expr("____")  # interval 1 day
            )
        ).alias("date")
    )
    .withColumn("year", F.year("date"))
    .withColumn("month", F.____(____("____")))  # month, date
    .withColumn("day_of_week", F.dayofweek("date"))
)

print(f"Calendar table: {calendar.count()} dni")
display(calendar)

---

## ‚úÖ Podsumowanie warsztatu

**Zrealizowane cele:**
- ‚úÖ Window Functions (ranking, lag/lead, rolling aggregations, cumulative sum)
- ‚úÖ Przetwarzanie JSON (from_json, explode, struct)
- ‚úÖ Array operations (collect_list, array_contains, size)
- ‚úÖ Zaawansowane operacje na datach (truncation, arithmetic, sequences)

**Kluczowe wnioski:**
1. Window Functions pozwalajƒÖ na analizy per grupa bez GROUP BY
2. JSON i struktury z≈Ço≈ºone sƒÖ native w Spark
3. Date functions umo≈ºliwiajƒÖ zaawansowane analizy temporalne
4. Optymalizacja: u≈ºyj broadcast dla ma≈Çych tabel w JOIN

**Best Practices:**
- Window Functions: zawsze definiuj explicit window spec
- JSON: u≈ºywaj schema inference tylko dla exploratation
- Dates: u≈ºywaj native date types (nie string)
- Performance: cache() dla czƒôsto u≈ºywanych DataFrame

---

## üßπ Cleanup (opcjonalnie)

In [None]:
# Wyczy≈õƒá temporary views
# spark.catalog.dropTempView("orders")
# spark.catalog.clearCache()