# Advanced Transformations (PySpark & SQL)

**Workshop Objective:**
- Practical application of Window Functions (lag, lead, rank, rolling aggregations)
- Processing complex structures (JSON, arrays, structs)
- Advanced date and time operations
- Transformation optimization for performance

**Note:** You can choose to solve the tasks using **PySpark** or **SQL**. Both approaches are provided.

**Time:** 30 minutes

---

## User Isolation
This notebook is designed to be run in a shared environment.
To avoid conflicts, we will use a unique `catalog` and `schema` for your user.
The `00_setup` script will automatically configure these for you.

## Environment Configuration
We will configure the environment variables and paths used in this workshop.

In [0]:
%run ../00_setup

## Configuration

In [0]:
from pyspark.sql import functions as F
from pyspark.sql.window import Window
from pyspark.sql.types import *
from datetime import datetime, timedelta

# Display user context
print("=== User Context ===")
print(f"Catalog: {CATALOG}")
print(f"Schema: {BRONZE_SCHEMA}")
print(f"User: {raw_user}")

# Set catalog and schema as default
spark.sql(f"USE CATALOG {CATALOG}")
spark.sql(f"USE SCHEMA {BRONZE_SCHEMA}")

## Data Preparation from Databricks Volume

Load data from Databricks Volume for the workshop:

In [0]:
# Load customer data
customers_df = spark.read.csv(f"{volume_path}/customers/customers.csv", header=True, inferSchema=True)

In [0]:
# Generate sample orders data for the workshop
data = [
    (1, "2024-01-01", 100), (1, "2024-01-02", 150), (1, "2024-01-02", 150), (1, "2024-01-05", 200),
    (2, "2024-01-10", 50), (2, "2024-01-12", 80),
    (3, "2024-02-01", 300), (3, "2024-02-01", 300), (3, "2024-02-05", 350)
]
columns = ["customer_id", "order_date", "total_amount"]
test_orders = spark.createDataFrame(data, columns)
test_orders = test_orders.withColumn("order_date", F.to_date("order_date"))

# Register as temp view for SQL exercises
test_orders.createOrReplaceTempView("orders")
print("Created 'orders' temporary view for SQL exercises.")
display(test_orders)

# Part A: PySpark Implementation

In this section, you will implement the transformations using the PySpark DataFrame API.

In [0]:
# Load JSON data from Volume (orders may contain nested structures)
# Volume already contains parsed JSON, but we can create an example with nested structure


# For practice, create test data with nested JSON string
json_data = spark.createDataFrame([
 (1, '{"items": [{"product": "laptop", "price": 1200}, {"product": "mouse", "price": 25}], "total": 1225}'),
 (2, '{"items": [{"product": "keyboard", "price": 80}], "total": 80}'),
 (3, '{"items": [{"product": "monitor", "price": 350}, {"product": "cable", "price": 15}], "total": 365}')
], ["order_id", "order_json"])

# Register for SQL
json_data.createOrReplaceTempView("json_orders_raw")

display(json_data)

---

## Window Functions

### Ranking - ROW_NUMBER, RANK, DENSE_RANK

**Instructions:**
1. For each customer, rank orders by date (newest first)
2. Add columns:
 - `row_num`: using `row_number()`
 - `rank`: using `rank()`
 - `dense_rank`: using `dense_rank()`
3. Window spec: `partitionBy("customer_id").orderBy(F.desc("order_date"))`

**Expected Result:**
- Each customer has orders numbered starting from 1 (newest)

In [0]:
# TODO: Task 1.1 - Ranking functions

from pyspark.sql.window import Window

# Define window spec
window_spec = Window.____("____").orderBy(F.____("____")) # partitionBy customer_id, orderBy desc order_date

# Add ranking columns
orders_ranked = (
 test_orders
 .withColumn("row_num", F.____().____(window_spec)) # row_number, over
 .withColumn("rank", F.____().over(____)) # rank, window_spec
 .withColumn("dense_rank", F.____().over(window_spec)) # dense_rank
)

display(orders_ranked.orderBy("customer_id", "order_date"))

**Explanation of differences:**

- **ROW_NUMBER**: Unique sequential numbers (1, 2, 3...)
- **RANK**: Gaps in numbering for ties (1, 2, 2, 4...)
- **DENSE_RANK**: No gaps for ties (1, 2, 2, 3...)

### LAG and LEAD - Compare with previous/next values

**Instructions:**
1. For each customer, calculate:
 - `previous_order_amount`: value of the previous order (using `lag`)
 - `next_order_amount`: value of the next order (using `lead`)
 - `amount_diff_vs_previous`: difference between current and previous
2. Window spec: `partitionBy("customer_id").orderBy("order_date")`

In [0]:
# TODO: Task 1.2 - LAG and LEAD

# Window spec - chronological order
window_chrono = Window.partitionBy("____").orderBy("____") # customer_id, order_date

# Use LAG and LEAD
orders_lag_lead = (
 test_orders
 .withColumn("previous_order_amount", F.____(____, ____).over(____)) # lag, total_amount, 1, window_chrono
 .withColumn("next_order_amount", F.____(____, 1).over(window_chrono)) # lead, total_amount
 .withColumn(
 "amount_diff_vs_previous",
 F.col("____") - F.col("____") # total_amount, previous_order_amount
 )
)

display(orders_lag_lead.select(
 "customer_id", "order_date", "total_amount", 
 "previous_order_amount", "next_order_amount", "amount_diff_vs_previous"
).orderBy("customer_id", "order_date"))

### Rolling Aggregations - Moving Averages

**Instructions:**
1. Calculate rolling average for order amount:
 - Window: 3 last orders (current + 2 previous)
2. Use `.rowsBetween(-2, 0)` for window spec
3. Add column `rolling_avg_3_orders`

In [0]:
# TODO: Task 1.3 - Rolling aggregations

# Window spec with rowsBetween
window_rolling = (
 Window
 .partitionBy("customer_id")
 .orderBy("order_date")
 .____(____, ____) # rowsBetween, -2, 0 (3 last records)
)

# Rolling average
orders_rolling = (
 test_orders
 .withColumn(
 "rolling_avg_3_orders",
 F.____("____").over(____) # avg, total_amount, window_rolling
 )
 .withColumn(
 "rolling_sum_3_orders",
 F.sum("total_amount").over(window_rolling)
 )
)

display(orders_rolling.select(
 "customer_id", "order_date", "total_amount", 
 "rolling_avg_3_orders", "rolling_sum_3_orders"
).orderBy("customer_id", "order_date"))

### Cumulative Sum

**Instructions:**
1. Calculate cumulative sum of order amounts per customer
2. Use `.rowsBetween(Window.unboundedPreceding, Window.currentRow)`
3. Add column `cumulative_amount`

In [0]:
# TODO: Task 1.4 - Cumulative sum

# Window spec for cumulative
window_cumulative = (
 Window
 .partitionBy("____")
 .orderBy("____")
 .rowsBetween(Window.____, Window.____) # unboundedPreceding, currentRow
)

# Cumulative sum
orders_cumulative = (
 test_orders
 .withColumn(
 "cumulative_amount",
 F.____(____("____")).over(window_cumulative) # round, sum total_amount
 )
)

display(orders_cumulative.select(
 "customer_id", "order_date", "total_amount", "cumulative_amount"
).orderBy("customer_id", "order_date"))

---

## Processing Complex Structures

### JSON Processing - from_json() and explode()

**Instructions:**
1. Load JSON data from Volume (orders)
2. Use `from_json()` to parse JSON if needed
3. Use `explode()` to "unpack" array
4. Extract fields from nested struct

In [0]:
# TODO: Task 2.1 - JSON processing

# Define JSON schema
json_schema = StructType([
 StructField("items", ArrayType(StructType([
 StructField("product", StringType()),
 StructField("price", IntegerType())
 ]))),
 StructField("total", IntegerType())
])

# Parse JSON
orders_parsed = (
 json_data
 .withColumn("parsed", F.____(____("____"), ____)) # from_json, order_json, json_schema
)

display(orders_parsed.select("order_id", "parsed"))

In [0]:
# TODO: Explode array and extract fields

orders_exploded = (
 orders_parsed
 .withColumn("item", F.____("____")) # explode, parsed.items
 .select(
 "order_id",
 F.col("____").alias("product_name"), # item.product
 F.col("____").alias("product_price"), # item.price
 F.col("____").alias("order_total") # parsed.total
 )
)

display(orders_exploded)

### Array Functions - collect_list, array_contains

**Instructions:**
1. Group orders per customer
2. Use `collect_list()` to gather all order amounts into an array
3. Use `array_contains()` to check if customer has an order > 500
4. Use `size()` to count number of orders

In [0]:
# TODO: Task 2.2 - Array functions

customer_arrays = (
 test_orders
 .groupBy("____") # customer_id
 .agg(
 F.____(____("____")).alias("order_amounts"), # collect_list, total_amount
 F.collect_list("order_date").alias("order_dates"),
 F.count("*").alias("total_orders")
 )
 .withColumn(
 "num_orders",
 F.____("____") # size, order_amounts
 )
)

display(customer_arrays)

### Struct - Combining columns into structures

**Instructions:**
1. Create struct `customer_info` containing: customer_id, total_orders
2. Create struct `order_summary` containing: min/max/avg amount
3. Extract fields from struct using `.` notation

In [0]:
# TODO: Task 2.3 - Struct operations

customer_structs = (
 test_orders
 .groupBy("customer_id")
 .agg(
 F.count("*").alias("total_orders"),
 F.min("total_amount").alias("min_amount"),
 F.max("total_amount").alias("max_amount"),
 F.avg("total_amount").alias("avg_amount")
 )
 .withColumn(
 "customer_info",
 F.____("____", "____") # struct, customer_id, total_orders
 )
 .withColumn(
 "order_summary",
 F.struct("min_amount", "____", "____") # max_amount, avg_amount
 )
)

display(customer_structs.select("customer_info", "order_summary"))

In [0]:
# Extract fields from struct
customer_flat = (
 customer_structs
 .select(
 F.col("customer_info.____").alias("customer_id"), # customer_id
 F.col("order_summary.____").alias("avg_order_value") # avg_amount
 )
)

display(customer_flat)

---

## Part 3: Advanced Date Operations

### Task 3.1: Date truncation and extraction

**Instructions:**
1. Use `date_trunc()` to round dates to: month, quarter, year
2. Use `year()`, `month()`, `dayofweek()` to extract date parts
3. Calculate `days_since_order` (difference between today and order date)

In [0]:
# TODO: Task 3.1 - Date functions

orders_dates = (
 test_orders
 .withColumn("order_month", F.____(____("____"), "____")) # date_trunc, order_date, month
 .withColumn("order_quarter", F.date_trunc("____", "order_date")) # quarter
 .withColumn("order_year_num", F.____("____")) # year, order_date
 .withColumn("order_month_num", F.____(____("____"))) # month, order_date
 .withColumn("day_of_week", F.____(____("order_date"))) # dayofweek
 .withColumn(
 "days_since_order",
 F.datediff(F.____, "____") # current_date, order_date
 )
)

display(orders_dates.select(
 "order_id", "order_date", "order_month", "order_quarter",
 "order_year_num", "order_month_num", "day_of_week", "days_since_order"
))

### Task 3.2: Date arithmetic - adding/subtracting periods

**Instructions:**
1. Use `date_add()` to add 30 days to order date
2. Use `add_months()` to add 3 months
3. Use `last_day()` to get the last day of the month
4. Use `next_day()` to get the next Monday

In [0]:
# TODO: Task 3.2 - Date arithmetic

orders_date_math = (
 test_orders
 .withColumn("delivery_date_estimate", F.____(____("____"), ____)) # date_add, order_date, 30
 .withColumn("renewal_date", F.____(____("order_date"), ____)) # add_months, 3
 .withColumn("month_end", F.____(____("____"))) # last_day, order_date
 .withColumn("next_monday", F.next_day("____", "____")) # order_date, Monday
)

display(orders_date_math.select(
 "order_date", "delivery_date_estimate", "renewal_date", 
 "month_end", "next_monday"
))

### Task 3.3: Generating date sequences

**Instructions:**
1. Use `sequence()` to generate an array of dates between two dates
2. Use `explode()` to create one row per date
3. Create a calendar table with all days between min and max order_date

In [0]:
# TODO: Task 3.3 - Date sequences

# Find min and max dates
date_range = test_orders.select(
 F.min("order_date").alias("min_date"),
 F.max("order_date").alias("max_date")
).first()

# Generate date sequence
calendar = (
 spark.range(1)
 .select(
 F.____( # explode
 F.____(
 F.lit(date_range["____"]), # min_date
 F.lit(date_range["max_date"]),
 F.expr("____") # interval 1 day
 )
 ).alias("date")
 )
 .withColumn("year", F.year("date"))
 .withColumn("month", F.____(____("____"))) # month, date
 .withColumn("day_of_week", F.dayofweek("date"))
)

print(f"Calendar table: {calendar.count()} days")
display(calendar)

# Part B: SQL Implementation

In this section, you will implement the same transformations using **Spark SQL**.
We have already registered the `orders` and `json_orders_raw` temporary views for you.

### Ranking - ROW_NUMBER, RANK, DENSE_RANK

**Instructions:**
1. For each customer, rank orders by date (newest first)
2. Add columns:
 - `row_num`: using `row_number()`
 - `rank`: using `rank()`
 - `dense_rank`: using `dense_rank()`
3. Window spec: `partitionBy("customer_id").orderBy(F.desc("order_date"))`

**Expected Result:**
- Each customer has orders numbered starting from 1 (newest)

In [0]:
%sql
-- TODO: Task 1.1 - Ranking functions (SQL)
SELECT 
    *,
    ___ OVER (PARTITION BY ___ ORDER BY ___ DESC) as row_num,
    ___ OVER (PARTITION BY ___ ORDER BY ___ DESC) as rank,
    ___ OVER (PARTITION BY ___ ORDER BY ___ DESC) as dense_rank
FROM orders
ORDER BY customer_id, order_date

**Explanation of differences:**

- **ROW_NUMBER**: Unique sequential numbers (1, 2, 3...)
- **RANK**: Gaps in numbering for ties (1, 2, 2, 4...)
- **DENSE_RANK**: No gaps for ties (1, 2, 2, 3...)

### LAG and LEAD - Compare with previous/next values

**Instructions:**
1. For each customer, calculate:
 - `previous_order_amount`: value of the previous order (using `lag`)
 - `next_order_amount`: value of the next order (using `lead`)
 - `amount_diff_vs_previous`: difference between current and previous
2. Window spec: `partitionBy("customer_id").orderBy("order_date")`

In [0]:
%sql
-- TODO: Task 1.2 - LAG and LEAD (SQL)
SELECT 
    customer_id, order_date, total_amount,
    ___(total_amount) OVER (PARTITION BY customer_id ORDER BY order_date) as previous_order_amount,
    ___(total_amount) OVER (PARTITION BY customer_id ORDER BY order_date) as next_order_amount,
    total_amount - ___(total_amount) OVER (PARTITION BY customer_id ORDER BY order_date) as amount_diff_vs_previous
FROM orders
ORDER BY customer_id, order_date

### Rolling Aggregations - Moving Averages

**Instructions:**
1. Calculate rolling average for order amount:
 - Window: 3 last orders (current + 2 previous)
2. Use `.rowsBetween(-2, 0)` for window spec
3. Add column `rolling_avg_3_orders`

In [0]:
%sql
-- TODO: Task 1.3 - Rolling aggregations (SQL)
SELECT 
    customer_id, order_date, total_amount,
    AVG(total_amount) OVER (PARTITION BY customer_id ORDER BY order_date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) as rolling_avg_3_orders,
    SUM(total_amount) OVER (PARTITION BY customer_id ORDER BY order_date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) as rolling_sum_3_orders
FROM orders
ORDER BY customer_id, order_date

### Cumulative Sum

**Instructions:**
1. Calculate cumulative sum of order amounts per customer
2. Use `.rowsBetween(Window.unboundedPreceding, Window.currentRow)`
3. Add column `cumulative_amount`

In [0]:
%sql
-- TODO: Task 1.4 - Cumulative sum (SQL)
SELECT 
    customer_id, order_date, total_amount,
    SUM(total_amount) OVER (PARTITION BY customer_id ORDER BY order_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as cumulative_amount
FROM orders
ORDER BY customer_id, order_date

### JSON Processing - from_json() and explode()

**Instructions:**
1. Load JSON data from Volume (orders)
2. Use `from_json()` to parse JSON if needed
3. Use `explode()` to "unpack" array
4. Extract fields from nested struct

In [0]:
%sql
-- TODO: Task 2.1 - JSON processing (SQL)
SELECT 
    order_id,
    from_json(order_json, 'items ARRAY<STRUCT<product: STRING, price: INT>>, total INT') as parsed
FROM json_orders_raw

In [0]:
%sql
-- TODO: Explode array and extract fields (SQL)
-- Note: We need to parse first, then explode. In SQL we can do it in one query or use CTE.
WITH parsed_data AS (
  SELECT order_id, from_json(order_json, 'items ARRAY<STRUCT<product: STRING, price: INT>>, total INT') as parsed
  FROM json_orders_raw
)
SELECT 
    order_id,
    item.product as product_name,
    item.price as product_price,
    parsed.total as order_total
FROM parsed_data
LATERAL VIEW explode(parsed.items) AS item

### Array Functions - collect_list, array_contains

**Instructions:**
1. Group orders per customer
2. Use `collect_list()` to gather all order amounts into an array
3. Use `array_contains()` to check if customer has an order > 500
4. Use `size()` to count number of orders

In [0]:
%sql
-- TODO: Task 2.2 - Array functions (SQL)
SELECT 
    customer_id,
    collect_list(total_amount) as order_amounts,
    collect_list(order_date) as order_dates,
    count(*) as total_orders,
    size(collect_list(total_amount)) as num_orders
FROM orders
GROUP BY customer_id

### Struct - Combining columns into structures

**Instructions:**
1. Create struct `customer_info` containing: customer_id, total_orders
2. Create struct `order_summary` containing: min/max/avg amount
3. Extract fields from struct using `.` notation

In [0]:
%sql
-- TODO: Task 2.3 - Struct operations (SQL)
SELECT 
    struct(customer_id, count(*) as total_orders) as customer_info,
    struct(min(total_amount) as min_amount, max(total_amount) as max_amount, avg(total_amount) as avg_amount) as order_summary
FROM orders
GROUP BY customer_id

In [0]:
%sql
-- Extract fields from struct (SQL)
-- Assuming we have the structs from previous query (using CTE for demo)
WITH struct_data AS (
    SELECT 
        struct(customer_id, count(*) as total_orders) as customer_info,
        struct(min(total_amount) as min_amount, max(total_amount) as max_amount, avg(total_amount) as avg_amount) as order_summary
    FROM orders
    GROUP BY customer_id
)
SELECT 
    customer_info.customer_id,
    order_summary.avg_amount as avg_order_value
FROM struct_data

### Task 3.1: Date truncation and extraction

**Instructions:**
1. Use `date_trunc()` to round dates to: month, quarter, year
2. Use `year()`, `month()`, `dayofweek()` to extract date parts
3. Calculate `days_since_order` (difference between today and order date)

In [0]:
%sql
-- TODO: Task 3.1 - Date functions (SQL)
SELECT 
    customer_id, order_date,
    date_trunc('month', order_date) as order_month,
    date_trunc('quarter', order_date) as order_quarter,
    year(order_date) as order_year_num,
    month(order_date) as order_month_num,
    dayofweek(order_date) as day_of_week,
    datediff(current_date(), order_date) as days_since_order
FROM orders

### Task 3.2: Date arithmetic - adding/subtracting periods

**Instructions:**
1. Use `date_add()` to add 30 days to order date
2. Use `add_months()` to add 3 months
3. Use `last_day()` to get the last day of the month
4. Use `next_day()` to get the next Monday

In [0]:
%sql
-- TODO: Task 3.2 - Date arithmetic (SQL)
SELECT 
    order_date,
    date_add(order_date, 30) as delivery_date_estimate,
    add_months(order_date, 3) as renewal_date,
    last_day(order_date) as month_end,
    next_day(order_date, 'Monday') as next_monday
FROM orders

### Task 3.3: Generating date sequences

**Instructions:**
1. Use `sequence()` to generate an array of dates between two dates
2. Use `explode()` to create one row per date
3. Create a calendar table with all days between min and max order_date

In [0]:
-- TODO: Task 3.3 - Date sequences (SQL)
WITH range AS (
  SELECT min(order_date) as min_date, max(order_date) as max_date FROM orders
)
SELECT 
    explode(sequence(min_date, max_date, interval 1 day)) as date
FROM range

---

## Workshop Summary

**Objectives Achieved:**
- Window Functions (ranking, lag/lead, rolling aggregations, cumulative sum)
- JSON Processing (from_json, explode, struct)
- Array operations (collect_list, array_contains, size)
- Advanced date operations (truncation, arithmetic, sequences)

**Key Takeaways:**
1. Window Functions allow per-group analysis without GROUP BY
2. JSON and complex structures are native in Spark
3. Date functions enable advanced temporal analysis
4. Optimization: use broadcast for small tables in JOIN

**Best Practices:**
- Window Functions: always define explicit window spec
- JSON: use schema inference only for exploration
- Dates: use native date types (not string)
- Performance: cache() for frequently used DataFrames

---

## Cleanup (optional)

# Solution

The complete code is below. Try to solve it yourself first!

In [0]:
%sql
-- ============================================================
-- FULL SOLUTION - Workshop 2: SQL
-- ============================================================

-- Task 1.1: Ranking
SELECT *, 
  row_number() OVER (PARTITION BY customer_id ORDER BY order_date DESC) as row_num,
  rank() OVER (PARTITION BY customer_id ORDER BY order_date DESC) as rank,
  dense_rank() OVER (PARTITION BY customer_id ORDER BY order_date DESC) as dense_rank
FROM orders;

-- Task 1.2: LAG/LEAD
SELECT *, 
  lag(total_amount) OVER (PARTITION BY customer_id ORDER BY order_date) as prev_amt,
  lead(total_amount) OVER (PARTITION BY customer_id ORDER BY order_date) as next_amt
FROM orders;

-- Task 1.3: Rolling Aggregations
SELECT *, 
  avg(total_amount) OVER (PARTITION BY customer_id ORDER BY order_date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) as rolling_avg 
FROM orders;

-- Task 2.1: JSON
SELECT from_json(order_json, 'items ARRAY<STRUCT<product: STRING, price: INT>>, total INT') as parsed FROM json_orders_raw;

-- Task 2.2: Arrays
SELECT customer_id, collect_list(total_amount) as amounts, size(collect_list(total_amount)) as count FROM orders GROUP BY customer_id;

-- Task 3.1: Dates
SELECT order_date, date_trunc('month', order_date) as mth, datediff(current_date(), order_date) as diff FROM orders;

In [0]:
# ============================================================
# FULL SOLUTION - Workshop 2: Advanced Transformations (PySpark)
# ============================================================

# --- Task 1.1: Ranking ---
window_spec = Window.partitionBy("customer_id").orderBy(F.desc("order_date"))
orders_ranked = (
    test_orders
    .withColumn("row_num", F.row_number().over(window_spec))
    .withColumn("rank", F.rank().over(window_spec))
    .withColumn("dense_rank", F.dense_rank().over(window_spec))
)

# --- Task 1.2: LAG/LEAD ---
window_chrono = Window.partitionBy("customer_id").orderBy("order_date")
orders_lag_lead = (
    test_orders
    .withColumn("previous_order_amount", F.lag("total_amount", 1).over(window_chrono))
    .withColumn("next_order_amount", F.lead("total_amount", 1).over(window_chrono))
    .withColumn("amount_diff_vs_previous", F.col("total_amount") - F.col("previous_order_amount"))
)

# --- Task 1.3: Rolling Aggregations ---
window_rolling = Window.partitionBy("customer_id").orderBy("order_date").rowsBetween(-2, 0)
orders_rolling = (
    test_orders
    .withColumn("rolling_avg_3_orders", F.avg("total_amount").over(window_rolling))
    .withColumn("rolling_sum_3_orders", F.sum("total_amount").over(window_rolling))
)

# --- Task 1.4: Cumulative Sum ---
window_cumulative = Window.partitionBy("customer_id").orderBy("order_date").rowsBetween(Window.unboundedPreceding, Window.currentRow)
orders_cumulative = (
    test_orders
    .withColumn("cumulative_amount", F.sum("total_amount").over(window_cumulative))
)

# --- Task 2.1: JSON Processing ---
json_schema = "items ARRAY<STRUCT<product: STRING, price: INT>>, total INT"
orders_parsed = json_data.withColumn("parsed", F.from_json("order_json", json_schema))

# --- Task 2.2: Array Functions ---
customer_arrays = (
    test_orders
    .groupBy("customer_id")
    .agg(
        F.collect_list("total_amount").alias("order_amounts"),
        F.collect_list("order_date").alias("order_dates"),
        F.count("*").alias("total_orders")
    )
    .withColumn("num_orders", F.size("order_amounts"))
)

# --- Task 2.3: Struct Operations ---
customer_structs = (
    test_orders
    .groupBy("customer_id")
    .agg(
        F.count("*").alias("total_orders"),
        F.min("total_amount").alias("min_amount"),
        F.max("total_amount").alias("max_amount"),
        F.avg("total_amount").alias("avg_amount")
    )
    .withColumn("customer_info", F.struct("customer_id", "total_orders"))
    .withColumn("order_summary", F.struct("min_amount", "max_amount", "avg_amount"))
)

# --- Task 3.1: Date Functions ---
orders_dates = (
    test_orders
    .withColumn("order_month", F.date_trunc("month", "order_date"))
    .withColumn("order_quarter", F.date_trunc("quarter", "order_date"))
    .withColumn("order_year_num", F.year("order_date"))
    .withColumn("order_month_num", F.month("order_date"))
    .withColumn("day_of_week", F.dayofweek("order_date"))
    .withColumn("days_since_order", F.datediff(F.current_date(), "order_date"))
)

# --- Task 3.2: Date Arithmetic ---
orders_date_math = (
    test_orders
    .withColumn("delivery_date_estimate", F.date_add("order_date", 30))
    .withColumn("renewal_date", F.add_months("order_date", 3))
    .withColumn("month_end", F.last_day("order_date"))
    .withColumn("next_monday", F.next_day("order_date", "Monday"))
)

# --- Task 3.3: Date Sequences ---
date_range = test_orders.select(F.min("order_date").alias("min_date"), F.max("order_date").alias("max_date")).first()
calendar = (
    spark.range(1)
    .select(F.explode(F.sequence(F.lit(date_range["min_date"]), F.lit(date_range["max_date"]), F.expr("interval 1 day"))).alias("date"))
)

print("PySpark Solutions executed successfully.")

In [0]:
# Clean up temporary views
# spark.catalog.dropTempView("orders")
# spark.catalog.clearCache()

## Clean up resources
Stop any active streams and remove the created resources.

In [0]:
# Cleanup
# dbutils.fs.rm(CHECKPOINT_PATH, True)
# spark.sql(f"DROP SCHEMA IF EXISTS {CATALOG}.{SCHEMA} CASCADE")