## Flattening Nested JSON Orders in PySpark


### Scenario

An e-commerce platform receives customer details from its mobile appliation in **nested JSON format** through a streaming pipeline. Each record contains:

- Order metadata (`order_id`, `order_timestamp`)
- Customer details (including nested `location`)
- Payment details
- An array of purchased `items`
- An array of `delivery_updates` status strings

To store and analyze this data effiiently in a data warehouse, **the nested structure must be flattened into a tabular format using PySpark**, ensuring all relevant attributes are readily accessible for reporting and analytics

In [0]:
df = spark.read.format("json")\
    .option("multiLine", "true")\
    .load("/Volumes/pyspark_cata/source/db_volume/json_data/")

# .option("multiLine", "true") tells Spark that a single JSON record may span multiple lines
# (pretty-printed/indented JSON, or a file containing a top-level array).
# Without it, Spark treats each line as a separate JSON object, which can lead to:
# parse errors / corrupt records, or only partial data being read.
# It is needed whenever the JSON isn’t “one complete object per line” (JSON Lines / NDJSON).

display(df)
df.printSchema()

customer,delivery_updates,items,order_id,order_timestamp,payment
"List(CUST101, john.doe@example.com, List(Toronto, Canada), John Doe)","List(Order Placed, Packed, Shipped, Out for Delivery)","List(List(ITEM1001, 25.5, Wireless Mouse, 2), List(ITEM1002, 199.75, Mechanical Keyboard, 1))",ORD001,2025-08-15T10:45:30Z,"List(250.75, CAD, Credit Card)"
"List(CUST102, jane.smith@example.com, List(Vancouver, Canada), Jane Smith)","List(Order Placed, Packed, Shipped)","List(List(ITEM1003, 89.99, USB-C Hub, 1))",ORD002,2025-08-15T11:10:15Z,"List(89.99, CAD, PayPal)"


root
 |-- customer: struct (nullable = true)
 |    |-- customer_id: string (nullable = true)
 |    |-- email: string (nullable = true)
 |    |-- location: struct (nullable = true)
 |    |    |-- city: string (nullable = true)
 |    |    |-- country: string (nullable = true)
 |    |-- name: string (nullable = true)
 |-- delivery_updates: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- items: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- item_id: string (nullable = true)
 |    |    |-- price_per_unit: double (nullable = true)
 |    |    |-- product_name: string (nullable = true)
 |    |    |-- quantity: long (nullable = true)
 |-- order_id: string (nullable = true)
 |-- order_timestamp: string (nullable = true)
 |-- payment: struct (nullable = true)
 |    |-- amount: double (nullable = true)
 |    |-- currency: string (nullable = true)
 |    |-- method: string (nullable = true)



### Flattening approach

We will:
1. Select order-level + customer-level + payment-level fields
2. `explode` the `items` array so we get one row per item
3. Extract item fields into top-level columns


But before we do that, let's understand how can we access the nested data. We will go through a simple example of customer column and then layout the exact transformation.

In [0]:
df_cust = df.select("customer.customer_id",
                    "customer.name",
                    "customer.email",
                    "customer.location.city",
                    "customer.location.country",
                    "*").drop("customer")
display(df_cust)

customer_id,name,email,city,country,delivery_updates,items,order_id,order_timestamp,payment
CUST101,John Doe,john.doe@example.com,Toronto,Canada,"List(Order Placed, Packed, Shipped, Out for Delivery)","List(List(ITEM1001, 25.5, Wireless Mouse, 2), List(ITEM1002, 199.75, Mechanical Keyboard, 1))",ORD001,2025-08-15T10:45:30Z,"List(250.75, CAD, Credit Card)"
CUST102,Jane Smith,jane.smith@example.com,Vancouver,Canada,"List(Order Placed, Packed, Shipped)","List(List(ITEM1003, 89.99, USB-C Hub, 1))",ORD002,2025-08-15T11:10:15Z,"List(89.99, CAD, PayPal)"


#### Transformation

In [0]:
from pyspark.sql.functions import *

In [0]:
# Explode items to create one row per purchased item (item-level fact table)
df_flat = df.withColumn("item", explode_outer(col("items")))

# Select + alias fields into a clean flattened schema
df_flattened = df_flat.select(
    col("customer.customer_id").alias("customer_id"),
    col("customer.name").alias("customer_name"),
    col("customer.email").alias("customer_email"),
    col("customer.location.city").alias("customer_city"),
    col("customer.location.country").alias("customer_country"),

    col("delivery_updates"),

    col("item.item_id").alias("item_id"),
    col("item.product_name").alias("product_name"),
    col("item.quantity").alias("quantity"),
    col("item.price_per_unit").alias("price_per_unit"),

    col("order_id"),
    col("order_timestamp"),
    
    col("payment.method").alias("payment_method"),
    col("payment.amount").alias("payment_amount"),
    col("payment.currency").alias("payment_currency")
)

display(df_flattened)

customer_id,customer_name,customer_email,customer_city,customer_country,delivery_updates,item_id,product_name,quantity,price_per_unit,order_id,order_timestamp,payment_method,payment_amount,payment_currency
CUST101,John Doe,john.doe@example.com,Toronto,Canada,"List(Order Placed, Packed, Shipped, Out for Delivery)",ITEM1001,Wireless Mouse,2,25.5,ORD001,2025-08-15T10:45:30Z,Credit Card,250.75,CAD
CUST101,John Doe,john.doe@example.com,Toronto,Canada,"List(Order Placed, Packed, Shipped, Out for Delivery)",ITEM1002,Mechanical Keyboard,1,199.75,ORD001,2025-08-15T10:45:30Z,Credit Card,250.75,CAD
CUST102,Jane Smith,jane.smith@example.com,Vancouver,Canada,"List(Order Placed, Packed, Shipped)",ITEM1003,USB-C Hub,1,89.99,ORD002,2025-08-15T11:10:15Z,PayPal,89.99,CAD


### Notes / common pitfalls

- We use `explode_outer(items)` so orders with missing/empty items don’t crash the pipeline.
- Keeping `delivery_updates` as an array avoids multiplying rows unnecessarily.

Lets explode delivery updates just to see how it looks like and if we would want it in our fact table.

In [0]:
# Explode items to create one row per purchased item (item-level fact table)
df_flat = df.withColumn("item", explode_outer(col("items"))).withColumn("delivery_updates", explode_outer(col("delivery_updates")))

# Select + alias fields into a clean flattened schema
df_flattened = df_flat.select(
    col("customer.customer_id").alias("customer_id"),
    col("customer.name").alias("customer_name"),
    col("customer.email").alias("customer_email"),
    col("customer.location.city").alias("customer_city"),
    col("customer.location.country").alias("customer_country"),

    col("delivery_updates"),

    col("item.item_id").alias("item_id"),
    col("item.product_name").alias("product_name"),
    col("item.quantity").alias("quantity"),
    col("item.price_per_unit").alias("price_per_unit"),

    col("order_id"),
    col("order_timestamp"),
    
    col("payment.method").alias("payment_method"),
    col("payment.amount").alias("payment_amount"),
    col("payment.currency").alias("payment_currency")
)

display(df_flattened)

customer_id,customer_name,customer_email,customer_city,customer_country,delivery_updates,item_id,product_name,quantity,price_per_unit,order_id,order_timestamp,payment_method,payment_amount,payment_currency
CUST101,John Doe,john.doe@example.com,Toronto,Canada,Order Placed,ITEM1001,Wireless Mouse,2,25.5,ORD001,2025-08-15T10:45:30Z,Credit Card,250.75,CAD
CUST101,John Doe,john.doe@example.com,Toronto,Canada,Packed,ITEM1001,Wireless Mouse,2,25.5,ORD001,2025-08-15T10:45:30Z,Credit Card,250.75,CAD
CUST101,John Doe,john.doe@example.com,Toronto,Canada,Shipped,ITEM1001,Wireless Mouse,2,25.5,ORD001,2025-08-15T10:45:30Z,Credit Card,250.75,CAD
CUST101,John Doe,john.doe@example.com,Toronto,Canada,Out for Delivery,ITEM1001,Wireless Mouse,2,25.5,ORD001,2025-08-15T10:45:30Z,Credit Card,250.75,CAD
CUST101,John Doe,john.doe@example.com,Toronto,Canada,Order Placed,ITEM1002,Mechanical Keyboard,1,199.75,ORD001,2025-08-15T10:45:30Z,Credit Card,250.75,CAD
CUST101,John Doe,john.doe@example.com,Toronto,Canada,Packed,ITEM1002,Mechanical Keyboard,1,199.75,ORD001,2025-08-15T10:45:30Z,Credit Card,250.75,CAD
CUST101,John Doe,john.doe@example.com,Toronto,Canada,Shipped,ITEM1002,Mechanical Keyboard,1,199.75,ORD001,2025-08-15T10:45:30Z,Credit Card,250.75,CAD
CUST101,John Doe,john.doe@example.com,Toronto,Canada,Out for Delivery,ITEM1002,Mechanical Keyboard,1,199.75,ORD001,2025-08-15T10:45:30Z,Credit Card,250.75,CAD
CUST102,Jane Smith,jane.smith@example.com,Vancouver,Canada,Order Placed,ITEM1003,USB-C Hub,1,89.99,ORD002,2025-08-15T11:10:15Z,PayPal,89.99,CAD
CUST102,Jane Smith,jane.smith@example.com,Vancouver,Canada,Packed,ITEM1003,USB-C Hub,1,89.99,ORD002,2025-08-15T11:10:15Z,PayPal,89.99,CAD
