
##  Medallion Architecture (Multi-hop)


- Medallion Architecture is a **layered approach** for building reliable, scalable data lakes:

| Layer                 | Description                                                          | Data Quality           | Purpose                                                        |
| --------------------- | -------------------------------------------------------------------- | ---------------------- | -------------------------------------------------------------- |
| **Bronze** (Raw)      | Raw ingestion from source systems (e.g., logs, events, files)        | Raw data, may be dirty | Capture original data for audit and backup                     |
| **Silver** (Cleansed) | Cleaned and filtered data, duplicates removed, basic transformations | Cleaned, enriched      | Enable business logic, filtering, and filtering of bad records |
| **Gold** (Aggregated) | Aggregated, business-level data ready for analytics and BI           | Curated, optimized     | Power dashboards, ML, and reporting                            |

- **Multi-hop** refers to data pipelines where data flows through multiple stages or tables before reaching the final presentation layer.
- Data flows through these layers progressively, increasing in quality and usability.



In [0]:
from pyspark.sql.functions import *
from pyspark.sql.types import *

orders_bronze = spark.table("workspace.`2235-wk3`.orders_bronze")

orders_bronze.show()


#### Silver Layer (Parse and Flatten)

- Parse the JSON string column order_details with from_json() and a defined schema
- Explode the nested array products so each product is one row
- Flatten nested structs like delivery address and payment info
- Cast columns to proper types (int, float, date, etc.)
- Save as Delta table `orders_silver`

In [0]:
# define the json schema

product_schema = ArrayType(StructType([
    StructField("product_id", StringType(), True),
    StructField("category", StringType(), True),
    StructField("price", DoubleType(), True)

]))

delivery_schema = StructType([
    StructField("method", StringType(), True),
    StructField("address", StructType([
        StructField("street", StringType(), True),
        StructField("city", StringType(), True),
        StructField("zip", StringType(), True)
                                        ]), True)
])

payment_schema = StructType([
    StructField("method", StringType(), True),
    StructField("installments", IntegerType(), True)
])

order_details_schema = StructType([
    StructField("products", product_schema, True),
    StructField("delivery", delivery_schema, True),
    StructField("payment", payment_schema, True)
])




In [0]:
orders_silver = orders_bronze.withColumn(
    "orders_details_json",
    from_json(col("order_details"), order_details_schema)
)
orders_silver.display()

In [0]:
# explode product array

exploded_df = orders_silver.select(
    "order_id",
    "customer_id",
    col("order_date").cast("timestamp"),
    col("total_amount").cast("double"),
    explode(col("orders_details_json.products")).alias("product"),
    col("orders_details_json.delivery.method").alias("delivery_method"),
    col("orders_details_json.delivery.address.street").alias("delivery_street"),
    col("orders_details_json.delivery.address.city").alias("delivery_city"),
    col("orders_details_json.delivery.address.zip").alias("delivery_zip"),
    col("orders_details_json.payment.method").alias("payment_method"),
    col("orders_details_json.payment.installments").alias("payment_installments")
)


# flaten product struct

orders_silver = exploded_df.select(
    "order_id",
    "customer_id",
    "order_date",
    "total_amount",
    col("product.product_id").alias("product_id"),
    col("product.category").alias("product_category"),
    col("product.price").alias("product_price"),
    "delivery_method",
    "delivery_street",
    "delivery_city",
    "delivery_zip",
    "payment_method",
    "payment_installments"
)

# orders_silver.show(4)

orders_silver.write.format("delta").mode("overwrite").saveAsTable("workspace.`2235-wk3`.orders_silver")

#### Gold Layer (Business Aggregations)

- Aggregate to get metrics like total sales and order count per product category & payment method
- Save as Gold Delta table

In [0]:
silver_df = spark.table("workspace.`2235-wk3`.orders_silver")

orders_gold = silver_df.groupBy("product_category", "payment_method") \
        .agg(
            {
                "total_amount": "sum",
                "order_id": "count"
            }
        ) \
            .withColumnRenamed("sum(total_amount)", "total_sales") \
                .withColumnRenamed("count(order_id)", "order_count") \
                    .orderBy(col("total_sales").desc())

orders_gold.display()

orders_gold.write.format("delta").mode("overwrite").saveAsTable("workspace.`2235-wk3`.orders_gold")