# Databricks Data Engineering 101: Medallion Architecture

**Build Production-Ready Data Pipelines with Bronze, Silver & Gold Layers**

Welcome to the core of data engineering on Databricks! In this notebook, you'll learn to:

- ü•â **Bronze Layer**: Ingest raw data with complete history
- ü•à **Silver Layer**: Clean, validate, and standardize data
- ü•á **Gold Layer**: Create business-ready analytics tables

## What is Medallion Architecture?

The Medallion Architecture organizes data into three progressive layers:

```
Raw Data ‚Üí [Bronze] ‚Üí [Silver] ‚Üí [Gold] ‚Üí Business Insights
           Raw         Clean      Analytics
```

### Benefits:
- **Incremental Refinement**: Each layer adds value
- **Data Quality**: Progressive validation and cleansing
- **Performance**: Optimized for different use cases
- **Flexibility**: Easy to add new sources or metrics

### Real-World Use Cases:
- **Finance**: Transaction processing and fraud detection
- **E-commerce**: Customer analytics and product performance
- **Healthcare**: Patient records and clinical analytics
- **IoT**: Sensor data processing and anomaly detection

---


## SETUP

Just run the next couple of cells for setup!


In [None]:
# Configuration Widgets - Customize your setup

dbutils.widgets.text("catalog", "demo", "Catalog")
dbutils.widgets.text("bronze_db", "bronze", "Bronze DB")
dbutils.widgets.text("silver_db", "silver", "Silver DB")
dbutils.widgets.text("gold_db", "gold", "Gold DB")

catalog = dbutils.widgets.get("catalog")
bronze_db = dbutils.widgets.get("bronze_db")
silver_db = dbutils.widgets.get("silver_db")
gold_db = dbutils.widgets.get("gold_db")

# Define path for data storage in Volume
path = f"/Volumes/{catalog}/{bronze_db}/ecommerce/files/"

print(f"Catalog: {catalog}")
print(f"Bronze DB: {bronze_db}")
print(f"Silver DB: {silver_db}")
print(f"Gold DB: {gold_db}")
print(f"Path: {path}")


In [None]:
# Create Unity Catalog, Schemas, and Volume

# Create catalog
spark.sql(f"CREATE CATALOG IF NOT EXISTS {catalog}")

# Use catalog
spark.sql(f"USE CATALOG {catalog}")

# Create databases/schemas
spark.sql(f"CREATE DATABASE IF NOT EXISTS {bronze_db}")
spark.sql(f"CREATE DATABASE IF NOT EXISTS {silver_db}")
spark.sql(f"CREATE DATABASE IF NOT EXISTS {gold_db}")

# Create volume for data storage
spark.sql(f"CREATE VOLUME IF NOT EXISTS {bronze_db}.ecommerce")

print(f"‚úÖ Created catalog: {catalog}")
print(f"‚úÖ Created schemas: {bronze_db}, {silver_db}, {gold_db}")
print(f"‚úÖ Created volume: {bronze_db}.ecommerce")


In [None]:
# Create volume folders for data organization

try:
    dbutils.fs.mkdirs(f"/Volumes/{catalog}/{bronze_db}/ecommerce/files/customers")
    dbutils.fs.mkdirs(f"/Volumes/{catalog}/{bronze_db}/ecommerce/files/products")
    dbutils.fs.mkdirs(f"/Volumes/{catalog}/{bronze_db}/ecommerce/files/orders")
    dbutils.fs.mkdirs(f"/Volumes/{catalog}/{bronze_db}/ecommerce/files/order_items")
    dbutils.fs.mkdirs(f"/Volumes/{catalog}/{bronze_db}/ecommerce/downloads")
    print("‚úÖ Created volume folder structure")
except Exception as e:
    print(f"‚ö†Ô∏è  Volume folders may already exist: {e}")


### üì• Load Sample Data

Copy CSV files to Unity Catalog Volume:


In [None]:
from pyspark.sql.functions import *
import shutil
import os

print("="*70)
print("BRONZE LAYER: DATA INGESTION")
print("="*70)

# Relative Imports
notebook_dir = os.getcwd()
repo_data_dir = os.path.abspath(os.path.join(notebook_dir, "../data"))

if not os.path.isdir(repo_data_dir):
    raise FileNotFoundError(
        f"Could not find ../data folder relative to the notebook at {notebook_dir}. "
        "Please ensure the data directory exists."
    )
else:
    print(f"\nüì¶ Using relative data path: {repo_data_dir}")

# Target directory: Unity Catalog Volume 
volume_base = f"/Volumes/{catalog}/{bronze_db}/ecommerce/files"
downloads_dir = f"/Volumes/{catalog}/{bronze_db}/ecommerce/downloads"

print(f"\nüìã Creating Bronze tables with PySpark...")
print("="*70)

tables = {
    "customers": f"{catalog}.{bronze_db}.customers",
    "products": f"{catalog}.{bronze_db}.products",
    "orders": f"{catalog}.{bronze_db}.orders",
    "order_items": f"{catalog}.{bronze_db}.order_items"
}

for csv_name in tables:
    source_path = os.path.join(repo_data_dir, f"{csv_name}.csv")
    target_dir = f"{volume_base}/{csv_name}"
    target_path = f"{target_dir}/{csv_name}.csv"

    # Only attempt to copy if file doesn't already exist in volume
    if not os.path.exists(target_path):
        try:
            shutil.copyfile(source_path, target_path)
            print(f"   ‚úÖ Copied {csv_name}.csv to {target_dir}")
        except Exception as e:
            print(f"   ‚ö†Ô∏è Could not copy {csv_name}.csv to {target_dir}: {e}")
    else:
        print(f"   ‚è≠Ô∏è {csv_name}.csv already exists in volume, skipping copy.")

# Now, read from the Unity Catalog volume paths and create Bronze tables
for csv_name, table_name in tables.items():
    print(f"\n‚è≥ Processing {csv_name}.csv...")

    volume_csv_path = f"{volume_base}/{csv_name}/{csv_name}.csv"
    # No "file:" URI prefix (use UC/DBFS path)
    df = spark.read.csv(volume_csv_path, header=True, inferSchema=True)
    df = df.withColumn("_ingestion_timestamp", current_timestamp()) \
           .withColumn("_source_file", lit(f"{csv_name}.csv"))
    df.write.mode("overwrite").format("delta").saveAsTable(table_name)
    count = spark.table(table_name).count()
    print(f"   ‚úÖ {table_name}: {count:,} records")

print("\n" + "="*70)
print("üéâ BRONZE LAYER COMPLETE!")
print("="*70)
print(f"\nAll CSV data loaded into Unity Catalog: {catalog}")
print(f"Tables created in {catalog}.{bronze_db} schema")
print("You can now proceed to the Silver layer below.\n")


---

# ü•â Bronze Layer: Verify & Explore

The Bronze tables have been created above! Let's verify and explore the data.


### Verify Bronze Tables


In [None]:
# Show all Bronze tables in Unity Catalog
display(spark.sql(f"SHOW TABLES IN {catalog}.{bronze_db}"))



In [None]:
# Check record counts (Unity Catalog 3-level naming)
from pyspark.sql.functions import lit, count

customers_df = spark.table(f"{catalog}.{bronze_db}.customers").select(lit('customers').alias('table_name'), count('*').alias('record_count'))
products_df = spark.table(f"{catalog}.{bronze_db}.products").select(lit('products').alias('table_name'), count('*').alias('record_count'))
orders_df = spark.table(f"{catalog}.{bronze_db}.orders").select(lit('orders').alias('table_name'), count('*').alias('record_count'))
order_items_df = spark.table(f"{catalog}.{bronze_db}.order_items").select(lit('order_items').alias('table_name'), count('*').alias('record_count'))

result_df = customers_df.union(products_df).union(orders_df).union(order_items_df)
display(result_df)


In [None]:
# Preview customers data
display(spark.table(f"{catalog}.{bronze_db}.customers").limit(5))


### Preview Other Tables


In [None]:
# Preview products
display(spark.table(f"{catalog}.{bronze_db}.products").limit(5))


### Data Quality Check


In [None]:
# Check for any null key columns (should be none)
from pyspark.sql.functions import lit, sum, when, col

customers_nulls = spark.table(f"{catalog}.{bronze_db}.customers").select(
    lit('customers').alias('table_name'),
    sum(when(col('customer_id').isNull(), 1).otherwise(0)).alias('null_ids')
)

products_nulls = spark.table(f"{catalog}.{bronze_db}.products").select(
    lit('products').alias('table_name'),
    sum(when(col('product_id').isNull(), 1).otherwise(0)).alias('null_ids')
)

orders_nulls = spark.table(f"{catalog}.{bronze_db}.orders").select(
    lit('orders').alias('table_name'),
    sum(when(col('order_id').isNull(), 1).otherwise(0)).alias('null_ids')
)

order_items_nulls = spark.table(f"{catalog}.{bronze_db}.order_items").select(
    lit('order_items').alias('table_name'),
    sum(when(col('line_item_id').isNull(), 1).otherwise(0)).alias('null_ids')
)

result_df = customers_nulls.union(products_nulls).union(orders_nulls).union(order_items_nulls)
display(result_df)


### ‚úÖ Bronze Layer Summary


In [None]:
# Summary statistics
print("="*70)
print("BRONZE LAYER SUMMARY")
print("="*70)
print(f"\nCatalog: {catalog}")
print(f"\nCustomers:    {spark.table(f'{catalog}.bronze.customers').count():>10,} records")
print(f"Products:     {spark.table(f'{catalog}.bronze.products').count():>10,} records")
print(f"Orders:       {spark.table(f'{catalog}.bronze.orders').count():>10,} records")
print(f"Order Items:  {spark.table(f'{catalog}.bronze.order_items').count():>10,} records")

print("\n" + "="*70)
print("‚úÖ Bronze Layer Complete - Raw data in Unity Catalog")
print("="*70)
print("\nKey Features:")
print("  ‚Ä¢ Unity Catalog 3-level namespacing (catalog.schema.table)")
print("  ‚Ä¢ All data stored in Delta Lake format (ACID compliant)")
print("  ‚Ä¢ Metadata columns added (_ingestion_timestamp, _source_file)")
print("  ‚Ä¢ Ready for cleansing in Silver layer")
print("\n")

---

# ü•à Silver Layer: Data Cleansing & Validation

## Goals:
- **Clean**: Remove nulls, fix data types, standardize formats
- **Validate**: Apply business rules and constraints
- **Deduplicate**: Keep only unique, valid records
- **Enrich**: Add derived columns for downstream use

## Key Patterns:
- Data quality checks
- Deduplication using window functions
- Type casting and formatting
- Business rule validation


### Silver: Customers (Cleaned)


In [None]:
# Create Silver customers with data quality rules
from pyspark.sql.functions import col, initcap, trim, lower, upper, current_timestamp

# Drop table if exists
spark.sql(f"DROP TABLE IF EXISTS {catalog}.{silver_db}.customers")

# Read from Bronze and apply transformations
customers_df = spark.table(f"{catalog}.{bronze_db}.customers") \
    .select(
        col("customer_id"),
        initcap(trim(col("first_name"))).alias("first_name"),
        initcap(trim(col("last_name"))).alias("last_name"),
        lower(trim(col("email"))).alias("email"),
        col("phone"),
        col("address"),
        col("city"),
        upper(col("state")).alias("state"),
        col("zip_code"),
        col("country"),
        col("registration_date"),
        col("customer_segment"),
        current_timestamp().alias("updated_at")
    ) \
    .filter(
        (col("customer_id").isNotNull()) &
        (col("email").isNotNull()) &
        (col("email").like("%@%")) &
        (col("registration_date").isNotNull())
    )

# Write to Silver table
customers_df.write.mode("overwrite").format("delta").saveAsTable(f"{catalog}.{silver_db}.customers")

# Show count
cleaned_count = spark.table(f"{catalog}.{silver_db}.customers").count()
print(f"cleaned_count: {cleaned_count}")


In [None]:
# Compare Bronze vs Silver
from pyspark.sql.functions import lit, count

bronze_count = spark.table(f"{catalog}.{bronze_db}.customers").select(
    lit('Bronze').alias('layer'),
    count('*').alias('record_count')
)

silver_count = spark.table(f"{catalog}.{silver_db}.customers").select(
    lit('Silver').alias('layer'),
    count('*').alias('record_count')
)

result_df = bronze_count.union(silver_count)
display(result_df)


### Silver: Products (Cleaned)


In [None]:
# Create Silver products with data quality rules
from pyspark.sql.functions import col, trim, when, current_timestamp

# Drop table if exists
spark.sql(f"DROP TABLE IF EXISTS {catalog}.{silver_db}.products")

# Read from Bronze and apply transformations
products_df = spark.table(f"{catalog}.{bronze_db}.products") \
    .select(
        col("product_id"),
        trim(col("product_name")).alias("product_name"),
        col("category"),
        col("brand"),
        col("price"),
        col("stock_quantity"),
        col("is_active"),
        when(col("price") < 50, "Budget")
            .when(col("price") < 200, "Mid-Range")
            .otherwise("Premium").alias("price_tier"),
        current_timestamp().alias("updated_at")
    ) \
    .filter(
        (col("product_id").isNotNull()) &
        (col("product_name").isNotNull()) &
        (col("price") > 0) &
        (col("price") < 10000)
    )

# Write to Silver table
products_df.write.mode("overwrite").format("delta").saveAsTable(f"{catalog}.{silver_db}.products")

# Show count
cleaned_count = spark.table(f"{catalog}.{silver_db}.products").count()
print(f"cleaned_count: {cleaned_count}")


### Silver: Orders (Cleaned & Enriched)


In [None]:
# Create Silver orders with data quality rules and enrichment
from pyspark.sql.functions import col, when, datediff, year, month, dayofweek, current_timestamp, to_date

# Drop table if exists
spark.sql(f"DROP TABLE IF EXISTS {catalog}.{silver_db}.orders")

# Read from Bronze and apply transformations
orders_df = spark.table(f"{catalog}.{bronze_db}.orders") \
    .select(
        col("order_id"),
        col("customer_id"),
        col("order_date"),
        col("status"),
        col("payment_method"),
        col("shipped_date"),
        col("delivered_date"),
        col("discount_percent"),
        when(col("delivered_date").isNotNull(), 
             datediff(col("delivered_date"), to_date(col("order_date"))))
            .otherwise(None).alias("days_to_deliver"),
        year(col("order_date")).alias("order_year"),
        month(col("order_date")).alias("order_month"),
        dayofweek(col("order_date")).alias("order_day_of_week"),
        current_timestamp().alias("updated_at")
    ) \
    .filter(
        (col("order_id").isNotNull()) &
        (col("customer_id").isNotNull()) &
        (col("order_date").isNotNull()) &
        (col("status").isin('Completed', 'Shipped', 'Processing', 'Cancelled'))
    )

# Write to Silver table
orders_df.write.mode("overwrite").format("delta").saveAsTable(f"{catalog}.{silver_db}.orders")

# Show count
cleaned_count = spark.table(f"{catalog}.{silver_db}.orders").count()
print(f"cleaned_count: {cleaned_count}")


### Silver: Order Items (Cleaned with Calculations)


In [None]:
# Create Silver order_items with calculations
from pyspark.sql.functions import col, current_timestamp

# Drop table if exists
spark.sql(f"DROP TABLE IF EXISTS {catalog}.{silver_db}.order_items")

# Read from Bronze and apply transformations
order_items_df = spark.table(f"{catalog}.{bronze_db}.order_items") \
    .select(
        col("line_item_id"),
        col("order_id"),
        col("product_id"),
        col("quantity"),
        col("unit_price"),
        (col("quantity") * col("unit_price")).alias("line_total"),
        current_timestamp().alias("updated_at")
    ) \
    .filter(
        (col("line_item_id").isNotNull()) &
        (col("order_id").isNotNull()) &
        (col("product_id").isNotNull()) &
        (col("quantity") > 0) &
        (col("quantity") <= 100) &
        (col("unit_price") > 0) &
        (col("unit_price") < 10000)
    )

# Write to Silver table
order_items_df.write.mode("overwrite").format("delta").saveAsTable(f"{catalog}.{silver_db}.order_items")

# Show count
cleaned_count = spark.table(f"{catalog}.{silver_db}.order_items").count()
print(f"cleaned_count: {cleaned_count}")


### ‚úÖ Silver Layer Complete!

Summary of our cleansed data:


In [None]:
# Compare Bronze vs Silver record counts
print("="*70)
print("BRONZE ‚Üí SILVER DATA QUALITY REPORT")
print("="*70)

tables = ['customers', 'products', 'orders', 'order_items']
for table in tables:
    bronze_count = spark.table(f'{catalog}.{bronze_db}.{table}').count()
    silver_count = spark.table(f'{catalog}.{silver_db}.{table}').count()
    rejected = bronze_count - silver_count
    rejection_rate = (rejected / bronze_count * 100) if bronze_count > 0 else 0
    
    print(f"\n{table.upper()}:")
    print(f"  Bronze: {bronze_count:>10,}")
    print(f"  Silver: {silver_count:>10,}")
    print(f"  Rejected: {rejected:>8,} ({rejection_rate:.2f}%)")

print("="*70)


---

# ü•á Gold Layer: Business Analytics

## Goals:
- Create **business-ready** tables optimized for reporting
- Pre-calculate **metrics and KPIs**
- Denormalize data for **fast queries**
- Support **dashboards and analytics**

## Patterns:
- Aggregations and rollups
- Star schema / dimensional modeling
- Pre-calculated metrics
- Optimized for BI tools


### Gold: Customer Analytics

Calculate customer lifetime value and segmentation metrics.


In [None]:
# Create Gold customer_analytics table
from pyspark.sql.functions import col, count, sum, avg, max, min, datediff, when, coalesce, lit, current_timestamp, countDistinct

# Drop table if exists
spark.sql(f"DROP TABLE IF EXISTS {catalog}.{gold_db}.customer_analytics")

# Create order_totals CTE
order_totals = spark.table(f"{catalog}.{silver_db}.orders").alias("o") \
    .join(spark.table(f"{catalog}.{silver_db}.order_items").alias("oi"), 
          col("o.order_id") == col("oi.order_id"), "inner") \
    .filter(col("o.status") != "Cancelled") \
    .groupBy("o.customer_id", "o.order_id") \
    .agg(sum("oi.line_total").alias("order_total"))

# Main query
customers = spark.table(f"{catalog}.{silver_db}.customers").alias("c")
orders = spark.table(f"{catalog}.{silver_db}.orders").alias("o")

customer_analytics_df = customers \
    .join(orders, 
          (col("c.customer_id") == col("o.customer_id")) & (col("o.status") != "Cancelled"), 
          "left") \
    .join(order_totals.alias("ot"), 
          col("o.order_id") == col("ot.order_id"), 
          "left") \
    .groupBy(
        "c.customer_id", "c.first_name", "c.last_name", "c.email",
        "c.city", "c.state", "c.customer_segment", "c.registration_date"
    ) \
    .agg(
        countDistinct("o.order_id").alias("total_orders"),
        coalesce(sum("ot.order_total"), lit(0)).alias("lifetime_value"),
        coalesce(avg("ot.order_total"), lit(0)).alias("avg_order_value"),
        max("o.order_date").alias("last_order_date"),
        min("o.order_date").alias("first_order_date"),
        datediff(max("o.order_date"), min("o.order_date")).alias("customer_tenure_days")
    ) \
    .withColumn("orders_per_month",
        when(col("customer_tenure_days") > 0,
             col("total_orders") * 30.0 / col("customer_tenure_days"))
        .otherwise(0)
    ) \
    .withColumn("calculated_at", current_timestamp())

# Write to Gold table
customer_analytics_df.write.mode("overwrite").format("delta").saveAsTable(f"{catalog}.{gold_db}.customer_analytics")

# Show count
customer_count = spark.table(f"{catalog}.{gold_db}.customer_analytics").count()
print(f"customer_count: {customer_count}")


In [None]:
# Top 10 customers by lifetime value
from pyspark.sql.functions import col, concat_ws, round

top_customers_df = spark.table(f"{catalog}.{gold_db}.customer_analytics") \
    .select(
        col("customer_id"),
        concat_ws(" ", col("first_name"), col("last_name")).alias("customer_name"),
        col("email"),
        col("total_orders"),
        round(col("lifetime_value"), 2).alias("lifetime_value"),
        round(col("avg_order_value"), 2).alias("avg_order_value"),
        col("customer_segment")
    ) \
    .orderBy(col("lifetime_value").desc()) \
    .limit(10)

display(top_customers_df)


### Gold: Product Performance

Analyze product sales and revenue.


In [None]:
# Create Gold product_performance table
from pyspark.sql.functions import col, countDistinct, sum, avg, current_timestamp, rank
from pyspark.sql.window import Window

# Drop table if exists
spark.sql(f"DROP TABLE IF EXISTS {catalog}.{gold_db}.product_performance")

# Read tables
products = spark.table(f"{catalog}.{silver_db}.products").alias("p")
order_items = spark.table(f"{catalog}.{silver_db}.order_items").alias("oi")
orders = spark.table(f"{catalog}.{silver_db}.orders").alias("o")

# Join and aggregate
product_metrics = products \
    .join(order_items, col("p.product_id") == col("oi.product_id"), "left") \
    .join(orders, (col("oi.order_id") == col("o.order_id")) & (col("o.status") != "Cancelled"), "left") \
    .groupBy(
        "p.product_id", "p.product_name", "p.category", "p.brand", "p.price", "p.price_tier"
    ) \
    .agg(
        countDistinct("oi.order_id").alias("orders_containing_product"),
        sum("oi.quantity").alias("total_quantity_sold"),
        sum("oi.line_total").alias("total_revenue"),
        avg("oi.unit_price").alias("avg_selling_price")
    )

# Add window function for ranking
window_spec = Window.partitionBy("category").orderBy(col("total_revenue").desc())

product_performance_df = product_metrics \
    .withColumn("revenue_rank_in_category", rank().over(window_spec)) \
    .withColumn("calculated_at", current_timestamp())

# Write to Gold table
product_performance_df.write.mode("overwrite").format("delta").saveAsTable(f"{catalog}.{gold_db}.product_performance")

# Show count
product_count = spark.table(f"{catalog}.{gold_db}.product_performance").count()
print(f"product_count: {product_count}")


In [None]:
# Top 10 products by revenue
from pyspark.sql.functions import col, round

top_products_df = spark.table(f"{catalog}.{gold_db}.product_performance") \
    .select(
        col("product_name"),
        col("category"),
        col("brand"),
        round(col("total_revenue"), 2).alias("total_revenue"),
        col("total_quantity_sold"),
        col("orders_containing_product"),
        col("revenue_rank_in_category")
    ) \
    .orderBy(col("total_revenue").desc()) \
    .limit(10)

display(top_products_df)


### Gold: Monthly Revenue Trends

Time-series analysis for business reporting.


In [None]:
# Create Gold monthly_revenue table
from pyspark.sql.functions import col, sum, count, avg, round, current_timestamp, countDistinct, date_trunc, lag
from pyspark.sql.window import Window

# Drop table if exists
spark.sql(f"DROP TABLE IF EXISTS {catalog}.{gold_db}.monthly_revenue")

# Create order_revenues CTE
orders = spark.table(f"{catalog}.{silver_db}.orders").alias("o")
order_items = spark.table(f"{catalog}.{silver_db}.order_items").alias("oi")

order_revenues = orders \
    .join(order_items, col("o.order_id") == col("oi.order_id"), "inner") \
    .filter(col("o.status") != "Cancelled") \
    .groupBy("o.order_id", "o.order_date", "o.order_year", "o.order_month", "o.status", "o.discount_percent") \
    .agg(
        sum("oi.line_total").alias("order_total"),
        sum(col("oi.line_total") * col("o.discount_percent") / 100).alias("discount_amount")
    )

# Aggregate by month
monthly_agg = order_revenues \
    .withColumn("month_start_date", date_trunc("month", col("order_date"))) \
    .groupBy("order_year", "order_month", "month_start_date") \
    .agg(
        countDistinct("order_id").alias("total_orders"),
        sum("order_total").alias("gross_revenue"),
        sum("discount_amount").alias("total_discounts"),
        sum(col("order_total") - col("discount_amount")).alias("net_revenue"),
        avg("order_total").alias("avg_order_value")
    )

# Add window functions for month-over-month growth
window_spec = Window.orderBy("order_year", "order_month")

monthly_revenue_df = monthly_agg \
    .withColumn("prev_month_revenue", lag("gross_revenue").over(window_spec)) \
    .withColumn("mom_growth_percent",
        round(
            (col("gross_revenue") - col("prev_month_revenue")) / col("prev_month_revenue") * 100,
            2
        )
    ) \
    .withColumn("calculated_at", current_timestamp()) \
    .orderBy("order_year", "order_month")

# Write to Gold table
monthly_revenue_df.write.mode("overwrite").format("delta").saveAsTable(f"{catalog}.{gold_db}.monthly_revenue")

# Show count
month_count = spark.table(f"{catalog}.{gold_db}.monthly_revenue").count()
print(f"month_count: {month_count}")


In [None]:
# View monthly revenue trends
from pyspark.sql.functions import col, to_date, round

monthly_trends_df = spark.table(f"{catalog}.{gold_db}.monthly_revenue") \
    .select(
        to_date(col("month_start_date")).alias("month"),
        col("total_orders"),
        round(col("gross_revenue"), 2).alias("gross_revenue"),
        round(col("net_revenue"), 2).alias("net_revenue"),
        round(col("avg_order_value"), 2).alias("avg_order_value"),
        col("mom_growth_percent")
    ) \
    .orderBy(col("month").desc()) \
    .limit(12)

display(monthly_trends_df)


### Gold: Category Performance

Category-level analytics for merchandising decisions.


In [None]:
# Create Gold category_performance table
from pyspark.sql.functions import col, countDistinct, sum, avg, min, max, current_timestamp

# Drop table if exists
spark.sql(f"DROP TABLE IF EXISTS {catalog}.{gold_db}.category_performance")

# Read tables
products = spark.table(f"{catalog}.{silver_db}.products").alias("p")
order_items = spark.table(f"{catalog}.{silver_db}.order_items").alias("oi")
orders = spark.table(f"{catalog}.{silver_db}.orders").alias("o")

# Join and aggregate
category_performance_df = products \
    .join(order_items, col("p.product_id") == col("oi.product_id"), "left") \
    .join(orders, (col("oi.order_id") == col("o.order_id")) & (col("o.status") != "Cancelled"), "left") \
    .groupBy("p.category") \
    .agg(
        countDistinct("p.product_id").alias("total_products"),
        countDistinct("oi.order_id").alias("total_orders"),
        sum("oi.quantity").alias("total_units_sold"),
        sum("oi.line_total").alias("total_revenue"),
        avg("oi.line_total").alias("avg_transaction_value"),
        min("p.price").alias("min_price"),
        max("p.price").alias("max_price"),
        avg("p.price").alias("avg_price")
    ) \
    .withColumn("calculated_at", current_timestamp())

# Write to Gold table
category_performance_df.write.mode("overwrite").format("delta").saveAsTable(f"{catalog}.{gold_db}.category_performance")

# Display results ordered by revenue
display(spark.table(f"{catalog}.{gold_db}.category_performance").orderBy(col("total_revenue").desc()))


### ‚úÖ Gold Layer Complete!

Summary of all Gold tables:


In [None]:
# Show all Gold tables
display(spark.sql(f"SHOW TABLES IN {catalog}.{gold_db}"))


In [None]:
# Gold layer summary
print("="*70)
print("GOLD LAYER SUMMARY - BUSINESS ANALYTICS")
print("="*70)
print(f"\nCustomer Analytics:  {spark.table(f'{catalog}.{gold_db}.customer_analytics').count():>10,} customers")
print(f"Product Performance: {spark.table(f'{catalog}.{gold_db}.product_performance').count():>10,} products")
print(f"Monthly Revenue:     {spark.table(f'{catalog}.{gold_db}.monthly_revenue').count():>10,} months")
print(f"Category Performance:{spark.table(f'{catalog}.{gold_db}.category_performance').count():>10,} categories")
print("="*70)


---

# üéì Advanced Concepts

## Delta Lake Features You've Used

Throughout this notebook, you've leveraged powerful Delta Lake capabilities:


### Time Travel

Query previous versions of your data:


In [None]:
# View table history
display(spark.sql(f"DESCRIBE HISTORY {catalog}.{silver_db}.customers LIMIT 5"))


In [None]:
# Query a previous version (if you've updated the table)
# display(spark.read.format("delta").option("versionAsOf", 0).table(f"{catalog}.{silver_db}.customers"))


### Table Statistics & Optimization


In [None]:
# View detailed table information
display(spark.sql(f"DESCRIBE EXTENDED {catalog}.{gold_db}.customer_analytics"))


In [None]:
# Optimize tables for better query performance
# spark.sql(f"OPTIMIZE {catalog}.{gold_db}.customer_analytics")
# spark.sql(f"OPTIMIZE {catalog}.{gold_db}.product_performance")
# spark.sql(f"OPTIMIZE {catalog}.{gold_db}.monthly_revenue")



# üéâ Congratulations!

You've successfully built a complete **Medallion Architecture** on Databricks!

## What You Accomplished:

### ‚úÖ Bronze Layer
- Ingested raw CSV data into Delta tables
- Preserved complete data history
- Used idempotent `COPY INTO` pattern

### ‚úÖ Silver Layer
- Cleaned and validated data
- Applied business rules
- Added derived columns
- Standardized formats

### ‚úÖ Gold Layer
- Created business-ready analytics tables
- Pre-calculated KPIs and metrics
- Built customer lifetime value analysis
- Analyzed product and category performance
- Created time-series revenue trends

## Key Concepts Mastered:

- üì¶ **Delta Lake**: ACID transactions, time travel, schema evolution
- üèóÔ∏è **Medallion Architecture**: Progressive data refinement
- üîÑ **ETL Patterns**: Incremental loading, data quality, transformations
- üìä **Analytics Engineering**: Business metrics, aggregations, rankings
- üöÄ **Performance**: Optimizations, partitioning strategies

## Next Steps:

1. **Explore Further**: Try modifying queries to answer your own business questions
2. **Add Complexity**: Implement slowly changing dimensions (SCD Type 2)
3. **Automation**: Learn about Databricks Workflows to schedule these pipelines
4. **Streaming**: Explore Structured Streaming for real-time data
5. **ML Integration**: Build machine learning models on your clean data
6. **Check Best Practices**: Review notebook 03 for advanced patterns

---

## Sample Business Questions You Can Answer:

```sql
-- Who are the most valuable customers?
SELECT * FROM {catalog}.{gold_db}.customer_analytics 
ORDER BY lifetime_value DESC LIMIT 10;

-- What products drive the most revenue?
SELECT * FROM {catalog}.{gold_db}.product_performance 
ORDER BY total_revenue DESC LIMIT 10;

-- How is revenue trending?
SELECT * FROM {catalog}.{gold_db}.monthly_revenue 
ORDER BY order_year DESC, order_month DESC;

-- Which categories perform best?
SELECT * FROM {catalog}.{gold_db}.category_performance 
ORDER BY total_revenue DESC;
```

---

**You're now ready to build production data pipelines on Databricks! üöÄ**

Questions? Check out:
- [Databricks Documentation](https://docs.databricks.com/)
- [Delta Lake Guide](https://docs.delta.io/)
- [Medallion Architecture](https://www.databricks.com/glossary/medallion-architecture)
