# Gold Layer: Product Dimension (dim_products)

This notebook builds the Product Dimension by combining:

- silver.crm_prd_info        → core product master
- silver.erp_px_cat_g1v2     → category attributes

Goals:
- enrich products with category + subcategory + maintenance
- remove historical (expired) rows
- ensure no duplicates
- create surrogate key for star schema joins
- produce analytics-ready dimension

Output:
datawarehouse.gold.dim_products

In [0]:
from pyspark.sql import functions as F
from pyspark.sql.window import Window

## Step 1. Load Silver tables

In [0]:
spark.sql("USE CATALOG datawarehouse")

prd_df = spark.table("silver.crm_prd_info")
cat_df = spark.table("silver.erp_px_cat_g1v2")

prd_df.display()

prd_id,cat_id,prd_key,prd_nm,prd_cost,prd_line,prd_start_dt,prd_end_dt,dwh_create_date
601,CO_BB,BB-7421,LL Bottom Bracket,24,,2013-07-01,,2026-02-08T05:05:21.684Z
602,CO_BB,BB-8107,ML Bottom Bracket,45,,2013-07-01,,2026-02-08T05:05:21.684Z
603,CO_BB,BB-9108,HL Bottom Bracket,54,,2013-07-01,,2026-02-08T05:05:21.684Z
478,AC_BC,BC-M005,Mountain Bottle Cage,4,Mountain,2013-07-01,,2026-02-08T05:05:21.684Z
479,AC_BC,BC-R205,Road Bottle Cage,3,Road,2013-07-01,,2026-02-08T05:05:21.684Z
596,BI_MB,BK-M18B-40,Mountain-500 Black- 40,295,Mountain,2013-07-01,,2026-02-08T05:05:21.684Z
597,BI_MB,BK-M18B-42,Mountain-500 Black- 42,295,Mountain,2013-07-01,,2026-02-08T05:05:21.684Z
598,BI_MB,BK-M18B-44,Mountain-500 Black- 44,295,Mountain,2013-07-01,,2026-02-08T05:05:21.684Z
599,BI_MB,BK-M18B-48,Mountain-500 Black- 48,295,Mountain,2013-07-01,,2026-02-08T05:05:21.684Z
600,BI_MB,BK-M18B-52,Mountain-500 Black- 52,295,Mountain,2013-07-01,,2026-02-08T05:05:21.684Z


## Step 2. Join product info with category lookup

LEFT join keeps all products even if category metadata is missing.

In [0]:
joined_df = (
    prd_df.alias("pi")
    .join(
        cat_df.alias("ep"),
        F.col("pi.cat_id") == F.col("ep.ID"),
        "left"
    )
)

joined_df.display()

prd_id,cat_id,prd_key,prd_nm,prd_cost,prd_line,prd_start_dt,prd_end_dt,dwh_create_date,ID,CAT,SUBCAT,MAINTENANCE,dwh_create_date.1
601,CO_BB,BB-7421,LL Bottom Bracket,24,,2013-07-01,,2026-02-08T05:05:21.684Z,CO_BB,Components,Bottom Brackets,Yes,2026-02-08T06:46:23.452Z
602,CO_BB,BB-8107,ML Bottom Bracket,45,,2013-07-01,,2026-02-08T05:05:21.684Z,CO_BB,Components,Bottom Brackets,Yes,2026-02-08T06:46:23.452Z
603,CO_BB,BB-9108,HL Bottom Bracket,54,,2013-07-01,,2026-02-08T05:05:21.684Z,CO_BB,Components,Bottom Brackets,Yes,2026-02-08T06:46:23.452Z
478,AC_BC,BC-M005,Mountain Bottle Cage,4,Mountain,2013-07-01,,2026-02-08T05:05:21.684Z,AC_BC,Accessories,Bottles and Cages,No,2026-02-08T06:46:23.452Z
479,AC_BC,BC-R205,Road Bottle Cage,3,Road,2013-07-01,,2026-02-08T05:05:21.684Z,AC_BC,Accessories,Bottles and Cages,No,2026-02-08T06:46:23.452Z
596,BI_MB,BK-M18B-40,Mountain-500 Black- 40,295,Mountain,2013-07-01,,2026-02-08T05:05:21.684Z,BI_MB,Bikes,Mountain Bikes,Yes,2026-02-08T06:46:23.452Z
597,BI_MB,BK-M18B-42,Mountain-500 Black- 42,295,Mountain,2013-07-01,,2026-02-08T05:05:21.684Z,BI_MB,Bikes,Mountain Bikes,Yes,2026-02-08T06:46:23.452Z
598,BI_MB,BK-M18B-44,Mountain-500 Black- 44,295,Mountain,2013-07-01,,2026-02-08T05:05:21.684Z,BI_MB,Bikes,Mountain Bikes,Yes,2026-02-08T06:46:23.452Z
599,BI_MB,BK-M18B-48,Mountain-500 Black- 48,295,Mountain,2013-07-01,,2026-02-08T05:05:21.684Z,BI_MB,Bikes,Mountain Bikes,Yes,2026-02-08T06:46:23.452Z
600,BI_MB,BK-M18B-52,Mountain-500 Black- 52,295,Mountain,2013-07-01,,2026-02-08T05:05:21.684Z,BI_MB,Bikes,Mountain Bikes,Yes,2026-02-08T06:46:23.452Z


## Step 3. Duplicate check

Each prd_id should be unique.

If duplicates appear:
- category table may contain multiple IDs
- or bad source data


In [0]:
(
    joined_df
    .groupBy("prd_id")
    .count()
    .filter("count > 1")
    .display()
)

prd_id,count


## Step 4. Filter out historical records

Business rule:
Keep only active products

prd_end_dt IS NULL → current version

In [0]:
active_df = joined_df.filter(F.col("prd_end_dt").isNull())

active_df.display()

prd_id,cat_id,prd_key,prd_nm,prd_cost,prd_line,prd_start_dt,prd_end_dt,dwh_create_date,ID,CAT,SUBCAT,MAINTENANCE,dwh_create_date.1
601,CO_BB,BB-7421,LL Bottom Bracket,24,,2013-07-01,,2026-02-08T05:05:21.684Z,CO_BB,Components,Bottom Brackets,Yes,2026-02-08T06:46:23.452Z
602,CO_BB,BB-8107,ML Bottom Bracket,45,,2013-07-01,,2026-02-08T05:05:21.684Z,CO_BB,Components,Bottom Brackets,Yes,2026-02-08T06:46:23.452Z
603,CO_BB,BB-9108,HL Bottom Bracket,54,,2013-07-01,,2026-02-08T05:05:21.684Z,CO_BB,Components,Bottom Brackets,Yes,2026-02-08T06:46:23.452Z
478,AC_BC,BC-M005,Mountain Bottle Cage,4,Mountain,2013-07-01,,2026-02-08T05:05:21.684Z,AC_BC,Accessories,Bottles and Cages,No,2026-02-08T06:46:23.452Z
479,AC_BC,BC-R205,Road Bottle Cage,3,Road,2013-07-01,,2026-02-08T05:05:21.684Z,AC_BC,Accessories,Bottles and Cages,No,2026-02-08T06:46:23.452Z
596,BI_MB,BK-M18B-40,Mountain-500 Black- 40,295,Mountain,2013-07-01,,2026-02-08T05:05:21.684Z,BI_MB,Bikes,Mountain Bikes,Yes,2026-02-08T06:46:23.452Z
597,BI_MB,BK-M18B-42,Mountain-500 Black- 42,295,Mountain,2013-07-01,,2026-02-08T05:05:21.684Z,BI_MB,Bikes,Mountain Bikes,Yes,2026-02-08T06:46:23.452Z
598,BI_MB,BK-M18B-44,Mountain-500 Black- 44,295,Mountain,2013-07-01,,2026-02-08T05:05:21.684Z,BI_MB,Bikes,Mountain Bikes,Yes,2026-02-08T06:46:23.452Z
599,BI_MB,BK-M18B-48,Mountain-500 Black- 48,295,Mountain,2013-07-01,,2026-02-08T05:05:21.684Z,BI_MB,Bikes,Mountain Bikes,Yes,2026-02-08T06:46:23.452Z
600,BI_MB,BK-M18B-52,Mountain-500 Black- 52,295,Mountain,2013-07-01,,2026-02-08T05:05:21.684Z,BI_MB,Bikes,Mountain Bikes,Yes,2026-02-08T06:46:23.452Z


## Step 5. Select + rename columns

Reshape into business-friendly dimensional structure.

In [0]:
dim_products_df = (
    active_df.select(
        F.col("prd_id").alias("product_id"),
        F.col("prd_key").alias("product_number"),
        F.col("prd_nm").alias("product_name"),
        F.col("cat_id").alias("category_id"),
        F.col("CAT").alias("category"),
        F.col("SUBCAT").alias("sub_category"),
        F.col("MAINTENANCE").alias("maintenance"),
        F.col("prd_cost").alias("product_cost"),
        F.col("prd_line").alias("product_line"),
        F.col("prd_start_dt").alias("start_date")
    )
)

dim_products_df.display()

product_id,product_number,product_name,category_id,category,sub_category,maintenance,product_cost,product_line,start_date
601,BB-7421,LL Bottom Bracket,CO_BB,Components,Bottom Brackets,Yes,24,,2013-07-01
602,BB-8107,ML Bottom Bracket,CO_BB,Components,Bottom Brackets,Yes,45,,2013-07-01
603,BB-9108,HL Bottom Bracket,CO_BB,Components,Bottom Brackets,Yes,54,,2013-07-01
478,BC-M005,Mountain Bottle Cage,AC_BC,Accessories,Bottles and Cages,No,4,Mountain,2013-07-01
479,BC-R205,Road Bottle Cage,AC_BC,Accessories,Bottles and Cages,No,3,Road,2013-07-01
596,BK-M18B-40,Mountain-500 Black- 40,BI_MB,Bikes,Mountain Bikes,Yes,295,Mountain,2013-07-01
597,BK-M18B-42,Mountain-500 Black- 42,BI_MB,Bikes,Mountain Bikes,Yes,295,Mountain,2013-07-01
598,BK-M18B-44,Mountain-500 Black- 44,BI_MB,Bikes,Mountain Bikes,Yes,295,Mountain,2013-07-01
599,BK-M18B-48,Mountain-500 Black- 48,BI_MB,Bikes,Mountain Bikes,Yes,295,Mountain,2013-07-01
600,BK-M18B-52,Mountain-500 Black- 52,BI_MB,Bikes,Mountain Bikes,Yes,295,Mountain,2013-07-01


## Step 6. Create surrogate key

Sequential integer key:

product_key

Facts will reference this instead of product_id.

In [0]:
window_spec = Window.orderBy("start_date", "product_number")

dim_products_df = (
    dim_products_df
    .withColumn("product_key", F.row_number().over(window_spec))
)



## Step 7. Final column order

Standard layout:
key → identifiers → attributes

In [0]:
dim_products_df = dim_products_df.select(
    "product_key",
    "product_id",
    "product_number",
    "product_name",
    "category_id",
    "category",
    "sub_category",
    "maintenance",
    "product_cost",
    "product_line",
    "start_date"
)

dim_products_df.display()



product_key,product_id,product_number,product_name,category_id,category,sub_category,maintenance,product_cost,product_line,start_date
1,210,FR-R92B-58,HL Road Frame - Black- 58,CO_RF,Components,Road Frames,Yes,0,Road,2003-07-01
2,211,FR-R92R-58,HL Road Frame - Red- 58,CO_RF,Components,Road Frames,Yes,0,Road,2003-07-01
3,348,BK-M82B-38,Mountain-100 Black- 38,BI_MB,Bikes,Mountain Bikes,Yes,1898,Mountain,2011-07-01
4,349,BK-M82B-42,Mountain-100 Black- 42,BI_MB,Bikes,Mountain Bikes,Yes,1898,Mountain,2011-07-01
5,350,BK-M82B-44,Mountain-100 Black- 44,BI_MB,Bikes,Mountain Bikes,Yes,1898,Mountain,2011-07-01
6,351,BK-M82B-48,Mountain-100 Black- 48,BI_MB,Bikes,Mountain Bikes,Yes,1898,Mountain,2011-07-01
7,344,BK-M82S-38,Mountain-100 Silver- 38,BI_MB,Bikes,Mountain Bikes,Yes,1912,Mountain,2011-07-01
8,345,BK-M82S-42,Mountain-100 Silver- 42,BI_MB,Bikes,Mountain Bikes,Yes,1912,Mountain,2011-07-01
9,346,BK-M82S-44,Mountain-100 Silver- 44,BI_MB,Bikes,Mountain Bikes,Yes,1912,Mountain,2011-07-01
10,347,BK-M82S-48,Mountain-100 Silver- 48,BI_MB,Bikes,Mountain Bikes,Yes,1912,Mountain,2011-07-01


## Step 8. Write Gold Delta table

In [0]:
(
    dim_products_df
    .write
    .format("delta")
    .mode("overwrite")
    .saveAsTable("gold.dim_products")
)

