# ERP Product Category Cleaning (Bronze → Silver)

This notebook prepares **`bronze.erp_px_cat_g1v2`** for the Silver layer.

Objectives:
- verify primary key uniqueness (`ID`)
- validate reference integrity with `silver.crm_prd_info.cat_id`
- inspect category fields for inconsistencies
- remove unwanted spaces
- produce a cleaned Silver dataset

The steps mirror the original SQL exploration but implemented in PySpark.


In [0]:
from pyspark.sql import functions as F

spark.sql("USE CATALOG datawarehouse")

df_raw = spark.table("bronze.erp_px_cat_g1v2")

df_raw.display()

ID,CAT,SUBCAT,MAINTENANCE
AC_BR,Accessories,Bike Racks,Yes
AC_BS,Accessories,Bike Stands,No
AC_BC,Accessories,Bottles and Cages,No
AC_CL,Accessories,Cleaners,Yes
AC_FE,Accessories,Fenders,No
AC_HE,Accessories,Helmets,Yes
AC_HP,Accessories,Hydration Packs,No
AC_LI,Accessories,Lights,Yes
AC_LO,Accessories,Locks,Yes
AC_PA,Accessories,Panniers,No


## Step 1. Check for duplicate primary keys

We expect `ID` to uniquely identify each record.

If any rows appear below, duplicates exist and should be investigated
before writing to Silver.

In [0]:
(
    df_raw
    .groupBy("ID")
    .count()
    .filter(F.col("count") > 1)
    .display()
)

ID,count


## Step 2. Validate reference integrity

Each category `ID` should exist in:
`silver.crm_prd_info.cat_id`

This ensures the category table aligns with the product master.

Rows returned here indicate orphan records.

In [0]:
prd_df = spark.table("silver.crm_prd_info").select("cat_id")

invalid_refs = (
    df_raw
    .join(prd_df, df_raw.ID == prd_df.cat_id, "left_anti")
    .select("ID")
)

invalid_refs.display()

ID
CO_PD


## Step 3. Explore distinct values

We inspect unique values in:
- CAT
- SUBCAT
- MAINTENANCE

This helps detect spelling differences, casing issues, or unexpected labels.

In [0]:
df_raw.select("CAT").distinct().display()
df_raw.select("SUBCAT").distinct().display()
df_raw.select("MAINTENANCE").distinct().display()

CAT
Accessories
Bikes
Clothing
Components


SUBCAT
Bike Racks
Bike Stands
Bottles and Cages
Cleaners
Fenders
Helmets
Hydration Packs
Lights
Locks
Panniers


MAINTENANCE
Yes
No


## Step 4. Check for unwanted spaces

Leading or trailing spaces cause silent mismatches in joins and filters.

We check for rows where:
column ≠ TRIM(column)

Ideally, these checks return no rows.

In [0]:
df_raw.filter(F.col("CAT") != F.trim("CAT")).select("CAT").display()
df_raw.filter(F.col("SUBCAT") != F.trim("SUBCAT")).select("SUBCAT").display()
df_raw.filter(F.col("MAINTENANCE") != F.trim("MAINTENANCE")).select("MAINTENANCE").display()

CAT


SUBCAT


MAINTENANCE


## Step 5. Clean and standardize fields

Transformations applied:
- trim spaces from CAT, SUBCAT, MAINTENANCE
- keep ID unchanged

No business logic or remapping is applied here, only formatting cleanup.

In [0]:
df_clean = (
    df_raw
    .withColumn("CAT", F.trim("CAT"))
    .withColumn("SUBCAT", F.trim("SUBCAT"))
    .withColumn("MAINTENANCE", F.trim("MAINTENANCE"))
)

df_clean.display()

ID,CAT,SUBCAT,MAINTENANCE
AC_BR,Accessories,Bike Racks,Yes
AC_BS,Accessories,Bike Stands,No
AC_BC,Accessories,Bottles and Cages,No
AC_CL,Accessories,Cleaners,Yes
AC_FE,Accessories,Fenders,No
AC_HE,Accessories,Helmets,Yes
AC_HP,Accessories,Hydration Packs,No
AC_LI,Accessories,Lights,Yes
AC_LO,Accessories,Locks,Yes
AC_PA,Accessories,Panniers,No


## Step 6. Create final Silver dataset

The cleaned dataset is ready for the Silver layer.

Columns:
- ID
- CAT
- SUBCAT
- MAINTENANCE

All values are trimmed and join-safe.

In [0]:
silver_df = df_clean.select(
    "ID",
    "CAT",
    "SUBCAT",
    "MAINTENANCE"
)
silver_df = silver_df.select("*",F.current_timestamp().alias("dwh_create_date"))
silver_df.display()

ID,CAT,SUBCAT,MAINTENANCE,dwh_create_date
AC_BR,Accessories,Bike Racks,Yes,2026-02-08T06:45:58.957Z
AC_BS,Accessories,Bike Stands,No,2026-02-08T06:45:58.957Z
AC_BC,Accessories,Bottles and Cages,No,2026-02-08T06:45:58.957Z
AC_CL,Accessories,Cleaners,Yes,2026-02-08T06:45:58.957Z
AC_FE,Accessories,Fenders,No,2026-02-08T06:45:58.957Z
AC_HE,Accessories,Helmets,Yes,2026-02-08T06:45:58.957Z
AC_HP,Accessories,Hydration Packs,No,2026-02-08T06:45:58.957Z
AC_LI,Accessories,Lights,Yes,2026-02-08T06:45:58.957Z
AC_LO,Accessories,Locks,Yes,2026-02-08T06:45:58.957Z
AC_PA,Accessories,Panniers,No,2026-02-08T06:45:58.957Z


## Step 7. Save as Silver Delta table

Persist the curated result to the Silver layer as a Delta table
for downstream analytics and reporting.

In [0]:
(
    silver_df
    .write
    .format("delta")
    .mode("overwrite")
    .saveAsTable("silver.erp_px_cat_g1v2")
)
