# ERP Location Data Cleaning (Bronze → Silver)

This notebook prepares **`bronze.erp_loc_a101`** for the Silver layer.

Goal:
- clean customer IDs (`CID`)
- standardize country values (`CNTRY`)
- output a normalized dataset ready for analytics and joins

We follow a simple pipeline:
1. Load raw bronze table  
2. Inspect key fields  
3. Clean CID format  
4. Explore country inconsistencies  
5. Normalize country names  
6. Write final Silver table/

In [0]:
# Load bronze source
from pyspark.sql import functions as F

spark.sql("USE CATALOG datawarehouse")

df_raw = spark.table("bronze.erp_loc_a101")

df_raw.display()

CID,CNTRY
AW-00011000,Australia
AW-00011001,Australia
AW-00011002,Australia
AW-00011003,Australia
AW-00011004,Australia
AW-00011005,Australia
AW-00011006,Australia
AW-00011007,Australia
AW-00011008,Australia
AW-00011009,Australia


## Step 1. Inspect join keys (CID vs CRM keys)

Before transforming anything, we quickly check:

- `CID` from ERP (this table)
- `cst_key` from CRM customers

This helps confirm whether formats align or require cleaning.

In [0]:
crm_df = spark.table("bronze.crm_cust_info").select("cst_key")
erp_df = df_raw.select("CID")

crm_df.display()
erp_df.display()

cst_key
AW00011000
AW00011001
AW00011002
AW00011003
AW00011004
AW00011005
AW00011006
AW00011007
AW00011008
AW00011009


CID
AW-00011000
AW-00011001
AW-00011002
AW-00011003
AW-00011004
AW-00011005
AW-00011006
AW-00011007
AW-00011008
AW-00011009


## Step 2. Clean CID values

Issue:
Some `CID` values contain hyphens (`-`), which can break joins or create mismatches.

Action:
Remove hyphens only.  
No other transformations are applied.

In [0]:
df_clean = df_raw.withColumn(
    "CID_clean",
    F.regexp_replace("CID", "-", "")
)

df_clean.select("CID", "CID_clean").display()

CID,CID_clean
AW-00011000,AW00011000
AW-00011001,AW00011001
AW-00011002,AW00011002
AW-00011003,AW00011003
AW-00011004,AW00011004
AW-00011005,AW00011005
AW-00011006,AW00011006
AW-00011007,AW00011007
AW-00011008,AW00011008
AW-00011009,AW00011009


## Step 3. Explore country values

We check distinct `CNTRY` values to identify:
- spelling differences
- abbreviations
- unexpected or invalid entries

This informs the normalization logic.

In [0]:
df_raw.select("CNTRY").distinct().display()

CNTRY
Australia
US
Canada
DE
United Kingdom
France
USA
Germany
United States
""


## Step 4. Normalize country names

We standardize country labels to a consistent format:

Mappings:
- US / USA / UNITED STATES → United States
- DE / GERMANY → Germany
- UNITED KINGDOM / AUSTRALIA / CANADA / FRANCE → kept as-is
- everything else → `n/a`

This ensures consistent grouping and reporting.

In [0]:
df_normalized = (
    df_clean
    .withColumn("CNTRY_trim", F.upper(F.trim("CNTRY")))
    .withColumn(
        "CNTRY_clean",
        F.when(F.col("CNTRY_trim").isin("US","USA","UNITED STATES"), "United States")
         .when(F.col("CNTRY_trim").isin("GERMANY","DE"), "Germany")
         .when(F.col("CNTRY_trim").isin("UNITED KINGDOM","AUSTRALIA","CANADA","FRANCE"), F.col("CNTRY"))
         .otherwise("n/a")
    )
)

df_normalized.select("CNTRY", "CNTRY_clean").display()

CNTRY,CNTRY_clean
Australia,Australia
Australia,Australia
Australia,Australia
Australia,Australia
Australia,Australia
Australia,Australia
Australia,Australia
Australia,Australia
Australia,Australia
Australia,Australia


## Step 5. Create final Silver dataset

The Silver table keeps only:
- cleaned customer ID
- normalized country

All temporary helper columns are removed.

This dataset is now:
- join-safe
- standardized
- analytics-ready

In [0]:
silver_df = (
    df_normalized
    .select(
        F.col("CID_clean").alias("CID"),
        F.col("CNTRY_clean").alias("CNTRY")
    )
)
silver_df = silver_df.select("*",F.current_timestamp().alias("dwh_create_date"))
silver_df.display()

CID,CNTRY,dwh_create_date
AW00011000,Australia,2026-02-08T06:28:26.968Z
AW00011001,Australia,2026-02-08T06:28:26.968Z
AW00011002,Australia,2026-02-08T06:28:26.968Z
AW00011003,Australia,2026-02-08T06:28:26.968Z
AW00011004,Australia,2026-02-08T06:28:26.968Z
AW00011005,Australia,2026-02-08T06:28:26.968Z
AW00011006,Australia,2026-02-08T06:28:26.968Z
AW00011007,Australia,2026-02-08T06:28:26.968Z
AW00011008,Australia,2026-02-08T06:28:26.968Z
AW00011009,Australia,2026-02-08T06:28:26.968Z


## Step 6. Save as Silver Delta table

Persist the cleaned dataset to the Silver layer as a Delta table.
This becomes the curated source for downstream transformations and reporting.

In [0]:
(
    silver_df
    .write
    .format("delta")
    .mode("overwrite")
    .saveAsTable("datawarehouse.silver.erp_loc_a101")
)