Data Quality Issues to Fix:

- Duplicates: Same customer_id appearing multiple times
- Inconsistent formatting: City/state names have mixed case, leading/trailing spaces
- Data type mismatches: zip_code stored as string when it should be numeric

Requirements:

- Remove duplicate customer_id records (keep first occurrence)
- Convert customer_zip_code_prefix to integer type
- Standardize customer_city: UPPERCASE and trim whitespace
- Standardize customer_state: UPPERCASE and trim whitespace
- Validate: No null values in customer_id
- Validate: No null values in customer_unique_id

In [0]:
from pyspark.sql import functions as F

In [0]:
bronze_customer = spark.read.table("golden_360.bronze.customers")
bronze_customer.show()

In [0]:
silver_customers = (
    bronze_customer
    .withColumn("customer_city", F.upper(F.trim(F.col("customer_city"))))
    .withColumn("customer_state", F.upper(F.trim(F.col("customer_state"))))
    .withColumn("customer_zip_code_prefix", F.col("customer_zip_code_prefix").cast("int"))
    .filter(F.col("customer_id").isNotNull())
    .dropDuplicates(["customer_id"])
)

In [0]:
silver_customers.show()

In [0]:
silver_customers.write.format("delta").mode("overwrite").saveAsTable("golden_360.silver.customers")