Data Quality Issues to Fix:

- Duplicate seller_ids
- Inconsistent city/state formatting
- Zip code data type

Requirements:

- Deduplicate on seller_id
- Convert seller_zip_code_prefix to integer
- Standardize seller_city: UPPERCASE and trim whitespace
- Standardize seller_state: UPPERCASE and trim whitespace
- Validate: No null values in seller_id

Output Schema:
seller_id: string (primary key)
seller_zip_code_prefix: integer
seller_city: string (UPPERCASE, trimmed)
seller_state: string (UPPERCASE, trimmed)

In [0]:
from pyspark.sql import functions as F

In [0]:
bronze_sellers = spark.read.table("golden_360.bronze.sellers")

bronze_sellers.show()

In [0]:
silver_duplicates_drop = bronze_sellers.dropDuplicates(["seller_id"])
silver_duplicates_drop.show()

In [0]:
silver_int = (
    silver_duplicates_drop
    .withColumn("seller_city", F.upper(F.trim(F.col("seller_city"))))
    .withColumn("seller_state", F.upper(F.trim(F.col("seller_state"))))
    .withColumn("seller_zip_code_prefix", F.col("seller_zip_code_prefix").cast("int"))
)
silver_int.show()

In [0]:
silver_final = silver_int.where("seller_id is not Null")
silver_final.show()

In [0]:
silver_final.write.format("delta").mode("overwrite").saveAsTable("golden_360.silver.sellers")