Data Quality Issues to Fix:

- Duplicate product_ids
- Nulls in dimension/weight fields
- Inconsistent measurement units

Requirements:

- Deduplicate on product_id
- Cast all dimension fields to decimal(10,2):
- 
- product_name_lenght → rename to product_name_length (fix typo)
- product_description_lenght → rename to product_description_length (fix typo)
- product_photos_qty
- product_weight_g
- product_length_cm
- product_height_cm
- product_width_cm
- 

- Handle nulls: Replace dimension/weight nulls with 0.0 (indicates missing measurement)
- Keep product_category_name as-is (nulls are valid - uncategorized products)
- Validate: No null values in product_id

Output Schema:
- product_id: string (primary key)
- product_category_name: string (nullable)
- product_name_length: decimal(10,2)
- product_description_length: decimal(10,2)
- product_photos_qty: decimal(10,2)
- product_weight_g: decimal(10,2)
- product_length_cm: decimal(10,2)
- product_height_cm: decimal(10,2)
- product_width_cm: decimal(10,2)

In [0]:
from pyspark.sql import functions as F

In [0]:
bronze_df = spark.read.table("golden_360.bronze.products")

bronze_df.show()

In [0]:
silver_1 = bronze_df.dropDuplicates(["product_id"])
silver_1.show()

In [0]:
silver_2 = silver_1.withColumnsRenamed(
    {
        "product_description_lenght" : "product_description_length",
        "product_name_lenght" : "product_name_length"
    }
)

silver_2.show()

In [0]:
from pyspark.sql import functions as F

silver_3 = (
    silver_2
    .withColumn(
        "product_name_length",
        F.col("product_name_length").cast("decimal(10,2)")
    )
    .withColumn(
        "product_description_length",
        F.col("product_description_length").cast("decimal(10,2)")
    )
    .withColumn(
        "product_photos_qty",
        F.col("product_photos_qty").cast("decimal(10,2)")
    )
    .withColumn(
        "product_weight_g",
        F.col("product_weight_g").cast("decimal(10,2)")
    )
    .withColumn(
        "product_length_cm",
        F.col("product_length_cm").cast("decimal(10,2)")
    )
    .withColumn(
        "product_height_cm",
        F.col("product_height_cm").cast("decimal(10,2)")
    )
    .withColumn(
        "product_width_cm",
        F.col("product_width_cm").cast("decimal(10,2)")
    )
)


In [0]:
silver_final = silver_3.dropna(subset=["product_id"])

In [0]:
silver_final.write.format("delta").mode("overwrite").saveAsTable("golden_360.silver.products")