Data Quality Issues to Fix:

- Duplicate Portuguese category names
- Nulls in translations

Requirements:

- Deduplicate on product_category_name (Portuguese version)
- Trim whitespace from both columns
- Validate: No null values in either column
- Validate: Both columns have values (no empty strings)

Output Schema:
- product_category_name: string (primary key, Portuguese)
- product_category_name_english: string (English translation)

In [0]:
from pyspark.sql import functions as F

In [0]:
bronze_df = spark.read.table("golden_360.bronze.product_category")


bronze_df.show()

In [0]:
silver_df = bronze_df.drop("product_category_name_english")

In [0]:
silver_df_1 = silver_df.withColumnRenamed("product_category","product_category_name_english")

In [0]:
silver_df_2 = (
    silver_df_1.
    withColumn("product_category_name",F.trim(F.col("product_category_name")))
    .withColumn("product_category_name_english",F.trim(F.col("product_category_name_english")))
)

silver_df_2.show()

In [0]:
silver_final = silver_df_2.dropna(subset=["product_category_name","product_category_name_english"])

silver_final.show()

In [0]:
silver_final.write.format("delta").mode("overwrite").saveAsTable("golden_360.silver.product_category")