Data Quality Issues to Fix:

- Composite key duplicates (same order_id + order_item_id)
- Negative or zero prices
- Precision issues in decimal values

Requirements:

- Deduplicate on composite key (order_id, order_item_id)
- Cast price to decimal(10,2)
- Cast freight_value to decimal(10,2)
- Validate: price > 0
- Validate: freight_value >= 0
- Convert shipping_limit_date to timestamp format
- Validate: No null values in order_id, order_item_id, product_id, seller_id, price

Output Schema:
- order_id: string (composite primary key part 1)
- order_item_id: integer (composite primary key part 2)
- product_id: string (foreign key)
- seller_id: string (foreign key)
- shipping_limit_date: timestamp
- price: decimal(10,2)
- freight_value: decimal(10,2)

In [0]:
from pyspark.sql import functions as F

In [0]:
bronze_df = spark.read.table("golden_360.bronze.order_item")

bronze_df.show()

In [0]:
silver_1 = (
    bronze_df
    .withColumn("price",F.col("price").cast("decimal(10,2)"))
    .withColumn("freight_value",F.col("freight_value").cast("decimal(10,2)"))
)

print(silver_1.count())
silver_1.show()

In [0]:
silver_2 = (
    silver_1.where("price > 0 and freight_value >= 0"
    )
)

print(silver_2.count())

silver_2.show()

In [0]:
silver_3 = silver_2.withColumn("shipping_limit_date",F.col("shipping_limit_date").cast("timestamp"))

In [0]:
silver_final = silver_3.dropna(subset=["order_id","order_item_id","product_id","seller_id","price"])

silver_final.count()

In [0]:
silver_final.write.format("delta").mode("overwrite").saveAsTable("golden_360.silver.order_items")
