Data Quality Issues to Fix:

- Multiple payment methods per order (this is valid, don't deduplicate)
- Decimal precision issues
- Invalid payment types

Requirements:

- Do NOT deduplicate (orders can have multiple payment methods)
- Cast payment_value to decimal(10,2)
- Cast payment_installments to integer
- Validate payment_type is one of: ['credit_card', 'boleto', 'voucher', 'debit_card']
- Validate: payment_value > 0
- Validate: payment_installments >= 1
- Validate: No null values in order_id, payment_type, payment_value, payment_installments

Output Schema:
- order_id: string (foreign key, can repeat)
- payment_sequential: integer
- payment_type: string (validated enum)
- payment_installments: integer (>= 1)
- payment_value: decimal(10,2)

In [0]:
from pyspark.sql import functions as F

In [0]:
bronze_df = spark.read.table("golden_360.bronze.order_payment")


bronze_df.show()

In [0]:
silver_1 = (
    bronze_df
    .withColumn("payment_value",F.col("payment_value").cast("decimal(10,2)"))
    .withColumn("payment_installments",F.col("payment_installments").cast("int"))
)
silver_1.show()

In [0]:
validate = ['credit_card', 'boleto', 'voucher', 'debit_card']

silver_2 = silver_1.filter(F.col("payment_type").isin(validate))

silver_2.show()

In [0]:
silver_3 = silver_2.where("payment_value > 0").where("payment_installments >=0")
silver_3.show()

In [0]:
silver_final = silver_3.dropna(subset=["order_id","payment_type","payment_value","payment_installments"])


In [0]:
silver_final.write.format("delta").mode("overwrite").saveAsTable("golden_360.silver.order_payments")
