Data Quality Issues to Fix:

- Multiple reviews per order
- Null review comments
- Review scores out of range
- Invalid data types

Requirements:

- Deduplicate: One review per order_id (keep most recent by review_creation_date)
- Handle nulls in review_comment_message: Replace with "No comment provided"
- Handle nulls in review_comment_title: Replace with "No title provided"
- Cast review_score to integer type
- Validate review_score is between 1 and 5 (inclusive)
- Convert review_creation_date and review_answer_timestamp to timestamp format
- Validate: No null values in review_id, order_id, review_score

Output Schema:
- review_id: string (primary key)
- order_id: string (foreign key)
- review_score: integer (1-5)
- review_comment_title: string (no nulls)
- review_comment_message: string (no nulls)
- review_creation_date: timestamp
- review_answer_timestamp: timestamp (nullable)

In [0]:
from pyspark.sql import functions as F

In [0]:
bronze_df = spark.read.table("golden_360.bronze.order_review")

bronze_df.count()

In [0]:
silver_1 = (
    bronze_df
    .withColumn("review_score",F.col("review_score").cast("int"))
    .withColumn("review_creation_date",F.col("review_creation_date").cast("timestamp"))
    .withColumn("review_answer_timestamp",F.col("review_answer_timestamp").cast("timestamp"))
)
silver_1.count()

In [0]:
silver_2 = silver_1.where("review_score <=5 and review_score >=1")

silver_2.count()

In [0]:
silver_3 = silver_2.fillna("No title provided","review_comment_title")
silver_4 = silver_3.fillna("No comment provided","review_comment_message")

In [0]:
from pyspark.sql import Window
import pyspark.sql.functions as F

window_spec = Window.partitionBy("order_id").orderBy(F.col("review_creation_date").desc())

deduplicated_df = silver_4.withColumn("rank", F.row_number().over(window_spec)) \
                    .filter(F.col("rank") == 1) \
                    .drop("rank")

In [0]:
silver_5 = deduplicated_df.dropna(subset=["order_id","review_id","review_score"])


silver_5.write.format("delta").mode("overwrite").saveAsTable("golden_360.silver.order_reviews")