**Intialize the SparkSession**

In [0]:
from pyspark.sql import SparkSession
spark=SparkSession.builder\
      .appName("Customer_orders Analysis")\
      .getOrCreate()
spark

**Load cleaned order data**

In [0]:
#Load the data
df=spark.read.option("header",True).option("inferSchema",True)\
    .csv("file:/Workspace/Shared/customer_orders.csv")
df.show()
df.printSchema()                                                                        

+-----------+--------+----------+-------------+----------------------+----------+----------+---------+--------------------+-------------+-------+
|customer_id|order_id|order_date|delivery_date|expected_delivery_date|delay_days|    status|     name|               email|       region|delayed|
+-----------+--------+----------+-------------+----------------------+----------+----------+---------+--------------------+-------------+-------+
|         33|    1001|2025-05-22|   2025-05-25|            2025-05-27|         0| Delivered|   Ishaan|    ishaan@gmail.com|   West-North|  false|
|         35|    1002|2025-05-18|   2025-05-27|            2025-05-23|         4|In Transit|    Kiara|     kiara@gmail.com|   North-East|   true|
|        100|    1003|2025-05-01|   2025-05-05|            2025-05-06|         0| Delivered| Reyanshi|  reyanshi@gmail.com|   North-East|   true|
|         21|    1004|2025-05-24|   2025-05-25|            2025-05-29|         0| Delivered|  Reyansh|   reyansh@gmail.com| 

**Pipeline to Update Latest Delivery Status**

In [0]:
from pyspark.sql.functions import when, col, current_date, row_number
from pyspark.sql.window import Window
status = df.withColumn("delivery_status",
    when(col("delivery_date").isNotNull(), "Delivered")
    .when(col("delivery_date").isNull() & (col("expected_delivery_date") < current_date()), "Delayed")
    .otherwise("Pending")
)
w = Window.partitionBy("order_id").orderBy(col("delivery_date").desc())
latest=status.withColumn("row_num", row_number().over(w)) \
                                       .filter(col("row_num") == 1) \
                                       .drop("row_num")

print("Latest delivery status per order:")
latest.show(5)

Latest delivery status per order:
+-----------+--------+----------+-------------+----------------------+----------+----------+--------+------------------+------------+-------+---------------+
|customer_id|order_id|order_date|delivery_date|expected_delivery_date|delay_days|    status|    name|             email|      region|delayed|delivery_status|
+-----------+--------+----------+-------------+----------------------+----------+----------+--------+------------------+------------+-------+---------------+
|         33|    1001|2025-05-22|   2025-05-25|            2025-05-27|         0| Delivered|  Ishaan|  ishaan@gmail.com|  West-North|  false|      Delivered|
|         35|    1002|2025-05-18|   2025-05-27|            2025-05-23|         4|In Transit|   Kiara|   kiara@gmail.com|  North-East|   true|      Delivered|
|        100|    1003|2025-05-01|   2025-05-05|            2025-05-06|         0| Delivered|Reyanshi|reyanshi@gmail.com|  North-East|   true|      Delivered|
|         21|    1

**Save the Result as CSV**

In [0]:
latest.write.mode("overwrite").option("header", True).csv("file:/Workspace/Shared/latest_delivery_status")

**SQL Query**

In [0]:
#create temp view
status.createOrReplaceTempView("latest_orders")
#show top 5 delayed customers
top_delayed_customers = spark.sql("""
    SELECT customer_id, COUNT(*) AS delayed_orders
    FROM latest_orders
    WHERE delivery_status = 'Delayed'
    GROUP BY customer_id
    ORDER BY delayed_orders DESC
    LIMIT 5
""")
print("Top 5 delayed customers:")
top_delayed_customers.show()

Top 5 delayed customers:
+-----------+--------------+
|customer_id|delayed_orders|
+-----------+--------------+
+-----------+--------------+

