# Capstone Project: Supply Chain Monitoring and Optimization Platform

# Week 3 – Intro to PySpark: Processing Big Data

Tools: PySpark

Capstone Tasks:

    1. Load order data from CSV using PySpark.
    2. Filter delayed shipments.
    3. Group by supplier and count delayed orders.
    4. Save processed data to CSV or Parquet.

In [25]:
# Import required PySpark modules
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.functions import col, datediff, current_date

In [26]:
# Create Spark session
spark = SparkSession.builder.appName("week-3").getOrCreate()
spark

In [27]:
# 1. Load order data from CSV using PySpark.

# Upload CSV files from local to Colab
from google.colab import files
uploaded = files.upload()

Saving inventory.csv to inventory (1).csv
Saving orders.csv to orders (1).csv
Saving suppliers.csv to suppliers (1).csv


In [28]:
# Load CSV files into DataFrames
orders_df = spark.read.csv("orders.csv", header=True, inferSchema=True)
suppliers_df = spark.read.csv("suppliers.csv", header=True, inferSchema=True)
inventory_df = spark.read.csv("inventory.csv", header=True, inferSchema=True)

In [29]:
# Display Orders Schema
print("Orders Schema:")
orders_df.printSchema()

Orders Schema:
root
 |-- order_id: integer (nullable = true)
 |-- product_id: integer (nullable = true)
 |-- supplier_id: integer (nullable = true)
 |-- quantity: integer (nullable = true)
 |-- order_Date: date (nullable = true)
 |-- delivery_date: date (nullable = true)
 |-- status: string (nullable = true)



In [30]:
# Display Suppliers Schema
print("Suppliers Schema:")
suppliers_df.printSchema()

Suppliers Schema:
root
 |-- supplier_id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- contact_info: string (nullable = true)
 |-- location: string (nullable = true)



In [31]:
# Display Inventory Schema
print("Inventory Schema:")
inventory_df.printSchema()

Inventory Schema:
root
 |-- product_id: integer (nullable = true)
 |-- product_name: string (nullable = true)
 |-- quantity_in_stock: integer (nullable = true)
 |-- reorder_level: integer (nullable = true)



In [32]:
# Convert delivery_date to DateType
df = orders_df.withColumn("delivery_date", col("delivery_date").cast("date"))

# Calculate delay days = current_date - delivery_date
df = df.withColumn("delay_days", datediff(current_date(), col("delivery_date")))

In [33]:
# 2. Filter delayed shipments (delay_days > 0 means delayed)
delayed_df = df.filter(col("delay_days") > 0)
delayed_df.show()

+--------+----------+-----------+--------+----------+-------------+---------+----------+
|order_id|product_id|supplier_id|quantity|order_Date|delivery_date|   status|delay_days|
+--------+----------+-----------+--------+----------+-------------+---------+----------+
|       1|         3|          1|      40|2025-07-03|   2025-07-07|  Shipped|        25|
|       2|         7|          4|      15|2025-07-06|   2025-07-11|Delivered|        21|
|       3|         2|          2|      25|2025-07-01|   2025-07-06|  Pending|        26|
|       4|        10|          5|      10|2025-07-10|   2025-07-14|Cancelled|        18|
|       5|         1|          3|      50|2025-07-02|   2025-07-07|Delivered|        25|
|       6|         8|          8|      20|2025-07-07|   2025-07-12|  Pending|        20|
|       7|         5|          6|      35|2025-07-05|   2025-07-09|Delivered|        23|
|       8|         9|         10|      60|2025-07-08|   2025-07-13|  Shipped|        19|
|       9|         6|

In [34]:
# 3. Group by supplier and count delayed orders
grouped_df = delayed_df.groupBy("supplier_id").count().withColumnRenamed("count", "delayed_orders_count")

# Show grouped result
grouped_df.show()

+-----------+--------------------+
|supplier_id|delayed_orders_count|
+-----------+--------------------+
|          1|                   1|
|          6|                   1|
|          3|                   1|
|          5|                   1|
|          9|                   1|
|          4|                   1|
|          8|                   1|
|          7|                   1|
|         10|                   1|
|          2|                   1|
+-----------+--------------------+



Deliverables:

    1. PySpark script to load, process, and save supply chain data

    2. Output file showing grouped results

In [35]:
# 4. Save grouped result to CSV in output directory
grouped_df.write.mode("overwrite").csv("output/delayed_orders_by_supplier_csv", header=True)

In [36]:
# Merge Spark output parts into one CSV
!cat output/delayed_orders_by_supplier_csv/part-*.csv > delayed_orders_by_supplier.csv

# 2. Download final merged output file to local system
from google.colab import files
files.download("delayed_orders_by_supplier.csv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>