**Week 3 - Pyspark for Warehouse- Level Insights**

**Tools : Pyspark**

**Capstone Tasks:**

- Load large stock movement data using PySpark
- Aggregate total stock per warehouse
- Identify warehouses with overstocked or understocked items

**Deliverables:**

- PySpark script with group and filter logic
- Output file with warehouse-level stock status

Load large stock movement data using PySpark

In [1]:
from pyspark.sql import SparkSession
import  pyspark.sql.functions as F

spark = SparkSession.builder.appName("WarehouseInsights").getOrCreate()

In [2]:
from google.colab import files
uploded=files.upload()

Saving products_Cleaned.csv to products_Cleaned.csv
Saving stock_movements_Cleaned.csv to stock_movements_Cleaned.csv
Saving warehouses_Cleaned.csv to warehouses_Cleaned.csv


In [3]:
products_df = spark.read.csv("products_Cleaned.csv", header=True, inferSchema=True)
warehouses_df = spark.read.csv("warehouses_Cleaned.csv", header=True, inferSchema=True)
stock_movements_df = spark.read.csv("stock_movements_Cleaned.csv", header=True, inferSchema=True)

Aggregate total stock per warehouse

In [4]:
stock_movement_per_warehouse_df = stock_movements_df.groupBy("warehouse_id").agg(F.sum("stock_effect").alias("total_Sctock"))
stock_movement_per_warehouse_df=stock_movement_per_warehouse_df.join(warehouses_df.select("warehouse_id","location","capacity"), on = "warehouse_id",how='left')
stock_movement_per_warehouse_df.show()

+------------+------------+----------+--------+
|warehouse_id|total_Sctock|  location|capacity|
+------------+------------+----------+--------+
|      2010.0|           8|    Indore|     800|
|      2001.0|          -4| Bangalore|    1500|
|      2009.0|          24|Coimbatore|     900|
|      2007.0|           5|     Patna|    1000|
|      2000.0|          90|   Chennai|    2000|
|      2008.0|          34|     Noida|    1100|
|      2011.0|           2|    Jaipur|     600|
|      2002.0|           5|     Delhi|    1800|
|      2006.0|          31| Ahmedabad|    1300|
|      2004.0|          32|    Mumbai|    1600|
|      2013.0|          18|Trivandrum|     350|
|      2012.0|          18|  Guwahati|     450|
|      2014.0|          12|      Pune|     650|
|      2005.0|          35|   Kolkata|    1700|
|      2003.0|          45| Hyderabad|    1200|
+------------+------------+----------+--------+



Identify warehouses with overstocked or understocked items

In [5]:
stock_per_warehouse_product_df = stock_movements_df.groupBy("warehouse_id", "product_id") \
    .agg(F.sum("stock_effect").alias("current_stock"))

In [6]:
join_df=stock_per_warehouse_product_df.join(products_df,on="product_id",how="left"
).join(warehouses_df.select("warehouse_id","location"),on="warehouse_id",how="left")

In [7]:
stock_status_df = join_df.withColumn(
    "understocked",F.col("current_stock") <F.col("reorder_level")
).withColumn(
    "overstocked",F.col("current_stock") > (F.col("reorder_level") * 2)
)

Warehouse with OverStock

In [8]:
stock_status_df.filter(stock_status_df.overstocked == True).show()

+------------+----------+-------------+---+----------------+--------------------+-----------+----------+-------------+------+-----------+-------------------+-------------------+--------+------------+-----------+
|warehouse_id|product_id|current_stock|_c0|            name|         description|   category|unit_price|reorder_level|status|supplier_id|         created_at|       last_updated|location|understocked|overstocked|
+------------+----------+-------------+---+----------------+--------------------+-----------+----------+-------------+------+-----------+-------------------+-------------------+--------+------------+-----------+
|      2000.0|       1.0|           30|  0|          LED TV|    42 inch Smart TV|Electronics|     35000|           10|active|     1000.0|2024-01-08 10:00:00|2024-05-08 09:30:00| Chennai|       false|       true|
|      2000.0|       3.0|           15|  2|Wooden Bookshelf|5-tier sturdy boo...|  Furniture|      8000|            5|active|     1002.0|2024-02-08 09:1

Warehouse with UnderStock

In [9]:
stock_status_df.filter(stock_status_df.overstocked == False).show()

+------------+----------+-------------+---+--------------------+--------------------+-----------+----------+-------------+------+-----------+-------------------+-------------------+----------+------------+-----------+
|warehouse_id|product_id|current_stock|_c0|                name|         description|   category|unit_price|reorder_level|status|supplier_id|         created_at|       last_updated|  location|understocked|overstocked|
+------------+----------+-------------+---+--------------------+--------------------+-----------+----------+-------------+------+-----------+-------------------+-------------------+----------+------------+-----------+
|      2002.0|       1.0|           -5|  0|              LED TV|    42 inch Smart TV|Electronics|     35000|           10|active|     1000.0|2024-01-08 10:00:00|2024-05-08 09:30:00|     Delhi|        true|      false|
|      2001.0|       2.0|           -8|  1|   Organic Green Tea|Imported organic ...|    Grocery|      1200|           25|active

saving warehouse stock status

In [10]:
stock_status_df.toPandas().to_csv("warehouse_stock_status.csv", index=False)

Downloading the file

In [11]:
files.download("warehouse_stock_status.csv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>