PySpark Assignment – Product Sales Analysis

Part 1: Environment Setup

1. Install Spark + Java in Google Colab.

2. Initialize Spark with app name "ProductSalesAnalysis" .

In [59]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F

spark=SparkSession.builder.appName('ProductSalesAnalysis').getOrCreate()

In [60]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Part 2: Load Sales Data from CSV

Read the file into a PySpark DataFrame with header and inferred schema.

Print schema and show top 5 rows.

In [61]:
csv_data="""OrderID,Product,Category,Quantity,UnitPrice,Region
1001,Mobile,Electronics,2,15000,North
1002,Laptop,Electronics,1,55000,South
1003,T-Shirt,Apparel,3,500,East
1004,Jeans,Apparel,2,1200,North
1005,TV,Electronics,1,40000,West
1006,Shoes,Footwear,4,2000,South
1007,Watch,Accessories,2,3000,East
1008,Headphones,Electronics,3,2500,North"""

with open("/content/drive/MyDrive/Colab Notebooks/sales_data.csv","w") as f:
  f.write(csv_data)

df=spark.read.csv("/content/drive/My Drive/Colab Notebooks/sales_data.csv",header=True,inferSchema=True)

df.show(5)

+-------+-------+-----------+--------+---------+------+
|OrderID|Product|   Category|Quantity|UnitPrice|Region|
+-------+-------+-----------+--------+---------+------+
|   1001| Mobile|Electronics|       2|    15000| North|
|   1002| Laptop|Electronics|       1|    55000| South|
|   1003|T-Shirt|    Apparel|       3|      500|  East|
|   1004|  Jeans|    Apparel|       2|     1200| North|
|   1005|     TV|Electronics|       1|    40000|  West|
+-------+-------+-----------+--------+---------+------+
only showing top 5 rows



Part 3: Business Questions

1. Add a new column TotalPrice = Quantity × UnitPrice


In [62]:
df=df.withColumn("TotalPrice",F.col("Quantity")*F.col("UnitPrice"))
df.show()

+-------+----------+-----------+--------+---------+------+----------+
|OrderID|   Product|   Category|Quantity|UnitPrice|Region|TotalPrice|
+-------+----------+-----------+--------+---------+------+----------+
|   1001|    Mobile|Electronics|       2|    15000| North|     30000|
|   1002|    Laptop|Electronics|       1|    55000| South|     55000|
|   1003|   T-Shirt|    Apparel|       3|      500|  East|      1500|
|   1004|     Jeans|    Apparel|       2|     1200| North|      2400|
|   1005|        TV|Electronics|       1|    40000|  West|     40000|
|   1006|     Shoes|   Footwear|       4|     2000| South|      8000|
|   1007|     Watch|Accessories|       2|     3000|  East|      6000|
|   1008|Headphones|Electronics|       3|     2500| North|      7500|
+-------+----------+-----------+--------+---------+------+----------+



2. Total revenue generated across all regions.


In [63]:
total_rev = df.agg(F.sum("TotalPrice")).collect()[0][0]
print("Total Revenue Generated",total_rev)

Total Revenue Generated 150400


3. Category-wise revenue sorted in descending order.

In [77]:
df.groupBy("Category").agg(F.sum("TotalPrice").alias("TotalRevenue")).orderBy(F.desc("TotalRevenue")).show()

+-----------+------------+
|   Category|TotalRevenue|
+-----------+------------+
|Electronics|      132500|
|   Footwear|        8000|
|Accessories|        6000|
|    Apparel|        3900|
+-----------+------------+



4. Region with the highest number of orders

In [76]:
df.groupBy("Region").agg(F.count("OrderID").alias("OrderCount")).orderBy(F.desc("OrderCount")).limit(1).show()

+------+----------+
|Region|OrderCount|
+------+----------+
| North|         3|
+------+----------+



5. Average Unit Price per Category

In [75]:
df.groupBy("Category").agg(F.avg("UnitPrice").alias("AverageUnitPriceByCategory")).orderBy(F.desc("AverageUnitPriceByCategory")).show()

+-----------+--------------------------+
|   Category|AverageUnitPriceByCategory|
+-----------+--------------------------+
|Electronics|                   28125.0|
|Accessories|                    3000.0|
|   Footwear|                    2000.0|
|    Apparel|                     850.0|
+-----------+--------------------------+



6. All orders where TotalPrice is more than30,000

In [67]:
df.filter(F.col("TotalPrice")>30000).show()

+-------+-------+-----------+--------+---------+------+----------+
|OrderID|Product|   Category|Quantity|UnitPrice|Region|TotalPrice|
+-------+-------+-----------+--------+---------+------+----------+
|   1002| Laptop|Electronics|       1|    55000| South|     55000|
|   1005|     TV|Electronics|       1|    40000|  West|     40000|
+-------+-------+-----------+--------+---------+------+----------+



Part 4: Data Transformations

1. Create a new column HighValueOrder which is "Yes" if TotalPrice > 20,000,
else "No" .


In [68]:
df = df.withColumn("HighValueOrder", F.when(F.col("TotalPrice") > 20000, "Yes").otherwise("No"))
df.show()

+-------+----------+-----------+--------+---------+------+----------+--------------+
|OrderID|   Product|   Category|Quantity|UnitPrice|Region|TotalPrice|HighValueOrder|
+-------+----------+-----------+--------+---------+------+----------+--------------+
|   1001|    Mobile|Electronics|       2|    15000| North|     30000|           Yes|
|   1002|    Laptop|Electronics|       1|    55000| South|     55000|           Yes|
|   1003|   T-Shirt|    Apparel|       3|      500|  East|      1500|            No|
|   1004|     Jeans|    Apparel|       2|     1200| North|      2400|            No|
|   1005|        TV|Electronics|       1|    40000|  West|     40000|           Yes|
|   1006|     Shoes|   Footwear|       4|     2000| South|      8000|            No|
|   1007|     Watch|Accessories|       2|     3000|  East|      6000|            No|
|   1008|Headphones|Electronics|       3|     2500| North|      7500|            No|
+-------+----------+-----------+--------+---------+------+-------

2. Filter and display all high-value orders in the North region.

In [73]:
df.filter((F.col("HighValueOrder")=="Yes") & (F.col("Region")=="North")).show()

+-------+-------+-----------+--------+---------+------+----------+--------------+
|OrderID|Product|   Category|Quantity|UnitPrice|Region|TotalPrice|HighValueOrder|
+-------+-------+-----------+--------+---------+------+----------+--------------+
|   1001| Mobile|Electronics|       2|    15000| North|     30000|           Yes|
+-------+-------+-----------+--------+---------+------+----------+--------------+



3. Count how many high-value orders exist per region.

In [72]:
df.filter(F.col("HighValueOrder")=="Yes").groupBy("Region").agg(F.count("OrderID").alias("HighValueOrderCount")).orderBy(F.desc("HighValueOrderCount")).show()

+------+-------------------+
|Region|HighValueOrderCount|
+------+-------------------+
| South|                  1|
|  West|                  1|
| North|                  1|
+------+-------------------+



Part 5: Save Results

Save the transformed DataFrame as a CSV file named high_value_orders.csv with
headers.

In [71]:
df.write.csv("/content/drive/My Drive/Colab Notebooks/high_value_orders.csv",header=True)