PySpark Assignment – Product Sales Analysis
(Intermediate)

Part 1: Environment Setup
1. Install Spark + Java in Google Colab.
2. Initialize Spark with app name "ProductSalesAnalysis" .

In [37]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
        .appName("ProductSalesAnalysis") \
        .getOrCreate()

spark

Part 2: Load Sales Data from CSV

Create and load the following CSV as sales.csv :

Read the file into a PySpark DataFrame with header and inferred schema.

Print schema and show top 5 rows.

In [38]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [39]:
path = "/content/drive/MyDrive/Hexware_Training_DataEngineering/sales.csv"

df = spark.read.csv(path, header=True, inferSchema = True)
df.printSchema()
df.show(5)

root
 |-- OrderID: integer (nullable = true)
 |-- Product: string (nullable = true)
 |-- Category: string (nullable = true)
 |-- Quantity: integer (nullable = true)
 |-- UnitPrice: integer (nullable = true)
 |-- Region: string (nullable = true)

+-------+-------+-----------+--------+---------+------+
|OrderID|Product|   Category|Quantity|UnitPrice|Region|
+-------+-------+-----------+--------+---------+------+
|   1001| Mobile|Electronics|       2|    15000| North|
|   1002| Laptop|Electronics|       1|    55000| South|
|   1003|T-Shirt|    Apparel|       3|      500|  East|
|   1004|  Jeans|    Apparel|       2|     1200| North|
|   1005|     TV|Electronics|       1|    40000|  West|
+-------+-------+-----------+--------+---------+------+
only showing top 5 rows



Part 3: Business Questions
1. Add a new column TotalPrice = Quantity × UnitPrice
2. Total revenue generated across all regions.
3. Category-wise revenue sorted in descending order.
4. Region with the highest number of orders
5. Average Unit Price per Category
6. All orders where TotalPrice is more than
30,000

In [40]:
# 1
from pyspark.sql.functions import *

df = df.withColumn("TotalPrice", col('Quantity') * col('UnitPrice'))
df.show()

+-------+----------+-----------+--------+---------+------+----------+
|OrderID|   Product|   Category|Quantity|UnitPrice|Region|TotalPrice|
+-------+----------+-----------+--------+---------+------+----------+
|   1001|    Mobile|Electronics|       2|    15000| North|     30000|
|   1002|    Laptop|Electronics|       1|    55000| South|     55000|
|   1003|   T-Shirt|    Apparel|       3|      500|  East|      1500|
|   1004|     Jeans|    Apparel|       2|     1200| North|      2400|
|   1005|        TV|Electronics|       1|    40000|  West|     40000|
|   1006|     Shoes|   Footwear|       4|     2000| South|      8000|
|   1007|     Watch|Accessories|       2|     3000|  East|      6000|
|   1008|Headphones|Electronics|       3|     2500| North|      7500|
+-------+----------+-----------+--------+---------+------+----------+



In [41]:
# 2
df.groupBy("Region").agg(sum("TotalPrice").alias("TotalRevenue")).show()


+------+------------+
|Region|TotalRevenue|
+------+------------+
| South|       63000|
|  East|        7500|
|  West|       40000|
| North|       39900|
+------+------------+



In [42]:
# 3
df.groupby("Category").agg(sum("TotalPrice").alias("TotalRevenue")).sort(desc("TotalRevenue")).show()

+-----------+------------+
|   Category|TotalRevenue|
+-----------+------------+
|Electronics|      132500|
|   Footwear|        8000|
|Accessories|        6000|
|    Apparel|        3900|
+-----------+------------+



In [43]:
# 4
df.groupBy('Region').agg(count('OrderID').alias('OrderCount')).sort(desc('OrderCount')).show(1)

+------+----------+
|Region|OrderCount|
+------+----------+
| North|         3|
+------+----------+
only showing top 1 row



In [44]:
# 5
df.groupBy('Category').agg(avg('UnitPrice').alias('AvgUnitPrice')).sort(desc('AvgUnitPrice')).show()

+-----------+------------+
|   Category|AvgUnitPrice|
+-----------+------------+
|Electronics|     28125.0|
|Accessories|      3000.0|
|   Footwear|      2000.0|
|    Apparel|       850.0|
+-----------+------------+



In [45]:
# 6
df.filter(col('TotalPrice') > 30000).show()

+-------+-------+-----------+--------+---------+------+----------+
|OrderID|Product|   Category|Quantity|UnitPrice|Region|TotalPrice|
+-------+-------+-----------+--------+---------+------+----------+
|   1002| Laptop|Electronics|       1|    55000| South|     55000|
|   1005|     TV|Electronics|       1|    40000|  West|     40000|
+-------+-------+-----------+--------+---------+------+----------+



Part 4: Data Transformations
1. Create a new column HighValueOrder which is "Yes" if TotalPrice > 20,000,
else "No" .
2. Filter and display all high-value orders in the North region.
3. Count how many high-value orders exist per region.

In [46]:
# 1

df = df.withColumn("HighValueOrder", when (col('TotalPrice') > 20000, "Yes" ).otherwise("No"))
df.show()

+-------+----------+-----------+--------+---------+------+----------+--------------+
|OrderID|   Product|   Category|Quantity|UnitPrice|Region|TotalPrice|HighValueOrder|
+-------+----------+-----------+--------+---------+------+----------+--------------+
|   1001|    Mobile|Electronics|       2|    15000| North|     30000|           Yes|
|   1002|    Laptop|Electronics|       1|    55000| South|     55000|           Yes|
|   1003|   T-Shirt|    Apparel|       3|      500|  East|      1500|            No|
|   1004|     Jeans|    Apparel|       2|     1200| North|      2400|            No|
|   1005|        TV|Electronics|       1|    40000|  West|     40000|           Yes|
|   1006|     Shoes|   Footwear|       4|     2000| South|      8000|            No|
|   1007|     Watch|Accessories|       2|     3000|  East|      6000|            No|
|   1008|Headphones|Electronics|       3|     2500| North|      7500|            No|
+-------+----------+-----------+--------+---------+------+-------

In [49]:
# 2
df.filter((col('Region') == "North") & (col('HighValueOrder')=="Yes")).show()





+-------+-------+-----------+--------+---------+------+----------+--------------+
|OrderID|Product|   Category|Quantity|UnitPrice|Region|TotalPrice|HighValueOrder|
+-------+-------+-----------+--------+---------+------+----------+--------------+
|   1001| Mobile|Electronics|       2|    15000| North|     30000|           Yes|
+-------+-------+-----------+--------+---------+------+----------+--------------+



In [51]:
# 3
df.filter(col('HighValueOrder') == "Yes").groupBy('Region').agg(count('OrderID').alias('OrderCount')).show()

+------+----------+
|Region|OrderCount|
+------+----------+
| South|         1|
|  West|         1|
| North|         1|
+------+----------+



Part 5: Save Results

Save the transformed DataFrame as a CSV file named high_value_orders.csv with
headers.

In [54]:
highValueDF = df.filter(col('HighValueOrder')=="Yes")
outpath = "/content/drive/MyDrive/Hexware_Training_DataEngineering/high_value_orders.csv"

highValueDF.coalesce(1).write.csv(outpath, header = True, mode = "overwrite")
print("File Saved Successfully")

File Saved Successfully
