# PYSPARK DATA ANALYSIS

# Select • Filter • GroupBy • Aggregates • Window Functions

# DATASET
Domain: Multi-Store Retail Sales Analytics

# STEP 1 — CREATE THE DATASET

In [2]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Retail Sales Analysis").getOrCreate()

In [3]:
sales_data = [
("T001","North","Delhi","Store-01","Laptop","2024-01-01",75000),
("T002","North","Delhi","Store-01","Mobile","2024-01-02",32000),
("T003","North","Chandigarh","Store-02","Tablet","2024-01-03",26000),
("T004","South","Bangalore","Store-03","Laptop","2024-01-01",78000),
("T005","South","Chennai","Store-04","Mobile","2024-01-02",30000),
("T006","South","Bangalore","Store-03","Tablet","2024-01-03",24000),
("T007","East","Kolkata","Store-05","Laptop","2024-01-01",72000),
("T008","East","Kolkata","Store-05","Mobile","2024-01-02",28000),
("T009","East","Patna","Store-06","Tablet","2024-01-03",23000),
("T010","West","Mumbai","Store-07","Laptop","2024-01-01",80000),
("T011","West","Mumbai","Store-07","Mobile","2024-01-02",35000),
("T012","West","Pune","Store-08","Tablet","2024-01-03",27000),
("T013","North","Delhi","Store-01","Laptop","2024-01-04",76000),
("T014","South","Chennai","Store-04","Laptop","2024-01-04",79000),
("T015","East","Patna","Store-06","Mobile","2024-01-04",29000),
("T016","West","Pune","Store-08","Laptop","2024-01-04",77000),
("T017","North","Chandigarh","Store-02","Mobile","2024-01-05",31000),
("T018","South","Bangalore","Store-03","Mobile","2024-01-05",34000),
("T019","East","Kolkata","Store-05","Tablet","2024-01-05",25000),
("T020","West","Mumbai","Store-07","Tablet","2024-01-05",29000),
("T021","North","Delhi","Store-01","Tablet","2024-01-06",28000),
("T022","South","Chennai","Store-04","Tablet","2024-01-06",26000),
("T023","East","Patna","Store-06","Laptop","2024-01-06",74000),
("T024","West","Pune","Store-08","Mobile","2024-01-06",33000)
]
columns = [
"txn_id","region","city","store_id",
"product","sale_date","amount"
]
df_sales = spark.createDataFrame(sales_data, columns)
df_sales.show(5)
df_sales.printSchema()

+------+------+----------+--------+-------+----------+------+
|txn_id|region|      city|store_id|product| sale_date|amount|
+------+------+----------+--------+-------+----------+------+
|  T001| North|     Delhi|Store-01| Laptop|2024-01-01| 75000|
|  T002| North|     Delhi|Store-01| Mobile|2024-01-02| 32000|
|  T003| North|Chandigarh|Store-02| Tablet|2024-01-03| 26000|
|  T004| South| Bangalore|Store-03| Laptop|2024-01-01| 78000|
|  T005| South|   Chennai|Store-04| Mobile|2024-01-02| 30000|
+------+------+----------+--------+-------+----------+------+
only showing top 5 rows
root
 |-- txn_id: string (nullable = true)
 |-- region: string (nullable = true)
 |-- city: string (nullable = true)
 |-- store_id: string (nullable = true)
 |-- product: string (nullable = true)
 |-- sale_date: string (nullable = true)
 |-- amount: long (nullable = true)



# EXERCISE SET 1 — SELECT OPERATIONS

# 1. Select only txn_id , region , product , and amount

In [4]:
df_sales.select("txn_id","region","product","amount").show()

+------+------+-------+------+
|txn_id|region|product|amount|
+------+------+-------+------+
|  T001| North| Laptop| 75000|
|  T002| North| Mobile| 32000|
|  T003| North| Tablet| 26000|
|  T004| South| Laptop| 78000|
|  T005| South| Mobile| 30000|
|  T006| South| Tablet| 24000|
|  T007|  East| Laptop| 72000|
|  T008|  East| Mobile| 28000|
|  T009|  East| Tablet| 23000|
|  T010|  West| Laptop| 80000|
|  T011|  West| Mobile| 35000|
|  T012|  West| Tablet| 27000|
|  T013| North| Laptop| 76000|
|  T014| South| Laptop| 79000|
|  T015|  East| Mobile| 29000|
|  T016|  West| Laptop| 77000|
|  T017| North| Mobile| 31000|
|  T018| South| Mobile| 34000|
|  T019|  East| Tablet| 25000|
|  T020|  West| Tablet| 29000|
+------+------+-------+------+
only showing top 20 rows


# 2. Rename amount to revenue

In [5]:
df_sales.select("txn_id","region","product",df_sales["amount"].alias("revenue")).show()

+------+------+-------+-------+
|txn_id|region|product|revenue|
+------+------+-------+-------+
|  T001| North| Laptop|  75000|
|  T002| North| Mobile|  32000|
|  T003| North| Tablet|  26000|
|  T004| South| Laptop|  78000|
|  T005| South| Mobile|  30000|
|  T006| South| Tablet|  24000|
|  T007|  East| Laptop|  72000|
|  T008|  East| Mobile|  28000|
|  T009|  East| Tablet|  23000|
|  T010|  West| Laptop|  80000|
|  T011|  West| Mobile|  35000|
|  T012|  West| Tablet|  27000|
|  T013| North| Laptop|  76000|
|  T014| South| Laptop|  79000|
|  T015|  East| Mobile|  29000|
|  T016|  West| Laptop|  77000|
|  T017| North| Mobile|  31000|
|  T018| South| Mobile|  34000|
|  T019|  East| Tablet|  25000|
|  T020|  West| Tablet|  29000|
+------+------+-------+-------+
only showing top 20 rows


# 3. Create a derived column amount_in_thousands

In [6]:
from pyspark.sql.functions import col
df_sales.withColumn("amount_in_thousands", col("amount") / 1000).show()

+------+------+----------+--------+-------+----------+------+-------------------+
|txn_id|region|      city|store_id|product| sale_date|amount|amount_in_thousands|
+------+------+----------+--------+-------+----------+------+-------------------+
|  T001| North|     Delhi|Store-01| Laptop|2024-01-01| 75000|               75.0|
|  T002| North|     Delhi|Store-01| Mobile|2024-01-02| 32000|               32.0|
|  T003| North|Chandigarh|Store-02| Tablet|2024-01-03| 26000|               26.0|
|  T004| South| Bangalore|Store-03| Laptop|2024-01-01| 78000|               78.0|
|  T005| South|   Chennai|Store-04| Mobile|2024-01-02| 30000|               30.0|
|  T006| South| Bangalore|Store-03| Tablet|2024-01-03| 24000|               24.0|
|  T007|  East|   Kolkata|Store-05| Laptop|2024-01-01| 72000|               72.0|
|  T008|  East|   Kolkata|Store-05| Mobile|2024-01-02| 28000|               28.0|
|  T009|  East|     Patna|Store-06| Tablet|2024-01-03| 23000|               23.0|
|  T010|  West| 

# 4. Select distinct combinations of region and product

In [7]:
df_sales.select("region","product").distinct().show()

+------+-------+
|region|product|
+------+-------+
| North| Laptop|
| North| Tablet|
|  East| Tablet|
|  East| Laptop|
| South| Tablet|
| North| Mobile|
|  West| Tablet|
|  East| Mobile|
| South| Mobile|
| South| Laptop|
|  West| Mobile|
|  West| Laptop|
+------+-------+



# 5. Select all columns but exclude store_id

In [8]:
all_columns = df_sales.columns
columns_to_select = [col for col in all_columns if col != "store_id"]
df_sales.select(columns_to_select).show()

+------+------+----------+-------+----------+------+
|txn_id|region|      city|product| sale_date|amount|
+------+------+----------+-------+----------+------+
|  T001| North|     Delhi| Laptop|2024-01-01| 75000|
|  T002| North|     Delhi| Mobile|2024-01-02| 32000|
|  T003| North|Chandigarh| Tablet|2024-01-03| 26000|
|  T004| South| Bangalore| Laptop|2024-01-01| 78000|
|  T005| South|   Chennai| Mobile|2024-01-02| 30000|
|  T006| South| Bangalore| Tablet|2024-01-03| 24000|
|  T007|  East|   Kolkata| Laptop|2024-01-01| 72000|
|  T008|  East|   Kolkata| Mobile|2024-01-02| 28000|
|  T009|  East|     Patna| Tablet|2024-01-03| 23000|
|  T010|  West|    Mumbai| Laptop|2024-01-01| 80000|
|  T011|  West|    Mumbai| Mobile|2024-01-02| 35000|
|  T012|  West|      Pune| Tablet|2024-01-03| 27000|
|  T013| North|     Delhi| Laptop|2024-01-04| 76000|
|  T014| South|   Chennai| Laptop|2024-01-04| 79000|
|  T015|  East|     Patna| Mobile|2024-01-04| 29000|
|  T016|  West|      Pune| Laptop|2024-01-04| 

# 6. Create a new column sale_year extracted from sale_date

In [9]:
df_sales.withColumn("sale_year", col("sale_date").substr(1, 4)).show()

+------+------+----------+--------+-------+----------+------+---------+
|txn_id|region|      city|store_id|product| sale_date|amount|sale_year|
+------+------+----------+--------+-------+----------+------+---------+
|  T001| North|     Delhi|Store-01| Laptop|2024-01-01| 75000|     2024|
|  T002| North|     Delhi|Store-01| Mobile|2024-01-02| 32000|     2024|
|  T003| North|Chandigarh|Store-02| Tablet|2024-01-03| 26000|     2024|
|  T004| South| Bangalore|Store-03| Laptop|2024-01-01| 78000|     2024|
|  T005| South|   Chennai|Store-04| Mobile|2024-01-02| 30000|     2024|
|  T006| South| Bangalore|Store-03| Tablet|2024-01-03| 24000|     2024|
|  T007|  East|   Kolkata|Store-05| Laptop|2024-01-01| 72000|     2024|
|  T008|  East|   Kolkata|Store-05| Mobile|2024-01-02| 28000|     2024|
|  T009|  East|     Patna|Store-06| Tablet|2024-01-03| 23000|     2024|
|  T010|  West|    Mumbai|Store-07| Laptop|2024-01-01| 80000|     2024|
|  T011|  West|    Mumbai|Store-07| Mobile|2024-01-02| 35000|   

# 7. Reorder columns in a business-friendly format

In [10]:
df_sales.select("txn_id", "sale_date", "region", "city", "store_id", "product", "amount").show()

+------+----------+------+----------+--------+-------+------+
|txn_id| sale_date|region|      city|store_id|product|amount|
+------+----------+------+----------+--------+-------+------+
|  T001|2024-01-01| North|     Delhi|Store-01| Laptop| 75000|
|  T002|2024-01-02| North|     Delhi|Store-01| Mobile| 32000|
|  T003|2024-01-03| North|Chandigarh|Store-02| Tablet| 26000|
|  T004|2024-01-01| South| Bangalore|Store-03| Laptop| 78000|
|  T005|2024-01-02| South|   Chennai|Store-04| Mobile| 30000|
|  T006|2024-01-03| South| Bangalore|Store-03| Tablet| 24000|
|  T007|2024-01-01|  East|   Kolkata|Store-05| Laptop| 72000|
|  T008|2024-01-02|  East|   Kolkata|Store-05| Mobile| 28000|
|  T009|2024-01-03|  East|     Patna|Store-06| Tablet| 23000|
|  T010|2024-01-01|  West|    Mumbai|Store-07| Laptop| 80000|
|  T011|2024-01-02|  West|    Mumbai|Store-07| Mobile| 35000|
|  T012|2024-01-03|  West|      Pune|Store-08| Tablet| 27000|
|  T013|2024-01-04| North|     Delhi|Store-01| Laptop| 76000|
|  T014|

# EXERCISE SET 2 — FILTER OPERATIONS

# 1. Filter transactions where amount > 50000

In [14]:
df_sales.filter(col("amount") > 50000).show()

+------+------+---------+--------+-------+----------+------+
|txn_id|region|     city|store_id|product| sale_date|amount|
+------+------+---------+--------+-------+----------+------+
|  T001| North|    Delhi|Store-01| Laptop|2024-01-01| 75000|
|  T004| South|Bangalore|Store-03| Laptop|2024-01-01| 78000|
|  T007|  East|  Kolkata|Store-05| Laptop|2024-01-01| 72000|
|  T010|  West|   Mumbai|Store-07| Laptop|2024-01-01| 80000|
|  T013| North|    Delhi|Store-01| Laptop|2024-01-04| 76000|
|  T014| South|  Chennai|Store-04| Laptop|2024-01-04| 79000|
|  T016|  West|     Pune|Store-08| Laptop|2024-01-04| 77000|
|  T023|  East|    Patna|Store-06| Laptop|2024-01-06| 74000|
+------+------+---------+--------+-------+----------+------+



# 2. Filter only Laptop sales

In [15]:
df_sales.filter(col("product") == "Laptop").show()

+------+------+---------+--------+-------+----------+------+
|txn_id|region|     city|store_id|product| sale_date|amount|
+------+------+---------+--------+-------+----------+------+
|  T001| North|    Delhi|Store-01| Laptop|2024-01-01| 75000|
|  T004| South|Bangalore|Store-03| Laptop|2024-01-01| 78000|
|  T007|  East|  Kolkata|Store-05| Laptop|2024-01-01| 72000|
|  T010|  West|   Mumbai|Store-07| Laptop|2024-01-01| 80000|
|  T013| North|    Delhi|Store-01| Laptop|2024-01-04| 76000|
|  T014| South|  Chennai|Store-04| Laptop|2024-01-04| 79000|
|  T016|  West|     Pune|Store-08| Laptop|2024-01-04| 77000|
|  T023|  East|    Patna|Store-06| Laptop|2024-01-06| 74000|
+------+------+---------+--------+-------+----------+------+



# 3. Filter sales from North and South regions

In [16]:
df_sales.filter(col("region").isin(["North", "South"])).show()

+------+------+----------+--------+-------+----------+------+
|txn_id|region|      city|store_id|product| sale_date|amount|
+------+------+----------+--------+-------+----------+------+
|  T001| North|     Delhi|Store-01| Laptop|2024-01-01| 75000|
|  T002| North|     Delhi|Store-01| Mobile|2024-01-02| 32000|
|  T003| North|Chandigarh|Store-02| Tablet|2024-01-03| 26000|
|  T004| South| Bangalore|Store-03| Laptop|2024-01-01| 78000|
|  T005| South|   Chennai|Store-04| Mobile|2024-01-02| 30000|
|  T006| South| Bangalore|Store-03| Tablet|2024-01-03| 24000|
|  T013| North|     Delhi|Store-01| Laptop|2024-01-04| 76000|
|  T014| South|   Chennai|Store-04| Laptop|2024-01-04| 79000|
|  T017| North|Chandigarh|Store-02| Mobile|2024-01-05| 31000|
|  T018| South| Bangalore|Store-03| Mobile|2024-01-05| 34000|
|  T021| North|     Delhi|Store-01| Tablet|2024-01-06| 28000|
|  T022| South|   Chennai|Store-04| Tablet|2024-01-06| 26000|
+------+------+----------+--------+-------+----------+------+



# 4. Filter sales between 25000 and 75000

In [17]:
df_sales.filter((col("amount") >= 25000) & (col("amount") <= 75000)).show()

+------+------+----------+--------+-------+----------+------+
|txn_id|region|      city|store_id|product| sale_date|amount|
+------+------+----------+--------+-------+----------+------+
|  T001| North|     Delhi|Store-01| Laptop|2024-01-01| 75000|
|  T002| North|     Delhi|Store-01| Mobile|2024-01-02| 32000|
|  T003| North|Chandigarh|Store-02| Tablet|2024-01-03| 26000|
|  T005| South|   Chennai|Store-04| Mobile|2024-01-02| 30000|
|  T007|  East|   Kolkata|Store-05| Laptop|2024-01-01| 72000|
|  T008|  East|   Kolkata|Store-05| Mobile|2024-01-02| 28000|
|  T011|  West|    Mumbai|Store-07| Mobile|2024-01-02| 35000|
|  T012|  West|      Pune|Store-08| Tablet|2024-01-03| 27000|
|  T015|  East|     Patna|Store-06| Mobile|2024-01-04| 29000|
|  T017| North|Chandigarh|Store-02| Mobile|2024-01-05| 31000|
|  T018| South| Bangalore|Store-03| Mobile|2024-01-05| 34000|
|  T019|  East|   Kolkata|Store-05| Tablet|2024-01-05| 25000|
|  T020|  West|    Mumbai|Store-07| Tablet|2024-01-05| 29000|
|  T021|

# 5.  Filter transactions from Delhi stores only

In [20]:
df_sales.filter(col("city") == "Delhi").show()

+------+------+-----+--------+-------+----------+------+
|txn_id|region| city|store_id|product| sale_date|amount|
+------+------+-----+--------+-------+----------+------+
|  T001| North|Delhi|Store-01| Laptop|2024-01-01| 75000|
|  T002| North|Delhi|Store-01| Mobile|2024-01-02| 32000|
|  T013| North|Delhi|Store-01| Laptop|2024-01-04| 76000|
|  T021| North|Delhi|Store-01| Tablet|2024-01-06| 28000|
+------+------+-----+--------+-------+----------+------+



In [22]:
df_sales.filter(col("city") == "Delhi").select("city", "store_id", "amount").show()

+-----+--------+------+
| city|store_id|amount|
+-----+--------+------+
|Delhi|Store-01| 75000|
|Delhi|Store-01| 32000|
|Delhi|Store-01| 76000|
|Delhi|Store-01| 28000|
+-----+--------+------+



# 6. Apply multiple filters using both filter and where

In [25]:
df_sales.filter((col("region") == "North")).where(col("city") == "Chandigarh").show()

+------+------+----------+--------+-------+----------+------+
|txn_id|region|      city|store_id|product| sale_date|amount|
+------+------+----------+--------+-------+----------+------+
|  T003| North|Chandigarh|Store-02| Tablet|2024-01-03| 26000|
|  T017| North|Chandigarh|Store-02| Mobile|2024-01-05| 31000|
+------+------+----------+--------+-------+----------+------+



# 7. Change the order of filters and compare explain(True)

In [26]:
print("First order: filter by region then by city")
df_sales.filter(col("region") == "North").where(col("city") == "Chandigarh").explain(True)

print("Second order: filter by city then by region")
df_sales.where(col("city") == "Chandigarh").filter(col("region") == "North").explain(True)

First order: filter by region then by city
== Parsed Logical Plan ==
'Filter '`=`('city, Chandigarh)
+- Filter (region#1 = North)
   +- LogicalRDD [txn_id#0, region#1, city#2, store_id#3, product#4, sale_date#5, amount#6L], false

== Analyzed Logical Plan ==
txn_id: string, region: string, city: string, store_id: string, product: string, sale_date: string, amount: bigint
Filter (city#2 = Chandigarh)
+- Filter (region#1 = North)
   +- LogicalRDD [txn_id#0, region#1, city#2, store_id#3, product#4, sale_date#5, amount#6L], false

== Optimized Logical Plan ==
Filter ((isnotnull(region#1) AND isnotnull(city#2)) AND ((region#1 = North) AND (city#2 = Chandigarh)))
+- LogicalRDD [txn_id#0, region#1, city#2, store_id#3, product#4, sale_date#5, amount#6L], false

== Physical Plan ==
*(1) Filter ((isnotnull(region#1) AND isnotnull(city#2)) AND ((region#1 = North) AND (city#2 = Chandigarh)))
+- *(1) Scan ExistingRDD[txn_id#0,region#1,city#2,store_id#3,product#4,sale_date#5,amount#6L]

Second order

# 8. Identify which filters Spark pushes down

In [27]:
# From the `explain(True)` output for both filter orders:
# == Optimized Logical Plan ==
# Filter ((isnotnull(region#1) AND isnotnull(city#2)) AND ((region#1 = North) AND (city#2 = Chandigarh)))
# +- LogicalRDD [txn_id#0, region#1, city#2, store_id#3, product#4, sale_date#5, amount#6L], false

# == Physical Plan ==
# *(1) Filter ((isnotnull(city#2) AND isnotnull(region#1)) AND ((city#2 = Chandigarh) AND (region#1 = North)))
# +- *(1) Scan ExistingRDD[txn_id#0,region#1,city#2,store_id#3,product#4,sale_date#5,amount#6L]

# As seen in the 'Optimized Logical Plan' and 'Physical Plan', Spark's Catalyst optimizer rewrites the conditions
# into a single combined filter: `((isnotnull(region#1) AND isnotnull(city#2)) AND ((region#1 = North) AND (city#2 = Chandigarh)))`.
# This indicates that Spark effectively pushes down both the `region` filter and the `city` filter to the data source (the underlying RDD)
# as a single predicate for optimized execution. The order of `filter` and `where` operations does not affect the final optimized plan.

# EXERCISE SET 3 — GROUPBY & AGGREGATE FUNCTIONS

# 1. Total sales amount per region

In [28]:
from pyspark.sql.functions import sum
df_sales.groupBy("region").agg(sum("amount").alias("total_sales_amount")).show()

+------+------------------+
|region|total_sales_amount|
+------+------------------+
| South|            271000|
|  East|            251000|
|  West|            281000|
| North|            268000|
+------+------------------+



# 2. Average sales amount per product

In [31]:
from pyspark.sql.functions import avg
df_sales.groupBy("product").agg(avg("amount").alias("average_sales_amount")).show()

+-------+--------------------+
|product|average_sales_amount|
+-------+--------------------+
| Laptop|             76375.0|
| Mobile|             31500.0|
| Tablet|             26000.0|
+-------+--------------------+



# 3. Maximum sale per city

In [32]:
from pyspark.sql.functions import max
df_sales.groupBy("city").agg(max("amount").alias("maximum_sale")).show()

+----------+------------+
|      city|maximum_sale|
+----------+------------+
| Bangalore|       78000|
|     Patna|       74000|
|   Chennai|       79000|
|    Mumbai|       80000|
|   Kolkata|       72000|
|      Pune|       77000|
|     Delhi|       76000|
|Chandigarh|       31000|
+----------+------------+



# 4. Minimum sale per store

In [34]:
from pyspark.sql.functions import min
df_sales.groupBy("store_id").agg(min("amount").alias("minimum_sale")).orderBy("store_id").show()

+--------+------------+
|store_id|minimum_sale|
+--------+------------+
|Store-01|       28000|
|Store-02|       26000|
|Store-03|       24000|
|Store-04|       26000|
|Store-05|       25000|
|Store-06|       23000|
|Store-07|       29000|
|Store-08|       27000|
+--------+------------+



# 5.  Count of transactions per region

In [35]:
from pyspark.sql.functions import count
df_sales.groupBy("region").agg(count("*").alias("transaction_count")).show()

+------+-----------------+
|region|transaction_count|
+------+-----------------+
| South|                6|
|  East|                6|
|  West|                6|
| North|                6|
+------+-----------------+



# 6.  Total revenue per store

In [36]:
df_sales.groupBy("store_id").agg(sum("amount").alias("total_revenue")).show()

+--------+-------------+
|store_id|total_revenue|
+--------+-------------+
|Store-05|       125000|
|Store-06|       126000|
|Store-03|       136000|
|Store-01|       211000|
|Store-04|       135000|
|Store-07|       144000|
|Store-08|       137000|
|Store-02|        57000|
+--------+-------------+



# 7.  Region-wise product sales count

In [37]:
df_sales.groupBy("region", "product").agg(count("*").alias("product_sales_count")).show()


+------+-------+-------------------+
|region|product|product_sales_count|
+------+-------+-------------------+
| North| Laptop|                  2|
| North| Tablet|                  2|
|  East| Tablet|                  2|
|  East| Laptop|                  2|
| South| Tablet|                  2|
| North| Mobile|                  2|
|  West| Tablet|                  2|
|  East| Mobile|                  2|
| South| Mobile|                  2|
| South| Laptop|                  2|
|  West| Mobile|                  2|
|  West| Laptop|                  2|
+------+-------+-------------------+



# 8.  Average transaction value per city

In [38]:
df_sales.groupBy("city").agg(avg("amount").alias("average_transaction_value")).show()

+----------+-------------------------+
|      city|average_transaction_value|
+----------+-------------------------+
| Bangalore|       45333.333333333336|
|     Patna|                  42000.0|
|   Chennai|                  45000.0|
|    Mumbai|                  48000.0|
|   Kolkata|       41666.666666666664|
|      Pune|       45666.666666666664|
|     Delhi|                  52750.0|
|Chandigarh|                  28500.0|
+----------+-------------------------+



# 9.  Identify regions with total sales above a threshold

In [41]:
df_sales.groupBy("region").agg(sum("amount").alias("total_sales")).filter(col("total_sales") > 250000).show()

+------+-----------+
|region|total_sales|
+------+-----------+
| South|     271000|
|  East|     251000|
|  West|     281000|
| North|     268000|
+------+-----------+



# 10. Use explain(True) and identify shuffle stages

In [42]:
df_sales.groupBy("region").agg(sum("amount").alias("total_sales")).filter(col("total_sales") > 250000).explain(True)

== Parsed Logical Plan ==
'Filter '`>`('total_sales, 250000)
+- Aggregate [region#1], [region#1, sum(amount#6L) AS total_sales#567L]
   +- LogicalRDD [txn_id#0, region#1, city#2, store_id#3, product#4, sale_date#5, amount#6L], false

== Analyzed Logical Plan ==
region: string, total_sales: bigint
Filter (total_sales#567L > cast(250000 as bigint))
+- Aggregate [region#1], [region#1, sum(amount#6L) AS total_sales#567L]
   +- LogicalRDD [txn_id#0, region#1, city#2, store_id#3, product#4, sale_date#5, amount#6L], false

== Optimized Logical Plan ==
Filter (isnotnull(total_sales#567L) AND (total_sales#567L > 250000))
+- Aggregate [region#1], [region#1, sum(amount#6L) AS total_sales#567L]
   +- Project [region#1, amount#6L]
      +- LogicalRDD [txn_id#0, region#1, city#2, store_id#3, product#4, sale_date#5, amount#6L], false

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Filter (isnotnull(total_sales#567L) AND (total_sales#567L > 250000))
   +- HashAggregate(keys=[region#1], fun

# EXERCISE SET 4 — MULTI-DIMENSIONAL AGGREGATION

# 1. Region + Product wise total sales


In [43]:
df_sales.groupBy("region", "product").agg(sum("amount").alias("total_sales")).show()

+------+-------+-----------+
|region|product|total_sales|
+------+-------+-----------+
| North| Laptop|     151000|
| North| Tablet|      54000|
|  East| Tablet|      48000|
|  East| Laptop|     146000|
| South| Tablet|      50000|
| North| Mobile|      63000|
|  West| Tablet|      56000|
|  East| Mobile|      57000|
| South| Mobile|      64000|
| South| Laptop|     157000|
|  West| Mobile|      68000|
|  West| Laptop|     157000|
+------+-------+-----------+



# 2.  City + Store wise average sales

In [44]:
df_sales.groupBy("city", "store_id").agg(avg("amount").alias("average_sales")).show()

+----------+--------+------------------+
|      city|store_id|     average_sales|
+----------+--------+------------------+
| Bangalore|Store-03|45333.333333333336|
|     Patna|Store-06|           42000.0|
|   Chennai|Store-04|           45000.0|
|      Pune|Store-08|45666.666666666664|
|Chandigarh|Store-02|           28500.0|
|   Kolkata|Store-05|41666.666666666664|
|    Mumbai|Store-07|           48000.0|
|     Delhi|Store-01|           52750.0|
+----------+--------+------------------+



# 3. Region + City wise transaction count

In [45]:
df_sales.groupBy("region", "city").agg(count("*").alias("transaction_count")).show()

+------+----------+-----------------+
|region|      city|transaction_count|
+------+----------+-----------------+
|  West|    Mumbai|                3|
| South| Bangalore|                3|
| North|     Delhi|                4|
| North|Chandigarh|                2|
| South|   Chennai|                3|
|  West|      Pune|                3|
|  East|   Kolkata|                3|
|  East|     Patna|                3|
+------+----------+-----------------+



# 4.  Product + Store wise max sale

In [46]:
df_sales.groupBy("product", "store_id").agg(max("amount").alias("maximum_sale")).show()

+-------+--------+------------+
|product|store_id|maximum_sale|
+-------+--------+------------+
| Tablet|Store-06|       23000|
| Laptop|Store-07|       80000|
| Laptop|Store-01|       76000|
| Tablet|Store-02|       26000|
| Mobile|Store-01|       32000|
| Laptop|Store-03|       78000|
| Tablet|Store-08|       27000|
| Tablet|Store-03|       24000|
| Mobile|Store-04|       30000|
| Mobile|Store-07|       35000|
| Mobile|Store-05|       28000|
| Laptop|Store-05|       72000|
| Tablet|Store-01|       28000|
| Tablet|Store-07|       29000|
| Laptop|Store-08|       77000|
| Mobile|Store-08|       33000|
| Laptop|Store-04|       79000|
| Tablet|Store-05|       25000|
| Tablet|Store-04|       26000|
| Laptop|Store-06|       74000|
+-------+--------+------------+
only showing top 20 rows


# 5.  Identify top-selling product per region using aggregation only

In [48]:
from pyspark.sql.functions import sum

region_product_sales = df_sales.groupBy("region", "product").agg(sum("amount").alias("total_sales"))
region_product_sales.show()

+------+-------+-----------+
|region|product|total_sales|
+------+-------+-----------+
| North| Laptop|     151000|
| North| Tablet|      54000|
|  East| Tablet|      48000|
|  East| Laptop|     146000|
| South| Tablet|      50000|
| North| Mobile|      63000|
|  West| Tablet|      56000|
|  East| Mobile|      57000|
| South| Mobile|      64000|
| South| Laptop|     157000|
|  West| Mobile|      68000|
|  West| Laptop|     157000|
+------+-------+-----------+



In [49]:
from pyspark.sql.functions import max

max_sales_per_region = region_product_sales.groupBy("region").agg(max("total_sales").alias("max_sales_per_region"))
max_sales_per_region.show()

+------+--------------------+
|region|max_sales_per_region|
+------+--------------------+
| South|              157000|
|  East|              146000|
|  West|              157000|
| North|              151000|
+------+--------------------+



In [50]:
top_selling_product_per_region = region_product_sales.join(max_sales_per_region, "region").filter(region_product_sales["total_sales"] == max_sales_per_region["max_sales_per_region"])
top_selling_product_per_region.show()

+------+-------+-----------+--------------------+
|region|product|total_sales|max_sales_per_region|
+------+-------+-----------+--------------------+
| North| Laptop|     151000|              151000|
|  East| Laptop|     146000|              146000|
| South| Laptop|     157000|              157000|
|  West| Laptop|     157000|              157000|
+------+-------+-----------+--------------------+



# EXERCISE SET 5 — WINDOW FUNCTIONS (OVER)

In [51]:
from pyspark.sql.window import Window
from pyspark.sql.functions import sum, rank, row_number, dense_rank

# 1. Compute running total of sales per region ordered by date

In [52]:
df_sales.withColumn("running_total", sum("amount").over(Window.partitionBy("region").orderBy("sale_date"))).show()

+------+------+----------+--------+-------+----------+------+-------------+
|txn_id|region|      city|store_id|product| sale_date|amount|running_total|
+------+------+----------+--------+-------+----------+------+-------------+
|  T007|  East|   Kolkata|Store-05| Laptop|2024-01-01| 72000|        72000|
|  T008|  East|   Kolkata|Store-05| Mobile|2024-01-02| 28000|       100000|
|  T009|  East|     Patna|Store-06| Tablet|2024-01-03| 23000|       123000|
|  T015|  East|     Patna|Store-06| Mobile|2024-01-04| 29000|       152000|
|  T019|  East|   Kolkata|Store-05| Tablet|2024-01-05| 25000|       177000|
|  T023|  East|     Patna|Store-06| Laptop|2024-01-06| 74000|       251000|
|  T001| North|     Delhi|Store-01| Laptop|2024-01-01| 75000|        75000|
|  T002| North|     Delhi|Store-01| Mobile|2024-01-02| 32000|       107000|
|  T003| North|Chandigarh|Store-02| Tablet|2024-01-03| 26000|       133000|
|  T013| North|     Delhi|Store-01| Laptop|2024-01-04| 76000|       209000|
|  T017| Nor

# 2. Rank transactions by amount within each region

In [53]:
df_sales.withColumn("rank", rank().over(Window.partitionBy("region").orderBy(col("amount").desc()))).show()

+------+------+----------+--------+-------+----------+------+----+
|txn_id|region|      city|store_id|product| sale_date|amount|rank|
+------+------+----------+--------+-------+----------+------+----+
|  T023|  East|     Patna|Store-06| Laptop|2024-01-06| 74000|   1|
|  T007|  East|   Kolkata|Store-05| Laptop|2024-01-01| 72000|   2|
|  T015|  East|     Patna|Store-06| Mobile|2024-01-04| 29000|   3|
|  T008|  East|   Kolkata|Store-05| Mobile|2024-01-02| 28000|   4|
|  T019|  East|   Kolkata|Store-05| Tablet|2024-01-05| 25000|   5|
|  T009|  East|     Patna|Store-06| Tablet|2024-01-03| 23000|   6|
|  T013| North|     Delhi|Store-01| Laptop|2024-01-04| 76000|   1|
|  T001| North|     Delhi|Store-01| Laptop|2024-01-01| 75000|   2|
|  T002| North|     Delhi|Store-01| Mobile|2024-01-02| 32000|   3|
|  T017| North|Chandigarh|Store-02| Mobile|2024-01-05| 31000|   4|
|  T021| North|     Delhi|Store-01| Tablet|2024-01-06| 28000|   5|
|  T003| North|Chandigarh|Store-02| Tablet|2024-01-03| 26000| 

# 3. Assign row numbers per store ordered by sale amount

In [54]:
df_sales.withColumn("row_number", row_number().over(Window.partitionBy("store_id").orderBy(col("amount").desc()))).show()

+------+------+----------+--------+-------+----------+------+----------+
|txn_id|region|      city|store_id|product| sale_date|amount|row_number|
+------+------+----------+--------+-------+----------+------+----------+
|  T013| North|     Delhi|Store-01| Laptop|2024-01-04| 76000|         1|
|  T001| North|     Delhi|Store-01| Laptop|2024-01-01| 75000|         2|
|  T002| North|     Delhi|Store-01| Mobile|2024-01-02| 32000|         3|
|  T021| North|     Delhi|Store-01| Tablet|2024-01-06| 28000|         4|
|  T017| North|Chandigarh|Store-02| Mobile|2024-01-05| 31000|         1|
|  T003| North|Chandigarh|Store-02| Tablet|2024-01-03| 26000|         2|
|  T004| South| Bangalore|Store-03| Laptop|2024-01-01| 78000|         1|
|  T018| South| Bangalore|Store-03| Mobile|2024-01-05| 34000|         2|
|  T006| South| Bangalore|Store-03| Tablet|2024-01-03| 24000|         3|
|  T014| South|   Chennai|Store-04| Laptop|2024-01-04| 79000|         1|
|  T005| South|   Chennai|Store-04| Mobile|2024-01-

# 4.  Use dense rank to rank products per region

In [55]:
df_sales.withColumn("dense_rank", dense_rank().over(Window.partitionBy("region").orderBy(col("amount").desc()))).show()

+------+------+----------+--------+-------+----------+------+----------+
|txn_id|region|      city|store_id|product| sale_date|amount|dense_rank|
+------+------+----------+--------+-------+----------+------+----------+
|  T023|  East|     Patna|Store-06| Laptop|2024-01-06| 74000|         1|
|  T007|  East|   Kolkata|Store-05| Laptop|2024-01-01| 72000|         2|
|  T015|  East|     Patna|Store-06| Mobile|2024-01-04| 29000|         3|
|  T008|  East|   Kolkata|Store-05| Mobile|2024-01-02| 28000|         4|
|  T019|  East|   Kolkata|Store-05| Tablet|2024-01-05| 25000|         5|
|  T009|  East|     Patna|Store-06| Tablet|2024-01-03| 23000|         6|
|  T013| North|     Delhi|Store-01| Laptop|2024-01-04| 76000|         1|
|  T001| North|     Delhi|Store-01| Laptop|2024-01-01| 75000|         2|
|  T002| North|     Delhi|Store-01| Mobile|2024-01-02| 32000|         3|
|  T017| North|Chandigarh|Store-02| Mobile|2024-01-05| 31000|         4|
|  T021| North|     Delhi|Store-01| Tablet|2024-01-

#  5. Identify top 2 highest sales per region using window functions

In [56]:
from pyspark.sql.window import Window
from pyspark.sql.functions import rank, col

window_spec = Window.partitionBy("region").orderBy(col("total_sales").desc())
ranked_sales_per_region = region_product_sales.withColumn("sales_rank", rank().over(window_spec))
ranked_sales_per_region.show()

+------+-------+-----------+----------+
|region|product|total_sales|sales_rank|
+------+-------+-----------+----------+
|  East| Laptop|     146000|         1|
|  East| Mobile|      57000|         2|
|  East| Tablet|      48000|         3|
| North| Laptop|     151000|         1|
| North| Mobile|      63000|         2|
| North| Tablet|      54000|         3|
| South| Laptop|     157000|         1|
| South| Mobile|      64000|         2|
| South| Tablet|      50000|         3|
|  West| Laptop|     157000|         1|
|  West| Mobile|      68000|         2|
|  West| Tablet|      56000|         3|
+------+-------+-----------+----------+



In [57]:
top_2_sales_per_region = ranked_sales_per_region.filter(col("sales_rank") <= 2)
top_2_sales_per_region.show()

+------+-------+-----------+----------+
|region|product|total_sales|sales_rank|
+------+-------+-----------+----------+
|  East| Laptop|     146000|         1|
|  East| Mobile|      57000|         2|
| North| Laptop|     151000|         1|
| North| Mobile|      63000|         2|
| South| Laptop|     157000|         1|
| South| Mobile|      64000|         2|
|  West| Laptop|     157000|         1|
|  West| Mobile|      68000|         2|
+------+-------+-----------+----------+



# 6.  Compare rank vs dense_rank output


In [61]:
df_sales.withColumn("rank", rank().over(Window.partitionBy("region").orderBy(col("amount").desc()))).withColumn("dense_rank", dense_rank().over(Window.partitionBy("region").orderBy(col("amount").desc()))).show()

+------+------+----------+--------+-------+----------+------+----+----------+
|txn_id|region|      city|store_id|product| sale_date|amount|rank|dense_rank|
+------+------+----------+--------+-------+----------+------+----+----------+
|  T023|  East|     Patna|Store-06| Laptop|2024-01-06| 74000|   1|         1|
|  T007|  East|   Kolkata|Store-05| Laptop|2024-01-01| 72000|   2|         2|
|  T015|  East|     Patna|Store-06| Mobile|2024-01-04| 29000|   3|         3|
|  T008|  East|   Kolkata|Store-05| Mobile|2024-01-02| 28000|   4|         4|
|  T019|  East|   Kolkata|Store-05| Tablet|2024-01-05| 25000|   5|         5|
|  T009|  East|     Patna|Store-06| Tablet|2024-01-03| 23000|   6|         6|
|  T013| North|     Delhi|Store-01| Laptop|2024-01-04| 76000|   1|         1|
|  T001| North|     Delhi|Store-01| Laptop|2024-01-01| 75000|   2|         2|
|  T002| North|     Delhi|Store-01| Mobile|2024-01-02| 32000|   3|         3|
|  T017| North|Chandigarh|Store-02| Mobile|2024-01-05| 31000|   

# 7.  Calculate cumulative sales per store

In [62]:
df_sales.withColumn("cumulative_sales", sum("amount").over(Window.partitionBy("store_id").orderBy("sale_date"))).show()

+------+------+----------+--------+-------+----------+------+----------------+
|txn_id|region|      city|store_id|product| sale_date|amount|cumulative_sales|
+------+------+----------+--------+-------+----------+------+----------------+
|  T001| North|     Delhi|Store-01| Laptop|2024-01-01| 75000|           75000|
|  T002| North|     Delhi|Store-01| Mobile|2024-01-02| 32000|          107000|
|  T013| North|     Delhi|Store-01| Laptop|2024-01-04| 76000|          183000|
|  T021| North|     Delhi|Store-01| Tablet|2024-01-06| 28000|          211000|
|  T003| North|Chandigarh|Store-02| Tablet|2024-01-03| 26000|           26000|
|  T017| North|Chandigarh|Store-02| Mobile|2024-01-05| 31000|           57000|
|  T004| South| Bangalore|Store-03| Laptop|2024-01-01| 78000|           78000|
|  T006| South| Bangalore|Store-03| Tablet|2024-01-03| 24000|          102000|
|  T018| South| Bangalore|Store-03| Mobile|2024-01-05| 34000|          136000|
|  T005| South|   Chennai|Store-04| Mobile|2024-01-0

# 8. Identify first and last transaction per city using windows

In [63]:
from pyspark.sql.window import Window
from pyspark.sql.functions import col, row_number

window_spec_first_txn = Window.partitionBy("city").orderBy(col("sale_date").asc())
df_sales_first_txn = df_sales.withColumn("first_txn_rn", row_number().over(window_spec_first_txn))
df_sales_first_txn.show()

+------+------+----------+--------+-------+----------+------+------------+
|txn_id|region|      city|store_id|product| sale_date|amount|first_txn_rn|
+------+------+----------+--------+-------+----------+------+------------+
|  T004| South| Bangalore|Store-03| Laptop|2024-01-01| 78000|           1|
|  T006| South| Bangalore|Store-03| Tablet|2024-01-03| 24000|           2|
|  T018| South| Bangalore|Store-03| Mobile|2024-01-05| 34000|           3|
|  T003| North|Chandigarh|Store-02| Tablet|2024-01-03| 26000|           1|
|  T017| North|Chandigarh|Store-02| Mobile|2024-01-05| 31000|           2|
|  T005| South|   Chennai|Store-04| Mobile|2024-01-02| 30000|           1|
|  T014| South|   Chennai|Store-04| Laptop|2024-01-04| 79000|           2|
|  T022| South|   Chennai|Store-04| Tablet|2024-01-06| 26000|           3|
|  T001| North|     Delhi|Store-01| Laptop|2024-01-01| 75000|           1|
|  T002| North|     Delhi|Store-01| Mobile|2024-01-02| 32000|           2|
|  T013| North|     Delhi

In [64]:
from pyspark.sql.window import Window
from pyspark.sql.functions import col, row_number

window_spec_last_txn = Window.partitionBy("city").orderBy(col("sale_date").desc())
df_sales_last_txn = df_sales_first_txn.withColumn("last_txn_rn", row_number().over(window_spec_last_txn))
df_sales_last_txn.show()

+------+------+----------+--------+-------+----------+------+------------+-----------+
|txn_id|region|      city|store_id|product| sale_date|amount|first_txn_rn|last_txn_rn|
+------+------+----------+--------+-------+----------+------+------------+-----------+
|  T018| South| Bangalore|Store-03| Mobile|2024-01-05| 34000|           3|          1|
|  T006| South| Bangalore|Store-03| Tablet|2024-01-03| 24000|           2|          2|
|  T004| South| Bangalore|Store-03| Laptop|2024-01-01| 78000|           1|          3|
|  T017| North|Chandigarh|Store-02| Mobile|2024-01-05| 31000|           2|          1|
|  T003| North|Chandigarh|Store-02| Tablet|2024-01-03| 26000|           1|          2|
|  T022| South|   Chennai|Store-04| Tablet|2024-01-06| 26000|           3|          1|
|  T014| South|   Chennai|Store-04| Laptop|2024-01-04| 79000|           2|          2|
|  T005| South|   Chennai|Store-04| Mobile|2024-01-02| 30000|           1|          3|
|  T021| North|     Delhi|Store-01| Tablet|

In [65]:
df_sales_first_txn.filter(col("first_txn_rn") == 1).show()

+------+------+----------+--------+-------+----------+------+------------+
|txn_id|region|      city|store_id|product| sale_date|amount|first_txn_rn|
+------+------+----------+--------+-------+----------+------+------------+
|  T004| South| Bangalore|Store-03| Laptop|2024-01-01| 78000|           1|
|  T003| North|Chandigarh|Store-02| Tablet|2024-01-03| 26000|           1|
|  T005| South|   Chennai|Store-04| Mobile|2024-01-02| 30000|           1|
|  T001| North|     Delhi|Store-01| Laptop|2024-01-01| 75000|           1|
|  T007|  East|   Kolkata|Store-05| Laptop|2024-01-01| 72000|           1|
|  T010|  West|    Mumbai|Store-07| Laptop|2024-01-01| 80000|           1|
|  T009|  East|     Patna|Store-06| Tablet|2024-01-03| 23000|           1|
|  T012|  West|      Pune|Store-08| Tablet|2024-01-03| 27000|           1|
+------+------+----------+--------+-------+----------+------+------------+



In [66]:
df_sales_last_txn.filter(col("last_txn_rn") == 1).show()

+------+------+----------+--------+-------+----------+------+------------+-----------+
|txn_id|region|      city|store_id|product| sale_date|amount|first_txn_rn|last_txn_rn|
+------+------+----------+--------+-------+----------+------+------------+-----------+
|  T018| South| Bangalore|Store-03| Mobile|2024-01-05| 34000|           3|          1|
|  T017| North|Chandigarh|Store-02| Mobile|2024-01-05| 31000|           2|          1|
|  T022| South|   Chennai|Store-04| Tablet|2024-01-06| 26000|           3|          1|
|  T021| North|     Delhi|Store-01| Tablet|2024-01-06| 28000|           4|          1|
|  T019|  East|   Kolkata|Store-05| Tablet|2024-01-05| 25000|           3|          1|
|  T020|  West|    Mumbai|Store-07| Tablet|2024-01-05| 29000|           3|          1|
|  T023|  East|     Patna|Store-06| Laptop|2024-01-06| 74000|           3|          1|
|  T024|  West|      Pune|Store-08| Mobile|2024-01-06| 33000|           3|          1|
+------+------+----------+--------+-------+

# . Run explain(True) for:
Simple select
Filter
GroupBy
Window function

In [67]:
df_sales.select("txn_id", "sale_date", "region", "city", "store_id", "product", "amount").explain(True)

== Parsed Logical Plan ==
'Project ['txn_id, 'sale_date, 'region, 'city, 'store_id, 'product, 'amount]
+- LogicalRDD [txn_id#0, region#1, city#2, store_id#3, product#4, sale_date#5, amount#6L], false

== Analyzed Logical Plan ==
txn_id: string, sale_date: string, region: string, city: string, store_id: string, product: string, amount: bigint
Project [txn_id#0, sale_date#5, region#1, city#2, store_id#3, product#4, amount#6L]
+- LogicalRDD [txn_id#0, region#1, city#2, store_id#3, product#4, sale_date#5, amount#6L], false

== Optimized Logical Plan ==
Project [txn_id#0, sale_date#5, region#1, city#2, store_id#3, product#4, amount#6L]
+- LogicalRDD [txn_id#0, region#1, city#2, store_id#3, product#4, sale_date#5, amount#6L], false

== Physical Plan ==
*(1) Project [txn_id#0, sale_date#5, region#1, city#2, store_id#3, product#4, amount#6L]
+- *(1) Scan ExistingRDD[txn_id#0,region#1,city#2,store_id#3,product#4,sale_date#5,amount#6L]



In [68]:
df_sales.filter(col("product") == "Laptop").explain(True)

== Parsed Logical Plan ==
'Filter '`=`('product, Laptop)
+- LogicalRDD [txn_id#0, region#1, city#2, store_id#3, product#4, sale_date#5, amount#6L], false

== Analyzed Logical Plan ==
txn_id: string, region: string, city: string, store_id: string, product: string, sale_date: string, amount: bigint
Filter (product#4 = Laptop)
+- LogicalRDD [txn_id#0, region#1, city#2, store_id#3, product#4, sale_date#5, amount#6L], false

== Optimized Logical Plan ==
Filter (isnotnull(product#4) AND (product#4 = Laptop))
+- LogicalRDD [txn_id#0, region#1, city#2, store_id#3, product#4, sale_date#5, amount#6L], false

== Physical Plan ==
*(1) Filter (isnotnull(product#4) AND (product#4 = Laptop))
+- *(1) Scan ExistingRDD[txn_id#0,region#1,city#2,store_id#3,product#4,sale_date#5,amount#6L]



In [70]:
df_sales.groupBy("region", "product").agg(sum("amount").alias("total_sales")).explain(True)

== Parsed Logical Plan ==
'Aggregate ['region, 'product], ['region, 'product, 'sum('amount) AS total_sales#1172]
+- LogicalRDD [txn_id#0, region#1, city#2, store_id#3, product#4, sale_date#5, amount#6L], false

== Analyzed Logical Plan ==
region: string, product: string, total_sales: bigint
Aggregate [region#1, product#4], [region#1, product#4, sum(amount#6L) AS total_sales#1172L]
+- LogicalRDD [txn_id#0, region#1, city#2, store_id#3, product#4, sale_date#5, amount#6L], false

== Optimized Logical Plan ==
Aggregate [region#1, product#4], [region#1, product#4, sum(amount#6L) AS total_sales#1172L]
+- Project [region#1, product#4, amount#6L]
   +- LogicalRDD [txn_id#0, region#1, city#2, store_id#3, product#4, sale_date#5, amount#6L], false

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[region#1, product#4], functions=[sum(amount#6L)], output=[region#1, product#4, total_sales#1172L])
   +- Exchange hashpartitioning(region#1, product#4, 200), ENSURE_REQUIREM

In [71]:
df_sales.withColumn("running_total", sum("amount").over(Window.partitionBy("region").orderBy("sale_date"))).explain(True)

== Parsed Logical Plan ==
'Project [unresolvedstarwithcolumns(running_total, 'sum('amount) windowspecdefinition('region, 'sale_date ASC NULLS FIRST, unspecifiedframe$()), None)]
+- LogicalRDD [txn_id#0, region#1, city#2, store_id#3, product#4, sale_date#5, amount#6L], false

== Analyzed Logical Plan ==
txn_id: string, region: string, city: string, store_id: string, product: string, sale_date: string, amount: bigint, running_total: bigint
Project [txn_id#0, region#1, city#2, store_id#3, product#4, sale_date#5, amount#6L, running_total#1183L]
+- Project [txn_id#0, region#1, city#2, store_id#3, product#4, sale_date#5, amount#6L, running_total#1183L, running_total#1183L]
   +- Window [sum(amount#6L) windowspecdefinition(region#1, sale_date#5 ASC NULLS FIRST, specifiedwindowframe(RangeFrame, unboundedpreceding$(), currentrow$())) AS running_total#1183L], [region#1], [sale_date#5 ASC NULLS FIRST]
      +- Project [txn_id#0, region#1, city#2, store_id#3, product#4, sale_date#5, amount#6L]
   

# 2. Identify:
Shuffles
Exchanges
Sorts

1. Shuffles

What it means: A shuffle occurs when data needs to be redistributed across partitions, usually between different nodes in the cluster.
Why it happens: Operations that require grouping or joining data by key (e.g., groupBy, reduceByKey, join) trigger shuffles.
Impact: Shuffles are expensive because they involve disk I/O, network transfer, and serialization.

Examples of operations causing shuffles:

groupByKey()
reduceByKey()
join()
distinct()
repartition()


2. Exchanges

What it means: An exchange is the physical execution step in Spark where data is moved between partitions during a shuffle.
Relation to shuffle: The exchange is the actual mechanism Spark uses to implement a shuffle in the execution plan.
Seen in: Spark’s physical plan (explain()), you’ll see Exchange nodes for operations that require data movement.


3. Sorts

What it means: Sorting arranges data in a specific order within partitions or globally.
Types:

Local Sort: Sort within each partition (e.g., sortWithinPartitions()).
Global Sort: Requires shuffling to ensure total ordering (e.g., orderBy()).


Impact: Global sort is expensive because it triggers a shuffle.

Examples of operations causing sorts:

orderBy()
sort()
sortWithinPartitions()


🔍 How to Identify Them in PySpark

Use df.explain(True) or rdd.toDebugString() to inspect the execution plan.
Look for:

Exchange → indicates shuffle.
Sort → indicates sorting stage.





# 3. . Explain why window functions introduce sorting

Window functions introduce sorting because they need rows in a specific order to compute results like rank(), row_number(), lead(), or lag().

ORDER BY in the window spec → Spark sorts rows within each partition before applying the function.
If PARTITION BY is used → Spark may shuffle first, then sort locally.

So, sorting ensures correct sequence for window calculations.