In [0]:
# Spark Session
from pyspark.sql import SparkSession
spark = (
        SparkSession
        .builder
        .appName("Optimizing Shuffles")
        .config("spark.executors.cores",4)
        .config("spark.cores.max", 16)
        .config("spark.executor.memory", '512M')
        .getOrCreate()
)
spark


> IMP: In a Spark Standalone master, the defaut parallelism will be shown as 16 as per the configs we have specified in 'spark.config' above

In [0]:
master_url = spark.sparkContext.master
if master_url.startswith("spark://"):
    print("Spark Master URL:", master_url)
else:
    print("Not connected to a Spark standalone master.")

Not connected to a Spark standalone master.


%md
- We are connected to a local master (Since we are not connected to a Spark standalone master): 
Master
local[8]
- If you are running Spark locally (e.g., in local[*] mode), the default parallelism is typically set to the number of threads/cores available.
- If your environment has 8 cores, Spark will set default parallelism to 8 as shown below.


In [0]:
print(spark.sparkContext.master)

local[8]


In [0]:
# Check default parallelism
spark.sparkContext.defaultParallelism

Out[14]: 8

In [0]:
# Disable Adaptive Query Engine(AQE) and Broadcast Join
spark.conf.set("spark.sql.adaptive.enabled", False)
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", False)
spark.conf.set("spark.sql.adaptive.autoBroadcastJoinThreshold", -1)

In [0]:
# Read EMP CSV file with 10 million records
_schema = "first_name string, last_name string, job_title string, dob date, email string, phone string, salary double, department string, department_id integer"
emp = spark.read.option("header",True).schema(_schema).csv("/data/input/datasets/employee_records.csv")

IMPORTANT NOTE:
- Option values: In Spark, options typically expect string values, so it's more correct to pass "true" (string) instead of True (boolean).
- Use --> option("header", "true"), instead of this --> option("header",True)

In [0]:
# Find out AVG Salary as per Department
from pyspark.sql.functions import avg
emp_avg = emp.groupBy("department_id").agg(avg("salary").alias("avg_sal"))


In [0]:
# Write data for Performance Benchmarking
emp_avg.write.format("noop").mode("overwrite").save()

In [0]:
from pyspark.sql.functions import spark_partition_id

emp.withColumn("partition_id", spark_partition_id()).where("partition_id=0").show(truncate=False)

+----------+---------+-----------------+----------+------------------------------+---------------+------------------+------------------+-------------+------------+
|first_name|last_name|job_title        |dob       |email                         |phone          |salary            |department        |department_id|partition_id|
+----------+---------+-----------------+----------+------------------------------+---------------+------------------+------------------+-------------+------------+
|Jennifer  |Williams |HR Specialist    |1951-01-21|Jennifer.Williams.@example.com|+1-845-311-804 |42951.90537045701 |Finance           |6            |0           |
|James     |Miller   |Sales Executive  |1939-09-25|James.Miller.@example.com     |+1-274-633-7306|50933.8591162336  |Data and Analytics|6            |0           |
|Linda     |Jones    |Data Scientist   |2023-05-26|Linda.Jones.@example.com      |+1-149-733-8924|66274.49226944339 |Data and Analytics|2            |0           |
|Srishti   |Smit

In [0]:
# Check Spark Shuffle Partition setting
spark.conf.get("spark.sql.shuffle.partitions")

Out[21]: '200'

- Let's decrease the Shuffle Partitions into an approprite number to optimize it.
- For default '200' Shuffle Partitions, it is taking a lot time.

In [0]:
spark.conf.set("spark.sql.shuffle.partitions", 16)
spark.conf.get("spark.sql.shuffle.partitions")

Out[25]: '16'

In [0]:
# Re-Write data for Performance Benchmarking
emp_avg.write.format("noop").mode("overwrite").save()

> Now, Let's see if Reading a Partitioned Data will improve performance

In [0]:
# Write the DataFrame partitioned by department_id
emp.write.mode("overwrite").partitionBy("department_id").csv("/data/input/emp_partitioned.csv")

In [0]:
# Read the Partitioned Data
emp_partitioned = spark.read.schema(_schema).option("header",'true').csv("/data/input/emp_partitioned.csv")
emp_partitioned_avg = emp_partitioned.groupBy("department_id").agg(avg("salary").alias("avg_sal"))

# This will take even less time in the Spark UI to proces data since it is partitioned

In [0]:
# Write data for Performance Benchmarking
emp_partitioned_avg.write.format("noop").mode("overwrite").save()

> There will always be Performance benefit during Shuffle when we read a partitioned data.

**IMORTANT NOTE for OPTIMIZATION**
- **Good Shuffle**: Avoid unnecessary Shuffles wherever possible because Shuffle is a costly operation. Do not go for Shuffle operations without any need.
- **Repartitioning**: Ensure your data is properly partitioned as per the requirement
- **Filter**: Apply filters on your data as early as possible. Filter your data before doing aggregation or shuffle operations. It will avoid shuffling of unnecessary data which in turn will reduce the shuffling amount.
