In [0]:
# Spark Session
from pyspark.sql import SparkSession
spark = (
        SparkSession
        .builder
        .appName("AQE in Spark")
        .master("local[*]")
        .config("spark.executor.cores",4)
        .config("spark.cores.max",8)
        .config("spark.executor.memory", "512M")
        .config("spark.sql.adaptive.enabled", True) # Enbale Adaptive Query Engine(AQE) (NOTE:It is enabled by default in 3.0, So True is not necessarily needed)
        .config("spark.sql.adaptive.coalescePartitions.enabled", True)
        .getOrCreate()
)
spark


In [0]:
print(spark.conf.get("spark.sql.adaptive.enabled"))

True


In [0]:
# Coalescing post-shuffle partitions --> removes unnecessary shuffle partitions
# Skewed Join optimization (balanaces partition size) --> joins smaller pertitions OR/AND splits bigger partitions

# spark.conf.set("spark.sql.adaptive.enabled", True)
# spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", True)


In [0]:

# Fix partition size to avoid skew
spark.conf.get("spark.sql.adaptive.advisoryPartitionSizeInBytes","1MB") # DEfault Value : 64MB
spark.conf.get("spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes","2MB") # Default value 256 MB  -- Means if szie execeed 2MB we consider it as skewed partition


Out[12]: '2MB'

> NOTE: In many cases we dont need to modify skewedPartitionThresholdInBytes

In [0]:
print(spark.conf.get("spark.sql.adaptive.advisoryPartitionSizeInBytes","1MB"))
print(spark.conf.get("spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes","2MB"))

1MB
2MB


In [0]:
# Read EMP CSV file with 10 million records
emp_schema = "first_name string, last_name string, job_title string, dob date, email string, phone string, salary double, department string, department_id integer"
emp = spark.read.schema(emp_schema).option("header",True).csv("/data/input/datasets/employee_recs.csv")

In [0]:
# Read DEPT CSV file with 10 records
dept_schema ="department_id int, department_name string, description string, city string, state string, country string "
dept = spark.read.schema(dept_schema).option("header",True).csv("/data/input/datasets/department_recs.csv")

In [0]:
# JOINING datasets
dfjoined = emp.join(dept, on='department_id', how="left_outer")
dfjoined.write.format("noop").mode("overwrite").save()

In [0]:
dfjoined.explain()

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Project [department_id#132, first_name#124, last_name#125, job_title#126, dob#127, email#128, phone#129, salary#130, department#131, department_name#143, description#144, city#145, state#146, country#147]
   +- BroadcastHashJoin [department_id#132], [department_id#142], LeftOuter, BuildRight, false, true
      :- FileScan csv [first_name#124,last_name#125,job_title#126,dob#127,email#128,phone#129,salary#130,department#131,department_id#132] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex(1 paths)[dbfs:/data/input/datasets/employee_recs.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<first_name:string,last_name:string,job_title:string,dob:date,email:string,phone:string,sal...
      +- Exchange SinglePartition, EXECUTOR_BROADCAST, [plan_id=260]
         +- Filter isnotnull(department_id#142)
            +- FileScan csv [department_id#142,department_name#143,description#144,city#145,stat

- In Spark UI, if you check '*Total succeeded*' in Jobs tab, you will see it has taken significantly lesser shuffle partitions because Spark has coalesced unnecessary Shuffle Partitions post Shuffle
- Spark has automatically taken care of Skewness using AQE and created post shuffle partitions (No spillage happening)
- In Spark 3.0, if AQE is enabled and partition size is set up properly, Spark will automtically take careof the Spillage of the data into Memory/Disk and hence fixes Skewness


In [0]:
# Converting JOIN to Broadcast Join

spark.conf.set("spark.sql.adaptive.autoBroadcastJoinThreshold", "10MB")
print(spark.conf.get("spark.sql.adaptive.autoBroadcastJoinThreshold"))


10MB


> IMPORTANT: With AQE enabled you dont need to specify broadcast on a smaller dataframe. The AQE auto takes care of that.

In [0]:
# JOINING datasets
dfjoined = emp.join(dept, on='department_id', how="left_outer")
dfjoined.write.format("noop").mode("overwrite").save()

In [0]:
dfjoined.explain()

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=true
+- == Final Plan ==
   *(2) Project [department_id#132, first_name#124, last_name#125, job_title#126, dob#127, email#128, phone#129, salary#130, department#131, department_name#143, description#144, city#145, state#146, country#147]
   +- *(2) BroadcastHashJoin [department_id#132], [department_id#142], LeftOuter, BuildRight, false, true
      :- FileScan csv [first_name#124,last_name#125,job_title#126,dob#127,email#128,phone#129,salary#130,department#131,department_id#132] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex(1 paths)[dbfs:/data/input/datasets/employee_recs.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<first_name:string,last_name:string,job_title:string,dob:date,email:string,phone:string,sal...
      +- ShuffleQueryStage 0, Statistics(sizeInBytes=1144.0 B, rowCount=10, isRuntime=true)
         +- Exchange SinglePartition, EXECUTOR_BROADCAST, [plan_id=505]
            +- *(1) 

> IMPORTANT:
- **Dynamic Optimization**: AQE allows Spark to optimize query plans dynamically at runtime based on actual data statistics.
- **Improves Performance**: It helps in optimizing joins, skew handling, and partitioning, leading to faster query execution.
- **Automatic Adjustments**: Adjusts execution strategies such as join types and shuffle partitions without manual intervention.
- **Enables Better Resource Utilization**: Enhances resource efficiency by adapting to data characteristics during execution.
- In the latest versions of Spark (from Spark 3.0 onwards), Adaptive Query Execution (AQE) is enabled by default. 