<h1><center>Understanding Spark Query Plans</center></h1>
<hr><hr>
<center><img src="./images/000-spark-execution.png"></center>

Reference: <a href="https://www.youtube.com/watch?v=KnUXztKueMU&list=PLWAuYt0wgRcLCtWzUxNg4BjnYlCZNEVth" target="_blank">Master Reading Spark Query Plans - Afaque Ahmad</a>

## Imports and Configs:
------------------------

In [1]:
import findspark
findspark.init()

In [2]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import *

spark = SparkSession.builder.master("local[*]").appName("001-reading spark query plans").getOrCreate()

In [3]:
sc = spark.sparkContext
sc.setLogLevel("ERROR")

## Reading Files:
---------------------------

In [4]:
transactions_file = "./data/data_skew/transactions.parquet"
transactions_df = spark.read.parquet( transactions_file )

In [5]:
transactions_df.show(10)

+----------+----------+----------+---------------+----------+----+-----+---+-------------+------+------------+
|   cust_id|start_date|  end_date|         txn_id|      date|year|month|day| expense_type|   amt|        city|
+----------+----------+----------+---------------+----------+----+-----+---+-------------+------+------------+
|C0YDPQWPBJ|2010-07-01|2018-12-01|TZ5SMKZY9S03OQJ|2018-10-07|2018|   10|  7|Entertainment| 10.42|      boston|
|C0YDPQWPBJ|2010-07-01|2018-12-01|TYIAPPNU066CJ5R|2016-03-27|2016|    3| 27| Motor/Travel| 44.34|    portland|
|C0YDPQWPBJ|2010-07-01|2018-12-01|TETSXIK4BLXHJ6W|2011-04-11|2011|    4| 11|Entertainment|  3.18|     chicago|
|C0YDPQWPBJ|2010-07-01|2018-12-01|TQKL1QFJY3EM8LO|2018-02-22|2018|    2| 22|    Groceries|268.97| los_angeles|
|C0YDPQWPBJ|2010-07-01|2018-12-01|TYL6DFP09PPXMVB|2010-10-16|2010|   10| 16|Entertainment|  2.66|     chicago|
|C0YDPQWPBJ|2010-07-01|2018-12-01|T1SMX9EUG21BBSE|2015-02-11|2015|    2| 11|    Education| 54.14|    portland|
|

In [6]:
transactions_df.count()

39790092

In [7]:
customers_file = "./data/data_skew/customers.parquet"
customers_df = spark.read.parquet( customers_file )

In [9]:
customers_df.show( 10, truncate=False )

+----------+-------------+---+------+----------+-----+------------+
|cust_id   |name         |age|gender|birthday  |zip  |city        |
+----------+-------------+---+------+----------+-----+------------+
|C007YEYTX9|Aaron Abbott |34 |Female|7/13/1991 |97823|boston      |
|C00B971T1J|Aaron Austin |37 |Female|12/16/2004|30332|chicago     |
|C00WRSJF1Q|Aaron Barnes |29 |Female|3/11/1977 |23451|denver      |
|C01AZWQMF3|Aaron Barrett|31 |Male  |7/9/1998  |46613|los_angeles |
|C01BKUFRHA|Aaron Becker |54 |Male  |11/24/1979|40284|san_diego   |
|C01RGUNJV9|Aaron Bell   |24 |Female|8/16/1968 |86331|denver      |
|C01USDV4EE|Aaron Blair  |35 |Female|9/9/1974  |80078|new_york    |
|C01WMZQ7PN|Aaron Brady  |51 |Female|8/20/1994 |52204|philadelphia|
|C021567NJZ|Aaron Briggs |57 |Male  |3/10/1990 |22008|philadelphia|
|C023M6MKR3|Aaron Bryan  |29 |Male  |4/10/1976 |05915|philadelphia|
+----------+-------------+---+------+----------+-----+------------+
only showing top 10 rows



In [10]:
customers_df.count()

5000

## Narrow Transformations:
----------------------------
- filter rows where `city='boston'`
- add a new column: adding `first_name` and `last_name`
- alter an exisitng column: adding 5 to age column
- `select` relevant columns

In [21]:
narrow_trans_df = ( 
    customers_df.filter( F.col("city") == "boston" )
                .withColumn( "first_name", F.split( "name", " " )[0] )
                .withColumn( "last_name", F.split( "name", " " )[1] )
                .withColumn( "age", F.col("age") + F.lit(5) )
                .select( ["cust_id", "first_name", "last_name", "age", "gender", "birthday"] ) 
)

narrow_trans_df.show(10)

+----------+----------+---------+----+------+---------+
|   cust_id|first_name|last_name| age|gender| birthday|
+----------+----------+---------+----+------+---------+
|C007YEYTX9|     Aaron|   Abbott|39.0|Female|7/13/1991|
|C08XAQUY73|     Aaron|  Lambert|59.0|Female|11/5/1966|
|C094P1VXF9|     Aaron|  Lindsey|29.0|  Male|9/21/1990|
|C097SHE1EF|     Aaron|    Lopez|27.0|Female|4/18/2001|
|C0DTC6436T|     Aaron| Schwartz|57.0|Female| 7/9/1962|
|C0R42FPHRH|     Abbie|    Reyes|68.0|  Male|10/8/1995|
|C0RZV4BH7T|     Abbie|Stevenson|41.0|  Male|2/10/1971|
|C0U9RV3VBE|       Ada|  Andrews|47.0|  Male|6/10/1961|
|C0XNANAD6L|       Ada|   Harper|57.0|  Male|4/16/1996|
|C1869HFVF8|      Adam|    Clark|35.0|Female|7/17/1972|
+----------+----------+---------+----+------+---------+
only showing top 10 rows



In [22]:
narrow_trans_df.explain(True)

== Parsed Logical Plan ==
'Project ['cust_id, 'first_name, 'last_name, 'age, 'gender, 'birthday]
+- Project [cust_id#84, name#85, (cast(age#86 as double) + cast(5 as double)) AS age#618, gender#87, birthday#88, zip#89, city#90, first_name#599, last_name#608]
   +- Project [cust_id#84, name#85, age#86, gender#87, birthday#88, zip#89, city#90, first_name#599, split(name#85,  , -1)[1] AS last_name#608]
      +- Project [cust_id#84, name#85, age#86, gender#87, birthday#88, zip#89, city#90, split(name#85,  , -1)[0] AS first_name#599]
         +- Filter (city#90 = boston)
            +- Relation [cust_id#84,name#85,age#86,gender#87,birthday#88,zip#89,city#90] parquet

== Analyzed Logical Plan ==
cust_id: string, first_name: string, last_name: string, age: double, gender: string, birthday: string
Project [cust_id#84, first_name#599, last_name#608, age#618, gender#87, birthday#88]
+- Project [cust_id#84, name#85, (cast(age#86 as double) + cast(5 as double)) AS age#618, gender#87, birthday#88, 

### Explaination:
--------------------
- Parsed Logical Plan is the Unresolved Logical Plan(step 1).
- Analysed Logical Plan is the Logical Plan after involving Catalog.
- The Physical Plan generated, as shown, at the end is the one that will be executed on the cluster:
-----------------------------------------------------------------------------------------------------
<img src="./images/001-narrow_transform_pjysical_plan.jpg">

- The Physical plan execution steps are read from bottom (most-indented line) to top (least indented line), with each step starting with `+-` symbol.
- Based on the above plan:
    1. The parquet file is first scanned from the data source
    2. Spark converts data from columnar(parquet is a columnar format) to row format, as it finds that it will optimise the processing and transformations.
    3. It then filters the records, as we have applied a `city=boston` filter.
    4. Then the rest transformations are handled in a single projection.

## Wide Transformations:
-----------------------------
- Repartition
- Coalesce
- Joins
- GroupBy
    - count
    - countDistinct
    - sum

### 1. Repartition:
-----------------------
- Increase or decrease the number of partitions of an RDD, and distribute it to the executors.

In [24]:
transactions_df.rdd.getNumPartitions()

12

In [25]:
transactions_df.repartition( 24 ).explain(True)

== Parsed Logical Plan ==
Repartition 24, true
+- Relation [cust_id#0,start_date#1,end_date#2,txn_id#3,date#4,year#5,month#6,day#7,expense_type#8,amt#9,city#10] parquet

== Analyzed Logical Plan ==
cust_id: string, start_date: string, end_date: string, txn_id: string, date: string, year: string, month: string, day: string, expense_type: string, amt: string, city: string
Repartition 24, true
+- Relation [cust_id#0,start_date#1,end_date#2,txn_id#3,date#4,year#5,month#6,day#7,expense_type#8,amt#9,city#10] parquet

== Optimized Logical Plan ==
Repartition 24, true
+- Relation [cust_id#0,start_date#1,end_date#2,txn_id#3,date#4,year#5,month#6,day#7,expense_type#8,amt#9,city#10] parquet

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Exchange RoundRobinPartitioning(24), REPARTITION_BY_NUM, [plan_id=340]
   +- FileScan parquet [cust_id#0,start_date#1,end_date#2,txn_id#3,date#4,year#5,month#6,day#7,expense_type#8,amt#9,city#10] Batched: true, DataFilters: [], Format: Parquet, Locati

#### Repartition Physical Plan Explainantion:
--------------------------------------------------
<img src="./images/002-repartition_physical_plan.jpg">

- `Exchange` word in query plan `.explain()` means there is Shuffling occuring at that stage.
- `RoundRobinPartitioning(24)` means the scheme/logic followed while partitioning. Parameter contains the number of resulting partitions. In this schema, distribution happens as(in this case):
    - 1st row/record is sent to 1st partition,
    - 2nd row/record is sent to 2nd partition,
    - 3rd row/record is sent to 3rd partition, ...
    - 24th row/record is sent to 24th partition,
    - and again 25th row/record is sent to 1st partition,
- `AdaptiveSparkPlan isFinalPlan=false` means the code hasn't run yet, and the displayed Physical Plan is not the final chosen plan. As AQE in on (by default is remains on), based on runtime statistics, it can choose some other physical plan.

### 2. Coalesce:
-----------------
- **Coalesce is better than Repartition for reducing number of partitions** because it tries to avoid shuffling and data transfer accross executors.(Does shuffling only for edge cases)
- If possible, it combines multiple partitions within same executors, to reduce their number with the number of partitions asked to make.

In [26]:
transactions_df.rdd.getNumPartitions()

12

In [27]:
transactions_df.coalesce(1).explain(True)

== Parsed Logical Plan ==
Repartition 1, false
+- Relation [cust_id#0,start_date#1,end_date#2,txn_id#3,date#4,year#5,month#6,day#7,expense_type#8,amt#9,city#10] parquet

== Analyzed Logical Plan ==
cust_id: string, start_date: string, end_date: string, txn_id: string, date: string, year: string, month: string, day: string, expense_type: string, amt: string, city: string
Repartition 1, false
+- Relation [cust_id#0,start_date#1,end_date#2,txn_id#3,date#4,year#5,month#6,day#7,expense_type#8,amt#9,city#10] parquet

== Optimized Logical Plan ==
Repartition 1, false
+- Relation [cust_id#0,start_date#1,end_date#2,txn_id#3,date#4,year#5,month#6,day#7,expense_type#8,amt#9,city#10] parquet

== Physical Plan ==
Coalesce 1
+- *(1) ColumnarToRow
   +- FileScan parquet [cust_id#0,start_date#1,end_date#2,txn_id#3,date#4,year#5,month#6,day#7,expense_type#8,amt#9,city#10] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/E:/Programs & Codes/apache_spark/my_pysp

#### Why doesn't .coalesce() explicitly show the partitioning scheme?
`.coalesce` doesn't show the partitioning scheme e.g. `RoundRobinPartitioning` because:

- The operation only minimizes data movement by merging into fewer partitions, **it doesn't do any shuffling.**
- Because no shuffling is done, the partitioning scheme remains the same as the original DataFrame and Spark doesn't include it explicitly in it's plan as the partitioning scheme is unaffected by `.coalesce`

### 3. Joins:
------------------
- By default, Broadcast join is enabled. We can disable it by following code: \
    `spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)`

- `spark.sql.autoBroadcastJoinThreshold` configures the maximum size in bytes for a table that will be broadcast to all worker nodes when performing a join. Its default value is 10MB. By setting this value to -1, broadcasting can be disabled, and the joins will then be Sort-Merge Join...

In [28]:
print( spark.conf.get("spark.sql.autoBroadcastJoinThreshold") )

10485760b


In [29]:
spark.conf.set( "spark.sql.autoBroadcastJoinThreshold", -1 )

In [30]:
print( spark.conf.get("spark.sql.autoBroadcastJoinThreshold") )

-1


In [31]:
joined_df = (
    transactions_df.join(
        customers_df,
        how="inner",
        on="cust_id"
    )
)

In [32]:
joined_df.explain(True)

== Parsed Logical Plan ==
'Join UsingJoin(Inner,Buffer(cust_id))
:- Relation [cust_id#0,start_date#1,end_date#2,txn_id#3,date#4,year#5,month#6,day#7,expense_type#8,amt#9,city#10] parquet
+- Relation [cust_id#84,name#85,age#86,gender#87,birthday#88,zip#89,city#90] parquet

== Analyzed Logical Plan ==
cust_id: string, start_date: string, end_date: string, txn_id: string, date: string, year: string, month: string, day: string, expense_type: string, amt: string, city: string, name: string, age: string, gender: string, birthday: string, zip: string, city: string
Project [cust_id#0, start_date#1, end_date#2, txn_id#3, date#4, year#5, month#6, day#7, expense_type#8, amt#9, city#10, name#85, age#86, gender#87, birthday#88, zip#89, city#90]
+- Join Inner, (cust_id#0 = cust_id#84)
   :- Relation [cust_id#0,start_date#1,end_date#2,txn_id#3,date#4,year#5,month#6,day#7,expense_type#8,amt#9,city#10] parquet
   +- Relation [cust_id#84,name#85,age#86,gender#87,birthday#88,zip#89,city#90] parquet

== O

#### SortMergeJoin Physical plan:
-------------------------------------
<img src="./images/003-sortmergejoin-physical-plan.jpg">


- `Filter isnotnull(cust_id#84)` means as the `cust_id` is the join column, thus spark filters out all the `null` values in it as a part of its own optimization.
- `+- Exchange hashpartitioning(cust_id#84, 200)` means it does a Haspartition shuffling, and number of partitions shuffled is 200.
- After the above steps, the `cust_id` column is sorted.
- It does it for both dataframe `cust_id` columns, then does SortMergeJoin operation, and then does projection.

HashPartition Scheme
----------------------
- Partitioning schema is an algorithm that determines which row(record) will go to which partition (partition can also be called shuffle-partition).
- In HashPartition Scheme, the partition number in which a record will go is decided by the following formula: \
    `(Target Partition number of a record) = Hash(Key column's value) modulo (Num of partitions)`
- In our case, num of partitions of larger dataset = 12, and Key column is `cust_id`. Each partition is numbered/indexed as 0, 1, 2, ..., 11. Out of these indexes, eaxh record will be sent to specific indexed partition after shuffle, which is decided as: \
    `Target Partition Index = Hash(cust_id) % 12`
For same `cust_id` value, is Hash() value will be same, thus records of both dataframes will go into same partition, and then Sorting and Joining will be easier.

### 4. GroupBy
-----------------

In [33]:
transactions_df.printSchema()

root
 |-- cust_id: string (nullable = true)
 |-- start_date: string (nullable = true)
 |-- end_date: string (nullable = true)
 |-- txn_id: string (nullable = true)
 |-- date: string (nullable = true)
 |-- year: string (nullable = true)
 |-- month: string (nullable = true)
 |-- day: string (nullable = true)
 |-- expense_type: string (nullable = true)
 |-- amt: string (nullable = true)
 |-- city: string (nullable = true)



In [35]:
city_counts_df = (
    transactions_df.groupBy("city").count() 
)

In [36]:
city_counts_df.explain(True)

== Parsed Logical Plan ==
'Aggregate ['city], ['city, count(1) AS count#733L]
+- Relation [cust_id#0,start_date#1,end_date#2,txn_id#3,date#4,year#5,month#6,day#7,expense_type#8,amt#9,city#10] parquet

== Analyzed Logical Plan ==
city: string, count: bigint
Aggregate [city#10], [city#10, count(1) AS count#733L]
+- Relation [cust_id#0,start_date#1,end_date#2,txn_id#3,date#4,year#5,month#6,day#7,expense_type#8,amt#9,city#10] parquet

== Optimized Logical Plan ==
Aggregate [city#10], [city#10, count(1) AS count#733L]
+- Project [city#10]
   +- Relation [cust_id#0,start_date#1,end_date#2,txn_id#3,date#4,year#5,month#6,day#7,expense_type#8,amt#9,city#10] parquet

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[city#10], functions=[count(1)], output=[city#10, count#733L])
   +- Exchange hashpartitioning(city#10, 200), ENSURE_REQUIREMENTS, [plan_id=399]
      +- HashAggregate(keys=[city#10], functions=[partial_count(1)], output=[city#10, count#737L])
         +- 

#### GroupBy Count Physical Plan:
------------------------------------
<img src="./images/004-groupBy_count_physical_plan.jpg">


- `+- HashAggregate(keys=[city#10], functions=[partial_count(1)], output=[city#10, count#737L])` means by `HashAggregate`it counts all key column values in a single partition locally. For example, let's say, there are 3 city's(city is groupBy column thus the key column): A, B and C. Then, count of occurence of A, B and C is done locally at each partition level. It gives value counts at each partition, and is known as `Partial Count`, and is thus operated by `partial_count()` function.
- `+- Exchange hashpartitioning(city#10, 200)` means the Keys and their partially counted values are shuffled, where same keys are sent to same partition, and then their total counting takes place.
- After HashPartition shuffling, final count is again done by HashAggeration.

In [37]:
# Another example
amount_per_city_df = (
    transactions_df.groupBy("city").agg( F.sum("amt") )
)

In [38]:
amount_per_city_df.explain(True)

== Parsed Logical Plan ==
'Aggregate ['city], ['city, sum('amt) AS sum(amt)#750]
+- Relation [cust_id#0,start_date#1,end_date#2,txn_id#3,date#4,year#5,month#6,day#7,expense_type#8,amt#9,city#10] parquet

== Analyzed Logical Plan ==
city: string, sum(amt): double
Aggregate [city#10], [city#10, sum(cast(amt#9 as double)) AS sum(amt)#750]
+- Relation [cust_id#0,start_date#1,end_date#2,txn_id#3,date#4,year#5,month#6,day#7,expense_type#8,amt#9,city#10] parquet

== Optimized Logical Plan ==
Aggregate [city#10], [city#10, sum(cast(amt#9 as double)) AS sum(amt)#750]
+- Project [amt#9, city#10]
   +- Relation [cust_id#0,start_date#1,end_date#2,txn_id#3,date#4,year#5,month#6,day#7,expense_type#8,amt#9,city#10] parquet

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[city#10], functions=[sum(cast(amt#9 as double))], output=[city#10, sum(amt)#750])
   +- Exchange hashpartitioning(city#10, 200), ENSURE_REQUIREMENTS, [plan_id=412]
      +- HashAggregate(keys=[city#10],

- In this example, everything - logical part is same, only in place of `partial_count`, the `partial_sum` function is used.

- **Another example of `groupBy()` with distinct count**:

In [39]:
distinct_city_per_customer_df = (
    transactions_df.groupBy("cust_id")
                   .agg( F.countDistinct("city") )
)

In [41]:
distinct_city_per_customer_df.show(10)
distinct_city_per_customer_df.explain(True)

+----------+-----------+
|   cust_id|count(city)|
+----------+-----------+
|CPP8BY8U93|         10|
|CYB8BX9LU1|         10|
|CFRT841CCD|         10|
|CA0TSNMYDK|         10|
|COZ8NONEVZ|         10|
|C46OCVH3WG|         10|
|C1QF29WCA6|         10|
|CTJBQB0OJ1|         10|
|CD0DXL8XTM|         10|
|CADBQ5OL5C|         10|
+----------+-----------+
only showing top 10 rows

== Parsed Logical Plan ==
'Aggregate ['cust_id], ['cust_id, 'count(distinct 'city) AS count(city)#766]
+- Relation [cust_id#0,start_date#1,end_date#2,txn_id#3,date#4,year#5,month#6,day#7,expense_type#8,amt#9,city#10] parquet

== Analyzed Logical Plan ==
cust_id: string, count(city): bigint
Aggregate [cust_id#0], [cust_id#0, count(distinct city#10) AS count(city)#766L]
+- Relation [cust_id#0,start_date#1,end_date#2,txn_id#3,date#4,year#5,month#6,day#7,expense_type#8,amt#9,city#10] parquet

== Optimized Logical Plan ==
Aggregate [cust_id#0], [cust_id#0, count(distinct city#10) AS count(city)#766L]
+- Project [cust_id#0