# Imports

In [1]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F

In [2]:
spark = (SparkSession
         .builder
         .appName('How-Spark-Runs-On-Cluster')
         .getOrCreate())
spark

# Example

In [3]:
df1 = spark.range(2, 10000000, 2)
df2 = spark.range(2, 10000000, 4)

In [4]:
step1 = df1.repartition(5)
step12 = df2.repartition(6)

In [5]:
step2 = step1.selectExpr("id * 5 as id")
step3 = step2.join(step12, ["id"])
step4 = step3.selectExpr("sum(id)")

In [6]:
step4.collect() # 2500000000000

[Row(sum(id)=2500000000000)]

In [7]:
step4.explain()

== Physical Plan ==
*(7) HashAggregate(keys=[], functions=[sum(id#8L)])
+- Exchange SinglePartition, true, [id=#66]
   +- *(6) HashAggregate(keys=[], functions=[partial_sum(id#8L)])
      +- *(6) Project [id#8L]
         +- *(6) SortMergeJoin [id#8L], [id#2L], Inner
            :- *(3) Sort [id#8L ASC NULLS FIRST], false, 0
            :  +- Exchange hashpartitioning(id#8L, 200), true, [id=#50]
            :     +- *(2) Project [(id#0L * 5) AS id#8L]
            :        +- Exchange RoundRobinPartitioning(5), false, [id=#46]
            :           +- *(1) Range (2, 10000000, step=2, splits=8)
            +- *(5) Sort [id#2L ASC NULLS FIRST], false, 0
               +- Exchange hashpartitioning(id#2L, 200), true, [id=#57]
                  +- Exchange RoundRobinPartitioning(6), false, [id=#56]
                     +- *(4) Range (2, 10000000, step=4, splits=8)




After we aggregate on each partition, we bring all those aggregations to a single partition before sending the final result to the driver.

- Spark runs the jobs sequentially unless we use threading to launch multiple jobs in parallel.
- Each action kicks off one job
    - The job is broken down into stages based on the number of times we need to shuffle the data. Spark will try to combine as much tasks as possible and execute them in parallel w/o the need to shuffle.
        - The stage is broken down into tasks, which are the actual unit of work that the executors perform on each partition. Therefore, it is a combination of a block data and a set of transformations that will run on a single executor. The higher the number of partitions, the more tasks that can be run in parallel.
- The value of `spark.sql.shuffle.partitions` should be set according to the number of cores in the cluster. The default is 200.
- The number of partitions should be > number of executors by multiple factor to achieve max parallelism.
- Spark tries to use __Pipelining__ whenever possible. This means it tries to combine as many tasks as possible into one stage and do all the computations at once without the need to write the intermediate results to memory/disk. For example, if we have `select`, `filter`, then `select` on a DataFrame, Spark will do three computations in one go without writing the intermediate results into memory/disk --> Very fast.
- __Shuffle Persistence__: Spark makes executors write __shuffle files__ to their disk during execution stage. This would help Spark not rerun the whole shuffling process when something failed. It also allows Spark to use those shuffle files if new job is executed on the already shuffled data.