<p>Similar to relational databases, the Spark DataFrame and Dataset APIs
and Spark SQL offer a series of join transformations: inner joins, outer joins, left
joins, right joins, etc. All of these operations trigger a large amount of data movement
across Spark executors.</p>

<h3>Five join strategiy</h3>
<ol>
    <li>The broadcast hash join (BHJ)</li>
    <li>Shuffle hahs join (SHJ)</li>
    <li>Shuffle sort merge join (SMJ)</li>
    <li>Broadcast nested loop join (BNLJ)</li>
    <li>Shuffle and replicated nested loop join (Castesian product)</li>
</ol>

## Broadcast Hash Join

In [1]:
# By default Spark will use a broadcast join if the smaller data set is less than 10 MB. 
# This configuration is set in spark.sql.autoBroadcastJoinThreshold;

In [2]:
! ls -l ../data

total 391556
drwxrwxr-x. 3 train train        96 Aug 21 06:19 churn-telecom
-rw-rw-r--. 1 train train  41002480 Aug 21 06:18 Fire_Incidents.csv.gz
drwxrwxr-x. 7 train train        67 Aug 21 09:09 flight-data
-rw-rw-r--. 1 train train  46401315 Aug 21 06:18 Hotel_Reviews.csv.gz
-rw-rw-r--. 1 train train  44525776 Aug 21 06:17 market1mil.csv.gz
drwxrwxr-x. 2 train train       198 Aug 21 12:13 market1mil_snappyparquet
-rw-rw-r--. 1 train train 269015852 Aug 21 06:18 market5mil.csv.gz
drwxrwxr-x. 2 train train         6 Aug 21 09:37 market5mil_lzoparquet
drwxrwxr-x. 2 train train       198 Aug 21 12:11 market5mil_snappyparquet
drwxrwxr-x. 2 train train       133 Aug 21 06:18 retail_db


In [3]:
import findspark
findspark.init("/opt/manual/spark")
from pyspark.sql import SparkSession, functions as F

In [4]:
spark = (
SparkSession.builder.appName("Joins").master("local[2]")
    .config("spark.executor.memory","3g")
    .config("spark.driver.memory","512m")
    .config("spark.memory.fraction","0.1")
    .config("spark.memory.storageFraction","0.0")
    .getOrCreate()
)

In [5]:
categories = spark.read.format("csv") \
.option("header", True) \
.option("inferSchema", True) \
.option("sep", ",") \
.load("file:///home/train/datasets/retail_db/categories.csv")

In [6]:
categories.show(3)

+----------+--------------------+-------------------+
|categoryId|categoryDepartmentId|       categoryName|
+----------+--------------------+-------------------+
|         1|                   2|           Football|
|         2|                   2|             Soccer|
|         3|                   2|Baseball & Softball|
+----------+--------------------+-------------------+
only showing top 3 rows



In [7]:
departments = spark.read.format("csv") \
.option("header", True) \
.option("inferSchema", True) \
.option("sep", ",") \
.load("file:///home/train/datasets/retail_db/departments.csv")

In [8]:
departments.show()

+------------+--------------+
|departmentId|departmentName|
+------------+--------------+
|           2|       Fitness|
|           3|      Footwear|
|           4|       Apparel|
|           5|          Golf|
|           6|      Outdoors|
|           7|      Fan Shop|
+------------+--------------+



In [9]:
bcast_join_df = categories.join(F.broadcast(departments), 
                                categories.categoryDepartmentId == departments.departmentId)

In [10]:
bcast_join_df.show()

+----------+--------------------+-------------------+------------+--------------+
|categoryId|categoryDepartmentId|       categoryName|departmentId|departmentName|
+----------+--------------------+-------------------+------------+--------------+
|         1|                   2|           Football|           2|       Fitness|
|         2|                   2|             Soccer|           2|       Fitness|
|         3|                   2|Baseball & Softball|           2|       Fitness|
|         4|                   2|         Basketball|           2|       Fitness|
|         5|                   2|           Lacrosse|           2|       Fitness|
|         6|                   2|   Tennis & Racquet|           2|       Fitness|
|         7|                   2|             Hockey|           2|       Fitness|
|         8|                   2|        More Sports|           2|       Fitness|
|         9|                   3|   Cardio Equipment|           3|      Footwear|
|        10|    

<p>The BHJ is the easiest and fastest join Spark offers, since it does not involve any shuffle
of the data set; all the data is available locally to the executor after a broadcast. You
just have to be sure that you have enough memory both on the Spark driver’s and the
executors’ side to hold the smaller data set in memory.</p>

In [11]:
bcast_join_df.explain()

== Physical Plan ==
*(2) BroadcastHashJoin [categoryDepartmentId#17], [departmentId#54], Inner, BuildRight
:- *(2) Project [categoryId#16, categoryDepartmentId#17, categoryName#18]
:  +- *(2) Filter isnotnull(categoryDepartmentId#17)
:     +- FileScan csv [categoryId#16,categoryDepartmentId#17,categoryName#18] Batched: false, DataFilters: [isnotnull(categoryDepartmentId#17)], Format: CSV, Location: InMemoryFileIndex[file:/home/train/venvspark/dev/data/retail_db/categories.csv], PartitionFilters: [], PushedFilters: [IsNotNull(categoryDepartmentId)], ReadSchema: struct<categoryId:int,categoryDepartmentId:int,categoryName:string>
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))), [id=#133]
   +- *(1) Project [departmentId#54, departmentName#55]
      +- *(1) Filter isnotnull(departmentId#54)
         +- FileScan csv [departmentId#54,departmentName#55] Batched: false, DataFilters: [isnotnull(departmentId#54)], Format: CSV, Location: InMemoryFileInd

In [12]:
bcast_join_df.explain('cost')
# The modes include 'simple', 'extended', 'codegen', 'cost', and 'formatted'.

== Optimized Logical Plan ==
Join Inner, (categoryDepartmentId#17 = departmentId#54), rightHint=(strategy=broadcast), Statistics(sizeInBytes=105.1 KiB)
:- Filter isnotnull(categoryDepartmentId#17), Statistics(sizeInBytes=1133.0 B)
:  +- Relation[categoryId#16,categoryDepartmentId#17,categoryName#18] csv, Statistics(sizeInBytes=1133.0 B)
+- Filter isnotnull(departmentId#54), Statistics(sizeInBytes=95.0 B)
   +- Relation[departmentId#54,departmentName#55] csv, Statistics(sizeInBytes=95.0 B)

== Physical Plan ==
*(2) BroadcastHashJoin [categoryDepartmentId#17], [departmentId#54], Inner, BuildRight
:- *(2) Project [categoryId#16, categoryDepartmentId#17, categoryName#18]
:  +- *(2) Filter isnotnull(categoryDepartmentId#17)
:     +- FileScan csv [categoryId#16,categoryDepartmentId#17,categoryName#18] Batched: false, DataFilters: [isnotnull(categoryDepartmentId#17)], Format: CSV, Location: InMemoryFileIndex[file:/home/train/venvspark/dev/data/retail_db/categories.csv], PartitionFilters: [], 

    Use this type of join under the following conditions for maximum benefit:
    • When each key within the smaller and larger data sets is hashed to the same partition
    by Spark
    • When one data set is much smaller than the other (and within the default config
    of 10 MB, or more if you have sufficient memory)
    • When you only want to perform an equi-join, to combine two data sets based on
    matching unsorted keys
    • When you are not worried by excessive network bandwidth usage or OOM
    errors, because the smaller data set will be broadcast to all Spark executors
    Specifying a value of -1 in spark.sql.autoBroadcastJoinThreshold will cause
    Spark to always resort to a shuffle sort merge join.

In [21]:
# spark.stop()