<p>Similar to relational databases, the Spark DataFrame and Dataset APIs
and Spark SQL offer a series of join transformations: inner joins, outer joins, left
joins, right joins, etc. All of these operations trigger a large amount of data movement
across Spark executors.</p>

<h3>Five join strategiy</h3>
<ol>
    <li>The broadcast hash join (BHJ)</li>
    <li>Shuffle hahs join (SHJ)</li>
    <li>Shuffle sort merge join (SMJ)</li>
    <li>Broadcast nested loop join (BNLJ)</li>
    <li>Shuffle and replicated nested loop join (Castesian product)</li>
</ol>

# 1. Broadcast Hash Join

In [1]:
# By default Spark will use a broadcast join if the smaller data set is less than 10 MB. 
# This configuration is set in spark.sql.autoBroadcastJoinThreshold;
# You can increase this amount if you have enough memory, e.g. 100 MB.
# If you set spark.sql.autoBroadcastJoinThreshold to -1 
# you close broadcast join and use shuffle sort merge join

In [2]:
! ls -l ~/datasets

total 180676
-rw-rw-r--. 1 train train 42658497 Dec 30 23:32 201508_trip_data.csv
-rw-rw-r--. 1 train train  7077973 Dec 12 11:46 AB_NYC_2019.csv
-rw-rw-r--. 1 train train     4556 Jul 21  2020 Advertising.csv
-rw-rw-r--. 1 train train   674857 Dec 19 12:14 Churn_Modelling.csv
drwxr-xr-x. 3 train train       96 Nov 19  2020 churn-telecom
-rw-rw-r--. 1 train train  2609524 Jan  6 21:21 dirty_store_transactions.csv
-rw-rw-r--. 1 train train      227 Dec 22 21:45 employee.txt
-rw-rw-r--. 1 train train 41002480 Jan  1 12:12 Fire_Incidents.csv.gz
-rw-rw-r--. 1 train train 46401315 Dec 30 23:34 Hotel_Reviews.csv.gz
-rw-rw-r--. 1 train train     4611 Dec 11 12:13 iris.csv
-rw-rw-r--. 1 train train 44525776 Jan  2 19:22 market1mil.csv.gz
drwxrwxr-x. 2 train train     4096 Jan  2 16:22 market5mil_parquet
drwxrwxr-x. 2 train train      133 Jul 23  2020 retail_db
-rw-rw-r--. 1 train train      592 Jan  2 11:50 simple_data.csv
-rw-rw-r--. 1 train train      913 Dec 25 12:34 tr_il_plaka_kod.csv
-rw

In [1]:
import findspark
findspark.init("/opt/manual/spark")
from pyspark.sql import SparkSession, functions as F

In [2]:
spark = (
SparkSession.builder
    .appName("Joins")
    .master("local[2]")
    .config("spark.driver.memory","3000m")
    .getOrCreate()
)

2022-09-24 21:50:53,837 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


In [3]:
order_items = spark.read.format("csv") \
.option("header", True) \
.option("inferSchema", True) \
.option("sep", ",") \
.load("file:///home/train/datasets/retail_db/order_items.csv")

                                                                                

In [4]:
order_items.count()

                                                                                

172198

In [5]:
order_items.limit(3).toPandas()

Unnamed: 0,orderItemName,orderItemOrderId,orderItemProductId,orderItemQuantity,orderItemSubTotal,orderItemProductPrice
0,1,1,957,1,299.98,299.98
1,2,2,1073,1,199.99,199.99
2,3,2,502,5,250.0,50.0


In [6]:
products = spark.read.format("csv") \
.option("header", True) \
.option("inferSchema", True) \
.option("sep", ",") \
.load("file:///home/train/datasets/retail_db/products.csv")

In [7]:
products.count()

1345

In [8]:
products.limit(3).toPandas()

Unnamed: 0,productId,productCategoryId,productName,productDescription,productPrice,productImage
0,1,2,Quest Q64 10 FT. x 10 FT. Slant Leg Instant U,,59.98,http://images.acmesports.sports/Quest+Q64+10+F...
1,2,2,Under Armour Men's Highlight MC Football Clea,,129.99,http://images.acmesports.sports/Under+Armour+M...
2,3,2,Under Armour Men's Renegade D Mid Football Cl,,89.99,http://images.acmesports.sports/Under+Armour+M...


In [9]:
bcast_join_df = order_items.join(products, 
                                order_items.orderItemProductId == products.productId)

In [10]:
bcast_join_df.limit(5).toPandas()

Unnamed: 0,orderItemName,orderItemOrderId,orderItemProductId,orderItemQuantity,orderItemSubTotal,orderItemProductPrice,productId,productCategoryId,productName,productDescription,productPrice,productImage
0,1,1,957,1,299.98,299.98,957,43,Diamondback Women's Serene Classic Comfort Bi,,299.98,http://images.acmesports.sports/Diamondback+Wo...
1,2,2,1073,1,199.99,199.99,1073,48,Pelican Sunstream 100 Kayak,,199.99,http://images.acmesports.sports/Pelican+Sunstr...
2,3,2,502,5,250.0,50.0,502,24,Nike Men's Dri-FIT Victory Golf Polo,,50.0,http://images.acmesports.sports/Nike+Men%27s+D...
3,4,2,403,1,129.99,129.99,403,18,Nike Men's CJ Elite 2 TD Football Cleat,,129.99,http://images.acmesports.sports/Nike+Men%27s+C...
4,5,4,897,2,49.98,24.99,897,40,Team Golf New England Patriots Putter Grip,,24.99,http://images.acmesports.sports/Team+Golf+New+...


<p>The BHJ is the easiest and fastest join Spark offers, since it does not involve <strong>any shuffle of the data set</strong>; all the data is available locally to the executor after a broadcast. You just have to be sure that you have enough memory both on the Spark driver’s and the executors’ side to hold the smaller data set in memory.</p>

In [11]:
bcast_join_df.explain()

== Physical Plan ==
*(2) BroadcastHashJoin [orderItemProductId#18], [productId#61], Inner, BuildRight, false
:- *(2) Filter isnotnull(orderItemProductId#18)
:  +- FileScan csv [orderItemName#16,orderItemOrderId#17,orderItemProductId#18,orderItemQuantity#19,orderItemSubTotal#20,orderItemProductPrice#21] Batched: false, DataFilters: [isnotnull(orderItemProductId#18)], Format: CSV, Location: InMemoryFileIndex[file:/home/train/datasets/retail_db/order_items.csv], PartitionFilters: [], PushedFilters: [IsNotNull(orderItemProductId)], ReadSchema: struct<orderItemName:int,orderItemOrderId:int,orderItemProductId:int,orderItemQuantity:int,orderI...
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)),false), [id=#152]
   +- *(1) Filter isnotnull(productId#61)
      +- FileScan csv [productId#61,productCategoryId#62,productName#63,productDescription#64,productPrice#65,productImage#66] Batched: false, DataFilters: [isnotnull(productId#61)], Format: CSV, Locat

In [12]:
bcast_join_df.explain('cost')
# The modes include 'simple', 'extended', 'codegen', 'cost', and 'formatted'.

== Optimized Logical Plan ==
Join Inner, (orderItemProductId#18 = productId#61), Statistics(sizeInBytes=877.7 GiB)
:- Filter isnotnull(orderItemProductId#18), Statistics(sizeInBytes=5.2 MiB)
:  +- Relation[orderItemName#16,orderItemOrderId#17,orderItemProductId#18,orderItemQuantity#19,orderItemSubTotal#20,orderItemProductPrice#21] csv, Statistics(sizeInBytes=5.2 MiB)
+- Filter isnotnull(productId#61), Statistics(sizeInBytes=170.2 KiB)
   +- Relation[productId#61,productCategoryId#62,productName#63,productDescription#64,productPrice#65,productImage#66] csv, Statistics(sizeInBytes=170.2 KiB)

== Physical Plan ==
*(2) BroadcastHashJoin [orderItemProductId#18], [productId#61], Inner, BuildRight, false
:- *(2) Filter isnotnull(orderItemProductId#18)
:  +- FileScan csv [orderItemName#16,orderItemOrderId#17,orderItemProductId#18,orderItemQuantity#19,orderItemSubTotal#20,orderItemProductPrice#21] Batched: false, DataFilters: [isnotnull(orderItemProductId#18)], Format: CSV, Location: InMemoryFi

    Use this type of join under the following conditions for maximum benefit:
    • When each key within the smaller and larger data sets is hashed to the same partition
    by Spark
    • When one data set is much smaller than the other (and within the default config
    of 10 MB, or more if you have sufficient memory)
    • When you only want to perform an equi-join, to combine two data sets based on
    matching unsorted keys
    • When you are not worried by excessive network bandwidth usage or OOM
    errors, because the smaller data set will be broadcast to all Spark executors.
    • Specifying a value of -1 in spark.sql.autoBroadcastJoinThreshold will cause
    Spark to always resort to a shuffle sort merge join.

## spark.sql.autoBroadcastJoinThreshold=-1 always closes broadcast hash join.

In [13]:
spark.stop()