## Broadcast Join in Spark DataFrame

* It is also known as map side or replicated join.
* The smaller dataset will be broadcasted to all the executors in the cluster.
* The size of the smaller dataset is driven by `spark.sql.autoBroadcastJoinThreshold`.
* We can even perform broadcast join when the smaller dataset is greater than `spark.sql.autoBroadcastJoinThreshold` by using `broadcast` function from `pyspark.sql.functions`.
* We can disable broadcast join by setting `spark.sql.autoBroadcastJoinThreshold` value to 0.
* If broadcast join is disabled then it will result in `Reduce Side or Map Reduce` join.
* Make sure to setup multinode cluster using 28 GB Memory, 4 Cores each. Configure scaling between 2 nd 4 nodes. Driver can be of minimum configuration.

In [1]:
from pyspark.sql import SparkSession, Row
from pyspark.sql.functions import *
import datetime

In [2]:
userName = 'CodeInDNA'
spark = SparkSession. \
        builder. \
        appName(f'{userName} - JoinSparkDF'). \
        getOrCreate()

In [3]:
# Default size is 10MB
spark.conf.get('spark.sql.autoBroadcastJoinThreshold')

'10485760b'

In [4]:
# We can disable broadcast join using this approach
spark.conf.set('spark.sql.autoBroadcastJoinThreshold', '0')

In [5]:
spark.conf.get('spark.sql.autoBroadcastJoinThreshold')

'0'

In [6]:
# Resetting to original value
spark.conf.set('spark.sql.autoBroadcastJoinThreshold', '10485760b')

In [8]:
# 5.2MB
clickstream = spark.read.csv('../data/order_items.csv', header=True)

In [9]:
example = spark.read.csv('../data/orders.csv', header=False)

In [11]:
%%time
clickstream.join(example).explain()

== Physical Plan ==
BroadcastNestedLoopJoin BuildRight, Inner
:- FileScan csv [order_item_id#40,order_item_order_id#41,order_item_product_id#42,order_item_quantity#43,order_item_subtotal#44,order_item_product_price#45] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex[file:/E:/Practice/PySpark/data/order_items.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<order_item_id:string,order_item_order_id:string,order_item_product_id:string,order_item_qu...
+- BroadcastExchange IdentityBroadcastMode, [id=#68]
   +- FileScan csv [_c0#68,_c1#69,_c2#70,_c3#71] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex[file:/E:/Practice/PySpark/data/orders.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<_c0:string,_c1:string,_c2:string,_c3:string>


Wall time: 296 ms


In [12]:
broadcast(clickstream).join(example).explain()

== Physical Plan ==
BroadcastNestedLoopJoin BuildLeft, Inner
:- BroadcastExchange IdentityBroadcastMode, [id=#81]
:  +- FileScan csv [order_item_id#40,order_item_order_id#41,order_item_product_id#42,order_item_quantity#43,order_item_subtotal#44,order_item_product_price#45] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex[file:/E:/Practice/PySpark/data/order_items.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<order_item_id:string,order_item_order_id:string,order_item_product_id:string,order_item_qu...
+- FileScan csv [_c0#68,_c1#69,_c2#70,_c3#71] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex[file:/E:/Practice/PySpark/data/orders.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<_c0:string,_c1:string,_c2:string,_c3:string>


