**BroadCast Join:**

In Apache Spark, a broadcast join is an optimization technique used to perform joins more efficiently, especially when one of the DataFrames (or RDDs) is significantly smaller than the other. In a broadcast join, Spark "broadcasts" the smaller dataset to all the worker nodes in the cluster so that the larger dataset can be joined locally with the smaller dataset on each worker.

**Key Concepts:**
-   Broadcasting means sending a copy of the smaller DataFrame (or RDD) to each executor, avoiding the need to shuffle data across the network.
-   Shuffle is typically required in join operations when the data needs to be exchanged between different executors based on the join key. Broadcast join bypasses this step.
-   Broadcasting is most effective when one of the datasets is small enough to fit in memory on each executor.

In [1]:
import findspark
findspark.init()

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("PySparkExample") \
    .getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/12/20 19:15:39 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [2]:
orders_base = spark.sparkContext.textFile("/Users/sugumarsrinivasan/Documents/data/orders.csv")

In [3]:
orders_header = orders_base.first()

                                                                                

In [4]:
orders_data_without_header = orders_base.filter(lambda line: line != orders_header)

In [5]:
orders_mapped = orders_data_without_header.map(lambda x: (x.split(",")[2], x.split(",")[3]))

In [6]:
customers_base = spark.sparkContext.textFile("/Users/sugumarsrinivasan/Documents/data/customers.csv")

In [7]:
customers_header = customers_base.first()

In [8]:
customers_data_without_header = customers_base.filter(lambda line: line != customers_header)

In [9]:
customers_mapped = customers_data_without_header.map(lambda x: (x.split(",")[0], x.split(",")[8]))

In [10]:
customers_broadcast = spark.sparkContext.broadcast(customers_mapped.collect())

In [11]:
def get_pincode(customer_id):
    try:
        return customers_broadcast.value[customer_id]
    except:
        return "-1"


In [12]:
joined_rdd = orders_mapped.map(lambda x: (get_pincode(int(x[0])),x[1]))


In [13]:
joined_rdd.saveAsTextFile("/Users/sugumarsrinivasan/Documents/data/broadcastresult")

Details of spark UI

![Local Image](./screenshots/spark-broadcast-job.png)
![Local Image](./screenshots/spark-broadcast-stage.png)