**Join:**

In Apache Spark, a join transformation refers to an operation that combines two or more datasets (typically two DataFrames or RDDs) based on a common key or condition, producing a new dataset that includes columns from both input datasets. The resulting dataset usually contains rows where the join condition holds true.

Spark provides several types of join operations, and these can be applied to both DataFrames and RDDs.

In [1]:
import findspark
findspark.init()

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("PySparkExample") \
    .getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/12/20 17:46:42 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [2]:
orders_base = spark.sparkContext.textFile("/Users/sugumarsrinivasan/Documents/data/orders.csv")

In [3]:
orders_mapped = orders_base.map(lambda x: (x.split(",")[2], x.split(",")[3]))

In [4]:
customers_base = spark.sparkContext.textFile("/Users/sugumarsrinivasan/Documents/data/customers.csv")

In [6]:
customers_mapped = customers_base.map(lambda x: (x.split(",")[0], x.split(",")[8]))

In [7]:
joined_rdd = customers_mapped.join(orders_mapped)

In [8]:
joined_rdd.saveAsTextFile("/Users/sugumarsrinivasan/Documents/data/output")

                                                                                

Application details in spark UI:

-   Used the `saveAsTextFile()` action and hence the job is 1.
-   Used `join()` transformation which is wide, that's why you see 2 stages.
-   Since both the datasets(customers.csv and orders.csv) are less than the block size, that's why tasks are 2 for each dataset(`spark.sparkContext.defaultMinPartitions`).


![Local Image](/Users/sugumarsrinivasan/Developer/Coding/pyspark/Jupyter_Notebooks/screenshots/spark-trans-join-job.png)
![Local image](/Users/sugumarsrinivasan/Developer/Coding/pyspark/Jupyter_Notebooks/screenshots/spark-trans-join.png)
![Local Image](/Users/sugumarsrinivasan/Developer/Coding/pyspark/Jupyter_Notebooks/screenshots/spark-trans-join-stg.png)
![Local Image](/Users/sugumarsrinivasan/Developer/Coding/pyspark/Jupyter_Notebooks/screenshots/spark-trans-join-agg.png)