# Module 6 - Joins in PySpark

## Introduction

Joins are fundamental operations in data engineering - they combine data from multiple DataFrames based on common columns. This notebook covers all join types and best practices.

## What You'll Learn

- Inner joins
- Left/Right/Full outer joins
- Left semi and anti joins
- Join best practices
- Broadcast joins
- Handling join performance issues


In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
import os
os.environ["JAVA_TOOL_OPTIONS"] = (
    "--add-opens=java.base/java.lang=ALL-UNNAMED "
    "--add-opens=java.base/java.nio=ALL-UNNAMED "
    "--add-opens=java.base/sun.nio.ch=ALL-UNNAMED"
)
os.environ["PYSPARK_PYTHON"] = "python"
os.environ["PYSPARK_DRIVER_PYTHON"] = "python"

# Create SparkSession
spark = SparkSession.builder \
    .appName("Joins") \
    .master("local[*]") \
    .getOrCreate()

# Create employees DataFrame
employees_data = [
    (1, "Alice", "Sales"),
    (2, "Bob", "IT"),
    (3, "Charlie", "Sales"),
    (4, "Diana", "HR"),
    (5, "Eve", "IT")
]

employees_schema = StructType([
    StructField("EmpID", IntegerType(), True),
    StructField("Name", StringType(), True),
    StructField("Department", StringType(), True)
])

df_employees = spark.createDataFrame(employees_data, employees_schema)

# Create departments DataFrame
departments_data = [
    ("Sales", "New York"),
    ("IT", "San Francisco"),
    ("HR", "Chicago"),
    ("Finance", "Boston")  # Department not in employees
]

departments_schema = StructType([
    StructField("Department", StringType(), True),
    StructField("Location", StringType(), True)
])

df_departments = spark.createDataFrame(departments_data, departments_schema)

print("Employees DataFrame:")
df_employees.show()
print("\nDepartments DataFrame:")
df_departments.show()


Employees DataFrame:
+-----+-------+----------+
|EmpID|   Name|Department|
+-----+-------+----------+
|    1|  Alice|     Sales|
|    2|    Bob|        IT|
|    3|Charlie|     Sales|
|    4|  Diana|        HR|
|    5|    Eve|        IT|
+-----+-------+----------+


Departments DataFrame:
+----------+-------------+
|Department|     Location|
+----------+-------------+
|     Sales|     New York|
|        IT|San Francisco|
|        HR|      Chicago|
|   Finance|       Boston|
+----------+-------------+



## Inner Join

**Inner Join** returns only rows where the join key exists in both DataFrames. This is the most common join type.


In [2]:
# Inner join - only matching rows
df_inner = df_employees.join(df_departments, on="Department", how="inner")
print("Inner Join (only matching departments):")
df_inner.show()


Inner Join (only matching departments):
+----------+-----+-------+-------------+
|Department|EmpID|   Name|     Location|
+----------+-----+-------+-------------+
|        HR|    4|  Diana|      Chicago|
|        IT|    2|    Bob|San Francisco|
|        IT|    5|    Eve|San Francisco|
|     Sales|    1|  Alice|     New York|
|     Sales|    3|Charlie|     New York|
+----------+-----+-------+-------------+



## Left Join (Left Outer Join)

**Left Join** returns all rows from the left DataFrame and matching rows from the right DataFrame. Non-matching rows from the right have null values.


In [3]:
# Left join - all rows from left, matching from right
df_left = df_employees.join(df_departments, on="Department", how="left")
print("Left Join (all employees, with department info if available):")
df_left.show()


Left Join (all employees, with department info if available):
+----------+-----+-------+-------------+
|Department|EmpID|   Name|     Location|
+----------+-----+-------+-------------+
|     Sales|    1|  Alice|     New York|
|        IT|    2|    Bob|San Francisco|
|     Sales|    3|Charlie|     New York|
|        HR|    4|  Diana|      Chicago|
|        IT|    5|    Eve|San Francisco|
+----------+-----+-------+-------------+



## Right Join (Right Outer Join)

**Right Join** returns all rows from the right DataFrame and matching rows from the left DataFrame.


In [13]:
# Right join - all rows from right, matching from left
df_right = df_employees.join(df_departments, on="Department", how="right")
print("Right Join (all departments, with employees if available):")
df_right.show()


Right Join (all departments, with employees if available):
+----------+-----+-------+-------------+
|Department|EmpID|   Name|     Location|
+----------+-----+-------+-------------+
|     Sales|    3|Charlie|     New York|
|     Sales|    1|  Alice|     New York|
|        IT|    5|    Eve|San Francisco|
|        IT|    2|    Bob|San Francisco|
|        HR|    4|  Diana|      Chicago|
|   Finance| NULL|   NULL|       Boston|
+----------+-----+-------+-------------+



## Full Outer Join

**Full Outer Join** returns all rows from both DataFrames, with nulls where there's no match.


In [14]:
# Full outer join - all rows from both DataFrames
df_full = df_employees.join(df_departments, on="Department", how="full")
print("Full Outer Join (all rows from both):")
df_full.show()


Full Outer Join (all rows from both):
+----------+-----+-------+-------------+
|Department|EmpID|   Name|     Location|
+----------+-----+-------+-------------+
|   Finance| NULL|   NULL|       Boston|
|        HR|    4|  Diana|      Chicago|
|        IT|    2|    Bob|San Francisco|
|        IT|    5|    Eve|San Francisco|
|     Sales|    1|  Alice|     New York|
|     Sales|    3|Charlie|     New York|
+----------+-----+-------+-------------+



## Left Semi Join

**Left Semi Join** returns only rows from the left DataFrame that have a match in the right DataFrame. Similar to `EXISTS` in SQL.


In [15]:
# Left semi join - only left DataFrame columns, only matching rows
df_semi = df_employees.join(df_departments, on="Department", how="left_semi")
print("Left Semi Join (employees in departments that exist):")
df_semi.show()


Left Semi Join (employees in departments that exist):
+----------+-----+-------+
|Department|EmpID|   Name|
+----------+-----+-------+
|        HR|    4|  Diana|
|        IT|    2|    Bob|
|        IT|    5|    Eve|
|     Sales|    1|  Alice|
|     Sales|    3|Charlie|
+----------+-----+-------+



## Left Anti Join

**Left Anti Join** returns only rows from the left DataFrame that do NOT have a match in the right DataFrame. Similar to `NOT EXISTS` in SQL.


In [16]:
# Left anti join - rows in left that don't match right
df_anti = df_employees.join(df_departments, on="Department", how="left_anti")
print("Left Anti Join (employees in departments that don't exist in departments table):")
df_anti.show()


Left Anti Join (employees in departments that don't exist in departments table):
+----------+-----+----+
|Department|EmpID|Name|
+----------+-----+----+
+----------+-----+----+



## Joins with Different Column Names

When join columns have different names, specify them explicitly.


### Why Normal Joins Are Costly

**Normal joins are expensive operations** because they are **wide transformations** that require data shuffling across the network:

1. **Shuffling Overhead**: In a normal join, Spark needs to redistribute data across all executors based on join keys. This involves:
   - Reading data from all partitions
   - Sorting and grouping by join keys
   - Transferring data over the network to the correct executors
   - Writing shuffled data to disk (if memory is insufficient)

2. **Multiple Stages**: Normal joins typically create **2 stages**:
   - **Stage 1**: Shuffle read from both DataFrames
   - **Stage 2**: Join operation and write results

3. **Network I/O**: Large amounts of data move across the network, which is slow and expensive.

### How Broadcast Join Provides Optimized Solution

**Broadcast Join** is an optimization technique that eliminates shuffling by sending a small DataFrame to all executors:

1. **No Shuffling**: The small table is replicated to all executors once, eliminating network shuffling for the large table
2. **Single Stage**: Broadcast joins execute in **1 stage** instead of 2, as there's no data shuffling
3. **Local Join**: Each executor performs the join locally using the broadcasted small table and its portion of the large table
4. **Reduced Network Traffic**: Only the small table is sent over the network once, not the large table multiple times

### How Broadcast Join Works (Intuitive Example)

Imagine you have:
- **Large table**: 1 million employee records distributed across 10 executors (100K records each)
- **Small table**: 50 department records (lookup table)

**Normal Join Process:**
```
Executor 1: Has 100K employees → Needs to find matching departments
Executor 2: Has 100K employees → Needs to find matching departments
...
Executor 10: Has 100K employees → Needs to find matching departments

Problem: Each executor needs department data, but it's not available locally!
Solution: Spark shuffles ALL employee data across network to group by department,
         then joins with departments. This is EXPENSIVE!
```

**Broadcast Join Process:**
```
Step 1: Spark sends the 50 department records to ALL 10 executors (one-time broadcast)
Step 2: Each executor now has:
        - Its local 100K employee records
        - Complete copy of 50 department records (broadcasted)
Step 3: Each executor performs join LOCALLY (no network shuffling!)
Step 4: Results are collected

Result: Only 50 records sent over network (instead of 1 million shuffled records)
```

**Visual Representation:**
```
Normal Join:
[Large Table] → [Shuffle] → [Join] → [Results]
[Small Table] ↗              ↗
(2 stages, network shuffling)

Broadcast Join:
[Small Table] → [Broadcast to all executors]
[Large Table] → [Local Join on each executor] → [Results]
(1 stage, no shuffling)
```


In [17]:
# Let's create a more realistic example to demonstrate broadcast join
# Create a larger employees DataFrame to better illustrate the concept
from pyspark.sql.functions import col, lit

# Create a larger dataset for demonstration
large_employees_data = []
for i in range(1, 1001):  # 1000 employees
    dept = ["Sales", "IT", "HR"][i % 3]
    large_employees_data.append((i, f"Employee_{i}", dept))

large_employees_schema = StructType([
    StructField("EmpID", IntegerType(), True),
    StructField("Name", StringType(), True),
    StructField("Department", StringType(), True)
])

df_large_employees = spark.createDataFrame(large_employees_data, large_employees_schema)

print("Large Employees DataFrame (1000 records):")
print(f"Number of partitions: {df_large_employees.rdd.getNumPartitions()}")
df_large_employees.show(10)

print("\nSmall Departments DataFrame (4 records):")
df_departments.show()

# Normal join (without broadcast) - will cause shuffling
print("\n" + "="*60)
print("NORMAL JOIN (without broadcast):")
print("="*60)
print("This will create 2 stages due to shuffling")
df_normal_join = df_large_employees.join(
    df_departments,
    on="Department",
    how="inner"
)
# Trigger action to see execution plan
df_normal_join.explain(True)
print("\nNote: Check the execution plan above - you'll see 'Exchange' operations indicating shuffling")


Large Employees DataFrame (1000 records):
Number of partitions: 11
+-----+-----------+----------+
|EmpID|       Name|Department|
+-----+-----------+----------+
|    1| Employee_1|        IT|
|    2| Employee_2|        HR|
|    3| Employee_3|     Sales|
|    4| Employee_4|        IT|
|    5| Employee_5|        HR|
|    6| Employee_6|     Sales|
|    7| Employee_7|        IT|
|    8| Employee_8|        HR|
|    9| Employee_9|     Sales|
|   10|Employee_10|        IT|
+-----+-----------+----------+
only showing top 10 rows


Small Departments DataFrame (4 records):
+----------+-------------+
|Department|     Location|
+----------+-------------+
|     Sales|     New York|
|        IT|San Francisco|
|        HR|      Chicago|
|   Finance|       Boston|
+----------+-------------+


NORMAL JOIN (without broadcast):
This will create 2 stages due to shuffling
== Parsed Logical Plan ==
'Join UsingJoin(Inner, [Department])
:- LogicalRDD [EmpID#330, Name#331, Department#332], false
+- LogicalRDD [

In [None]:
# Broadcast join - optimized version
from pyspark.sql.functions import broadcast

print("\n" + "="*60)
print("BROADCAST JOIN (with broadcast):")
print("="*60)
print("This will create only 1 stage - no shuffling!")

df_broadcast_join_optimized = df_large_employees.join(
    broadcast(df_departments),
    on="Department",
    how="inner"
)

# Trigger action to see execution plan
df_broadcast_join_optimized.explain(True)
print("\nNote: Check the execution plan above - you'll see 'BroadcastHashJoin' and NO 'Exchange' operations!")
print("This means no shuffling occurred - the join happened locally on each executor.")



BROADCAST JOIN (with broadcast):
This will create only 1 stage - no shuffling!


NameError: name 'df_large_employees' is not defined

### How to Confirm Broadcast Join from Spark UI (Step by Step)

The Spark UI provides visual confirmation that a broadcast join is being used. Here's how to verify:

#### Step 1: Access Spark UI
1. After running your Spark job, open your web browser
2. Navigate to: `http://localhost:4040` (or the port shown in your Spark logs)
3. If multiple Spark sessions are running, check the port number in the logs (e.g., `Service 'SparkUI' could not bind on port 4040. Attempting port 4041.`)

#### Step 2: Navigate to the Completed Jobs/Stages
1. Click on the **"Jobs"** tab in the Spark UI
2. Find your job (usually the most recent one)
3. Click on the job to see its stages

#### Step 3: Identify Broadcast Join Characteristics

**Key Indicators of Broadcast Join:**

1. **Single Stage**: 
   - Broadcast joins execute in **1 stage** (not 2)
   - Look for a single stage in the job details
   - Normal joins typically show 2 stages (one for shuffle read, one for join)

2. **No Shuffle Operations**:
   - In the stage details, you should **NOT** see:
     - "Shuffle Read" metrics
     - "Shuffle Write" metrics
     - "Exchange" operations in the DAG visualization
   - If you see these, it's a normal join, not a broadcast join

3. **BroadcastHashJoin in DAG**:
   - In the **"Event Timeline"** or **"DAG Visualization"**:
     - Look for **"BroadcastHashJoin"** operation
     - This confirms Spark is using broadcast join strategy

4. **Broadcast Variables**:
   - In the **"Storage"** tab, you may see broadcast variables
   - These represent the small table that was broadcasted to all executors

#### Step 4: Compare Normal Join vs Broadcast Join

**Normal Join in Spark UI:**
- **Stages**: 2 stages
- **Stage 1**: Shows "Shuffle Read" and "Shuffle Write" metrics
- **Stage 2**: Shows join operation
- **DAG**: Shows "Exchange" nodes (indicating data shuffling)

**Broadcast Join in Spark UI:**
- **Stages**: 1 stage
- **No Shuffle Metrics**: No "Shuffle Read" or "Shuffle Write"
- **DAG**: Shows "BroadcastHashJoin" without "Exchange" nodes
- **Storage Tab**: May show broadcast variables

#### Visual Example in Spark UI:

```
Normal Join (2 Stages):
┌─────────────┐
│   Stage 1   │  ← Shuffle Read/Write (network I/O)
│  (Shuffle)  │
└──────┬──────┘
       │
┌──────▼──────┐
│   Stage 2   │  ← Join Operation
│   (Join)    │
└─────────────┘

Broadcast Join (1 Stage):
┌─────────────────────┐
│      Stage 1        │  ← BroadcastHashJoin
│  (No Shuffling!)    │     (Local join on each executor)
└─────────────────────┘
```

#### Step 5: Check Execution Plan in Code

You can also verify in your code using `.explain(True)`:

```python
# Normal join - will show Exchange operations
df_normal_join.explain(True)
# Look for: "Exchange" or "ShuffleExchange" in the plan

# Broadcast join - will show BroadcastHashJoin
df_broadcast_join.explain(True)
# Look for: "BroadcastHashJoin" and NO "Exchange" operations
```

**Summary**: Broadcast joins are confirmed by:
- ✅ **1 stage** (not 2)
- ✅ **No shuffle operations** (no Exchange nodes)
- ✅ **BroadcastHashJoin** in the execution plan
- ✅ **No network shuffling** of the large table


In [19]:
# Practical demonstration: Compare execution plans
print("="*70)
print("COMPARISON: Normal Join vs Broadcast Join Execution Plans")
print("="*70)

print("\n1. NORMAL JOIN EXECUTION PLAN:")
print("-" * 70)
df_normal_join.explain(True)

print("\n\n2. BROADCAST JOIN EXECUTION PLAN:")
print("-" * 70)
df_broadcast_join_optimized.explain(True)

print("\n" + "="*70)
print("KEY DIFFERENCES TO LOOK FOR:")
print("="*70)
print("Normal Join:")
print("  - Contains 'Exchange' or 'ShuffleExchange' operations")
print("  - Shows data being shuffled across network")
print("  - Multiple stages in Spark UI")
print("\nBroadcast Join:")
print("  - Contains 'BroadcastHashJoin' operation")
print("  - NO 'Exchange' operations (no shuffling!)")
print("  - Single stage in Spark UI")
print("  - Small table is broadcasted to all executors")


COMPARISON: Normal Join vs Broadcast Join Execution Plans

1. NORMAL JOIN EXECUTION PLAN:
----------------------------------------------------------------------
== Parsed Logical Plan ==
'Join UsingJoin(Inner, [Department])
:- LogicalRDD [EmpID#330, Name#331, Department#332], false
+- LogicalRDD [Department#187, Location#188], false

== Analyzed Logical Plan ==
Department: string, EmpID: int, Name: string, Location: string
Project [Department#332, EmpID#330, Name#331, Location#188]
+- Join Inner, (Department#332 = Department#187)
   :- LogicalRDD [EmpID#330, Name#331, Department#332], false
   +- LogicalRDD [Department#187, Location#188], false

== Optimized Logical Plan ==
Project [Department#332, EmpID#330, Name#331, Location#188]
+- Join Inner, (Department#332 = Department#187)
   :- Filter isnotnull(Department#332)
   :  +- LogicalRDD [EmpID#330, Name#331, Department#332], false
   +- Filter isnotnull(Department#187)
      +- LogicalRDD [Department#187, Location#188], false

== Phy

In [20]:
# Create DataFrame with different column name
df_dept_alt = df_departments.withColumnRenamed("Department", "DeptName")

# Join with different column names
df_join_diff = df_employees.join(
    df_dept_alt,
    df_employees.Department == df_dept_alt.DeptName,
    how="inner"
)
print("Join with different column names:")
df_join_diff.show()


Join with different column names:
+-----+-------+----------+--------+-------------+
|EmpID|   Name|Department|DeptName|     Location|
+-----+-------+----------+--------+-------------+
|    4|  Diana|        HR|      HR|      Chicago|
|    2|    Bob|        IT|      IT|San Francisco|
|    5|    Eve|        IT|      IT|San Francisco|
|    1|  Alice|     Sales|   Sales|     New York|
|    3|Charlie|     Sales|   Sales|     New York|
+-----+-------+----------+--------+-------------+



## Broadcast Join

**Broadcast Join** is an optimization technique. When one DataFrame is small, Spark can broadcast it to all executors to avoid shuffling the large DataFrame.

**When to Use:**
- One DataFrame is small (< 100MB typically)
- Join key distribution is skewed
- Can significantly improve performance


In [21]:
from pyspark.sql.functions import broadcast

# Broadcast the smaller DataFrame
df_broadcast_join = df_employees.join(
    broadcast(df_departments),
    on="Department",
    how="inner"
)

print("Broadcast join (small table broadcasted):")
df_broadcast_join.show()

print("\nNote: Spark may automatically broadcast small tables, but you can")
print("explicitly use broadcast() for optimization.")


Broadcast join (small table broadcasted):
+----------+-----+-------+-------------+
|Department|EmpID|   Name|     Location|
+----------+-----+-------+-------------+
|     Sales|    1|  Alice|     New York|
|        IT|    2|    Bob|San Francisco|
|     Sales|    3|Charlie|     New York|
|        HR|    4|  Diana|      Chicago|
|        IT|    5|    Eve|San Francisco|
+----------+-----+-------+-------------+


Note: Spark may automatically broadcast small tables, but you can
explicitly use broadcast() for optimization.


## Join Best Practices

1. **Use Appropriate Join Type**: Choose the right join based on your needs
   - Inner: When you only want matching rows
   - Left: When you want all rows from left table
   - Full: When you want all rows from both tables

2. **Broadcast Small Tables**: Use `broadcast()` for small lookup tables (< 100MB)

3. **Filter Before Joining**: Filter DataFrames before joining to reduce data size

4. **Avoid Cartesian Products**: Always specify join conditions

5. **Handle Nulls**: Be aware of null values in join keys

6. **Monitor Skew**: If one join key has many more values than others, consider bucketing

7. **Use Appropriate Partitioning**: Partition large tables by join keys when possible


In [None]:
# Example: Filter before joining (best practice)
df_filtered_employees = df_employees.filter(df_employees.Department == "Sales")
df_filtered_join = df_filtered_employees.join(df_departments, on="Department", how="inner")

print("Filtered before joining (more efficient):")
df_filtered_join.show()


Filtered before joining (more efficient):
+----------+-----+-------+--------+
|Department|EmpID|   Name|Location|
+----------+-----+-------+--------+
|     Sales|    1|  Alice|New York|
|     Sales|    3|Charlie|New York|
+----------+-----+-------+--------+



25/12/28 22:26:17 WARN HeartbeatReceiver: Removing executor driver with no recent heartbeats: 149592 ms exceeds timeout 120000 ms
25/12/28 22:26:17 WARN SparkContext: Killing executors is not supported by current scheduler.
25/12/28 22:26:18 ERROR Inbox: Ignoring error
org.apache.spark.SparkException: Exception thrown in awaitResult: 
	at org.apache.spark.util.SparkThreadUtils$.awaitResult(SparkThreadUtils.scala:56)
	at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:310)
	at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
	at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:102)
	at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:110)
	at org.apache.spark.util.RpcUtils$.makeDriverRef(RpcUtils.scala:36)
	at org.apache.spark.storage.BlockManagerMasterEndpoint.driverEndpoint$lzycompute(BlockManagerMasterEndpoint.scala:124)
	at org.apache.spark.storage.BlockManagerMasterEndpoint.org$apache$spark$storage$BlockManagerMasterEndpoint$$

## Summary

In this notebook, you learned:

1. **Inner Join**: Returns only matching rows from both DataFrames
2. **Left/Right/Full Outer Joins**: Returns all rows from one or both DataFrames
3. **Left Semi Join**: Returns left DataFrame rows that have matches (like EXISTS)
4. **Left Anti Join**: Returns left DataFrame rows that don't have matches (like NOT EXISTS)
5. **Broadcast Join**: Optimization for joining with small tables
6. **Best Practices**: Filter before joining, broadcast small tables, handle nulls

**Key Takeaway**: Joins are expensive operations (cause shuffles). Use appropriate join types, filter data before joining, and use broadcast joins for small lookup tables to optimize performance.

**Next Steps**: In Module 7, we'll learn about advanced operations including window functions, UDFs, and complex data types.
