# üì° Broadcast Join: Optimizing Small Table Joins

**Time to complete:** 30 minutes  
**Difficulty:** Advanced  
**Prerequisites:** Basic joins, DataFrame operations

---

## üéØ Learning Objectives

By the end of this notebook, you will master:
- ‚úÖ **Broadcast join mechanics** - How it works
- ‚úÖ **When to use broadcast joins** - Optimal scenarios
- ‚úÖ **Manual broadcast control** - Forcing broadcast behavior
- ‚úÖ **Performance monitoring** - Measuring broadcast effectiveness
- ‚úÖ **Broadcast join limitations** - When it doesn't work
- ‚úÖ **Alternative strategies** - When broadcast isn't feasible

**Broadcast joins can be 10-100x faster than shuffle joins!**

---

## üîç Understanding Broadcast Joins

**Broadcast join** is Spark's optimization for joining a large table with a small table. Instead of shuffling both tables across the network, it sends the small table to all executors.

### How Broadcast Join Works:
```
Traditional Join:     Broadcast Join:
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê       ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ Large Table ‚îÇ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚ñ∂‚îÇ   Shuffle   ‚îÇ
‚îÇ (1TB)       ‚îÇ       ‚îÇ   Both      ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò       ‚îÇ   Tables    ‚îÇ
        ‚îÇ             ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê             ‚îÇ
‚îÇ Small Table ‚îÇ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚ñ∂   Network Heavy
‚îÇ (100MB)     ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò

vs

Broadcast Join:
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê       ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ Large Table ‚îÇ       ‚îÇ Large Table ‚îÇ
‚îÇ (1TB)       ‚îÇ       ‚îÇ (1TB)       ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò       ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
        ‚îÇ                   ‚îÇ
        ‚ñº                   ‚ñº
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê       ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ Small Table ‚îÇ‚îÄ‚îÄ‚îÄ‚ñ∂   ‚îÇSmall Table  ‚îÇ (broadcasted)
‚îÇ (100MB)     ‚îÇ       ‚îÇ  in Memory  ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò       ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                      Network Light ‚ö°
```

**Network traffic reduction: 99%+ for typical use cases!**

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, broadcast
import pyspark.sql.functions as F
import time

spark = SparkSession.builder \
    .appName("Broadcast_Joins") \
    .master("local[*]") \
    .getOrCreate()

print(f"‚úÖ Spark ready - Version: {spark.version}")

# Check current broadcast threshold
threshold = spark.conf.get("spark.sql.autoBroadcastJoinThreshold")
print(f"Current broadcast threshold: {threshold} bytes")

# Create test datasets
# Large table (simulating fact table)
large_orders = [
    (i, f"customer_{i%1000}", f"product_{(i%100)+1}", 10 + (i % 90), f"2023-{(i%12)+1:02d}-01")
    for i in range(50000)
]

# Medium table (simulating dimension table)
medium_products = [
    (f"product_{i}", f"Category_{(i%5)+1}", f"Brand_{(i%3)+1}", 20.0 + (i % 80))
    for i in range(1, 101)
]

# Small table (lookup/dimension table)
small_categories = [
    ("Category_1", "Electronics", "High-tech products"),
    ("Category_2", "Clothing", "Fashion and apparel"),
    ("Category_3", "Books", "Educational materials"),
    ("Category_4", "Home", "Household items"),
    ("Category_5", "Sports", "Athletic equipment")
]

# Create DataFrames
orders_df = spark.createDataFrame(large_orders, 
    ["order_id", "customer_id", "product_id", "quantity", "order_date"])
products_df = spark.createDataFrame(medium_products, 
    ["product_id", "category_id", "brand", "base_price"])
categories_df = spark.createDataFrame(small_categories, 
    ["category_id", "category_name", "description"])

print("üìä Test Datasets:")
print(f"Orders: {orders_df.count():,} rows")
print(f"Products: {products_df.count()} rows")
print(f"Categories: {categories_df.count()} rows")

## ‚ö° Automatic Broadcast Detection

### Spark's Auto-Broadcast Logic

In [None]:
# Demonstrate automatic broadcast detection
print("‚ö° AUTOMATIC BROADCAST DETECTION")
print("=" * 50)

# Check DataFrame sizes (approximate)
print("DataFrame size estimates:")
print(f"Orders: ~{orders_df.count() * len(orders_df.columns) * 20:,} bytes")
print(f"Products: ~{products_df.count() * len(products_df.columns) * 20:,} bytes")
print(f"Categories: ~{categories_df.count() * len(categories_df.columns) * 20:,} bytes")
print(f"Broadcast threshold: {int(threshold):,} bytes")

# Join orders with categories (should auto-broadcast)
print("\nJoining orders (large) with categories (small):")
orders_with_categories = orders_df.join(
    products_df,
    "product_id",
    "inner"
).join(
    categories_df,
    "category_id",
    "inner"
)

print(f"Result: {orders_with_categories.count():,} rows")

# Check execution plan for broadcast hints
print("\nExecution plan (look for 'BroadcastHashJoin'):")
orders_with_categories.explain(mode="formatted")

# Show sample result
print("\nSample results:")
orders_with_categories.select(
    "order_id", "customer_id", "category_name", "quantity", "order_date"
).show(5)

## üéØ Manual Broadcast Control

### Forcing Broadcast Behavior

In [None]:
# Manual broadcast control
print("üéØ MANUAL BROADCAST CONTROL")
print("=" * 50)

# Force broadcast even if above threshold
print("1. Forcing broadcast with broadcast() function:")
forced_broadcast = orders_df.join(
    broadcast(products_df),  # Force broadcast
    "product_id",
    "inner"
)

print(f"Forced broadcast result: {forced_broadcast.count()} rows")
print("Execution plan:")
forced_broadcast.explain(mode="formatted")

# Change broadcast threshold for testing
print("\n2. Testing different threshold settings:")

# Very restrictive threshold (only very small tables)
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "1KB")
print(f"Restrictive threshold: {spark.conf.get('spark.sql.autoBroadcastJoinThreshold')}")

restrictive_join = orders_df.join(products_df, "product_id", "inner")
print("With restrictive threshold:")
restrictive_join.explain(mode="formatted")

# Very permissive threshold (broadcast larger tables)
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "100MB")
print(f"\nPermissive threshold: {spark.conf.get('spark.sql.autoBroadcastJoinThreshold')}")

permissive_join = orders_df.join(products_df, "product_id", "inner")
print("With permissive threshold:")
permissive_join.explain(mode="formatted")

# Reset to default
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "10MB")
print(f"\nReset to default: {spark.conf.get('spark.sql.autoBroadcastJoinThreshold')}")

## üöÄ Performance Comparison

### Broadcast vs Shuffle Join Performance

In [None]:
# Performance comparison
print("üöÄ PERFORMANCE COMPARISON")
print("=" * 50)

# Create larger test datasets
big_orders = [
    (i, f"customer_{i%2000}", f"product_{(i%200)+1}", 1 + (i % 10))
    for i in range(100000)
]

big_orders_df = spark.createDataFrame(big_orders, 
    ["order_id", "customer_id", "product_id", "quantity"])

# Small lookup table
small_lookup = [
    (f"product_{i}", f"Category_{(i%5)+1}", f"Supplier_{(i%3)+1}")
    for i in range(1, 201)
]

small_lookup_df = spark.createDataFrame(small_lookup, 
    ["product_id", "category", "supplier"])

print(f"Big orders: {big_orders_df.count():,} rows")
print(f"Small lookup: {small_lookup_df.count()} rows")

# Test 1: Regular join (may or may not broadcast)
print("\n=== Test 1: Regular Join ===")
start_time = time.time()
regular_join = big_orders_df.join(small_lookup_df, "product_id", "inner")
regular_count = regular_join.count()
regular_time = time.time() - start_time

print(f"Regular join: {regular_count:,} rows in {regular_time:.3f} seconds")

# Test 2: Forced broadcast join
print("\n=== Test 2: Forced Broadcast Join ===")
start_time = time.time()
broadcast_join = big_orders_df.join(broadcast(small_lookup_df), "product_id", "inner")
broadcast_count = broadcast_join.count()
broadcast_time = time.time() - start_time

print(f"Broadcast join: {broadcast_count:,} rows in {broadcast_time:.3f} seconds")

# Performance analysis
print(f"\nüéØ PERFORMANCE ANALYSIS:")
print(f"Regular join: {regular_time:.3f}s")
print(f"Broadcast join: {broadcast_time:.3f}s")

if regular_time > 0 and broadcast_time > 0:
    speedup = regular_time / broadcast_time
    if speedup > 1:
        print(f"Broadcast is {speedup:.1f}x faster!")
    else:
        print(f"Broadcast is {1/speedup:.1f}x slower (threshold effects)")

# Verify results are identical
print(f"\nResults identical: {regular_count == broadcast_count}")

# Show execution plans
print("\nRegular join plan:")
regular_join.explain(mode="simple")

print("\nBroadcast join plan:")
broadcast_join.explain(mode="simple")

## üéõÔ∏è Advanced Broadcast Scenarios

### Complex Join Conditions with Broadcast

In [None]:
# Advanced broadcast scenarios
print("üéõÔ∏è ADVANCED BROADCAST SCENARIOS")
print("=" * 50)

# Create complex lookup table
complex_lookup = [
    ("Electronics", 500, 1000, "premium"),
    ("Clothing", 50, 200, "standard"),
    ("Books", 10, 100, "standard"),
    ("Home", 100, 500, "standard"),
    ("Sports", 75, 300, "premium")
]

category_rules_df = spark.createDataFrame(complex_lookup, 
    ["category", "min_price", "max_price", "tier"])

print("Category rules lookup table:")
category_rules_df.show()

# Complex broadcast join with multiple conditions
complex_broadcast = orders_df.alias("o").join(
    broadcast(products_df.alias("p")),
    col("o.product_id") == col("p.product_id"),
    "inner"
).join(
    broadcast(category_rules_df.alias("r")),
    (col("p.category_id") == col("r.category")) &
    (col("p.base_price").between(col("r.min_price"), col("r.max_price"))),
    "inner"
)

print("\nComplex broadcast join result:")
complex_broadcast.select(
    "o.order_id", "o.customer_id", "p.category_id", 
    "p.base_price", "r.tier", "r.min_price", "r.max_price"
).show(10)

# Broadcast with aggregation
broadcast_agg = orders_df.join(
    broadcast(products_df),
    "product_id",
    "inner"
).groupBy("category_id").agg(
    F.sum("quantity").alias("total_quantity"),
    F.sum(F.col("quantity") * F.col("base_price")).alias("total_revenue"),
    F.avg("quantity").alias("avg_quantity"),
    F.count("*").alias("order_count")
)

print("\nBroadcast join with aggregation:")
broadcast_agg.show()

# Check execution plan
print("\nExecution plan:")
broadcast_agg.explain(mode="formatted")

## ‚ö†Ô∏è Broadcast Join Limitations

### When Broadcast Joins Don't Work

In [None]:
# Broadcast limitations
print("‚ö†Ô∏è BROADCAST JOIN LIMITATIONS")
print("=" * 50)

# Create a 'large' table that exceeds broadcast threshold
large_lookup = [
    (f"key_{i}", f"data_{i}", i * 10) 
    for i in range(10000)
]  # This will be > 10MB

large_lookup_df = spark.createDataFrame(large_lookup, 
    ["lookup_key", "lookup_data", "lookup_value"])

# Check if it would be broadcasted
print(f"Large lookup table size: {large_lookup_df.count()} rows")

# Try join without broadcast hint
no_broadcast_join = big_orders_df.join(
    large_lookup_df,  # This is too big for auto-broadcast
    big_orders_df["customer_id"] == large_lookup_df["lookup_key"],  # Non-matching keys
    "inner"
)

print("\nJoin with large table (no broadcast hint):")
no_broadcast_join.explain(mode="formatted")

# Force broadcast (may cause performance issues)
print("\n‚ö†Ô∏è  Forcing broadcast on large table:")
try:
    forced_large_broadcast = big_orders_df.join(
        broadcast(large_lookup_df),  # Force broadcast large table
        big_orders_df["customer_id"] == large_lookup_df["lookup_key"],
        "inner"
    )
    print("Forced broadcast succeeded (may be slow)")
    print(f"Result count: {forced_large_broadcast.count()}")
except Exception as e:
    print(f"Forced broadcast failed: {str(e)[:100]}...")

# Limitations summary
print("\nüö´ BROADCAST JOIN LIMITATIONS:")
print("1. Table size limit (default: 10MB)")
print("2. Memory pressure on executors")
print("3. Network overhead for very large broadcasts")
print("4. Not suitable for both tables being large")
print("5. Limited by executor memory")
print("6. May cause OOM for executors with limited memory")

## üéØ Best Practices & Optimization

### Broadcast Join Strategies

In [None]:
# Best practices
print("üéØ BROADCAST JOIN BEST PRACTICES")
print("=" * 50)

# Strategy 1: Pre-filter small tables
print("1. Pre-filter small tables:")
filtered_categories = categories_df.filter(col("category_id").isin(["Category_1", "Category_2"]))
print(f"Filtered categories: {filtered_categories.count()} rows")

# Join with filtered broadcast table
optimized_join = orders_df.join(
    broadcast(filtered_categories),
    orders_df["product_id"] == filtered_categories["category_id"],  # Simplified condition
    "inner"
)
print(f"Optimized join result: {optimized_join.count()} rows")

# Strategy 2: Cache broadcast tables
print("\n2. Cache frequently used broadcast tables:")
cached_categories = categories_df.cache()
print("Categories table cached")

# Multiple joins with same cached table
join1 = orders_df.join(broadcast(cached_categories), orders_df["product_id"].substr(1, 10) == cached_categories["category_id"], "left")
join2 = products_df.join(broadcast(cached_categories), products_df["category_id"] == cached_categories["category_id"], "inner")

print(f"Multiple joins with cached broadcast table")
print(f"Join 1: {join1.count()} rows")
print(f"Join 2: {join2.count()} rows")

# Strategy 3: Monitor broadcast effectiveness
print("\n3. Monitor broadcast performance:")
print("Check Spark UI for:")
print("- BroadcastHashJoin in execution plan")
print("- Broadcast exchange size")
print("- Task execution times")
print("- Memory usage per executor")

# Strategy 4: Adaptive query execution
print("\n4. Enable adaptive query execution:")
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")
print("Adaptive query execution enabled")
print("Spark will automatically choose optimal join strategies")

# Strategy 5: Combine with other optimizations
print("\n5. Combine broadcast with partitioning:")
partitioned_orders = orders_df.repartition(8, "customer_id")
broadcast_partitioned = partitioned_orders.join(
    broadcast(categories_df),
    partitioned_orders["customer_id"].substr(1, 9) == categories_df["category_id"],
    "inner"
)
print(f"Broadcast + partitioning: {broadcast_partitioned.count()} rows")

## üîç Monitoring & Debugging

### Broadcast Join Diagnostics

In [None]:
# Monitoring and debugging
print("üîç BROADCAST JOIN MONITORING")
print("=" * 50)

# Create a test scenario
test_orders = orders_df.limit(1000)
test_join = test_orders.join(broadcast(categories_df), 
    test_orders["product_id"].substr(1, 9) == categories_df["category_id"], "inner")

# 1. Check execution plan
print("1. Execution Plan Analysis:")
print("Look for 'BroadcastHashJoin' and broadcast size:")
test_join.explain(mode="formatted")

# 2. Performance metrics
print("\n2. Performance Metrics:")
start_time = time.time()
result_count = test_join.count()
execution_time = time.time() - start_time
print(f"Execution time: {execution_time:.3f} seconds")
print(f"Result count: {result_count}")

# 3. Memory usage estimation
print("\n3. Memory Usage Estimation:")
broadcast_size_bytes = categories_df.count() * len(categories_df.columns) * 50  # Rough estimate
broadcast_size_mb = broadcast_size_bytes / (1024 * 1024)
print(f"Estimated broadcast table size: {broadcast_size_mb:.2f} MB")
print(f"Broadcast threshold: {int(spark.conf.get('spark.sql.autoBroadcastJoinThreshold')) / (1024*1024):.0f} MB")

# 4. Data distribution analysis
print("\n4. Data Distribution:")
result_distribution = test_join.groupBy("category_id").count().orderBy("count", ascending=False)
print("Result distribution by category:")
result_distribution.show()

# 5. Common issues detection
print("\n5. Common Issues Detection:")
if broadcast_size_mb > 100:
    print("‚ö†Ô∏è  WARNING: Broadcast table is very large (>100MB)")
    print("   Consider: Increasing executor memory or using regular join")
elif execution_time > 30:
    print("‚ö†Ô∏è  WARNING: Join took longer than 30 seconds")
    print("   Consider: Checking data skew or network issues")
else:
    print("‚úÖ Broadcast join appears to be working efficiently")

# 6. Alternative join strategies for comparison
print("\n6. Alternative Strategy Comparison:")
regular_join = test_orders.join(categories_df, 
    test_orders["product_id"].substr(1, 9) == categories_df["category_id"], "inner")
print("Regular join plan:")
regular_join.explain(mode="simple")

## üéØ Interview Questions & Key Takeaways

### Common Interview Questions:
1. **What is a broadcast join in Spark?**
2. **When should you use broadcast joins?**
3. **What's the broadcast join threshold?**
4. **How do you force a broadcast join?**
5. **What are the limitations of broadcast joins?**

### Answers:
- **Broadcast join**: Sends small table to all executors to avoid shuffling large table
- **When to use**: Joining large table with small lookup table (< 10MB default)
- **Threshold**: `spark.sql.autoBroadcastJoinThreshold` (default: 10MB)
- **Force broadcast**: Use `broadcast()` function: `df1.join(broadcast(df2), "key")`
- **Limitations**: Table size, memory pressure, network overhead for large broadcasts

## üßπ Cleanup

In [None]:
# Cleanup
print("üßπ CLEANUP")
print("=" * 50)

# Reset configurations
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "10MB")
spark.conf.set("spark.sql.adaptive.enabled", "false")

print("Configurations reset to defaults")
print("Broadcast join demonstration complete!")

print("\nüìö KEY TAKEAWAYS:")
print("- Broadcast joins send small tables to all executors")
print("- Use for large table + small lookup table scenarios")
print("- Can be 10-100x faster than shuffle joins")
print("- Monitor execution plans and performance")
print("- Consider memory and network limitations")
print("- Spark auto-detects, but you can force with broadcast()")

# Note: Spark Session will be cleaned up automatically in Jupyter
# In production code, use: spark.stop()