# Module 12: Performance Optimization

**Difficulty**: ⭐⭐⭐  
**Estimated Time**: 75 minutes  
**Prerequisites**: 
- [Module 03: DataFrames and Datasets](03_dataframes_and_datasets.ipynb)
- [Module 05: DataFrame Operations](05_dataframe_operations.ipynb)
- Understanding of Spark execution model

## Learning Objectives

By the end of this notebook, you will be able to:

1. Apply partitioning strategies (repartition vs coalesce) to optimize data distribution
2. Use caching and persistence effectively with appropriate storage levels
3. Implement broadcast variables and broadcast joins for small-to-large table joins
4. Identify and avoid expensive shuffle operations in transformations
5. Monitor and tune Spark performance using configuration settings and execution plans

## 1. Setup and Introduction

**Performance Optimization in Spark**

Spark's performance depends on several factors:

1. **Data Partitioning**: How data is distributed across executors
2. **Shuffles**: Data movement between nodes (expensive!)
3. **Memory Management**: Caching, spilling, storage levels
4. **Parallelism**: Number of tasks running concurrently
5. **Query Optimization**: Catalyst optimizer decisions

**Key Concepts:**

**Partitions:**
- Logical chunks of data
- Each partition processed by one task
- Optimal: 2-4 partitions per CPU core

**Shuffle:**
- Data redistribution across cluster
- Triggered by: groupBy, join, repartition, distinct, etc.
- Expensive: disk I/O, network I/O, serialization

**Caching:**
- Store DataFrames in memory/disk
- Avoid recomputation
- Trade-off: memory vs computation

**Broadcast:**
- Send small data to all executors
- Avoid shuffle for small-to-large joins
- Limited by driver memory

In [None]:
# Import required libraries
from pyspark.sql import SparkSession
from pyspark.sql.functions import (
    col, count, sum as spark_sum, avg, broadcast,
    rand, round as spark_round, monotonically_increasing_id,
    when, expr, lit
)
from pyspark import StorageLevel
import time
import numpy as np

In [None]:
# Create Spark session with custom configurations
spark = SparkSession.builder \
    .appName("Performance Optimization") \
    .config("spark.driver.memory", "2g") \
    .config("spark.executor.memory", "2g") \
    .config("spark.sql.shuffle.partitions", "8") \
    .config("spark.default.parallelism", "8") \
    .getOrCreate()

spark.sparkContext.setLogLevel("WARN")
print(f"Spark version: {spark.version}")
print(f"Default parallelism: {spark.sparkContext.defaultParallelism}")
print(f"SQL shuffle partitions: {spark.conf.get('spark.sql.shuffle.partitions')}")

## 2. Partitioning Strategies

**Why Partitioning Matters:**
- Controls parallelism
- Affects shuffle performance
- Impacts memory usage
- Determines task granularity

**repartition() vs coalesce():**

**repartition(n)**:
- Can increase or decrease partitions
- Triggers full shuffle
- Distributes data evenly
- Use when: increasing partitions or need even distribution

**coalesce(n)**:
- Can only decrease partitions
- Avoids full shuffle (more efficient)
- May create uneven partitions
- Use when: reducing partitions after filtering

In [None]:
# Create sample DataFrame
df = spark.range(0, 10000000).toDF("id") \
    .withColumn("value", (col("id") * 3.14).cast("double")) \
    .withColumn("category", (col("id") % 100).cast("string"))

print(f"Original partitions: {df.rdd.getNumPartitions()}")
print(f"Total rows: {df.count():,}")

In [None]:
# Test repartition (increases partitions, triggers shuffle)
print("\n=== Testing repartition() ===")

start = time.time()
df_repartitioned = df.repartition(20)
df_repartitioned.write.mode("overwrite").format("noop").save()
repartition_time = time.time() - start

print(f"Partitions after repartition(20): {df_repartitioned.rdd.getNumPartitions()}")
print(f"Time: {repartition_time:.2f}s")

# Check partition sizes
partition_sizes = df_repartitioned.rdd.glom().map(len).collect()
print(f"Partition sizes (first 10): {partition_sizes[:10]}")
print(f"Min: {min(partition_sizes)}, Max: {max(partition_sizes)}, Avg: {np.mean(partition_sizes):.0f}")

In [ ]:
# Test coalesce (decreases partitions, avoids full shuffle)
print("\n=== Testing coalesce() ===")

start = time.time()
df_coalesced = df_repartitioned.coalesce(5)
df_coalesced.write.mode("overwrite").format("noop").save()
coalesce_time = time.time() - start

print(f"Partitions after coalesce(5): {df_coalesced.rdd.getNumPartitions()}")
print(f"Time: {coalesce_time:.2f}s")
print(f"Speedup vs repartition: {repartition_time/coalesce_time:.2f}x faster")

# Coalesce may create uneven partitions
coalesce_sizes = df_coalesced.rdd.glom().map(len).collect()
print(f"\nPartition sizes: {coalesce_sizes}")
print(f"Notice: Partitions may be uneven (that's okay for coalesce)")

In [None]:
# Repartition by column for better data locality
print("\n=== Repartitioning by Column ===")

# Repartition by category for grouped operations
df_by_category = df.repartition(10, "category")

print(f"Partitions: {df_by_category.rdd.getNumPartitions()}")
print("\nBenefit: All rows with same category are in same partition")
print("This avoids shuffle in subsequent groupBy(category) operations")

# Demonstrate benefit
start = time.time()
result1 = df.groupBy("category").count().count()
time1 = time.time() - start

start = time.time()
result2 = df_by_category.groupBy("category").count().count()
time2 = time.time() - start

print(f"\nGroupBy without pre-partitioning: {time1:.2f}s")
print(f"GroupBy with pre-partitioning: {time2:.2f}s")
print(f"Speedup: {time1/time2:.2f}x")

## 3. Caching and Persistence

**When to Cache:**
- DataFrame used multiple times
- Expensive computation that doesn't change
- Iterative algorithms (ML training)
- Interactive exploration

**When NOT to Cache:**
- DataFrame used only once
- Limited memory available
- Data too large to fit in memory

**Storage Levels:**
- `MEMORY_ONLY`: Fast but uses memory, data lost if evicted
- `MEMORY_AND_DISK`: Spills to disk if memory full
- `DISK_ONLY`: Uses disk storage, slower but reliable
- `MEMORY_ONLY_SER`: Serialized (saves memory, slower access)
- Add `_2` for replication (fault tolerance)

In [None]:
# Create DataFrame with expensive computation
expensive_df = spark.range(0, 5000000).toDF("id") \
    .withColumn("value1", expr("sin(id / 1000.0)")) \
    .withColumn("value2", expr("cos(id / 1000.0)")) \
    .withColumn("value3", expr("sqrt(abs(id))")) \
    .withColumn("category", (col("id") % 50).cast("string"))

print("Created expensive DataFrame with trigonometric computations")

In [None]:
# Scenario: Use DataFrame 3 times WITHOUT caching
print("\n=== WITHOUT Caching ===")

start = time.time()

# First use: count
count1 = expensive_df.count()

# Second use: aggregation
agg1 = expensive_df.groupBy("category").agg(avg("value1")).count()

# Third use: filter and count
filtered1 = expensive_df.filter(col("value2") > 0).count()

no_cache_time = time.time() - start

print(f"Time without caching: {no_cache_time:.2f}s")
print("Note: DataFrame was recomputed 3 times (sin, cos, sqrt calculated 3x)")

In [None]:
# Same scenario WITH caching
print("\n=== WITH Caching ===")

# Cache the DataFrame
cached_df = expensive_df.cache()

start = time.time()

# First use: count (triggers caching)
count2 = cached_df.count()

# Second use: aggregation (uses cache)
agg2 = cached_df.groupBy("category").agg(avg("value1")).count()

# Third use: filter and count (uses cache)
filtered2 = cached_df.filter(col("value2") > 0).count()

cache_time = time.time() - start

print(f"Time with caching: {cache_time:.2f}s")
print(f"Speedup: {no_cache_time/cache_time:.2f}x faster")
print("Note: Expensive computation done only once, then reused from cache")

# Don't forget to unpersist when done
cached_df.unpersist()
print("\nCache cleared with unpersist()")

In [None]:
# Different storage levels
print("\n=== Storage Levels Comparison ===")

test_df = spark.range(0, 3000000).toDF("id") \
    .withColumn("data", expr("cast(rand() * 1000 as int)"))

# MEMORY_ONLY (default cache())
df_mem_only = test_df.persist(StorageLevel.MEMORY_ONLY)
start = time.time()
df_mem_only.count()
mem_only_time = time.time() - start
print(f"MEMORY_ONLY: {mem_only_time:.2f}s")
df_mem_only.unpersist()

# MEMORY_AND_DISK
df_mem_disk = test_df.persist(StorageLevel.MEMORY_AND_DISK)
start = time.time()
df_mem_disk.count()
mem_disk_time = time.time() - start
print(f"MEMORY_AND_DISK: {mem_disk_time:.2f}s (more reliable, slight overhead)")
df_mem_disk.unpersist()

# MEMORY_ONLY_SER (serialized, saves memory)
df_mem_ser = test_df.persist(StorageLevel.MEMORY_ONLY_SER)
start = time.time()
df_mem_ser.count()
mem_ser_time = time.time() - start
print(f"MEMORY_ONLY_SER: {mem_ser_time:.2f}s (slower but uses less memory)")
df_mem_ser.unpersist()

print("\nRecommendation: Use MEMORY_AND_DISK for production (fault tolerant)")

## 4. Broadcast Joins

**Join Strategies:**

1. **Shuffle Join (Sort-Merge Join)**:
   - Both datasets shuffled and sorted
   - Expensive: network I/O, disk I/O
   - Used for large-to-large joins

2. **Broadcast Join (Map-Side Join)**:
   - Small dataset sent to all executors
   - No shuffle needed!
   - Used when one side < 10MB (configurable)
   - Much faster for small-to-large joins

**When to Broadcast:**
- Small dimension tables (< 1GB)
- Lookup tables
- Reference data

**How to Broadcast:**
- Automatic: Spark broadcasts if < `spark.sql.autoBroadcastJoinThreshold`
- Manual: `broadcast(small_df)`

In [None]:
# Create large fact table
large_df = spark.range(0, 2000000).toDF("id") \
    .withColumn("category_id", (col("id") % 100).cast("int")) \
    .withColumn("value", (rand() * 1000).cast("int"))

# Create small dimension table
small_df = spark.range(0, 100).toDF("category_id") \
    .withColumn("category_name", expr("concat('Category_', category_id)")) \
    .withColumn("department", 
                when(col("category_id") < 33, "Electronics")
                .when(col("category_id") < 66, "Clothing")
                .otherwise("Home"))

print(f"Large table: {large_df.count():,} rows")
print(f"Small table: {small_df.count():,} rows")
print(f"Size ratio: {large_df.count() / small_df.count():.0f}:1")

In [None]:
# Regular join (may use shuffle)
print("\n=== Regular Join (Potential Shuffle) ===")

# Disable auto broadcast to force shuffle join
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1")

start = time.time()
regular_join = large_df.join(small_df, "category_id") \
    .groupBy("department") \
    .agg(spark_sum("value").alias("total_value"))
regular_result = regular_join.count()
regular_time = time.time() - start

print(f"Regular join time: {regular_time:.2f}s")
print("Note: This used shuffle join (both sides shuffled)")

# Check execution plan
print("\nExecution plan (look for 'SortMergeJoin'):")
regular_join.explain()

In [None]:
# Broadcast join (no shuffle!)
print("\n=== Broadcast Join (No Shuffle) ===")

start = time.time()
broadcast_join = large_df.join(broadcast(small_df), "category_id") \
    .groupBy("department") \
    .agg(spark_sum("value").alias("total_value"))
broadcast_result = broadcast_join.count()
broadcast_time = time.time() - start

print(f"Broadcast join time: {broadcast_time:.2f}s")
print(f"Speedup: {regular_time/broadcast_time:.2f}x faster")
print("Note: Small table broadcasted to all executors, no shuffle!")

# Check execution plan
print("\nExecution plan (look for 'BroadcastHashJoin'):")
broadcast_join.explain()

# Re-enable auto broadcast
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "10485760")  # 10MB

## 5. Avoiding Shuffle Operations

**Shuffle is Expensive Because:**
- Data written to disk
- Data sent over network
- Data serialized/deserialized
- Creates intermediate files

**Operations that Trigger Shuffle:**
- `groupBy`, `agg`
- `join` (except broadcast join)
- `distinct`, `dropDuplicates`
- `repartition`
- `sortBy`, `orderBy`
- Window functions

**How to Minimize Shuffles:**
1. Use broadcast joins for small tables
2. Pre-partition data by join/groupBy keys
3. Use `coalesce` instead of `repartition` when reducing partitions
4. Filter data early to reduce shuffle volume
5. Combine multiple aggregations into one
6. Use DataFrames instead of RDDs (better optimized)

In [None]:
# Example: Inefficient - Multiple shuffles
print("=== INEFFICIENT: Multiple Shuffles ===")

test_df = spark.range(0, 1000000).toDF("id") \
    .withColumn("category", (col("id") % 10).cast("string")) \
    .withColumn("value", (rand() * 100).cast("int"))

start = time.time()

# Separate aggregations (each triggers shuffle)
avg_value = test_df.groupBy("category").agg(avg("value")).count()
max_value = test_df.groupBy("category").agg(spark_max("value")).count()
count_value = test_df.groupBy("category").count().count()

inefficient_time = time.time() - start
print(f"Time with multiple shuffles: {inefficient_time:.2f}s")
print("Note: Data was shuffled 3 times!")

In [None]:
# Example: Efficient - Single shuffle
print("\n=== EFFICIENT: Single Shuffle ===")

start = time.time()

# Combined aggregation (one shuffle)
combined = test_df.groupBy("category").agg(
    avg("value").alias("avg_value"),
    spark_max("value").alias("max_value"),
    count("*").alias("count_value")
).count()

efficient_time = time.time() - start
print(f"Time with single shuffle: {efficient_time:.2f}s")
print(f"Speedup: {inefficient_time/efficient_time:.2f}x faster")
print("Note: Data shuffled only once!")

In [None]:
# Filter early to reduce shuffle volume
print("\n=== Filter Early ===")

large_dataset = spark.range(0, 5000000).toDF("id") \
    .withColumn("category", (col("id") % 100).cast("string")) \
    .withColumn("value", (rand() * 1000).cast("int"))

# Inefficient: Filter after aggregation
start = time.time()
result1 = large_dataset.groupBy("category").agg(avg("value").alias("avg_value")) \
    .filter(col("avg_value") > 500) \
    .count()
time1 = time.time() - start
print(f"Filter after aggregation: {time1:.2f}s (shuffles all data)")

# Efficient: Filter before aggregation
start = time.time()
result2 = large_dataset.filter(col("value") > 500) \
    .groupBy("category").agg(avg("value").alias("avg_value")) \
    .count()
time2 = time.time() - start
print(f"Filter before aggregation: {time2:.2f}s (shuffles less data)")
print(f"Improvement: {time1/time2:.2f}x faster")

## 6. Configuration Tuning

**Key Configuration Parameters:**

**Memory:**
- `spark.executor.memory`: Memory per executor
- `spark.driver.memory`: Memory for driver
- `spark.memory.fraction`: Fraction for execution/storage (default 0.6)

**Parallelism:**
- `spark.default.parallelism`: Partitions for RDD operations
- `spark.sql.shuffle.partitions`: Partitions for DataFrame shuffles (default 200)

**Shuffle:**
- `spark.sql.autoBroadcastJoinThreshold`: Max size for broadcast (default 10MB)
- `spark.sql.adaptive.enabled`: Enable adaptive query execution

**Rules of Thumb:**
- Set `spark.sql.shuffle.partitions` to 2-4x number of cores
- Each partition should be 100-200MB
- Executor memory: 4-8GB per executor
- Number of executors: Leave 1-2 cores for OS

In [None]:
# Check current configurations
print("=== Current Spark Configurations ===")
important_configs = [
    "spark.executor.memory",
    "spark.driver.memory",
    "spark.sql.shuffle.partitions",
    "spark.default.parallelism",
    "spark.sql.autoBroadcastJoinThreshold",
    "spark.sql.adaptive.enabled"
]

for config in important_configs:
    try:
        value = spark.conf.get(config)
        print(f"{config}: {value}")
    except:
        print(f"{config}: (not set)")

In [None]:
# Test different shuffle partition settings
print("\n=== Impact of Shuffle Partitions ===")

test_df = spark.range(0, 2000000).toDF("id") \
    .withColumn("group", (col("id") % 1000).cast("string"))

partition_counts = [4, 8, 16, 32]
results = []

for n_partitions in partition_counts:
    spark.conf.set("spark.sql.shuffle.partitions", str(n_partitions))
    
    start = time.time()
    test_df.groupBy("group").count().count()
    elapsed = time.time() - start
    
    results.append((n_partitions, elapsed))
    print(f"Partitions={n_partitions:2d}: {elapsed:.2f}s")

# Find optimal
optimal = min(results, key=lambda x: x[1])
print(f"\nOptimal: {optimal[0]} partitions with time {optimal[1]:.2f}s")
print("Note: Optimal depends on data size and cluster resources")

# Reset to default
spark.conf.set("spark.sql.shuffle.partitions", "8")

## 7. Query Execution Plans

**Understanding Execution Plans:**

Spark's Catalyst optimizer creates execution plans:
- **Logical Plan**: What to compute
- **Optimized Plan**: How to compute (after optimizations)
- **Physical Plan**: Actual execution strategy

**Using explain():**
- `explain()`: Shows physical plan
- `explain(True)`: Shows all plans
- `explain("formatted")`: Pretty formatted

**Look For:**
- Shuffle operations (Exchange)
- Join strategies (BroadcastHashJoin vs SortMergeJoin)
- Filter pushdown
- Column pruning

In [None]:
# Create query to analyze
df1 = spark.range(0, 1000000).toDF("id") \
    .withColumn("value", (rand() * 100).cast("int")) \
    .withColumn("category", (col("id") % 10).cast("string"))

df2 = spark.range(0, 10).toDF("category_id") \
    .withColumn("category", col("category_id").cast("string")) \
    .withColumn("name", expr("concat('Cat_', category)"))

# Complex query
result = df1.filter(col("value") > 50) \
    .join(broadcast(df2), "category") \
    .groupBy("name") \
    .agg(
        count("*").alias("count"),
        avg("value").alias("avg_value")
    ) \
    .orderBy(col("count").desc())

print("=== Query Execution Plan ===")
result.explain()

In [None]:
# Detailed explanation
print("\n=== Extended Explanation ===")
result.explain(True)

print("\n=== What to Look For ===")
print("1. BroadcastHashJoin: Good! Small table broadcasted")
print("2. Exchange: Shuffle operation (minimize these)")
print("3. Filter pushed before Join: Good! Less data to join")
print("4. Project (column pruning): Good! Only needed columns")

## 8. Exercises

### Exercise 1: Partition Optimization

Optimize a query using appropriate partitioning.

**Tasks:**
1. Create a DataFrame with 10M rows, columns: id, user_id (1000 users), value
2. Calculate average value per user WITHOUT pre-partitioning
3. Repartition by user_id and calculate average again
4. Compare execution times
5. Determine the optimal number of partitions

In [None]:
# Your code here
# TODO: Create DataFrame
# TODO: Test without partitioning
# TODO: Test with partitioning
# TODO: Compare results

### Exercise 2: Caching Strategy

Determine when caching improves performance.

**Tasks:**
1. Create an expensive DataFrame (multiple transformations)
2. Use it 5 times in different queries WITHOUT caching
3. Repeat WITH caching
4. Calculate the break-even point (how many uses justify caching)
5. Test different storage levels and compare

In [None]:
# Your code here
# TODO: Create expensive computation
# TODO: Test without caching
# TODO: Test with caching
# TODO: Analyze break-even point

### Exercise 3: Join Optimization

Optimize a multi-table join scenario.

**Tasks:**
1. Create 3 tables: orders (large), customers (medium), products (small)
2. Join all three tables together
3. Optimize using broadcast joins where appropriate
4. Compare shuffle join vs broadcast join performance
5. Examine execution plans to verify optimizations

In [None]:
# Your code here
# TODO: Create tables
# TODO: Join without optimization
# TODO: Join with broadcast
# TODO: Compare and analyze

## 9. Exercise Solutions

### Solution 1: Partition Optimization

In [None]:
# Create DataFrame
user_df = spark.range(0, 10000000).toDF("id") \
    .withColumn("user_id", (col("id") % 1000).cast("string")) \
    .withColumn("value", (rand() * 1000).cast("int"))

print(f"Created DataFrame with {user_df.count():,} rows")
print(f"Default partitions: {user_df.rdd.getNumPartitions()}")

# Without pre-partitioning
print("\n=== Without Pre-Partitioning ===")
start = time.time()
result1 = user_df.groupBy("user_id").agg(avg("value").alias("avg_value")).count()
time_no_part = time.time() - start
print(f"Time: {time_no_part:.2f}s")

# With pre-partitioning
print("\n=== With Pre-Partitioning ===")
partitioned_df = user_df.repartition(20, "user_id")

start = time.time()
result2 = partitioned_df.groupBy("user_id").agg(avg("value").alias("avg_value")).count()
time_with_part = time.time() - start
print(f"Time: {time_with_part:.2f}s")
print(f"Speedup: {time_no_part/time_with_part:.2f}x")

# Test different partition counts
print("\n=== Testing Different Partition Counts ===")
for n_parts in [10, 20, 30, 40]:
    test_df = user_df.repartition(n_parts, "user_id")
    start = time.time()
    test_df.groupBy("user_id").agg(avg("value")).count()
    elapsed = time.time() - start
    print(f"Partitions={n_parts}: {elapsed:.2f}s")

### Solution 2: Caching Strategy

In [None]:
# Create expensive DataFrame
expensive = spark.range(0, 3000000).toDF("id") \
    .withColumn("x", expr("sin(id / 1000.0) * 100")) \
    .withColumn("y", expr("cos(id / 1000.0) * 100")) \
    .withColumn("z", expr("sqrt(abs(id)) / 10"))

# Test different usage counts
print("=== Caching Break-Even Analysis ===")

for num_uses in [1, 2, 3, 4, 5]:
    # Without caching
    start = time.time()
    for _ in range(num_uses):
        expensive.filter(col("x") > 0).count()
    time_no_cache = time.time() - start
    
    # With caching
    cached = expensive.cache()
    start = time.time()
    for _ in range(num_uses):
        cached.filter(col("x") > 0).count()
    time_with_cache = time.time() - start
    cached.unpersist()
    
    speedup = time_no_cache / time_with_cache
    print(f"Uses={num_uses}: No cache={time_no_cache:.2f}s, Cache={time_with_cache:.2f}s, Speedup={speedup:.2f}x")
    
print("\nConclusion: Caching beneficial when DataFrame used 2+ times")

### Solution 3: Join Optimization

In [None]:
# Create tables
orders = spark.range(0, 1000000).toDF("order_id") \
    .withColumn("customer_id", (col("order_id") % 10000).cast("int")) \
    .withColumn("product_id", (col("order_id") % 100).cast("int")) \
    .withColumn("amount", (rand() * 1000).cast("int"))

customers = spark.range(0, 10000).toDF("customer_id") \
    .withColumn("customer_name", expr("concat('Customer_', customer_id)"))

products = spark.range(0, 100).toDF("product_id") \
    .withColumn("product_name", expr("concat('Product_', product_id)")) \
    .withColumn("category", 
                when(col("product_id") < 33, "Electronics")
                .when(col("product_id") < 66, "Clothing")
                .otherwise("Home"))

print(f"Orders: {orders.count():,} rows")
print(f"Customers: {customers.count():,} rows")
print(f"Products: {products.count():,} rows")

# Disable auto broadcast
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1")

# Without optimization (shuffle joins)
print("\n=== Without Broadcast (Shuffle Joins) ===")
start = time.time()
result_shuffle = orders \
    .join(customers, "customer_id") \
    .join(products, "product_id") \
    .groupBy("category") \
    .agg(spark_sum("amount").alias("total")) \
    .count()
time_shuffle = time.time() - start
print(f"Time: {time_shuffle:.2f}s")

# With broadcast optimization
print("\n=== With Broadcast ===")
start = time.time()
result_broadcast = orders \
    .join(broadcast(customers), "customer_id") \
    .join(broadcast(products), "product_id") \
    .groupBy("category") \
    .agg(spark_sum("amount").alias("total")) \
    .count()
time_broadcast = time.time() - start
print(f"Time: {time_broadcast:.2f}s")
print(f"Speedup: {time_shuffle/time_broadcast:.2f}x faster")

# Reset
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "10485760")

## 10. Summary

Congratulations! You've learned advanced performance optimization techniques for Spark.

### Key Takeaways:

1. **Partitioning:**
   - Use `repartition()` to increase partitions or redistribute evenly
   - Use `coalesce()` to decrease partitions efficiently
   - Pre-partition by groupBy/join keys to avoid shuffles
   - Optimal: 2-4 partitions per CPU core, 100-200MB per partition

2. **Caching:**
   - Cache DataFrames used multiple times (2+)
   - Use `MEMORY_AND_DISK` for production (fault tolerant)
   - Always `unpersist()` when done to free memory
   - Cache after expensive computations, before multiple uses

3. **Broadcast Joins:**
   - Dramatically faster for small-to-large joins
   - Use for dimension tables < 1GB
   - Explicit `broadcast()` or tune `autoBroadcastJoinThreshold`
   - Eliminates shuffle for the small table

4. **Minimizing Shuffles:**
   - Combine aggregations into single operation
   - Filter data early to reduce shuffle volume
   - Use broadcast joins when possible
   - Pre-partition data by common keys

5. **Configuration:**
   - Tune `spark.sql.shuffle.partitions` based on data size
   - Adjust `autoBroadcastJoinThreshold` for your use case
   - Monitor and tune memory settings
   - Enable adaptive query execution for automatic optimization

6. **Execution Plans:**
   - Use `explain()` to understand query execution
   - Look for shuffle operations (Exchange)
   - Verify broadcast joins (BroadcastHashJoin)
   - Check for filter pushdown and column pruning

### Performance Checklist:

Before deploying to production:
- [ ] Appropriate number of partitions configured
- [ ] Expensive computations cached when reused
- [ ] Small tables broadcasted in joins
- [ ] Filters applied as early as possible
- [ ] Aggregations combined when possible
- [ ] Execution plans reviewed for inefficiencies
- [ ] Shuffle operations minimized
- [ ] Memory settings tuned for workload

### Common Pitfalls:

- Over-partitioning: Too many small partitions (overhead)
- Under-partitioning: Too few large partitions (poor parallelism)
- Caching everything: Wastes memory
- Not unpersisting: Memory leaks
- Shuffle joins for small tables: Use broadcast
- Multiple separate aggregations: Combine them
- Late filtering: Filter early to reduce data volume

### What's Next?

In [Module 13: Spark on Clusters](13_spark_on_clusters.ipynb), you'll learn:
- Cluster architecture and deployment modes
- Resource allocation strategies
- Submitting applications with spark-submit
- Monitoring and troubleshooting cluster jobs
- Production deployment best practices

### Additional Resources:

- [Tuning Spark](https://spark.apache.org/docs/latest/tuning.html)
- [Performance Tuning Guide](https://spark.apache.org/docs/latest/sql-performance-tuning.html)
- [Memory Management](https://spark.apache.org/docs/latest/tuning.html#memory-management-overview)

In [None]:
# Clean up
spark.stop()
print("Spark session stopped. Excellent work on performance optimization!")