In [1]:
import spark_env

spark = spark_env.create_spark_session('joins')

----------------------------------------
Exception occurred during processing of request from ('127.0.0.1', 55840)
Traceback (most recent call last):
  File "C:\Users\sriuj\AppData\Local\Programs\Python\Python310\lib\socketserver.py", line 316, in _handle_request_noblock
    self.process_request(request, client_address)
  File "C:\Users\sriuj\AppData\Local\Programs\Python\Python310\lib\socketserver.py", line 347, in process_request
    self.finish_request(request, client_address)
  File "C:\Users\sriuj\AppData\Local\Programs\Python\Python310\lib\socketserver.py", line 360, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "C:\Users\sriuj\AppData\Local\Programs\Python\Python310\lib\socketserver.py", line 747, in __init__
    self.handle()
  File "C:\Users\sriuj\Masters\MS in Data Science\Data Engineering\Pyspark\spark\lib\site-packages\pyspark\accumulators.py", line 295, in handle
    poll(accum_updates)
  File "C:\Users\sriuj\Masters\MS in Data Scienc

In [2]:
from pyspark.sql.functions import *
from pyspark.sql.types import *

- In Spark, joins are not just about matching keys — they’re executed using different physical strategies under the hood based on data size, memory, and join type.
- The most efficient join is the **Broadcast Hash Join**, which works by broadcasting the smaller DataFrame to all executor nodes. This avoids shuffling altogether and allows each executor to perform the join locally. It’s ideal when one side of the join is small enough to fit into memory (typically under 10MB by default, configurable via spark.sql.autoBroadcastJoinThreshold). You can also explicitly use it with broadcast(df_small) when joining with a large DataFrame.
- When the data is too large to broadcast but still one side is significantly smaller, Spark may use a **Shuffle Hash Join**. This strategy involves shuffling both datasets based on the join key, and then building a hash table on one side for lookup. It's faster than Sort-Merge Join when memory is sufficient and the key distribution isn’t skewed. However, it's more sensitive to memory pressure and hash collisions and can result in spilling to disk.
- For large-scale joins where both DataFrames are big and neither can be broadcasted, Spark uses the **Sort-Merge Join**. This strategy sorts both sides of the data on the join key and then merges them. Though it involves full shuffling and sorting, it’s the most stable and scalable approach for large joins. It’s the default choice for equi-joins when data size exceeds the broadcast threshold and there's no major skew.

In [3]:
spark.conf.set('spark.sql.autoBroadcastJoinThreshold',-1)
spark.conf.set('spark.sql.adaptive.enabled',False)

In [7]:
data1 = [
    (1,"Alice"),
    (2,"Bob"),
    (3,"Charlie"),
    (4,"David"),
    (5,"Eva")
]
df1 = spark.createDataFrame(data1, ["id","name"])

data2 = [
    (1,50000),
    (2,60000),
    (3,70000),
    (6,80000)
]
df2 = spark.createDataFrame(data2, ["id","salary"])

In [8]:
# Left Join
df_left_join = df1.join(df2,df1['id'] == df2['id'],'left')

In [9]:
df_left_join.show()

+---+-------+----+------+
| id|   name|  id|salary|
+---+-------+----+------+
|  5|    Eva|NULL|  NULL|
|  1|  Alice|   1| 50000|
|  3|Charlie|   3| 70000|
|  2|    Bob|   2| 60000|
|  4|  David|NULL|  NULL|
+---+-------+----+------+



## Driver Memory Management

- The Driver in Spark is the master process that coordinates all tasks.
- It holds metadata, task scheduling information, DAGs and more.
- Driver Memory -> Memory allocated to the Driver process when the spark job runs
- If Driver runs out of memory -> job fails with OutOfMemoryError
- The Driver memory is divided into two memory types
1. **JVM Heap Memory**: This is the main memory allocated to the Java Virtual Machine where Spark’s core data structures live (e.g., RDDs, DataFrames, metadata). It is used for Task scheduling, Query planning and optimization, Caching metadata, broadcast variables, etc. JVM heap memory is where Spark runs its logic.
2. **OverHead Memory**: This is extra memory reserved outside the JVM heap. It is used for Native memory (like PySpark or Pandas UDFs), Thread stacks, Internal buffers, Memory management by the OS. Overhead memory prevents out-of-memory errors during native or I/O-heavy operations. If you’re using PySpark or UDFs, increasing overhead memory is often necessary. This is max(10% of JVM Heap Memory, 384 MB)

#### Driver Out of Memory
When the size of the output from executors goes out of the range of the driver memory we get this error. We can mitigate this by avoiding heavy functions that return too much data like df.collect()

## Executor Memory Management

- The Executor Memory management is broadly divided into four categories:
1. JVM Heap Memory: This memory is further broken down into three memories  
   a. **Reserved Memory** [300 MB]: Is allocated 300 MB by default  
   b. **User Memory** [0.4*(Total memory - Reserved Memory)]: This stores all the user defined functions.  
   c. **Spark Memory Pool** [0.6*(Total memory - Reserved Memory)]:
   - This memory stores all the cache and transformations that are required.
   - 50% memory is used for caching and storing (Long Term Memory). Also called as **Storage Memory**
   - 50% memory is used for transformations (Short Term Memory). Also called as **Executor Memory**
   - However, this partition can be changed. This is called as allocation and borrowing
   - Executor Memory can eliminate storage memory using LRU method but storage memory cannot eliminate the executor memory  
3. Off-Heap Memory: Managed by the user and the default is zero
4. Overhead Memory: This is extra memory outside the JVM heap memory -> max(10% of executor memory, 384 MB)
5. Pyspark Momory: Used rarely and default is zero


#### Executor Out of Memory
This is caused because of one of the following:
- **Large data per task (partition too big):** A single task processes more data than the executor can hold.
- **Skewed data in joins or aggregations:** One key has too much data → some executors do all the work and crash.
- **Improper caching/persisting:** Caching a large DataFrame without enough memory (especially with MEMORY_ONLY) leads to eviction or OOM.
- **Use of wide transformations:** Operations like groupByKey, join, sort cause shuffle and large intermediate data in memory.
- **Calling collect() on large DataFrames:** Tries to bring all data to the driver or executor memory → instant crash if it can't fit.
- **Heavy PySpark UDFs or Pandas UDFs:** Native code or Python logic consumes off-heap memory → hits memoryOverhead limit.
- **Too many tasks per executor:** Multiple tasks run concurrently, each using memory → total usage exceeds limit.
- **Improper configuration:** --executor-memory or spark.executor.memoryOverhead is set too low.
- **GC overhead without error:** JVM spends too much time in garbage collection due to memory pressure (near-OOM symptom).


#### Salting

Salting is a technique used in Spark to handle data skew, which occurs when one or a few keys in a join or aggregation operation have significantly more data than others. This imbalance causes Spark to assign a disproportionately large amount of data to a single task or executor, leading to performance bottlenecks or even out-of-memory errors. Salting mitigates this by artificially distributing skewed keys across multiple partitions. It works by appending a random "salt" value (like a number) to the skewed key, effectively transforming one heavy key into multiple lighter keys (e.g., "India" becomes "India_1", "India_2", etc.). The other dataset (in the case of a join) is also modified to match this salted structure. After performing the join or aggregation, the results can optionally be de-salted or recombined. This approach helps achieve better parallelism and load balancing during execution.