# Memory Management in Spark

Spark memory management is divided into two types:

- Static Memory Manager (*Deprecated* since Spark 1.6).
- Unified Memory Manager (*default*).

## Unified Memory Manager (UMM)

### Architecture

![image.png](./images/Cd_Unified_Memory_Manager_5GB.jpg)

[Source](https://community.cloudera.com/t5/Community-Articles/Spark-Memory-Management/ta-p/317794)

Advantages of UMM:
- The boundary between storage memory and execution memory is not static, and in cases of memory pressure, the boundary would be moved, i.e., one region would grow by borrowing space from another one.
- When the application has no cache and is propagating, execution uses all the memory to avoid unnecessary disk overflow.
- When the application has a cache, it will reserve the minimum storage memory so that the data block is not affected.

JVM has two types of memory:

- On-Heap Memory.
- Off-Heap Memory.

There is one more segment of memory that is accessed by Spark, i.e., external process memory, mainly used for PySpark and SparkR applications, resides outside the JVM.

### On-Heap Memory (*default*)

The size of the on-heap memory is configured by the --executor-memory or spark.executor.memory parameter when the Spark application starts.

Two main configurations to control executor memory allocation:

| Parameter | Description |
| -- | -- |
| spark.memory.fraction (default 0.6) | Fraction of the heap space used for execution and storage. The lower this is, the more frequently spills and cached data evictions occur. The purpose of this configuration is to set aside memory for internal metadata, user data structures, and imprecise size estimation in the case of sparse, unusually large records. |
| spark.memory.storageFraction (default 0.5) | The size of the storage region within the space set aside by spark.memory.fraction. Cached data may only be evicted if total storage exceeds this region. |

Apache Spark supports three memory regions:

- Reserved Memory (**hard-coded** 300MB):
    - Reserved for the system and is used to store Spark's internal objects.
    - If executor memory is less than 1.5 times the reserved memory (450MB), Spark will raise an error.

    ![image.png](./images/low_exe_memory_error.png)

- User Memory: *(Java Heap — Reserved Memory) * (1.0 — spark.memory.fraction)*
    - Used to store user-defined data structures, Spark internal metadata, any UDFs created by the user, and the data needed for RDD conversion operations, such as RDD dependency information, etc.
    - Is 40% of (Java Heap - Reserved Memory) by default.

- Spark Memory: *(Java Heap — Reserved Memory) * spark.memory.fraction*
    - Is 60% of (Java Heap - Reserved Memory) by default.
    - Execution: 
        - Used for shuffles, joins, sorts, and aggregations.
        - Supports spilling on disk if memory's not enough.
        - Can't be forcefully evicted by other threads.
        - Evicted immediately after each operation.
    - Storage: 
        - Used to cache partitions of data.
        - Can be evicted then:
            - Written to disk if persistence level is MEMORY_AND_DISK.
            - Recomputed when needed if persistence level is MEMORY_ONLY.

    ***Note**: Execution Memory > Storage Memory*


### Off-Heap Memory

- Means allocating memory objects (serialized to byte array) to memory outside the heap of the Java virtual machine (JVM).
- Managed by the OS.
- Stored outside the process heap in native memory &rarr; not processed by the *garbage collector*
- Slower than On-Heap, faster than disk.
- User has to maually deal with managing the allocated memory.

Two main configurations to set Off-Heap Memory:

| Parameter | Description |
| -- | -- |
| spark.memory.offHeap.enabled (default false) | The option to use off-heap memory for certain operations |
| spark.memory.offHeap.size (default 0) | The total amount of memory in bytes for off-heap allocation. It has no impact on heap memory usage, so make sure not to exceed your executor’s total limits. |

Advantages:
- Reduce memory usage, reduce frequent GC, and improve program performance.
- When an executor is killed, all cached data for that executor would be gone but with off-heap memory, the data would persist.

Disadvantages: (?)
- Using OFF_HEAP does not back up data, nor can it guarantee high data availability and data loss requires recalculation.

## Calculate the storage memory in spark

### On-Heap Memory

```console
spark-shell \
    --executor-memory 4g \
    --driver-memory 4g
```

![image.png](./images/storage_mem_on_heap.png)

In [11]:
java_heap_mem = 4 * 1024
reserved_mem = 300
usable_mem = java_heap_mem - reserved_mem
user_mem = round(0.4 * usable_mem, 2)
spark_mem = usable_mem - user_mem
spark_storage_mem = round(spark_mem / 2, 2)
spark_execution_mem = round(spark_mem / 2, 2)

print(f'Java Heap Memory: {java_heap_mem} MiB')
print(f'Reserved Memory: {reserved_mem} MiB')
print(f'Usable Memory: {usable_mem} MiB')
print(f'User Memory: {user_mem} MiB')
print(f'--> Spark Memory: {spark_mem} MiB = {round(spark_mem / 1024, 2)} GiB')
print(f'Spark Storage Memory: {spark_storage_mem} MiB')
print(f'Spark Execution Memory: {spark_execution_mem} MiB')

Java Heap Memory: 4096 MiB
Reserved Memory: 300 MiB
Usable Memory: 3796 MiB
User Memory: 1518.4 MiB
--> Spark Memory: 2277.6 MiB = 2.22 GiB
Spark Storage Memory: 1138.8 MiB
Spark Execution Memory: 1138.8 MiB


For more accuracy, we can get the max memory used by spark through the termial:

```scala
val maxMemory = Runtime.getRuntime.maxMemory()
```

![image.png](./images/scala_get_max_mem.png)

4294967296 B = 4 GiB

### Off-Heap Memory

```console
spark-shell \
    --driver-memory 4g \
    --executor-memory 4g \
    --conf spark.memory.offHeap.enabled=true \
    --conf spark.memory.offHeap.size=4g
```

![image.png](./images/storage_mem_off_heap.png)

In [12]:
off_heap_mem = 4 * 1024
total_spark_mem = spark_mem + off_heap_mem

print(f'Off Heap Memory: {java_heap_mem} MiB')
print(f'--> Total Spark Memory: {total_spark_mem} MiB = {round(total_spark_mem / 1024, 2)} GiB')

Off Heap Memory: 4096 MiB
--> Total Spark Memory: 6373.6 MiB = 6.22 GiB


## Spark Memory Management with YARN

![image.png](./images/executor_memory_yarn.webp)

Basically the same, with an overhead memory of 10% of `--executor-memory`, minimum of 384 MB