## Unified Memory in MLX

Apple Silicon features a **unified memory architecture**, allowing the CPU and GPU to share the same memory pool. MLX leverages this architecture to simplify and optimize memory usage for computations.



### **1. What is Unified Memory?**

- In traditional architectures, the CPU and GPU use separate memory pools, requiring explicit data transfers between them.
- With unified memory, the CPU and GPU can directly access the same memory pool, eliminating the overhead of data transfers.



### **2. Array Location in MLX**

- When you create an array in MLX, its location (CPU or GPU) does not need to be specified.
- Example:

In [10]:
import mlx.core as mx
a = mx.random.normal((100,))
b = mx.random.normal((100,))


  - Both `a` and `b` are stored in unified memory and accessible by both the CPU and GPU.


### **3. Device-Specific Operations**

- Instead of moving arrays between devices, you specify the device when performing an operation.
- Example:

In [11]:
mx.add(a, b, stream=mx.cpu)  # Perform addition on the CPU
mx.add(a, b, stream=mx.gpu)  # Perform addition on the GPU


array([-0.586388, 0.780082, 0.683154, ..., 2.48238, 0.565931, -1.35311], dtype=float32)

  - Both operations directly access `a` and `b` in unified memory.


In [12]:


# Create arrays in unified memory
a = mx.random.normal((100,))
b = mx.random.normal((100,))

# Check the arrays
print(f"Array a: {a}")
print(f"Array b: {b}")


Array a: array([-0.035678, -1.62694, -0.229797, ..., 0.374411, -1.10128, -0.0897107], dtype=float32)
Array b: array([0.0952313, -0.0990817, 0.301924, ..., 0.0235381, 0.815937, -0.1246], dtype=float32)


### **4. Parallelism and Streams**

- Since the CPU and GPU can share memory, they can perform operations in parallel.
- Example:

In [13]:
# Perform addition on the CPU
result_cpu = mx.add(a, b, stream=mx.cpu)
print(f"Result on CPU: {result_cpu}")

# Perform addition on the GPU
result_gpu = mx.add(a, b, stream=mx.gpu)
print(f"Result on GPU: {result_gpu}")


Result on CPU: array([0.0595533, -1.72602, 0.072127, ..., 0.39795, -0.285342, -0.214311], dtype=float32)
Result on GPU: array([0.0595533, -1.72602, 0.072127, ..., 0.39795, -0.285342, -0.214311], dtype=float32)


 - These operations can run simultaneously since there are no dependencies.



In [14]:
# Perform parallel operations
mx.add(a, b, stream=mx.cpu)
mx.add(a, b, stream=mx.gpu)
# Both operations access the same unified memory and can execute in parallel


array([0.0595533, -1.72602, 0.072127, ..., 0.39795, -0.285342, -0.214311], dtype=float32)

### **5. Handling Dependencies**

- If operations have dependencies, MLX's scheduler automatically ensures they are handled correctly, preventing race conditions.


MLX will automatically insert a dependency between the two streams so that the second add only starts executing after the first is complete and `a` and `b` are available.

## A Simple Example: Using Unified Memory for Optimal Device Utilization

Unified memory in Apple Silicon allows seamless sharing of data between the CPU and GPU, enabling mixed execution to optimize performance. Here's an example illustrating this:



### **Computation**

We have a computation that involves:
1. A **matrix multiplication** (`mx.matmul`) between large matrices `a` and `b`.
2. **Repeated element-wise operations** (`mx.exp`) on matrix `b`.

---

### **Device Suitability**

1. **Matrix Multiplication (`mx.matmul`):**
   - This operation is compute-intensive and benefits significantly from the parallel processing power of the **GPU**.

2. **Repeated Element-Wise Operations (`mx.exp`):**
   - These operations are small and frequent, meaning:
     - Running them on the GPU could introduce overhead due to kernel launch times.
     - The **CPU** is better suited for these smaller tasks because it has lower overhead for such operations.

---

### **Unified Memory Advantage**

- Both `a` and `b` reside in **unified memory**, accessible by both the **CPU** and **GPU**.
- This allows seamless sharing of data between devices **without explicit data transfers**.
- Unified memory saves time and simplifies the code by removing the need to manually move data between devices.

---

### **Performance Comparison**

#### **Fully on GPU:**
- Running all operations on the GPU takes approximately **2.8 milliseconds**.
- The small, frequent operations (`mx.exp`) are **overhead-bound**, making the GPU less efficient for this part of the computation.

#### **Mixed Execution (GPU + CPU):**
- By splitting the workload:
  - The **GPU** handles the compute-intensive matrix multiplication.
  - The **CPU** efficiently processes the small, frequent element-wise operations.


In [17]:

import time
import platform

# Define the computation function
def fun(a, b, d1, d2):
    """
    Perform matrix multiplication on device d1 and repeated element-wise
    operations on device d2.
    """
    # Perform matrix multiplication on device d1
    x = mx.matmul(a, b, stream=d1)
    # Perform repeated element-wise operations on device d2
    for _ in range(500):
        b = mx.exp(b, stream=d2)
    return x, b

# Create large arrays in unified memory
a = mx.random.uniform(shape=(4096, 5120))  # Large matrix
b = mx.random.uniform(shape=(5120, 400))    # Smaller matrix

# Function to time and execute
def execute_and_time(func, *args, description="Execution"):
    start_time = time.time()
    result = func(*args)
    end_time = time.time()
    elapsed_time = end_time - start_time
    print(f"{description}: {elapsed_time:.4f} seconds")
    return result, elapsed_time

# Display the system information
chip_name = platform.processor()
print(f"Running on: {chip_name}")

# Fully on GPU
print("\nRunning fully on GPU...")
(x_gpu, b_gpu), time_gpu = execute_and_time(fun, a, b, mx.gpu, mx.gpu, description="Full GPU Execution")

# Mixed execution (GPU + CPU)
print("\nRunning mixed execution (GPU for matmul, CPU for exp)...")
(x_mixed, b_mixed), time_mixed = execute_and_time(fun, a, b, mx.gpu, mx.cpu, description="Mixed GPU + CPU Execution")

# Print summary
print("\n--- Performance Summary ---")
print(f"Chip: {chip_name}")
print(f"Full GPU Execution Time: {time_gpu:.4f} seconds")
print(f"Mixed Execution Time: {time_mixed:.4f} seconds")
print(f"Speedup (Mixed vs GPU-only): {time_gpu / time_mixed:.2f}x faster")


Running on: arm

Running fully on GPU...
Full GPU Execution: 0.0017 seconds

Running mixed execution (GPU for matmul, CPU for exp)...
Mixed GPU + CPU Execution: 0.0016 seconds

--- Performance Summary ---
Chip: arm
Full GPU Execution Time: 0.0017 seconds
Mixed Execution Time: 0.0016 seconds
Speedup (Mixed vs GPU-only): 1.10x faster
