# Unified Memory

Apple silicon has a unified memory architecture. The CPU and GPU have direct access to the same memory pool. MLX is designed to take advantage of that.

Concretely, when you make an array in MLX, you don't have to specify its location



In [16]:
import time

In [17]:
import mlx.core as mx
a = mx.random.normal((10000000,))
b = mx.random.normal((10000000,))

Both `a` and `b` live in unified memory.

In MLX, rather than moving arrays to devices, you specify the device when you run the operation. Any device can perform any operation on `a` and `b` without needing to move them from one memory location ot another.

In [18]:
tic = time.time()
mx.add(a,b,stream=mx.cpu)
toc = time.time()
print('CPU time:',toc-tic)

tic = time.time()
mx.add(a,b,stream=mx.gpu)
toc = time.time()
print('GPU time:',toc-tic)

CPU time: 8.869171142578125e-05
GPU time: 4.1961669921875e-05


In the above, both the CPU and the GPU will perform the same add operation. The operation can ( and likely) be run in parallel, since there are no dependencies between them.

In the above `add` example, there are no dependencies between the operations, so there is no possibility for race conditions. If there are dependencies, the MLX scehduler will autmatically manage them. For example:

In [19]:
tic = time.time()
c = mx.add(a,b,stream=mx.cpu)
toc = time.time()
print('CPU time:',toc-tic)

tic = time.time()
c = mx.add(a,b,stream=mx.gpu)
toc = time.time()
print('GPU time:',toc-tic)



CPU time: 0.00011491775512695312
GPU time: 4.38690185546875e-05


In the above case, the second `add` runs on the GPU but it depends on the output of the first `add` which is running on CPU. MLX will automatically insert a dependency between two streasm so that the second `add` only starts executing after the first is complete and `c` is available. 

## A Simple Example

Here is a more interesting example of how unified memory can be useful

In [20]:
def fun(a,b,d1,d2):
    x = mx.matmul(a,b,stream=d1)
    for _ in range(500):
        b = mx.exp(b,stream=d2)
    return x,b

which we want to run with the following arguments:

In [49]:
a = mx.random.uniform(shape=(1000000000,10000000))
b = mx.random.uniform(shape=(10000000,10))

The first `matmul` operation is a good fit for GPY, since it's more compute dense. The second sequence of operatins are better fit for the CPU, since they are very small and would probably be overhead bound on the GPU. Let's see

In [50]:
tic = time.time()
fun(a,b,mx.gpu,mx.gpu)
toc = time.time()
print('Full GPU time:',toc-tic)


Full GPU time: 0.0025260448455810547


In [51]:
tic = time.time()
fun(a,b,mx.cpu,mx.cpu)
toc = time.time()
print('Full CPU time:',toc-tic)


Full CPU time: 0.0014491081237792969


In [52]:
tic = time.time()
fun(a,b,mx.gpu,mx.cpu)
toc = time.time()
print('GPU + CPU time:',toc-tic)


GPU + CPU time: 0.0014119148254394531
