# Memory Spaces

## Content

* [Host and Device Memory Spaces](Host-and-Device-Memory-Spaces)
* [Exercise: Copy](01.06.02-Exercise-Copy.ipynb)

At the beginning of this section, we covered execution spaces but left one change without explanation.
We replaced `std::vector` with `thrust::universal_vector`.
By the end of this lab, you'll understand why this change was necessary.

But before we start, let's try to figure out why GPUs are so good at massive parallelism.
Many benefits of GPUs result focusing on high throughput.
To support massive compute that GPUs are able of sustaining, 
we have to provide memory speed that matches these capabilities.
This essentially means that memory also has to be throughput-oriented.
That's why GPUs often come with built-in high-bandwidth memory rather than relying on system memory.
Let's return to our code to see how it's affected by this fact.

In [None]:
import os

if os.getenv("COLAB_RELEASE_TAG"): # If running in Google Colab:
  !mkdir -p Sources
  !wget https://raw.githubusercontent.com/NVIDIA/accelerated-computing-hub/refs/heads/main/gpu-cpp-tutorial/notebooks/01.06-Memory-Spaces/Sources/ach.h -nv -O Sources/ach.h

In [None]:
%%writefile Sources/heat-2D.cpp
#include "ach.h"

int main()
{
  int height = 4096;
  int width  = 4096;

  thrust::universal_vector<float> prev = ach::init(height, width);
  thrust::universal_vector<float> next(height * width);

  for (int write_step = 0; write_step < 3; write_step++) {
    std::printf("   write step %d\n", write_step);
    ach::store(write_step, height, width, prev);

    for (int compute_step = 0; compute_step < 3; compute_step++) {
      auto begin = std::chrono::high_resolution_clock::now();
      ach::simulate(height, width, prev, next);
      auto end = std::chrono::high_resolution_clock::now();
      auto seconds = std::chrono::duration<double>(end - begin).count();
      std::printf("computed step %d in %g s\n", compute_step, seconds);
      prev.swap(next);
    }
  }
}

In the code above, we allocate data in `thrust::universal_vector`.
Then, `ach::store` accesses content of this vector on CPU to store results on disk.
After that, the data is repeatedly accessed by the GPU in the `ach::simulate` function.
This is a bit suspicious. 
We just said that CPU and GPU have distinct memory spaces, 
but we are not seeing anything that'd reflect this in the code.
Maybe performance can reveal something?

In [None]:
!nvcc --extended-lambda -o /tmp/a.out Sources/heat-2D.cpp -x cu -arch=native # build executable
!/tmp/a.out # run executable

There's a strange pattern in the execution times. 
Every time we write data, the next compute step takes 100 times longer to compute.
This happens because the data is being implicitly copied between CPU and GPU memory spaces.

![Implicit Memory Transfers](Images/managed.png "Implicit Memory Transfers")

Let's say our data resides in the GPU memory.
When `ach::store` accesses it, the data has to be copied to the CPU memory.
Next, when we call `ach::simulate`, the data is being accessed by the GPU, so the data has to be copied back.
So `thrust::universal_vector` works as a vector that lives in both CPU and GPU memory spaces and automatically migrates between them.
The problem is that we know that `ach::store` is not modifying the data, so the copy back to the GPU is unnecessary.
Fortunately, we can avoid this extra copy by using explicit memory spaces.

## Host and Device Memory Spaces

Presence of distinct host and device memory spaces is a fundamental concept in GPU programming.
For you, as a software engineer, this means that in addition to thinking about where code runs, 
you also have to keep in mind where the bytes that this code accesses live.
On a high level, we have a **host memory space** and a **device memory space**.
Thrust provides container types that manage memory in the associated memory spaces.
Let's take a look at a program that allocates vectors in corresponding memory spaces:

```c++
thrust::host_vector<int> h_vec{ 11, 12 };
thrust::device_vector<int> d_vec{ 21, 22 };
thrust::copy_n(h_vec.begin(), 1, d_vec.begin());
```

Let's take a look at this code step by step.
We started by allocating a vector with two element in host memory.
We initialized these two elements with `11` and `12`:

```c++
thrust::host_vector<int> h_vec{ 11, 12 };
```

Functionally, there's little difference between `std::vector` and `thrust::host_vector`.
As you learn, we suggest using `thrust::host_vector` just to make memory space more pronounced.
Besides host vector, we also allocated device one:

```c++
thrust::device_vector<int> d_vec{ 21, 22 };
```

We then copied one element from host memory space to device memory space using Thrust copy algorithm.
In general, copy is one of the few algorithms that you can provide mixed memory spaces.

```c++
thrust::copy_n(h_vec.begin(), 1, d_vec.begin());
```

![Memory Spaces](Images/memory.png "Memory Spaces")

---
For now, it's safe to assume that:

- Device memory space is accessible from device execution space
- Host memory space is accessible from host execution space
- Thrust data movement algorithms can copy data between memory spaces

Let's try to internalize these points by practical examples.

Proceed to [the next exercise](01.06.02-Exercise-Copy.ipynb).