# Execution Spaces

## Content

* [Heterogeneous Programming Model](#Heterogeneous-Programming-Model)
* [Execution Policy](#Execution-Policy)
* [Exercise: Annotate Execution Spaces](01.02.02-Exercise-Annotate-Execution-Spaces.ipynb)
* [Exercise: Changing Execution Space](01.02.03-Exercise-Changing-Execution-Space.ipynb)
* [Exercise: Compute Median Temperature](01.02.04-Exercise-Compute-Median-Temperature.ipynb)

---

By the end of this lab, you’ll have your first code running on a GPU!
But what exactly does it mean to run code on GPU? 
For that matter, what does it mean to run code anywhere? 
Let's start by working our way through this question. 

To build intuition around such fundamental questions, we'll be simulating heat conduction.
We'll start with a very simple version that simulates how objects cool down to the environment temperature.
As we gain proficiency with necessary tools, we'll advance this example.

In [None]:
import os

if os.getenv("COLAB_RELEASE_TAG"): # If running in Google Colab:
  !mkdir -p Sources
  !wget https://raw.githubusercontent.com/NVIDIA/accelerated-computing-hub/refs/heads/main/tutorials/cuda-cpp/notebooks/01.02-Execution-Spaces/Sources/ach.h -nv -O Sources/ach.h

In [None]:
%%writefile Sources/cpu-cooling.cpp

#include <cstdio>
#include <vector>

int main() {
    float k = 0.5;
    float ambient_temp = 20;
    std::vector<float> temp{ 42, 24, 50 };

    std::printf("step  temp[0]  temp[1]  temp[2]\n");
    for (int step = 0; step < 3; step++) {
        for (int i = 0; i < temp.size(); i++) {
            float diff = ambient_temp - temp[i];
            temp[i] = temp[i] + k * diff;
        }

        std::printf("%d     %.2f    %.2f    %.2f\n", step, temp[0], temp[1], temp[2]);
    }
}

At the beginning of the `main` function, we construct a `std::vector` and store three elements in it:

```c++
std::vector<float> temp{ 42, 24, 50 };
```

After that, we transform each element of this vector:

```c++
for (int i = 0; i < temp.size(); i++) {
    float diff = ambient_temp - temp[i];
    temp[i] = temp[i] + k * diff;
}
```

Here, we are updating each element of the vector by a constant factor times the difference between the ambient temperature and the current temperature. The result of this computation overwrites each previous element:

```c++
diff    = 20 - 42;        // -22
temp[0] = 42 + 0.5 * -22; // 31.0
```

Finally, we print the new contents of the vector:

If everything goes well and your environment is set up correctly, the cell below should print:

| step | temp[0] | temp[1] | temp[2] |
| :--- | :------ | :------ | :------ |
| 0    | 31.00   | 22.00   | 35.00   |
| 1    | 25.50   | 21.00   | 27.50   |
| 2    | 22.75   | 20.50   | 23.75   |

In [None]:
!g++ Sources/cpu-cooling.cpp -o /tmp/a.out # compile the code
!/tmp/a.out # run the executable

Let's revisit the steps that we've just made. 
We started by compiling our code using the `g++` compiler:
```bash
g++ Sources/cpu-cooling.cpp -o /tmp/a.out
```

The `g++` compiler consumed C++ code and produced an executable file, `a.out`, which contains a set of machine instructions. However, there’s a problem: different CPUs support different sets of instructions. For example, if you compile the program above for an x86 CPU, the `temp[i] + k * diff` expression will be compiled into the `vfmadd132ss` instruction on the x86 architecture. If you try running the resulting executable on an ARM CPU, it won’t work because the ARM architecture does not support this instruction. To run this code on an ARM CPU, you would need to compile it specifically for the ARM architecture. In that case, the expression would be compiled into the `vmla.f32` instruction.

From this perspective, GPUs are no different.
GPUs have their own set of instructions, therefore, we have to compile our code for GPUs somehow.

![Compilation process diagram shows how a given C++ expression is turned into architecture-specific instructions](Images/compilation.svg "Compilation")

The NVIDIA CUDA Compiler (NVCC) allows you to compile C++ code for GPUs.
Let's try using it on the same file without changing anything:

In [None]:
!nvcc -x cu -arch=native Sources/cpu-cooling.cpp -o /tmp/a.out # compile the code
!/tmp/a.out # run the executable

Congratulations! You just compiled your first CUDA program!
There's one issue, though: ***none of the code above runs on the GPU***.
That might be surprising because when we compiled our code for the CPU, the entire program could be executed on a CPU.
But now we compile our program for the GPU, and nothing runs on the GPU. 
This confusion is an indicator that we are missing an important piece of CUDA programming model.

## Heterogeneous Programming Model

GPUs are accelerators rather than standalone processors. 
A lot of computational work, like interactions with network and file system, is done on the CPU.
So a CUDA program always *starts* on the CPU.
You, the programmer, are responsible for explicitly specifying which code has to run on the GPU.
In other words, you are responsible for specifying which code runs **where**.
The established terminology for **where** code is executed is **execution space**.

![Heterogeneous programming model](Images/heterogeneous.png "Heterogeneous programming model")

At a high level, execution spaces are partitioned into **host** (CPU) and **device** (GPU).
These terms are used to generalize the programming model.
Something other than a CPU could host a GPU, and something other than a GPU could accelerate a CPU.

By default, code runs on the **host** side.
You are responsible for specifying which code should run on the **device**. 
This should explain why using `nvcc` alone was insufficient: we haven't marked any code for execution on GPU.

So, let's try fixing that. 
The CUDA compiler, NVCC, is accompanied by a set of core libraries.
These libraries allow you to explicitly specify the execution space where you want a given algorithm to run.
To prepare our code for these libraries, let's refactor the temperature update `for` loop first:

In [None]:
%%writefile Sources/gpu-cooling.cpp

#include <algorithm>
#include <cstdio>
#include <vector>

int main() {
    float k = 0.5;
    float ambient_temp = 20;
    std::vector<float> temp{ 42, 24, 50 };
    auto transformation = [=] (float temp) { return temp + k * (ambient_temp - temp); };

    std::printf("step  temp[0]  temp[1]  temp[2]\n");
    for (int step = 0; step < 3; step++) {
        std::transform(temp.begin(), temp.end(), temp.begin(), transformation);
        std::printf("%d     %.2f    %.2f    %.2f\n", step, temp[0], temp[1], temp[2]);
    }
}

In [None]:
!nvcc Sources/gpu-cooling.cpp -x cu -arch=native -o /tmp/a.out # compile the code
!/tmp/a.out # run the executable

Instead of a `for` loop, we used the `std::transform` algorithm from the C++ standard library. 
One of the benefits of using algorithms instead of custom loops is reduced mental load.
Instead of "executing" the loop in your mind to see that it implements a transformation pattern,
you can quickly recognize it by the algorithm name.

But above all else, using algorithms enables you to easily leverage GPUs!
For that, we'll be using one of the CUDA Core Libraries called Thrust.
Thrust provides standard algorithms and containers that run on the GPU. 
Let's try using those:

In [None]:
%%writefile Sources/thrust-cooling.cpp

#include <thrust/execution_policy.h>
#include <thrust/universal_vector.h>
#include <thrust/transform.h>
#include <cstdio>

int main() {
    float k = 0.5;
    float ambient_temp = 20;
    thrust::universal_vector<float> temp{ 42, 24, 50 };
    auto transformation = [=] __host__ __device__ (float temp) { return temp + k * (ambient_temp - temp); };

    std::printf("step  temp[0]  temp[1]  temp[2]\n");
    for (int step = 0; step < 3; step++) {
        thrust::transform(thrust::device, temp.begin(), temp.end(), temp.begin(), transformation);
        std::printf("%d     %.2f    %.2f    %.2f\n", step, temp[0], temp[1], temp[2]);
    }
}

In [None]:
!nvcc --extended-lambda Sources/thrust-cooling.cpp -x cu -arch=native -o /tmp/a.out # compile the code
!/tmp/a.out # run the executable


Let's take a look at the changes that we've just made.
We started by replacing `std::vector` with `thrust::universal_vector`.
We'll explain why this change was necessary later in this lab.
More importantly, we annotated the lambda with `__host__ __device__` execution specifiers.

As discussed earlier, we have to compile some of the code into GPU instructions. 
Execution space specifiers tell NVCC which code can be executed on GPU. 
The `__host__` specifier denotes that a given function is executable by CPU. 
This specifier is used by default on every C++ function.
For example, this means that `int main()` is the same as `__host__ int main()`.

The `__device__` specifier, on the other hand, denotes a function that's executable by GPU.
That's how NVCC knows which functions to compile for the GPU and which ones for the CPU.
In the code above, we combined the `__host__ __device__` specifiers.
This indicates that the function can be executed by both CPU and GPU.

Finally, we replace `std::transform` with `thrust::transform`. 
Unlike `std::transform`, `thrust::transform` accepts the execution space as the first parameter.
In the code above, we explicitly asked Thrust to perform the transformation on device (GPU) by passing `thrust::device`.

## Execution Policy

![Execution Policy](Images/execution-policy.svg "Execution Policy")

---
Congratulations!  You've learned some basic truths about execution spaces.  Overall, the goal of this lab is to show you that there's no magic behind CUDA:

* Code that starts execution on the host stays on the host.
* Code that runs on the device stays on the device.

Proceed to your first [exercise](01.02.02-Exercise-Annotate-Execution-Spaces.ipynb).