# Modern C++

## Level 0: Hello World

We start with a serial CPU code printing a range of numbers:

```cpp
for (size_t i0 = 0; i0 < nx; ++i0) {
    printf("%ld\n", i0);
}
```

The full example is available in [print-numbers-base.cpp](../src/print-numbers/print-numbers-base.cpp), and can be compiled and executed using the following cells.

In [None]:
!g++ -O3 -march=native -std=c++17 -o ../build/print-numbers/print-numbers-base ../src/print-numbers/print-numbers-base.cpp

In [None]:
!../build/print-numbers/print-numbers-base

This simple loop-based example does not work well with modern C++ and its GPU offloading capabilities.
The first step is rewriting it using STL algorithms.

For this simple application, the `for_each` pattern is suitable.
It applies an operation to each *input element* (and does not return or store anything).
Using this pattern can be done a number of different ways.

**1.** Setting up a container with the number of elements equal to the number of iterations in the original loop, and assigning each element its index.
This increases the memory footprint and requires memory management (see below).

**2.** Using C++ [ranges](https://en.cppreference.com/w/cpp/ranges).
Compiler support is not yet ideal, especially on GPUs.
A possible implementation could look similar to this snippet:

```cpp
auto indices = std::ranges::iota_view{0, nx};
std::for_each(indices.begin(), indices.end(),
              [=](const auto &i0) {
                printf("%ld\n", i0);
              });
```

**3.** Reconstruct the index using pointer arithmetic. While this breaks some concepts, it is a robust way that works for many applications.

```cpp
int* ptr = 0;
std::for_each(std::execution::par_unseq, ptr, ptr + nx,
              [=](const auto &p) {
                size_t i0 = &p - ptr;
                printf("%ld\n", i0);
              });
```

**4.** Use a thrust `counting_iterator` (also available on AMD via *rocThrust*). Fully switching to [thrust](thrust.ipynb) might be a suitable alternative in this case.

```cpp
std::for_each(thrust::make_counting_iterator<size_t>(0), thrust::make_counting_iterator<size_t>(nx),
              [=](const auto &i0) {
                printf("%ld\n", i0);
              });
```

**Offloading**

In any case, many STL algorithms can be parallelized by providing an *execution policy*.
GPU offloading can then be enabled via compiler arguments.

```cpp
int* ptr = 0;
std::for_each(std::execution::par_unseq, ptr, ptr + nx,
              [=](const auto &p) {
                size_t i0 = &p - ptr;
                printf("%ld\n", i0);
              });
```

All algorithm executions are synchronous with respect to the CPU, e.g. no explicit GPU synchronization is necessary.

The complete example code is available in [print-numbers-std-par.cpp](../src/print-numbers/print-numbers-std-par.cpp).
Build and execute it using the following cells.

In [None]:
!nvc++ -O3 -std=c++17 -stdpar=gpu -target=gpu -gpu=cc86 -o ../build/print-numbers/print-numbers-std-par ../src/print-numbers/print-numbers-std-par.cpp

In [None]:
!../build/print-numbers/print-numbers-std-par

## Level 1: Adding Managed Memory

Our next application is increasing all elements of an array by one.

[increase-base.cpp](../src/increase/increase-base.cpp) shows a serial CPU-only implementation.
Its key part and our entry point is the increase function.

```cpp
void increase(double* data, size_t nx) {
    for (size_t i0 = 0; i0 < nx; ++i0) {
        data[i0] += 1;
    }
}
```

The modern C++ approach uses managed memory by default, no additional compiler flags are necessary.
For this particular application, using `for_each` with index reconstruction works well enough.
The `transform` pattern might be a better fit however.

```cpp
std::transform(std::execution::par_unseq, data, data + nx, data,
               [=](auto data_item) {
                   return data_item + 1;
               }); // implicit synchronization
```

The complete example code is available in [increase-std-par.cpp](../src/increase/increase-std-par.cpp).
Build and execute it using the following cells.

In [None]:
!nvc++ -O3 -std=c++17 -stdpar=gpu -target=gpu -gpu=cc86 -o ../build/increase/increase-std-par ../src/increase/increase-std-par.cpp

In [None]:
!../build/increase/increase-std-par

## Level 2: Switching to Explicit Memory Management

Using explicit memory management techniques is not possible with this approach.
One alternative is using functionalities from [thrust](thrust.ipynb).