# OpenMP

## Level 0: Hello World

We start with a serial CPU code printing a range of numbers:

```cpp
for (size_t i0 = 0; i0 < nx; ++i0) {
    printf("%ld\n", i0);
}
```

The full example is available in [print-numbers-base.cpp](../src/print-numbers/print-numbers-base.cpp), and can be compiled and executed using the following cells.

In [None]:
!g++ -O3 -march=native -std=c++17 -o ../build/print-numbers/print-numbers-base ../src/print-numbers/print-numbers-base.cpp

In [None]:
!../build/print-numbers/print-numbers-base

**1.** OpenMP enables code execution on GPUs by introducing *target regions*.

```cpp
#pragma omp target
for (size_t i0 = 0; i0 < nx; ++i0) {
    printf("%ld\n", i0);
}
```

**2.** This code executes the loop on the GPU *serially*. To introduce parallelism, use `teams` and `parallel`.

**3.** Loop iterations can be mapped to spawned threads using `distribute` and `for`.
If not specified, the compiler chooses the number of teams and threads per team.

```cpp
#pragma omp target teams distribute parallel for
for (size_t i0 = 0; i0 < nx; ++i0) {
    printf("%ld\n", i0);
}
```

**4.** Synchronization occurs implicitly at the end of the target region.

The complete example code can be found in [print-numbers-omp-target.cpp](../src/print-numbers/print-numbers-omp-target.cpp).
Build and execute them using the following cells.

In [None]:
!nvc++ -O3 -std=c++17 -mp=gpu -target=gpu -o ../build/print-numbers/print-numbers-omp-target ../src/print-numbers/print-numbers-omp-target.cpp

In [None]:
!../build/print-numbers/print-numbers-omp-target

## Level 1: Adding Managed Memory

Our next application is increasing all elements of an array by one.

[increase-base.cpp](../src/increase/increase-base.cpp) shows a serial CPU-only implementation.
Its key part and our entry point is the increase function.

```cpp
void increase(double* data, size_t nx) {
    for (size_t i0 = 0; i0 < nx; ++i0) {
        data[i0] += 1;
    }
}
```

Offloading and parallelization works as before:

```cpp
#pragma omp target teams distribute parallel for
for (size_t i0 = 0; i0 < nx; ++i0) {
    data[i0] += 1;
}
```

When using managed memory no further code changes are necessary.
The only thing left to do is activating the support by adding the `-gpu=mem:managed` compiler option.

The complete example code can be found in [increase-omp-target-mm.cpp](../src/increase/increase-omp-target-mm.cpp), and be built and executed with the following cells.

In [None]:
!nvc++ -O3 -std=c++17 -mp=gpu -target=gpu -gpu=mem:managed -o ../build/increase/increase-omp-target-mm ../src/increase/increase-omp-target-mm.cpp

In [None]:
!../build/increase/increase-omp-target-mm

## Level 2: Switching to Explicit Memory Management

Switching from managed memory to explicit memory management requires the following changes and additions:

OpenMP handles data staging, including (de-)allocation and transfers, via structured *target data* regions.
Note that the data pointer has been renamed to `field` to avoid confusion.

```cpp
auto field = new double[nx];   // host allocation

#pragma omp target data map(tofrom : field[0:nx])
{ // device allocation and H2D transfer

    /* ... */

} // device de-allocation and D2H transfer

delete[] field;                // host de-allocation
```

For improved flexibility, unstructured primitives are also available.

```cpp
#pragma omp target enter data map(to   : field[0:nx])

/* ... */

#pragma omp target exit  data map(from : field[0:nx])
```

The complete example code can be found in [increase-omp-target-expl.cpp](../src/increase/increase-omp-target-expl.cpp).
Build and execute them using the following cells.

In [None]:
!nvc++ -O3 -std=c++17 -mp=gpu -target=gpu -o ../build/increase/increase-omp-target-expl ../src/increase/increase-omp-target-expl.cpp

In [None]:
!../build/increase/increase-omp-target-expl