# Kokkos

## Level 0: Hello World

We start with a serial CPU code printing a range of numbers:

```cpp
for (size_t i0 = 0; i0 < nx; ++i0) {
    printf("%ld\n", i0);
}
```

The full example is available in [print-numbers-base.cpp](../src/print-numbers/print-numbers-base.cpp), and can be compiled and executed using the following cells.

In [None]:
!g++ -O3 -march=native -std=c++17 -o ../build/print-numbers/print-numbers-base ../src/print-numbers/print-numbers-base.cpp

In [None]:
!../build/print-numbers/print-numbers-base

Kokkos provides its own abstraction for parallel loops.
Depending on how Kokkos is compiled, this will map to either CPU or GPU execution spaces.

```cpp
Kokkos::parallel_for(
    Kokkos::RangePolicy<>(0, nx),
        KOKKOS_LAMBDA(const size_t i0) {
            printf("%ld\n", i0);
        });
```

Tuning the thread hierarchy can be done by specifying an additional *team policy*.

Synchronization with the GPU is *not* implicit.
It can be triggered by calling:

```cpp
Kokkos::fence();
```

The complete example code is available in [print-numbers-kokkos.cpp](../src/print-numbers/print-numbers-kokkos.cpp).
Build and execute it using the following cells.

In [None]:
!g++ -O3 -march=native -std=c++20 -I/root/kokkos/install-serial/include -L/root/kokkos/install-serial/lib -o ../build/print-numbers/print-numbers-kokkos-serial ../src/print-numbers/print-numbers-kokkos.cpp -lkokkoscore -ldl

In [None]:
!../build/print-numbers/print-numbers-kokkos-serial

In [None]:
!/root/kokkos/install-cuda/bin/nvcc_wrapper -O3 -march=native -std=c++20 -arch=sm_86 --expt-extended-lambda --expt-relaxed-constexpr -I/root/kokkos/install-cuda/include -L/root/kokkos/install-cuda/lib -o ../build/print-numbers/print-numbers-kokkos-cuda ../src/print-numbers/print-numbers-kokkos.cpp -lkokkoscore -ldl -lcuda

In [None]:
!../build/print-numbers/print-numbers-kokkos-cuda

## Level 1: Handling GPU Memory

Our next application is increasing all elements of an array by one.

[increase-base.cpp](../src/increase/increase-base.cpp) shows a serial CPU-only implementation.
Its key part and our entry point is the increase function.

```cpp
void increase(double* data, size_t nx) {
    for (size_t i0 = 0; i0 < nx; ++i0) {
        data[i0] += 1;
    }
}
```

Kokkos builds on allocations in memory spaces.
Depending on the configuration during compilation, this defaults to device allocations (i.e. if GPU support is enabled).
Host allocations can then be created as mirror views.

```cpp
{
    // device allocation
    Kokkos::View<double *> data("data", nx);

    // host allocation
    auto h_data = Kokkos::create_mirror_view(data);
} // implicit de-allocation
```

`View` instances can then be used in kernels.
Note the `()` access instead of the previously used `[]` access.

```cpp
Kokkos::parallel_for(
    Kokkos::RangePolicy<>(0, nx),
        KOKKOS_LAMBDA(const size_t i0) {
            data(i0) += 1;
        });
```

The complete example code is available in [increase-kokkos.cpp](../src/increase/increase-kokkos.cpp).
Build and execute it using the following cells.

In [None]:
!g++ -O3 -march=native -std=c++20 -I/root/kokkos/install-serial/include -L/root/kokkos/install-serial/lib -o ../build/increase/increase-kokkos-serial ../src/increase/increase-kokkos.cpp -lkokkoscore -ldl

In [None]:
!../build/increase/increase-kokkos-serial

In [None]:
!g++ -O3 -march=native -std=c++20 -fopenmp -I/root/kokkos/install-omp/include -L/root/kokkos/install-omp/lib -o ../build/increase/increase-kokkos-omp-host ../src/increase/increase-kokkos.cpp -lkokkoscore -ldl

In [None]:
!../build/increase/increase-kokkos-omp-host

In [None]:
!/root/kokkos/install-cuda/bin/nvcc_wrapper -O3 -march=native -std=c++20 -arch=sm_86 --expt-extended-lambda --expt-relaxed-constexpr -I/root/kokkos/install-cuda/include -L/root/kokkos/install-cuda/lib -o ../build/increase/increase-kokkos-cuda ../src/increase/increase-kokkos.cpp -lkokkoscore -ldl -lcuda

In [None]:
!../build/increase/increase-kokkos-cuda