Accelerating portable HPC Applications with Standard C++
===

# Lab 2: 2D Unsteady Heat Equation

In this lab you'll parallelize a pre-existing MPI heat-equation mini-application with the C++ parallel algorithms, such that it runs on multi-node GPU and multi-node CPU HPC systems.

## Getting started

We'll be visualizing the solution with `visualize()` function from [vis.py], and compiling our code with the MPI's implementation `mpicxx` compiler using the following flags for all compilers:

[vis.py]: ./vis.py

In [None]:
%run vis.py
mpicxx="mpicxx -std=c++23 -Ofast -march=native -o heat"

We'll be selecting which compiler to use with the `OMPI_CXX` environment variable (`OMPI_CXX=g++` picks the GNU g++ compiler).

A working implementation parallelized using only MPI is provided in [starting_point.cpp].
Let's compile it, run it with 2 MPI ranks, and visualize the results.
The Command Line interface of the `heat` mini-application binary is `./heat NX NY NITER` (it takes three extra arguments):
  * `NX` and `NY`: number of unknowns per MPI rank in `x` and `y` dimensions,
  * `NITERS`: number of time-step to simulate.

The mini-application performs a weak scaling with increasing number of ranks (NX and NY are kept constant).
It outputs the error every few time steps, which can be used to verify the correctness of the implementation, as well as the per-rank and total throughput achieved in GB/s.

**PLEASE** be mindful to not use too many MPI ranks when running this notebook on a shared HPC system during a tutorial event. Please do run these examples with more MPI ranks on your own systems :)

[starting_point.cpp]: ./starting_point.cpp

In [None]:
!echo "[g++]:    " && rm -f output heat && OMPI_CXX=g++     {mpicxx} starting_point.cpp && mpirun -np 2 ./heat 256 256 16000
!echo "[clang++]:" && rm -f output heat && OMPI_CXX=clang++ {mpicxx} starting_point.cpp && mpirun -np 2 ./heat 256 256 16000
!echo "[nvc++]:  " && rm -f output heat && OMPI_CXX=nvc++   {mpicxx} starting_point.cpp && mpirun -np 2 ./heat 256 256 16000
visualize()

## Exercise 1: Parallelize with C++ parallel algorithms

The goal of this exercise is to parallelize the `apply_stencil` and `initialize` implementations using the C++ parallel algorithms.

### Parallelizing the Stencil application

The serial implementation of `apply_stencil` uses two raw loops for multi-dimensional iteration over all elements of the sub-grid `[g.x_begin, g.x_end)x[g.y_begin, g.y_end)`. It applies the `stencil` operation for each element of the grid which returns the thermal energy of each element, and then adds all the energies together:

```c++
double apply_stencil(double* u_new, double* u_old, grid g, parameters p) {
  // Reduction over the energy:
  double energy = 0.;
  // Multi-dimensional iteration: [g.x_begin, g.x_end)x[g.y_begin, g.y_end)
  for (long x = g.x_begin; x < g.x_end; ++x) {
    for (long y = g.y_begin; y < g.y_end; ++y) {
      // Apply stencil for element (x, y) and accumulate its energy:
      energy += stencil(u_new, u_old, x, y, p);
    }
  }
  // Return energy of the grid:
  return energy;
}
```

The only functions that needs to be modified to achieve this are the `stencil` and `initial_condition` (see later) functions.

To parallelize it with the C++ parallel algorithms, we need to solve the `TODO`s to:
* Construct a multi-dimensional range for `[g.x_begin, g.x_end)x[g.y_begin, g.y_end)` from two [std::views::iota] ranges, one for x and one for y, using [std::views::cartesian_product], and
* Use the [std::transform_reduce] algorithm to apply the stencil in parallel to each element and sum the energies:
  - Using the [std::execution::par] execution policy,
  - Iterating over the [std::views::cartesian_product] range,
  - Initializing the result to `0.`,
  - Using [std::plus] to sum all the energies,
  - Using a lambda that applies the stencil to one element and returns its energy.

as follows:

```c++
double apply_stencil(double* u_new, double* u_old, grid g, parameters p) {
  // TODO: Create one iota range per dimension for [g.x_begin,g.x_end) and [g.y_begin,g.y_end).
  auto xs = std::views::iota(g.x_begin, g.x_end);
  auto ys = std::views::iota(g.y_begin, g.y_end);
  // TODO: Construct a cartesian_product range from the two iota ranges: [g.x_begin,g.x_end)x[g.y_begin,g.y_end).
  auto ids = std::views::cartesian_product(xs, ys);
  // TODO: Use the std::transform_reduce algorithm to apply the stencil in parallel to each element and sum the energies:
  return std::transform_reduce(
    // TODO: Use the std::execution::par parallel execution policy
    std::execution::par, 
    // TODO: iterate over the cartesian_product range
    ids.begin(), ids.end(), 
    // TODO: initialize the energy to zero
    0., 
    // TODO: use std::plus to sum the energies
    std::plus{}, 
    // TODO: Use a lambda that applies the stencil to one element and returns its energy:
    [u_new, u_old, p](auto idx) {
      // TODO [within lambda]: Extract the 1D indices from the tuple of indices:
      auto [x, y] = idx;
      // TODO [within lambda]: Apply the stencil and return the energy.
      return stencil(u_new, u_old, x, y, p);
  });
}
```

### Parallelize the initialization

To keep all memory on the device when offloading the C++ parallel algorithms, also parallelize the `initial_condition` function from its raw loop version using the [std::fill_n] parallel algorithm:

```c++
void initial_condition(grid_t u_new, grid_t u_old) {
  // TODO: parallelize using the std::fill_n parallel algorithm
  // BEFORE:
  // for (long i = 0; i < u_new.size(); ++i) {
  //   u_old.data_handle()[i] = 0.;
  //   u_new.data_handle()[i] = 0.;
  // }
}
```

### Compilation and run commands

A template for working on the solution of this exercise is provided in  [exercise1.cpp]. 
The following cell compiles and runs this template, but as provided it produces incorrect results due to the incomplete `stencil` and `initialize` implementations.
Fix the `TODO`s in the file until it compiles and run correctly:

[exercise1.cpp]: ./exercise1.cpp
[std::views::cartesian_product]: https://en.cppreference.com/w/cpp/ranges/cartesian_product_view
[std::transform_reduce]: https://en.cppreference.com/w/cpp/algorithm/transform_reduce
[std::fill_n]: https://en.cppreference.com/w/cpp/algorithm/fill_n
[std::plus]: https://en.cppreference.com/w/cpp/utility/functional/plus
[std::execution::par]: https://en.cppreference.com/w/cpp/algorithm/execution_policy_tag

In [None]:
!echo "[g++]:      " && rm -f output heat && OMPI_CXX=g++     {mpicxx} exercise1.cpp -ltbb             && mpirun -np 2 ./heat 256 256 16000
!echo "[clang++]:  " && rm -f output heat && OMPI_CXX=clang++ {mpicxx} exercise1.cpp -ltbb             && mpirun -np 2 ./heat 256 256 16000
!echo "[nvc++ CPU]:" && rm -f output heat && OMPI_CXX=nvc++   {mpicxx} exercise1.cpp -stdpar=multicore && mpirun -np 2 ./heat 256 256 16000
!echo "[nvc++ GPU]:" && rm -f output heat && OMPI_CXX=nvc++   {mpicxx} exercise1.cpp -stdpar=gpu       && mpirun -np 2 ./heat 256 256 16000
visualize()

### Solutions Exercise 1

The solution for this exercise is in [solutions/exercise1.cpp].
The following cell compiles and runs the solutions for Exercise 1 using different compilers:

[solutions/exercise1.cpp]: ./solutions/exercise1.cpp

In [None]:
!echo "[g++]:      " && rm -f output heat && OMPI_CXX=g++     {mpicxx} solutions/exercise1.cpp -ltbb             && mpirun -np 2 ./heat 256 256 16000
!echo "[clang++]:  " && rm -f output heat && OMPI_CXX=clang++ {mpicxx} solutions/exercise1.cpp -ltbb             && mpirun -np 2 ./heat 256 256 16000
!echo "[nvc++ CPU]:" && rm -f output heat && OMPI_CXX=nvc++   {mpicxx} solutions/exercise1.cpp -stdpar=multicore && mpirun -np 2 ./heat 256 256 16000
!echo "[nvc++ GPU]:" && rm -f output heat && OMPI_CXX=nvc++   {mpicxx} solutions/exercise1.cpp -stdpar=gpu       && mpirun -np 2 ./heat 256 256 16000
visualize()

We can run the GPU version with slightly larger inputs:

In [None]:
!echo "[nvc++ GPU]:" && rm -f output heat && OMPI_CXX=nvc++ {mpicxx} solutions/exercise1.cpp -stdpar=gpu && mpirun -np 2 ./heat 1024 65536 2000

## Exercise 2: Overlapping Communication and Computation

The goal of this exercise is to overlap communicaiton with computation using `std::thread`, `std::atomic`, and `std::barrier`.

A template for the solution is provided in [exercise2.cpp]. 
[exercise2.cpp]: ./exercise2.cpp

First, notice that the computation involves a data exchange with neighbors and is split into three steps:

* `internal`: processes internal rows that do not depend on data from neighbors
* `prev_boundary`: exchanges data with neighbor at `rank - 1` and processes the rows that depend on the elements received
* `next_boundary`: exchanges data with neighbor at `rank + 1` and processes the rows that depend on the elements received


```c++
double internal(double* u_new, double* u_old, parameters p) {
    grid g { .x_start = 2, .x_end = p.nx, .y_start = 1, .y_end = p.ny - 1 };
    energy += stencil(u_new.get(), u_old.get(), g, p);
}

double prev_boundary(double* u_new, double* u_old, parameters p) {
    // Send window cells, receive halo cells
    if (p.rank > 0) {
      // Send bottom boundary to bottom rank
      MPI_Send(u_old + p.ny, p.ny, MPI_DOUBLE, p.rank - 1, 0, MPI_COMM_WORLD);
      // Receive top boundary from bottom rank
      MPI_Recv(u_old + 0, p.ny,  MPI_DOUBLE, p.rank - 1, 1, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
    }
    grid g { .x_start = p.nx, .x_end = p.nx + 1, .y_start = 1, .y_end = p.ny - 1 };
    return stencil(u_new, u_old, g, p);
}

double next_boundary(double* u_new, double* u_old, parameters p) {
    if (p.rank < p.nranks - 1) {
        // Receive bottom boundary from top rank
        MPI_Recv(u_old + (p.nx + 1) * p.ny, p.ny, MPI_DOUBLE, p.rank + 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
        // Send top boundary to top rank, and
        MPI_Send(u_old + p.nx * p.ny, p.ny, MPI_DOUBLE, p.rank + 1, 1, MPI_COMM_WORLD);
    }
    grid g { .x_start = 1, .x_end = 2, .y_start = 1, .y_end = p.ny - 1 };
    return stencil(u_new, u_old, g, p);
}
```

In the previous exercise, these steps are performed sequentially:

```c++
for (long it = 0; it < p.nit(); ++it) {
    double energy = 0.;
    // Exchange and compute domain boundaries:
    energy += prev_boundary(u_new.data(), u_old.data(), p);
    energy += next_boundary(u_new.data(), u_old.data(), p);
    energy += internal(u_new.data(), u_old.data(), p);
    // ...
}
```

In this exercise, we need to modify the application to perform these three steps concurrently and in parallel.

This will require:

* using one `std::thread` per computation in such a way that we do not launch one thread on every iteration
* using `std::atomic<double>` for the `energy`, to enable the separate threads to modify the energy concurrently
* using a `std::barrier` to synchronize the different threads

Furthermore, one of the threads will need to perform the following in a critical section:
  * `MPI_Reduce` of the `energy`: this operation requires for all threads to have updated the `energy` for the current iteration, so it must happen after these updates have completed
  * reset the `energy` to `0.` before the next iteration: all threads must wait for this operation to complete before starting the next iteration
  
The template [exercise2.cpp] provides `TODO`s to guide you through this process: 

```c++
  // TODO: use an atomic variable for the energy
  double energy = 0.;
    
  // TODO: use a barrier for synchronization
  // ...bar = ...

  // TODO: use threads for the different computations
  auto thread_prev = std::thread([/*TODO: complete capture */]() {
      for (long it = 0; it < p.nit(); ++it) {
          // TODO: perform the prev exchange and computation
          // TODO: update the atomic energy
          // TODO: synchronize with the barrier
      }
  });
    
  auto thread_next = /* TODO: similar for prev */;
      
  auto thread_internal = /*
    TODO: same as for next and prev
    TODO: need to perform the reduction in one of the threads (for example this one)
    TODO: need to reset the atomic in one of the threads (for example this one)
  */;

  // TODO: join all threads

```

[exercise2.cpp]: ./exercise2.cpp

### Compilation and run commands


The following commands compile but produce incorrect results.
Your goal is to fix that by following the instructions above.

In [None]:
!rm output || true
!rm heat || true
!OMPI_CXX=g++ mpicxx -std=c++20 -Ofast -march=native -o heat exercise2.cpp -ltbb
!mpirun -np 2 ./heat 256 256 16000
visualize()

In [None]:
!rm output || true
!rm heat || true
!OMPI_CXX=clang++ mpicxx -std=c++20 -Ofast -march=native -o heat exercise2.cpp -ltbb
!mpirun -np 2 ./heat 256 256 16000
visualize()

In [None]:
!rm output || true
!rm heat || true
!OMPI_CXX=nvc++ mpicxx -std=c++20 -Ofast -march=native -stdpar=gpu -o heat exercise2.cpp -ltbb
!mpirun -np 2 ./heat 256 256 16000
visualize()

### Solution Exercise 2

The solutions for each example are available in the `solutions/exercise2.cpp` sub-directory.

The following compiles and runs the solutions for Exercise 2 using different compilers and C++ standard versions.

In [None]:
!rm output || true
!rm heat || true
!OMPI_CXX=g++ mpicxx -std=c++20 -Ofast -march=native -DNDEBUG -o heat solutions/exercise2.cpp -ltbb
!mpirun -np 2 ./heat 256 256 16000
visualize()

In [None]:
!rm output || true
!rm heat || true
!OMPI_CXX=clang++ mpicxx -std=c++20 -Ofast -march=native -DNDEBUG -o heat solutions/exercise2.cpp -ltbb
!mpirun -np 2 ./heat 256 256 16000
visualize()

In [None]:
!rm output || true
!rm heat || true
!OMPI_CXX=nvc++ mpicxx -std=c++20 -Ofast -march=native -DNDEBUG -stdpar=multicore -o heat solutions/exercise2.cpp
!mpirun -np 2 ./heat 256 256 16000
visualize()

In [None]:
!rm output || true
!rm heat || true
!OMPI_CXX=nvc++ mpicxx -std=c++20 -Ofast -march=native -DNDEBUG -stdpar=gpu -o heat solutions/exercise2.cpp
!UCX_RNDV_FRAG_MEM_TYPE=cuda mpirun -np 2 ./heat 256 256 16000
visualize()

## Exercise 3: Senders & Receivers

The goal of this exercise is to simplify the implementation of Exercise 2 - Overlap Communication and Computation - by using Senders & Receivers with a `static_thread_pool` to manage the host threads, while combining this with the C++ parallel algorithms.

The implementation of Exercise 2 is quite complex. It requires:

```c++
// A shared atomic variable to accumulate the energy:
std::atomic<double> energy = 0.;

// A shared barrier for synchronizing threads:
std::barrier bar(3);

// User must manually create and start threads:
std::thread thread_inner(..[&] {
      energy += computation(...);
      bar.arrive_and_wait();
      // User must manually create a critical section for MPI rank reduction: 
      MPI_Reduce(...);
      // User must manually reset the shared state on each iteration:
      energy = 0;
      bar.arrive_and_wait();
  });

std::thread thread_prev(...);
std::thread thread_next(...);

// User must manually join all threads before doing File I/O
thread_prev.join();
thread_next.join();
thread_inner.join();

// File I/O
```

In this exercise, we'll use Senders & Receivers instead to create a graph representing the computation:

```c++
stde::sender iteration_step(stde::scheduler sch, parameters p, long it,
                            std::vector<double>& u_new, std::vector<double>& u_old) {
    // TODO: use Senders & Receivers to create a graph representing the computation of a single iteration   
}
```

and will then dispatch it to an execution context:

```c++
stde::static_thread_pool ctx{3}; // Thread Pool with 3 threads
stde::scheduler auto sch = ctx.get_scheduler();

for (long it = 0; it < p.nit(); ++it) {
    stde::sync_wait(iteration_step(sch));
}
```

### Compilation and run commands

[exercise3.cpp]: ./exercise3.cpp

The template [exercise3.cpp] compiles and runs as provided, but produces incorrect results due to the incomplete `iteration_step` implementation.

After completing it the following blocks should compile and run correctly:

In [None]:
!rm output || true
!rm heat || true
!OMPI_CXX=g++ mpicxx -std=c++20 -Ofast -march=native -o heat exercise3.cpp -ltbb
!mpirun -np 2 ./heat 256 256 16000
visualize()

In [None]:
!rm output || true
!rm heat || true
!OMPI_CXX=clang++ mpicxx -std=c++20 -Ofast -march=native -o heat exercise3.cpp -ltbb
!mpirun -np 2 ./heat 256 256 16000
visualize()

In [None]:
!rm output || true
!rm heat || true
!OMPI_CXX=nvc++ mpicxx -std=c++20 -Ofast -march=native -stdpar=multicore -o heat exercise3.cpp
!mpirun -np 2 ./heat 256 256 16000
visualize()

In [None]:
!rm output || true
!rm heat || true
!OMPI_CXX=nvc++ mpicxx -std=c++20 -Ofast -march=native -stdpar=gpu -o heat exercise3.cpp
!UCX_RNDV_FRAG_MEM_TYPE=cuda mpirun -np 2 ./heat 256 256 16000
visualize()

### Solutions Exercise 3

The solutions for each example are available in the [`solutions/exercise3.cpp`] sub-directory.

[`solutions/exercise3.cpp`]: ./solutions/exercise3.cpp

The following blocks compiles and runs the solutions for Exercise 3 using different compilers and C++ standard versions.
By default, the [`static_thread_pool`] scheduler is used.

[`static_thread_pool`]: https://github.com/NVIDIA/stdexec/blob/main/include/exec/static_thread_pool.hpp

In [None]:
!rm output || true
!rm heat || true
!OMPI_CXX=g++ mpicxx -std=c++20 -Ofast -march=native -DNDEBUG -o heat solutions/exercise3.cpp -ltbb
!mpirun -np 2 ./heat 256 256 16000
visualize()

In [None]:
!rm output || true
!rm heat || true
!OMPI_CXX=clang++ mpicxx -std=c++20 -Ofast -march=native -DNDEBUG -o heat solutions/exercise3.cpp -ltbb
!mpirun -np 2 ./heat 256 256 16000
visualize()

In [None]:
!rm output || true
!rm heat || true
!OMPI_CXX=nvc++ mpicxx -std=c++20 -Ofast -march=native -DNDEBUG -stdpar=multicore -o heat solutions/exercise3.cpp
!mpirun -np 2 ./heat 256 256 16000
visualize()

In [None]:
!rm output || true
!rm heat || true
!OMPI_CXX=nvc++ mpicxx -std=c++20 -Ofast -march=native -DNDEBUG -stdpar=gpu -o heat solutions/exercise3.cpp
!UCX_RNDV_FRAG_MEM_TYPE=cuda mpirun -np 2 ./heat 256 256 16000
visualize()

In [None]:
!rm output || true
!rm heat || true
!OMPI_CXX=nvc++ mpicxx -std=c++20 -Ofast -march=native -DNDEBUG -stdpar=gpu -o heat solutions/exercise1.cpp

In [None]:
!UCX_RNDV_FRAG_MEM_TYPE=cuda mpirun -np 1 ./heat 8192 8192 4000