Accelerating portable HPC Applications with Standard C++
===

# Lab 3: Parallel Tree Construction

In this tutorial we will learn how to implement starvation-free concurrent algorithms by looking at parallel tree construction (see slides).

A working implementation is provided in [starting_point.cpp].
Please take 5 minutes to skim through it.

Before starting we need to obtain a collection of books to run the example with:

[starting_point.cpp]: ./starting_point.cpp

In [None]:
!./books.sh

---

Now let's compile and run the starting point:

In [None]:
!g++ -std=c++20 -Ofast -march=native -o tree starting_point.cpp -ltbb
!./tree

The input size should be 11451683 chars, and the sample books should have assembled 99743 nodes.

This implementation reads all books into a single string of characters, and then processes it as 1 domain.

## Exercise 1: process the input in parallel

The goal of this exercise is to process the input in parallel using multiple domains.

A template for the solution is provided in [exercise1.cpp]. The `TODO`s indicate the parts of the template that must be completed.

[exercise1.cpp]: ./exercise1.cpp

The example compiles and runs serially as provided.
Once you parallelize it, the following blocks should compile and run correctly:

In [None]:
!g++ -std=c++20 -Ofast -march=native -o tree exercise1.cpp -ltbb
!./tree

In [None]:
!clang++ -std=c++20 -Ofast -march=native -o tree exercise1.cpp -ltbb
!./tree

In [None]:
!nvc++ -std=c++20 -Ofast -march=native -stdpar=multicore -o tree exercise1.cpp
!./tree

### Solutions Exercise 1

The solutions for each example are available in the `solutions/` sub-directory.

The following compiles and runs the solutions for Exercise 1 using different compilers.

In [None]:
!g++ -std=c++20 -Ofast -march=native -DNDEBUG -o tree solutions/exercise1.cpp -ltbb
!./tree

In [None]:
!clang++ -std=c++20 -Ofast -march=native -DNDEBUG -o tree solutions/exercise1.cpp -ltbb
!./tree

In [None]:
!nvc++ -std=c++20 -Ofast -march=native -DNDEBUG -stdpar=multicore -o tree solutions/exercise1.cpp
!./tree

Currently, not all `std::atomic` operations are supported on GPUs.
The CUDA Toolkit is included with the HPC SDK and includes [libcudacxx](https://github.com/NVIDIA/libcudacxx), the CUDA C++ standard library.
This library provides the `cuda::atomic` and similar types in the `#include <cuda/atomic>` header and those can be used on GPUs.

In [None]:
!g++ -std=c++20 -Ofast -march=native -DNDEBUG -o tree solutions/exercise1_gpu.cpp -ltbb
!./tree

In [None]:
!clang++ -std=c++20 -Ofast -march=native -DNDEBUG -o tree solutions/exercise1_gpu.cpp -ltbb
!./tree

In [None]:
!nvc++ -std=c++20 -Ofast -march=native -DNDEBUG -stdpar=gpu -o tree solutions/exercise1_gpu.cpp
!./tree