Accelerating portable HPC Applications with Standard C++
===

# [Optional] Lab 1: select (aka `copy_if`).

If [Lab 1: DAXPY](../lab1_daxpy/daxpy.ipynb) was quick to complete for you, Lab 1 proposes a slightly more advanced example which requires the decomposition of a problem into multiple algorithm calls. You will use different approaches, sequential and parallel, to re-implement the [std::copy_if] algorithm into an API we'll call `select`, which selects some elements of an input vector `v` according to a user-provided criterion, and copies the selected element consecutively into a new vector `w`.

This problem is easy to solve sequentially but faces an issue in a concurrent run: the index of write operations into `w` depends on operations performed by other threads.

## Initial condition

For all the exercises, the vector `v` is initializer with an filled with pseudo-random numbers that are seeded with a constant value and are therefore identical from one execution to another.

## Select API

The API of select we'll implement is the following:

```c++
template <class UnaryPredicate>
void select(const std::vector<int>& v, UnaryPredicate pred, 
            std::vector<size_t>& index, std::vector<int>& w);
            
```
- `v` is the input data,
- `pred` the predicate used to select which data copied into the output,
- `w` is the output,
- and `index` is extra temporary storage that our implementation is allowed to use.

Both `w` and `index` may be assumed to contain 0 elements.

## Problem size and compiler options

The following cell fixes the problem size and the compiler options:

[std::copy_if]: https://en.cppreference.com/w/cpp/algorithm/copy

In [None]:
N=100000000
flags="-std=c++23 -Ofast -march=native -DNDEBUG -o select"

## Exercise 1: implementation with parallel `std::count_if` and `std::copy_if`

This implementation performs two passes over the input elements of `v`:
1. First pass: count the number of elements to copy using parallel [std::count_if] algorithm.
2. Resize `w`.
3. Second pass: copy the elements from `v` into `w` according to the predicate using parallel [std::copy_if] algorithm.

A template for the solution is provided in [exercise1.cpp]. Fix the `TODO`s to complete this exercise:

```c++
template <class UnaryPredicate>
void select(const std::vector<int>& v, UnaryPredicate pred, 
            std::vector<size_t>& index, std::vector<int>& w)
{
    // TODO: parallelize "select" using parallel "count_if" & "copy_if" algorithms:
    // auto count = std::count_if(std::execution::par, v.begin(), v.end(), pred);
    // w.resize(count);
    // std::copy_if(std::execution::par, v.begin(), v.end(), w.begin(), pred);
}
```

The following cell compiles and runs [exercise1.cpp] template:

[exercise1.cpp]: ./exercise1.cpp
[std::execution::par]: https://en.cppreference.com/w/cpp/algorithm/execution_policy_tag
[std::count_if]: https://en.cppreference.com/w/cpp/algorithm/count
[std::back_inserter]: https://en.cppreference.com/w/cpp/iterator/back_inserter
[std::copy_if]: https://en.cppreference.com/w/cpp/algorithm/copy

In [None]:
!echo -n "[g++]       " && rm -f select && g++     {flags} exercise1.cpp -ltbb             && ./select {N}
!echo -n "[clang++]   " && rm -f select && clang++ {flags} exercise1.cpp -ltbb             && ./select {N}
!echo -n "[nvc++ CPU] " && rm -f select && nvc++   {flags} exercise1.cpp -stdpar=multicore && ./select {N}
!echo -n "[nvc++ GPU] " && rm -f select && nvc++   {flags} exercise1.cpp -stdpar=gpu       && ./select {N}

### Solutions Exercise 1

The solutions for each example are available at [solutions/exercise1.cpp]. The following compiles and runs the solutions for Exercise 1 using different compilers and C++ standard versions:

[solutions/exercise1.cpp]: ./solutions/exercise1.cpp

In [None]:
!echo -n "[g++]       " && rm -f select && g++     {flags} solutions/exercise1.cpp -ltbb             && ./select {N}
!echo -n "[clang++]   " && rm -f select && clang++ {flags} solutions/exercise1.cpp -ltbb             && ./select {N}
!echo -n "[nvc++ CPU] " && rm -f select && nvc++   {flags} solutions/exercise1.cpp -stdpar=multicore && ./select {N}
!echo -n "[nvc++ GPU] " && rm -f select && nvc++   {flags} solutions/exercise1.cpp -stdpar=gpu       && ./select {N}

## Exercise 2: parallel implementation with `std::transform_inclusive_scan` and `std::for_each`

Instead of using [std::copy_if], we'll now re-implement it from scratch in the following steps:
1. Resize `index` to the same size as `v`.
2. First pass: use parallel [std::transform_inclusive_scan] to write to `index` the indices at which each selected element is to be written.
    * `transform` operation should return `1` if `pred(e)` returns `true`, and `0` otherwise.
    * `scan` with [std::plus] computes a [prefix sum] of the result of the transfor, i.e., the output index at which each element for which the predicate returns true should be written at,
    * `exclusive` vs `inclusive` scan denotes whether the prefix sum starts at 0 (exclusive), or at the value of the first element (inclusive). We'll need the total number of elements to be written in the next step, so we'll use an `inclusive` scan, which writes that value for the last elements of the sequence (so we can access it at `index.back()`; see [std::vector::back]).
4. Resize the output `w`; the total number of output elements is the last value of the `inclusive_scan` (i.e. `index.back()`).
5. Second pass: use parallel `for_each` statement to copy values from `v` to `w`, depending on the outcome of the unary predicate. Keep in mind that the output index of each element is off by plus one (because we used `inclusive_scan`), so need to subtract one from it.

```c++
template<class UnaryPredicate>
void select(const std::vector<int>& v, UnaryPredicate pred,
            std::vector<size_t>& index, std::vector<int>& w)
{
    // 1. Resize `index` to the same size as `v`. 
    index.resize(v.size());   
    // 2. Use parallel `transform_inclusive_scan` to write to `index` 
    // the indices at which each selected element is to be written.
    std::transform_inclusive_scan(std::execution::par, v.begin(), v.end(), index.begin(), std::plus<size_t>{},
                                  [pred](int x) { return pred(x) ? 1 : 0; });
    // 3. Resize the output `w`. The total number of output elements 
    // is the last value of the `inclusive_scan` (i.e. `index.back()`).
    w.resize(index.empty() ? 0 : index.back());
    // 4. Use parallel `for_each` statement to copy values from `v` to `w`, 
    // depending on the outcome of the unary predicate. 
    // The output index of each element is off by plus one, so need to subtract one from it.
    std::for_each_n(std::execution::par, std::views::iota(0).begin(), (int)v.size(),
        [pred, v = v.data(), w = w.data(), index = index.data()](int i) {
            if (pred(v[i])) w[index[i] - 1] = v[i];
    });  
}
```

Fix the `TODO`s in [exercise2.cpp] template for the following cell to compile and run correctly:

[exercise2.cpp]: ./exercise2.cpp
[std::plus]: https://en.cppreference.com/w/cpp/utility/functional/plus
[std::copy_if]: https://en.cppreference.com/w/cpp/algorithm/copy
[prefix sum]: https://en.wikipedia.org/wiki/Prefix_sum
[std::transform_inclusive_scan]: https://en.cppreference.com/w/cpp/algorithm/transform_inclusive_scan
[std::vector::back]: https://en.cppreference.com/w/cpp/container/vector/back

In [None]:
!echo -n "[g++]       " && rm -f select && g++     {flags} exercise2.cpp -ltbb             && ./select {N}
!echo -n "[clang++]   " && rm -f select && clang++ {flags} exercise2.cpp -ltbb             && ./select {N}
!echo -n "[nvc++ CPU] " && rm -f select && nvc++   {flags} exercise2.cpp -stdpar=multicore && ./select {N}
!echo -n "[nvc++ GPU] " && rm -f select && nvc++   {flags} exercise2.cpp -stdpar=gpu       && ./select {N}

### Solutions Exercise 2

The solutions for each example are available at [solutions/exercise2.cpp].

[solutions/exercise2.cpp]: ./solutions/exercise2.cpp

The following compiles and runs the solutions for Exercise 2 using different compilers and C++ standard versions.

In [None]:
!echo -n "[g++]       " && rm -f select && g++     {flags} solutions/exercise2.cpp -ltbb             && ./select {N}
!echo -n "[clang++]   " && rm -f select && clang++ {flags} solutions/exercise2.cpp -ltbb             && ./select {N}
!echo -n "[nvc++ CPU] " && rm -f select && nvc++   {flags} solutions/exercise2.cpp -stdpar=multicore && ./select {N}
!echo -n "[nvc++ GPU] " && rm -f select && nvc++   {flags} solutions/exercise2.cpp -stdpar=gpu       && ./select {N}

## How good is parallel `std::copy_if`?

In practice, please just use parallel [std::copy_if]. It's much better.

[std::copy_if]: https://en.cppreference.com/w/cpp/algorithm/copy

In [None]:
!echo -n "[g++]       " && rm -f select && g++     {flags} solutions/copy_if.cpp -ltbb             && ./select {N}
!echo -n "[clang++]   " && rm -f select && clang++ {flags} solutions/copy_if.cpp -ltbb             && ./select {N}
!echo -n "[nvc++ CPU] " && rm -f select && nvc++   {flags} solutions/copy_if.cpp -stdpar=multicore && ./select {N}
!echo -n "[nvc++ GPU] " && rm -f select && nvc++   {flags} solutions/copy_if.cpp -stdpar=gpu       && ./select {N}