Lab 1: DAXPY - Accelerating portable HPC Applications with Standard C++
===

This tutorial familiarizes you with the C++ parallel algorithms. We'll parallelize Double-precision $Y = A \cdot X + Y$, also known as `DAXPY, one of the main algorithms in the standard Basic Linear Algebra Subroutines (BLAS) library. It scales double-precision elements of the vector $X$ with the scalar $A$, and adds its result to the vector $Y$.

The vectors are initialized with `x[i] = i` and `y[i] = 2`, and all exercises validate your implementation by checking that `y = 2 + a * i`.

## Sequential implementation

A working sequential implementation is provided in [starting_point.cpp]. All exercises focus on the following two main functions:

```c++
/// Intialize vectors `x` and `y`: raw loop sequential version
void initialize(std::vector<double> &x, std::vector<double> &y) {
  for (std::size_t i = 0; i < x.size(); ++i) {
    x[i] = (double)i;
    y[i] = 2.;
  }
}

/// DAXPY: AX + Y: raw loop sequential version
void daxpy(double a, std::vector<double> const &x, std::vector<double> &y) {
  for (std::size_t i = 0; i < y.size(); ++i)
    y[i] += a * x[i];
}
```

[starting_point.cpp]: ./starting_point.cpp

Let's start by checking the version of some of the compilers installed in the image:

In [None]:
!g++ --version
!clang++ --version
!nvc++ --version

---

Now let's define a size for all problems:

In [None]:
N=10000000

and compile and run the starting point with that size:

In [None]:
!g++ -std=c++23 -o daxpy starting_point.cpp && ./daxpy {N}

Here the `-std=c++23` selects the C++ language standard.

Let's enable optimizations with `-Ofast`, disabling debug checks `-DNDEBUG`, and compiling for the current CPU using `-march=native`:

In [None]:
!g++ -std=c++23 -Ofast -march=native -DNDEBUG -o daxpy starting_point.cpp && ./daxpy {N}

The performance with optimizations is much higher.

We'll use these exact same flags for all compilers:

In [None]:
flags="-std=c++23 -Ofast -march=native -DNDEBUG -o daxpy"

## Exercise 1: From raw DAXPY loop to sequential C++ `std::for_each_n` algorithm

The goal of this first exercise is to re-write the raw DAXPY loop by combining:
- the C++ standard library [std::for_each_n] algorithm, and
- [std::views::iota] to create an iterator over a range of integers.

You can click on the C++ API name links (e.g. on [std::for_each_n]) to access their documentation. 

In this exercise, we'll use [std::views::iota] to get an iterator over a range of integers starting at zero as follows:

```c++
auto range = std::views::iota(0); // Range of integers: [0, int_max).
auto it    = range.begin();       // Iterator to first element of range (0).
```

and pass it to the [std::for_each_n] algorithm, whose API is:

```c++
std::for_each_n(
    Iterator begin,      // Iterator to the first element of the range.
    size_t length,   // Number of range elements to process.
    UnaryFunction op // Operation applied to each element in the range: op(e) . 
); 
```

to replace the sequential raw loop in the implementation of `daxpy` with [std::for_each_n] as follows:

```c++
// [Before] Raw-loop: 
for (std::size_t i = 0; i < y.size(); ++i) {
    ...
}
    
// [After] Algorithm:
std::for_each_n(std::views::iota(0).begin(), y.size(), [](int i) {
    ...
});

```

The top folder of this notebook provides templates for implementing the solution.
These contains `// TODO` comments to help you focus on the parts of the file that need to be modified.
There is no need to modify any other place in the program. 
Please use these templates when working on the exercises.

A template for the solution is provided in the [exercise1.cpp] file.
We need to fix the `TODO`s in the template:

```c++
#include <chrono>
// TODO: add C++ standard library includes as necessary
// #include <algorithm>
// #include <ranges>

/// DAXPY: AX + Y: sequential algorithm version
void daxpy(double a, std::vector<double> const &x, std::vector<double> &y) {
  // TODO: replace this raw loop with an algorithm:
  // for (std::size_t i = 0; i < y.size(); ++i) {
  //   y[i] += a * x[i];
  // }
  // Using: 
  // - std::views::iota(0).begin() iterator
  // - std::for_each_n algorithm
  // std::for_each_n(std::views::iota(0).begin(), x.size(), [&](int i) {
  //  y[i] += a * x[i];
  // });
}
```

This next cell compiles and runs the template to test your solution. Without modifications, the template compiles, but produces incorrect results because the `daxpy` implementation is empty. Once you fix it, the following cells should compile and run correctly.

[exercise1.cpp]: ./exercise1.cpp
[std::for_each_n]: https://en.cppreference.com/w/cpp/algorithm/for_each_n
[std::views::iota]: https://en.cppreference.com/w/cpp/ranges/iota_view

In [None]:
!echo -n "[g++]     " && rm -f daxpy && g++     {flags} exercise1.cpp && ./daxpy {N}
!echo -n "[clang++] " && rm -f daxpy && clang++ {flags} exercise1.cpp && ./daxpy {N}
!echo -n "[nvc++]   " && rm -f daxpy && nvc++   {flags} exercise1.cpp && ./daxpy {N}

### Solutions Exercise 1

The solutions for each example are available in the [solutions/] sub-directory.

The following block compiles and run the solutions at [solutions/exercise1.cpp] using different compilers:

[solutions/]: ./solutions
[solutions/exercise1.cpp]: ./solutions/exercise1.cpp

In [None]:
!echo -n "[g++]     " && rm -f daxpy && g++     {flags} solutions/exercise1.cpp && ./daxpy {N}
!echo -n "[clang++] " && rm -f daxpy && clang++ {flags} solutions/exercise1.cpp && ./daxpy {N}
!echo -n "[nvc++]   " && rm -f daxpy && nvc++   {flags} solutions/exercise1.cpp && ./daxpy {N}

## Exercise 2: Parallelizing DAXPY with execution policies

To run DAXPY in parallel, we need to:
- obtain access to the execution policies by `#include <execution>` header,
- pass the [std::execution::par] policy as the first argument of the [std::for_each_n] algorithm,
- enable the parallel algorithms via compiler options.

A template for the solution is provided in the [exercise2.cpp] file.
We need to fix the `TODO`s in the template:

```c++
#include <algorithm>
// TODO: add C++ standard library includes as necessary
// #include <execution>

/// DAXPY: AX + Y: parallel algorithm version
void daxpy(double a, std::vector<double> const &x, std::vector<double> &y) {
  std::for_each_n(// TODO: pass std::execution::par, as first argument 
                  std::views::iota(0).begin(), x.size(), [&](int i) {
    y[i] += a * x[i];
  });
}
```

This next cell compiles and runs the template to test your solution.
It already contains the right compilation options to enable the parallel algorithms in the different compilers: 
- `clang` and `gcc`: need to link with the TBB library using the `-ltbb` flag, because their parallel algorithms implementation depends on it.
- `nvc++`: need to use the `-stdpar=multicore` or `-stdpar=gpu` flags, to enable the parallel algorithm and select where they'll run.

Once you make the changes, you should see the performance increase while the tests still pass.

[exercise2.cpp]: ./exercise2.cpp
[std::for_each_n]: https://en.cppreference.com/w/cpp/algorithm/for_each_n
[std::execution::par]: https://en.cppreference.com/w/cpp/algorithm/execution_policy_tag

In [None]:
!echo -n "[g++]       " && rm -f daxpy && g++     {flags} exercise2.cpp -ltbb             && ./daxpy {N}
!echo -n "[clang++]   " && rm -f daxpy && clang++ {flags} exercise2.cpp -ltbb             && ./daxpy {N}
!echo -n "[nvc++ CPU] " && rm -f daxpy && nvc++   {flags} exercise2.cpp -stdpar=multicore && ./daxpy {N}
!echo -n "[nvc++ GPU] " && rm -f daxpy && nvc++   {flags} exercise2.cpp -stdpar=gpu       && ./daxpy {N}

### Solutions Exercise 2

The following block compiles and run the solutions at [solutions/exercise2.cpp] using different compilers:

[solutions/exercise2.cpp]: ./solutions/exercise2.cpp

In [None]:
!echo -n "[g++]       " && rm -f daxpy && g++     {flags} solutions/exercise2.cpp -ltbb             && ./daxpy {N}
!echo -n "[clang++]   " && rm -f daxpy && clang++ {flags} solutions/exercise2.cpp -ltbb             && ./daxpy {N}
!echo -n "[nvc++ CPU] " && rm -f daxpy && nvc++   {flags} solutions/exercise2.cpp -stdpar=multicore && ./daxpy {N}
!echo -n "[nvc++ GPU] " && rm -f daxpy && nvc++   {flags} solutions/exercise2.cpp -stdpar=gpu       && ./daxpy {N}

## Exercise 3: Improving lambda captures for GPU performance

In the previous execise, our parallel implementation captures everything in the lambda by reference, i.e., with a `[&](...) { ... }` capture clause. This only works on heterogeneous platforms that are coherent, like the one this notebook is running on. On hardware-coherent platforms, like Grace Hopper, it works really well, but in software-coherent platforms, it does not deliver the best performance.

In this exercise, we will learn how to write code that works on non-coherent platforms, and performs well on all platforms. 
The solution to both issues is the same: modify our lambda's to capture all arguments by value. So instead of `[&]` we need to use `[a, x = x.data(), y = y.data()]`, where:
- `[a]`: captures `a` by value, i.e., copies the scalar `a` into the lambda.
- `[x = x.data(), y = y.data()]`: captures pointers to the data of `x` and `y` by value (i.e., this does not make a copy of the vectors, like `[x, y]` would do).

A template for the solution is provided in the [exercise3.cpp] file.
We need to fix the `TODO`s in the template:

```c++
void daxpy(double a, std::vector<double> const &x, std::vector<double> &y) {
  std::for_each_n(std::execution::par,
                  std::views::iota(0).begin(), x.size(),  
    // TODO: instead of by reference [&], capture by value using:
    // [a, x = x.data(), y = y.data()]
    [&](int i) {
        y[i] += a * x[i];
  });
}
```

This next cell compiles and runs the template to test your solution.
Once you make the changes, you should see the performance increase while the tests still pass.

[exercise3.cpp]: ./exercise3.cpp

In [None]:
!echo -n "[g++]       " && rm -f daxpy && g++     {flags} exercise3.cpp -ltbb             && ./daxpy {N}
!echo -n "[clang++]   " && rm -f daxpy && clang++ {flags} exercise3.cpp -ltbb             && ./daxpy {N}
!echo -n "[nvc++ CPU] " && rm -f daxpy && nvc++   {flags} exercise3.cpp -stdpar=multicore && ./daxpy {N}
!echo -n "[nvc++ GPU] " && rm -f daxpy && nvc++   {flags} exercise3.cpp -stdpar=gpu       && ./daxpy {N}

### Solutions Exercise 3

The following block compiles and run the solutions at [solutions/exercise3.cpp] using different compilers:

[solutions/exercise3.cpp]: ./solutions/exercise3.cpp

In [None]:
!echo -n "[g++]       " && rm -f daxpy && g++     {flags} solutions/exercise3.cpp -ltbb             && ./daxpy {N}
!echo -n "[clang++]   " && rm -f daxpy && clang++ {flags} solutions/exercise3.cpp -ltbb             && ./daxpy {N}
!echo -n "[nvc++ CPU] " && rm -f daxpy && nvc++   {flags} solutions/exercise3.cpp -stdpar=multicore && ./daxpy {N}
!echo -n "[nvc++ GPU] " && rm -f daxpy && nvc++   {flags} solutions/exercise3.cpp -stdpar=gpu       && ./daxpy {N}

## Exercise 4: Know your algorithms: `transform_reduce`

In this exercise, we'll parallelize a variant of `daxpy` called `daxpy_sum` that, on top of applying daxpy, also adds all elements of `y` up, i.e., also performs a reduction: 

```c++
/// DAXPY: AX + Y and returns sum(Y)
double daxpy_sum(double a, std::vector<double> const &x, std::vector<double> &y) {
  auto ints = std::views::iota(0, (int)x.size());
  double sum = 0.;
  for (auto i : ints) {
    y[i] += a * x[i];
    sum += y[i];
  }
  return sum;
}
```

Since C++ does not allow multiple threads to mutate a single shared value without extra synchronization (like the one provided by locks or atomic operations), we **cannot** easily solve this exercise by just using [std::for_each_n] like we did above, and directly updating `sum` concurrently from within the lambda:

```c++
void daxpy(double a, std::vector<double> const &x, std::vector<double> &y) {
  double sum = 0.;
  std::for_each_n(std::execution::par,
                  std::views::iota(0).begin(), x.size(),    
    [&sum, a, x = x.data(), y = y.data()](int i) {
        y[i] += a * x[i];
        sum += y[i]; // ERROR (undefined behavior): concurrent accesses to "sum".
  });
}
```

The [std::transform_reduce] algorithm from the `<numeric>` header provides a simple and efficient way to solve this problem. It iterates over all elements of the range, and:
- applies a function `map` that _transforms_ a range element `e` of type `T` into a different value of type `U`: `map(T e) -> U;`,
- returns the combination of multiple transformations into a single value via a binary _reduction_ operation `red(U a, U b) -> U`,
- guarantees that `map` is called _exactly once_ per range element.

For a three element range: `[e0, e1, e2]`, the final result is `U res = red(red(map(e0), map(e1)), map(e2));`.

The [std::transform_reduce] algorithm is available via `#include <numeric>`.
The API of [std::transform_reduce] we will be using is the (3)rd overload in its API documentation:

```c++
template <typename Iter, typename T, typename BinaryReduction, typename UnaryFunction>
T transform_reduce(std::execution::par,       // Execution policy.
                   Iter begin, Iter end,      // [begin, end) range.
                   U init,                    // Inital value for reduction.
                   BinaryReduction red,       // Binary reduction: r(U x, U y) -> U above.
                   UnaryFunction   map);      // Unary function m(T e) -> U applied to every element in [begin, end).
```

For `daxpy_sum` we need to:
- Perform the daxpy update `y[i] += a * x[i]` as part of the transform; this relies on the transform operation being called exactly once per range element.
- Return the updated value `y[i]` after the update from the transformation.
- Add all the updated values with the binary reduction operation [std::plus] from the `#include <functional>`.

A template for the solution is provided in the [exercise4.cpp] file.
We need to fix the `TODO`s in the template:

```c++
// TODO: add C++ standard library includes as necessary
// #include <numeric> // std::transform_reduce is in the <numeric> header!
double daxpy_sum(double a, std::vector<double> const &x, std::vector<double> &y) {
  // TODO: create a range of integers [0, x.size()).
  // NOTE: iota(begin, end) takes integers of the same type!
  auto ints = std::views::iota(0, (int)x.size());
  // TODO: call transform_reduce on the integer range, using:
  // - "0." as the initial value, and
  // - `std::plus{}` as the binary reduction:
  return std::transform_reduce(std::execution::par, ints.begin(), ints.end(), /* 0. */, /* std::plus{} */, 
    [a, x = x.data(), y = y.data()](int i) {
        // TODO: perform saxpy update:
        // y[i] += a * x[i];
        // TODO: return the updated value:
        return /* y[i] */;
  });
}
```

This next cell compiles and runs the template to test your solution.
Once you make the changes, you should see the performance increase while the tests still pass.

[exercise4.cpp]: ./exercise4.cpp
[std::for_each_n]: https://en.cppreference.com/w/cpp/algorithm/for_each_n
[std::transform_reduce]: https://en.cppreference.com/w/cpp/algorithm/transform_reduce
[std::plus]: https://en.cppreference.com/w/cpp/utility/functional/plus

In [None]:
!echo -n "[g++]       " && rm -f daxpy && g++     {flags} exercise4.cpp -ltbb             && ./daxpy {N}
!echo -n "[clang++]   " && rm -f daxpy && clang++ {flags} exercise4.cpp -ltbb             && ./daxpy {N}
!echo -n "[nvc++ CPU] " && rm -f daxpy && nvc++   {flags} exercise4.cpp -stdpar=multicore && ./daxpy {N}
!echo -n "[nvc++ GPU] " && rm -f daxpy && nvc++   {flags} exercise4.cpp -stdpar=gpu       && ./daxpy {N}

### Solutions Exercise 4

The following block compiles and run the [`solutions/exercise4.cpp`]:

[`solutions/exercise4.cpp`]: ./solutions/exercise4.cpp

In [None]:
!echo -n "[g++]       " && rm -f daxpy && g++     {flags} solutions/exercise4.cpp -ltbb             && ./daxpy {N}
!echo -n "[clang++]   " && rm -f daxpy && clang++ {flags} solutions/exercise4.cpp -ltbb             && ./daxpy {N}
!echo -n "[nvc++ CPU] " && rm -f daxpy && nvc++   {flags} solutions/exercise4.cpp -stdpar=multicore && ./daxpy {N}
!echo -n "[nvc++ GPU] " && rm -f daxpy && nvc++   {flags} solutions/exercise4.cpp -stdpar=gpu       && ./daxpy {N}

## [Optional] Exercise 5: Know your algorithms: `fill_n`

In this exercise, we are going to parallelize the `initialize` function as follows: 
- Initialize `x[i] = i;` using [std::for_each_n] with [std::views::iota], just like in the previous exercise.
- Initialize `y[i] = 2.;` using the [std::fill_n] algorithm, which writes the same value to all elements of a range.

```c++
/// Intialize vectors `x` and `y`: parallel algorithm version
void initialize(std::vector<double> &x, std::vector<double> &y) {
  // TODO: parallelize the initialization using
  //  - for_each_n + views::iota to initialize x
  //  - fill_n to initialize y
  // for (std::size_t i = 0; i < x.size(); ++i) {
  //   x[i] = (double)i;
  //   y[i] = 2.;
  // }
}
```

The API of [std::fill_n] is:

```c++
std::fill_n(std::execution::par, // Execution policy
            iterator,            // Iterator to the elements, e.g., a pointer
            number_of_elements,  // Number of elements
            value);              // Value to initialize all elements to
```

A template for the solution is provided in [exercise5.cpp]; it compiles and runs as provided, but produces incorrect results due to the incomplete implementation of the `initialize` function. Once you fix it, the following block should compile and run correctly.

[std::fill_n]: https://en.cppreference.com/w/cpp/algorithm/fill_n 
[std::for_each_n]: https://en.cppreference.com/w/cpp/algorithm/for_each_n 
[std::views::iota]: https://en.cppreference.com/w/cpp/ranges/iota_view
[exercise5.cpp]: ./exercise5.cpp

In [None]:
!echo -n "[g++]       " && rm -f daxpy && g++     {flags} exercise5.cpp -ltbb             && ./daxpy {N}
!echo -n "[clang++]   " && rm -f daxpy && clang++ {flags} exercise5.cpp -ltbb             && ./daxpy {N}
!echo -n "[nvc++ CPU] " && rm -f daxpy && nvc++   {flags} exercise5.cpp -stdpar=multicore && ./daxpy {N}
!echo -n "[nvc++ GPU] " && rm -f daxpy && nvc++   {flags} exercise5.cpp -stdpar=gpu       && ./daxpy {N}

### Solutions Exercise 5

The following block compiles and run the solutions at [solutions/exercise5.cpp] using different compilers:

[solutions/exercise5.cpp]: ./solutions/exercise5.cpp

In [None]:
!echo -n "[g++]       " && rm -f daxpy && g++     {flags} solutions/exercise5.cpp -ltbb             && ./daxpy {N}
!echo -n "[clang++]   " && rm -f daxpy && clang++ {flags} solutions/exercise5.cpp -ltbb             && ./daxpy {N}
!echo -n "[nvc++ CPU] " && rm -f daxpy && nvc++   {flags} solutions/exercise5.cpp -stdpar=multicore && ./daxpy {N}
!echo -n "[nvc++ GPU] " && rm -f daxpy && nvc++   {flags} solutions/exercise5.cpp -stdpar=gpu       && ./daxpy {N}

If you are done quickly, please continue with the optional [Lab 1: Select](../lab1_select/select.ipynb).

## [Optional] Exercise 6: Process multiple elements per iteration with multi-dimensional span

Until now, we've looked at the following 1D version of parallel `daxpy`:

```c++
/// 1D DAXPY: AX + Y: raw loop sequential version
void daxpy(double a, std::vector<double> const &x, std::vector<double> &y) {
  std::for_each_n(std::execution::par,
                  std::views::iota(0).begin(), x.size(), 
                  [a, x = x.data(), y = y.data()](int i) { 
      y[i] += a * x[i]; 
  });
}
```

In this exercise, we'll improve the following 2D version of `daxpy`, which processes multiple elements per "task" by _tiling_ the 1D vectors as 2D matrices:

```c++
void daxpy(double a, std::vector<double> const &x, std::vector<double> &y, size_t ncol = 2) {
  assert(x.size() == y.size());
  if (x.size() % ncols != 0) { 
      std::cerr << "ERROR: size " << x.size() << " not divisible by " << ncols << std::endl; 
      std::abort(); 
  }
  size_t nrows = x.size() / ncols;

  // Number of rows:
  size_t N = x.size() / ncol;

  // Parallel loop over rows:
  double* xs = x.data();
  double* ys = y.data();
  std::for_each_n(std::execution::par,
                  std::views::iota(0).begin(), N, [=](int i) { 
      // Sequential loop over columns:
      for (size_t j = 0; j < ncol; ++j) {
          // TODO: update to use mdspan
          ys[j + i * ncol] += a * xs[j + i * ncol];
      }
  });
}
```

When performing this two-dimensional iteration over the 1D vectors, `xs` and `ys`, we need to manually map the two-dimensional indices, `i` and `j`, to one-dimensional indices, e.g., `j + i * ncol`. There are multiple choices of mappings we could pick, e.g., [row- vs col-major order](https://en.wikipedia.org/wiki/Row-_and_column-major_order). Since the choice significantly impacts application performance, we want to be able to quickly change it throughout our applications, without introducing programmer errors, e.g., due to picking one mapping in one place, and a different one in another.

[std::mdspan] provides a multi-dimensional view of 1D data-structures with a customizable mapping, ensuring the same mapping is used by all accesses, and allowing us to change the mapping safely in one single place of our application.

A template for the solution is provided in [exercise6.cpp]: 

```c++
/// 2D DAXPY: AX + Y: parallel algorithm version
void daxpy(double a, std::vector<double> &x, std::vector<double> &y, size_t ncols = 2) {
  assert(x.size() == y.size());
  if (x.size() % ncols != 0) { 
      std::cerr << "ERROR: size " << x.size() << " not divisible by " << ncols << std::endl; 
      std::abort(); 
  }
  size_t nrows = x.size() / ncols;

  // TODO: use mdspan instead of raw pointers and manual indexing:
  // std::mdspan xs { x.data(), nrows, ncols };
  // std::mdspan ys { y.data(), nrows, ncols };
  double* xs = x.data();
  double* ys = y.data();
  std::for_each_n(std::execution::par,
                  std::views::iota(0).begin(), nrows, [=](int row) {
        for (size_t col = 0; col < ncols; ++col) {
            // TODO: use mdspan instead of raw pointers and manual indexing: 
            // ys(row, col) += a * xs(row, col);
            size_t idx = row * ncols + col;
            ys[idx] += a * xs[idx];
        }
  });
}
```

[std::mdspan]: https://en.cppreference.com/w/cpp/container/mdspan
[exercise6.cpp]: ./exercise6.cpp

In [None]:
!echo -n "[g++]       " && rm -f daxpy && g++     {flags} exercise6.cpp -ltbb             && ./daxpy {N}
!echo -n "[clang++]   " && rm -f daxpy && clang++ {flags} exercise6.cpp -ltbb             && ./daxpy {N}
!echo -n "[nvc++ CPU] " && rm -f daxpy && nvc++   {flags} exercise6.cpp -stdpar=multicore && ./daxpy {N}
!echo -n "[nvc++ GPU] " && rm -f daxpy && nvc++   {flags} exercise6.cpp -stdpar=gpu       && ./daxpy {N}

### Solutions Exercise 6

The following block compiles and run the solutions at [solutions/exercise6.cpp] using different compilers:

[solutions/exercise6.cpp]: ./solutions/exercise6.cpp

In [None]:
!echo -n "[g++]       " && rm -f daxpy && g++     {flags} solutions/exercise6.cpp -ltbb             && ./daxpy {N}
!echo -n "[clang++]   " && rm -f daxpy && clang++ {flags} solutions/exercise6.cpp -ltbb             && ./daxpy {N}
!echo -n "[nvc++ CPU] " && rm -f daxpy && nvc++   {flags} solutions/exercise6.cpp -stdpar=multicore && ./daxpy {N}
!echo -n "[nvc++ GPU] " && rm -f daxpy && nvc++   {flags} solutions/exercise6.cpp -stdpar=gpu       && ./daxpy {N}

## [Optional] Exercise 7: Configuring `std::mdspan` layout

Example 6 showed that the performance of the 2D DAXPY implementation is a bit lower than that of the 1D DAXPY implementations in the prior examples, even though they are all doing the same amount of work.

In this example we'll learn how to recover the performance, by configuring the `std::mdspan` layout.

In the previous example, we constructed the `std::mdspan` from the lengths of each dimension, called _extents_, as follows:

```c++
auto xs = stde::mdspan{x.data(), N, ncol};
auto ys = stde::mdspan{y.data(), N, ncol};
```

This constructs the `std::mdspan` with the default layout mapping policy, [std::layout_right].
We can constructing the `std::mdspan` from a different layout mapping policy, e.g., the one from [std::layout_left], as follows:

```c++
auto extents = ...;
auto xs = stde::mdspan{x.data(), std::layout_left::mapping(extents)};
auto ys = stde::mdspan{y.data(), std::layout_left::mapping(extents)};

```

To do so we need to construct a extents object describing the array's extents. Since both of our extents are _dynamic_ runtime values, we use [std::dextents]:

```c++
auto extents = std::dextents<size_t, 2>(N, ncol);
auto xs = stde::mdspan{x.data(), std::layout_left::mapping(extents)};
auto ys = stde::mdspan{y.data(), std::layout_left::mapping(extents)};

```

A template for the solution is provided in [exercise7.cpp]:

```c++
/// 2D DAXPY: AX + Y: parallel algorithm version
void daxpy(double a, std::vector<double> &x, std::vector<double> &y, size_t ncols = 1) {
  assert(x.size() == y.size());
  if (x.size() % ncols != 0) { 
      std::cerr << "ERROR: size " << x.size() << " not divisible by " << ncols << std::endl; 
      std::abort(); 
  }
  size_t nrows = x.size() / ncols;

  // TODO: construct extents object.
  // std::dextents<size_t, 2> extents(nrows, ncols);
  // TODO: construct the layout mapping object:
  // auto mapping = std::layout_left::mapping(extents);
  // TODO: construct the mdspans from the mapping:
  // std::mdspan xs { x.data(), mapping };
  // std::mdspan ys { y.data(), mapping };
  std::mdspan xs { x.data(), nrows, ncols };
  std::mdspan ys { y.data(), nrows, ncols };
  std::for_each_n(std::execution::par,
                  std::views::iota(0).begin(), nrows, [=](int row) {
        for (size_t col = 0; col < ncols; ++col) {
            ys(row, col) += a * xs(row, col);
        }
  });
}
```

[exercise7.cpp]: ./exercise7.cpp
[std::layout_right]: https://en.cppreference.com/w/cpp/container/mdspan/layout_right
[std::dextents]: https://en.cppreference.com/w/cpp/container/mdspan/extents

In [None]:
!echo -n "[g++]       " && rm -f daxpy && g++     {flags} exercise7.cpp -ltbb             && ./daxpy {N}
!echo -n "[clang++]   " && rm -f daxpy && clang++ {flags} exercise7.cpp -ltbb             && ./daxpy {N}
!echo -n "[nvc++ CPU] " && rm -f daxpy && nvc++   {flags} exercise7.cpp -stdpar=multicore && ./daxpy {N}
!echo -n "[nvc++ GPU] " && rm -f daxpy && nvc++   {flags} exercise7.cpp -stdpar=gpu       && ./daxpy {N}

### Solutions Exercise 7

The following block compiles and run the [`solutions/exercise7.cpp`]:

[`solutions/exercise7.cpp`]: ./solutions/exercise7.cpp

In [None]:
!echo -n "[g++]       " && rm -f daxpy && g++     {flags} solutions/exercise7.cpp -ltbb             && ./daxpy {N}
!echo -n "[clang++]   " && rm -f daxpy && clang++ {flags} solutions/exercise7.cpp -ltbb             && ./daxpy {N}
!echo -n "[nvc++ CPU] " && rm -f daxpy && nvc++   {flags} solutions/exercise7.cpp -stdpar=multicore && ./daxpy {N}
!echo -n "[nvc++ GPU] " && rm -f daxpy && nvc++   {flags} solutions/exercise7.cpp -stdpar=gpu       && ./daxpy {N}

## Exercise 8: Multi-dimensional iteration with `std::views::cartesian_product`

In this exercise, we'll learn how to use [std::views::cartesian_product] to iterate over multi-dimensional data such as the two-dimensional [std::mdspan] we've used Exercises 6 and 7. We've been using the [std::for_each_n] algorithm with an iterator and a count, combined with a sequential loop as follows:

```c++
  std::for_each_n(std::execution::par,
                  std::views::iota(0).begin(), nrows, [=](int row) {
        for (size_t col = 0; col < ncols; ++col) {
            ys(row, col) += a * xs(row, col);
        }
  });
```

The goal of this exercise is to convert the above to use the [std::for_each] algorithm (without the `_n`), to iterate in parallel over a  [std::views::cartesian_product] view and, within the loop, obtain the indices for each dimension.

A template for the solution is provided in [exercise8.cpp]:

```c++
/// 2D DAXPY: AX + Y: parallel algorithm version
void daxpy(double a, std::vector<double> &x, std::vector<double> &y, int ncols = 1) {
  assert(x.size() == y.size());
  if (x.size() % ncols != 0) { 
      std::cerr << "ERROR: size " << x.size() << " not divisible by " << ncols << std::endl; 
      std::abort(); 
  }
  int nrows = x.size() / ncols;

  std::mdspan xs { x.data(), nrows, ncols };
  std::mdspan ys { y.data(), nrows, ncols };
  // TODO: Create a std::views::cartesian_product range spanning (0, nrows)x(0, ncols):
  // auto is = std::views::cartesian_product(
  //  std::views::iota(0, nrows),
  //  std::views::iota(0, ncols)
  // );
  // TODO: Use the std::for_each (without _n) algorithm to iterate in parallel over the cartesian_product range:
  // std::for_each(std::execution::par, is.begin(), is.end(), [=](auto i) {
    // Each element of the cartesian_product range is a tuple containing one index per dimension.
    // TODO: Extract the individual indices using structured bindings:
    // auto [row, col] = i;
    // ys(row, col) += a * xs(row, col);
  // });
}
```

[exercise8.cpp]: ./exercise8.cpp
[std::views::cartesian_product]: https://en.cppreference.com/w/cpp/ranges/cartesian_product_view
[std::mdspan]: https://en.cppreference.com/w/cpp/container/mdspan
[std::for_each_n]: https://en.cppreference.com/w/cpp/algorithm/for_each_n
[std::for_each]: https://en.cppreference.com/w/cpp/algorithm/for_each

In [None]:
!echo -n "[g++]       " && rm -f daxpy && g++     {flags} exercise8.cpp -ltbb             && ./daxpy {N}
!echo -n "[clang++]   " && rm -f daxpy && clang++ {flags} exercise8.cpp -ltbb             && ./daxpy {N}
!echo -n "[nvc++ CPU] " && rm -f daxpy && nvc++   {flags} exercise8.cpp -stdpar=multicore && ./daxpy {N}
!echo -n "[nvc++ GPU] " && rm -f daxpy && nvc++   {flags} exercise8.cpp -stdpar=gpu       && ./daxpy {N}

### Solutions Exercise 8

The following block compiles and run the [solutions/exercise8.cpp]:

[solutions/exercise8.cpp]: ./solutions/exercise8.cpp

In [None]:
!echo -n "[g++]       " && rm -f daxpy && g++     {flags} solutions/exercise8.cpp -ltbb             && ./daxpy {N}
!echo -n "[clang++]   " && rm -f daxpy && clang++ {flags} solutions/exercise8.cpp -ltbb             && ./daxpy {N}
!echo -n "[nvc++ CPU] " && rm -f daxpy && nvc++   {flags} solutions/exercise8.cpp -stdpar=multicore && ./daxpy {N}
!echo -n "[nvc++ GPU] " && rm -f daxpy && nvc++   {flags} solutions/exercise8.cpp -stdpar=gpu       && ./daxpy {N}

## More optional exercises

For more optional exercises, check out [Lab 1: Select](../lab1_select/select.ipynb).