![DLI Header](../images/DLI_Header.png)

# Parallelizing Daxpy and Initialization

In this notebook you will parallelize the `initialize` and `daxpy` functions to compute the results in parallel using CPUs or GPUs.

## Learning Objectives

By the time you complete this notebook you should:

- Be able to use execution policies to enable the parallel execution of standard library algorithms
- Be able to use lambda captures to appropriately capture scalars and data pointers by value, enabling support across all parallel environments
- Observe dramatic performance improvements by running your DAXPY program on the GPU

## Exercise 3: Refactor `daxpy` and `initialize` to Run in Parallel

For this exercise you will work with [exercise3.cpp](exercise3.cpp), which starts with the solution from exercise 2. You will need to include execution policies where appropriate, and be sure, as discussed in the earlier presentation, to capture scalars and pointers to data by value, using lambda captures.

The `TODO`s indicate the parts of the source code that need adding or refactoring. Below are the parts of the file containing `TODO`'s.

```c++
#include <ranges>
// TODO: add C++ standard library includes as necessary

...

/// Intialize vectors `x` and `y`: parallel algorithm version
void initialize(std::vector<double> &x, std::vector<double> &y) {
  assert(x.size() == y.size());
  // TODO: Parallelize initialization of `x`
  auto ints = std::views::iota(0);
  std::for_each_n(ints.begin(), x.size(), [&x)](int i) { x[i] = (double)i; });
  // TODO: Parallelize initialization of `y`
  std::fill_n(y.begin(), y.size(), 2.);
}

/// DAXPY: AX + Y: sequential algorithm version
void daxpy(double a, std::vector<double> const &x, std::vector<double> &y) {
  assert(x.size() == y.size());
  /// TODO: Parallelize DAXPY computation
  std::transform(x.begin(), x.end(), y.begin(), y.begin(),
                 [&](double x, double y) { return a * x + y; });
}
```

### Compile and Run

Compiling with support for the parallel algorithms requires:
* `g++` and `clang++`: link against Intel TBB with `-ltbb`
* `nvc++`: compile and link with `-stdpar` flag:
  * `-stdpar=multicore` runs parallel algorithms on CPUs
  * `-stdpar=gpu` runs parallel algorithms on GPUs, further `-gpu=` flags control the GPU target
  * See the [Parallel Algorithms Documentation](https://docs.nvidia.com/hpc-sdk/compilers/c++-parallel-algorithms/index.html).
    
The example compiles, runs, and produces correct results as provided. Parallelize it using the C++ standard library parallel algorithms and ensure that the results are still correct. You should see a drastic performance increase when running the program on the GPU (see the solution below if necessary).

The first 3 of the following blocks compile and run the program using different compilers on the CPU.

The last block compiles and runs the program on the GPU. If you get an error, make sure that the lambda captures are capturing scalars by value, and that when capturing a vector to access its data, one captures a pointer to its data by value as well using `[x = x.data()]`.

In [None]:
!g++ -std=c++20 -Ofast -march=native -DNDEBUG -o daxpy exercise3.cpp -ltbb
!./daxpy 1000000

In [None]:
!clang++ -std=c++20 -Ofast -march=native -DNDEBUG -o daxpy exercise3.cpp -ltbb
!./daxpy 1000000

In [None]:
!nvc++ -stdpar=multicore -std=c++20 -O4 -fast -march=native -Mllvm-fast -DNDEBUG -o daxpy exercise3.cpp
!./daxpy 1000000

In [None]:
!nvc++ -stdpar=gpu -std=c++20 -O4 -fast -march=native -Mllvm-fast -DNDEBUG -o daxpy exercise3.cpp
!./daxpy 1000000

### Solutions for Exercise 3

The [solution for this exercise is `solutions/exercise3.cpp`](solutions/exercise3.cpp) which you can view if you get stuck or want to check your work.

In [None]:
!g++ -std=c++20 -Ofast -march=native -DNDEBUG -o daxpy solutions/exercise3.cpp -ltbb
!./daxpy 1000000

In [None]:
!clang++ -std=c++20 -Ofast -march=native -DNDEBUG -o daxpy solutions/exercise3.cpp -ltbb
!./daxpy 1000000

In [None]:
!nvc++ -stdpar=multicore -std=c++20 -O4 -fast -march=native -Mllvm-fast -DNDEBUG -o daxpy solutions/exercise3.cpp
!./daxpy 1000000

In [None]:
!nvc++ -stdpar=gpu -std=c++20 -O4 -fast -march=native -Mllvm-fast -DNDEBUG -o daxpy solutions/exercise3.cpp
!./daxpy 1000000

## Conclusion

Thank you so much for your particpation in *GPU Acceleration with the C++ Standard Library*, we sincerely hope you learned a lot and are excited to accelerate your own C++ applications.

Please take 1 minute to [complete the course survey](https://learn.next.courses.nvidia.com/courses/course-v1:DLI+S-AC-08+V1/courseware/85f2a3ac16a0476685257996b84001ad/aac082d17f5f412dbd38cbe13d905dc7/?activate_block_id=block-v1%3ADLI%2BS-AC-08%2BV1%2Btype%40sequential%2Bblock%40aac082d17f5f412dbd38cbe13d905dc7). Your input is important to us and helps us improve this content.Âµ

## Extra Credit

If interested, please proceed to [the extra credit materials](../05-Select/Select.ipynb) where you will solidify your understanding of the materials presented in this course by implementing a parallel program that requires the thoughtful composition of several standard parallel algorithms.

![DLI Header](../images/DLI_Header.png)