# Reductions

## Overview

When multiple threads access the same variable, this leads to a race condition.
\
This pattern frequently arises, when threads _accumulate_ their variables into a single one, e.g. to compute an overall sum or minimum.

Let's load our custom magic and consider an example.

In [None]:
%load_ext ice.magic

In [None]:
%%cpp_omp -o code/reductions/race-cond.cpp

int sum = 0;
#pragma omp parallel num_threads(1024)
    sum += 1;

std::cout << sum << std::endl;

The result is neither correct nor deterministic.
You can verify the latter by running the application multiple times.

Options to fix this issue include
* adding attitional [synchronization](synchronization.ipynb),
* serialization of the summation, and
* using OpenMP reductions.

## Serialization

The core idea is to have a local contribution per thread which is stored in an array with as many elements as threads.
Then, after ther parallel region, the global sum is computed serially.

<div class="alert alert-block alert-info"> <b>Note:</b> This works robustly, but may be slower than using OpenMP reductions. In some cases it is still preferable since the result of of operation is always <i>reproducible</i>. </div>

In [None]:
%%cpp_omp -o code/reductions/serialized.cpp

int sum = 0;
int localSums[128];
#pragma omp parallel num_threads(1024)
    localSums[omp_get_thread_num()] = 1;

for (auto i = 0; i < 1024; ++i)
    sum += localSums[i];

std::cout << sum << std::endl;

## Reductions

OpenMP reductions can combined with different OpenMP functionalities ([OpenMP 5.1 - 2.21.5](https://www.openmp.org/spec-html/5.1/openmpsu117.html)), e.g. with the parallel construct.
\
Specifying a reduction is done by adding `reduction( op : var )` where
* `op` is the operation to be performed
* `var` is the variable to be reduced

| Language  | Supported `op`                                    |
|-----------|---------------------------------------------------|
| C and C++ | `+`, `-`, `*`, `&`, `|`, `^`, `&&` and `||`       |
| Fortran   | `+`, `-`, `*`, `.and.`, `.or.`, `.eqv.`, `.neqv.` |
| Fortran   | `max`, `min`, `iand`, `ior`, `ieor`               |

In [None]:
%%cpp_omp -o code/reductions/reduction.cpp

int sum = 0;
#pragma omp parallel num_threads(1024) reduction( + : sum )
    sum += 1;

std::cout << sum << std::endl;

In the same fashion, reduction can be added to `parallel for`.
Note that this time the number of threads is not fixed.

In [None]:
%%cpp_omp -o code/reductions/parallel-for.cpp

int sum = 0;
#pragma omp parallel for reduction( + : sum )
for (auto i = 0; i < 1024; ++i)
    sum += 1;

std::cout << sum << std::endl;

## Reductions for Nested Constructs

Reductions can also appear on enclosing regions with multiple nested OpenMP primitives.

In [None]:
%%cpp_omp -o code/reductions/nested.cpp

int sum = 0;
#pragma omp parallel reduction( + : sum )
{
    #pragma omp for
    for (auto i = 0; i < 1024; ++i)
        sum += 1;

    #pragma omp for
    for (auto i = 0; i < 1024; ++i)
        sum += 2;
}

std::cout << sum << std::endl;

## Scopes

Adding `scope` directives allows for reductions inside of parallel regions.

In [None]:
%%cpp_omp -o code/reductions/scope.cpp

int sum = 0;
#pragma omp parallel
{
    //# sum must not be private here
    #pragma omp scope reduction( + : sum )
    {
        #pragma omp for
        for (auto i = 0; i < 1024; ++i)
            sum += 1;
    } //# implicit barrier - sum is available

    if (12 == omp_get_thread_num() || 24 == omp_get_thread_num())
        std::cout << sum << std::endl;
}

std::cout << sum << std::endl;

## Exercise: 2D Stencil Residual

<div class="alert alert-block alert-success"> <b>Exercise:</b> Fuse loops and add reduction. </div>

Check the code for the 2D stencil application at [code/examples/stencil-2d.cpp](code/examples/stencil-2d.cpp) and the documentation in the [examples notebook](examples.ipynb#2D-Stencil).
For convenience, the cells for building and executing are copied below.
\
Refactor the code such that
* The residual is computed every iteration. You can find the computation at the end of the main function.
* The residual is computed in parallel. Remember to add a suitable reduction clause.

How does the performance change compared to the original version?

In [None]:
!g++ -O3 -std=c++17 -Wall -o code/examples/stencil-2d code/examples/stencil-2d.cpp

In [None]:
!code/examples/stencil-2d 4096 4096 64

### Solution

You can find one possible solutions at [code/solutions/reductions/stencil-2d.cpp](code/solutions/reductions/stencil-2d.cpp).

In [None]:
!g++ -O3 -std=c++17 -Wall -fopenmp -o code/solutions/reductions/stencil-2d code/solutions/reductions/stencil-2d.cpp
!code/solutions/reductions/stencil-2d 4096 4096 64