# Performance Considerations

## Overview

Here are some (very rough) guidelines when optimizing for performance of OpenMP programs:

* OpenMP synchronization can impose quite some overhead.
    * Use explicit barriers sparingly.
    * Use the appropriate [synchronization](synchronization.ipynb) mechanism if synchronization is necessary.

* Setting up threads can be costly.
    * Fusing parallel regions can improve performance.

* Overheads can overshadow performance gains.    
    * Make sure that each thread has sufficient work.
    * In some cases, *not* parallelizing a loop/ region is the better choice.

* Different schedules can have vastly different performance.
    * Choose the appropriate worksharing [schedule](loops.ipynb#Scheduling) for your application.

Beyond these implementation centric considerations, there are also hardware and OS considerations.
In particular, using correct pinning and honoring first touch policy can be critical.
Both will be discussed below.

In [None]:
%load_ext ice.magic

## Topology

<img src="img/performance/cc-numa.png" alt="cc-numa" width="50%" style="background-color:white"/>


Many present day systems built on ccNUMA (cache-coherent non-uniform memory access).
Here, memory is distributed over *locality domains* in the granularity of pages.
Bandwidth and latency differ for accesses from one core into different locality domains.

Rule of thumb: accesses to data from a thread running on any given core is faster if the memory is located closer to that core.

### Tools to Examine Topology

Below are some tools that give various levels of detail.
Note that not all tools may be available on all systems.

In [None]:
!lscpu | grep NUMA

In [None]:
!lstopo

In [None]:
!numactl -H

In [None]:
!likwid-topology

For NVIDIA GPUs, as will be used in the [target offloading](target-offloading.ipynb) notebook, the following can be useful.

In [None]:
!nvidia-smi
!nvidia-smi topo -m

## Pinning

By default, threads may be migrated across all available cores.
Pinning prevents this and assigns threads to fixed *resources*.
If the resource includes more than one core, threads can use any of those cores.
If there are more threads then places, then threads are distributed round-robin.

OpenMP support this mechanism via the environment variables `OMP_PLACES` ([OpenMP 5.1 - 6.5](https://www.openmp.org/spec-html/5.1/openmpse62.html)) and `OMP_PROC_BIND` ([OpenMP 5.1 - 6.4](https://www.openmp.org/spec-html/5.1/openmpse61.html)).
\
Setting `OMP_DISPLAY_AFFINITY` to true triggers printing debug information ([OpenMP 5.1 - 6.13](https://www.openmp.org/spec-html/5.1/openmpse70.html))

| `OMP_PLACES`   | Places                                |
|----------------|---------------------------------------|
| `threads`      | hardware (SMT) threads/ virtual cores |
| `cores`        | physical cores                        |
| `l1_caches`    | cores sharing an L1 cache             |
| `numa_domains` | cores in a single NUMA domain         |
| `sockets`      | cores in a single socket              |

`OMP_PLACES` can also be a list of hardware IDs of cores as
* list, e.g. `OMP_PLACES="0,2,4,6"`,
* list of lists, e.g. `OMP_PLACES="{0,2},{4,6}"`, or
* range with length and optional stride, e.g. `OMP_PLACES="{0}:4:2"`.

| `OMP_PROC_BIND` | Effect                                             |
|-----------------|----------------------------------------------------|
| `false`         | disable affinity                                   |
| `true`          | enable affinity (compiler defined)                 |
| `close`         | pin threads to adjacent places                     |
| `spread`        | distribute threads equally over available places   |
| `primary`       | pin all threads to the place of the initial thread |

<div class="alert alert-block alert-info"> <b>Note:</b> The equivalent clause in the parallel construct is <code>proc_bind</code>. </div>

## Exercise: Pinning

<div class="alert alert-block alert-success"> <b>Exercise:</b> Investigate pinning effects. </div>

Consider the example below.
It spawns a number of threads and keeps them busy for a fixed period of time (in this case one second).
In constrast to using a sleep method, this code generates actual CPU load which makes it suitable for our exercise.

In [None]:
%%cpp_omp -o code/performance/busy-wait.cpp -v -e OMP_NUM_THREADS=12 OMP_DISPLAY_AFFINITY=true

#pragma omp parallel
{
    auto start = omp_get_wtime();
    auto duration = 0.;
    while (duration < 1.000) //# seconds
       duration = omp_get_wtime() - start;
}

Your tasks are
* Examine the topology of the compute you are currently running on
* Open a separate terminal (`file` > `new` > `terminal`) and start 'htop'
    * htop visualizes the current load per core
* Vary the number of threads and apply different pinning strategies
    * observe the respective effects with htop
    * feel free to change the time each thread 'works'

## NUMA Considerations

### First Touch

First touch policy
typically memory is allocated in two stages
memory is only reserved* but not yet associated with pages in RAM
writing to not yet associated pages triggers allocation
a memory page is placed into the locality domain the core touching it belongs to

```cpp
int *vec = new int[1024];       // memory is 'reserved'

for (auto i = 0; i < 1024; ++i)
    vec[i] = 0;                 // first touch allocates pages in corresponding NUMA domain
```


<div class="alert alert-block alert-info"> <b>Note:</b> Policies other than first touch can be selected with <code>numactl</code>. </div>

In [None]:
%%cpp_omp -o code/performance/first-touch.cpp -e OMP_PROC_BIND=close OMP_PLACES=cores

constexpr auto N = 256 * 1024 * 1024;

int *vec = new int[N];

☝
for (auto i = 0; i < N; ++i)
    vec[i] = i;

auto start = omp_get_wtime();

#pragma omp parallel for schedule(static)
☝
for (auto i = 0; i < N; ++i)
    vec[i] *= 2;

auto end = omp_get_wtime();
std::cout << "Total time: " << 1e3 * (end - start) << " ms" << std::endl;

Try to see differences in performance when the first loop is also parallelized with OpenMP and the same schedule as the main work loop.

### NUMA Balancing

The operating system automatically migrates pages between NUMA nodes to increase performance (even though this incurs initial overhead).
Whether this is active and different settings can be checked with the following commands (0 means disabled, 1 means enabled):

In [None]:
!cat /proc/sys/kernel/numa_balancing

In [None]:
!grep .\* /proc/sys/kernel/numa_balancing*

## Exercise: Pinned Stream Benchmark

<div class="alert alert-block alert-success"> <b>Exercise:</b> Investigate pinning. </div>

This exercise assumes, that the stream benchmark application at [code/examples/stream.cpp](code/examples/stream.cpp) is already parallelized with OpenMP.
If this is not the case, you can copy the contents of a previous solution ([code/solutions/loops/stream.cpp](code/solutions/loops/stream.cpp)).
Also feel free to (re-)check the documentation in the [examples notebook](examples.ipynb#Stream-Benchmark).
For convenience, the cells for building and executing are copied below.
\
Investigate performance for the already parallelized stream by following these steps:
* Start with 72 threads, compact (close) pinning to cores, and static scheduling
    * Can you observe performance deviations for the first two iterations?
    * Parallelize the initialization as well and repeat your experiments. Did the performance change?
* Investigate the effect of applying pinning for 72 threads
    * Start with a static schedule. How does performance vary for other schedules?
    * How does performance compare to a version without pinning?
* Investigate performance for less than 72 threads
    * Check 1, 18 and 36 threads
    * Which pinning configuration yields the highest performance for each thread count assuming static scheduling?

In [None]:
!g++ -O3 -std=c++17 -Wall -o code/examples/stream code/examples/stream.cpp

In [None]:
!code/examples/stream $((32 * 1024 * 1024)) 4

### Solution

You can find one possible solution at [code/solutions/performance/stream.cpp](code/solutions/performance/stream.cpp).
The following cells evaluate different scheduling and pinning strategies.

We first establish a base line by using the solution from the loops exercise (i.e. the one without parallel initialization).

In [None]:
!g++ -O3 -std=c++17 -Wall -fopenmp -o code/solutions/performance/stream code/solutions/loops/stream.cpp
!OMP_NUM_THREADS=72 \
    OMP_PROC_BIND=close \
    OMP_PLACES=cores \
    OMP_DISPLAY_AFFINITY=true \
    OMP_SCHEDULE=static \
    code/solutions/loops/stream $((32 * 1024 * 1024)) 4

Now we can compare it to our updated solution.

In [None]:
!g++ -O3 -std=c++17 -Wall -fopenmp -o code/solutions/performance/stream code/solutions/performance/stream.cpp

In [None]:
!OMP_NUM_THREADS=72 \
    OMP_PROC_BIND=close \
    OMP_PLACES=cores \
    OMP_DISPLAY_AFFINITY=true \
    OMP_SCHEDULE=static \
    code/solutions/performance/stream $((32 * 1024 * 1024)) 4

In [None]:
!OMP_NUM_THREADS=36 \
    OMP_PROC_BIND=close \
    OMP_PLACES=cores \
    OMP_DISPLAY_AFFINITY=false \
    OMP_SCHEDULE=static \
    code/solutions/performance/stream $((32 * 1024 * 1024)) 4

In [None]:
!OMP_NUM_THREADS=36 \
    OMP_PROC_BIND=spread \
    OMP_PLACES=cores \
    OMP_DISPLAY_AFFINITY=false \
    OMP_SCHEDULE=static \
    code/solutions/performance/stream $((32 * 1024 * 1024)) 4

In [None]:
!OMP_NUM_THREADS=18 \
    OMP_PROC_BIND=close \
    OMP_PLACES=cores \
    OMP_DISPLAY_AFFINITY=false \
    OMP_SCHEDULE=static \
    code/solutions/performance/stream $((32 * 1024 * 1024)) 4

In [None]:
!OMP_NUM_THREADS=18 \
    OMP_PROC_BIND=spread \
    OMP_PLACES=cores \
    OMP_DISPLAY_AFFINITY=false \
    OMP_SCHEDULE=static \
    code/solutions/performance/stream $((32 * 1024 * 1024)) 4

In [None]:
!OMP_NUM_THREADS=1 \
    OMP_PROC_BIND=close \
    OMP_PLACES=cores \
    OMP_DISPLAY_AFFINITY=false \
    OMP_SCHEDULE=static \
    code/solutions/performance/stream $((32 * 1024 * 1024)) 4