# Performance Optimization and Analysis

In this tutorial, we will:

* learn to optimize the performance of an `Operator`,
* investigate the effects of optimization in two real-life seismic inversion `Operator`s,
* analyze and interpret the performance report displayed after a run.

We will rely on preset models and `Operator`s from a seismic inversion problem based on an **isotropic acoustic wave equation**. To run one such `Operator`, in particular a forward modeling operator, we will exploit the `benchmark.py` module. This provides a number of options to configure the simulation and to try out different optimizations. The `benchmark.py` is intended to let newcomers play with Devito -- and its performance optimizations! -- without having to know anything about its symbolic language, mechanisms and functioning.

In [None]:
import examples.seismic.benchmark
benchmark = examples.seismic.benchmark.__file__

In [None]:
# For a full list of options
%run $benchmark --help

OK, now we want Devito to run this `Operator`.

In [None]:
%run $benchmark run -P acoustic

That was simple. Of course, we may want to run the same simulation on a bigger grid, with different grid point spacing or space order, and so on. And yes, we'll do this later on in the tutorial. But before scaling up in problem size, we shall take a look at what sort of performance optimizations we'll be able to apply to speed it up.

In essence, there are four knobs we can play with to maximize the `Operator` performance (or to see how the performace varies when adding or removing specific transformations):

* parallelism,
* the Devito Symbolic Engine (DSE) optimization level,
* the Devito Loop Engine (DLE) optimization level,
* loop blocking auto-tuning.

## Shared-memory parallelism

We start with the most obvious -- parallelism. Devito implements shared-memory parallelism via OpenMP. To enable it, the following environment variable needs to be set:

In [None]:
%env DEVITO_OPENMP=1

Multiple threads will now be used when running an `Operator`. But how many? And how efficiently? We may be running on a multi-socket node -- how should we treat it, as a "flatten system" or what?

Devito aims to use distributed-memory parallelism over multi-socket nodes; that is, it allocates one MPI process per socket, and each MPI process should spawn as many OpenMP threads as the number of cores on the socket. Users don't get all this for free, however; a minimal configuration effort is required. But don't worry: as you shall see, it's much simpler than it sounds!

For this tutorial, we forget about MPI; we rather focus on enabling OpenMP on a single socket. So, first of all, we want to restrain execution to a single socket -- we want threads to stay on that socket without ever migrating to other cores of the system due to OS scheduling. Are we really on a multi-socket node? And how many cores does a socket have? Let's find out. We shall use a very standard tool such as `lscpu` on Linux systems.

In [None]:
! lscpu

A line beginning with `'NUMA node...'` represents one specific socket. Its value (on the right hand side, after the ':') indicates the ID ranges of its logical cores. For example, if our node consisted of two sockets, each socket having 8 physical cores and 2 hyperthreads per core, we would see something like

```
...
NUMA node0 CPU(s):     0-7,16-23
NUMA node1 CPU(s):     8-15,24-31
...
```

Now, say we choose to run on the 8 cores of socket 0 (``node0``). We then simply have to set the following OpenMP environment variable:

In [None]:
%env OMP_NUM_THREADS=8

Thus, 8 threads will be spawned each time an `Operator` is run. They will be killed as soon as the `Operator` has done its job. 

We also want to **bind** them to the physical cores of socket 0; that is, we want to prevent OS-induced migration. This is known as *thread pinning*. One may use a program such as ``numactl`` or, alternatively, exploit other OpenMP environment variables. If the Intel compiler is at our disposal, we can also enforce pinning through the following two-step procedure:

* We ask Devito to use the Intel compiler through the special `DEVITO_ARCH` environment variable;
* We set the Intel-specific `KMP_HW_SUBSET` and `KMP_AFFINITY` environment variables.

Let's see how we can do this in practice, and what's the impact on performance.

We run the isotropic acoustic forward operator again, but this time with a much larger grid, a 512x512x512 cube, and a more realistic space order, 8. We also shorten the duration by deliberately choosing a very small simulation end time (50 ms).

In [None]:
# Thread pinning
%env KMP_HW_SUBSET=8c,1t
%env KMP_AFFINITY=compact
# Tell Devito to use the Intel compiler
%env DEVITO_ARCH=intel
# or, equivalently, programmatically
from devito import configuration
configuration['compiler'] = 'intel'

for i in range(3):
    print ("Run %d" % i)
    %run $benchmark run -P acoustic -so 8 -d 256 256 256 --tn 50

Observation: the execution times are stable. This is a symptom that thread pinning is working. In practice, don't forget to check by taking a look at OpenMP reports or using profilers (e.g., Intel VTune) or through user-friendly tools such as `htop`.

## DSE - Devito Symbolic Engine

We know how to switch on parallelism. So, it's finally time to see what kind of optimizations can be applied to Devito-generated kernels. By default, Devito aggressively optimizes all `Operator`s. When running through `benchmark.py`, however, optimizations are left disabled until users explicitly request them. This, hopefully, simplifies initial experimentation and investigation.

Let's then dive into to the Devito Symbolic Engine (or DSE) section of this tutorial. It is worth observing that the DSE is one of the distinguishing features of Devito w.r.t. many other stencil frameworks! Why is that? This is what the documentation says about the DSE:

> [The DSE performs] Flop-count optimizations - They aim to reduce the operation count of an Operator. These include, for example, code motion, factorization, and detection of cross-stencil redundancies. The flop-count optimizations are performed by routines built on top of SymPy.

So the DSE reduces the flop count of `Operator`s. This is particularly useful in the case of complicated PDEs, for example those making extensive use of differential operators. And even more important in high order methods. In such cases, it's not unusual to end up with kernels requiring hundreds of arithmetic operations per grid point calculation. Since Devito doesn't make assumptions about the PDEs, the presence of an optimization system such as the DSE becomes of fundamental importance. In fact, we know that its impact has been remarkable in real-life siesmic inversion operators that have been written on top of Devito (e.g., TTI operators).

Let's see what happens enabling the DSE in our acoustic operator.

In [None]:
run $benchmark run -P acoustic -so 8 -d 256 256 256 --tn 50 -dse advanced

Compared to the previous runs, do you note any change ...

* in the Devito output reports?
* in execution times? why?

And why, from a performance analysis point of view, is the DSE anyhow useful even though no changes in execution times are observed?

## DLE - Devito Loop Engine

TODO: acoustic switching DLE to advanced. How does the output changes? Is there any difference in performance? Why?
how about 3D blocking ?

## Loop blocking auto-tuning

Switch it on. Basic vs aggressive mode. Perhaps something in between. How does the acoustic performance varies? Why?

## Automated generation of roofline plots

Show this cute feature

## A sneak peek at the YASK backend

YASK -- Yet Another Stencil Kernel -- is
```
a framework to facilitate exploration of the HPC stencil-performance design space.
```
It operates at a level of abstraction much lower than Devito's (e.g., no symbolic language is available). We've been working on integrating YASK with Devito so that Devito users can exploit the YASK technology in a seamless fashion, w/o having to change their code. After months of work, we're finally in a position in which we can run some non-trivial `Operator`s, while reusing all of the Devito infrastructure presented so far (including the symbolic optimizations provided by the DSE). In this section, we show how simple is it from a user point-of-view to exploit YASK in Devito -- no more difficult than setting an environment variable ! Then, we also look at performance of the acoustic `Operator` w/ and w/o YASK.

## A sneak peek at the TTI forward Operator

Quickly show unoptimized (but parallel) vs +loop_blocking vs +loop_blocking+DSEaggressive

<sup>This notebook is part of the tutorial "Optimised Symbolic Finite Difference Computation with Devito" presented at the Intel® HPC Developer Conference 2017.</sup>