# Performance analysis and tuning

In this tutorial, we will:

* learn to optimize the performance of an `Operator`,
* investigate the effects of optimization in two real-life seismic inversion `Operator`s,
* analyze and interpret the performance report displayed after a run.

We will rely on preset models and `Operator`s from the following problems:

* Acoustic isotropic wave equation
* TTI (TODO: expand)

To access them, we will use a convenience module available in Devito, namely ``benchmark.py``.

In [None]:
import examples.seismic.benchmark
benchmark = examples.seismic.benchmark.__file__

# For a full list of options
# %run $benchmark.__file__ --help

Now we want Devito to run an `Operator` as quickly as it can.

In essence, there are four knobs we can play with to improve the execution time of an `Operator` (or, simply, to see how the performace varies when adding or removing a specific transformation):

* parallelism,
* the Devito Symbolic Engine (DSE) optimization level,
* the Devito Loop Engine (DLE) optimization level,
* loop blocking auto-tuning.

## Shared-memory parallelism

We start with the most obvious -- parallelism. Devito implements shared-memory parallelism via OpenMP. To enable it, the following environment variable needs to be set:

In [None]:
%env DEVITO_OPENMP=1

This does suffice to use multiple threads when running an `Operator`. We should, however, also restrain execution to a single socket and enforce thread pinning. For this, we first have to understand something more about the underlying CPU architecture. Let's find out how many cores one of our CPU socket has with a standard utility such as:

In [None]:
! lscpu

Let's assume that our CPU architecture consists of multiple sockets, each socket having 8 physical cores. This may be inferred from lines such as ``NUMA node0 CPU(s): 0-7,16-23``. If we choose to use the 8 physical cores on socket 0 (``node0``), we should set the OpenMP environment variable to:

In [None]:
%env OMP_NUM_THREADS=8

Thus, 8 threads will be spawned. Now we have to bind them to the physical cores of socket 0. One may use a program such as ``numactl`` or take a look at other OpenMP environment variables. More simply, if the Intel compiler is at our disposal, to achieve this we can:

* tell Devito to use the Intel compiler through ``DEVITO_ARCH``, and
* set the Intel-specific ``KMP_HW_SUBSETS`` environment variable

as follows:

In [None]:
%env DEVITO_ARCH=intel
%env KMP_HW_SUBSETS=8c,1t

Now, Devito will run each `Operator` with 8 threads, each thread bound to a physical core of socket 0. Did it work? Let's try running the acoustic forward operator.

TODO: Fix a realistic space order as used in seismic inversion, such as 8

In [None]:
%run $benchmark run -P acoustic 

TODO: need openmp enabled as a special dle mode, switched on automatically if DEVITO_OPENMP=1

TODO: show diff in runtime, perhaps point to generated code

## DSE - Devito Symbolic Engine

Let's now jump to the DSE -- probably one of the distinguishing features of Devito w.r.t. other stencil frameworks!

This is what the documentation says about the DSE:
```
[The DSE performs] Flop-count optimizations - They aim to reduce the operation count of an Operator. These include, for example, code motion, factorization, and detection of cross-stencil redundancies. The flop-count optimizations are performed by routines built on top of SymPy.
```

TODO: ask them to run acoustic with DSE advanced (skip basic?). What changes? Does the runtime significantly differ? Why?

## DLE - Devito Loop Engine

TODO: acoustic switching DLE to advanced. How does the output changes? Is there any difference in performance? Why?
how about 3D blocking ?

## Loop blocking auto-tuning

Switch it on. Basic vs aggressive mode. Perhaps something in between. How does the acoustic performance varies? Why?

## Automated generation of roofline plots

Show this cute feature

## TTI

Quickly show unoptimized (but parallel) vs +loop_blocking vs +loop_blocking+DSEaggressive

## A sneak peek at the YASK backend

YASK -- Yet Another Stencil Kernel -- is
```
a framework to facilitate exploration of the HPC stencil-performance design space.
```
It operates at a level of abstraction much lower than Devito's (e.g., no symbolic language is available). We've been working on integrating YASK with Devito so that Devito users can exploit the YASK technology in a seamless fashion, w/o having to change their code. After months of work, we're finally in a position in which we can run some non-trivial `Operator`s, while reusing all of the Devito infrastructure presented so far (including the symbolic optimizations provided by the DSE). In this section, we show how simple is it from a user point-of-view to exploit YASK in Devito -- no more difficult than setting an environment variable ! Then, we also look at performance of the acoustic `Operator` w/ and w/o YASK.

<sup>This notebook is part of the tutorial "Optimised Symbolic Finite Difference Computation with Devito" presented at the Intel® HPC Developer Conference 2017.</sup>