# Introducing the example: Diffusion 2D

## Overview

* **The brief physics intro**
  * 2D explicit diffusion using the finite-difference method

* **The code structure overview**
  * Compute and "main" functions
  * [`shared.jl`](./../shared.jl) (included) script

* **The output**
  * Visualisation
  * `Time` and `T_eff` - Performance reporting in the REPL

* **The serial loop version**
  * Macros, "race" conditions
  * Moving to a single compute function

## The brief physics intro

During this workshop, we will use a 2D diffusion solver to investigate how Julia allows us to tackle various HPC concepts in a concise fashion, without trading performance.

We will solve the 2D diffusion equation for a quantity $C$,
$$
\frac{∂C}{∂t} = -∇ ⋅ q~,
$$
where $q$ represents the diffusive flux:
$$
q = -D \; ∇C~,
$$
and where $D$ stands for the diffusion coefficient.


We will solve this partial differential equation (PDE) using the finite-difference method and an explicit forward Euler time integrator on a regular staggered Cartesian grid.

**TODO**: add staggered grid figure.

The 2D domain is of size $L=10$ and the scalar linear diffusion coefficient $D=1$. We use a constant grid size `ds = L / ns`, where `ns` represent the number of finite-difference cells in both $x$ and $y$ dimension.

As initial condition, we define a Gaussian perturbation centred in the middle of the domain of amplitude and standard deviation equal to 1.

<img src="imgs/initial.png" width=400px>

## The code structure overview

Let's have a look at the code structure. We'll first look at the [`diffusion_2d.jl`](diffusion_2d.jl) script. It contains:
- 2 compute functions implementing the spatial and temporal discretisation of the PDE;

- a "main" function to run the code;

- an include statement for [`shared.jl`](./../shared.jl), mostly containing parameters and arrays initialisation, and visualisation.

## The output

### Graphics

The visualisation renders the evolution of the distribution of the diffusing quantity $C$ throughout the simulation at frequency intervals defined by `nout = nt / 5`.

### Timing and performance

Besides plotting, the code also reports performance using wall-time and effective memory throughput as metric and prints in the REPL.

#### Wall time

The first performance metric is wall time, or total runtime. It is computed using a simple custom tic-toc approach, making sure to skip the first 10 iterations to account for "warm-up". For any performance assessment, we should make sure to run the code for at least 1 second in order to ensure reliable results.

#### Effective memory throughput

The second metric is the effective memory throughput $T_\mathrm{eff}$ (`T_eff` in the REPL). It defines as the **non-redundant** memory access per iteration divided by the time per iteration $t_\mathrm{it}$ (in sec.):
$$
T_\mathrm{eff} = \frac{A_\mathrm{eff}}{t_\mathrm{it}}~,
$$
where $A_\mathrm{eff} = n_\mathrm{IO} ~ n_s^2 ~ s_\mathrm{DAT} ~ 10^{-9}$ is the effective memory access (in GB).

In our example, $n_\mathrm{IO} = 2$ as we only need to read old values of $C$ and write them back to solve the diffusion PDE. $s_\mathrm{DAT} = 8$ as we are running double precision floating point arithmetic.

$T_\mathrm{eff}$ provides an idea on how far from the performance of memory copy only memory-bounded codes are, under various assumptions. Refer to [Räss et al. (2022)](https://doi.org/10.5194/gmd-15-5757-2022) for details.

We will further use this metric in the GPU computing part.

## The serial loop version

The final step to look at, before we start our deep dive, is the serial loop version of the 2D diffusion code. If we now open the [`diffusion_2d_loop.jl`](diffusion_2d_loop.jl) script aside the vectorized one ([`diffusion_2d.jl`](diffusion_2d.jl)), we can diff them "by eye" to see the major change being the change in the `diffusion_step!` function.

In a nutshell:
- we do no longer explicitly assign flux computation results to temporary variable in global memory (previously `qx` and `qy`);

- we introduce a nested loop of spacial dimensions respecting a **column major order**;

- we introduce a temporary second array `C2` to not read and write from the same array in order to avoid race conditions;

- we use `@inbounds` upon having verified the correctness of the results to skip bound-checking.

**→ Hands-on** (see [README.md](README.md))