<a href="https://colab.research.google.com/github/amontoison/Workshop-GERAD/blob/main/parallel_computing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Parallel computing and GPU programming with Julia
## Background
Alexis Montoison

In [2]:
import Pkg
Pkg.activate("colab2")
Pkg.add("Hwloc")

[32m[1m  Activating[22m[39m new project at `/content/colab2`
[32m[1m    Updating[22m[39m registry at `~/.julia/registries/General.toml`
[32m[1m   Resolving[22m[39m package versions...
[32m[1m   Installed[22m[39m XML2_jll ────────────── v2.13.8+0
[32m[1m   Installed[22m[39m Hwloc ───────────────── v3.3.0
[32m[1m   Installed[22m[39m Hwloc_jll ───────────── v2.12.2+0
[32m[1m   Installed[22m[39m Xorg_libpciaccess_jll ─ v0.18.1+0
[32m[1m    Updating[22m[39m `/content/colab2/Project.toml`
  [90m[0e44f5e4] [39m[92m+ Hwloc v3.3.0[39m
[32m[1m    Updating[22m[39m `/content/colab2/Manifest.toml`
  [90m[fa961155] [39m[92m+ CEnum v0.5.0[39m
  [90m[0e44f5e4] [39m[92m+ Hwloc v3.3.0[39m
  [90m[692b3bcd] [39m[92m+ JLLWrappers v1.7.1[39m
  [90m[21216c6a] [39m[92m+ Preferences v1.5.0[39m
  [90m[e33a78d0] [39m[92m+ Hwloc_jll v2.12.2+0[39m
  [90m[94ce4f54] [39m[92m+ Libiconv_jll v1.18.0+0[39m
[33m⌅[39m [90m[02c8fc9c] [39m[92m+ XML2_jll v2

<img src='https://github.com/amontoison/Workshop-GERAD/blob/main/Graphics/julia.png?raw=1' width='200'>

There are many types of parallelism:

* **Instruction level parallelism** (e.g. SIMD)
* **Multi-threading** (shared memory)
* **Multi-processing** (shared system memory)
* **Distributed processing** (typically no shared memory)

And then there are highly-parallel hardware accelerators like **GPUs**.

Important: **At the center of any efficient parallel code is a fast serial code!!!**

### When go parallel?

* If parts of your (optimized!) serial code aren't fast enough.
  * note that parallelization typically increases the code complexity.
* If your system has multiple execution units (CPU cores, GPU streaming multiprocessors, ...).
  * particularly important on large supercomputers but also already on modern desktop computers and laptops.

<img src='https://github.com/amontoison/Workshop-GERAD/blob/main/Graphics/frontier.png?raw=1' width='600'>

<img src='https://github.com/amontoison/Workshop-GERAD/blob/main/Graphics/aurora.png?raw=1' width='600'>

### How many CPU threads / cores do I have?

In [1]:
using Hwloc
Hwloc.num_physical_cores()

LoadError: ArgumentError: Package Hwloc not found in current path.
- Run `import Pkg; Pkg.add("Hwloc")` to install the Hwloc package.

Note that there may be more than one CPU thread per physical CPU core (e.g. hyperthreading).

In [None]:
Sys.CPU_THREADS

2

### Amdahl's law

Naive strong scaling expectation: I have 4 cores, give me my 4x speedup!

> If $p$ is the fraction of a code that can be parallelized, then the maximal theoretical speedup by parallelization on $n$ cores is given by $$ F(n) = \frac{1}{1 - p + p / n} $$

In [3]:
using Plots
F(p,n) = 1/(1-p + p/n)

pl = plot()
for p in (0.5, 0.7, 0.9, 0.95, 0.99)
    plot!(pl, n -> F(p,n), 1:8, lab="p=$p", lw=2,
        legend=:topleft, xlab="number of cores", ylab="parallel speedup", frame=:box)
end
pl

LoadError: InitError: could not load library "/root/.julia/artifacts/52d9b3e9e3507f7b2cf723af43d0e7f095e2edc7/lib/libGL.so"
/root/.julia/artifacts/52d9b3e9e3507f7b2cf723af43d0e7f095e2edc7/lib/libGL.so: undefined symbol: _glapi_tls_Current
during initialization of module Libglvnd_jll

### [Parallel computing](https://docs.julialang.org/en/v1/manual/parallel-computing/) in Julia

Julia provides support for all types of parallelism mentioned above

|                                                         |                                                                                                                                                                                       |
|---------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Instruction level parallelism** (e.g. SIMD)           | → [`@simd`](https://docs.julialang.org/en/v1/base/base/#Base.SimdLoop.@simd), [SIMD.jl](https://github.com/eschnett/SIMD.jl), ...                                                     |
| **Multi-threading** (shared memory)                     | → [Base.Threads](https://docs.julialang.org/en/v1/base/multi-threading/), [ThreadsX.jl](https://github.com/tkf/ThreadsX.jl), [FLoops.jl](https://github.com/JuliaFolds/FLoops.jl), .. |
| **Multi-processing** (shared system memory)             | → [Distributed.jl](https://docs.julialang.org/en/v1/stdlib/Distributed/), [MPI.jl](https://github.com/JuliaParallel/MPI.jl), ...                                                      |
| **Distributed processing** (typically no shared memory) | → [Distributed.jl](https://docs.julialang.org/en/v1/stdlib/Distributed/), [MPI.jl](https://github.com/JuliaParallel/MPI.jl), ...                                                      |
| **GPU programming**                                     | → [CUDA.jl](https://github.com/JuliaGPU/CUDA.jl), [AMDGPU.jl](https://github.com/JuliaGPU/AMDGPU.jl), [oneAPI.jl](https://github.com/JuliaGPU/oneAPI.jl), [KernelAbstractions.jl](https://github.com/JuliaGPU/KernelAbstractions.jl), ... |

Reference: [JuliaUCL24](https://github.com/carstenbauer/JuliaUCL24)