# Multithreading (shared-memory parallelism)

## Overview

* **Running Julia with multiple threads**

* Where are the threads running?
  * ThreadPinning.jl

* **Task-based multithreading**
  * dynamic and static scheduling

* **"Data pinning"**
  * NUMA "first-touch" policy

## Running Julia with multiple threads

By default, Julia starts with a single *user thread*. We must tell it explicitly to start multiple user threads.

* Environment variable: `export JULIA_NUM_THREADS=8`

* Command line argument: `julia -t 8` or `julia --threads 8`

* **VS Code:** Add `"julia.NumThreads": 8` to workspace settings (`Preferences: Open Workspace Settings (JSON)`)

**It is currently not really possible to change the number of threads at runtime!**

In [None]:
Threads.nthreads()

## Where are the threads running?

[ThreadPinning.jl](https://github.com/carstenbauer/ThreadPinning.jl) is the best tool for visualizing and controlling thread placement in Julia. (Disclaimer: I'm the main author 😉)

In [None]:
using ThreadPinning

threadinfo()

### Pinning threads (i.e. controling where they are running)

#### Why?

* To avoid double occupancy of CPU cores.

* To reduce noise in benchmarks.

* To address the complexity of the system topology, e.g. to use specific/all memory domains (NUMA).

* ...

#### How?

`pinthreads(strategy)`
* `:cputhreads` pin to CPU threads (incl. "hypterthreads") one after another
* `:cores:` pin to CPU cores one after another
* `:numa:` alternate between NUMA domains (round-robin)
* `:sockets:` alternate between sockets (round-robin)
* `:affinitymask`: pin according to an external affinity mask (e.g. set by SLURM)

(More? See my talk at JuliaCon2023 @ MIT: https://youtu.be/6Whc9XtlCC0)

In [None]:
pinthreads(:cores) # try :cores or :sockets or :random
threadinfo()

In [None]:
pinthreads(:numa)
threadinfo(; groupby=:numa)

#### Memory domains (NUMA)

NUMA = **n**on-**u**niform **m**emory **a**ccess

One (of two) AMD Milan CPUs in a Perlmutter node:

<img src="imgs/amd_milan_cpu_die.svg" width=800px>

**Image source:** [AMD, High Performance Computing (HPC) Tuning Guide for AMD EPYCTM 7003 Series Processors](https://www.amd.com/system/files/documents/high-performance-computing-tuning-guide-amd-epyc7003-series-processors.pdf)

In [None]:
# Other useful options for querying system information

# using CpuId
# cpuinfo()

# using Hwloc
# topology_graphical()

## Task-based multithreading

<br>
<img src="./imgs/tasks_threads_cores.svg" width=750px>
</br>

The user doesn't control threads but tasks that get scheduled on threads.

**Advantages:** 👍
* high-level abstraction
* nestability / composability

**Disadvantages:** 👎
* scheduling overhead
* uncertain and potentially suboptimal task → thread assignment
  * scheduler has limited information (e.g. about the system topology)
  * task migration

### Dynamic scheduling: `@threads :dynamic for ... in ...`

* **Splits up the iteration space into `nthreads()` contiguous chunks**

* Creates a task for each of them and hands them off to the dynamic scheduler (essentially `@spawn`s each chunk).

In [None]:
using Base.Threads: @threads, threadid, nthreads

In [None]:
# implicitly creates nthreads() many tasks, each of which handles 2 iterations
@threads :dynamic for i in 1:2*nthreads()
    println("Running iteration ", i, " on thread ", threadid())
end

#### Static scheduling: `@threads :static for ... in ...`

* `:static` option to opt-out of dynamic scheduling

* Statically **"pins" tasks to threads**
  * task 1 → thread 1, task 2 → thread 2, and so on.

Pro 👍
   * **fixed task-thread mapping** (no task migration)
   * very little overhead
   
Con 👎
   * not composable / nestable

In [None]:
@threads :static for i in 1:2*nthreads()
    println("Running iteration ", i, " on thread ", threadid());
end

(For `@threads :static`, every thread handles precisely two iterations!)

## "Data pinning" (NUMA revisited)

Implicitly → **NUMA "first-touch" policy**

Explicitly → [NUMA.jl](https://github.com/JuliaPerf/NUMA.jl)

### NUMA "first-touch" policy

Data is (typically) placed in the **NUMA domain that is closest to the thread/CPU core** that is "touching" the data.

```julia
x = Vector{Float64}(undef, 10)   # allocation, no "touch" yet
rand!(x)                         # first touch == first write
```

In [None]:
pinthreads(:numa)
threadinfo(; groupby=:numa)

### Array initialization: serial vs parallel

**Different parts of an array can be placed in different NUMA domains!**

Data is managed in terms of memory pages ("unit of data").

#### Serial

```julia
x = Vector{Float64}(undef, 100)   # allocation, no "touch" yet
rand!(x)                          # first touch == first write
```

The location of the "main" thread determines the NUMA domain of the entire array!

If we later access the data in parallel, all threads must read from the same NUMA domain → competition for the memory bus → potential bottleneck.

#### Parallel

```julia
pinthreads(:numa)                       # pin threads to different NUMA domains
x = Vector{Float64}(undef, 100)         # allocation, no "touch" yet
@threads :static for i in eachindex(x)  # parallel iteration
    x[i] = rand()                       # first touch == first write
end
```

Different threads - running in different NUMA regions - touch different parts of the array → the latter will (likely) be placed in different NUMA domains.

If we later access the data in parallel, all threads can read their part of the array from their local NUMA domain → no bottleneck.

Crucial point: **How you initialize your data influences the performance of your computational kernel!** (non-local effect)

**→ Hands-on** (see [README.md](README.md))