# SIMD

SIMD stands for **"Single Instruction Multiple Data"** and falls into the category of instruction level parallelism (vector instructions). Consider this simple example where `A`, `B`, and `C` are vectors:

In [1]:
function vector_add(A, B, C)
    for i in eachindex(A, B, C)
        @inbounds A[i] = B[i] + C[i]
    end
end

vector_add (generic function with 1 method)


The idea behind SIMD is to perform the add instruction on multiple elements at the same time (instead of separately performing them one after another). The process of splitting up the simple loop addition into multiple vector additions is often denoted as "loop vectorization". Since each vectorized addition happens at instruction level, i.e. within a CPU core, the feature set of the CPU determines how many elements we can process in one go.

## Summary
<img src="../../static/simd_vaddpd.png" width=300px>
<img src="../../static/simd_register_width.png" width=400px>

(**Source:** Node-level performance engineering course by [NHR@FAU](https://hpc.fau.de/))

Let's check which "advanced vector extensions" (AVX) the system supports.

In [2]:
using CpuId
cpuinfo()

| Cpu Property       | Value                                                      |
|:------------------ |:---------------------------------------------------------- |
| Brand              | AMD Ryzen 7 2700X Eight-Core Processor                     |
| Vendor             | :AMD                                                       |
| Architecture       | :Zen                                                       |
| Model              | Family: 0x8f, Model: 0x08, Stepping: 0x02, Type: 0x00      |
| Cores              | 16 physical cores, 16 logical cores (on executing CPU)     |
|                    | No Hyperthreading hardware capability detected             |
| Clock Frequencies  | Not supported by CPU                                       |
| Data Cache         | Level 1:3 : (32, 512, 8192) kbytes                         |
|                    | 64 byte cache line size                                    |
| Address Size       | 48 bits virtual, 48 bits physical                          |
| SIMD               | 256 bit = 32 byte max. SIMD vector size                    |
| Time Stamp Counter | TSC is accessible via `rdtsc`                              |
|                    | TSC runs at constant rate (invariant from clock frequency) |
| Perf. Monitoring   | Performance Monitoring Counters (PMC) are not supported    |
| Hypervisor         | No                                                         |


In [3]:
filter(x -> contains(string(x), "AVX"), cpufeatures())

2-element Vector{Symbol}:
 :AVX
 :AVX2

In [4]:
SIZE = 256 * 100

25600

In [5]:
A = rand(Float64, SIZE)
B = rand(Float64, SIZE)
C = rand(Float64, SIZE);

In [6]:
@code_native debuginfo=:none syntax = :intel vector_add(A, B, C)

	[0m.text
	[0m.file	[0m"vector_add"
	[0m.globl	[0mjapi1_vector_add_2251           [90m# -- Begin function japi1_vector_add_2251[39m
	[0m.p2align	[33m4[39m[0m, [33m0x90[39m
	[0m.type	[0mjapi1_vector_add_2251[0m,[0m@function
[91mjapi1_vector_add_2251:[39m                  [90m# @japi1_vector_add_2251[39m
	[0m.cfi_startproc
[90m# %bb.0:                                # %top[39m
	[96m[1mpush[22m[39m	[0mrbp
	[0m.cfi_def_cfa_offset [33m16[39m
	[0m.cfi_offset [0mrbp[0m, [33m-16[39m
	[96m[1mmov[22m[39m	[0mrbp[0m, [0mrsp
	[0m.cfi_def_cfa_register [0mrbp
	[96m[1mpush[22m[39m	[0mr15
	[96m[1mpush[22m[39m	[0mr14
	[96m[1mpush[22m[39m	[0mr13
	[96m[1mpush[22m[39m	[0mr12
	[96m[1mpush[22m[39m	[0mrbx
	[96m[1mand[22m[39m	[0mrsp[0m, [33m-32[39m
	[96m[1msub[22m[39m	[0mrsp[0m, [33m160[39m
	[0m.cfi_offset [0mrbx[0m, [33m-56[39m
	[0m.cfi_offset [0mr12[0m, [33m-48[39m
	[0m.cfi_offset [0mr13[0m, [33m-40[39m
	[0

In [9]:
function vector_add_no_simd(A, B, C)
    for i in eachindex(A, B, C)
        A[i] = B[i] + C[i]
    end
end

vector_add_no_simd (generic function with 1 method)

In [10]:
@btime vector_add_no_simd(A, B, C)

  6.795 μs (0 allocations: 0 bytes)


In [8]:
@btime vector_add(A, B, C)

  6.755 μs (0 allocations: 0 bytes)


## It's not always so simple: Reduction

In [12]:
function vector_dot(B, C)
    a = zero(eltype(B))
    for i in eachindex(B, C)
        @inbounds a += B[i] * C[i]
    end
    return a
end

vector_dot (generic function with 1 method)

In [13]:
@code_native debuginfo=:none syntax = :intel vector_dot(B, C)

	[0m.text
	[0m.file	[0m"vector_dot"
	[0m.globl	[0mjulia_vector_dot_2603           [90m# -- Begin function julia_vector_dot_2603[39m
	[0m.p2align	[33m4[39m[0m, [33m0x90[39m
	[0m.type	[0mjulia_vector_dot_2603[0m,[0m@function
[91mjulia_vector_dot_2603:[39m                  [90m# @julia_vector_dot_2603[39m
	[0m.cfi_startproc
[90m# %bb.0:                                # %top[39m
	[96m[1mpush[22m[39m	[0mrbp
	[0m.cfi_def_cfa_offset [33m16[39m
	[0m.cfi_offset [0mrbp[0m, [33m-16[39m
	[96m[1mmov[22m[39m	[0mrbp[0m, [0mrsp
	[0m.cfi_def_cfa_register [0mrbp
	[96m[1mpush[22m[39m	[0mr15
	[96m[1mpush[22m[39m	[0mr14
	[96m[1mpush[22m[39m	[0mr13
	[96m[1mpush[22m[39m	[0mr12
	[96m[1mpush[22m[39m	[0mrbx
	[96m[1mand[22m[39m	[0mrsp[0m, [33m-32[39m
	[96m[1msub[22m[39m	[0mrsp[0m, [33m96[39m
	[0m.cfi_offset [0mrbx[0m, [33m-56[39m
	[0m.cfi_offset [0mr12[0m, [33m-48[39m
	[0m.cfi_offset [0mr13[0m, [33m-40[39m
	[0m

Note the `vaddsd` instruction and usage of `xmmi` registers (128 bit).

How could this loop be vectorized?

In [15]:
function vector_dot_unrolled4(B, C)
    a1 = zero(eltype(B))
    a2 = zero(eltype(B))
    a3 = zero(eltype(B))
    a4 = zero(eltype(B))
    @inbounds for i in 1:4:length(B)-4
        a1 += B[i] * C[i]
        a2 += B[i+1] * C[i+1]
        a3 += B[i+2] * C[i+2]
        a4 += B[i+3] * C[i+3]
    end
    return a1 + a2 + a3 + a4
end

vector_dot_unrolled4 (generic function with 1 method)

In [16]:
@code_native debuginfo=:none syntax = :intel vector_dot_unrolled4(B, C)

	[0m.text
	[0m.file	[0m"vector_dot_unrolled4"
	[0m.globl	[0mjulia_vector_dot_unrolled4_2752 [90m# -- Begin function julia_vector_dot_unrolled4_2752[39m
	[0m.p2align	[33m4[39m[0m, [33m0x90[39m
	[0m.type	[0mjulia_vector_dot_unrolled4_2752[0m,[0m@function
[91mjulia_vector_dot_unrolled4_2752:[39m        [90m# @julia_vector_dot_unrolled4_2752[39m
	[0m.cfi_startproc
[90m# %bb.0:                                # %top[39m
	[96m[1mpush[22m[39m	[0mr14
	[0m.cfi_def_cfa_offset [33m16[39m
	[96m[1mpush[22m[39m	[0mrbx
	[0m.cfi_def_cfa_offset [33m24[39m
	[96m[1msub[22m[39m	[0mrsp[0m, [33m8[39m
	[0m.cfi_def_cfa_offset [33m32[39m
	[0m.cfi_offset [0mrbx[0m, [33m-24[39m
	[0m.cfi_offset [0mr14[0m, [33m-16[39m
	[96m[1mmov[22m[39m	[0mrdx[0m, [95mqword[39m [95mptr[39m [33m[[39m[0mrdi [0m+ [33m8[39m[33m][39m
	[96m[1mmov[22m[39m	[0mr14[0m, [0mrsi
	[96m[1mmov[22m[39m	[0mrbx[0m, [0mrdi
	[96m[1mmovabs[22m[39m	[0mrax[

In [17]:
using BenchmarkTools
@btime vector_dot($B, $C);
@btime vector_dot_unrolled4($B, $C);

  19.296 μs (0 allocations: 0 bytes)
  5.033 μs (0 allocations: 0 bytes)


To "force" automatic SIMD vectorization in Julia, you can use the `@simd` macro.

In [18]:
function vector_dot_simd(B, C)
    a = zero(eltype(B))
    @simd for i in eachindex(B, C)
        @inbounds a += B[i] * C[i]
    end
    return a
end

vector_dot_simd (generic function with 1 method)

By using the `@simd` macro, we are asserting several properties of the loop:

* It is safe to execute iterations in arbitrary or overlapping order, with special consideration for reduction variables.
* Floating-point operations on reduction variables can be reordered, possibly causing different results than without `@simd`.

In [19]:
@btime vector_dot_simd($B, $C);

  3.693 μs (0 allocations: 0 bytes)


This is a huge speedup for just a little extra `@simd`!

In [20]:
@code_native debuginfo=:none syntax = :intel vector_dot_simd(B, C)

	[0m.text
	[0m.file	[0m"vector_dot_simd"
	[0m.globl	[0mjulia_vector_dot_simd_2836      [90m# -- Begin function julia_vector_dot_simd_2836[39m
	[0m.p2align	[33m4[39m[0m, [33m0x90[39m
	[0m.type	[0mjulia_vector_dot_simd_2836[0m,[0m@function
[91mjulia_vector_dot_simd_2836:[39m             [90m# @julia_vector_dot_simd_2836[39m
	[0m.cfi_startproc
[90m# %bb.0:                                # %top[39m
	[96m[1mpush[22m[39m	[0mrbp
	[0m.cfi_def_cfa_offset [33m16[39m
	[0m.cfi_offset [0mrbp[0m, [33m-16[39m
	[96m[1mmov[22m[39m	[0mrbp[0m, [0mrsp
	[0m.cfi_def_cfa_register [0mrbp
	[96m[1mpush[22m[39m	[0mr15
	[96m[1mpush[22m[39m	[0mr14
	[96m[1mpush[22m[39m	[0mr13
	[96m[1mpush[22m[39m	[0mr12
	[96m[1mpush[22m[39m	[0mrbx
	[96m[1mand[22m[39m	[0mrsp[0m, [33m-32[39m
	[96m[1msub[22m[39m	[0mrsp[0m, [33m96[39m
	[0m.cfi_offset [0mrbx[0m, [33m-56[39m
	[0m.cfi_offset [0mr12[0m, [33m-48[39m
	[0m.cfi_offset [0mr13[0m,

Note the `vfmadd231pd` instruction and usage of `ymmi` AVX registers (256 bit).

Data types matter:
* Floating-point addition is **non-associative** and the order of operations is important.
* Integer addition is **associative** and the order of operations has no impact.

Let's check what happens for `Int64` input.

In [21]:
B_int = rand(Int64, SIZE)
C_int = rand(Int64, SIZE)
@btime vector_dot($B_int, $C_int);
@btime vector_dot_simd($B_int, $C_int);

  10.350 μs (0 allocations: 0 bytes)
  11.962 μs (0 allocations: 0 bytes)


As expected, there is no difference between the two variants.

### SIMD is hard...

* Autovectorization is a hard problem (it needs to prove a lot of things about the code!)
* Not every code / loop is readily vectorizable
  * Keep your loops simple, e.g. avoid conditionals, control flow, and function calls if possible!
  * Loop length should be countable up front
  * Contiguous data access
  * (Align data structures to SIMD width boundary)

**Keep it simple!**

### [LoopVectorization.jl](https://github.com/JuliaSIMD/LoopVectorization.jl)

Think of `@turbo` as a more sophisticated version of `@simd`. Hopefully, these features will at some point just be part of Julia's compiler.

In [22]:
using LoopVectorization

function vector_dot_turbo(B, C)
    a = zero(eltype(B))
    @turbo for i in eachindex(B, C)
        @inbounds a += B[i] * C[i]
    end
    return a
end

@btime vector_dot_simd($B, $C);
@btime vector_dot_turbo($B, $C);

  3.507 μs (0 allocations: 0 bytes)
  3.479 μs (0 allocations: 0 bytes)


In [None]:
@code_native debuginfo=:none syntax = :intel vector_dot_turbo(B, C)

Note the usage of the `zmmi` AVX512 registers! (512 bit)

## Structure of Array vs Array of Structure

Data layout can matter!

In [23]:
# Array of structure
AoS = [complex(rand(), rand()) for i in 1:SIZE]

25600-element Vector{ComplexF64}:
  0.47615967336798193 + 0.7455265837044197im
   0.4999237219810103 + 0.6551526587274029im
  0.25845791195193235 + 0.9486621722228858im
   0.6048131887781468 + 0.9816045559526196im
  0.31826990577421077 + 0.8556270504839079im
   0.9608605608510309 + 0.43045871282491555im
   0.5316508069904498 + 0.9332815053700042im
   0.9359446412803937 + 0.2413369995777297im
   0.5445003161725982 + 0.5175829019511838im
   0.8516086373164797 + 0.2829406741258146im
   0.8225745171390131 + 0.013675602566029954im
   0.7437782141382747 + 0.03352366879165547im
   0.4256170224507019 + 0.318856163987701im
                      ⋮
   0.6564963114676212 + 0.755427941249618im
  0.47542479966788387 + 0.0539043254557281im
 0.028102281377028304 + 0.4512535597598467im
   0.5208421644519043 + 0.7744136909778965im
   0.5381940725978631 + 0.3825444822931944im
   0.9823478619763404 + 0.7252334590689509im
  0.41713931217686384 + 0.038913540976964645im
   0.8343206036478911 + 0.817228370637

In [24]:
@btime sum($AoS);

  12.253 μs (0 allocations: 0 bytes)


### [StructArrays.jl](https://github.com/JuliaArrays/StructArrays.jl)

https://en.wikipedia.org/wiki/AoS_and_SoA

In [25]:
using StructArrays

In [26]:
SoA = StructArray{Complex}((rand(SIZE), rand(SIZE)))

25600-element StructArray(::Vector{Float64}, ::Vector{Float64}) with eltype Complex:
   0.0583671486160664 + 0.038973094500907135im
   0.7391978941712049 + 0.4635086111740244im
   0.8687771490717257 + 0.8096734476306267im
  0.31918171150387453 + 0.6095299031996888im
  0.47589887602079506 + 0.7144714616948497im
   0.3943948319977675 + 0.6997549500428829im
  0.18300503789866374 + 0.3394125580761389im
   0.6931214300977017 + 0.3484585661075027im
   0.4371431015485052 + 0.0374794777200147im
   0.8898722227383757 + 0.0036969751062636558im
   0.7004166247746507 + 0.18579015290769074im
   0.6365615938644166 + 0.21139339682777114im
   0.9011039041963108 + 0.11354854226901745im
                      ⋮
   0.4988382101498491 + 0.3289351521662768im
   0.9261243242881568 + 0.9866549809596703im
   0.6273822066485626 + 0.6744463540847465im
   0.9974714408638112 + 0.08915121248716917im
   0.6040326424984684 + 0.5987488553848405im
   0.9866928065694077 + 0.09979478208688874im
   0.2866859994507862 + 0.

In [29]:
typeof(AoS)

Vector{ComplexF64}[90m (alias for [39m[90mArray{Complex{Float64}, 1}[39m[90m)[39m

In [28]:
typeof(SoA)

StructVector{Complex, NamedTuple{(:re, :im), Tuple{Vector{Float64}, Vector{Float64}}}, Int64}[90m (alias for [39m[90mStructArray{Complex, 1, NamedTuple{(:re, :im), Tuple{Array{Float64, 1}, Array{Float64, 1}}}, Int64}[39m[90m)[39m

In [27]:
@btime sum($SoA);

  4.367 μs (0 allocations: 0 bytes)


**Resources:**

* [LoopVectorization.jl video on youtube](https://www.youtube.com/watch?v=qz2kJdVDWi0)
* [SIMD and SIMD-intrinsics in Julia](http://kristofferc.github.io/post/intrinsics/)
* [Optimizing Serial Code](https://mitmath.github.io/18337/lecture2/optimizing)

## Inlining

In [30]:
@noinline fnoinline(x, y) = x + y

fnoinline (generic function with 1 method)

In [31]:
finline(x, y) = x + y

finline (generic function with 1 method)

In [32]:
function qinline(x,y)
  a = 4
  b = 2
  c = finline(x,a)
  d = finline(b,c)
  finline(d,y)
end

qinline (generic function with 1 method)

In [33]:
function qnoinline(x,y)
  a = 4
  b = 2
  c = fnoinline(x,a)
  d = fnoinline(b,c)
  fnoinline(d,y)
end

qnoinline (generic function with 1 method)

In [36]:
?@inline

```
@inline
```

Give a hint to the compiler that this function is worth inlining.

Small functions typically do not need the `@inline` annotation, as the compiler does it automatically. By using `@inline` on bigger functions, an extra nudge can be given to the compiler to inline it.

`@inline` can be applied immediately before the definition or in its function body.

```julia
# annotate long-form definition
@inline function longdef(x)
    ...
end

# annotate short-form definition
@inline shortdef(x) = ...

# annotate anonymous function that a `do` block creates
f() do
    @inline
    ...
end
```

!!! compat "Julia 1.8"
    The usage within a function body requires at least Julia 1.8.


---

```
@inline block
```

Give a hint to the compiler that calls within `block` are worth inlining.

```julia
# The compiler will try to inline `f`
@inline f(...)

# The compiler will try to inline `f`, `g` and `+`
@inline f(...) + g(...)
```

!!! note
    A callsite annotation always has the precedence over the annotation applied to the definition of the called function:

    ```julia
    @noinline function explicit_noinline(args...)
        # body
    end

    let
        @inline explicit_noinline(args...) # will be inlined
    end
    ```


!!! note
    When there are nested callsite annotations, the innermost annotation has the precedence:

    ```julia
    @noinline let a0, b0 = ...
        a = @inline f(a0)  # the compiler will try to inline this call
        b = f(b0)          # the compiler will NOT try to inline this call
        return a, b
    end
    ```


!!! warning
    Although a callsite annotation will try to force inlining in regardless of the cost model, there are still chances it can't succeed in it. Especially, recursive calls can not be inlined even if they are annotated as `@inline`d.


!!! compat "Julia 1.8"
    The callsite annotation requires at least Julia 1.8.



In [34]:
@code_llvm debuginfo=:none qinline(1.0, 2.0)

[95mdefine[39m [36mdouble[39m [93m@julia_qinline_6552[39m[33m([39m[36mdouble[39m [0m%0[0m, [36mdouble[39m [0m%1[33m)[39m [0m#0 [33m{[39m
[91mtop:[39m
  [0m%2 [0m= [96m[1mfadd[22m[39m [36mdouble[39m [0m%0[0m, [33m4.000000e+00[39m
  [0m%3 [0m= [96m[1mfadd[22m[39m [36mdouble[39m [0m%2[0m, [33m2.000000e+00[39m
  [0m%4 [0m= [96m[1mfadd[22m[39m [36mdouble[39m [0m%3[0m, [0m%1
  [96m[1mret[22m[39m [36mdouble[39m [0m%4
[33m}[39m


In [35]:
@code_llvm debuginfo=:none qnoinline(1.0, 2.0)

[95mdefine[39m [36mdouble[39m [93m@julia_qnoinline_6568[39m[33m([39m[36mdouble[39m [0m%0[0m, [36mdouble[39m [0m%1[33m)[39m [0m#0 [33m{[39m
[91mtop:[39m
  [0m%2 [0m= [96m[1mcall[22m[39m [36mdouble[39m [93m@j_fnoinline_6570[39m[33m([39m[36mdouble[39m [0m%0[0m, [36mi64[39m [95msignext[39m [33m4[39m[33m)[39m [0m#0
  [0m%3 [0m= [96m[1mcall[22m[39m [36mdouble[39m [93m@j_fnoinline_6571[39m[33m([39m[36mi64[39m [95msignext[39m [33m2[39m[0m, [36mdouble[39m [0m%2[33m)[39m [0m#0
  [0m%4 [0m= [96m[1mcall[22m[39m [36mdouble[39m [93m@j_fnoinline_6572[39m[33m([39m[36mdouble[39m [0m%3[0m, [36mdouble[39m [0m%1[33m)[39m [0m#0
  [96m[1mret[22m[39m [36mdouble[39m [0m%4
[33m}[39m
