# Compilation

To be fast, Julia needs to **specialize** code, that is **compile specific native versions of the code** utilizing the type information. **The better the specialization the faster the code!**

## "Just ahead of time" compilation

* Julia **specializes on the types of function arguments** and 
* compiles efficient machine code **when a function is called for the first time** (with these input argument types).

If the same function is called again with the same input argument types, the already existing machine code is reused.


In [2]:
func(x,y) = 2x + y

func (generic function with 1 method)

In [3]:
x = [1.2, 3.4, 5.6] # Vector{Float64}
y = [0.4, 0.7, 0.9] # Vector{Float64}

@time func(x,y);
@time func(x,y);

  0.041341 seconds (186.09 k allocations: 12.677 MiB, 99.97% compilation time)
  0.000002 seconds (2 allocations: 160 bytes)


**First call:** compilation + running the code

**Second call:** running the code


In [4]:
@time func(x,y);

  0.000004 seconds (2 allocations: 160 bytes)


If one of the input types changes, Julia compiles a new specialization of the function!


In [5]:
typeof(x)

Vector{Float64}[90m (alias for [39m[90mArray{Float64, 1}[39m[90m)[39m

In [6]:
x = [1, 3, 5]

3-element Vector{Int64}:
 1
 3
 5

In [7]:
typeof(x)

Vector{Int64}[90m (alias for [39m[90mArray{Int64, 1}[39m[90m)[39m

In [8]:
@time func(x,y); # Vector{Int64}, Vector{Float64}
@time func(x,y);

  0.043782 seconds (150.96 k allocations: 10.204 MiB, 99.96% compilation time)
  0.000003 seconds (2 allocations: 160 bytes)


We now have two efficient native codes in the cache: one for all `Vector{Float64}` inputs and another one for `Vector{Int64}` as the first and `Vector{Float64}` as the second argument type.

In [9]:
methods(func)

In [10]:
using MethodAnalysis
methodinstances(func)

3-element Vector{Core.MethodInstance}:
 MethodInstance for func(::Any, ::Any)
 MethodInstance for func(::Vector{Float64}, ::Vector{Float64})
 MethodInstance for func(::Vector{Int64}, ::Vector{Float64})

### Compilation pipeline

<p><br><img src="imgs/Julia_compilation_pipeline.svg" width="512"/></p>

* **AST**: abstract syntax tree
* **IR**: intermediate representation

More about Julia compilation, see [Bezanson J et al (2018) Julia: dynamism and performance reconciled by design. Proc ACM Program Lang.](https://doi.org/10.1145/3276490)

### What makes Julia fast?

**Specialization** → (Successful) **Type inference** → **Compilation**

## Introspection tools
#### (*But I really want to see what happens!*)

We can inspect the code at all transformation stages with a bunch of macros:

<img src="./imgs/julia_introspection_macros.svg" width=300px>

In [11]:
@macroexpand @show 3+3

quote
    Base.println("3 + 3 = ", Base.repr(begin
                [90m#= show.jl:1181 =#[39m
                local var"#118#value" = 3 + 3
            end))
    var"#118#value"
end

In [12]:
f(x, y) = x^3 + y/2

f (generic function with 1 method)

In [13]:
@code_lowered f(1.0,2.0)

CodeInfo(
[90m1 ─[39m %1 = Main.:^
[90m│  [39m %2 = Core.apply_type(Base.Val, 3)
[90m│  [39m %3 = (%2)()
[90m│  [39m %4 = Base.literal_pow(%1, x, %3)
[90m│  [39m %5 = y / 2
[90m│  [39m %6 = %4 + %5
[90m└──[39m      return %6
)

In [14]:
@code_typed f(1.0,2.0)

CodeInfo(
[90m1 ─[39m %1 = Base.mul_float(x, x)[36m::Float64[39m
[90m│  [39m %2 = Base.mul_float(%1, x)[36m::Float64[39m
[90m│  [39m %3 = Base.div_float(y, 2.0)[36m::Float64[39m
[90m│  [39m %4 = Base.add_float(%2, %3)[36m::Float64[39m
[90m└──[39m      return %4
) => Float64

From the types of the input arguments, Julia has figured out all the intermediate types. This crucial process is known as **type inference** and its success is the basis for a good specialization (i.e. performant native code as a result). Moreover, the generic power function computing the cubic of `x` is replaced by specific floating-point multiplications (**static dispatch**).

In [15]:
@code_llvm debuginfo=:none f(1.0,2.0)

[95mdefine[39m [36mdouble[39m [93m@julia_f_1849[39m[33m([39m[36mdouble[39m [0m%0[0m, [36mdouble[39m [0m%1[33m)[39m [0m#0 [33m{[39m
[91mtop:[39m
  [0m%2 [0m= [96m[1mfmul[22m[39m [36mdouble[39m [0m%0[0m, [0m%0
  [0m%3 [0m= [96m[1mfmul[22m[39m [36mdouble[39m [0m%2[0m, [0m%0
  [0m%4 [0m= [96m[1mfmul[22m[39m [36mdouble[39m [0m%1[0m, [33m5.000000e-01[39m
  [0m%5 [0m= [96m[1mfadd[22m[39m [36mdouble[39m [0m%3[0m, [0m%4
  [96m[1mret[22m[39m [36mdouble[39m [0m%5
[33m}[39m


The expensive divide operation (`y/2`) is replaced by multiplying by 0.5. In the end, giving two `Float64` arguments this function has 4 floating-point operations, i.e. 3 multiplications and 1 addition, instead of cubic function and division.

In [16]:
@code_native debuginfo=:none f(1.0,2.0)

	[0m.text
	[0m.file	[0m"f"
	[0m.section	[0m.rodata.cst8[0m,[0m"aM"[0m,[0m@progbits[0m,[33m8[39m
	[0m.p2align	[33m3[39m                               [90m# -- Begin function julia_f_1884[39m
[91m.LCPI0_0:[39m
	[0m.quad	[33m0x3fe0000000000000[39m              [90m# double 0.5[39m
	[0m.text
	[0m.globl	[0mjulia_f_1884
	[0m.p2align	[33m4[39m[0m, [33m0x90[39m
	[0m.type	[0mjulia_f_1884[0m,[0m@function
[91mjulia_f_1884:[39m                           [90m# @julia_f_1884[39m
[90m# %bb.0:                                # %top[39m
	[96m[1mpush[22m[39m	[0mrbp
	[96m[1mvmulsd[22m[39m	[0mxmm2[0m, [0mxmm0[0m, [0mxmm0
	[96m[1mmovabs[22m[39m	[0mrax[0m, [95moffset[39m [0m.LCPI0_0
	[96m[1mmov[22m[39m	[0mrbp[0m, [0mrsp
	[96m[1mvmulsd[22m[39m	[0mxmm1[0m, [0mxmm1[0m, [95mqword[39m [95mptr[39m [33m[[39m[0mrax[33m][39m
	[96m[1mvmulsd[22m[39m	[0mxmm0[0m, [0mxmm2[0m, [0mxmm0
	[96m[1mvaddsd[22m[39m	[0mxmm0[0m, 

Let's compare this to integer inputs.

In [17]:
@code_native debuginfo=:none f(1,2)

	[0m.text
	[0m.file	[0m"f"
	[0m.section	[0m.rodata.cst8[0m,[0m"aM"[0m,[0m@progbits[0m,[33m8[39m
	[0m.p2align	[33m3[39m                               [90m# -- Begin function julia_f_1888[39m
[91m.LCPI0_0:[39m
	[0m.quad	[33m0x3fe0000000000000[39m              [90m# double 0.5[39m
	[0m.text
	[0m.globl	[0mjulia_f_1888
	[0m.p2align	[33m4[39m[0m, [33m0x90[39m
	[0m.type	[0mjulia_f_1888[0m,[0m@function
[91mjulia_f_1888:[39m                           [90m# @julia_f_1888[39m
[90m# %bb.0:                                # %top[39m
	[96m[1mpush[22m[39m	[0mrbp
	[96m[1mmov[22m[39m	[0mrcx[0m, [0mrdi
	[96m[1mvcvtsi2sd[22m[39m	[0mxmm0[0m, [0mxmm0[0m, [0mrsi
	[96m[1mmovabs[22m[39m	[0mrax[0m, [95moffset[39m [0m.LCPI0_0
	[96m[1mmov[22m[39m	[0mrbp[0m, [0mrsp
	[96m[1mimul[22m[39m	[0mrcx[0m, [0mrdi
	[96m[1mvmulsd[22m[39m	[0mxmm0[0m, [0mxmm0[0m, [95mqword[39m [95mptr[39m [33m[[39m[0mrax[33m][39m
	[96m[1mim

### Recommendation: [Cthulhu.jl](https://github.com/JuliaDebug/Cthulhu.jl)
While these introspection macros are great, we recommend to use `@descend` from the package [Cthulhu.jl](https://github.com/JuliaDebug/Cthulhu.jl) for real world code analysis.

Essentially, Cthulhu is an **interactive**, more powerful generalization of the macros above.

* Allows easy switching between code representations (syntax, typed, native, ...).
* **Recursive application possible**(!) (i.e. introspecting a function that is called within a function within function ...).

However, due to its interactivity, it doesn't work in Jupyter but **only works in the REPL** (→ exercise).

<img src="./imgs/cthulhu.png" width=1000>

## How important is specialization?

Let's try to estimate the performance gain by specialization.

To prevent specialization, we deliberately throw away any useful type information and operate on a `Vector{Any}` that can literally store anything!

(This is qualitatively comparable to what Python does.)


In [18]:
func(v) = 2*v[1] + v[2] # version of func that takes in a vector

func (generic function with 2 methods)

In [19]:
rand(2)

2-element Vector{Float64}:
 0.927918719389245
 0.1934154611149782

In [20]:
Any[rand(), rand()]

2-element Vector{Any}:
 0.1757754666614454
 0.7986265068547347

For benchmarking we will use `@btime` (or `@benchmark`) from [BenchmarkTools.jl](https://github.com/JuliaCI/BenchmarkTools.jl). This will take care of a couple of things for us:
* Exclude first run.
* Run the code multiple times (→ statistics).
* Benchmark in a function (local scope).

**General rule:** For proper benchmarking don't use `@time` but `@btime` and interpolate (`$`) global input arguments.

(Prefixing variable with `$` always means interpolation in Julia, e.g. string interpolation.)

In [21]:
ucl = "UCL ARC"
welcome = "Welcome to $ucl"

"Welcome to UCL ARC"

In [22]:
using BenchmarkTools

v_typed = rand(2)
v_any = Any[rand(), rand()]

@btime func($v_typed);
@btime func($v_any);

  1.643 ns (0 allocations: 0 bytes)


  18.747 ns (2 allocations: 32 bytes)


In [23]:
@benchmark func($v_any)

BenchmarkTools.Trial: 10000 samples with 998 evaluations.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m17.202 ns[22m[39m … [35m393.214 ns[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 92.16%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m18.626 ns               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m19.249 ns[22m[39m ± [32m  9.027 ns[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m1.18% ±  2.42%

  [39m [39m [39m [39m [39m [39m [39m▃[39m▃[39m▅[39m█[34m▇[39m[39m▂[39m [39m [32m [39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m▂[39m▂[39m▂[39m▃

In [24]:
@code_typed func(rand(2))

CodeInfo(
[90m1 ─[39m %1 = Base.arrayref(true, v, 1)[36m::Float64[39m
[90m│  [39m %2 = Base.mul_float(2.0, %1)[36m::Float64[39m
[90m│  [39m %3 = Base.arrayref(true, v, 2)[36m::Float64[39m
[90m│  [39m %4 = Base.add_float(%2, %3)[36m::Float64[39m
[90m└──[39m      return %4
) => Float64

**static dispatch**: the generic functions `*` and `+` are replaced by specific implementations.

In [25]:
@code_typed func(Any[rand(), rand()])

CodeInfo(
[90m1 ─[39m %1 = Base.arrayref(true, v, 1)[36m::Any[39m
[90m│  [39m %2 = (2 * %1)[36m::Any[39m
[90m│  [39m %3 = Base.arrayref(true, v, 2)[36m::Any[39m
[90m│  [39m %4 = (%2 + %3)[36m::Any[39m
[90m└──[39m      return %4
) => Any

Note here the generic functions `*` and `+` can not be replaced by specific variants due to lack of type information. This leads to inefficient **runtime dispatch**.

## Dispatch and specialization

**Types drive both dispatch and specialization.**

First, the most specific method is selected (dispatch), then it gets compiled to efficient native code (specialization). Let's reconsider our earlier example:

In [26]:
myabs(x::Real) = sign(x) * x
myabs(z::Complex) = sqrt(real(z * conj(z)))

myabs (generic function with 2 methods)

In [27]:
@code_native myabs(3.2 + 4.5im) # complex input

	[0m.text
	[0m.file	[0m"myabs"
	[0m.globl	[0mjulia_myabs_2668                [90m# -- Begin function julia_myabs_2668[39m
	[0m.p2align	[33m4[39m[0m, [33m0x90[39m
	[0m.type	[0mjulia_myabs_2668[0m,[0m@function
[91mjulia_myabs_2668:[39m                       [90m# @julia_myabs_2668[39m
[90m; ┌ @ /home/javier/Julia Courses/JuliaUCL24/notebooks/Day1/2_compilation.ipynb:2 within `myabs`[39m
[90m# %bb.0:                                # %L13[39m
	[96m[1mpush[22m[39m	[0mrbp
[90m; │┌ @ complex.jl:296 within `*` @ float.jl:411[39m
	[96m[1mvmovupd[22m[39m	[0mxmm0[0m, [95mxmmword[39m [95mptr[39m [33m[[39m[0mrdi[33m][39m
	[96m[1mmov[22m[39m	[0mrbp[0m, [0mrsp
	[96m[1mvmulpd[22m[39m	[0mxmm0[0m, [0mxmm0[0m, [0mxmm0
[90m; ││ @ complex.jl:296 within `*`[39m
[90m; ││┌ @ float.jl:410 within `-`[39m
	[96m[1mvpermilpd[22m[39m	[0mxmm1[0m, [0mxmm0[0m, [33m1[39m           [90m# xmm1 = xmm0[1,0][39m
	[96m[1mvaddsd[22m[39m	[0mxmm

In [28]:
@code_native myabs(3 + 4im) # also complex input but different native code (due to specialization)!

	[0m.text
	[0m.file	[0m"myabs"
	[0m.globl	[0mjulia_myabs_2675                [90m# -- Begin function julia_myabs_2675[39m
	[0m.p2align	[33m4[39m[0m, [33m0x90[39m
	[0m.type	[0mjulia_myabs_2675[0m,[0m@function
[91mjulia_myabs_2675:[39m                       [90m# @julia_myabs_2675[39m
[90m; ┌ @ /home/javier/Julia Courses/JuliaUCL24/notebooks/Day1/2_compilation.ipynb:2 within `myabs`[39m
[90m# %bb.0:                                # %top[39m
	[96m[1mpush[22m[39m	[0mrbp
[90m; │┌ @ complex.jl:296 within `*` @ int.jl:88[39m
	[96m[1mmov[22m[39m	[0mrax[0m, [95mqword[39m [95mptr[39m [33m[[39m[0mrdi[33m][39m
[90m; │└[39m
[90m; │┌ @ complex.jl:282 within `conj`[39m
[90m; ││┌ @ int.jl:85 within `-`[39m
	[96m[1mmov[22m[39m	[0mrcx[0m, [95mqword[39m [95mptr[39m [33m[[39m[0mrdi [0m+ [33m8[39m[33m][39m
	[96m[1mmov[22m[39m	[0mrbp[0m, [0mrsp
[90m; │└└[39m
[90m; │┌ @ complex.jl:296 within `*` @ int.jl:88[39m
	[96m[1mimul[

## Are explicit type annotations necessary? (think C or Fortran)

Note that Julia's type inference is powerful. Specifying types **is not** necessary for best performance!


In [29]:
function my_function(x)
    y = rand()
    z = rand()
    x+y+z
end

function my_function_typed(x::Int64)::Float64
    y::Float64 = rand()
    z::Float64 = rand()
    x+y+z
end

my_function_typed (generic function with 1 method)

In [30]:
@btime my_function(10);
@btime my_function_typed(10);

  2.251 ns (0 allocations: 0 bytes)


  2.254 ns (0 allocations: 0 bytes)


Annotating types explicitly can serve a purpose.

* Enforce conversions
* Rather rarely: help the compiler infer types in tricky situations

However, more often than not it is an indication of suboptimal code design. (It also makes functions much less generic and reusable!)

## Compilation on heterogeneous HPC clusters

By default, Julia produces native code for the CPU type it is running on. This means that it uses the [Instruction Set Architecture (ISA)](https://en.wikipedia.org/wiki/Instruction_set_architecture) of this CPU.

This can lead to issues on heterogeneous clusters where different nodes have different CPU types. E.g. you precompile Julia packages on a login node with an Intel CPU but want to run the code on a compute node with AMD CPUs.

**Solution: Multiversioning**

```julia
# Noctua 1 & 2
export JULIA_CPU_TARGET="generic;znver3,clone_all;skylake,clone_all"

# Noctua 2 & DGX
export JULIA_CPU_TARGET="generic;znver3,clone_all;znver2,clone_all"
```

This will compile a generic (but slow) variant as well as efficient variants for AMD Zen3 and Intel Skylake CPUs / AMD Zen2.

# Core messages of this Notebook

* **A function is compiled when called for the first time** with a given set of argument types.
* There are **multiple code transformation steps** which can be inspected through macros like `@code_warntype` or `@descend` from Cthulhu.jl.
* What makes Julia fast? Successful **Type inference** → **Specialization** → **Compilation**.
* Functions should almost always be benchmarked with **BenchmarkTools.jl's `@btime` and `@benchmark`** instead of `@time`.
* In virtually all cases, **explicit type annotations are irrelevant for performance**.