# Code Specialization

To be fast, Julia needs to **specialize** code, that is compile specific native versions of the code. **The better the specialization the faster the code!** In the following we will investigate how Julia achieves good code specialization while retaining the power of generic programming.

## Just Ahead of Time (JAOT) Compilation

![](../../static/from_source_to_native.png)
 

**AST = Abstract Syntax Tree**

**IR = Intermediate Representation**

**SSA = Static Single Assignment**

**[LLVM](https://de.wikipedia.org/wiki/LLVM) = Low Level Virtual Machine**

## Specialization

**Julia specializes on the types of function arguments**, i.e. Julia compiles efficient machine code for the given input types, **when a function is called for the first time**.

If it is called again, the already existing machine code is reused, until we call the function with different input types.


In [1]:
func(x, y) = 2x + y

func (generic function with 1 method)

In [2]:
x = [1.2, 3.4, 5.6] # Vector{Float64}
y = [0.4, 0.7, 0.9] # Vector{Float64}

@time func(x, y);
@time func(x, y);

  0.119239 seconds (405.27 k allocations: 20.623 MiB, 12.16% gc time, 99.98% compilation time)
  0.000009 seconds (2 allocations: 160 bytes)


**First call:** compilation + running the code

**Second call:** running the code


In [3]:
@time func(x, y);

  0.000012 seconds (2 allocations: 160 bytes)


If one of the input types changes, Julia compiles a new specialization of the function!


In [4]:
typeof(x)

Vector{Float64}[90m (alias for [39m[90mArray{Float64, 1}[39m[90m)[39m

In [5]:
x = [1, 3, 5]

3-element Vector{Int64}:
 1
 3
 5

In [6]:
typeof(x)

Vector{Int64}[90m (alias for [39m[90mArray{Int64, 1}[39m[90m)[39m

In [11]:
@time func(x, x); # Vector{Int64}, Vector{Float64}
@time func(x, x);

  0.087038 seconds (194.08 k allocations: 9.592 MiB, 14.88% gc time, 99.96% compilation time)
  0.000007 seconds (2 allocations: 160 bytes)


We now have two efficient native codes in the cache: one for all `Vector{Float64}` inputs and another one for `Vector{Int64}` as the first and `Vector{Float64}` as the second argument type.

In [8]:
using MethodAnalysis

In [9]:
methods(func)

In [12]:
methodinstances(func)

3-element Vector{Core.MethodInstance}:
 MethodInstance for func(::Vector{Float64}, ::Vector{Float64})
 MethodInstance for func(::Vector{Int64}, ::Vector{Float64})
 MethodInstance for func(::Vector{Int64}, ::Vector{Int64})

## Introspection

(*But I really want to see what happens!*)

We can inspect the code at all transformation stages with a bunch of macros:

<img src="../../static/julia_introspection_macros.png" width=350px>

### Code Lowering

Compiler optimisation is a very hard task, to make the compilers life easier it should be given 'simple' code.

In Julia 'simple' means [static single-assignment form](https://en.wikipedia.org/wiki/Static_single-assignment_form):

> In compiler design, static single assignment form (often abbreviated as SSA form or simply SSA) is a property of an intermediate representation (IR) that requires each variable to be assigned exactly once and defined before it is used.
>
> ...
>
> One can expect to find SSA in a compiler for Fortran, C or C++ ...

This transformation from source code to SSA form is called **lowering**, and the SSA form is called **lowered code**.

In [13]:
function basic_condition(bool::Bool)
    if bool
        return 0
    else
        return 1
    end
end

basic_condition (generic function with 1 method)

In [14]:
@code_lowered basic_condition(true)

CodeInfo(
[90m1 ─[39m     goto #3 if not bool
[90m2 ─[39m     return 0
[90m3 ─[39m     return 1
)

In [15]:
# Increase verbosity with `debuginfo=:source`
@code_lowered debuginfo=:source basic_condition(true)

CodeInfo(
   [33m @ In[13]:2 within `basic_condition`[39m
[90m1 ─[39m     goto #3 if not bool
   [33m @ In[13]:3 within `basic_condition`[39m
[90m2 ─[39m     return 0
   [33m @ In[13]:5 within `basic_condition`[39m
[90m3 ─[39m     return 1
)

In [16]:
function basic_loop()
    a = 0
    for i in [1, 2, 3]
        a += i
    end
    return a
end

basic_loop (generic function with 1 method)

In [17]:
@code_lowered debuginfo=:source basic_loop()

CodeInfo(
   [33m @ In[16]:2 within `basic_loop`[39m
[90m1 ─[39m       a = 0
[90m│ [39m [33m @ In[16]:3 within `basic_loop`[39m
[90m│  [39m %2  = Base.vect(1, 2, 3)
[90m│  [39m       @_2 = Base.iterate(%2)
[90m│  [39m %4  = @_2 === nothing
[90m│  [39m %5  = Base.not_int(%4)
[90m└──[39m       goto #4 if not %5
[90m2 ┄[39m %7  = @_2
[90m│  [39m       i = Core.getfield(%7, 1)
[90m│  [39m %9  = Core.getfield(%7, 2)
[90m│ [39m [33m @ In[16]:4 within `basic_loop`[39m
[90m│  [39m       a = a + i
[90m│ [39m [33m @ In[16]:5 within `basic_loop`[39m
[90m│  [39m       @_2 = Base.iterate(%2, %9)
[90m│  [39m %12 = @_2 === nothing
[90m│  [39m %13 = Base.not_int(%12)
[90m└──[39m       goto #4 if not %13
[90m3 ─[39m       goto #2
   [33m @ In[16]:6 within `basic_loop`[39m
[90m4 ┄[39m       return a
)

- `#N` refers to [basic blocks](https://en.wikipedia.org/wiki/Basic_blockhttps://en.wikipedia.org/wiki/Basic_block) of code
  - Blocks are shown on the left with `|` characters outlining their span
- `%N` refers to single static assignment (SSA) valuesrefer to single static assignment (SSA) values, when a previous SSA value is used, it's referenced by an `SSAValue` and printed as `%N`
- `@_N` refers to temporary variables

In [18]:
function nextfib(n)
    a, b = one(n), one(n)
    while b < n
        a, b = b, a + b
    end
    return b
end

nextfib (generic function with 1 method)

In [19]:
@code_lowered debuginfo=:source nextfib(1)

CodeInfo(
   [33m @ In[18]:2 within `nextfib`[39m
[90m1 ─[39m %1 = Main.one(n)
[90m│  [39m %2 = Main.one(n)
[90m│  [39m      a = %1
[90m└──[39m      b = %2
   [33m @ In[18]:3 within `nextfib`[39m
[90m2 ┄[39m %5 = b < n
[90m└──[39m      goto #4 if not %5
   [33m @ In[18]:4 within `nextfib`[39m
[90m3 ─[39m %7 = b
[90m│  [39m %8 = a + b
[90m│  [39m      a = %7
[90m│  [39m      b = %8
[90m│ [39m [33m @ In[18]:5 within `nextfib`[39m
[90m└──[39m      goto #2
   [33m @ In[18]:6 within `nextfib`[39m
[90m4 ─[39m      return b
)

### Type Inference

The above lowered code now starts to get **specialised**: argument types and any explicit annotations are used to infer the types of all SSA variables (where/if possible), and that information is then used to specialise the function calls:

In [None]:
function nextfib(n)
    a, b = one(n), one(n)
    while b < n
        a, b = b, a + b
    end
    return b
end

In [20]:
@code_typed debuginfo=:source nextfib(1.0)

CodeInfo(
[90m1 ─[39m      nothing[90m::Nothing[39m
   [33m @ In[18]:3 within `nextfib`[39m
[90m2 ┄[39m %2 = φ (#1 => 1.0, #3 => %6)[36m::Float64[39m
[90m│  [39m %3 = φ (#1 => 1.0, #3 => %2)[36m::Float64[39m
[90m│ [39m [33m┌ @ float.jl:412 within `<`[39m
[90m│  [39m[33m│[39m %4 = Base.lt_float(%2, n)[36m::Bool[39m
[90m│ [39m [33m└[39m
[90m└──[39m      goto #4 if not %4
   [33m @ In[18]:4 within `nextfib`[39m
   [33m┌ @ float.jl:383 within `+`[39m
[90m3 ─[39m[33m│[39m %6 = Base.add_float(%3, %2)[36m::Float64[39m
[90m│ [39m [33m└[39m
[90m│ [39m [33m @ In[18]:5 within `nextfib`[39m
[90m└──[39m      goto #2
   [33m @ In[18]:6 within `nextfib`[39m
[90m4 ─[39m      return %2
) => Float64

In [21]:
@code_typed debuginfo=:source nextfib(1)

CodeInfo(
[90m1 ─[39m      nothing[90m::Nothing[39m
   [33m @ In[18]:3 within `nextfib`[39m
[90m2 ┄[39m %2 = φ (#1 => 1, #3 => %6)[36m::Int64[39m
[90m│  [39m %3 = φ (#1 => 1, #3 => %2)[36m::Int64[39m
[90m│ [39m [33m┌ @ int.jl:83 within `<`[39m
[90m│  [39m[33m│[39m %4 = Base.slt_int(%2, n)[36m::Bool[39m
[90m│ [39m [33m└[39m
[90m└──[39m      goto #4 if not %4
   [33m @ In[18]:4 within `nextfib`[39m
   [33m┌ @ int.jl:87 within `+`[39m
[90m3 ─[39m[33m│[39m %6 = Base.add_int(%3, %2)[36m::Int64[39m
[90m│ [39m [33m└[39m
[90m│ [39m [33m @ In[18]:5 within `nextfib`[39m
[90m└──[39m      goto #2
   [33m @ In[18]:6 within `nextfib`[39m
[90m4 ─[39m      return %2
) => Int64

Note the specialisation on the types, instead of generic `+` and `>`, now specific `add_float`/`add_int` are used!

Whereas in Python:

```ipython
In [1]: import dis

In [2]: def nextfib(n):
   ...:     a, b = 1, 1
   ...:     while b < n:
   ...:         a, b = b, a + b
   ...:     return b
   ...: 

In [3]: dis.dis(nextfib)
  2           0 LOAD_CONST               1 ((1, 1))
              2 UNPACK_SEQUENCE          2
              4 STORE_FAST               1 (a)
              6 STORE_FAST               2 (b)

  3           8 LOAD_FAST                2 (b)
             10 LOAD_FAST                0 (n)
             12 COMPARE_OP               0 (<)
             14 POP_JUMP_IF_FALSE       19 (to 38)

  4     >>   16 LOAD_FAST                2 (b)
             18 LOAD_FAST                1 (a)
             20 LOAD_FAST                2 (b)
             22 BINARY_ADD
             24 ROT_TWO
             26 STORE_FAST               1 (a)
             28 STORE_FAST               2 (b)

  3          30 LOAD_FAST                2 (b)
             32 LOAD_FAST                0 (n)
             34 COMPARE_OP               0 (<)
             36 POP_JUMP_IF_TRUE         8 (to 16)

  5     >>   38 LOAD_FAST                2 (b)
             40 RETURN_VALUE

In [4]: def add(a, b):
    ...:     return a.__add__(b)
    ...: 

In [5]: dis.dis(add)
  2           0 LOAD_FAST                0 (a)
              2 LOAD_METHOD              0 (__add__)
              4 LOAD_FAST                1 (b)
              6 CALL_METHOD              1
              8 RETURN_VALUE
```

Types are not known, so the correct functions have to be found **each time** an operation is done.

Whereas Julia can compile the code once for given input types and then directly call the required function.

This crucial process is known as **type inference** and its success is the basis for a good specialization (i.e. performant native code as a result). It will concern us in much more detail tomorrow.

### LLVM IR

The next step is to go from lowered typed code to LLMV IR.

Julia uses the LLVM compiler framework, which is also used by Rust, Swift, Kotlin, and other languages.

'IR' means [Intermediary Representation](https://en.wikipedia.org/wiki/Intermediate_representation):

> An intermediate representation (IR) is the data structure or code used internally by a compiler or virtual machine to represent source code. An IR is designed to be conducive to further processing, such as optimization and translation

In [None]:
function basic_condition(bool::Bool)
    if bool
        return 0
    else
        return 1
    end
end

In [24]:
@code_lowered basic_condition(true) 

CodeInfo(
[90m1 ─[39m     goto #3 if not bool
[90m2 ─[39m     return 0
[90m3 ─[39m     return 1
)

In [28]:
?@code_llvm

```
@code_llvm
```

Evaluates the arguments to the function or macro call, determines their types, and calls [`code_llvm`](@ref) on the resulting expression. Set the optional keyword arguments `raw`, `dump_module`, `debuginfo`, `optimize` by putting them and their value before the function call, like this:

```
@code_llvm raw=true dump_module=true debuginfo=:default f(x)
@code_llvm optimize=false f(x)
```

`optimize` controls whether additional optimizations, such as inlining, are also applied. `raw` makes all metadata and dbg.* calls visible. `debuginfo` may be one of `:source` (default) or `:none`,  to specify the verbosity of code comments. `dump_module` prints the entire module that encapsulates the function.


In [22]:
@code_llvm basic_condition(true)  # dump_module=true shows the 'full' ir

[90m;  @ In[13]:1 within `basic_condition`[39m
[95mdefine[39m [36mi64[39m [93m@julia_basic_condition_2527[39m[33m([39m[36mi8[39m [95mzeroext[39m [0m%0[33m)[39m [0m#0 [33m{[39m
[91mtop:[39m
[90m;  @ In[13]:2 within `basic_condition`[39m
  [0m%1 [0m= [96m[1mand[22m[39m [36mi8[39m [0m%0[0m, [33m1[39m
  [0m%2 [0m= [96m[1mxor[22m[39m [36mi8[39m [0m%1[0m, [33m1[39m
  [0m%3 [0m= [96m[1mzext[22m[39m [36mi8[39m [0m%2 [95mto[39m [36mi64[39m
[90m;  @ In[13] within `basic_condition`[39m
  [96m[1mret[22m[39m [36mi64[39m [0m%3
[33m}[39m


In [None]:
function nextfib(n)
    while b < n
        a = b or 
        b = a + b * c
    end
    return b
end

In [31]:
@code_llvm debuginfo=:none nextfib(1)

[95mdefine[39m [36mi64[39m [93m@julia_nextfib_2874[39m[33m([39m[36mi64[39m [95msignext[39m [0m%0[33m)[39m [0m#0 [33m{[39m
[91mtop:[39m
  [96m[1mbr[22m[39m [36mlabel[39m [91m%L2[39m

[91mL2:[39m                                               [90m; preds = %L2, %top[39m
  [0m%value_phi [0m= [96m[1mphi[22m[39m [36mi64[39m [33m[[39m [33m1[39m[0m, [91m%top[39m [33m][39m[0m, [33m[[39m [0m%1[0m, [91m%L2[39m [33m][39m
  [0m%value_phi1 [0m= [96m[1mphi[22m[39m [36mi64[39m [33m[[39m [33m1[39m[0m, [91m%top[39m [33m][39m[0m, [33m[[39m [0m%value_phi[0m, [91m%L2[39m [33m][39m
  [0m%.not [0m= [96m[1micmp[22m[39m [96m[1mslt[22m[39m [36mi64[39m [0m%value_phi[0m, [0m%0
  [0m%1 [0m= [96m[1madd[22m[39m [36mi64[39m [0m%value_phi1[0m, [0m%value_phi
  [96m[1mbr[22m[39m [36mi1[39m [0m%.not[0m, [36mlabel[39m [91m%L2[39m[0m, [36mlabel[39m [91m%L8[39m

[91mL8:[39m                        

### Native Code

In [26]:
@code_native debuginfo=:source nextfib(1)  # binary=true to see the raw binary code

	[0m.text
	[0m.file	[0m"nextfib"
	[0m.globl	[0mjulia_nextfib_2652              [90m# -- Begin function julia_nextfib_2652[39m
	[0m.p2align	[33m4[39m[0m, [33m0x90[39m
	[0m.type	[0mjulia_nextfib_2652[0m,[0m@function
[91mjulia_nextfib_2652:[39m                     [90m# @julia_nextfib_2652[39m
[90m; ┌ @ In[18]:1 within `nextfib`[39m
	[0m.cfi_startproc
[90m# %bb.0:                                # %top[39m
	[96m[1mmovl[22m[39m	[33m$1[39m[0m, [0m%ecx
	[96m[1mmovl[22m[39m	[33m$1[39m[0m, [0m%eax
	[0m.p2align	[33m4[39m[0m, [33m0x90[39m
[91m.LBB0_1:[39m                                [90m# %L2[39m
                                        [90m# =>This Inner Loop Header: Depth=1[39m
	[96m[1mmovq[22m[39m	[0m%rax[0m, [0m%rdx
	[96m[1mmovq[22m[39m	[0m%rcx[0m, [0m%rax
[90m; │ @ In[18]:4 within `nextfib`[39m
[90m; │┌ @ int.jl:87 within `+`[39m
	[96m[1maddq[22m[39m	[0m%rcx[0m, [0m%rdx
	[96m[1mmovq[22m[39m	[0m%rdx[0m, [0

Let's compare this to integer input.


In [27]:
@code_native debuginfo=:source nextfib(1.0)

	[0m.text
	[0m.file	[0m"nextfib"
	[0m.section	[0m.rodata.cst8[0m,[0m"aM"[0m,[0m@progbits[0m,[33m8[39m
	[0m.p2align	[33m3[39m                               [90m# -- Begin function julia_nextfib_2656[39m
[91m.LCPI0_0:[39m
	[0m.quad	[33m0x3ff0000000000000[39m              [90m# double 1[39m
	[0m.text
	[0m.globl	[0mjulia_nextfib_2656
	[0m.p2align	[33m4[39m[0m, [33m0x90[39m
	[0m.type	[0mjulia_nextfib_2656[0m,[0m@function
[91mjulia_nextfib_2656:[39m                     [90m# @julia_nextfib_2656[39m
[90m; ┌ @ In[18]:1 within `nextfib`[39m
	[0m.cfi_startproc
[90m# %bb.0:                                # %top[39m
	[96m[1mmovabsq[22m[39m	[93m$.LCPI0_0[39m[0m, [0m%rax
	[96m[1mvmovsd[22m[39m	[33m([39m[0m%rax[33m)[39m[0m, [0m%xmm1                   [90m# xmm1 = mem[0],zero[39m
[90m; │ @ In[18]:3 within `nextfib`[39m
[90m; │┌ @ float.jl:412 within `<`[39m
	[96m[1mvucomisd[22m[39m	[0m%xmm1[0m, [0m%xmm0
[90m; │└[39m
	[96m

## How Important is Specialization?

Let's try to estimate the performance gain by specialization, we can do this by breaking the compilation process by throwing away type information.

This way Julia will act, roughly, in the same way as Python: it has to work out what can be done to a variable every single time it encounters it. We can do this by storing the variables in a `Vector{Any}`.

First, let's write the same `nextfib` function in Python and benchmark it:

```ipython
In [1]: def nextfib(n):
   ...:     a, b = 1, 1
   ...:     while b < n:
   ...:         a, b = b, a + b
   ...:     return b
   ...: 

In [2]: %timeit nextfib(100_000)
951 ns ± 7.39 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
```

Now, for reference, here is the standard implementation in Julia:

In [32]:
function nextfib(n)
    a, b = one(n), one(n)
    while b < n
        a, b = b, a + b
    end
    return b
end

nextfib (generic function with 1 method)

In [35]:
@benchmark nextfib(100_000)

BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m2.264 ns[22m[39m … [35m17.914 ns[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m2.284 ns              [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m2.390 ns[22m[39m ± [32m 0.329 ns[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.00% ± 0.00%

  [39m▆[39m█[34m▆[39m[39m▂[39m [39m [39m [39m [39m [32m [39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▃[39m▁[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▂[39m▅[39m [39m▂
  [39m█[39m█[34m█[39m[39m█[39m▄[39m▄[3

And the broken version, which:

- Throws away type information by storing variables in a `Vector{Any}`
- Force enables bounds-checks for situations where they could be optimised away (e.g. the 3-element vector)
- Disables specialisation on the argument `n`

In [36]:
function nextfib_bad(n)
    vars::Vector{Any} = [1, 1, n]
    while vars[2] < vars[3]
        vars[1], vars[2] = vars[2], vars[1] + vars[2]
    end
    return vars[2]
end

nextfib_bad (generic function with 1 method)

In [37]:
@benchmark nextfib_bad(100_000)

BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m1.211 μs[22m[39m … [35m  4.438 μs[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m1.365 μs               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m1.413 μs[22m[39m ± [32m199.772 ns[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.00% ± 0.00%

  [39m [39m▇[39m▆[39m [39m [39m█[39m [39m [34m [39m[39m [39m [32m [39m[39m▁[39m▁[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m▃[39m█[39m█[39m▄[39m▅[39m█

This is (on my computer) *almost* the same as the Python version!

In [39]:
# @code_typed nextfib_bad(100)

In [41]:
# @code_native nextfib_bad(100)

## Types vs Values

In high performance computing, compilation time (order of seconds or minutes) is typically neglectable compared to the actual time it takes to perform the computation (readily on the orders of hours/days/weeks). Therefore, we generally want to optimize for runtime efficiency even if this means that compilation time goes up by a reasonable amount.

**Julia specializes on input types and not values!**

Primarily it is **type information** that is used by the compiler to specialize code. (There are special techniques like, e.g., constant propagation and others that we are neglecting here.)

(Very) roughly speaking, the more information there is in *type space* (e.g. in type parameters) the higher the likelihood that the compiler produces fast and efficient code.

As before, here is a Python benchmark:

```ipython
In [1]: import numpy as np

In [2]: A = np.random.rand(10, 10)

In [3]: B = np.random.rand(10, 10)

In [4]: %timeit A + B
507 ns ± 4.51 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
```

In [None]:
np = pyimport("numpy")

A = np.random.rand(10, 10)
B = np.random.rand(10, 10);

In [None]:
@btime $A + $B;

In [53]:
A = rand(10, 10);
B = rand(10, 10);
@btime $A + $B;

  133.979 ns (1 allocation: 896 bytes)


In [44]:
typeof(A)

Matrix{Float64}[90m (alias for [39m[90mArray{Float64, 2}[39m[90m)[39m

In [45]:
size(A)

(10, 10)

In [46]:
size(typeof(A)) # the size of A isn't type information

LoadError: MethodError: no method matching size(::Type{Matrix{Float64}})
[0mClosest candidates are:
[0m  size([91m::Union{LinearAlgebra.Adjoint{T, var"#s886"}, LinearAlgebra.Transpose{T, var"#s886"}} where {T, var"#s886"<:(AbstractVector)}[39m) at /usr/share/julia/stdlib/v1.8/LinearAlgebra/src/adjtrans.jl:173
[0m  size([91m::Union{LinearAlgebra.Adjoint{T, var"#s886"}, LinearAlgebra.Transpose{T, var"#s886"}} where {T, var"#s886"<:(AbstractMatrix)}[39m) at /usr/share/julia/stdlib/v1.8/LinearAlgebra/src/adjtrans.jl:174
[0m  size([91m::Union{LinearAlgebra.QR, LinearAlgebra.QRCompactWY, LinearAlgebra.QRPivoted}[39m) at /usr/share/julia/stdlib/v1.8/LinearAlgebra/src/qr.jl:581
[0m  ...

In [47]:
using StaticArrays

In [48]:
A = @SMatrix rand(10, 10);
B = @SMatrix rand(10, 10);

In [49]:
typeof(A)

SMatrix{10, 10, Float64, 100}[90m (alias for [39m[90mSArray{Tuple{10, 10}, Float64, 2, 100}[39m[90m)[39m

In [50]:
size(typeof(A)) # the size of A is type information!

(10, 10)

In [54]:
@btime $A + $B;

  133.269 ns (1 allocation: 896 bytes)


In [51]:
@btime $A + $B;

  30.580 ns (0 allocations: 0 bytes)


**StaticArrays.jl**

```
============================================
    Benchmarks for 3×3 Float64 matrices
============================================
Matrix multiplication               -> 5.9x speedup
Matrix multiplication (mutating)    -> 1.8x speedup
Matrix addition                     -> 33.1x speedup
Matrix addition (mutating)          -> 2.5x speedup
Matrix determinant                  -> 112.9x speedup
Matrix inverse                      -> 67.8x speedup
Matrix symmetric eigendecomposition -> 25.0x speedup
Matrix Cholesky decomposition       -> 8.8x speedup
Matrix LU decomposition             -> 6.1x speedup
Matrix QR decomposition             -> 65.0x speedup
```

### Why not always use static arrays then?!

By putting more information in the type you are putting more stress on the compiler to optimize things.

Specifically, if static arrays are too big compile time can explode or the compiler might just give up and fall back to an inefficient default version.

Generally speaking, static arrays are only useful as small fixed-size arrays.

In [None]:
# should take (much) longer to compile and the speedup should be gone as well
# if it isn't, increase N a little bit
N = 100
M = rand(N,N);
Mstatic = SMatrix{N,N}(M);

@btime $Mstatic + $Mstatic;
@btime $M + $M;

### Dispatch and Specialization

Having a reasonable amount of information encoded in the type domain isn't only useful to help the compiler (specialization) but also for dispatching to the most specific (and therefore hopfully most performant) method of a function.

**Types drive both specialization and multiple dispatch!**

In this sense, multiple dispatch is essentially the first step of the specialization process where Julia chooses between different implementations.

#### Example: Determinant of a 2x2 matrix

Let's say your task would be to write a function computing the determinant of a 2x2 matrix. How would you implement it?

Probably you'd say, well I know the formula for computing the determinant of a 2x2 matrix! Let's just implement it.

In Python:

```ipython
In [1]: import numpy as np

In [2]: M = np.array([[1, 2], [3, 4]])

In [3]: def det_2x2(X):
   ...:     return X[0, 0] * X[1, 1] - X[0, 1] * X[1, 0]
   ...: 

In [4]: %timeit det_2x2(M)
502 ns ± 10.5 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
```

And for Julia:

In [1]:
using BenchmarkTools

In [2]:
det_2x2(X) = X[1, 1] * X[2, 2] - X[1, 2] * X[2, 1]

det_2x2 (generic function with 1 method)

In [3]:
M = [1 2; 3 4]

2×2 Matrix{Int64}:
 1  2
 3  4

In [4]:
det_2x2(M)

-2

Let's see how Julia's built-in `det` function compares to our algorithm, first in Numpy:

```ipython
In [6]: %timeit np.linalg.det(M)
4.75 µs ± 59.9 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
```

Almost 10x slower than the hand-written version! And for Julia:

In [6]:
using LinearAlgebra

det(M)

-2.0

In [12]:
@btime det($M);

  197.053 ns (2 allocations: 176 bytes)


20x slower!

The reason isn't just that the compiler doesn't just know the size of the matrix from its type but also that [the code it considers](https://github.com/JuliaLang/julia/blob/release-1.8/stdlib/LinearAlgebra/src/generic.jl#L1544-L1550) (selected by the dispatch mechanism) is too general to compete with our implementation in `det_2x2`.

Let's now move the size information to the type domain and see how things change.

In [8]:
using StaticArrays
S = @SMatrix [1 2; 3 4]

2×2 SMatrix{2, 2, Int64, 4} with indices SOneTo(2)×SOneTo(2):
 1  2
 3  4

In [15]:
@btime det($S);

  2.765 ns (0 allocations: 0 bytes)


In [16]:
@btime det_2x2($M);

  2.274 ns (0 allocations: 0 bytes)


Note that it is super faster because StaticArrays.jl provides [a hand-coded version](https://github.com/JuliaArrays/StaticArrays.jl/blob/master/src/det.jl#L10-L12), similar to our `det_2x2` above, which gets selected because of the size information in the type.

The (tiny) speed difference compared to our own `det_2x2` is only due to bounds checking and matrix vs linear indexing.

In [None]:
det_2x2_optimized(X) = X[1] * X[4] - X[3] * X[2]
@btime det_2x2_optimized($M);

## Are Explicit Type Annotations Necessary?

Fortan/C require them, are they required in Julia?

In [17]:
function my_function(x)
    y = rand()
    z = rand()
    x + y + z
end

function my_function_typed(x::Int)::Float64
    y::Float64 = rand()
    z::Float64 = rand()
    x + y + z
end

my_function_typed (generic function with 1 method)

Nope! Julia's type inference is powerful. Specifying types **is not** necessary for best performance.

In [22]:
a = @code_llvm my_function(10)

[90m;  @ In[17]:1 within `my_function`[39m
[95mdefine[39m [36mdouble[39m [93m@julia_my_function_2450[39m[33m([39m[36mi64[39m [95msignext[39m [0m%0[33m)[39m [0m#0 [33m{[39m
[91mtop:[39m
  [0m%thread_ptr [0m= [95mcall[39m [36mi8[39m[0m* [95masm[39m [0m"movq %fs:0, $0"[0m, [0m"=r"[33m([39m[33m)[39m [0m#4
  [0m%ppgcstack_i8 [0m= [96m[1mgetelementptr[22m[39m [36mi8[39m[0m, [36mi8[39m[0m* [0m%thread_ptr[0m, [36mi64[39m [33m-8[39m
  [0m%ppgcstack [0m= [96m[1mbitcast[22m[39m [36mi8[39m[0m* [0m%ppgcstack_i8 [95mto[39m [33m{[39m[33m}[39m[0m****
  [0m%pgcstack [0m= [96m[1mload[22m[39m [33m{[39m[33m}[39m[0m***[0m, [33m{[39m[33m}[39m[0m**** [0m%ppgcstack[0m, [95malign[39m [33m8[39m
[90m;  @ In[17]:2 within `my_function`[39m
[90m; ┌ @ /cache/build/default-aws-shared0-3/julialang/julia-release-1-dot-8/usr/share/julia/stdlib/v1.8/Random/src/Random.jl:257 within `rand` @ /cache/build/default-aws-shared0-

In [23]:
b = @code_llvm my_function_typed(10);

[90m;  @ In[17]:7 within `my_function_typed`[39m
[95mdefine[39m [36mdouble[39m [93m@julia_my_function_typed_2452[39m[33m([39m[36mi64[39m [95msignext[39m [0m%0[33m)[39m [0m#0 [33m{[39m
[91mtop:[39m
  [0m%thread_ptr [0m= [95mcall[39m [36mi8[39m[0m* [95masm[39m [0m"movq %fs:0, $0"[0m, [0m"=r"[33m([39m[33m)[39m [0m#4
  [0m%ppgcstack_i8 [0m= [96m[1mgetelementptr[22m[39m [36mi8[39m[0m, [36mi8[39m[0m* [0m%thread_ptr[0m, [36mi64[39m [33m-8[39m
  [0m%ppgcstack [0m= [96m[1mbitcast[22m[39m [36mi8[39m[0m* [0m%ppgcstack_i8 [95mto[39m [33m{[39m[33m}[39m[0m****
  [0m%pgcstack [0m= [96m[1mload[22m[39m [33m{[39m[33m}[39m[0m***[0m, [33m{[39m[33m}[39m[0m**** [0m%ppgcstack[0m, [95malign[39m [33m8[39m
[90m;  @ In[17]:8 within `my_function_typed`[39m
[90m; ┌ @ /cache/build/default-aws-shared0-3/julialang/julia-release-1-dot-8/usr/share/julia/stdlib/v1.8/Random/src/Random.jl:257 within `rand` @ /cache/build/de

Annotating types explicitly can serve a purpose.

* Enforce conversions
* Very rarely: help the compiler infer types in tricky situations

However, more often than not it is an indication of suboptimal code design. (It also makes functions much less generic and reusable!)

# Core messages of this Notebook

* **A function is compiled when called for the first time** with a given set of argument types.
* The are **multiple compilation steps** which can be inspected through macros like `@code_warntype`.
* **Code specialization** based on the types of all of the input arguments is important for speed.
* Critical information can be moved to the **type domain** for better dispatch and specialization.
* In virtually all cases, **explicit type annotations are irrelevant for performance**.

# References

- https://github.com/carstenbauer/JuliaHLRS22/blob/main/Day1/3_specialization.ipynb
- https://docs.julialang.org/en/v1/devdocs/ast/
- https://juliadebug.github.io/JuliaInterpreter.jl/stable/ast/
- https://stackoverflow.com/questions/43453944/what-is-the-difference-between-code-native-code-typed-and-code-llvm-in-julia
- https://tenthousandmeters.com/blog/python-behind-the-scenes-4-how-python-bytecode-is-executed/