# Julia Performance Tips

In this notebook, I will give performance tips for programming in Julia.
We will see specific examples in action.

For more examples and detailed explanations, read the fantastic advice in the links below

- Julia Documentation: Performance Tips [link](https://docs.julialang.org/en/v1/manual/performance-tips/)
- Chris Rackauckas SiML Course: Optimizing Serial Code [link](https://book.sciml.ai/notes/02-Optimizing_Serial_Code/)
- Modern Julia Workflows: Optimizng Tutorial [link](https://modernjuliaworkflows.org/optimizing/)
- Guillaume Dalle: High Performance Julia [link](https://gdalle.github.io/JuliaPerf-CERMICS/)

Please follow along using your own machine!

## Learning to Write Efficient Julia Code

How should you learn to write efficient Julia code?
By talking with and learning from others!
The Julia community is very welcoming and helpful:

- Discourse [link](https://discourse.julialang.org/)
- StackOverflow [link](https://stackoverflow.com/questions/tagged/julia?tab=Newest)

After a bit of time using Julia, you may also want to answer questions on these sites.
By answering questions, you figure out what makes a good question and also you learn more about the language.

## General Performance Tips

There are a few general performance tips when writing in Julia.

1. Do not use global variables
2. Put your code into functions
3. Facilitate type inference
4. Reduce memory allocations

If you're in the habit of writing modular code, then (1) and (2) are already taken care of.
Honestly, a lot of Julia documentation focusing on facilitating type inference but I haven't run into an example where this is the issue in my research code; perhaps that's because I'm lucky enough to be writing type-stable functions (doubtful).

From my own experience using Julia, unnecessary memory allocations is always the culprit.
For these reason, reducing memory allocations will be the main focus of this tutorial.

## Timing Code

In order to show off these performance tips, we will need to measure the performance of our code. 
So, how do you time your code in Julia?

There are two options:

1. **Option 1**: `@time exp`
  - executes expression `exp`, reports execution time and memory allocation
  - buil-in utility macro
2. **Option 2**: `@btime exp`
  - like `@time`, but runs the expression multiple times for higher reporting accuracy
  - (*requires the `BenchmarkTools.jl` package*)
  
Let's see how to use `@time` first.

In [1]:
A = randn(100, 100)
B = randn(100, 100)

@time A * B;

  0.014414 seconds (114 allocations: 89.297 KiB, 71.67% compilation time)


We can see the execution time and the number of allocations.
But let's see what happens when we run the function again.

In [2]:
@time A * B;

  0.000277 seconds (3 allocations: 116.078 KiB)


Whoah - the execution time dropped by a lot.
What's up with that?

Julia is a Just-In-Time (JIT) compiled language.
This means that the first time a function is called within a Julia session, the source code is compiled down to optimized machine code.
It will then be faster every time it is run during the session.

In fact, note that the `@time` macro actually warned us about this, by stating the percent compile time.

Let's just see this happen again.

In [3]:
function mat_mul(A,B)
    return A * B
end

n = 100
A = randn(n,n)
B = randn(n,n)

print("First run:  ")
@time mat_mul(A,B)

print("Second run: ")
@time mat_mul(A,B);

First run:    0.020761 seconds (1.45 k allocations: 146.406 KiB, 97.63% compilation time)
Second run:   0.000732 seconds (3 allocations: 80.078 KiB)


The `@time` macro is great, but it has a few drawbacks.
The main drawback is that the expression is only run once, leading to sometimes unstable reporting of execution time.
The reason is that you machine is doing many things, not just running Julia.
So, the execution timing can differ between runs.

To overcome this issue (and the pre-compile issue) you might want to use `@btime`.
It evaluates the expression multiple times and reports median info

In [4]:
using BenchmarkTools

@btime mat_mul(A,B);

  153.627 μs (3 allocations: 80.08 KiB)


In fact, `@btime` is just a wrapper for `@benchmark`, which in another timing macro that gives you a lot more data.

In [5]:
@benchmark mat_mul(A,B)

BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m155.578 μs[22m[39m … [35m 24.148 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m234.258 μs               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m280.622 μs[22m[39m ± [32m373.489 μs[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m5.09% ± 7.62%

  [39m▅[39m▅[39m▅[39m█[34m▇[39m[39m▅[32m▄[39m[39m▃[39m▃[39m▃[39m▃[39m▃[39m▃[39m▂[39m▁[39m▁[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▂
  [39m█[39

We can see more summary statistics about our execution time from 1,000 evaluations of the expression.
Pretty cool!

In fact, there are a lot of bells and whistles that `@benchmark` comes with.
For example, suppose you wanted to test the speed of `mat_mul` on many random matrices `A` and `B`.
You might write something like this.

In [6]:
n = 100
@benchmark mat_mul(randn(n,n), randn(n,n))

BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m216.829 μs[22m[39m … [35m 15.060 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m 0.00% … 97.40%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m395.794 μs               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m 0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m451.038 μs[22m[39m ± [32m426.610 μs[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m10.97% ± 11.58%

  [39m▅[39m▂[39m▁[39m▆[34m█[39m[39m▅[32m▄[39m[39m▃[39m▂[39m▂[39m▁[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▁
  [39m

But this has a downside: the generation of the two random matrices `A` and `B` is included in the run time and the memory estimates!
In fact, we can see this because the execution time is noticeably larger than in the example above.

How can we get around this?

Well, `@benchmark` has a functionality specifically for this.
We can use the `setup` arguement.

In [7]:
@benchmark mat_mul(data...) setup=(data=(randn(n,n), randn(n,n)))

BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m158.648 μs[22m[39m … [35m 13.254 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 97.56%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m228.091 μs               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m243.729 μs[22m[39m ± [32m226.185 μs[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m5.81% ±  6.86%

  [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▂[39m▇[39m▇[39m▄[34m▅[39m[39m█[39m▃[39m [32m [39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m▁[

Anyways, we can go down this rabbit hole of timing basically forever.
The reality is that, for our purposes of optimizing our code, we don't care about minor improvements in run time, we can about significant (say, 2x or 10x) speed ups.
So these details of how exactly you ought to time the code won't matter so much.
It will typically be fine to just use `@btime` by itself.

*This exercise is to be completed in small groups of 3 or less*

**Exercise 1**: Benchmark a function for matrix conjugation, i.e. compute $A^\top X A$ 

## Reducing Memory Allocations

Now that we have a way to evaluate the performance of our code, we can move onto one of the most important lessons in Julia -- and indeed, in any high level programming language: reducing memory allocations.

### In-Place Functions and Pre-Allocating Memory

The first way to reduce memory allocations is to not allocate memory at all.

- An **in-place function** is one which modifies its input arguments directly (i.e. overwrites) rather than creating and returning a new modified object.
- In Julia, in-place functions have the convention of ending with a `!`

Let's see a simple example based on sorting.

In [8]:
x = randn(5)

5-element Vector{Float64}:
 -0.6104295279403974
 -0.08816670911839933
 -0.2796911726064586
  0.7880772986789002
  0.613527674395732

In [9]:
y = sort(x) # this is not in-place, because it creates a new array y

println("This is y $y")
println("This is x $x")

This is y [-0.6104295279403974, -0.2796911726064586, -0.08816670911839933, 0.613527674395732, 0.7880772986789002]
This is x [-0.6104295279403974, -0.08816670911839933, -0.2796911726064586, 0.7880772986789002, 0.613527674395732]


In [10]:
sort!(x) # this is in place

println("This is x $x")

This is x [-0.6104295279403974, -0.2796911726064586, -0.08816670911839933, 0.613527674395732, 0.7880772986789002]


Let's see the difference in performance.

In [11]:
x = randn(10^5)

print("Sort (copy): ")
@btime sort(x)

print("Sort (in-place):")
@btime sort!(x);

Sort (copy):   2.940 ms (12 allocations: 1.17 MiB)
Sort (in-place):  205.397 μs (0 allocations: 0 bytes)


Here's a small example about adding new elements to an array.

- `vcat` will create a new array
- `append!` will update the second array in-place

In [12]:
x = randn(1000)
a = 5

print("vcat:  ")
@btime vcat(x, [a]) # this allocates a new array 

print("append!: ")
@btime append!(x, a); # this performs the action in-place

vcat:    1.281 μs (5 allocations: 8.12 KiB)
append!:   64.722 ns (0 allocations: 0 bytes)


Note that this is a difference of a factor of 20x -- that can be a huge improvement if this gets called in your code millions of times.
This would be a huge improvement in, say, a Monte Carlo simulation.

We can also look at an in-place matrix-vector multiplication.

In [13]:
using LinearAlgebra

n = 100

y = zeros(n)
A = randn(n,n)
x = randn(n)

print("Standard ")
@btime y = A * x

print("In Place ")
@btime mul!(y,A,x);

Standard   1.498 μs (2 allocations: 928 bytes)
In Place   1.395 μs (0 allocations: 0 bytes)


The improvement here is not impressive.
That is probably because the dominating cost is the actual linear algebra, rather than the memory allocation.

**Note**: You can use existing in-place functions, or make them yourself

- To find existing in-place implementations of a function, just search it online
- You might want to make your own functions in-place *if allocation is the bottleneck*

*This exercise is to be completed in small groups of 3 or less*

**Exercise 2**: Benchmark the difference between `filter` and `filter!`

### Using Views Instead of Slices

Another way that you can accidentally use memory allocation is to use slices, rather than views.

- Slicing an array `a[3:5]` allocates a new array in memory
- Using a view `@view a[3:5]` does not allocate new memory

Let's see an example of this

In [14]:
a = randn(1000)

@btime sum(a[5:500])
@btime sum(@view a[5:500])

  553.531 ns (4 allocations: 4.08 KiB)
  163.395 ns (2 allocations: 64 bytes)


-7.5933451079071785

You can also use the macro `@views` outside of the entire evaluation too

In [15]:
@views sum(a[5:500])

-7.5933451079071785

### Fused Vectorizations

Julia has a special **dot syntax** which converts a scalar function into a function that acts element-wise on arrays.
This can help the compiler to optimize the code.

In [16]:
x = [1,2,3,4]

x.^2

4-element Vector{Int64}:
  1
  4
  9
 16

This is useful from a practical point of view: it may be easier to write in this dot synatx than to code the full loop.

But, there is a secondary purpose here.

Dot syntax is *fusing*, which means that they are combined at the syntax level into a single loop, without allocating temporary arrays.

Let's see this in action.

In [17]:
x = randn(10^5);

@btime 3 * x.^2 + 4 * x + 7 * x.^3 # this is *not* fused  --
@btime 3 .* x.^2 .+ 4 .* x .+ 7 .* x.^3; # this is fused 

  537.496 μs (32 allocations: 4.59 MiB)
  74.506 μs (22 allocations: 784.78 KiB)


Writing in dot syntax can get us great speed-ups, but it can also be ugly.
You can use the `@.` macro to convert every function call or operator into a "dot" call.

In [18]:
@btime @. 3*x^2 + 4*x + 7*x^3; # equivalent to above, but prettier

  74.755 μs (21 allocations: 784.72 KiB)


Also, the number of allocations above is slighly misleading. 
If we just wrap this in functions, we can see much smaller allocations.

Let this be a warning against taking benchmarking *too seriously*

In [19]:
f(x) = 3 * x.^2 + 4 * x + 7 * x.^3
fdot(x) = @. 3*x^2 + 4*x + 7*x^3

@btime f(x)
@btime fdot(x);

  524.861 μs (18 allocations: 4.59 MiB)
  68.474 μs (3 allocations: 784.06 KiB)


Let's just see another example.

In [20]:
f1(x) = exp.(x) + exp.(-x) # the + is not fused!
f2(x) = @. exp(x) + exp(-x)

@btime f1(x)
@btime f2(x);

  1.347 ms (12 allocations: 3.06 MiB)
  1.254 ms (3 allocations: 784.06 KiB)


The performance gain here is much more modest, but you can see the reduction in allocation happening here.

*This exercise is to be completed in small groups of 3 or less*

**Exercise 3**: Create two equivalent functions that apply $x^2 + \sqrt{x} - 5$ to each entry of a vector, where one function uses fused vectorization and the other function does not. Benchmark the difference on a relevant test case.

### Use StaticArrays.jl for small vectors / matrices

If your function uses many small arrays (e.g. length `<= 100`) and the size of these arrays before execution, then you can use the `StaticArrays.jl` which offers much faster memory allocation over the `Array` types in base Julia.

In [21]:
using StaticArrays

A_s = @SMatrix [1 3 5 ; 2 3 6; 9 2 3]
B_s = @SMatrix [4 5 9 ; 5 4 4; 9 7 5]

A_n = [1 3 5 ; 2 3 6; 9 2 3]
B_n = [4 5 9 ; 5 4 4; 9 7 5]

println("Matrix Addition")
@btime A_n + B_n
@btime A_s + B_s

println("Matrix Multiplication")
@btime A_n * B_n
@btime A_s * B_s;

Matrix Addition
  107.163 ns (2 allocations: 144 bytes)
  66.188 ns (1 allocation: 80 bytes)
Matrix Multiplication
  146.122 ns (2 allocations: 144 bytes)
  77.499 ns (1 allocation: 80 bytes)


In [22]:
println("Determinant")
@btime det(A_n)
@btime det(A_s)

Determinant
  388.772 ns (5 allocations: 240 bytes)
  49.392 ns (1 allocation: 16 bytes)


26.0

Yeah, that speed up is pretty nice, almost a factor of 10x

*This exercise is to be completed in small groups of 3 or less*

**Exercise 4**: Write a function to compute the 4th power of a matrix. Benchmark the performance difference between `Array` and `StaticArray` on a 4-by-4 example. 