# Understandable performance
*Going fast, nowhere*

In [36]:
using Pkg
Pkg.activate("envs/lecture2b")
Pkg.instantiate()

[32m[1m  Updating[22m[39m registry at `~/.julia/registries/General`
[32m[1m  Updating[22m[39m git-repo `https://github.com/JuliaRegistries/General.git`
[?25l[2K[?25h

## A note on benchmarking
*Premature optimization is the root of all evil* & *If you don't measure you won't improve*

### Tools
1. BenchmarkTools.jl https://github.com/JuliaCI/BenchmarkTools.jl
2. Profiler https://docs.julialang.org/en/latest/manual/profile/
3. ProfileView.jl https://github.com/timholy/ProfileView.jl
4. VTunes/Perf/OProfile https://docs.julialang.org/en/latest/manual/profile/#External-Profiling-1
4. PProf https://github.com/vchuravy/PProf.jl

## BenchmarkTools.jl
Solid package that tries to eliminate common pitfalls in performance measurment.
- `@benchmark` macro that will repeatedly evaluate your code to gain enough samples
- Caveat: You probably want to escape `$` your input data

In [3]:
data = rand(2^10);

In [38]:
using BenchmarkTools
@benchmark sum($data)

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     76.940 ns (0.00% GC)
  median time:      81.142 ns (0.00% GC)
  mean time:        85.052 ns (0.00% GC)
  maximum time:     190.697 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     970

![Compiler](compiler.png)

![Compiler Stages](compiler-stages.png)

## Figuring out what is happening
The stages of the compiler
- `@code_lowered`
- `@code_typed` & `@code_warntype`
- `@code_llvm`
- `@code_native`

Where is a function defined
`@which` & `@edit`

In [5]:
##########################
# Low-level benchmarking #
##########################
using LLVM
using LLVM.Interop

 """
    clobber()
 Force the compiler to flush pending writes to global memory.
Acts as an effective read/write barrier.
"""
@inline clobber() = @asmcall("", "~{memory}", true) 

"""
    escape(val)
 The `escape` function can be used to prevent a value or
expression from being optimized away by the compiler. This function is
intended to add little to no overhead.
See: https://youtu.be/nXaxk27zwlk?t=2441
"""
@inline escape(val::T) where T = @asmcall("", "X,~{memory}", true, Nothing, Tuple{T}, val)

escape

# A simple example: counting

In [6]:
function f(N)
    acc = 0
    for i in 1:N
        acc += 1
    end
    return acc
end

f (generic function with 1 method)

In [40]:
N = 100_000_000
result = @benchmark f($N)

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.267 ns (0.00% GC)
  median time:      1.498 ns (0.00% GC)
  mean time:        1.469 ns (0.00% GC)
  maximum time:     18.606 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

In [41]:
using Unitful

┌ Info: Precompiling Unitful [1986cc42-f94f-5a68-af5c-568840ba703d]
└ @ Base loading.jl:1186


In [67]:
t = time(minimum(result)) * u"ns" # in ns
pFreq = N/t |> u"PHz"
t, pFreq

(1.267 ns, 78.92659826361485 PHz)

So we are doing 100 million additions in 1.2ns.
So our processor is operating at 70 PHz...

We wish...

What is going on?

In [9]:
@benchmark f($(10*N))

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.408 ns (0.00% GC)
  median time:      1.496 ns (0.00% GC)
  mean time:        1.510 ns (0.00% GC)
  maximum time:     32.392 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

In [10]:
@code_lowered f(N)

CodeInfo(
[37m1 ─[39m       acc = 0
[37m│  [39m %2  = 1:N
[37m│  [39m       #temp# = (Base.iterate)(%2)
[37m│  [39m %4  = #temp# === nothing
[37m│  [39m %5  = (Base.not_int)(%4)
[37m└──[39m       goto #4 if not %5
[37m2 ┄[39m %7  = #temp#
[37m│  [39m       i = (Core.getfield)(%7, 1)
[37m│  [39m %9  = (Core.getfield)(%7, 2)
[37m│  [39m       acc = acc + 1
[37m│  [39m       #temp# = (Base.iterate)(%2, %9)
[37m│  [39m %12 = #temp# === nothing
[37m│  [39m %13 = (Base.not_int)(%12)
[37m└──[39m       goto #4 if not %13
[37m3 ─[39m       goto #2
[37m4 ┄[39m       return acc
)

In [11]:
@code_typed optimize=false f(N)

CodeInfo(
[37m1 ─[39m       (acc = 0)[37m::Const(0, false)[39m
[37m│  [39m %2  = (1:N)[36m::UnitRange{Int64}[39m
[37m│  [39m       (#temp# = (Base.iterate)(%2))[37m::Union{Nothing, Tuple{Int64,Int64}}[39m
[37m│  [39m %4  = (#temp# === nothing)[36m::Bool[39m
[37m│  [39m %5  = (Base.not_int)(%4)[36m::Bool[39m
[37m└──[39m       goto #4 if not %5
[37m2 ┄[39m %7  = #temp#::Tuple{Int64,Int64}[36m::Tuple{Int64,Int64}[39m
[37m│  [39m       (i = (Core.getfield)(%7, 1))[37m::Int64[39m
[37m│  [39m %9  = (Core.getfield)(%7, 2)[36m::Int64[39m
[37m│  [39m       (acc = acc + 1)[37m::Int64[39m
[37m│  [39m       (#temp# = (Base.iterate)(%2, %9))[37m::Union{Nothing, Tuple{Int64,Int64}}[39m
[37m│  [39m %12 = (#temp# === nothing)[36m::Bool[39m
[37m│  [39m %13 = (Base.not_int)(%12)[36m::Bool[39m
[37m└──[39m       goto #4 if not %13
[37m3 ─[39m       goto #2
[37m4 ┄[39m       return acc
) => Int64

In [12]:
@code_typed optimize=true f(N)

CodeInfo(
[37m1 ──[39m %1  = (Base.sle_int)(1, N)[36m::Bool[39m
[37m│   [39m %2  = (Base.ifelse)(%1, N, 0)[36m::Int64[39m
[37m│   [39m %3  = (Base.slt_int)(%2, 1)[36m::Bool[39m
[37m└───[39m       goto #3 if not %3
[37m2 ──[39m       goto #4
[37m3 ──[39m       goto #4
[37m4 ┄─[39m %7  = φ (#2 => true, #3 => false)[36m::Bool[39m
[37m│   [39m %8  = φ (#3 => 1)[36m::Int64[39m
[37m│   [39m %9  = (Base.not_int)(%7)[36m::Bool[39m
[37m└───[39m       goto #10 if not %9
[37m5 ┄─[39m %11 = φ (#4 => 0, #9 => %13)[36m::Int64[39m
[37m│   [39m %12 = φ (#4 => %8, #9 => %19)[36m::Int64[39m
[37m│   [39m %13 = (Base.add_int)(%11, 1)[36m::Int64[39m
[37m│   [39m %14 = (%12 === %2)[36m::Bool[39m
[37m└───[39m       goto #7 if not %14
[37m6 ──[39m       goto #8
[37m7 ──[39m %17 = (Base.add_int)(%12, 1)[36m::Int64[39m
[37m└───[39m       goto #8
[37m8 ┄─[39m %19 = φ (#7 => %17)[36m::Int64[39m
[37m│   [39m %20 = φ (#6 => true, #7 => false)[36m::Boo

In [13]:
@code_llvm optimize=false f(10)


;  @ In[6]:2 within `f'
define i64 @julia_f_13439(i64) {
top:
  %1 = call %jl_value_t*** @julia.ptls_states()
  %2 = bitcast %jl_value_t*** %1 to %jl_value_t addrspace(10)**
  %3 = getelementptr inbounds %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)** %2, i64 2
  %4 = bitcast %jl_value_t addrspace(10)** %3 to i64**
  %5 = load i64*, i64** %4
;  @ In[6]:3 within `f'
; ┌ @ range.jl:5 within `Colon'
; │┌ @ range.jl:274 within `Type'
; ││┌ @ range.jl:279 within `unitrange_last'
; │││┌ @ operators.jl:333 within `>='
; ││││┌ @ int.jl:428 within `<='
       %6 = icmp sle i64 1, %0
       %7 = zext i1 %6 to i8
; │││└└
     %8 = trunc i8 %7 to i1
     %9 = xor i1 %8, true
     %10 = select i1 %9, i64 0, i64 %0
; └└└
; ┌ @ range.jl:590 within `iterate'
; │┌ @ range.jl:474 within `isempty'
; ││┌ @ operators.jl:286 within `>'
; │││┌ @ int.jl:49 within `<'
      %11 = icmp slt i64 %10, 1
      %12 = zext i1 %11 to i8
; │└└└
   %13 = trunc i8 %12 to i1
   %14 = xor i1 %13, true
   br i1 %14

In [14]:
@code_llvm optimize=true f(10)


;  @ In[6]:2 within `f'
define i64 @julia_f_13440(i64) {
top:
;  @ In[6]:3 within `f'
; ┌ @ range.jl:5 within `Colon'
; │┌ @ range.jl:274 within `Type'
; ││┌ @ range.jl:279 within `unitrange_last'
; │││┌ @ operators.jl:333 within `>='
; ││││┌ @ int.jl:428 within `<='
       %1 = icmp sgt i64 %0, 0
; └└└└└
  %spec.select = select i1 %1, i64 %0, i64 0
;  @ In[6]:6 within `f'
  ret i64 %spec.select
}


In [15]:
@code_native f(10)

	.text
; ┌ @ In[6]:2 within `f'
	movq	%rdi, %rax
	sarq	$63, %rax
	andnq	%rdi, %rax, %rax
; │ @ In[6]:6 within `f'
	retq
	nopl	(%rax)
; └


# Conclusion

LLVM realised that our loop.

```julia
for i in 1:N
  acc += 1
end
```

Just ended up being $acc = 1 * N$

# Exercise

What happens with:

```julia
function h(N)
    acc = 0.0
    for i in 1:N
        acc += 1.0
    end
    acc
end
```

and

```julia
function g(N)
    acc = 0
    for i in 1:N
        acc += 1.0
    end
    acc
end
```
    

In [16]:
function h(N)
    acc = 0.0
    for i in 1:N
        acc += 1.0
    end
    acc
end

h (generic function with 1 method)

In [17]:
@code_native h(10)

	.text
; ┌ @ In[16]:2 within `h'
	vxorpd	%xmm0, %xmm0, %xmm0
; │ @ In[16]:3 within `h'
; │┌ @ range.jl:5 within `Colon'
; ││┌ @ range.jl:274 within `Type'
; │││┌ @ range.jl:279 within `unitrange_last'
; ││││┌ @ operators.jl:333 within `>='
; │││││┌ @ int.jl:428 within `<='
	testq	%rdi, %rdi
; │└└└└└
	jle	L42
	movabsq	$140591353365352, %rax  # imm = 0x7FDDF9AD0368
	vmovsd	(%rax), %xmm1           # xmm1 = mem[0],zero
	nopw	(%rax,%rax)
; │ @ In[16]:4 within `h'
; │┌ @ float.jl:395 within `+'
L32:
	vaddsd	%xmm1, %xmm0, %xmm0
; │└
; │┌ @ range.jl:594 within `iterate'
; ││┌ @ promotion.jl:403 within `=='
	addq	$-1, %rdi
; │└└
	jne	L32
; │ @ In[16]:6 within `h'
L42:
	retq
	nopl	(%rax,%rax)
; └


In [18]:
function g(N)
    acc = 0
    for i in 1:N
        acc += 1.0
    end
    acc
end

g (generic function with 1 method)

In [19]:
@code_warntype g(10)

Body[91m[1m::Union{Float64, Int64}[22m[39m
[37m1 ──[39m %1  = (Base.sle_int)(1, N)[36m::Bool[39m
[37m│   [39m %2  = (Base.ifelse)(%1, N, 0)[36m::Int64[39m
[37m│   [39m %3  = (Base.slt_int)(%2, 1)[36m::Bool[39m
[37m└───[39m       goto #3 if not %3
[37m2 ──[39m       goto #4
[37m3 ──[39m       goto #4
[37m4 ┄─[39m %7  = φ (#2 => true, #3 => false)[36m::Bool[39m
[37m│   [39m %8  = φ (#3 => 1)[36m::Int64[39m
[37m│   [39m %9  = (Base.not_int)(%7)[36m::Bool[39m
[37m└───[39m       goto #15 if not %9
[37m5 ┄─[39m %11 = φ (#4 => 0, #14 => %26)[91m[1m::Union{Float64, Int64}[22m[39m
[37m│   [39m %12 = φ (#4 => %8, #14 => %32)[36m::Int64[39m
[37m│   [39m %13 = (isa)(%11, Float64)[36m::Bool[39m
[37m└───[39m       goto #7 if not %13
[37m6 ──[39m %15 = π (%11, [36mFloat64[39m)
[37m│   [39m %16 = (Base.add_float)(%15, 1.0)[36m::Float64[39m
[37m└───[39m       goto #10
[37m7 ──[39m %18 = (isa)(%11, Int64)[36m::Bool[39m
[37m└───[39m     

In [20]:
function k(::Type{T}, N) where T
    acc = zero(T)
    for i in 1:N
        acc += one(T)
        clobber()
    end
    return acc
end

k (generic function with 1 method)

In [21]:
@code_native k(Float64, 10)

	.text
; ┌ @ In[20]:2 within `k'
	vxorpd	%xmm0, %xmm0, %xmm0
; │ @ In[20]:3 within `k'
; │┌ @ range.jl:5 within `Colon'
; ││┌ @ range.jl:274 within `Type'
; │││┌ @ range.jl:279 within `unitrange_last'
; ││││┌ @ operators.jl:333 within `>='
; │││││┌ @ int.jl:428 within `<='
	testq	%rsi, %rsi
; │└└└└└
	jle	L42
	movabsq	$140591353379984, %rax  # imm = 0x7FDDF9AD3C90
	vmovsd	(%rax), %xmm1           # xmm1 = mem[0],zero
	nopw	(%rax,%rax)
; │ @ In[20]:4 within `k'
; │┌ @ float.jl:395 within `+'
L32:
	vaddsd	%xmm1, %xmm0, %xmm0
; │└
; │ @ In[20]:5 within `k'
; │┌ @ range.jl:594 within `iterate'
; ││┌ @ base.jl:43 within `=='
	addq	$-1, %rsi
; │└└
	jne	L32
; │ @ In[20]:7 within `k'
L42:
	retq
	nopl	(%rax,%rax)
; └


In [22]:
@code_native k(Int64, 10)

	.text
; ┌ @ In[20]:3 within `k'
; │┌ @ range.jl:5 within `Colon'
; ││┌ @ range.jl:274 within `Type'
; │││┌ @ range.jl:279 within `unitrange_last'
; ││││┌ @ operators.jl:333 within `>='
; │││││┌ @ In[20]:2 within `<='
	testq	%rsi, %rsi
; │└└└└└
	jle	L26
	movq	%rsi, %rax
	nopl	(%rax,%rax)
; │ @ In[20]:5 within `k'
; │┌ @ range.jl:594 within `iterate'
; ││┌ @ base.jl:43 within `=='
L16:
	addq	$-1, %rax
; │└└
	jne	L16
; │ @ In[20]:7 within `k'
	movq	%rsi, %rax
	retq
L26:
	xorl	%esi, %esi
; │ @ In[20]:7 within `k'
	movq	%rsi, %rax
	retq
; └


In [23]:
function m(::Type{T}, N) where T
    acc = zero(T)
    for i in 1:N
        acc += one(T)
        escape(acc)
    end
    return acc
end

m (generic function with 1 method)

In [24]:
@code_native m(Int64, 30)

	.text
; ┌ @ In[23]:3 within `m'
; │┌ @ range.jl:5 within `Colon'
; ││┌ @ range.jl:274 within `Type'
; │││┌ @ range.jl:279 within `unitrange_last'
; ││││┌ @ operators.jl:333 within `>='
; │││││┌ @ In[23]:2 within `<='
	testq	%rsi, %rsi
; │└└└└└
	jle	L38
	movq	%rsi, %rax
	negq	%rax
	movl	$1, %ecx
; │ @ In[23]:5 within `m'
; │┌ @ range.jl:594 within `iterate'
; ││┌ @ base.jl:43 within `=='
L16:
	leaq	(%rax,%rcx), %rdx
	addq	$1, %rdx
; ││└
; ││ @ range.jl:595 within `iterate'
; ││┌ @ int.jl:53 within `+'
	addq	$1, %rcx
; ││└
; ││ @ range.jl:594 within `iterate'
; ││┌ @ promotion.jl:403 within `=='
	cmpq	$1, %rdx
; │└└
	jne	L16
; │ @ In[23]:7 within `m'
	movq	%rsi, %rax
	retq
L38:
	xorl	%esi, %esi
; │ @ In[23]:7 within `m'
	movq	%rsi, %rax
	retq
	nopl	(%rax)
; └


In [25]:
result2 = @benchmark m($Int64, $N)

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     30.238 ms (0.00% GC)
  median time:      32.846 ms (0.00% GC)
  mean time:        33.203 ms (0.00% GC)
  maximum time:     46.178 ms (0.00% GC)
  --------------
  samples:          151
  evals/sample:     1

In [26]:
@benchmark m($Int64, $(N*10))

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     577.269 ms (0.00% GC)
  median time:      590.251 ms (0.00% GC)
  mean time:        595.120 ms (0.00% GC)
  maximum time:     648.988 ms (0.00% GC)
  --------------
  samples:          9
  evals/sample:     1

In [27]:
t = time(minimum(result2)) # in ns
N / (t * 1e-9) # in Hz

3.307094405375166e9

Sanity restored: 3.8 GHz is much closer to the frequency of my actual processor 

Note: Benchmarking is hard, careful evalutaion of *what* you are trying to benchmark.

- If we were just interesting in how fast `f(N)` was we would have been fine with our first measurement
- But we were interested in the speed of addition as a proxy of perfromance
- Integer math on a computer is associative, Floating-Point math is not.

In [28]:
@benchmark h($N)

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     118.324 ms (0.00% GC)
  median time:      121.744 ms (0.00% GC)
  mean time:        121.335 ms (0.00% GC)
  maximum time:     123.944 ms (0.00% GC)
  --------------
  samples:          42
  evals/sample:     1

In [29]:
function l(N)
    acc = 0.0
    @simd for i in 1:N
        acc += 1.0
    end
    acc
end

l (generic function with 1 method)

In [30]:
@benchmark l($N)

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     6.537 ms (0.00% GC)
  median time:      6.790 ms (0.00% GC)
  mean time:        6.829 ms (0.00% GC)
  maximum time:     7.717 ms (0.00% GC)
  --------------
  samples:          732
  evals/sample:     1

# Performance annotiations in Julia

- https://docs.julialang.org/en/v1/manual/performance-tips/
- Julia does bounds checking by default `ones(10)[11]` is an error
- `@inbounds` Turns of bounds-checking locally
- `@fastmath` Turns of strict IEEE749 locally -- be very careful this might not to what you want
- `@simd` and `@simd ivdep` stronger gurantuees to encourage LLVM to use SIMD operations

In [31]:
?@simd

```
@simd
```

Annotate a `for` loop to allow the compiler to take extra liberties to allow loop re-ordering

!!! warning
    This feature is experimental and could change or disappear in future versions of Julia. Incorrect use of the `@simd` macro may cause unexpected results.


The object iterated over in a `@simd for` loop should be a one-dimensional range. By using `@simd`, you are asserting several properties of the loop:

  * It is safe to execute iterations in arbitrary or overlapping order, with special consideration for reduction variables.
  * Floating-point operations on reduction variables can be reordered, possibly causing different results than without `@simd`.

In many cases, Julia is able to automatically vectorize inner for loops without the use of `@simd`. Using `@simd` gives the compiler a little extra leeway to make it possible in more situations. In either case, your inner loop should have the following properties to allow vectorization:

  * The loop must be an innermost loop
  * The loop body must be straight-line code. Therefore, [`@inbounds`](@ref) is   currently needed for all array accesses. The compiler can sometimes turn   short `&&`, `||`, and `?:` expressions into straight-line code if it is safe   to evaluate all operands unconditionally. Consider using the [`ifelse`](@ref)   function instead of `?:` in the loop if it is safe to do so.
  * Accesses must have a stride pattern and cannot be "gathers" (random-index   reads) or "scatters" (random-index writes).
  * The stride should be unit stride.

!!! note
    The `@simd` does not assert by default that the loop is completely free of loop-carried memory dependencies, which is an assumption that can easily be violated in generic code. If you are writing non-generic code, you can use `@simd ivdep for ... end` to also assert that:


  * There exists no loop-carried memory dependencies
  * No iteration ever waits on a previous iteration to make forward progress.


In [32]:
@code_llvm l(10)


;  @ In[29]:2 within `l'
define double @julia_l_13882(i64) {
top:
;  @ In[29]:3 within `l'
; ┌ @ simdloop.jl:65 within `macro expansion'
; │┌ @ range.jl:5 within `Colon'
; ││┌ @ range.jl:274 within `Type'
; │││┌ @ range.jl:279 within `unitrange_last'
; ││││┌ @ operators.jl:333 within `>='
; │││││┌ @ int.jl:428 within `<='
        %1 = icmp sgt i64 %0, 0
; ││││└└
      %2 = select i1 %1, i64 %0, i64 0
; │└└└
; │ @ simdloop.jl:67 within `macro expansion'
; │┌ @ simdloop.jl:47 within `simd_inner_length'
; ││┌ @ range.jl:540 within `length'
; │││┌ @ checked.jl:222 within `checked_sub'
; ││││┌ @ checked.jl:194 within `sub_with_overflow'
       %3 = add nsw i64 %2, -1
; │││└└
; │││┌ @ checked.jl:165 within `checked_add'
; ││││┌ @ checked.jl:132 within `add_with_overflow'
       %4 = call { i64, i1 } @llvm.sadd.with.overflow.i64(i64 %3, i64 1)
       %5 = extractvalue { i64, i1 } %4, 1
; ││││└
; ││││ @ checked.jl:166 within `checked_add'
      br i1 %5, label %L16, label %L21

L16:          

# Let's revisit our example from earlier!

Slightly more complicated function!

- What is wrong with `mysum3(ones(10_000))`

In [33]:
function mysum3(data::Vector{T}) where T<:Number
  acc = zero(T)
  for x in data
      acc += x
  end
  return acc
end

mysum3 (generic function with 1 method)

In [34]:
@code_warntype mysum3(zeros(3))

Body[36m::Float64[39m
[37m1 ──[39m %1  = (Base.arraylen)(data)[36m::Int64[39m
[37m│   [39m %2  = (Base.sle_int)(0, %1)[36m::Bool[39m
[37m│   [39m %3  = (Base.bitcast)(UInt64, %1)[36m::UInt64[39m
[37m│   [39m %4  = (Base.ult_int)(0x0000000000000000, %3)[36m::Bool[39m
[37m│   [39m %5  = (Base.and_int)(%2, %4)[36m::Bool[39m
[37m└───[39m       goto #3 if not %5
[37m2 ──[39m %7  = (Base.arrayref)(false, data, 1)[36m::Float64[39m
[37m└───[39m       goto #4
[37m3 ──[39m       goto #4
[37m4 ┄─[39m %10 = φ (#2 => false, #3 => true)[36m::Bool[39m
[37m│   [39m %11 = φ (#2 => %7)[36m::Float64[39m
[37m│   [39m %12 = φ (#2 => 2)[36m::Int64[39m
[37m└───[39m       goto #5
[37m5 ──[39m %14 = (Base.not_int)(%10)[36m::Bool[39m
[37m└───[39m       goto #11 if not %14
[37m6 ┄─[39m %16 = φ (#5 => 0.0, #10 => %19)[36m::Float64[39m
[37m│   [39m %17 = φ (#5 => %11, #10 => %32)[36m::Float64[39m
[37m│   [39m %18 = φ (#5 => %12, #10 => %33)[36m::Int64

# Task

- Write, a fast and generic `sum` implementation.

## Using the profiler

1. `using Profile`
2. `@profile mysum()`
3. `Profile.clear()` -- reset the profile
4. `Proile.print()` simple display of profile data
5. Use ProfileView.jl or PProf.jl to analyse your data better


# From performance to generic code
- Up until now I have been heavily focused on performance
- Mostly because I am a low-level person and this excites me!
- Performance was the reason why I came to Julia, but I stayed because of the features
- Tomorrow we will taks about using Julia for Science and using GPUs