# Table of Contents
 <p><div class="lev1 toc-item"><a href="#Julia:-Functions,-Type-System,-Multiple-Dispatch,-JIT,-and-Profiling" data-toc-modified-id="Julia:-Functions,-Type-System,-Multiple-Dispatch,-JIT,-and-Profiling-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Julia: Functions, Type System, Multiple Dispatch, JIT, and Profiling</a></div><div class="lev2 toc-item"><a href="#Control-flow-and-loops" data-toc-modified-id="Control-flow-and-loops-11"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Control flow and loops</a></div><div class="lev2 toc-item"><a href="#Functions" data-toc-modified-id="Functions-12"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Functions</a></div><div class="lev2 toc-item"><a href="#Type-system" data-toc-modified-id="Type-system-13"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Type system</a></div><div class="lev2 toc-item"><a href="#Multiple-dispatch" data-toc-modified-id="Multiple-dispatch-14"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Multiple dispatch</a></div><div class="lev2 toc-item"><a href="#Just-in-time-compilation-(JIT)" data-toc-modified-id="Just-in-time-compilation-(JIT)-15"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Just-in-time compilation (JIT)</a></div><div class="lev2 toc-item"><a href="#Profiling-Julia-code" data-toc-modified-id="Profiling-Julia-code-16"><span class="toc-item-num">1.6&nbsp;&nbsp;</span>Profiling Julia code</a></div><div class="lev2 toc-item"><a href="#Memory-profiling" data-toc-modified-id="Memory-profiling-17"><span class="toc-item-num">1.7&nbsp;&nbsp;</span>Memory profiling</a></div><div class="lev2 toc-item"><a href="#Type-stability" data-toc-modified-id="Type-stability-18"><span class="toc-item-num">1.8&nbsp;&nbsp;</span>Type stability</a></div>

# Julia: Functions, Type System, Multiple Dispatch, JIT, and Profiling

In this lecture, we try to understand why Julia is fast. 

Machine information

In [1]:
versioninfo()

Julia Version 0.6.4
Commit 9d11f62bcb (2018-07-09 19:09 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell MAX_THREADS=16)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, skylake)


## Control flow and loops

Building blocks of a function:

* if-elseif-else-end
```julia
if condition1
    # do something
elseif condition2
    # do something
else
    # do something
end
```

* `for` loop
```julia
for i in 1:10
    println(i)
end
```

* Nested `for` loop:
```julia
for i in 1:10
    for j in 1:5
        println(i * j)
    end
end
```
Same as
```julia
for i in 1:10, j in 1:5
    println(i * j)
end
```

* Exit loop:
```julia
for i in 1:10
    # do something
    if condition1
        exit # skip remaining loop
    end
end
```

* Exit iteration:  
```julia
for i in 1:10
    # do something
    if condition1
        continue # skip to next iteration
    end
    # do something
end
```

## Functions 

* In Julia, all arguments to functions are **passed by reference**, in contrast to R and Matlab.

* Function names ending with `!` indicates that function mutates at least one argument, typically the first.
```julia
sort!(x) # vs sort(x)
```

* Function definition
```julia
function func(req1, req2; key1=dflt1, key2=dflt2)
    # do stuff
    return out1, out2, out3
end
```
**Required arguments** are separated with a comma and use the positional notation.  
**Optional arguments** need a default value in the signature.  
**Semicolon** is not required in function call.  
**return** statement is optional.  
Multiple outputs can be returned as a **tuple**, e.g., `return out1, out2, out3`.  

* Anonymous functions, e.g., `x -> x^2`, is commonly used in collection function or list comprehensions.
```julia
map(x -> x^2, y) # square each element in x
```

* Functions can be nested:
```julia
function outerfunction()
    # do some outer stuff
    function innerfunction()
        # do inner stuff
        # can access prior outer definitions
    end
    # do more outer stuff
end
```

* Functions can be vectorized using the **dot syntax**:

In [2]:
function myfunc(x)
    return sin(x^2)
end

x = randn(5, 3)
myfunc.(x)

5×3 Array{Float64,2}:
  0.698459    0.0332408   0.989591 
  0.0610353   0.104825   -0.521194 
 -0.972617    0.716213    0.0618489
  0.704725   -0.449746    0.896146 
  0.886254    0.162349    0.0109776

Dot operations are fused into a single loop:

In [3]:
myfunc.(x .+ 1)

5×3 Array{Float64,2}:
 -0.380497    0.619866   0.0377494
  0.99988     0.98342    0.779047 
  0.947605    0.0113398  0.534859 
  0.0133797   0.723658   0.179633 
  0.00190486  0.921095   0.939281 

In [4]:
using BenchmarkTools

# allocate new array for z
@benchmark z = myfunc.(x .+ 1) # sin((x + 1)^2)

BenchmarkTools.Trial: 
  memory estimate:  1.30 KiB
  allocs estimate:  25
  --------------
  minimum time:     4.248 μs (0.00% GC)
  median time:      4.471 μs (0.00% GC)
  mean time:        4.779 μs (2.91% GC)
  maximum time:     405.425 μs (95.57% GC)
  --------------
  samples:          10000
  evals/sample:     7

In [5]:
# use same z
@benchmark z .= myfunc.(x .+ 1) # sin(x^2) + 1

LoadError: [91mUndefVarError: z not defined[39m

* **Collection function** (think this as the `apply` series in R).

    Apply a function to each element of a collection:
```julia
map(f, coll) # or
map(coll) do elem
    # do stuff with elem
    # must contain return
end
```

In [6]:
map(x -> sin(x^2), x)

5×3 Array{Float64,2}:
  0.698459    0.0332408   0.989591 
  0.0610353   0.104825   -0.521194 
 -0.972617    0.716213    0.0618489
  0.704725   -0.449746    0.896146 
  0.886254    0.162349    0.0109776

In [7]:
map(x) do elem
    elem = elem^2
    return sin(elem)
end

5×3 Array{Float64,2}:
  0.698459    0.0332408   0.989591 
  0.0610353   0.104825   -0.521194 
 -0.972617    0.716213    0.0618489
  0.704725   -0.449746    0.896146 
  0.886254    0.162349    0.0109776

In [8]:
# Mapreduce
mapreduce(x -> sin(x^2), +, x)

3.382108546293333

In [9]:
# same as
sum(x -> sin(x^2), x)

3.382108546293333

* List **comprehension**

In [10]:
[sin(2i + j) for i in 1:5, j in 1:3] # similar to Python

5×3 Array{Float64,2}:
  0.14112   -0.756802  -0.958924
 -0.958924  -0.279415   0.656987
  0.656987   0.989358   0.412118
  0.412118  -0.544021  -0.99999 
 -0.99999   -0.536573   0.420167

## Type system

* When thinking about types, think about sets.

* Everything is a subtype of the abstract type `Any`.

* An abstract type defines a set of types
    - Consider types in Julia that are a `Number`:

<img src="tree.png" width="600" align="center"/>

* You can explore type hierarchy with `typeof()`, `supertype()`, and `subtypes()`.

In [11]:
typeof(1.0), typeof(1)

(Float64, Int64)

In [12]:
supertype(Float64)

AbstractFloat

In [13]:
subtypes(AbstractFloat)

4-element Array{Union{DataType, UnionAll},1}:
 BigFloat
 Float16 
 Float32 
 Float64 

In [14]:
# Is Float64 a subtype of AbstractFloat?
Float64 <: AbstractFloat

true

In [15]:
# On 64bit machine, Int == Int64
Int == Int64

true

In [16]:
convert(Float64, 1) # same as Float(64)

1.0

In [17]:
x = randn(Float32, 5)

5-element Array{Float32,1}:
 -1.64933  
 -0.305266 
  0.845376 
 -0.0670683
 -0.307917 

In [18]:
convert(Vector{Float64}, x) # same as Float64.(x)

5-element Array{Float64,1}:
 -1.64933  
 -0.305266 
  0.845376 
 -0.0670683
 -0.307917 

In [19]:
convert(Int, 1.0) # exact conversion

1

In [20]:
convert(Int, 1.5) # should use round(1.5)

LoadError: [91mInexactError()[39m

In [21]:
round(Int, 1.5)

2

## Multiple dispatch

* Multiple dispatch lies in the core of Julia design. It allows built-in and user-defined functions to be overloaded for different combinations of argument types.

* Let's consider a simple "doubling" function:

In [22]:
g(x) = x + x

g (generic function with 1 method)

In [23]:
g(1.5)

3.0

This definition is too broad, since some things can't be added 

In [24]:
g("hello world")

LoadError: [91mMethodError: no method matching +(::String, ::String)[0m
Closest candidates are:
  +(::Any, ::Any, [91m::Any[39m, [91m::Any...[39m) at operators.jl:424[39m

* This definition is correct but too restrictive, since any `Number` can be added.

In [25]:
g(x::Float64) = x + x

g (generic function with 2 methods)

* This will automatically work on the entire type tree above!

In [26]:
g(x::Number) = x + x

g (generic function with 3 methods)

This is a lot nicer than 
```julia
function g(x)
    if isa(x, Number)
        return x + x
    else
        throw(ArgumentError("x should be a number"))
    end
end
```

* `methods(func)` function display all methods defined for `func`.

In [27]:
methods(g)

* `@which func(x)` marco tells which method is being used for argument signature `x`.

In [28]:
x = 1
typeof(x)

Int64

In [29]:
g(x)

2

In [30]:
@which g(x)

In [31]:
x = randn(5)
@which g(x)

In [32]:
g(x)

5-element Array{Float64,1}:
 -0.670831
 -0.222685
  2.39433 
 -2.15139 
  0.452496

## Just-in-time compilation (JIT)

Following figures and some examples are taken from Arch D. Robinson's slides [Introduction to Writing High Performance Julia](https://docs.google.com/viewer?a=v&pid=sites&srcid=ZGVmYXVsdGRvbWFpbnxibG9uem9uaWNzfGd4OjMwZjI2YTYzNDNmY2UzMmE).

| <img src="./julia_toolchain.png" alt="Julia toolchain" style="width: 400px;"/> | <img src="./julia_introspect.png" alt="Julia toolchain" style="width: 500px;"/> |
|----------------------------------|------------------------------------|
|||

* `Julia`'s efficiency results from its capabilities to infer the types of **all** variables within a function and then call LLVM to generate optimized machine code at run-time. 

In [33]:
workspace() # clear previous definition of g
g(x::Number) = x + x

g (generic function with 1 method)

This function will work on **any** type which has a method for `+`.

In [34]:
@show g(2)
@show g(2.0);

g(2) = 4
g(2.0) = 4.0


This is the [abstract syntax tree (AST)](https://en.wikipedia.org/wiki/Abstract_syntax_tree).

In [35]:
@code_lowered g(2)

CodeInfo(:(begin 
        nothing
        return x + x
    end))

Type inference:

In [36]:
@code_warntype g(2)

Variables:
  #self# <optimized out>
  x::Int64

Body:
  begin 
      return (Base.add_int)(x::Int64, x::Int64)::Int64
  end::Int64


In [37]:
@code_warntype g(2.0)

Variables:
  #self# <optimized out>
  x::Float64

Body:
  begin 
      return (Base.add_float)(x::Float64, x::Float64)::Float64
  end::Float64


Peek at the compiled **LLVM bitcode** with `@code_llvm`

In [38]:
@code_llvm g(2)


define i64 @julia_g_63500(i64) #0 !dbg !5 {
top:
  %1 = shl i64 %0, 1
  ret i64 %1
}


In [39]:
@code_llvm g(2.0)


define double @julia_g_63557(double) #0 !dbg !5 {
top:
  %1 = fadd double %0, %0
  ret double %1
}


We didn't provide a type annotation. But different LLVM code is generated according to the argument type!

* In R or Python, `g(2)` and `g(2.0)` would use the same code for both.
 
* In Julia, `g(2)` and `g(2.0)` dispatches to optimized code for `Int64` and `Float64`, respectively.

* For integer input `x`, LLVM compiler is smart enough to know `x + x` is simple shifting `x` by 1 bit, which is faster than addition.
 
Lowest level is the **assembly code**, which is machine dependent.

In [40]:
@code_native g(2)

	.section	__TEXT,__text,regular,pure_instructions
Filename: In[26]
	pushl	%ebp
	decl	%eax
	movl	%esp, %ebp
Source line: 1
	decl	%eax
	leal	(%edi,%edi), %eax
	popl	%ebp
	retl
	nop
	nop
	nop
	nop
	nop
	nop


In [41]:
@code_native g(2.0)

	.section	__TEXT,__text,regular,pure_instructions
Filename: In[25]
	pushl	%ebp
	decl	%eax
	movl	%esp, %ebp
Source line: 1
	addsd	%xmm0, %xmm0
	popl	%ebp
	retl
	nop
	nop
	nop
	nop
	nop
	nop


## Profiling Julia code

Julia has several built-in tools for profiling. The `@time` marco outputs run time and heap allocation.

In [42]:
function tally(x)
    s = 0
    for v in x
        s += v
    end
    s
end

srand(123)
a = rand(10000)
@time tally(a) # first run: include compile time

4997.572183650806

  0.012710 seconds (31.35 k allocations: 539.400 KiB)


In [43]:
@time tally(a)

4997.572183650806

  0.000287 seconds (30.00 k allocations: 468.906 KiB)


For more robust benchmarking, the [BenchmarkTools.jl](https://github.com/JuliaCI/BenchmarkTools.jl) package is highly recommended.

In [44]:
using BenchmarkTools

@benchmark tally(a)

BenchmarkTools.Trial: 
  memory estimate:  468.75 KiB
  allocs estimate:  30000
  --------------
  minimum time:     180.395 μs (0.00% GC)
  median time:      187.535 μs (0.00% GC)
  mean time:        209.668 μs (6.87% GC)
  maximum time:     1.508 ms (83.81% GC)
  --------------
  samples:          10000
  evals/sample:     1



We see the memory allocation (468.75 KiB, average 10.73% GC) is suspiciously high.

The `Profile` module gives line by line profile results.

In [45]:
srand(123)
a = rand(10_000_000)
Profile.clear()
@profile tally(a)
Profile.print(format=:flat)

 Count File                        Line Function                               
   291 ./<missing>                   -1 anonymous                              
     4 ./In[42]                       3 tally(::Array{Float64,1})              
   287 ./In[42]                       4 tally(::Array{Float64,1})              
     1 ./abstractarray.jl           573 copy!(::Array{Any,1}, ::Core.Infere... 
     1 ./array.jl                   391 _collect(::Type{Any}, ::Core.Infere... 
     1 ./array.jl                   388 collect(::Type{Any}, ::Core.Inferen... 
     5 ./float.jl                   375 +(::Float64, ::Float64)                
     1 ./generator.jl                45 next(::Core.Inference.Generator{Arr... 
     1 ./inference.jl               291 Core.Inference.InferenceState(::Cor... 
     2 ./inference.jl              1897 abstract_call(::Any, ::Array{Any,1}... 
     2 ./inference.jl              1420 abstract_call_gf_by_type(::Any, ::A... 
     3 ./inference.jl              1950 

One can use [`ProfileView`](https://github.com/timholy/ProfileView.jl) package for better visualization of profile data:

```julia
using ProfileView

ProfileView.view()
```

In [46]:
@code_warntype tally(a)

Variables:
  #self# <optimized out>
  x::Array{Float64,1}
  v::Float64
  #temp#@_4::Int64
  s[1m[91m::Union{Float64, Int64}[39m[22m
  #temp#@_6::Core.MethodInstance
  #temp#@_7::Float64

Body:
  begin 
      s[1m[91m::Union{Float64, Int64}[39m[22m = 0 # line 3:
      #temp#@_4::Int64 = 1
      4: 
      unless (Base.not_int)((#temp#@_4::Int64 === (Base.add_int)((Base.arraylen)(x::Array{Float64,1})::Int64, 1)::Int64)::Bool)::Bool goto 29
      SSAValue(2) = (Base.arrayref)(x::Array{Float64,1}, #temp#@_4::Int64)::Float64
      SSAValue(3) = (Base.add_int)(#temp#@_4::Int64, 1)::Int64
      v::Float64 = SSAValue(2)
      #temp#@_4::Int64 = SSAValue(3) # line 4:
      unless (s[1m[91m::Union{Float64, Int64}[39m[22m isa Int64)::Bool goto 14
      #temp#@_6::Core.MethodInstance = MethodInstance for +(::Int64, ::Float64)
      goto 23
      14: 
      unless (s[1m[91m::Union{Float64, Int64}[39m[22m isa Float64)::Bool goto 18
      #temp#@_6::Core.MethodInstance = MethodInstance

## Memory profiling

Detailed memory profiling requires a detour. First let's write a script [`bar.jl`](./bar.jl), which contains the workload function `tally` and a wrapper for profiling.

In [47]:
;cat bar.jl

function tally(x)
    s = 0
    for v in x
        s += v
    end
    s
end

# call workload from wrapper to avoid misattribution bug
function wrapper()
    y = rand(10000)
    # force compilation
    println(tally(y))
    # clear allocation counters
    Profile.clear_malloc_data()
    # run compiled workload
    println(tally(y))
end

wrapper()


Next, in terminal, we run the script with `--track-allocation=user` option.

In [48]:
;julia --track-allocation=user bar.jl

4975.67150640176
4975.67150640176


The profiler outputs a file `bar.jl.mem`.

In [49]:
;cat bar.jl.mem

        - function tally(x)
        0     s = 0
        0     for v in x
   479984         s += v
        -     end
        0     s
        - end
        - 
        - # call workload from wrapper to avoid misattribution bug
        - function wrapper()
        0     y = rand(10000)
        -     # force compilation
        0     println(tally(y))
        -     # clear allocation counters
        0     Profile.clear_malloc_data()
        -     # run compiled workload
      192     println(tally(y))
        - end
        - 
        - wrapper()
        - 


We see line 4 is allocating suspicious amount of heap memory. 

## Type stability

The key to writing performant Julia code is to be [**type stable**](https://docs.julialang.org/en/stable/manual/performance-tips/#Write-"type-stable"-functions-1), such that `Julia` is able to infer types of all variables and output of a function from the types of input arguments. 

Is the `tally` function type stable? How to diagnose and fix it?

In [50]:
@code_warntype tally(rand(100))

Variables:
  #self# <optimized out>
  x::Array{Float64,1}
  v::Float64
  #temp#@_4::Int64
  s[1m[91m::Union{Float64, Int64}[39m[22m
  #temp#@_6::Core.MethodInstance
  #temp#@_7::Float64

Body:
  begin 
      s[1m[91m::Union{Float64, Int64}[39m[22m = 0 # line 3:
      #temp#@_4::Int64 = 1
      4: 
      unless (Base.not_int)((#temp#@_4::Int64 === (Base.add_int)((Base.arraylen)(x::Array{Float64,1})::Int64, 1)::Int64)::Bool)::Bool goto 29
      SSAValue(2) = (Base.arrayref)(x::Array{Float64,1}, #temp#@_4::Int64)::Float64
      SSAValue(3) = (Base.add_int)(#temp#@_4::Int64, 1)::Int64
      v::Float64 = SSAValue(2)
      #temp#@_4::Int64 = SSAValue(3) # line 4:
      unless (s[1m[91m::Union{Float64, Int64}[39m[22m isa Int64)::Bool goto 14
      #temp#@_6::Core.MethodInstance = MethodInstance for +(::Int64, ::Float64)
      goto 23
      14: 
      unless (s[1m[91m::Union{Float64, Int64}[39m[22m isa Float64)::Bool goto 18
      #temp#@_6::Core.MethodInstance = MethodInstance

In this case, Julia fails to infer the type of the reduction variable `s`, which has to be **boxed** in heap memory at run time.

<img src="https://www.codeproject.com/KB/dotnet/6importentStepsDotNet/14.jpg" width="400" align="center"/>

<img src="https://i-msdn.sec.s-msft.com/dynimg/IC97798.jpeg" width="300" align="center"/>

This is the generated LLVM bitcode, which is unsually long and contains lots of _box_:

In [51]:
@code_llvm tally(rand(100))


define { i8**, i8 } @julia_tally_63670([8 x i8]* noalias nocapture, i8** dereferenceable(40)) #0 !dbg !5 {
top:
  %2 = call i8**** @jl_get_ptls_states() #5
  %3 = alloca [8 x i8**], align 8
  %.sub = getelementptr inbounds [8 x i8**], [8 x i8**]* %3, i64 0, i64 0
  %4 = getelementptr [8 x i8**], [8 x i8**]* %3, i64 0, i64 5
  %5 = getelementptr [8 x i8**], [8 x i8**]* %3, i64 0, i64 2
  %6 = getelementptr [8 x i8**], [8 x i8**]* %3, i64 0, i64 4
  %7 = bitcast i8*** %4 to i8*
  call void @llvm.memset.p0i8.i32(i8* %7, i8 0, i32 24, i32 8, i1 false)
  %8 = bitcast [8 x i8**]* %3 to i64*
  %9 = bitcast i8*** %5 to i8*
  call void @llvm.memset.p0i8.i64(i8* %9, i8 0, i64 16, i32 8, i1 false)
  store i64 12, i64* %8, align 8
  %10 = bitcast i8**** %2 to i64*
  %11 = load i64, i64* %10, align 8
  %12 = getelementptr [8 x i8**], [8 x i8**]* %3, i64 0, i64 1
  %13 = bitcast i8*** %12 to i64*
  store i64 %11, i64* %13, align 8
  store i8*** %.sub, i8**** %2, align 8
  store i8** null, i8*** %6,

What's the fix?

In [52]:
function tally2(x)
    s = zero(eltype(x))
    for v in x
        s += v
    end
    s
end

tally2 (generic function with 1 method)

In [53]:
@benchmark tally2(a)

BenchmarkTools.Trial: 
  memory estimate:  16 bytes
  allocs estimate:  1
  --------------
  minimum time:     11.276 ms (0.00% GC)
  median time:      12.042 ms (0.00% GC)
  mean time:        12.284 ms (0.00% GC)
  maximum time:     20.968 ms (0.00% GC)
  --------------
  samples:          407
  evals/sample:     1

Much shorter LLVM bitcode:

In [54]:
@code_llvm tally2(a)


define double @julia_tally2_64016(i8** dereferenceable(40)) #0 !dbg !5 {
top:
  %1 = getelementptr inbounds i8*, i8** %0, i64 1
  %2 = bitcast i8** %1 to i64*
  %3 = load i64, i64* %2, align 8
  %4 = icmp eq i64 %3, 0
  br i1 %4, label %L14, label %if.lr.ph

if.lr.ph:                                         ; preds = %top
  %5 = getelementptr i8*, i8** %0, i64 3
  %6 = bitcast i8** %5 to i64*
  %7 = load i64, i64* %6, align 8
  %8 = bitcast i8** %0 to double**
  %9 = load double*, double** %8, align 8
  br label %if

if:                                               ; preds = %if.lr.ph, %idxend
  %s.06 = phi double [ 0.000000e+00, %if.lr.ph ], [ %16, %idxend ]
  %"#temp#.05" = phi i64 [ 1, %if.lr.ph ], [ %15, %idxend ]
  %10 = add i64 %"#temp#.05", -1
  %11 = icmp ult i64 %10, %7
  br i1 %11, label %idxend, label %oob

L14.loopexit:                                     ; preds = %idxend
  br label %L14

L14:                                              ; preds = %L14.loopexit, %top
  %

Let's add further performance boost by `@simd`

In [1]:
function tally3(x)
    s = zero(eltype(x))
    @simd for v in x
        s += v
    end
    s
end

tally3 (generic function with 1 method)

In [4]:
@benchmark tally3(a)

BenchmarkTools.Trial: 
  memory estimate:  16 bytes
  allocs estimate:  1
  --------------
  minimum time:     3.683 ms (0.00% GC)
  median time:      3.755 ms (0.00% GC)
  mean time:        3.835 ms (0.00% GC)
  maximum time:     5.136 ms (0.00% GC)
  --------------
  samples:          1301
  evals/sample:     1