# Julia is fast

Very often, benchmarks are used to compare languages.  These benchmarks can lead to long discussions, first as to exactly what is being benchmarked and secondly what explains the differences.  These simple questions can sometimes get more complicated than you at first might imagine.

The purpose of this notebook is for you to see a simple benchmark for yourself.  One can read the notebook and see what happened on the author's Macbook Pro with a 4-core Intel Core I7, or run the notebook yourself.

(This material began life as a wonderful lecture by Steven Johnson at MIT: https://github.com/stevengj/18S096-iap17/blob/master/lecture1/Boxes-and-registers.ipynb.)

# `sum`: An easy enough function to understand

Consider the  **sum** function `sum(a)`, which computes
$$
\mathrm{sum}(a) = \sum_{i=1}^n a_i,
$$
where $n$ is the length of `a`.

In [73]:
a = rand(10^7) # 1D vector of random numbers, uniform on [0,1)

10000000-element Array{Float64,1}:
 0.378597 
 0.559676 
 0.84083  
 0.0206154
 0.482099 
 0.206748 
 0.537193 
 0.049093 
 0.858788 
 0.395606 
 0.541494 
 0.174265 
 0.562593 
 ⋮        
 0.141164 
 0.80015  
 0.4803   
 0.122947 
 0.998452 
 0.671899 
 0.983469 
 0.0728076
 0.595391 
 0.566238 
 0.282788 
 0.581366 

In [2]:
sum(a)   

5.000622679156734e6

The expected result is 0.5 * 10^7, since the mean of each entry is 0.5

# Benchmarking a few ways in a few languages

Julia has a `BenchmarkTools.jl` package for easy and accurate benchmarking:

In [3]:
Pkg.add("BenchmarkTools")

[1m[34mINFO: Nothing to be done
[0m[1m[34mINFO: METADATA is out-of-date — you may not have the latest version of BenchmarkTools
[0m[1m[34mINFO: Use `Pkg.update()` to get the latest versions of your packages
[0m

In [4]:
using BenchmarkTools  

#  1. The C language

C is often considered the gold standard: difficult on the human, nice for the machine. Getting within a factor of 2 of C is often satisfying. Nonetheless, even within C, there are many kinds of optimizations possible that a naive C writer may or may not get the advantage of.

The current author does not speak C, so he does not read the cell below, but is happy to know that you can put C code in a Julia session, compile it, and run it. Note that the `"""` wrap a multi-line string.

In [5]:
C_code = """
#include <stddef.h>
double c_sum(size_t n, double *X) {
    double s = 0.0;
    for (size_t i = 0; i < n; ++i) {
        s += X[i];
    }
    return s;
}
"""

const Clib = tempname()   # make a temporary file


# compile to a shared library by piping C_code to gcc
# (works only if you have gcc installed):

open(`gcc -fPIC -O3 -msse3 -xc -shared -o $(Clib * "." * Libdl.dlext) -`, "w") do f
    print(f, C_code) 
end

# define a Julia function that calls the C function:
c_sum(X::Array{Float64}) = ccall(("c_sum", Clib), Float64, (Csize_t, Ptr{Float64}), length(X), X)

c_sum (generic function with 1 method)

In [6]:
c_sum(a)

5.000622679156596e6

In [7]:
c_sum(a) ≈ sum(a) # type \approx and then <TAB> to get the ≈ symbolb

true

In [8]:
≈  # alias for the `isapprox` function

isapprox (generic function with 2 methods)

In [9]:
?isapprox

search: [1mi[22m[1ms[22m[1ma[22m[1mp[22m[1mp[22m[1mr[22m[1mo[22m[1mx[22m



```
isapprox(x, y; rtol::Real=sqrt(eps), atol::Real=0)
```

Inexact equality comparison: `true` if `norm(x-y) <= atol + rtol*max(norm(x), norm(y))`. The default `atol` is zero and the default `rtol` depends on the types of `x` and `y`.

For real or complex floating-point values, `rtol` defaults to `sqrt(eps(typeof(real(x-y))))`. This corresponds to requiring equality of about half of the significand digits. For other types, `rtol` defaults to zero.

`x` and `y` may also be arrays of numbers, in which case `norm` defaults to `vecnorm` but may be changed by passing a `norm::Function` keyword argument. (For numbers, `norm` is the same thing as `abs`.) When `x` and `y` are arrays, if `norm(x-y)` is not finite (i.e. `±Inf` or `NaN`), the comparison falls back to checking whether all elements of `x` and `y` are approximately equal component-wise.

The binary operator `≈` is equivalent to `isapprox` with the default arguments, and `x ≉ y` is equivalent to `!isapprox(x,y)`.


We can now benchmark the C code directly from Julia:

In [10]:
c_bench = @benchmark c_sum($a) 

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     8.144 ms (0.00% GC)
  median time:      9.043 ms (0.00% GC)
  mean time:        9.149 ms (0.00% GC)
  maximum time:     12.813 ms (0.00% GC)
  --------------
  samples:          545
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

In [11]:
println("C: Fastest time was $(minimum(c_bench.times) / 1e6) msec")

C: Fastest time was 8.143967 msec


In [12]:
d = Dict()  # a "dictionary", i.e. an associative array
d["C"] = minimum(c_bench.times) / 1e6  # in milliseconds
d

Dict{Any,Any} with 1 entry:
  "C" => 8.14397

In [13]:
using Plots
gr()

Plots.GRBackend()

In [14]:
t = c_bench.times / 1e6 # times in milliseconds
m, σ = minimum(t), std(t)

histogram(t, bins=500,
    xlim=(m - 0.01, m + σ),
    xlabel="milliseconds", ylabel="count", label="")

# 2. Python's built in `sum` 

The `PyCall` package provides a Julia interface to Python:

In [15]:
Pkg.add("PyCall")

[1m[34mINFO: Nothing to be done
[0m[1m[34mINFO: METADATA is out-of-date — you may not have the latest version of PyCall
[0m[1m[34mINFO: Use `Pkg.update()` to get the latest versions of your packages
[0m

In [16]:
using PyCall

In [17]:
# Call a low-level PyCall function to get a Python list, because
# by default PyCall will convert to a NumPy array instead (we benchmark NumPy below):

apy_list = PyCall.array2py(a, 1, 1)

# get the Python built-in "sum" function:
pysum = pybuiltin("sum")

PyObject <built-in function sum>

In [18]:
pysum(a)

5.000622679156596e6

In [19]:
pysum(a) ≈ sum(a)

true

In [20]:
py_list_bench = @benchmark $pysum($apy_list)

BenchmarkTools.Trial: 
  memory estimate:  672 bytes
  allocs estimate:  19
  --------------
  minimum time:     73.125 ms (0.00% GC)
  median time:      80.030 ms (0.00% GC)
  mean time:        83.116 ms (0.00% GC)
  maximum time:     99.707 ms (0.00% GC)
  --------------
  samples:          61
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

In [21]:
d["Python built-in"] = minimum(py_list_bench.times) / 1e6
d

Dict{Any,Any} with 2 entries:
  "C"               => 8.14397
  "Python built-in" => 73.1252

# 3. Python: `numpy` 

## Takes advantage of hardware "SIMD", but only works when it works.

`numpy` is an optimized C library, callable from Python.
It may be installed within Julia as follows:

In [22]:
using Conda 
Conda.add("numpy")

Fetching package metadata ...........
Solving package specifications: .

# All requested packages already installed.
# packages in environment at /Users/dpsanders/.julia/v0.5/Conda/deps/usr:
#
numpy                     1.12.1                   py27_0  


In [23]:
numpy_sum = pyimport("numpy")["sum"]
apy_numpy = PyObject(a) # converts to a numpy array by default

py_numpy_bench = @benchmark $numpy_sum($apy_numpy)

BenchmarkTools.Trial: 
  memory estimate:  960 bytes
  allocs estimate:  25
  --------------
  minimum time:     3.920 ms (0.00% GC)
  median time:      4.163 ms (0.00% GC)
  mean time:        4.271 ms (0.00% GC)
  maximum time:     6.990 ms (0.00% GC)
  --------------
  samples:          1164
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

In [24]:
numpy_sum(apy_list) # python thing

5.000622679156729e6

In [25]:
numpy_sum(apy_list) ≈ sum(a)

true

In [26]:
d["Python numpy"] = minimum(py_numpy_bench.times) / 1e6
d

Dict{Any,Any} with 3 entries:
  "C"               => 8.14397
  "Python numpy"    => 3.91977
  "Python built-in" => 73.1252

# 4. Python, hand-written 

In [76]:
py"""
def py_sum(a):
    s = 0.0
    for x in a:
        s = s + x
    return s
"""

sum_py = py"py_sum"

PyObject <function py_sum at 0x32fbb16e0>

In [72]:
py_hand = @benchmark $sum_py($apy_list)

BenchmarkTools.Trial: 
  memory estimate:  672 bytes
  allocs estimate:  19
  --------------
  minimum time:     1.333 s (0.00% GC)
  median time:      1.380 s (0.00% GC)
  mean time:        1.378 s (0.00% GC)
  maximum time:     1.419 s (0.00% GC)
  --------------
  samples:          4
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

In [29]:
sum_py(apy_list)

5.000622679156596e6

In [30]:
sum_py(apy_list) ≈ sum(a)

true

In [31]:
d["Python hand-written"] = minimum(pyhand.times) / 1e6
d

Dict{Any,Any} with 4 entries:
  "C"                   => 8.14397
  "Python numpy"        => 3.91977
  "Python hand-written" => 1285.9
  "Python built-in"     => 73.1252

# 5. Julia (built-in) 

## Written directly in Julia, not in C!

In [32]:
@which sum(a)

In [33]:
j_bench = @benchmark sum($a)

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.770 ms (0.00% GC)
  median time:      4.254 ms (0.00% GC)
  mean time:        4.358 ms (0.00% GC)
  maximum time:     6.861 ms (0.00% GC)
  --------------
  samples:          1141
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

In [34]:
d["Julia built-in"] = minimum(j_bench.times) / 1e6
d

Dict{Any,Any} with 5 entries:
  "C"                   => 8.14397
  "Python numpy"        => 3.91977
  "Python hand-written" => 1285.9
  "Python built-in"     => 73.1252
  "Julia built-in"      => 3.77031

# 6. Julia (hand-written) 

In [35]:
function mysum(A)   
    s = 0.0  # s = zero(eltype(A))
    for a in A
        s += a
    end
    s
end

mysum (generic function with 1 method)

In [36]:
j_bench_hand = @benchmark mysum($a)

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     8.150 ms (0.00% GC)
  median time:      9.289 ms (0.00% GC)
  mean time:        9.551 ms (0.00% GC)
  maximum time:     12.956 ms (0.00% GC)
  --------------
  samples:          522
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

In [37]:
d["Julia hand-written"] = minimum(j_bench_hand.times) / 1e6
d

Dict{Any,Any} with 6 entries:
  "C"                   => 8.14397
  "Python numpy"        => 3.91977
  "Julia hand-written"  => 8.14976
  "Python hand-written" => 1285.9
  "Python built-in"     => 73.1252
  "Julia built-in"      => 3.77031

# Julia parallel: `DistributedArrays`

In [38]:
Pkg.add("DistributedArrays")

[1m[34mINFO: Nothing to be done
[0m[1m[34mINFO: METADATA is out-of-date — you may not have the latest version of DistributedArrays
[0m[1m[34mINFO: Use `Pkg.update()` to get the latest versions of your packages
[0m

In [39]:
addprocs(4)

4-element Array{Int64,1}:
 2
 3
 4
 5

In [40]:
@everywhere using DistributedArrays



In [41]:
â = distribute(a)  # write â as `a\hat<TAB>`

10000000-element DistributedArrays.DArray{Float64,1,Array{Float64,1}}:
 0.780425 
 0.831399 
 0.495164 
 0.507991 
 0.257236 
 0.134897 
 0.634852 
 0.272411 
 0.863697 
 0.743836 
 0.493572 
 0.989171 
 0.091806 
 ⋮        
 0.245745 
 0.0803498
 0.616252 
 0.209649 
 0.207877 
 0.662813 
 0.142127 
 0.968939 
 0.305472 
 0.0241002
 0.205357 
 0.0487156

In [42]:
â.indexes

4-element Array{Tuple{UnitRange{Int64}},1}:
 (1:2500000,)       
 (2500001:5000000,) 
 (5000001:7500000,) 
 (7500001:10000000,)

In [43]:
â.pids

4-element Array{Int64,1}:
 2
 3
 4
 5

In [44]:
j_parallel = @benchmark sum($â)

BenchmarkTools.Trial: 
  memory estimate:  43.39 KiB
  allocs estimate:  553
  --------------
  minimum time:     2.183 ms (0.00% GC)
  median time:      2.369 ms (0.00% GC)
  mean time:        2.441 ms (0.32% GC)
  maximum time:     6.055 ms (61.41% GC)
  --------------
  samples:          2026
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

In [45]:
d["Julia parallel"] = minimum(j_parallel.times) / 1e6
d

Dict{Any,Any} with 7 entries:
  "Julia parallel"      => 2.18308
  "C"                   => 8.14397
  "Python numpy"        => 3.91977
  "Julia hand-written"  => 8.14976
  "Python hand-written" => 1285.9
  "Python built-in"     => 73.1252
  "Julia built-in"      => 3.77031

# Call MATLAB from Julia

The `MATLAB.jl` package allows you to call MATLAB from Julia, **provided**, of course, that **you have a MATLAB license**. See [here](https://github.com/JuliaInterop/MATLAB.jl) for instructions on setting up MATLAB for use from Julia for different platforms.

In [46]:
Pkg.add("MATLAB")

[1m[34mINFO: Nothing to be done
[0m[1m[34mINFO: METADATA is out-of-date — you may not have the latest version of MATLAB
[0m[1m[34mINFO: Use `Pkg.update()` to get the latest versions of your packages
[0m

In [47]:
using MATLAB

In [48]:
a = rand(10^7);

In [49]:
@mput a   # send the object to the MATLAB workspace

In [50]:
using BenchmarkTools

In [51]:
matlab_builtin = @benchmark @matlab sum(a)  # use `@matlab` to run MATLAB code

>> 
ans =

   4.9999e+06

>> 
ans =

   4.9999e+06

>> 
ans =

   4.9999e+06

>> 
ans =

   4.9999e+06

>> 
ans =

   4.9999e+06

>> 
ans =

   4.9999e+06

>> 
ans =

   4.9999e+06

>> 
ans =

   4.9999e+06

>> 
ans =

   4.9999e+06

>> 
ans =

   4.9999e+06

>> 
ans =

   4.9999e+06

>> 
ans =

   4.9999e+06

>> 
ans =

   4.9999e+06

>> 
ans =

   4.9999e+06

>> 
ans =

   4.9999e+06

>> 
ans =

   4.9999e+06

>> 
ans =

   4.9999e+06

>> 
ans =

   4.9999e+06

>> 
ans =

   4.9999e+06

>> 
ans =

   4.9999e+06

>> 
ans =

   4.9999e+06

>> 
ans =

   4.9999e+06

>> 
ans =

   4.9999e+06

>> 
ans =

   4.9999e+06

>> 
ans =

   4.9999e+06

>> 
ans =

   4.9999e+06

>> 
ans =

   4.9999e+06

>> 
ans =

   4.9999e+06

>> 
ans =

   4.9999e+06

>> 
ans =

   4.9999e+06

>> 
ans =

   4.9999e+06

>> 
ans =

   4.9999e+06

>> 
ans =

   4.9999e+06

>> 
ans =

   4.9999e+06

>> 
ans =

   4.9999e+06

>> 
ans =

   4.9999e+06

>> 
ans =

   4.9999e+06

>> 
ans =

   4.9999e+06

>> 
ans =

 

BenchmarkTools.Trial: 
  memory estimate:  416 bytes
  allocs estimate:  14
  --------------
  minimum time:     7.195 ms (0.00% GC)
  median time:      7.942 ms (0.00% GC)
  mean time:        8.218 ms (0.00% GC)
  maximum time:     34.227 ms (0.00% GC)
  --------------
  samples:          607
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

In [52]:
d["Matlab built-in"] = minimum(matlab_builtin.times) / 1e6
d

Dict{Any,Any} with 8 entries:
  "Julia parallel"      => 2.18308
  "C"                   => 8.14397
  "Python numpy"        => 3.91977
  "Julia hand-written"  => 8.14976
  "Matlab built-in"     => 7.19475
  "Python hand-written" => 1285.9
  "Python built-in"     => 73.1252
  "Julia built-in"      => 3.77031

In [53]:
a = rand(10^7)

10000000-element Array{Float64,1}:
 0.104046  
 0.348486  
 0.21648   
 0.586656  
 0.724166  
 0.90426   
 0.0846461 
 0.843455  
 0.106086  
 0.246553  
 0.00915259
 0.203915  
 0.744469  
 ⋮         
 0.817912  
 0.518353  
 0.974355  
 0.0843241 
 0.923075  
 0.712648  
 0.565731  
 0.111142  
 0.597492  
 0.654492  
 0.433877  
 0.861064  

In [54]:
function matlab_sum(a)
    @mput a
    
    mat"""
        s = 0;
        for i=1:length(a)
            s = s + a(i);
        end
    """
    
    return @mget s
end

matlab_sum (generic function with 1 method)

In [55]:
matlab_hand = @benchmark matlab_sum($a)

BenchmarkTools.Trial: 
  memory estimate:  1.22 KiB
  allocs estimate:  32
  --------------
  minimum time:     145.748 ms (0.00% GC)
  median time:      166.239 ms (0.00% GC)
  mean time:        167.055 ms (0.00% GC)
  maximum time:     211.482 ms (0.00% GC)
  --------------
  samples:          31
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

In [56]:
d["Matlab hand-written"] = minimum(matlab_hand.times) / 1e6
d

Dict{Any,Any} with 9 entries:
  "Julia parallel"      => 2.18308
  "Matlab hand-written" => 145.748
  "C"                   => 8.14397
  "Python numpy"        => 3.91977
  "Julia hand-written"  => 8.14976
  "Matlab built-in"     => 7.19475
  "Python hand-written" => 1285.9
  "Python built-in"     => 73.1252
  "Julia built-in"      => 3.77031

# Summary

In [60]:
for (key, value) in sort(collect(d))
    println(rpad(key, 20, "."), lpad(round(value, 1), 8, "."))
end

C........................8.1
Julia built-in...........3.8
Julia hand-written.......8.1
Julia parallel...........2.2
Matlab built-in..........7.2
Matlab hand-written....145.7
Python built-in.........73.1
Python hand-written...1285.9
Python numpy.............3.9


In [61]:
for (key, value) in sort(collect(d), by=x->x[2])
    println(rpad(key, 20, "."), lpad(round(value, 2), 10, "."))
end

Julia parallel............2.18
Julia built-in............3.77
Python numpy..............3.92
Matlab built-in...........7.19
C.........................8.14
Julia hand-written........8.15
Python built-in..........73.13
Matlab hand-written.....145.75
Python hand-written.....1285.9
