# Variations on sum

Very often, benchmarks are used to compare languages.  These benchmarks can lead to long discussions, first as to exactly what is being benchmarked and secondly what explains the differences.  These simple questions can sometimes get more complicated than you at first might imagine.

The purpose of this notebook is for you to see a simple benchmark for yourself.  One can read the notebook and see what happened on the author's Macbook Pro with a 4-core Intel Core I7, or run the notebook yourself.

(This material began life as a wonderful lecture by Steven Johnson at MIT: https://github.com/stevengj/18S096-iap17/blob/master/lecture1/Boxes-and-registers.ipynb.)

# `sum`: An easy enough function to understand

Consider the  **sum** function `sum(a)`, which computes
$$
\mathrm{sum}(a) = \sum_{i=1}^n a_i,
$$
where $n$ is the length of `a`.

In [1]:
a = rand(10^7) # 1D vector of random numbers, uniform on [0,1)

10000000-element Array{Float64,1}:
 0.8369669914723556  
 0.6159230095590622  
 0.700043929178316   
 0.8926700411245836  
 0.6621500969501344  
 0.15121378293851717 
 0.9119828437548392  
 0.04886263606637753 
 0.6063175205454701  
 0.7909039284012476  
 0.22482286992951517 
 0.4849086677732175  
 0.5957151940264158  
 ⋮                   
 0.21685101313453403 
 0.05219692397435316 
 0.5198168147866551  
 0.6960490853043606  
 0.908948430319036   
 0.8366302274194704  
 0.7021682717380182  
 0.2545997543326539  
 0.19118737839340882 
 0.8102983012622875  
 0.025189910336794075
 0.9840669209115014  

In [2]:
@time sum(a)
@time sum(a)
@time sum(a)

  0.035185 seconds (96.85 k allocations: 4.933 MiB)
  0.005230 seconds (5 allocations: 176 bytes)
  0.004509 seconds (5 allocations: 176 bytes)


5.000683119920582e6

The expected result is 0.5 * 10^7, since the mean of each entry is 0.5

# Benchmarking a few ways in a few languages

Julia has a `BenchmarkTools.jl` package for easy and accurate benchmarking:

In [3]:
using BenchmarkTools

In [5]:
@btime sin(1.5)

  0.031 ns (0 allocations: 0 bytes)


0.9974949866040544

#  1. The C language

C is often considered the gold standard: difficult on the human, nice for the machine. Getting within a factor of 2 of C is often satisfying. Nonetheless, even within C, there are many kinds of optimizations possible that a naive C writer may or may not get the advantage of.

The current author does not speak C, so he does not read the cell below, but is happy to know that you can put C code in a Julia session, compile it, and run it. Note that the `"""` wrap a multi-line string.

In [6]:
C_code = """
#include <stddef.h>
double c_sum(size_t n, double *X) {
    double s = 0.0;
    for (size_t i = 0; i < n; ++i) {
        s += X[i];
    }
    return s;
}
"""

const Clib = tempname()   # make a temporary file

# compile to a shared library by piping C_code to gcc
# (works only if you have gcc installed):

using Libdl

open(`gcc -fPIC -O3 -msse3 -xc -shared -o $(Clib * "." * Libdl.dlext) -`, "w") do f
    print(f, C_code) 
end

# define a Julia function that calls the C function:
c_sum(X::Array{Float64}) = ccall(("c_sum", Clib), Float64, (Csize_t, Ptr{Float64}), length(X), X)

c_sum (generic function with 1 method)

In [7]:
c_sum(a)

5.000683119921316e6

In [8]:
c_sum(a) ≈ sum(a) # type \approx and then <TAB> to get the ≈ symbolb

true

In [10]:
c_sum(a) - sum(a)

7.348135113716125e-7

In [None]:
≈  # alias for the `isapprox` function

In [None]:
?isapprox

We can now benchmark the C code directly from Julia:

In [11]:
c_bench = @benchmark c_sum($a)

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     8.920 ms (0.00% GC)
  median time:      8.980 ms (0.00% GC)
  mean time:        9.154 ms (0.00% GC)
  maximum time:     16.750 ms (0.00% GC)
  --------------
  samples:          546
  evals/sample:     1

In [12]:
println("C: Fastest time was $(minimum(c_bench.times) / 1e6) msec")

C: Fastest time was 8.919854 msec


In [13]:
d = Dict()  # a "dictionary", i.e. an associative array
d["C"] = minimum(c_bench.times) / 1e6  # in milliseconds
d

Dict{Any,Any} with 1 entry:
  "C" => 8.91985

# 2. Python's built in `sum` 

The `PyCall` package provides a Julia interface to Python:

In [14]:
using PyCall

In [15]:
# Call a low-level PyCall function to get a Python list, because
# by default PyCall will convert to a NumPy array instead (we benchmark NumPy below):

apy_list = PyCall.array2py(a, 1, 1)

# get the Python built-in "sum" function:
pysum = pybuiltin("sum")

PyObject <built-in function sum>

In [16]:
pysum(a)

5.000683119921316e6

In [17]:
pysum(a) ≈ sum(a)

true

In [18]:
pysum(a) - sum(a)

7.348135113716125e-7

In [19]:
py_list_bench = @benchmark $pysum($apy_list)

BenchmarkTools.Trial: 
  memory estimate:  48 bytes
  allocs estimate:  3
  --------------
  minimum time:     36.665 ms (0.00% GC)
  median time:      37.131 ms (0.00% GC)
  mean time:        38.087 ms (0.00% GC)
  maximum time:     52.233 ms (0.00% GC)
  --------------
  samples:          132
  evals/sample:     1

In [20]:
d["Python built-in"] = minimum(py_list_bench.times) / 1e6
d

Dict{Any,Any} with 2 entries:
  "C"               => 8.91985
  "Python built-in" => 36.6648

# 3. Python: `numpy` 

## Takes advantage of hardware "SIMD", but only works when it works.

`numpy` is an optimized C library, callable from Python.
It may be installed within Julia as follows:

In [21]:
using Conda

In [22]:
numpy_sum = pyimport("numpy")["sum"]
apy_numpy = PyObject(a) # converts to a numpy array by default

py_numpy_bench = @benchmark $numpy_sum($apy_numpy)

BenchmarkTools.Trial: 
  memory estimate:  48 bytes
  allocs estimate:  3
  --------------
  minimum time:     3.263 ms (0.00% GC)
  median time:      3.330 ms (0.00% GC)
  mean time:        3.381 ms (0.00% GC)
  maximum time:     7.150 ms (0.00% GC)
  --------------
  samples:          1477
  evals/sample:     1

In [23]:
numpy_sum(apy_list) # python thing

5.000683119920585e6

In [24]:
numpy_sum(apy_list) ≈ sum(a)

true

In [25]:
numpy_sum(apy_list) - sum(a)

3.725290298461914e-9

In [26]:
d["Python numpy"] = minimum(py_numpy_bench.times) / 1e6
d

Dict{Any,Any} with 3 entries:
  "C"               => 8.91985
  "Python numpy"    => 3.26261
  "Python built-in" => 36.6648

# 4. Python, hand-written 

In [27]:
py"""
def py_sum(a):
    s = 0.0
    for x in a:
        s = s + x
    return s
"""

sum_py = py"py_sum"

PyObject <function py_sum at 0x13966ad90>

In [28]:
py_hand = @benchmark $sum_py($apy_list)

BenchmarkTools.Trial: 
  memory estimate:  48 bytes
  allocs estimate:  3
  --------------
  minimum time:     186.661 ms (0.00% GC)
  median time:      190.081 ms (0.00% GC)
  mean time:        190.892 ms (0.00% GC)
  maximum time:     204.720 ms (0.00% GC)
  --------------
  samples:          27
  evals/sample:     1

In [29]:
@which sum([1.5])

In [30]:
sum_py(apy_list)

5.000683119921316e6

In [31]:
sum_py(apy_list) ≈ sum(a)

true

In [32]:
sum_py(apy_list) - sum(a)

7.348135113716125e-7

In [33]:
d["Python hand-written"] = minimum(py_hand.times) / 1e6
d

Dict{Any,Any} with 4 entries:
  "C"                   => 8.91985
  "Python numpy"        => 3.26261
  "Python hand-written" => 186.661
  "Python built-in"     => 36.6648

# 5. Julia (built-in) 

## Written directly in Julia, not in C!

In [34]:
@which sum(a)

In [35]:
j_bench = @benchmark sum($a)

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.997 ms (0.00% GC)
  median time:      4.051 ms (0.00% GC)
  mean time:        4.076 ms (0.00% GC)
  maximum time:     6.132 ms (0.00% GC)
  --------------
  samples:          1226
  evals/sample:     1

In [36]:
d["Julia built-in"] = minimum(j_bench.times) / 1e6
d

Dict{Any,Any} with 5 entries:
  "C"                   => 8.91985
  "Python numpy"        => 3.26261
  "Python hand-written" => 186.661
  "Python built-in"     => 36.6648
  "Julia built-in"      => 3.99686

# 6. Julia (hand-written) 

In [37]:
function mysum(A)   
    s = 0.0 # s = zero(eltype(A))
    for a in A
        s += a
    end
    s
end

mysum (generic function with 1 method)

In [38]:
j_bench_hand = @benchmark mysum($a)

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     8.925 ms (0.00% GC)
  median time:      9.102 ms (0.00% GC)
  mean time:        9.810 ms (0.00% GC)
  maximum time:     19.440 ms (0.00% GC)
  --------------
  samples:          510
  evals/sample:     1

In [39]:
d["Julia hand-written"] = minimum(j_bench_hand.times) / 1e6
d

Dict{Any,Any} with 6 entries:
  "C"                   => 8.91985
  "Python numpy"        => 3.26261
  "Julia hand-written"  => 8.9251
  "Python hand-written" => 186.661
  "Python built-in"     => 36.6648
  "Julia built-in"      => 3.99686

# 7. Julia (hand-written + simd) 

In [40]:
function mysum_simd(A)   
    s = 0.0 # s = zero(eltype(A))
    @simd for a in A
        s += a
    end
    s
end

mysum_simd (generic function with 1 method)

In [41]:
j_bench_hand_simd = @benchmark mysum_simd($a)

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.272 ms (0.00% GC)
  median time:      4.072 ms (0.00% GC)
  mean time:        4.025 ms (0.00% GC)
  maximum time:     5.843 ms (0.00% GC)
  --------------
  samples:          1241
  evals/sample:     1

In [42]:
mysum_simd(a)

5.000683119920576e6

In [43]:
d["Julia hand-written simd"] = minimum(j_bench_hand_simd.times) / 1e6
d

Dict{Any,Any} with 7 entries:
  "Julia hand-written simd" => 3.27176
  "C"                       => 8.91985
  "Python numpy"            => 3.26261
  "Julia hand-written"      => 8.9251
  "Python hand-written"     => 186.661
  "Python built-in"         => 36.6648
  "Julia built-in"          => 3.99686

# Summary

In [44]:
for (key, value) in sort(collect(d), by=last)
    println(rpad(key, 23, " "), lpad(round(value, digits=1), 6, " "))
end

Python numpy              3.3
Julia hand-written simd   3.3
Julia built-in            4.0
C                         8.9
Julia hand-written        8.9
Python built-in          36.7
Python hand-written     186.7
