In [None]:
import Pkg; Pkg.add(Pkg.PackageSpec(url="https://github.com/JuliaComputing/JuliaAcademyData.jl"))
using JuliaAcademyData; activate("Parallel_Computing")

# Julia is fast

Very often, benchmarks are used to compare languages.  These benchmarks can
lead to long discussions, first as to exactly what is being benchmarked and
secondly what explains the differences.  These simple questions can sometimes
get more complicated than you at first might imagine.

The purpose of this notebook is for you to see a simple benchmark for
yourself.

(This material began life as a wonderful lecture by Steven Johnson at MIT:
[Boxes and registers](https://github.com/stevengj/18S096/blob/master/lectures/lecture1/Boxes-and-registers.ipynb)).

# Outline of this notebook

- Define the sum function
- Implementations & benchmarking of sum in...
    - Julia (built-in)
    - Julia (hand-written)
    - C (hand-written)
    - python (built-in)
    - python (numpy)
    - python (hand-written)
- Towards exploiting parallelism with Julia
    - Allowing for floating point associativity
    - Making use of four cores at once: built-in
    - Making use of four cores at once: hand-written
- Summary of benchmarks

# `sum`: An easy enough function to understand

Consider the  **sum** function `sum(a)`, which computes
$$
\mathrm{sum}(a) = \sum_{i=1}^n a_i,
$$
where $n$ is the length of `a`.

In [1]:
a = rand(10^7) # 1D vector of random numbers, uniform on [0,1)

10000000-element Vector{Float64}:
 0.12170443858416524
 0.05406416678609949
 0.7647346886456676
 0.571280973158476
 0.3540463146060089
 0.6417664725906271
 0.26379448787023074
 0.42128705946223777
 0.1969495230024263
 0.6271605650103085
 0.16470746773721
 0.9104573391417858
 0.8996253349387446
 ⋮
 0.4376777357734525
 0.17382456667783686
 0.16310929075640845
 0.7580473398881458
 0.4954343342805785
 0.9888049683842577
 0.9469044697008451
 0.18591043819707975
 0.10009839359669659
 0.004296624562745821
 0.750145193234373
 0.03041370706436486

In [2]:
sum(a)

4.999722243263498e6

The expected result is ~0.5 * 10^7, since the mean of each entry is 0.5

# Benchmarking a few ways in a few languages

In [3]:
@time sum(a)

  0.003581 seconds (1 allocation: 16 bytes)


4.999722243263498e6

In [4]:
@time sum(a)

  0.003399 seconds (1 allocation: 16 bytes)


4.999722243263498e6

In [5]:
@time sum(a)

  0.003896 seconds (1 allocation: 16 bytes)


4.999722243263498e6

The `@time` macro can yield noisy results, so it's not our best choice for benchmarking!

Luckily, Julia has a `BenchmarkTools.jl` package to make benchmarking easy and accurate:

In [7]:
using Pkg
Pkg.add("BenchmarkTools")
using BenchmarkTools

[32m[1m    Updating[22m[39m registry at `C:\Users\kart-\.julia\registries\General`
[32m[1m    Updating[22m[39m git-repo `https://github.com/JuliaRegistries/General.git`
[32m[1m   Resolving[22m[39m package versions...
[32m[1m    Updating[22m[39m `C:\Users\kart-\.julia\environments\v1.6\Project.toml`
 [90m [6e4b80f9] [39m[92m+ BenchmarkTools v1.3.1[39m
[32m[1m    Updating[22m[39m `C:\Users\kart-\.julia\environments\v1.6\Manifest.toml`
 [90m [6e4b80f9] [39m[92m+ BenchmarkTools v1.3.1[39m


In [8]:
@benchmark sum($a)

BenchmarkTools.Trial: 1585 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m2.873 ms[22m[39m … [35m  6.403 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m3.030 ms               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m3.139 ms[22m[39m ± [32m377.164 μs[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.00% ± 0.00%

  [39m▄[39m█[39m [39m▁[39m▅[34m [39m[39m [39m [32m [39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m█[39m█[39m▇[39m█[39m█[34m█[39

# 1. Julia Built-in

So that's the performance of Julia's built-in sum — but that could be doing any number of tricks to be fast, including not using Julia at all in the first place! Of course, it is indeed written in Julia, but would it perform if we write a naive implementation ourselves?

In [9]:
@which sum(a)

Let's save these benchmark results to a dictionary so we can start keeping track of them and comparing them down the line.

In [10]:
j_bench = @benchmark sum($a)

BenchmarkTools.Trial: 1632 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m2.877 ms[22m[39m … [35m  4.419 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m2.957 ms               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m3.048 ms[22m[39m ± [32m202.206 μs[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.00% ± 0.00%

  [39m [39m█[39m▆[39m▃[39m [34m [39m[39m [39m [39m [39m [39m [32m [39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m▆[39m█[39m█[39m█[39m█[34m▆[39

In [11]:
d = Dict()
d["Julia built-in"] = minimum(j_bench.times) / 1e6
d

Dict{Any, Any} with 1 entry:
  "Julia built-in" => 2.8772

# 2. Julia (hand-written)

In [12]:
function mysum(A)
    s = 0.0
    for a in A
        s += a
    end
    return s
end

mysum (generic function with 1 method)

In [13]:
j_bench_hand = @benchmark mysum($a)

BenchmarkTools.Trial: 643 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m7.579 ms[22m[39m … [35m  9.281 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m7.743 ms               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m7.768 ms[22m[39m ± [32m148.200 μs[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.00% ± 0.00%

  [39m [39m [39m [39m [39m [39m▄[39m▅[39m▆[39m▇[39m█[39m▆[39m▃[34m▅[39m[39m▆[32m▆[39m[39m▄[39m▁[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m▂[39m▃[39m▅[39m▇[39m▇[39m█[39m

In [14]:
d["Julia hand-written"] = minimum(j_bench_hand.times) / 1e6
d

Dict{Any, Any} with 2 entries:
  "Julia hand-written" => 7.5792
  "Julia built-in"     => 2.8772

So that's about 2x slower than the builtin definition. We'll see why later on.

But first: is this fast?  How would we know?  Let's compare it to some other languages...

#  3. The C language

C is often considered the gold standard: difficult on the human, nice for the machine. Getting within a factor of 2 of C is often satisfying. Nonetheless, even within C, there are many kinds of optimizations possible that a naive C writer may or may not get the advantage of.

The current author does not speak C, so he does not read the cell below, but is happy to know that you can put C code in a Julia session, compile it, and run it. Note that the `"""` wrap a multi-line string.

In [15]:
using Libdl
C_code = """
    #include <stddef.h>
    double c_sum(size_t n, double *X) {
        double s = 0.0;
        for (size_t i = 0; i < n; ++i) {
            s += X[i];
        }
        return s;
    }
"""

const Clib = tempname()   # make a temporary file


# compile to a shared library by piping C_code to gcc
# (works only if you have gcc installed):

open(`gcc -fPIC -O3 -msse3 -xc -shared -o $(Clib * "." * Libdl.dlext) -`, "w") do f
    print(f, C_code)
end

# define a Julia function that calls the C function:
c_sum(X::Array{Float64}) = ccall(("c_sum", Clib), Float64, (Csize_t, Ptr{Float64}), length(X), X)

<stdin>: In function 'c_sum':
<stdin>:4:9: error: 'for' loop initial declarations are only allowed in C99 or C11 mode
<stdin>:4:9: note: use option -std=c99, -std=gnu99, -std=c11 or -std=gnu11 to compile your code


LoadError: failed process: Process(`[4mgcc[24m [4m-fPIC[24m [4m-O3[24m [4m-msse3[24m [4m-xc[24m [4m-shared[24m [4m-o[24m [4m'C:\Users\kart-\AppData\Local\Temp\jl_07K1iO9OVz.dll'[24m [4m-[24m`, ProcessExited(1)) [1]


In [None]:
c_sum(a)

In [None]:
c_sum(a) ≈ sum(a) # type \approx and then <TAB> to get the ≈ symbolb

We can now benchmark the C code directly from Julia:

In [None]:
c_bench = @benchmark c_sum($a)

In [None]:
d["C"] = minimum(c_bench.times) / 1e6  # in milliseconds
d

# 4. Python's built in `sum`

The `PyCall` package provides a Julia interface to Python:

In [16]:
using PyCall

In [17]:
# get the Python built-in "sum" function:
pysum = pybuiltin("sum")

PyObject <built-in function sum>

In [18]:
pysum(a)

4.999722243263187e6

In [19]:
pysum(a) ≈ sum(a)

true

In [20]:
py_list_bench = @benchmark $pysum($a)

BenchmarkTools.Trial: 5 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m848.516 ms[22m[39m … [35m   1.428 s[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m19.68% … 28.50%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m920.220 ms               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m24.46%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m   1.087 s[22m[39m ± [32m267.364 ms[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m25.03% ±  3.78%

  [39m█[39m [39m [39m [39m [39m [39m█[34m█[39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [32m [39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m█[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m█[39m [39m 
  [39m█[39m▁[39m▁[

In [21]:
d["Python built-in"] = minimum(py_list_bench.times) / 1e6
d

Dict{Any, Any} with 3 entries:
  "Julia hand-written" => 7.5792
  "Julia built-in"     => 2.8772
  "Python built-in"    => 848.516

# 5. Python: `numpy`

`numpy` is an optimized C library, callable from Python.
It may be installed within Julia as follows:

In [23]:
Pkg.add("Conda")
using Conda

[32m[1m   Resolving[22m[39m package versions...
[32m[1m    Updating[22m[39m `C:\Users\kart-\.julia\environments\v1.6\Project.toml`
 [90m [8f4d0f93] [39m[92m+ Conda v1.7.0[39m
[32m[1m  No Changes[22m[39m to `C:\Users\kart-\.julia\environments\v1.6\Manifest.toml`


In [27]:
Conda.add("numpy")

┌ Info: Running `conda install -y numpy` in root environment
└ @ Conda C:\Users\kart-\.julia\packages\Conda\x2UxR\src\Conda.jl:127


Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.



In [30]:
numpy_sum = pyimport("numpy")["sum"]

py_numpy_bench = @benchmark $numpy_sum($a)

BenchmarkTools.Trial: 489 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m 9.770 ms[22m[39m … [35m 17.233 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m10.089 ms               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m10.219 ms[22m[39m ± [32m542.326 μs[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.00% ± 0.00%

  [39m [39m [39m [39m [39m▁[39m▁[39m [39m▆[39m▇[39m▇[39m█[34m▃[39m[39m▂[39m▁[39m [32m [39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m▂[39m▃[39m▃[39m█[39m█[

In [31]:
numpy_sum(a)

4.999722243263509e6

In [32]:
numpy_sum(a) ≈ sum(a)

true

In [33]:
d["Python numpy"] = minimum(py_numpy_bench.times) / 1e6
d

Dict{Any, Any} with 4 entries:
  "Python numpy"       => 9.7702
  "Julia hand-written" => 7.5792
  "Julia built-in"     => 2.8772
  "Python built-in"    => 848.516

# 6. Python, hand-written

In [34]:
py"""
def py_sum(A):
    s = 0.0
    for a in A:
        s += a
    return s
"""

sum_py = py"py_sum"

PyObject <function py_sum at 0x0000000001578EE0>

In [35]:
py_hand = @benchmark $sum_py($a)

BenchmarkTools.Trial: 6 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m969.226 ms[22m[39m … [35m990.921 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m975.653 ms               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m976.821 ms[22m[39m ± [32m  8.280 ms[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.00% ± 0.00%

  [39m█[39m█[39m [39m [39m [39m [39m [39m█[34m [39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [32m [39m[39m [39m [39m [39m [39m [39m [39m█[39m█[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m█[39m [39m 
  [39m█[39m█[39m▁[39m▁

In [36]:
sum_py(a)

4.999722243263187e6

In [37]:
sum_py(a) ≈ sum(a)

true

In [38]:
d["Python hand-written"] = minimum(py_hand.times) / 1e6
d

Dict{Any, Any} with 5 entries:
  "Python numpy"        => 9.7702
  "Julia hand-written"  => 7.5792
  "Python hand-written" => 969.226
  "Julia built-in"      => 2.8772
  "Python built-in"     => 848.516

# Summary so far

In [39]:
for (key, value) in sort(collect(d), by=last)
    println(rpad(key, 25, "."), lpad(round(value; digits=1), 6, "."))
end

Julia built-in..............2.9
Julia hand-written..........7.6
Python numpy................9.8
Python built-in...........848.5
Python hand-written.......969.2


We seem to have three different performance classes here: The numpy and Julia
builtins are leading the pack, followed by the hand-written Julia and C
definitions. Those seem to be about 2x slower.  And then we have the much much
slower Python definitions over 100x slower.

# Exploiting parallelism with Julia

The fact that our hand-written Julia solution was almost an even multiple of
2x slower than the builtin solutions is a big clue: perhaps theres some sort
of 2x parallelism going on here?

(In fairness, there are ways to exploit parallelism in other languages, too,
but for brevity we won't cover them)

# 7. Julia (allowing floating point associativity)

The `for` loop

```julia
for a in A
    s += a
end
```

defines a very strict _order_ to the summation: Julia follows exactly what you
wrote and adds the elements of `A` to the result `s` in the order it iterates.
Since floating point numbers aren't associative, a rearrangement here would
change the answer — and Julia is loathe to give you different answer than
the one you asked for.

You can, however, tell Julia to relax that rule and allow for associativity
with the `@fastmath` macro. This might allow Julia to rearrange the sum in an
advantageous manner.

In [40]:
function mysum_fast(A)
    s = 0.0
    for a in A
        @fastmath s += a
    end
    s
end

mysum_fast (generic function with 1 method)

In [41]:
j_bench_hand_fast = @benchmark mysum_fast($a)

BenchmarkTools.Trial: 1718 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m2.724 ms[22m[39m … [35m  4.528 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m2.793 ms               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m2.893 ms[22m[39m ± [32m224.545 μs[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.00% ± 0.00%

  [39m [39m▇[39m█[34m▃[39m[39m [39m [39m [39m [32m [39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m▆[39m█[39m█[34m█[39m[39m▇[39m

In [42]:
mysum_fast(a)

4.999722243263524e6

In [43]:
d["Julia hand-written fast"] = minimum(j_bench_hand_fast.times) / 1e6
d

Dict{Any, Any} with 6 entries:
  "Python numpy"            => 9.7702
  "Julia hand-written"      => 7.5792
  "Python hand-written"     => 969.226
  "Julia built-in"          => 2.8772
  "Python built-in"         => 848.516
  "Julia hand-written fast" => 2.7239

# 8. Distributed Julia (built-in)

We can take this one step further: nearly every modern computer these days has
multiple cores. All the above solutions are working one core hard, but all the
others are just sitting by idly. Let's put them to work!

In [44]:
Pkg.add("DistributedArrays")

[32m[1m   Resolving[22m[39m package versions...
[32m[1m   Installed[22m[39m IntegerMathUtils ── v0.1.0
[32m[1m   Installed[22m[39m Primes ──────────── v0.5.2
[32m[1m   Installed[22m[39m DistributedArrays ─ v0.6.6
[32m[1m    Updating[22m[39m `C:\Users\kart-\.julia\environments\v1.6\Project.toml`
 [90m [aaf54ef3] [39m[92m+ DistributedArrays v0.6.6[39m
[32m[1m    Updating[22m[39m `C:\Users\kart-\.julia\environments\v1.6\Manifest.toml`
 [90m [aaf54ef3] [39m[92m+ DistributedArrays v0.6.6[39m
 [90m [18e54dd8] [39m[92m+ IntegerMathUtils v0.1.0[39m
 [90m [27ebfcd6] [39m[92m+ Primes v0.5.2[39m
[32m[1mPrecompiling[22m[39m project...
[32m  ✓ [39m[90mIntegerMathUtils[39m
[32m  ✓ [39m[90mPrimes[39m
[32m  ✓ [39mDistributedArrays
  3 dependencies successfully precompiled in 2 seconds (89 already precompiled)


In [46]:
using Distributed
using DistributedArrays
addprocs(4)
@sync @everywhere workers() include("/opt/julia-1.0/etc/julia/startup.jl") # Needed just for JuliaBox
@everywhere using DistributedArrays

In [47]:
adist = distribute(a)
j_bench_dist = @benchmark sum($adist)

BenchmarkTools.Trial: 1809 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m2.309 ms[22m[39m … [35m 11.089 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m2.670 ms               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m2.744 ms[22m[39m ± [32m360.196 μs[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.00% ± 0.00%

  [39m [39m [39m [39m [39m [39m [39m [39m▄[39m▃[39m█[39m▇[39m▄[39m▄[39m▄[39m▄[39m▂[39m▂[39m▂[39m [39m▂[34m▁[39m[39m [39m [39m [32m [39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m▂[39m▁[39m▁[39m▂[39m▄[39m▅[39

In [48]:
d["Julia 4x built-in"] = minimum(j_bench_dist.times) / 1e6
d

Dict{Any, Any} with 7 entries:
  "Python numpy"            => 9.7702
  "Julia hand-written"      => 7.5792
  "Python hand-written"     => 969.226
  "Julia built-in"          => 2.8772
  "Python built-in"         => 848.516
  "Julia 4x built-in"       => 2.3087
  "Julia hand-written fast" => 2.7239

# 9. Distributed Julia (hand-written)

Ok, that might be cheating, too — it's again just calling a library
function. Is it possible to write distributed sum ourselves?

In [49]:
function mysum_dist(a::DArray)
    r = Array{Future}(undef, length(procs(a)))
    for (i, id) in enumerate(procs(a))
        r[i] = @spawnat id sum(localpart(a))
    end
    return sum(fetch.(r))
end

mysum_dist (generic function with 1 method)

In [50]:
j_bench_hand_dist = @benchmark mysum_dist($adist)

BenchmarkTools.Trial: 1389 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m3.015 ms[22m[39m … [35m  4.497 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m3.550 ms               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m3.585 ms[22m[39m ± [32m240.826 μs[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.00% ± 0.00%

  [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▅[39m█[39m▅[39m▄[39m▄[39m▂[39m▃[39m▅[39m [39m▃[34m▂[39m[32m [39m[39m [39m [39m [39m▁[39m [39m▁[39m [39m [39m▁[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m▁[39m▁[39m▁[39m▂[39m▂[39m▂[39

In [51]:
d["Julia 4x hand-written"] = minimum(j_bench_hand_dist.times) / 1e6
d

Dict{Any, Any} with 8 entries:
  "Python numpy"            => 9.7702
  "Julia hand-written"      => 7.5792
  "Python hand-written"     => 969.226
  "Julia built-in"          => 2.8772
  "Python built-in"         => 848.516
  "Julia 4x built-in"       => 2.3087
  "Julia 4x hand-written"   => 3.0155
  "Julia hand-written fast" => 2.7239

# Overall Summary

In [52]:
for (key, value) in sort(collect(d), by=last)
    println(rpad(key, 25, "."), lpad(round(value; digits=1), 6, "."))
end

Julia 4x built-in...........2.3
Julia hand-written fast.....2.7
Julia built-in..............2.9
Julia 4x hand-written.......3.0
Julia hand-written..........7.6
Python numpy................9.8
Python built-in...........848.5
Python hand-written.......969.2


# Key take-aways

* Julia allows for serial C-like performance, even with hand-written functions
* Julia allows us to exploit many forms of parallelism to further improve performance. We demonstrated:
    * Single-processor parallelism with SIMD
    * Multi-process parallelism with DistributedArrays
* But there are many other ways to express parallelism, too!