# Julia is Fast
A menudo *benchmarks* es usado para comparar lenguajes. Estos benchmarks pueden dar lugar a largas discusiones, en primer lugar, sobre qué se está evaluando exactamente y, en segundo lugar, qué explica las diferencias. Estas preguntas simples a veces pueden volverse más complicadas de lo que podrías imaginar al principio.

El propósito de este *notebook* es que veas un *benchmark* simple por ti mismo.

Esquema de este *notebook*

- Definir la función de suma
- Implementaciones y benchmarking de sum en...
    - Julia (*built-in*)
    - Julia (*hand-written*)
    - C (*built-in*)
    - python (*hand-written*)
    - python (*numpy*)
    - python (*hand-written*)
- Hacia la explotación del paralelismo con Julia
    - Permitir la asociatividad de punto flotante
    - Haciendo uso de cuatro núcleos a la vez: *built-in*
    - Haciendo uso de cuatro núcleos a la vez: *hand-written*
- Resumen de los *benchmarks*

# `sum`: Una función fácil de entender

Considere la función suma `sum(a)`, la cual calcula
$$
\mathrm{sum}(a) = \sum_{i=1}^n a_i,
$$
Donde $n$ es la longitud `a`.

In [1]:
a = rand(10^7); # vector de 1D de numeros aleatorios, uniformes en el rango [0,1)

In [2]:
sum(a)

5.000137565064207e6

El resultado esperado es ~0.5 * 10^7, ya que la media entre entrada es 0.5

# Evaluación comparativa de algunas formas en algunos idiomas

In [3]:
@time sum(a)

  0.004696 seconds (1 allocation: 16 bytes)


5.000137565064207e6

In [4]:
@time sum(a)

  0.004916 seconds (1 allocation: 16 bytes)


5.000137565064207e6

In [5]:
@time sum(a)

  0.005583 seconds (1 allocation: 16 bytes)


5.000137565064207e6

La macro @time puede generar resultados *ruidosos*, por lo que no es nuestra mejor opción para la evaluación comparativa(*benchmarking*).
Afortunadamente Julia tiene `BenchmarkTools.jl`, un paquete para hacer sencilla y precisa la comparación:

In [6]:
using BenchmarkTools

In [7]:
@benchmark sum($a)

BenchmarkTools.Trial: 1102 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m4.496 ms[22m[39m … [35m 5.766 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m4.522 ms              [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m4.526 ms[22m[39m ± [32m43.423 μs[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.00% ± 0.00%

  [39m [39m [39m [39m▁[39m [39m [39m [39m [39m▄[39m▆[39m█[39m█[39m▇[34m▇[39m[39m▅[32m█[39m[39m█[39m▆[39m▃[39m▂[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m▅[39m▆[39m▇[39m█[39m▇[39m▇[39m▆[39m█

# 1. Julia *Built-in* (función integrada)
Así que ese es el rendimiento de la suma integrada de Julia, pero eso podría estar haciendo muchos trucos para ser rápido, ¡incluso no usar Julia en absoluto en primer lugar! Por supuesto, está escrito en Julia, pero ¿funcionaría si escribimos una implementación sencilla nosotros mismos?

In [8]:
@which sum(a)

Guardemos estos resultados de referencia en un diccionario para que podamos comenzar a realizar un seguimiento de ellos y compararlos en el futuro.

In [9]:
j_bench = @benchmark sum($a)

BenchmarkTools.Trial: 1102 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m4.492 ms[22m[39m … [35m 4.740 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m4.522 ms              [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m4.525 ms[22m[39m ± [32m21.930 μs[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.00% ± 0.00%

  [39m [39m [39m [39m [39m [39m▂[39m▂[39m [39m [39m [39m [39m▁[39m▅[39m█[39m▇[39m█[34m▄[39m[32m▆[39m[39m▅[39m▄[39m▄[39m▂[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m▂[39m▁[39m▃[39m▅[39m▆[39m█[39m█[39m▇

In [10]:
d = Dict()
d["Julia built-in"] = minimum(j_bench.times) / 1e6
d

Dict{Any, Any} with 1 entry:
  "Julia built-in" => 4.49192

## 2. Julia *hand-written*(escrita a mano)

In [11]:
function mysum(A)
    s = 0.0
    for a in A
        s += a
    end
    return s
end

mysum (generic function with 1 method)

In [12]:
j_bench_hand = @benchmark mysum($a)

BenchmarkTools.Trial: 590 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m7.759 ms[22m[39m … [35m  9.796 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m8.438 ms               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m8.462 ms[22m[39m ± [32m191.470 μs[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.00% ± 0.00%

  [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▄[39m▅[39m▁[39m█[39m▅[39m▃[39m▃[39m▅[39m▃[39m▃[34m▂[39m[32m▃[39m[39m▄[39m▂[39m▂[39m▁[39m▃[39m▂[39m [39m▁[39m▂[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m▂[39m▂[39m▁[39m▁[39m▁[39m▁[39m

In [13]:
d["Julia hand-written"] = minimum(j_bench_hand.times) / 1e6
d

Dict{Any, Any} with 2 entries:
  "Julia hand-written" => 7.75934
  "Julia built-in"     => 4.49192

Eso es aproximadamente varias veces más lento que la definición integrada. Veremos porqué más adelante.

Pero primero: ¿es esto rápido? ¿Cómo lo sabríamos? Comparémoslo con otros idiomas...

# 3. Lenguaje C

C a menudo se considera el referente dorado: difícil para el ser humano, agradable para la máquina. Llegar a un factor de 2 de C suele ser satisfactorio. No obstante, incluso dentro de C, hay muchos tipos de optimizaciones posibles que un escritor de C puede o no aprovechar.

El autor actual no habla C, por lo que no lee la celda a continuación, pero está feliz de saber que puede poner código C en una sesión de Julia, compilarlo y ejecutarlo. Tenga en cuenta que el """ envuelve una cadena de varias líneas.

In [14]:
using Libdl
C_code = """
    #include <stddef.h>
    double c_sum(size_t n, double *X) {
        double s = 0.0;
        for (size_t i = 0; i < n; ++i) {
            s += X[i];
        }
        return s;
    }
"""

const Clib = tempname()   # make a temporary file


# compile to a shared library by piping C_code to gcc
# (works only if you have gcc installed):

open(`gcc -fPIC -O3 -msse3 -xc -shared -o $(Clib * "." * Libdl.dlext) -`, "w") do f
    print(f, C_code)
end

# define a Julia function that calls the C function:
c_sum(X::Array{Float64}) = ccall(("c_sum", Clib), Float64, (Csize_t, Ptr{Float64}), length(X), X)

c_sum (generic function with 1 method)

In [15]:
c_sum(a)

5.000137565064117e6

In [16]:
c_sum(a) ≈ sum(a) # escribe \approx y despues <TAB> para obtenerel simbolo ≈

true

Ahora podemos evaluar el codigo de c directamente desde Julia:

In [17]:
c_bench = @benchmark c_sum($a)

BenchmarkTools.Trial: 598 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m7.943 ms[22m[39m … [35m 12.343 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m8.279 ms               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m8.351 ms[22m[39m ± [32m349.738 μs[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.00% ± 0.00%

  [39m [39m [39m [39m [39m [39m█[39m▇[39m▆[39m▅[39m▃[39m▃[39m▃[39m▇[39m▄[34m█[39m[39m▃[39m▄[32m▃[39m[39m▁[39m▁[39m▂[39m [39m▂[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m▂[39m▁[39m▁[39m▇[39m▇[39m█[39m

In [18]:
d["C"] = minimum(c_bench.times) / 1e6  # in milliseconds
d

Dict{Any, Any} with 3 entries:
  "C"                  => 7.94345
  "Julia hand-written" => 7.75934
  "Julia built-in"     => 4.49192

# 4. Función suma en Python integrada (built-in sum)

El paquete PyCall proporciona una interface de Python a Julia:

In [20]:
] add PyCall

[32m[1m    Updating[22m[39m registry at `~/.julia/registries/General`
[32m[1m    Updating[22m[39m git-repo `https://github.com/JuliaRegistries/General.git`
[32m[1m   Resolving[22m[39m package versions...
[32m[1m    Updating[22m[39m `~/.julia/environments/v1.8/Project.toml`
 [90m [438e738f] [39m[92m+ PyCall v1.94.1[39m
[32m[1m    Updating[22m[39m `~/.julia/environments/v1.8/Manifest.toml`
 [90m [438e738f] [39m[92m+ PyCall v1.94.1[39m
[32m[1mPrecompiling[22m[39m project...
[32m  ✓ [39mPyCall
  1 dependency successfully precompiled in 4 seconds. 274 already precompiled. 1 skipped during auto due to previous errors.


In [21]:
using PyCall

In [22]:
pysum = pybuiltin("sum")

PyObject <built-in function sum>

In [23]:
pysum(a)

5.000137565064117e6

In [24]:
pysum(a) ≈ sum(a)

true

In [25]:
py_list_bench = @benchmark $pysum($a)

BenchmarkTools.Trial: 8 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m646.561 ms[22m[39m … [35m667.759 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m656.876 ms               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m657.302 ms[22m[39m ± [32m  8.373 ms[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.00% ± 0.00%

  [39m█[39m [39m [39m [39m█[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m█[34m█[39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [32m [39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m█[39m [39m [39m [39m [39m [39m█[39m [39m [39m [39m [39m [39m [39m [39m█[39m [39m [39m [39m█[39m [39m 
  [39m█[39m▁[39m▁[39m▁

In [26]:
d["Python built-in"] = minimum(py_list_bench.times) / 1e6
d

Dict{Any, Any} with 4 entries:
  "C"                  => 7.94345
  "Julia hand-written" => 7.75934
  "Julia built-in"     => 4.49192
  "Python built-in"    => 646.561

# 5. Python: numpy

numpy is an optimized C library, callable from Python. It may be installed within Julia as follows:
numpy es una biblioteca de c optimizada, que se puede llamar desde python. Se puede instalar dentro de Julia de la siguiente manera:

In [28]:
] add Conda

[32m[1m   Resolving[22m[39m package versions...
[32m[1m    Updating[22m[39m `~/.julia/environments/v1.8/Project.toml`
 [90m [8f4d0f93] [39m[92m+ Conda v1.7.0[39m
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.8/Manifest.toml`


In [29]:
    using Conda

In [31]:
numpy_sum = pyimport("numpy")["sum"]

py_numpy_bench = @benchmark $numpy_sum($a)

BenchmarkTools.Trial: 1058 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m4.482 ms[22m[39m … [35m  7.111 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m4.587 ms               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m4.711 ms[22m[39m ± [32m313.156 μs[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.00% ± 0.00%

  [39m▆[39m█[39m▄[39m▁[34m [39m[39m [39m [39m [32m [39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m█[39m█[39m█[39m█[34m▇[39m[39m

In [32]:
numpy_sum(a)

5.000137565064208e6

In [33]:
numpy_sum(a) ≈ sum(a)

true

In [35]:
d["Python numpy"] = minimum(py_numpy_bench.times) / 1e6
d

Dict{Any, Any} with 5 entries:
  "C"                  => 7.94345
  "Julia hand-written" => 7.75934
  "Python numpy"       => 4.48179
  "Julia built-in"     => 4.49192
  "Python built-in"    => 646.561

# 6. Python, código escrito (hand-written)

In [36]:
py"""
def py_sum(A):
    s = 0.0
    for a in A:
        s += a
    return s
"""

sum_py = py"py_sum"

PyObject <function py_sum at 0x7f48eb9186a8>

In [37]:
py_hand = @benchmark $sum_py($a)

BenchmarkTools.Trial: 7 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m749.043 ms[22m[39m … [35m777.600 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m754.366 ms               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m758.850 ms[22m[39m ± [32m 10.350 ms[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.00% ± 0.00%

  [39m▁[39m [39m [39m [39m [39m [39m▁[39m [39m [39m [39m [34m█[39m[39m [39m▁[39m [39m [39m [39m [39m [39m [39m [32m [39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▁[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▁[39m [39m 
  [39m█[39m▁[39m▁[39m▁

In [38]:
sum_py(a)

5.000137565064117e6

In [39]:
sum_py(a) ≈ sum(a)

true

In [40]:
d["Python hand-written"] = minimum(py_hand.times) / 1e6
d

Dict{Any, Any} with 6 entries:
  "C"                   => 7.94345
  "Julia hand-written"  => 7.75934
  "Python numpy"        => 4.48179
  "Python hand-written" => 749.043
  "Julia built-in"      => 4.49192
  "Python built-in"     => 646.561

# Resumido hasta ahora

In [49]:
for (key, value) in sort(collect(d), by=last)
    println(rpad(key, 25, "."), lpad(round(value; digits=1), 6, "."))
end

Python numpy................4.5
Julia built-in..............4.5
Julia hand-written..........7.8
C...........................7.9
Python built-in...........646.6
Python hand-written.......749.0


Parece que aquí tenemos tres clases de rendimiento diferentes: las funciones integradas numpy y Julia lideran el grupo, seguidos por las definiciones escritas a mano de Julia y C. Esos parecen ser aproximadamente 2 veces más lentos. Y luego tenemos las definiciones de Python, mucho más lentas, más de 100 veces más lentas.

# Explotando el paralelismo con Julia

The fact that our hand-written Julia solution was almost an even multiple of 2x slower than the builtin solutions is a big clue: perhaps theres some sort of 2x parallelism going on here?

(In fairness, there are ways to exploit parallelism in other languages, too, but for brevity we won't cover them)

El hecho de que nuestra solución de Julia escrita a mano fuera casi 2 veces más lenta que las soluciones integradas es una gran pista: ¿quizás hay algún tipo de paralelismo aquí?

(Para ser justos, también hay formas de explotar el paralelismo en otros idiomas, pero por brevedad no las cubriremos)

# 7. Julia (permitiendo asociatividad de punto flotante)

El ciclo de `for`

```julia
for a in A
    s += a
end
```

define un orden muy estricto para sumar: Julia sigue exactamente lo que fue escrito y agrega los elementos de `A` al resultado `s` en el orden de las iteraciones. Dado que los numeros con punto flotante no son asociativos una reorganización aquí cambiaría la respuesta, y Julia detesta darte una respuesta diferente a la que pediste.

Sin embargo, puede decirle a Julia que relaje esa regla y permita la asociatividad con la macro `@fastmath`. Esto podría permitir a Julia reorganizar la suma de manera conveniente.

In [50]:
function mysum_fast(A)
    s = 0.0
    for a in A
        @fastmath s += a
    end
    s
end

mysum_fast (generic function with 1 method)

In [51]:
j_bench_hand_fast = @benchmark mysum_fast($a)

BenchmarkTools.Trial: 1073 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m4.447 ms[22m[39m … [35m  8.608 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m4.540 ms               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m4.644 ms[22m[39m ± [32m342.232 μs[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.00% ± 0.00%

  [39m▆[39m█[39m▇[34m▆[39m[39m▅[39m▂[32m▃[39m[39m▄[39m▄[39m▂[39m▂[39m▁[39m▁[39m▁[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m█[39m█[39m█[34m█[39m[39m█[39m

In [52]:
mysum_fast(a)

5.000137565064223e6

In [53]:
d["Julia hand-written fast"] = minimum(j_bench_hand_fast.times) / 1e6
d

Dict{Any, Any} with 7 entries:
  "C"                       => 7.94345
  "Julia hand-written"      => 7.75934
  "Python numpy"            => 4.48179
  "Python hand-written"     => 749.043
  "Julia built-in"          => 4.49192
  "Python built-in"         => 646.561
  "Julia hand-written fast" => 4.44741

# 8. Julia Distribuida (built-in)

Podemos llevar esto un paso más allá: casi todas las computadoras modernas en estos días tienen múltiples núcleos. Todas las soluciones anteriores están trabajando duro en un núcleo, pero todos los demás están sentados sin hacer nada. ¡Pongámoslos a trabajar!

In [55]:
] add DistributedArrays

[32m[1m   Resolving[22m[39m package versions...
[32m[1m   Installed[22m[39m DistributedArrays ─ v0.6.6
[32m[1m   Installed[22m[39m IntegerMathUtils ── v0.1.0
[32m[1m   Installed[22m[39m Primes ──────────── v0.5.3
[32m[1m    Updating[22m[39m `~/.julia/environments/v1.8/Project.toml`
 [90m [aaf54ef3] [39m[92m+ DistributedArrays v0.6.6[39m
[32m[1m    Updating[22m[39m `~/.julia/environments/v1.8/Manifest.toml`
 [90m [aaf54ef3] [39m[92m+ DistributedArrays v0.6.6[39m
 [90m [18e54dd8] [39m[92m+ IntegerMathUtils v0.1.0[39m
 [90m [27ebfcd6] [39m[92m+ Primes v0.5.3[39m
[32m[1mPrecompiling[22m[39m project...
[32m  ✓ [39m[90mIntegerMathUtils[39m
[32m  ✓ [39m[90mPrimes[39m
[32m  ✓ [39mDistributedArrays
  3 dependencies successfully precompiled in 1 seconds. 275 already precompiled. 1 skipped during auto due to previous errors.


In [58]:
using Distributed
using DistributedArrays
addprocs(4)
#@sync @everywhere workers() include("/opt/julia-1.0/etc/julia/startup.jl") # Solo necesario para JuliaBox
@everywhere using DistributedArrays

In [59]:
adist = distribute(a)
j_bench_dist = @benchmark sum($adist)

BenchmarkTools.Trial: 1173 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m4.069 ms[22m[39m … [35m  7.635 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m4.207 ms               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m4.249 ms[22m[39m ± [32m237.093 μs[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.00% ± 0.00%

  [39m [39m▃[39m▅[39m▆[39m▇[39m█[34m█[39m[39m█[32m▆[39m[39m▄[39m▂[39m [39m [39m [39m▁[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▁
  [39m▇[39m█[39m█[39m█[39m█[39m█[34

In [60]:
d["Julia 4x built-in"] = minimum(j_bench_dist.times) / 1e6
d

Dict{Any, Any} with 8 entries:
  "C"                       => 7.94345
  "Julia hand-written"      => 7.75934
  "Python numpy"            => 4.48179
  "Python hand-written"     => 749.043
  "Julia built-in"          => 4.49192
  "Python built-in"         => 646.561
  "Julia 4x built-in"       => 4.06858
  "Julia hand-written fast" => 4.44741

# 8. Julia Distribuida (hand-written)
Ok, eso también podría ser hacer trampa, nuevamente es solo llamar a una función de la biblioteca. ¿Es posible escribir una suma distribuida nosotros mismos?

In [61]:
function mysum_dist(a::DArray)
    r = Array{Future}(undef, length(procs(a)))
    for (i, id) in enumerate(procs(a))
        r[i] = @spawnat id sum(localpart(a))
    end
    return sum(fetch.(r))
end

mysum_dist (generic function with 1 method)

In [62]:
j_bench_hand_dist = @benchmark mysum_dist($adist)

BenchmarkTools.Trial: 898 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m4.907 ms[22m[39m … [35m  8.086 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m5.514 ms               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m5.553 ms[22m[39m ± [32m308.847 μs[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.00% ± 0.00%

  [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▂[39m▂[39m▃[39m▂[39m▄[39m▆[39m▅[39m▅[39m█[39m▄[39m▅[39m▄[39m▂[39m▃[34m▁[39m[32m▁[39m[39m▃[39m [39m▂[39m▅[39m [39m▅[39m▃[39m▃[39m [39m [39m▂[39m [39m [39m▂[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m▃[39m▁[39m▂[39m▃[39m▃[39m▃[39m

In [63]:
d["Julia 4x hand-written"] = minimum(j_bench_hand_dist.times) / 1e6
d

Dict{Any, Any} with 9 entries:
  "C"                       => 7.94345
  "Julia hand-written"      => 7.75934
  "Python numpy"            => 4.48179
  "Python hand-written"     => 749.043
  "Julia built-in"          => 4.49192
  "Python built-in"         => 646.561
  "Julia 4x built-in"       => 4.06858
  "Julia 4x hand-written"   => 4.90732
  "Julia hand-written fast" => 4.44741

# Resumen general

In [64]:
for (key, value) in sort(collect(d), by=last)
    println(rpad(key, 25, "."), lpad(round(value; digits=1), 6, "."))
end

Julia 4x built-in...........4.1
Julia hand-written fast.....4.4
Python numpy................4.5
Julia built-in..............4.5
Julia 4x hand-written.......4.9
Julia hand-written..........7.8
C...........................7.9
Python built-in...........646.6
Python hand-written.......749.0


    Conclusiones clave:
- Julia permite un rendimiento tipo C en serie, incluso con funciones escritas a mano
- Julia nos permite explotar muchas formas de paralelismo para mejorar aún más el rendimiento. Demostramos:
    - Paralelismo de un solo procesador con SIMD
    - Paralelismo multiproceso con DistributedArrays
- ¡Pero también hay muchas otras formas de expresar el paralelismo!