# Parallel For Loops

## Setting

A computational problem ends up taking the form of iterating over dimensions a large multidimensional array. Some dimensions of this array are independent from each others. One may use this feature to parallelize calculations to obtain a speed-up compared to the serial outcome.

## How ?

This notebook illustrates how one may parallelize a for loop using :
* native parallel julia 
* `SharedArrays`
* `DistibutedArrays`

## A Toy Problem

The program takes a 3-dimensional matrix `A[:,:,:]` and sets its rows and columns to be equal to the third dimension. For instance `A[:,:,1] = ones(n1,n2)`, `A[:,:,2] = ones(n1,n2)*2`, `A[:,:,3] = ones(n1,n2)*3`, where $n1$ and $n2$ are the number of rows and columns respectively.

This is a toy example, because three lines of code would suffice
```julia
for k=1:size(A,3)
    A[:,:,k] = k
end
```

Yet, here we are going to act as if the first and the second dimensions of $A$ are dependent (calculations are to be serially done, respecting the order):
```julia
# This loop can be parallelized
for k=1:size(A,3)
    # We cannot parallelize this level
    for j=1:size(A,2)
        for i=1:size(A,1)
         A[i,j,k] = k
         end
      end
    end
end
```



## Preliminaries

Let's add workers to the current session, define some constants and load the required packages.

In [1]:
addprocs(3)

3-element Array{Int64,1}:
 2
 3
 4

In [2]:
nworkers()

3

In [3]:
@everywhere using DistributedArrays

In [4]:
@everywhere using Distributions

In [6]:
@everywhere const dim1 = 100
@everywhere const dim2 = 100
@everywhere const dim3 = nworkers()*1

# I. The serial solution

## The obvious candidate

The following solution does what it should, but does not respect the constraint that calculations at the level of the first and second dimensions should be done serially:

In [66]:
t = zeros(dim1,dim2,dim3)

@time for k=1:size(t,3)
    println(k)
    t[:,:,k] = k
end

1
2
3
  0.002657 seconds (183 allocations: 3.703 KiB)


## Taking into account the serial constraints

The following solution respects the constraint:

In [67]:
@time for k=1:size(t,3)
    # We cannot parallelize this level
    #----------------------------------
    for j=1:size(t,2)
        for i=1:size(t,1)
         t[i,j,k] = k
         end
    end
end

  0.024601 seconds (120.61 k allocations: 3.223 MiB)


Let's spice up the problem by making each execution slower.
Let's use `sleep(0.0)`, which (suprisingly) takes more than 0 second to run.

In [68]:
@everywhere function give_my_id_serial!(x::Array{Float64,3}, id::Int64)
    for indexDim2 = 1:dim2
        for indexDim1 = 1:dim1
            x[indexDim1, indexDim2, id] = id
            #sleep (0.0) still take some time:
            #0.000236 seconds (37 allocations: 800 bytes)
            sleep(0.0)
        end
    end
end

In [69]:
function wrapper(x::Array{Float64,3})
    give_my_id_serial!(x, 1)
    give_my_id_serial!(x, 2)
    give_my_id_serial!(x, 3)
end

wrapper (generic function with 1 method)

In [70]:
@time wrapper(t)

 36.178154 seconds (153.08 k allocations: 5.624 MiB)


As illustrated above, the task we want to accomplish takes approximately 35 seconds.
Yet, as the code makes it obvious, 3 workers could be used in parallel to divide the execution time by
approximately 3.

## II. Using native Julia

Let's split the data evenly among the 3 workers. In do that by initializing a 3-dimensional array on each worker.  

In [12]:
@time asyncmap(fetch, (@spawnat p eval(:(myLocalArray = zeros(dim1, dim2, 1)))) for p in workers())

 35.277891 seconds (153.14 k allocations: 6.626 MiB)


3-element Array{Array{Float64,3},1}:
 [0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0; … ; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0]
 [0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0; … ; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0]
 [0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0; … ; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0]

The following function is built so that each worker assigns a specific value equal to `id`, respecting the serial constraint

In [13]:
@everywhere function give_my_id_parallel!(x::Array{Float64,3}, id::Int64)
    for indexDim2 = 1:dim2
        for indexDim1 = 1:dim1
            x[indexDim1, indexDim2, 1] = id
            sleep(0.0)
        end
    end
end

  2.379786 seconds (466.32 k allocations: 25.863 MiB, 0.76% gc time)


In [14]:
@time asyncmap(fetch, (@spawnat p give_my_id_parallel!(myLocalArray, p-1)) for p in workers())

3-element Array{Void,1}:
 nothing
 nothing
 nothing

 12.011893 seconds (96.40 k allocations: 5.165 MiB)


Now, let's recombine the data. For this part, I get inspiration from this excellent post:
https://stackoverflow.com/questions/39058884/julia-use-of-pmap-with-arrays-vs-sharedarrays

In [15]:
#inspired by: https://stackoverflow.com/questions/39058884/julia-use-of-pmap-with-arrays-vs-sharedarrays
getfrom(p::Int, nm::Symbol; mod=Main) = fetch(@spawnat(p, getfield(mod, nm)))

function recombine_data(Data::Symbol)
    Results = zeros(dim1,dim2,dim3)
    for (idx, pid) in enumerate(workers())
        Results[:,:,idx] = getfrom(pid, Data)
    end
    return Results
end

recombine_data (generic function with 1 method)

In [16]:
@time k = recombine_data(:myLocalArray)

100×100×3 Array{Float64,3}:
[:, :, 1] =
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  …  1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  …  1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  …  1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0

  0.161408 seconds (53.63 k allocations: 3.667 MiB)


## III. Using DistributedArrays

The package `DistributedArrays` was created to distribute "chunks" of a mutlidimensional array (matrix) to workers.
Each worker owns only a part of the multidimensional array. In the code below, I create a 3-dimensional matrix `a`. Using the function `dzeros`, I specify that I want it to be spread among all the available workers. In the last part of the command, I tell ̀`DistributedArrays` that I want the matrix to be spread among workers along the third dimension only.

In [17]:
@sync begin 
    #a = dzeros(dim1,dim2,dim3);
    #Split along the 3rd dimension
    #-----------------------------
    a = dzeros((dim1,dim2,dim3), workers(), [1,1,nworkers()]);
end

100×100×3 DistributedArrays.DArray{Float64,3,Array{Float64,3}}:
[:, :, 1] =
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.

Using the function `procs`, one can see that:
* workers 2 owns the chunk `t[:, :, 1]` of `a`
* workers 3 owns the chunk `t[:, :, 2]` of `a`.
* workers 4 owns the chunk `t[:, :, 3]` o `a`.

In [18]:
procs(a)

1×1×3 Array{Int64,3}:
[:, :, 1] =
 2

[:, :, 2] =
 3

[:, :, 3] =
 4

The function `give_my_id_parallel!` can be sent to workers using `asyncmap` and the usual `fetch @spawnat` commands:

In [19]:
@time asyncmap(fetch, (@spawnat p give_my_id_parallel!(localpart(a), p-1)) for p in procs(a))

1×1×3 Array{Void,3}:
[:, :, 1] =
 nothing

[:, :, 2] =
 nothing

[:, :, 3] =
 nothing

 11.922004 seconds (211.75 k allocations: 11.768 MiB, 0.08% gc time)


As expected, execution time is divided by approximately 3.
By inspecting `a`, one sees that it has the correct dimension and values.

In [20]:
@time [@spawnat p println(size(localpart(a))) for p in procs(a)]

1×1×3 Array{Future,3}:
[:, :, 1] =
 Future(2, 1, 7259, #NULL)

[:, :, 2] =
 Future(3, 1, 7260, #NULL)

[:, :, 3] =
 Future(4, 1, 7261, #NULL)

In [21]:
@time [@spawnat p println(typeof(localpart(a))) for p in procs(a)]

  0.060911 seconds (30.02 k allocations: 1.587 MiB)


1×1×3 Array{Future,3}:
[:, :, 1] =
 Future(2, 1, 7262, #NULL)

[:, :, 2] =
 Future(3, 1, 7263, #NULL)

[:, :, 3] =
 Future(4, 1, 7264, #NULL)

	From worker 2:	(100, 100, 1)
	From worker 4:	(100, 100, 1)
	From worker 3:	(100, 100, 1)
  0.046706 seconds (25.30 k allocations: 1.317 MiB)


In [22]:
map(fetch, (@spawnat p size(localpart(a))) for p=procs(a) )

	From worker 3:	Array{Float64,3}
	From worker 4:	Array{Float64,3}
	From worker 2:	Array{Float64,3}


1×1×3 Array{Tuple{Int64,Int64,Int64},3}:
[:, :, 1] =
 (100, 100, 1)

[:, :, 2] =
 (100, 100, 1)

[:, :, 3] =
 (100, 100, 1)

In [23]:
a[:,:,1]

100×100 SubArray{Float64,2,DistributedArrays.DArray{Float64,3,Array{Float64,3}},Tuple{Base.Slice{Base.OneTo{Int64}},Base.Slice{Base.OneTo{Int64}},Int64},false}:
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  …  1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  …  1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  …  1.0  1.0  1.0  

In [24]:
a[:,:,2]

100×100 SubArray{Float64,2,DistributedArrays.DArray{Float64,3,Array{Float64,3}},Tuple{Base.Slice{Base.OneTo{Int64}},Base.Slice{Base.OneTo{Int64}},Int64},false}:
 2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  …  2.0  2.0  2.0  2.0  2.0  2.0  2.0
 2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0     2.0  2.0  2.0  2.0  2.0  2.0  2.0
 2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0     2.0  2.0  2.0  2.0  2.0  2.0  2.0
 2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0     2.0  2.0  2.0  2.0  2.0  2.0  2.0
 2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0     2.0  2.0  2.0  2.0  2.0  2.0  2.0
 2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  …  2.0  2.0  2.0  2.0  2.0  2.0  2.0
 2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0     2.0  2.0  2.0  2.0  2.0  2.0  2.0
 2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0     2.0  2.0  2.0  2.0  2.0  2.0  2.0
 2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0     2.0  2.0  2.0  2.0  2.0  2.0  2.0
 2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0     2.0  2.0  2.0  2.0  2.0  2.0  2.0
 2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  …  2.0  2.0  2.0  

In [25]:
a[:,:,3]

100×100 SubArray{Float64,2,DistributedArrays.DArray{Float64,3,Array{Float64,3}},Tuple{Base.Slice{Base.OneTo{Int64}},Base.Slice{Base.OneTo{Int64}},Int64},false}:
 3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  …  3.0  3.0  3.0  3.0  3.0  3.0  3.0
 3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0     3.0  3.0  3.0  3.0  3.0  3.0  3.0
 3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0     3.0  3.0  3.0  3.0  3.0  3.0  3.0
 3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0     3.0  3.0  3.0  3.0  3.0  3.0  3.0
 3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0     3.0  3.0  3.0  3.0  3.0  3.0  3.0
 3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  …  3.0  3.0  3.0  3.0  3.0  3.0  3.0
 3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0     3.0  3.0  3.0  3.0  3.0  3.0  3.0
 3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0     3.0  3.0  3.0  3.0  3.0  3.0  3.0
 3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0     3.0  3.0  3.0  3.0  3.0  3.0  3.0
 3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0     3.0  3.0  3.0  3.0  3.0  3.0  3.0
 3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  …  3.0  3.0  3.0  

## IV. Using SharedArrays

`SharedArrays` are `DistributedArrays` for which the chunks are equal to the full array. That is, each worker receives the entire array. 

In [26]:
@time b = SharedArray(zeros(dim1,dim2,dim3));

In [27]:
@everywhere function give_my_id!(x::SharedArray{Float64,3}, indexDim3)
    for indexDim2 = 1:dim2
        for indexDim1 = 1:dim1
            x[indexDim1, indexDim2, indexDim3] = indexDim3
            sleep(0.0)
        end
    end
end

In [28]:
@time asyncmap(fetch, (@spawnat p give_my_id!(b, p)) for p=1:nworkers())

  2.232642 seconds (900.36 k allocations: 48.754 MiB, 4.52% gc time)


3-element Array{Void,1}:
 nothing
 nothing
 nothing

Once again, execution time is approximately divided by  3.
Inspecting the array `b`, we see that we have the correct result

In [29]:
b[:,:,1]

100×100 Array{Float64,2}:
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  …  1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  …  1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  …  1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.

 12.323589 seconds (240.83 k allocations: 11.962 MiB)


In [30]:
b[:,:,2]

100×100 Array{Float64,2}:
 2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  …  2.0  2.0  2.0  2.0  2.0  2.0  2.0
 2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0     2.0  2.0  2.0  2.0  2.0  2.0  2.0
 2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0     2.0  2.0  2.0  2.0  2.0  2.0  2.0
 2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0     2.0  2.0  2.0  2.0  2.0  2.0  2.0
 2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0     2.0  2.0  2.0  2.0  2.0  2.0  2.0
 2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  …  2.0  2.0  2.0  2.0  2.0  2.0  2.0
 2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0     2.0  2.0  2.0  2.0  2.0  2.0  2.0
 2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0     2.0  2.0  2.0  2.0  2.0  2.0  2.0
 2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0     2.0  2.0  2.0  2.0  2.0  2.0  2.0
 2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0     2.0  2.0  2.0  2.0  2.0  2.0  2.0
 2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  …  2.0  2.0  2.0  2.0  2.0  2.0  2.0
 2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0     2.0  2.0  2.0  2.0  2.0  2.0  2.0
 2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.

In [31]:
b[:,:,3]

100×100 Array{Float64,2}:
 3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  …  3.0  3.0  3.0  3.0  3.0  3.0  3.0
 3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0     3.0  3.0  3.0  3.0  3.0  3.0  3.0
 3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0     3.0  3.0  3.0  3.0  3.0  3.0  3.0
 3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0     3.0  3.0  3.0  3.0  3.0  3.0  3.0
 3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0     3.0  3.0  3.0  3.0  3.0  3.0  3.0
 3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  …  3.0  3.0  3.0  3.0  3.0  3.0  3.0
 3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0     3.0  3.0  3.0  3.0  3.0  3.0  3.0
 3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0     3.0  3.0  3.0  3.0  3.0  3.0  3.0
 3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0     3.0  3.0  3.0  3.0  3.0  3.0  3.0
 3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0     3.0  3.0  3.0  3.0  3.0  3.0  3.0
 3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  …  3.0  3.0  3.0  3.0  3.0  3.0  3.0
 3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0     3.0  3.0  3.0  3.0  3.0  3.0  3.0
 3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.

## V. Timing initialization and execution

In [48]:
function initialize_solve_serial()
    
    # initialization
    k = zeros(dim1,dim2,dim3)
    
    # calculation
    wrapper(k)
    
    return k
end


function initialize_solve_native()
    
    # initialization
    asyncmap(fetch, (@spawnat p eval(:(myLocalArray = zeros(dim1, dim2, 1)))) for p in workers())

    # calculation
    asyncmap(fetch, (@spawnat p give_my_id_parallel!(myLocalArray, p-1)) for p in workers())
    
    # recombine results
    k = recombine_data(:myLocalArray)
end

function initialize_solve_shared()
    
    # initialization
    k = SharedArray(zeros(dim1,dim2,dim3));
    
    # calculation
    asyncmap(fetch, (@spawnat p give_my_id!(k, p)) for p=1:nworkers())
    
    return k
end

function initialize_solve_distributed()
    
    @sync begin 
        k = dzeros((dim1,dim2,dim3), workers(), [1,1,nworkers()]);
    end
    
    asyncmap(fetch, (@spawnat p give_my_id_parallel!(localpart(a), p-1)) for p in procs(a))
    
    return k
    
end

initialize_solve_distributed (generic function with 1 method)

To make the benchmark more accurate, let's use `BenchmarkTools` rather than the usual `@time` macro

In [55]:
using BenchmarkTools

[1m[36mINFO: [39m[22m[36mRecompiling stale cache file /home/julien/.julia/lib/v0.6/BenchmarkTools.ji for module BenchmarkTools.
[39m

In [60]:
@btime initialize_solve_serial()

100×100×3 Array{Float64,3}:
[:, :, 1] =
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  …  1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  …  1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  …  1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0

  35.459 s (150712 allocations: 5.73 MiB)


In [61]:
@btime initialize_solve_native()

100×100×3 Array{Float64,3}:
[:, :, 1] =
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  …  1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  …  1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  …  1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0

  11.685 s (1852 allocations: 813.55 KiB)


In [62]:
@btime initialize_solve_distributed()

  11.688 s (1218 allocations: 88.33 KiB)


100×100×3 DistributedArrays.DArray{Float64,3,Array{Float64,3}}:
[:, :, 1] =
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.

In [63]:
@btime initialize_solve_shared()

100×100×3 SharedArray{Float64,3}:
[:, :, 1] =
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  …  1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  …  1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  …  1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.

  11.947 s (51226 allocations: 2.13 MiB)


## VI. Benchmark Results


### Serial
* 35.459 s (150712 allocations: 5.73 MiB)

### Native Julia
* 11.685 s (1852 allocations: 813.55 KiB)

### DistributedArrays
* 11.688 s (1218 allocations: 88.33 KiB)

### SharedArrays
* 11.947 s (51226 allocations: 2.13 MiB)


This benchmark suggests that `DistributedArrays` is as good as native Julia, if not better.
This benchmark also illustrates that there is a small penalty associated to using `SharedArrays`, associated with the cost of copying the entire data to each worker, but this penalty is likely to be marginal when each calculation takes more time. 

## VI. Safety checks

Let's make sure that all the results are equal.
For the tests to make sense, we need to convert DistributedArrays and SharedArrays to native Julia Arrays.

In [64]:
using Base.Test

In [65]:
# Conversion
aArray = convert(Array{Float64,3}, a)
bArray = convert(Array{Float64,3}, b)

# Tets
println(@test t==aArray)
println(@test t==bArray)

[1m[32mTest Passed[39m[22m
[1m[32mTest Passed[39m[22m
