# Parallel For Loops

## Setting

A computational problem ends up taking the form of iterating over rows and columns of a large multidimensional matrix. Some dimensions of this matrix are independent from each others. One may use this feature to parallelize calculations to obtain a speed-up compared to the serial alternative. 

## How ?

This notebook illustrates how one may parallelize a for loop using :
* native parallel julia 
* `SharedArrays`
* `DistibutedArrays`

## A Toy Problem

The program takes a 3-dimensional matrix `A[:,:,:]` and sets its rows and columns to be equal to the third dimension. For instance `A[:,:,1] = ones(n1,n2)`, `A[:,:,2] = ones(n1,n2)*2`, `A[:,:,3] = ones(n1,n2)*3` and so on, where $n1$ and $n2$ are the number of rows and columns respectively.

This is a toy example, because three lines of code would suffice
```julia
for k=1:size(A,3)
    A[:,:,k] = k
end
```

Yet, here we are going to act as if the first and the second dimensions of $A$ are dependent (calculations are to be serially done, respecting the order):
```julia
# This loop can be parallelized
for k=1:size(A,3)
    # We cannot parallelize this level
    for j=1:size(A,2)
        for i=1:size(A,1)
         A[i,j,k] = k
         end
      end
    end
end
```



## Preliminaries

Let's add workers to the current session, define some constants and load the required packages.

In [3]:
addprocs(3)

3-element Array{Int64,1}:
 2
 3
 4

In [4]:
nworkers()

3

In [5]:
@everywhere using DistributedArrays

In [6]:
@everywhere using Distributions

In [7]:
using Base.Test

In [8]:
@everywhere const dim1 = 100
@everywhere const dim2 = 100
@everywhere const dim3 = nworkers()*1

# I. The serial solution

## The obvious candidate

The following solution does what it should, but does not respect the constraint that calculations at the level of the first and second dimensions should be done serially:

In [38]:
t = zeros(dim1,dim2,dim3)

@time for k=1:size(t,3)
    println(k)
    t[:,:,k] = k
end

1
2
3
  0.000575 seconds (183 allocations: 3.703 KiB)


## Taking into the serial constraints

The following solution respects the constraint

In [39]:
@time for k=1:size(t,3)
    # We cannot parallelize this level
    #----------------------------------
    for j=1:size(t,2)
        for i=1:size(t,1)
         t[i,j,k] = k
         end
    end
end

  0.026812 seconds (120.61 k allocations: 3.223 MiB)


Let's spice up the problem by making each execution slower.
Let's use `sleep(0.0)`, which (suprisingly) takes more than 0 second to run.

In [155]:
@everywhere function give_my_id_serial!(x::Array{Float64,3}, id::Int64)
    for indexDim2 = 1:dim2
        for indexDim1 = 1:dim1
            x[indexDim1, indexDim2, id] = id
            #sleep (0.0) still take some time:
            #0.000236 seconds (37 allocations: 800 bytes)
            sleep(0.0)
        end
    end
end

In [42]:
function wrapper(x::Array{Float64,3})
    give_my_id_serial!(x, 1)
    give_my_id_serial!(x, 2)
    give_my_id_serial!(x, 3)
end

wrapper (generic function with 1 method)

In [43]:
@time wrapper(t)

 35.441115 seconds (153.07 k allocations: 5.624 MiB, 0.09% gc time)


As illustrated above, the task we want to accomplish takes approximately 35 seconds.
Yet, as the code makes it obvious, 3 workers could be used in parallel to divide the execution time by
approximately 3.

## II. Using native Julia

In [None]:
[WORKINPROGRESS]

## III. Using DistributedArrays

The package `DistributedArrays` was created to distribute "chunks" of a mutlidimensional array (matrix) to workers.
Each worker owns only a part of the multidimensional array. In the code below, I create a 3-dimensional matrix `a`. Using the function `dzeros`, I specify that I want it to be spread among all the available workers. In the last part of the command, I tell ̀`DistributedArrays` that I want the matrix to be spread among workers along the third dimension only.

In [119]:
@sync begin 
    #a = dzeros(dim1,dim2,dim3);
    #Split along the 3rd dimension
    #-----------------------------
    a = dzeros((dim1,dim2,dim3), workers(), [1,1,nworkers()]);
end

100×100×3 DistributedArrays.DArray{Float64,3,Array{Float64,3}}:
[:, :, 1] =
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.

Using the function `procs`, one can see that:
* workers 2 owns the chunk `t[:, :, 1]` of `a`
* workers 3 owns the chunk `t[:, :, 2]` of `a`.
* workers 4 owns the chunk `t[:, :, 3]` o `a`.

In [60]:
procs(a)

1×1×3 Array{Int64,3}:
[:, :, 1] =
 2

[:, :, 2] =
 3

[:, :, 3] =
 4

The following function is built so that each worker, owing only a part of `a`, assigns a specific value equal to `id`, respecting the serial constraint

In [142]:
@everywhere function give_my_id_parallel!(x::Array{Float64,3}, id::Int64)
    for indexDim2 = 1:dim2
        for indexDim1 = 1:dim1
            x[indexDim1, indexDim2, 1] = id
            sleep(0.0)
        end
    end
end

The function `give_my_id_parallel!` can be sent to workers using `asyncmap` and the usual `fetch @spawnat` commands:

In [152]:
@time asyncmap(fetch, (@spawnat p give_my_id_parallel!(localpart(a), p-1)) for p in procs(a))

1×1×3 Array{Void,3}:
[:, :, 1] =
 nothing

[:, :, 2] =
 nothing

[:, :, 3] =
 nothing

 12.112261 seconds (78.43 k allocations: 4.217 MiB)


As expected, execution time is divided by approximately 3.
By inspecting `a`, one sees that it has the correct dimension and values.

In [101]:
@time [@spawnat p println(size(localpart(a))) for p in procs(a)]

1×1×3 Array{Future,3}:
[:, :, 1] =
 Future(2, 1, 28881, #NULL)

[:, :, 2] =
 Future(3, 1, 28882, #NULL)

[:, :, 3] =
 Future(4, 1, 28883, #NULL)

  0.078799 seconds (29.75 k allocations: 1.564 MiB)
	From worker 4:	(100, 100, 1)
	From worker 3:	(100, 100, 1)
	From worker 2:	(100, 100, 1)


In [105]:
@time [@spawnat p println(typeof(localpart(a))) for p in procs(a)]

1×1×3 Array{Future,3}:
[:, :, 1] =
 Future(2, 1, 28893, #NULL)

[:, :, 2] =
 Future(3, 1, 28894, #NULL)

[:, :, 3] =
 Future(4, 1, 28895, #NULL)

  0.115833 seconds (24.70 k allocations: 1.287 MiB)
	From worker 2:	Array{Float64,3}
	From worker 4:	Array{Float64,3}
	From worker 3:	Array{Float64,3}


In [106]:
map(fetch, (@spawnat p size(localpart(a))) for p=procs(a) )

1×1×3 Array{Tuple{Int64,Int64,Int64},3}:
[:, :, 1] =
 (100, 100, 1)

[:, :, 2] =
 (100, 100, 1)

[:, :, 3] =
 (100, 100, 1)

In [100]:
a[:,:,1]

100×100 SubArray{Float64,2,DistributedArrays.DArray{Float64,3,Array{Float64,3}},Tuple{Base.Slice{Base.OneTo{Int64}},Base.Slice{Base.OneTo{Int64}},Int64},false}:
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  …  1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  …  1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  …  1.0  1.0  1.0  

In [107]:
a[:,:,2]

100×100 SubArray{Float64,2,DistributedArrays.DArray{Float64,3,Array{Float64,3}},Tuple{Base.Slice{Base.OneTo{Int64}},Base.Slice{Base.OneTo{Int64}},Int64},false}:
 2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  …  2.0  2.0  2.0  2.0  2.0  2.0  2.0
 2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0     2.0  2.0  2.0  2.0  2.0  2.0  2.0
 2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0     2.0  2.0  2.0  2.0  2.0  2.0  2.0
 2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0     2.0  2.0  2.0  2.0  2.0  2.0  2.0
 2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0     2.0  2.0  2.0  2.0  2.0  2.0  2.0
 2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  …  2.0  2.0  2.0  2.0  2.0  2.0  2.0
 2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0     2.0  2.0  2.0  2.0  2.0  2.0  2.0
 2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0     2.0  2.0  2.0  2.0  2.0  2.0  2.0
 2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0     2.0  2.0  2.0  2.0  2.0  2.0  2.0
 2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0     2.0  2.0  2.0  2.0  2.0  2.0  2.0
 2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  …  2.0  2.0  2.0  

In [108]:
a[:,:,3]

100×100 SubArray{Float64,2,DistributedArrays.DArray{Float64,3,Array{Float64,3}},Tuple{Base.Slice{Base.OneTo{Int64}},Base.Slice{Base.OneTo{Int64}},Int64},false}:
 3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  …  3.0  3.0  3.0  3.0  3.0  3.0  3.0
 3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0     3.0  3.0  3.0  3.0  3.0  3.0  3.0
 3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0     3.0  3.0  3.0  3.0  3.0  3.0  3.0
 3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0     3.0  3.0  3.0  3.0  3.0  3.0  3.0
 3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0     3.0  3.0  3.0  3.0  3.0  3.0  3.0
 3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  …  3.0  3.0  3.0  3.0  3.0  3.0  3.0
 3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0     3.0  3.0  3.0  3.0  3.0  3.0  3.0
 3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0     3.0  3.0  3.0  3.0  3.0  3.0  3.0
 3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0     3.0  3.0  3.0  3.0  3.0  3.0  3.0
 3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0     3.0  3.0  3.0  3.0  3.0  3.0  3.0
 3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  …  3.0  3.0  3.0  

## IV. Using SharedArrays

`SharedArrays` are `DistributedArrays` for which the chunks are equal to the full array. That is, each worker receives the entire array. 

In [109]:
@time b = SharedArray(zeros(dim1,dim2,dim3));

  2.195976 seconds (741.15 k allocations: 39.987 MiB, 0.58% gc time)


In [156]:
@everywhere function give_my_id!(x::SharedArray{Float64,3}, indexDim3)
    for indexDim2 = 1:dim2
        for indexDim1 = 1:dim1
            x[indexDim1, indexDim2, indexDim3] = indexDim3
            sleep(0.0)
        end
    end
end

In [129]:
@time asyncmap(fetch, (@spawnat p give_my_id!(b, p)) for p=1:nworkers())

3-element Array{Void,1}:
 nothing
 nothing
 nothing

 12.208940 seconds (129.51 k allocations: 6.098 MiB)


Once again, execution time is approximately divided by  3.
Inspecting the array `b`, we see that we have the correct result

In [125]:
b[:,:,1]

100×100 Array{Float64,2}:
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  …  1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  …  1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  …  1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.

 12.203567 seconds (133.59 k allocations: 6.298 MiB)


In [126]:
b[:,:,2]

100×100 Array{Float64,2}:
 2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  …  2.0  2.0  2.0  2.0  2.0  2.0  2.0
 2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0     2.0  2.0  2.0  2.0  2.0  2.0  2.0
 2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0     2.0  2.0  2.0  2.0  2.0  2.0  2.0
 2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0     2.0  2.0  2.0  2.0  2.0  2.0  2.0
 2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0     2.0  2.0  2.0  2.0  2.0  2.0  2.0
 2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  …  2.0  2.0  2.0  2.0  2.0  2.0  2.0
 2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0     2.0  2.0  2.0  2.0  2.0  2.0  2.0
 2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0     2.0  2.0  2.0  2.0  2.0  2.0  2.0
 2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0     2.0  2.0  2.0  2.0  2.0  2.0  2.0
 2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0     2.0  2.0  2.0  2.0  2.0  2.0  2.0
 2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  …  2.0  2.0  2.0  2.0  2.0  2.0  2.0
 2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0     2.0  2.0  2.0  2.0  2.0  2.0  2.0
 2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.

In [127]:
b[:,:,3]

100×100 Array{Float64,2}:
 3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  …  3.0  3.0  3.0  3.0  3.0  3.0  3.0
 3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0     3.0  3.0  3.0  3.0  3.0  3.0  3.0
 3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0     3.0  3.0  3.0  3.0  3.0  3.0  3.0
 3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0     3.0  3.0  3.0  3.0  3.0  3.0  3.0
 3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0     3.0  3.0  3.0  3.0  3.0  3.0  3.0
 3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  …  3.0  3.0  3.0  3.0  3.0  3.0  3.0
 3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0     3.0  3.0  3.0  3.0  3.0  3.0  3.0
 3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0     3.0  3.0  3.0  3.0  3.0  3.0  3.0
 3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0     3.0  3.0  3.0  3.0  3.0  3.0  3.0
 3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0     3.0  3.0  3.0  3.0  3.0  3.0  3.0
 3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  …  3.0  3.0  3.0  3.0  3.0  3.0  3.0
 3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0     3.0  3.0  3.0  3.0  3.0  3.0  3.0
 3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.

## V. Summary


### Serial (2nd run)
* 34.513930 seconds (150.72 k allocations: 5.504 MiB)

### DistributedArrays (2nd run):
* 12.112261 seconds (78.43 k allocations: 4.217 MiB)

### SharedArrays (2nd run):
* 12.208940 seconds (129.51 k allocations: 6.098 MiB)


The solution using DistributedArrays slighly outperforms the one relying on SharedArrays. If initialization were to be take into account, it is likely that the better performance of DistributedArrays would be more obvious, as only parts of initial matrix are copied to each worker, while each worker receives the entier matrix when DistributedArrays are used. 

## VI. Safety checks

Let's make sure that all the results are equal.
For the tests to make sense, we need to convert DistributedArrays and SharedArrays to native julia Arrays.

In [154]:
# Conversion
aArray = convert(Array{Float64,3}, a)
bArray = convert(Array{Float64,3}, b)

# Tets
println(@test t==aArray)
println(@test t==bArray)

[1m[32mTest Passed[39m[22m
[1m[32mTest Passed[39m[22m
