# Table of Contents
 <p><div class="lev1 toc-item"><a href="#Parallel-Computing-in-Julia" data-toc-modified-id="Parallel-Computing-in-Julia-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Parallel Computing in Julia</a></div><div class="lev2 toc-item"><a href="#Start-Julia-with-multiple-workers/processes" data-toc-modified-id="Start-Julia-with-multiple-workers/processes-11"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Start Julia with multiple workers/processes</a></div><div class="lev2 toc-item"><a href="#addprocs(),--rmprocs(),-and-@everywhere" data-toc-modified-id="addprocs(),--rmprocs(),-and-@everywhere-12"><span class="toc-item-num">1.2&nbsp;&nbsp;</span><code>addprocs()</code>,  <code>rmprocs()</code>, and <code>@everywhere</code></a></div><div class="lev2 toc-item"><a href="#remotecall(),-@spawn" data-toc-modified-id="remotecall(),-@spawn-13"><span class="toc-item-num">1.3&nbsp;&nbsp;</span><code>remotecall()</code>, <code>@spawn</code></a></div><div class="lev2 toc-item"><a href="#Running-a-function-everywhere" data-toc-modified-id="Running-a-function-everywhere-14"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Running a function everywhere</a></div><div class="lev2 toc-item"><a href="#@parallel-and-pmap()" data-toc-modified-id="@parallel-and-pmap()-15"><span class="toc-item-num">1.5&nbsp;&nbsp;</span><code>@parallel</code> and <code>pmap()</code></a></div><div class="lev2 toc-item"><a href="#Benchmark:-find-pi" data-toc-modified-id="Benchmark:-find-pi-16"><span class="toc-item-num">1.6&nbsp;&nbsp;</span>Benchmark: find pi</a></div><div class="lev2 toc-item"><a href="#Using-pmap()-to-run-a-serial-program-on-multiple-processors-with-different-arguments-to-the-function" data-toc-modified-id="Using-pmap()-to-run-a-serial-program-on-multiple-processors-with-different-arguments-to-the-function-17"><span class="toc-item-num">1.7&nbsp;&nbsp;</span>Using <code>pmap()</code> to run a serial program on multiple processors with different arguments to the function</a></div><div class="lev2 toc-item"><a href="#When-to-use-pmap()" data-toc-modified-id="When-to-use-pmap()-18"><span class="toc-item-num">1.8&nbsp;&nbsp;</span>When to use <code>pmap()</code></a></div><div class="lev2 toc-item"><a href="#Shared-arrays" data-toc-modified-id="Shared-arrays-19"><span class="toc-item-num">1.9&nbsp;&nbsp;</span>Shared arrays</a></div><div class="lev2 toc-item"><a href="#Parallel-reduction" data-toc-modified-id="Parallel-reduction-110"><span class="toc-item-num">1.10&nbsp;&nbsp;</span>Parallel reduction</a></div><div class="lev2 toc-item"><a href="#Distributed-arrays" data-toc-modified-id="Distributed-arrays-111"><span class="toc-item-num">1.11&nbsp;&nbsp;</span>Distributed arrays</a></div>

# Parallel Computing in Julia

Machine information:

In [1]:
versioninfo()

Julia Version 0.6.4
Commit 9d11f62bcb (2018-07-09 19:09 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell MAX_THREADS=16)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, skylake)


## Start Julia with multiple workers/processes

The command
```bash
julia -p 4
```
will start Julia with 4 workers on a single machine.

In cluster, we may create a file named `$HOME/machinefile2servers` and add the name of both servers, one line for each processes meaning if you want to run two process per server you have to repeat the server name twice:
```
server1
server1
server2
server2
```
Then the command
```bash
julia --machinefile $HOME/machinefile2servers
```
will run a 4 processor job, 2 on server1 and and 2 on server2.

## `addprocs()`,  `rmprocs()`, and `@everywhere`

In [2]:
write(STDOUT, "Hello World")

11

Hello World

In [3]:
print(run(`hostname`))

BS-HUAZHOU-LAP.attlocal.net
nothing

In [4]:
workers()

1-element Array{Int64,1}:
 1

In [5]:
myid()

1

In [6]:
@everywhere println(myid())

1


In [7]:
addprocs(3)

3-element Array{Int64,1}:
 2
 3
 4

In [8]:
@everywhere println(myid())

1
	From worker 4:	4
	From worker 3:	3
	From worker 2:	2


In [9]:
@everywhere println(run(`hostname`))

BS-HUAZHOU-LAP.attlocal.net
nothing
	From worker 4:	BS-HUAZHOU-LAP.attlocal.net
	From worker 3:	BS-HUAZHOU-LAP.attlocal.net
	From worker 2:	BS-HUAZHOU-LAP.attlocal.net
	From worker 3:	nothing
	From worker 2:	nothing
	From worker 4:	nothing


In [10]:
rmprocs(2)

Task (queued) @0x000000012336cfd0

In [11]:
println(nworkers(), ':', nprocs(), ':', Sys.CPU_CORES)

2:3:8


## `remotecall()`, `@spawn`

Remote references first argument is the function name, second argument is the processor id, the remaining are arguments

In [12]:
r = remotecall(rand, 3, 4, 4) # lazy evaluation

Future(3, 1, 11, Nullable{Any}())

bring value of `r` from the remote processor to the master

In [13]:
fetch(r)

4×4 Array{Float64,2}:
 0.787261  0.0779991  0.731475  0.120837
 0.542724  0.992773   0.774951  0.361337
 0.791827  0.156149   0.657864  0.233488
 0.924564  0.826326   0.969382  0.106976

Bring the index(1,1) of `r` to the master

In [14]:
remotecall_fetch(getindex, 3, r, 1, 1)

0.7872614454821361

`@spawn` is similar to remotecall, Julia will choose the process number randomly.

In [15]:
r = @spawn rand(2,2)

Future(3, 1, 14, Nullable{Any}())

In [16]:
fetch(r)

2×2 Array{Float64,2}:
 0.686142  0.931471
 0.594098  0.521942

`@spawnat` can choose the processor number to execute. In this case we are adding 1 to value of r on processor 2.

In [17]:
s = @spawnat 3 1 .+ fetch(r)

Future(3, 1, 16, Nullable{Any}())

In [18]:
@everywhere p = 5  # forces the assignment of p = 5 on all processors

In [19]:
@everywhere println(@sprintf("ID %d: %f %d", myid(), rand(), p))

ID 1: 0.944437 5
	From worker 3:	ID 3: 0.775197 5
	From worker 4:	ID 4: 0.856055 5


In [20]:
@everywhere run(`whoami`)

	From worker 4:	huazhou
	From worker 3:	huazhou
huazhou


In [21]:
@everywhere run(`hostname`)

	From worker 4:	BS-HUAZHOU-LAP.attlocal.net
	From worker 3:	BS-HUAZHOU-LAP.attlocal.net
BS-HUAZHOU-LAP.attlocal.net


## Running a function everywhere

Let's define a function.

In [22]:
# purposefully left out @everywhere
function count_heads(n)
    println("My process id is $(myid())")
    c::Int = 0
    for i=1:n
        c += rand(Bool)
    end
    c
end

count_heads (generic function with 1 method)

In [23]:
a = @spawn count_heads(100000000)

Future(4, 1, 25, Nullable{Any}())

In [24]:
b = @spawn count_heads(100000000)

Future(3, 1, 26, Nullable{Any}())

In [25]:
fetch(a) + fetch(b) 

LoadError: [91mOn worker 4:
[91mUndefVarError: #count_heads not defined[39m
deserialize_datatype at ./serialize.jl:973
handle_deserialize at ./serialize.jl:677
deserialize at ./serialize.jl:637
handle_deserialize at ./serialize.jl:684
deserialize_global_from_main at ./distributed/clusterserialize.jl:154
foreach at ./abstractarray.jl:1733
deserialize at ./distributed/clusterserialize.jl:56
handle_deserialize at ./serialize.jl:726
deserialize at ./serialize.jl:637
handle_deserialize at ./serialize.jl:681
deserialize at ./serialize.jl:637
handle_deserialize at ./serialize.jl:684
deserialize_msg at ./distributed/messages.jl:98
message_handler_loop at ./distributed/process_messages.jl:161
process_tcp_streams at ./distributed/process_messages.jl:118
#99 at ./event.jl:73[39m

## `@parallel` and `pmap()`

In [26]:
@everywhere begin
    function parallel_func(idx)
        workernum = myid() - 1 
        sleep(workernum)
        println("job $idx")
    end
end

In [27]:
# The run below will have equal number of processors involved
@parallel for idx = 1:12
    parallel_func(idx)
end

2-element Array{Future,1}:
 Future(4, 1, 30, #NULL)
 Future(3, 1, 31, #NULL)

In [28]:
# The run below will have unequal number of processors involved
pmap(parallel_func, 1:12)

	From worker 3:	job 7
	From worker 4:	job 1
	From worker 3:	job 1
	From worker 3:	job 8
	From worker 4:	job 2
	From worker 3:	job 4
	From worker 3:	job 9
	From worker 4:	job 2
	From worker 3:	job 6
	From worker 4:	job 5
	From worker 3:	job 10
	From worker 4:	job 3
	From worker 3:	job 7
	From worker 3:	job 11
	From worker 4:	job 8
	From worker 3:	job 9
	From worker 3:	job 12
	From worker 4:	job 4
	From worker 3:	job 11
	From worker 4:	job 10
	From worker 4:	job 5
	From worker 3:	job 12


12-element Array{Void,1}:
 nothing
 nothing
 nothing
 nothing
 nothing
 nothing
 nothing
 nothing
 nothing
 nothing
 nothing
 nothing

	From worker 4:	job 3
	From worker 4:	job 6


## Benchmark: find pi

In [29]:
function findpi(n)
     inside = 0
     for i in 1:n
         x, y = rand(2)
         if (x^2 + y^2 <= 1)
            inside += 1
         end
   end
   4 * inside / n
end

findpi(10) # compile
@time println(findpi(1_000_000_000))

3.141585112
 49.572526 seconds (1.00 G allocations: 89.408 GiB, 12.02% gc time)


In [30]:
function parallel_findpi(n)
    inside =  @parallel (+) for i in 1:n
        x, y = rand(2)
        x^2 + y^2 <= 1 ? 1 : 0
     end
     4 * inside / n
end

@time println(parallel_findpi(1000000000))

3.141672468
 22.326746 seconds (104.83 k allocations: 5.508 MiB)


## Using `pmap()` to run a serial program on multiple processors with different arguments to the function

In [31]:
x_value = [3, 4, 5, 6]
y_value = [4, 5, 6, 7]
@everywhere function hypot(x,y)
    println("My process id is $(myid())")
    x, y = abs(x), abs(y)
    if x > y
        r = y / x
        return x * sqrt(1 + r * r)
    end
    if y == 0
        return zero(x)
    end
    r = x / y
    return y * sqrt(1 + r * r)
end
info("Serial")
Results = map(hypot, x_value, y_value)
println(Results)
info("Parallel")
Results = pmap(hypot, x_value, y_value)
println(Results)

My process id is 1
My process id is 1
My process id is 1
My process id is 1


[1m[36mINFO: [39m[22m[36mSerial
[39m

[5.0, 6.40312, 7.81025, 9.21954]


[1m[36mINFO: [39m[22m[36mParallel
[39m

	From worker 3:	My process id is 3
	From worker 3:	My process id is 3
	From worker 3:	My process id is 3
	From worker 4:	My process id is 4
[5.0, 6.40312, 7.81025, 9.21954]


## When to use `pmap()`

This example demonstrates when to use `pmap()`. If the function has not much work to do, serial version is going to be faster than parallel.

In [32]:
@everywhere function NotMuchToDo(x::Int64)
    return x^2+x+1.0
end

@everywhere function LotToDo(x::Int64)
    a = 1.0
    for i in 1:1000
        for j in 1:5000
            a += asinh(i + j) + acosh(i + j)
        end
    end
    return a
end

In [33]:
info("Precompilation")
map(NotMuchToDo, 1:1000)
pmap(NotMuchToDo, 1:1000)
map(LotToDo,1:100)
pmap(LotToDo,1:100)
info("Timing LotToDo function")
@time map(LotToDo, 1:100)
@time pmap(LotToDo,1:100)
info("Timing NotMuchToDo function")
@time map(NotMuchToDo, 1:1000)
@time pmap(NotMuchToDo, 1:1000)

[1m[36mINFO: [39m[22m[36mPrecompilation
[39m[1m[36mINFO: [39m[22m[36mTiming LotToDo function
[39m

 23.884060 seconds (9 allocations: 1.156 KiB)
 12.029566 seconds (9.49 k allocations: 327.797 KiB)
  0.000014 seconds (9 allocations: 8.219 KiB)


[1m[36mINFO: [39m[22m[36mTiming NotMuchToDo function
[39m

1000-element Array{Float64,1}:
      3.0    
      7.0    
     13.0    
     21.0    
     31.0    
     43.0    
     57.0    
     73.0    
     91.0    
    111.0    
    133.0    
    157.0    
    183.0    
      ⋮      
 979111.0    
 981091.0    
 983073.0    
 985057.0    
 987043.0    
 989031.0    
 991021.0    
 993013.0    
 995007.0    
 997003.0    
 999001.0    
      1.001e6

  0.091042 seconds (92.17 k allocations: 2.765 MiB)


## Shared arrays

This function will print all 0.0 for `a`. That is because a at processor 2 has a different memory address from master.

In [34]:
a = zeros(10)
@parallel for i = 1:10
    a[i] = i
end
a

10-element Array{Float64,1}:
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0

In [35]:
a = SharedArray{Float64}(10)
@parallel for i = 1:10
    a[i] = i
end
a

10-element SharedArray{Float64,1}:
  1.0
  2.0
  3.0
  4.0
  5.0
  6.0
  7.0
  8.0
  9.0
 10.0

In [36]:
println(nprocs())
@everywhere function parallel_func(idx)
        println("job $idx")
        a = idx^2+idx-1
        return a
    end
result=SharedArray{Int64}(12)
for idx = 1:12
    result[idx] = 0
end
info("Calling @parallel")
@sync @parallel for idx = 1:12
    result[idx]=parallel_func(idx)
end
for idx = 1:12
    print(result[idx], ' ')
end
println(" ");
info("Calling pmap()")
result=pmap(parallel_func, 1:12)
println(result)

3


[1m[36mINFO: [39m[22m[36mCalling @parallel
[39m

	From worker 4:	job 1
	From worker 3:	job 7
	From worker 4:	job 2
	From worker 4:	job 3
	From worker 3:	job 8
	From worker 3:	job 9
	From worker 3:	job 10
	From worker 3:	job 11
	From worker 3:	job 12
	From worker 4:	job 4
	From worker 4:	job 5
	From worker 4:	job 6
1 5 11 19 29 41 55 71 89 109 131 155  
	From worker 3:	job 1
	From worker 4:	job 2
	From worker 3:	job 3
	From worker 4:	job 4
	From worker 3:	job 5
	From worker 4:	job 6
	From worker 3:	job 7
	From worker 4:	job 8
	From worker 3:	job 9
	From worker 4:	job 10
	From worker 3:	job 11


[1m[36mINFO: [39m[22m[36mCalling pmap()
[39m

[1, 5, 11, 19, 29, 41, 55, 71, 89, 109, 131, 155]
	From worker 4:	job 12


## Parallel reduction

In [37]:
@everywhere f(x) = x^2+1
a = randn(1000)
@parallel (+) for i = 1:100000
    f(a[rand(1:end)])
end

208925.21259916504

## Distributed arrays

In [38]:
@everywhere using DistributedArrays

dzeros(2, 2, 4)
dones(1, 100)
drand(2, 2, 4)
drandn(2, 2, 4)
dfill(2, 2, 4)

x = @DArray [@show x^2 for x = 1:10];
arr = rand(8,8)

dist_arr = distribute(arr)
remotecall_fetch(localpart, 1, dist_arr)
remotecall_fetch(localpart, 2, dist_arr)
remotecall_fetch(localpart, 3, dist_arr)
remotecall_fetch(localpart, 4, dist_arr)
remotecall_fetch(localindexes, 3, dist_arr)
remotecall_fetch(localindexes, 2, dist_arr)
remotecall_fetch(localindexes, 4, dist_arr)

remotecall_fetch(getindex, 2, arr, 1, 1)

	From worker 4:	x ^ 2 = 36
	From worker 4:	x ^ 2 = 49
	From worker 4:	x ^ 2 = 64
	From worker 4:	x ^ 2 = 81
	From worker 4:	x ^ 2 = 100
	From worker 3:	x ^ 2 = 1
	From worker 3:	x ^ 2 = 4
	From worker 3:	x ^ 2 = 9
	From worker 3:	x ^ 2 = 16
	From worker 3:	x ^ 2 = 25


LoadError: [91mProcessExitedException()[39m