# Parallel Julia

To run multiple local processes in parallel, we first need to tell julia how many processes we want to add. This should be less than or equal to the effective number of processors our machine has. That number is fixed for a laptop (probably 4 or 8), but we can set it on Engaging with the "cpus-per-task" parameter. There are two ways to do this.

1. Launch julia with the "-p" flag: ```julia -p 3 my_script.jl```
2. Add processes while julia is already running: ```addprocs(3)``` (Note: do this before running any other commands to ensure your modules are loaded across all of the worker processes.)

Both of these will launch 4 total processes, the master and then 3 additional worker processes.

Julia provides low-level functions for parallel computing. These allow us to specify which process evaluates each expression. The documentation can be found here:
https://docs.julialang.org/en/stable/manual/parallel-computing/

For much of what we want to do, this granular level of control is unnecessary. Instead, we can use some high-level functions and let julia figure out how to distribute the work amongst the various processes.

There are two high-level functions we will  use:

1. ```@parallel for```
2. ```pmap```

Let's consider the following problem:

A random walker takes steps according to a ten thousand step Bernoulli process with parameter 0.001, each step being +1 or -1 with equal probability. Any time the walker takes a step, we have the chance to bet for/against it. If we bet for it, we will get the difference between the walker's final position and the current position. If we bet against it, we will get the difference between the walker's current position and the walker's final position.

We want to evaluate the expected profit of the following strategy:

1. If the walker is at or below -a, bet in favor of the walker
2. If the walker is at or above +a, bet against the walker

In [None]:
# Let's write a function to simulate a single trial of this game
function simulate_path(a)
    T = 10000
    p = 0.001
    profit = 0
    walker_position = 0
    n_bets_for = 0
    n_bets_against = 0
    just_stepped = false
    for i in 1:T
        step = rand()
        if 
            
        elseif 
            
        else
            
        end
        if 
            
        elseif 
            
        end
    end
    
    return profit
end

In [None]:
# Now let's write a function to run a million trials of this simulation
function serial_simulate(a)
    
end

In [None]:
@time serial_simulate(1)

In [None]:
# Not let's try it with a parallel for loop
function parallel_simulate(a)
    
end

In [None]:
@time parallel_simulate(1)

Note: the + sign is called the reduction operator. It's not required for parallel for loops, but I recommend always including one. It can be any operation that can be recursively applied pairwise. The order in which the iterations of the loop will execute is not guaranteed, so it should be commutative. If we need the entire vector of results, we can use the ```vcat``` reduction operator.

## Global variables

If the expressions executed within the parallel for loop reference global variables (i.e. variables defined outside of a function or class), then each worker process will be passed a copy of that variable. If these variables are only needed as read-only variables, then this is okay. However, unnecessarily passing large chunks of memory between processes can significantly slow down your script. Attempting to modify the contents of a global variable within a parallel for loop will not have the intended effect.

In [None]:
# Don't do this
a = zeros(3)
@parallel for i = 1:3
    a[i] = i
end

In [None]:
a

In [None]:
fetch(@spawnat 2 getfield(Main,:a))

It's best to avoid using global variables whenver possible. However, if you do need a shared variable that all processes can write to, you can use the ```SharedArray``` type.

## ```pmap```

Parallel for loops are optimized for the setting in which each iteration of the loop only does a small amount of work. In contrast, ```pmap``` is optimized for the setting in which each function call does a large amount of work. Let's consider the same example as before, but we now wish to evaluate the expected profit of a variety of different thresholds.

Bonus! Why is that? For parallel for loops, the work is divided up immediately, so each worker process will evaluate the same number of iterations. With ```pmap``` a queue is formed and jobs are distributed to the workers as they complete their prior jobs. If there are only a few jobs, there will likely be some variation in the amount of time they will take to execute. It's less efficient to pre-assign them to worker processes than to dynamically asssign them. However, if there are many jobs, the time to complete a subset of 333,333 of them will have little variation, so the overhead of dynamic job assignment is less efficient.

Final comment: if a function takes multiple arguments, each set of elements as a tuple and pass an array of tuples as the second arguments to ```pmap```.