Pierre Navaro - [Institut de Recherche Mathématique de Rennes](https://irmar.univ-rennes1.fr) - [CNRS](http://www.cnrs.fr/)

# Parallel Computation

## Parallel computers
- Multiprocessor/multicore: several processors work on data stored in shared memory
- Cluster: several processor/memory units work together by exchanging data over a network
- Co-processor: a general-purpose processor delegates specific tasks to a special-purpose processor (GPU, Xeon Phi,...)


## Parallel Programming
- Decomposition of the complete task into independent subtasks and the data flow between them.
- Distribution of the subtasks over the processors minimizing the total execution time.
- For clusters: distribution of the data over the nodes minimizing the communication time.
- For multiprocessors: optimization of the memory access patterns minimizing waiting times.
- Synchronization of the individual processes.

## MapReduce

In [1]:
from time import sleep
def f(x):
    sleep(1)
    return x*x
L = list(range(8))
L

[0, 1, 2, 3, 4, 5, 6, 7]

In [2]:
%time sum([f(x) for x in L])

CPU times: user 1.13 ms, sys: 1.21 ms, total: 2.34 ms
Wall time: 8.02 s


140

In [3]:
%time sum(map(f,L))

CPU times: user 1.2 ms, sys: 2.35 ms, total: 3.55 ms
Wall time: 8.02 s


140

## Multiprocessing 

The multiprocessing allows the programmer to fully leverage multiple processors.
- The Pool object parallelizes the execution of a function across multiple input values.
- The if __name__ == '__main__' part is necessary.
- The multiprocessing Pool class provides a map function. Partition and distribute input to a user-specified function in pool of worker processes is automatic.

In [4]:
from multiprocessing import cpu_count

cpu_count()

8

In [5]:
%%time 
from multiprocessing import Pool

if __name__ == '__main__': # Executed only on main process.
    with Pool(4) as p:
        print(sum(p.map(f, L))) # Apply f on L sequence


140
CPU times: user 13.7 ms, sys: 19 ms, total: 32.7 ms
Wall time: 2.06 s


- Pool() launches one slave process per physical processor on the computer. 
- pool.map(...) divides the input list into chunks and puts the tasks (function + chunk) on a queue.
- Each slave process takes a task (function + a chunk of data), runs map(function, chunk), and puts the result on a result list.
- pool.map on the master process waits until all tasks are handled and returns the concatenation of the result lists.

### Exercise 4

- Use `paragraph` function module from `lorem` to create a text
- Create a list of words from it
- Use `map` function from `multiprocessing.Pool` to compute each word length


## Thread and Process: Differences

- A Process is an instance of a running program. 
- Process may contain one or more threads, but a thread cannot contain a process.
- Process has a self-contained execution environment. It has its own memory space. 
- Application running on your computer may be a set of cooperating processes.

- A Thread is made of and exist within a Process; every process has at least one. 
- Multiple threads in a process share resources, which helps in efficient communication between threads.
- Threads can be concurrent on a multi-core system, with every core executing the separate threads simultaneously.




## The Global Interpreter Lock (GIL)

- The Python interpreter is not thread safe.
- A few critical internal data structures may only be accessed by one thread at a time. Access to them is protected by the GIL.
- Attempts at removing the GIL from CPython have failed until now. The main difficulty is maintaining the C API for extension modules.
- Multiprocessing avoids the GIL by having separate processes which each have an independent copy of the interpreter data structures.
- The price to pay: serialization of tasks, arguments, and results.

## Futures
The concurrent.futures module provides a high-level interface for asynchronously executing callables.

The asynchronous execution can be performed with:
- **threads**, using ThreadPoolExecutor, 
- separate **processes**, using ProcessPoolExecutor. 
Both implement the same interface, which is defined by the abstract Executor class.

In [6]:
%%time
from concurrent.futures import ProcessPoolExecutor
e = ProcessPoolExecutor()

results = sum(e.map(f, L))
print(results)

140
CPU times: user 12.8 ms, sys: 22.1 ms, total: 34.9 ms
Wall time: 1.03 s


In [7]:
%%time
from concurrent.futures import ThreadPoolExecutor
e = ThreadPoolExecutor()

results = sum(e.map(f, L))
print(results)

140
CPU times: user 4.59 ms, sys: 4.24 ms, total: 8.83 ms
Wall time: 1.01 s


### Exercise 5

Use `ProcessPoolExecutor` to compute each word length.

### Exercise 6

Same as exercise 4 but use `ThreadPoolExecutor`.