## Factors against Parallelism

* Startup costs associated with initiating processes
  * May often overwhelm actual processing time (rendering ||ism useless)
  * Involve thread/process creation, data loading
* Interference: slowdown resulting from multiple processors accessing shared resources
  * Resources: memory, I/O, system bus, sub-processors
  * Data movement: redistribution of updates
  * Software synchronization: locks, latches, mutex, barriers
  * Hardware synchronization: cache faults, interrupts
* Skew: when breaking a single task into many smaller tasks, not all tasks may be the same size
  * Not all tasks finish at the same time
  
### Understanding Factors by Analogy


#### Startup

* Szkieletor in Krakow Poland
  * Too expensive to complete or demolish
* Why is this like a parallel computer?
    * It is a parallel living environment
    * The parallel living throughput is 0 because of startup
    
<img src="./images/sz.png" width="300" title="http://en.wikipedia.org/wiki/File:Szkieletor2.jpg" />

#### Interference: 

* Congested intersections
  * mulitple vehicles compete for same resource (lanes in roundabout)
  * others await resources
* This is a parallel driving environment
  * unused capacity (16 outbound lanes) because resource competition (the roundabout) prevents its use
  * this system exhibits a _throughput collapse_ in which more usage reduces flow
  
<img src="./images/traffic.png" width="512" title="http://crowdcentric.net/2011/05/can-you-help-us-solve-the-worlds-traffic-congestion/" />


#### Skew: 

* Completion of parts for assembly
  * throughput: output planes stalled awaiting other parts
* The parallelism is implicit in the parallel construction of all parts
  * the entire system is stalled (seen) awaiting a nose section.

<img src="./images/plane.png" width="512" title="http://www.ainonline.com/?q=aviation-news/dubai-air-show/2011-11-12/" />


### Factors against Parallelism in our Examples

* Interference: in game of life, ghost cells need to be updated every iteration.
  * Updates are done in parallel, but it's not useful work.
  * Why not useful? Because a serial program doesn't need to do it.
* Startup: I/O to read data in turbulence examples
  * I/O requests may be sent in parallel, but are performed serially.
  * On a computer with 32 cores, there are 2 memory busses, and one I/O system.
* Skew: no good example yet.  
  * We'll get one in dataframes.
  
### Factors Conclusions

* Factors against parallelism are the most important design consideration.
  * This is the non-parallel part in Amdahl’s law
* Typical experience
  * Design a parallel code
  * Test on n=2 or 4 nodes (works great)
  * Deploy on >16 nodes (doesn't work great)
* Measure factors against parallelism
  * Redesign/reimplement


## Advanced Topic (maybe skip)

This is the most important optimization in parallel computing.  But, dask mostly does it for you with asynchronous execution of the task graph.

### Overlapping Computation and I/O

(I/O or messaging) and computation that occur in parallel are overlapped

<img src="./images/overlap.png" width="512" title="Unknown source" />

* _Concept_: When performing a slow operation
  * do the slow operation asynchronously
  * do useful work with processor while waiting
* Overlap is one of the simplest and most important forms of asynchronous execution
  * identify independent tasks and do in parallel
  * reorder I/O to initiate as early as possible and wait as late as possible
  * while computing at the same time
  
I've built a toy example to demonstrate.

In [None]:
# synchronous I/O and then compute
def factorial(number):  
    f = 1
    for i in range(2, number+1):
        f *= i
    return f

def io_from_devnull(number):
    with open("/dev/null", "rb") as fh:
        for i in range(number):
            fh.read(1)
    return number

In [None]:
%timeit -n 20 factorial(10000)
%timeit -n 20 io_from_devnull(30000)

In [None]:
%%timeit -n 20 

factorial(10000)
io_from_devnull(30000)

In [None]:
%%timeit -n 20

from multiprocessing import Process
p1 = Process(target=factorial, args=(10000,))
p2 = Process(target=io_from_devnull, args=(30000,))
p1.start()
p2.start() 
p1.join()
p2.join()