In [3]:
import math
import logging
FORMAT = '[%(name)s:%(levelname)s]  %(message)s'
logging.basicConfig(level=logging.DEBUG, format=FORMAT)
logger = logging.getLogger('dbg')

def dprint(s):
    logger.debug(s)

def iprint(s):
    logger.info(s)

logger.setLevel(logging.INFO)

## Task Parallel Compute

**Moore's Law** - 1965: double every year for minimum component cost
**Moore's Law** - 1975: double every two years

### Dennard Scaling

With each generation, transistor dimensions shrink by 30%:
- Double the # of devices
- Same power
- 40% faster

Ended in around 2005 as the model ignores leakage current

### Amdahl's Law for Speed-up

Let $\alpha \in [0, 1]$ be the fraction of code that can be parallelized fully. Let $P$ be the number of processors and $T$ be time:

$T_{new} = T \times \left ( (1 - \alpha) \frac{\alpha }{P} \right)$

Amdahl's law is a loose **Upper Bound** on parallelism efficiency, i.e. its usually worse

### Gustafson's Law
Amdahl's law is based on the assumption of a fixed problem size, that is of an execution workload that does not change with respect to the improvement of the resources. 

Gustafson's law instead proposes that programmers tend to increase the size of problems to fully exploit the computing power that becomes available as the resources improve.

Speed-Up = $1 + \alpha \cdot (P -1)$ - Linear Scaling with P!

The correct law depends on the system and the task at hand.

### Memory Models For Parallel Computing

How to organise multi-core memory?

**Distributed Memory**
* Each core requests the memory of another core over a network

**Shared Memory**
* Any core can access any memory in a shared address space
* **SMP** Uniform Memory Access (UMA)
    * uniform access time to all memory
* **DSM** Non Uniform Memory Access (NUMA)
    * access time depends on location of data and network topology


### Parallelism

**Data-level**
* Processors perform the same task on different subsets of data in parallel

**Task level**
* Distribute tasks across processors - different tasks may be run on the same data


### Task level Parallelism

Task parallelism can be implemented with, threads ("virtual processors") that share memory. However, this has proven difficult to program: Scheduling/load-balancing is a challenging job.

**Task-parallel platforms** Add an abstraction layer on top of threads. Programmer specifies which tasks can run in parallel (but not where they run). Platform manages scheduling, balancing etc.

#### Fork Join

Most task-parallel platforms support fork-join:
* **Spawn: "forks"** - executes function while caller continues to run in parallel
* **Sync: "joins"** - waits for spawned threads to finish before proceeding

Key concept: programmer only specifies which tasks can run in parallel, not which tasks must run in parallel. Parallel sections can fork recursively until reaching a given task granularity

### Parallel Computation Analysis

Assume:
* We have an ideal parallel computer
    * Multiple Cores
    * Sequentially consistent memory model
    * Uniform Compution power
* There is no scheduling overhead

#### Work-Span

Let $T_P$ denote the runtime on $P$ processors
* **Work** - $T_1$ - time to execute on 1 core
* **Span** - $T_\infty$ - time to execute on $\infty$ cores

The **span** is the sum of runtime of strands on the 'critical path' - the longest path in the computational DAG.

1. **Work Law** $\rightarrow T_P \geq T_1 / P$ 
1. **Span Law** $\rightarrow T_P \geq T_\infty$ 
1. **Speedup** $\rightarrow T_1 / T_P$ 
1. **Lin Speedup** $\rightarrow T_1 / T_P = \Theta(P)$ 
1. **Parallelism** $\rightarrow T_1 / T_\infty$ (Max possible speedup) 

Parallel analysis can then be performed on an algorithm:

<img src="media/workspan.png" alt="drawing" width="800"/>


Find **Work** $\mathbb{W} = T_1(n)$ and **Span** $\mathbb{S} = T_\infty(n)$

Parallelism = $\mathbb{W}/\mathbb{S} = T_1(n) / T_\infty(n)$