# Single Machine Distributed Computation in Python

## Outcomes

### 1. Understand computer science things 

- concurrency, parallelism & asynchrony.

### 2. Understand hardware / software things 

- CPU cores, threads & processes.

### 3. How to distribute compute in Python on a single machine (standard library only)

- why to never use `threading`,
- why `multiprocessing` - (use for CPU bound tasks),
- single thread + `asyncio` - (use for IO bound tasks).


## Background

### Horizontal versus vertical scaling

Horizontal scaling = **scaling wide** - using many identical or similar workers to solve a problem.

Vertical scaling = **scaling tall** - using one large worker to solve a problem.

![](../assets/scaling.svg)

Similar terminology in business (vertical integration etc).


### Fixed versus variable costs

Variable costs are a function of scale - fixed costs are constant & independent of scale.

There is a trade-off between fixed costs and variable costs.

For distributed computation, we balance fixed costs (such time & memory cost to start up processes in `multiprocessing`, time to spin up a Dask cluster on EC2 / ECS) with variable costs (the additional/marginal increase in cost with one additional unit of compute).

There can also be additional trade-offs - such as between different fixed costs - for example running a cluster on the cloud:

- starting it up each time = high fixed cost of execution time,
- running it all the time = high fixed cost CPU time.


## Why distribute compute?

### To go faster

There are diminishing returns on increasing the performance of a single CPU core - there is limited ability to vertically scale on a single CPU core.

Distributing work over many CPU cores let's us massively exceed what we can do with a single core.

Distributing compute over many CPU cores (parallelism) or across one intelligently (asynchronous) ignores the vertical limit on a single core by going **wide & horizontal with many CPU cores**.  The GPU is an extreme example of this, with thousands of cores per GPU card.

Solving two problems will be the  focus of this notebook:

1) CPU bound problems,
2) IO bound problems.

The solutions to these two problems:

1. CPU bound = using more hardware/CPU/workers to go faster by running many in parallel,
2. IO bound = use the same hardware (single CPU core) to go faster by handing off execution & waiting.


### To use more memory 

Sometimes we want to do calculations on large datasets (larger than memory):

1. use many small machines,
2. batch data (Dask Dataframe does this) - intelligent use of the same hardware.

The third notebook (`3-many-machines.ipynb`) will show how to distribute compute across a Dask cluster on EC2.


## Why distribute compute on a single machine?

It's easier than many machines- don't need to:

- install anything third party,
- access/manage remote machines,
- sign up for any accounts.

Trade-off reducing the fixed overhead costs that come with multiple machines (incl. programmer time) with a limited capacity to scale (one machine).

# Hardware & Software

<center><img src="../assets/cores.png" alt="Drawing" style="width: 800px;"/></center>

## CPU cores

*Hardware* - also called a processor.  One core can do one thing at a time - execute a single thread.

Your machine will have 4-8 (16 if you are lucky).


## Threads

*Software* - atomic unit of compute in a process.

The software representation of a sequence of computation. 

Memory is shared between threads - making threads the lightest weight, fastest way to speed up compute.


## Processes

*Software* - a program in execution.

Each process has one or more threads - sequential steps of compute that are moved forward by a CPU.

Each process has a **dedicated memory space** - making **multiple processes memory intensive** relative to using multiple threads.

A process comes with memory overhead - this is traded-off with safety that comes from having separate & independent memory.


## Additional detail

#### Multi-threading 

Difference between:

- one core taking turns to execute multiple threads in one process,
- multiple cores running multiple threads in one process at the same time.

Single CPU core multi-threading is an oppourtunity for concurrency.

Multiple CPU core multi-threading is an oppourtunity for both concurrency and parallelism.


#### Hyperthreading

Allows one core to run multiple threads.  One core creates two virtual cores, which can run two threads.

# Concurrency, Parallelism & Asynchrony

## Motivating Examples

<center><img src="../assets/examples.png" alt="Drawing" style="width: 800px;"/></center>

## Concurrency

**Concurrency = handling/managing/doing many things at once**.

Opportunities for concurrency in programs are places where the order doesn't matter, such as reading two files from disk - either file can be read first.

- tasks can be run in a different order sequentially, in parallel or with overlap.

Exploiting concurrency can involve:

- using more hardware (ie additional CPU cores),
- communicating (ie waiting for an asynchronous callback/future/promise to return).

Common examples in computer programs for concurrency:

- IO bound problems - reading/writing to disk/network,
- CPU bound problems - parallelizable tasks.


## Parallelism

**Parallelism = doing many things at once** - it is a form of concurrency:

- requires many workers (to all work at the same time),
- requires independent work - work needs to be split into many tasks that can run at the same time.

Opportunities for parallelism in code are in sequential iteration of independent tasks, such as `for` loops or `map`.

We can parallelize compute across:

- threads (in one CPU core),
- CPU cores (in one machine),
- multiple machines,
- multiple data centres,
- multiple regions,
- multiple planets.


## Asynchrony

**Asynchronous programming = handing off the CPU to other tasks while we wait** - it is also a form of concurrency:

- ability to wait for outside/external process without blocking the main thread of execution.

This gives us a form of parallelism with a single CPU - our main thread can be working on something else while we are waiting for the network or disk.


## Examples

### You can't grow a tree faster

Some jobs cannot be performed concurrently - you just need to wait for the tree to grow.

There is no way to split up, separate or independently run any task to speed up how long it takes to grow one tree.

<center><img src="../assets/tree.jpg" alt="Drawing" style="width: 500px;"/></center>


### You can mow one lawn faster

We can split up a lawn into independent sections - introducing additional hardware (mowers) to achieve more by running them in parallel.

<center><img src="../assets/lawn-full.png" alt="Drawing" style="width: 500px;"/></center>


### You can play many games of chess at once

One chess grandmaster can play many games at once:

- that game can be suspended so it doesn't block the main thread of execution for the grandmaster, 
- the grandmaster can later resume the game.

A synchronous version would mean the grandmaster would need to sit & wait for the first game - while waiting for the opponent to move.

<center><img src="../assets/chess.jpg" alt="Drawing" style="width: 500px;"/></center>

# CPU bound & IO bound problems

Computers do three things:

1. **compute** - transform data into other data,
2. **communicate** - computers can talk to other computers over a network.
3. **store** - remember data for access later,

This functionality can be mapped to problems or bottlenecks we have when using computers:

1. CPU bound problems (compute),
2. IO bound problems (communication),
3. memory problems (storage).

This notebooks focuses on CPU & IO bound problems:

1. solve CPU bound problems with `multiprocessing`,
2. solve IO bound problems with `asyncio`.

# `threading`

Threads are the basic unit of computation in a process.

Using multiple CPU cores to execute multiple threads gives parallelism across threads.

Multi-threading is an efficient way to speed up programs as they all exist in the same process with the same memory space.  This sharing of the same memory space is also multi-threading's weakness - multiple threads accesing the same memory at different times can cause problems like race conditions or deadlocks.

These problems are managed by locking access to memory in the correct way - which is hard to program.


## Multi-threading in Python

**Python is not thread safe** - simultaneous execution of multiple threads in a single process is not safe - using multiple threads to execute one function can lead to different threads mutating the same space in memory.

Memory safety in Python is achieved through a Global Interpreter Lock (GIL) - the GIL locks onto a single thread & core.

Writing multi-threaded code in Python is possible but hard - see [Raymond Hettinger, Keynote on Concurrency, PyBay 2017](https://www.youtube.com/watch?v=9zinZmE3Ogk).


## `threading` in Python - One Thread, No Parallelism

What the Python standard library does offer in the `threading` module is the ability to do concurrent computation on a single thread.

This makes `threading` suitable only for IO bound tasks - `threading` offers no parallelism across threads.

Using threads in Python is challenging - if you have an IO bound task, `asyncio` is a more modern choice - it's mostly programmer preference.


## Why not to use `threading`

- limited to a single thread of execution at a time (can't do true multi-threading),
- `asyncio` offers the same as `threading` - giving `threading` no space in your toolkit.


## But you can multi-threading in Python through C/C++

In the same way that a tuple isn't really immutable, you can actually safely multi-thread in Python.

`numpy` is an example of easy access to the power of fast, multi-threaded computation in Python.

In [None]:
import numpy as np

# `multiprocessing`

`multiprocessing` is a powerful tool for Python programmers - it is a solution to CPU bound problems.


## `multiprocessing` is easy to program

Of the three Python standard library modules we look at here, `multiprocessing` is the eaisest to implement.  Any `for` or `map` can eaisly be changed from sequential to parallel exection with 2-3 lines of code.


## `multiprocessing` has costs

Processes do not share memory - they have their own memory space.  This makes them safe but expensive.

This cost (increased memory consumption, fixed startup time of n processes) can often be trivial, but is perhaps catastrophic (especially in terms of memory).

Another hidden cost with `multiprocessing` is that your subprocesses (created to parallelize compute across) will not print to STDOUT - logging can be more complex.

In [None]:
from helpers import compute_intensive_task

required_words = 8
word_length = 1000000

In [None]:
%%timeit -n 1 -r 1
words = []
for _ in range(required_words):
    words.append(compute_intensive_task(word_length))

First step is to re-write this to use a `map`:

In [None]:
%%timeit -n 1 -r 1
words = list(map(compute_intensive_task, [word_length] * required_words))

And finally we can convert this to run in parallel with `multiprocessing`:

In [None]:
%%timeit -n 1 -r 1
import multiprocessing

with multiprocessing.Pool(2) as pool:
    words = pool.map(compute_intensive_task, [word_length] * required_words)

# `asyncio`

`asyncio` was introduced in Python 3.4 - it's a library to write single threaded, concurrent programs using the `async` & `await` syntax/keywords.

## `asyncio` is harder to program

At the core of using `asyncio` are asynchronous functions that can stop & start - known as co-routines.

`asyncio` allow us to switch tasks where we want using `await` (you may also see the older `yield` syntax).

Making a Python program asynchronous using `asyncio` requires changing:

- how you structure the program (using non-blocking versions of normally blocking network/disk traffic),
- how you run in (`asyncio.run` etc).

A further challenge using `asyncio` is that Jupyter is already running an `asyncio` event loop - this means we have to run our async examples outside this notebook.

In [None]:
import asyncio
asyncio.get_event_loop()

## `asyncio` is still single threaded

One is that it's still single threaded (thanks to the GIL) - so not suitable for parallelizing programs.

Another is that using `asyncio` requires using non-blocking versions of things you currently do.

This means instead of doing:

In [None]:
%%timeit -n 1 -r 1

import time
time.sleep(2)

We do something like:

```python
import asyncio
await asyncio.sleep(3)
```

In [None]:
#  need to run in script
!python ../src/asyncio_python.py

Similar story for any other blocking call we want to unblock:

Opening files:

`open('filename', 'r')` -> `aiofiles.open('filename', 'r')`,

Network requests:

`requests.get`

versus

```
async with httpx.AsyncClient() as client:
    r = await client.get('https://www.example.com/')
```

Popular libraries for these calls:
- `httpx`, `aiofiles`, `aiohttp`.

# Exercises

Now it's your turn ^^

Hints:
- you can use `$ time python ./your_script.py` to measure execution time (note there are different `time` depending on OS/shell),
- answers are in `notebooks/answers.py`.


## Exercise

Convert the CPU bound code in `./src/cpu_bound.py` to use `multiprocessing`.

Hints:

- `multiprocessing.Pool().map()`,
- `functools.partial` & `pool.starmap` are often useful.

## Exercise

Convert the IO bound code in `./src/io_bound.py` to use `asyncio`.

Hints:

- write & run the code in a separate Python `.py` script,
- you need to make a choice RE how to introduce non-blocking disk/network IO - see `aiofiles`, `aiohttp` or `httpx`.

## Exercise

Convert this code to use `multiprocessing` and/or `asyncio` as appropriate.

Hints:

- where is this a IO bound versus a CPU bound problem?


In [None]:
from pathlib import Path

import numpy as np
import pandas as pd
import requests


Path('./data').mkdir(exist_ok=True)
months = [str(month).zfill(2) for month in range(1, 5)]
urls = [
    f"http://www.nemweb.com.au/Data_Archive/Wholesale_Electricity/MMSDM/2021/MMSDM_2021_{month}/MMSDM_Historical_Data_SQLLoader/DATA/PUBLIC_DVD_DISPATCH_UNIT_SCADA_2021{month}010000.zip"
    for month in months
]

def download(url):
    print(f"downloading {url}")
    response = requests.get(url)
    assert response.ok
    fname = url.split('/')[-1]
    Path(f'./data/{fname}').write_bytes(response.content)
    
    
def process(url):
    print(f"processing {url}")
    fname = url.split('/')[-1]
    data = pd.read_csv(f'./data/{fname}', skiprows=1)
    data = data.groupby('SETTLEMENTDATE').agg('mean', 'std')
   

def main(urls):
    for url in urls:
        download(url)
        process(url)

main(urls)