# Distributed Computation

## Quick Recap - Multiprocessing vs. Multithreading

Multiprocessing can utilize multiple CPU cores, thus achieving a more real sense of a parallel computation. However, multiprocessing suffers when they need to share a common memory space.

On the other hand, multithreading within a process can share a common memory space, and achieve a more loose sense of a parallel computation. Multithreading is more like hyper-looping through multiple queues while waiting for each particular thread's turn; and when it is a particular thread's turn, it would acquire the Global Interpreter Lock (GIL) to hog the memory space _and_ the CPU core the entire process runs on, until it finishes its designated computation or until it hits another "busy-waiting" block such as some I/O action, where the loop releases the GIL and let the jump proceed (context-switch) to the next thread.

### Asynchronous I/O Loops, and Multithreading's Little Brother - Coroutines

Both multiprocessing and multithreading require dedicated hardware and operating system coordination. As software technologies mature, engineers started to explore capabilities within the application layer itself. The same conceptual model of multithreading (again, the hyper-looping) gets a new interpretation within programming stack (such as Python, or more appropriately the CPython runtime), giving more direct control to the program within the runtime instead of relying on the OS mechanism to switch context between threads. The term _Coroutine_, first coined in 1958, is such a materialization of the concept that can be seen as lightweight threads, and has seen its implementation in many languages including Python. Its implementation usually involves dedicated asynchronous I/O loops (or event loops) to switch between the coroutines with more direct controls than threading on when to suspend and resume each task.

Both multithreading and coroutine techniques are suitable for I/O focused tasks, such as reading files from a disk, or making HTTP requests. Coroutines tend to require less computing resource overhead compared to its more OS-native sibling.

In [13]:
import random
import requests


# 5 requests, each with delays ranging from 1-5 seconds
reqs = [
    f'https://httpbin.org/delay/{random.randint(1, 3)}'
    for _ in range(10)
]

def get_sync():
    all_data = []
    for req in reqs:
        res = requests.get(req)
        all_data.append(res.json())

    return all_data

In [14]:
%time res = get_sync()
print(res[-1], len(res))

CPU times: user 163 ms, sys: 12.9 ms, total: 176 ms
Wall time: 23.6 s
{'args': {}, 'data': '', 'files': {}, 'form': {}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.25.1', 'X-Amzn-Trace-Id': 'Root=1-60d10ac9-550eb0d6203262b369bd3576'}, 'origin': '107.179.188.69', 'url': 'https://httpbin.org/delay/2'} 10


In [15]:
import asyncio

import aiohttp


async def get(session, url):
    res = await session.request('GET', url=url)
    data = await res.json()
    return data


async def main():
    async with aiohttp.ClientSession() as session:
        tasks = []
        for req in reqs:
            tasks.append(get(session, url=req))

        all_data = await asyncio.gather(*tasks, return_exceptions=True)
        return all_data

In [16]:
'''
Two things about Jupyter Notebook environment
1. Cannot use %time magic for async functions
2. No explicit event loop initiation (it's already in one)
'''
import time

# loop = asyncio.get_event_loop()
# loop.run_until_complete(main())
start = time.time()
res = await main()
print(round(time.time() - start, 1), 'seconds')
print(res[-1], len(res))

3.2 seconds
{'args': {}, 'data': '', 'files': {}, 'form': {}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Host': 'httpbin.org', 'User-Agent': 'Python/3.8 aiohttp/3.7.4.post0', 'X-Amzn-Trace-Id': 'Root=1-60d10acc-6bdccf035163f7cc14d45527'}, 'origin': '107.179.188.69', 'url': 'https://httpbin.org/delay/2'} 10


Let's use the responses to verify what the delays were involved in the API calls, by using functional techniques `map()` and `reduce()` to extract delays (in seconds) and sum them up.

In [17]:
delays = [*map(lambda url: int(url.split('/')[-1]), reqs)]
delays

[1, 1, 1, 2, 3, 3, 2, 3, 2, 2]

In [18]:
from functools import reduce

total_delay = reduce(lambda left, right: left + right, [*delays], 0)
total_delay

20

The theoretical total delay matches our observation from the synchronous process, while the _maximum_ from individual delays matches our observation from the asynchronous process.

## MapReduce - Distributing Computing Resources

Recall one of the most critical disadvantages that multiprocessing - the lack of a shared memory space. This particular constraint makes attempt to perform parrallel computation across multiple cores on a single machine often less desirable. One workaround of such issue is to leverage a datastore (such as files or databases backed by harddrives) as an inter-process data pool so that multiple processes can simultaneously read and write to it, such as a local SQLite database, as demonstrated in [a previous part](./11-work-with-sql.ipynb).

If we extend this problem to a larger scale, where not only the dataset we want to work with exceed the available memory space, but also impossible to efficiently store (or at all) on the disk of a single machine, then we need to revisit viable solutions.

The MapReduce model is a _divide and conquer_ strategy applying and extending the functional programming concepts of `map()` and `reduce()`, where a large dataset is dissected and distributed through a mapping procedure onto a multitude of "commodity" server nodes to parrallize the computation of the smaller portion, then reducing the resulting subsets back to less and less nodes until the final outcome is completed.

This model allows batch data processing to have near infinite capacity, and a relatively cost-effective way to speed up the process.

It is worth knowing that the recent development of cheaper and faster harddrives, especially the more wide-spread adoption of SSDs (solid-state drive), plays a vital role in enabling distributed computation.

The MapReduce model was first pioneered [by Google](https://research.google/pubs/pub62/) in 2004 to resolve the practical problem of exponentially growing dataset for computing their search indexes. Then many have adopted and contributed toward the technology's development and evolution through the open-source community.

### Apache Spark™

Apache Spark is such an open-source framework that came around 2014 (10 years after the initial MapReduce research paper) that provides an elegant and unified abstraction to enable large-scale data processing that can efficiently utilize from multiple-cores of a single machine to "multiple-clouds".

For the sake of simplicity, we will demonstrate through its usage on a single machine. First, obtain the number of CPU cores:

In [24]:
import multiprocessing

CORES = multiprocessing.cpu_count()
CORES

12

Borrowing from [another previous part](./12-generate-data.ipynb), where we attempted to generate random and hashed device IDs:

In [19]:
# mostly from Part 12 - Generate data
from uuid import uuid4
from hashlib import sha1


def gen_device_ids(count: int = 20_000) -> list:
    device_ids = []
    for _ in range(count):
        device_ids.append(str(uuid4()))
    # hash
    return [sha1(x.encode()).hexdigest() for x in device_ids]

Let's generate a relatively large batch of device IDs, say 1 million times the number of `CORES`:

In [25]:
import pandas as pd

%time device_ids = pd.DataFrame(gen_device_ids(count=1_000_000 * CORES), columns=['device_id'])

device_ids

CPU times: user 52.7 s, sys: 5.23 s, total: 57.9 s
Wall time: 58.1 s


Unnamed: 0,device_id
0,b3bb3680a7f55bab654b55e9b863bb20319d2275
1,2cb7cfafe45326136c21b28f67d6fd37e5df09eb
2,5c0152a869d09b97c71407197fe54b94d390ebd0
3,0aae3ef47af4250b120de7d9b4ce378148f5117f
4,d39eb8ce7181493ffabc2d4e85e37bc4a6fd1046
...,...
11999995,9b69945cd1b25563a86ac67cc0cdc8a4bbaa16bc
11999996,855422cc57a941a31c5caaca4112e2e8089add93
11999997,7d231df831b8e304ecdc42965b659944f673a6d6
11999998,d8f735c365c65e16849286d4add1d8673843411a


In [26]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.master(f'local[{CORES}]').getOrCreate()
spark

In [27]:
%%time

rdd = spark.sparkContext.parallelize(
    # a list of [1_000_000, ..., 1_000_000], where the length is the number of CORES
    [1_000_000] * CORES
).flatMap(
    lambda count: [(d,) for d in gen_device_ids(count)]
)

CPU times: user 1.77 ms, sys: 2.06 ms, total: 3.83 ms
Wall time: 240 ms


In [28]:
rdd.take(10)

[('fce5ebc469e311a076f3728b5efcb763e981c4bb',),
 ('d88078c656aaea444d07d7171b4101c12897b06f',),
 ('bf82f64f7659ddbbbcdf97620240bc4c0693666b',),
 ('e51aa729693db2f7c4f30b223334fdbe10424176',),
 ('d14b3f7eaff47f6910b3c9fe109f264edee1892c',),
 ('0adf59ec0a68c223e48b157d6506a02faa9f35e6',),
 ('9506af476fca4e6e29f9478d3ee69a02298cb2c0',),
 ('42c879b781d666b0f991b5a7fb3d55eeaa9ea678',),
 ('ae67613f969adabf521578c6e02827f239be0036',),
 ('7df6d9f75204f440187a429c9788ab3f31402a7c',)]

In [29]:
%time df = rdd.toDF(['device_id'])

CPU times: user 7.28 ms, sys: 2.75 ms, total: 10 ms
Wall time: 6.08 s


In [30]:
%time df.show()

+--------------------+
|           device_id|
+--------------------+
|cbf00cb2384157944...|
|43b75d2d1670c8128...|
|0cc778e3fb3b9b445...|
|806402f4d57f036b2...|
|84b09b47c40e5fb41...|
|40af5c531146774b2...|
|fcefc7a5954fd2575...|
|b3ddc0a27c12c3202...|
|d03ea43b57b7d5eb0...|
|aba71b2421fe48a2c...|
|76bae0fa4ae9e7087...|
|bf0d964e22120f4dc...|
|81baaab08a7786e10...|
|aa0cac18fceb534d2...|
|22428e2dc78a1df0f...|
|c2b8562fb2a2e578c...|
|c800e2ecddcb76515...|
|a80b6ba01870d975e...|
|81becefd56f651b04...|
|184b81305b125fc90...|
+--------------------+
only showing top 20 rows

CPU times: user 991 µs, sys: 1.17 ms, total: 2.16 ms
Wall time: 5.44 s


In [31]:
%time df.count()

CPU times: user 2.16 ms, sys: 1.76 ms, total: 3.92 ms
Wall time: 16 s


12000000

In [32]:
df = spark.createDataFrame([(i,) for i in range(CORES)], ['cluster'])

def _gen(df):
    device_ids = gen_device_ids(count=1_000_000)
    pdf = pd.DataFrame(device_ids, columns=['device_id'])
    pdf['cluster'] = df['cluster']
    return pdf.reset_index()

def gen_device_ids_udf(df):
    output = []
    for _, row in df.iterrows():
        pdf = _gen(df)
        output.append(pdf)

    return pd.concat(output)


schema = 'index long, cluster long, device_id string'
%time df = df.groupby('cluster').applyInPandas(gen_device_ids_udf, schema=schema).drop('cluster', 'index')

CPU times: user 47.9 ms, sys: 9.64 ms, total: 57.6 ms
Wall time: 286 ms


In [33]:
%time df.show()

+--------------------+
|           device_id|
+--------------------+
|4da9c0558916b527f...|
|562320ef2fd83ea9a...|
|02b94ee058a97ccfe...|
|9f631b8ad4971febb...|
|2029b2c38a6827026...|
|b0a1faaef6cbce175...|
|5e1cc1a654ab23995...|
|f2a4d5f5bd2e0c789...|
|46d469143a8f1b5f1...|
|e3ea69956fcc8d484...|
|7ba5ddc687032cb2f...|
|7e4d98b486e9db93d...|
|e526aea7cab06d7be...|
|61020ce8262da90b1...|
|bb39647008d9b5bd5...|
|e80eca555a2eda978...|
|a738f658d1c395188...|
|ea65abc7b7fb6e7e1...|
|56ad36bd1e56cc05d...|
|ad28cab848fa36c0c...|
+--------------------+
only showing top 20 rows

CPU times: user 3.13 ms, sys: 2.58 ms, total: 5.71 ms
Wall time: 6.37 s


In [34]:
%time df.count()

CPU times: user 14.6 ms, sys: 9.65 ms, total: 24.3 ms
Wall time: 13.4 s


12000000

In [35]:
%%time

device_ids['first'] = device_ids.device_id.astype(str).str[0]
gb_char = device_ids.groupby('first').agg({'device_id': 'count'})
gb_char

CPU times: user 6.03 s, sys: 335 ms, total: 6.36 s
Wall time: 6.38 s


Unnamed: 0_level_0,device_id
first,Unnamed: 1_level_1
0,750199
1,749324
2,749898
3,750227
4,750237
5,749944
6,749974
7,750928
8,750079
9,750166


In [36]:
%%time

df = df.withColumn('first', df.device_id.substr(0, 1))
gb_char = df.groupby('first').agg({'device_id': 'count'})
gb_char.show()

+-----+----------------+
|first|count(device_id)|
+-----+----------------+
|    7|          748909|
|    3|          748245|
|    8|          749328|
|    0|          750360|
|    f|          752284|
|    5|          750596|
|    6|          750622|
|    e|          748313|
|    d|          749367|
|    c|          750413|
|    9|          749946|
|    1|          750750|
|    b|          749675|
|    4|          750549|
|    a|          750601|
|    2|          750042|
+-----+----------------+

CPU times: user 107 ms, sys: 25.9 ms, total: 133 ms
Wall time: 14.5 s
