## Parallel Computing Concepts

+ We've not yet discussed Bodo's utility for simplifying parallel computing.
+ Let's quickly review some key ideas there.

---

+ Parallel computing is hard!
  + Developing, debugging programs
  + Load balancing, synchronization
  + Porting programs, etc.

+ Shared versus Distributed Memory

---

+ Parallel programming often takes years for programmers to master.
+ Debugging is hard, so is making sure workers have reasonable workloads, synchronizing workers, running programs on different hardware, and so on.
---
+ Another important distinction parallel programmers need to understand is *shared* versus *distributed memory*.

---

#### Shared memory computing
<center>
    <img src='./img/shared-memory.svg' width='35%'></img>
</center>

+ Shared memory pool for several *threads*
+ Challenge: race conditions

+ The shared memory paradigm can be thought of numerous workers sharing a common pool of memory.
+ Your laptop likely has *threads* running that all share common RAM and disk access.
+ Parallel programming with shared memory is compicated by *race conditions*...
 + ...when distinct threads try to read from or write to the same memory location at the same time.

---

#### Distributed memory computing
<center>
    <img src='./img/distributed-memory.svg' width='45%'></img>
</center>

+ Distinct memory for each workers
+ Commonly used with *clusters*
+ Challenge: communication, synchronization

+ The distributed memory paradigm can be thought of numerous isolated workers.
+ Clusters & supercomputers use distributed processes with their own RAM and disk drives connected over a network.
+ Communication is a challenge with distributed computing because workers often need to send data to & fro at different parts of a computation.

---

### Distributed computing with Bodo

+ Distributed computing with `ipyparallel`
+  *SPMD* (Single Program Multiple Data) paradigm

+ Bodo uses `ipyparallel` for distributed computing—even on a single machine.
+ The principle approach in Bodo is called *SPMD* (Single Program Multiple Data)

---

#### SPMD Paradigm
<center>
    <img src='./img/SPMD-1.svg'  width='70%'></img>
</center>

+ In SPMD, a single program written with instructions for all workers.
+ Conceptually, data can be thought of in *distinct* chunks.

---

#### SPMD Paradigm
<center>
    <img src='./img/SPMD-2.svg' width='70%'></img>
</center>

+ In a typical SPMD program, data have to be *distributed* to each worker.
+ The common program guides each workers's operations on their specific chunk of data.

---

#### SPMD Paradigm
<center>
    <img src='./img/SPMD-3.svg' width='70%'></img>
</center>

+ Again, with distributed memory systems, it's challenging to manage *communication* between workers
+ Data may need to be shared at different checkpoints requiring synchronization between workers.

---

#### Parallel computing with `ipyparallel`

In [1]:
import ipyparallel as ipp
n_proc = 4
rc = (ipp.Cluster(engines='mpi', n=n_proc)
         .start_and_connect_sync(activate=True))

Starting 4 engines with <class 'ipyparallel.cluster.launcher.MPIEngineSetLauncher'>
100%|█████████████████████████████████████████| 4/4 [00:08<00:00,  2.17s/engine]


+ When running `bodo` in a Jupyter notebook with multiple engines—the relevant `ipyparallel` term for workers—some set up is needed.
+ This notebook is executed on a notebook with 4 cores which we'll use as engines.
  + In later notebooks executed on the Bodo cloud platform, there are far more than 4 engines used.
  + We use 4 here to keep the output manageable.

---

```python
import ipyparallel as ipp
import psutil
n_proc = max(psutil.cpu_count(logical=False), 2)
rc = (ipp.Cluster(engines='mpi', n=n_proc)
         .start_and_connect_sync(activate=True))
```

+ Done automatically on [`platform.bodo.ai`](https://platform.bodo.ai/notebooks)


---
+ The details of this boilerplate code are complicated; don't worry about it here.
---
+ This is executed behind the scenes on [`platform.bodo.ai`](https://platform.bodo.ai/notebooks)
  + On that platform, users don't need this boilerplate in their Jupyter notebooks.

---

#### Distinct *namespaces*

+ Jupyter kernel & `ipyparallel` *engines*
+ Execute commands on engines:
  + `%px` line magic
  + `%%px` cell magic

---

+ Bodo users *do* need to understand that, in this SPMD framework, engines have their own *namespace*...
+ ... and the Jupyter kernel also has its own namespace as well.
+ Commands executed using IPython `px` magics run on distinct engines.

---

In [2]:
# This cell executes in the main Jupyter kernel
a = 4
from math import sqrt

In [3]:
# This cell executes in the main Jupyter kernel
def missing(s):
    return f'Identifier \'{s}\' not found in this namespace'

assert 'a'    in globals(), missing('a')
assert 'sqrt' in globals(), missing('sqrt')
print(f'a == {a}')

a == 4


---

+ To see this clearly, execute a standard Jupyter notebook cell.
+ This executes with a single Jupyter kernel with its own namespace, defining the objects `a` and `sqrt`.

---

+ We'll examine this namespace using the dictionary `globals()` in some `assert` statements.
+ These assertions are both `True`, so no `AssertionError`s are raised.
+ The last line of the cell executes without any problem.

---

```python
%%px
# This cell executes on every engine
def missing(s):
    return f'Identifier \'{s}\' not found in this namespace'
assert 'a'    in globals(), missing('a')
assert 'sqrt' in globals(), missing('sqrt')
print(f'a == {a}')
```

+ Expect `AssertionError` when executed...

+ We'll repeat the last cell with the `px` *cell* magic (with two percent symbols)
+ This means every line gets executed in the namespace of each `ipyparallel` engine.

+ Executing this cell yields messy assertion errors on every engine...

---

In [4]:
%%px
# executed in every engine's namespace
def missing(s):
    return f'Identifier \'{s}\' not found in this namespace'
assert 'a'    in globals(), missing('a')
assert 'sqrt' in globals(), missing('sqrt')
print(f'a == {a}')

[2:execute]
[0;31m---------------------------------------------------------------------------[0m
[0;31mAssertionError[0m                            Traceback (most recent call last)
Input [0;32mIn [1][0m, in [0;36m<cell line: 4>[0;34m()[0m
[1;32m      2[0m [38;5;28;01mdef[39;00m [38;5;21mmissing[39m(s):
[1;32m      3[0m     [38;5;28;01mreturn[39;00m [38;5;124mf[39m[38;5;124m'[39m[38;5;124mIdentifier [39m[38;5;130;01m\'[39;00m[38;5;132;01m{[39;00ms[38;5;132;01m}[39;00m[38;5;130;01m\'[39;00m[38;5;124m not found in this namespace[39m[38;5;124m'[39m
[0;32m----> 4[0m [38;5;28;01massert[39;00m [38;5;124m'[39m[38;5;124ma[39m[38;5;124m'[39m    [38;5;129;01min[39;00m [38;5;28mglobals[39m(), missing([38;5;124m'[39m[38;5;124ma[39m[38;5;124m'[39m)
[1;32m      5[0m [38;5;28;01massert[39;00m [38;5;124m'[39m[38;5;124msqrt[39m[38;5;124m'[39m [38;5;129;01min[39;00m [38;5;28mglobals[39m(), missing([38;5;124m'[39m[38;5;124msqrt[39

AlreadyDisplayedError: 4 errors

+ ... because the individiual engines do not have the identifiers `a` and `sqrt` in their respective namespaces.
+ Observe each engine displays the exception; the results appear in non-deterministic order.
  + This is a common feature of parallel programs.

---

In [5]:
%%px
# This *entire* cell executes in every engine's namespace
from numpy import sqrt
assert 'sqrt' in globals(), "Identifier 'sqrt' not found in this namespace"
b = sqrt(list(range(3)))
assert 'b' in globals(), "Identifier 'b' not found in this namespace"

+ Above executes (without output) on all engines.
+ It defines the identifiers `b` and `sqrt` on all 4 engines.
+ Again, these two `assert` statements execute without raising exceptions.

---

In [6]:
# This *line* executes in each engine's namespace
%px %who

[stdout:1] b	 missing	 sqrt	 


[stdout:0] b	 missing	 sqrt	 


[stdout:2] b	 missing	 sqrt	 


[stdout:3] b	 missing	 sqrt	 


---

+ Trying again, we'll use the `%px` line magic to execute `%who` on all engines; each prints to *standard output*.
+ The same data exists in each engine's namespace: the function `numpy.sqrt`, and the object labelled `b`
+ Notice all the assertions pass.
+ Again, the *standard output* appears in non-deterministic order.

---

In [7]:
%px print(f'\tValue of b: {b}')   # <- All engines
main_str = '\n\nMain kernel namespace'     # <- Main kernel
print(f'{main_str}: sqrt(a) == {sqrt(a)}') # <- Main kernel

[stdout:0] 	Value of b: [0.         1.         1.41421356]


[stdout:1] 	Value of b: [0.         1.         1.41421356]


[stdout:2] 	Value of b: [0.         1.         1.41421356]


[stdout:3] 	Value of b: [0.         1.         1.41421356]




Main kernel namespace: sqrt(a) == 2.0


+ This cell executes a `print` statement on each engine and a different one on the main Jupyter kernel.
+ Notice the same data—a vector with entries 0, 1, & square root of 2—has been computed on each engine.
+ The last two lines of the cell—those that lack a `%px` magic—execute in the Jupyter kernel namespace.
+ They compute & print the square root of 4.

---

```python
# This produces an error in Jupyter kernel namespace
b = sqrt(list(range(5))) # math.sqrt != np.sqrt
```

+ Expect `TypeError` when executed...

+ If we attempt a command intended for the engines *without* the `px` magic
  + it executes in the Jupyter kernel.
  + We expect this line to yield an exception
  + (because, in this namespace has `sqrt` is `math.sqrt`)

---

In [8]:
# This produces an error in Jupyter kernel namespace
b = sqrt(list(range(5))) # math.sqrt != np.sqrt

TypeError: must be real number, not list

+ In this instance, the function `math.sqrt` cannot accept an iterable list as input (unlike `numpy.sqrt`).
+ The moral here is to make sure we know which namespace our code is meant to execute in.

---

How is all this useful?

+ `bodo.get_rank` $\mapsto$ numeric identifier of engine
+ `bodo.get_size` $\mapsto$ total number of engines

---

+ Okay, so we can execute lines of code on different engines. So what?
+ Well, if we can *identify* distinct engines, we can *distribute* different data & have each engine do different work *concurrently*.
---
+ Borrowing from the *Message Passing Interface* (*MPI*), the functions `bodo.get_size` & `bodo.get_rank` permit engines to assess at run-time:
  + how many engines are available; and
  + which engine is executing this line of code.
+ By convention, Engine 0 is referred to as `root`.

---

In [9]:
%%px
import bodo, pandas as pd, numpy as np
# Illustrates querying ID number by each engine
print(f'\tThis is Engine #{bodo.get_rank():02d} of {bodo.get_size():2d} Engines')

[stdout:3] 	This is Engine #03 of  4 Engines


[stdout:2] 	This is Engine #02 of  4 Engines


[stdout:0] 	This is Engine #00 of  4 Engines


[stdout:1] 	This is Engine #01 of  4 Engines


+ As an example, let's run `bodo.get_rank` on each engine and print the result.
+ Again, this cell uses the `px` cell magic to execute all lines on all engines
  + The `bodo` module is imported on each engine (along with NumPy & Pandas).
  + Each engine prints its rank—from 0 to 3 here—and the total number of engines.
  + The sequence of outputs is nondeterministic & hence not ordered by engine rank.

---

In [10]:
import pandas as pd
DATA_ROOT = 'bodo-examples-data/bodo-training-fundamentals/DATA'
DATA_SRC = f's3://{DATA_ROOT}/CSV/samples_001.csv'
loading_opts = dict(storage_options=dict(anon=True))
df = pd.read_csv(DATA_SRC, **loading_opts, nrows=8)

+ Let's try to use this in a sensible example.
+ We'll lay out code to  read a single CSV file from the S3 bucket.
+ Remember, without any `px` magics, this cell executes in the main Jupyter notebook kernel.

---

In [11]:
df[['Name']].transpose() # view 1st 8 Names in sequence

Unnamed: 0,0,1,2,3,4,5,6,7
Name,Tomas Talley,Paulene Greer,Barrett Mccray,Cammie Adkins,Breann Moses,Joey O'neal,Mellie Martin,Edwardo Fischer


+ We look at the `Name` column (in a transposed view) from this DataFrame.
+ This shows names in sequence from all 8 rows of this DataFrame.

---

In [12]:
%%px
DATA_ROOT = 'bodo-examples-data/bodo-training-fundamentals/DATA'
DATA_SRC = f's3://{DATA_ROOT}/CSV/samples_001.csv'
loading_opts = dict(storage_options=dict(anon=True))

+ Let's repeat this idea on each engine.
+ We need to define the `DATA_SRC` & loading options again.

---

In [13]:
%%px
chunk_size = 2
my_rank = bodo.get_rank()
skiprows = range(1, my_rank*chunk_size + 1)
print(my_rank, list(skiprows))

[stdout:3] 3 [1, 2, 3, 4, 5, 6]


[stdout:0] 0 []


[stdout:1] 1 [1, 2]


[stdout:2] 2 [1, 2, 3, 4]


+ We'll define something called the `chunk_size`, the number of rows to read on each engine.
+ `my_rank` is the identifier of each engine...
+ ... and `skiprows` is an iterable range telling which rows each engine will skip when reading the CSV file.
+ We can print these out (as lists) to see what they are.

---

In [14]:
%%px
df = pd.read_csv(DATA_SRC,
                 header=[0],
                 skiprows=skiprows,
                 nrows=chunk_size,
                 **loading_opts)

%px: 100%|█████████████████████████████████████| 4/4 [00:05<00:00,  1.35s/tasks]


+ To apply the preceding ideas in practice, we can, for instance, assign distinct chunks of a dataset to different engines.
+ We'll have each engine read a chunk of 2 lines from a CSV file.
+ Each engine needs to use its rank to determine the number of rows to skip (as given in `skip_rows`).
+ When we execute `read_csv` with the `px` cell magic—and suitable parsing options—each engine reads its own portion of the file.

In [15]:
%%px
if my_rank==0:
    display(df)

[output:0]

Unnamed: 0,Name,Age,Purchase_Date,Purchase_Amount,Product,Product_Review
0,Tomas Talley,49,1994-12-09 13:50:37,16.3,Health,
1,Paulene Greer,31,2004-06-05 14:43:15,129.76,Sporting-Goods,Good : Any element of a tuple can be accessed ...


+ We can display the data stored on each engine.
+ So on engine 0, you can see the first two rows from the original DataFrame...

---

In [16]:
%%px
if my_rank==1:
    display(df)

[output:1]

Unnamed: 0,Name,Age,Purchase_Date,Purchase_Amount,Product,Product_Review
0,Barrett Mccray,69,1990-04-06 18:35:10,85.19,Electronics,Fine : I don't even care.
1,Cammie Adkins,57,2020-07-17 14:25:20,101.17,Electronics,Good : He looked inquisitively at his keyboard...


+ ...on engine 1, you see the 3rd & 4th rows...

---

In [17]:
%%px
if my_rank==2:
    display(df)

[output:2]

Unnamed: 0,Name,Age,Purchase_Date,Purchase_Amount,Product,Product_Review
0,Breann Moses,59.0,2003-03-16 17:18:02,7.49,Food,Good : Where are my pants?
1,Joey O'neal,,2021-11-21 14:54:29,40.5,Toys,Terrible : He looked inquisitively at his keyb...


+ ... on engine 2, you see the 5th & 6th rows...

---

In [18]:
%%px
if my_rank==3:
    display(df)

[output:3]

Unnamed: 0,Name,Age,Purchase_Date,Purchase_Amount,Product,Product_Review
0,Mellie Martin,,2008-08-20 14:06:02,939.31,Computers,Terrible : Ports are created with the built-in...
1,Edwardo Fischer,24.0,2002-12-27 14:42:21,14.44,Music,Great : Erlang is known for its designs that a...


+ ... and, finally, on engine 3, you see the 7th & 8th rows from the original DataFrame.
+ These match the data as viewed in the 8-row DataFrame in the main Jupyter kernel.

---

In [19]:
%%px
# df pre-computed on each engine...
partial_sum = df.Purchase_Amount.sum()
n_partial = len(df)
print(f'The {my_rank}th partial sum is ' +
      f'{partial_sum:6.2f} with ' +
      f'{n_partial} summands.')

[stdout:0] The 0th partial sum is  146.06 with 2 summands.


[stdout:2] The 2th partial sum is   47.99 with 2 summands.


[stdout:1] The 1th partial sum is  186.36 with 2 summands.


[stdout:3] The 3th partial sum is  953.75 with 2 summands.


+ Next, we compute a partial sum & count on each engine.
+ We can do this with standard Pandas approaches over small dataframes.
+ For convenience, print the partial results to standard output.

---

In [20]:
%%px
sums = bodo.gatherv(np.array([partial_sum], dtype=np.float_))
lens = bodo.gatherv(np.array([n_partial], dtype=np.int_))
if my_rank==0:
    print(f'\nAll partial sums:          {sums}')
    print(f'Lengths of all partitions: {lens}')

[stdout:0] 
All partial sums:          [146.06 186.36  47.99 953.75]
Lengths of all partitions: [2 2 2 2]


+ There's an additional step here; once the partial steps have been accumulated, the results need to be shared.
+ The function `bodo.gatherv`—based on an MPI function—communicates the partial results into a NumPy array.
+ By default, `gatherv` gathers the array on the root process (assigning `None` elsewhere).

---

In [21]:
%%px
if my_rank==0:
    mean = sum(sums) / sum(lens)
    print(f'\nThe mean purchase is {mean:.2f}.')

[stdout:0] 
The mean purchase is 166.77.


+ The last part of the program is to compute the general mean from the partial sums and counts.
+ This pattern—printing to standard output only on the root node—is common in SPMD programs.

---

In [22]:
%%px
# Generalize to compute on an entire file of 1,000,000
DATA_SRC = f's3://{DATA_ROOT}/CSV/samples_001.csv'
chunk_size = 1_000_000 // bodo.get_size()
my_rank = bodo.get_rank()
skiprows = range(1, my_rank*chunk_size + 1)
df = pd.read_csv(DATA_SRC,
                 header=[0],
                 skiprows=skiprows,
                 nrows=chunk_size,
                 **loading_opts)
partial_sum = df.Purchase_Amount.sum()
n_partial = len(df)
sums = bodo.gatherv(np.array([partial_sum], dtype=np.float_))
lens = bodo.gatherv(np.array([n_partial], dtype=np.int_))

%px: 100%|█████████████████████████████████████| 4/4 [02:01<00:00, 30.29s/tasks]


+ The preceding cells can be generalised to read the full one million rows from the CSV file into chunks on each engine.
+ This is a more general SPMD program that uses the size of the data & the number of engines to work out how to distribute the data.
+ Once the chunks are loaded onto each engine, the relevant partial sums can be computed on each concurrently.
+ The partial sums and chunk_sizes need to be gathered onto the root engine for the last step...

---

In [23]:
%%px
if my_rank==0:
    mean = sum(sums) / sum(lens)
    print(f'The mean purchase is {mean:.2f}.')

[stdout:0] The mean purchase is 223.45.


+ ...which is computing the actual mean and printing it out on the root engine.
+ The preceding logic generalises to allow the computation with a larger dataset
  + e.g., a million rows—to be split amongst our engines.
+ Size of chunks to read can be tricky to balance among the engines.
---
+ This is a standard SPMD style of writing programs in MPI with ipyparallel.
+ The book-keeping is tedious but not difficult.

---

In [24]:
%%px
@bodo.jit
def compute_mean_bodo():
    DATA_ROOT = 'bodo-examples-data/bodo-training-fundamentals/DATA'
    DATA_SRC = f's3://{DATA_ROOT}/CSV/samples_001.csv'
    df = pd.read_csv(DATA_SRC)
    m = df.Purchase_Amount.mean()
    return m

+ Bodo provides a more elegant solution.
+ We start by writing a function—say, `compute_mean_bodo`—as we would write it for the single-processor case.
+ Then we apply the `bodo.jit` decorator.
+ Remember, we need the `%%px` cell magic to ensure this is defined on all engines.

---

In [25]:
%%px
mean = compute_mean_bodo()
if my_rank==0:
    print(f'The mean purchase is {mean:.2f}.')

%px:   0%|                                             | 0/4 [01:32<?, ?tasks/s]

[stdout:0] The mean purchase is 223.45.


%px: 100%|█████████████████████████████████████| 4/4 [01:32<00:00, 23.04s/tasks]


+ Sure enough, invoking `compute_mean_bodo` yields the same result.
+ But, it *didn't* require writing an MPI-style SPMD program, nor an explicit gather at the end.
+ And it still did the computation with work shared by distinct engines *concurrently*.

---

In [26]:
%px compute_mean_bodo.distributed_diagnostics()

[stdout:0] Distributed diagnostics for function compute_mean_bodo, /tmp/ipykernel_59390/794234824.py (1)

Data distributions:
   csv_table.7          1D_Block
   index_col.8          1D_Block
   arr_63               1D_Block

Parfor distributions:
   0                    1D_Block

Distributed listing for function compute_mean_bodo, /tmp/ipykernel_59390/794234824.py (1)
------------------------------------------------------------------------| parfor_id/variable: distribution
@bodo.jit                                                               | 
def compute_mean_bodo():                                                | 
    DATA_ROOT = 'bodo-examples-data/bodo-training-fundamentals/DATA'    | 
    DATA_SRC = f's3://{DATA_ROOT}/CSV/samples_001.csv'                  | 
    df = pd.read_csv(DATA_SRC)------------------------------------------| csv_table.7: 1D_Block, index_col.8: 1D_Block
    m = df.Purchase_Amount.mean()---------------------------------------| #0: 1D_Block, arr_63: 1D_Blo

+ `bodo.jit` produces a `CPUDispatcher` object.
+ That includes a method `distributed_diagnostics`.
+ When executed, it shows how bodo distributes data
+ In this case, local variables `csv_table`, `index_col`, & `arr` are distributed as `1D-block`s.
+ This did *not* require the tricky book-keeping as needed when doing so in an MPI style.

---

#### SPMD Paradigm
<center>
    <img src='./img/SPMD-2.svg' width='70%'></img>
</center>

+ The output from the `distributed_diagnostics` method is verbose.
+ The essence to remember is that, in SPMD programming, data has to be *distributed* to all workers.
+ Rather than the programmer doing so explicitly, Bodo can create & scatter 1-dimensional blocks
  + i.e., chunks or rows of a DataFrame
+ Alternatively, all the data can be *replicated* to all workers.
  + The former strategy is applied here.

---

#### MPI functions in Bodo

+ `bodo.get_rank`
+ `bodo.get_size`
+ `bodo.gatherv`

+ `bodo.allgatherv`: `gatherv`
 + (replicates data to all engines)

+ `bodo.scatterv`: opposite of `gatherv`

+ `bodo.barrier`: synchronizes engines

---

+ We've seen three MPI functions embedded in Bodo: `get_rank`, `get_size`, & `gatherv`.
+ In many cases, the `bodo.jit` decorator replaces explicit calls to these.
+ However, in some cases (as we'll see soon), we may need these MPI tools.
---
+ Other MPI functions we may need include:
+ `allgatherv`: this is like `gatherv` but it reproduces gathered data on all engines.
---
+ `scatterv` is the opposite of `gatherv`; it distributes data to all engines.
---
+ `barrier` pauses engines until all reach the same synchronization barrier in the SPMD code.

---

In [27]:
# To stop the cluster run the following command.
rc.cluster.stop_cluster_sync()

Stopping controller
Controller stopped: {'exit_code': 0, 'pid': 59352, 'identifier': 'ipcontroller-1650030798-mw6x-59335'}
Stopping engine(s): 1650030799
engine set stopped 1650030799: {'exit_code': 0, 'pid': 59387, 'identifier': 'ipengine-1650030798-mw6x-1650030799-59335'}


+ Finally, one important thing to remember is to cleanly shut down the `ipyparallel` session when finished.
+ Generically, this requires inserting a line like this at the end of a notebook.
+ Again, these mechanics happen behind the scene on Bodo's cloud platform.

---

### Summary

+ Single Program Multiple Data (SPMD) paradigm
+ Distributed computing using SPMD
  + Using `ipyparallel` and `px` magics
+ Bodo JIT compiler as a simpler route

+ Writing SPMD code for distributing computing can be tricky.
+ The key portions to understand is that `ipyparallel` handles the underlying work.
+ The `bodo.jit` compiler provides a simplified route to generating SPMD programs.

---