# Bodo Getting Started Tutorial

Bodo is the simplest and most efficient analytics engine. It accelerates and scales data science programs
automatically and enables instant deployment, eliminating the need to rewrite Python analytics code to Spark/Scala, SQL or MPI/C++.
In this tutorial, we will cover the basics of using Bodo and explain how it works under the hood.

In a nutshell, Bodo provides a just-in-time (JIT) compilation workflow using the `@bodo.jit` decorator. It replaces decorated Python functions with an optimized and parallelized binary version using advanced compilation methods.

Let's get started!

## Environment Setup
Please follow the [bodo installation](http://docs.bodo.ai/latest/source/install.html) and [Jupyter Notebook Setup](http://docs.bodo.ai/latest/source/jupyter.html) pages to setup the environment. Also, make sure MPI engines are started in the `IPython Clusters` tab (or using `ipcluster start -n 8 --profile=mpi` in command line), then initialize the `ipyparallel` environment:

In [61]:
import ipyparallel as ipp
c = ipp.Client(profile='mpi')
view = c[:]
view.activate()

## Parallel pandas with Bodo
First, we demonstrate how Bodo automatically parallelizes and optimizes standard Python programs that make use of pandas and NumPy, without the need to rewrite your code. Bodo can scale your analytics code to thousands of cores, providing orders of magnitude speed up depending on program characteristics.

### Generate data
To begin, let's generate a simple dataset (the size of this dataframe in memory is approximately 305 MB, and the size of the written Parquet file is 77 MB):

In [62]:
import pandas as pd
import numpy as np

NUM_GROUPS = 30
NUM_ROWS = 20_000_000
df = pd.DataFrame({
    "A": np.random.randint(0, NUM_GROUPS, NUM_ROWS),
    "B": np.arange(NUM_ROWS)
})
df.to_parquet("example1.pq")
print(df)

           A         B
0         11         0
1         16         1
2         18         2
3          9         3
4         25         4
...       ..       ...
19999995  12  19999995
19999996  22  19999996
19999997   3  19999997
19999998   2  19999998
19999999  11  19999999

[20000000 rows x 2 columns]


### Data Analysis
Now let's read and process this dataframe. First using Python and pandas:

In [63]:
def test():
    df = pd.read_parquet("example1.pq")
    df2 = df.groupby("A").sum()
    m = df2.B.max()
    print(m)

test()

6684404478993


Now let's run it with Bodo in parallel. To do this, all that we have to do is add the `bodo.jit` decorator to the function, and use the `%%px --block` *magic* of Jupyter (to run on MPI engines):

In [64]:
%%px --block
import pandas as pd
import bodo

@bodo.jit
def test():
    df = pd.read_parquet("example1.pq")
    df2 = df.groupby("A").sum()
    m = df2.B.max()
    print(m)

test()

[stdout:0] 6684404478993.0


Although the program appears to be a regular sequential Python program, Bodo compiles and *transforms* the decorated code (the `test` function in this example) under the hood, so that it can run in parallel on many cores. Each core operates on a different chunk of the data and communicates with other cores when necessary.

### Parallel Python Processes
Bodo manages parallelism inside `jit` functions to match sequential Python as much as possible. For example, even though the previous code runs on 8 processes, the output print happens only once. On the other hand, the code outside `jit` functions runs as regular Python on all processes. For example, the code below produces 8 prints, one for each Python process:

In [65]:
%%px --block

@bodo.jit
def test():
    df = pd.read_parquet("example1.pq")
    df2 = df.groupby("A").sum()
    m = df2.B.max()
    return m

m = test()
print(m)

[stdout:0] 6684404478993.0
[stdout:1] 6684404478993.0
[stdout:2] 6684404478993.0
[stdout:3] 6684404478993.0
[stdout:4] 6684404478993.0
[stdout:5] 6684404478993.0
[stdout:6] 6684404478993.0
[stdout:7] 6684404478993.0


### Parallel Data Read

Bodo can read data from storage such as Parquet files in parallel. This means that each process reads only its own chunk of data (which can be proportionally faster than sequential read). The example below demonstrates parallel read by printing data chunks on different cores:

In [66]:
%%px --block

@bodo.jit
def test():
    df = pd.read_parquet("example1.pq")
    print(df)

test()

[stdout:0] 
          A        B
0        11        0
1        16        1
2        18        2
3         9        3
4        25        4
...      ..      ...
2499995   2  2499995
2499996  20  2499996
2499997   2  2499997
2499998   3  2499998
2499999  19  2499999

[2500000 rows x 2 columns]
[stdout:1] 
          A        B
2500000  25  2500000
2500001  23  2500001
2500002   7  2500002
2500003  25  2500003
2500004  24  2500004
...      ..      ...
4999995   1  4999995
4999996   2  4999996
4999997  12  4999997
4999998  12  4999998
4999999   3  4999999

[2500000 rows x 2 columns]
[stdout:2] 
          A        B
5000000   0  5000000
5000001  26  5000001
5000002  21  5000002
5000003   3  5000003
5000004  27  5000004
...      ..      ...
7499995  17  7499995
7499996  11  7499996
7499997  13  7499997
7499998  15  7499998
7499999  20  7499999

[2500000 rows x 2 columns]
[stdout:3] 
          A        B
7500000  22  7500000
7500001  10  7500001
7500002  24  7500002
7500003   8  7500003
7500004

Looking at column B, we can clearly see that each process has a separate chunk of the original dataframe. 

<img style="float: right;" src="img/groupby.jpg">

### Parallelizing Computation

Bodo parallelizes computation automatically by dividing the work between cores and performing the necessary data communication. For example, the `groupby` operation in our example needs the data of each group to be on the same processor. This requires *shuffling* data across the cluster. Bodo uses [MPI](https://en.wikipedia.org/wiki/Message_Passing_Interface) for efficient communication, which is usually much faster than alternative methods.

### Parallel Write

Bodo can write data to storage in parallel as well:

In [67]:
%%px --block

@bodo.jit
def test():
    df = pd.read_parquet("example1.pq")
    df2 = df.groupby("A").sum()
    df2.to_parquet("example1-df2.pq")

test()

Now let's read and print the results with pandas:

In [68]:
import pandas as pd

df = pd.read_parquet("example1-df2.pq")
print(df)

                B
A                
28  6670905246730
20  6665595173997
24  6666669522380
6   6669013038677
4   6658128085633
18  6656980845035
8   6663814943031
15  6659418554747
13  6655276574590
19  6658728940063
21  6667581375427
22  6649556088061
9   6649522716000
27  6669589988049
26  6655304463161
3   6659645100063
2   6666548164072
16  6670285154224
0   6667636488372
11  6680488658607
12  6678417438832
7   6667061548203
1   6666431467246
29  6668872325952
23  6674709585510
5   6680301445070
25  6664124914413
14  6684404478993
10  6677703432545
17  6677274242317


The order of the `groupby` results generated by Bodo can differ from pandas since Bodo doesn't automatically sort the output distributed data (it is expensive and not necessary in many cases). Users can explicitly sort dataframes at any point if desired.

### Specifying Data Distribution

Bodo automatically distributes data and computation in Bodo functions by analyzing them for parallelization. However, Bodo does not know how input parameters of Bodo functions are distributed, and similarly how the user wants to handle return values. As such, Bodo assumes that input parameters and return values are *replicated* by default, meaning that every process receives the same input data and returns the same output, as opposed to different data chunks.

<div class="alert alert-block alert-danger"
<b>Important:</b> the distribution scheme of input parameters and return values determines the distribution scheme for variables inside the Bodo function that depend on them.
</div>

To illustrate this effect, let's return the `groupby` output from the Bodo function:

In [69]:
%%px --block
import pandas as pd
import bodo

@bodo.jit
def test():
    df = pd.read_parquet("example1.pq")
    df2 = df.groupby("A").sum()
    return df2

df2 = test()
print(df2)

[stdout:0] 
                B
A                
11  6680488658607
16  6670285154224
18  6656980845035
9   6649522716000
25  6664124914413
12  6678417438832
0   6667636488372
14  6684404478993
27  6669589988049
28  6670905246730
20  6665595173997
8   6663814943031
26  6655304463161
19  6658728940063
24  6666669522380
29  6668872325952
3   6659645100063
23  6674709585510
21  6667581375427
10  6677703432545
7   6667061548203
1   6666431467246
5   6680301445070
15  6659418554747
6   6669013038677
13  6655276574590
4   6658128085633
2   6666548164072
22  6649556088061
17  6677274242317
[stdout:1] 
                B
A                
11  6680488658607
16  6670285154224
18  6656980845035
9   6649522716000
25  6664124914413
12  6678417438832
0   6667636488372
14  6684404478993
27  6669589988049
28  6670905246730
20  6665595173997
8   6663814943031
26  6655304463161
19  6658728940063
24  6666669522380
29  6668872325952
3   6659645100063
23  6674709585510
21  6667581375427
10  6677703432545
7   

[stderr:0] 


As we can see, `df2` has the same data on every process. Furthermore, Bodo warns that it didn't find any parallelism inside the `test` function. In this example, every process reads the whole input Parquet file and executes the same sequential program. The reason is that Bodo makes sure all variables dependent on `df2` have the same distribution, creating an inverse cascading effect.

### `distributed` Flag

The user can tell Bodo what input/output variables should be distributed using the `distributed` flag:

In [70]:
%%px --block

@bodo.jit(distributed=["df2"])
def test():
    df = pd.read_parquet("example1.pq")
    df2 = df.groupby("A").sum()
    return df2

df2 = test()
print(df2)

[stdout:0] 
                B
A                
28  6670905246730
20  6665595173997
24  6666669522380
6   6669013038677
4   6658128085633
[stdout:1] 
                B
A                
18  6656980845035
8   6663814943031
15  6659418554747
13  6655276574590
[stdout:2] 
                B
A                
19  6658728940063
21  6667581375427
22  6649556088061
[stdout:3] 
                B
A                
9   6649522716000
27  6669589988049
26  6655304463161
3   6659645100063
2   6666548164072
[stdout:4] 
                B
A                
16  6670285154224
0   6667636488372
[stdout:5] 
                B
A                
11  6680488658607
12  6678417438832
7   6667061548203
1   6666431467246
[stdout:6] 
                B
A                
29  6668872325952
23  6674709585510
5   6680301445070
[stdout:7] 
                B
A                
25  6664124914413
14  6684404478993
10  6677703432545
17  6677274242317


In this case, the program is fully parallelized and chunks of data are returned to Python on different processes.

### Basic benchmarking of the pandas example
Now let's do some basic benchmarking to observe the effect of Bodo's automatic parallelization. Here we are only scaling up to a few cores, but Bodo can scale the same code to thousands of cores in a cluster.

Let's add timers and run the code again with pandas:

In [71]:
import pandas as pd
import time

def test():
    df = pd.read_parquet("example1.pq")
    t0 = time.time()
    df2 = df.groupby("A").sum()
    m = df2.B.max()
    print("Compute time:", time.time() - t0, "secs")
    return m

result = test()

Compute time: 0.48606395721435547 secs


Now let's measure Bodo's execution time:

In [72]:
%%px --block
import pandas as pd
import time
import bodo

@bodo.jit
def test():
    df = pd.read_parquet("example1.pq")
    t0 = time.time()
    df2 = df.groupby("A").sum()
    m = df2.B.max()
    print("Compute time:", time.time() - t0, "secs")
    return m

result = test()

[stdout:0] Compute time: 0.21879451999848243 secs


As we can see, Bodo computes results faster than pandas using parallel computation. The speedup depends on the data and program characteristics, as well as the number of cores used. Usually, we can continue scaling to many more cores as long as the data is large enough.

Note how we included timers inside the Bodo function. This avoids measuring compilation time since Bodo compiles each `jit` function the first time it is called. Not measuring compilation time in benchmarking is usually important since:

1. Compilation time is often not significant for large computations in real settings but simple benchmarks are designed to run quickly
2. Functions can potentially be compiled and cached ahead of execution time
3. Compilation happens only once but the same function may be called multiple times, leading to inconsistent measurements