# Bodo Extended Tutorial

This is a continuation of the "Getting Started" tutorial. You are encouraged to visit that tutorial first if you have not done so already. In this tutorial, we will explain core Bodo concepts in more detail, introduce more Bodo features, and discuss more advanced topics.

## Environment Setup

Please follow the [bodo installation](http://docs.bodo.ai/latest/source/install.html) and [Jupyter Notebook Setup](http://docs.bodo.ai/latest/source/jupyter.html) pages to setup the environment. Also, make sure MPI engines are started in the `IPython Clusters` tab (or using `ipcluster start -n 4 --profile=mpi` in command line).

# 1. Bodo Basics

## JIT (Just-in-time) Compilation Workflow

Bodo provides a just-in-time (JIT) compilation workflow using the `@bodo.jit` decorator, which replaces a Python function with a so-called `Dispatcher` object. Bodo compiles the function the first time a Dispatcher object is called and reuses the compiled version afterwards. The function is recompiled only if the same function is called with different argument types (not often in practice).

In [1]:
import numpy as np
import pandas as pd
import bodo

@bodo.jit
def f(n, a):
    df = pd.DataFrame({'A': np.arange(n) + a})
    return df.head(3)

print(f)
print(f(8, 1))  # compiles for (int, int) input types
print(f(8, 2))  # same input types, no need to compile
print(f(8, 2.2))  # compiles for (int, float) input types

CPUDispatcher(<function f at 0x130714940>)
   A
0  1
1  2
2  3
   A
0  2
1  3
2  4
     A
0  2.2
1  3.2
2  4.2


All of this is completely transparent to the caller, and does not affect any Python code calling the function.

## Parallel Execution Model

As we saw in the "Getting Started" tutorial, Bodo transforms functions for parallel execution. However, the dispatcher does not launch processes or threads on the fly. Instead, the Python application (including non-Bodo code) is intended to be executed under an MPI Single Program Multiple Data ([SPMD](https://en.wikipedia.org/wiki/SPMD)) paradigm, where MPI processes are launched in the beginning and all run the same code.


For example, we can save an example code in a file and use *mpiexec* to launch 4 processes:

In [2]:
import numpy as np
import pandas as pd
import bodo

@bodo.jit(distributed=["df"])
def f(n, a):
    df = pd.DataFrame({'A': np.arange(n) + a})
    return df

print(f(8, 1))

   A
0  1
1  2
2  3
3  4
4  5
5  6
6  7
7  8


In [3]:
%save -f test_bodo.py 2 # cell number of previous cell

The following commands were written to file `test_bodo.py`:
import numpy as np
import pandas as pd
import bodo

@bodo.jit(distributed=["df"])
def f(n, a):
    df = pd.DataFrame({'A': np.arange(n) + a})
    return df

print(f(8, 1))


In [4]:
!mpiexec -n 4 python test_bodo.py

   A
0  1
1  2
   A
2  3
3  4
   A
4  5
5  6
   A
6  7
7  8


In this example, `mpiexec` launches 4 Python processes, each of which executes the same `test_bodo.py` file.

<div class="alert alert-block alert-danger"
<b>Important:</b>

- Python codes outside of Bodo functions execute sequentially on every process.
- Bodo functions run in parallel assuming that Bodo is able to parallelize them. Otherwise, they also run sequentially on every process. Bodo warns if it does not find parallelism (more details later).

</div>

Note how the prints, which are regular Python code executed outside of Bodo, run for each process.

On Jupyter notebook, parallel execution happens in very much the same way. We start a set of MPI engines through `ipyparallel` and activate a client.

In [6]:
import ipyparallel as ipp
c = ipp.Client(profile='mpi')
view = c[:]
view.activate()

After this initialization, any code that we run in the notebook with `%%px --block` is sent for execution on all MPI engines.

In [7]:
%%px --block

import numpy as np
import pandas as pd
import bodo

@bodo.jit(distributed=['df'])
def f(n):
    df = pd.DataFrame({'A': np.arange(n)})
    return df

print(f(8))

[stdout:0] 
   A
0  0
1  1
[stdout:1] 
   A
2  2
3  3
[stdout:2] 
   A
4  4
5  5
[stdout:3] 
   A
6  6
7  7


## Parallel APIs

Bodo provides a limited number of parallel APIs to support advanced cases that may need them. Example below demonstrates running some code only on one process, and then synchronizing all processes:


In [8]:
%%px --block

# some work on only rank 0
if bodo.get_rank() == 0:
    print("rank 0 done")

# make sure all processes are synchronized
# (e.g. all processes need to see effect of rank 0's work)
bodo.barrier()  

[stdout:0] rank 0 done


[0;31mOut[0:2]: [0m0

[0;31mOut[1:2]: [0m0

[0;31mOut[2:2]: [0m0

[0;31mOut[3:2]: [0m0

<div class="alert alert-block alert-danger"
<b>Important:</b> As in this example, it is possible to have each process follow a different control flow, but all processes must always call the same Bodo functions in the same order.
</div>

# 2. Data Distribution

Bodo parallelize computation by dividing data into separate chunks across processes. However, some data handled by a Bodo function may not be divided into chunks. There are are two main data distribution schemes:

- Replicated (*REP*): the data associated with the variable is the same on every process.
- One-dimensional (*1D*): the data is divided into chunks, split along one dimension (rows of a dataframe or first dimension of an array)

Bodo finds distribution of variables automatically, using the nature of the computation that produces them. Let's see an example:

In [8]:
%%px --block

@bodo.jit
def mean_power_speed():
    df = pd.read_parquet('cycling_dataset.pq')
    m = df[["power", "speed"]].mean()
    return m

res = mean_power_speed()
print(res)

[stdout:0] 
power    102.078421
speed      5.656851
dtype: float64
[stdout:1] 
power    102.078421
speed      5.656851
dtype: float64
[stdout:2] 
power    102.078421
speed      5.656851
dtype: float64
[stdout:3] 
power    102.078421
speed      5.656851
dtype: float64


In this example, `df` is parallelized but `m` is replicated, even though it is a Series. Semantically, it makes sense for output of `mean` operation to be replicated on all processors, since it is a reduction and produces "small" data.

### Distributed Diagnostics

The distributions found by Bodo can be printed either by setting `BODO_DISTRIBUTED_DIAGNOSTICS=1` or calling `distributed_diagnostics()` on the compiled function. Let's examine the previous example's distributions:

In [9]:
%%px --block
mean_power_speed.distributed_diagnostics()

[stdout:0] 
Distributed diagnostics for function mean_power_speed, <ipython-input-3-b6752a146201> (1)

Data distributions:
   power.362                1D_Block
   speed.363                1D_Block
   $A.599.945               1D_Block
   $A.662.955               1D_Block
   $data.571.966            REP
   $12call_method.5.929     REP
   $66call_method.31.589    REP
   $m.968                   REP
   $30return_value.12       REP

Parfor distributions:
   1                    1D_Block
   2                    1D_Block

Distributed listing for function mean_power_speed, <ipython-input-3-b6752a146201> (1)
--------------------------------------------------| parfor_id/variable: distribution
@bodo.jit                                         | 
def mean_power_speed():                           | 
    df = pd.read_parquet('cycling_dataset.pq')----| power.362: 1D_Block, speed.363: 1D_Block
    m = df[["power", "speed"]].mean()-------------| $A.599.945: 1D_Block, $A.662.955: 1D_Block
    return m--

Variables are renamed due to optimization. The output shows that `power` and `speed` columns of `df` are distributed (`1D_Block`) but `m` is replicated (`REP`). This is because `df` is output of `read_parquet` and input of `mean`, both of which can be distributed. `m` is output of `mean`, which is always replicated (available on every process).

### Function Arguments and Return Values

Now let's see what happens if we pass the data into the Bodo function as a function parameter but don't mark it as distributed:

In [10]:
%%px --block

@bodo.jit
def mean_power_speed(df):
    m = df[["power", "speed"]].mean()
    return m

df = pd.read_parquet('cycling_dataset.pq')
res = mean_power_speed(df)
print(res)

[stdout:0] 
power    102.078421
speed      5.656851
dtype: float64
[stdout:1] 
power    102.078421
speed      5.656851
dtype: float64
[stdout:2] 
power    102.078421
speed      5.656851
dtype: float64
[stdout:3] 
power    102.078421
speed      5.656851
dtype: float64


[stderr:0] 


The program runs and returns the same correct value as before, but everything is replicated on all processes and *there is no parallelism!* Bodo's warning indicates this explicitly. Therefore, each process reads the whole data file and calculates the mean of the dataframe independently.

This is because data is passed to Bodo as argument without setting the distributed flag, and Bodo assumes correctly that the data is replicated. Bodo then follows dependencies and replicates the whole program.

Similarly, return values will be replicated by default, since data is passed to regular Python:

In [11]:
%%px --block

@bodo.jit
def mean_power_speed():
    df = pd.read_parquet('cycling_dataset.pq')
    return df

df = mean_power_speed()
print(df)

[stdout:0] 
      Unnamed: 0    altitude  cadence      distance   hr   latitude  \
0              0  185.800003       51      3.460000   81  30.313309   
1              1  185.800003       68      7.170000   82  30.313277   
2              2  186.399994       38     11.040000   82  30.313243   
3              3  186.800003       38     15.180000   83  30.313212   
4              4  186.600006       38     19.430000   83  30.313172   
...          ...         ...      ...           ...  ...        ...   
3897        1127  178.199997        0  22014.929688  100  30.313483   
3898        1128  178.199997        0  22018.220703  100  30.313482   
3899        1129  178.199997        0  22021.179688  100  30.313485   
3900        1130  178.399994        0  22024.150391  100  30.313489   
3901        1131  178.399994        0  22027.009766  100  30.313492   

      longitude  power  speed                time  
0    -97.732711     45  3.459 2016-10-20 22:01:26  
1    -97.732715      0  3.710 2

[stderr:0] 


<div class="alert alert-block alert-danger"
<b>Important:</b> Bodo assumes that input parameters and return values are replicated, unless if specified using `distributed` flag. This can lead to replication of the whole program due to dependencies.
</div>

### Passing Distributed Data to Bodo

Bodo functions may require distributed arguments and return values in some cases such as passing distributed data across Bodo functions. This can be achieved using the `distributed` flag:

In [12]:
%%px --block

@bodo.jit(distributed=['df'])
def read_data():
    df = pd.read_parquet('cycling_dataset.pq')
    print("total size", len(df))
    return df

@bodo.jit(distributed=['df'])
def mean_power(df):
    x = df.power.mean()
    return x

df = read_data()
# df is a chunk of data on each process
print("chunk size", len(df))
res = mean_power(df)
print(res)

[stdout:0] 
total size 3902
chunk size 976
102.07842132239877
[stdout:1] 
chunk size 976
102.07842132239877
[stdout:2] 
chunk size 976
102.07842132239877
[stdout:3] 
chunk size 974
102.07842132239877


### Scattering Data

One can distribute data manually by *scattering* data from one process to all processes. For example:

In [13]:
%%px --block

@bodo.jit(distributed=['df'])
def mean_power(df):
    x = df.power.mean()
    return x

df = None
# only rank 0 reads the data
if bodo.get_rank() == 0:
    df = pd.read_parquet('cycling_dataset.pq')

df = bodo.scatterv(df)
res = mean_power(df)
print(res)

[stdout:0] 102.07842132239877
[stdout:1] 102.07842132239877
[stdout:2] 102.07842132239877
[stdout:3] 102.07842132239877


### Gathering Data

One can *gather* distributed data into a single process manually. For example:

In [14]:
%%px --block

@bodo.jit
def mean_power():
    df = pd.read_parquet('cycling_dataset.pq')
    return bodo.gatherv(df)

df = mean_power()
print(df)

[stdout:0] 
      Unnamed: 0    altitude  cadence      distance   hr   latitude  \
0              0  185.800003       51      3.460000   81  30.313309   
1              1  185.800003       68      7.170000   82  30.313277   
2              2  186.399994       38     11.040000   82  30.313243   
3              3  186.800003       38     15.180000   83  30.313212   
4              4  186.600006       38     19.430000   83  30.313172   
...          ...         ...      ...           ...  ...        ...   
3897        1127  178.199997        0  22014.929688  100  30.313483   
3898        1128  178.199997        0  22018.220703  100  30.313482   
3899        1129  178.199997        0  22021.179688  100  30.313485   
3900        1130  178.399994        0  22024.150391  100  30.313489   
3901        1131  178.399994        0  22027.009766  100  30.313492   

      longitude  power  speed                time  
0    -97.732711     45  3.459 2016-10-20 22:01:26  
1    -97.732715      0  3.710 2

Alternatively, distributed data can be gathered and sent to all processes, effectively replicating the data:

In [15]:
%%px --block

@bodo.jit
def mean_power():
    df = pd.read_parquet('cycling_dataset.pq')
    return bodo.allgatherv(df)

df = mean_power()
print(df)

[stdout:0] 
      Unnamed: 0    altitude  cadence      distance   hr   latitude  \
0              0  185.800003       51      3.460000   81  30.313309   
1              1  185.800003       68      7.170000   82  30.313277   
2              2  186.399994       38     11.040000   82  30.313243   
3              3  186.800003       38     15.180000   83  30.313212   
4              4  186.600006       38     19.430000   83  30.313172   
...          ...         ...      ...           ...  ...        ...   
3897        1127  178.199997        0  22014.929688  100  30.313483   
3898        1128  178.199997        0  22018.220703  100  30.313482   
3899        1129  178.199997        0  22021.179688  100  30.313485   
3900        1130  178.399994        0  22024.150391  100  30.313489   
3901        1131  178.399994        0  22027.009766  100  30.313492   

      longitude  power  speed                time  
0    -97.732711     45  3.459 2016-10-20 22:01:26  
1    -97.732715      0  3.710 2

# 3. Parallel I/O
<img style="float: right;" src="img/file-read.jpg">

Efficient parallel data processing requires data I/O to be parallelized effectively as well. Bodo provides parallel file I/O for many different formats such as [Parquet](http://parquet.apache.org),
CSV, JSON, Numpy binaries, [HDF5](http://www.h5py.org) and SQL databases. This diagram demonstrates how chunks of data are partitioned among parallel execution engines by Bodo.

## Parquet

Parquet is a commonly used file format in analytics due to its efficient columnar storage. Bodo supports the standard pandas API for reading Parquet:

In [12]:
%%px --block
import pandas as pd
import bodo
from IPython.display import display

@bodo.jit(distributed=['df'])
def pq_read():
    df = pd.read_parquet('cycling_dataset.pq')
    return df

# on each process, this returns the data chunk read by that process
res = pq_read()
if bodo.get_rank() == 0:
    display(res.head())  # display results of first process only

[output:0]

Unnamed: 0.1,Unnamed: 0,altitude,cadence,distance,hr,latitude,longitude,power,speed,time
0,0,185.800003,51,3.46,81,30.313309,-97.732711,45,3.459,2016-10-20 22:01:26
1,1,185.800003,68,7.17,82,30.313277,-97.732715,0,3.71,2016-10-20 22:01:27
2,2,186.399994,38,11.04,82,30.313243,-97.732717,42,3.874,2016-10-20 22:01:28
3,3,186.800003,38,15.18,83,30.313212,-97.73272,5,4.135,2016-10-20 22:01:29
4,4,186.600006,38,19.43,83,30.313172,-97.732723,1,4.25,2016-10-20 22:01:30


Bodo also supports the pandas API for writing Parquet files:

In [8]:
%%px --block
import numpy as np
import pandas as pd
import bodo

@bodo.jit
def generate_data_and_write():
    df = pd.DataFrame({"A": np.arange(80)})
    df.to_parquet("pq_output.pq")

generate_data_and_write()

<div class="alert alert-block alert-info"
<b>Note:</b> Bodo writes a directory of parquet files (one file per process) when writing distributed data. Bodo writes a single file when the data is replicated.
</div>

In this example, `df` is distributed data so it is written to a directory a parquet files.

Bodo supports parallel read of single Parquet files, as well as directory of files:

In [9]:
%%px --block
import pandas as pd
import bodo

@bodo.jit(distributed=['df'])
def read_parquet_dir():
    df = pd.read_parquet("pq_output.pq")
    return df

df = read_parquet_dir()
print(df)

[stdout:0] 
     A
0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
10  10
11  11
12  12
13  13
14  14
15  15
16  16
17  17
18  18
19  19
20  20
21  21
22  22
23  23
24  24
25  25
26  26
27  27
28  28
29  29
30  30
31  31
32  32
33  33
34  34
35  35
36  36
37  37
38  38
39  39
[stdout:1] 
     A
40  40
41  41
42  42
43  43
44  44
45  45
46  46
47  47
48  48
49  49
50  50
51  51
52  52
53  53
54  54
55  55
56  56
57  57
58  58
59  59
60  60
61  61
62  62
63  63
64  64
65  65
66  66
67  67
68  68
69  69
70  70
71  71
72  72
73  73
74  74
75  75
76  76
77  77
78  78
79  79


## CSV
CSV is a common text format for data exchange. Bodo supports the standard pandas API to read CSV files:

In [10]:
%%px --block
import pandas as pd
import bodo

@bodo.jit(distributed=['df'])
def csv_example():
    df = pd.read_csv('cycling_dataset.csv')
    return df

res = csv_example()
if bodo.get_rank() == 0:
    display(res.head())

[output:0]

Unnamed: 0,0,0.1,185.8000030517578,51,3.4600000381469727,81,30.31330947764218,-97.73271068930626,45,3.4590001106262207,2016-10-20 22:01:26
0,1,1,185.800003,68,7.17,82,30.313277,-97.732715,0,3.71,2016-10-20 22:01:27
1,2,2,186.399994,38,11.04,82,30.313243,-97.732717,42,3.874,2016-10-20 22:01:28
2,3,3,186.800003,38,15.18,83,30.313212,-97.73272,5,4.135,2016-10-20 22:01:29
3,4,4,186.600006,38,19.43,83,30.313172,-97.732723,1,4.25,2016-10-20 22:01:30
4,5,5,186.600006,0,23.860001,84,30.31313,-97.732724,0,4.435,2016-10-20 22:01:31


In addition to the pandas `read_csv()` functionality, Bodo can also read a directory containing multiple CSV files (all part of the same dataframe).

<div class="alert alert-block alert-info"
<b>Note:</b>

When writing distributed data to CSV:
- To S3 or HDFS: Bodo writes to a directory of CSV files (one file per process)
- To POSIX filesystem (e.g. local filesystem on Linux): Bodo writes the distributed data in parallel to a single file.

If the data is replicated, Bodo always writes to a single file.

</div>

## HDF5
HDF5 is a common format in scientific computing, especially for multi-dimensional numerical data. HDF5 can be very efficient at scale, since it has native parallel I/O support. Bodo supports the standard h5py APIs:

In [11]:
%%px --block
import h5py

@bodo.jit
def example_h5():
    f = h5py.File("data.h5", "r")
    return f['A'][:].sum()

res = example_h5()
if bodo.get_rank() == 0: print(res)

[output:0]

66

## Numpy Binary Files
Bodo supports reading and writing binary files using Numpy APIs as well.

In [15]:
%%px --block

@bodo.jit
def example_np_io():
    A = np.fromfile("data.dat", np.int64)
    return A.sum()

res = example_np_io()
if bodo.get_rank() == 0: print(res)

[output:0]

45

## Type Annotation (when file name is unknown at compile time)

Bodo needs to know or infer the types for all data, but this is not always possible for input from files if file name is not known at compilation time.

For example, suppose we have the following files:

In [2]:
import pandas as pd
import numpy as np

def generate_files(n):
    for i in range(n):
        df = pd.DataFrame({"A": np.arange(5, dtype=np.int64)})
        df.to_parquet("test" + str(i) + ".pq")

generate_files(5)

And we want to read them like this:

In [3]:
import pandas as pd
import numpy as np
import bodo

@bodo.jit
def read_data(n):
    x = 0
    for i in range(n):
        file_name = "test" + str(i) + ".pq"
        df = pd.read_parquet(file_name)
        print(df)
        x += df["A"].sum()
    return x

result = read_data(5)
# BodoError: Parquet schema not available. Either path argument should be
# constant for Bodo to look at the file at compile time or schema should be provided.

BodoError: Parquet schema not available. Either path argument should be constant for Bodo to look at the file at compile time or schema should be provided.

The file names are computed at runtime, which doesn't allow the compiler to find the files and extract the schemas. As shown below, the solution is to use *type annotation* to provide data types to the compiler.

### Type annotation for Parquet files

Example below uses the `locals` option of the decorator to provide the compiler with the schema of the local variable `df`:

In [5]:
%%px --block
import pandas as pd
import numpy as np
import bodo

@bodo.jit(locals={"df": {"A": bodo.int64[:]}})
def read_data(n):
    x = 0
    for i in range(n):
        file_name = "test" + str(i) + ".pq"
        df = pd.read_parquet(file_name)
        x += df["A"].sum()
    return x

result = read_data(5)
if bodo.get_rank() == 0:
    print(result)

[stdout:0] 50


### Type annotation for CSV files

For CSV files, we can annotate types in the same way as pandas:

In [6]:
%%px --block
import pandas as pd
import numpy as np
import bodo

def generate_files(n):
    for i in range(n):
        df = pd.DataFrame({"A": np.arange(5, dtype=np.int64)})
        df.to_csv("test" + str(i) + ".csv", index=False)

@bodo.jit
def read_data(n):
    coltypes = {'A': np.int64}
    x = 0
    for i in range(n):
        file_name = "test" + str(i) + ".csv"
        df = pd.read_csv(file_name, names=coltypes.keys(), dtype=coltypes, header=True)
        x += df["A"].sum()
    return x

n = 5
if bodo.get_rank() == 0:
    generate_files(n)
bodo.barrier()
result = read_data(n)
if bodo.get_rank() == 0:
    print(result)

[stdout:0] 50


# 4. Advanced Features

## Explicit Parallel Loops
Sometimes explicit parallel loops are required since a program cannot be written in terms of data-parallel operators easily. In this case, one can use Bodo’s `prange` in place of `range` to specify that a loop can be parallelized. The user is required to make sure the loop does not have cross iteration dependencies except for supported reductions.

The example below demonstrates a parallel loop with a reduction:

In [None]:
%%px --block
import bodo
from bodo import prange
import numpy as np

@bodo.jit
def prange_test(n):
    A = np.random.ranf(n)
    s = 0
    for i in prange(len(A)):
        # A[i]: distributed data access with loop index
        # s: a supported sum reduction
        s += A[i]
    return s

res = prange_test(10)
print(res)

Currently, reductions using +=, *=, min, and max operators are supported. Iterations are simply divided between processes and executed in parallel, but reductions are handled using data exchange.

## Integration with non-Bodo APIs
There are multiple methods for integration with APIs that Bodo does not support natively:
1. Switch to python object mode inside jit functions
2. Pass data in and out of jit functions

### Object mode
Object mode allows switching to a python intepreted context to be able to run non-jittable code. The main requirement is specifying the type of returned values. For example, the following code calls a Scipy function on data elements of a distributed dataset:

In [None]:
%%px --block
import scipy.special as sc

@bodo.jit
def objmode_test(n):
    A = np.random.ranf(n)
    s = 0
    for i in prange(len(A)):
        x = A[i]
        with bodo.objmode(y="float64"):
            y = sc.entr(x)  # call entropy function on each data element
        s += y
    return s

res = objmode_test(10)
print(res)

See Numba's documentation for [objmode](http://numba.pydata.org/numba-doc/latest/user/withobjmode.html#the-objmode-context-manager) for more details.

### Passing Distributed Data
Bodo can receive or return chunks of distributed data to allow flexible integration with any non-Bodo Python code. The following example passes chunks of data to interpolate with Scipy, and returns interpolation results back to jit function.

In [None]:
%%px --block
import scipy.interpolate

@bodo.jit(distributed=["X", "Y", "X2"])
def dist_pass_test(n):
    X = np.arange(n)
    Y = np.exp(-X/3.0)
    X2 = np.arange(0, n, 0.5)
    return X, Y, X2

X, Y, X2 = dist_pass_test(100)
# clip potential out-of-range values
X2 = np.minimum(np.maximum(X2, X[0]), X[-1])
f = scipy.interpolate.interp1d(X, Y)
Y2 = f(X2)

@bodo.jit(distributed={'Y2'})
def dist_pass_res(Y2):
    return Y2.sum()

res = dist_pass_res(Y2)
print(res)

### Visualization
A simple approach for visualization is pulling data to the notebook process from execution engines and using Python visualization libraries. Distributed data can be gathered if there is enough memory on the local machine. Otherwise, a sample of data can be gathered. The example code below demonstrates gathering a portion of data for visualization:

In [None]:
%%px --block

@bodo.jit
def dist_gather_test(n):
    X = np.arange(n)
    Y = np.exp(-X/3.0)
    return bodo.gatherv(Y[::10])  # gather every 10th element


Y_sample = dist_gather_test(100)


In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

Y_sample = view['Y_sample'][0]
plt.plot(Y_sample)

# 5. Troubleshooting

## Compilation Tips

The general recommendation is to **compile the code that is performance critical and/or requires scaling**.

1. Don’t use Bodo for scripts that set up infrastucture or do initializations.
2. Only use Bodo for data processing and analytics code.

This reduces the risk of hitting unsupported features and reduces compilation time. To do so, simply factor out the code that needs to be compiled by Bodo and pass data into Bodo compiled functions.

## Compilation Errors

The most common reason is that the code relies on features that Bodo currently does not support, so it’s important to understand the limitations of Bodo. There are 4 main limitations:

1. Not supported Pandas API (see [here](http://docs.bodo.ai/latest/source/pandas.html#pandas))
2. Not supported NumPy API (see [here](http://docs.bodo.ai/latest/source/numpy.html#numpy))
3. Not supported Python features or datatypes (see [here](http://docs.bodo.ai/latest/source/not_supported.html#unsupported-python-constructs))
4. Not supported Python programs due to type instability

Solutions:

1. Make sure your code works in Python (using a small sample dataset): a lot of the times a Bodo decorated function doesn’t compile, but it does not compile in Python either.
2. Replace unsupported operations with supported operations if possible.
3. Refactor the code to partially use regular Python, explained in "Integration with non-Bodo APIs" section.

For example, the code below uses heterogenous list values inside `a` which cannot be typed:

In [None]:
@bodo.jit
def f(n):
    a = [[-1, "a"]]
    for i in range(n):
        a.append([i, "a"])
    return a

print(f(3))

However, this use case can be rewritten to use tuple values instead of lists since values don't change:

In [None]:
@bodo.jit
def f(n):
    a = [(-1, "a")]
    for i in range(n):
        a.append((i, "a"))
    return a

print(f(3))

### DataFrame Schema Stability

Deterministic dataframe schemas (column names and types), which are required in most data systems, are key for type stability. For example, variable `df` in example below could be either a single column dataframe or a two column one – Bodo cannot determine it at compilation time:

In [None]:
@bodo.jit
def f(a):
    df = pd.DataFrame({"A": [1, 2, 3]})
    df2 = pd.DataFrame({"A": [1, 3, 4], "C": [-1, -2, -3]})
    if len(a) > 3:
        df = df.merge(df2)

    return df.mean()

print(f([2, 3]))
# TypeError: Cannot unify dataframe((array(int64, 1d, C),), RangeIndexType(none), ('A',), False)
# and dataframe((array(int64, 1d, C), array(int64, 1d, C)), RangeIndexType(none), ('A', 'C'), False) for 'df'

The error message means that Bodo cannot find a type that can unify the two types into a single type. This code can be refactored so that the if control flow is executed in regular Python context, but the rest of computation is in Bodo functions. For example, one could use two versions of the function:

In [None]:
@bodo.jit
def f1():
    df = pd.DataFrame({"A": [1, 2, 3]})
    return df.mean()

@bodo.jit
def f2():
    df = pd.DataFrame({"A": [1, 2, 3]})
    df2 = pd.DataFrame({"A": [1, 3, 4], "C": [-1, -2, -3]})
    df = df.merge(df2)
    return df.mean()

a = [2, 3]
if len(a) > 3:
    print(f1())
else:
    print(f2())

Another common place where schema stability may be compromised is in passing non-constant list of key column names to dataframe operations such as `groupby`, `merge` and `sort_values`. In these operations, Bodo should be able to deduce the list of key column names at compile time in order to determine the output dataframe schema. For example, the program below is potentially type unstable since Bodo may not be able to infer `column_list` during compilation:

In [None]:
@bodo.jit
def f(a):
    column_list = a[0]  # some computation that cannot be inferred statically
    df = pd.DataFrame({"A": [1, 2, 1], "B": [4, 5, 6]})
    return df.groupby(column_list).sum()

f(["A"])
# BodoError: groupby(): 'by' parameter only supports a constant column label or column labels.

## Nullable Integers in Pandas

DataFrame and Series objects with integer data need special care due to [integer NA issues in Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/gotchas.html#nan-integer-na-values-and-na-type-promotions). By default, Pandas dynamically converts integer columns to floating point when missing values (NAs) are needed, which can result in loss of precision as well as type instability.

Pandas introduced [a new nullable integer data type](https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html#integer-na) that can solve this issue, which is also supported by Bodo. For example, this code reads column A into a nullable integer array (the capital “I” denotes nullable integer type):

In [None]:
data = (
    "11,1.2\n"
    "-2,\n"
    ",3.1\n"
    "4,-0.1\n"
)

with open("data.csv", "w") as f:
    f.write(data)


@bodo.jit(distributed=["df"])
def f():
    dtype = {"A": "Int64", "B": "float64"}
    df = pd.read_csv("data.csv", dtype = dtype, names = dtype.keys())
    return df

f()

## Boxing/Unboxing Overheads

Bodo uses efficient native data structures which can be different than Python. When Python values are passed to Bodo, they are *unboxed* to native representation. On the other hand, returning Bodo values requires *boxing* to Python objects. Boxing and unboxing can have significant overhead depending on size and type of data. For example, passing string column between Python/Bodo repeatedly can be expensive:

In [9]:
@bodo.jit(distributed=["df"])
def gen_data():
    df = pd.read_parquet("cycling_dataset.pq")
    df["hr"] = df["hr"].astype(str)
    return df

@bodo.jit(distributed=["df", "x"])
def mean_power(df):
    x = df.hr.str[1:]
    return x

df = gen_data()
res = mean_power(df)
print(res)

0        1
1        2
2        2
3        3
4        3
        ..
3897    00
3898    00
3899    00
3900    00
3901    00
Name: hr, Length: 3902, dtype: object


One can try to keep data in Bodo functions as much as possible to avoid boxing/unboxing overheads:

In [11]:
@bodo.jit(distributed=["df"])
def gen_data():
    df = pd.read_parquet("cycling_dataset.pq")
    df["hr"] = df["hr"].astype(str)
    return df

@bodo.jit(distributed=["df", "x"])
def mean_power(df):
    x = df.hr.str[1:]
    return x

@bodo.jit
def f():
    df = gen_data()
    res = mean_power(df)
    print(res)

f()

0        1
1        2
2        2
3        3
4        3
        ..
3897    00
3898    00
3899    00
3900    00
3901    00
Name: hr, Length: 3902, dtype: object
