# Bodo Getting Started Tutorial

Bodo is the simplest and most efficient analytics engine. It accelerates and scales data science programs
automatically and enables instant deployment, eliminating the need to rewrite Python analytics code to Spark/Scala, SQL or MPI/C++.
In this tutorial, we will cover the basics of using Bodo and explain how it works under the hood.

In a nutshell, Bodo provides a just-in-time (JIT) compilation workflow using the `@bodo.jit` decorator. It replaces decorated Python functions with an optimized and parallelized binary version using advanced compilation methods.

Let's get started!

## 1. Environment Setup
Make sure MPI engines are started in the `IPython Clusters` tab (or using `ipcluster start -n 8 --profile=mpi`), then initialize the `ipyparallel` environment:

In [1]:
import ipyparallel as ipp
c = ipp.Client(profile='mpi')
view = c[:]
view.activate()

## 2. Parallel pandas with Bodo
First, we will show how Bodo automatically parallelizes, scales and optimizes standard Python programs that make use of pandas and NumPy, without the need to rewrite your code. Bodo can run your Python code on thousands of cores and, depending on the size of your data and the operations that the code performs, it can speed up analytics programs by several orders of magnitude.

### Generate data
To begin, let's generate a simple dataset (the size of this dataframe in memory is approximately 305 MB, and the size of the written Parquet file is 77 MB):

In [2]:
import pandas as pd
import numpy as np

NUM_GROUPS = 10
NUM_ROWS = 20000000
A_col = [i % NUM_GROUPS for i in range(NUM_ROWS)]
df = pd.DataFrame({"A": A_col, "B": np.arange(NUM_ROWS)})
df.to_parquet("example1.pq")
print(df)

          A         B
0         0         0
1         1         1
2         2         2
3         3         3
4         4         4
...      ..       ...
19999995  5  19999995
19999996  6  19999996
19999997  7  19999997
19999998  8  19999998
19999999  9  19999999

[20000000 rows x 2 columns]


### Data analysis
Now let's read and process this dataframe. First using Python and pandas:

In [3]:
import pandas as pd

def test():
    df = pd.read_parquet("example1.pq")
    result = df.groupby("A").sum()
    return result

result = test()
print(result)

                B
A                
0  19999990000000
1  19999992000000
2  19999994000000
3  19999996000000
4  19999998000000
5  20000000000000
6  20000002000000
7  20000004000000
8  20000006000000
9  20000008000000


Now let's run it with Bodo in parallel. To do this, all that we have to do is add the `bodo.jit` decorator to the function:

In [2]:
%%px --block
import pandas as pd
import bodo

@bodo.jit(distributed=['result'])
def test():
    df = pd.read_parquet("example1.pq")
    result = df.groupby("A").sum()
    return result

result = test()
print(result)

[stdout:0] 
                B
A                
0  19999990000000
4  19999998000000
5  20000000000000
6  20000002000000
[stdout:1] 
                B
A                
1  19999992000000
2  19999994000000
3  19999996000000
7  20000004000000
8  20000006000000
9  20000008000000


There are several things to explain here. We will go over them one by one in the following. At the end, we will do some basic benchmarking of this example and compare performance with pandas.

What is important to understand first is how this code is being executed in parallel. If you are running this in a notebook, the `%%px magic` sends the above code for execution on every process (or MPI engine).

The main thing to note is that although the program appears to be a regular sequential Python program, Bodo is compiling and *transforming* the decorated code (the `test` function in this example) under the hood, so that it can run in parallel on many cores. There will be many cores running the same code in parallel (which is a transformed version of the original code) and each core will be operating on a different chunk of the data.

Although you don't need to understand how Bodo parallelizes, it is important to understand this distinction between a regular Python program and a parallel Bodo program in order to achieve a correct distribution or partitioning of data across processes for effective parallelization. We will explain this below.

//The result obtained is the same as pandas but presented differently because each process owns a different part of the data. We will explain this further in a moment.

### Specifying data distribution

Bodo automatically distributes data and computation in Bodo functions by analyzing them for parallelization. However, Bodo does not know how input parameters of Bodo functions are distributed, and similarly how the user wants to handle return values. As such, by default it makes the assumption that input parameters and return values are *replicated* (that is, every process receives the same input data and returns the same output, as opposed to different data chunks).

<div class="alert alert-block alert-danger"
<b>Important:</b> the distribution scheme of input parameters and return values determines the distribution scheme for variables inside the Bodo function that depend on them.
</div>

In our above example, if we don't tell Bodo that the return variable is `distributed`, Bodo will assume that the data associated with `result` has to be the same on every process, which has an inverse cascading effect: if that data is on every process, that means the data of `df` is also on every process, which means that the `read_parquet` should be a sequential read of the whole file on every process.

To illustrate, let's see what happens if we don't use the `distributed` flag in the decorator:

In [3]:
%%px --block
import pandas as pd
import bodo

@bodo.jit
def test():
    df = pd.read_parquet("example1.pq")
    result = df.groupby("A").sum()
    return result

result = test()
print(result)

[stdout:0] 
                B
A                
0  19999990000000
1  19999992000000
2  19999994000000
3  19999996000000
4  19999998000000
5  20000000000000
6  20000002000000
7  20000004000000
8  20000006000000
9  20000008000000
[stdout:1] 
                B
A                
0  19999990000000
1  19999992000000
2  19999994000000
3  19999996000000
4  19999998000000
5  20000000000000
6  20000002000000
7  20000004000000
8  20000006000000
9  20000008000000


[stderr:0] 


As we can see, `result` has the same data on every process. Furthermore, Bodo warns that it didn't find any parallelism inside the `test` function. Note that in this case every process reads the whole input Parquet file, and every process is executing the same sequential program, as opposed to doing a computation in parallel.

### Parallel read
When Bodo reads data from a Parquet file, it reads it in parallel and, assuming the data is determined to be distributed as we saw above, each process reads a different chunk of the dataset. This also means that reads can be proportionally faster compared to sequential reading with one process. To illustrate more clearly the effect of reading different chunks, let's print the data that is read by each process:

In [4]:
%%px --block
import pandas as pd
import time
import bodo

@bodo.jit(distributed=['result'])
def test():
    df = pd.read_parquet("example1.pq")
    bodo.parallel_print(df)

test()

[stdout:0] 
         A        B
0        0        0
1        1        1
2        2        2
3        3        3
4        4        4
...     ..      ...
9999995  5  9999995
9999996  6  9999996
9999997  7  9999997
9999998  8  9999998
9999999  9  9999999

[10000000 rows x 2 columns]
[stdout:1] 
          A         B
10000000  0  10000000
10000001  1  10000001
10000002  2  10000002
10000003  3  10000003
10000004  4  10000004
...      ..       ...
19999995  5  19999995
19999996  6  19999996
19999997  7  19999997
19999998  8  19999998
19999999  9  19999999

[10000000 rows x 2 columns]


Looking at column B, we can clearly see that each process has a separate chunk of the original dataframe. 

### Parallelizing computation
After reading the data, our example performs a groupby operation. Note that the dataset has 10 different groups, and we can see above that all processes have rows belonging to the ten groups. Therefore, computing the groupby in parallel requires data exchange and communication between processes. This is all done and handled automatically by Bodo using [MPI](https://en.wikipedia.org/wiki/Message_Passing_Interface) for efficient communication. MPI is frequently used in large-scale parallel applications and high-performance computing. Bodo performs the groupby using its own parallel implementation of groupby. Likewise, Bodo has efficient parallel implementations of many other NumPy and pandas operators.

<span style='background :yellow' > maybe we can explain a bit more about the required data shuffling for groupby and put a figure (-Juan) </span>

### Parallel write

Let's do a slightly modified version of the program where we now write the results to disk. Bodo writes the results in parallel:

In [5]:
%%px --block
import pandas as pd
import bodo

@bodo.jit
def test():
    df = pd.read_parquet("example1.pq")
    result = df.groupby("A").sum()
    result.to_parquet("example1-results")

test()

Now let's read and print the results with pandas:

In [6]:
import pandas as pd

df = pd.read_parquet("example1-results")
print(df)

                B
A                
0  19999990000000
4  19999998000000
5  20000000000000
6  20000002000000
1  19999992000000
2  19999994000000
3  19999996000000
7  20000004000000
8  20000006000000
9  20000008000000


One thing that we can notice here is that the order of results generated by Bodo can differ from pandas. This is due to the parallel nature of the computation, and because Bodo doesn't automatically sort distributed data (due to it being expensive and not necessary in many cases). Users can always explicitly sort dataframes at any point in their workflow (including inside Bodo functions) if desired.

### Basic benchmarking of the pandas example
Now let's do some basic benchmarking to observe the effect of Bodo's automatic parallelization. Here we are only scaling up to a few cores, but Bodo can scale the same code to many nodes in a cluster and thousands of cores.

Let's run the code again with pandas, this time including timers:

In [10]:
import pandas as pd
import time

def test():
    t0 = time.time()
    df = pd.read_parquet("example1.pq")
    print("Read time:", time.time() - t0, "secs")
    t0 = time.time()
    result = df.groupby("A").sum()
    print("Groupby time:", time.time() - t0, "secs")
    return result

result = test()

Read time: 0.5932414531707764 secs
Groupby time: 0.6460061073303223 secs


To get accurate and consistent timings across runs for Bodo, we put a `bodo.barrier()` before the computation to make sure that all processes start computing at the same time (recall that the groupby requires processes to communicate and shuffle data to do the computation in parallel, so we want to make sure they begin at the same time).

In [12]:
%%px --block
import pandas as pd
import time
import bodo

@bodo.jit(distributed=['result'])
def test():
    t0 = time.time()
    df = pd.read_parquet("example1.pq")
    bodo.barrier()
    print("Read time:", time.time() - t0, "secs")
    t0 = time.time()
    result = df.groupby("A").sum()
    print("Groupby time:", time.time() - t0, "secs")
    return result

result = test()

[stdout:0] 
Read time: 0.6947760581970215 secs
Groupby time: 0.34023189544677734 secs


Note how we have included timers inside the Bodo function. We do this because the first time that a Bodo-decorated function is called it needs to be compiled by Bodo. This compilation time is non-negligible but we don't want to include it in measurements because: (i) when doing large computations typically the compilation time is not significant, but the example we are running now executes very quickly; (ii) there are cases where programs repeatedly call the same Bodo function multiple times and we want to get consistent timings. Again, compilation only occurs the first time a function is called. To avoid including compilation time in measurements, we put timers inside the Bodo function.

As we can see, Bodo computes results faster than pandas by doing parallel computation. The speedup depends on the number of cores used. We can continue scaling to many more cores as long as the data is large enough.

## 3. Data Distribution

In our first example, we saw that Bodo performs parallel computation by having separate chunks of data divided among processes. However, we also saw that there are situations where some data handled by a Bodo function is not divided into chunks. There are are two main ways in which data is distributed:

- Replicated: the data associated with the variable is the same on every process.
- 1D: the data is divided into chunks (equal or variable size) split along one dimension. For dataframes, the dimension is rows.

Variables which are scalar values are always replicated (they cannot be divided among processes).

For arrays, dataframes and other collections, the distribution of a variable is determined mostly by the distribution of variables that it depends on, which often leads back to the distribution of the input parameters of the Bodo function and its return value:

<div class="alert alert-block alert-danger"
<b>Important:</b> Bodo cannot infer the distribution scheme of input parameters and return values (this must be indicated by the user using the `distributed` flag) and assumes they are replicated by default.
</div>

In addition, the distribution scheme of variables also depends on the nature of the parallel computation that produces the value.

Let's illustrate with some examples.

Consider the following simple example that computes the mean of a dataframe column:

In [34]:
%%px --block
import pandas as pd
import bodo

@bodo.jit
def mean_power():
    df = pd.read_parquet('cycling_dataset.pq')
    x = df.power.mean()
    return x

res = mean_power()
print(res)

[stdout:0] 102.07842132239877
[stdout:1] 102.07842132239877


The mean can be computed in parallel. Different processes can read different chunks of the dataset in parallel, compute the sum for their chunk independently, and then exchange that information with others in order to compute the mean. The result `x` is a scalar value and thus replicated (is available on every process).

Note that there is nothing that requires `df` to be replicated:
- The replicated value `x` can be obtained from a parallel computation on distributed data.
- The mean can be calculated in parallel from distributed chunks of `df`.
- `df` comes from a read operation, which can read chunks.

For this reason, Bodo automatically determines that the best distribution for `df` is 1D-distributed.

We can be confident that this program runs with true parallelism because Bodo does not warn about the lack of parallelism.

Note that because `x` is a scalar value which can only be replicated, if you try marking it as distributed with `distributed=['x']` Bodo will throw an error.

Now let's see what happens if instead we pass the data into the Bodo function as a function parameter and we don't mark it as distributed:

In [41]:
%%px --block
import pandas as pd
import bodo

@bodo.jit
def mean_power(df):
    x = df.power.mean()
    return x

df = pd.read_parquet('cycling_dataset.pq')
res = mean_power(df)
print(res)

[stdout:0] 102.07842132239877
[stdout:1] 102.07842132239877


[stderr:0] 


The program runs, and returns the same correct value as before, but there is an important difference: *there is no true parallelism!* Bodo shows an explicit warning about this. What is happening is that each process is reading the whole data file and calculating the mean of the data frame independently.

*It is important to understand why this is happening.*

Bodo acts in this way because we are passing data into a Bodo function but we haven't specified that the data is distributed, so Bodo assumes that it is replicated. Bodo makes this assumption because it is the safe assumption to make when data comes in from Python, in order to guarantee a correct result.

In this example, it turns out that it is indeed the correct assumption, because the data is actually replicated. Here we are reading the dataset outside of a Bodo function. Recall that every process (MPI engine) is executing the same code. The `read_parquet` is regular Python code that is being executed on every process to read the same file using pandas.

If we do a `var` operation instead of `mean` and specify `distributed=['df']` we can see that the result is different, which is why it is important to tell Bodo what is the actual distribution of the input data.

### Passing distributed data into Bodo functions

Now you might be wondering how we can pass distributed data from Python into Bodo functions.

One way is to read it with Bodo, return it to Python using the `distributed` flag, and pass it to a different Bodo function using the `distributed` flag. For example:

In [2]:
%%px --block
import pandas as pd
import bodo

@bodo.jit(distributed=['df'])
def read_data():
    df = pd.read_parquet('cycling_dataset.pq')    
    return df

@bodo.jit(distributed=['df'])
def mean_power(df):
    x = df.power.mean()
    return x

df = read_data()
# ... we could do some stuff in Python with df now ...
res = mean_power(df)
print(res)

[stdout:0] 102.07842132239877
[stdout:1] 102.07842132239877


In this example, we are reading the data by calling a Bodo function (`read_data`). We are telling Bodo that the return value (which would be replicated by default) is distributed. This means that Bodo will have each process read a separate chunk from the file, and each process will return only that chunk to the encompassing Python code. Again, recall that every process is running a copy of the same code.

Then, from Python, we call another Bodo function (`mean_power`) passing it the chunk of data. We need to tell Bodo that this is a chunk as opposed to replicated data, so we use the `distributed` flag.

Note how the code runs and Bodo does not warn about lack of parallelism.

Another way to pass distributed data into Bodo functions from Python is to read or generate it from one process, and then scatter it to every process using MPI:

In [3]:
%%px --block
import pandas as pd
import bodo

@bodo.jit(distributed=['df'])
def mean_power(df):
    x = df.power.mean()
    return x

if bodo.get_rank() == 0:
    # this executes only on process 0
    df = pd.read_parquet('cycling_dataset.pq')
    # scatter chunks of df across all processes
    df = bodo.scatterv(df)
else:
    # this executes on every process
    # receive my chunk
    df = bodo.scatterv(None)
res = mean_power(df)
print(res)

[stdout:0] 102.07842132239877
[stdout:1] 102.07842132239877


This is a common pattern in MPI applications.

## 4. Pi Example

<span style='background :yellow' > I'm not sure about this example. Its value is that it shows very good performance with Bodo and is small, but I'm not sure that it is interesting for most users. Should we keep this, move it somewhere else? or do we still want it at the beginning? (-Juan) </span>

Let's start with a simple example, which computes the value of Pi using Monte-Carlo Integration, to get familiar with the execution environment. Here is the Python version:

In [2]:
import numpy as np
import time

def calc_pi(n):
    t1 = time.time()
    x = 2 * np.random.ranf(n) - 1
    y = 2 * np.random.ranf(n) - 1
    pi = 4 * np.sum(x**2 + y**2 < 1) / n
    print("Execution time:", time.time()-t1, "\nresult:", pi)
    return pi

calc_pi(2 * 10**8)

Execution time: 8.239847183227539 
result: 3.14149874


3.14149874

Now let's add the @bodo.jit decorator and run it on just one core (without parallel engines):

In [3]:
import bodo

@bodo.jit
def calc_pi(n):
    t1 = time.time()
    x = 2 * np.random.ranf(n) - 1
    y = 2 * np.random.ranf(n) - 1
    pi = 4 * np.sum(x**2 + y**2 < 1) / n
    print("Execution time:", time.time()-t1, "\nresult:", pi)
    return pi

calc_pi(2 * 10**8)

Execution time: 2.248674508999102 
result: 3.1414679


3.1414679

We see significant speedup due to compiler optimization. Now let's use the parallel engines using the `%%px` magic:

In [4]:
%%px --block
import bodo
import numpy as np
import time

@bodo.jit
def calc_pi(n):
    t1 = time.time()
    x = 2 * np.random.ranf(n) - 1
    y = 2 * np.random.ranf(n) - 1
    pi = 4 * np.sum(x**2 + y**2 < 1) / n
    print("Execution time:", time.time()-t1, "\nresult:", pi)
    return pi

p = calc_pi(2 * 10**8)

[stdout:0] 
Execution time: 0.8065992589981761 
result: 3.14135774


Bodo automatically parallelizes this code and distributes the work among parallel engines. Hence, we see additional speedup depending on the number of cores used by engines.

<img style="float: right;" src="img/data-parallel.jpg">

## Data-Parallel Operations
Many operations in Numpy and Pandas are fully data-parallel, which lets Bodo parallelize them across data blocks without communication between processors.
Examples include many math operators, filtering, combining columns, normalization, dropping rows/columns, etc.

Let's drop some rows and columns, and create a new column by extracting month from the time column.

<span style='background :yellow' > This example is either buggy or we have to give a good explanation somewhere on why the result is a chunk and not replicated. I opened an issue about this. (-Juan) </span>

In [5]:
%%px --block

@bodo.jit
def data_par():
    df = pd.read_parquet('cycling_dataset.pq')
    df = df[df.power != 0]
    df['month'] = df.time.dt.month
    df = df.drop(['latitude', 'longitude', 'power', 'time'], axis=1)
    #return df.head()
    return df

res = data_par()
#if bodo.get_rank() == 0: display(res)
print(res)

[stdout:0] 
      Unnamed: 0    altitude  cadence      distance   hr  speed  month
0              0  185.800003       51      3.460000   81  3.459     10
2              2  186.399994       38     11.040000   82  3.874     10
3              3  186.800003       38     15.180000   83  4.135     10
4              4  186.600006       38     19.430000   83  4.250     10
12            12  186.199997        0     51.610001   80  3.029     10
...          ...         ...      ...           ...  ...    ...    ...
1946         551  146.600006       72  11345.559570  133  6.300     10
1947         552  147.000000       72  11351.919922  134  6.356     10
1948         553  147.399994       99  11358.370117  135  6.452     10
1949         554  147.600006       85  11364.919922  136  6.550     10
1950         555  147.600006       78  11371.500000  137  6.581     10

[1201 rows x 7 columns]
[stdout:1] 
      Unnamed: 0    altitude  cadence      distance   hr  speed  month
1951         556  147.600006

<img style="float: right;" src="img/reduction.jpg">

## Reduction operations

<span style='background :yellow' > I used this example for section 3. Thinking of removing this from this tutorial (-Juan) </span>

Some operators such as `sum` require a reduction operation across all the data, which implies communication across data blocks. Bodo handles these operations using efficient MPI communication, and makes the output available on all processors.

As an example let's compute the mean of the 'power' column.

In [25]:
%%px --block

@bodo.jit
def mean_power():
    df = pd.read_parquet('cycling_dataset.pq')
    x = df.power.mean()
    return x

res = mean_power()
#if bodo.get_rank() == 0: display(res)
print(res)

[stdout:0] 102.07842132239877
[stdout:1] 102.07842132239877


<img style="float: right;" src="img/groupby.jpg">

## GroupBy/Aggregation

<span style='background :yellow' > Thinking of removing from this tutorial (-Juan) </span>

Grouping operations, which are typically followed by aggregations/reductions, are
more challenging for parallel and distributed environments. Bodo uses efficient MPI communication primitives to provide fast and scalable groupby/aggregations.

Let's compute the average power output per hour:

In [9]:
%%px --block
import pandas as pd
import numpy as np
import bodo

@bodo.jit
def mean_power_pm():
    df = pd.read_parquet('cycling_dataset.pq')
    df['hour'] = df.time.dt.hour
    grp = df.groupby('hour')
    mean_df = grp['power'].mean()
    return mean_df.head()

res = mean_power_pm()
if bodo.get_rank() == 0: display(res)

[output:0]

22    110.625821
23     71.754079
Name: power, dtype: float64

<img style="float: right;" src="img/rolling.jpg">

## Sliding Windows

<span style='background :yellow' > Thinking of removing from this tutorial (-Juan) </span>

Some popular analytics operations, especially for time-series analysis, are based on sliding windows. Examples include moving averages and percentage change. In a distributed setup, these require communication beyond map-reduce (which is the basis of most systems such as Spark). Bodo handles these cases using efficient patterns known from HPC.

Let's compute the moving average of the heart-rate.

In [10]:
%%px --block

@bodo.jit
def mov_avg():
    df = pd.read_parquet('cycling_dataset.pq')
    mv_av = df.hr.rolling(4).mean()
    return mv_av.head()

res = mov_avg()
if bodo.get_rank() == 0: display(res)

[output:0]

0     NaN
1     NaN
2     NaN
3    82.0
4    82.5
Name: hr, dtype: float64

## Join

<span style='background :yellow' > Thinking of removing from this tutorial (-Juan) </span>

Bodo can also efficiently join dataframes, which uses a communication pattern similar to Groupby.

Let's read data, split into 2 dataframes and re-join on time column.

In [11]:
%%px --block

@bodo.jit
def merge_dfs():
    df = pd.read_parquet('cycling_dataset.pq')
    df1 = df[['altitude', 'cadence', 'distance', 'hr', 'time']]
    df2 = df[['latitude', 'longitude', 'power', 'speed', 'time']]
    df3 = df1.merge(df2, on='time')
    return df3.head()

res = merge_dfs()
if bodo.get_rank() == 0: display(res)

[output:0]

Unnamed: 0,altitude,cadence,distance,hr,time,latitude,longitude,power,speed
0,186.600006,0,23.860001,84,2016-10-20 22:01:31,30.31313,-97.732724,0,4.435
1,186.600006,0,28.35,84,2016-10-20 22:01:32,30.313093,-97.732723,0,4.49
2,186.600006,0,32.970001,83,2016-10-20 22:01:33,30.31305,-97.732717,0,4.621
3,186.600006,0,37.560001,84,2016-10-20 22:01:34,30.313011,-97.732702,0,4.591
4,186.199997,70,86.339996,88,2016-10-20 22:01:46,30.31258,-97.732621,109,5.553


## Automatic Distribution

<span style='background :yellow' > This is not needed anymore as I have explained it in section 3, but maybe there is still material here to be merged there or kept in the in-depth tutorial (-Juan) </span>

Bodo automatically distributes data and computation of the target function by analyzing it for parallelization. It chooses the best and *safest* possible distribution. For example, returning distributed data is not necessary *safe*, since the code outside of the bodo scope would need to handle chunks of data instead of the full data. Consider the example below:

In [12]:
%%px --block

@bodo.jit
def read_pq():
    df = pd.read_parquet('cycling_dataset.pq')
    return df

df = read_pq()
print(df.shape)
if bodo.get_rank() == 0: read_pq.distributed_diagnostics()

[stdout:0] 
(3902, 10)
Distributed diagnostics for function read_pq, <ipython-input-76-70a5aeb5d795> (1)

Data distributions:
   Unnamed: 0.77403           REP
   altitude.77404             REP
   cadence.77405              REP
   distance.77406             REP
   hr.77407                   REP
   latitude.77408             REP
   longitude.77409            REP
   power.77410                REP
   speed.77411                REP
   time.77412                 REP
   __index_level_0__.77413    REP
   $0.15.77429                [<Distribution.REP: 1>, <Distribution.REP: 1>, <Distribution.REP: 1>, <Distribution.REP: 1>, <Distribution.REP: 1>, <Distribution.REP: 1>, <Distribution.REP: 1>, <Distribution.REP: 1>, <Distribution.REP: 1>, <Distribution.REP: 1>]
   $0.23.77539                REP
   $df.77578                  REP
   $0.6                       REP

Parfor distributions:
No parfors to distribute.

Distributed listing for function read_pq, <ipython-input-76-70a5aeb5d795> (1)
---------

[stderr:0] 
  "information.".format(self.func_ir.func_id.func_name)


The `distributed_diagnostics` function prints diagnostics information about distribution analysis by Bodo. In this case, all variables are assigned the `REP` distribution, which means they are replicated and there is no distribution of data. The reason is the return of `df`, which also propagates `REP` to all other variables since they are involved in parallel computation with `df`.

We can change this behavior by a simple annotation for `df`:

In [13]:
%%px --block

@bodo.jit(distributed=['df'])
def read_pq():
    df = pd.read_parquet('cycling_dataset.pq')
    return df

df = read_pq()
print(df.shape)
if bodo.get_rank() == 0: read_pq.distributed_diagnostics()

[stdout:0] 
(488, 10)
Distributed diagnostics for function read_pq, <ipython-input-77-cdd5d7a9c9a3> (1)

Data distributions:
   Unnamed: 0.77776            1D_Block
   altitude.77777              1D_Block
   cadence.77778               1D_Block
   distance.77779              1D_Block
   hr.77780                    1D_Block
   latitude.77781              1D_Block
   longitude.77782             1D_Block
   power.77783                 1D_Block
   speed.77784                 1D_Block
   time.77785                  1D_Block
   __index_level_0__.77786     1D_Block
   $0.15.77802                 [<Distribution.OneD: 5>, <Distribution.OneD: 5>, <Distribution.OneD: 5>, <Distribution.OneD: 5>, <Distribution.OneD: 5>, <Distribution.OneD: 5>, <Distribution.OneD: 5>, <Distribution.OneD: 5>, <Distribution.OneD: 5>, <Distribution.OneD: 5>]
   $0.23.77922                 1D_Block
   $df.77964                   1D_Block
   distributed_return.77845    1D_Block
   $dist_return.77842.77965    1D_Block

Pa

In this case, all variables are assigned the `1D_Block` distribution, which means they are divided in equal chunks among processors. The returned dataframe on each processor is therefore a chunk of the full dataset. This is useful, for example, when computation on chunks is desired outside the scope of Bodo (e.g. mixing Bodo code with custom non-Bodo code and other packages like TensorFlow). In general, the `distributed` flag can be used for both passing distributed chunks as input argument, as well as returning distributed chunks.