# Getting Started with Bodo


---------------

## Connect to a cluster 

This notebook runs code on a cluster. If you are in the Community Edition Workspace, and your notebook is *detached*, make sure the Community Edition Cluster is running. On the sidebar, right click on **<img src="img/cluster_icon.png"/></a> Clusters** and open tab. If the cluster is paused, click on the play button. Once the state changes to running, the notebook should automatically attach to the cluster. If it doesn't, click the dropdown and attach to the Community Edition Cluster. 


## Run the next code cell to check number of cores on the cluster


If you are running your code on the Community Edition Cluster (which has 2 c5.2xlarge nodes with 8 physical cores, you should see 8 lines of output with the form:

`[stdout:<n>] "Hello World from rank <n>. Total ranks=8" .`
    

In [1]:
%%px
import bodo
import warnings
warnings.filterwarnings("ignore")
print(f"Hello World from rank {bodo.get_rank()}. Total ranks={bodo.get_size()}")

Starting 8 engines with <class 'ipyparallel.cluster.launcher.MPIEngineSetLauncher'>


  0%|          | 0/8 [00:00<?, ?engine/s]

[stdout:0] Hello World from rank 0. Total ranks=8


[stdout:6] Hello World from rank 6. Total ranks=8


[stdout:4] Hello World from rank 4. Total ranks=8


[stdout:2] Hello World from rank 2. Total ranks=8


[stdout:5] Hello World from rank 5. Total ranks=8


[stdout:7] Hello World from rank 7. Total ranks=8


[stdout:1] Hello World from rank 1. Total ranks=8


[stdout:3] Hello World from rank 3. Total ranks=8


## Generate some parquet data 
Run the next code cell to generate a simple dataset and write to a local [Parquet](http://parquet.apache.org/) file:

In [2]:
import pandas as pd
import numpy as np

# 10m data points
df = pd.DataFrame(
    {
        "A": np.repeat(pd.date_range("2013-01-03", periods=1000), 10_000),
        "B": np.arange(10_000_000),
    }
)
# set some values to NA
df.iloc[np.arange(1000) * 3, 0] = pd.NA
# using row_group_size helps with efficient parallel read of data later
df.to_parquet("pd_example.pq", row_group_size=100000)
print(df)

                 A        B
0              NaT        0
1       2013-01-03        1
2       2013-01-03        2
3              NaT        3
4       2013-01-03        4
...            ...      ...
9999995 2015-09-29  9999995
9999996 2015-09-29  9999996
9999997 2015-09-29  9999997
9999998 2015-09-29  9999998
9999999 2015-09-29  9999999

[10000000 rows x 2 columns]


## Run a simple pandas data transformation function (on one core)

The next cell contains a simple data transformation function in pandas that reads the parquet file we just wrote into a dataframe, processes a column of datetime values in this dataframe, and creates two new columns:

In [3]:
import time
import pandas as pd

def data_transform():
    t0 = time.time()
    df = pd.read_parquet("pd_example.pq")
    df["B"] = df.apply(lambda r: "NA" if pd.isna(r.A) else "P1" if r.A.month < 5 else "P2", axis=1)
    df["C"] = df.A.dt.month
    t2 = time.time()
    print("Total time: {:.2f}".format(time.time()-t0))
    return df

data_transform()

Total time: 141.60


Unnamed: 0,A,B,C
0,NaT,,
1,2013-01-03,P1,1.0
2,2013-01-03,P1,1.0
3,NaT,,
4,2013-01-03,P1,1.0
...,...,...,...
9999995,2015-09-29,P2,9.0
9999996,2015-09-29,P2,9.0
9999997,2015-09-29,P2,9.0
9999998,2015-09-29,P2,9.0


Python uses just a single CPU core and cannot run on all cores in the cluster. 

## Run the same code on a single core, now with Bodo

To run the code with Bodo, we need to add the `bodo.jit` decorator to the data transformation function.

In [4]:
import pandas as pd
import warnings
warnings.filterwarnings("ignore")
import bodo
import time

@bodo.jit
def data_transform():
    t0 = time.time()
    df = pd.read_parquet("pd_example.pq")
    df["B"] = df.apply(lambda r: "NA" if pd.isna(r.A) else "P1" if r.A.month < 5 else "P2", axis=1)
    df["C"] = df.A.dt.month
    print("Total time: {:.2f}".format(time.time()-t0))
    return df

data_transform()

Total time: 0.91


Unnamed: 0,A,B,C
0,NaT,,
1,2013-01-03,P1,1
2,2013-01-03,P1,1
3,NaT,,
4,2013-01-03,P1,1
...,...,...,...
9999995,2015-09-29,P2,9
9999996,2015-09-29,P2,9
9999997,2015-09-29,P2,9
9999998,2015-09-29,P2,9


Even though the code is still running on a single core, notice it is more than 150x faster. 

## Run the same code on all cores of your cluster with Bodo
Now let’s run the code on all the cores of your cluster (8 cores if you are using this on the Community Edition Cluster) using using the `%%px` [*magic*](https://ipyparallel.readthedocs.io/en/latest/tutorial/magics.html):

In [5]:
%%px
import pandas as pd
import time

@bodo.jit
def data_transform():
    t0 = time.time()
    df = pd.read_parquet("pd_example.pq")
    t1 = time.time()
    df["B"] = df.apply(lambda r: "NA" if pd.isna(r.A) else "P1" if r.A.month < 5 else "P2", axis=1)
    df["C"] = df.A.dt.month
    t2 = time.time()
    print("IO time: {:.2f}".format(t2-t1))
    print("Compute time: {:.2f}".format(time.time()-t0))
    print("Total time: {:.2f}".format(time.time()-t0))
    return df

data_transform()

%px:   0%|          | 0/8 [00:00<?, ?tasks/s]

[stdout:0] IO time: 0.08
Compute time: 0.43
Total time: 0.43


Unnamed: 0,A,B,C
0,NaT,,
1,2013-01-03,P1,1
2,2013-01-03,P1,1
3,NaT,,
4,2013-01-03,P1,1
...,...,...,...
1249995,2013-05-07,P2,5
1249996,2013-05-07,P2,5
1249997,2013-05-07,P2,5
1249998,2013-05-07,P2,5


Unnamed: 0,A,B,C
7500000,2015-01-23,P1,1
7500001,2015-01-23,P1,1
7500002,2015-01-23,P1,1
7500003,2015-01-23,P1,1
7500004,2015-01-23,P1,1
...,...,...,...
8749995,2015-05-27,P2,5
8749996,2015-05-27,P2,5
8749997,2015-05-27,P2,5
8749998,2015-05-27,P2,5


Unnamed: 0,A,B,C
3750000,2014-01-13,P1,1
3750001,2014-01-13,P1,1
3750002,2014-01-13,P1,1
3750003,2014-01-13,P1,1
3750004,2014-01-13,P1,1
...,...,...,...
4999995,2014-05-17,P2,5
4999996,2014-05-17,P2,5
4999997,2014-05-17,P2,5
4999998,2014-05-17,P2,5


Unnamed: 0,A,B,C
5000000,2014-05-18,P2,5
5000001,2014-05-18,P2,5
5000002,2014-05-18,P2,5
5000003,2014-05-18,P2,5
5000004,2014-05-18,P2,5
...,...,...,...
6249995,2014-09-19,P2,9
6249996,2014-09-19,P2,9
6249997,2014-09-19,P2,9
6249998,2014-09-19,P2,9


Unnamed: 0,A,B,C
1250000,2013-05-08,P2,5
1250001,2013-05-08,P2,5
1250002,2013-05-08,P2,5
1250003,2013-05-08,P2,5
1250004,2013-05-08,P2,5
...,...,...,...
2499995,2013-09-09,P2,9
2499996,2013-09-09,P2,9
2499997,2013-09-09,P2,9
2499998,2013-09-09,P2,9


Unnamed: 0,A,B,C
2500000,2013-09-10,P2,9
2500001,2013-09-10,P2,9
2500002,2013-09-10,P2,9
2500003,2013-09-10,P2,9
2500004,2013-09-10,P2,9
...,...,...,...
3749995,2014-01-12,P1,1
3749996,2014-01-12,P1,1
3749997,2014-01-12,P1,1
3749998,2014-01-12,P1,1


Unnamed: 0,A,B,C
8750000,2015-05-28,P2,5
8750001,2015-05-28,P2,5
8750002,2015-05-28,P2,5
8750003,2015-05-28,P2,5
8750004,2015-05-28,P2,5
...,...,...,...
9999995,2015-09-29,P2,9
9999996,2015-09-29,P2,9
9999997,2015-09-29,P2,9
9999998,2015-09-29,P2,9


Unnamed: 0,A,B,C
6250000,2014-09-20,P2,9
6250001,2014-09-20,P2,9
6250002,2014-09-20,P2,9
6250003,2014-09-20,P2,9
6250004,2014-09-20,P2,9
...,...,...,...
7499995,2015-01-22,P1,1
7499996,2015-01-22,P1,1
7499997,2015-01-22,P1,1
7499998,2015-01-22,P1,1


----

If you've made it this far, you have now run your first parallel computation with Bodo! Visit our [docs](https://docs.bodo.ai) to continue to learn how to continue to parallelize your data apps with Bodo. Please consider joining our [community slack](https://bodocommunity.slack.com/join/shared_invite/zt-qwdc8fad-6rZ8a1RmkkJ6eOX1X__knA#/shared-invite/email) to get in touch directly with our engineers and other Bodo users like yourself. 