# Divvy Data Set

The Divvy data set is a collection of historical trip data made publicly available by Lyft. More information about this data set can be found [here](https://ride.divvybikes.com/system-data).

In this demo we will explore loading and manipulating this data set across a cluster using Bodo. 

## Let's load the divvy bike data set using regular pandas.

In [1]:
%%px
##load aws creds
from utils.creds import *
load_aws_creds()
import pandas as pd
import time
def load_data(filename):
    start=time.time()
    df=pd.read_csv(filename)
    print(f"time to run query {time.time()-start}")
    return df
df=load_data("s3://bodo-divvy-data/csv/202004-divvy-tripdata.csv")

Starting 8 engines with <class 'ipyparallel.cluster.launcher.MPIEngineSetLauncher'>


  0%|          | 0/8 [00:00<?, ?engine/s]

%px:   0%|          | 0/8 [00:00<?, ?tasks/s]

[stdout:4] time to run query 2.463881492614746


[stdout:2] time to run query 2.4441416263580322


[stdout:6] time to run query 2.45072603225708


[stdout:0] time to run query 2.469806671142578


[stdout:7] time to run query 2.4745378494262695


[stdout:3] time to run query 2.46004056930542


[stdout:1] time to run query 2.4607479572296143


[stdout:5] time to run query 2.4620072841644287


## Let's take a quick look at the resulting dataframe.
Since, we used pandas and not bodo to load the df, the Dataframe has dupplicated on all the cores across the cluster.
Lets inspect the dataframe shape. Note how the data is duplicated across the ranks/cores on the cluster

In [2]:
%%px
df.shape

[0;31mOut[4:2]: [0m(84776, 13)

[0;31mOut[0:2]: [0m(84776, 13)

[0;31mOut[1:2]: [0m(84776, 13)

[0;31mOut[7:2]: [0m(84776, 13)

[0;31mOut[5:2]: [0m(84776, 13)

[0;31mOut[3:2]: [0m(84776, 13)

[0;31mOut[2:2]: [0m(84776, 13)

[0;31mOut[6:2]: [0m(84776, 13)

## Now let's bodoize this code and run it on a cluster

To convert this Pandas function to a Bodo optimized function, you use the `@bodo.jit` decorator. This causes Bodo to automatically compile the code and optimize it. Additionally, it will leverage MPI and make the single threaded pandas code distributable and parallel.

Lets do a conversion below.

In [3]:
%%px
import pandas as pd
import time
import bodo
@bodo.jit(cache=True) 
def load_data_bodo(filename):
    start=time.time()
    df=pd.read_csv(filename)
    print(f"time to run query {time.time()-start}")
    return df
df=load_data_bodo("s3://bodo-divvy-data/csv/202004-divvy-tripdata.csv")

%px:   0%|          | 0/8 [00:00<?, ?tasks/s]

[stdout:0] time to run query 0.704914


### Bodo Magic and Distribution
Lets inspect the dataframe shape. Note how the data is now distributed across the ranks/cores on the cluster, which each rank only containing parts of the data

In [4]:
%%px
df.shape

[0;31mOut[2:4]: [0m(10597, 13)

[0;31mOut[6:4]: [0m(10597, 13)

[0;31mOut[3:4]: [0m(10597, 13)

[0;31mOut[4:4]: [0m(10597, 13)

[0;31mOut[1:4]: [0m(10597, 13)

[0;31mOut[7:4]: [0m(10597, 13)

[0;31mOut[5:4]: [0m(10597, 13)

[0;31mOut[0:4]: [0m(10597, 13)

![image info](https://docs.bodo.ai/2022.6/img/file-read.svg)

## Scalable  I/O
Efficient parallel data processing requires data I/O to be parallelized effectively as well. Bodo provides parallel file I/O for many different formats such as Parquet, CSV, JSON, Numpy binaries, HDF5 and SQL databases. This diagram demonstrates how chunks of data are partitioned among parallel execution engines by Bodo.
Bodo automatically parallelizes I/O for any number of cores and cluster size without any additional API layers

### Lets check the length of the dataframe, to ensure it matches pandas

In [5]:
%%px
print(len(df))

[stdout:4] 10597


[stdout:2] 10597


[stdout:0] 10597


[stdout:7] 10597


[stdout:1] 10597


[stdout:3] 10597


[stdout:5] 10597


[stdout:6] 10597


### Bodo for distributed processing
The above length does not look right, that is because you are using regular pandas dataframe to calculate this. Length of a dataframe across multiple cores would be a distributed operation, that means we have to combine the length of each shard to get the correct answer. That can be done by using a bodo function. If you dont wrap the operation in a @bodo.jit , it will execute as standard python. 

In [6]:
%%px
@bodo.jit(cache=True)
def get_len(df):
    print(len(df))
get_len(df)

[stdout:0] 84776


## Great, we have run a parallel and distributed bodo code on a cluster
Each process in a bodo (mpi) cluster is called a rank. Bodo ensures that code and data are evenly distributed across these ranks for a faster scalable and distributed data processing.

Let's quickly inspect data in the ranks.


In [7]:
%%px --targets 0
df

Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual
0,A847FADBBC638E45,docked_bike,2020-04-26 17:45:14,2020-04-26 18:12:03,Eckhart Park,86,Lincoln Ave & Diversey Pkwy,152,41.8964,-87.6610,41.9322,-87.6586,member
1,5405B80E996FF60D,docked_bike,2020-04-17 17:08:54,2020-04-17 17:17:03,Drake Ave & Fullerton Ave,503,Kosciuszko Park,499,41.9244,-87.7154,41.9306,-87.7238,member
2,5DD24A79A4E006F4,docked_bike,2020-04-01 17:54:13,2020-04-01 18:08:36,McClurg Ct & Erie St,142,Indiana Ave & Roosevelt Rd,255,41.8945,-87.6179,41.8679,-87.6230,member
3,2A59BBDF5CDBA725,docked_bike,2020-04-07 12:50:19,2020-04-07 13:02:31,California Ave & Division St,216,Wood St & Augusta Blvd,657,41.9030,-87.6975,41.8992,-87.6722,member
4,27AD306C119C6158,docked_bike,2020-04-18 10:22:59,2020-04-18 11:15:54,Rush St & Hubbard St,125,Sheridan Rd & Lawrence Ave,323,41.8902,-87.6262,41.9695,-87.6547,casual
...,...,...,...,...,...,...,...,...,...,...,...,...,...
10592,3C80C8C5E5A8CE7C,docked_bike,2020-04-19 12:55:40,2020-04-19 13:16:41,Damen Ave & Charleston St,310,Western Ave & Walton St,374,41.9201,-87.6779,41.8984,-87.6866,member
10593,6C36A85193EC130C,docked_bike,2020-04-19 14:14:25,2020-04-19 14:27:28,Western Ave & Walton St,374,Damen Ave & Charleston St,310,41.8984,-87.6866,41.9201,-87.6779,member
10594,A0DA8D82BE8FA614,docked_bike,2020-04-19 12:23:22,2020-04-19 12:34:01,Clark St & Chicago Ave,337,Kingsbury St & Kinzie St,133,41.8968,-87.6309,41.8892,-87.6385,casual
10595,B1876A43AAD8D8C4,docked_bike,2020-04-07 12:54:29,2020-04-07 13:12:35,Sheffield Ave & Webster Ave,327,Cannon Dr & Fullerton Ave,34,41.9215,-87.6538,41.9268,-87.6344,casual


In [8]:
%%px --targets 1
df

Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual
10597,21A370135335E189,docked_bike,2020-04-19 10:55:55,2020-04-19 11:24:20,Sedgwick St & Schiller St,236,Sedgwick St & Schiller St,236,41.9076,-87.6386,41.9076,-87.6386,casual
10598,D0CBA40D168549C6,docked_bike,2020-04-04 14:27:39,2020-04-04 14:56:30,Rush St & Hubbard St,125,Wabash Ave & Grand Ave,199,41.8902,-87.6262,41.8915,-87.6268,casual
10599,C9A860BBF400D30C,docked_bike,2020-04-19 13:51:38,2020-04-19 15:43:54,Millennium Park,90,Columbus Dr & Randolph St,195,41.8810,-87.6241,41.8847,-87.6195,casual
10600,4E8CD11CA5AE87CA,docked_bike,2020-04-04 15:53:54,2020-04-04 16:23:58,Richmond St & Diversey Ave,501,Winchester Ave & Elston Ave,505,41.9319,-87.7012,41.9241,-87.6765,casual
10601,CC0557DEF4F7B2AB,docked_bike,2020-04-23 07:57:07,2020-04-23 08:05:30,Franklin St & Chicago Ave,31,Milwaukee Ave & Grand Ave,84,41.8967,-87.6357,41.8916,-87.6484,casual
...,...,...,...,...,...,...,...,...,...,...,...,...,...
21189,3B121E191651B551,docked_bike,2020-04-26 19:09:23,2020-04-26 19:17:43,Halsted St & 18th St,202,Wabash Ave & 16th St,72,41.8575,-87.6463,41.8604,-87.6258,member
21190,8FED4B3A3526BC1A,docked_bike,2020-04-20 12:09:40,2020-04-20 12:17:17,Ashland Ave & Augusta Blvd,30,Ogden Ave & Race Ave,186,41.8996,-87.6677,41.8918,-87.6588,member
21191,2E060803F362E81F,docked_bike,2020-04-20 19:17:50,2020-04-20 19:26:27,Ashland Ave & Augusta Blvd,30,Ashland Ave & Division St,210,41.8996,-87.6677,41.9035,-87.6677,member
21192,A68FBAE9D2691584,docked_bike,2020-04-23 10:00:48,2020-04-23 10:30:44,Kedzie Ave & Harrison St,433,Ashland Ave & Blackhawk St,333,41.8736,-87.7049,41.9071,-87.6673,member


Notice how the divvy csv data has been distributed across the ranks.
The bodo function has taken a single threaded pandas function and converted it into distributed codes, which produced a distributed dataframe.
Unlike Spark, bodo dataframes remain distributed when converted back to python. Let's see a quick example. Similar to len, even groupby needs to be distributed to get the correct result across the whole dataset. Simply adding @bodo.jit() to this function below will achieve that. See how simple bodo is.

In [10]:
%%px
import time
def groupbymember_casual(column_name,df):
    start=time.time()
    gdf=df.groupby(column_name,as_index=False)["ride_id"].count()
    print(f"time for group by pandas {time.time()-start}")
    return gdf
gdf=groupbymember_casual(["member_casual"],df)

[stdout:2] time for group by pandas 0.0034332275390625


[stdout:0] time for group by pandas 0.0033690929412841797


[stdout:5] time for group by pandas 0.0036652088165283203


[stdout:1] time for group by pandas 0.003698110580444336


[stdout:3] time for group by pandas 0.003718137741088867


[stdout:7] time for group by pandas 0.0039031505584716797


[stdout:4] time for group by pandas 0.0048770904541015625


[stdout:6] time for group by pandas 0.004438161849975586


In [11]:
%%px
gdf

Unnamed: 0,member_casual,ride_id
0,casual,2926
1,member,7671


Unnamed: 0,member_casual,ride_id
0,casual,3212
1,member,7385


Unnamed: 0,member_casual,ride_id
0,casual,2742
1,member,7855


Unnamed: 0,member_casual,ride_id
0,casual,2493
1,member,8104


Unnamed: 0,member_casual,ride_id
0,casual,2977
1,member,7620


Unnamed: 0,member_casual,ride_id
0,casual,3022
1,member,7575


Unnamed: 0,member_casual,ride_id
0,casual,3179
1,member,7418


Unnamed: 0,member_casual,ride_id
0,casual,3077
1,member,7520


## Now let's run a groupby operation with bodo.

Notice, how the dataframe was grouped and counted for rides, based on the member_casual value. The above function was not jitted, so it ran as a regular pandas function. Although each rank returned the required output for their chunk of the data, it was not the right answer to this problem.

Pandas did not have the capability to treat this chunked data set spread across multiple cores as a single dataframe to calculate the answer. How do we do this calculation across the data present in all the ranks? If we use the a `@bodo.jit` function this is done automatically.

In [12]:
%%px
import time
@bodo.jit(cache=True)
def groupbymember_casual(column_name,df):
    start=time.time()
    gdf=df.groupby(column_name,as_index=False)["ride_id"].count()
    print(f"time for group by bodo {time.time()-start}")
    return gdf
gdf=groupbymember_casual(["member_casual"],df)

[stdout:0] time for group by bodo 0.080326


In [13]:
%%px
gdf

Unnamed: 0,member_casual,ride_id


Unnamed: 0,member_casual,ride_id


Unnamed: 0,member_casual,ride_id


Unnamed: 0,member_casual,ride_id


Unnamed: 0,member_casual,ride_id


Unnamed: 0,member_casual,ride_id
1,member,61148


Unnamed: 0,member_casual,ride_id
0,casual,23628


Unnamed: 0,member_casual,ride_id


Notice how simply adding the decorator, made this simple function distributed and data was shuffled across the ranks to provide an answer to this question across the entire dataset across all the ranks. This same concept can be extended to multiple cores and to other use cases like pivoting, joins etc.

For those not familiar with pandas, we also support SQL workloads through our BodoSQL package. BodoSQL is described is explained in more detail in another demo.

## Summary

Hopefully the exercise allowed you to see how simple and powerful bodo can be. Bodo is also versatile and offers multiple developments options like python/pandas and sql. Additionally, with bodo, you can work on large datasets. for ex with this particular example, we are passing one file to rea_csv, what if we have 100s of files. Nomral practice in pandas is to loop on the files and concat the resulting dataframes. Bodo makes this simple, just pass the folder where all this CSV is located. See the simple example below. Run the above cells again to inspect the data and do groupby etc.

In [14]:
%%px
df=load_data_bodo("s3://bodo-divvy-data/csv/")
get_len(df)

%px:   0%|          | 0/8 [00:00<?, ?tasks/s]

[stdout:0] time to run query 3.561756
3242939
