# Getting Started with Bodo


---------------

## Create a local cluster

The first step to run bodo efficiently is to run it on a cluster.
- You are currently running this code in a container on a local machine. 
- To create a cluster, you need to run the following boiler plate code. This will create an ipyparallel cluster with maximum of 8 cores.
- If you use bodo platform at platform.bodo.ai, you can create large clusters with multiple nodes. Bodo will take care of creating the cluster with optimum configurations and you don't need to run the following code block. Just move to the **"Run A SQL Query"** Section. 


In [1]:
import ipyparallel as ipp
import psutil

n = min(psutil.cpu_count(logical=False), 8)
rc = ipp.Cluster(engines='mpi', n=n).start_and_connect_sync(activate=True)

Starting 8 engines with <class 'ipyparallel.cluster.launcher.MPIEngineSetLauncher'>
100%|██████████| 8/8 [00:05<00:00,  1.37engine/s]


## Run A SQL Query

Lets run simple SQL query to generate a quick summary of a dataset stored in parquet format in a public S3 bucket hosted by Bodo. We are using an NYC taxi dataset[<sup>1</sup>](#fn1) containing info about yellow and green taxi trips originating in New York City in 2019. The size of this dataset is about 8GB. 


Run the next code cell to generate a table summary, grouped by passenger counts, showing rounded off average and total fares. If you are running this query on the Community Edition Cluster, you should see 8 core outputs, some of which might be empty. 

In [2]:
%%px
import bodo
import bodosql
import warnings
warnings.filterwarnings("ignore")

# File stored in public S3 bucket hosted by Bodo
s3_file_path = "s3://bodo-example-data/nyc-taxi/yellow_tripdata_2019_half.pq" 

@bodo.jit
def simple_query():
    
    # reading file directly from S3
    bc = bodosql.BodoSQLContext( {"nyctaxi": bodosql.TablePath(s3_file_path, "parquet")})
    
    # executing SQL query 
    df1 = bc.sql('''
                SELECT DISTINCT passenger_count
                , ROUND (SUM (fare_amount),0) as TotalFares
                , ROUND (AVG (fare_amount),0) as AvgFares
                FROM nyctaxi
                GROUP BY passenger_count
                ''')
    
    return df1 

simple_query()


%px:   0%|          | 0/8 [00:39<?, ?tasks/s]

Unnamed: 0,passenger_count,TotalFares,AvgFares


Unnamed: 0,passenger_count,TotalFares,AvgFares


Unnamed: 0,passenger_count,TotalFares,AvgFares
8,5,11343053.0,13.0
9,6,6764732.0,12.0


Unnamed: 0,passenger_count,TotalFares,AvgFares
6,0,4862469.0,14.0
7,7,4017.0,51.0


Unnamed: 0,passenger_count,TotalFares,AvgFares
0,9,2250.0,49.0


Unnamed: 0,passenger_count,TotalFares,AvgFares
1,8,3854.0,54.0


Unnamed: 0,passenger_count,TotalFares,AvgFares
2,4,5030361.0,13.0


%px: 100%|██████████| 8/8 [00:39<00:00,  4.90s/tasks]


Unnamed: 0,passenger_count,TotalFares,AvgFares
3,1,190001767.0,13.0
4,3,11024521.0,13.0
5,2,40135971.0,13.0




---
</br>


If you've made it this far, you have now run your first data processing SQL query with Bodo! Please consider joining our [community slack](https://bodocommunity.slack.com/join/shared_invite/zt-qwdc8fad-6rZ8a1RmkkJ6eOX1X__knA#/shared-invite/email) to get in touch directly with our engineers and other Bodo users like yourself. For more information and to learn about how Bodo works, visit our [docs]("https://docs.bodo.ai").





### Pandas Feature Engineering  


You can do some feature engineering using pandas on the same NYC taxi dataset, enabling further data science workloads.  

In [3]:
%%px 
import pandas as pd 

@bodo.jit()
def feat_eng():
    """
    Generate features from a raw taxi dataframe.
    """
    taxi_df = pd.read_parquet(
        "s3://bodo-example-data/nyc-taxi/yellow_tripdata_2019_half.pq",
        )
    df = taxi_df[taxi_df.fare_amount > 0][
        "tpep_pickup_datetime", "passenger_count", "tip_amount", "fare_amount"
    ].copy()  # avoid divide-by-zero
    df["tip_fraction"] = df.tip_amount / df.fare_amount

    df["pickup_weekday"] = df.tpep_pickup_datetime.dt.weekday
    df["pickup_weekofyear"] = df.tpep_pickup_datetime.dt.weekofyear
    df["pickup_hour"] = df.tpep_pickup_datetime.dt.hour
    df["pickup_week_hour"] = (df.pickup_weekday * 24) + df.pickup_hour
    df["pickup_minute"] = df.tpep_pickup_datetime.dt.minute
    df = (
        df[
            "pickup_weekday",
            "pickup_weekofyear",
            "pickup_hour",
            "pickup_week_hour",
            "pickup_minute",
            "passenger_count",
            "tip_fraction",
        ]
        .astype(float)
        .fillna(-1)
    )
    return df


taxi_feat = feat_eng()
if bodo.get_rank() == 0:
    display(taxi_feat.head())

%px:  38%|███▊      | 3/8 [00:54<00:00,  6.82tasks/s]

[output:0]

Unnamed: 0,pickup_weekday,pickup_weekofyear,pickup_hour,pickup_week_hour,pickup_minute,passenger_count,tip_fraction
0,1.0,1.0,0.0,24.0,46.0,1.0,0.235714
1,1.0,1.0,0.0,24.0,59.0,1.0,0.071429
2,4.0,51.0,13.0,109.0,48.0,3.0,0.0
3,2.0,48.0,15.0,63.0,52.0,5.0,0.0
4,2.0,48.0,15.0,63.0,56.0,5.0,0.0


%px: 100%|██████████| 8/8 [00:54<00:00,  6.87s/tasks]


<div class="alert alert-info alert" Note:
     
### note

- The `%%px` cell [magic]("https://ipyparallel.readthedocs.io/en/latest/tutorial/magics.html") indicates that the code cell should run on all the cores of the cluster.
- The `@bodo.jit` decorator is an annotation to tell the bodo engine to parallelize the code. 
     
</div>

---


### Footnotes 

\[1\] <span id="fn1"> Original example can be found [here]("https://github.com/toddwschneider/nyc-taxi-data"). </span>

</br>