# Getting Started with Bodo


---------------

## Connect to a cluster 

This notebook runs code on a cluster.
- If you are in the Community Edition Workspace, and your notebook is *detached*, make sure the Community Edition Cluster is running.

- On the sidebar, right click on **<img src="img/cluster_icon.png"/></a> Clusters** and open tab. If the cluster is paused, click on the **<img src="img/play_button.png"/></a>** play button.

- Once the state changes to running, the notebook should automatically attach to the cluster. If it doesn't, click the dropdown and attach to the Community Edition Cluster. 


## Run SQL Queries

Lets run a couple of SQL queries from a dataset stored in parquet format in a public S3 bucket hosted by Bodo. We are using an NYC taxi dataset[<sup>1</sup>](#fn1) containing info about yellow and green taxi trips originating in New York City in 2019. The size of this dataset is about 8GB. 



### Print a few Records
Run the next code cell to run a simple SQL query to print out a few records from the table. If you are running on the Community Edition Cluster, this code runs on all 8 cores in the cluster, and you should see 8 outputs. However, since we are only printing out a few records, they will most likely be collected on the output of one core. 

In [6]:
%%px
import bodo
import bodosql
import warnings
warnings.filterwarnings("ignore")

# File stored in public S3 bucket hosted by Bodo
s3_file_path = "s3://bodo-example-data/nyc-taxi/yellow_tripdata_2019_half.pq" 

@bodo.jit
def simple_query():
    
    # reading file directly from S3
    bc = bodosql.BodoSQLContext( {"nyctaxi": bodosql.TablePath(s3_file_path, "parquet")})
    
    # executing SQL query 
    df1 = bc.sql("SELECT * FROM nyctaxi LIMIT 8")
    
    return df1 

simple_query()


%px:   0%|          | 0/8 [00:00<?, ?tasks/s]

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge


Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge


Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge


Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge


Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge


Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge


Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge


Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
0,1,2019-01-01 00:46:40,2019-01-01 00:53:20,1,1.5,1,N,151,239,1,7.0,0.5,0.5,1.65,0.0,0.3,9.95,
1,1,2019-01-01 00:59:47,2019-01-01 01:18:59,1,2.6,1,N,239,246,1,14.0,0.5,0.5,1.0,0.0,0.3,16.3,
2,2,2018-12-21 13:48:30,2018-12-21 13:52:40,3,0.0,1,N,236,236,1,4.5,0.5,0.5,0.0,0.0,0.3,5.8,
3,2,2018-11-28 15:52:25,2018-11-28 15:55:45,5,0.0,1,N,193,193,2,3.5,0.5,0.5,0.0,0.0,0.3,7.55,
4,2,2018-11-28 15:56:57,2018-11-28 15:58:33,5,0.0,2,N,193,193,2,52.0,0.0,0.5,0.0,0.0,0.3,55.55,
5,2,2018-11-28 16:25:49,2018-11-28 16:28:26,5,0.0,1,N,193,193,2,3.5,0.5,0.5,0.0,5.76,0.3,13.31,
6,2,2018-11-28 16:29:37,2018-11-28 16:33:43,5,0.0,2,N,193,193,2,52.0,0.0,0.5,0.0,0.0,0.3,55.55,
7,1,2019-01-01 00:21:28,2019-01-01 00:28:37,1,1.3,1,N,163,229,1,6.5,0.5,0.5,1.25,0.0,0.3,9.05,


### Generate a quick Summary

Let's run another simple query, which generates a quick summary of the table, grouped by passenger counts, showing rounded off average and total fares. Again, you should see 8 core outputs, and some might be empty. 

In [8]:
%%px 

@bodo.jit
def simple_query_2():
    # reading file directly from S3
    bc = bodosql.BodoSQLContext({ "nyctaxi": bodosql.TablePath(s3_file_path, "parquet")})
   
    # executing SQL query 
    df1 = bc.sql('''
                SELECT DISTINCT passenger_count
                , ROUND (SUM (fare_amount),0) as TotalFares
                , ROUND (AVG (fare_amount),0) as AvgFares
                FROM nyctaxi
                GROUP BY passenger_count
                ''')
    return df1

simple_query_2()

%px:   0%|          | 0/8 [00:00<?, ?tasks/s]

Unnamed: 0,passenger_count,TotalFares,AvgFares


Unnamed: 0,passenger_count,TotalFares,AvgFares


Unnamed: 0,passenger_count,TotalFares,AvgFares
1,8,3854.0,54.0


Unnamed: 0,passenger_count,TotalFares,AvgFares
0,9,2250.0,49.0


Unnamed: 0,passenger_count,TotalFares,AvgFares
2,4,5030361.0,13.0


Unnamed: 0,passenger_count,TotalFares,AvgFares
6,0,4862469.0,14.0
7,7,4017.0,51.0


Unnamed: 0,passenger_count,TotalFares,AvgFares
8,5,11343053.0,13.0
9,6,6764732.0,12.0


Unnamed: 0,passenger_count,TotalFares,AvgFares
3,1,190001767.0,13.0
4,3,11024521.0,13.0
5,2,40135971.0,13.0




---
</br>


If you've made it this far, you have now run your first data processing SQL query with Bodo! Please consider joining our [community slack](https://bodocommunity.slack.com/join/shared_invite/zt-qwdc8fad-6rZ8a1RmkkJ6eOX1X__knA#/shared-invite/email) to get in touch directly with our engineers and other Bodo users like yourself. For more information and to learn about how Bodo works, visit our [docs]("https://docs.bodo.ai").



### Footnotes 

\[1\] <span id="fn1"> Original example can be found [here]("https://github.com/toddwschneider/nyc-taxi-data"). </span>

</br>

### Extra: Pandas Feature Engineering  


You can do some feature engineering using Panda on the same NYC taxi dataset, enabling further data science workloads.  

In [None]:
%%px 
import pandas as pd 

@bodo.jit()
def feat_eng():
    """
    Generate features from a raw taxi dataframe.
    """
    taxi_df = pd.read_parquet(
        "s3://bodo-example-data/nyc-taxi/yellow_tripdata_2019_half.pq",
        )
    df = taxi_df[taxi_df.fare_amount > 0][
        "tpep_pickup_datetime", "passenger_count", "tip_amount", "fare_amount"
    ].copy()  # avoid divide-by-zero
    df["tip_fraction"] = df.tip_amount / df.fare_amount

    df["pickup_weekday"] = df.tpep_pickup_datetime.dt.weekday
    df["pickup_weekofyear"] = df.tpep_pickup_datetime.dt.weekofyear
    df["pickup_hour"] = df.tpep_pickup_datetime.dt.hour
    df["pickup_week_hour"] = (df.pickup_weekday * 24) + df.pickup_hour
    df["pickup_minute"] = df.tpep_pickup_datetime.dt.minute
    df = (
        df[
            "pickup_weekday",
            "pickup_weekofyear",
            "pickup_hour",
            "pickup_week_hour",
            "pickup_minute",
            "passenger_count",
            "tip_fraction",
        ]
        .astype(float)
        .fillna(-1)
    )
    return df


taxi_feat = feat_eng()
if bodo.get_rank() == 0:
    display(taxi_feat.head())

%px:   0%|          | 0/8 [00:00<?, ?tasks/s]

<div class="alert alert-info alert" Note:
     
### note

- The `%%px` cell [magic]("https://ipyparallel.readthedocs.io/en/latest/tutorial/magics.html") indicates that the code cell should run on all the cores of the cluster.
- The `@bodo.jit` decorator is an annotation to tell the bodo engine to parallelize the code. 
     
</div>

---