# NYC Yellow Taxi Tips Prediction With Machine Learning in Python

This example shows use of regression models to predict taxi tip fractions. 
Original example can be found [here](https://github.com/saturncloud/workshop-scaling-ml/blob/main/04-large-dataset.ipynb).

### Notes on running this example:

By defaults runs use Bodo. Hence, data is distributed in chunks across processes.

The current results are based on running on a local Macbook Pro with 8 cores. You can also run it on our platform using, for example, one **m5.12xlarge** instance (24 cores, 192GiB memory)

The dataset can be downloaded from S3 bucket (`s3://bodo-examples-data/nyc-taxi/yellow_tripdata_2019.csv`)

To run the code:
1. Make sure you add your AWS account credentials to access the data. 
2. If you want to run the example using pandas only (without Bodo):
    1. Comment lines magic expression (`%%px`) and bodo decorator (`@bodo.jit`) from all the code cells.
    2. Then, re-run cells from the beginning.
3. For xgboost package, build it from source with MPI enabled (this step is already done on Bodo Platform).

### Start an IPyParallel cluster (skip if running on Bodo Platform)
Run the following code in a cell to start an IPyParallel cluster. 4 cores are used in this example. You will skip this step and next one (Verify your IPyParallel cluster) if you are using bodo's platform.

In [1]:
import ipyparallel as ipp
import psutil; n = min(psutil.cpu_count(logical=False), 4)
rc = ipp.Cluster(engines='mpi', n=n).start_and_connect_sync(activate=True)

Starting 4 engines with <class 'ipyparallel.cluster.launcher.MPIEngineSetLauncher'>
100%|█████████████████████████████████████████| 4/4 [00:06<00:00,  1.68s/engine]


### Verifying your setup
Run the following code to verify that your IPyParallel cluster is set up correctly:

In [2]:
%%px
import bodo
print(f"Hello World from rank {bodo.get_rank()}. Total ranks={bodo.get_size()}")

[stdout:1] Hello World from rank 1. total ranks=4


[stdout:0] Hello World from rank 0. total ranks=4


[stdout:2] Hello World from rank 2. total ranks=4


[stdout:3] Hello World from rank 3. total ranks=4


## Importing the Packages

These are the main packages we are going to work with:
 - Bodo to parallelize Python code automatically
 - Pandas to work with data
 - scikit-learn to build and evaluate regression models
 - xgboost for xgboost regressor model

In [3]:
%%px
import warnings
warnings.filterwarnings("ignore")

import bodo
import pandas as pd
import numpy as np
import time

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import SGDRegressor
from sklearn.metrics import mean_squared_error
from xgboost import XGBRegressor

%px: 100%|█████████████████████████████████████| 4/4 [00:00<00:00,  4.53tasks/s]


In [4]:
%%px

import os

os.environ["AWS_ACCESS_KEY_ID"] = "your_aws_access_key_id"
os.environ["AWS_SECRET_ACCESS_KEY"] = "your_secret_access_key_id"
os.environ["AWS_DEFAULT_REGION"] = "us-east-2"

## Load data

In [5]:
%%px

@bodo.jit(distributed=["taxi"], cache=True)
def get_taxi_trips():
    start = time.time()
    taxi = pd.read_csv(
        "s3://bodo-examples-data/nyc-taxi/yellow_tripdata_2019.csv",
        parse_dates=['tpep_pickup_datetime', 'tpep_dropoff_datetime']
    )
    print("Reading time: ", time.time() - start)
    print(taxi.head())
    print(taxi.shape)
    return taxi
    
taxi = get_taxi_trips()

%px:   0%|                                             | 0/4 [10:40<?, ?tasks/s]

[stdout:0] Reading time:  382.25558358499984
   VendorID tpep_pickup_datetime tpep_dropoff_datetime  passenger_count  \
0         1  2019-01-01 00:46:40   2019-01-01 00:53:20                1   
1         1  2019-01-01 00:59:47   2019-01-01 01:18:59                1   
2         2  2018-12-21 13:48:30   2018-12-21 13:52:40                3   
3         2  2018-11-28 15:52:25   2018-11-28 15:55:45                5   
4         2  2018-11-28 15:56:57   2018-11-28 15:58:33                5   

   trip_distance  RatecodeID store_and_fwd_flag  PULocationID  DOLocationID  \
0            1.5           1                  N           151           239   
1            2.6           1                  N           239           246   
2            0.0           1                  N           236           236   
3            0.0           1                  N           193           193   
4            0.0           2                  N           193           193   

   payment_type  fare_amount 

%px: 100%|████████████████████████████████████| 4/4 [10:46<00:00, 161.66s/tasks]


## Exploratory analysis

## Feature engineering

1. Create features before performing any data splitting.
2. Split data into train/test sets.

In [7]:
%%px

@bodo.jit(distributed=['taxi_df', 'df'], cache=True)
def prep_df(taxi_df):
    '''
    Generate features from a raw taxi dataframe.
    '''
    start = time.time()    
    df = taxi_df[taxi_df.fare_amount > 0]['tpep_pickup_datetime', 'passenger_count', 'tip_amount', 'fare_amount'].copy()  # avoid divide-by-zero
    df['tip_fraction'] = df.tip_amount / df.fare_amount
     
    df['pickup_weekday'] = df.tpep_pickup_datetime.dt.weekday
    df['pickup_weekofyear'] = df.tpep_pickup_datetime.dt.weekofyear
    df['pickup_hour'] = df.tpep_pickup_datetime.dt.hour
    df['pickup_week_hour'] = (df.pickup_weekday * 24) + df.pickup_hour
    df['pickup_minute'] = df.tpep_pickup_datetime.dt.minute
    df = df['pickup_weekday', 
    'pickup_weekofyear', 
    'pickup_hour', 
    'pickup_week_hour', 
    'pickup_minute', 
    'passenger_count', 'tip_fraction'].astype(float).fillna(-1)
    print("Data preparation time: ", time.time() - start)
    print(df.head())
    return df

taxi_feat = prep_df(taxi)

%px:   0%|                                             | 0/4 [00:30<?, ?tasks/s]

[stdout:0] Data preparation time:  15.592501404000359
   pickup_weekday  pickup_weekofyear  pickup_hour  pickup_week_hour  \
0             1.0                1.0          0.0              24.0   
1             1.0                1.0          0.0              24.0   
2             4.0               51.0         13.0             109.0   
3             2.0               48.0         15.0              63.0   
4             2.0               48.0         15.0              63.0   

   pickup_minute  passenger_count  tip_fraction  
0           46.0              1.0      0.235714  
1           59.0              1.0      0.071429  
2           48.0              3.0      0.000000  
3           52.0              5.0      0.000000  
4           56.0              5.0      0.000000  


[stdout:1] Empty DataFrame
Columns: [pickup_weekday, pickup_weekofyear, pickup_hour, pickup_week_hour, pickup_minute, passenger_count, tip_fraction]
Index: []


[stdout:3] Empty DataFrame
Columns: [pickup_weekday, pickup_weekofyear, pickup_hour, pickup_week_hour, pickup_minute, passenger_count, tip_fraction]
Index: []


[stdout:2] Empty DataFrame
Columns: [pickup_weekday, pickup_weekofyear, pickup_hour, pickup_week_hour, pickup_minute, passenger_count, tip_fraction]
Index: []


%px: 100%|█████████████████████████████████████| 4/4 [00:31<00:00,  7.94s/tasks]


In [8]:
%%px

@bodo.jit(distributed=["taxi_feat", "X_train", "X_test", "y_train", "y_test"])
def data_split(taxi_feat):
    X_train, X_test, y_train, y_test = train_test_split(
        taxi_feat['pickup_weekday', 
                    'pickup_weekofyear', 
                    'pickup_hour', 
                    'pickup_week_hour', 
                    'pickup_minute', 
                    'passenger_count'], 
        taxi_feat['tip_fraction'], 
        test_size=0.3,
        train_size=0.7,        
        random_state=42
    )
    return X_train, X_test, y_train, y_test

X_train, X_test, y_train, y_test = data_split(taxi_feat)

%px: 100%|█████████████████████████████████████| 4/4 [01:05<00:00, 16.36s/tasks]


## Train Model over large dataset

We'll train a linear model to predict tip_fraction and evaluate these models against the test set using RMSE.

#### 1. Linear Regression

In [9]:
%%px

@bodo.jit(distributed=['X_train', 'y_train', 'X_test', 'y_test'], cache=True)
def lr_model(X_train, y_train, X_test, y_test):
    start = time.time()    
    lr = LinearRegression()
    lr_fitted = lr.fit(X_train, y_train)
    print("Linear Regression fitting time: ", time.time() - start)

    start = time.time()    
    lr_preds = lr_fitted.predict(X_test)
    print("Linear Regression prediction time: ", time.time() - start)    
    print(mean_squared_error(y_test, lr_preds, squared=False))
    
lr_model(X_train, y_train, X_test, y_test)

%px:   0%|                                             | 0/4 [00:55<?, ?tasks/s]

[stdout:0] Linear Regression fitting time:  54.828501329000574
Linear Regression prediction time:  0.9286313830016297
15.822427514807156


%px: 100%|█████████████████████████████████████| 4/4 [00:56<00:00, 14.09s/tasks]


#### 2. Ridge

In [10]:
%%px

@bodo.jit(distributed=['X_train', 'y_train', 'X_test', 'y_test'])
def rr_model(X_train, y_train, X_test, y_test):
    start = time.time()    
    rr = Ridge()
    rr_fitted = rr.fit(X_train, y_train)
    print("Ridge fitting time: ", time.time() - start)

    start = time.time()    
    rr_preds = rr_fitted.predict(X_test)
    print("Ridge prediction time: ", time.time() - start)    
    print(mean_squared_error(y_test, rr_preds, squared=False))
    
rr_model(X_train, y_train, X_test, y_test)

%px:   0%|                                             | 0/4 [00:54<?, ?tasks/s]

[stdout:0] Ridge fitting time:  54.10153803299909
Ridge prediction time:  0.17183516999648418
15.822457947896114


%px: 100%|█████████████████████████████████████| 4/4 [00:55<00:00, 13.75s/tasks]


#### 3. Lasso

In [11]:
%%px

@bodo.jit(distributed=['X_train', 'y_train', 'X_test', 'y_test'])
def lsr_model(X_train, y_train, X_test, y_test):
    start = time.time()    
    lsr = Lasso()
    lsr_fitted = lsr.fit(X_train, y_train)
    print("Lasso fitting time: ", time.time() - start)

    start = time.time()    
    lsr_preds = lsr_fitted.predict(X_test)
    print("Lasso prediction time: ", time.time() - start)    
    print(mean_squared_error(y_test, lsr_preds, squared=False))
    
lsr_model(X_train, y_train, X_test, y_test)

%px:   0%|                                             | 0/4 [00:49<?, ?tasks/s]

[stdout:0] Lasso fitting time:  50.112302917001216
Lasso prediction time:  0.19195895200027735
15.822415059071433


%px: 100%|█████████████████████████████████████| 4/4 [00:49<00:00, 12.48s/tasks]


#### 4. SGDRegressor

In [12]:
%%px

@bodo.jit(distributed=['X_train', 'y_train', 'X_test', 'y_test'])
def sgdr_model(X_train, y_train, X_test, y_test):
    start = time.time()    
    sgdr = SGDRegressor(max_iter=100, penalty='l2')
    sgdr_fitted = sgdr.fit(X_train, y_train)
    print("SGDRegressor fitting time: ", time.time() - start)

    start = time.time()    
    sgdr_preds = sgdr_fitted.predict(X_test)
    print("SGDRegressor prediction time: ", time.time() - start)    
    print(mean_squared_error(y_test, sgdr_preds, squared=False))
    
sgdr_model(X_train, y_train, X_test, y_test)

%px:   0%|                                             | 0/4 [00:52<?, ?tasks/s]

[stdout:0] SGDRegressor fitting time:  53.11146027599898
SGDRegressor prediction time:  0.1952197940008773
15.822458676129292


%px: 100%|█████████████████████████████████████| 4/4 [00:52<00:00, 13.19s/tasks]


#### 5. XGBoost Model

In [13]:
%%px
@bodo.jit(distributed=['X_train', 'y_train', 'X_test', 'y_test'], cache=True)
def xgb_model(X_train, y_train, X_test, y_test):
    start = time.time()    
    xgb = XGBRegressor(
        objective="reg:squarederror",
        tree_method='approx',
        learning_rate=0.1,
        max_depth=5,
        n_estimators=100,
    )
    xgb_fitted = xgb.fit(X_train, y_train)
    print("XGBRegressor fitting time: ", time.time() - start)

    start = time.time()    
    xgb_preds = xgb_fitted.predict(X_test)
    print("XGBRegressor prediction time: ", time.time() - start)    
    print(mean_squared_error(y_test, xgb_preds, squared=False))
    
xgb_model(X_train, y_train, X_test, y_test)    

%px:   0%|                                             | 0/4 [45:22<?, ?tasks/s]

[stdout:0] XGBRegressor fitting time:  2722.4965671949976
XGBRegressor prediction time:  14.637384427001962
15.823090436429029


%px: 100%|████████████████████████████████████| 4/4 [46:10<00:00, 692.56s/tasks]
