# NYC Yellow Taxi Tips Prediction With Machine Learning in Python

This example shows use of regression models to predict taxi tip fractions. 
Original example can be found [here](https://github.com/saturncloud/workshop-scaling-ml/blob/main/04-large-dataset.ipynb).

### Start an IPyParallel cluster (skip if running on Bodo Platform)
Run the following code in a cell to start an IPyParallel cluster. 8 cores are used in this example.

In [1]:
import ipyparallel as ipp
import psutil; n = min(psutil.cpu_count(logical=False), 8)
rc = ipp.Cluster(engines='mpi', n=n).start_and_connect_sync(activate=True)

Starting 8 engines with <class 'ipyparallel.cluster.launcher.MPIEngineSetLauncher'>
100%|███████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:05<00:00,  1.35engine/s]


### Verifying your setup
Run the following code to verify that your IPyParallel cluster is set up correctly:

In [2]:
%%px
import bodo
print(f"Hello World from rank {bodo.get_rank()}. Total ranks={bodo.get_size()}")

[stdout:3] Hello World from rank 3. Total ranks=8


[stdout:4] Hello World from rank 4. Total ranks=8


[stdout:1] Hello World from rank 1. Total ranks=8


[stdout:2] Hello World from rank 2. Total ranks=8


[stdout:7] Hello World from rank 7. Total ranks=8


[stdout:5] Hello World from rank 5. Total ranks=8


[stdout:6] Hello World from rank 6. Total ranks=8


[stdout:0] Hello World from rank 0. Total ranks=8


## Importing the Packages

These are the main packages we are going to work with:
 - Bodo to parallelize Python code automatically
 - Pandas to work with data
 - scikit-learn to build and evaluate regression models
 - xgboost for xgboost regressor model

In [3]:
%%px
import warnings
warnings.filterwarnings("ignore")

import bodo
import pandas as pd
import time

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import SGDRegressor
from sklearn.metrics import mean_squared_error

## Load data

In [4]:
%%px

@bodo.jit(cache=True)
def get_taxi_trips():
    start = time.time()
    taxi = pd.read_csv(
        "s3://bodo-example-data/nyc-taxi/yellow_tripdata_2019.csv",
        parse_dates=['tpep_pickup_datetime', 'tpep_dropoff_datetime'], nrows=10000
    )
    print("Reading time: ", time.time() - start)
    print(taxi.head())
    print(taxi.shape)
    return taxi
    
taxi = get_taxi_trips()

%px:   0%|                                                                                               | 0/8 [00:12<?, ?tasks/s]

[stdout:0] Reading time:  2.241074000000026
   VendorID tpep_pickup_datetime tpep_dropoff_datetime  passenger_count  \
0         1  2019-01-01 00:46:40   2019-01-01 00:53:20                1   
1         1  2019-01-01 00:59:47   2019-01-01 01:18:59                1   
2         2  2018-12-21 13:48:30   2018-12-21 13:52:40                3   
3         2  2018-11-28 15:52:25   2018-11-28 15:55:45                5   
4         2  2018-11-28 15:56:57   2018-11-28 15:58:33                5   

   trip_distance  RatecodeID store_and_fwd_flag  PULocationID  DOLocationID  \
0            1.5           1                  N           151           239   
1            2.6           1                  N           239           246   
2            0.0           1                  N           236           236   
3            0.0           1                  N           193           193   
4            0.0           2                  N           193           193   

   payment_type  fare_amount  

%px: 100%|███████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:12<00:00,  1.61s/tasks]


## Exploratory analysis

## Feature engineering

1. Create features before performing any data splitting.
2. Split data into train/test sets.

In [5]:
%%px

@bodo.jit(cache=True)
def prep_df(taxi_df):
    '''
    Generate features from a raw taxi dataframe.
    '''
    start = time.time()    
    df = taxi_df[taxi_df.fare_amount > 0]['tpep_pickup_datetime', 'passenger_count', 'tip_amount', 'fare_amount'].copy()  # avoid divide-by-zero
    df['tip_fraction'] = df.tip_amount / df.fare_amount
     
    df['pickup_weekday'] = df.tpep_pickup_datetime.dt.weekday
    df['pickup_weekofyear'] = df.tpep_pickup_datetime.dt.weekofyear
    df['pickup_hour'] = df.tpep_pickup_datetime.dt.hour
    df['pickup_week_hour'] = (df.pickup_weekday * 24) + df.pickup_hour
    df['pickup_minute'] = df.tpep_pickup_datetime.dt.minute
    df = df['pickup_weekday', 
    'pickup_weekofyear', 
    'pickup_hour', 
    'pickup_week_hour', 
    'pickup_minute', 
    'passenger_count', 'tip_fraction'].astype(float).fillna(-1)
    print("Data preparation time: ", time.time() - start)
    print(df.head())
    return df

taxi_feat = prep_df(taxi)

%px:   0%|                                                                                               | 0/8 [00:13<?, ?tasks/s]

[stdout:0] Data preparation time:  0.0019850000001042645
   pickup_weekday  pickup_weekofyear  pickup_hour  pickup_week_hour  \
0             1.0                1.0          0.0              24.0   
1             1.0                1.0          0.0              24.0   
2             4.0               51.0         13.0             109.0   
3             2.0               48.0         15.0              63.0   
4             2.0               48.0         15.0              63.0   

   pickup_minute  passenger_count  tip_fraction  
0           46.0              1.0      0.235714  
1           59.0              1.0      0.071429  
2           48.0              3.0      0.000000  
3           52.0              5.0      0.000000  
4           56.0              5.0      0.000000  


%px: 100%|███████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:13<00:00,  1.73s/tasks]


In [6]:
%%px

@bodo.jit
def data_split(taxi_feat):
    X_train, X_test, y_train, y_test = train_test_split(
        taxi_feat['pickup_weekday', 
                    'pickup_weekofyear', 
                    'pickup_hour', 
                    'pickup_week_hour', 
                    'pickup_minute', 
                    'passenger_count'], 
        taxi_feat['tip_fraction'], 
        test_size=0.3,
        train_size=0.7,        
        random_state=42
    )
    return X_train, X_test, y_train, y_test

X_train, X_test, y_train, y_test = data_split(taxi_feat)

%px: 100%|███████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:05<00:00,  1.56tasks/s]


## Train Model over large dataset

We'll train a linear model to predict tip_fraction and evaluate these models against the test set using RMSE.

#### 1. Linear Regression

In [7]:
%%px

@bodo.jit(cache=True)
def lr_model(X_train, y_train, X_test, y_test):
    start = time.time()    
    lr = LinearRegression()
    lr_fitted = lr.fit(X_train, y_train)
    print("Linear Regression fitting time: ", time.time() - start)

    start = time.time()    
    lr_preds = lr_fitted.predict(X_test)
    print("Linear Regression prediction time: ", time.time() - start)    
    print(mean_squared_error(y_test, lr_preds, squared=False))
    
lr_model(X_train, y_train, X_test, y_test)

%px:   0%|                                                                                               | 0/8 [00:00<?, ?tasks/s]

[stdout:0] Linear Regression fitting time:  0.10938899999996465
Linear Regression prediction time:  0.002658999999994194
775437450.3689747


%px: 100%|███████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 28.73tasks/s]


#### 2. Ridge

In [8]:
%%px

@bodo.jit
def rr_model(X_train, y_train, X_test, y_test):
    start = time.time()    
    rr = Ridge()
    rr_fitted = rr.fit(X_train, y_train)
    print("Ridge fitting time: ", time.time() - start)

    start = time.time()    
    rr_preds = rr_fitted.predict(X_test)
    print("Ridge prediction time: ", time.time() - start)    
    print(mean_squared_error(y_test, rr_preds, squared=False))
    
rr_model(X_train, y_train, X_test, y_test)

[stdout:0] Ridge fitting time:  0.11805099999992308
Ridge prediction time:  0.0017510000000129367
246173655470.98502


#### 3. Lasso

In [9]:
%%px

@bodo.jit
def lsr_model(X_train, y_train, X_test, y_test):
    start = time.time()    
    lsr = Lasso()
    lsr_fitted = lsr.fit(X_train, y_train)
    print("Lasso fitting time: ", time.time() - start)

    start = time.time()    
    lsr_preds = lsr_fitted.predict(X_test)
    print("Lasso prediction time: ", time.time() - start)    
    print(mean_squared_error(y_test, lsr_preds, squared=False))
    
lsr_model(X_train, y_train, X_test, y_test)

[stdout:0] Lasso fitting time:  0.10019899999997506
Lasso prediction time:  0.0018880000000081054
59969897832.56497


#### 4. SGDRegressor

In [10]:
%%px

@bodo.jit
def sgdr_model(X_train, y_train, X_test, y_test):
    start = time.time()    
    sgdr = SGDRegressor(max_iter=100, penalty='l2')
    sgdr_fitted = sgdr.fit(X_train, y_train)
    print("SGDRegressor fitting time: ", time.time() - start)

    start = time.time()    
    sgdr_preds = sgdr_fitted.predict(X_test)
    print("SGDRegressor prediction time: ", time.time() - start)    
    print(mean_squared_error(y_test, sgdr_preds, squared=False))
    
sgdr_model(X_train, y_train, X_test, y_test)

[stdout:0] SGDRegressor fitting time:  0.20479599999998754
SGDRegressor prediction time:  0.003669000000172673
494260382.85129654


#### 5. XGBoost Model

In [11]:
%%px
from xgboost import XGBRegressor

@bodo.jit(cache=True)
def xgb_model(X_train, y_train, X_test, y_test):
    start = time.time()    
    xgb = XGBRegressor(
        objective="reg:squarederror",
        tree_method='approx',
        learning_rate=0.1,
        max_depth=5,
        n_estimators=100,
    )
    xgb_fitted = xgb.fit(X_train, y_train)
    print("XGBRegressor fitting time: ", time.time() - start)

    start = time.time()    
    xgb_preds = xgb_fitted.predict(X_test)
    print("XGBRegressor prediction time: ", time.time() - start)    
    print(mean_squared_error(y_test, xgb_preds, squared=False))
    
xgb_model(X_train, y_train, X_test, y_test)  

[stdout:0] XGBRegressor fitting time:  0.04732899999999063
XGBRegressor prediction time:  0.005711999999903128
0.22680806722334323


In [12]:
# To stop the cluster run the following command. 
rc.cluster.stop_cluster_sync()

Stopping controller
Controller stopped: {'exit_code': 0, 'pid': 21481, 'identifier': 'ipcontroller-1653095920-59jm-21470'}
Stopping engine(s): 1653095921
mpiexec error output:
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 21495 RUNNING AT nicholass-mbp.lan
=   EXIT CODE: 9
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault: 11 (signal 11)

engine set stopped 1653095921: {'exit_code': 11, 'pid': 21493, 'identifier': 'ipengine-1653095920-59jm-1653095921-21470'}
