# NYC Taxi Trip Duration Prediction - Random Forest
***

<a id=path></a>
## Set Local Path
We need to set the local path to read and write to the file.

In [1]:
if 'google.colab' in str(get_ipython()):
  print('Running on CoLab - Remove sample data')
  !rm -r sample_data
else:
  print('Not running on CoLab - Continue')

Running on CoLab - Remove sample data


# Import dataset from Amazon S3 storage

In [2]:
!wget https://seminar-ml-2020.s3.amazonaws.com/NYC_DS_After.zip -P ./datasets
!unzip ./datasets/NYC_DS_After.zip -d ./datasets
!rm ./datasets/NYC_DS_After.zip

--2021-01-13 18:52:11--  https://seminar-ml-2020.s3.amazonaws.com/NYC_DS_After.zip
Resolving seminar-ml-2020.s3.amazonaws.com (seminar-ml-2020.s3.amazonaws.com)... 52.217.111.28
Connecting to seminar-ml-2020.s3.amazonaws.com (seminar-ml-2020.s3.amazonaws.com)|52.217.111.28|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 177761819 (170M) [application/zip]
Saving to: ‘./datasets/NYC_DS_After.zip’


2021-01-13 18:52:13 (74.4 MB/s) - ‘./datasets/NYC_DS_After.zip’ saved [177761819/177761819]

Archive:  ./datasets/NYC_DS_After.zip
  inflating: ./datasets/train_ds.csv  
  inflating: ./datasets/test_ds.csv  


<a id=library></a>
# Import libraries
***

In [3]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV
from tqdm import tqdm
from time import perf_counter

# Import to Show image in Jupyter Notebook
from IPython.display import Image
%matplotlib inline

<a id=data></a>
# Import Dataset
***

In [None]:
!ls

anaconda3  datasets  rf_model.ipynb


In [31]:
train_df=pd.read_csv("./datasets/train_ds.csv")
test_df=pd.read_csv("./datasets/test_ds.csv")

In [32]:
DO_NOT_USE_FOR_TRAINING = ['id', 'pickup_datetime', 'dropoff_datetime','pickup_longitude', 'pickup_latitude', 'dropoff_longitude',
       'dropoff_latitude', 'date',
       'month', 'weekday', 'hour', 'minute', 'second', 'passenger_count',
       'distance', 'best_travel_time', 'left',
       'right', 'merge', 'on ramp', 'off ramp', 'fork', 'end of road',
       'continue', 'roundabout', 'rotary', 'roundabout turn', 
       'average temperature','departure', 'HDD', 'CDD', 'snow fall', 'num_rides_by_pickup_group']

In [33]:
train_df = train_df.drop([col for col in DO_NOT_USE_FOR_TRAINING if col in train_df], axis=1)
new_test = test_df.drop([col for col in DO_NOT_USE_FOR_TRAINING if col in test_df], axis=1)

In [34]:
new_test.isnull().sum()

store_and_fwd_flag            0
is_weekend                    0
is_holiday                    0
is_near_holiday               0
is_businessday                0
minute_of_day                 0
haversine_distance            0
manhattan_distance            0
pickup_pca                    0
dropoff_pca                   0
maximum temerature            0
minimum temperature           0
precipitation                 0
snow depth                    0
kmeans_pickup                 0
kmeans_dropoff                0
num_rides_by_dropoff_group    0
dtype: int64

In [35]:
sample_train = train_df.sample(frac=0.4,random_state=1)

In [36]:
y = np.log(sample_train['trip_duration'].values)

In [37]:
# drop target
sample_train = sample_train.drop(columns='trip_duration')

<a id=splitdata></a>
## Split data to train and validation
***
For comparing the results for raw and optimized data, we'll split and use both

In [38]:
train_x, val_x, train_y, val_y = train_test_split(sample_train, y, test_size=0.2)

<a id=rf></a>
# Random Forest Regressor
***
A Random Forest is an ensemble technique using a technique called Bootstrap, commonly known as bagging - the basic idea behind this is to combine multiple decision trees in determining the final output rather than relying on individual decision trees.
Bagging generates new training sets, each of size n, by sampling from the data randomly with replacement.
The remaining samples are called Out-of-Bag dataset and are used for validation.

Decision Trees tend to overfit, and to avoid overfitting we need to tune the hyper parameters:

max_features - The maximum number of features Random Forest is allowed to try in individual tree, we need to find the number of sub-set features in order to create more versatile trees and reduce variance.

n_estimators - The number of trees in the forest

min_sample_leaf - The minimum number of samples in leaf, helps avoid overfitting

In [11]:
# Number of trees in random forest
n_estimators = [80, 100, 200, 500]
# Number of features to consider at every split
max_features = ['auto', 'sqrt', 'log2', None] # take all the features, take square root of the total number of features, take 20% of variables in individual run
# Maximum depth of tree
max_depth = [4, 6, 10, 14]
#Minimu samples in leaf for split
min_samples_split = [0.1, 0.01, 0.001]
# Fracture of data for bootstrap 
max_samples = [0.6, 0.7, 0.8]

In [40]:
params = {'n_estimators': 100,
               'max_features': 'sqrt',
               'max_depth': 4,
               'min_samples_split': 0.01,
               'max_samples': 0.6}

In [41]:
#from sklearn.model_selection import RandomizedSearchCV
# Use the random grid to search for best hyperparameters
# First create the base model to tune
rf_model = RandomForestRegressor(**params, n_jobs=-1)

In [42]:
start = perf_counter()
for iter in tqdm(np.arange(100)):
        rf_model.fit(sample_train, y)
end = perf_counter()
rf_train_time = end-start #in seconds


  0%|          | 0/100 [00:00<?, ?it/s][A
  1%|          | 1/100 [00:28<46:17, 28.06s/it][A
  2%|▏         | 2/100 [00:56<46:01, 28.18s/it][A
  3%|▎         | 3/100 [01:23<45:06, 27.90s/it][A
  4%|▍         | 4/100 [01:51<44:27, 27.79s/it][A
  5%|▌         | 5/100 [02:18<43:36, 27.54s/it][A
  6%|▌         | 6/100 [02:45<43:06, 27.51s/it][A
  7%|▋         | 7/100 [03:13<42:34, 27.46s/it][A
  8%|▊         | 8/100 [03:40<42:11, 27.52s/it][A
  9%|▉         | 9/100 [04:10<42:50, 28.25s/it][A
 10%|█         | 10/100 [04:38<42:11, 28.13s/it][A
 11%|█         | 11/100 [05:05<41:22, 27.89s/it][A
 12%|█▏        | 12/100 [05:34<41:04, 28.00s/it][A
 13%|█▎        | 13/100 [06:01<40:19, 27.81s/it][A
 14%|█▍        | 14/100 [06:29<39:55, 27.85s/it][A
 15%|█▌        | 15/100 [06:57<39:27, 27.85s/it][A
 16%|█▌        | 16/100 [07:24<38:52, 27.76s/it][A
 17%|█▋        | 17/100 [07:52<38:26, 27.79s/it][A
 18%|█▊        | 18/100 [08:20<38:00, 27.81s/it][A
 19%|█▉        | 19/100 [08:4

In [43]:
rf_train_time

2727.7934546550005

In [44]:
y_pred = rf_model.predict(val_x)

In [45]:
import sklearn.metrics as metrics
rf_rmsle = np.sqrt(metrics.mean_squared_log_error(val_y,y_pred))
rf_rmsle = np.sqrt(metrics.mean_squared_error(val_y,y_pred))  # for comparison
print('RMSLE score for the RF regressor is : {}'.format(rf_rmsle))
print('RMSE score for the RF regressor is : {}'.format(rf_rmsle))

RMSLE score for the RF regressor is : 0.44660445701889384
RMSE score for the RF regressor is : 0.44660445701889384


In [47]:
pred_rf = rf_model.predict(new_test)
pred_rf = np.exp(pred_rf)
print('Test shape OK.') if new_test.shape[0] == pred_rf.shape[0] else print('Oops')
pred_rf

Test shape OK.


array([ 693.33507192,  731.07243879,  452.28863076, ..., 1250.85513947,
       1437.55404708,  975.66705962])

In [48]:
test_df['trip_duration'] = pred_rf

In [49]:
submission_rf = test_df[['id', 'trip_duration']]

In [50]:
submission_rf

Unnamed: 0,id,trip_duration
0,id3004672,693.335072
1,id3505355,731.072439
2,id1217141,452.288631
3,id2150126,1168.683146
4,id1598245,401.244631
...,...,...
625129,id3008929,319.812385
625130,id3700764,1070.388951
625131,id2568735,1250.855139
625132,id1384355,1437.554047


In [51]:
submission_rf.to_csv('submission-rf.csv',index=False)