# EDSA 2021: Sendy Logistics Challenge
Predict the estimated time of arrival (ETA) for motorbike deliveries in Nairobi

# Introduction

In this notebook, based on historic data used to predict an accurate time for the arrival of the rider at the destination of a package, we will be building a machine learning model that predicts an accurate delivery time, from picking up a package to arriving at the final destination. An accurate arrival time prediction will help all businesses to improve their logistics and communicate an accurate time to their customers.

# Overview

### - Importing libraries and data
### - Data Exploration Analysis
### - Data Cleaning and Formatting
### - Feature Engineering
### - Data Preprocessing for Model
### - Basic Model Building
### - Model Tuning
### - Ensemble Model Building
### - Results

## Importing Python libraries

In [154]:
!pip install lightgbm

Collecting lightgbm
  Downloading lightgbm-3.0.0-py2.py3-none-win_amd64.whl (737 kB)
Installing collected packages: lightgbm
Successfully installed lightgbm-3.0.0


In [155]:
!pip install xgboost

Collecting xgboost
  Downloading xgboost-1.2.0-py3-none-win_amd64.whl (86.5 MB)


ERROR: Exception:
Traceback (most recent call last):
  File "C:\Users\maakw\anaconda3\lib\site-packages\pip\_vendor\urllib3\response.py", line 425, in _error_catcher
    yield
  File "C:\Users\maakw\anaconda3\lib\site-packages\pip\_vendor\urllib3\response.py", line 507, in read
    data = self._fp.read(amt) if not fp_closed else b""
  File "C:\Users\maakw\anaconda3\lib\site-packages\pip\_vendor\cachecontrol\filewrapper.py", line 62, in read
    data = self.__fp.read(amt)
  File "C:\Users\maakw\anaconda3\lib\http\client.py", line 457, in read
    n = self.readinto(b)
  File "C:\Users\maakw\anaconda3\lib\http\client.py", line 501, in readinto
    n = self.fp.readinto(b)
  File "C:\Users\maakw\anaconda3\lib\socket.py", line 589, in readinto
    return self._sock.recv_into(b)
  File "C:\Users\maakw\anaconda3\lib\ssl.py", line 1071, in recv_into
    return self.read(nbytes, buffer)
  File "C:\Users\maakw\anaconda3\lib\ssl.py", line 929, in read
    return self._sslobj.read(len, buffer)
sock

In [156]:
#Linear algebra
import numpy as np

#Data processing
import pandas as pd

#Date library
import datetime as dt

#Data visualization
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

#Metrics
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split, KFold, GridSearchCV 
from sklearn.model_selection import cross_val_score, learning_curve
from sklearn.metrics import mean_squared_error

#Algorithms
from sklearn.svm import SVR
from sklearn.ensemble import ExtraTreesRegressor, RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import LinearRegression

#import xgboost as xgb
import lightgbm as lgb

sns.set(style='white', context='notebook', palette='deep')

## Importing the dataset

In [125]:
#Train_Masked has extra columns: Delivery destination (day, month, time)

train_df = pd.read_csv("D:\Temp\Train.csv")
test_df = pd.read_csv("D:\Temp\Test.csv")
riders_df = pd.read_csv("D:\Temp\Riders.csv")

print(train_df.shape, test_df.shape, riders_df.shape)
train_df.head()

(21201, 29) (7068, 25) (960, 5)


Unnamed: 0,Order No,User Id,Vehicle Type,Platform Type,Personal or Business,Placement - Day of Month,Placement - Weekday (Mo = 1),Placement - Time,Confirmation - Day of Month,Confirmation - Weekday (Mo = 1),...,Arrival at Destination - Time,Distance (KM),Temperature,Precipitation in millimeters,Pickup Lat,Pickup Long,Destination Lat,Destination Long,Rider Id,Time from Pickup to Arrival
0,Order_No_4211,User_Id_633,Bike,3,Business,9,5,9:35:46 AM,9,5,...,10:39:55 AM,4,20.4,,-1.317755,36.83037,-1.300406,36.829741,Rider_Id_432,745
1,Order_No_25375,User_Id_2285,Bike,3,Personal,12,5,11:16:16 AM,12,5,...,12:17:22 PM,16,26.4,,-1.351453,36.899315,-1.295004,36.814358,Rider_Id_856,1993
2,Order_No_1899,User_Id_265,Bike,3,Business,30,2,12:39:25 PM,30,2,...,1:00:38 PM,3,,,-1.308284,36.843419,-1.300921,36.828195,Rider_Id_155,455
3,Order_No_9336,User_Id_1402,Bike,3,Business,15,5,9:25:34 AM,15,5,...,10:05:27 AM,9,19.2,,-1.281301,36.832396,-1.257147,36.795063,Rider_Id_855,1341
4,Order_No_27883,User_Id_1737,Bike,1,Personal,13,1,9:55:18 AM,13,1,...,10:25:37 AM,9,15.4,,-1.266597,36.792118,-1.295041,36.809817,Rider_Id_770,1214


# Exploritory Data Analysis

### 3) Data Manipulation

In [126]:
#Drop data not available in test, Pickup Time + label = Arrival times

train_df = train_df.drop(['Arrival at Destination - Day of Month', 'Arrival at Destination - Weekday (Mo = 1)', 'Arrival at Destination - Time'], axis=1)


#### Creating Full_df

In [127]:
#Create (full_df = train + test) ** caution (dont shuffle, avoid drop/adding rows)
#explore training, make (column) changes to full, later we use the border to separate
#Be careful of information leakage

border = train_df.shape[0]
test_df['Time from Pickup to Arrival'] = [np.nan]* test_df.shape[0]
full_df = pd.concat([train_df, test_df], axis=0, ignore_index=True)

train_df.shape, test_df.shape, full_df.shape

((21201, 26), (7068, 26), (28269, 26))

#### Renaming columns

In [128]:
#Renaming columns (shorten, remove space, standardize)
new_names = {"Order No": "Order_No", "User Id": "User_Id", "Vehicle Type": "Vehicle_Type",
    "Personal or Business": "Personal_Business", "Placement - Day of Month": "Pla_Mon",
    "Placement - Weekday (Mo = 1)": "Pla_Weekday", "Placement - Time": "Pla_Time", 
    "Confirmation - Day of Month":"Con_Day_Mon", "Confirmation - Weekday (Mo = 1)": "Con_Weekday","Confirmation - Time": "Con_Time", 
    "Arrival at Pickup - Day of Month": "Arr_Pic_Mon", "Arrival at Pickup - Weekday (Mo = 1)": "Arr_Pic_Weekday", 
                "Arrival at Pickup - Time": "Arr_Pic_Time", "Platform Type": "Platform_Type",
     "Pickup - Day of Month": "Pickup_Mon", "Pickup - Weekday (Mo = 1)": "Pickup_Weekday",           
    "Pickup - Time": "Pickup_Time",  "Distance (KM)": "Distance(km)",
    "Precipitation in millimeters": "Precipitation(mm)", "Pickup Lat": "Pickup_Lat", "Pickup Long": "Pickup_Lon", 
    "Destination Lat": "Destination_Lat", "Destination Long":"Destination_Lon", "Rider Id": "Rider_Id",
                            "Time from Pickup to Arrival": "Time_Pic_Arr"
                           }

full_df = full_df.rename(columns=new_names)
full_df.columns

Index(['Order_No', 'User_Id', 'Vehicle_Type', 'Platform_Type',
       'Personal_Business', 'Pla_Mon', 'Pla_Weekday', 'Pla_Time',
       'Con_Day_Mon', 'Con_Weekday', 'Con_Time', 'Arr_Pic_Mon',
       'Arr_Pic_Weekday', 'Arr_Pic_Time', 'Pickup_Mon', 'Pickup_Weekday',
       'Pickup_Time', 'Distance(km)', 'Temperature', 'Precipitation(mm)',
       'Pickup_Lat', 'Pickup_Lon', 'Destination_Lat', 'Destination_Lon',
       'Rider_Id', 'Time_Pic_Arr'],
      dtype='object')

#### Convert Time

In [129]:
#Convert Time from 12H to 24H

def convert_to_24hrs(fulldf):
    for col in fulldf.columns:
        if col.endswith("Time"):
            fulldf[col] = pd.to_datetime(fulldf[col], format='%I:%M:%S %p').dt.strftime("%H:%M:%S")
    return fulldf

full_df = convert_to_24hrs(full_df)

full_df[['Pla_Time', 'Con_Time' , 'Arr_Pic_Time', 'Pickup_Time']][3:6]


Unnamed: 0,Pla_Time,Con_Time,Arr_Pic_Time,Pickup_Time
3,09:25:34,09:26:05,09:37:56,09:43:06
4,09:55:18,09:56:18,10:03:53,10:05:23
5,15:07:35,15:08:57,15:21:36,15:30:30


#### Filling Missing Values

In [130]:
#Filling Missing Values for temperatures and humidity

full_df['Temperature'] = full_df['Temperature'].fillna(full_df['Temperature'].mean())
full_df['Precipitation(mm)'].fillna(full_df['Precipitation(mm)'].mean(), inplace=True)

In [131]:
full_df.head()

Unnamed: 0,Order_No,User_Id,Vehicle_Type,Platform_Type,Personal_Business,Pla_Mon,Pla_Weekday,Pla_Time,Con_Day_Mon,Con_Weekday,...,Pickup_Time,Distance(km),Temperature,Precipitation(mm),Pickup_Lat,Pickup_Lon,Destination_Lat,Destination_Lon,Rider_Id,Time_Pic_Arr
0,Order_No_4211,User_Id_633,Bike,3,Business,9,5,09:35:46,9,5,...,10:27:30,4,20.4,7.573502,-1.317755,36.83037,-1.300406,36.829741,Rider_Id_432,745.0
1,Order_No_25375,User_Id_2285,Bike,3,Personal,12,5,11:16:16,12,5,...,11:44:09,16,26.4,7.573502,-1.351453,36.899315,-1.295004,36.814358,Rider_Id_856,1993.0
2,Order_No_1899,User_Id_265,Bike,3,Business,30,2,12:39:25,30,2,...,12:53:03,3,23.255689,7.573502,-1.308284,36.843419,-1.300921,36.828195,Rider_Id_155,455.0
3,Order_No_9336,User_Id_1402,Bike,3,Business,15,5,09:25:34,15,5,...,09:43:06,9,19.2,7.573502,-1.281301,36.832396,-1.257147,36.795063,Rider_Id_855,1341.0
4,Order_No_27883,User_Id_1737,Bike,1,Personal,13,1,09:55:18,13,1,...,10:05:23,9,15.4,7.573502,-1.266597,36.792118,-1.295041,36.809817,Rider_Id_770,1214.0


#### Traversing Month and Weekday

In [132]:
#Since, we have not been given the actual dates & bikes (same day) were used, is Pick, Arrival date not the same?

month_cols = [col for col in full_df.columns if col.endswith("Mon")]
weekday_cols = [col for col in full_df.columns if col.endswith("Weekday")]

count = 0
instances_of_different_days = [];
for i, row in full_df.iterrows():
    if len(set(row[month_cols].values)) > 1:
        print(count+1, end="\r")
        count = count + 1
        instances_of_different_days.append(list(row[month_cols].values))
instances_of_different_days

2

[[17, 18, 18, 18], [11, 13, 13, 13]]

In [133]:
month_cols

['Pla_Mon', 'Con_Day_Mon', 'Arr_Pic_Mon', 'Pickup_Mon']

In [134]:
weekday_cols

['Pla_Weekday', 'Con_Weekday', 'Arr_Pic_Weekday', 'Pickup_Weekday']

#### Creating Month and Weekday columns

In [135]:
full_df['Day_of_Month'] = full_df[month_cols[0]]
full_df['Day_of_Week'] = full_df[weekday_cols[0]]

#### Dropping redundant columns

In [136]:
#All Vehicle types are Bikes, Vehicle Type is not necessary.
#Day & Weekday values are repeated in all rows except 2, we retain only one
full_df.drop(month_cols+weekday_cols, axis=1, inplace=True)
full_df.drop('Vehicle_Type', axis=1, inplace=True)

full_df.head(3)

Unnamed: 0,Order_No,User_Id,Platform_Type,Personal_Business,Pla_Time,Con_Time,Arr_Pic_Time,Pickup_Time,Distance(km),Temperature,Precipitation(mm),Pickup_Lat,Pickup_Lon,Destination_Lat,Destination_Lon,Rider_Id,Time_Pic_Arr,Day_of_Month,Day_of_Week
0,Order_No_4211,User_Id_633,3,Business,09:35:46,09:40:10,10:04:47,10:27:30,4,20.4,7.573502,-1.317755,36.83037,-1.300406,36.829741,Rider_Id_432,745.0,9,5
1,Order_No_25375,User_Id_2285,3,Personal,11:16:16,11:23:21,11:40:22,11:44:09,16,26.4,7.573502,-1.351453,36.899315,-1.295004,36.814358,Rider_Id_856,1993.0,12,5
2,Order_No_1899,User_Id_265,3,Business,12:39:25,12:42:44,12:49:34,12:53:03,3,23.255689,7.573502,-1.308284,36.843419,-1.300921,36.828195,Rider_Id_155,455.0,30,2


In [137]:
full_df.head()

Unnamed: 0,Order_No,User_Id,Platform_Type,Personal_Business,Pla_Time,Con_Time,Arr_Pic_Time,Pickup_Time,Distance(km),Temperature,Precipitation(mm),Pickup_Lat,Pickup_Lon,Destination_Lat,Destination_Lon,Rider_Id,Time_Pic_Arr,Day_of_Month,Day_of_Week
0,Order_No_4211,User_Id_633,3,Business,09:35:46,09:40:10,10:04:47,10:27:30,4,20.4,7.573502,-1.317755,36.83037,-1.300406,36.829741,Rider_Id_432,745.0,9,5
1,Order_No_25375,User_Id_2285,3,Personal,11:16:16,11:23:21,11:40:22,11:44:09,16,26.4,7.573502,-1.351453,36.899315,-1.295004,36.814358,Rider_Id_856,1993.0,12,5
2,Order_No_1899,User_Id_265,3,Business,12:39:25,12:42:44,12:49:34,12:53:03,3,23.255689,7.573502,-1.308284,36.843419,-1.300921,36.828195,Rider_Id_155,455.0,30,2
3,Order_No_9336,User_Id_1402,3,Business,09:25:34,09:26:05,09:37:56,09:43:06,9,19.2,7.573502,-1.281301,36.832396,-1.257147,36.795063,Rider_Id_855,1341.0,15,5
4,Order_No_27883,User_Id_1737,1,Personal,09:55:18,09:56:18,10:03:53,10:05:23,9,15.4,7.573502,-1.266597,36.792118,-1.295041,36.809817,Rider_Id_770,1214.0,13,1


In [138]:
full_df.columns

Index(['Order_No', 'User_Id', 'Platform_Type', 'Personal_Business', 'Pla_Time',
       'Con_Time', 'Arr_Pic_Time', 'Pickup_Time', 'Distance(km)',
       'Temperature', 'Precipitation(mm)', 'Pickup_Lat', 'Pickup_Lon',
       'Destination_Lat', 'Destination_Lon', 'Rider_Id', 'Time_Pic_Arr',
       'Day_of_Month', 'Day_of_Week'],
      dtype='object')

#### Variable Datatypes

In [139]:
numeric_cols = []
object_cols = []
time_cols = []
for k, v in full_df.dtypes.items():
    if (v != object):
        if (k != "Time_Pic_Arr"):
            numeric_cols.append(k)
    elif k.endswith("Time"):
        time_cols.append(k)
    else:
        object_cols.append(k)
full_df[numeric_cols].head(3) 

Unnamed: 0,Platform_Type,Distance(km),Temperature,Precipitation(mm),Pickup_Lat,Pickup_Lon,Destination_Lat,Destination_Lon,Day_of_Month,Day_of_Week
0,3,4,20.4,7.573502,-1.317755,36.83037,-1.300406,36.829741,9,5
1,3,16,26.4,7.573502,-1.351453,36.899315,-1.295004,36.814358,12,5
2,3,3,23.255689,7.573502,-1.308284,36.843419,-1.300921,36.828195,30,2


In [140]:
full_df[time_cols].head(3)

Unnamed: 0,Pla_Time,Con_Time,Arr_Pic_Time,Pickup_Time
0,09:35:46,09:40:10,10:04:47,10:27:30
1,11:16:16,11:23:21,11:40:22,11:44:09
2,12:39:25,12:42:44,12:49:34,12:53:03


In [141]:
full_df[object_cols].head(3)

Unnamed: 0,Order_No,User_Id,Personal_Business,Rider_Id
0,Order_No_4211,User_Id_633,Business,Rider_Id_432
1,Order_No_25375,User_Id_2285,Personal,Rider_Id_856
2,Order_No_1899,User_Id_265,Business,Rider_Id_155


#### Convert an object to numeric

In [142]:
#Convert an object to numeric (encoding)

le = LabelEncoder()
le.fit(full_df['Personal_Business'])
full_df['Personal_Business'] = le.transform(full_df['Personal_Business'])
full_df['Personal_Business'][:2]


0    0
1    1
Name: Personal_Business, dtype: int32

In [143]:
full_df.head()

Unnamed: 0,Order_No,User_Id,Platform_Type,Personal_Business,Pla_Time,Con_Time,Arr_Pic_Time,Pickup_Time,Distance(km),Temperature,Precipitation(mm),Pickup_Lat,Pickup_Lon,Destination_Lat,Destination_Lon,Rider_Id,Time_Pic_Arr,Day_of_Month,Day_of_Week
0,Order_No_4211,User_Id_633,3,0,09:35:46,09:40:10,10:04:47,10:27:30,4,20.4,7.573502,-1.317755,36.83037,-1.300406,36.829741,Rider_Id_432,745.0,9,5
1,Order_No_25375,User_Id_2285,3,1,11:16:16,11:23:21,11:40:22,11:44:09,16,26.4,7.573502,-1.351453,36.899315,-1.295004,36.814358,Rider_Id_856,1993.0,12,5
2,Order_No_1899,User_Id_265,3,0,12:39:25,12:42:44,12:49:34,12:53:03,3,23.255689,7.573502,-1.308284,36.843419,-1.300921,36.828195,Rider_Id_155,455.0,30,2
3,Order_No_9336,User_Id_1402,3,0,09:25:34,09:26:05,09:37:56,09:43:06,9,19.2,7.573502,-1.281301,36.832396,-1.257147,36.795063,Rider_Id_855,1341.0,15,5
4,Order_No_27883,User_Id_1737,1,1,09:55:18,09:56:18,10:03:53,10:05:23,9,15.4,7.573502,-1.266597,36.792118,-1.295041,36.809817,Rider_Id_770,1214.0,13,1


#### Feature Selection

In [144]:
features = numeric_cols + ['Personal_Business']

data_df = full_df[features]

y = full_df[:border]['Time_Pic_Arr']
train = data_df[:border]
test = data_df[border:]

train.head()

Unnamed: 0,Platform_Type,Distance(km),Temperature,Precipitation(mm),Pickup_Lat,Pickup_Lon,Destination_Lat,Destination_Lon,Day_of_Month,Day_of_Week,Personal_Business
0,3,4,20.4,7.573502,-1.317755,36.83037,-1.300406,36.829741,9,5,0
1,3,16,26.4,7.573502,-1.351453,36.899315,-1.295004,36.814358,12,5,1
2,3,3,23.255689,7.573502,-1.308284,36.843419,-1.300921,36.828195,30,2,0
3,3,9,19.2,7.573502,-1.281301,36.832396,-1.257147,36.795063,15,5,0
4,1,9,15.4,7.573502,-1.266597,36.792118,-1.295041,36.809817,13,1,1


In [107]:
print(full_df.shape,data_df.shape,train.shape,test.shape,y.shape)

(28269, 19) (28269, 11) (21201, 11) (7068, 11) (21201,)


In [145]:
train

Unnamed: 0,Platform_Type,Distance(km),Temperature,Precipitation(mm),Pickup_Lat,Pickup_Lon,Destination_Lat,Destination_Lon,Day_of_Month,Day_of_Week,Personal_Business
0,3,4,20.400000,7.573502,-1.317755,36.830370,-1.300406,36.829741,9,5,0
1,3,16,26.400000,7.573502,-1.351453,36.899315,-1.295004,36.814358,12,5,1
2,3,3,23.255689,7.573502,-1.308284,36.843419,-1.300921,36.828195,30,2,0
3,3,9,19.200000,7.573502,-1.281301,36.832396,-1.257147,36.795063,15,5,0
4,1,9,15.400000,7.573502,-1.266597,36.792118,-1.295041,36.809817,13,1,1
...,...,...,...,...,...,...,...,...,...,...,...
21196,3,3,28.600000,7.573502,-1.258414,36.804800,-1.275285,36.802702,20,3,1
21197,3,7,26.000000,7.573502,-1.307143,36.825009,-1.331619,36.847976,13,6,0
21198,3,20,29.200000,7.573502,-1.286018,36.897534,-1.258414,36.804800,7,4,0
21199,1,13,15.000000,7.573502,-1.250030,36.874167,-1.279209,36.794872,4,3,1


In [146]:
test

Unnamed: 0,Platform_Type,Distance(km),Temperature,Precipitation(mm),Pickup_Lat,Pickup_Lon,Destination_Lat,Destination_Lon,Day_of_Month,Day_of_Week,Personal_Business
21201,3,8,23.255689,7.573502,-1.333275,36.870815,-1.305249,36.822390,27,3,0
21202,3,5,23.255689,7.573502,-1.272639,36.794723,-1.277007,36.823907,17,5,0
21203,3,5,22.800000,7.573502,-1.290894,36.822971,-1.276574,36.851365,27,4,0
21204,3,5,24.500000,7.573502,-1.290503,36.809646,-1.303382,36.790658,17,1,0
21205,3,6,24.400000,7.573502,-1.281081,36.814423,-1.266467,36.792161,11,2,0
...,...,...,...,...,...,...,...,...,...,...,...
28264,3,5,24.800000,7.573502,-1.258414,36.804800,-1.288780,36.816831,7,1,0
28265,3,22,30.700000,7.573502,-1.276141,36.771084,-1.316098,36.913164,10,3,0
28266,3,10,25.100000,7.573502,-1.301446,36.766138,-1.264960,36.798178,5,3,0
28267,3,18,23.600000,7.573502,-1.248404,36.678276,-1.272027,36.817411,29,2,1


In [147]:
y

0         745.0
1        1993.0
2         455.0
3        1341.0
4        1214.0
          ...  
21196       9.0
21197     770.0
21198    2953.0
21199    1380.0
21200    2128.0
Name: Time_Pic_Arr, Length: 21201, dtype: float64

#### train_test_split

In [148]:
X_train, X_test, y_train, y_test = train_test_split(train, y, test_size=0.2, shuffle=True)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(16960, 11) (4241, 11) (16960,) (4241,)


### Modeling

#### Cross validation

In [157]:
rs = 3
kfold = KFold(n_splits=10, random_state=rs, shuffle=True)

regressors = []
regressors.append(SVR())
regressors.append(GradientBoostingRegressor(random_state=rs))
regressors.append(ExtraTreesRegressor(n_estimators=rs))
regressors.append(RandomForestRegressor(random_state=rs))
#regressors.append(xgb.XGBRegressor(random_state=rs, objective="reg:squarederror"))
regressors.append(lgb.LGBMRegressor(random_state=rs))

cv_results = []
for regressor in regressors:     #scores to be minimised are negated (neg)
    cv_results.append(np.sqrt(abs(cross_val_score(regressor, X_train, y=y_train, scoring='neg_mean_squared_error', cv=kfold))))

cv_means = []
cv_stds = []
for cv_result in cv_results:
    cv_means.append(cv_result.mean())
    cv_stds.append(cv_result.std())
    
#cv_res = pd.DataFrame({ 
#    "Algorithm": ["SVR", "GBR", "EXR", "RFR", "XGBR", "LGBM"],
#    "CrossValMeans": cv_means, "CrossValErrors": cv_stds
#                       })
cv_res = pd.DataFrame({ 
    "Algorithm": ["SVR", "GBR", "EXR", "RFR", "LGBM"],
    "CrossValMeans": cv_means, "CrossValErrors": cv_stds
                       })

cv_res = cv_res.sort_values("CrossValMeans", ascending=True)
print(cv_res)

  Algorithm  CrossValMeans  CrossValErrors
4      LGBM     774.614768       18.715030
1       GBR     778.134629       17.888758
3       RFR     800.179967       16.876306
0       SVR     914.156364       19.688177
2       EXR     934.162660       18.913009


### Random Forest

In [152]:
RFC = RandomForestRegressor(random_state=rs)
rf_param = {"max_depth":[None], "max_features":[3], "min_samples_split":[10],
           "min_samples_leaf": [3], "n_estimators":[300]}
rsearch = GridSearchCV(RFC, cv=kfold, scoring='neg_mean_squared_error',param_grid=rf_param)
rfm = rsearch.fit(X_train, y_train)

r_score = np.sqrt(abs(rfm.best_score_))
r_params = rfm.best_params_
print(r_score, r_params)

773.5171026354723 {'max_depth': None, 'max_features': 3, 'min_samples_leaf': 3, 'min_samples_split': 10, 'n_estimators': 300}


## Submission

In [None]:
Prediction = predict(fit, test)
submit = data.frame(PassengerId = test$PassengerId, Survived = Prediction)
write.csv(submit, file = “firstforest.csv”, row.names = FALSE)

In [None]:
lgbm_y = lgbm.predict(test, num_iteration=lgbm.best_iteration)
lgbm_output = pd.DataFrame({"Order No":test_df['Order No'], 
                           "Time from Pickup to Arrival": lgbm_y })
lgbm_output.to_csv("submission.csv", index=False)

In [None]:
y_pred = lr.predict(test_new)

In [None]:
submission_df = test_df1[['Order No']]
submission_df['Time_Pic_Arr'] = y_pred

In [None]:
submission_df.to_csv('D:/Temp/LRImproved.csv', index = False)

### lightgbm

#### Parameter Tuning

In [158]:
params = {
    'n_estimators':[75], # [75, 95],
    'num_leaves': [15], #[12,15, 17],
    'reg_alpha': [0.02], #[0.02, 0.05],
    'min_data_in_leaf': [300],  #[250, 280, 300]
    'learning_rate': [0.1], #[0.05, 0.1, 0.25],
    'objective': ['regression'] #['regression', None]
    }

lsearch = GridSearchCV(estimator = lgb.LGBMRegressor(random_state=rs), cv=kfold,scoring='neg_mean_squared_error', param_grid=params)
lgbm = lsearch.fit(X_train, y_train)

l_params = lgbm.best_params_
l_score = np.sqrt(abs(lgbm.best_score_))
print(lgbm.best_params_, np.sqrt(abs(lgbm.best_score_)))

{'learning_rate': 0.1, 'min_data_in_leaf': 300, 'n_estimators': 75, 'num_leaves': 15, 'objective': 'regression', 'reg_alpha': 0.02} 772.4197981914269


#### Training and making a prediction

In [160]:
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)

lparams = {
           'learning_rate': 0.1, 'min_data_in_leaf': 300, 
           'n_estimators': 75, 'num_leaves': 20, 'random_state':rs,
           'objective': 'regression', 'reg_alpha': 0.02,
          'feature_fraction': 0.9, 'bagging_fraction':0.9}


lgbm = lgb.train(lparams, lgb_train, valid_sets=lgb_eval, num_boost_round=20, early_stopping_rounds=20)

lpred = lgbm.predict(X_test, num_iteration=lgbm.best_iteration)

print("The RMSE of prediction is ", mean_squared_error(y_test, lpred)**0.5)


You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1336
[LightGBM] [Info] Number of data points in the train set: 16960, number of used features: 11
[LightGBM] [Info] Start training from score 1555.150236
[1]	valid_0's l2: 918188
Training until validation scores don't improve for 20 rounds
[2]	valid_0's l2: 866539
[3]	valid_0's l2: 824476
[4]	valid_0's l2: 789734
[5]	valid_0's l2: 761691
[6]	valid_0's l2: 738958
[7]	valid_0's l2: 720292
[8]	valid_0's l2: 704931
[9]	valid_0's l2: 691675
[10]	valid_0's l2: 681557
[11]	valid_0's l2: 673080
[12]	valid_0's l2: 665960
[13]	valid_0's l2: 660426
[14]	valid_0's l2: 654948
[15]	valid_0's l2: 650506
[16]	valid_0's l2: 647193
[17]	valid_0's l2: 644222
[18]	valid_0's l2: 641821
[19]	valid_0's l2: 638942
[20]	valid_0's l2: 637037
[21]	valid_0's l2: 635290
[22]	valid_0's l2: 634152
[23]	valid_0's l2: 633216
[24]	valid_0's l2: 632033
[25]	valid_0's l2: 630848
[26]	valid_0's l2: 630171
[27]	valid_0's l2: 629561
[28]	

## Submission

In [162]:
lgbm_y = lgbm.predict(test, num_iteration=lgbm.best_iteration)
lgbm_output = pd.DataFrame({"Order No":test_df['Order No'], 
                           "Time from Pickup to Arrival": lgbm_y })
lgbm_output.to_csv("D:\Temp\submission.csv", index=False)