## Defining the Research Question

### Background

Logistics in Sub-Saharan Africa increases the cost of manufactured goods by up to 320%; while in Europe, it only accounts for up to 90% of the manufacturing cost. Sendy is a business-to-business platform established in 2014, to enable businesses of all types and sizes to transport goods more efficiently across East Africa. The company is headquartered in Kenya with a team of more than 100 staff, focused on building practical solutions for Africa’s dynamic transportation needs, from developing apps and web
solutions to providing dedicated support for goods on the move.



### Problem Statement

Sendy has hired you to help predict the estimated time of delivery of orders, from the point of driver pickup to the point of arrival at the final destination. Build a model that predicts an accurate delivery time, from picking up a package arriving at the final destination. An accurate arrival time prediction will help all business to improve their logistics and communicate the accurate time their time to their customers. You will be
required to perform various feature engineering techniques while preparing your data for further analysis.


### Metric of Success

A model that will accurately predict the estimated time of delivery of orders, with an RMSE that is less than 10% of the target mean.

### Dataset Information

The dataset provided by Sendy includes order details and rider metrics based on orders made on the Sendy platform. The challenge is to predict the estimated time of arrival for orders- from pick-up to drop-off. The dataset provided here is a subset of over 20,000
orders and only includes direct orders (i.e. Sendy “express” orders) with bikes in Nairobi.
All data in this subset have been fully anonymized while preserving the distribution.

Dataset URL = https://bit.ly/3deaKEM

Dataset Glossary = https://bit.ly/30O3xsr

Project Source: https://bit.ly/2Y6Hzz3

### Solution Steps

* Import Libraries
* Load, Explore & Clean Data
* Model base regressor and evaluate
* Model Improvement
* Summary of findings and Recommendations
* Challenge the Solution

## Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler, StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error
pd.set_option('display.max_columns', None)

In [None]:
# We'll import and install the following packages: six, sys, mlrose and joblib
# to use `SequentialFeatureSelector` for feature selection from mlxtend.

# importing six and sys
import six
import sys
sys.modules['sklearn.externals.six'] = six

# installing mlrose
!pip install mlrose
import mlrose

# importing joblib
import joblib
sys.modules['sklearn.externals.joblib'] = joblib

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
from mlxtend.feature_selection import SequentialFeatureSelector
from sklearn.feature_selection import RFE
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import GridSearchCV

## Load, Explore and Clean Data

In [None]:
# load glossary
glossary = pd.read_csv('https://bit.ly/30O3xsr')
glossary

Unnamed: 0,Order No,Unique number identifying the order
0,User Id,Unique number identifying the customer on a pl...
1,Vehicle Type,"For this competition limited to bikes, however..."
2,Platform Type,"Platform used to place the order, there are 4 ..."
3,Personal or Business,Customer type
4,Placement - Day of Month,Placement - Day of Month i.e 1-31
5,Placement - Weekday (Mo = 1),Placement - Weekday (Monday = 1)
6,Placement - Time,Placement - Time - Time of day the order was p...
7,Confirmation - Day of Month,Confirmation - Day of Month i.e 1-31
8,Confirmation - Weekday (Mo = 1),Confirmation - Weekday (Monday = 1)
9,Confirmation - Time,Confirmation - Time - Time of day the order wa...


In [None]:
# load data
df = pd.read_csv("https://bit.ly/3deaKEM")

In [None]:
# preview data
df.head(3)

Unnamed: 0,Order No,User Id,Vehicle Type,Platform Type,Personal or Business,Placement - Day of Month,Placement - Weekday (Mo = 1),Placement - Time,Confirmation - Day of Month,Confirmation - Weekday (Mo = 1),Confirmation - Time,Arrival at Pickup - Day of Month,Arrival at Pickup - Weekday (Mo = 1),Arrival at Pickup - Time,Pickup - Day of Month,Pickup - Weekday (Mo = 1),Pickup - Time,Arrival at Destination - Day of Month,Arrival at Destination - Weekday (Mo = 1),Arrival at Destination - Time,Distance (KM),Temperature,Precipitation in millimeters,Pickup Lat,Pickup Long,Destination Lat,Destination Long,Rider Id,Time from Pickup to Arrival
0,Order_No_4211,User_Id_633,Bike,3,Business,9,5,9:35:46 AM,9,5,9:40:10 AM,9,5,10:04:47 AM,9,5,10:27:30 AM,9,5,10:39:55 AM,4,20.4,,-1.317755,36.83037,-1.300406,36.829741,Rider_Id_432,745
1,Order_No_25375,User_Id_2285,Bike,3,Personal,12,5,11:16:16 AM,12,5,11:23:21 AM,12,5,11:40:22 AM,12,5,11:44:09 AM,12,5,12:17:22 PM,16,26.4,,-1.351453,36.899315,-1.295004,36.814358,Rider_Id_856,1993
2,Order_No_1899,User_Id_265,Bike,3,Business,30,2,12:39:25 PM,30,2,12:42:44 PM,30,2,12:49:34 PM,30,2,12:53:03 PM,30,2,1:00:38 PM,3,,,-1.308284,36.843419,-1.300921,36.828195,Rider_Id_155,455


In [None]:
#check for duplicates
df.duplicated().sum()

0

In [None]:
# #check for duplicates in the Order No feature
df['Order No'].duplicated().sum()

0

In [None]:
# drop unnecessary columns
df = df.drop(['Order No',	'User Id', 'Rider Id'], axis = 1)

In [None]:
# check the dataset shape
df.shape

(21201, 26)

In [None]:
# check for null values
df.isna().sum()

Vehicle Type                                     0
Platform Type                                    0
Personal or Business                             0
Placement - Day of Month                         0
Placement - Weekday (Mo = 1)                     0
Placement - Time                                 0
Confirmation - Day of Month                      0
Confirmation - Weekday (Mo = 1)                  0
Confirmation - Time                              0
Arrival at Pickup - Day of Month                 0
Arrival at Pickup - Weekday (Mo = 1)             0
Arrival at Pickup - Time                         0
Pickup - Day of Month                            0
Pickup - Weekday (Mo = 1)                        0
Pickup - Time                                    0
Arrival at Destination - Day of Month            0
Arrival at Destination - Weekday (Mo = 1)        0
Arrival at Destination - Time                    0
Distance (KM)                                    0
Temperature                    

> Precipitation in millimeters and temperature features have missing values.

In [None]:
# drop Precipitation in millimeters
df = df.drop(['Precipitation in millimeters'], axis = 1)

In [None]:
# impute missing values for the Temperature feature with mean, only 20% of observations are missing
df['Temperature'] = df.Temperature.fillna(value = df.Temperature.mean())

In [None]:
df.isna().sum()

Vehicle Type                                 0
Platform Type                                0
Personal or Business                         0
Placement - Day of Month                     0
Placement - Weekday (Mo = 1)                 0
Placement - Time                             0
Confirmation - Day of Month                  0
Confirmation - Weekday (Mo = 1)              0
Confirmation - Time                          0
Arrival at Pickup - Day of Month             0
Arrival at Pickup - Weekday (Mo = 1)         0
Arrival at Pickup - Time                     0
Pickup - Day of Month                        0
Pickup - Weekday (Mo = 1)                    0
Pickup - Time                                0
Arrival at Destination - Day of Month        0
Arrival at Destination - Weekday (Mo = 1)    0
Arrival at Destination - Time                0
Distance (KM)                                0
Temperature                                  0
Pickup Lat                                   0
Pickup Long  

In [None]:
df['Vehicle Type'].value_counts()

Bike    21201
Name: Vehicle Type, dtype: int64

> Drop Vehicle Type feauture because there's only one vehicle type for all observations

In [None]:
df = df.drop(['Vehicle Type'], axis = 1)

In [None]:
# check unique Personal or Business feature values
df['Personal or Business'].value_counts()

Business    17384
Personal     3817
Name: Personal or Business, dtype: int64

In [None]:
# encode Personal or Business feature
df['Personal or Business'] = df['Personal or Business'].astype('category').cat.codes

In [None]:
df['Personal or Business'].value_counts()

0    17384
1     3817
Name: Personal or Business, dtype: int64

Our time feautures are in the form I:M:S, we can convert these to HMS. For example, 2:19:47 PM becomes 141947.

%H is the 24 hour clock, %I is the 12 hour clock and when using the 12 hour clock, %p qualifies if it is AM or PM

In [None]:
from datetime import datetime

def time_converter(time):
  in_time = datetime.strptime(time, "%I:%M:%S %p")
  out_time = datetime.strftime(in_time, "%H%M%S")
  return out_time

In [None]:
time_cols = ['Placement - Time', 'Confirmation - Time', 'Arrival at Pickup - Time', 'Pickup - Time', 'Arrival at Destination - Time']

In [None]:
df[time_cols].sample(5)

Unnamed: 0,Placement - Time,Confirmation - Time,Arrival at Pickup - Time,Pickup - Time,Arrival at Destination - Time
2814,10:08:32 AM,10:08:47 AM,10:16:00 AM,10:18:18 AM,10:41:36 AM
10039,3:43:45 PM,4:17:04 PM,4:26:58 PM,4:43:02 PM,4:53:33 PM
11385,9:49:42 AM,9:52:49 AM,10:15:36 AM,10:16:56 AM,10:30:33 AM
15659,4:06:24 PM,4:07:01 PM,4:07:44 PM,4:15:20 PM,4:31:08 PM
6913,9:18:56 AM,9:19:07 AM,9:19:39 AM,9:41:34 AM,10:17:02 AM


In [None]:
for col in time_cols:
  df[col] = df[col].apply(time_converter)
  print('Success converting', col)

Success converting Placement - Time
Success converting Confirmation - Time
Success converting Arrival at Pickup - Time
Success converting Pickup - Time
Success converting Arrival at Destination - Time


In [None]:
df[time_cols].sample(5)

Unnamed: 0,Placement - Time,Confirmation - Time,Arrival at Pickup - Time,Pickup - Time,Arrival at Destination - Time
4457,153443,153449,153505,154050,161915
1615,124146,124533,125000,130129,141109
11226,153549,153636,154539,155036,161354
8660,150747,150755,150814,151128,155025
18884,141912,141925,141934,150024,152146


Target RMSE

In [68]:
df['Time from Pickup to Arrival'].mean() * 0.1

155.69209471251358

## Base Model: ensemble regressor

Encode categorical features

In [None]:
#split data
X = df.drop(['Time from Pickup to Arrival'], axis=1)
Y = df['Time from Pickup to Arrival']

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=.2, random_state=0)

In [None]:
# the model 
base_regressor = RandomForestRegressor(n_estimators = 10, random_state = 12345)

In [None]:
# fit the model
base_regressor.fit(X_train, Y_train)

RandomForestRegressor(n_estimators=10, random_state=12345)

In [None]:
# predict
predictions = base_regressor.predict(X_test)

In [None]:
print('RMSE:', np.sqrt(mean_squared_error(Y_test, predictions)))

RMSE: 548.0882766104316


## Model Improvement

### Feature scaling

Modeling with only normalisation

In [None]:
norm = MinMaxScaler().fit(X_train) 
X_train_normed = norm.transform(X_train) 
X_test_normed = norm.transform(X_test)

In [None]:
# fit the model
regressor_with_norm = RandomForestRegressor(n_estimators = 10, random_state = 12345)
regressor_with_norm.fit(X_train_normed, Y_train)

RandomForestRegressor(n_estimators=10, random_state=12345)

In [None]:
# predict
predictions_with_norm = regressor_with_norm.predict(X_test_normed)

In [None]:
print('RMSE:', np.sqrt(mean_squared_error(Y_test, predictions_with_norm)))

RMSE: 546.0850642671926


Modeling with standardisation

In [None]:
scaler = StandardScaler() 
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)

In [None]:
# fit the model
regressor_with_sc = RandomForestRegressor(n_estimators = 10, random_state = 12345)
regressor_with_sc.fit(X_train_scaled, Y_train)

RandomForestRegressor(n_estimators=10, random_state=12345)

In [None]:
# predict
predictions_with_sc = regressor_with_sc.predict(X_test_scaled)

In [None]:
print('RMSE:', np.sqrt(mean_squared_error(Y_test, predictions_with_sc)))

RMSE: 547.0538222752199


Observation: scaling improves the RMSE slighly

### Feature Selection

##### Wrapper Method: Step Forward Feature Selection

In [None]:
# split data
X = df.drop(['Time from Pickup to Arrival'], axis=1)
Y = df['Time from Pickup to Arrival']

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=.2, random_state=0)

# normalization
norm = MinMaxScaler().fit(X_train) 
X_train_normed = norm.transform(X_train) 
X_test_normed = norm.transform(X_test)

In [None]:
# modelling, we'll use the normalised data

sf_regressor = RandomForestRegressor(n_estimators = 10, random_state = 12345)

# We pass the regressor the estimator to the SequentialFeatureSelector function. 
# k_features specifies the number of features to select. 
# forward parameter, if set to True, performs step forward feature selection. 
# verbose parameter is used for logging the progress of the feature selector
# scoring parameter defines the performance evaluation criteria 
# cv refers to cross-validation folds.

sf_feature_selector = SequentialFeatureSelector(sf_regressor,
           k_features=20,
           forward=True,
           verbose=2,
           scoring='r2',
           cv=4)
 
# Perform step forward feature selection
sf_feature_selector = sf_feature_selector.fit(X_train_normed, Y_train)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  23 out of  23 | elapsed:   12.4s finished

[2022-08-02 12:26:36] Features: 1/20 -- score: 0.3442037719100365[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.5s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  22 out of  22 | elapsed:   12.7s finished

[2022-08-02 12:26:49] Features: 2/20 -- score: 0.34284050006761724[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.2s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  21 out of  21 | elapsed:   11.3s finished

[2022-08-02 12:27:00] Features: 3/20 -- score: 0.3369269304430071[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 

In [None]:
# the selected features

sf_feat_cols = list(sf_feature_selector.k_feature_idx_)
print(sf_feat_cols)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]


In [None]:
# modelling with step forward feature selection
regressor_with_sffs = RandomForestRegressor(n_estimators = 10, random_state = 12345)
regressor_with_sffs.fit(X_train_normed[:, sf_feat_cols], Y_train)

# Making Predictions and determining the accuracies
sffs_predictions = regressor_with_sffs.predict(X_test_normed[:, sf_feat_cols])
print('RMSE With sffs:', np.sqrt(mean_squared_error(Y_test, sffs_predictions)))

RMSE With sffs: 425.9848291836782


Observation: our RMSE improves from the original 548 to 425

##### Wrapper Method: Step Backward Feature Selection

In [None]:
# split data
X = df.drop(['Time from Pickup to Arrival'], axis=1)
Y = df['Time from Pickup to Arrival']

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=.2, random_state=0)

# normalization
norm = MinMaxScaler().fit(X_train) 
X_train_normed = norm.transform(X_train) 
X_test_normed = norm.transform(X_test)

In [None]:
# Modelling, we'll use the normalised data

sb_regressor = RandomForestRegressor(n_estimators = 10, random_state = 12345)

# We pass the regressor the estimator to the SequentialFeatureSelector function. 
# k_features specifies the number of features to select. 
# forward parameter, if set to True, performs step forward feature selection. 
# verbose parameter is used for logging the progress of the feature selector
# scoring parameter defines the performance evaluation criteria 
# cv refers to cross-validation folds.

sb_feature_selector = SequentialFeatureSelector(sb_regressor,
           k_features=20,
           forward=False,
           verbose=2,
           scoring='r2',
           cv=4)
 
# Perform step backward feature selection
sb_feature_selector = sb_feature_selector.fit(X_train_normed, Y_train)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    6.7s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  23 out of  23 | elapsed:  2.6min finished

[2022-08-02 12:43:59] Features: 22/20 -- score: 0.6813925934046008[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    6.3s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  22 out of  22 | elapsed:  2.4min finished

[2022-08-02 12:46:21] Features: 21/20 -- score: 0.7215913114718059[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    6.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  21 out of  21 | elapsed:  2.1min finished

[2022-08-02 12:48:28] Features: 20/20 -- score: 0.8539239273640011

In [None]:
# the selected features

sb_feat_cols = list(sb_feature_selector.k_feature_idx_)
print(sb_feat_cols)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 18, 19, 20]


In [None]:
# modelling with step backward feature selection
regressor_with_sbfs = RandomForestRegressor(n_estimators = 10, random_state = 12345)
regressor_with_sbfs.fit(X_train_normed[:, sb_feat_cols], Y_train)

# Making Predictions and determining the accuracies
sbfs_predictions = regressor_with_sbfs.predict(X_test_normed[:, sb_feat_cols])
print('RMSE With sbfs:', np.sqrt(mean_squared_error(Y_test, sbfs_predictions)))

RMSE With sbfs: 338.9262916190953


Observation: our RMSE improves from the original 548 to 338

##### Wrapper Method: Recursive Feature Elimination

In [None]:
# split data
X = df.drop(['Time from Pickup to Arrival'], axis=1)
Y = df['Time from Pickup to Arrival']

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=.2, random_state=0)

# normalization
norm = MinMaxScaler().fit(X_train) 
X_train_normed = norm.transform(X_train) 
X_test_normed = norm.transform(X_test)

In [None]:
# We want to select the best 20 features for our model. 
# NB: n_features_to_select will include the response variable

rfe_regressor = RandomForestRegressor(n_estimators = 10, random_state = 12345)

rfe_regressor = RFE(rfe_regressor, n_features_to_select = 20, step=1)

rfe_regressor.fit(X_train_normed, Y_train) 

# Make Predictions  
rfe_y_pred = rfe_regressor.predict(X_test_normed)

# Finally, evaluate our model  
print('RMSE with RFE:', np.sqrt(mean_squared_error(Y_test, rfe_y_pred)))

# Displaying our best features
print('RFE Selected features: %s' % list(X.columns[rfe_regressor.support_]))

RMSE with RFE: 546.3869357498613
RFE Selected features: ['Placement - Day of Month', 'Placement - Time', 'Confirmation - Day of Month', 'Confirmation - Weekday (Mo = 1)', 'Confirmation - Time', 'Arrival at Pickup - Day of Month', 'Arrival at Pickup - Weekday (Mo = 1)', 'Arrival at Pickup - Time', 'Pickup - Day of Month', 'Pickup - Weekday (Mo = 1)', 'Pickup - Time', 'Arrival at Destination - Day of Month', 'Arrival at Destination - Weekday (Mo = 1)', 'Arrival at Destination - Time', 'Distance (KM)', 'Temperature', 'Pickup Lat', 'Pickup Long', 'Destination Lat', 'Destination Long']


Observation: our RMSE improves slightly from the original 548 to 546

##### Feature Transformation: Principal Component Analysis

In [None]:
# split data
X = df.drop(['Time from Pickup to Arrival'], axis=1)
Y = df['Time from Pickup to Arrival']

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=.2, random_state=0)

# normalization
norm = MinMaxScaler().fit(X_train) 
X_train_normed = norm.transform(X_train) 
X_test_normed = norm.transform(X_test)

In [None]:
# Apply PCA

pca = PCA()
X_train_ = pca.fit_transform(X_train_normed)
X_test_ = pca.transform(X_test_normed)

regressor_with_pca = RandomForestRegressor(n_estimators = 10, random_state = 12345)
regressor_with_pca.fit(X_train_, Y_train)

pca_y_pred = regressor_with_pca.predict(X_test_)

# Finally, evaluate our model  
print('RMSE with PCA:', np.sqrt(mean_squared_error(Y_test, pca_y_pred)))

RMSE with PCA: 410.51916925219047


Observation: our RMSE improves from the original 548 to 410

##### Feature Transformation: Linear Discriminant Analysis

In [None]:
# split data
X = df.drop(['Time from Pickup to Arrival'], axis=1)
Y = df['Time from Pickup to Arrival']

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=.2, random_state=0)

# normalization
norm = MinMaxScaler().fit(X_train) 
X_train_normed = norm.transform(X_train) 
X_test_normed = norm.transform(X_test)

In [None]:
clf = LinearDiscriminantAnalysis()
clf.fit(X_train_normed, Y_train)
clf_predictions = clf.predict(X_test_normed)

#evaluate our model  
print('RMSE with Linear Discriminant Analysis:', np.sqrt(mean_squared_error(Y_test, clf_predictions)))

RMSE with Linear Discriminant Analysis: 634.2949788975607


Our RMSE just got worse

Observation
> Of all the feature selection methods, step backward feauture selection produced the best RMSE of 338.

### Feature construction

In [None]:
# preview data
df.sample(3)

Unnamed: 0,Platform Type,Personal or Business,Placement - Day of Month,Placement - Weekday (Mo = 1),Placement - Time,Confirmation - Day of Month,Confirmation - Weekday (Mo = 1),Confirmation - Time,Arrival at Pickup - Day of Month,Arrival at Pickup - Weekday (Mo = 1),Arrival at Pickup - Time,Pickup - Day of Month,Pickup - Weekday (Mo = 1),Pickup - Time,Arrival at Destination - Day of Month,Arrival at Destination - Weekday (Mo = 1),Arrival at Destination - Time,Distance (KM),Temperature,Pickup Lat,Pickup Long,Destination Lat,Destination Long,Time from Pickup to Arrival
6472,3,0,25,1,150919,25,1,151059,25,1,153855,25,1,164842,25,1,170047,4,27.3,-1.316711,36.830156,-1.300406,36.829741,725
5797,3,0,2,3,150125,2,3,150304,2,3,150732,2,3,154055,2,3,163726,7,26.1,-1.259956,36.799344,-1.301642,36.827168,3391
10383,3,0,13,4,93658,13,4,93714,13,4,93801,13,4,95151,13,4,101209,17,18.5,-1.307726,36.839117,-1.348394,36.907428,1218


In [None]:
# create a new feature:  speed = distance/time
# convert time to hours so that our speed is in KM/Hr
df['speed'] = df['Distance (KM)'] / (df['Time from Pickup to Arrival'] / 3600)

In [None]:
df.sample(3)

Unnamed: 0,Platform Type,Personal or Business,Placement - Day of Month,Placement - Weekday (Mo = 1),Placement - Time,Confirmation - Day of Month,Confirmation - Weekday (Mo = 1),Confirmation - Time,Arrival at Pickup - Day of Month,Arrival at Pickup - Weekday (Mo = 1),Arrival at Pickup - Time,Pickup - Day of Month,Pickup - Weekday (Mo = 1),Pickup - Time,Arrival at Destination - Day of Month,Arrival at Destination - Weekday (Mo = 1),Arrival at Destination - Time,Distance (KM),Temperature,Pickup Lat,Pickup Long,Destination Lat,Destination Long,Time from Pickup to Arrival,speed
9227,3,0,6,1,123705,6,1,123817,6,1,125126,6,1,130203,6,1,132356,8,20.4,-1.255189,36.782203,-1.28462,36.795832,1313,21.934501
14457,3,0,23,6,105237,23,6,105413,23,6,110452,23,6,111044,23,6,113436,11,23.258889,-1.272828,36.816608,-1.331619,36.847976,1432,27.653631
18200,3,0,21,3,131653,21,3,151603,21,3,152534,21,3,153814,21,3,154632,5,23.2,-1.295034,36.78205,-1.298465,36.817189,498,36.144578


Without Normalisation

In [None]:
#split data
features = df.drop(['Time from Pickup to Arrival'], axis=1)
target = df['Time from Pickup to Arrival']

features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=.2, random_state=0)

In [None]:
new_base_regressor = RandomForestRegressor(n_estimators = 10, random_state = 12345)

In [None]:
new_base_regressor.fit(features_train, target_train)

RandomForestRegressor(n_estimators=10, random_state=12345)

In [None]:
new_predictions = new_base_regressor.predict(features_test)

In [None]:
print('RMSE:', np.sqrt(mean_squared_error(target_test, new_predictions)))

RMSE: 63.56973317692046


With Normalisation

In [None]:
norm = MinMaxScaler().fit(features_train) 
features_train_normed = norm.transform(features_train) 
features_test_normed = norm.transform(features_test)

In [None]:
new_base_regressor = RandomForestRegressor(n_estimators = 10, random_state = 12345)

In [None]:
new_base_regressor.fit(features_train_normed, target_train)

RandomForestRegressor(n_estimators=10, random_state=12345)

In [None]:
new_predictions = new_base_regressor.predict(features_test_normed)

In [None]:
print('RMSE:', np.sqrt(mean_squared_error(target_test, new_predictions)))

RMSE: 62.772266782474404


With Step Backward Selection

In [None]:
# Modelling, we'll use the normalised data

new_sb_regressor = RandomForestRegressor(n_estimators = 10, random_state = 12345)

# We pass the regressor the estimator to the SequentialFeatureSelector function. 
# k_features specifies the number of features to select. 
# forward parameter, if set to True, performs step forward feature selection. 
# verbose parameter is used for logging the progress of the feature selector
# scoring parameter defines the performance evaluation criteria 
# cv refers to cross-validation folds.

new_sb_feature_selector = SequentialFeatureSelector(new_sb_regressor,
           k_features=20,
           forward=False,
           verbose=2,
           scoring='r2',
           cv=4)
 
# Perform step backward feature selection
new_sb_feature_selector = new_sb_feature_selector.fit(features_train_normed, target_train)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    7.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  24 out of  24 | elapsed:  2.4min finished

[2022-08-02 13:04:06] Features: 23/20 -- score: 0.9944965343757916[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    6.7s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  23 out of  23 | elapsed:  2.1min finished

[2022-08-02 13:06:14] Features: 22/20 -- score: 0.9948821852363113[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    6.3s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  22 out of  22 | elapsed:  1.9min finished

[2022-08-02 13:08:09] Features: 21/20 -- score: 0.9950208645462182[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Don

In [None]:
# the selected features

new_sb_feat_cols = list(new_sb_feature_selector.k_feature_idx_)
print(new_sb_feat_cols)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 23]


In [None]:
# modelling with step backward feature selection
new_regressor_with_sbfs = RandomForestRegressor(n_estimators = 10, random_state = 12345)
new_regressor_with_sbfs.fit(features_train_normed[:, new_sb_feat_cols], target_train)

# Making Predictions and determining the accuracies
new_sbfs_predictions = new_regressor_with_sbfs.predict(features_test_normed[:, new_sb_feat_cols])
print('RMSE With sbfs:', np.sqrt(mean_squared_error(target_test, new_sbfs_predictions)))

RMSE With sbfs: 60.40979404221297


Observation
> After adding the new feauture + normalization + step backward feature selection got the RMSE to drop from 548 to 60

## Summary of findings and Recommendations

> Our base model had an RMSE of 548.

> Feature normalisation and standardisation improved the RMSE to 546 and 547 respectively.

> Step forward feature selection improved the RMSE from 548 to 425.

> Step backward feature selection improved the RMSE from 548 to 338.

> Recursive feature elimination improved the RMSE slightly from 548 to 546.

> Principal Component Analysis improved the RMSE from the 548 to 410.

> Linear discriminant analysis had the worst RMSE of 634.

> Feature construction had the best RMSE of 63. Coupled with normalisation and step backward feature selection, the RMSE improved to 60

Sendy can rely on a Random Forest Regressor to predict delivery time.


## Challenge the Solution

a) Did we have the right question? Yes

b) Did we have the right data? Yes

c) What can be done to improve the solution?
- Hyperparameter tuning
- Construct more features
- Handle any outliers 