## Background
Logistics in Sub-Saharan Africa increases the cost of manufactured goods by up to
320%; while in Europe, it only accounts for up to 90% of the manufacturing cost. Sendy
is a business-to-business platform established in 2014, to enable businesses of all types
and sizes to transport goods more efficiently across East Africa. The company is
headquartered in Kenya with a team of more than 100 staff, focused on building practical
solutions for Africa’s dynamic transportation needs, from developing apps and web
solutions to providing dedicated support for goods on the move.
## Problem Statement
Sendy has hired you to help predict the estimated time of delivery of orders, from the
point of driver pickup to the point of arrival at the final destination. Build a model that
predicts an accurate delivery time, from picking up a package arriving at the final
destination. An accurate arrival time prediction will help all business to improve their
logistics and communicate the accurate time their time to their customers. You will be
required to perform various feature engineering techniques while preparing your data for
further analysis.
You will be required to go through the following:
* Defining the Research Question
* Data Importation
* Data Exploration
* Data Cleaning
* Data Analysis (Univariate and Bivariate)
* Data Preparation
* Data Modeling
* Model Evaluation
* Challenging your Solution
* Recommendations / Conclusion

## Dataset Information
The dataset provided by Sendy includes order details and rider metrics based on orders
made on the Sendy platform. The challenge is to predict the estimated time of arrival for
orders- from pick-up to drop-off. The dataset provided here is a subset of over 20,000
orders and only includes direct orders (i.e. Sendy “express” orders) with bikes in Nairobi.
All data in this subset have been fully anonymized while preserving the distribution.

**Dataset URL** = https://bit.ly/3deaKEM

**Dataset Glossary** = https://bit.ly/30O3xsr

In [None]:
# Load the necessary libraries

import pandas as pd

In [36]:
pip install mlxtend

Collecting mlxtend
  Downloading mlxtend-0.21.0-py2.py3-none-any.whl (1.3 MB)
Installing collected packages: mlxtend
Successfully installed mlxtend-0.21.0
Note: you may need to restart the kernel to use updated packages.


In [2]:
# Load the Data set
df = pd.read_csv('https://bit.ly/3deaKEM')

In [3]:
# Load the Data set
df = pd.read_csv('https://bit.ly/3deaKEM')
#read the first five records
df.head()

Unnamed: 0,Order No,User Id,Vehicle Type,Platform Type,Personal or Business,Placement - Day of Month,Placement - Weekday (Mo = 1),Placement - Time,Confirmation - Day of Month,Confirmation - Weekday (Mo = 1),...,Arrival at Destination - Time,Distance (KM),Temperature,Precipitation in millimeters,Pickup Lat,Pickup Long,Destination Lat,Destination Long,Rider Id,Time from Pickup to Arrival
0,Order_No_4211,User_Id_633,Bike,3,Business,9,5,9:35:46 AM,9,5,...,10:39:55 AM,4,20.4,,-1.317755,36.83037,-1.300406,36.829741,Rider_Id_432,745
1,Order_No_25375,User_Id_2285,Bike,3,Personal,12,5,11:16:16 AM,12,5,...,12:17:22 PM,16,26.4,,-1.351453,36.899315,-1.295004,36.814358,Rider_Id_856,1993
2,Order_No_1899,User_Id_265,Bike,3,Business,30,2,12:39:25 PM,30,2,...,1:00:38 PM,3,,,-1.308284,36.843419,-1.300921,36.828195,Rider_Id_155,455
3,Order_No_9336,User_Id_1402,Bike,3,Business,15,5,9:25:34 AM,15,5,...,10:05:27 AM,9,19.2,,-1.281301,36.832396,-1.257147,36.795063,Rider_Id_855,1341
4,Order_No_27883,User_Id_1737,Bike,1,Personal,13,1,9:55:18 AM,13,1,...,10:25:37 AM,9,15.4,,-1.266597,36.792118,-1.295041,36.809817,Rider_Id_770,1214


In [4]:
# View the last five records
df.tail()

Unnamed: 0,Order No,User Id,Vehicle Type,Platform Type,Personal or Business,Placement - Day of Month,Placement - Weekday (Mo = 1),Placement - Time,Confirmation - Day of Month,Confirmation - Weekday (Mo = 1),...,Arrival at Destination - Time,Distance (KM),Temperature,Precipitation in millimeters,Pickup Lat,Pickup Long,Destination Lat,Destination Long,Rider Id,Time from Pickup to Arrival
21196,Order_No_8834,User_Id_2001,Bike,3,Personal,20,3,3:54:38 PM,20,3,...,4:20:17 PM,3,28.6,,-1.258414,36.8048,-1.275285,36.802702,Rider_Id_953,9
21197,Order_No_22892,User_Id_1796,Bike,3,Business,13,6,10:13:34 AM,13,6,...,10:46:17 AM,7,26.0,,-1.307143,36.825009,-1.331619,36.847976,Rider_Id_155,770
21198,Order_No_2831,User_Id_2956,Bike,3,Business,7,4,5:06:16 PM,7,4,...,6:40:05 PM,20,29.2,,-1.286018,36.897534,-1.258414,36.8048,Rider_Id_697,2953
21199,Order_No_6174,User_Id_2524,Bike,1,Personal,4,3,9:31:39 AM,4,3,...,10:08:15 AM,13,15.0,,-1.25003,36.874167,-1.27921,36.794872,Rider_Id_347,1380
21200,Order_No_9836,User_Id_718,Bike,3,Business,26,2,2:19:47 PM,26,2,...,3:17:23 PM,12,30.9,,-1.255189,36.782203,-1.320157,36.830887,Rider_Id_177,2128


In [5]:
# View the data types
df.dtypes

Order No                                      object
User Id                                       object
Vehicle Type                                  object
Platform Type                                  int64
Personal or Business                          object
Placement - Day of Month                       int64
Placement - Weekday (Mo = 1)                   int64
Placement - Time                              object
Confirmation - Day of Month                    int64
Confirmation - Weekday (Mo = 1)                int64
Confirmation - Time                           object
Arrival at Pickup - Day of Month               int64
Arrival at Pickup - Weekday (Mo = 1)           int64
Arrival at Pickup - Time                      object
Pickup - Day of Month                          int64
Pickup - Weekday (Mo = 1)                      int64
Pickup - Time                                 object
Arrival at Destination - Day of Month          int64
Arrival at Destination - Weekday (Mo = 1)     

In [6]:
# View the descriptive statistics
df.describe()

Unnamed: 0,Platform Type,Placement - Day of Month,Placement - Weekday (Mo = 1),Confirmation - Day of Month,Confirmation - Weekday (Mo = 1),Arrival at Pickup - Day of Month,Arrival at Pickup - Weekday (Mo = 1),Pickup - Day of Month,Pickup - Weekday (Mo = 1),Arrival at Destination - Day of Month,Arrival at Destination - Weekday (Mo = 1),Distance (KM),Temperature,Precipitation in millimeters,Pickup Lat,Pickup Long,Destination Lat,Destination Long,Time from Pickup to Arrival
count,21201.0,21201.0,21201.0,21201.0,21201.0,21201.0,21201.0,21201.0,21201.0,21201.0,21201.0,21201.0,16835.0,552.0,21201.0,21201.0,21201.0,21201.0,21201.0
mean,2.752182,15.653696,3.240083,15.653837,3.240225,15.653837,3.240225,15.653837,3.240225,15.653837,3.240225,9.506533,23.258889,7.905797,-1.28147,36.811264,-1.282581,36.81122,1556.920947
std,0.625178,8.798916,1.567295,8.798886,1.567228,8.798886,1.567228,8.798886,1.567228,8.798886,1.567228,5.668963,3.615768,17.089971,0.030507,0.037473,0.034824,0.044721,987.270788
min,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,11.2,0.1,-1.438302,36.653621,-1.430298,36.606594,1.0
25%,3.0,8.0,2.0,8.0,2.0,8.0,2.0,8.0,2.0,8.0,2.0,5.0,20.6,1.075,-1.300921,36.784605,-1.301201,36.785661,882.0
50%,3.0,15.0,3.0,15.0,3.0,15.0,3.0,15.0,3.0,15.0,3.0,8.0,23.5,2.9,-1.279395,36.80704,-1.284382,36.808002,1369.0
75%,3.0,23.0,5.0,23.0,5.0,23.0,5.0,23.0,5.0,23.0,5.0,13.0,26.0,4.9,-1.257147,36.829741,-1.261177,36.829477,2040.0
max,4.0,31.0,7.0,31.0,7.0,31.0,7.0,31.0,7.0,31.0,7.0,49.0,32.1,99.1,-1.14717,36.991046,-1.030225,37.016779,7883.0


In [8]:
# Check for Missing values
df.isna().sum()

Order No                                         0
User Id                                          0
Vehicle Type                                     0
Platform Type                                    0
Personal or Business                             0
Placement - Day of Month                         0
Placement - Weekday (Mo = 1)                     0
Placement - Time                                 0
Confirmation - Day of Month                      0
Confirmation - Weekday (Mo = 1)                  0
Confirmation - Time                              0
Arrival at Pickup - Day of Month                 0
Arrival at Pickup - Weekday (Mo = 1)             0
Arrival at Pickup - Time                         0
Pickup - Day of Month                            0
Pickup - Weekday (Mo = 1)                        0
Pickup - Time                                    0
Arrival at Destination - Day of Month            0
Arrival at Destination - Weekday (Mo = 1)        0
Arrival at Destination - Time  

In [9]:
df.duplicated().sum()

0

### Observations
* There are two columns with missing values, Temperatures and Precipitation in millimeters
* The column names need to be Renamed due to long names.
* There are no duplicates.
* There are some column names that are are not useful in the data set, e.g Order number and user ID
* There are some categorical columns that need to be dealt with for example the time columns. These clumns need to be converted to date time objects for manipulation

In [10]:
# Dealing with the missing data.
# The precipitation column has so many missing values hence its not valuable to salvage.
# The temperature column can be filled with mean for missing numbers.
# Drop the unecessary columns

df.drop('Precipitation in millimeters', axis= 1 , inplace= True)
df['Temperature'].fillna(df['Temperature'].mean(skipna=True), inplace=True)
df.drop(['Order No','User Id', 'Rider Id'], axis = 1 ,inplace= True)

In [11]:
# Check if missing values have been fixed

df.describe()

Unnamed: 0,Platform Type,Placement - Day of Month,Placement - Weekday (Mo = 1),Confirmation - Day of Month,Confirmation - Weekday (Mo = 1),Arrival at Pickup - Day of Month,Arrival at Pickup - Weekday (Mo = 1),Pickup - Day of Month,Pickup - Weekday (Mo = 1),Arrival at Destination - Day of Month,Arrival at Destination - Weekday (Mo = 1),Distance (KM),Temperature,Pickup Lat,Pickup Long,Destination Lat,Destination Long,Time from Pickup to Arrival
count,21201.0,21201.0,21201.0,21201.0,21201.0,21201.0,21201.0,21201.0,21201.0,21201.0,21201.0,21201.0,21201.0,21201.0,21201.0,21201.0,21201.0,21201.0
mean,2.752182,15.653696,3.240083,15.653837,3.240225,15.653837,3.240225,15.653837,3.240225,15.653837,3.240225,9.506533,23.258889,-1.28147,36.811264,-1.282581,36.81122,1556.920947
std,0.625178,8.798916,1.567295,8.798886,1.567228,8.798886,1.567228,8.798886,1.567228,8.798886,1.567228,5.668963,3.222006,0.030507,0.037473,0.034824,0.044721,987.270788
min,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,11.2,-1.438302,36.653621,-1.430298,36.606594,1.0
25%,3.0,8.0,2.0,8.0,2.0,8.0,2.0,8.0,2.0,8.0,2.0,5.0,21.4,-1.300921,36.784605,-1.301201,36.785661,882.0
50%,3.0,15.0,3.0,15.0,3.0,15.0,3.0,15.0,3.0,15.0,3.0,8.0,23.258889,-1.279395,36.80704,-1.284382,36.808002,1369.0
75%,3.0,23.0,5.0,23.0,5.0,23.0,5.0,23.0,5.0,23.0,5.0,13.0,25.3,-1.257147,36.829741,-1.261177,36.829477,2040.0
max,4.0,31.0,7.0,31.0,7.0,31.0,7.0,31.0,7.0,31.0,7.0,49.0,32.1,-1.14717,36.991046,-1.030225,37.016779,7883.0


In [13]:
# View the data set on high level
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21201 entries, 0 to 21200
Data columns (total 25 columns):
 #   Column                                     Non-Null Count  Dtype  
---  ------                                     --------------  -----  
 0   Vehicle Type                               21201 non-null  object 
 1   Platform Type                              21201 non-null  int64  
 2   Personal or Business                       21201 non-null  object 
 3   Placement - Day of Month                   21201 non-null  int64  
 4   Placement - Weekday (Mo = 1)               21201 non-null  int64  
 5   Placement - Time                           21201 non-null  object 
 6   Confirmation - Day of Month                21201 non-null  int64  
 7   Confirmation - Weekday (Mo = 1)            21201 non-null  int64  
 8   Confirmation - Time                        21201 non-null  object 
 9   Arrival at Pickup - Day of Month           21201 non-null  int64  
 10  Arrival at Pickup - We

In [14]:
# Check the unique values of vehicle type

df['Vehicle Type'].unique()

array(['Bike'], dtype=object)

In [15]:
# The vehicle type has only one unique value and hence it can be dropped
df.drop('Vehicle Type', axis =1 , inplace= True)

In [16]:
# Check on unique values Personal or Business

df['Personal or Business'].unique()

array(['Business', 'Personal'], dtype=object)

In [17]:
#Chek on distribution of the 2 categories
df['Personal or Business'].value_counts()


Business    17384
Personal     3817
Name: Personal or Business, dtype: int64

In [18]:
# This column is a candidate for Label encoding

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

df['Personal or Business'] = le.fit_transform(df['Personal or Business'])

In [19]:
# Check the unique categories
df['Personal or Business'].value_counts()

0    17384
1     3817
Name: Personal or Business, dtype: int64

### Converting date and Time

In [20]:
# Dealing with the time features
import datetime
from datetime import datetime
def time_converter(time):
  in_time = datetime.strptime(time, "%I:%M:%S %p")
  out_time = datetime.strftime(in_time, "%H%M%S")
  return out_time

time_cols = ['Placement - Time', 'Confirmation - Time', 'Arrival at Pickup - Time', 'Pickup - Time', 'Arrival at Destination - Time']

for col in time_cols:
  df[col] = df[col].apply(time_converter)
  print('Success converting', col)

Success converting Placement - Time
Success converting Confirmation - Time
Success converting Arrival at Pickup - Time
Success converting Pickup - Time
Success converting Arrival at Destination - Time


In [21]:
# Check the resultant Data Frame
df.head()

Unnamed: 0,Platform Type,Personal or Business,Placement - Day of Month,Placement - Weekday (Mo = 1),Placement - Time,Confirmation - Day of Month,Confirmation - Weekday (Mo = 1),Confirmation - Time,Arrival at Pickup - Day of Month,Arrival at Pickup - Weekday (Mo = 1),...,Arrival at Destination - Day of Month,Arrival at Destination - Weekday (Mo = 1),Arrival at Destination - Time,Distance (KM),Temperature,Pickup Lat,Pickup Long,Destination Lat,Destination Long,Time from Pickup to Arrival
0,3,0,9,5,93546,9,5,94010,9,5,...,9,5,103955,4,20.4,-1.317755,36.83037,-1.300406,36.829741,745
1,3,1,12,5,111616,12,5,112321,12,5,...,12,5,121722,16,26.4,-1.351453,36.899315,-1.295004,36.814358,1993
2,3,0,30,2,123925,30,2,124244,30,2,...,30,2,130038,3,23.258889,-1.308284,36.843419,-1.300921,36.828195,455
3,3,0,15,5,92534,15,5,92605,15,5,...,15,5,100527,9,19.2,-1.281301,36.832396,-1.257147,36.795063,1341
4,1,1,13,1,95518,13,1,95618,13,1,...,13,1,102537,9,15.4,-1.266597,36.792118,-1.295041,36.809817,1214


In [22]:
# Check the high level DF
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21201 entries, 0 to 21200
Data columns (total 24 columns):
 #   Column                                     Non-Null Count  Dtype  
---  ------                                     --------------  -----  
 0   Platform Type                              21201 non-null  int64  
 1   Personal or Business                       21201 non-null  int32  
 2   Placement - Day of Month                   21201 non-null  int64  
 3   Placement - Weekday (Mo = 1)               21201 non-null  int64  
 4   Placement - Time                           21201 non-null  object 
 5   Confirmation - Day of Month                21201 non-null  int64  
 6   Confirmation - Weekday (Mo = 1)            21201 non-null  int64  
 7   Confirmation - Time                        21201 non-null  object 
 8   Arrival at Pickup - Day of Month           21201 non-null  int64  
 9   Arrival at Pickup - Weekday (Mo = 1)       21201 non-null  int64  
 10  Arrival at Pickup - Ti

In [25]:
# Convert the Time columns to Integers

for val in time_cols:
    df[val] = df[val].astype(str).astype(int)

In [26]:
# Check the final Data Frame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21201 entries, 0 to 21200
Data columns (total 24 columns):
 #   Column                                     Non-Null Count  Dtype  
---  ------                                     --------------  -----  
 0   Platform Type                              21201 non-null  int64  
 1   Personal or Business                       21201 non-null  int32  
 2   Placement - Day of Month                   21201 non-null  int64  
 3   Placement - Weekday (Mo = 1)               21201 non-null  int64  
 4   Placement - Time                           21201 non-null  int32  
 5   Confirmation - Day of Month                21201 non-null  int64  
 6   Confirmation - Weekday (Mo = 1)            21201 non-null  int64  
 7   Confirmation - Time                        21201 non-null  int32  
 8   Arrival at Pickup - Day of Month           21201 non-null  int64  
 9   Arrival at Pickup - Weekday (Mo = 1)       21201 non-null  int64  
 10  Arrival at Pickup - Ti

## Data Modelling

In [27]:
# Feature selection

X = df.drop('Time from Pickup to Arrival', axis= 1)
y = df['Time from Pickup to Arrival']

In [28]:
# Split the Data Set
    
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2, random_state= 42)

In [29]:
# Import the Random Forest Regressor

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
import numpy as np

# Initialize the model

base_model = RandomForestRegressor(random_state=42)

# Fit the Data

base_model.fit(X_train,y_train)

# Predict
base_predictor = base_model.predict(X_test)

print('RMSE', np.sqrt(mean_squared_error(base_predictor, y_test)))

RMSE 500.4618540959815


In [30]:
# Model improvement with feature scaling

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler().fit(X_train)
X_train_normed = scaler.transform(X_train)
X_test_normed = scaler.transform(X_test)

In [31]:
# Fit the normalized model
normalised_model = RandomForestRegressor(random_state=42)
normalised_model.fit(X_train_normed,y_train)
normalised_predictor = normalised_model.predict(X_test_normed)

print('RMSE', np.sqrt(mean_squared_error(normalised_predictor, y_test)))

RMSE 500.96544990245377


In [32]:
# There was a slight  decrease in performance. Lets try modelling with Standadization

from sklearn.preprocessing import StandardScaler

standard = StandardScaler().fit(X_train)
X_train_standardized  = standard.transform(X_train)
X_test_standardized = standard.transform(X_test)

# Fit the model

standardized_model = RandomForestRegressor(random_state=42)
standardized_model.fit(X_train_standardized,y_train)
standadized_predictor = standardized_model.predict(X_test_standardized)

print('RMSE', np.sqrt(mean_squared_error(standadized_predictor, y_test)))

RMSE 500.9073711988233


* The model has improved a bit. Time to try the wrapper methods.

### Step Forward Feature Selection

In [33]:
# Split the data
X = df.drop('Time from Pickup to Arrival', axis=1)
y = df['Time from Pickup to Arrival']

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2, random_state= 42)

# Normalize the Data

scaler = MinMaxScaler().fit(X_train)
X_train_normed = scaler.transform(X_train)
X_test_normed = scaler.transform(X_test)

In [37]:
  
from mlxtend.feature_selection import SequentialFeatureSelector
    
#Instatiate the model and the Feature selector

sequential_regressor = RandomForestRegressor(random_state=42)

sf_feature_selector = SequentialFeatureSelector(sequential_regressor,
                                               k_features = 4,
                                               forward = True,
                                               verbose = 1,
                                               scoring = 'r2',
                                               cv=4)
sf_feature_selector = sf_feature_selector.fit(X_train_normed,y_train)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  23 out of  23 | elapsed:  1.8min finished
Features: 1/4[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  22 out of  22 | elapsed:  1.7min finished
Features: 2/4[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  21 out of  21 | elapsed:  1.9min finished
Features: 3/4[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  20 out of  20 | elapsed:  2.3min finished
Features: 4/4

In [52]:
sf_feat_cols = list(sf_feature_selector.k_feature_idx_)
print(sf_feat_cols)

[0, 1, 3, 17]


In [39]:
# Modelling with the selected features

regressor_with_selected_forward_features  = RandomForestRegressor(random_state=42)
regressor_with_selected_forward_features.fit(X_train_normed[:,sf_feat_cols],y_train)

#Predictions

selected_forward_features_predictor = regressor_with_selected_forward_features.predict(X_test_normed[:,sf_feat_cols])
print('RMSE with Selective Forward Feature Selection', np.sqrt(mean_squared_error(selected_forward_features_predictor, y_test)))

RMSE with Selective Forward Feature Selection 809.288022341077


* Observations With 4 features the performance was horrible. Increasing the features will improve the performance. Although this will not be done in this project. 

### Time to try Recursive Feature Elimination

In [42]:
# split data
X = df.drop(['Time from Pickup to Arrival'], axis=1)
y = df['Time from Pickup to Arrival']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42)

# normalization
norm = MinMaxScaler().fit(X_train) 
X_train_normed = norm.transform(X_train) 
X_test_normed = norm.transform(X_test)

In [43]:
# Instatiating the regressor and RFE object

from sklearn.feature_selection import RFE

Recursive_Regressor = RandomForestRegressor(random_state=42)
Recursive_Regressor = RFE(Recursive_Regressor, n_features_to_select=20, step = 1)

Recursive_Regressor.fit(X_train_normed,y_train)

RFE_predictor = Recursive_Regressor.predict(X_test_normed)

print('RMSE with Recursive Feature Elimination',np.sqrt(mean_squared_error(RFE_predictor,y_test)))

RMSE with Recursive Feature Elimination 500.6372774459313


**Observations:** Improvements to 500.63. Time to try out Principal Componet analysis

In [45]:
#prepare the data
X = df.drop(['Time from Pickup to Arrival'], axis=1)
y = df['Time from Pickup to Arrival']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42)

# normalization
norm = MinMaxScaler().fit(X_train) 
X_train_normed = norm.transform(X_train) 
X_test_normed = norm.transform(X_test)

from sklearn.decomposition import PCA
pca = PCA()
X_train_PCA = pca.fit_transform(X_train_normed)
X_test_PCA = pca.transform(X_test_normed)

# Fit the model

regressor_with_pca = RandomForestRegressor(random_state=42)
regressor_with_pca.fit(X_train_PCA, y_train)

pca_y_pred = regressor_with_pca.predict(X_test_PCA)

# Finally, evaluate our model  
print('RMSE with PCA:', np.sqrt(mean_squared_error(y_test, pca_y_pred)))

RMSE with PCA: 360.65155292646614


* Massive improvements to 360

### Feature Transformation with LDA

In [46]:
#prepare the data
X = df.drop(['Time from Pickup to Arrival'], axis=1)
y = df['Time from Pickup to Arrival']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42)

# normalization
norm = MinMaxScaler().fit(X_train) 
X_train_normed = norm.transform(X_train) 
X_test_normed = norm.transform(X_test)

In [47]:
# Instatiate the models
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis


lda = LinearDiscriminantAnalysis(n_components=2)
X_train = lda.fit_transform(X_train_normed, y_train)
X_test = lda.transform(X_test_normed)

classifier = RandomForestRegressor(random_state=42)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

print('Random Forest:', np.sqrt(mean_squared_error(y_test, y_pred)))

Random Forest: 358.05387837233695


* Observations: Slight improvements to 358

## Feature construction

In [48]:
# create a new feature:  speed = distance/time

df['speed'] = df['Distance (KM)'] / (df['Time from Pickup to Arrival'] / 3600)

In [49]:
# Review the df
df.head()

Unnamed: 0,Platform Type,Personal or Business,Placement - Day of Month,Placement - Weekday (Mo = 1),Placement - Time,Confirmation - Day of Month,Confirmation - Weekday (Mo = 1),Confirmation - Time,Arrival at Pickup - Day of Month,Arrival at Pickup - Weekday (Mo = 1),...,Arrival at Destination - Weekday (Mo = 1),Arrival at Destination - Time,Distance (KM),Temperature,Pickup Lat,Pickup Long,Destination Lat,Destination Long,Time from Pickup to Arrival,speed
0,3,0,9,5,93546,9,5,94010,9,5,...,5,103955,4,20.4,-1.317755,36.83037,-1.300406,36.829741,745,19.328859
1,3,1,12,5,111616,12,5,112321,12,5,...,5,121722,16,26.4,-1.351453,36.899315,-1.295004,36.814358,1993,28.901154
2,3,0,30,2,123925,30,2,124244,30,2,...,2,130038,3,23.258889,-1.308284,36.843419,-1.300921,36.828195,455,23.736264
3,3,0,15,5,92534,15,5,92605,15,5,...,5,100527,9,19.2,-1.281301,36.832396,-1.257147,36.795063,1341,24.161074
4,1,1,13,1,95518,13,1,95618,13,1,...,1,102537,9,15.4,-1.266597,36.792118,-1.295041,36.809817,1214,26.688633


In [50]:
# Splitting the data
X =  df.drop(['Time from Pickup to Arrival'], axis=1)
y =  df['Time from Pickup to Arrival']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42)

# normalization
norm = MinMaxScaler().fit(X_train) 
X_train_normed = norm.transform(X_train) 
X_test_normed = norm.transform(X_test)

In [51]:
#Fitting the model
second_base_model = RandomForestRegressor(random_state=42)
second_base_model.fit(X_train_normed,y_train)
new_predictor = second_base_model.predict(X_test_normed)

print('RMSE:', np.sqrt(mean_squared_error(y_test, new_predictor)))

RMSE: 78.20192010411137


## Conclusion 
* The best results were Obtained after Creating New Features with data Normalization.

**Challenge the Solution**
a) Did we have the right question? Yes

b) Did we have the right data? Yes

c) What can be done to improve the solution?

* Hyperparameter tuning Using more features for the Wrapper methods. I needed more computing Power to achieve this. Handle any outliers