# Week4_Feature_Engineering Project

## Defining the Question

## a) Specifying the Data Analysis Question

Sendy has hired you to help predict the estimated time of delivery of orders, from the point of driver pickup to the point of arrival at the final destination. Build a model that predicts an accurate delivery time, from picking up a package arriving at the final destination

## b) Defining the Metric for Success

The metrics we will use to evaluate our model are RMSE

## c) Understanding the Context

Logistics in Sub-Saharan Africa increases the cost of manufactured goods by up to 320%; while in Europe, it only accounts for up to 90% of the manufacturing cost. Sendy is a business-to-business platform established in 2014, to enable businesses of all types and sizes to transport goods more efficiently across East Africa. The company is headquartered in Kenya with a team of more than 100 staff, focused on building practical solutions for Africa’s dynamic transportation needs, from developing apps and web solutions to providing dedicated support for goods on the move.

Sendy has hired you to help predict the estimated time of delivery of orders, from the point of driver pickup to the point of arrival at the final destination. Build a model that predicts an accurate delivery time, from picking up a package arriving at the final destination. An accurate arrival time prediction will help all business to improve their logistics and communicate the accurate time their time to their customers. You will be required to perform various feature engineering techniques while preparing your data for further analysis.

## d) Recording the Experimental Design

Defining the Research Question

Data Importation

Data Exploration

Data Cleaning

Data Analysis

Data Preparation

Data Modeling

Model Evaluation

Challenging your Solution

Recommendations / Conclusion

e) Data Relevance
The data provided was relevant to answering the research question.

# 1. Data Cleaning & Preparation

In [1]:
# loading libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option('display.max.columns', None)
pd.set_option('display.max_colwidth', None)
%matplotlib inline

In [2]:
# loading and previewing dataset
df = pd.read_csv('https://bit.ly/3deaKEM')
df.sample(3)

Unnamed: 0,Order No,User Id,Vehicle Type,Platform Type,Personal or Business,Placement - Day of Month,Placement - Weekday (Mo = 1),Placement - Time,Confirmation - Day of Month,Confirmation - Weekday (Mo = 1),Confirmation - Time,Arrival at Pickup - Day of Month,Arrival at Pickup - Weekday (Mo = 1),Arrival at Pickup - Time,Pickup - Day of Month,Pickup - Weekday (Mo = 1),Pickup - Time,Arrival at Destination - Day of Month,Arrival at Destination - Weekday (Mo = 1),Arrival at Destination - Time,Distance (KM),Temperature,Precipitation in millimeters,Pickup Lat,Pickup Long,Destination Lat,Destination Long,Rider Id,Time from Pickup to Arrival
3185,Order_No_24597,User_Id_1133,Bike,3,Business,27,1,12:11:27 PM,27,1,12:11:40 PM,27,1,12:20:34 PM,27,1,12:27:50 PM,27,1,12:48:47 PM,11,23.8,,-1.333275,36.870815,-1.277847,36.818008,Rider_Id_648,1257
4287,Order_No_22975,User_Id_1363,Bike,3,Business,29,3,9:52:08 AM,29,3,9:52:25 AM,29,3,9:52:40 AM,29,3,10:19:05 AM,29,3,10:19:09 AM,10,18.0,,-1.300406,36.829741,-1.272929,36.767908,Rider_Id_515,4
9773,Order_No_10623,User_Id_1075,Bike,3,Business,31,1,3:53:09 PM,31,1,4:43:42 PM,31,1,4:54:31 PM,31,1,5:02:59 PM,31,1,5:17:01 PM,1,22.5,,-1.26496,36.798178,-1.252796,36.800313,Rider_Id_216,842


In [3]:
# loading glossary
glossary = pd.read_csv('https://bit.ly/30O3xsr', header = None)
glossary.head(5)

Unnamed: 0,0,1
0,Order No,Unique number identifying the order
1,User Id,Unique number identifying the customer on a platform
2,Vehicle Type,"For this competition limited to bikes, however in practice Sendy service extends to trucks and vans"
3,Platform Type,"Platform used to place the order, there are 4 types"
4,Personal or Business,Customer type


In [4]:
# checking dataset shape
df.shape

(21201, 29)

In [5]:
# checking data types
df.dtypes

Order No                                      object
User Id                                       object
Vehicle Type                                  object
Platform Type                                  int64
Personal or Business                          object
Placement - Day of Month                       int64
Placement - Weekday (Mo = 1)                   int64
Placement - Time                              object
Confirmation - Day of Month                    int64
Confirmation - Weekday (Mo = 1)                int64
Confirmation - Time                           object
Arrival at Pickup - Day of Month               int64
Arrival at Pickup - Weekday (Mo = 1)           int64
Arrival at Pickup - Time                      object
Pickup - Day of Month                          int64
Pickup - Weekday (Mo = 1)                      int64
Pickup - Time                                 object
Arrival at Destination - Day of Month          int64
Arrival at Destination - Weekday (Mo = 1)     

In [6]:
#convert 'Personal or Business' column into numerical

df['Personal or Business'] = df['Personal or Business'].replace({'Business' : 1, 'Personal' : 0})

In [7]:
#obtain numeric value from rider id
df[['Rider','id','Rider Id']]=df['Rider Id'].str.split('_', expand=True)

In [8]:
df.isnull().sum()

Order No                                         0
User Id                                          0
Vehicle Type                                     0
Platform Type                                    0
Personal or Business                             0
Placement - Day of Month                         0
Placement - Weekday (Mo = 1)                     0
Placement - Time                                 0
Confirmation - Day of Month                      0
Confirmation - Weekday (Mo = 1)                  0
Confirmation - Time                              0
Arrival at Pickup - Day of Month                 0
Arrival at Pickup - Weekday (Mo = 1)             0
Arrival at Pickup - Time                         0
Pickup - Day of Month                            0
Pickup - Weekday (Mo = 1)                        0
Pickup - Time                                    0
Arrival at Destination - Day of Month            0
Arrival at Destination - Weekday (Mo = 1)        0
Arrival at Destination - Time  

In [9]:
#replacing null values in temperature with mode

df['Temperature'].fillna(df['Temperature'].mode()[0], inplace=True)

In [11]:
#selecting useful columns
df=df[['Platform Type', 'Personal or Business','Placement - Day of Month','Placement - Weekday (Mo = 1)','Confirmation - Day of Month', 'Confirmation - Weekday (Mo = 1)', 'Arrival at Pickup - Day of Month','Arrival at Pickup - Weekday (Mo = 1)', 'Pickup - Day of Month', 'Pickup - Weekday (Mo = 1)', 'Arrival at Destination - Day of Month','Arrival at Destination - Weekday (Mo = 1)','Distance (KM)','Temperature','Rider Id', 'Time from Pickup to Arrival']]

In [12]:
# getting the records with outliers
q1 = df.quantile(0.25)
q3 = df.quantile(0.75)
iqr = q3 - q1

outliers_df = df[((df < (q1 - 1.5 * iqr)) | (df > (q3 + 1.5 * iqr))).any(axis = 1)]
print(outliers_df.shape)
outliers_df.shape

(5170, 16)


  q1 = df.quantile(0.25)
  q3 = df.quantile(0.75)
  outliers_df = df[((df < (q1 - 1.5 * iqr)) | (df > (q3 + 1.5 * iqr))).any(axis = 1)]


(5170, 16)

In [13]:
df.shape

(21201, 16)

In [14]:
#changing object columns into int
df['Personal or Business']= df['Personal or Business'].astype(int)
df['Rider Id'] = df['Rider Id'].astype(int)

In [15]:
#previewing df

df.head()

Unnamed: 0,Platform Type,Personal or Business,Placement - Day of Month,Placement - Weekday (Mo = 1),Confirmation - Day of Month,Confirmation - Weekday (Mo = 1),Arrival at Pickup - Day of Month,Arrival at Pickup - Weekday (Mo = 1),Pickup - Day of Month,Pickup - Weekday (Mo = 1),Arrival at Destination - Day of Month,Arrival at Destination - Weekday (Mo = 1),Distance (KM),Temperature,Rider Id,Time from Pickup to Arrival
0,3,1,9,5,9,5,9,5,9,5,9,5,4,20.4,432,745
1,3,0,12,5,12,5,12,5,12,5,12,5,16,26.4,856,1993
2,3,1,30,2,30,2,30,2,30,2,30,2,3,24.7,155,455
3,3,1,15,5,15,5,15,5,15,5,15,5,9,19.2,855,1341
4,1,0,13,1,13,1,13,1,13,1,13,1,9,15.4,770,1214


## 2.Analysis

In [16]:
# average delivery time per Customer Type
df.groupby('Personal or Business')['Time from Pickup to Arrival'].mean().sort_values(ascending = False)

Personal or Business
0    1585.056327
1    1550.743270
Name: Time from Pickup to Arrival, dtype: float64

In [17]:
#Data Modeling

import warnings
warnings.filterwarnings("ignore")

# importing utility modules
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# importing machine learning models for prediction
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb
from sklearn.linear_model import LinearRegression



# getting target  from the dataframe
y = df["Time from Pickup to Arrival"]

# getting features from the dataframe
X = df.drop(columns=["Time from Pickup to Arrival"])

# Splitting between train data into training and validation dataset
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20)

# initializing all the model objects with default parameters
model_1 = LinearRegression()
model_2 = xgb.XGBRegressor()
model_3 = RandomForestRegressor()

# training all the model on the training dataset
model_1.fit(X_train, y_train)
model_2.fit(X_train, y_train)
model_3.fit(X_train, y_train)

# predicting the output on the validation dataset
Precd_one = model_1.predict(X_test)
Precd_two = model_2.predict(X_test)
Precd_three = model_3.predict(X_test)

# final prediction after averaging on the prediction of all 3 models
pred_final = (Precd_one+Precd_two+Precd_three)/3.0

# printing the root mean squared error between real value and predicted value
print(np.sqrt(mean_squared_error(y_test, pred_final)))

777.2410460036763


## 3. Feature Engineering

In [18]:
# dropping duplicates, if any
df.drop_duplicates(inplace = True)
df.shape

(21201, 16)

In [19]:
#creating new feature, speed
df['speed']=df['Distance (KM)']/df['Time from Pickup to Arrival']

In [20]:
df.head()

Unnamed: 0,Platform Type,Personal or Business,Placement - Day of Month,Placement - Weekday (Mo = 1),Confirmation - Day of Month,Confirmation - Weekday (Mo = 1),Arrival at Pickup - Day of Month,Arrival at Pickup - Weekday (Mo = 1),Pickup - Day of Month,Pickup - Weekday (Mo = 1),Arrival at Destination - Day of Month,Arrival at Destination - Weekday (Mo = 1),Distance (KM),Temperature,Rider Id,Time from Pickup to Arrival,speed
0,3,1,9,5,9,5,9,5,9,5,9,5,4,20.4,432,745,0.005369
1,3,0,12,5,12,5,12,5,12,5,12,5,16,26.4,856,1993,0.008028
2,3,1,30,2,30,2,30,2,30,2,30,2,3,24.7,155,455,0.006593
3,3,1,15,5,15,5,15,5,15,5,15,5,9,19.2,855,1341,0.006711
4,1,0,13,1,13,1,13,1,13,1,13,1,9,15.4,770,1214,0.007414


In [21]:
df.isnull().sum()

Platform Type                                0
Personal or Business                         0
Placement - Day of Month                     0
Placement - Weekday (Mo = 1)                 0
Confirmation - Day of Month                  0
Confirmation - Weekday (Mo = 1)              0
Arrival at Pickup - Day of Month             0
Arrival at Pickup - Weekday (Mo = 1)         0
Pickup - Day of Month                        0
Pickup - Weekday (Mo = 1)                    0
Arrival at Destination - Day of Month        0
Arrival at Destination - Weekday (Mo = 1)    0
Distance (KM)                                0
Temperature                                  0
Rider Id                                     0
Time from Pickup to Arrival                  0
speed                                        0
dtype: int64

In [22]:
# getting target  from the dataframe
y = df["Time from Pickup to Arrival"]

# getting features from the dataframe
X = df.drop(columns=["Time from Pickup to Arrival"])

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20)

In [23]:
# scaling our features
from sklearn.preprocessing import MinMaxScaler
norm = MinMaxScaler().fit(X_train)
X_train = norm.transform(X_train)
X_test = norm.transform(X_test)

In [24]:
#use PCA in feature transformation
from sklearn.decomposition import PCA
pca = PCA()
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)

-Feature engineering aspects of cleaning data, feature scaling and transformation, removing of null values etc improves the performance of a machine learning model

-Average delivery time for the two customer types is almost similar(very small difference noted )

## 6. Challenging your Solution

**a) Did we have the right question?**

Yes

**b) Did we have the right data?**

Yes

**c) What can be done to improve the solution?**

Get more data to train Further hyperparameter tuning Perform more feature engineering