# Introduction

## Context

Sendy, in partnership with insight2impact facility, is hosting a Zindi challenge to predict the estimated time of delivery of orders, from the point of driver pickup to the point of arrival at final destination.

The solution will help Sendy enhance customer communication and improve the reliability of its service; which will ultimately improve customer experience. In addition, the solution will enable Sendy to realise cost savings, and ultimately reduce the cost of doing business, through improved resource management and planning for order scheduling.

Sendy helps men and women behind every type of business to trade easily, deliver more competitively, and build extraordinary businesses.

## Problem Statement

Data is a critical component in helping Sendy to build more efficient, affordable and accessible solutions. Given the details of a Sendy order, can we use historic data to predict an accurate time for the arrival of the rider at the destination of a package? In this competition, we’re challenging you to build a model that predicts an accurate delivery time, from picking up a package to arriving at the final destination. An accurate arrival time prediction will help all businesses to improve their logistics and communicate an accurate time to their customers.

# Data Exploration

In [267]:
# import packages
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style('ticks')
%matplotlib inline

In [199]:
# load data
train_data = pd.read_csv(r'C:\Users\tbndo\Google Drive\Projects\regression-predict-api-template\predict deliverable\data\Train.csv')
test_data = pd.read_csv(r'C:\Users\tbndo\Google Drive\Projects\regression-predict-api-template\predict deliverable\data\Test.csv')
riders_data = pd.read_csv(r'C:\Users\tbndo\Google Drive\Projects\regression-predict-api-template\predict deliverable\data\Riders.csv')

In [200]:
variable_defs = pd.read_csv(r'C:\Users\tbndo\Google Drive\Projects\regression-predict-api-template\predict deliverable\data\VariableDefinitions.csv')
variable_defs = variable_defs.drop_duplicates(subset = 'Order No').dropna(how = 'all')
variable_defs.set_index('Order No', inplace = True)
for feature in variable_defs.index:
    print('{}: '.format(feature), variable_defs.loc[feature, :][0])

User Id:  Unique number identifying the customer on a platform
Vehicle Type:  For this competition limited to bikes, however in practice Sendy service extends to trucks and vans
Platform Type:  Platform used to place the order, there are 4 types
Personal or Business:  Customer type
Placement - Day of Month:  Placement - Day of Month i.e 1-31
Placement - Weekday (Mo = 1):  Placement - Weekday (Monday = 1)
Placement - Time:  Placement - Time - Time of day the order was placed
Confirmation - Day of Month:  Confirmation - Day of Month i.e 1-31
Confirmation - Weekday (Mo = 1):  Confirmation - Weekday (Monday = 1)
Confirmation - Time:  Confirmation - Time - Time of day the order was confirmed by a rider
Arrival at Pickup - Day of Month:  Arrival at Pickup - Day of Month i.e 1-31
Arrival at Pickup - Weekday (Mo = 1):  Arrival at Pickup - Weekday (Monday = 1)
Arrival at Pickup - Time:  Time of day the the rider arrived at the location to pick up the order - as marked by the rider through the S

In [201]:
print('Train rows and columns :', train_data.shape)
print('Test rows and columns :', test_data.shape)

Train rows and columns : (21201, 29)
Test rows and columns : (7068, 25)


In [202]:
train_column_names = train_data.columns.values
test_column_names = test_data.columns.values
target_vars = set(train_column_names) - set(test_column_names)
print('Target columns')
print(target_vars)

Target columns
{'Arrival at Destination - Day of Month', 'Arrival at Destination - Weekday (Mo = 1)', 'Arrival at Destination - Time', 'Time from Pickup to Arrival'}


In [203]:
print(' Total Riders :', riders_data['Rider Id'].nunique(), 
      '\n Riders in train data :',train_data['Rider Id'].nunique(),
      '\n Riders in test data :',test_data['Rider Id'].nunique(),
      '\n Intersection between train & test data :', pd.Series(test_data['Rider Id'].to_list() + train_data['Rider Id'].to_list()).nunique()
     )

 Total Riders : 960 
 Riders in train data : 924 
 Riders in test data : 822 
 Intersection between train & test data : 952


In [204]:
train_data = pd.merge(
                left = train_data,
                right = riders_data,
                how = 'left'
            )

test_data = pd.merge(
                left = test_data,
                right = riders_data,
                how = 'left'
            )

In [205]:
train_data.head()

Unnamed: 0,Order No,User Id,Vehicle Type,Platform Type,Personal or Business,Placement - Day of Month,Placement - Weekday (Mo = 1),Placement - Time,Confirmation - Day of Month,Confirmation - Weekday (Mo = 1),...,Pickup Lat,Pickup Long,Destination Lat,Destination Long,Rider Id,Time from Pickup to Arrival,No_Of_Orders,Age,Average_Rating,No_of_Ratings
0,Order_No_4211,User_Id_633,Bike,3,Business,9,5,9:35:46 AM,9,5,...,-1.317755,36.83037,-1.300406,36.829741,Rider_Id_432,745,1637,1309,13.8,549
1,Order_No_25375,User_Id_2285,Bike,3,Personal,12,5,11:16:16 AM,12,5,...,-1.351453,36.899315,-1.295004,36.814358,Rider_Id_856,1993,396,339,13.6,69
2,Order_No_1899,User_Id_265,Bike,3,Business,30,2,12:39:25 PM,30,2,...,-1.308284,36.843419,-1.300921,36.828195,Rider_Id_155,455,1023,242,12.5,114
3,Order_No_9336,User_Id_1402,Bike,3,Business,15,5,9:25:34 AM,15,5,...,-1.281301,36.832396,-1.257147,36.795063,Rider_Id_855,1341,886,283,14.5,113
4,Order_No_27883,User_Id_1737,Bike,1,Personal,13,1,9:55:18 AM,13,1,...,-1.266597,36.792118,-1.295041,36.809817,Rider_Id_770,1214,2311,872,14.1,533


In [206]:
test_data.head()

Unnamed: 0,Order No,User Id,Vehicle Type,Platform Type,Personal or Business,Placement - Day of Month,Placement - Weekday (Mo = 1),Placement - Time,Confirmation - Day of Month,Confirmation - Weekday (Mo = 1),...,Precipitation in millimeters,Pickup Lat,Pickup Long,Destination Lat,Destination Long,Rider Id,No_Of_Orders,Age,Average_Rating,No_of_Ratings
0,Order_No_19248,User_Id_3355,Bike,3,Business,27,3,4:44:10 PM,27,3,...,,-1.333275,36.870815,-1.305249,36.82239,Rider_Id_192,439,1511,13.3,171
1,Order_No_12736,User_Id_3647,Bike,3,Business,17,5,12:57:35 PM,17,5,...,,-1.272639,36.794723,-1.277007,36.823907,Rider_Id_868,488,273,14.4,45
2,Order_No_768,User_Id_2154,Bike,3,Business,27,4,11:08:14 AM,27,4,...,,-1.290894,36.822971,-1.276574,36.851365,Rider_Id_26,833,460,13.1,67
3,Order_No_15332,User_Id_2910,Bike,3,Business,17,1,1:51:35 PM,17,1,...,,-1.290503,36.809646,-1.303382,36.790658,Rider_Id_685,487,560,13.7,44
4,Order_No_21373,User_Id_1205,Bike,3,Business,11,2,11:30:28 AM,11,2,...,,-1.281081,36.814423,-1.266467,36.792161,Rider_Id_858,4761,1120,14.1,1010


## Data Structures

In [207]:
def checkDataType(df, column_name):
    dtype = df[column_name].dtype
    return str(dtype)

def classifyColumnsByDataType(df):

    id_cols = ['Order No', 'User Id', 'Rider Id']
    
    numeric_type = ['int16', 'int64', 'float16', 'float64']
    factor_type = ['object']
    
    numeric_vars = list()
    factor_vars = list()

    for column_name in df.columns:
        data_type = checkDataType(df = df, column_name = column_name)

        if column_name not in id_cols + list(target_vars):
            if data_type in numeric_type:
                numeric_vars.append(column_name)
            else:
                factor_vars.append(column_name)
#             print('{}: '.format(column_name), checkDataType(df = train_data, column=column_name))
    return numeric_vars, factor_vars

numeric_vars, factor_vars = classifyColumnsByDataType(df = train_data)

In [208]:
train_data.loc[:, factor_vars].head()

Unnamed: 0,Vehicle Type,Personal or Business,Placement - Time,Confirmation - Time,Arrival at Pickup - Time,Pickup - Time
0,Bike,Business,9:35:46 AM,9:40:10 AM,10:04:47 AM,10:27:30 AM
1,Bike,Personal,11:16:16 AM,11:23:21 AM,11:40:22 AM,11:44:09 AM
2,Bike,Business,12:39:25 PM,12:42:44 PM,12:49:34 PM,12:53:03 PM
3,Bike,Business,9:25:34 AM,9:26:05 AM,9:37:56 AM,9:43:06 AM
4,Bike,Personal,9:55:18 AM,9:56:18 AM,10:03:53 AM,10:05:23 AM


In [209]:
def convertToTime(df):
    
    time_vars = list()
    for column_name in df.loc[:, factor_vars].columns.to_list():

        if column_name.find("Time") != -1:

            print('{} :'.format(column_name), 'Found')
            time_vars.append(column_name)
            factor_vars.remove(column_name)
            
            df['{}'.format(column_name)] = df.loc[:, column_name].map(lambda timestring: pd.to_datetime(timestring).time())
            
    return df

convertToTime(df = train_data)
print(time_vars)
train_data.loc[:, time_vars].head()

Placement - Time : Found
Confirmation - Time : Found
Arrival at Pickup - Time : Found
Pickup - Time : Found
['Placement - Time', 'Confirmation - Time', 'Arrival at Pickup - Time', 'Pickup - Time']


Unnamed: 0,Placement - Time,Confirmation - Time,Arrival at Pickup - Time,Pickup - Time
0,09:35:46,09:40:10,10:04:47,10:27:30
1,11:16:16,11:23:21,11:40:22,11:44:09
2,12:39:25,12:42:44,12:49:34,12:53:03
3,09:25:34,09:26:05,09:37:56,09:43:06
4,09:55:18,09:56:18,10:03:53,10:05:23


In [210]:
train_data.loc[:, factor_vars].head()

Unnamed: 0,Vehicle Type,Personal or Business
0,Bike,Business
1,Bike,Personal
2,Bike,Business
3,Bike,Business
4,Bike,Personal


In [211]:
train_data.loc[:, numeric_vars].head()

Unnamed: 0,Platform Type,Placement - Day of Month,Placement - Weekday (Mo = 1),Confirmation - Day of Month,Confirmation - Weekday (Mo = 1),Arrival at Pickup - Day of Month,Arrival at Pickup - Weekday (Mo = 1),Pickup - Day of Month,Pickup - Weekday (Mo = 1),Distance (KM),Temperature,Precipitation in millimeters,Pickup Lat,Pickup Long,Destination Lat,Destination Long,No_Of_Orders,Age,Average_Rating,No_of_Ratings
0,3,9,5,9,5,9,5,9,5,4,20.4,,-1.317755,36.83037,-1.300406,36.829741,1637,1309,13.8,549
1,3,12,5,12,5,12,5,12,5,16,26.4,,-1.351453,36.899315,-1.295004,36.814358,396,339,13.6,69
2,3,30,2,30,2,30,2,30,2,3,,,-1.308284,36.843419,-1.300921,36.828195,1023,242,12.5,114
3,3,15,5,15,5,15,5,15,5,9,19.2,,-1.281301,36.832396,-1.257147,36.795063,886,283,14.5,113
4,1,13,1,13,1,13,1,13,1,9,15.4,,-1.266597,36.792118,-1.295041,36.809817,2311,872,14.1,533


In [212]:
train_data.loc[:, target_vars].head()

Unnamed: 0,Arrival at Destination - Day of Month,Arrival at Destination - Weekday (Mo = 1),Arrival at Destination - Time,Time from Pickup to Arrival
0,9,5,10:39:55 AM,745
1,12,5,12:17:22 PM,1993
2,30,2,1:00:38 PM,455
3,15,5,10:05:27 AM,1341
4,13,1,10:25:37 AM,1214


## Missing Values

In [373]:
# investigate missing values and treat them accordingly
train_data.name = 'train_data'
test_data.name = 'test_data'

def findMissing(df):
    
    missing = df.isnull().sum()
    mask = missing > 0
    perc_missing = missing[mask].div(len(df)).mul(100).round(1)
    perc_missing = pd.DataFrame({'Percentage': perc_missing})
    perc_missing['data'] = df.name
    return perc_missing

perc_missing = pd.concat(
    [findMissing(train_data), findMissing(test_data)]
)
perc_missing

Unnamed: 0,Percentage,data
Temperature,20.6,train_data
Precipitation in millimeters,97.4,train_data
Temperature,20.3,test_data
Precipitation in millimeters,97.2,test_data


In [369]:
missings_vars = perc_missing.index.unique().to_list()
missings_vars

['Temperature', 'Precipitation in millimeters']

# Benchmark Model

In [None]:
# minimal data preprocessing (just enough to build base model)

In [None]:
# fit model and cross validate perfomance

# Exploratory Data Analysis

## Univariate Analysis

In [None]:
# plot histograms for numeric features - lookout for outliers and leverage points

In [6]:
# plot histograms for categorical features - Lookout for too many levels that can be grouped together

In [None]:
# Further analysis of missing values (if required)

## Multivariate Analysis

In [7]:
# analyse relationships between features (correlation and factor analysis)

In [8]:
# check for collinearity and multi-collinearity

In [None]:
# analyses relationship between features and target

# Feature Engineering

In [None]:
# investigate non linearity

In [None]:
# investigate feature interactions and synergy

In [None]:
# other transformations (data mining)

# Feature Selection

In [None]:
# L1 and L2 Regularisation

In [9]:
# forward selection, backward selection and/or mixed selection

In [10]:
# genetic algorithms (borita)

In [None]:
# Random forests and feature importance

# Model Selection 

In [None]:
# fit multiple models and do hyper-parameter tuning

In [None]:
# cross validate model perfomance

In [None]:
# select the best model and implement it as final solution

# Summary of Conclusions

In [None]:
# summarise findings and make recommendations