## Thejas Balenahalli Kiran
## December 06, 2021
# <center>Predicting eBay Package Delivery Date</center>
## <center>Final Project STAT 5000</center>
### Project Link: 

# ABSTRACT
### Keywords
Machine Learning, Multi-variate Linar Regression, MLP Regressor, XGBoost, Delivery Date Prediction

<div style="page-break: always;"></div>

# Project Introduction

### Acknowledgement
I would like to thank eBay for providing me an opportunity to work with a real-time dataset by hosting a ML Challenge on eval.AI

### Problem Statement
To predict the delivery dates of packages sold on eBay by both customers and businesses.

### Recommended Hardware Specifications
1. Intel i5 Processor
2. 16GB RAM
3. 10GB Hard Disk Space

### Recommended Software Specifications
1. Python 3.7.1
2. Jupyter Notebook

### Data Source
The data for predicting the eBay delivery dates were given to me by eBay as a part of their ML Challenge 2021. I have used the same data for my models and as per their policy agreements, I do not have the authority to share the data with others but will show a snippet while working through the whole process. However, I will be happy to show the data in-person if necessary.

# Literature Review

1. "Predicting Shipping Time with Machine Learning", a project presented at Massachusets Institute of Technology in the year 2012 by Antonie Charles Jean Jonquais and Florian Krempl focussed on predicting delivery times of freights for Maersk (Shipping company in Denmark. Although they had an in-house way of predicting the delivery dates, they wanted to build a better and more accurate model using Machine Learning and predictive analytics. They implemented the random forest algorithm and found it to be surprisingly better than neural networks in terms of accuracy and prediction time. 

2. An article written by Purdue University focuses on estimating delivery time for industrial pieces of equipment using different machine learning approaches. Some of the approaches they implemented include Support Vector Machines, Random Forest, k-Nearest Neighbors, XGBoost among others. After analysis and further research, they found the Random Forest and SVM Ensemble models to be the best with high accuracies and low Mean Squared Squared Errors. They plan to enhance the model further by adding features that weren't collected for this phase of the research. 

3. The paper written by Fan Wu et. al. was presented at the Association for the Advancement of Artificial Intelligence under the title "DeepETA: A Spatial-Temporal Sequential Neural Network Model for Estimating Time of Arrival in Package Delivery System". The paper considers the features which could not be included in the traditional models i.e. frequency of delivery routes, taking multiple destinations on the same route into consideration. Including the above-mentioned and additional features, they have developed a recurrent neural network model for predicting the delivery dates in China.


# Understanding the dataset
The dataset given to us by eBay consists of 15,000,000 data points with each data point being explained by 19 attributes. The attributes that define each datapoint are,
1. <u>b2c_c2c</u> - It of type string with values b2c or c2c. This explains if the transaction happening is between business to customers or between customers and other customers.
2. <u>seller_id</u> - This is a unique ID given to each seller for identification. It is of type Long Int.
3. <u>declared_handling_days</u> - The number of days taken by the seller to ship the carrier from the day of acceptance.
4. <u>acceptance_scan_timestamp</u> - The date and time when the carrier has accepted the package for the final shipment. The values in this are of type timestamp.
5. <u>shipment_method_id</u> - The integer type attribute defines the type of shipping service declared by the seller.
6. <u>shipping_fee</u> - Transportation and handling charges charged by the seller for shipping the items. All the values are in USD.
7. <u>carrier_min_estimate</u> - The minimum estimate of the number of required days by the carrier for the specified service.
8. <u>carrier_max_estimate</u> - The maximum estimate of the number of required days by the carrier for the specified service.
9. <u>item_zip</u> - The US Postal zip code of the package origin/source.
10. <u>buyer_zip</u> - The US Postal zip code of the package destination.
11. <u>category_id</u> - An integer type data attribute that categorizes the package by its type.
12. <u>item_price</u> - The price per item involved in the transaction.
13. <u>quantity</u> - Number of items involved in the transaction.
14. <u>payment_datetime</u> - A timestamp attribute that clocks in the time when the payment has been done for the particular transaction.
15. <u>delivery_date</u> - The actual delivery date of the package. This is the attribute that we need to predict using the other attributes.
16. <u>weight</u> - A scalar value that determines the sum of the weight of all quantities involved in the transaction.
17. <u>weight_units</u> - It defines the weight scalar value by telling if the measurements are in kilograms or pounds. Pounds are represented as 1 and Kilograms are denoted by 2.
18. <u>package_size</u> - It is a categorical value which categorizes the packages based on its sizes.
19. <u>record_number</u> - Unique integer number given to the transaction to identify them.

Before going ahead to implement the model after understanding the data, I found some irregularities in the dataset and have cleaned them before moving ahead.

# Importing the libraries and loading the data

In [1]:
import pandas as pd
import numpy as np
from uszipcode import SearchEngine
from datetime import datetime
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.neural_network import MLPRegressor
import xgboost as xgb
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# Importing our data and checking the structure of it
data = pd.read_csv('/Users/ujas/Downloads/eBay_ML_Challenge_Dataset_2021/eBay_ML_Challenge_Dataset_2021_train.tsv', 
                   sep = '\t')
data.head(25)

Unnamed: 0,b2c_c2c,seller_id,declared_handling_days,acceptance_scan_timestamp,shipment_method_id,shipping_fee,carrier_min_estimate,carrier_max_estimate,item_zip,buyer_zip,category_id,item_price,quantity,payment_datetime,delivery_date,weight,weight_units,package_size,record_number
0,B2C,25454,3.0,2019-03-26 15:11:00.000-07:00,0,0.0,3,5,97219,49040,13,27.95,1,2019-03-24 03:56:49.000-07:00,2019-03-29,5,1,LETTER,1
1,C2C,6727381,2.0,2018-06-02 12:53:00.000-07:00,0,3.0,3,5,11415-3528,62521,0,20.5,1,2018-06-01 13:43:54.000-07:00,2018-06-05,0,1,PACKAGE_THICK_ENVELOPE,2
2,B2C,18507,1.0,2019-01-07 16:22:00.000-05:00,0,4.5,3,5,27292,53010,1,19.9,1,2019-01-06 00:02:00.000-05:00,2019-01-10,9,1,PACKAGE_THICK_ENVELOPE,3
3,B2C,4677,1.0,2018-12-17 16:56:00.000-08:00,0,0.0,3,5,90703,80022,1,35.5,1,2018-12-16 10:28:28.000-08:00,2018-12-21,8,1,PACKAGE_THICK_ENVELOPE,4
4,B2C,4677,1.0,2018-07-27 16:48:00.000-07:00,0,0.0,3,5,90703,55070,1,25.0,1,2018-07-26 18:20:02.000-07:00,2018-07-30,3,1,PACKAGE_THICK_ENVELOPE,5
5,B2C,10514,1.0,2019-04-19 19:42:00.000-04:00,0,0.0,3,5,43215,77063,3,10.39,1,2019-04-18 14:11:09.000-04:00,2019-04-22,1,1,PACKAGE_THICK_ENVELOPE,6
6,B2C,104,1.0,2019-02-08 17:35:00.000-08:00,0,0.0,3,5,91304,60565,11,5.7,1,2019-02-08 09:33:13.000-08:00,2019-02-11,0,1,PACKAGE_THICK_ENVELOPE,7
7,B2C,340356,1.0,2018-04-23 17:31:00.000-04:00,0,2.95,3,5,49735,29379,1,6.0,1,2018-04-22 18:32:04.000-04:00,2018-04-25,1,1,PACKAGE_THICK_ENVELOPE,8
8,B2C,113915,5.0,2019-10-12 09:22:00.000-04:00,3,0.0,2,8,43606,32958,18,5.55,1,2019-10-11 04:54:25.000-04:00,2019-10-15,0,1,NONE,9
9,B2C,130301,1.0,2019-08-09 11:24:00.000-05:00,1,0.0,2,5,35117,84776,13,59.98,1,2019-08-08 12:47:14.000-05:00,2019-08-12,112,1,PACKAGE_THICK_ENVELOPE,10


In [3]:
# The columns in our dataset are
data.columns

Index(['b2c_c2c', 'seller_id', 'declared_handling_days',
       'acceptance_scan_timestamp', 'shipment_method_id', 'shipping_fee',
       'carrier_min_estimate', 'carrier_max_estimate', 'item_zip', 'buyer_zip',
       'category_id', 'item_price', 'quantity', 'payment_datetime',
       'delivery_date', 'weight', 'weight_units', 'package_size',
       'record_number'],
      dtype='object')

In [4]:
# Checking the data types of different columns that can help us see if any columns need to be cleaned particularly
data.dtypes

b2c_c2c                       object
seller_id                      int64
declared_handling_days       float64
acceptance_scan_timestamp     object
shipment_method_id             int64
shipping_fee                 float64
carrier_min_estimate           int64
carrier_max_estimate           int64
item_zip                      object
buyer_zip                     object
category_id                    int64
item_price                   float64
quantity                       int64
payment_datetime              object
delivery_date                 object
weight                         int64
weight_units                   int64
package_size                  object
record_number                  int64
dtype: object

In [5]:
# Printing the number of unique values in each column 
data.nunique()

b2c_c2c                             2
seller_id                     1759305
declared_handling_days             11
acceptance_scan_timestamp     2245193
shipment_method_id                 25
shipping_fee                     7044
carrier_min_estimate                6
carrier_max_estimate                6
item_zip                        50939
buyer_zip                       57273
category_id                        33
item_price                      41571
quantity                          147
payment_datetime             14090416
delivery_date                     767
weight                           1298
weight_units                        2
package_size                        7
record_number                15000000
dtype: int64

# Data Cleaning
After understanding the dataset, I have gone through the data to make sure there are no faulty data points. I have spent a lot of time on this as I believe a good data is a good model.

In [6]:
# Checking if there are any null values in the dataset
data.isna().sum()

b2c_c2c                           0
seller_id                         0
declared_handling_days       702886
acceptance_scan_timestamp         0
shipment_method_id                0
shipping_fee                      0
carrier_min_estimate              0
carrier_max_estimate              0
item_zip                          1
buyer_zip                         1
category_id                       0
item_price                        0
quantity                          0
payment_datetime                  0
delivery_date                     0
weight                            0
weight_units                      0
package_size                      0
record_number                     0
dtype: int64

In [7]:
# Since there are no null values in b2c_c2c, I checked the unique values in them
data['b2c_c2c'].unique()

array(['B2C', 'C2C'], dtype=object)

In [8]:
# Converting the column b2c_c2c to integer values to pass into the model later on
data['b2c_c2c'] = data['b2c_c2c'].replace({'B2C':0, 'C2C':1})

In [9]:
# Checking if there are any seller_id's in the negative range and since there aren't any, we'll move ahead with this
sum(data['seller_id'] < 0)

0

In [10]:
# Since, there are Null values in the declared_handling_days column, I will be handling them later. I am also checking if there 
# are any negative values.
sum(data['declared_handling_days'] < 0)

0

In [11]:
# Since declared_handling_days is of type float and has only 11 unique values, I will print them to see if I can have them as a 
# categorical variable and to check if there are any outliers
data['declared_handling_days'].unique()

array([ 3.,  2.,  1.,  5.,  0., 10., nan,  4., 30., 15., 20., 40.])

In [12]:
# Since there are no null values and all the variables in it are of data type timestamp, I will not be doing any cleaning on it

In [13]:
# Since shiping_method_id has no Null values, I will be checking for the number of times each shipping method is used. This is 
# to make sure that there are no outliers in the column
data.shipment_method_id.value_counts()

0     9341444
1     2955149
2      856653
3      780997
4      297472
5      253508
6      176007
7      137404
8      125093
9       39639
10      27081
11       5906
13       1545
12       1077
14        447
15        431
16         48
19         21
17         20
18         17
21         16
20         13
22          6
24          4
26          2
Name: shipment_method_id, dtype: int64

In [14]:
# Since there are no null values, I will be checking for any manual entry faults i.e. if there are any negative or zero values
print('Number of 0\'s in shipping_fee:', sum(data['shipping_fee'] == 0))
print('Number of negative\'s in shipping_fee:', sum(data['shipping_fee'] < 0))

Number of 0's in shipping_fee: 9000100
Number of negative's in shipping_fee: 21


In [15]:
# Since, there are a lot of zero values I am assuming this is not an eror. As far as negative values go, there are 21 of them
# and I will be putting them to None to remove it later
for i in list(data[(data['shipping_fee']<0)]['record_number']):
    data['shipping_fee'][i-1] = None

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['shipping_fee'][i-1] = None


In [16]:
# Since, there are no null values in the carrier_min_estimate, I am checking for any negative value
print(sum(data['carrier_min_estimate'] < 0))

# Since, there are nearly 2000 negative values, I am looking for what they are and changing them to None for later processing
print(data['carrier_min_estimate'].unique())
data["carrier_min_estimate"] = data["carrier_min_estimate"].replace({-1:None})

1095
[ 3  2  1  6 -1  0]


In [17]:
# Since, there are no null values in the carrier_max_estimate, I am checking for any negative value
print(sum(data['carrier_max_estimate'] < 0))

# Since, there are nearly 2000 negative values, I am looking for what they are and changing them to None for later processing
print(data['carrier_max_estimate'].unique())
data["carrier_max_estimate"] = data["carrier_max_estimate"].replace({-1:None})

1095
[ 5  8  9  1 25 -1]


In [18]:
# Since, there is a Null value in item_zip, I will be removing it now and any other faulty data points as I will be calculating 
# distance between item_zip and buyer_zip before building the model
for x in range(len(data['item_zip'])):
    try:
        x = data['item_zip'][x][0]
    except:
        to_drop = x
        
#Removing the Null values
data.drop(to_drop, inplace=True)

In [19]:
# Resetting the index values of teh dataframe
data.reset_index(drop=True, inplace=True)

In [20]:
# As I see more specific 9 digits pin code in the dataset, I am slicing them to 5 digits for easier calculations as many of the 
# zip codes (item_zip and buyer_zip) are of size 5
data['item_zip'] = data['item_zip'].str.slice(0, 5)

In [21]:
# Since, there is a Null value in buyer_zip, I will be removing it now and any other faulty data points as I will be calculating 
# distance between item_zip and buyer_zip before building the model
for x in range(len(data['buyer_zip'])):
    try:
        x = data['buyer_zip'][x][0]
    except:
        to_drop = x
        
#Removing the Null values
data.drop(to_drop, inplace=True)

In [22]:
# Resetting the index values of teh dataframe
data.reset_index(drop=True, inplace=True)

In [23]:
# As I see more specific 9 digits pin code in the dataset, I am slicing them to 5 digits for easier calculations as many of the 
# zip codes (item_zip and buyer_zip) are of size 5
data['buyer_zip'] = data['buyer_zip'].str.slice(0, 5)

In [24]:
# Since, there are no Null values, I check if there are any negative values and for values 0 assuming it is proxy. I find a 
# lot of 0 values and looking at the distribution I assume it is a part of the categorical variable 
print(sum(data['category_id'] < 0))
print(data.category_id.value_counts())

0
0     2544488
1     1293227
2     1281052
3     1233523
4      931637
5      787817
6      772039
7      720987
8      569544
10     543662
9      538951
11     445998
12     434892
13     429443
14     366025
15     320721
16     303283
17     257026
19     239080
18     215859
20     125426
22     102039
21      97935
23      94685
24      69860
25      65579
26      64628
27      39559
28      38358
29      32863
30      28701
31       9216
32       1895
Name: category_id, dtype: int64


In [25]:
# Since, there are no Null values in the item_price, we check for prices less than or equal to zero
print(sum(data['item_price'] == 0))
print(sum(data['item_price'] < 0))
# Since, there are no negative values and just 2 values equal to 0, I am not changing anything in the datapoint assuming these
# products were given for free on offer or discount

2
0


In [26]:
# As there are no Null values in quantity, I am checking if there are any negative values in the column
sum(data['quantity'] < 0)

0

In [27]:
# Since, weight is a value which cannot be negative, I am checking if there are any negative faulty data points
sum(data['weight'] < 0)

0

In [28]:
# As there are two types of weights (Pounds and Kilograms) in the dataset, I will be converting everything to Kilograms
def converting_weights(row):
    return [row["weight"] if row["weight_units"] == 2 else (row["weight"] * 0.45359237)][0]

data["weight_kg"] = data.apply(converting_weights, axis=1)

In [29]:
# Checking if there are weights with 0 values and making them None
print(sum(data['weight_kg'] == 0))
data["weight_kg"] = data["weight_kg"].replace({0:None})

4792626


In [30]:
# To see the frequency of package_size that are being shipped
print(data['package_size'].value_counts())

PACKAGE_THICK_ENVELOPE    12652643
NONE                       1055227
LETTER                      912894
LARGE_ENVELOPE              209218
LARGE_PACKAGE               170014
VERY_LARGE_PACKAGE               1
EXTRA_LARGE_PACKAGE              1
Name: package_size, dtype: int64


In [31]:
# Assigning the string_type values in the package_size to a unique number
data["package_size"] = data["package_size"].replace({'NONE':None,'PACKAGE_THICK_ENVELOPE':0,'LETTER':1,'LARGE_ENVELOPE':2,\
                                                     'LARGE_PACKAGE':3,'VERY_LARGE_PACKAGE':4,'EXTRA_LARGE_PACKAGE':5})

In [32]:
# Creating a new column with the values as teh average of carrier_min_estimate and carrier_max_estimate
data = data.assign(carrier_average_estimate = lambda x: (x.carrier_min_estimate + x.carrier_max_estimate) / 2)

In [33]:
# I am creating a new column called distance which is an approximate measure of the distance between the item_zip and buer_zip
# To create the distance column, I am first collecting the list of latitude and logitude of zip codes that are present in the
# dataset into a dctionary

search = SearchEngine(simple_zipcode=True)

zipcodes_needed = set()
for i in data['item_zip']:
    zipcodes_needed.add(i)
for i in data['buyer_zip']:
    zipcodes_needed.add(i)
    
zipcodes_needed_list = list(zipcodes_needed)
zipcodes_dict = {}

for i in zipcodes_needed_list:
    zip1 = search.by_zipcode(i)
    zipcodes_dict[i]=[zip1.lat, zip1.lng]

In [None]:
# Assigning each datapoint with new columns that consists of the latitude and logitude of both seller and buyer from the 
# previously created dictionary

def itemlatlong(series):
    lat = zipcodes_dict[series['item_zip']]
    return lat
data['item_coord'] = data.apply(itemlatlong, axis = 1)

def buyerlatlong(series):
    lat = zipcodes_dict[series['buyer_zip']]
    return lat
data['buyer_coord'] = data.apply(buyerlatlong, axis = 1)

data[['item_lat','item_long']] = pd.DataFrame(data.item_coord.tolist(), index= data.index)
data[['buyer_lat','buyer_long']] = pd.DataFrame(data.buyer_coord.tolist(), index= data.index)

In [None]:
# Using these latitudes and logitudes, I am caculating the crow distance between these 2 geo location points using the haversine
# formula. The crow distance is considered because I do not know the type of transportation used i.e. freights, trucks, 
# airplanes etc.

def haversine_np(lon1, lat1, lon2, lat2):
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2
    c = 2 * np.arcsin(np.sqrt(a))
    km = 6367 * c
    return km

data['distance'] = haversine_np(data['item_long'], data['item_lat'], data['buyer_long'], data['buyer_lat'])

In [None]:
# These are the functions that I have written to convert the timestamps to dates

# This function splits and considers only the date from a timestamp variable
def date(string):
    return string.split()[0]

# This function is used to convert the string type date object into date type
def convert_data(string):
    return datetime.strptime(str(string), '%Y-%m-%d').date()

# This is used to calculate the total numebr of days taken to deliver the product to the customer from the seller
# since the date of order
def number_of_days(string):
    return (string['delivery_date_modified'] - string['payment_date']).days

# This is used to calculate the total numebr of days taken to deliver the product to the carrier from the seller
# since the date of order
def number_of_handling_days(string):
    return (string['acceptance_date'] - string['payment_date']).days

In [None]:
# Using the above defined functions to generate date from the payment_datetime column and storing it in a new column

data['payment_date'] = data['payment_datetime'].apply(date)
data['payment_date'] = data['payment_date'].apply(convert_data)

In [None]:
# Generating date from the delivery_date column and storing it in a new column
data['delivery_date_modified'] = data['delivery_date'].apply(convert_data)

In [None]:
# Calculating the total numebr of days taken to deliver the product to the customer from the seller since the date of order and
# storing it in a new column
data['no_of_days_after_payment'] = data.apply(number_of_days, axis = 1)

In [None]:
# Using the above defined functions to generate date from the acceptance_scan_timestamp column and storing it in a new column

data['acceptance_date'] = data['acceptance_scan_timestamp'].apply(date)
data['acceptance_date'] = data['acceptance_date'].apply(convert_data)

In [None]:
# Calculate the total numebr of days taken to deliver the product to the carrier from the seller since the date of order and
# storing it in a new column
data['handling_days'] = data.apply(number_of_handling_days, axis = 1)

In [None]:
# Checking for the indexes of data points which has a negative handling days period. This is done to remove them considering 
# they 
are faulty human inputs.
to_drop = data[data['no_of_days_after_payment'] < 0].index

In [None]:
# Dropping the negative values present in the no_of_days_after_payment column
data.drop(index = to_drop, inplace = True)

In [None]:
# After removing the Null values, I remove the columns that are of less importance in predicting the delivery dates
data.drop(['item_coord', 'buyer_coord', 'item_lat', 'item_long', 'buyer_lat', 'buyer_long','weight_units','weight','payment_datetime'], axis = 1, inplace = True)           

In [None]:
# Removing all the rows containing the Null values and storing it in a new dataframe
model_data = data.dropna()

After all the cleaning has been done, I have stored the cleaned dataframe in a new variable called model_data. I have done this as I have more cleaning ideas that I will be implementing before the final eBay ML Challenge submission. 

# Data Visualization

In [None]:
# Plotting the histogram of the number of days it takes for the delivery to arrive to the buyer
plt.hist(model_data['no_of_days_after_payment'], bins = 200)
plt.xlim(0, 25)

I have plotted the frequency of the days a shipment takes to reach the customer. From the above graph, we can see that, most of the packages arrive by 3 to 4 days. These two days alone cover nearly 50% of all the transactions in our dataset.  

In [None]:
# Plotting a scatter plot of the distance against shipment_method_id
sns.scatterplot(x = model_data['shipment_method_id'], y = model_data['distance'])

From the above graph, we can clearly see that the shipment method id 0 and 1 do the bulk of the shipment when it comes to long distance. For this reason, it is safe to assume these two shipment method id's are part of the airplanes. Although most of the shipments cover a distance of 0 to 4000 Kms, shipemt id's after 15 cover less number of transactions in that range. 

# Loss Function
As this is a dataset obtained from the eBay ML Chanllenge, I will be using the loss function defined by them to calculate the accuracy of my models. The funtion basically penalizes the model with a value of 0.4 for every early delivery and 0.6 if there is a late delivery. Lesser the loss, better the model. The baseline loss score of the current existing model is given by eBay as 0.75

In [None]:
def evaluate_loss(preds, actual):
    early_loss, late_loss = 0,0 
    for i in range(len(preds)):
        if preds[i] < actual[i]:
            #early shipment
            early_loss += actual[i] - preds[i]
        elif preds[i] > actual[i]:
            #late shipment
            late_loss += preds[i] - actual[i]
    loss = (1/len(preds)) * (0.4 * (early_loss) + 0.6 * (late_loss))
    return loss

# Train and Test data
Before running the models, I have split the data into train set and test set. I have done this using the sklearn library by giving 80% of the data to training and the rest 20% to testing. I have also shuffled the dataset before splitting to make sure that data points present in different parts of the file are used for training.

In [None]:
x = model_data[['b2c_c2c', 'declared_handling_days', 'carrier_average_estimate', 'distance', 'quantity', 'weight_kg',
          'handling_days']]
y = model_data['no_of_days_after_payment']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, shuffle = True)

# Models
I have built three models for predicting the delivery dates. All the models produce a slightly better loss value than the baseline of 0.75. The models that I have trained are,
1. Multivariate Linear Regression
2. MLP Regressor
3. XGBoost Ensemble Algorithm

### Multi-variate Linear Regression
As the name suggests, more than one variable is used to determine the predictor i.e. delivery date in our case. This is a linear regression model with multiple variables to determine the output. I have used this to check if the combination of variables has a relation with the predicting variable even though the maximum correlation between any one variable with the predicting variable is just 0.46. Another reason for using this model is to find outliers/anomalies if there are any. I have used the LinearRegression function present in the scikit-learn library.

In [None]:
# Creating a Linear Regression model using the sklearn library
model = LinearRegression()

# Training the model with the train dataset
model.fit(np.array(x_train), y_train)

# Predicting the target values for the test dataset
predictions = model.predict(np.array(x_test))

# Evaluating the loss of the model using the eBay defined function
print('Linear Regression loss:', evaluate_loss([np.floor(x) for x in predictions], np.array(y_test)))
print('r2 of the model:', r2_score(y_test, predictions))

Although, the r2 of the model is quite low, I am considering this as it has a better loss compared to the baseline score provided by eBay.

### MLP Regressor
MLP stands for Multi-Layer Perceptron and it is a basic Artificial Neural Network model present in the scikit-learn library. Some of the salient features of the algorithm are that,
1. It does not have an activation function in the output layer.
2. Squared Error is used as a loss function in the regressor models, whereas, cross-entropy is used in classification models.
3. It does not support implementation on the GPU

The model is not good for real-time data applications as we can see in the below results. The accuracy of the model is almost the same as the accuracy of the multivariate linear regression model. I have used this model to learn about the basic deep learning models that can be implemented from the scikit-learn library without the support of any GPUs.

In [None]:
# Building the MLP Regressor with 5 hidden layers and an activation function
model = MLPRegressor(hidden_layer_sizes = (2,3,4,3,2), activation="relu", solver="lbfgs", max_iter=100)

# Training the model
model.fit(x_train, y_train)

# Generating predictions
predictions = model.predict(x_test)

# Calculating the loss of our model
print('MLP Regressor Loss:', evaluate_loss([np.floor(i) for i in predictions], y_test))

I have built an MLP model with 5 layers with an activation function 'relu' in each layer. As per the loss function defined by eBay, my model is doing better than the above implemented multi-variate linear regression model.

### XGBoost Ensemble Algorithm
It stands for Extreme Gradient Boosting and is an ensemble algorithm that is used to boost the decision trees for higher speed and performance/accuracy. There are three main forms of gradient boosting supported by the library, i.e.,
1. Gradient Boosting algorithm also called gradient boosting machine including the learning rate.
2. Stochastic Gradient Boosting with sub-sampling at the row, column, and column per split levels.
3. Regularized Gradient Boosting with both L1 and L2 regularization.

I have used the basic gradient boosting algorithm as it provided better results than the others.
The main algorithmic features that made me select this model are,
1. Sparse Aware implementation with automatic handling of missing data values.
2. Block Structure to support the parallelization of tree construction.
3. Continued Training so that I can further boost an already fitted model on new data.
The XGBoost implementation speed and the model performance have also made it easier for me to use the model for this case.

In [None]:
# Converting the variable of float datatype to int datatype
a = model_data['package_size'].copy()
a = a.astype(int)

# Since, XGBoost needs data in a different format, I will be reassinging x and y as per required
x = np.column_stack((model_data['b2c_c2c'],model_data['seller_id'], model_data['declared_handling_days'], 
                    model_data['shipment_method_id'],model_data['shipping_fee'], model_data['carrier_min_estimate'], 
                    model_data['carrier_max_estimate'], model_data['item_zip'], model_data['buyer_zip'], 
                    model_data['category_id'], model_data['item_price'], model_data['weight_kg'], model_data['quantity'], a, 
                    model_data['handling_days']))

y = np.array(model_data['no_of_days_after_payment'])

# Splitting the data into train and test set
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, shuffle = True)

# Creating an XGBoost model
model = xgb.XGBRegressor(verbosity = 0)

# Trainging the dataset
model.fit(x_train, y_train)

# Predicting the values for the test dataset
predictions = model.predict(x_test)

# Calculating the loss of our model
print('XGBoost Loss:', evaluate_loss([np.floor(i) for i in predictions],np.array(y_test)))

As we can see from the outputs of all the developed models, XGBoost is giving the best results. The results are quite expected as seen from the article written by Purdue university which shows Random Forest Ensemble method to be the best. XGBoost is supposed to be better both in terms of speed and performance. The results are quite fulfilling.

# Learning Curve
The project has been a great learning curve for me. I was able to practice the whole data cleaning process on a real-world data set, explain the dataset with visualizations, and build models on top of it. The models that I built were also something new that I learned while going through the project. I plan to enhance the model accuracy by understanding the DeepETA algorithm and working on it in the future.

# Future Work
As I have mentioned before, I have planned to do more data cleaning that includes, excluding the weekends for the number_of_handling_days andno_of_days_after_payment column. I also have to consider the time of order or shipping because if the order is after 5:00 P.M. I should take the next business day for further processing. As for the models, although the developed ones are above the baseline score defined by eBay, I would like to build a Convolutional Neural Network model as defined in the paper  "DeepETA: A Spatial-Temporal Sequential Neural Network Model for Estimating Time of Arrival in Package Delivery System" as a continuation for the project.

# References
1. "Predicting Shipping Time with Machine Learning", Presented at Massachusets Institute of Technology, 2012.
2. "A Machine Learning Approach to Delivery Time Estimation for Industrial Equipment", Purdue University.
3. "DeepETA: A Spatial-Temporal Sequential Neural Network Model for Estimating Time of Arrival in Package Delivery System", Assosciation for the Advancement of Artificial Intelligence, 2019. 
4. "Boosting Algorithms for Delivery Time Prediction in Transportation Logistics", International Conference on Data Mining Workshops, 2020, DOI: 10.1109/ICDMW51313.2020.00043
5. "Predicting the Last Mile: Route-Free Prediction of Parcel Delivery Time with Deep Learning for Smart-City Applications", Queen's University, Canada, 2020.
6. "Predicting Package Delivery Time For Motorcycles In Nairobi", Kenya College of Accountancy, 2020, DOI:10.13140/RG.2.2.27105.94567
7. Haversine formula - https://stackoverflow.com/questions/29545704/fast-haversine-approximation-python-pandas/29546836#29546836
8. Similar work - https://milliemince.github.io/eBay-shipping-predictions/
