## Introduction <a name="introduction"></a>

**A model to predict the estimated time of delivery of orders, from the point of driver pickup to the point of arrival at final destination. The solution will help Sendy enhance customer communication and improve the reliability of its service; which will ultimately improve customer experience. In addition, the solution will enable Sendy to realise cost savings, and ultimately reduce the cost of doing business, through improved resource management and planning for order scheduling.**

# Table of contents
1. [Introduction](#introduction)
2. [Importing libraries](#Importing_libraries)
3. [Importing datasets](#Importing_datasets)
4. [A quick look at how our data is structured](#Data_structure)
5. [Data Visualization](#visuals)
    1. [Temperature distribution](#temperature)
    2. [Vehicle types](#vehicle_types)
    3. [Platform types](#platform_types)
    4. [Personal or Business](#personal_or_business)
    5. [Order placement](#order_placement)
        1. [Order placement day of the month](#order_placement_day_of_the_month)
        2. [Placement weekday](#placement_weekday)
    6. [Order confirmation](#order_confirmation)
        1. [Confirmation day of month](#confirmation_day_of_month)
        2. [Confirmation weekday](#confirmation_weekday)
6. [Data Preprocessing](#data_preprocessing)
7. [Modeling](#modeling)
    1. [Linear Regression Model](#linear_model)
    2. [XGBoost](#xgb)
    3. [Random Forest](#random_forest)
    4. [Decision Tree](#decision_tree)
8. [Conclusion](#conclusion)

# Importing libraries <a name="Importing_libraries"></a>

In [None]:
!pip3 install category_encoders
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression, Ridge
import xgboost as xgb
import category_encoders as ce
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.metrics import mean_squared_error
from math import sqrt
from datetime import timedelta
from scipy.stats import uniform, randint 
import pickle

# Importing datasets <a name="Importing_datasets"></a>

In [None]:
train = pd.read_csv("data/Train.csv")
test = pd.read_csv("data/Test.csv")
riders = pd.read_csv("data/Riders.csv")
sample_submission = pd.read_csv("data/SampleSubmission.csv")
variable_definitions = pd.read_csv("data/VariableDefinitions.csv")

# A quick look at how our data is structured <a name="Data_structure">

### Variable definitions

In [None]:
display(variable_definitions)

### Train data

In [None]:
print(f"The training dataset has {train.shape[0]} rows and {train.shape[1]} columns.") # Getting the total number of rows & columns in the training data.
display(train.info())
display(train.head()) # The 1st 5 rows.
display(train.describe())

### Test data

In [None]:
print(f"The training dataset has {test.shape[0]} rows and {test.shape[1]} columns.") # Getting the total number of rows & columns in the training data.
display(test.info())
display(test.head()) # The 1st 5 rows.
display(test.describe())

# Data Visualization <a name="visuals">

### Temperature distribution <a name="temperature">

In [None]:
f, axes = plt.subplots(1, 2, figsize=(15, 7), sharex=True)
axes[0].set_title("Train Data")
axes[1].set_title("Test Data")
sns.distplot(train["Temperature"], color="blue", ax=axes[0])
sns.distplot(test["Temperature"], color="red", ax=axes[1])

In [None]:
print("Train mean {}".format(train["Temperature"].mean()))
print("Train median {}".format(train["Temperature"].median()))
print("Test mean {}".format(test["Temperature"].mean()))
print("Test median {}".format(test["Temperature"].median()))

The mean and median temperature values are pretty much the same, we can safely use either 1 to replace the missing temperature values in our data.

In [None]:
clean_train_df = train.copy(deep=True)
clean_test_df = test.copy(deep=True)
clean_train_df["Temperature"].fillna(clean_train_df["Temperature"].mean(), inplace=True)
clean_test_df["Temperature"].fillna(clean_test_df["Temperature"].mean(), inplace=True)

replacing missing temperature values with the mean temperature.

### Vehicle types <a name="vehicle_types">

In [None]:
f, axes = plt.subplots(1, 2, figsize=(15, 7), sharex=True)
axes[0].set_title("Train data")
axes[1].set_title("Test data")
sns.countplot(x='Vehicle Type', data=train, ax=axes[0])
sns.countplot(x='Vehicle Type', data=test, ax=axes[1])

from the **count plot** above we can see that there's only **one** vehicle type, so we can safely discard this column