<a href="https://colab.research.google.com/github/NicoleMeinie/Regression/blob/master/Frankenstein.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Sendy Logistics Challenge
Logistics is fundamental to the success of a business.

It is reported that in Africa, logistics add an average of 320% to a manufactured good’s cost. Sendy is a logistics platform servicing East Africa and aims to help businesses and enterprises grow through efficient and affordable logistics.

Sendy is trying to predict accurate arrival times that will assist businesses in improving logistic operations and communicate accurate times to customers awaiting deliveries. Data is key in this endeavour and this project aims to use the given data to build a model to make these arrival time predictions. practical solutions for Africa’s dynamic transportation needs.

The training dataset provided here is a subset of over 20,000 orders and only includes direct orders (i.e. Sendy “express” orders) with bikes in Nairobi. All data in this subset have been fully anonymized while preserving the distribution.

# Import Libraries and datasets

In [0]:
# Importing libraries 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.style as style
import seaborn as sns
import scipy.stats as stats
%matplotlib inline

# avoid pd warnings 
import warnings
warnings.filterwarnings("ignore")

# should these be here or in the body of the code? - PEP8: They should be here. We'll collect them as we go along
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

Three .csv files were provided: "Train.csv", "Test.csv" & "Riders.csv". Let's import these datasets.

In [0]:
# import each csv file
test_raw = pd.read_csv('Test.csv')
train_raw = pd.read_csv('Train.csv')
riders_raw = pd.read_csv('Riders.csv')
Sub = pd.read_csv('SampleSubmission.csv')

# Join riders to test & train data and initialise working dataframes
Train = train_raw.merge(riders_raw, how="left", on = "Rider Id") 
Test = test_raw.merge(riders_raw, how="left", on = "Rider Id")

Great! Let's build a model!

#Exploratory Data Analysis
Hold on! Exploratory Data Analysis is vital in determining our data structure; potential patterns & relationships between the variables in the dataset and ascertaining whether our dataset is in the best format for processing by the model we'll eventually be building. We've divided EDA into the following sections: Completeness, Data Types & Initial Variable Selection & Visualisation. Once we have 'the lay of the land' we'll move on to Preprocessing the datasets.

##Completeness of the Data

Quality of a dataset is dependent on completeness. Let's investigate which variables have those pesky null values and figure out a way forward to handle them.

In [0]:
train_raw.head()

In [0]:
train_raw.info()

In [0]:
train_raw.isnull().sum()

In [0]:
test_raw.head()

In [0]:
test_raw.info()

In [0]:
test.raw.isnull().sum()

In [0]:
riders.head()

In [0]:
riders_raw.info()

**Observations:**

*  Both the train & test data sets have missing values for the 'Temperature' & 'Precipitation' columns. The values for the train data should be imputed using an appropriate method after the Train-Validation Split. Approximately 20% of the Temperature values are missing in both the Test & Train data sets. Replacing the NaN with the average would therefore be a reasonable assumption. Approximately 97% of the Precipitation values are missing. Imputation of the NaN values could be achieved either via assuming zero precipitation for those Order times, or imputation by mode.

*  As the data has already been split into training & test sets we can go ahead and imput the values for each set. It's best practice to impute values after the split to ensure a fair test of the model evetually built.

*  The riders.csv file has no missing values. Phew!

*  Side Note: The column names for the test and train data could do with some formatting to get rid of spaces between strings, but perhaps we'll leave this as is to match the submission file format.

*  The train data set has four additional columns centred arround the arrival time of the order: 'Arrival at Destination - Day of Month', 'Arrival at Destination - Weekday (Mo = 1)', 'Arrival at Destination - Time', 'Time from Pickup to Arrival'. Other than that the columns are identical. This is expected as if we had these columns present in the Test set, we wouldnt have a response variable to predict!

*  It appears that the riders data could be joined with both the train and test data sets on the column 'Rider Id'. It is anticipated that some variables in this data set such as 'AverageRating' could be influential in predicting the response variable. What's that you say...? "all this does is increase the No. of variables in both sets!"... Yes, but there are definitely some variables that can be removed...

*  Summary - impute missing temperature and precipitation values. Join riders to both test & train.

##Distribution of the feature and response variables
A histogram of delivery times as well as the outlier threshold will be plotted to viuslize the distribution of the delivery times (response/target variable).

In [0]:
# Calculating the outlier threshold 
mu = Train['Time from Pickup to Arrival'].mean()
sd = Train['Time from Pickup to Arrival'].std()
li = mu + 3*sd

# creating the histogram - distribution of delivery times in seconds
sns.set()
_ = plt.figure(figsize = (10,5))
_ = plt.hist(Train['Time from Pickup to Arrival'], bins = 70, color = 'blue')
_ = plt.title('Delivery time distribution')
_ = plt.xlabel('Delivery Time (seconds)')
_ = plt.ylabel('Number of orders')
_ = plt.axvline(li, color = 'gray', linestyle = 'dashed', linewidth = 1)
plt.show()

**Observations:**

*   From the plot above it’s clear that the delivery times are possitively skewed, with the majority of orders being delivered in about 16 minutes 

*   There are orders with a delivery time of 1 second. These could be data points that weren't recorded properly. In practice these rows should be excluded when training the model.

* The grey dotted line above indicates the threshold for existance of outliers (measured by the presence of values 3 standard deviation away from the mean.) A small portion of the delivery times recorded with times > 4500 seconds can be considered outliers


## Distribution plots for all the features in the dataset

In [0]:
# creating a histogram for all the variables in the dataset to visulize the distribution of each feature
sns.set()
Train.hist(bins = 60, figsize = (20,15))
plt.show()

**Observations:**


*   Majority of the orders are placed using platform 3 and only a tiny portion using platform 4
*   Weekends are not as busy compared to weekdays 
* Most orders are delivered within a 10km radius
* Orders are spread out almost evenly across the day of the month - no obvious popular day
* Most Temperatures fall between 20 and 27 degrees celcius
* The average driver rating is between 13 and 15 with vary little varience 
* A large portion of riders have no or very few ratings

## Exploring the relationship between delivery time and a few features


### Day of the week

Next the relationship between the day of the week that orders are placed and the delivery times is visualized using a violin plot

In [0]:
sns.set()
_ = plt.figure(figsize = (10,5))
_ = sns.violinplot(x = Train['Placement - Weekday (Mo = 1)'], y = Train['Time from Pickup to Arrival'], data = Train )
_ = plt.title('Delivery time vs Day of the week')
_ = plt.xlabel('Day of the week')
_ = plt.ylabel('Delivery Time (seconds)')

**Observations:**
 

*   Delivery times during the week have the same spread from monday to friday with slight differences in the extremity of outliers. 

* The delivery times during weekends have less variability and less extreme outliers. This could possibly be due to drivers experiencing less traffic over weekends 

### Type of client serviced
Sendy handles deliveries for businesses and personal clients. Does the type of client serviced have an effect on the delivery time?

In [0]:
sns.set()
_ = plt.figure(figsize = (10,5))
_ = sns.boxplot(x = Train['Personal or Business'], y = Train['Time from Pickup to Arrival'], data = Train )
_ = plt.title('Delivery time vs type of client')
_ = plt.xlabel('Platform')
_ = plt.ylabel('Delivery Time in seconds')

**Observations:**

*  There is almost no difference in the distibution of delivery times for personal clients compared to businesses with the exception of a few outliers. 

### Platform used to place orders
Users can place orders using 4 different platforms. Is there a relationship between the type of platform used to place an order and the delievry time?

In [0]:
sns.set()
_ = plt.figure(figsize = (10,5))
_ = sns.boxplot(x = Train['Platform Type'], y = Train['Time from Pickup to Arrival'], data = Train )
_ = plt.title('Delivery time vs Platform type')
_ = plt.xlabel('Platform')
_ = plt.ylabel('Delivery Time in seconds')

**Observations:**

*  The delivery times for platform types 1 to 3 are similar, although more outliers are observed when orders are placed using platform 3.

*  Platform 4 