# Team 19 JHB #TeamName
---
<img src="https://github.com/Lizette95/notebook_images/blob/master/banner.png?raw=true" align="left">  


---
**Markdown Cheatsheet:**
https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet

---

### Table of Contents
---
1. [Introduction](#intro)
 * Project Description
 * Datasets and Variables
---
2. [Module Imports](#imports)
---
3. [Exploratory Data Analysis](#EDA)
 * Load Data
 * Merge Datasets
 * Data Summary Statistics
 * Missing Data
 * Numerical Features
 * Categorical Features
---
4. [Data Preprocessing](#preprocessing)
---
5. [Modelling](#modelling)
---
6. [Performance Evaluation](#evaluation)
---
7. [Model Analysis](#analysis)
---
8. [Conclusion](#conclusion)

<a id="intro"></a>
## Introduction  

### Project Description

Machine learning is a powerful branch of Artificial Intelligence, dating as far back as 1952,  developed to teach computer systems how to make decisions emanated from preexisting data. Mathematical algorithms are applied to training data which allows a machine to identify patterns and make predictions for unseen datasets.

Regression is a popular supervised learning statistical method that aims to predict the value of a dependent variable (y) based on an independent variable (x). The relationship between the variables can be linear or nonlinear.

The Zindi challenge, hosted by Sendy in partnership with insight2impact facility, is to build a regression model that will predict an accurate time of arrival for motobike deliveries from the pickup point to the destination of the package. An accurate arrival time will enhance customer communication and customer experience. In addition, the solution will help businesses reduce the cost of trade through better management of resources and planning.


The dataset provided by Sendy includes order details and rider metrics based on orders made on the Sendy platform. The challenge is to predict the estimated time of arrival for orders, from pick-up to drop-off. The training dataset is a subset of over 20,000 orders and only includes direct orders (i.e. Sendy “express” orders) with bikes in Nairobi. 

### Datasets and Variables  

**train_data:** The dataset that will be used to train our model  
**test_data:** The dataset on which we will apply our model to  
**riders:** Contains unique rider Ids, number of orders, age, rating and number of ratings

**Order details**  
* Order No: Unique number identifying the order  
* User Id: Unique number identifying the customer on a platform  
* Vehicle Type: For this competition limited to bikes, however in practice, Sendy service extends to trucks and vans  
* Platform Type: Platform used to place the order (there are 4 types)  
* Personal or Business: Customer type  

**Placement times**  
* Placement: Day of Month (i.e 1-31)  
* Placement: Weekday (Monday = 1)  
* Placement: Time - Time of day the order was placed  

**Confirmation times**  
* Confirmation: Day of Month (i.e 1-31)  
* Confirmation: Weekday (Monday = 1)  
* Confirmation: Time (Time of day the order was confirmed by a rider)  

**Arrival at Pickup times**  
* Arrival at Pickup: Day of Month (i.e 1-31)  
* Arrival at Pickup: Weekday (Monday = 1)  
* Arrival at Pickup: Time (Time of day the rider arrived at the location to pick up the order - as marked by the rider through the Sendy application)  

**Pickup times**  
* Pickup: Day of Month (i.e 1-31)  
* Pickup: Weekday (Monday = 1)  
* Pickup: Time (Time of day the rider picked up the order - as marked by the rider through the Sendy application)  

**Arrival at Destination times** (not in Test set)  
* Arrival at Delivery: Day of Month (i.e 1-31)  
* Arrival at Delivery: Weekday (Monday = 1)  
* Arrival at Delivery: Time (Time of day the rider arrived at the destination to deliver the order - as marked by the rider through the Sendy application)  
* Distance covered (KM): The distance from Pickup to Destination  
* Temperature: Temperature at the time of order placement in Degrees Celsius (measured every three hours)  
* Precipitation in Millimeters: Precipitation at the time of order placement (measured every three hours)  
* Pickup Latitude and Longitude: Latitude and longitude of pick up location  
* Destination Latitude and Longitude: Latitude and longitude of delivery location  
* Rider ID: ID of the Rider who accepted the order  
* Time from Pickup to Arrival: Time in seconds between ‘Pickup’ and ‘Arrival at Destination’    

**Rider metrics**  
* Rider ID: Unique number identifying the rider (same as in order details)  
* No of Orders: Number of Orders the rider has delivered  
* Age: Number of days since the rider delivered the first order  
* Average Rating: Average rating of the rider  
* No of Ratings: Number of ratings the rider has received. Rating an order is optional for the customer

<a id="imports"></a>
## Module Imports

In [0]:
# Ignore warnings
import warnings
warnings.simplefilter(action='ignore')

# Import modules
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Display
%matplotlib inline
sns.set_style("white")

<a id="EDA"></a>
## Exploratory Data Analysis

### Load Data

In [0]:
train_data = pd.read_csv('https://raw.githubusercontent.com/Lizette95/regression-predict-api-template/master/utils/data/train_data.csv')
test_data = pd.read_csv('https://raw.githubusercontent.com/Lizette95/regression-predict-api-template/master/utils/data/test_data.csv')
riders = pd.read_csv('https://raw.githubusercontent.com/Lizette95/regression-predict-api-template/master/utils/data/riders.csv')

### Merge Datasets

In [0]:
riders.head()

In [0]:
train_data = pd.merge(train_data,riders,on='Rider Id',how='left')
train_data.head()

In [0]:
test_data = pd.merge(test_data,riders,on='Rider Id',how='left')
test_data.head()

### Data Summary Statistics

**Train Data**

In [0]:
train_data.info()

In [0]:
train_data.describe()

**Test Data**

In [0]:
test_data.info()

In [0]:
test_data.describe()

### Missing Data

**Train Data**

In [0]:
train_data.isnull().sum()

In [0]:
fig,axis = plt.subplots(figsize=(12,5))
sns.heatmap(train_data.isnull(),yticklabels=False,cbar=False,cmap='brg')
plt.xticks(fontsize=12)
plt.show()

In [0]:
missing_train = pd.DataFrame(round((train_data.isnull().sum()/train_data.isnull().count())*100,2),columns=['% missing data'])
missing_train.sort_values(by='% missing data',ascending=False).head(2)

* Remove Precipitation in millimeters  
* Impute Temperature (average temperature according to time of order placement)

In [0]:
# Drop 'Precipitation in millimeters' column
train_data.drop('Precipitation in millimeters',axis=1,inplace=True)
# Create 24h time bins
train_data['Placement - Time(bins)'] = pd.to_datetime(pd.to_datetime(train_data['Placement - Time']).dt.strftime('%H:%M:%S')).dt.strftime('%H')
# Impute temperature for missing values
train_data['Temperature'] = train_data['Temperature'].fillna(round(train_data.groupby('Placement - Time(bins)')['Temperature'].transform('mean'),1))

In [0]:
fig,axis = plt.subplots(figsize=(12,5))
sns.boxplot(x='Placement - Time(bins)',y='Temperature',data=train_data,palette='CMRmap')
plt.title('Distribution of Temperature (°C) at Different Times During the Day',fontsize=16)
plt.xlabel('Time of Day (Hour)',fontsize=12)
plt.ylabel('Temperature (°C)',fontsize=12)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.show()

**Test Data**

In [0]:
test_data.isnull().sum()

In [0]:
fig,axis = plt.subplots(figsize=(12,5))
sns.heatmap(test_data.isnull(),yticklabels=False,cbar=False,cmap='brg')
plt.xticks(fontsize=12)
plt.show()

In [0]:
missing_test = pd.DataFrame(round((test_data.isnull().sum()/test_data.isnull().count())*100,2),columns=['% missing data'])
missing_test.sort_values(by='% missing data',ascending=False).head(2)

* Remove Precipitation in millimeters  
* Impute Temperature (average temperature according to time of order placement)

In [0]:
# Drop 'Precipitation in millimeters' column
test_data.drop('Precipitation in millimeters',axis=1,inplace=True)
# Create 24h time bins
test_data['Placement - Time(bins)'] = pd.to_datetime(pd.to_datetime(test_data['Placement - Time']).dt.strftime('%H:%M:%S')).dt.strftime('%H')
# Impute temperature for missing values
test_data['Temperature'] = test_data['Temperature'].fillna(round(test_data.groupby('Placement - Time(bins)')['Temperature'].transform('mean'),1))

In [0]:
fig,axis = plt.subplots(figsize=(12,5))
sns.boxplot(x='Placement - Time(bins)',y='Temperature',data=test_data,palette='CMRmap')
plt.title('Distribution of Temperature (°C) at Different Times During the Day',fontsize=16)
plt.xlabel('Time of Day (Hour)',fontsize=12)
plt.ylabel('Temperature (°C)',fontsize=12)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.show()

### Numerical Features

* Placement - Time?  
* Confirmation - Time?  
* Arrival at Pickup - Time?  
* Pickup - Time?  
* Distance (KM)  
* Temperature  
* Pickup Lat  
* Pickup Long  
* Destination Lat  
* Destination Long  
* No_Of_Orders  
* Age  
* Average_Rating  
* No_of_Ratings  

**Target variable:**  

Time from Pickup to Arrival (ignore 'Arrival at Destination' columns for features)

In [0]:
# Correlation of numerical variables & target variable
train_numerical = train_data[['Distance (KM)','Temperature','Pickup Lat','Pickup Long','Destination Lat','Destination Long','No_Of_Orders','Age','Average_Rating','No_of_Ratings','Time from Pickup to Arrival']]
cmatrix = train_numerical.corr()
sns.heatmap(cmatrix, square=True, cmap='BuPu')
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.show()

In [0]:
pd.DataFrame(cmatrix['Time from Pickup to Arrival'].abs().sort_values(ascending=False))

In [0]:
sns.pairplot(train_numerical)
plt.show()

In [0]:
fig,axis = plt.subplots(figsize=(15, 5))
sns.countplot(x='Arrival at Pickup - Day of Month',data=train_data,palette='plasma')
plt.xlabel('Arrival at Pickup - Day of Month',fontsize=12)
plt.ylabel('Count',fontsize=12)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.show()

In [0]:
fig,axis = plt.subplots(ncols=2, figsize=(15, 5))
sns.countplot(x='Platform Type',data=train_data,palette='nipy_spectral',ax=axis[0])
sns.countplot(x='Personal or Business',data=train_data,palette='magma',ax=axis[1])
axis[0].set_title('Number of Deliveries Per Platform Type',fontsize=14)
axis[1].set_title('Number of Personal vs. Business Deliveries',fontsize=14)
axis[0].set_xlabel('Platform Type',fontsize=12)
axis[1].set_xlabel('Personal or Business',fontsize=12)
axis[0].set_ylabel('Number of Deliveries',fontsize=12)
axis[1].set_ylabel('Number of Deliveries',fontsize=12)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.show()

In [0]:
fig,axis = plt.subplots(figsize=(12,5))
sns.distplot(train_data['Time from Pickup to Arrival'],color='purple')
plt.title('Distribution of Time From Pickup to Arrival',fontsize=14)
plt.xlabel('Time From Pickup to Arrival',fontsize=12)
plt.ylabel('Count',fontsize=12)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.show()

### Categorical Features

* Platform Type  
* Personal or Business  
* Placement - Day of Month  
* Placement - Weekday (Mo = 1)  
* Confirmation - Day of Month  
* Confirmation - Weekday (Mo = 1)  
* Arrival at Pickup - Day of Month  
* Arrival at Pickup - Weekday (Mo = 1)  
* Pickup - Day of Month  
* Pickup - Weekday (Mo = 1)  
* Rider Id

In [0]:
sns.catplot(x='Platform Type', y='Time from Pickup to Arrival',data=train_data.sort_values('Platform Type'),kind='box',height=5,aspect=2,palette='nipy_spectral')
plt.title('Distribution of Time from Pickup to Arrival Per Platform Type',fontsize=14)
plt.xlabel('Platform Type',fontsize=12)
plt.ylabel('Time from Pickup to Arrival',fontsize=12)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.show()

<a id="preprocessing"></a>
## Data Preprocessing

<a id="modelling"></a>
## Modelling

<a id="evaluation"></a>
## Performance Evaluation

<a id="analysis"></a>
## Model Analysis

<a id="conclusion"></a>
## Conclusion

Thank you and goodnight...