<a href="https://colab.research.google.com/github/BileOara/REGRESSION/blob/master/Predict_Regression_LM_v02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Problem Statement

#### Context
**Economies are better when logistics is efficient and affordable**

Sendy, in partnership with insight2impact facility, is hosting a Zindi challenge to predict the estimated time of delivery of orders, from the point of driver pickup to the point of arrival at final destination. 
Sendy helps men and women behind every type of business to trade easily, deliver more competitively, and build extraordinary businesses.

#### Why Solve this problem?
The solution will help Sendy enhance customer communication and improve the reliability of its service; which will ultimately improve customer experience. In addition, the solution will enable Sendy to realise cost savings, and ultimately reduce the cost of doing business, through improved resource management and planning for order scheduling.

An accurate arrival time prediction will help all businesses to improve their logistics and communicate an accurate time to their customers. 

#### What will be done?
Given the details of a Sendy order, historic data will be used to predict the time of arrival of a rider at the destination of a package as accurately as possible.

#### How will this be done

By building a linear regression model that predicts an accurate delivery time, from picking up a package to arriving at the final destination.


# 1. Data Pre-Processing

## 1.1 Load Libraries

In [0]:
# Import modules
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os

# Figures inline and set visualization style
%matplotlib inline
sns.set()

## 1.2 Check Datasets

In [5]:
# List all files in a directory using os.listdir
basepath = '/Zindi Data/'
for entry in os.listdir(basepath):
    if os.path.isfile(os.path.join(basepath, entry)):
        print(entry)

SampleSubmission.csv
Riders.csv
VariableDefinitions.csv
Test.csv
Train.csv


#### Check that all datasets are accounted for

The files for download according to the hackathon:

* `Train.csv` - is the dataset that you will use to train your model
* `Test.csv` - is the dataset on which you will apply your model to.
* `Riders.csv` - contains unique rider Ids, number of orders, age, rating and number of ratings
* `VariableDefinitions.csv` - Definitions of variables in the Train, Test and Riders files

The above files are accounted for.

An additional file, `SampleSubmission.csv` was available for download as well


## 1.3 Import the datasets

In [0]:
# import Datasets
data_folder = '/Zindi Data/'

train_df = pd.read_csv(data_folder + 'Train.csv')
test_df = pd.read_csv(data_folder + 'Test.csv')
riders_df = pd.read_csv(data_folder + 'Riders.csv')
variable_definitions_df = pd.read_csv(data_folder + 'VariableDefinitions.csv', header=None,
                                      names=['Variable', 'Definition'])
sample_submission_df = pd.read_csv(data_folder + 'SampleSubmission.csv')

In [7]:
# check training data
train_df.head()

Unnamed: 0,Order No,User Id,Vehicle Type,Platform Type,Personal or Business,Placement - Day of Month,Placement - Weekday (Mo = 1),Placement - Time,Confirmation - Day of Month,Confirmation - Weekday (Mo = 1),Confirmation - Time,Arrival at Pickup - Day of Month,Arrival at Pickup - Weekday (Mo = 1),Arrival at Pickup - Time,Pickup - Day of Month,Pickup - Weekday (Mo = 1),Pickup - Time,Arrival at Destination - Day of Month,Arrival at Destination - Weekday (Mo = 1),Arrival at Destination - Time,Distance (KM),Temperature,Precipitation in millimeters,Pickup Lat,Pickup Long,Destination Lat,Destination Long,Rider Id,Time from Pickup to Arrival
0,Order_No_4211,User_Id_633,Bike,3,Business,9,5,9:35:46 AM,9,5,9:40:10 AM,9,5,10:04:47 AM,9,5,10:27:30 AM,9,5,10:39:55 AM,4,20.4,,-1.317755,36.83037,-1.300406,36.829741,Rider_Id_432,745
1,Order_No_25375,User_Id_2285,Bike,3,Personal,12,5,11:16:16 AM,12,5,11:23:21 AM,12,5,11:40:22 AM,12,5,11:44:09 AM,12,5,12:17:22 PM,16,26.4,,-1.351453,36.899315,-1.295004,36.814358,Rider_Id_856,1993
2,Order_No_1899,User_Id_265,Bike,3,Business,30,2,12:39:25 PM,30,2,12:42:44 PM,30,2,12:49:34 PM,30,2,12:53:03 PM,30,2,1:00:38 PM,3,,,-1.308284,36.843419,-1.300921,36.828195,Rider_Id_155,455
3,Order_No_9336,User_Id_1402,Bike,3,Business,15,5,9:25:34 AM,15,5,9:26:05 AM,15,5,9:37:56 AM,15,5,9:43:06 AM,15,5,10:05:27 AM,9,19.2,,-1.281301,36.832396,-1.257147,36.795063,Rider_Id_855,1341
4,Order_No_27883,User_Id_1737,Bike,1,Personal,13,1,9:55:18 AM,13,1,9:56:18 AM,13,1,10:03:53 AM,13,1,10:05:23 AM,13,1,10:25:37 AM,9,15.4,,-1.266597,36.792118,-1.295041,36.809817,Rider_Id_770,1214


In [8]:
# check test data
test_df.head()

Unnamed: 0,Order No,User Id,Vehicle Type,Platform Type,Personal or Business,Placement - Day of Month,Placement - Weekday (Mo = 1),Placement - Time,Confirmation - Day of Month,Confirmation - Weekday (Mo = 1),Confirmation - Time,Arrival at Pickup - Day of Month,Arrival at Pickup - Weekday (Mo = 1),Arrival at Pickup - Time,Pickup - Day of Month,Pickup - Weekday (Mo = 1),Pickup - Time,Distance (KM),Temperature,Precipitation in millimeters,Pickup Lat,Pickup Long,Destination Lat,Destination Long,Rider Id
0,Order_No_19248,User_Id_3355,Bike,3,Business,27,3,4:44:10 PM,27,3,4:44:29 PM,27,3,4:53:04 PM,27,3,5:06:47 PM,8,,,-1.333275,36.870815,-1.305249,36.82239,Rider_Id_192
1,Order_No_12736,User_Id_3647,Bike,3,Business,17,5,12:57:35 PM,17,5,12:59:17 PM,17,5,1:20:27 PM,17,5,1:25:37 PM,5,,,-1.272639,36.794723,-1.277007,36.823907,Rider_Id_868
2,Order_No_768,User_Id_2154,Bike,3,Business,27,4,11:08:14 AM,27,4,11:25:05 AM,27,4,11:33:20 AM,27,4,11:57:54 AM,5,22.8,,-1.290894,36.822971,-1.276574,36.851365,Rider_Id_26
3,Order_No_15332,User_Id_2910,Bike,3,Business,17,1,1:51:35 PM,17,1,1:53:27 PM,17,1,2:02:41 PM,17,1,2:16:52 PM,5,24.5,,-1.290503,36.809646,-1.303382,36.790658,Rider_Id_685
4,Order_No_21373,User_Id_1205,Bike,3,Business,11,2,11:30:28 AM,11,2,11:34:45 AM,11,2,11:47:19 AM,11,2,11:56:04 AM,6,24.4,,-1.281081,36.814423,-1.266467,36.792161,Rider_Id_858


In [9]:
# check riders
riders_df.head()

Unnamed: 0,Rider Id,No_Of_Orders,Age,Average_Rating,No_of_Ratings
0,Rider_Id_396,2946,2298,14.0,1159
1,Rider_Id_479,360,951,13.5,176
2,Rider_Id_648,1746,821,14.3,466
3,Rider_Id_753,314,980,12.5,75
4,Rider_Id_335,536,1113,13.7,156


In [10]:
# Check variable definitions
variable_definitions_df.head()

Unnamed: 0,Variable,Definition
0,Order No,Unique number identifying the order
1,User Id,Unique number identifying the customer on a pl...
2,Vehicle Type,"For this competition limited to bikes, however..."
3,Platform Type,"Platform used to place the order, there are 4 ..."
4,Personal or Business,Customer type


In [11]:
# check sample submission
sample_submission_df.head()

Unnamed: 0,Order_No,Time from Pickup to Arrival
0,Order_No_19248,567.0
1,Order_No_12736,4903.0
2,Order_No_768,5649.0
3,Order_No_15332,
4,Order_No_21373,


#### Prelimenary observation
Based on DataFrame previews it can be assumed that of all data were successfully imported. the following data will form part of the regression analysis:
* `train_df`
* `test_df`
* `riders_df`

`variable_definitions_df` privides definitions of the variables

`sample_submission_df` is a template for the submission of model predictions for this project

## 1.4 Assess Data

### 1.4.1 Assess variable definitions

In [0]:
# Get variable definitions
print(f'Number of variables: {len(variable_definitions_df)}\n',
     '======================================= \n')
for var, definition in variable_definitions_df.values:
    print(f'{var}:\nDef:{definition}\n')

Number of variables: 36

Order No:
Def:Unique number identifying the order

User Id:
Def:Unique number identifying the customer on a platform

Vehicle Type:
Def:For this competition limited to bikes, however in practice Sendy service extends to trucks and vans

Platform Type:
Def:Platform used to place the order, there are 4 types

Personal or Business:
Def:Customer type

Placement - Day of Month:
Def:Placement - Day of Month i.e 1-31

Placement - Weekday (Mo = 1):
Def:Placement - Weekday (Monday = 1)

Placement - Time:
Def:Placement - Time - Time of day the order was placed

Confirmation - Day of Month:
Def:Confirmation - Day of Month i.e 1-31

Confirmation - Weekday (Mo = 1):
Def:Confirmation - Weekday (Monday = 1)

Confirmation - Time:
Def:Confirmation - Time - Time of day the order was confirmed by a rider

Arrival at Pickup - Day of Month:
Def:Arrival at Pickup - Day of Month i.e 1-31

Arrival at Pickup - Weekday (Mo = 1):
Def:Arrival at Pickup - Weekday (Monday = 1)

Arrival at P

### 1.4.2 Assess dataset dimensions, variables, missing data and data types

In [0]:
# define function to calculate missing values
def missing_values_table(df):
    """
    This function takes a dataframe as input and returns a dataframe
    of the number and percentage of missing values as and output.
    """
    mis_val = df.isnull().sum()

    mis_val_percent = 100 * df.isnull().sum() / len(df)

    mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)

    mis_val_table_ren_columns = mis_val_table.rename(
    columns = {0 : 'Missing Values', 1 : '% of Total Values'})

    mis_val_table_ren_columns = mis_val_table_ren_columns[
        mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
    '% of Total Values', ascending=False).round(1)

    print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"      
        "There are " + str(mis_val_table_ren_columns.shape[0]) +
          " columns that have missing values.")

    # Return the dataframe with missing information
    return mis_val_table_ren_columns

In [13]:
# Assess Train dataset dimensions, variables and datatypes
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21201 entries, 0 to 21200
Data columns (total 29 columns):
 #   Column                                     Non-Null Count  Dtype  
---  ------                                     --------------  -----  
 0   Order No                                   21201 non-null  object 
 1   User Id                                    21201 non-null  object 
 2   Vehicle Type                               21201 non-null  object 
 3   Platform Type                              21201 non-null  int64  
 4   Personal or Business                       21201 non-null  object 
 5   Placement - Day of Month                   21201 non-null  int64  
 6   Placement - Weekday (Mo = 1)               21201 non-null  int64  
 7   Placement - Time                           21201 non-null  object 
 8   Confirmation - Day of Month                21201 non-null  int64  
 9   Confirmation - Weekday (Mo = 1)            21201 non-null  int64  
 10  Confirmation - Time   

In [14]:
# Assess Train dataset missing values
missing_values_table(train_df)

Your selected dataframe has 29 columns.
There are 2 columns that have missing values.


Unnamed: 0,Missing Values,% of Total Values
Precipitation in millimeters,20649,97.4
Temperature,4366,20.6


In [15]:
# Assess Test dataset dimensions, variables and datatypes
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7068 entries, 0 to 7067
Data columns (total 25 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   Order No                              7068 non-null   object 
 1   User Id                               7068 non-null   object 
 2   Vehicle Type                          7068 non-null   object 
 3   Platform Type                         7068 non-null   int64  
 4   Personal or Business                  7068 non-null   object 
 5   Placement - Day of Month              7068 non-null   int64  
 6   Placement - Weekday (Mo = 1)          7068 non-null   int64  
 7   Placement - Time                      7068 non-null   object 
 8   Confirmation - Day of Month           7068 non-null   int64  
 9   Confirmation - Weekday (Mo = 1)       7068 non-null   int64  
 10  Confirmation - Time                   7068 non-null   object 
 11  Arrival at Pickup

In [16]:
# Assess Test dataset missing values
missing_values_table(test_df)

Your selected dataframe has 25 columns.
There are 2 columns that have missing values.


Unnamed: 0,Missing Values,% of Total Values
Precipitation in millimeters,6869,97.2
Temperature,1437,20.3


In [17]:
# Assess rider dataset dimensions, variables and datatypes
riders_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 960 entries, 0 to 959
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Rider Id        960 non-null    object 
 1   No_Of_Orders    960 non-null    int64  
 2   Age             960 non-null    int64  
 3   Average_Rating  960 non-null    float64
 4   No_of_Ratings   960 non-null    int64  
dtypes: float64(1), int64(3), object(1)
memory usage: 37.6+ KB


In [18]:
# Assess riders dataset missing values
missing_values_table(riders_df)

Your selected dataframe has 5 columns.
There are 0 columns that have missing values.


Unnamed: 0,Missing Values,% of Total Values
