<a href="https://colab.research.google.com/github/BileOara/REGRESSION/blob/master/Predict_Regression_LM_v02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Problem Statement

#### Context
**Economies are better when logistics is efficient and affordable**

Sendy, in partnership with insight2impact facility, is hosting a Zindi challenge to predict the estimated time of delivery of orders, from the point of driver pickup to the point of arrival at final destination. 
Sendy helps men and women behind every type of business to trade easily, deliver more competitively, and build extraordinary businesses.

#### Why Solve this problem?
The solution will help Sendy enhance customer communication and improve the reliability of its service; which will ultimately improve customer experience. In addition, the solution will enable Sendy to realise cost savings, and ultimately reduce the cost of doing business, through improved resource management and planning for order scheduling.

An accurate arrival time prediction will help all businesses to improve their logistics and communicate an accurate time to their customers. 

#### What will be done?
Given the details of a Sendy order, historic data will be used to predict the time of arrival of a rider at the destination of a package as accurately as possible.

#### How will this be done

By building a linear regression model that predicts an accurate delivery time, from picking up a package to arriving at the final destination.


# 1. Data Pre-Processing

## 1.1 Load Libraries

In [0]:
# Import modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os

# Figures inline and set visualization style
%matplotlib inline
sns.set()

## 1.2 Check Datasets

In [0]:
# List all files in a directory using os.listdir
basepath = 'Zindi Data/'
for entry in os.listdir(basepath):
    if os.path.isfile(os.path.join(basepath, entry)):
        print(entry)

Test.csv
SampleSubmission.csv
Train.csv
Riders.csv
VariableDefinitions.csv


#### Check that all datasets are accounted for

The files for download according to the hackathon:

* `Train.csv` - is the dataset that you will use to train your model
* `Test.csv` - is the dataset on which you will apply your model to.
* `Riders.csv` - contains unique rider Ids, number of orders, age, rating and number of ratings
* `VariableDefinitions.csv` - Definitions of variables in the Train, Test and Riders files

The above files are accounted for.

An additional file, `SampleSubmission.csv` was available for download as well


## 1.3 Import the datasets

In [0]:
# import Datasets
data_folder = 'Zindi Data/'

train_df = pd.read_csv(data_folder + 'Train.csv')
test_df = pd.read_csv(data_folder + 'Test.csv')
riders_df = pd.read_csv(data_folder + 'Riders.csv')
variable_definitions_df = pd.read_csv(data_folder + 'VariableDefinitions.csv', header=None,
                                      names=['Variable', 'Definition'])
sample_submission_df = pd.read_csv(data_folder + 'SampleSubmission.csv')

In [0]:
# check training data
train_df.head()

Unnamed: 0,Order No,User Id,Vehicle Type,Platform Type,Personal or Business,Placement - Day of Month,Placement - Weekday (Mo = 1),Placement - Time,Confirmation - Day of Month,Confirmation - Weekday (Mo = 1),...,Arrival at Destination - Time,Distance (KM),Temperature,Precipitation in millimeters,Pickup Lat,Pickup Long,Destination Lat,Destination Long,Rider Id,Time from Pickup to Arrival
0,Order_No_4211,User_Id_633,Bike,3,Business,9,5,9:35:46 AM,9,5,...,10:39:55 AM,4,20.4,,-1.317755,36.83037,-1.300406,36.829741,Rider_Id_432,745
1,Order_No_25375,User_Id_2285,Bike,3,Personal,12,5,11:16:16 AM,12,5,...,12:17:22 PM,16,26.4,,-1.351453,36.899315,-1.295004,36.814358,Rider_Id_856,1993
2,Order_No_1899,User_Id_265,Bike,3,Business,30,2,12:39:25 PM,30,2,...,1:00:38 PM,3,,,-1.308284,36.843419,-1.300921,36.828195,Rider_Id_155,455
3,Order_No_9336,User_Id_1402,Bike,3,Business,15,5,9:25:34 AM,15,5,...,10:05:27 AM,9,19.2,,-1.281301,36.832396,-1.257147,36.795063,Rider_Id_855,1341
4,Order_No_27883,User_Id_1737,Bike,1,Personal,13,1,9:55:18 AM,13,1,...,10:25:37 AM,9,15.4,,-1.266597,36.792118,-1.295041,36.809817,Rider_Id_770,1214


In [0]:
# check test data
test_df.head()

Unnamed: 0,Order No,User Id,Vehicle Type,Platform Type,Personal or Business,Placement - Day of Month,Placement - Weekday (Mo = 1),Placement - Time,Confirmation - Day of Month,Confirmation - Weekday (Mo = 1),...,Pickup - Weekday (Mo = 1),Pickup - Time,Distance (KM),Temperature,Precipitation in millimeters,Pickup Lat,Pickup Long,Destination Lat,Destination Long,Rider Id
0,Order_No_19248,User_Id_3355,Bike,3,Business,27,3,4:44:10 PM,27,3,...,3,5:06:47 PM,8,,,-1.333275,36.870815,-1.305249,36.82239,Rider_Id_192
1,Order_No_12736,User_Id_3647,Bike,3,Business,17,5,12:57:35 PM,17,5,...,5,1:25:37 PM,5,,,-1.272639,36.794723,-1.277007,36.823907,Rider_Id_868
2,Order_No_768,User_Id_2154,Bike,3,Business,27,4,11:08:14 AM,27,4,...,4,11:57:54 AM,5,22.8,,-1.290894,36.822971,-1.276574,36.851365,Rider_Id_26
3,Order_No_15332,User_Id_2910,Bike,3,Business,17,1,1:51:35 PM,17,1,...,1,2:16:52 PM,5,24.5,,-1.290503,36.809646,-1.303382,36.790658,Rider_Id_685
4,Order_No_21373,User_Id_1205,Bike,3,Business,11,2,11:30:28 AM,11,2,...,2,11:56:04 AM,6,24.4,,-1.281081,36.814423,-1.266467,36.792161,Rider_Id_858


In [0]:
# check riders
riders_df.head()

Unnamed: 0,Rider Id,No_Of_Orders,Age,Average_Rating,No_of_Ratings
0,Rider_Id_396,2946,2298,14.0,1159
1,Rider_Id_479,360,951,13.5,176
2,Rider_Id_648,1746,821,14.3,466
3,Rider_Id_753,314,980,12.5,75
4,Rider_Id_335,536,1113,13.7,156


In [0]:
# Check variable definitions
variable_definitions_df.head()

Unnamed: 0,Variable,Definition
0,Order No,Unique number identifying the order
1,User Id,Unique number identifying the customer on a pl...
2,Vehicle Type,"For this competition limited to bikes, however..."
3,Platform Type,"Platform used to place the order, there are 4 ..."
4,Personal or Business,Customer type


In [0]:
# check sample submission
sample_submission_df.head()

Unnamed: 0,Order_No,Time from Pickup to Arrival
0,Order_No_19248,567.0
1,Order_No_12736,4903.0
2,Order_No_768,5649.0
3,Order_No_15332,
4,Order_No_21373,


#### Prelimenary observation
Based on DataFrame previews it can be assumed that of all data were successfully imported. the following data will form part of the regression analysis:
* `train_df`
* `test_df`
* `riders_df`

`variable_definitions_df` privides definitions of the variables

`sample_submission_df` is a template for the submission of model predictions for this project

## 1.4 Assess Data

### 1.4.1 Assess variable definitions

In [0]:
# Get variable definitions
var_num = int()
print(f'Number of variables: {len(variable_definitions_df)}\n',
     '======================================= \n')
for var, definition in variable_definitions_df.values:
    var_num += 1
    print(f'Variable {var_num} \nName: {var}: \nDefinition: {definition} \n')

Number of variables: 36

Variable 1 
Name: Order No: 
Definition: Unique number identifying the order 

Variable 2 
Name: User Id: 
Definition: Unique number identifying the customer on a platform 

Variable 3 
Name: Vehicle Type: 
Definition: For this competition limited to bikes, however in practice Sendy service extends to trucks and vans 

Variable 4 
Name: Platform Type: 
Definition: Platform used to place the order, there are 4 types 

Variable 5 
Name: Personal or Business: 
Definition: Customer type 

Variable 6 
Name: Placement - Day of Month: 
Definition: Placement - Day of Month i.e 1-31 

Variable 7 
Name: Placement - Weekday (Mo = 1): 
Definition: Placement - Weekday (Monday = 1) 

Variable 8 
Name: Placement - Time: 
Definition: Placement - Time - Time of day the order was placed 

Variable 9 
Name: Confirmation - Day of Month: 
Definition: Confirmation - Day of Month i.e 1-31 

Variable 10 
Name: Confirmation - Weekday (Mo = 1): 
Definition: Confirmation - Weekday (Monda

### 1.4.2 Assess dataset dimensions, variables, missing data and data types

In [0]:
# define function to calculate missing values
def missing_values_table(df):
    """
    This function takes a dataframe as input and returns a dataframe
    of the number and percentage of missing values as and output.
    """
    mis_val = df.isnull().sum()

    mis_val_percent = 100 * df.isnull().sum() / len(df)

    mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)

    mis_val_table_ren_columns = mis_val_table.rename(
    columns = {0 : 'Missing Values', 1 : '% of Total Values'})

    mis_val_table_ren_columns = mis_val_table_ren_columns[
        mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
    '% of Total Values', ascending=False).round(1)

    print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"      
        "There are " + str(mis_val_table_ren_columns.shape[0]) +
          " columns that have missing values.")

    # Return the dataframe with missing information
    return mis_val_table_ren_columns

In [0]:
# Assess Train dataset dimensions, variables and datatypes
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21201 entries, 0 to 21200
Data columns (total 29 columns):
 #   Column                                     Non-Null Count  Dtype  
---  ------                                     --------------  -----  
 0   Order No                                   21201 non-null  object 
 1   User Id                                    21201 non-null  object 
 2   Vehicle Type                               21201 non-null  object 
 3   Platform Type                              21201 non-null  int64  
 4   Personal or Business                       21201 non-null  object 
 5   Placement - Day of Month                   21201 non-null  int64  
 6   Placement - Weekday (Mo = 1)               21201 non-null  int64  
 7   Placement - Time                           21201 non-null  object 
 8   Confirmation - Day of Month                21201 non-null  int64  
 9   Confirmation - Weekday (Mo = 1)            21201 non-null  int64  
 10  Confirmation - Time   

In [0]:
# Assess Train dataset missing values
missing_values_table(train_df)

Your selected dataframe has 29 columns.
There are 2 columns that have missing values.


Unnamed: 0,Missing Values,% of Total Values
Precipitation in millimeters,20649,97.4
Temperature,4366,20.6


In [0]:
# Assess Test dataset dimensions, variables and datatypes
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7068 entries, 0 to 7067
Data columns (total 25 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   Order No                              7068 non-null   object 
 1   User Id                               7068 non-null   object 
 2   Vehicle Type                          7068 non-null   object 
 3   Platform Type                         7068 non-null   int64  
 4   Personal or Business                  7068 non-null   object 
 5   Placement - Day of Month              7068 non-null   int64  
 6   Placement - Weekday (Mo = 1)          7068 non-null   int64  
 7   Placement - Time                      7068 non-null   object 
 8   Confirmation - Day of Month           7068 non-null   int64  
 9   Confirmation - Weekday (Mo = 1)       7068 non-null   int64  
 10  Confirmation - Time                   7068 non-null   object 
 11  Arrival at Pickup

In [0]:
# Assess Test dataset missing values
missing_values_table(test_df)

Your selected dataframe has 25 columns.
There are 2 columns that have missing values.


Unnamed: 0,Missing Values,% of Total Values
Precipitation in millimeters,6869,97.2
Temperature,1437,20.3


In [0]:
# Assess rider dataset dimensions, variables and datatypes
riders_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 960 entries, 0 to 959
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Rider Id        960 non-null    object 
 1   No_Of_Orders    960 non-null    int64  
 2   Age             960 non-null    int64  
 3   Average_Rating  960 non-null    float64
 4   No_of_Ratings   960 non-null    int64  
dtypes: float64(1), int64(3), object(1)
memory usage: 37.6+ KB


In [0]:
# Assess riders dataset missing values
missing_values_table(riders_df)

Your selected dataframe has 5 columns.
There are 0 columns that have missing values.


Unnamed: 0,Missing Values,% of Total Values


## Data Cleaning for API

In [0]:
import requests
import json

In [0]:
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [0]:
list_predictors = ['Distance (KM)', 'Temperature', 'Pickup - Day of Month',
'Confirmation - Weekday (Mo = 1)', 'Platform Type']

response = 'Time from Pickup to Arrival'

In [0]:
post = test_df.iloc[1].to_json()
post

'{"Order No":"Order_No_12736","User Id":"User_Id_3647","Vehicle Type":"Bike","Platform Type":3,"Personal or Business":"Business","Placement - Day of Month":17,"Placement - Weekday (Mo = 1)":5,"Placement - Time":"12:57:35 PM","Confirmation - Day of Month":17,"Confirmation - Weekday (Mo = 1)":5,"Confirmation - Time":"12:59:17 PM","Arrival at Pickup - Day of Month":17,"Arrival at Pickup - Weekday (Mo = 1)":5,"Arrival at Pickup - Time":"1:20:27 PM","Pickup - Day of Month":17,"Pickup - Weekday (Mo = 1)":5,"Pickup - Time":"1:25:37 PM","Distance (KM)":5,"Temperature":null,"Precipitation in millimeters":null,"Pickup Lat":-1.272639,"Pickup Long":36.794723,"Destination Lat":-1.2770066,"Destination Long":36.8239068,"Rider Id":"Rider_Id_868"}'

In [0]:
from flask import Flask, request, jsonify

In [0]:
json.loads(post)

{'Order No': 'Order_No_12736',
 'User Id': 'User_Id_3647',
 'Vehicle Type': 'Bike',
 'Platform Type': 3,
 'Personal or Business': 'Business',
 'Placement - Day of Month': 17,
 'Placement - Weekday (Mo = 1)': 5,
 'Placement - Time': '12:57:35 PM',
 'Confirmation - Day of Month': 17,
 'Confirmation - Weekday (Mo = 1)': 5,
 'Confirmation - Time': '12:59:17 PM',
 'Arrival at Pickup - Day of Month': 17,
 'Arrival at Pickup - Weekday (Mo = 1)': 5,
 'Arrival at Pickup - Time': '1:20:27 PM',
 'Pickup - Day of Month': 17,
 'Pickup - Weekday (Mo = 1)': 5,
 'Pickup - Time': '1:25:37 PM',
 'Distance (KM)': 5,
 'Temperature': None,
 'Precipitation in millimeters': None,
 'Pickup Lat': -1.272639,
 'Pickup Long': 36.794723,
 'Destination Lat': -1.2770066,
 'Destination Long': 36.8239068,
 'Rider Id': 'Rider_Id_868'}

In [0]:
# Convert the json string to a python dictionary object
feature_vector_dict = json.loads(post)

# Load the dictionary as a Pandas DataFrame.
feature_vector_df = pd.DataFrame.from_dict([feature_vector_dict])
feature_vector_df

Unnamed: 0,Order No,User Id,Vehicle Type,Platform Type,Personal or Business,Placement - Day of Month,Placement - Weekday (Mo = 1),Placement - Time,Confirmation - Day of Month,Confirmation - Weekday (Mo = 1),...,Pickup - Weekday (Mo = 1),Pickup - Time,Distance (KM),Temperature,Precipitation in millimeters,Pickup Lat,Pickup Long,Destination Lat,Destination Long,Rider Id
0,Order_No_12736,User_Id_3647,Bike,3,Business,17,5,12:57:35 PM,17,5,...,5,1:25:37 PM,5,,,-1.272639,36.794723,-1.277007,36.823907,Rider_Id_868


#### Functions for data cleaning

In [0]:
def get_train_predictors(X, y, tsize = 0.2, rstate=16):

    """
    This function takes predictor, X, and response, Y, variables and returns
    the train data predctor,X_train and response y_train.
    """

    X_train = train_test_split(X, y, test_size = tsize, random_state = rstate)[0]

    # y_train = train_test_split(X, y, test_size = tsize, random_state = rstate)[2]

    return X_train

In [0]:
def impute_data_median(df, list_features, predict_vector_df):

    """
    This function a dataframe of training data, df, list of features selected for model and datframe of variables
    to get a predict from as inputs. It returns a datframe of variables to get a predict from
    with null values replced with the median
    """

    # instantiate an imputer object with a median filling strategy
    imputer = SimpleImputer(missing_values = np.nan, strategy='median')

    # split predictors and response
    X = df[list_features].values

    y = df['Time from Pickup to Arrival'].values

    # extract training data to calibrate missing data
    X_train = get_train_data(X, y)

    # Calibrate imputation on training data
    imputer.fit(X_train)

    # replace null values with median
    predict_vector = imputer.transform(predict_vector_df.values.reshape(1,-1))
    
    # convert to dataframe
    return pd.DataFrame(predict_vector, columns=list_features)

In [0]:
feature_vector_df = feature_vector_df[list_predictors]
feature_vector_df

Unnamed: 0,Distance (KM),Temperature,Pickup - Day of Month,Confirmation - Weekday (Mo = 1),Platform Type
0,5,,17,5,3


In [0]:
def impute_data_median(list_features, data):

    """
    This function takes a list of features selected for model and a dataframe of predictor variables, data, as inputs.
    It returns a datframe of predictor variables with null values replaced with the median
    """

    # instantiate an imputer object with a median filling strategy
    imputer = SimpleImputer(missing_values = np.nan, strategy='median')

    # split predictors and response
    X = train_df[list_features].values

    y = train_df['Time from Pickup to Arrival'].values

    # extract training data to calibrate missing data
    X_train = get_train_predictors(X, y)

    # Calibrate imputation on training data
    imputer.fit(X_train)

    # replace null values with median
    data = imputer.transform(data[list_features].values)
    
    # convert to dataframe
    return pd.DataFrame(data, columns=list_features)

In [0]:
impute_data_median(list_predictors, train_df)

Unnamed: 0,Distance (KM),Temperature,Pickup - Day of Month,Confirmation - Weekday (Mo = 1),Platform Type
0,4.0,20.4,9.0,5.0,3.0
1,16.0,26.4,12.0,5.0,3.0
2,3.0,23.4,30.0,2.0,3.0
3,9.0,19.2,15.0,5.0,3.0
4,9.0,15.4,13.0,1.0,1.0
...,...,...,...,...,...
21196,3.0,28.6,20.0,3.0,3.0
21197,7.0,26.0,13.0,6.0,3.0
21198,20.0,29.2,7.0,4.0,3.0
21199,13.0,15.0,4.0,3.0,1.0


In [0]:
def scale_data(list_features, predict_vector_df):

    """
    This function a dataframe of training data, df, list of features selected for model and datframe of variables
    to get a predict from as inputs. It returns a datframe of variables with feature values scaled.
    """

    # instantiate scaler object
    scaler = StandardScaler()


    # split predictors and response
    X = train_df[list_features].values

    y = train_df['Time from Pickup to Arrival'].values

    # extract training data to calibrate missing data
    X_train = get_train_data(X, y)

    # calibrate scaler object to train dataset
    scaler.fit(X_train)

    # scale data for prediction
    predict_vector = scaler.transform(predict_vector_df.values.reshape(1,-1))
    
    # convert to dataframe
    return pd.DataFrame(predict_vector, columns=list_features)

In [0]:
scale_data(list_predictors, feature_vector_df)

Unnamed: 0,Distance (KM),Temperature,Pickup - Day of Month,Confirmation - Weekday (Mo = 1),Platform Type
0,-0.797054,,0.154348,1.121992,0.396638


In [0]:
pd.DataFrame(test_df.iloc[1])

Unnamed: 0,1
Order No,Order_No_12736
User Id,User_Id_3647
Vehicle Type,Bike
Platform Type,3
Personal or Business,Business
Placement - Day of Month,17
Placement - Weekday (Mo = 1),5
Placement - Time,12:57:35 PM
Confirmation - Day of Month,17
Confirmation - Weekday (Mo = 1),5


In [0]:
impute_data_median(train)

In [0]:
list_predictors = ['Distance (KM)', 'Temperature', 'Pickup - Day of Month',
'Confirmation - Weekday (Mo = 1)', 'Platform Type']

response = 'Time from Pickup to Arrival'

In [0]:
test_df.columns

Index(['Order No', 'User Id', 'Vehicle Type', 'Platform Type',
       'Personal or Business', 'Placement - Day of Month',
       'Placement - Weekday (Mo = 1)', 'Placement - Time',
       'Confirmation - Day of Month', 'Confirmation - Weekday (Mo = 1)',
       'Confirmation - Time', 'Arrival at Pickup - Day of Month',
       'Arrival at Pickup - Weekday (Mo = 1)', 'Arrival at Pickup - Time',
       'Pickup - Day of Month', 'Pickup - Weekday (Mo = 1)', 'Pickup - Time',
       'Distance (KM)', 'Temperature', 'Precipitation in millimeters',
       'Pickup Lat', 'Pickup Long', 'Destination Lat', 'Destination Long',
       'Rider Id'],
      dtype='object')

In [0]:
'Distance (KM)' == 'Distance (KM)'

True

In [0]:
def select_variables(df, X_vars, Y_var):
    """
    This function takes a dataframe, df, a list of predictor variables, X_vars, and
    a response variable, Y_var. It then returns 2 2d arrays of predictor variables
    and response variables respectively.
    """
    # predictor variables
    X = df[X_vars].values

    # response variable
    Y = df[Y_var].values

    return X, Y


In [0]:
select_variables(train_df, list_predictors, response)

(array([[ 4. , 20.4,  9. ,  5. ,  3. ],
        [16. , 26.4, 12. ,  5. ,  3. ],
        [ 3. ,  nan, 30. ,  2. ,  3. ],
        ...,
        [20. , 29.2,  7. ,  4. ,  3. ],
        [13. , 15. ,  4. ,  3. ,  1. ],
        [12. , 30.9, 26. ,  2. ,  3. ]]),
 array([ 745, 1993,  455, ..., 2953, 1380, 2128]))

In [0]:
test_df.iloc[1][list_predictors]

Distance (KM)                        5
Temperature                        NaN
Pickup - Day of Month               17
Confirmation - Weekday (Mo = 1)      5
Platform Type                        3
Name: 1, dtype: object

In [0]:
feature_vector_df = dt.select_variables(feature_vector_df, list_predictors)

In [0]:
def impute_request_data_median(list_features, str_response, data):

    """
    This function takes a list of features selected for model and a dataframe of predictor variables, data, as inputs.
    It returns a datframe of predictor variables with null values replaced with the median
    """

    # instantiate an imputer object with a median filling strategy
    imputer = SimpleImputer(missing_values = np.nan, strategy='median')

    # split predictors and response
    X = train_df[list_features].values

    y = train_df[str_response].values

    # extract training data to calibrate missing data
    X_train = get_train_predictors(X, y)

    # Calibrate imputation on training data
    imputer.fit(X_train)

    # replace null values with median
    data = imputer.transform(data[list_features].values)
    
    # convert to dataframe
    return pd.DataFrame(data, columns=list_features)

def scale_request_data(list_features, str_response, predict_vector_df):

    """
    This function a dataframe of training data, df, list of features selected for model and datframe of variables
    to get a predict from as inputs. It returns a datframe of variables with feature values scaled.
    """

    # instantiate scaler object
    scaler = StandardScaler()


    # split predictors and response
    X = train_df[list_features].values

    y = train_df[str_response].values

    # extract training data to calibrate missing data
    X_train = get_train_data(X, y)

    # calibrate scaler object to train dataset
    scaler.fit(X_train)

    # scale data for prediction
    predict_vector = scaler.transform(predict_vector_df.values.reshape(1,-1))
    
    # convert to dataframe
    return pd.DataFrame(predict_vector, columns=list_features)

In [0]:
impute_data_median(list_predictors, response, train_df)

Unnamed: 0,Distance (KM),Temperature,Pickup - Day of Month,Confirmation - Weekday (Mo = 1),Platform Type
0,4.0,20.4,9.0,5.0,3.0
1,16.0,26.4,12.0,5.0,3.0
2,3.0,23.4,30.0,2.0,3.0
3,9.0,19.2,15.0,5.0,3.0
4,9.0,15.4,13.0,1.0,1.0
...,...,...,...,...,...
21196,3.0,28.6,20.0,3.0,3.0
21197,7.0,26.0,13.0,6.0,3.0
21198,20.0,29.2,7.0,4.0,3.0
21199,13.0,15.0,4.0,3.0,1.0


In [0]:
response

'Time from Pickup to Arrival'

In [0]:
def scale_data(data):

    """
    This function takes a list of features selected for model and datframe of variables
    to get a predict from as inputs. It returns a datframe of variables with feature values scaled.
    """

    # instantiate scaler object
    scaler = StandardScaler()

    # calibrate scaler object to train dataset
    scaler.fit(data)

    # scale data for prediction
    return scaler.transform(data)

def impute_data_median(data):

    """
    This function takes a list of features selected for model and a dataframe of predictor variables, data, as inputs.
    It returns a datframe of predictor variables with null values replaced with the median
    """

    # instantiate an imputer object with a median filling strategy
    imputer = SimpleImputer(missing_values = np.nan, strategy='median')

    # Calibrate imputation on training data
    imputer.fit(data)

    # replace null values with median
    return imputer.transform(data)

In [0]:
train_df[list_predictors].values

array([[ 4. , 20.4,  9. ,  5. ,  3. ],
       [16. , 26.4, 12. ,  5. ,  3. ],
       [ 3. ,  nan, 30. ,  2. ,  3. ],
       ...,
       [20. , 29.2,  7. ,  4. ,  3. ],
       [13. , 15. ,  4. ,  3. ,  1. ],
       [12. , 30.9, 26. ,  2. ,  3. ]])

In [0]:
impute_data_median(train_df[list_predictors].values)

array([[ 4. , 20.4,  9. ,  5. ,  3. ],
       [16. , 26.4, 12. ,  5. ,  3. ],
       [ 3. , 23.5, 30. ,  2. ,  3. ],
       ...,
       [20. , 29.2,  7. ,  4. ,  3. ],
       [13. , 15. ,  4. ,  3. ,  1. ],
       [12. , 30.9, 26. ,  2. ,  3. ]])

In [0]:
scale_data(train_df[list_predictors].values)

In [0]:
test_df.iloc[1]

Order No                                Order_No_12736
User Id                                   User_Id_3647
Vehicle Type                                      Bike
Platform Type                                        3
Personal or Business                          Business
Placement - Day of Month                            17
Placement - Weekday (Mo = 1)                         5
Placement - Time                           12:57:35 PM
Confirmation - Day of Month                         17
Confirmation - Weekday (Mo = 1)                      5
Confirmation - Time                        12:59:17 PM
Arrival at Pickup - Day of Month                    17
Arrival at Pickup - Weekday (Mo = 1)                 5
Arrival at Pickup - Time                    1:20:27 PM
Pickup - Day of Month                               17
Pickup - Weekday (Mo = 1)                            5
Pickup - Time                               1:25:37 PM
Distance (KM)                                        5
Temperatur

In [0]:
def _preprocess_data(data):
    """Private helper function to preprocess data for model prediction.

    NB: If you have utilised feature engineering/selection in order to create
    your final model you will need to define the code here.


    Parameters
    ----------
    data : str
        The data payload received within POST requests sent to our API.

    Returns
    -------
    Pandas DataFrame : <class 'pandas.core.frame.DataFrame'>
        The preprocessed data, ready to be used our model for prediction.

    """
    # Convert the json string to a python dictionary object
    feature_vector_dict = json.loads(data)
    
    # Load the dictionary as a Pandas DataFrame.
    predict_vector = pd.DataFrame.from_dict([feature_vector_dict])

    # ---------------------------------------------------------------
    # NOTE: You will need to swap the lines below for your own data
    # preprocessing methods.
    #
    # The code below is for demonstration purposes only. You will not
    # receive marks for submitting this code in an unchanged state.
    # ---------------------------------------------------------------

    # ----------- Replace this code with your own preprocessing steps --------
    
    # 1. Select variables for model
    predict_vector = predict_vector[list_predictors]

    # 2. Impute Missing Data
    predict_vector = impute_request_data_median(list_predictors, response, predict_vector)

    # 3. Data Scaling
    predict_vector = scale_request_data(list_predictors, response, predict_vector)

    # ------------------------------------------------------------------------

    return predict_vector

In [0]:
predict_vector = pd.DataFrame.from_dict([json.loads(post)])
predict_vector.values

array([['Order_No_12736', 'User_Id_3647', 'Bike', 3, 'Business', 17, 5,
        '12:57:35 PM', 17, 5, '12:59:17 PM', 17, 5, '1:20:27 PM', 17, 5,
        '1:25:37 PM', 5, None, None, -1.272639, 36.794723, -1.2770066,
        36.8239068, 'Rider_Id_868']], dtype=object)

In [0]:
X_predict = _preprocess_data(post)
X_predict.shape

(1, 5)

In [0]:
def original_preprocess_data(data):
    """Private helper function to preprocess data for model prediction.

    NB: If you have utilised feature engineering/selection in order to create
    your final model you will need to define the code here.


    Parameters
    ----------
    data : str
        The data payload received within POST requests sent to our API.

    Returns
    -------
    Pandas DataFrame : <class 'pandas.core.frame.DataFrame'>
        The preprocessed data, ready to be used our model for prediction.

    """
    # Convert the json string to a python dictionary object
    feature_vector_dict = json.loads(data)
    # Load the dictionary as a Pandas DataFrame.
    feature_vector_df = pd.DataFrame.from_dict([feature_vector_dict])

    # ---------------------------------------------------------------
    # NOTE: You will need to swap the lines below for your own data
    # preprocessing methods.
    #
    # The code below is for demonstration purposes only. You will not
    # receive marks for submitting this code in an unchanged state.
    # ---------------------------------------------------------------

    # ----------- Replace this code with your own preprocessing steps --------
    predict_vector = feature_vector_df[['Pickup Lat','Pickup Long',
                                        'Destination Lat','Destination Long']]
    # ------------------------------------------------------------------------

    return predict_vector

In [0]:
original_preprocess_data(post)

Unnamed: 0,Pickup Lat,Pickup Long,Destination Lat,Destination Long
0,-1.272639,36.794723,-1.277007,36.823907


### Training for API

In [0]:
import os

In [0]:
# get current directory name
current_dir = os.path.dirname(__file__)

NameError: name '__file__' is not defined

# Restoring Model and Making Prediction

In [0]:
# import pickle library
import pickle

In [0]:
def load_model(path_to_model:str):
    """Adapter function to load our pretrained model into memory.

    Parameters
    ----------
    path_to_model : str
        The relative path to the model weights/schema to load.
        Note that unless another file format is used, this needs to be a
        .pkl file.

    Returns
    -------
    <class: sklearn.estimator>
        The pretrained model loaded into memory.

    """
    return pickle.load(open(path_to_model, 'rb'))

In [0]:
# store file path in variable
model_load_path = "base_model.pkl"

# open file to read binary
with open(model_load_path,'rb') as file:
    
    # load model from file in path
    unpickled_model = pickle.load(file)

In [0]:
# get predictions from unpickled model
y_pred = unpickled_model.predict(X_predict)

In [0]:
y_pred

array([1098.0531514])

In [0]:
static_model = load_model(path_to_model='base_model.pkl')

In [0]:
y_pred2 = static_model.predict(X_predict)
y_pred2

array([1098.0531514])

In [0]:
def make_prediction(data, model):
    """Prepare request data for model prediciton.

    Parameters
    ----------
    data : str
        The data payload received within POST requests sent to our API.
    model : <class: sklearn.estimator>
        An sklearn model object.

    Returns
    -------
    list
        A 1-D python list containing the model prediction.

    """
    # Data preprocessing.
    prep_data = _preprocess_data(data)
    # Perform prediction with model and preprocessed data.
    prediction = model.predict(prep_data)
    # Format as list for output standerdisation.
    return [prediction[0]]

In [0]:
make_prediction(post, unpickled_model)

[1098.0531514029267]

In [0]:
post

'{"Order No":"Order_No_12736","User Id":"User_Id_3647","Vehicle Type":"Bike","Platform Type":3,"Personal or Business":"Business","Placement - Day of Month":17,"Placement - Weekday (Mo = 1)":5,"Placement - Time":"12:57:35 PM","Confirmation - Day of Month":17,"Confirmation - Weekday (Mo = 1)":5,"Confirmation - Time":"12:59:17 PM","Arrival at Pickup - Day of Month":17,"Arrival at Pickup - Weekday (Mo = 1)":5,"Arrival at Pickup - Time":"1:20:27 PM","Pickup - Day of Month":17,"Pickup - Weekday (Mo = 1)":5,"Pickup - Time":"1:25:37 PM","Distance (KM)":5,"Temperature":null,"Precipitation in millimeters":null,"Pickup Lat":-1.272639,"Pickup Long":36.794723,"Destination Lat":-1.2770066,"Destination Long":36.8239068,"Rider Id":"Rider_Id_868"}'