
# <center>Sendy Logistics</center>

Sendy is a business-to-business platform established in 2014, to enable businesses of all types and sizes to transport goods more efficiently across East Africa.

The company is headquartered in Kenya with a team of more than 100 staff, focused on building practical solutions for Africa’s dynamic transportation needs, from developing apps and web solutions, to providing dedicated support for goods on the move.

Currently operating in Kenya and Uganda, Sendy is expanding to Nigeria and Tanzania, to enable thousands more businesses to move volumes of goods easily, anywhere, at any time. Sendy aggregates a pool of delivery options from 28 ton, 14 ton, 5 ton trucks to pick up trucks, vans and motorcycles.

<p><img src="./image/Sendy-delivery-1200x500.jpg" alt="Sendy Logistics Logo"></p>


Sendy Logistics has realised that data is a critical component that can aid in building more efficient, affordable and accessible solutions as such they are interested in using data to predict the estimated time of delivery of orders, from the point of driver pickup to the point of arrival at final destination. 

The solution will help Sendy enhance customer communication and the reliability of its service; which will ultimately improve customer experience. In addition, the solution will enable Sendy to realise cost savings, and ultimately reduce the cost of doing business through improved resource management and planning for order scheduling.

To help Sendy achieve this goal we will build a predictive model that will predict estimated time of delivery of orders by looking at all factors that could influence the time from pick up to arrival of an order such as distance, date, the rider delivering the order etc.

## Importing the libraries ##
- We going to use the numpy libraries to use numpy arrays 
- We going to use pandas to load, merge  and modify our dataset
- matplotlib and seaborn libraries are going to be used to plot the model 

In [1]:
%matplotlib notebook
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from mpl_toolkits.mplot3d import Axes3D
from scipy.stats import pearsonr
from sklearn.metrics import mean_squared_error



## Loading the dataset ##

- **Pandas** is used to load the datafiles into our workspace
- four datafiles were loaded named Riders, Test,Train as well as VariableDefinitions

- Train.csv contains  Sendy historic data of orders with 28 features, this is the data we will use to train our model
- Test.csv contains  Sendy historic data of orders with 24 features, this is the data we will use to test our model
- Riders.csv contains information of riders that make the deliveries


In [2]:
riders = pd.read_csv("./regression data/Riders.csv")
test = pd.read_csv("./regression data/Test.csv")
train = pd.read_csv("./regression data/Train.csv")
variableDefinitions= pd.read_csv("./regression data/VariableDefinitions.csv")

train.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21201 entries, 0 to 21200
Data columns (total 29 columns):
 #   Column                                     Non-Null Count  Dtype  
---  ------                                     --------------  -----  
 0   Order No                                   21201 non-null  object 
 1   User Id                                    21201 non-null  object 
 2   Vehicle Type                               21201 non-null  object 
 3   Platform Type                              21201 non-null  int64  
 4   Personal or Business                       21201 non-null  object 
 5   Placement - Day of Month                   21201 non-null  int64  
 6   Placement - Weekday (Mo = 1)               21201 non-null  int64  
 7   Placement - Time                           21201 non-null  object 
 8   Confirmation - Day of Month                21201 non-null  int64  
 9   Confirmation - Weekday (Mo = 1)            21201 non-null  int64  
 10  Confirmation - Time   

## Identifying and handling missing values in train data

In [3]:
# Checking missing values
train.isnull().sum()

Order No                                         0
User Id                                          0
Vehicle Type                                     0
Platform Type                                    0
Personal or Business                             0
Placement - Day of Month                         0
Placement - Weekday (Mo = 1)                     0
Placement - Time                                 0
Confirmation - Day of Month                      0
Confirmation - Weekday (Mo = 1)                  0
Confirmation - Time                              0
Arrival at Pickup - Day of Month                 0
Arrival at Pickup - Weekday (Mo = 1)             0
Arrival at Pickup - Time                         0
Pickup - Day of Month                            0
Pickup - Weekday (Mo = 1)                        0
Pickup - Time                                    0
Arrival at Destination - Day of Month            0
Arrival at Destination - Weekday (Mo = 1)        0
Arrival at Destination - Time  

We see that only two columns contain missing values.
<br> Temperature column has 4366 missing values while Precipitation in millimeters has 20649.
We can fill the missing values in the Temperature column with the mean of the temperatures.


In [4]:
# Filling the missing values in values Temperature column with the on both the train and test data
train['Temperature'] = train['Temperature'].fillna( train['Temperature'].mean())
test['Temperature'] = test['Temperature'].fillna( test['Temperature'].mean())

## Investigating the missing values in Precipitation in millimeters column

In [5]:
# Proportion of missing values in the Precipitation in millimeters column
missing_vals = train['Precipitation in millimeters'].isnull().sum()
round((missing_vals/len(train.index))*100,0)

97.0

### 97% of the records on the Precipitation in millimeters column have missing values
That is only 552 records have values, so we analyse these 552 records we have

In [6]:
precipitation = train['Precipitation in millimeters'].copy()
precipitation.dropna(inplace = True)

# We want to check if whether the available records we have contain Zeros for when 
# there was no rainfall/precipitation at the time of the delivery.
precipitation[precipitation==0].count()

0

We see that on the 552 records that we have non of them contain Zeros for whent here was no precipitation as such we believe that the missing values in this column actually suggest that there was no precipitation at the time the deliveries, as such we fill the missing values with Zeros

In [7]:
# Fillinh missing values in Precipitation column with 0 on both train and test data
train['Precipitation in millimeters'] = train['Precipitation in millimeters'].fillna(0)
test['Precipitation in millimeters'] = test['Precipitation in millimeters'].fillna(0)

In [8]:
# Checking if all missing values have been handled
train.isnull().sum()

Order No                                     0
User Id                                      0
Vehicle Type                                 0
Platform Type                                0
Personal or Business                         0
Placement - Day of Month                     0
Placement - Weekday (Mo = 1)                 0
Placement - Time                             0
Confirmation - Day of Month                  0
Confirmation - Weekday (Mo = 1)              0
Confirmation - Time                          0
Arrival at Pickup - Day of Month             0
Arrival at Pickup - Weekday (Mo = 1)         0
Arrival at Pickup - Time                     0
Pickup - Day of Month                        0
Pickup - Weekday (Mo = 1)                    0
Pickup - Time                                0
Arrival at Destination - Day of Month        0
Arrival at Destination - Weekday (Mo = 1)    0
Arrival at Destination - Time                0
Distance (KM)                                0
Temperature  

## Data Preprocessing ##
 - Some columns needs to dropped
 - train test and riders needs to be merged 
 - nulls values needs to be dealt with

In [9]:
# Cleaning the data


# Allignment of Dataset

train = train[['Order No', 'User Id', 'Vehicle Type', 'Platform Type',
       'Personal or Business', 'Placement - Day of Month',
       'Placement - Weekday (Mo = 1)', 'Placement - Time',
       'Confirmation - Day of Month', 'Confirmation - Weekday (Mo = 1)',
       'Confirmation - Time', 'Arrival at Pickup - Day of Month',
       'Arrival at Pickup - Weekday (Mo = 1)', 'Arrival at Pickup - Time',
       'Pickup - Day of Month', 'Pickup - Weekday (Mo = 1)', 'Pickup - Time',
       'Distance (KM)', 'Temperature', 'Precipitation in millimeters',
       'Pickup Lat', 'Pickup Long', 'Destination Lat', 'Destination Long',
       'Rider Id','Time from Pickup to Arrival']]
       


# check which data type we are dealing with
train.dtypes 
test.dtypes



train.head()
    







Unnamed: 0,Order No,User Id,Vehicle Type,Platform Type,Personal or Business,Placement - Day of Month,Placement - Weekday (Mo = 1),Placement - Time,Confirmation - Day of Month,Confirmation - Weekday (Mo = 1),...,Pickup - Time,Distance (KM),Temperature,Precipitation in millimeters,Pickup Lat,Pickup Long,Destination Lat,Destination Long,Rider Id,Time from Pickup to Arrival
0,Order_No_4211,User_Id_633,Bike,3,Business,9,5,9:35:46 AM,9,5,...,10:27:30 AM,4,20.4,0.0,-1.317755,36.83037,-1.300406,36.829741,Rider_Id_432,745
1,Order_No_25375,User_Id_2285,Bike,3,Personal,12,5,11:16:16 AM,12,5,...,11:44:09 AM,16,26.4,0.0,-1.351453,36.899315,-1.295004,36.814358,Rider_Id_856,1993
2,Order_No_1899,User_Id_265,Bike,3,Business,30,2,12:39:25 PM,30,2,...,12:53:03 PM,3,23.258889,0.0,-1.308284,36.843419,-1.300921,36.828195,Rider_Id_155,455
3,Order_No_9336,User_Id_1402,Bike,3,Business,15,5,9:25:34 AM,15,5,...,9:43:06 AM,9,19.2,0.0,-1.281301,36.832396,-1.257147,36.795063,Rider_Id_855,1341
4,Order_No_27883,User_Id_1737,Bike,1,Personal,13,1,9:55:18 AM,13,1,...,10:05:23 AM,9,15.4,0.0,-1.266597,36.792118,-1.295041,36.809817,Rider_Id_770,1214


## Variable Selection by correlation and significance ##

- We have more predictor variables to choose from, so we need a way of guiding us to choose the best ones to be our predictors. One way is to look at the correlations between the Time from Pickup to Arrival and each variables in our DataFrame and select those with the strongest correlations (both positive and negative).

- We also need to consider how significant those features are.

- The code below will create a new DataFrame and store the correlation coefficents and p-values in that DataFrame

In [10]:
# Calculate correlations between predictor variables and the response variable
corrs = train.corr()['Time from Pickup to Arrival'].sort_values(ascending=False)
corrs

Time from Pickup to Arrival             1.000000
Distance (KM)                           0.580608
Destination Long                        0.070425
Pickup Long                             0.060285
Confirmation - Weekday (Mo = 1)         0.009744
Arrival at Pickup - Weekday (Mo = 1)    0.009744
Pickup - Weekday (Mo = 1)               0.009744
Placement - Weekday (Mo = 1)            0.009693
Temperature                             0.005772
Precipitation in millimeters            0.005495
Platform Type                          -0.003827
Pickup - Day of Month                  -0.014701
Arrival at Pickup - Day of Month       -0.014701
Confirmation - Day of Month            -0.014701
Placement - Day of Month               -0.014710
Pickup Lat                             -0.053823
Destination Lat                        -0.061872
Name: Time from Pickup to Arrival, dtype: float64

In [11]:

# Build a dictionary of correlation coefficients and p-values
dict_cp = {}

column_titles = [col for col in corrs.index if col!= 'Time from Pickup to Arrival']
for col in column_titles:
    p_val = round(pearsonr(train[col], train['Time from Pickup to Arrival'])[1],6)
    dict_cp[col] = {'Correlation_Coefficient':corrs[col],
                    'P_Value':p_val}

df_cp = pd.DataFrame(dict_cp).T
df_cp_sorted = df_cp.sort_values('P_Value')
df_cp_sorted[df_cp_sorted['P_Value']<0.1]

Unnamed: 0,Correlation_Coefficient,P_Value
Distance (KM),0.580608,0.0
Destination Long,0.070425,0.0
Pickup Long,0.060285,0.0
Pickup Lat,-0.053823,0.0
Destination Lat,-0.061872,0.0
Placement - Day of Month,-0.01471,0.032205
Pickup - Day of Month,-0.014701,0.032312
Arrival at Pickup - Day of Month,-0.014701,0.032312
Confirmation - Day of Month,-0.014701,0.032312


In [12]:

#dropping highly correlated predictors and the ones that were not selected above
train = train.drop(['Placement - Weekday (Mo = 1)', 'Placement - Weekday (Mo = 1)','Confirmation - Day of Month','Confirmation - Weekday (Mo = 1)','Arrival at Pickup - Day of Month','Arrival at Pickup - Weekday (Mo = 1)','Pickup - Day of Month','Pickup - Weekday (Mo = 1)'], axis = 1)

test = test.drop(['Placement - Weekday (Mo = 1)', 'Placement - Weekday (Mo = 1)','Confirmation - Day of Month','Confirmation - Weekday (Mo = 1)','Arrival at Pickup - Day of Month','Arrival at Pickup - Weekday (Mo = 1)','Pickup - Day of Month','Pickup - Weekday (Mo = 1)'], axis = 1)


#dropping the irrelevant columns 
train = train.drop(['User Id','Vehicle Type','Rider Id', 'Confirmation - Time', ], axis = 1)

test = test.drop(['User Id','Vehicle Type','Rider Id', 'Confirmation - Time'], axis = 1)


train.head()



Unnamed: 0,Order No,Platform Type,Personal or Business,Placement - Day of Month,Placement - Time,Arrival at Pickup - Time,Pickup - Time,Distance (KM),Temperature,Precipitation in millimeters,Pickup Lat,Pickup Long,Destination Lat,Destination Long,Time from Pickup to Arrival
0,Order_No_4211,3,Business,9,9:35:46 AM,10:04:47 AM,10:27:30 AM,4,20.4,0.0,-1.317755,36.83037,-1.300406,36.829741,745
1,Order_No_25375,3,Personal,12,11:16:16 AM,11:40:22 AM,11:44:09 AM,16,26.4,0.0,-1.351453,36.899315,-1.295004,36.814358,1993
2,Order_No_1899,3,Business,30,12:39:25 PM,12:49:34 PM,12:53:03 PM,3,23.258889,0.0,-1.308284,36.843419,-1.300921,36.828195,455
3,Order_No_9336,3,Business,15,9:25:34 AM,9:37:56 AM,9:43:06 AM,9,19.2,0.0,-1.281301,36.832396,-1.257147,36.795063,1341
4,Order_No_27883,1,Personal,13,9:55:18 AM,10:03:53 AM,10:05:23 AM,9,15.4,0.0,-1.266597,36.792118,-1.295041,36.809817,1214


In [13]:
train.drop(['Placement - Time','Arrival at Pickup - Time','Pickup - Time'], axis = 1, inplace = True)
test.drop(['Placement - Time','Arrival at Pickup - Time','Pickup - Time'], axis = 1, inplace = True)

In [14]:
train.head()

Unnamed: 0,Order No,Platform Type,Personal or Business,Placement - Day of Month,Distance (KM),Temperature,Precipitation in millimeters,Pickup Lat,Pickup Long,Destination Lat,Destination Long,Time from Pickup to Arrival
0,Order_No_4211,3,Business,9,4,20.4,0.0,-1.317755,36.83037,-1.300406,36.829741,745
1,Order_No_25375,3,Personal,12,16,26.4,0.0,-1.351453,36.899315,-1.295004,36.814358,1993
2,Order_No_1899,3,Business,30,3,23.258889,0.0,-1.308284,36.843419,-1.300921,36.828195,455
3,Order_No_9336,3,Business,15,9,19.2,0.0,-1.281301,36.832396,-1.257147,36.795063,1341
4,Order_No_27883,1,Personal,13,9,15.4,0.0,-1.266597,36.792118,-1.295041,36.809817,1214



## Encoding the categorical data ##
 - Bussiness column needs to be encoded into dummy variables  so i can be of type int 


In [15]:
test_df = pd.get_dummies(test.iloc[:,1:], drop_first= True)
train_df = pd.get_dummies(train.iloc[:,1:], drop_first=True)

train_df.head()

Unnamed: 0,Platform Type,Placement - Day of Month,Distance (KM),Temperature,Precipitation in millimeters,Pickup Lat,Pickup Long,Destination Lat,Destination Long,Time from Pickup to Arrival,Personal or Business_Personal
0,3,9,4,20.4,0.0,-1.317755,36.83037,-1.300406,36.829741,745,0
1,3,12,16,26.4,0.0,-1.351453,36.899315,-1.295004,36.814358,1993,1
2,3,30,3,23.258889,0.0,-1.308284,36.843419,-1.300921,36.828195,455,0
3,3,15,9,19.2,0.0,-1.281301,36.832396,-1.257147,36.795063,1341,0
4,1,13,9,15.4,0.0,-1.266597,36.792118,-1.295041,36.809817,1214,1


In [16]:
test_df.head()

Unnamed: 0,Platform Type,Placement - Day of Month,Distance (KM),Temperature,Precipitation in millimeters,Pickup Lat,Pickup Long,Destination Lat,Destination Long,Personal or Business_Personal
0,3,27,8,23.24612,0.0,-1.333275,36.870815,-1.305249,36.82239,0
1,3,17,5,23.24612,0.0,-1.272639,36.794723,-1.277007,36.823907,0
2,3,27,5,22.8,0.0,-1.290894,36.822971,-1.276574,36.851365,0
3,3,17,5,24.5,0.0,-1.290503,36.809646,-1.303382,36.790658,0
4,3,11,6,24.4,0.0,-1.281081,36.814423,-1.266467,36.792161,0


# Creating X and y

In [17]:
X = train_df.drop('Time from Pickup to Arrival', axis = 1)
y = np.array(train_df['Time from Pickup to Arrival']).reshape(-1,1)

In [18]:
# base model

In [19]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

In [20]:
X_train, X_val, y_train, y_val = train_test_split(X,y,test_size = 0.30, random_state = 25)

In [21]:
lin_reg = LinearRegression()
lin_reg.fit(X_train,y_train)

LinearRegression()

In [22]:
val_pred = lin_reg.predict(X_val)
val_pred

array([[2042.76865557],
       [1415.87781663],
       [2050.18185505],
       ...,
       [1093.6552803 ],
       [1023.72502374],
       [1561.67252544]])

In [23]:
# function that calculates the root mean squared error
def rmse(y_test,y_prediction):
    result = np.sqrt(mean_squared_error(y_test,y_prediction))
    return result

In [24]:
rmse(y_val, val_pred)

793.3028821523337

In [25]:
# Making actual y predictions
y_pred = lin_reg.predict(test_df)

In [26]:
submission = test[['Order No']].copy()
submission['Time from Pickup to Arrival'] = y_pred

In [27]:
submission

Unnamed: 0,Order No,Time from Pickup to Arrival
0,Order_No_19248,1307.746708
1,Order_No_12736,1109.207739
2,Order_No_768,1055.041384
3,Order_No_15332,1103.354981
4,Order_No_21373,1195.624103
...,...,...
7063,Order_No_3612,1130.460822
7064,Order_No_7657,2936.028198
7065,Order_No_1969,1654.064536
7066,Order_No_10591,2611.834259


In [628]:
submission.to_csv('submission_0.11.csv', index = False)

In [47]:
# Saving the model with the MSE: 793.30288
import pickle

model_save_path = "submission_1_model.pkl"
with open(model_save_path,'wb') as file:
    pickle.dump(submission,file)

## Feature Scaling ##

One of the important reason of feature scaling is so that one significant number doesn't impact the model because of their large magnitude. Thus, feature scaling is one of the most critical step in machine learning during the preprocessing data before creating the a machine learning model. There are several techniques of scaling and the one used in this work is known as Normalization. Normally this technique is used when we want to bound the values between two numbers, that are often between 0 to 1, or -1 to 1. The diagram below shows how the data looks like after it has been scaling it in the X-Y plane.

![image.png](attachment:image.png)

In [28]:
# Import the """MIN MAX SCALER"""
from sklearn.preprocessing import  MinMaxScaler
scaler = MinMaxScaler()

In [29]:
train_df.columns

Index(['Platform Type', 'Placement - Day of Month', 'Distance (KM)',
       'Temperature', 'Precipitation in millimeters', 'Pickup Lat',
       'Pickup Long', 'Destination Lat', 'Destination Long',
       'Time from Pickup to Arrival', 'Personal or Business_Personal'],
      dtype='object')

In [30]:
# Here the Features: Day of the Month, Distance, Temp, Precipitation (mm), Time for Pick up Arrival are picked 
# ... from the train data available. 
# Then, they are scaled and represented in a new Dataframe : df_train
df_train = pd.DataFrame(scaler.fit_transform(train_df[['Placement - Day of Month',
                                               'Distance (KM)', 'Temperature', 
                                               'Precipitation in millimeters', 'Time from Pickup to Arrival' ]]),
                   columns = ['Day of Month', 'Distance (km)', 'Temp', 'Precipitation (mm)', 'Pickup to Arrival (s)' ])

In [31]:
df_train.head()

Unnamed: 0,Day of Month,Distance (km),Temp,Precipitation (mm),Pickup to Arrival (s)
0,0.266667,0.0625,0.440191,0.0,0.094392
1,0.366667,0.3125,0.727273,0.0,0.252728
2,0.966667,0.041667,0.57698,0.0,0.0576
3,0.466667,0.166667,0.382775,0.0,0.170008
4,0.4,0.166667,0.200957,0.0,0.153895


We should als note that even if we only scale the independent variables, we get a similar plot of both sclaed X-Y axis.

In [32]:
# Here the Features: Day of the Month, Distance, Temp, Precipitation (mm) are picked 
# ... from the train data available. 
# Then, they are scaled and represented in a new Dataframe : df2_train
df2_train = pd.DataFrame(scaler.fit_transform(train[['Placement - Day of Month',
                                               'Distance (KM)', 'Temperature', 
                                               'Precipitation in millimeters']]),
                   columns = ['Day of Month', 'Distance (km)', 'Temp', 'Precipitation (mm)'])
df2_test = pd.DataFrame(scaler.fit_transform(test[['Placement - Day of Month',
                                               'Distance (KM)', 'Temperature', 
                                               'Precipitation in millimeters']]),
                   columns = ['Day of Month', 'Distance (km)', 'Temp', 'Precipitation (mm)'])

In [33]:
# df2_train represents the independent variables that are scaled for this work. 
df2_train.head()

Unnamed: 0,Day of Month,Distance (km),Temp,Precipitation (mm)
0,0.266667,0.0625,0.440191,0.0
1,0.366667,0.3125,0.727273,0.0
2,0.966667,0.041667,0.57698,0.0
3,0.466667,0.166667,0.382775,0.0
4,0.4,0.166667,0.200957,0.0


In [34]:
df2_test.head()

Unnamed: 0,Day of Month,Distance (km),Temp,Precipitation (mm)
0,0.866667,0.152174,0.531541,0.0
1,0.533333,0.086957,0.531541,0.0
2,0.866667,0.086957,0.507937,0.0
3,0.533333,0.086957,0.597884,0.0
4,0.333333,0.108696,0.592593,0.0


## Creating y and x metrics ##

In [35]:
# Below are the plot with X-Y metrics scaled. 
fig, axs = plt.subplots(2, 2, figsize=(9,7))

axs[0,0].scatter(df_train['Day of Month'], df_train['Pickup to Arrival (s)'], color = 'black')
axs[0,0].title.set_text('Day of Month vs. Pickup to Arrival (s)')

axs[0,1].scatter(df_train['Distance (km)'], df_train['Pickup to Arrival (s)'], color = 'black')
axs[0,1].title.set_text('Distance (km) vs. Pickup to Arrival (s)')

axs[1,0].scatter(df_train['Temp'],df_train['Pickup to Arrival (s)'], color = 'black')
axs[1,0].title.set_text('Temperature vs. Pickup to Arrival (s)')

axs[1,1].scatter(df_train['Precipitation (mm)'], df_train['Pickup to Arrival (s)'], color = 'black')
axs[1,1].title.set_text('Precipitation (mm) vs. Pickup to Arrival (s)')

fig.tight_layout(pad=3.0)

plt.show()

# The Day of Month, Distance (km), Temp and precipitation represents the predictor, whereas, the Pickup 
# ... to Arrival is the response. 

<IPython.core.display.Javascript object>

In [36]:
# These are the plots with with only the independent variables are scaled. 
fig, axs = plt.subplots(2, 2, figsize=(9,7))

axs[0,0].scatter(df_train['Day of Month'], train['Time from Pickup to Arrival'])
axs[0,0].title.set_text('Day of Month vs. Pickup to Arrival (s)')

axs[0,1].scatter(df_train['Distance (km)'], train['Time from Pickup to Arrival'])
axs[0,1].title.set_text('Distance (km) vs. Pickup to Arrival (s)')

axs[1,0].scatter(df_train['Temp'], train['Time from Pickup to Arrival'])
axs[1,0].title.set_text('Temperature vs. Pickup to Arrival (s)')

axs[1,1].scatter(df_train['Precipitation (mm)'], train['Time from Pickup to Arrival'])
axs[1,1].title.set_text('Precipitation (mm) vs. Pickup to Arrival (s)')

fig.tight_layout(pad=3.0)

plt.show()

# The Day of Month, Distance (km), Temp and precipitation represents the predictor, whereas, the Pickup 
# ... to Arrival is the response (not scaled).

<IPython.core.display.Javascript object>

In [37]:
# split predictors and response

# Scaled predictors
x_month  = df_train['Day of Month']
x_distance = df_train['Distance (km)']
x_temp = df_train['Temp']
x_prec = df_train['Precipitation (mm)']

# The predicted response:
y_pa = train['Time from Pickup to Arrival']
# The unpredicted response:
y_pa_scale = df_train['Pickup to Arrival (s)']

# Model with scaled data version 0.2

In [38]:
# Scalled datasets
#df2_train
# df2_test

In [39]:
# Creating X and y metrics
X = df2_train.copy()
y = np.array(train_df['Time from Pickup to Arrival']).reshape(-1,1)

In [40]:
X_train, X_val, y_train, y_val = train_test_split(X,y,test_size = 0.3, random_state = 23)

In [41]:
# training the model
lin_reg.fit(X_train, y_train)

LinearRegression()

In [42]:
val_pred = lin_reg.predict(X_val)

In [43]:
# Checking the root mean square error
rmse(y_val, val_pred)

828.9786142267769

In [44]:
# actual y_predictions
X_test = df2_test

In [45]:
y_pred = lin_reg.predict(X_test)

In [46]:
submission_scaled_data = test[['Order No']].copy()
submission_scaled_data['Time from Pickup to Arrival'] = y_pred

submission_scaled_data

Unnamed: 0,Order No,Time from Pickup to Arrival
0,Order_No_19248,1405.545337
1,Order_No_12736,1112.339636
2,Order_No_768,1088.639682
3,Order_No_15332,1114.426323
4,Order_No_21373,1233.422165
...,...,...
7063,Order_No_3612,1137.883107
7064,Order_No_7657,2932.406166
7065,Order_No_1969,1669.912584
7066,Order_No_10591,2455.420186


In [658]:
submission_scaled_data.to_csv('submission_scaled_data_0.2.csv', index = False)

## Spliting Data into the training and the test set ##

## Splitting data into train and test set

## Training the model

## Fitting the multivariate Regression model ##

## Assesing model accuracy ##