## Problem Statement

**Food Delivery services like Zomato and Swiggy need to show the accurate time it will take to deliver your order to keep transparency with their customers. These companies use Machine Learning algorithms to predict the food delivery time based on how much time the delivery partners took for the same distance in the past.**

**To predict the food delivery time in real-time, we need to calculate the distance between the food preparation point and the point of food consumption. After finding the distance between the restaurant and the delivery locations, we need to find relationships between the time taken by delivery partners to deliver the food in the past for the same distance.**

In [1]:
#importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from math import radians,sin,cos,sqrt,atan2
import os

%matplotlib inline

#Ignore warnings
warnings.filterwarnings("ignore")

In [2]:
df = pd.read_csv('C:\\Users\\Aditya\\OneDrive\\Documents\\DataScience\\SA\\ml-project\\ml-pipeline-project\\delivery-time-prediction\\delivery-time-prediction\\Data\\finalTrain.csv')

In [3]:
df.head(10)

Unnamed: 0,ID,Delivery_person_ID,Delivery_person_Age,Delivery_person_Ratings,Restaurant_latitude,Restaurant_longitude,Delivery_location_latitude,Delivery_location_longitude,Order_Date,Time_Orderd,Time_Order_picked,Weather_conditions,Road_traffic_density,Vehicle_condition,Type_of_order,Type_of_vehicle,multiple_deliveries,Festival,City,Time_taken (min)
0,0xcdcd,DEHRES17DEL01,36.0,4.2,30.327968,78.046106,30.397968,78.116106,12-02-2022,21:55,22:10,Fog,Jam,2,Snack,motorcycle,3.0,No,Metropolitian,46
1,0xd987,KOCRES16DEL01,21.0,4.7,10.003064,76.307589,10.043064,76.347589,13-02-2022,14:55,15:05,Stormy,High,1,Meal,motorcycle,1.0,No,Metropolitian,23
2,0x2784,PUNERES13DEL03,23.0,4.7,18.56245,73.916619,18.65245,74.006619,04-03-2022,17:30,17:40,Sandstorms,Medium,1,Drinks,scooter,1.0,No,Metropolitian,21
3,0xc8b6,LUDHRES15DEL02,34.0,4.3,30.899584,75.809346,30.919584,75.829346,13-02-2022,09:20,09:30,Sandstorms,Low,0,Buffet,motorcycle,0.0,No,Metropolitian,20
4,0xdb64,KNPRES14DEL02,24.0,4.7,26.463504,80.372929,26.593504,80.502929,14-02-2022,19:50,20:05,Fog,Jam,1,Snack,scooter,1.0,No,Metropolitian,41
5,0x3af3,MUMRES15DEL03,29.0,4.5,19.176269,72.836721,19.266269,72.926721,02-04-2022,20:25,20:35,Sandstorms,Jam,2,Buffet,electric_scooter,1.0,No,Metropolitian,20
6,0x3aab,MYSRES01DEL01,35.0,4.0,12.311072,76.654878,12.351072,76.694878,01-03-2022,14:55,15:10,Windy,High,1,Meal,scooter,1.0,No,Metropolitian,33
7,0x689b,PUNERES20DEL01,33.0,4.2,18.592718,73.773572,18.702718,73.883572,16-03-2022,20:30,20:40,Sandstorms,Jam,2,Snack,motorcycle,1.0,No,Metropolitian,40
8,0x6f67,HYDRES14DEL01,34.0,4.9,17.426228,78.407495,17.496228,78.477495,20-03-2022,20:40,20:50,Cloudy,Jam,0,Snack,motorcycle,,No,Metropolitian,41
9,0xc9cf,KOLRES15DEL03,21.0,4.7,22.552672,88.352885,22.582672,88.382885,15-02-2022,21:15,21:30,Windy,Jam,0,Meal,motorcycle,1.0,No,Urban,15


In [4]:
df.columns

Index(['ID', 'Delivery_person_ID', 'Delivery_person_Age',
       'Delivery_person_Ratings', 'Restaurant_latitude',
       'Restaurant_longitude', 'Delivery_location_latitude',
       'Delivery_location_longitude', 'Order_Date', 'Time_Orderd',
       'Time_Order_picked', 'Weather_conditions', 'Road_traffic_density',
       'Vehicle_condition', 'Type_of_order', 'Type_of_vehicle',
       'multiple_deliveries', 'Festival', 'City', 'Time_taken (min)'],
      dtype='object')

In [5]:
df.shape

(45584, 20)

In [6]:
df.describe()

Unnamed: 0,Delivery_person_Age,Delivery_person_Ratings,Restaurant_latitude,Restaurant_longitude,Delivery_location_latitude,Delivery_location_longitude,Vehicle_condition,multiple_deliveries,Time_taken (min)
count,43730.0,43676.0,45584.0,45584.0,45584.0,45584.0,45584.0,44591.0,45584.0
mean,29.566911,4.633774,17.017948,70.229684,17.46548,70.844161,1.023385,0.744635,26.293963
std,5.815064,0.334744,8.185674,22.885575,7.335562,21.120578,0.839055,0.57251,9.384298
min,15.0,1.0,-30.905562,-88.366217,0.01,0.01,0.0,0.0,10.0
25%,25.0,4.5,12.933284,73.17,12.988453,73.28,0.0,0.0,19.0
50%,30.0,4.7,18.55144,75.897963,18.633934,76.002574,1.0,1.0,26.0
75%,35.0,4.9,22.728163,78.044095,22.785049,78.107044,2.0,1.0,32.0
max,50.0,6.0,30.914057,88.433452,31.054057,88.563452,3.0,3.0,54.0


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45584 entries, 0 to 45583
Data columns (total 20 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   ID                           45584 non-null  object 
 1   Delivery_person_ID           45584 non-null  object 
 2   Delivery_person_Age          43730 non-null  float64
 3   Delivery_person_Ratings      43676 non-null  float64
 4   Restaurant_latitude          45584 non-null  float64
 5   Restaurant_longitude         45584 non-null  float64
 6   Delivery_location_latitude   45584 non-null  float64
 7   Delivery_location_longitude  45584 non-null  float64
 8   Order_Date                   45584 non-null  object 
 9   Time_Orderd                  43853 non-null  object 
 10  Time_Order_picked            45584 non-null  object 
 11  Weather_conditions           44968 non-null  object 
 12  Road_traffic_density         44983 non-null  object 
 13  Vehicle_conditio

In [8]:
df.isnull().sum()

ID                                0
Delivery_person_ID                0
Delivery_person_Age            1854
Delivery_person_Ratings        1908
Restaurant_latitude               0
Restaurant_longitude              0
Delivery_location_latitude        0
Delivery_location_longitude       0
Order_Date                        0
Time_Orderd                    1731
Time_Order_picked                 0
Weather_conditions              616
Road_traffic_density            601
Vehicle_condition                 0
Type_of_order                     0
Type_of_vehicle                   0
multiple_deliveries             993
Festival                        228
City                           1200
Time_taken (min)                  0
dtype: int64

In [9]:
## to find the number of unique values in column(to know whether which to classify categorical)
for i in df.columns:
    print(f" no of unique values in {i} is : {df[i].nunique()}")
    print("#######################################################")
    if (df[i].nunique()<6):
        print(f" unique values are : {df[i].unique()}")
        print("*****************************************")
    else:
        pass

 no of unique values in ID is : 45584
#######################################################
 no of unique values in Delivery_person_ID is : 1320
#######################################################
 no of unique values in Delivery_person_Age is : 22
#######################################################
 no of unique values in Delivery_person_Ratings is : 28
#######################################################
 no of unique values in Restaurant_latitude is : 657
#######################################################
 no of unique values in Restaurant_longitude is : 518
#######################################################
 no of unique values in Delivery_location_latitude is : 4373
#######################################################
 no of unique values in Delivery_location_longitude is : 4373
#######################################################
 no of unique values in Order_Date is : 44
#######################################################
 no of unique values in 

In [10]:
# creating some features
features=[]
dtypes=[]
count=[]
unique=[]
missing=[]
missing_percentage=[]

for column in df.columns:
    features.append(column)
    dtypes.append(df[column].dtypes)
    count.append(len(df[column]))
    unique.append(df[column].nunique())
    missing.append(df[column].isnull().sum())
    missing_percentage.append((df[column].isnull().sum())/(df.shape[0])*100)
    
dataframe=pd.DataFrame({'feature':features,'dtype':dtypes,'count':count,'unique':unique,
                       'missing':missing,'missing_percentage':missing_percentage})

dataframe.set_index('feature')

Unnamed: 0_level_0,dtype,count,unique,missing,missing_percentage
feature,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
ID,object,45584,45584,0,0.0
Delivery_person_ID,object,45584,1320,0,0.0
Delivery_person_Age,float64,45584,22,1854,4.067217
Delivery_person_Ratings,float64,45584,28,1908,4.185679
Restaurant_latitude,float64,45584,657,0,0.0
Restaurant_longitude,float64,45584,518,0,0.0
Delivery_location_latitude,float64,45584,4373,0,0.0
Delivery_location_longitude,float64,45584,4373,0,0.0
Order_Date,object,45584,44,0,0.0
Time_Orderd,object,45584,176,1731,3.797385


In [11]:
# removing the ID column
df.drop('ID',axis=1,inplace=True)

In [12]:
df.dtypes

Delivery_person_ID              object
Delivery_person_Age            float64
Delivery_person_Ratings        float64
Restaurant_latitude            float64
Restaurant_longitude           float64
Delivery_location_latitude     float64
Delivery_location_longitude    float64
Order_Date                      object
Time_Orderd                     object
Time_Order_picked               object
Weather_conditions              object
Road_traffic_density            object
Vehicle_condition                int64
Type_of_order                   object
Type_of_vehicle                 object
multiple_deliveries            float64
Festival                        object
City                            object
Time_taken (min)                 int64
dtype: object

In [13]:
# changing order_date columns to datetime
df['Order_Date']=pd.to_datetime(df['Order_Date'])

In [14]:
df['year']=df['Order_Date'].dt.year
df['month']=df['Order_Date'].dt.month
df['day']=df['Order_Date'].dt.day

In [15]:
df.drop(['Order_Date'],axis=1,inplace=True)

In [None]:
df['Time_Orderd'].isnull().sum()

In [None]:
df.dropna(subset=['Time_Orderd'],inplace=True)

In [None]:
df['Time_Orderd']=df['Time_Orderd'].str.replace('.',':')

In [None]:
df['Time_Orderd'].sample(10)

In [20]:
# Define lambda function with error handling
def extract_time(x):
    try:
        return x.split(':')[0] + ':' + x.split(':')[1][:2]
    except IndexError:
        return '00:00'
    
#Apply lambda function to order time column
df['Time_Orderd']=df['Time_Orderd'].apply(extract_time)

AttributeError: 'float' object has no attribute 'split'

In [16]:
df.dtypes

Delivery_person_ID              object
Delivery_person_Age            float64
Delivery_person_Ratings        float64
Restaurant_latitude            float64
Restaurant_longitude           float64
Delivery_location_latitude     float64
Delivery_location_longitude    float64
Time_Orderd                     object
Time_Order_picked               object
Weather_conditions              object
Road_traffic_density            object
Vehicle_condition                int64
Type_of_order                   object
Type_of_vehicle                 object
multiple_deliveries            float64
Festival                        object
City                            object
Time_taken (min)                 int64
year                             int64
month                            int64
day                              int64
dtype: object

In [21]:
df['Time_Orderd'] = pd.to_datetime(df['Time_Orderd'], format='%H:%M',errors='coerce')

In [39]:
df.head()

Unnamed: 0,Delivery_person_ID,Delivery_person_Age,Delivery_person_Ratings,Restaurant_latitude,Restaurant_longitude,Delivery_location_latitude,Delivery_location_longitude,Time_Orderd,Time_Order_picked,Weather_conditions,...,Vehicle_condition,Type_of_order,Type_of_vehicle,multiple_deliveries,Festival,City,Time_taken (min),year,month,day
0,DEHRES17DEL01,36.0,4.2,30.327968,78.046106,30.397968,78.116106,1900-01-01 21:55:00,22:10,Fog,...,2,Snack,motorcycle,3.0,No,Metropolitian,46,2022,12,2
1,KOCRES16DEL01,21.0,4.7,10.003064,76.307589,10.043064,76.347589,1900-01-01 14:55:00,15:05,Stormy,...,1,Meal,motorcycle,1.0,No,Metropolitian,23,2022,2,13
2,PUNERES13DEL03,23.0,4.7,18.56245,73.916619,18.65245,74.006619,1900-01-01 17:30:00,17:40,Sandstorms,...,1,Drinks,scooter,1.0,No,Metropolitian,21,2022,4,3
3,LUDHRES15DEL02,34.0,4.3,30.899584,75.809346,30.919584,75.829346,1900-01-01 09:20:00,09:30,Sandstorms,...,0,Buffet,motorcycle,0.0,No,Metropolitian,20,2022,2,13
4,KNPRES14DEL02,24.0,4.7,26.463504,80.372929,26.593504,80.502929,1900-01-01 19:50:00,20:05,Fog,...,1,Snack,scooter,1.0,No,Metropolitian,41,2022,2,14


In [None]:
df['Time_Orderd']=df['Time_Orderd'].astype('datetime64[ns]',errors='ignore')

In [None]:
df['Time_Orderd']=pd.to_datetime(df['Time_Orderd'],errors='coerce')

In [None]:
df.dtypes

In [None]:
df['Date_im']=df['Time_Orderd'].str.split()[0]

AttributeError: Can only use .str accessor with string values!

In [None]:
df

In [None]:
# there are 1731 missing values in time ordered column(which is close to 4%),we can either remove the null values or we can impute
# with above data,we can see that for time order empty,we have time_order_picked value,also there are no empty values for time_order_picked.
# so now, we will calculate,for each meal type,we will find approx difference in time between time order and order pick up time,using this we can impute the time order to some extent

In [None]:
#df[df['Time_Orderd'].isna()]

#df1=df[df['Time_Orderd'].isna()]
#df1.shape

#final_df.shape , df1.shape , df.shape
#final_df is removed na value of df.shape
#not convertibe time format = 3702
#final_df when converted to time format,we will get 3702 NaT

#final_df=df.dropna(subset=['Time_Orderd'])

#final_df.dtypes

#final_df.head()

#final_df['Time_Orderd'].isnull().sum()

#time_format='%H:%M'
#final_df['Time_Orderd']=pd.to_datetime(final_df['Time_Orderd'],format=time_format,errors='coerce')

#final_df[final_df['Time_Orderd'].isna()].shape

#df2=final_df.dropna(subset=['Time_Orderd'])

#df2.dtypes

#df2['Time_Orderd']=df2['Time_Orderd'].dt.time

#df2['Time_Orderd']=pd.to_datetime(df2['Time_Orderd'],format=time_format,errors='coerce')

#df2.head()

#df2.dtypes

#df.dtypes

#def calculate_difference(dataframe):
#    return dataframe['Time_Order_picked'].sub(dataframe['Time_Orderd'])

#df['diff']=df.groupby('Type_of_order').apply(calculate_difference)