<h1>Final project</h1>

<h2>Food Delivery Dataset</h2>

Origin of the dataset: https://www.kaggle.com/datasets/gauravmalik26/food-delivery-dataset

<h2>About this dataset</h2>

Food delivery is a courier service in which a restaurant, store, or independent food-delivery company delivers food to a customer. An order is typically made either through a restaurant or grocer's website or mobile app, or through a food ordering company. The delivered items can include entrees, sides, drinks, desserts, or grocery items and are typically delivered in boxes or bags. The delivery person will normally drive a car, but in bigger cities where homes and restaurants are closer together, they may use bikes or motorized scooters.

<h2>Files</h2>

1) train.csv - the training set

2) test.csv - the test set

3) sample_submission.csv - a sample submission file in the correct format


<h2>Objective</h2>

Predict estimated time for food delivery.

Evaluation metric is r2 score.

<h2>Part 1. Data cleaning and preparation</h2>

In [1]:
# !source final_project_ironhack
!pip install geopy



In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import geopy 
from geopy import distance
import datetime

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

 Let's check out what do we have inside given datasets.

In [3]:
train = pd.read_csv("../Data/train.csv")
train.head()

Unnamed: 0,ID,Delivery_person_ID,Delivery_person_Age,Delivery_person_Ratings,Restaurant_latitude,Restaurant_longitude,Delivery_location_latitude,Delivery_location_longitude,Order_Date,Time_Orderd,Time_Order_picked,Weatherconditions,Road_traffic_density,Vehicle_condition,Type_of_order,Type_of_vehicle,multiple_deliveries,Festival,City,Time_taken(min)
0,0x4607,INDORES13DEL02,37,4.9,22.745049,75.892471,22.765049,75.912471,19-03-2022,11:30:00,11:45:00,conditions Sunny,High,2,Snack,motorcycle,0,No,Urban,(min) 24
1,0xb379,BANGRES18DEL02,34,4.5,12.913041,77.683237,13.043041,77.813237,25-03-2022,19:45:00,19:50:00,conditions Stormy,Jam,2,Snack,scooter,1,No,Metropolitian,(min) 33
2,0x5d6d,BANGRES19DEL01,23,4.4,12.914264,77.6784,12.924264,77.6884,19-03-2022,08:30:00,08:45:00,conditions Sandstorms,Low,0,Drinks,motorcycle,1,No,Urban,(min) 26
3,0x7a6a,COIMBRES13DEL02,38,4.7,11.003669,76.976494,11.053669,77.026494,05-04-2022,18:00:00,18:10:00,conditions Sunny,Medium,0,Buffet,motorcycle,1,No,Metropolitian,(min) 21
4,0x70a2,CHENRES12DEL01,32,4.6,12.972793,80.249982,13.012793,80.289982,26-03-2022,13:30:00,13:45:00,conditions Cloudy,High,1,Snack,scooter,1,No,Metropolitian,(min) 30


In [4]:
test = pd.read_csv("../Data/test.csv")
test.head()

Unnamed: 0,ID,Delivery_person_ID,Delivery_person_Age,Delivery_person_Ratings,Restaurant_latitude,Restaurant_longitude,Delivery_location_latitude,Delivery_location_longitude,Order_Date,Time_Orderd,Time_Order_picked,Weatherconditions,Road_traffic_density,Vehicle_condition,Type_of_order,Type_of_vehicle,multiple_deliveries,Festival,City
0,0x2318,COIMBRES13DEL01,,,11.003669,76.976494,11.043669,77.016494,30-03-2022,,15:05:00,conditions NaN,,3,Drinks,electric_scooter,1,No,Metropolitian
1,0x3474,BANGRES15DEL01,28.0,4.6,12.975377,77.696664,13.085377,77.806664,29-03-2022,20:30:00,20:35:00,conditions Windy,Jam,0,Snack,motorcycle,1,No,Metropolitian
2,0x9420,JAPRES09DEL03,23.0,4.5,26.911378,75.789034,27.001378,75.879034,10-03-2022,19:35:00,19:45:00,conditions Stormy,Jam,0,Drinks,motorcycle,1,No,Metropolitian
3,0x72ee,JAPRES07DEL03,21.0,4.8,26.766536,75.837333,26.856536,75.927333,02-04-2022,17:15:00,17:20:00,conditions Fog,Medium,1,Meal,scooter,1,No,Metropolitian
4,0xa759,CHENRES19DEL01,31.0,4.6,12.986047,80.218114,13.096047,80.328114,27-03-2022,18:25:00,18:40:00,conditions Sunny,Medium,2,Drinks,scooter,1,No,Metropolitian


In [5]:
sample_submission = pd.read_csv("../Data/sample_submission.csv")
sample_submission.head()


Unnamed: 0,ID,Time_taken (min)
0,0x2318,25.668333
1,0x3474,27.881667
2,0x9420,27.023333
3,0x72ee,28.153333
4,0xa759,21.018333


For now let's focus on the fist one (train.csv).

In [6]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45593 entries, 0 to 45592
Data columns (total 20 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   ID                           45593 non-null  object 
 1   Delivery_person_ID           45593 non-null  object 
 2   Delivery_person_Age          45593 non-null  object 
 3   Delivery_person_Ratings      45593 non-null  object 
 4   Restaurant_latitude          45593 non-null  float64
 5   Restaurant_longitude         45593 non-null  float64
 6   Delivery_location_latitude   45593 non-null  float64
 7   Delivery_location_longitude  45593 non-null  float64
 8   Order_Date                   45593 non-null  object 
 9   Time_Orderd                  45593 non-null  object 
 10  Time_Order_picked            45593 non-null  object 
 11  Weatherconditions            45593 non-null  object 
 12  Road_traffic_density         45593 non-null  object 
 13  Vehicle_conditio

In [7]:
train.isna().sum()

ID                             0
Delivery_person_ID             0
Delivery_person_Age            0
Delivery_person_Ratings        0
Restaurant_latitude            0
Restaurant_longitude           0
Delivery_location_latitude     0
Delivery_location_longitude    0
Order_Date                     0
Time_Orderd                    0
Time_Order_picked              0
Weatherconditions              0
Road_traffic_density           0
Vehicle_condition              0
Type_of_order                  0
Type_of_vehicle                0
multiple_deliveries            0
Festival                       0
City                           0
Time_taken(min)                0
dtype: int64

When we use isna().sum() we don't see any NaN values in the dataset, but if we look at the head of the dataset - we see that these values exists. 
Let's try to convert them into numpy datatype.

In [8]:
for column in train.columns:
    train.loc[train[column] == 'NaN', column] = np.nan
    train.loc[train[column] == 'NaN ', column] = np.nan

In [9]:
train.isna().sum()

ID                                0
Delivery_person_ID                0
Delivery_person_Age            1854
Delivery_person_Ratings        1908
Restaurant_latitude               0
Restaurant_longitude              0
Delivery_location_latitude        0
Delivery_location_longitude       0
Order_Date                        0
Time_Orderd                    1731
Time_Order_picked                 0
Weatherconditions                 0
Road_traffic_density            601
Vehicle_condition                 0
Type_of_order                     0
Type_of_vehicle                   0
multiple_deliveries             993
Festival                        228
City                           1200
Time_taken(min)                   0
dtype: int64

In [10]:
# Function for data cleaning

def clean_data(data):
    data_cleaned = data.copy()

# Standardizing header names: 
# 1) Converting column names to lowercase
# 2) Deleting spaces

    cols = []
    for col in data_cleaned.columns:
        cols.append(col.lower().replace(' ', ''))
    data_cleaned.columns = cols

# Deleting duplicates

    data_cleaned = data_cleaned.drop_duplicates()
    
# Dealing with NaN values
    
    for column in data_cleaned.columns:
        data_cleaned.loc[data_cleaned[column] == 'NaN', column] = np.nan
        data_cleaned.loc[data_cleaned[column] == 'NaN ', column] = np.nan
    
# Filling all NaN-values of numerical columns with their mean value
# Column 'delivery_person_age'

    data_cleaned['delivery_person_age'] = data_cleaned['delivery_person_age'].astype('float64')
    data_cleaned['delivery_person_age'] = data_cleaned['delivery_person_age'].fillna(round(np.mean(data_cleaned['delivery_person_age'])))
    data_cleaned['delivery_person_age'] = data_cleaned['delivery_person_age'].astype('int')
    
# Column 'multiple_deliveries'
    data_cleaned['multiple_deliveries'] = data_cleaned['multiple_deliveries'].astype('float64')
    data_cleaned['multiple_deliveries'] = data_cleaned['multiple_deliveries'].fillna(round(np.mean(data_cleaned['multiple_deliveries'])))
    data_cleaned['multiple_deliveries'] = data_cleaned['multiple_deliveries'].astype('int')
    
# Column 'delivery_person_ratings'

    data_cleaned['delivery_person_ratings'] = data_cleaned['delivery_person_ratings'].astype('float64')
    data_cleaned['delivery_person_ratings'] = data_cleaned['delivery_person_ratings'].fillna(round(np.mean(data_cleaned['delivery_person_ratings']), 1))

# Deleting missing values in the column 'time_orderd' and 'festival'

    data_cleaned.dropna(subset = ['time_orderd'], axis = 0, inplace = True)
    data_cleaned.dropna(subset = ['festival'], axis = 0, inplace = True)

# Column 'city'
    data_cleaned['city'] = data_cleaned['city'].fillna(data_cleaned['city'].mode()[0])

# Dealing with other columns data types
# Column 'vehicle_condition'

    data_cleaned['vehicle_condition'] = data_cleaned['vehicle_condition'].astype('int')

# Column 'time_orderd' and 'time_order_picked'
    data_cleaned['time_orderd'] = pd.to_datetime(data_cleaned['order_date'] + ' ' + data_cleaned['time_orderd'])
    data_cleaned['time_order_picked'] = pd.to_datetime(data_cleaned['order_date'] + ' ' + data_cleaned['time_order_picked'])

# Column 'order_date'

    data_cleaned['order_date'] = pd.to_datetime(data_cleaned['order_date'], format='%d-%m-%Y', errors='ignore')
    #data_cleaned['order_date'] = pd.to_datetime(data_cleaned['order_date'], dayfirst = True)
      
# Deleting unnecessary information from the column values
# By using str.split with expand = True we will split elements into separate columns (https://stackoverflow.com/questions/63796316/string-split-with-expand-true-can-anyone-explain-what-is-the-meaning) 

    data_cleaned['weatherconditions'] = data_cleaned['weatherconditions'].str.split(" ", expand = True)[1]

# Converting all the values of the columns (except ids) to lowercase
# and deleting spaces (there are lots of unnecessary spaces after words)

    categorical = data_cleaned.select_dtypes(object)
    categorical_new = categorical.drop(['id'], axis=1)
    categorical_new = categorical_new.drop(['delivery_person_id'], axis=1)

    for column_name in categorical_new.columns:
        data_cleaned[column_name] = data_cleaned[column_name].str.lower().replace(' ', '')
         
    return data_cleaned

**Let's apply this function to 'train' dataset**.

In [11]:
train_cleaned = clean_data(train)

Adding a few more adjustments (can't use them inside of the function, because there is no 'time_taken(min)' column in the 'test' set).

In [12]:
train_cleaned['time_taken(min)'] = train_cleaned['time_taken(min)'].str.split(" ", expand = True)[1]
train_cleaned['time_taken(min)'] = train_cleaned['time_taken(min)'].astype('float64')

Cheking that we don't have any NaN-values and that all data types are correctly identified:

In [13]:
train_cleaned.isna().sum()

id                             0
delivery_person_id             0
delivery_person_age            0
delivery_person_ratings        0
restaurant_latitude            0
restaurant_longitude           0
delivery_location_latitude     0
delivery_location_longitude    0
order_date                     0
time_orderd                    0
time_order_picked              0
weatherconditions              0
road_traffic_density           0
vehicle_condition              0
type_of_order                  0
type_of_vehicle                0
multiple_deliveries            0
festival                       0
city                           0
time_taken(min)                0
dtype: int64

In [14]:
train_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 43643 entries, 0 to 45592
Data columns (total 20 columns):
 #   Column                       Non-Null Count  Dtype         
---  ------                       --------------  -----         
 0   id                           43643 non-null  object        
 1   delivery_person_id           43643 non-null  object        
 2   delivery_person_age          43643 non-null  int64         
 3   delivery_person_ratings      43643 non-null  float64       
 4   restaurant_latitude          43643 non-null  float64       
 5   restaurant_longitude         43643 non-null  float64       
 6   delivery_location_latitude   43643 non-null  float64       
 7   delivery_location_longitude  43643 non-null  float64       
 8   order_date                   43643 non-null  datetime64[ns]
 9   time_orderd                  43643 non-null  datetime64[ns]
 10  time_order_picked            43643 non-null  datetime64[ns]
 11  weatherconditions            43643 non-nu

In [15]:
train_cleaned.head()

Unnamed: 0,id,delivery_person_id,delivery_person_age,delivery_person_ratings,restaurant_latitude,restaurant_longitude,delivery_location_latitude,delivery_location_longitude,order_date,time_orderd,time_order_picked,weatherconditions,road_traffic_density,vehicle_condition,type_of_order,type_of_vehicle,multiple_deliveries,festival,city,time_taken(min)
0,0x4607,INDORES13DEL02,37,4.9,22.745049,75.892471,22.765049,75.912471,2022-03-19,2022-03-19 11:30:00,2022-03-19 11:45:00,sunny,high,2,snack,motorcycle,0,no,urban,24.0
1,0xb379,BANGRES18DEL02,34,4.5,12.913041,77.683237,13.043041,77.813237,2022-03-25,2022-03-25 19:45:00,2022-03-25 19:50:00,stormy,jam,2,snack,scooter,1,no,metropolitian,33.0
2,0x5d6d,BANGRES19DEL01,23,4.4,12.914264,77.6784,12.924264,77.6884,2022-03-19,2022-03-19 08:30:00,2022-03-19 08:45:00,sandstorms,low,0,drinks,motorcycle,1,no,urban,26.0
3,0x7a6a,COIMBRES13DEL02,38,4.7,11.003669,76.976494,11.053669,77.026494,2022-04-05,2022-05-04 18:00:00,2022-05-04 18:10:00,sunny,medium,0,buffet,motorcycle,1,no,metropolitian,21.0
4,0x70a2,CHENRES12DEL01,32,4.6,12.972793,80.249982,13.012793,80.289982,2022-03-26,2022-03-26 13:30:00,2022-03-26 13:45:00,cloudy,high,1,snack,scooter,1,no,metropolitian,30.0


In [16]:
train_cleaned.shape

(43643, 20)

**Let's apply the same cleaning function to the 'test' set:**

In [17]:
test_cleaned = clean_data(test)

In [18]:
test_cleaned.isnull().sum()

id                             0
delivery_person_id             0
delivery_person_age            0
delivery_person_ratings        0
restaurant_latitude            0
restaurant_longitude           0
delivery_location_latitude     0
delivery_location_longitude    0
order_date                     0
time_orderd                    0
time_order_picked              0
weatherconditions              0
road_traffic_density           0
vehicle_condition              0
type_of_order                  0
type_of_vehicle                0
multiple_deliveries            0
festival                       0
city                           0
dtype: int64

In [19]:
test_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10892 entries, 1 to 11398
Data columns (total 19 columns):
 #   Column                       Non-Null Count  Dtype         
---  ------                       --------------  -----         
 0   id                           10892 non-null  object        
 1   delivery_person_id           10892 non-null  object        
 2   delivery_person_age          10892 non-null  int64         
 3   delivery_person_ratings      10892 non-null  float64       
 4   restaurant_latitude          10892 non-null  float64       
 5   restaurant_longitude         10892 non-null  float64       
 6   delivery_location_latitude   10892 non-null  float64       
 7   delivery_location_longitude  10892 non-null  float64       
 8   order_date                   10892 non-null  datetime64[ns]
 9   time_orderd                  10892 non-null  datetime64[ns]
 10  time_order_picked            10892 non-null  datetime64[ns]
 11  weatherconditions            10892 non-nu

Now datasets are ready for further exploration.

<h2>Part 2. Data exploration</h2>

In [20]:
train_cleaned.head()

Unnamed: 0,id,delivery_person_id,delivery_person_age,delivery_person_ratings,restaurant_latitude,restaurant_longitude,delivery_location_latitude,delivery_location_longitude,order_date,time_orderd,time_order_picked,weatherconditions,road_traffic_density,vehicle_condition,type_of_order,type_of_vehicle,multiple_deliveries,festival,city,time_taken(min)
0,0x4607,INDORES13DEL02,37,4.9,22.745049,75.892471,22.765049,75.912471,2022-03-19,2022-03-19 11:30:00,2022-03-19 11:45:00,sunny,high,2,snack,motorcycle,0,no,urban,24.0
1,0xb379,BANGRES18DEL02,34,4.5,12.913041,77.683237,13.043041,77.813237,2022-03-25,2022-03-25 19:45:00,2022-03-25 19:50:00,stormy,jam,2,snack,scooter,1,no,metropolitian,33.0
2,0x5d6d,BANGRES19DEL01,23,4.4,12.914264,77.6784,12.924264,77.6884,2022-03-19,2022-03-19 08:30:00,2022-03-19 08:45:00,sandstorms,low,0,drinks,motorcycle,1,no,urban,26.0
3,0x7a6a,COIMBRES13DEL02,38,4.7,11.003669,76.976494,11.053669,77.026494,2022-04-05,2022-05-04 18:00:00,2022-05-04 18:10:00,sunny,medium,0,buffet,motorcycle,1,no,metropolitian,21.0
4,0x70a2,CHENRES12DEL01,32,4.6,12.972793,80.249982,13.012793,80.289982,2022-03-26,2022-03-26 13:30:00,2022-03-26 13:45:00,cloudy,high,1,snack,scooter,1,no,metropolitian,30.0


In [21]:
# Function for data exploration

def explore_data(data):
    
    # Checking numerical and categoracal columns of the dataframe
    numerical = data.select_dtypes(np.number)
    categorical = data.select_dtypes(object)
    display(numerical.head())
    display(categorical.head())

    # Using Matplotlib to construct histograms for all numerical columns
    plt.style.context('ggplot')    
    for column_name in numerical.columns:
        plt.figure()
        fig,ax = plt.subplots()
        ax.set_title(column_name)
        plt.hist(numerical[column_name], bins=20)
        plt.show()
        display(numerical[column_name].unique())
    
    return data

In [22]:
# explore_data(train_cleaned)

In [23]:
train_cleaned['time_taken(min)'].min()

10.0

In [24]:
train_cleaned['time_taken(min)'].max()

54.0

<h2>Part 3. Feature selection</h2>

In [25]:
train_cleaned.head()

Unnamed: 0,id,delivery_person_id,delivery_person_age,delivery_person_ratings,restaurant_latitude,restaurant_longitude,delivery_location_latitude,delivery_location_longitude,order_date,time_orderd,time_order_picked,weatherconditions,road_traffic_density,vehicle_condition,type_of_order,type_of_vehicle,multiple_deliveries,festival,city,time_taken(min)
0,0x4607,INDORES13DEL02,37,4.9,22.745049,75.892471,22.765049,75.912471,2022-03-19,2022-03-19 11:30:00,2022-03-19 11:45:00,sunny,high,2,snack,motorcycle,0,no,urban,24.0
1,0xb379,BANGRES18DEL02,34,4.5,12.913041,77.683237,13.043041,77.813237,2022-03-25,2022-03-25 19:45:00,2022-03-25 19:50:00,stormy,jam,2,snack,scooter,1,no,metropolitian,33.0
2,0x5d6d,BANGRES19DEL01,23,4.4,12.914264,77.6784,12.924264,77.6884,2022-03-19,2022-03-19 08:30:00,2022-03-19 08:45:00,sandstorms,low,0,drinks,motorcycle,1,no,urban,26.0
3,0x7a6a,COIMBRES13DEL02,38,4.7,11.003669,76.976494,11.053669,77.026494,2022-04-05,2022-05-04 18:00:00,2022-05-04 18:10:00,sunny,medium,0,buffet,motorcycle,1,no,metropolitian,21.0
4,0x70a2,CHENRES12DEL01,32,4.6,12.972793,80.249982,13.012793,80.289982,2022-03-26,2022-03-26 13:30:00,2022-03-26 13:45:00,cloudy,high,1,snack,scooter,1,no,metropolitian,30.0


We can create new features using latitude and longitude data (of the restaurant and of the delivery location).

We can use geopy library to calculate the distance between the restaurant and delivery location.
More info here: https://geopy.readthedocs.io/en/stable/#module-geopy.distance

In [26]:
# function for distance calculation

def calculate_distance(row):
    coordinates_restaurant = (row['restaurant_latitude'], row['restaurant_longitude'])
    coordinates_delivery_location = (row['delivery_location_latitude'], row['delivery_location_longitude'])
    return distance.distance(coordinates_restaurant, coordinates_delivery_location).km

In [27]:
# function for getting datetime column in seconds

def get_seconds(timedelta):
    return timedelta.seconds

In [39]:
# function for data preparation

def prepare_features(data):
    
# calculating distance
    data['distance'] = data.apply(calculate_distance, axis=1)
    
    data['year'] = data.order_date.dt.year
    data['month'] = data.order_date.dt.month
    data['day'] = data.order_date.dt.day
    data['day_of_week'] = data.order_date.dt.day_of_week.astype(int)
    data['is_weekend'] = data['day_of_week'].isin([5,6]).astype(int)
    data['hour_ordered'] = data.time_orderd.dt.hour
    
    
    data['preparation_time'] = (data['time_order_picked'] - data['time_orderd'])
    data['preparation_time'] = (data['preparation_time'].map(get_seconds))/60

    return data
    

In [40]:
train_prepared = prepare_for_model(train_cleaned)
train_prepared.head()

Unnamed: 0,id,delivery_person_id,delivery_person_age,delivery_person_ratings,restaurant_latitude,restaurant_longitude,delivery_location_latitude,delivery_location_longitude,order_date,time_orderd,time_order_picked,weatherconditions,road_traffic_density,vehicle_condition,type_of_order,type_of_vehicle,multiple_deliveries,festival,city,time_taken(min),distance,year,month,day,day_of_week,is_weekend,hour_ordered,preparation_time
0,0x4607,INDORES13DEL02,37,4.9,22.745049,75.892471,22.765049,75.912471,2022-03-19,2022-03-19 11:30:00,2022-03-19 11:45:00,sunny,high,2,snack,motorcycle,0,no,urban,24.0,3.020737,2022,3,19,5,1,11,15.0
1,0xb379,BANGRES18DEL02,34,4.5,12.913041,77.683237,13.043041,77.813237,2022-03-25,2022-03-25 19:45:00,2022-03-25 19:50:00,stormy,jam,2,snack,scooter,1,no,metropolitian,33.0,20.143737,2022,3,25,4,0,19,5.0
2,0x5d6d,BANGRES19DEL01,23,4.4,12.914264,77.6784,12.924264,77.6884,2022-03-19,2022-03-19 08:30:00,2022-03-19 08:45:00,sandstorms,low,0,drinks,motorcycle,1,no,urban,26.0,1.549693,2022,3,19,5,1,8,15.0
3,0x7a6a,COIMBRES13DEL02,38,4.7,11.003669,76.976494,11.053669,77.026494,2022-04-05,2022-05-04 18:00:00,2022-05-04 18:10:00,sunny,medium,0,buffet,motorcycle,1,no,metropolitian,21.0,7.774497,2022,4,5,1,0,18,10.0
4,0x70a2,CHENRES12DEL01,32,4.6,12.972793,80.249982,13.012793,80.289982,2022-03-26,2022-03-26 13:30:00,2022-03-26 13:45:00,cloudy,high,1,snack,scooter,1,no,metropolitian,30.0,6.197898,2022,3,26,5,1,13,15.0


In [30]:
# def calculate_distance(row):
#     coordinates_restaurant = (row['restaurant_latitude'], row['restaurant_longitude'])
#     coordinates_delivery_location = (row['delivery_location_latitude'], row['delivery_location_longitude'])
#     return distance.distance(coordinates_restaurant, coordinates_delivery_location).km

# train_cleaned['distance'] = train_cleaned.apply(calculate_distance, axis=1)
# test_cleaned['distance'] = test_cleaned.apply(calculate_distance, axis=1)

In [31]:
# train_cleaned.head()

In [32]:
# data = train_cleaned

In [33]:
# data['year'] = data.order_date.dt.year
# data['month'] = data.order_date.dt.month
# data['day'] = data.order_date.dt.day
# data['day_of_week'] = data.order_date.dt.day_of_week.astype(int)
# data['is_weekend'] = data['day_of_week'].isin([5,6]).astype(int)
# data['hour_ordered'] = data.time_orderd.dt.hour

In [34]:
# data.head()

In [35]:
# data['preparation_time'] = (data['time_order_picked'] - data['time_orderd'])

In [36]:
# def get_seconds(timedelta):
#     return timedelta.seconds

In [37]:
# data['preparation_time'] = (data['preparation_time'].map(get_seconds))/60

<h2>Part 4. Model building</h2>

In [None]:
# y = train_cleaned['time_taken(min)']
# X = train_cleaned.drop(columns=['id','delivery_person_id','restaurant_latitude', 'restaurant_longitude', 'delivery_location_latitude', 'delivery_location_longitude', 'order_date', 'time_orderd', 'time_order_picked', 'day_of_week', 'time_taken(min)'],axis=1)

In [None]:
# numerical = train_cleaned.select_dtypes(np.number)
# plt.figure(figsize=(12, 5))
# heatmap = sns.heatmap(numerical.corr(), annot=True, cmap='Greens' )
# heatmap.set_title('Correlation matrix')

<h2>Part 5. Model comparison and conclusion</h2>