# **Delivery Duration Prediction**

## Assignment
When a consumer places an order on DoorDash, we show the expected time of delivery. It is very important for DoorDash to get this right, as it has a big impact on consumer experience. In this exercise, you will build a model to predict the estimated time taken for a delivery.

Concretely, for a given delivery you must predict the total delivery duration seconds , i.e., the time taken from

Start: the time consumer submits the order (created_at) to
End: when the order will be delivered to the consumer (actual_delivery_time)


## Data Description

The attached file historical_data.csv contains a subset of deliveries received at DoorDash in early 2015 in a subset of the cities. Each row in this file corresponds to one unique delivery. We have added noise to the dataset to obfuscate certain business details. Each column corresponds to a feature as explained below. Note all money (dollar) values given in the data are in cents and all time duration values given are in seconds

The target value to predict here is the total seconds value between created_at and actual_delivery_time.

 - Columns in historical_data.csv

**Time features**

- market_id: A city/region in which DoorDash operates, e.g., Los Angeles, given in the data as an id

- created_at: Timestamp in UTC when the order was submitted by the consumer to DoorDash. (Note this timestamp is in UTC, but in case you need it, the actual timezone of the region was US/Pacific)
actual_delivery_time: Timestamp in UTC when the order was delivered to the consumer


**Store features**

- store_id: an id representing the restaurant the order was submitted for
- store_primary_category: cuisine category of the restaurant, e.g., italian, asian
- order_protocol: a store can receive orders from DoorDash through many modes. This field represents an id denoting the protocol

**Order features**

- total_items: total number of items in the order
- subtotal: total value of the order submitted (in cents)
- num_distinct_items: number of distinct items included in the order
- min_item_price: price of the item with the least cost in the order (in cents)
- max_item_price: price of the item with the highest cost in the order (in cents)

**Market features**

DoorDash being a marketplace, we have information on the state of marketplace when the order is placed, that can be used to estimate delivery time. The following features are values at the time of created_at (order submission time):

- total_onshift_dashers: Number of available dashers who are within 10 miles of the store at the time of order creation
- total_busy_dashers: Subset of above total_onshift_dashers who are currently working on an order
- total_outstanding_orders: Number of orders within 10 miles of this order that are currently being processed.

**Predictions from other models**

We have predictions from other models for various stages of delivery process that we can use:

- estimated_order_place_duration: Estimated time for the restaurant to receive the order from DoorDash (in seconds)
- estimated_store_to_consumer_driving_duration: Estimated travel time between store and consumer (in seconds)

**Practicalities**
Build a model to predict the total delivery duration seconds (as defined above). Feel free to generate additional features from the given data to improve model performance. Explain:

- model(s) used,
- how you evaluated your model performance on the historical data,
- any data processing you performed on the data,
- feature engineering choices you made,
- other information you would like to share your modeling approach.

- We expect the project to take 3-5 hours in total, but feel free to spend as much time as you like on it. Feel free to use any open source packages for the task.


#### To download the dataset <a href="https://drive.google.com/drive/folders/1H4wDwJhfElUd4OfkaoH_VIZhvwDuOMxf?usp=sharing"> Click here </a>

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error

# Load the data
data = pd.read_csv("C:\\Users\\manoj\\Downloads\\historical_data.csv")
# Check the loaded DataFrame
print(data.head())

   market_id           created_at actual_delivery_time  store_id  \
0        1.0  2015-02-06 22:24:17  2015-02-06 23:27:16      1845   
1        2.0  2015-02-10 21:49:25  2015-02-10 22:56:29      5477   
2        3.0  2015-01-22 20:39:28  2015-01-22 21:09:09      5477   
3        3.0  2015-02-03 21:21:45  2015-02-03 22:13:00      5477   
4        3.0  2015-02-15 02:40:36  2015-02-15 03:20:26      5477   

  store_primary_category  order_protocol  total_items  subtotal  \
0               american             1.0            4      3441   
1                mexican             2.0            1      1900   
2                    NaN             1.0            1      1900   
3                    NaN             1.0            6      6900   
4                    NaN             1.0            3      3900   

   num_distinct_items  min_item_price  max_item_price  total_onshift_dashers  \
0                   4             557            1239                   33.0   
1                   1       

In [2]:
# Check the loaded DataFrame
print(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 197428 entries, 0 to 197427
Data columns (total 16 columns):
 #   Column                                        Non-Null Count   Dtype  
---  ------                                        --------------   -----  
 0   market_id                                     196441 non-null  float64
 1   created_at                                    197428 non-null  object 
 2   actual_delivery_time                          197421 non-null  object 
 3   store_id                                      197428 non-null  int64  
 4   store_primary_category                        192668 non-null  object 
 5   order_protocol                                196433 non-null  float64
 6   total_items                                   197428 non-null  int64  
 7   subtotal                                      197428 non-null  int64  
 8   num_distinct_items                            197428 non-null  int64  
 9   min_item_price                                19

In [3]:
# Convert timestamps
data['created_at'] = pd.to_datetime(data['created_at'])
data['actual_delivery_time'] = pd.to_datetime(data['actual_delivery_time'])

# Calculate target variable: delivery duration
data['delivery_duration'] = (data['actual_delivery_time'] - data['created_at']).dt.total_seconds()

# Extract day of the week and hour of the day from created_at timestamp
data['day_of_week'] = data['created_at'].dt.dayofweek
data['hour_of_day'] = data['created_at'].dt.hour



In [4]:
# Handle missing values (fill with median for numeric, mode for categorical)
data['market_id'].fillna(data['market_id'].mode()[0], inplace=True)
data['store_primary_category'].fillna(data['store_primary_category'].mode()[0], inplace=True)
data['order_protocol'].fillna(data['order_protocol'].mode()[0], inplace=True)
data['total_onshift_dashers'].fillna(data['total_onshift_dashers'].median(), inplace=True)
data['total_busy_dashers'].fillna(data['total_busy_dashers'].median(), inplace=True)
data['total_outstanding_orders'].fillna(data['total_outstanding_orders'].median(), inplace=True)
data['estimated_store_to_consumer_driving_duration'].fillna(data['estimated_store_to_consumer_driving_duration'].median(), inplace=True)

# Drop rows where the target variable is NaN
data.dropna(subset=['delivery_duration'], inplace=True)

# Define features and target
features = ['market_id', 'store_primary_category', 'order_protocol', 'total_items',
            'subtotal', 'num_distinct_items', 'min_item_price', 'max_item_price',
            'total_onshift_dashers', 'total_busy_dashers', 'total_outstanding_orders',
            'estimated_order_place_duration', 'estimated_store_to_consumer_driving_duration',
            'day_of_week', 'hour_of_day']
target = 'delivery_duration'



In [5]:
# One-hot encode categorical variables
categorical_features = ['market_id', 'store_primary_category', 'order_protocol', 'day_of_week', 'hour_of_day']
numerical_features = list(set(features) - set(categorical_features))

# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='median')

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ])



In [6]:
# Define the model
model = LinearRegression()

# Bundle preprocessing and modeling code in a pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('model', model)])

# Split data into training and test sets
X = data[features]
y = data[target]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Preprocess and train the model
pipeline.fit(X_train, y_train)

# Predict on the test set
y_pred = pipeline.predict(X_test)

# Evaluate the model
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print(f'Mean Absolute Error: {mae}')
print(f'Root Mean Squared Error: {rmse}')


Mean Absolute Error: 748.3181648388801
Root Mean Squared Error: 2161.0494226593582


In [4]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

# Define the model
model = RandomForestRegressor(random_state=42)

# Define the pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('model', model)])

# Define the parameter grid for Randomized Search
param_dist = {
    'model__n_estimators': randint(50, 200),
    'model__max_depth': [None, 10, 20, 30],
    'model__min_samples_split': randint(2, 10),
    'model__min_samples_leaf': randint(1, 4)
}

# Set up Randomized Search
random_search = RandomizedSearchCV(pipeline, param_distributions=param_dist, n_iter=20, cv=3, scoring='neg_mean_absolute_error', n_jobs=-1, random_state=42)

# Fit the model
random_search.fit(X_train, y_train)

# Get the best model
best_model = random_search.best_estimator_

# Predict on the test set
y_pred = best_model.predict(X_test)

# Evaluate the model
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print(f'Best Model Parameters: {random_search.best_params_}')
print(f'Mean Absolute Error: {mae}')
print(f'Root Mean Squared Error: {rmse}')


Best Model Parameters: {'model__max_depth': 20, 'model__min_samples_leaf': 3, 'model__min_samples_split': 7, 'model__n_estimators': 130}
Mean Absolute Error: 748.1048206011676
Root Mean Squared Error: 1419.8857283346824
