# DoorDash ETA Prediction

## Overview
The dataset used for this project is sourced from DoorDash deliveries in early 2015, capturing the details of various orders in a subset of cities. The primary goal of this analysis is to predict the estimated time taken for delivery, measured in total seconds from when an order is placed (`created_at`) to when it is delivered (`actual_delivery_time`). This prediction is crucial for improving customer experience on the DoorDash platform.

## Data Description
The dataset is provided in a CSV file called `historical_data.csv` and includes several columns representing different aspects of each delivery.

### Columns in the Dataset:
1. **Time Features:**
   - `market_id`: Identifier for the city/region (e.g., Los Angeles).
   - `created_at`: Timestamp in UTC when the order was submitted.
   - `actual_delivery_time`: Timestamp in UTC when the order was delivered.

2. **Store Features:**
   - `store_id`: Identifier for the restaurant where the order was placed.
   - `store_primary_category`: Cuisine category of the restaurant (e.g., Italian, Asian).
   - `order_protocol`: An identifier denoting the order mode used by the store.

3. **Order Features:**
   - `total_items`: Total number of items in the order.
   - `subtotal`: Total value of the order (in cents).
   - `num_distinct_items`: Number of unique items in the order.
   - `min_item_price`: Price of the cheapest item (in cents).
   - `max_item_price`: Price of the most expensive item (in cents).

4. **Market Features:**
   - `total_onshift_dashers`: Number of available delivery drivers within 10 miles of the store at the time of order.
   - `total_busy_dashers`: Number of those drivers who are currently occupied with an order.
   - `total_outstanding_orders`: Number of orders being processed within 10 miles of the current order.

5. **Predictions from Other Models:**
   - `estimated_order_place_duration`: Estimated time for the restaurant to receive the order (in seconds).
   - `estimated_store_to_consumer_driving_duration`: Estimated travel time between the store and the consumer (in seconds).

## Data Characteristics
- **Unit of Time**: All time values are recorded in seconds.
- **Monetary Values**: Dollar amounts are expressed in cents.
- **Time Zones**: The timestamps are given in UTC, with the relevant timezone being US/Pacific.

## Objective
The aim of this analysis is to develop a model that accurately predicts the delivery duration, helping DoorDash enhance service quality and meet consumer expectations more effectively.


### Import Necessary Libraries

In [30]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import holidays
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from sklearn.preprocessing import StandardScaler,LabelEncoder
from sklearn.model_selection import train_test_split, KFold
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.impute import KNNImputer
import lightgbm as lgb

In [31]:
# Load the dataset
data = pd.read_csv(r'C:\Users\mprat\Downloads\historical_data.csv')

### Convert and Extract Datetime Features

In [None]:
# Convert datetime columns
data['created_at'] = pd.to_datetime(data['created_at'])
data['actual_delivery_time'] = pd.to_datetime(data['actual_delivery_time'])
data['delivery_duration_minutes'] = (
    (data['actual_delivery_time'] - data['created_at']).dt.total_seconds() / 60
)

# Time-Based Features
data['hour'] = data['created_at'].dt.hour
data['day_of_week_num'] = data['created_at'].dt.dayofweek
data['is_weekend'] = data['day_of_week_num'].isin([5, 6]).astype(int)

# Holiday Indicator
us_holidays = holidays.US()
data['is_holiday'] = data['created_at'].dt.date.astype(str).isin(us_holidays).astype(int)

### Datetime Conversion
- **`created_at` and `actual_delivery_time`**: Converts string datetime columns into `datetime` type, enabling easy extraction of date-related features.

### Target Variable
- **`delivery_duration_minutes`**: Calculates the target variable, representing the delivery time in minutes. This is essential for model training to predict delivery times.

### Time-Based Features
- **`hour`**: Helps the model learn hourly patterns in delivery times.
- **`day_of_week_num`**: Allows the model to capture trends based on the day of the week.
- **`is_weekend`**: Indicates whether the delivery was on a weekend, helping the model understand delivery time variations during weekends.

### Holiday Indicator
- **`is_holiday`**: Marks whether the delivery occurred on a holiday, which can impact delivery times due to reduced availability or increased demand.

## Dasher Features

In [None]:
data['total_busy_dashers'] = abs(data['total_busy_dashers'])  # Handle negative values
data['total_onshift_dashers'] = abs(data['total_onshift_dashers'])
data['dashers_per_order'] = data['total_onshift_dashers'] / (data['total_outstanding_orders'] + 1e-5)
data['%_dashers_avail'] = data['total_busy_dashers'] / (
    data['total_busy_dashers'] + data['total_onshift_dashers'] + 1e-5
)

- **`total_busy_dashers`**: Ensures non-negative values for better modeling and reflects available resources.
- **`total_onshift_dashers`**: Ensures non-negative values, representing the total number of dashers working.
- **`dashers_per_order`**: Calculates the ratio of dashers to outstanding orders, helping the model understand how resource availability affects delivery time.
- **`%_dashers_avail`**: Represents the proportion of busy dashers compared to the total, providing insight into overall availability.

## Price-Based Features

In [None]:
data['price_range'] = data['max_item_price'] - data['min_item_price']
data['avg_item_price'] = data['subtotal'] / (data['total_items'] + 1e-5)
data['price_volatility'] = data['price_range'] / (data['avg_item_price'] + 1e-5)

- **`price_range`**: Measures the difference between the highest and lowest item prices, giving insight into price volatility.
- **`avg_item_price`**: Calculates the average item price per order, contributing to understanding how item pricing may affect delivery.
- **`price_volatility`**: Shows the variability in price within an order, indicating potential impacts on order handling and delivery times.

## Interaction Features

In [None]:
# Interaction Features
data['order_intensity'] = data['total_outstanding_orders'] / (data['total_busy_dashers'] + 1e-5)
data['delivery_difficulty'] = data['order_intensity'] * data['delivery_duration_minutes']

- **`order_intensity`**: Represents the ratio of total outstanding orders to busy dashers, providing an indication of workload and potential delivery time impacts.
- **`delivery_difficulty`**: The interaction of order intensity and delivery duration, capturing how workload complexity impacts delivery times.

## Delivery Speed

In [None]:
data['delivery_speed'] = data['delivery_duration_minutes'] / (
    data['estimated_store_to_consumer_driving_duration'] / 60 + 1e-5)

- **`delivery_speed`**: Measures the ratio of delivery duration to estimated store-to-consumer driving duration, indicating how efficiently deliveries are made.

## Log Transformations

In [None]:
data['log_subtotal'] = np.log1p(data['subtotal'])
data['log_outstanding_orders'] = np.log1p(data['total_outstanding_orders'].clip(lower=1e-5))

- **`log_subtotal`**: Applies a log transformation to the subtotal for better handling of skewed data and reducing the impact of outliers.
- **`log_outstanding_orders`**: Applies a log transformation to the number of outstanding orders, ensuring a better scale and reducing data skew.


In [32]:
data = data.drop(columns=['created_at', 'actual_delivery_time'])

# Outlier Removal Using IQR Method

In [33]:
def remove_outliers_iqr(df, variables, threshold=1.5):
   
    for variable in variables:
        if variable in df.columns:
            Q1 = df[variable].quantile(0.25)
            Q3 = df[variable].quantile(0.75)
            IQR = Q3 - Q1
            lower_bound = Q1 - (threshold * IQR)
            upper_bound = Q3 + (threshold * IQR)
            df = df[(df[variable] >= lower_bound) & (df[variable] <= upper_bound)]
    return df

# Define numerical columns with potential outliers
outlier_columns = [
    'subtotal', 'delivery_duration_minutes', 'max_item_price', 'price_range',
    'avg_item_price', 'price_volatility', 'delivery_speed'
]

# Remove outliers
data = remove_outliers_iqr(data, outlier_columns)



## Columns Processed
The following numerical columns are targeted for outlier removal:
- `subtotal`: The total value of items in an order.
- `delivery_duration_minutes`: The total delivery time in minutes.
- `max_item_price`: The highest item price in an order.
- `price_range`: The difference between the maximum and minimum item prices.
- `avg_item_price`: The average price of items in an order.
- `price_volatility`: The variability of item prices.
- `delivery_speed`: The ratio of delivery duration to estimated driving duration.
          

# Handling Missing Values in the Dataset
### Using KNN Imputer


In [34]:
def handle_missing_values(df, n_neighbors=5):
    
    # Handle numerical columns using KNN Imputer
    numeric_cols = df.select_dtypes(include=['float64', 'int64']).columns
    imputer = KNNImputer(n_neighbors=n_neighbors)
    df[numeric_cols] = imputer.fit_transform(df[numeric_cols])
    
    # Handle categorical columns using mode imputation
    categorical_cols = df.select_dtypes(include=['object', 'category']).columns
    for col in categorical_cols:
        df[col] = df[col].fillna(df[col].mode()[0])
    
    return df

# Apply missing value handling
data = handle_missing_values(data)
print("Missing values per column after imputation:")
print(data.isnull().sum())


Missing values per column after imputation:
market_id                                       0
store_id                                        0
store_primary_category                          0
order_protocol                                  0
total_items                                     0
subtotal                                        0
num_distinct_items                              0
min_item_price                                  0
max_item_price                                  0
total_onshift_dashers                           0
total_busy_dashers                              0
total_outstanding_orders                        0
estimated_order_place_duration                  0
estimated_store_to_consumer_driving_duration    0
delivery_duration_minutes                       0
hour                                            0
day_of_week_num                                 0
is_weekend                                      0
is_holiday                                      0
dasher

### 1. Numerical Columns:
- **KNN Imputer**: Uses the k-nearest neighbors algorithm to estimate missing values based on the similarity to other rows. The number of neighbors (`n_neighbors`) can be adjusted to control the influence of nearby data points.

### 2. Categorical Columns:
- **Mode Imputation**: Fills missing values with the most frequent value (mode) of the column. This ensures that categorical columns retain the most common category, maintaining the overall distribution.

In [None]:
#This Method requires less computation power but not much reliable in outier data(we consider this when we require the outliers fix in less time)

'''def handle_missing_values(df):
    # Handle numerical columns using median imputation for efficiency
    numeric_cols = df.select_dtypes(include=['float64', 'int64']).columns
    for col in numeric_cols:
        df[col].fillna(df[col].median(), inplace=True)
    
    # Handle categorical columns using mode imputation
    categorical_cols = df.select_dtypes(include=['object', 'category']).columns
    for col in categorical_cols:
        df[col].fillna(df[col].mode()[0], inplace=True)
    
    return df'''

# Optimized Label Encoding for Categorical Columns

In [None]:
def optimized_label_encoding(df, cat_cols):
    le_dict = {} 
    
    for col in cat_cols:
        le = LabelEncoder()
        df[col] = le.fit_transform(df[col].astype(str))
        le_dict[col] = le 
    
    return df, le_dict
    
categorical_columns = ['store_primary_category']
data, encoders = optimized_label_encoding(data, categorical_columns)

data['store_primary_category'].unique()

Label encoding is used to convert categorical values into numeric values so that machine learning algorithms can process them efficiently.
It is not fit for nominal values but instead of using OneHot Encoder which cause enormous amount of columns or like Target encoder may cause to a data leak, it is better in this way.

- The function iterates over the specified columns (`cat_cols`), fits the encoder on each column, and transforms the column values accordingly.
- The encoder for each column is stored in a dictionary (`le_dict`) for potential future use


In [None]:
# Target and feature variables
X = data.drop(columns=['delivery_duration_minutes'])
y = data['delivery_duration_minutes']

In [49]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Linear Regression Model for Predicting Delivery Duration

In [52]:
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

#Evaluate the model using MAE and RMSE
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print(f"Mean Absolute Error (MAE): {mae:.2f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.2f}")

Mean Absolute Error (MAE): 4.82
Root Mean Squared Error (RMSE): 6.55


# LightGBM Model for Predicting Delivery Duration

In [58]:
lgb_model = lgb.LGBMRegressor(
    objective='regression',
    metric='rmse',
    num_leaves=31,
    learning_rate=0.1,
    n_estimators=200,
    random_state=42
)

lgb_model.fit(X_train, y_train)
y_pred = lgb_model.predict(X_test)

# Evaluate the model using MAE and RMSE
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print(f"Mean Absolute Error (MAE): {mae:.2f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.2f}")

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.023128 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 4801
[LightGBM] [Info] Number of data points in the train set: 126109, number of used features: 27
[LightGBM] [Info] Start training from score 44.448571
Mean Absolute Error (MAE): 0.58
Root Mean Squared Error (RMSE): 1.00


# Neural Network Model for Predicting Delivery Duration

In [55]:
neural_model = Sequential([
    Dense(128, input_dim=X_train.shape[1], activation='relu'),
    Dropout(0.2),  
    Dense(64, activation='relu'),
    Dropout(0.2), 
    Dense(32, activation='relu'),  
    Dense(1, activation='linear')
])

# Compile the model
neural_model.compile(optimizer='adam', loss='mean_squared_error', metrics=['mae'])

early_stopping = tf.keras.callbacks.EarlyStopping(
    monitor='val_loss',  
    patience=10,  
    restore_best_weights=True  
)

# Train the model
history = neural_model.fit(
    X_train, y_train,
    validation_split=0.2,  
    epochs=50,  
    batch_size=32, 
    callbacks=[early_stopping]  
)

# Step 7: Model Evaluation
y_pred = neural_model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print(f"MAE: {mae:.2f}")
print(f"RMSE: {rmse:.2f}")

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


Epoch 1/50
[1m3153/3153[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 3ms/step - loss: 230.9714 - mae: 9.7173 - val_loss: 11.6799 - val_mae: 2.5707
Epoch 2/50
[1m3153/3153[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 3ms/step - loss: 30.7224 - mae: 3.9855 - val_loss: 6.6142 - val_mae: 1.8162
Epoch 3/50
[1m3153/3153[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 3ms/step - loss: 23.0373 - mae: 3.3683 - val_loss: 5.2353 - val_mae: 1.6206
Epoch 4/50
[1m3153/3153[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 3ms/step - loss: 18.9243 - mae: 2.9653 - val_loss: 2.7135 - val_mae: 1.1528
Epoch 5/50
[1m3153/3153[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 3ms/step - loss: 11.4643 - mae: 2.3472 - val_loss: 5.5476 - val_mae: 1.8334
Epoch 6/50
[1m3153/3153[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 3ms/step - loss: 7.3584 - mae: 1.9115 - val_loss: 3.8692 - val_mae: 1.4157
Epoch 7/50
[1m3153/3153[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m

# Conclusion

## Summary of Analysis
The DoorDash ETA prediction dataset offers valuable insights into factors influencing delivery times, capturing essential details from order placement to final delivery. Key variables impacting delivery durations include time-based features, store and market conditions, and order characteristics. Notably, the number of available dashers and their status, the number of items in the order, and store type are significant contributors to delivery time.

## Key Findings
- **Temporal Patterns**: Delivery times are influenced by the time of order creation and market congestion. Higher numbers of outstanding orders and active dashers correlate with longer delivery times.
- **Order Complexity**: Larger orders with more items and higher total subtotals generally lead to longer delivery times.
- **Store Influence**: The type of store and dasher availability in its vicinity affect delivery speed, suggesting that optimizing dasher distribution relative to store location and order volume can improve efficiency.
- **Model Performance**: Predictive features such as `estimated_order_place_duration` and `estimated_store_to_consumer_driving_duration` can be used effectively for accurate ETA predictions.

## Model Performance Summary

| **Model**          | **Mean Absolute Error (MAE) in minutes** | **Root Mean Squared Error (RMSE)in minutes** |
|--------------------|-------------------------------|------------------------------------|
| **LightGBM**       | 0.58                          | 1.00                               |
| **Neural Network** | 1.16                          | 1.67                               |
| **Linear Regression** | 4.82                       | 6.55                               |

## Analysis and Recommendations

### 1. LightGBM Model
- **Best Performance**: LightGBM showed the highest accuracy with an MAE of **0.58** and an RMSE of **1.00**, demonstrating its capability to handle complex relationships in the data effectively.
- **Implications**: LightGBM is the ideal model for ETA predictions due to its accuracy and efficiency with large datasets. Further optimization through hyperparameter tuning and feature engineering can improve results even more.

### 2. Neural Network Model
- **Moderate Performance**: The neural network had an MAE of **1.16** and RMSE of **1.67**, which is better than linear regression but not as effective as LightGBM.
- **Implications**: While neural networks can model complex data, their performance here suggests that they need more tuning and training adjustments to match or exceed LightGBM.

### 3. Linear Regression Model
- **Weakest Performance**: The linear regression model had the highest errors with an MAE of **4.82** and RMSE of **6.55**, indicating it cannot capture the data's complexities adequately.
- **Implications**: Linear regression may be used for initial benchmarks or when interpretability is prioritized over predictive accuracy.

## Recommendations for Improving Delivery Times

### Key Factors Affecting Delivery Time
- **Time-Based Features**: Delivery times are significantly affected by the time of order placement and time of day.
- **Geographical and Traffic Conditions**: Locations and traffic influence delivery speed.
- **Order Size and Complexity**: Larger or more complex orders take longer to handle.
- **External Conditions**: Weather and unforeseen events can also impact delivery times.

### Suggested Improvements for Business Optimization and Enhanced Model Performance

1. **Integrate Real-Time Data**: Utilize APIs for live traffic, weather, and event updates to dynamically adjust ETAs.
2. **Peak Time Analysis**: Identify and manage peak hours by adjusting scheduling to minimize delays.
3. **Advanced Routing**: Implement route optimization algorithms considering real-time conditions.
4. **Data Enrichment**: Include more detailed location and external factors like traffic congestion.
5. **Feature Engineering**: Develop new features representing traffic congestion patterns and seasonal variations.
6. **Automated Adjustments**: Train models on historical data with traffic and weather as features to adapt to real-time conditions.
7. **Feedback Loops**: Regularly compare actual delivery times with predictions to refine the model.

### Data Enhancements
- **Expand Feature Set**: Add variables such as driver experience and order priority.
- **Ensure Data Quality**: Validate data accuracy for consistency in time, distance, and delivery details.
- **Reduce Noise**: Apply data cleaning techniques to minimize anomalies and ensure a realistic dataset.

### Future Model Enhancements
- **Hyperparameter Tuning**: Optimize LightGBM parameters using grid search or Bayesian methods.
- **Ensemble Methods**: Combine strengths of LightGBM, neural networks, and linear regression for a robust solution.
- **Continuous Updates**: Regularly refresh training data to adapt to changes in delivery patterns, traffic, and weather.

## Final Thoughts
Understanding the factors impacting delivery times enables DoorDash to take proactive measures that enhance delivery efficiency and customer experience. Implementing the strategies discussed can lead to more accurate ETAs and optimized resource allocation, ultimately improving service quality on the platform.
