<a href="https://colab.research.google.com/github/SriRamK345/Predicting-DoorDash-ETA-A-Machine-Learning-Approach/blob/main/Predicting_DoorDash_ETA_A_Machine_Learning_Approach.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Summary encapsulates a comprehensive data science workflow applied to predicting DoorDash delivery times.**

### **Data Preprocessing**
1. **Handling Missing Data**:
   - Rows with missing values and duplicates are removed, ensuring data integrity.
2. **Feature Engineering**:
   - Creation of new features like `delivery_duration_sec` and `free_dashers` to improve model insights and performance.
3. **Data Transformation**:
   - Conversion of date columns to datetime objects for temporal analysis.
   - Removal of unnecessary columns and outliers in the target variable to refine the dataset.

### **Exploratory Data Analysis (EDA)**
- Visual tools such as:
  - **Count plots**: For categorical data distribution.
  - **Scatter plots**: To examine relationships between variables.
  - **Heatmaps**: For correlation analysis.
- These provide insights into variable interactions and distributions.

### **Model Building and Evaluation**
1. **Data Preparation**:
   - Splitting data into training and testing sets.
   - Feature scaling with `MinMaxScaler` to standardize the feature range.
2. **Model Training**:
   - Models used: Linear Regression, Random Forest, XGBoost, and a Neural Network.
3. **Model Evaluation**:
   - Performance metrics include:
     - **R-squared**: Proportion of variance explained.
     - **MAE**, **MSE**, and **RMSE**: Indicators of prediction accuracy.

### **Model Selection and Deployment**
1. **Best Model Selection**:
   - The Neural Network outperformed others based on evaluation metrics.
2. **Model Saving and Deployment**:
   - Saved using `model.save()` for reuse.
   - A function enables loading the saved model and predicting new data seamlessly.

### **Conclusion**
This project effectively follows the data science pipeline:
- Data cleaning and transformation ensure robust input.
- EDA highlights key insights.
- Model building explores various algorithms.
- Deployment enables practical use of the Neural Network for accurate delivery time predictions.

This structured approach showcases expertise in handling end-to-end machine learning projects and can be applied to similar prediction tasks.

# Steps to get data sets from kaggle

In [None]:
! mkdir ~/.kaggle

In [None]:
! cp "/content/drive/MyDrive/kaggle.json" ~/.kaggle/kaggle.json

In [None]:
! kaggle datasets download dharun4772/doordash-eta-prediction

In [None]:
! unzip /content/doordash-eta-prediction.zip

# Import libraries

In [None]:
# Data cleaning
import pandas as pd
import numpy as np
# Visualization / EDA
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
# remove warnings
import warnings
warnings.filterwarnings("ignore")
# Split data for training and testing & Optimizing model parameters
from sklearn.model_selection import train_test_split
# Model evaluation
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
# Model selection
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
# Feature scaling
from sklearn.preprocessing import MinMaxScaler
# TenserFlow
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.callbacks import EarlyStopping

# Import dataset

In [None]:
df= pd.read_csv("/content/historical_data.csv")
df.head()

## `Feature descriptions`

1. **market_id** - A city/region in which DoorDash operates, e.g., Los Angeles, given in the data as an id
2. **created_at** - When the order was submitted by the consumer to DoorDash.
3. **actual_delivery_time** - When the order was delivered
4. **store_id** - Representing the restaurant ID
5. **store_primary_category** - cuisine category of the restaurant
6. **order_protocol** - a store can receive orders from DoorDash through many modes. This field represents an id denoting the protocol.
7. **total_items** - total number of items in the order
8. **subtotal** - total value of the order submitted (in cents)
9. **num_distinct_items** - number of distinct items included in the order
10. **min_item_price** - price of the item with the least cost in the order (in cents)
11. **max_item_price** - max price of the item
12. **total_onshift_dashers:** The total number of delivery drivers who are currently available and actively working.
13. **total_busy_dashers:** The total number of delivery drivers who are currently occupied with delivering orders.
14. **total_outstanding_orders:** The total number of orders that have been placed but have not yet been delivered.
15. **estimated_order_place_duration:** The estimated time it takes for a customer to place an order.
16. **estimated_store_to_consumer_driving_duration:** The estimated time it takes for a delivery driver to travel from the store to the customer's location.

The target value to predict here is the total seconds value between created_at and actual_delivery_time.


# Exploring Data

In [None]:
df.columns

## Check for data types

In [None]:
df.info()

In [None]:
df.shape

In [None]:
# Checking Null values

df.isnull().sum()

In [None]:
# Checking Null values
print(f"There are {df.isnull().sum().sum()} null values in this dataset")

#  Data Preprocessing

## Ratio of missing values

In [None]:
total_rows = df.shape[0]
total_missing_values = df.isnull().sum().sum()

if total_missing_values == 0:
    print("There are no missing values in the DataFrame.")
else:
    missing_values_ratio = total_rows / total_missing_values
    print(f"The ratio of total rows to missing values is: {missing_values_ratio:.2f}")

**`The missing values is minimun so we can remove those from the dataset and dropping duplicate values`**

In [None]:
df.drop_duplicates(inplace=True) # to drop duplicate values
df.dropna(inplace=True) # to drop null values

In [None]:
# Get descriptive statistics to understand the distribution of numerical features
df.describe().T

In [None]:
print(df.describe(include='object'))  # Categorical columns

###  Convert Date Columns

In [None]:
df['created_at'] = pd.to_datetime(df['created_at'])
df['actual_delivery_time'] = pd.to_datetime(df['actual_delivery_time'])

# Feature Engineering

In [None]:
# Calculate the delivery duration in seconds by subtracting the order creation time
# from the actual delivery time and extracting the total seconds.
df['delivery_duration_sec'] = (df['actual_delivery_time'] - df['created_at']).dt.total_seconds()
df.head()

In [None]:
# Calculate the number of free dashers
df["free_dashers"] = df["total_onshift_dashers"] - df["total_busy_dashers"]
df.head()

### Inference

`Could be able to see negative value in free_dashers column, Indicates there is no persons available to pick and deliver the order. Hence there was a delay in delivery.`

## Column Unique Values

In [None]:
unique_number = []
for i in df.columns:
    x = df[i].value_counts().count()
    unique_number.append(x)

pd.DataFrame(unique_number, index = df.columns, columns = ["Total Unique Values"])

# EDA

In [None]:
# Marlket Count
sns.countplot(x=df["market_id"], palette="Set2")
plt.show()

In [None]:
plt.figure(figsize=(10, 6))
sns.barplot(x=df["store_primary_category"], y=df["total_items"],palette="Set2")
plt.xticks(rotation=90)
plt.show()

`All cuisine category of the restaurant were sold well`

In [None]:
sns.scatterplot(x=df["total_items"], y=df["delivery_duration_sec"], palette="Set2")
plt.show()

In [None]:
plt.figure(figsize=(10, 6))
sns.scatterplot(x=df["store_id"], y=df["delivery_duration_sec"], palette="railbow")
plt.show()

In [None]:
sns.distplot(np.sqrt(df["delivery_duration_sec"]))
plt.show()

In [None]:
# heatmap

plt.figure(figsize=(10, 6))
sns.heatmap(df.select_dtypes(include=np.number).corr(), annot=True, cmap='BrBG',linewidths=0.5)
plt.show()

## Droping Outliers in Target column

In [None]:
drop_out= df[df["delivery_duration_sec"] > 10000]
print(len(drop_out))

df.drop(drop_out.index, inplace=True)

In [None]:
px.box(df, y="delivery_duration_sec")

## Removing Unwanted columns

In [None]:
df_copy = df.copy()

In [None]:
df_copy.drop(columns=["created_at","market_id", "store_id", "store_primary_category", "order_protocol","subtotal","num_distinct_items", "actual_delivery_time","min_item_price","max_item_price"], inplace=True)

In [None]:
corr = df_copy.corr()

sns.heatmap(corr, annot=True, cmap='coolwarm',linewidths=0.5)
plt.show()

# Train Test Split

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X = df_copy.drop("delivery_duration_sec", axis=1)
y = df_copy["delivery_duration_sec"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=25)

# Feature Scaling

In [None]:
scale = MinMaxScaler()

X_train_s = scale.fit_transform(X_train)
X_test_s = scale.transform(X_test)

y_train_s = scale.fit_transform(y_train.values.reshape(-1, 1))
y_test_s = scale.transform(y_test.values.reshape(-1, 1))

# Pipeline

In [None]:
pipe = Pipeline([
    ('scaler', MinMaxScaler()),
    ('model', LinearRegression())
])

# Evaluation matrix

In [None]:
def evaluation_matrix(actual, pred):
  MAE = mean_absolute_error(actual, pred)
  MSE = mean_squared_error(actual, pred)
  RMSE = np.sqrt(mean_squared_error(actual, pred))
  SCORE = r2_score(actual, pred)
  return print("\n","r2_score:",SCORE , "\n","MAE:", MAE, "\n","MSE",MSE, "\n","RMSE", RMSE)

## 1. Linear Regression

In [None]:
pipe.fit(X_train, y_train)

In [None]:
lr_model_g = LinearRegression()
lr_model_g.fit(X_train_s, y_train_s)
y_pred_l = lr_model_g.predict(X_test_s)

train_score_LRg= lr_model_g.score(X_train_s,y_train_s)
test_score_LRg= lr_model_g.score(X_test_s,y_test_s)
print("Train Score LR", train_score_LRg)
print("Test Score LR", test_score_LRg)
evaluation_matrix(y_test_s, y_pred_l)

## 2. RandomForestRegressor

In [None]:
RF_model = RandomForestRegressor()
RF_model.fit(X_train, y_train)
y_pred_r = RF_model.predict(X_test)

train_score_RF= RF_model.score(X_train,y_train)
test_score_RF= RF_model.score(X_test,y_test)
print("Train Score RF", train_score_RF)
print("Test Score RF", test_score_RF)

In [None]:
evaluation_matrix(y_test,y_pred_r)

## 3. XGBoost

In [None]:
xg_boost = XGBRegressor()
xg_boost.fit(X_train, y_train)
y_pred_x = xg_boost.predict(X_test)

train_score_xg= xg_boost.score(X_train,y_train)
test_score_xg= xg_boost.score(X_test,y_test)
print("Train Score XGB", train_score_xg)
print("Test Score XGB", test_score_xg)

In [None]:
evaluation_matrix(y_test,y_pred_x)

## Deep learning Model

In [None]:
def build_model(input_shape):
    model = keras.Sequential([
        layers.Dense(30, activation="relu", input_shape=input_shape),
        layers.Dropout(0.5),
        layers.Dense(15, activation="relu"),
        layers.Dropout(0.5),
        layers.Dense(1)  # Output layer with 1 neuron for regression
    ])
    optimizer = tf.keras.optimizers.Adam(0.001) # optimizer

    model.compile(loss="mean_squared_error",
                  optimizer=optimizer,
                  metrics=["mae", "mse"])
    return model

In [None]:
input_shape = X_train_s.shape[1:]  # Input shape for the model

nn_model = build_model(input_shape) # Call function

# Define the EarlyStopping callback
early_stopping = EarlyStopping(
    monitor="val_loss",  # Metric to monitor
    patience=4,          # Number of epochs to wait before stopping
    restore_best_weights=True  # Restore the best model weights
)

# Train the model
history = nn_model.fit(X_train_s,
                           y_train_s,
                           epochs=1000,
                           validation_split=0.2, verbose=1,
                           callbacks=[early_stopping],
                           batch_size = 15)

## Evaluate the NN model

In [None]:
loss, mae, mse= nn_model.evaluate(X_test_s, y_test_s, verbose=0)
print("Testing set Mean Abs Error: {:.3f} ".format(mae))
print("Testing set Mean Squared Error: {:.3f}".format(mse))
print("Testing set Root Mean Squared Error: {:.3f}".format(np.sqrt(mse)))

In [None]:
y_pred = nn_model.predict(X_test_s)
evaluation_matrix(y_test_s, y_pred)

In [None]:
# Plot training & validation accuracy values
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.plot(history.history['mae'])
plt.plot(history.history['val_mae'])
plt.title('Model accuracy')
plt.ylabel('mae')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper left')

# Plot training & validation loss values
plt.subplot(1, 2, 2)
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper left')

plt.show()

# Saving the Neural network model

In [None]:
nn_model.save('model.keras')

# Loading the saved model and prediction

In [None]:
Model = tf.keras.models.load_model('model.keras')

In [None]:
total_items = int(input("Enter total_items: "))
total_onshift_dashers = int(input("Enter total_onshift_dashers: "))
total_busy_dashers = int(input("Enter total_busy_dashers: "))
total_outstanding_orders = int(input("Enter total_outstanding_orders: "))
estimated_order_place_duration = int(input("Enter estimated_order_place_duration: "))
estimated_store_to_consumer_driving_duration = int(input("Enter estimated_store_to_consumer_driving_duration: "))
free_dashers = int(input("Enter free_dashers: "))

In [None]:
input_data = np.array([[total_items, total_onshift_dashers, total_busy_dashers, total_outstanding_orders, estimated_order_place_duration, estimated_store_to_consumer_driving_duration, free_dashers]])

In [None]:
def prediction(*input_data):
  input_data = np.array([input_data])
  prediction = Model.predict(input_data)
  delivery_duration_sec = scale.inverse_transform(prediction)
  return abs(delivery_duration_sec[0][0])

print("Total seconds to deliver",prediction(total_items, total_onshift_dashers, total_busy_dashers, total_outstanding_orders, estimated_order_place_duration, estimated_store_to_consumer_driving_duration, free_dashers),"sec")