<a href="https://colab.research.google.com/github/AaryanPriyadarshi/Amazon-Delivery-Time-Prediction/blob/main/Amazon_Delivery_Time_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Amazon Delivery Time Prediction


##### **Project Type**    - EDA / Regression / Forecasting

##### **Contribution**    - Aaryan Priyadarshi

##### **Domain**          - E-Commerce & Logistics

#####**Goal**             - Predict delivery times based on traffic, weather, distance, and delivery agent data.

# **Project Summary -**

This project focuses on predicting Amazon delivery times using machine learning to improve logistics efficiency and customer satisfaction. The dataset contains key operational parameters such as traffic conditions, weather, distance, vehicle type, and agent performance metrics.
After cleaning and validating the data, several engineered features were added — including distance between store and customer (via the Haversine formula) and time-based variables like month, weekday, and order hour.
Exploratory Data Analysis (EDA) revealed strong relationships between distance, traffic intensity, and delivery time, which were further validated using hypothesis testing.
Machine learning models — including Random Forest Regressor and XGBoost Regressor — were implemented and evaluated based on RMSE, MAE, and R² scores.
Among these, XGBoost delivered the best performance, demonstrating its ability to capture non-linear patterns effectively.
Finally, feature importance and SHAP analysis provided valuable interpretability, identifying distance, traffic conditions, and agent ratings as the most influential factors.
This end-to-end solution not only predicts delivery duration accurately but also provides data-driven insights for route optimization and workforce planning in e-commerce logistics.

# **GitHub Link -**

https://github.com/AaryanPriyadarshi/Amazon-Delivery-Time-Prediction

# **Problem Statement**


Accurately predicting delivery time is a critical challenge in e-commerce logistics. Factors such as traffic congestion, weather variations, route distance, and delivery agent performance contribute to unpredictable delivery durations.
The problem this project addresses is:

>How can we use data-driven modeling to accurately predict delivery times and identify the operational factors that most affect them?

By building predictive models and analyzing the underlying drivers of delivery delays, this project aims to help Amazon and similar companies optimize delivery schedules, allocate resources efficiently, and enhance the customer experience through reliable delivery time estimates.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

### **1. Import Libraries**

Imported necessary libraries for:

- **Data handling:** `pandas`, `numpy`  
- **Data visualization:** `matplotlib`, `seaborn`, `plotly`  
- **Machine learning:** `sklearn`, `xgboost`  
- **Statistical testing:** `scipy`  
- **Explainability:** `shap`  
- **Logging, warnings, and reproducibility**


In [None]:
# Core Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import warnings, logging, random, time, os

# Machine Learning
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import LabelEncoder, StandardScaler, PolynomialFeatures
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor

# Others
from math import radians, sin, cos, sqrt, atan2
from scipy import stats

# Configuration
warnings.filterwarnings("ignore", category=FutureWarning)
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

# Reproducibility
SEED = 42
np.random.seed(SEED)
random.seed(SEED)

logging.info("Libraries imported and configurations set successfully.")


### **2. Dataset Loading**

Mounted Google Drive and loaded the **Amazon Delivery dataset** stored in Drive.  

Verified that the dataset is accessible, correctly formatted, and loaded into a pandas DataFrame.  
Previewed dataset shape and the first few rows for confirmation.


In [None]:
from google.colab import drive
drive.mount('/content/drive')

import os

# Define the dataset path inside Google Drive
DATA_PATH = "/content/drive/MyDrive/Amazon Delivery Time Prediction/amazon_delivery.csv"

def load_dataset(path):
    """Load dataset from Drive and verify existence."""
    if not os.path.exists(path):
        raise FileNotFoundError(f"File not found at: {path}")
    df = pd.read_csv(path)
    logging.info(f"Dataset loaded successfully. Shape: {df.shape}")
    return df

# Load dataset
df = load_dataset(DATA_PATH)

# Display basic information
print("Dataset Shape:", df.shape)
display(df.head())


### 3. Dataset First View

- Preview dataset with `.head()`
- Check basic info with `.info()` and `.describe()`
- Count missing values


In [None]:
def explore_data(df):
    """Perform initial data checks."""
    logging.info(f"Shape: {df.shape}")
    logging.info(f"Missing values:\n{df.isnull().sum()}")
    logging.info(f"Duplicate rows: {df.duplicated().sum()}")
    return df.isnull().sum()

explore_data(df)


### 4. Input Validation

Checked if the required columns exist in the dataset:  

- Order, delivery, and location fields  
- Verified presence of key identifiers for feature engineering  

Ensures the dataset is complete and properly structured before further analysis.


In [None]:
required_columns = [
    "Order_ID", "Delivery_Time",
    "Store_Latitude", "Store_Longitude",
    "Drop_Latitude", "Drop_Longitude",
    "Order_Date"
]

missing = [col for col in required_columns if col not in df.columns]
if missing:
    logging.warning(f"Missing essential columns: {missing}")
else:
    logging.info("All required columns found.")


###5. Feature Engineering

Created new informative features to enhance model performance:

- **Distance (km):** Calculated using the Haversine formula between store and customer locations  
- **Temporal features:** Extracted month, weekday, weekend flag, and hour from order date/time  
- **Interaction features:** Combined distance with traffic intensity for added predictive insight  

These transformations allow the model to better understand geographical and time-based trends.


In [None]:
from math import radians, sin, cos, sqrt, atan2

def haversine_km(lat1, lon1, lat2, lon2):
    """Calculate distance between two coordinates in kilometers."""
    R = 6371.0  # Radius of Earth in km
    lat1, lon1, lat2, lon2 = map(radians, [lat1, lon1, lat2, lon2])
    dlat = lat2 - lat1
    dlon = lon2 - lon1
    a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
    return R * 2 * atan2(sqrt(a), sqrt(1 - a))

def add_features(df):
    """Add distance, temporal, and interaction-based features."""

    try:
        # ✅ Distance Feature
        df['distance_km'] = df.apply(lambda r: haversine_km(
            r['Store_Latitude'], r['Store_Longitude'],
            r['Drop_Latitude'], r['Drop_Longitude']
        ), axis=1)
        logging.info("Distance feature created successfully.")
    except KeyError as e:
        logging.warning(f"Missing column for distance calculation: {e}")

    # ✅ Convert and extract date/time features
    if 'Order_Date' in df.columns:
        df['Order_Date'] = pd.to_datetime(df['Order_Date'], errors='coerce')
        df['Order_Month'] = df['Order_Date'].dt.month
        df['Order_Weekday'] = df['Order_Date'].dt.weekday
        df['Is_Weekend'] = df['Order_Weekday'].isin([5, 6]).astype(int)

    if 'Order_Time' in df.columns:
        df['Order_Hour'] = pd.to_datetime(df['Order_Time'], format='%H:%M:%S', errors='coerce').dt.hour

    # ✅ Interaction Feature (distance × traffic)
    if 'Traffic' in df.columns:
        df['Traffic_code'] = LabelEncoder().fit_transform(df['Traffic'].astype(str))
        if 'distance_km' in df.columns:
            df['distance_x_traffic'] = df['distance_km'] * df['Traffic_code']

    logging.info("Feature engineering completed successfully.")
    return df

# Apply the feature engineering
df = add_features(df)

# Display first few rows
display(df.head())


### 6. Feature Verification

Verified that new engineered features were successfully created and contain valid data.  
This step ensures all transformations — distance, temporal, and interaction features — exist, have correct data types, and minimal missing values.


In [None]:
# List new engineered features
engineered_features = [
    'distance_km',
    'Order_Month',
    'Order_Weekday',
    'Is_Weekend',
    'Order_Hour',
    'Traffic_code',
    'distance_x_traffic'
]

print("Checking engineered features...\n")

# Check which features exist
existing = [col for col in engineered_features if col in df.columns]
missing = [col for col in engineered_features if col not in df.columns]

print(f"Existing features: {existing}")
if missing:
    print(f"Missing features: {missing}\n")
else:
    print("All engineered features are present.\n")

# Show summary statistics for numeric engineered features
display(df[existing].describe())

# Check for nulls and data types
print("\nNull Values:")
display(df[existing].isnull().sum())

print("\nData Types:")
display(df[existing].dtypes)


###7. Data Visualization

Explored and visualized data patterns using different plots:

- **Delivery Time Distribution:** To view time variability and skewness  
- **Distance vs Delivery Time:** To check the correlation between distance and delay  
- **Delivery Time by Traffic Condition:** To see how traffic affects delivery duration  
- **Correlation Heatmap:** To understand numeric relationships between variables  

Helps identify key features influencing delivery performance.


In [None]:
# Delivery Time Distribution
plt.figure(figsize=(8,4))
sns.histplot(df['Delivery_Time'].dropna(), kde=True)
plt.title("Delivery Time Distribution")
plt.xlabel("Hours")
plt.show()

# Distance vs Delivery Time
plt.figure(figsize=(8,5))
sns.scatterplot(x='distance_km', y='Delivery_Time', data=df, alpha=0.5)
plt.title("Distance vs Delivery Time")
plt.show()

# Delivery Time by Traffic
if 'Traffic' in df.columns:
    plt.figure(figsize=(10,5))
    sns.boxplot(x='Traffic', y='Delivery_Time', data=df)
    plt.title("Delivery Time by Traffic Condition")
    plt.show()

# Correlation Heatmap
plt.figure(figsize=(10,6))
sns.heatmap(df.select_dtypes('number').corr(), annot=True, cmap="coolwarm")
plt.title("Correlation Heatmap")
plt.show()


###7. Hypothesis Testing

Formulated and tested the hypothesis:  

> “Weekend deliveries take longer than weekday deliveries.”  

Used an **independent t-test** to compare mean delivery times between weekend and weekday groups.  
This validates if delivery delays significantly differ based on the day of the week.


In [None]:
weekend = df[df['Is_Weekend'] == 1]['Delivery_Time'].dropna()
weekday = df[df['Is_Weekend'] == 0]['Delivery_Time'].dropna()

t_stat, p_val = stats.ttest_ind(weekend, weekday, equal_var=False)
print(f"T-statistic: {t_stat:.4f}, p-value: {p_val:.4f}")

if p_val < 0.05:
    print("Reject Null Hypothesis: Weekend and weekday delivery times differ significantly.")
else:
    print("Fail to Reject Null Hypothesis: No significant difference.")


###8. Data Preprocessing

Prepared the dataset for machine learning by:

- **Encoding categorical variables** using Label Encoding  
- **Removing irrelevant identifiers** (e.g., Order ID)  
- **Handling missing values** using median imputation  
- **Splitting data** into 80% training and 20% testing sets  
- **Scaling features** using StandardScaler for model compatibility  

Ensures all features are clean, numerical, and properly scaled.


In [None]:
def preprocess_and_split(df, target='Delivery_Time', test_size=0.2, seed=42):
    df_copy = df.copy()

    # Drop irrelevant ID-like columns
    drop_cols = ['Order_ID', 'Order_Date', 'Pickup_Time']  # Drop datetime & identifiers
    df_copy.drop(columns=drop_cols, errors='ignore', inplace=True)

    # Encode categorical columns
    cat_cols = df_copy.select_dtypes(include='object').columns
    le = LabelEncoder()
    for c in cat_cols:
        df_copy[c] = le.fit_transform(df_copy[c].astype(str))

    # Handle missing values for numeric features
    for col in df_copy.select_dtypes(include=np.number).columns:
        df_copy[col].fillna(df_copy[col].median(), inplace=True)

    # Define X and y
    X = df_copy.drop(columns=[target], errors='ignore')
    y = df_copy[target]

    # Keep only numeric columns for training
    X = X.select_dtypes(include=[np.number])

    # Split data
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=seed)

    # Scale features
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)

    logging.info("Preprocessing completed successfully.")
    return X_train_scaled, X_test_scaled, y_train, y_test, X.columns

# Run preprocessing
X_train_scaled, X_test_scaled, y_train, y_test, feature_names = preprocess_and_split(df)


### 9. Model Implementation

Implemented and trained multiple regression models:

- **Random Forest Regressor**  
- **XGBoost Regressor**  

Each model was evaluated using the following metrics:  
- Root Mean Squared Error (**RMSE**)  
- Mean Absolute Error (**MAE**)  
- Coefficient of Determination (**R²**)  

This step determines the best-performing model for delivery time prediction.


In [None]:
def evaluate_model(name, model, X_train, X_test, y_train, y_test):
    start = time.time()
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    rmse = np.sqrt(mean_squared_error(y_test, preds))
    mae = mean_absolute_error(y_test, preds)
    r2 = r2_score(y_test, preds)
    duration = time.time() - start
    logging.info(f"{name} — RMSE: {rmse:.3f}, MAE: {mae:.3f}, R²: {r2:.3f}, Time: {duration:.2f}s")
    return {"Model": name, "RMSE": rmse, "MAE": mae, "R²": r2}

models = [
    ("Random Forest", RandomForestRegressor(n_estimators=200, random_state=SEED)),
    ("XGBoost", XGBRegressor(n_estimators=300, learning_rate=0.1, random_state=SEED))
]

results = [evaluate_model(n, m, X_train_scaled, X_test_scaled, y_train, y_test) for n, m in models]
results_df = pd.DataFrame(results).sort_values("RMSE")
display(results_df)


### 10. Model Comparison

Compared model performance visually using bar plots to highlight differences in RMSE, MAE, and R².  

Helps quickly identify the model that performs best across accuracy and error metrics.


In [None]:
results_df.set_index("Model")[["RMSE", "MAE", "R²"]].plot(kind='bar', figsize=(8,5), colormap='viridis')
plt.title("Model Performance Comparison")
plt.ylabel("Score")
plt.show()


### 11. Feature Importance

Assessed which features most strongly influence delivery time prediction:  

- Used **feature importance scores** from tree-based models  
- Applied **SHAP explainability** (if available) for interpretability  

These insights help understand the relative impact of distance, traffic, and timing on delivery duration.


In [None]:
try:
    import shap
    SHAP_AVAILABLE = True
except Exception:
    SHAP_AVAILABLE = False

best_model_name = results_df.iloc[0]["Model"]
best_model = RandomForestRegressor(n_estimators=200, random_state=SEED) if best_model_name == "Random Forest" else XGBRegressor(n_estimators=300, learning_rate=0.1, random_state=SEED)
best_model.fit(X_train_scaled, y_train)

fi = pd.DataFrame({"Feature": feature_names, "Importance": best_model.feature_importances_}).sort_values("Importance", ascending=False).head(20)
plt.figure(figsize=(8,6))
sns.barplot(x="Importance", y="Feature", data=fi, palette="coolwarm")
plt.title(f"{best_model_name} - Top 20 Important Features")
plt.show()

if SHAP_AVAILABLE:
    explainer = shap.Explainer(best_model, X_train_scaled)
    shap_values = explainer(X_test_scaled)
    shap.summary_plot(shap_values, X_test_scaled, feature_names=feature_names, show=True)


# **12. Conclusion**

- Delivery time is mainly affected by **distance**, **traffic conditions**, and **order timing**.  
- **XGBoost** achieved the highest predictive accuracy among models tested.  
- Feature importance confirms that geographical and temporal factors dominate prediction performance.  

### **Future Enhancements**
- Integrate **real-time traffic and weather data**  
- Apply **Bayesian hyperparameter tuning** for optimization  
- Deploy the model via **Streamlit or Flask** for real-world usability

