### <div align="center">***MODEL BUILDING AND TRAINING***</div>

### ***Import libraries and modules***

In [8]:
# Import required libraries

import os
import joblib
import mlflow
import dagshub
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.model_selection import train_test_split, RandomizedSearchCV, KFold
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

from category_encoders import TargetEncoder
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
import xgboost as xgb

import warnings
warnings.filterwarnings("ignore", category=UserWarning, module="_distutils_hack")
warnings.filterwarnings("ignore", category=FutureWarning, module="mlflow.data")

### ***Preprocessing dataset for Modeling***

In [2]:
# Load Clean Dataset
fe_dataset = pd.read_csv(r'C:\Users\spand\Projects\LABMENTIX_PROJECTS\Amazon_DeliveryTime_Prediction\Data\Processed\EDA_dataset.csv')

In [3]:
df = fe_dataset.copy()

# Interactive features
df['Traffic_Area'] = df['Traffic'].astype(str) + "_" + df['Area'].astype(str)
df['Area_Vehicle'] = df['Area'].astype(str) + "_" + df['Vehicle'].astype(str)
df['Weather_Area'] = df['Weather'].astype(str) + "_" + df['Area'].astype(str)

df['Is_Peak_Hours'] = df['Order_Hour'].apply(lambda x: 1 if (17 <= x <= 23) else 0) # Create peak hours from order hours
df['Is_Urban'] = df['Area'].apply(lambda x: 1 if x in ['Urban', 'Metropolitan'] else 0) # Collapse to Binary (Urban vs Non-Urban)

df = df.drop(["Delay_Time_M", "Order_Month", "Order_DayOfWeek", "Order_Day", "Is_Weekend"], axis=1) # Drop features with no predictive power
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43648 entries, 0 to 43647
Data columns (total 15 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Agent_Age      43648 non-null  int64  
 1   Agent_Rating   43648 non-null  float64
 2   Weather        43648 non-null  object 
 3   Traffic        43648 non-null  object 
 4   Vehicle        43648 non-null  object 
 5   Area           43648 non-null  object 
 6   Delivery_Time  43648 non-null  int64  
 7   Category       43648 non-null  object 
 8   Distance_km    43648 non-null  float64
 9   Order_Hour     43648 non-null  int64  
 10  Traffic_Area   43648 non-null  object 
 11  Area_Vehicle   43648 non-null  object 
 12  Weather_Area   43648 non-null  object 
 13  Is_Peak_Hours  43648 non-null  int64  
 14  Is_Urban       43648 non-null  int64  
dtypes: float64(2), int64(5), object(8)
memory usage: 5.0+ MB


#### ***Train-Val-Test split*** 

In [4]:
# Define target
TARGET = "Delivery_Time"
train_df, val_test_df = train_test_split(df, test_size=0.3, random_state=42) # First split Train (70%) & Val+Test (30%)
val_df, test_df = train_test_split(val_test_df, test_size=0.5, random_state=42) # Second split Validation (15%) & Test (15%)

X_train, y_train = train_df.drop(columns=TARGET), train_df[TARGET]
X_val, y_val = val_df.drop(columns=TARGET), val_df[TARGET]
X_test, y_test = test_df.drop(columns=TARGET), test_df[TARGET]

print(f"Train: {X_train.shape}, Val: {X_val.shape}, Test: {X_test.shape}")

Train: (30553, 14), Val: (6547, 14), Test: (6548, 14)


### ***Prepocessing Pipeline***

#### ***Outlier Handling & Skewness Correction***

***As observed from univariate analysis, Delivery_Time, Distance_km and Order_Hour are skewed with outliers.***
- ***Delivery_Time (target):*** left untouched 
- ***Order_Hour:*** bimodal, temporal pattern → keep raw
- ***Agent_Rating:*** bounded [1–5], skew due to natural bias → keep raw 
- ***Distance_km:*** extreme skew & heavy-tailed → cap + log transform (if needed) ✅

#### ***Encode categorical features***
- ***Define low & high cardinality feature groups***
- Apply One-hot Encoding for low cardinality features
- Apply Target Encoding for high cardinality features (Interactive features)

#### ***Scaling numerical features + target-encoded features***
- Apply ***StandardScalar()*** on numerical + target-encoded features

In [5]:
# Outlier & Skew transformer
class OutlierSkewHandler(BaseEstimator, TransformerMixin):
    def __init__(self, features, skew_threshold=1.0, quantile=0.99):
        self.features = features
        self.skew_threshold = skew_threshold
        self.quantile = quantile
        self.upper_limits_ = {}
        self.log_features_ = []

    def fit(self, X, y=None):
        X = X.copy()
        for col in self.features:
            self.upper_limits_[col] = X[col].quantile(self.quantile)
            capped = np.where(X[col] > self.upper_limits_[col], self.upper_limits_[col], X[col])
            if pd.Series(capped).skew() > self.skew_threshold:
                self.log_features_.append(col)
        return self

    def transform(self, X):
        X = X.copy()
        for col in self.features:
            X[col] = np.where(X[col] > self.upper_limits_[col], self.upper_limits_[col], X[col])
            if col in self.log_features_:
                X[col] = np.log1p(X[col])
        return X

In [6]:
# Define Preprocessing pipeline

# Feature groups
low_card_feats  = ["Weather", "Traffic", "Vehicle", "Area"]
high_card_feats = ["Category", "Traffic_Area", "Area_Vehicle", "Weather_Area"]
num_feats = ["Agent_Age", "Agent_Rating", "Distance_km", "Order_Hour"]

# Custom transformers
outlier_skew = OutlierSkewHandler(features=["Distance_km"], skew_threshold=1.0)
ohe = OneHotEncoder(handle_unknown="ignore", sparse_output=False)
target_enc = TargetEncoder(cols=high_card_feats)

# Column transformer (numerics + OHE)
preprocessor = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), num_feats + high_card_feats),
        ("ohe", ohe, low_card_feats)
    ],
    remainder="drop"  # 🚨 prevent leakage
)

# Full preprocessing pipeline
full_preprocessing_pipeline = Pipeline(steps=[
    ("outlier_skew", outlier_skew),
    ("encode_target", target_enc),
    ("preproc", preprocessor)
])

In [7]:
# Wrapper: preprocessing + model
def make_pipeline(model):
    return Pipeline(steps=[
        ("preprocessing", full_preprocessing_pipeline),
        ("model", model)
    ])

### ***Setup DagsHub MLflow***

In [None]:
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

# Set MLflow tracking to your project root "mlruns" folder
mlflow.set_tracking_uri("file:///C:/Users/spand/Projects/LABMENTIX_PROJECTS/Amazon_DeliveryTime_Prediction/mlruns")
mlflow.set_experiment("Amazon_delivery_time_prediction")

### ***Model Building, Training & Evaluation***

In [57]:
# Metrics helper functions
def metrics_report(y_true, y_pred):
    return {
        "rmse": np.sqrt(mean_squared_error(y_true, y_pred)),
        "mae": mean_absolute_error(y_true, y_pred),
        "r2": r2_score(y_true, y_pred)
    }

def print_eval(name, y_true, y_pred):
    m = metrics_report(y_true, y_pred)
    print(f"{name} -> RMSE: {m['rmse']:.4f} | MAE: {m['mae']:.4f} | R2: {m['r2']:.4f}")

#### ***Baseline Models***

In [None]:
# Define baseline models
models = {
    "LinearRegression": LinearRegression(),
    "Ridge": Ridge(random_state=42),
    "RandomForest": RandomForestRegressor(n_estimators=200, random_state=42, n_jobs=-1),
    "GradientBoosting": GradientBoostingRegressor(n_estimators=200, random_state=42),
    "XGBoost": xgb.XGBRegressor(n_estimators=200, random_state=42, n_jobs=-1)
}

2025/10/02 00:16:31 INFO mlflow.tracking.fluent: Experiment with name 'Amazon_delivery_time_prediction' does not exist. Creating a new experiment.


***Validation Performance (Main Criteria)***
- The goal is to minimize Val RMSE (lower is better).
- RandomForest (23.23) and XGBoost (23.23) are nearly identical, both much better than GradientBoosting (24.29) and far better than Ridge/Linear (32+).
- So RandomForest & XGBoost are the top contenders.

***Overfitting Check***
- RandomForest: Train RMSE = 8.42 vs Val RMSE = 23.23 → Big gap → model fits training data extremely well but generalizes less strongly.
- XGBoost: Train RMSE = 17.24 vs Val RMSE = 23.23 → Smaller gap → slightly better generalization than RandomForest.
- Both still generalize well, but RandomForest shows stronger overfitting.

***R² Scores (Explained Variance)***
- RandomForest: 0.97 (train) → ~ 0.80 (val)
- XGBoost: 0.88 (train) → ~ 0.80 (val)
- Both capture ~80% of variance on validation set.

***Even though XGBoost is very close, RandomForest edges out slightly on validation RMSE (23.2360 vs 23.2388) thought it shows more overfitting (big train-val gap). So let's:***
- Get feature importances for both RandomForest and XGBoost.
- Drop weak features (low contribution).
- Hyperparameter tune both models with CV.
- Track everything in MLflow (params, metrics, datasets, models).

#### ***Extract Feature Importances***

#### ***Hyperparameter Tune RF & XGB on reduced Feature set***

***Validation Performance (what really matters):***
- RF has slightly lower RMSE/MAE → better generalization.
- RF Val_RMSE = 22.15 vs XGB = 22.41 (small edge to RF).
- RF Val_MAE = 17.14 vs XGB = 17.43.
- Val_R² is basically the same (~0.818 vs 0.814).

***Training Performance:***
- XGB fits training data better (lower Train_RMSE/MAE, higher Train_R²).
- But that’s not necessarily good → it might be slightly overfitting compared to RF.

***✅ Best Choice -> RandomForest***
- Lower validation RMSE/MAE (your main metrics).
- Slightly more balanced (less risk of overfitting).

#### ***Retrain on Train + Val with best model parameters***

#### ***Interpretation***

***Consistency (Good Generalization)***
- Train+Val RMSE = 20.9 vs Test RMSE = 22.0 → only a small gap.
- Train+Val R² = 0.838 vs Test R² = 0.823 → very close.
- ✅ Model is not overfitting; generalizes well to unseen data.

***Error Magnitude***
- RMSE ~22 hours → predictions deviate by ~22 hours on average.
- MAE ~17 hours → median absolute error is lower (robust to outliers).
- Considering delivery times may range widely (short vs long trips), this error seems reasonable — since long-distance orders exist.

***Residual Plot***
- Residuals are mostly centered around 0 (good).
- Spread increases with higher predicted values (heteroscedasticity).
- Suggests model handles short/medium deliveries better, but variance grows for long delivery times (common in real logistics data).

***Conclusion***
- Hyperparameter tuned RandomForest model is performing very well
- Balanced fit (Train ≈ Test metrics).
- Captures ~82% of variance in delivery times.
- Errors are relatively stable across test set.