# Delivery Time Prediction ML Workflow

This notebook covers the full machine learning pipeline for predicting delivery time:


1. Import Required Libraries
2. Load and Explore Dataset
3. Data Preprocessing
4. Feature Selection
5. Model Training
6. Model Evaluation
7. Hyperparameter Tuning


In [1]:
# Import Required Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.preprocessing import LabelEncoder, StandardScaler
import xgboost as xgb
import lightgbm as lgb

In [6]:
# Load and Explore Dataset
import sys
import os
sys.path.append(os.path.abspath("../src"))
from data_loader import load_delivery_data
df = load_delivery_data()
df['receipt_time'] = pd.to_datetime(df['receipt_time'], format='%m-%d %H:%M:%S')
df['sign_time'] = pd.to_datetime(df['sign_time'], format='%m-%d %H:%M:%S')
df['delivery_minutes'] = (df['sign_time'] - df['receipt_time']).dt.total_seconds() / 60
df.head()

Unnamed: 0,order_id,from_dipan_id,from_city_name,delivery_user_id,poi_lng,poi_lat,aoi_id,typecode,receipt_time,receipt_lng,receipt_lat,sign_time,sign_lng,sign_lat,ds,delivery_minutes
0,687227b4d0c733049b16ccd566db6e01,08331170e24742ba7a3938f5b34ff24d,Mbeya,18ff78d2069125937a847fb701a9db6c,33.501712,-8.86739,e0581ca18e7ca371a9869e041cb09075,4602b38053ece07a9ca5153f1df2e404,1900-03-18 13:35:00,35.738886,-6.1752,1900-03-18 14:51:00,35.772387,-6.191757,318,76.0
1,55be8cdf1270526231c9ba3387f51b54,c5ac5ba99801aa6b85ba473d9260512b,Dar es Salaam,df0b594618d1ba6f619e4e7dd034447c,39.202811,-6.758018,9c0f96ff01a71477334ef563001abc72,203ac3454d75e02ebb0a3c6f51d735e4,1900-03-18 08:32:00,36.683317,-3.403086,1900-03-18 14:33:00,36.693977,-3.377285,318,361.0
2,ee46cae9ba2c002451af3c6fbcb49410,2129bfb99a2f6c11000c0ecbf1a5f3f6,Mwanza,05cceaaa5db96756294dd6d573fd865d,32.959725,-2.557876,4de9bf7f155046e7d0fd400672ab9cf3,203ac3454d75e02ebb0a3c6f51d735e4,1900-03-18 13:02:00,36.649081,-3.363579,1900-03-18 15:34:00,36.660932,-3.37145,318,152.0
3,38912be86c83138901b5e26398832be7,08331170e24742ba7a3938f5b34ff24d,Dar es Salaam,f29e97ef8398477abb72b852b16c91c0,39.19865,-6.825873,fe48cde9b33e2308641d985f8a701c7e,203ac3454d75e02ebb0a3c6f51d735e4,1900-03-18 12:11:00,35.778454,-6.210589,1900-03-18 14:08:00,35.777235,-6.204619,318,117.0
4,2b83e2ba16714fee357694964d0e7e41,4fe96250270c2e17a28016a5fba4bc4a,Arusha,1d00e6f2308aad233f0179aac63aa23d,36.714718,-3.370972,a7d4de5484ca867fe453976ba9fee424,4602b38053ece07a9ca5153f1df2e404,1900-03-18 07:28:00,35.759836,-6.159013,1900-03-20 12:40:00,35.748038,-6.176871,318,3192.0


In [7]:
# Data Preprocessing
df = df.dropna(subset=['delivery_minutes'])
cat_cols = ['from_city_name', 'delivery_user_id']
for col in cat_cols:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col])
features = ['from_city_name', 'delivery_user_id', 'poi_lng', 'poi_lat', 'receipt_lng', 'receipt_lat', 'sign_lng', 'sign_lat']
X = df[features]
y = df['delivery_minutes']
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [8]:
# Feature Selection using XGBoost feature importance
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
xgb_model = xgb.XGBRegressor()
xgb_model.fit(X_train, y_train)
importances = xgb_model.feature_importances_
for feat, imp in zip(features, importances):
    print(f'{feat}: {imp:.4f}')

from_city_name: 0.2679
delivery_user_id: 0.1463
poi_lng: 0.0954
poi_lat: 0.1259
receipt_lng: 0.1103
receipt_lat: 0.0927
sign_lng: 0.0764
sign_lat: 0.0852


In [9]:
# Model Training with LightGBM
lgb_model = lgb.LGBMRegressor()
lgb_model.fit(X_train, y_train)

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.037758 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1791
[LightGBM] [Info] Number of data points in the train set: 377935, number of used features: 8
[LightGBM] [Info] Start training from score 175.746978


0,1,2
,boosting_type,'gbdt'
,num_leaves,31
,max_depth,-1
,learning_rate,0.1
,n_estimators,100
,subsample_for_bin,200000
,objective,
,class_weight,
,min_split_gain,0.0
,min_child_weight,0.001


In [11]:
# Model Evaluation
y_pred = lgb_model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f'MAE: {mae:.2f}')
print(f'RMSE: {rmse:.2f}')



MAE: 149.31
RMSE: 505.61


In [14]:
# Hyperparameter Tuning with GridSearchCV
param_grid = {
    'num_leaves': [31, 50],
    'learning_rate': [0.1, 0.01],
    'n_estimators': [100, 200]
}
grid = GridSearchCV(lgb.LGBMRegressor(), param_grid, cv=3, scoring='neg_mean_absolute_error')
grid.fit(X_train, y_train)
print('Best parameters:', grid.best_params_)
print('Best MAE:', -grid.best_score_)

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.068361 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1791
[LightGBM] [Info] Number of data points in the train set: 251956, number of used features: 8
[LightGBM] [Info] Start training from score 176.954869




[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.032950 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1791
[LightGBM] [Info] Number of data points in the train set: 251957, number of used features: 8
[LightGBM] [Info] Start training from score 175.336748




[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.040644 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1791
[LightGBM] [Info] Number of data points in the train set: 251957, number of used features: 8
[LightGBM] [Info] Start training from score 174.949321




[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.065032 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1791
[LightGBM] [Info] Number of data points in the train set: 251956, number of used features: 8
[LightGBM] [Info] Start training from score 176.954869




[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.010941 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1791
[LightGBM] [Info] Number of data points in the train set: 251957, number of used features: 8
[LightGBM] [Info] Start training from score 175.336748




[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.050160 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1791
[LightGBM] [Info] Number of data points in the train set: 251957, number of used features: 8
[LightGBM] [Info] Start training from score 174.949321




[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.045566 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1791
[LightGBM] [Info] Number of data points in the train set: 251956, number of used features: 8
[LightGBM] [Info] Start training from score 176.954869




[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.033313 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1791
[LightGBM] [Info] Number of data points in the train set: 251957, number of used features: 8
[LightGBM] [Info] Start training from score 175.336748




[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.051278 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1791
[LightGBM] [Info] Number of data points in the train set: 251957, number of used features: 8
[LightGBM] [Info] Start training from score 174.949321




[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.016452 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1791
[LightGBM] [Info] Number of data points in the train set: 251956, number of used features: 8
[LightGBM] [Info] Start training from score 176.954869




[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.052192 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1791
[LightGBM] [Info] Number of data points in the train set: 251957, number of used features: 8
[LightGBM] [Info] Start training from score 175.336748




[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.014636 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1791
[LightGBM] [Info] Number of data points in the train set: 251957, number of used features: 8
[LightGBM] [Info] Start training from score 174.949321




[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.066783 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1791
[LightGBM] [Info] Number of data points in the train set: 251956, number of used features: 8
[LightGBM] [Info] Start training from score 176.954869




[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.058329 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1791
[LightGBM] [Info] Number of data points in the train set: 251957, number of used features: 8
[LightGBM] [Info] Start training from score 175.336748




[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.034843 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1791
[LightGBM] [Info] Number of data points in the train set: 251957, number of used features: 8
[LightGBM] [Info] Start training from score 174.949321




[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.102708 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1791
[LightGBM] [Info] Number of data points in the train set: 251956, number of used features: 8
[LightGBM] [Info] Start training from score 176.954869




[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.008362 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1791
[LightGBM] [Info] Number of data points in the train set: 251957, number of used features: 8
[LightGBM] [Info] Start training from score 175.336748




[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.051529 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1791
[LightGBM] [Info] Number of data points in the train set: 251957, number of used features: 8
[LightGBM] [Info] Start training from score 174.949321




[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.045640 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1791
[LightGBM] [Info] Number of data points in the train set: 251956, number of used features: 8
[LightGBM] [Info] Start training from score 176.954869




[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.032947 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1791
[LightGBM] [Info] Number of data points in the train set: 251957, number of used features: 8
[LightGBM] [Info] Start training from score 175.336748




[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.060681 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1791
[LightGBM] [Info] Number of data points in the train set: 251957, number of used features: 8
[LightGBM] [Info] Start training from score 174.949321




[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.014407 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1791
[LightGBM] [Info] Number of data points in the train set: 251956, number of used features: 8
[LightGBM] [Info] Start training from score 176.954869




[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.082364 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1791
[LightGBM] [Info] Number of data points in the train set: 251957, number of used features: 8
[LightGBM] [Info] Start training from score 175.336748




[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.027563 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1791
[LightGBM] [Info] Number of data points in the train set: 251957, number of used features: 8
[LightGBM] [Info] Start training from score 174.949321




[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.082248 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1791
[LightGBM] [Info] Number of data points in the train set: 377935, number of used features: 8
[LightGBM] [Info] Start training from score 175.746978
Best parameters: {'learning_rate': 0.01, 'n_estimators': 200, 'num_leaves': 50}
Best MAE: 149.81585636727365
Best parameters: {'learning_rate': 0.01, 'n_estimators': 200, 'num_leaves': 50}
Best MAE: 149.81585636727365


In [15]:
# Save the trained model for backend integration
import joblib
joblib.dump(grid.best_estimator_, '../src/delivery_time_model.pkl')
print('Model saved to ../src/delivery_time_model.pkl')

Model saved to ../src/delivery_time_model.pkl
