<a href="https://colab.research.google.com/github/Lake-Commander/logistics_optimization/blob/main/logistics_optimization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

🔹 1. Business Impact of Delivery Delays
Delivery delays impact:

Operations: inefficiencies in routing, warehousing, and resource allocation.

Customer Satisfaction: increased complaints, loss of trust, and potential churn.

Revenue: higher cost per order due to expedited reshipments and potential refunds.

Brand: poor delivery performance damages reputation in competitive markets.

With predictive modeling:

SwiftChain can proactively flag high-risk deliveries.

Stakeholders can re-route, prioritize, or adjust delivery expectations.

Enables automated decision-making in logistics platforms (dynamic ETAs, personalized shipping offers).

🔹 2. Dataset Overview (41 Variables)
Here's a grouped summary based on your detailed description:

📦 Order & Shipping Info
Column	Type	Description
order_id, order_date, order_status	Numerical/Datetime/Categorical	Core order details
shipping_date, shipping_mode	Datetime/Categorical	Critical for calculating delivery windows
order_item_discount, order_item_discount_rate	Numerical	Discount amounts (may affect fulfillment priority)
order_item_quantity, order_item_total_amount, sales	Numerical	Order value/volume (affects logistics load)
label	Categorical (Target)	Delivery outcome: -1 (early), 0 (on-time), 1 (delayed)

👤 Customer & Market Info
Column	Type	Description
customer_id, customer_segment, customer_city, customer_country, customer_state, customer_zipcode	Mixed	Buyer profile
market, order_country, order_city, order_region, order_state	Categorical	Destination geo-coding
latitude, longitude	Numerical	Geo-coordinates (can be used for mapping delays or regional clustering)

🛍️ Product Info
Column	Type	Description
product_name, product_price, product_category_id, category_name	Mixed	Product-level features
order_item_product_price, order_item_profit_ratio	Numerical	Product-level sales/profit data
profit_per_order, order_profit_per_order, sales_per_customer	Numerical	Profit and revenue KPIs

🏪 Store & Department Info
Column	Type	Description
department_id, department_name, category_id	Mixed	Internal classification for store analytics

🔍 3. Key Predictive Variables to Focus On
From a feature importance hypothesis:

Likely Impact	Candidate Variables
📦 Shipping & Timing	shipping_mode, order_date, shipping_date, order_status
🌎 Location	order_region, customer_country, order_state, market
🛍️ Product Profile	product_category_id, order_item_quantity, product_price
💵 Financial Signals	order_profit_per_order, order_item_discount, profit_per_order
👤 Customer Behavior	customer_segment, sales_per_customer

## Imports

In [19]:
# Data handling
import pandas as pd
import numpy as np
import os
import joblib

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno

# Model selection & evaluation
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Preprocessing
from sklearn.preprocessing import LabelEncoder, StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier


In [20]:
# Set styles for plots
sns.set(style='whitegrid')

# === 1. Load Data ===
df = pd.read_csv("logistics.csv")

# === 2. Create output folder for plots ===
os.makedirs("eda/plots", exist_ok=True)

# === 3. Check Target Distribution ===
plt.figure(figsize=(6, 4))
sns.countplot(data=df, x='label')
plt.title("Delivery Status Distribution")
plt.xlabel("Label (-1 = Early, 0 = On-time, 1 = Late)")
plt.ylabel("Count")
plt.tight_layout()
plt.savefig("eda/plots/delivery_status_distribution.png")
plt.close()

# === 4. Shipping Mode vs Delivery Status ===
plt.figure(figsize=(8, 5))
sns.countplot(data=df, x='shipping_mode', hue='label')
plt.title("Shipping Mode vs Delivery Status")
plt.xlabel("Shipping Mode")
plt.ylabel("Count")
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig("eda/plots/shipping_mode_vs_label.png")
plt.close()

# === 5. Top 10 Order Regions vs Delivery Status ===
top_regions = df['order_region'].value_counts().nlargest(10).index
subset = df[df['order_region'].isin(top_regions)]

plt.figure(figsize=(10, 6))
sns.countplot(data=subset, x='order_region', hue='label')
plt.title("Top 10 Order Regions vs Delivery Status")
plt.xlabel("Order Region")
plt.ylabel("Count")
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig("eda/plots/top10_regions_vs_label.png")
plt.close()

# === 6. Discount Rate vs Delivery Outcome ===
plt.figure(figsize=(8, 5))
sns.boxplot(data=df, x='label', y='order_item_discount_rate')
plt.title("Discount Rate vs Delivery Outcome")
plt.xlabel("Delivery Status")
plt.ylabel("Discount Rate")
plt.tight_layout()
plt.savefig("eda/plots/discount_vs_label.png")
plt.close()

# === 7. Missing Values Matrix ===
msno.matrix(df)
plt.title("Missing Values Matrix")
plt.tight_layout()
plt.savefig("eda/plots/missing_values_matrix.png")
plt.close()

print("✅ EDA completed. All plots saved to 'eda/plots'")


  plt.tight_layout()


✅ EDA completed. All plots saved to 'eda/plots'


In [21]:
# ================================
# Load Dataset
# ================================
df = pd.read_csv('logistics.csv')

# ================================
# Handle Missing Values
# ================================
missing = df.isnull().sum()
print("Missing values:\n", missing[missing > 0])

if 'shipping_date' in df.columns:
    df['shipping_date'] = pd.to_datetime(df['shipping_date'], errors='coerce', utc=True)
    df['shipping_date'] = df['shipping_date'].ffill()

if 'order_date' in df.columns:
    df['order_date'] = pd.to_datetime(df['order_date'], errors='coerce', utc=True)
    df['order_date'] = df['order_date'].ffill()

threshold = 0.4
to_drop = [col for col in df.columns if df[col].isnull().mean() > threshold]
df.drop(columns=to_drop, inplace=True)
print(f"\nDropped columns with >{int(threshold*100)}% missing values: {to_drop}")

# ================================
# Encode Categorical Variables
# ================================
cat_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()
le = LabelEncoder()
for col in cat_cols:
    if df[col].nunique() <= 20:
        df[col] = le.fit_transform(df[col].astype(str))
    else:
        df[col] = df[col].astype(str)

# ================================
# Handle Outliers
# ================================
num_cols = df.select_dtypes(include=['float64', 'int64']).columns.tolist()
num_cols = [col for col in num_cols if col != 'label']
for col in num_cols:
    q_low = df[col].quantile(0.01)
    q_hi = df[col].quantile(0.99)
    df[col] = df[col].clip(lower=q_low, upper=q_hi)

# ================================
# Scale Numerical Features
# ================================
scaler = StandardScaler()
scale_cols = ['profit_per_order', 'sales_per_customer', 'order_item_discount',
              'order_item_discount_rate', 'order_item_product_price',
              'order_item_quantity', 'sales', 'order_item_total_amount',
              'order_profit_per_order', 'product_price']
for col in scale_cols:
    if col in df.columns:
        df[[col]] = scaler.fit_transform(df[[col]])

# ================================
# Save Cleaned Dataset
# ================================
df.to_csv("logistics_cleaned.csv", index=False)
print("\n✅ Preprocessing complete. Cleaned dataset saved as 'logistics_cleaned.csv'")

Missing values:
 Series([], dtype: int64)

Dropped columns with >40% missing values: []

✅ Preprocessing complete. Cleaned dataset saved as 'logistics_cleaned.csv'


In [22]:
# === Load original and cleaned datasets ===
original_df = pd.read_csv('logistics.csv')
cleaned_df = pd.read_csv('logistics_cleaned.csv')

# === Normalize column names for consistency ===
def clean_column_names(df):
    df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_").str.replace("-", "_")
    return df

original_df = clean_column_names(original_df)
cleaned_df = clean_column_names(cleaned_df)

# === Compare original vs cleaned dataset columns ===
def compare_columns(df1, df2):
    cols1 = set(df1.columns)
    cols2 = set(df2.columns)
    removed = cols1 - cols2
    added = cols2 - cols1
    print("Removed columns:", removed)
    print("Added columns:", added)

compare_columns(original_df, cleaned_df)

# === Proceed with feature engineering ===
df = cleaned_df.copy()

# Use order_date and shipping_date instead
date_cols = ['order_date', 'shipping_date']
for col in date_cols:
    if col in df.columns:
        df[col] = pd.to_datetime(df[col], errors='coerce')
    else:
        print(f"Warning: Column '{col}' not found in dataset.")

# Feature: shipping_delay_days
if 'order_date' in df.columns and 'shipping_date' in df.columns:
    df['shipping_delay_days'] = (df['shipping_date'] - df['order_date']).dt.days
else:
    df['shipping_delay_days'] = None

# Feature: is_late_shipping
df['is_late_shipping'] = df['shipping_delay_days'].apply(lambda x: 1 if pd.notnull(x) and x > 2 else 0)

# Feature: order_weekday
if 'order_date' in df.columns:
    df['order_weekday'] = df['order_date'].dt.day_name()

# Feature: is_weekend_shipping
if 'shipping_date' in df.columns:
    df['is_weekend_shipping'] = df['shipping_date'].dt.weekday >= 5

# === Save engineered features ===
df.to_csv('logistics_featurized.csv', index=False)
print("✅ Feature engineering complete. Output saved as 'logistics_featurized.csv'")


Removed columns: set()
Added columns: set()
✅ Feature engineering complete. Output saved as 'logistics_featurized.csv'


In [23]:
# Load the dataset
df = pd.read_csv('logistics_featurized.csv')

# Drop unnamed column if exists
if 'Unnamed: 0' in df.columns:
    df.drop(columns=['Unnamed: 0'], inplace=True)

# Check if 'label' column exists
if 'label' not in df.columns:
    raise ValueError("Missing 'label' column in dataset")

X = df.drop(columns=['label'])
y = df['label']

# Print full dataset label distribution
print("Full dataset label distribution:\n", y.value_counts())

# Stratified train-test split to preserve class distribution
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Print training label distribution
print("Training label distribution:\n", y_train.value_counts())

# Ensure we have at least two classes in y_train
if len(y_train.unique()) < 2:
    raise ValueError(f"y_train has only one class: {y_train.unique()[0]}. Cannot train classifiers.")

# Identify column types
categorical_cols = X.select_dtypes(include=['object']).columns.tolist()
numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns.tolist()

# Handle column with all missing values (e.g., 'delivery_delay_days')
for col in numerical_cols:
    if X[col].isna().all():
        print(f"Dropping column with all missing values: {col}")
        X_train = X_train.drop(columns=[col])
        X_test = X_test.drop(columns=[col])
        numerical_cols.remove(col)

# Preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('num', Pipeline([
            ('imputer', SimpleImputer(strategy='median')),
            ('scaler', StandardScaler())
        ]), numerical_cols),

        ('cat', Pipeline([
            ('imputer', SimpleImputer(strategy='most_frequent')),
            ('onehot', OneHotEncoder(handle_unknown='ignore'))
        ]), categorical_cols)
    ]
)

# Define models
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'XGBoost': XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)
}

# Train and evaluate each model
for name, model in models.items():
    print(f"\nTraining {name}...")
    pipeline = Pipeline([
        ('preprocessor', preprocessor),
        ('classifier', model)
    ])

    # Remap labels for XGBoost
    if name == 'XGBoost':
        y_train_xgb = y_train.replace({-1: 0, 0: 1, 1: 2})
        y_test_xgb = y_test.replace({-1: 0, 0: 1, 1: 2})
        pipeline.fit(X_train, y_train_xgb)
        y_pred = pipeline.predict(X_test)
        # Map predictions back for reporting
        y_pred_report = pd.Series(y_pred).replace({0: -1, 1: 0, 2: 1})
        print(f"\n{name} Evaluation:")
        print("Accuracy:", accuracy_score(y_test, y_pred_report))
        print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_report))
        print("Classification Report:\n", classification_report(y_test, y_pred_report))
    else:
        pipeline.fit(X_train, y_train)
        y_pred = pipeline.predict(X_test)
        print(f"\n{name} Evaluation:")
        print("Accuracy:", accuracy_score(y_test, y_pred))
        print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
        print("Classification Report:\n", classification_report(y_test, y_pred))

# Optional: Hyperparameter tuning for best model (Random Forest shown)
param_grid = {
    'classifier__n_estimators': [100, 200],
    'classifier__max_depth': [None, 10, 20]
}

print("\nTuning Random Forest with GridSearchCV...")

rf_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(random_state=42))
])

grid_search = GridSearchCV(rf_pipeline, param_grid, cv=3, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train, y_train)

print("\nBest Random Forest Parameters:", grid_search.best_params_)
best_model = grid_search.best_estimator_

# Final evaluation
final_pred = best_model.predict(X_test)
print("\nTuned Random Forest Evaluation:")
print("Accuracy:", accuracy_score(y_test, final_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, final_pred))
print("Classification Report:\n", classification_report(y_test, final_pred))

# Save final model
os.makedirs("models", exist_ok=True)
joblib.dump(best_model, "models/best_model.pkl")
print("\nModel saved to models/best_model.pkl")


Full dataset label distribution:
 label
 1    8976
-1    3545
 0    3028
Name: count, dtype: int64
Training label distribution:
 label
 1    7181
-1    2836
 0    2422
Name: count, dtype: int64

Training Logistic Regression...

Logistic Regression Evaluation:
Accuracy: 0.5463022508038585
Confusion Matrix:
 [[ 259  127  323]
 [ 159   85  362]
 [ 258  182 1355]]
Classification Report:
               precision    recall  f1-score   support

          -1       0.38      0.37      0.37       709
           0       0.22      0.14      0.17       606
           1       0.66      0.75      0.71      1795

    accuracy                           0.55      3110
   macro avg       0.42      0.42      0.42      3110
weighted avg       0.51      0.55      0.53      3110


Training Random Forest...

Random Forest Evaluation:
Accuracy: 0.5906752411575563
Confusion Matrix:
 [[ 156    2  551]
 [  60    0  546]
 [ 112    2 1681]]
Classification Report:
               precision    recall  f1-score   suppo

Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)



XGBoost Evaluation:
Accuracy: 0.6067524115755627
Confusion Matrix:
 [[ 433   23  253]
 [ 212   50  344]
 [ 353   38 1404]]
Classification Report:
               precision    recall  f1-score   support

          -1       0.43      0.61      0.51       709
           0       0.45      0.08      0.14       606
           1       0.70      0.78      0.74      1795

    accuracy                           0.61      3110
   macro avg       0.53      0.49      0.46      3110
weighted avg       0.59      0.61      0.57      3110


Tuning Random Forest with GridSearchCV...

Best Random Forest Parameters: {'classifier__max_depth': None, 'classifier__n_estimators': 100}

Tuned Random Forest Evaluation:
Accuracy: 0.5906752411575563
Confusion Matrix:
 [[ 156    2  551]
 [  60    0  546]
 [ 112    2 1681]]
Classification Report:
               precision    recall  f1-score   support

          -1       0.48      0.22      0.30       709
           0       0.00      0.00      0.00       606
        

In [24]:
# Load cleaned and engineered data
df = pd.read_csv("logistics_featurized.csv")

# Load trained model (e.g., Random Forest)
model = joblib.load("models/best_model.pkl")

# Ensure output directory
os.makedirs("output/phase5", exist_ok=True)

# ========================
# 1. Feature Importance
# ========================
def plot_feature_importance(model, feature_names):
    importances = model.named_steps['classifier'].feature_importances_
    importance_df = pd.DataFrame({
        'Feature': feature_names,
        'Importance': importances
    }).sort_values(by='Importance', ascending=False)

    plt.figure(figsize=(10, 6))
    sns.barplot(data=importance_df.head(15), x='Importance', y='Feature', palette='viridis')
    plt.title("Top 15 Feature Importances")
    plt.tight_layout()
    plt.savefig("output/phase5/feature_importance.png")
    plt.close()
    return importance_df

# Plot and save
preprocessor = model.named_steps['preprocessor']
# Get feature names after preprocessing
cat_features = preprocessor.named_transformers_['cat']['onehot'].get_feature_names_out(preprocessor.transformers_[1][2])
num_features = preprocessor.transformers_[0][2]
feature_names = np.concatenate([num_features, cat_features])
importance_df = plot_feature_importance(model, feature_names)

# ========================
# 2. Key Insights
# ========================
insights = []

top_features = importance_df.head(5)['Feature'].tolist()

insights.append(f"Top predictive features include: {', '.join(top_features)}.")
if 'shipping_duration' in top_features:
    insights.append("Shipping duration plays a significant role in predicting delays.")
if 'product_category_id' in top_features:
    insights.append("Product category is influential, suggesting some categories are more delay-prone.")
if 'customer_country' in top_features:
    insights.append("Customer location also contributes heavily, indicating geographic logistics challenges.")

# ========================
# 3. Recommendations
# ========================
recommendations = [
    "✅ Optimize operations for product categories that are frequently delayed.",
    "✅ Investigate and improve shipping processes in high-delay regions.",
    "✅ Use predicted delay probabilities to notify customers ahead of time.",
    "✅ Prioritize orders with short delivery windows to improve reliability ratings.",
    "✅ Consider reinforcing customer support in delay-prone areas to manage expectations."
]

# Save to file
with open("output/phase5/insights_and_recommendations.txt", "w", encoding="utf-8") as f:
    f.write("=== KEY INSIGHTS ===\n")
    for insight in insights:
        f.write(f"- {insight}\n")
    f.write("\n=== RECOMMENDATIONS ===\n")
    for rec in recommendations:
        f.write(f"- {rec}\n")

print("✅ Phase 5 complete: Feature importances, insights and recommendations saved.")



Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(data=importance_df.head(15), x='Importance', y='Feature', palette='viridis')


✅ Phase 5 complete: Feature importances, insights and recommendations saved.


In [25]:
import shutil
from google.colab import files
import os

#  Define source and destination
source_dir = "/content"
output_zip = "colab_session_backup.zip"

#  create a ZIP archive while excluding sample_data
def zip_without_sample_data(zip_name, source):
    with shutil.make_archive(zip_name.replace('.zip', ''), 'zip', source) as archive:
        pass

# exclude sample_data
def zipdir_excluding_sample_data(zip_name, path):
    import zipfile
    with zipfile.ZipFile(zip_name, 'w', zipfile.ZIP_DEFLATED) as zipf:
        for root, dirs, files in os.walk(path):
            if 'sample_data' in root:
                continue  # Skip sample_data directory
            for file in files:
                filepath = os.path.join(root, file)
                arcname = os.path.relpath(filepath, path)
                zipf.write(filepath, arcname)

# Step 3: Create the zip
zipdir_excluding_sample_data(output_zip, source_dir)

# Step 4: Download the zip file
files.download(output_zip)


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>