# <p style="background-color:#9C27B0; font-family:calibri; color:white; font-size:120%; text-align:center; border-radius:15px 50px;">Capstone Project | E-commerce Product Delivery Prediction</p>

<img src="img.JPG" width="100%" height="60%">

<div style="border-radius:10px; padding: 15px; background-color:rgb(241, 191, 250); font-size:120%; text-align:left">

<h3 align="left"><font color=purple><b>Problem:</b></font></h3>

<font color=black>
In this project, we analyze a dataset containing shipment, customer, and product details from an <b>international e-commerce</b> company.

The primary goal is to build a predictive model capable of accurately determining whether a product will be delivered on time or delayed. Since late deliveries negatively impact customer satisfaction and business reputation, our focus is on minimizing false negatives — i.e., ensuring the model correctly identifies potential late deliveries. Therefore, recall for the late delivery class is a critical evaluation metric for this project.</font>

<div style="border-radius:10px; padding: 15px; background-color:rgb(241, 191, 250); font-size:115%; text-align:left">
  <h3 style="color:purple;"><b>Objectives:</b></h3>
  <ul style="color:black;">
    <li><b>Explore the Dataset:</b> Uncover patterns, distributions, and relationships within the data.</li>
    <li><b>Conduct Extensive Exploratory Data Analysis (EDA):</b> Dive deep into bivariate relationships against the target.</li>
    <li><b>Preprocessing Steps:</b>
      <ul>
        <li>Remove irrelevant features</li>
        <li>Address missing values</li>
        <li>Treat outliers</li>
        <li>Encode categorical variables</li>
        <li>Transform skewed features to achieve normal-like distributions</li>
      </ul>
    </li>
    <li><b>Model Building:</b>
      <ul>
        <li>Establish pipelines for models that require scaling</li>
        <li>Implement and tune classification models: Logistic Regression, SVM, Decision Trees, Random Forest</li>
        <li>Emphasize high recall for class 1 to ensure comprehensive identification of heart patients</li>
      </ul>
    </li>
    <li><b>Evaluate and Compare Model Performance:</b> Utilize precision, recall, and F1-score to gauge models' effectiveness.</li>
  </ul>
</div>


<style>
a {
  color: #00004d !important;  /* Or any color you like */
  text-decoration:none;   /* Optional: removes underline */
  font-family:cambria;
}
</style>

<a id="contents_table"></a>

<div style="border-radius:10px; padding: 15px; background-color: rgb(241, 191, 250); font-size:100%; text-align:left;">

<h3 align="left"><font color="purple"><b>Table of Contents:</b></font></h3>

* [Step 1 | Import Libraries](#import)
* [Step 2 | Load the Dataset](#read)
* [Step 3 | Dataset Overview](#overview)
    - [3.1 | Basic Information](#basic_info)
    - [3.2 | Summary Statistics](#summary_stats)
* [Step 4 | Exploratory Data Analysis](#eda)
    - [4.1 | Univariate Analysis](#univariate)
        - [4.1.1 | Categorical Features](#cat_uni)
        - [4.1.2 | Numerical Features](#num_uni)
    - [4.2 | Bivariate Analysis](#bivariate)
        - [4.2.1 | Target vs Categorical Features](#target_cat)
        - [4.2.2 | Target vs Numerical Features](#target_num)
        - [4.2.3 | Correlation Matrix](#correlation)
* [Step 5 | Data Pre-processing](#preprocessing)
    - [5.1 | Split Features & Target](#split)
    - [5.2 | Column Transformer](#transformer)
    - [5.3 | Train-Test Split](#train_test)
* [Step 6 | Model Building](#model_building)
    - [6.1 | Build Model](#model_build)
    - [6.2 | Parameter Tuning](#param_tuning)
    - [6.3 | GridSearchCV Setup](#gridsearch)
* [Step 7 | Model Evaluation](#evaluation)
    - [7.1 | Accuracy, Precision, Recall, F1-Score](#metrics)
    - [7.2 | ROC Curve](#roc)
    - [7.3 | Classification Report](#classification_report)
* [Step 8 | Model Comparison](#comparison)
    - [8.1 | Bar Plot of Metrics](#bar_plot)
    - [8.2 | Extract from Best Model Pipeline](#best_pipeline)
    - [8.3 | Plot Top Features](#top_features)
* [Step 9 | Conclusion](#conclusion)

⬆️ **[Back to Top](#contents_table)**
</div>





<a id="import"></a>
## <p style="background-color:#9C27B0; font-family:calibri; color:white; font-size:100%; text-align:center; border-radius:15px 50px;">Step 1 | Import Libraries</p>

⬆️ [Contents](#contents_table)

In [3]:
import numpy as numpy
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, GridSearchCV,RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder,RobustScaler
from sklearn.preprocessing import OrdinalEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix,classification_report,roc_auc_score, roc_curve,mean_absolute_error, r2_score, mean_squared_error
from sklearn.metrics import accuracy_score

from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

import ipywidgets as widgets
from IPython.display import display, HTML

from imblearn.over_sampling import SMOTE

import warnings
warnings.filterwarnings('ignore')

In [4]:
# pip install ipywidgets

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.




<a id="read"></a>
## <p style="background-color:#9C27B0; font-family:calibri; color:white; font-size:100%; text-align:center; border-radius:15px 50px;">Step 2 | Load the dataset</p>

⬆️ [Contents](#contents_table)

In [5]:

df = pd.read_csv("E_Commerce.csv")

df.head()

FileNotFoundError: [Errno 2] No such file or directory: 'E_Commerce.csv'

In [None]:
df.shape

<a id="overview"></a>
## <p style="background-color:#9C27B0; font-family:calibri; color:white; font-size:100%; text-align:center; border-radius:15px 50px;">Step 3 | Dataset Overview</p>

⬆️ [Contents](#contents_table)


<a id="overview"></a>
## <b><span style='color:#ff00ff'>Step 3.1 |</span><span style='color:#ff00ff'> Dataset Basic Information</span></b>

In [None]:
df.info()

In [None]:
df.dtypes       #Checking data types of the columns

In [None]:
df.isnull().sum()       #Checking for null/missing values

In [None]:
df.duplicated().sum()     #Checking for duplicate values

<a id="overview"></a>
## <b><span style='color:#ff00ff'>Step 3.2 |</span><span style='color:#ff00ff'> Summary Statistics for Numerical Variables</span></b></span></b>


In [None]:
df.describe().T

<a id="overview"></a>
## <b><span style='color:#ff00ff'>Step 3.3 |</span><span style='color:#ff00ff'> Summary Statistics for Numerical Variables</span></b></span></b>

In [None]:
df.describe(include=object)

<a id="eda"></a>
## <p style="background-color:#9C27B0; font-family:calibri; color:white; font-size:100%; text-align:center; border-radius:15px 50px;">Step 4 | Exploratory Data Analysis</p>

⬆️ [Contents](#contents_table)



<div style="border-radius:10px; padding: 15px; background-color: #E1BEE7; font-size:115%; text-align:left">

<font color=Black>For <b>Exploratory Data Analysis (EDA)</b>, we'll follow two main steps:

<b>• Univariate Analysis</b> – We'll look at each feature by itself to understand its values and distribution.

<b>• Bivariate Analysis</b> – We'll check how each feature relates to the target variable to see which ones might be important for prediction.

This will help us understand the data better and find useful patterns for our model.

In The Exploratory Data Analysis, I Will Be Looking At The <b>Relationship Between The Target Variable And The Other Variables.</b> I Will Also Be Looking At The Distribution Of The Variables Across The Dataset, In Order To Understand The Data In A Better Way.</font>

<a id="univariate"></a>
# <b><span style='color:#ff00ff'>Step 4.1 |</span><span style='color:#ff00ff'> Univariate Analysis</span></b></span></b>
<a id="cat_uni"></a>
### <b><span style='color:#ff00ff'>Step 4.1.1 |</span><span style='color:#ff00ff'> Categorical Feature</span></b></span></b>
⬆️ [Contents](#contents_table)

In [None]:

categorical_cols = ['Warehouse_block', 'Mode_of_Shipment', 'Product_importance', 'Gender', 'Reached.on.Time_Y.N']

for col in categorical_cols:
    
    fig = plt.figure(figsize=(6, 4))
    fig.patch.set_facecolor("#27272AFE")  # Dark outer background

    sns.countplot(data=df, x=col, palette='Set2')  # Soft bar colors

    ax = plt.gca()
    ax.set_facecolor("#27272AFE")  # Inner plot background

    plt.title(f'Distribution of {col}', fontsize=12, color="white")
    
    plt.xticks(rotation=45, fontsize=9, color="#EEEEF7")  # X-axis tick color
    plt.yticks(fontsize=8, color="#EEEEF7")               # Y-axis tick color

    plt.xlabel(col, color='white')
    plt.ylabel("Count", color='white')

    # Remove all spines (borders)
    for spine in ax.spines.values():
        spine.set_visible(False)

    plt.tight_layout(pad=2.0)
    plt.show()


<a id="num_uni"></a>
## <b><span style='color:#ff00ff'></span><span style='color:#ff00ff'>4.1.2. Numerical Features</span></b>
⬆️ [Contents](#contents_table)

In [None]:
numerical_cols = ['Customer_care_calls', 'Customer_rating', 'Cost_of_the_Product',
                  'Prior_purchases', 'Discount_offered', 'Weight_in_gms']

fig, axes = plt.subplots(3, 2, figsize=(16, 12))

# Set background color for the full figure
fig.patch.set_facecolor('#e6f2ff')  # Light blue

# Loop through each subplot and plot the histogram
for ax, col in zip(axes.flatten(), numerical_cols):
    sns.histplot(data=df,x=col,bins=20,kde=True,ax=ax,color="#3871C1", edgecolor='black', linewidth=0.5)
    #df[col].hist(ax=ax, bins=20, color='#336699', edgecolor='black')  # Navy bars
    ax.set_title(col, color='#003366', fontsize=12)                   # Dark blue title
    ax.set_facecolor('#ffffff')  # Off-white background for subplot
    ax.tick_params(colors='#333333')  # Dark grey tick labels
    
     # Light horizontal grid lines
    ax.grid(axis='both', color="#b9b5b5", linestyle='--', linewidth=0.5)

    # Clean up borders
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)

# Adjust spacing and add super title
plt.tight_layout(pad=3.0)
plt.suptitle("Histograms of Numerical Features", fontsize=18, y=1.02, color='#003366')
plt.show()



<a id="bivariate"></a>
## <b><span style='color:#ff00ff'>Step 4.2 |</span><span style='color:#ff00ff'> Bivariate Analysis(how features relate to the target)</span></b></span></b>
<a id="cat_target"></a>
### <b><span style='color:#ff00ff'></span><span style='color:#ff00ff'>4.2.1. Target vs Categorical Features</span></b></span></b>
⬆️ [Contents](#contents_table)

In [None]:
for col in categorical_cols[:-1]:
    
    sns.countplot(data=df, x=col, hue='Reached.on.Time_Y.N')
    plt.title(f'{col} vs Delivery Status')
    plt.xticks(rotation=45)
    plt.show()


<a id="num_target"></a>
### <b><span style='color:#ff00ff'></span><span style='color:#ff00ff'>4.2.2. Target vs Numerical Features</span></b></span></b>
⬆️ [Contents](#contents_table)


In [None]:
for col in numerical_cols:
    sns.boxplot(data=df, x='Reached.on.Time_Y.N', y=col)
    plt.title(f'{col} by Delivery Status')
    plt.show()


<a id="cor"></a>
### <b><span style='color:#ff00ff'></span><span style='color:#ff00ff'>4.2.3. Correlation Matrix</span></b></span></b>

⬆️ [Contents](#contents_table)


In [None]:
plt.figure(figsize=(10, 6))
sns.heatmap(df[numerical_cols].corr(), annot=True, cmap='YlGnBu')
plt.title("Correlation Matrix")
plt.xticks(rotation=45)
plt.gcf().patch.set_facecolor("#c4c4c9")  # Figure background
plt.tight_layout()
plt.show()


<a id="out"></a>
## <b><span style='color:#ff00ff'>Step 4.2.4. </span><span style='color:#ff00ff'>    Outlier's Treatment</span></b></span></b>

⬆️ [Contents](#contents_table)


In [None]:
# List of numerical columns to check for outliers
check_cols = ['Cost_of_the_Product', 'Discount_offered', 'Weight_in_gms', 'Prior_purchases']

print("Outlier Count by Column:\n")

for col in check_cols:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    outlier_count = df[(df[col] < lower_bound) | (df[col] > upper_bound)].shape[0]
    print(f"{col}: {outlier_count} outliers")

In [None]:
import numpy as np

# Drop only numerical columns
corr_matrix = df.select_dtypes(include=[np.number]).corr().abs()

# Take upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))

# Find features with correlation > threshold (0.9 here)
to_drop = [column for column in upper.columns if any(upper[column] > 0.9)]

print("Highly correlated columns to drop:", to_drop)

# Drop from dataframe
df.drop(columns=to_drop, inplace=True)

In [None]:
# Drop column
df.drop(['ID', 'Gender'], axis=1, inplace=True)

In [None]:
df.shape

In [None]:
df.head()

<a id="preprocessing"></a>
## <p style="background-color:#9C27B0; font-family:calibri; color:white; font-size:100%; text-align:center; border-radius:15px 50px;">Step 5 | Data Pre-processing</p>

⬆️ [Contents](#contents_table)

<a id="split"></a>
## <b><span style='color:#ff00ff'>Step 5.1 | Split Features and Target Variables</span></b>

⬆️ [Contents](#contents_table)

In [None]:
X=df.drop('Reached.on.Time_Y.N', axis=1)
y=df['Reached.on.Time_Y.N']

In [None]:
# # Assuming df is already cleaned
# numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns.tolist()
# ordinal_cols =['Product_importance']
# categorical_cols = set(X.select_dtypes(include=['object', 'category']).columns.tolist()) - set(ordinal_cols)


# print("Numerical:", numerical_cols)
# print("Categorical:", categorical_cols)
# print("ordinal_cols:",ordinal_cols)


In [None]:
# # Numerical Columns:
# numerical_cols = [
#     'Customer_rating',
#     'Cost_of_the_Product',
#     'Discount_offered',
#     'Weight_in_gms',
#     'Prior_purchases'
# ]

In [None]:
# #Categorical (Nominal) Columns:
# categorical_cols = [
#     'Warehouse_block',
#     'Mode_of_Shipment'
# ]

In [None]:
# # Ordinal Columns:
# #These have meaningful order: low < medium < high  -> Use OrdinalEncoding

# ordinal_cols = ['Product_importance']

# # Defining order explicitly
# ordinal_order = [['low', 'medium', 'high']]





<a id="transformer"></a>
## <b><span style='color:#ff00ff'>Step 5.2 |Column Transformer</span></b>

⬆️ [Contents](#contents_table)

In [None]:
# numeric, categorical, ordinal columns
num_features = ['Customer_care_calls', 'Customer_rating', 'Cost_of_the_Product',
                'Prior_purchases', 'Discount_offered', 'Weight_in_gms']

cat_features = ['Mode_of_Shipment', 'Warehouse_block']
ord_features = ['Product_importance']

# preprocessing
preprocessor = ColumnTransformer([
    ('num', RobustScaler(), num_features),
    ('cat', OneHotEncoder(), cat_features),
    ('ord', OrdinalEncoder(), ord_features)
])

# pipeline with SMOTE AFTER preprocessing
from sklearn.ensemble import RandomForestClassifier

pipe = Pipeline([
    ('preprocessor', preprocessor),
    ('smote', SMOTE()),   # SMOTE sees only numerical output
    ('model', RandomForestClassifier())
])

In [None]:

# # Pipelines for each type
# num_pipeline = Pipeline([
#     ('scaler', StandardScaler())
# ])

# cat_pipeline = Pipeline([
#     ('onehot', OneHotEncoder(drop='first', handle_unknown='ignore'))
# ])

# ord_pipeline = Pipeline([
#     ('ordinal', OrdinalEncoder(categories=ordinal_order))
# ])

# # Final column transformer
# preprocessor = ColumnTransformer([
#     ('num', num_pipeline, numerical_cols),
#     ('cat', cat_pipeline, categorical_cols),
#     ('ord', ord_pipeline, ordinal_cols)
# ])

# # Full pipeline with RandomForest
# pipeline = Pipeline([
#     ("preprocess", preprocessor),
#      ('smote', SMOTE()),   # SMOTE sees only numerical output
#     ("model", RandomForestClassifier(random_state=42))
# ])


<a id="train_test"></a>

### <b><span style='color:#ff00ff'></span><span style='color:#ff00ff'>5.3. Train-Test-Split</span></b></span></b>
⬆️ [Contents](#contents_table)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [None]:
# Training and Testing Data
print(f'Training Data : {X_train.shape} | Testing Data : {X_test.shape}')


In [None]:
df.shape

<a id="model_building"></a>
## <p style="background-color:#9C27B0; font-family:calibri; color:white; font-size:100%; text-align:center; border-radius:15px 50px;">Step 6 | Model Building</p>




<a id="model_build"></a>
### <b><span style='color:#ff00ff'></span><span style='color:#ff00ff'>6.1  Model Build</span></b></span></b>
⬆️ [Contents](#contents_table)


In [None]:

models = {
    'RandomForest': RandomForestClassifier(random_state=42),
    'LogisticRegression': LogisticRegression(max_iter=1000),
    'DecisionTree': DecisionTreeClassifier(random_state=42),
    'KNN': KNeighborsClassifier()
}

# Store results
results = {}

# Loop through and train each model
for name, model in models.items():
    pipe = Pipeline([
        ('preprocessing', preprocessor),
        ('model', model)
    ])
    
    pipe.fit(X_train, y_train)
    y_pred = pipe.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    results[name] = acc

    print(f"\n{name} Accuracy: {acc * 100:.2f}%")

# Find the best model
best_model_name = max(results, key=results.get)
best_accuracy = results[best_model_name]

print("\n💠 Best Model:")
print(f"Model is: {best_model_name} | Accuracy is: {best_accuracy * 100:.2f}%")
# print("Classification Report:\n", classification_report(y_test, y_pred))
# print("-" * 50)

  


<a id="parameter_tuning"></a>

### <b><span style='color:#ff00ff'></span><span style='color:#ff00ff'>6.2. Parameter Tunning</span></b></span></b>
⬆️ [Contents](#contents_table)

In [None]:
#  1. Consistent names everywhere
param_grid = {
    'RandomForest': {
        'clf__n_estimators': [100, 200],
        'clf__max_depth': [None, 10, 20]
    },
    'DecisionTree': {
        'clf__max_depth': [None, 10, 20],
        'clf__min_samples_split': [2, 5]
    },
    'LogisticRegression': {
        'clf__C': [0.01, 0.1, 1, 10]
    },
    'KNN': {
        'clf__n_neighbors': [3, 5, 7]
    }
}

models = {
    'RandomForest': RandomForestClassifier(class_weight='balanced', random_state=42),
    'DecisionTree': DecisionTreeClassifier(class_weight='balanced', random_state=42),
    'LogisticRegression': LogisticRegression(max_iter=1000, class_weight='balanced'),
    'KNN': KNeighborsClassifier()
}



<a id="gridsearchcv"></a>

### <b><span style='color:#ff00ff'></span><span style='color:#ff00ff'>6.3. GridSearchCV setup</span></b></span></b>

⬆️ [Contents](#contents_table)

In [None]:
#  2. Loop through models
from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

metrics_summary = {}
best_pipelines={}

for name, model in models.items():
    pipe = ImbPipeline([
        ('preprocessor', preprocessor),  # Only encode categoricals
        ('smote', SMOTE(random_state=42)),
        ('clf', model)
    ])

    grid = GridSearchCV(pipe, param_grid[name], cv=5, scoring='accuracy', n_jobs=-1)
    grid.fit(X_train, y_train)

    y_pred = grid.predict(X_test)

    acc = accuracy_score(y_test, y_pred)
    prec = precision_score(y_test, y_pred)
    rec = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    best_pipelines[name] = grid  #  store the GridSearchCV

    metrics_summary[name] = {
        'Accuracy': acc, 'Precision': prec, 'Recall': rec, 'F1-Score': f1
    }

    print(f"\n💠{name} Results:")
    print("===================================")
    print("Best Parameters:", grid.best_params_)
    print(f"Accuracy: {acc:.2%}")
    print(confusion_matrix(y_test, y_pred))
    print(classification_report(y_test, y_pred))

In [None]:
#  3. After loop: Find best model
best_model_name = max(metrics_summary, key=lambda k: metrics_summary[k]['Accuracy'])
best_metrics = metrics_summary[best_model_name]

print("💠 Best Model Overall:")
print("-----------------------------------")
print(f"Model: {best_model_name}")
print(f"Accuracy: {best_metrics['Accuracy']:.2%}")
print(f"Precision: {best_metrics['Precision']:.2%}")
print(f"Recall: {best_metrics['Recall']:.2%}")
print(f"F1-Score: {best_metrics['F1-Score']:.2%}")

In [None]:
print("Best model:", best_model_name)
print("Best params:", best_pipelines[best_model_name].best_params_)

In [None]:
# #  4. Define param grid for the best model
# if best_model_name == 'RandomForest':
#     param_grid = {
#         'model__n_estimators': [50, 100, 200],
#         'model__max_depth': [5, 10, 15, None],
#         'model__min_samples_split': [2, 5, 10]
#     }
# elif best_model_name == 'LogisticRegression':
#     param_grid = {
#         'model__C': [0.01, 0.1, 1, 10],
#         'model__solver': ['liblinear', 'lbfgs']
#     }
# elif best_model_name == 'DecisionTree':
#     param_grid = {
#         'model__max_depth': [None, 5, 10, 20],
#         'model__min_samples_split': [2, 5, 10]
#     }
# elif best_model_name == 'KNN':
#     param_grid = {
#         'model__n_neighbors': [3, 5, 7, 9],
#         'model__weights': ['uniform', 'distance']
#     }
# else:
#     raise ValueError("No param grid defined for this model.")


In [None]:
# pipe = Pipeline([
#     ('preprocessor', preprocessor),
#     ('smote', SMOTE()),   # SMOTE sees only numerical output
#     ('model', RandomForestClassifier())
# ])

In [None]:
# # 5. Setup pipeline again for the best model
# best_model = models[best_model_name]

# pipe = Pipeline([
#     ('preprocessing', preprocessor),
#     ('model', best_model)
# ])

In [None]:
# pipe.fit(X_train, y_train)

In [None]:
# # ✅ 6. GridSearchCV for hyperparameter tuning
# grid_search = GridSearchCV(
#     estimator=pipe,
#     param_grid=param_grid,
#     cv=5,
#     scoring='accuracy',
#     n_jobs=-1
# )

# grid_search.fit(X_train, y_train)


In [None]:
# # RandomizedSearchCV
# grid = RandomizedSearchCV(pipe, param_grid, cv=5, scoring='accuracy', n_jobs=-1, n_iter=20, verbose=2)
# grid.fit(X_train, y_train)

In [None]:
# # Best Parameters
# best_params = grid.best_params_
# best_params

In [None]:
# # Best Score

# # best_score = grid.best_score_
# # best_score
# print(f"✅ Best CV Accuracy: {grid.best_score_:.4f}")

In [None]:
# # Best estimator
# best_model = grid.best_estimator_


In [None]:

# #  7. Use this tuned model to predict test set
# y_pred = best_model.predict(X_test)
# test_acc = accuracy_score(y_test, y_pred)
# print(f"✅ Test Accuracy with best hyperparameters: {test_acc:.4f}")


<a id="evaluation"></a>
## <p style="background-color:#9C27B0; font-family:calibri; color:white; font-size:100%; text-align:center; border-radius:15px 50px;">Step 7 | Model Evaluation</p>

⬆️ [Contents](#contents_table)


In [None]:
#Accuracy Score
print("Test Accuracy:", accuracy_score(y_test, y_pred))


<a id="metrics"></a>

### <b><span style='color:#ff00ff'></span><span style='color:#ff00ff'>7.1. Accuracy, Precision, Recall, F1-Score</span></b></span></b>
⬆️ [Contents](#contents_table)

In [None]:
#Classification Report
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Classification Report:\n", confusion_matrix(y_test, y_pred))
cm = confusion_matrix(y_test, y_pred)
# Plot: Confusion Matrix + Report
plt.figure(figsize=(4, 3))
sns.heatmap(cm, annot=True, fmt="d", cmap="YlGnBu", cbar=False)
plt.title("Random Forest - Confusion Matrix", fontsize=14)
plt.xlabel("Predicted Label")
plt.ylabel("Actual Label")
plt.show()


<a id="roc"></a>

### <b><span style='color:#ff00ff'></span><span style='color:#ff00ff'>7.2. ROC Curve</span></b></span></b>
⬆️ [Contents](#contents_table)

In [None]:
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt

#  Get best estimator from GridSearchCV results
final_model = best_pipelines[best_model_name].best_estimator_

#  Get predicted probabilities for class 1
y_proba = final_model.predict_proba(X_test)[:, 1]

#  Compute ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_proba)
roc_auc = auc(fpr, tpr)

#  Plot it
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')  # Diagonal line
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title(f'ROC Curve - {best_model_name}')
plt.legend(loc="lower right")
plt.grid(True)
plt.show()



<a id="classification_report"></a>

### <b><span style='color:#ff00ff'></span><span style='color:#ff00ff'>7.3. Classification report</span></b></span></b>
⬆️ [Contents](#contents_table)

In [None]:
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report
)
import matplotlib.pyplot as plt
import seaborn as sns

# 📌 Dictionary to store all confusion matrices
all_confusion_matrices = {}

for name, grid in best_pipelines.items():
    best_model = grid.best_estimator_
    y_pred = best_model.predict(X_test)

    # Metrics
    acc = accuracy_score(y_test, y_pred)
    prec = precision_score(y_test, y_pred)
    rec = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    cm = confusion_matrix(y_test, y_pred)

    all_confusion_matrices[name] = cm

    print(f"\n===== {name} =====")
    print(f"Accuracy: {acc* 100:.4f}")
    print(f"Precision: {prec * 100:.4f}")
    print(f"Recall: {rec * 100:.4f}")
    print(f"F1-Score: {f1 * 100:.4f}")
    print("Classification Report:\n", classification_report(y_test, y_pred))

    # Plot confusion matrix
    plt.figure(figsize=(5, 4))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
    plt.title(f"Confusion Matrix - {name}")
    plt.xlabel('Predicted')
    plt.ylabel('Actual')
    plt.show()


In [None]:
# Setup subplots
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.flatten()

# Loop through models and plot
for idx, (name, model) in enumerate(models.items()):
    pipe = Pipeline([
        ("preprocess", preprocessor),
        ("model", model)
    ])
    pipe.fit(X_train, y_train)
    y_pred = pipe.predict(X_test)

    # Confusion Matrix
    cm = confusion_matrix(y_test, y_pred)
    sns.heatmap(cm, annot=True, fmt="d", cmap="Reds", ax=axes[idx])
    axes[idx].set_title(name)
    axes[idx].set_xlabel("Predicted")
    axes[idx].set_ylabel("Actual")

    # Print classification report
    print(f"\n======== {name} =========")
    print(classification_report(y_test, y_pred))

plt.tight_layout()
plt.show()


<a id="comparison"></a>
## <p style="background-color:#9C27B0; font-family:calibri; color:white; font-size:100%; text-align:center; border-radius:15px 50px;">Step 8 | Model Comparison: Accuracy</p>
⬆️ [Contents](#contents_table)



<a id="bar_plot"></a>

### <b><span style='color:#ff00ff'></span><span style='color:#ff00ff'>8.1 Bar plot comparison of metrics</span></b></span></b>
⬆️ [Contents](#contents_table)

In [None]:
import matplotlib.pyplot as plt
import numpy as np


metric_names = ['Accuracy', 'Precision', 'Recall', 'F1-Score']
model_names = list(metrics_summary.keys())

x = np.arange(len(model_names))
width = 0.2

# Custom colors for each metric
colors =  ["#840486", '#421a68', "#F6E19E", "#dc74c9"]

plt.figure(figsize=(12, 6))

for i, (metric, color) in enumerate(zip(metric_names, colors)):
    values = [metrics_summary[model][metric] for model in model_names]
    bars = plt.bar(x + i * width, values, width=width, label=metric, color=color)

    # Add percentage labels on top of each bar
    for bar, val in zip(bars, values):
        plt.text(
            bar.get_x() + bar.get_width() / 2,
            bar.get_height() + 0.01,
            f"{val * 100:.1f}%",  # show as %
            ha='center',
            va='bottom',
            fontsize=9
        )

plt.xticks(x + width * (len(metric_names)-1)/2, model_names, rotation=15)
plt.ylabel("Score")
plt.ylim(0, 1)
plt.title("Model Comparison: Accuracy, Precision, Recall, F1-Score")
plt.legend(title="Metrics")
plt.tight_layout()
plt.show()



<a id="best_pipeline"></a>

### <b><span style='color:#ff00ff'></span><span style='color:#ff00ff'>8.2 Extract from best model pipeline</span></b></span></b>
⬆️ [Contents](#contents_table)

In [None]:

from imblearn.pipeline import Pipeline as ImbPipeline

final_pipeline = ImbPipeline([
    ('preprocessor', preprocessor),
    ('smote', SMOTE(random_state=42)),
    ('clf', RandomForestClassifier(class_weight='balanced', n_estimators=200, random_state=42))
])



<a id="top_features"></a>

### <b><span style='color:#ff00ff'></span><span style='color:#ff00ff'>8.3. Plot top features</span></b></span></b>

⬆️ [Contents](#contents_table)

In [None]:
from imblearn.pipeline import Pipeline as ImbPipeline
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier

#   Final pipeline with best parameters
final_pipeline = ImbPipeline([
    ('preprocessor', preprocessor),
    ('smote', SMOTE(random_state=42)),
    ('clf', RandomForestClassifier(class_weight='balanced',
                                   n_estimators=200,
                                   max_depth=20,   # Example best param
                                   random_state=42))
])

#   Fit on FULL training set
final_pipeline.fit(X_train, y_train)




In [None]:
#  Extract the fitted model
rf_model = final_pipeline.named_steps['clf']

#  Get proper feature names
ohe_features = preprocessor.named_transformers_['cat'].get_feature_names_out(cat_features)
final_features = num_features + list(ohe_features) + ord_features

#  Check size match
print(f"Features: {len(final_features)} | Importances: {len(rf_model.feature_importances_)}")

#  Make Series and plot top 10
importances = pd.Series(rf_model.feature_importances_, index=final_features)
top10 = importances.sort_values(ascending=False).head(10)

plt.figure(figsize=(8, 6))
top10.sort_values().plot(kind='barh', color='#9C27B0', edgecolor='black')
plt.title("Top 10 Feature Importances")
plt.xlabel("Importance Score")
plt.ylabel("Feature")
plt.grid(axis='x', linestyle='--', alpha=0.7)
plt.show()


<a id="conclusion"></a>
## <p style="background-color:#9C27B0; font-family:calibri; color:white; font-size:100%; text-align:center; border-radius:15px 50px;">Step 9 | Conclusion</p>

⬆️ [Contents](#contents_table)


<a id="conclusion"></a>

<div style="border-radius:10px; padding: 15px; background-color:rgb(241, 191, 250); font-size:120%; text-align:left">

<h3 align="left"><font color=purple><b>Conclusion:</b></font></h3>

<font color=black>
In this project, we performed a complete analysis of the shipment dataset, conducted extensive EDA, handled data preprocessing, and developed multiple classification models to predict whether a product will be delivered on time or delayed.
<br><br>
After comparing performance metrics such as Accuracy, Precision, Recall, F1-Score, and ROC-AUC, we identified the <b>Random Forest Classifier</b> as the best-performing model for this problem.
<br><br>
The Random Forest model achieved the highest overall accuracy while providing a strong balance between precision and recall. Most importantly, it delivered the highest <b>recall</b> for the late delivery class, which aligns with our key objective of minimizing false negatives and improving proactive shipment management.
<br><br>
This predictive model can help the company take early action to address potential delays, improving customer satisfaction and protecting the company’s reputation.
</font>

</div>



In [None]:
import joblib
joblib.dump(best_model, 'final_delivery_model.joblib')


<h2 align="Center"><font color='#ff00ff'><b>Thank you!</b></font></h2>