<a href="https://colab.research.google.com/github/Collins-nnaji/Data_Science/blob/main/Predictive_Analytics_for_On_Time_Delivery_using_SVM_LOG~RES_DECISION~TREE_RANDOM~FOREST.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

In [2]:
path = "/content/drive/MyDrive/Colab Notebooks/Shipping_data.csv"
df = pd.read_csv(path)
df.head()

Unnamed: 0,ID,Warehouse_block,Mode_of_Shipment,Customer_care_calls,Customer_rating,Cost_of_the_Product,Prior_purchases,Product_importance,Gender,Discount_offered,Weight_in_gms,Reached.on.Time_Y.N
0,1,D,Flight,4,2,177,3,low,F,44,1233,1
1,2,F,Flight,4,5,216,2,low,M,59,3088,1
2,3,A,Flight,2,2,183,4,low,M,48,3374,1
3,4,B,Flight,3,3,176,4,medium,M,10,1177,1
4,5,C,Flight,2,2,184,3,medium,F,46,2484,1


In [3]:
# Function to load and preprocess data
def load_and_preprocess_data(path):


    # Label encoding for categorical variables
    le = LabelEncoder()
    categorical_vars = ['Warehouse_block', 'Mode_of_Shipment', 'Product_importance', 'Gender']
    for var in categorical_vars:
        df[var] = le.fit_transform(df[var])

    # Standardizing numerical variables
    sc = StandardScaler()
    numerical_vars = ['Customer_care_calls', 'Customer_rating', 'Cost_of_the_Product',
                      'Prior_purchases', 'Discount_offered', 'Weight_in_gms']
    df[numerical_vars] = sc.fit_transform(df[numerical_vars])

    return df

# Function to split data into training and testing sets
def split_data(df, target_var, test_size=0.2, random_state=42):
    """
    Split the data into training and testing sets.
    """
    X = df.drop(target_var, axis=1)
    y = df[target_var]
    return train_test_split(X, y, test_size=test_size, random_state=random_state)

# Function to perform model training and evaluation
def train_and_evaluate_model(model, X_train, y_train, X_test, y_test, params=None, cv=10):
    """
    Train the model using GridSearchCV if parameters are provided and evaluate it.
    """
    if params:
        model = GridSearchCV(model, params, cv=cv)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    # Output model performance
    print('Accuracy:', accuracy_score(y_test, y_pred))
    print(classification_report(y_test, y_pred))
    print(confusion_matrix(y_test, y_pred))

    if params:
        print('Best hyperparameters:', model.best_params_)


In [4]:
df2 = load_and_preprocess_data(path)
df2.head()

Unnamed: 0,ID,Warehouse_block,Mode_of_Shipment,Customer_care_calls,Customer_rating,Cost_of_the_Product,Prior_purchases,Product_importance,Gender,Discount_offered,Weight_in_gms,Reached.on.Time_Y.N
0,1,3,0,-0.047711,-0.700755,-0.690722,-0.372735,1,0,1.889983,-1.46824,1
1,2,4,0,-0.047711,1.421578,0.120746,-1.029424,1,1,2.815636,-0.333893,1
2,3,0,0,-1.799887,-0.700755,-0.565881,0.283954,1,1,2.136824,-0.159002,1
3,4,1,0,-0.923799,0.006689,-0.711529,0.283954,2,1,-0.208162,-1.502484,1
4,5,2,0,-1.799887,-0.700755,-0.545074,-0.372735,2,0,2.013404,-0.703244,1


In [5]:
# Main execution

X_train, X_test, y_train, y_test = split_data(df, 'Reached.on.Time_Y.N')

# Decision Tree
dt_params = {'criterion': ['gini', 'entropy'], 'max_depth': range(1, 10)}
train_and_evaluate_model(DecisionTreeClassifier(), X_train, y_train, X_test, y_test, dt_params)

# Random Forest
rf_params = {'n_estimators': [10, 50, 100, 200], 'max_depth': range(1, 10)}
train_and_evaluate_model(RandomForestClassifier(), X_train, y_train, X_test, y_test, rf_params)

# Logistic Regression
train_and_evaluate_model(LogisticRegression(), X_train, y_train, X_test, y_test)

# SVM
train_and_evaluate_model(SVC(), X_train, y_train, X_test, y_test)

Accuracy: 0.6945454545454546
              precision    recall  f1-score   support

           0       0.57      1.00      0.73       895
           1       1.00      0.49      0.65      1305

    accuracy                           0.69      2200
   macro avg       0.79      0.74      0.69      2200
weighted avg       0.83      0.69      0.68      2200

[[895   0]
 [672 633]]
Best hyperparameters: {'criterion': 'gini', 'max_depth': 1}
Accuracy: 0.6904545454545454
              precision    recall  f1-score   support

           0       0.57      0.98      0.72       895
           1       0.98      0.49      0.65      1305

    accuracy                           0.69      2200
   macro avg       0.77      0.74      0.69      2200
weighted avg       0.81      0.69      0.68      2200

[[879  16]
 [665 640]]
Best hyperparameters: {'max_depth': 3, 'n_estimators': 200}
Accuracy: 0.649090909090909
              precision    recall  f1-score   support

           0       0.56      0.65      

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Accuracy: 0.6863636363636364
              precision    recall  f1-score   support

           0       0.57      0.91      0.70       895
           1       0.90      0.53      0.67      1305

    accuracy                           0.69      2200
   macro avg       0.73      0.72      0.69      2200
weighted avg       0.77      0.69      0.68      2200

[[816  79]
 [611 694]]




Four different machine learning models are employed: Decision Tree, Random Forest, Logistic Regression, and Support Vector Machine (SVM). Each model is either trained directly or using GridSearchCV for hyperparameter tuning, depending on whether hyperparameters are provided. The models are evaluated based on their accuracy, and detailed performance metrics such as classification reports and confusion matrices are displayed. This approach provides a comprehensive analysis of the models' ability to predict on-time delivery of shipments, making it a valuable tool for logistics and supply chain analysis in a data science context.\\
Model Training and Evaluation:
Several different models are trained and evaluated:

Decision Tree: A simple model that splits the data based on certain criteria.
Random Forest: An ensemble of decision trees that averages multiple decision tree results to improve the model's accuracy and reduce overfitting.
Logistic Regression: A fundamental model for binary classification tasks.
SVM (Support Vector Machine): A powerful classifier that works well for a wide range of datasets.
Each model is evaluated using accuracy, precision, recall, f1-score, and a confusion matrix:

Accuracy: Percentage of total predictions that are correct.
Precision: The ratio of correctly predicted positive observations to the total predicted positives.
Recall (Sensitivity): The ratio of correctly predicted positive observations to all observations in actual class.
F1-Score: The weighted average of Precision and Recall.
Confusion Matrix: A table used to describe the performance of a classification model.
Results:
The code outputs the performance metrics for each model. These metrics include accuracy, precision, recall, f1-score, and the confusion matrix. Each model's output is slightly different, showing how model selection and tuning can impact the performance on the given task.

Decision Tree and Random Forest models show similar performance, with accuracy around 69%.
Logistic Regression shows a slightly lower accuracy of around 64.9%.
SVM also shows similar performance with an accuracy of around 68.6%.
The confusion matrix for each model gives a more detailed view of where the models are making correct and incorrect predictions, showing the true positives, true negatives, false positives, and false negatives.