# E-BOOK hands-on materials
# Building a Data Driven Infrastructure
## How Can Logistics Benefit From Data Science


# Product Shipment Tracking Use Case Data-set

This notebook focusses on **exploratory data analysis (EDA)** in order to prepare a given dataset for the use of **machine learning  (ML) models**.

A data scientist is provided with a datawarehouse containing a static dataset of 10999 product shipment tracking observations.
<br> This datawarehouse can be downloaded as Train.csv  file from [Kaggle](https://www.kaggle.com/prachi13/customer-analytics).

The data scientist’s goal is to put together a custom-made DSI-stack  aimed to develop a ML-based model to forecast  whether ordered products are delivered on time or not.

# Step 1: Setting up a Colab Jupyter Notebook with scikit-learn
--------------------------------------------------------------------------------------------------------------
### Shows: “How to Install and import the required data science python packages”
--------------------------------------------------------------------------------------------------------------


In [None]:
# Installeer de benodigde packages via Jupyter Notebook
import sys
!{sys.executable} -m pip install numpy pandas matplotlib seaborn scikit-learn==1.2.2



In [None]:
# Required Data Sciene Toolchain Imports

import warnings
# supress warnings as output
warnings.simplefilter(action='ignore')

# Data Science stuff
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
pal = sns.color_palette()

## Preprocessing & ML-model Assessment parameters
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score, confusion_matrix


# SKlearn ML-models that can be used to make numerical or binary predictions
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.neural_network import MLPClassifier
from xgboost import XGBClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.linear_model import RidgeClassifier
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.linear_model import Perceptron
from sklearn.ensemble import ExtraTreesClassifier, BaggingClassifier, VotingClassifier, StackingClassifier
from sklearn.semi_supervised import LabelSpreading, LabelPropagation
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.neural_network import BernoulliRBM
from sklearn.linear_model import RidgeClassifierCV
from sklearn.svm import NuSVC, LinearSVC

# Step 2: Getting a basic Understanding of the dataset.
----------------------------------------------------------------------------------------------------
### Shows: “How to read a warehousing dataset into a DataFrame and  how to explore it visually”
------------------------------------------------------------------------------------------------------


In [None]:
# Load the data
data = pd.read_csv('Train.csv')
data.sample(10)

NameError: name 'pd' is not defined

In [None]:
# Drop/Remove ID collumn (1st)
data = data.drop('ID', axis=1)

In [None]:
data.info()

Observation:
1. Data contains 12 column with 10999 rows
2. The data type in each column is appropriate
3. No missing values found, let's check to make sure there are no missing values

In [None]:
# Verify the meta information as provide with .info()
# Determine if there is data missing & Dtype
print("Missing values:", data.isna().sum().sum())
print("Categorical features:", len(data.select_dtypes('object').columns))
print("Numerical features:", len(data.select_dtypes('number').columns))

## Numerical Features

### Statistical Summary

In [None]:
# determine the amount of unqiue entries for each feature
data.nunique()

In [None]:
# set column to lowercase
dats = data.copy()
dats.columns=dats.columns.str.lower()

# group column names based on type
# for numerical data
num = ['customer_care_calls', 'customer_rating', 'cost_of_the_product', 'prior_purchases',
       'discount_offered', 'weight_in_gms', 'reached.on.time_y.n']
data_num = dats[num]

# for categorical data
cat = ['warehouse_block', 'mode_of_shipment', 'product_importance', 'gender']
data_cat = dats[cat]

# Generate descriptive statistics on numerical data
data_num.describe().applymap('{:,.2f}'.format)

Observation:
1. Average *customer_care_calls* 4 times.
2. Average *customer_rating* is 3 or medium.
3. The *cost_of_the_product* range is 48 - 310 with an average of 210.
4. The maximum *prior_purchases* and *discount_offered* are 10 and 65. The average *prior_purchases* are 3 times, while the average *discount_offered* is 13.
5. The average *weight_in_gms* is 3634 with a maximum weight of 7846.
6. Column *id, customer_care_calls, customer_rating, and prior_purchases*.  seems to be **symmetrically distributed** (the mean and medium values are almost the same).
7. Columns *cost_of_the_product, discount_offered, and weight_in_gms* seem to have a **skewed distribution**.
8. *reached.on.time_y.n* is boolean/binary columns since the value is 0 or 1, so no need to conclude its simmetricity.

### Correlation of Numerical Features

In [None]:
# create correlation matrix from numerical features
corr_matrix = data[data.select_dtypes('number').columns].corr()

# Create mask as to remove the upper half of the corr_matrix
mask = np.zeros_like(corr_matrix, dtype=bool)
mask[np.triu_indices_from(mask)]= True

In [None]:
# Display the correlation heatmap
f, ax = plt.subplots(figsize=(8, 11))

heatmap = sns.heatmap(corr_matrix,
                      mask = mask,
                      square = True,
                      linewidths = .5,
                      cmap = 'coolwarm',
                      cbar_kws = {'shrink': .6,
                                'ticks' : [-1, -.5, 0, 0.5, 1]},
                      vmin = -1,vmax = 1,
                      annot = True,
                      annot_kws = {"size": 14})

#add the column names as labels
ax.set_yticklabels(corr_matrix.columns, rotation = 0, fontsize = 10)
ax.set_xticklabels(corr_matrix.columns, rotation = 45, fontsize = 8)
sns.set_style({'xtick.bottom': True}, {'ytick.left': True})

Observation:
1. All correlations with the target *reached_on_time* are rather poor
2. *discount_offered* has the highest positive correlation: + 0.40
3. *cost_of_the_product* has the second highest correlation: + 0.38
3. *weight_in_gms* has the highest negative correlation: -0.28

## Categorical Features

In [None]:
# Import required libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Set the size of the overall figure
plt.figure(figsize=(10, 8))

# Select the categorical features (those of type 'object') from the dataset
cat_features = data.select_dtypes('object').columns.values

# Loop over each categorical feature
for i, cat in enumerate(cat_features):

    # Add a subplot for each feature to the overall figure
    # The '2, 2' indicates that the subplots will be arranged in a 2x2 grid
    # 'i+1' is the index of the current subplot
    plt.subplot(2, 2, i+1)

    # Create a histogram for the current feature
    # 'shrink' reduces the size of the bars by a factor of 0.8 to leave space between them
    # 'color' assigns a color to the bars
    sns.histplot(data[cat], shrink=0.8, color=pal[i])

# Display the figure with all subplots
plt.show()


- **Warehouse:** Blocks *A, B, C, D* are equilibrated while block *F* is predominent (1/2 ratio).
- **Shipment:** *Flight* and *Road* have similar observations while *Ship* is predominent (1/4 ratio).
- **Importance:** There is a majority of *low* and *medium* importances and a minority of *high* importances.
- **Genders:** Both classes are balanced.

## Numerical

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Set the size of the figure for the plots
plt.figure(figsize=(18, 10))

# Get the numeric features from the data, excluding 'Reached.on.Time_Y.N'
num_features = data.select_dtypes('number').drop('Reached.on.Time_Y.N', axis=1).columns.values

# Iterate over each numeric feature and plot a histogram
for i, num in enumerate(num_features):

    # Create a subplot for each histogram
    plt.subplot(2, 3, i+1)

    # Plot a histogram for the numeric feature
    # Use the 'bins' argument to specify the number of bins
    # This will make the histogram evenly spaced along the x-axis
    sns.histplot(data[num], bins=10, color=pal[i])

# Adjust the layout to ensure the subplots do not overlap
plt.tight_layout()

# Display the plots
plt.show()


- **Care Calls:** Sligh positive skewed normal distribution with mode at 4.
- **Customer Rating:** Uniform distribution.
- **Costs:** 2 picks: smallest around 150, highest around 250.
- **Prior Purchases:** Positive skewed normal distribution, mode at 3.
- **Discount offered:** Separated into 2 uniform distributions: 0 to 10 is predominent and then small amount from 10 to 65.
- **Weight:** 3 zones: high from 1000 to 2000 and from 4000 to 6000. Low from 2000 to 4000.

## Data Science Visualization of the whole dataset

In [None]:
# Import the required libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Read the dataset
dataset = pd.read_csv('Train.csv')

# Clean up the column names: remove underscores, capitalize first letter of each word,
# and reorder columns to place 'Reached.On.Time Y.N' at the end.
new_cols = ['ID'] + [col.replace("_", " ").title() for col in dataset.columns[1:-1]] + ['Reached.On.Time Y.N']
dataset.columns = new_cols

# Create a subplot grid for visualization
fig, axes = plt.subplots(4, 2, figsize=(12, 14), facecolor='#F2F4F4')

# List of column names for countplot
columns = ["Warehouse Block", "Mode Of Shipment", "Customer Care Calls", "Customer Rating",
           "Prior Purchases", "Product Importance", "Gender", "Reached.On.Time Y.N"]

# List of palette colors for each countplot
palettes = ['CMRmap_r', ['#DC143C','#556b2f','#008b8b'], 'cubehelix', "rocket",
            'viridis', None, ['#800000','#191970'], 'tab20c_r']

# List of plot titles for each countplot
titles = ['Orders Handled By Each Warehouse Block', 'Number of Orders By Shipment Mode',
          'Number of Customer Care Calls Made by Customers', 'Customer Rating Received',
          'Number of Prior Purchases Made by Customers', 'Number of Orders Made by Product Importance',
          "Number of Orders Made by Customers' Gender", 'Number of Orders Based On Arrival Time']

# Generate countplots for each column
for i, ax in enumerate(axes.flatten()):
    if i < len(columns):
        column_counts = dataset[columns[i]].value_counts(ascending=False)
        bar_plot = sns.countplot(x=dataset[columns[i]], order=column_counts.index, ax=ax, palette=palettes[i])

        # Set plot title and labels
        ax.set_title(titles[i], fontsize=12)
        column_percentages = column_counts.values * 100 / column_counts.values.sum()
        labels = [f"{count} ({percentage:.2f}%)" for count, percentage in zip(column_counts, column_percentages)]

        # Add labels to each bar manually
        for j, p in enumerate(bar_plot.patches):
            height = p.get_height()
            ax.text(p.get_x()+p.get_width()/2., height + 0.1, labels[j], ha="center", fontsize = 8)

# Adjust the layout to ensure the subplots do not overlap
plt.tight_layout()

# Show the plot
plt.show()

## Target Analysis

In [None]:
# Ratio of delayed (1) and not delayed orders (0)
data['Reached.on.Time_Y.N'].value_counts() / data['Reached.on.Time_Y.N'].count()
data['Reached.on.Time_Y.N'].value_counts()
#display(data['Reached.on.Time_Y.N'].sum())

In [None]:
df = data.copy()
display(df)

The target classes are slightly unbalanced.

# Step 3: Preprocessing, feature analysis and  ML-model performance assessment
--------------------------------------------------------------------------------------------------------------------------------
### Shows: “How to perform data preprocessing to enable ML-learning & Feature analysis”
--------------------------------------------------------------------------------------------------------------------------------


### Preprocessing: One-HOT encoding + Scaling

In [None]:
from sklearn.preprocessing import StandardScaler
def preprocess_inputs(df):
    df = df.copy()

    # Drop/Remove ID collumn (1st)
    # df = df.drop('ID', axis=1)

    # exluding categorical data
    #df= df.select_dtypes(exclude=['object'])

    # One-HOT encoding of the categorical data
    df = pd.concat([df, pd.get_dummies(df['Warehouse_block'])], axis=1).drop('Warehouse_block', axis=1)
    df = pd.concat([df, pd.get_dummies(df['Mode_of_Shipment'])], axis=1).drop('Mode_of_Shipment', axis=1)
    df = pd.concat([df, pd.get_dummies(df['Product_importance'])], axis=1).drop('Product_importance', axis=1)
    df = pd.concat([df, pd.get_dummies(df['Gender'], prefix='Sex')], axis=1).drop('Gender', axis=1)

    # Split dataset into INPUT: X  /  OUTPUT: target
    X = df.drop('Reached.on.Time_Y.N', axis=1)
    target = df['Reached.on.Time_Y.N']

    # Standardizing the features
    scaler = StandardScaler()
    scaled_features = pd.DataFrame(scaler.fit_transform(X))

    return scaled_features, target, X

In [None]:
# Show 10 samples of the One-HOT encoded dataset
# X is One-HOT encoded data
pdata, target, hot = preprocess_inputs(data)
display(hot.sample(10))

In [None]:
# Show 1- samples of the scaled, One-HOT encoded dataset
# pdata is One-HOT encoded + scaled data
pdata, target, hot = preprocess_inputs(data)
display(pdata.sample(10))

## Feature analysis: Logistic Regression fitting

In [None]:
# pdata is One-HOT encoded + scaled data
pdata, target, hot = preprocess_inputs(data)

# Split the train data into train and test
X_train, X_test, y_train, y_test = \
train_test_split(pdata, target, shuffle=True, train_size=0.8, random_state=0)

In [None]:
# Model specific Imports
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Create a logistic regression model
logreg_model = LogisticRegression()

# Train the model
logreg_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = logreg_model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# Print feature coefficients
feature_importance = logreg_model.coef_[0]
#print("Feature Coefficients:", feature_importance)

labels = hot.columns.tolist()

# Assuming 'logreg_model' is your trained model
coefficients = logreg_model.coef_[0]

# Use 'labels' variable for features
feature_importance = pd.DataFrame({'Feature': labels, 'Importance': np.abs(coefficients)})
feature_importance = feature_importance.sort_values('Importance', ascending=True)

ax = feature_importance.plot(x='Feature', y='Importance', kind='barh', figsize=(10, 6))
ax.set_title('Logistic Regression Feature Importance Analysis')
plt.show()

## Feature analysis: Decision Tree fitting

In [None]:
from sklearn.tree import DecisionTreeClassifier
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# Create a decision tree model
tree_model = DecisionTreeClassifier(random_state=42)

# Train the model
tree_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred_tree = tree_model.predict(X_test)

# Evaluate the model
accuracy_tree = accuracy_score(y_test, y_pred_tree)
#print(f"Decision Tree Accuracy: {accuracy_tree:.2f}")

# Plot feature importances
feature_importance_tree = tree_model.feature_importances_

# Assuming 'labels' variable for features
feature_importance_tree_df = \
pd.DataFrame({'Feature': labels, 'Importance': np.abs(feature_importance_tree)})

feature_importance_tree_df = feature_importance_tree_df.sort_values('Importance', ascending=True)

ax_tree = feature_importance_tree_df.plot(x='Feature', y='Importance', kind='barh', figsize=(8, 4))
ax_tree.set_title('Decision Tree Feature Importance Analysis')
plt.show()

# Step 4: Selecting the best available supervised-learning ML-model
----------------------------------------------------------------------------------------------------
### Shows: “How to import, train, evaluate & select  interpretable  ML-models using Sklearn”
------------------------------------------------------------------------------------------------------


## Data Preprocessing

In [None]:
def preprocess_inputs(df):
    df = df.copy()

    # One-hot encoding of the categorical data
    df = pd.concat([df, pd.get_dummies(df['Warehouse_block'])], axis=1).drop('Warehouse_block', axis=1)
    df = pd.concat([df, pd.get_dummies(df['Mode_of_Shipment'])], axis=1).drop('Mode_of_Shipment', axis=1)
    df = pd.concat([df, pd.get_dummies(df['Product_importance'])], axis=1).drop('Product_importance', axis=1)
    df = pd.concat([df, pd.get_dummies(df['Gender'], prefix='Sex')], axis=1).drop('Gender', axis=1)

    # Split X and y
    X = df.drop('Reached.on.Time_Y.N', axis=1)
    y = df['Reached.on.Time_Y.N']

    # Train-test split
    X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=True, train_size=0.8, random_state=0)

    # Scale X
    scaler = StandardScaler()
    X_train = pd.DataFrame(scaler.fit_transform(X_train), index=X_train.index, columns=X_train.columns)
    X_test = pd.DataFrame(scaler.transform(X_test), index=X_test.index, columns=X_test.columns)


    return X_train, X_test, y_train, y_test

## Importing & Training Multiple models

In [None]:
# Available predicting ML-models from Sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.neural_network import MLPClassifier
from xgboost import XGBClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.linear_model import RidgeClassifier
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.linear_model import Perceptron
from sklearn.ensemble import ExtraTreesClassifier, BaggingClassifier, VotingClassifier, StackingClassifier
from sklearn.semi_supervised import LabelSpreading, LabelPropagation
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.neural_network import BernoulliRBM
from sklearn.linear_model import RidgeClassifierCV
from sklearn.svm import NuSVC, LinearSVC

In [None]:
# SKlearn ML-models
models = {
    "Logistic": LogisticRegression(),
    "KNN": KNeighborsClassifier(),
    "RF": RandomForestClassifier(),
    "SVC": SVC(probability=True),
    "GNB": GaussianNB(),
    "CART": DecisionTreeClassifier(),
    "LDA": LinearDiscriminantAnalysis(),
    "HGB": HistGradientBoostingClassifier(),
    "Ada": AdaBoostClassifier(n_estimators=100, random_state=0),
    "XGB": XGBClassifier(),
    ##"SGD": SGDClassifier(loss='squared_hinge'),
    "GBC": GradientBoostingClassifier(),
    "QDA": QuadraticDiscriminantAnalysis(),
    ##"RC": RidgeClassifier(),
    ##"PAC": PassiveAggressiveClassifier(),
    ##"P": Perceptron(),
    "ETC": ExtraTreesClassifier(),
    "BC": BaggingClassifier(),
    "LS": LabelSpreading(),
    "LP": LabelPropagation(),
    ##"GPC": GaussianProcessClassifier(), # takes a long time to train
    ##"RCCV": RidgeClassifierCV(),
    "NuSVC": NuSVC(probability=True),
    ##"LSVC": LinearSVC()
}


# Initialize progress bar
from tqdm import tqdm
import time

pbar = tqdm(total=len(models))

# Train each model and update progress bar
for name, model in models.items():
    start_time = time.time()  # Start time
    print(f"Training model: {name}")
    model.fit(X_train, y_train)
    end_time = time.time()  # End time
    duration = end_time - start_time  # Calculate duration
    # Print duration as a whole number
    print(f"Finished training model: {name} in {int(duration)} seconds")
    pbar.update(1)

# Close progress bar
pbar.close()

In [None]:
from sklearn.metrics import classification_report

# Assuming trained_ensemble_models and X_test, y_test are defined

for name, model in models.items():
    y_pred = model.predict(X_test)
    print("-----------------------------")
    print(f"Classification report for {name}:")
    print(classification_report(y_test, y_pred))
    print("-----------------------------")
    print("-----------------------------")


In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc

# Compute AUC for each model and store in a dictionary
auc_dict = {}
for name, model in models.items():
    y_pred_proba = model.predict_proba(X_test)[:, 1]
    fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
    roc_auc = auc(fpr, tpr)
    auc_dict[name] = (roc_auc, fpr, tpr)

# Sort the models by AUC from high to low
sorted_models = sorted(auc_dict.items(), key=lambda x: x[1][0], reverse=True)

# Plot size
plt.figure(figsize=(10, 10))

# Plot ROC curve for each model, in sorted order
for name, (roc_auc, fpr, tpr) in sorted_models:
    plt.plot(fpr, tpr, label=f'{name} (AUC = {roc_auc:.2f})')

# Diagonal line
plt.plot([0, 1], [0, 1], color='navy', linestyle='--')

# Set plot labels and legend
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")

# Show the plot
plt.show()


In [None]:
model_list = list(models.values())
print(model_list[1])

In [None]:
y_pred = models['RF'].predict(X_test)
y_pred-y_test

##  Results

In [None]:
results = []
scores = {}

for name, model in models.items():

    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred) * 100
    f1 = f1_score(y_test, y_pred) * 100
    recall = recall_score(y_test, y_pred) * 100
    precision = precision_score(y_test, y_pred) * 100

    print(name + "    Accuracy: {:.2f} %".format(accuracy))
    print("            F1 Score: {:.2f} %".format(f1))
    print("              Recall: {:.2f} %".format(recall))
    print("           Precision: {:.2f} %".format(precision))
    print("-----------------------------")

    results.append(confusion_matrix(y_test, y_pred))
    # Calculate the score for this model
    scores[name] = accuracy + f1 + recall + precision

# Find the model with the highest score
best_model = max(scores, key=scores.get)
print(f"The best model is {best_model} with a score of {scores[best_model]}")

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

fig, axs = plt.subplots(6, 3, figsize=(10, 16))

for i, name in enumerate(list(models.keys())):
    ax = plt.subplot(5, 4, i + 1)
    sns.heatmap(results[i], annot=True, square=True, cbar=False,
                xticklabels=['No delay', 'Delay'], yticklabels=['No delay', 'Delay'], cmap='Reds', fmt='10.0f', ax=ax)
    plt.title(name)
    plt.xlabel('Predicted')
    plt.ylabel('Actual')

# Adjust the spacing between the subplots
plt.subplots_adjust(wspace=0.4, hspace=0.0)

# Adjust the layout to en
# plt.tight_layout()

plt.show()

Scores are low, it is not always possible to get high scores when features are not well correlated, and this is what we saw in the heatmap.

This dataset also seems to be fictive since some classes are perfectly balanced. In any case, this was a good practice.

In [None]:
# Define base models
base_models = [
    ('logistic_regression', LogisticRegression()),
    ('knn', KNeighborsClassifier()),
    ('random_forest', RandomForestClassifier()),
    ('svc', SVC(probability=True)),
    ('gnb', GaussianNB()),
    ('cart', DecisionTreeClassifier()),
    ('hgb', HistGradientBoostingClassifier()),
    ('ada', AdaBoostClassifier(n_estimators=100, random_state=0)),
    ('mlpc', MLPClassifier(activation='tanh', solver='lbfgs', alpha=0.001, hidden_layer_sizes=(8, 2), random_state=1,max_iter=20000, early_stopping=True)),
]

# Define ensemble models
models = {
    "VC": VotingClassifier(estimators=base_models, voting='soft'),  # 'soft' voting returns the class with the highest sum of predicted probabilities
    "SC": StackingClassifier(estimators=base_models, final_estimator=LogisticRegression()),  # final_estimator is used to combine the base models
}


# Initialize progress bar
from tqdm import tqdm
import time

pbar = tqdm(total=len(models))

# Train each model and update progress bar
for name, model in models.items():
    start_time = time.time()  # Start time
    print(f"Training model: {name}")
    model.fit(X_train, y_train)
    end_time = time.time()  # End time
    duration = end_time - start_time  # Calculate duration
    print(f"Finished training model: {name} in {int(duration)} seconds")  # Print duration as a whole number
    pbar.update(1)

# Close progress bar
pbar.close()

In [None]:
results = []
scores = {}

for name, model in models.items():
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred) * 100
    f1 = f1_score(y_test, y_pred) * 100
    recall = recall_score(y_test, y_pred) * 100
    precision = precision_score(y_test, y_pred) * 100
    print(name + "    Accuracy: {:.2f} %".format(accuracy))
    print("            F1 Score: {:.2f} %".format(f1))
    print("              Recall: {:.2f} %".format(recall))
    print("           Precision: {:.2f} %".format(precision))
    print("-----------------------------")
    results.append(confusion_matrix(y_test, y_pred))
    # Calculate the score for this model
    scores[name] = accuracy + f1 + recall + precision

# Find the model with the highest score
best_model = max(scores, key=scores.get)

print(f"The best model is {best_model} with a score of {scores[best_model]}")

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc

# Assuming X_test and y_test are your testing data and labels

# Plot size
plt.figure(figsize=(10, 10))

for name, model in models.items():
    # Predict probabilities for the positive class
    y_pred_proba = model.predict_proba(X_test)[:, 1]

    # Compute ROC curve and ROC area
    fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
    roc_auc = auc(fpr, tpr)

    # Plot the ROC curve
    plt.plot(fpr, tpr, label=f'{name} (AUC = {roc_auc:.2f})')

# Diagonal line
plt.plot([0, 1], [0, 1], color='navy', linestyle='--')

# Set plot labels and legend
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")

# Show the plot
plt.show()