## Part 3: Model Comparison

In this section, the models trained in the previous stage (Logistic Regression, KNN, and ANN) are evaluated on the test data.  
The goal is to compare the performance of each model based on the Accuracy metric and ultimately select and save the best model.

In [1]:
import pandas as pd
import numpy as np
import joblib
import pickle

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import accuracy_score, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier

In [2]:
with open("data_splits.pkl", "rb") as f:
    X_train, X_test, y_train, y_test = pickle.load(f)

print("Data loaded successfully from Part 2!")

Data loaded successfully from Part 2!


In [3]:
label_encoders = {}
for column in X_train.columns:
    if X_train[column].dtype == 'object':
        le = LabelEncoder()
        X_train[column] = le.fit_transform(X_train[column].astype(str))
        X_test[column] = le.transform(X_test[column].astype(str))
        label_encoders[column] = le

print("Categorical data converted to numeric successfully.")

Categorical data converted to numeric successfully.


In [4]:
print("Features and target variable are already separated (from Part 2).")
print("X_train shape:", X_train.shape)
print("y_train shape:", y_train.shape)

Features and target variable are already separated (from Part 2).
X_train shape: (399, 11)
y_train shape: (399,)


In [5]:
log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train, y_train)

y_pred_log = log_reg.predict(X_test)

print("Logistic Regression Performance:")
print("Accuracy:", accuracy_score(y_test, y_pred_log))
print(classification_report(y_test, y_pred_log))

Logistic Regression Performance:
Accuracy: 0.8602150537634409
              precision    recall  f1-score   support

           N       0.94      0.59      0.72        29
           Y       0.84      0.98      0.91        64

    accuracy                           0.86        93
   macro avg       0.89      0.79      0.81        93
weighted avg       0.87      0.86      0.85        93



In [6]:
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

y_pred_knn = knn.predict(X_test)

print("KNN Performance:")
print("Accuracy:", accuracy_score(y_test, y_pred_knn))
print(classification_report(y_test, y_pred_knn))


KNN Performance:
Accuracy: 0.8387096774193549
              precision    recall  f1-score   support

           N       0.82      0.62      0.71        29
           Y       0.85      0.94      0.89        64

    accuracy                           0.84        93
   macro avg       0.83      0.78      0.80        93
weighted avg       0.84      0.84      0.83        93



In [7]:
ann = MLPClassifier(hidden_layer_sizes=(64, 32), max_iter=500, random_state=42)
ann.fit(X_train, y_train)

y_pred_ann = ann.predict(X_test)

print("ANN Performance:")
print("Accuracy:", accuracy_score(y_test, y_pred_ann))
print(classification_report(y_test, y_pred_ann))

ANN Performance:
Accuracy: 0.7634408602150538
              precision    recall  f1-score   support

           N       0.62      0.62      0.62        29
           Y       0.83      0.83      0.83        64

    accuracy                           0.76        93
   macro avg       0.72      0.72      0.72        93
weighted avg       0.76      0.76      0.76        93





In [8]:
results = {
    "Logistic Regression": accuracy_score(y_test, y_pred_log),
    "KNN": accuracy_score(y_test, y_pred_knn),
    "ANN": accuracy_score(y_test, y_pred_ann)
}

results_df = pd.DataFrame.from_dict(results, orient='index', columns=['Accuracy'])
print(results_df)

                     Accuracy
Logistic Regression  0.860215
KNN                  0.838710
ANN                  0.763441


In [9]:
best_model_name = max(results, key=results.get)
print(f"Best model based on accuracy: {best_model_name}")


if best_model_name == "Logistic Regression":
    final_model = log_reg
elif best_model_name == "KNN":
    final_model = knn
else:
    final_model = ann

joblib.dump(final_model, "final_model.pkl")
print("Final model saved as final_model.pkl")

Best model based on accuracy: Logistic Regression
Final model saved as final_model.pkl


## Conclusion of Part 3

After comparing the three different models, the results showed that the Logistic Regression model achieved the highest accuracy on the test data.  
Therefore, this model was selected as the final model and saved in the file final_model.pkl to be used in the subsequent stages.
