<a href="https://colab.research.google.com/github/PrateekCoder/Mastering-Classification-Metrics---A-beginners-Guide/blob/main/Mastering_Classification_Metrics_A_beginners_Guide.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [14]:
import pandas as pd
from tqdm import tqdm
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, fbeta_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import roc_auc_score
import plotly.graph_objects as go
from sklearn.metrics import roc_curve
from sklearn.metrics import matthews_corrcoef, balanced_accuracy_score, cohen_kappa_score

# Accuracy, Precision and Recall

### Evaluation Metrics on Breast Cancer Data [Important Metric: Recall]

In [None]:
# Load the breast cancer dataset
data = load_breast_cancer()

# Get the input features and target variable
X = data.data
y = data.target

# Scale the input features for better performance
scaler = StandardScaler()
X = scaler.fit_transform(X)

# Convert the input features into a DataFrame
df_breast_cancer = pd.DataFrame(X, columns=data.feature_names)

# Add the target variable as a new column
df_breast_cancer['target'] = y

# Display the DataFrame
df_breast_cancer

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,1.097064,-2.073335,1.269934,0.984375,1.568466,3.283515,2.652874,2.532475,2.217515,2.255747,...,-1.359293,2.303601,2.001237,1.307686,2.616665,2.109526,2.296076,2.750622,1.937015,0
1,1.829821,-0.353632,1.685955,1.908708,-0.826962,-0.487072,-0.023846,0.548144,0.001392,-0.868652,...,-0.369203,1.535126,1.890489,-0.375612,-0.430444,-0.146749,1.087084,-0.243890,0.281190,0
2,1.579888,0.456187,1.566503,1.558884,0.942210,1.052926,1.363478,2.037231,0.939685,-0.398008,...,-0.023974,1.347475,1.456285,0.527407,1.082932,0.854974,1.955000,1.152255,0.201391,0
3,-0.768909,0.253732,-0.592687,-0.764464,3.283553,3.402909,1.915897,1.451707,2.867383,4.910919,...,0.133984,-0.249939,-0.550021,3.394275,3.893397,1.989588,2.175786,6.046041,4.935010,0
4,1.750297,-1.151816,1.776573,1.826229,0.280372,0.539340,1.371011,1.428493,-0.009560,-0.562450,...,-1.466770,1.338539,1.220724,0.220556,-0.313395,0.613179,0.729259,-0.868353,-0.397100,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,2.110995,0.721473,2.060786,2.343856,1.041842,0.219060,1.947285,2.320965,-0.312589,-0.931027,...,0.117700,1.752563,2.015301,0.378365,-0.273318,0.664512,1.629151,-1.360158,-0.709091,0
565,1.704854,2.085134,1.615931,1.723842,0.102458,-0.017833,0.693043,1.263669,-0.217664,-1.058611,...,2.047399,1.421940,1.494959,-0.691230,-0.394820,0.236573,0.733827,-0.531855,-0.973978,0
566,0.702284,2.045574,0.672676,0.577953,-0.840484,-0.038680,0.046588,0.105777,-0.809117,-0.895587,...,1.374854,0.579001,0.427906,-0.809587,0.350735,0.326767,0.414069,-1.104549,-0.318409,0
567,1.838341,2.336457,1.982524,1.735218,1.525767,3.272144,3.296944,2.658866,2.137194,1.043695,...,2.237926,2.303601,1.653171,1.430427,3.904848,3.197605,2.289985,1.919083,2.219635,0


In [None]:
def print_metrics(model_name, y_test, y_pred):
    print("Model Used:", model_name)
    print("Accuracy Score:", accuracy_score(y_test, y_pred))
    print("Precision Score:", precision_score(y_test, y_pred))
    print("Recall Score:", recall_score(y_test, y_pred))
    print("F1 Score:", f1_score(y_test, y_pred))
    print()

def run_model(model, X, y):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    return y_test, y_pred

In [None]:
y_test, y_pred = run_model(LogisticRegression(), X, y)
print_metrics("Logistic Regression", y_test, y_pred)

Model Used: Logistic Regression
Accuracy Score: 0.9736842105263158
Precision Score: 0.9722222222222222
Recall Score: 0.9859154929577465
F1 Score: 0.979020979020979



In [None]:
y_test, y_pred = run_model(DecisionTreeClassifier(), X, y)
print_metrics("Decision Tree", y_test, y_pred)

Model Used: Decision Tree
Accuracy Score: 0.9385964912280702
Precision Score: 0.9444444444444444
Recall Score: 0.9577464788732394
F1 Score: 0.951048951048951



In [None]:
y_test, y_pred = run_model(SVC(), X, y)
print_metrics("SVM", y_test, y_pred)

Model Used: SVM
Accuracy Score: 0.9736842105263158
Precision Score: 0.9722222222222222
Recall Score: 0.9859154929577465
F1 Score: 0.979020979020979



In [None]:
y_test, y_pred = run_model(RandomForestClassifier(), X, y)
print_metrics("Random Forest", y_test, y_pred)

Model Used: Random Forest
Accuracy Score: 0.9649122807017544
Precision Score: 0.958904109589041
Recall Score: 0.9859154929577465
F1 Score: 0.9722222222222222



In [None]:
y_test, y_pred = run_model(KNeighborsClassifier(), X, y)
print_metrics("KNN", y_test, y_pred)

Model Used: KNN
Accuracy Score: 0.9473684210526315
Precision Score: 0.9577464788732394
Recall Score: 0.9577464788732394
F1 Score: 0.9577464788732394



Based on the results, both Logistic Regression and SVM have the highest Recall and F1 Scores. Given the importance of Recall in the breast cancer classification problem, these two models are the top candidates for further consideration.

### Evaluation Metrics on SPAM Classification Data [Important Metric: Precision]

In [None]:
# Load the dataset
data = pd.read_csv("gdrive/My Drive/datasets/SMSSpamCollection/SMSSpamCollection", sep="\t", header=None, names=["label", "text"])

# Preprocess the data
data["label"] = data["label"].map({"ham": 0, "spam": 1})

# Create feature vectors using the Bag of Words or TF-IDF approach
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(data["text"])
y = data["label"]

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Evaluate the models
models = {
    "Logistic Regression": LogisticRegression(),
    "Decision Tree": DecisionTreeClassifier(),
    "SVM": SVC(),
    "Random Forest": RandomForestClassifier(),
    "KNN": KNeighborsClassifier()
}

for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(f"Model Used: {name}")
    print("Accuracy Score:", accuracy_score(y_test, y_pred))
    print("Precision Score:", precision_score(y_test, y_pred))
    print("Recall Score:", recall_score(y_test, y_pred))
    print("F1 Score:", f1_score(y_test, y_pred))
    print("\n")

Model Used: Logistic Regression
Accuracy Score: 0.9641255605381166
Precision Score: 1.0
Recall Score: 0.7315436241610739
F1 Score: 0.8449612403100776


Model Used: Decision Tree
Accuracy Score: 0.9739910313901345
Precision Score: 0.9225352112676056
Recall Score: 0.8791946308724832
F1 Score: 0.9003436426116839


Model Used: SVM
Accuracy Score: 0.9829596412556054
Precision Score: 1.0
Recall Score: 0.87248322147651
F1 Score: 0.931899641577061


Model Used: Random Forest
Accuracy Score: 0.9847533632286996
Precision Score: 1.0
Recall Score: 0.8859060402684564
F1 Score: 0.9395017793594307


Model Used: KNN
Accuracy Score: 0.9147982062780269
Precision Score: 1.0
Recall Score: 0.3624161073825503
F1 Score: 0.5320197044334976




In this spam classification problem, we want a model that can identify spam messages accurately (high precision) while also minimizing the number of legitimate messages that are mistakenly classified as spam (high recall). Among the results, the Random Forest model performs the best in terms of balancing precision and recall, while also achieving a high accuracy score.

The Random Forest model has a perfect precision score of 1.0, meaning it doesn't misclassify any legitimate messages as spam. The recall score of 0.8859 means that the model identifies 88.59% of the spam messages correctly. The F1 score, which is the harmonic mean of precision and recall, is also the highest among all models at 0.9395, indicating that the Random Forest model achieves the best balance between precision and recall for this problem.

Thus, the Random Forest model is the most suitable choice for this spam classification problem, considering the importance of precision and recall in this scenario.

# F0.5, F1 and F2 Score

For this exmaple, I will be using credit card fraud classification where F-Score is very important metric to look at as well need a good balance of Precision and Recall. I will be using the following data from Kaggle:
https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud


In [3]:
df = pd.read_csv('gdrive/My Drive/datasets/Mastering Classification Metrics Medium/Credit Card Fraud Classification/creditcard.csv')
df

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.166480,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.167170,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.379780,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.108300,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.50,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.206010,0.502292,0.219422,0.215153,69.99,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
284802,172786.0,-11.881118,10.071785,-9.834783,-2.066656,-5.364473,-2.606837,-4.918215,7.305334,1.914428,...,0.213454,0.111864,1.014480,-0.509348,1.436807,0.250034,0.943651,0.823731,0.77,0
284803,172787.0,-0.732789,-0.055080,2.035030,-0.738589,0.868229,1.058415,0.024330,0.294869,0.584800,...,0.214205,0.924384,0.012463,-1.016226,-0.606624,-0.395255,0.068472,-0.053527,24.79,0
284804,172788.0,1.919565,-0.301254,-3.249640,-0.557828,2.630515,3.031260,-0.296827,0.708417,0.432454,...,0.232045,0.578229,-0.037501,0.640134,0.265745,-0.087371,0.004455,-0.026561,67.88,0
284805,172788.0,-0.240440,0.530483,0.702510,0.689799,-0.377961,0.623708,-0.686180,0.679145,0.392087,...,0.265245,0.800049,-0.163298,0.123205,-0.569159,0.546668,0.108821,0.104533,10.00,0


In [4]:
# Check for missing values
missing_values = df.isnull().sum()
print("Missing values:", missing_values)

Missing values: Time      0
V1        0
V2        0
V3        0
V4        0
V5        0
V6        0
V7        0
V8        0
V9        0
V10       0
V11       0
V12       0
V13       0
V14       0
V15       0
V16       0
V17       0
V18       0
V19       0
V20       0
V21       0
V22       0
V23       0
V24       0
V25       0
V26       0
V27       0
V28       0
Amount    0
Class     0
dtype: int64


In [5]:
# Scale the features (excluding the 'Time' and 'Class' columns)
scaler = StandardScaler()
df.iloc[:, 1:-1] = scaler.fit_transform(df.iloc[:, 1:-1])

# Split the dataset into features (X) and target (y)
X = df.drop('Class', axis=1)
y = df['Class']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [6]:
# Evaluate the models
models = {
    "Logistic Regression": LogisticRegression(),
    "Decision Tree": DecisionTreeClassifier(),
    "SVM": SVC(class_weight='balanced'),
    "Random Forest": RandomForestClassifier(),
    "KNN": KNeighborsClassifier()
}

for name, model in tqdm(models.items(), desc="Training Models"):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(f"Model Used: {name}")
    print("Accuracy Score:", accuracy_score(y_test, y_pred))
    print("Precision Score:", precision_score(y_test, y_pred))
    print("Recall Score:", recall_score(y_test, y_pred))
    print("F1 Score:", f1_score(y_test, y_pred))
    print("F0.5 Score:", fbeta_score(y_test, y_pred, beta=0.5))
    print("F2 Score:", fbeta_score(y_test, y_pred, beta=2))
    print("\n")

Training Models:  20%|██        | 1/5 [00:04<00:19,  4.77s/it]

Model Used: Logistic Regression
Accuracy Score: 0.9990168884519505
Precision Score: 0.7441860465116279
Recall Score: 0.6530612244897959
F1 Score: 0.6956521739130435
F0.5 Score: 0.7239819004524887
F2 Score: 0.6694560669456067




Training Models:  40%|████      | 2/5 [00:35<01:00, 20.15s/it]

Model Used: Decision Tree
Accuracy Score: 0.9990168884519505
Precision Score: 0.7058823529411765
Recall Score: 0.7346938775510204
F1 Score: 0.7200000000000001
F0.5 Score: 0.7114624505928854
F2 Score: 0.728744939271255




Training Models:  60%|██████    | 3/5 [4:03:32<3:42:14, 6667.13s/it]

Model Used: SVM
Accuracy Score: 0.4421544187352972
Precision Score: 0.002450210466796507
Recall Score: 0.7959183673469388
F1 Score: 0.004885381435550545
F0.5 Score: 0.0030604077404774235
F2 Score: 0.012102029417240739




Training Models:  80%|████████  | 4/5 [4:08:31<1:09:12, 4152.92s/it]

Model Used: Random Forest
Accuracy Score: 0.9995962220427653
Precision Score: 0.9518072289156626
Recall Score: 0.8061224489795918
F1 Score: 0.8729281767955801
F0.5 Score: 0.9186046511627907
F2 Score: 0.831578947368421




Training Models: 100%|██████████| 5/5 [4:09:56<00:00, 2999.37s/it]

Model Used: KNN
Accuracy Score: 0.9984902215512096
Precision Score: 1.0
Recall Score: 0.12244897959183673
F1 Score: 0.2181818181818182
F0.5 Score: 0.410958904109589
F2 Score: 0.14851485148514854







# ROC-AUC Curve

In [3]:
# Load the dataset
data = pd.read_csv('gdrive/My Drive/datasets/Mastering Classification Metrics Medium/Bank Marketing Dataset/bank-additional-full.csv', sep=';')
data

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41183,73,retired,married,professional.course,no,yes,no,cellular,nov,fri,...,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,yes
41184,46,blue-collar,married,professional.course,no,no,no,cellular,nov,fri,...,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,no
41185,56,retired,married,university.degree,no,yes,no,cellular,nov,fri,...,2,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,no
41186,44,technician,married,professional.course,no,no,no,cellular,nov,fri,...,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,yes


In [4]:
# Inspect the dataset
print(data.head())

   age        job  marital    education  default housing loan    contact  \
0   56  housemaid  married     basic.4y       no      no   no  telephone   
1   57   services  married  high.school  unknown      no   no  telephone   
2   37   services  married  high.school       no     yes   no  telephone   
3   40     admin.  married     basic.6y       no      no   no  telephone   
4   56   services  married  high.school       no      no  yes  telephone   

  month day_of_week  ...  campaign  pdays  previous     poutcome emp.var.rate  \
0   may         mon  ...         1    999         0  nonexistent          1.1   
1   may         mon  ...         1    999         0  nonexistent          1.1   
2   may         mon  ...         1    999         0  nonexistent          1.1   
3   may         mon  ...         1    999         0  nonexistent          1.1   
4   may         mon  ...         1    999         0  nonexistent          1.1   

   cons.price.idx  cons.conf.idx  euribor3m  nr.employed

In [5]:
# One-hot encode categorical variables
data = pd.get_dummies(data, columns=['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'day_of_week', 'poutcome'], drop_first=True)

# Check for missing values
missing_values = data.isnull().sum()
print("Missing values:", missing_values)

Missing values: age                              0
duration                         0
campaign                         0
pdays                            0
previous                         0
emp.var.rate                     0
cons.price.idx                   0
cons.conf.idx                    0
euribor3m                        0
nr.employed                      0
y                                0
job_blue-collar                  0
job_entrepreneur                 0
job_housemaid                    0
job_management                   0
job_retired                      0
job_self-employed                0
job_services                     0
job_student                      0
job_technician                   0
job_unemployed                   0
job_unknown                      0
marital_married                  0
marital_single                   0
marital_unknown                  0
education_basic.6y               0
education_basic.9y               0
education_high.school            0
educ

In [6]:
# Scale numerical features
numerical_features = ['age', 'duration', 'campaign', 'pdays', 'previous', 'emp.var.rate', 'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed']
scaler = StandardScaler()
data[numerical_features] = scaler.fit_transform(data[numerical_features])

# Split the dataset into features (X) and target (y)
X = data.drop('y', axis=1)
y = data['y'].map({'yes': 1, 'no': 0})

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [12]:
# Define the models
models = {
    "Logistic Regression": LogisticRegression(random_state=42, max_iter=1000),
    "Decision Tree": DecisionTreeClassifier(random_state=42),
    "SVM": SVC(random_state=42, probability=True),
    "Random Forest": RandomForestClassifier(random_state=42),
    "KNN": KNeighborsClassifier()
}

# Train and evaluate the models
for name, model in tqdm(models.items()):
    model.fit(X_train, y_train)
    y_pred_proba = model.predict_proba(X_test)[:, 1]
    auc_score = roc_auc_score(y_test, y_pred_proba)
    
    fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
    
    fig = go.Figure()
    
    fig.add_trace(go.Scatter(x=fpr, y=tpr, mode='lines', name=f'ROC curve (area = {auc_score:.2f})'))
    fig.add_trace(go.Scatter(x=[0, 1], y=[0, 1], mode='lines', name='Random Classifier', line=dict(dash='dash')))
    
    fig.update_layout(
        title=f'ROC Curve for {name}',
        xaxis_title='False Positive Rate',
        yaxis_title='True Positive Rate',
        legend=dict(x=0, y=1, bgcolor='rgba(255, 255, 255, 0)', bordercolor='rgba(255, 255, 255, 0)'),
        margin=dict(l=0, r=0, t=30, b=0),
        width=800,
        height=600
    )
    
    fig.show()

    print(f"Model: {name}\nROC-AUC Score: {auc_score}\n")

  0%|          | 0/5 [00:00<?, ?it/s]

 20%|██        | 1/5 [00:01<00:04,  1.01s/it]

Model: Logistic Regression
ROC-AUC Score: 0.9424225494127081



 40%|████      | 2/5 [00:01<00:01,  1.73it/s]

Model: Decision Tree
ROC-AUC Score: 0.7412507665455919



 60%|██████    | 3/5 [02:38<02:24, 72.22s/it]

Model: SVM
ROC-AUC Score: 0.9147820946742772



 80%|████████  | 4/5 [02:42<00:45, 45.09s/it]

Model: Random Forest
ROC-AUC Score: 0.9473372122505779



100%|██████████| 5/5 [02:43<00:00, 32.80s/it]

Model: KNN
ROC-AUC Score: 0.8766933286947497






# Matthews Correlation Coefficient (MCC), Balanced Accuracy and Cohen's Kappa

In [16]:
# Train and evaluate the models
for name, model in tqdm(models.items()):
    model.fit(X_train, y_train)
    y_pred_proba = model.predict_proba(X_test)[:, 1]
    y_pred = model.predict(X_test)
    
    # Calculate the metrics
    auc_score = roc_auc_score(y_test, y_pred_proba)
    mcc_score = matthews_corrcoef(y_test, y_pred)
    b_acc_score = balanced_accuracy_score(y_test, y_pred)
    kappa_score = cohen_kappa_score(y_test, y_pred)

    print(f"Model: {name}\nROC-AUC Score: {auc_score}\nMatthews Correlation Coefficient: {mcc_score}\nBalanced Accuracy: {b_acc_score}\nCohen's Kappa: {kappa_score}\n")

 20%|██        | 1/5 [00:00<00:03,  1.32it/s]

Model: Logistic Regression
ROC-AUC Score: 0.9424225494127081
Matthews Correlation Coefficient: 0.511878341895685
Balanced Accuracy: 0.7047701247700363
Cohen's Kappa: 0.49358998731109216



 40%|████      | 2/5 [00:01<00:01,  2.11it/s]

Model: Decision Tree
ROC-AUC Score: 0.7412507665455919
Matthews Correlation Coefficient: 0.4782586521532672
Balanced Accuracy: 0.7412507665455919
Cohen's Kappa: 0.4782273857010254



 60%|██████    | 3/5 [02:34<02:20, 70.41s/it]

Model: SVM
ROC-AUC Score: 0.9147820946742772
Matthews Correlation Coefficient: 0.4999679705973072
Balanced Accuracy: 0.6973638202273692
Cohen's Kappa: 0.480092271028396



 80%|████████  | 4/5 [02:39<00:44, 44.50s/it]

Model: Random Forest
ROC-AUC Score: 0.9473372122505779
Matthews Correlation Coefficient: 0.5305007270035996
Balanced Accuracy: 0.7301029824520024
Cohen's Kappa: 0.522373412431687



100%|██████████| 5/5 [02:43<00:00, 32.64s/it]

Model: KNN
ROC-AUC Score: 0.8766933286947497
Matthews Correlation Coefficient: 0.4591059091639883
Balanced Accuracy: 0.6995381562809566
Cohen's Kappa: 0.45226155334258833






From the output, we can interpret the following:

Logistic Regression: The model has the second-highest ROC-AUC score (0.9424) among all models, indicating good performance in distinguishing between the positive and negative classes. It has a moderate Matthews Correlation Coefficient (0.5119) and Balanced Accuracy (0.7048), suggesting that there's still room for improvement in handling class imbalance. The Cohen's Kappa (0.4936) is moderate, indicating a reasonable agreement between the predicted and actual classes, but there's still potential for improvement.

Decision Tree: This model has the lowest ROC-AUC score (0.7413), suggesting it's not as good at distinguishing between positive and negative classes compared to the other models. It has moderate scores for Matthews Correlation Coefficient (0.4783), Balanced Accuracy (0.7413), and Cohen's Kappa (0.4782), indicating that it's not handling the class imbalance very well.

SVM: The SVM model has a good ROC-AUC score (0.9148), but lower Matthews Correlation Coefficient (0.5000), Balanced Accuracy (0.6974), and Cohen's Kappa (0.4801) scores. These lower scores suggest that the SVM model could be improved in terms of handling class imbalance and agreement between predicted and actual classes.

Random Forest: This model has the highest ROC-AUC score (0.9473) among all models, indicating excellent performance in distinguishing between the positive and negative classes. It also has the highest Matthews Correlation Coefficient (0.5305), Balanced Accuracy (0.7301), and Cohen's Kappa (0.5224) scores, suggesting it's handling the class imbalance better than the other models and has better agreement between the predicted and actual classes.

KNN: The KNN model has a decent ROC-AUC score (0.8767) but lower Matthews Correlation Coefficient (0.4591), Balanced Accuracy (0.6995), and Cohen's Kappa (0.4523) scores. This indicates that the KNN model could be improved in terms of handling class imbalance and agreement between predicted and actual classes.

Overall, the Random Forest model appears to be the best performing model among the ones tested, considering all the evaluation metrics. However, there's still room for improvement in all the models, especially when it comes to handling class imbalance.

By considering these additional metrics, we can confirm that the Random Forest model is indeed the best-performing model among the ones tested, as it has the highest scores for all the evaluation metrics, including ROC-AUC. This reinforces our previous interpretation based on just the ROC-AUC scores.