In [36]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import itertools
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GroupKFold, GroupShuffleSplit, cross_val_predict, cross_validate
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA
from sklearn.metrics import make_scorer, accuracy_score, f1_score, precision_score, recall_score, roc_curve, auc, confusion_matrix, ConfusionMatrixDisplay, roc_auc_score, classification_report
from sklearn.svm import SVC
from sklearn.metrics import (
    make_scorer, f1_score, precision_score, recall_score, roc_curve, auc, 
    confusion_matrix, ConfusionMatrixDisplay
)
from pyswarm import pso
from scipy.stats import wilcoxon

## Evaluation <a id="evaluation"></a>
[go back to the top](#contents)

**Comparative Analysis of Machine Learning Models Without Feature Selection**

In this section, we conduct a statistical comparison of three machine learning models—**Random Forest (RF)**, **Logistic Regression (LR)**, and **Support Vector Machine (SVM)**—across multiple datasets. The goal is to evaluate and determine whether there are significant differences in performance among these models based on various evaluation metrics.

We analyze the models using three distinct datasets:
1. **Combined Dataset (Without Feature Selection)**
2. **Radiomic Dataset (Without Feature Selection)**
3. **Pylidc Dataset (Without Feature Selection)**

**Performance Metrics**
For each dataset, the models are assessed based on the following metrics:
- **Accuracy:** Measures the proportion of correctly predicted instances out of all predictions made.
- **F1 Score:** The harmonic mean of precision and recall, providing a balance between the two.
- **Precision:** Indicates the proportion of true positive predictions among all positive predictions.
- **Recall (Sensitivity):** Reflects the model's ability to identify actual positive cases.
- **ROC-AUC (Receiver Operating Characteristic - Area Under Curve):** Assesses the model's ability to distinguish between classes across various threshold settings.

**Statistical Comparison Method**
To determine if the observed differences in performance metrics between the models are statistically significant, we employ the **Wilcoxon Signed-Rank Test**. This non-parametric test is suitable for comparing paired samples without assuming a normal distribution of the differences. The significance level is set at **0.05**, meaning that p-values below this threshold indicate a statistically significant difference between the models for the given metric.

For each dataset, the following steps are performed:

1. **Data Preparation:**
   - **Loading Results:** Import the cross-validation results for each model from their respective CSV files.
   - **Filtering Data:** Exclude summary rows such as 'mean', 'std', and 'test_set' to focus solely on per-fold performance metrics.
   - **Alignment Check:** Ensure that the cross-validation folds are consistently aligned across all models to maintain the integrity of paired comparisons.

2. **Pairwise Model Comparison:**
   - **Metric Selection:** For each performance metric (Accuracy, F1 Score, Precision, Recall, ROC-AUC), perform comparisons.
   - **Statistical Testing:** Execute the Wilcoxon Signed-Rank Test for each pair of models:
     - **RF vs. LR**
     - **RF vs. SVM**
     - **LR vs. SVM**
   - **Result Interpretation:** Report the Wilcoxon test statistic and p-value for each comparison to assess statistical significance.

**Combined Dataset Without Feature Selection**

In [38]:
# Define the models and the metrics to compare
models = {
    'Random Forest': 'Combined Dataset (Without Feature Selection)_random_forest_results.csv',
    'Logistic Regression': 'Combined Dataset (Without Feature Selection)_logistic_regression_results.csv',
    'SVM': 'Combined Dataset (Without Feature Selection)_svm_results.csv'
}

# Define the metrics to compare
metrics = ['accuracy', 'f1', 'precision', 'recall', 'roc_auc']

# Load per-fold results for each model into a dictionary
model_results = {}

for model_name, filename in models.items():
    df = pd.read_csv(filename)
    # Exclude 'mean', 'std', and 'test_set' rows
    exclude_rows = ['mean', 'std', 'test_set']
    df = df[~df['fold'].isin(exclude_rows)].reset_index(drop=True)
    model_results[model_name] = df

# Ensure folds are aligned across models
folds = model_results[next(iter(model_results))]['fold']
for model_name, df in model_results.items():
    assert all(df['fold'] == folds), f"Folds are misaligned in {model_name}"

# Function to perform Wilcoxon Signed-Rank Test
def compare_models(metric, model1_name, model2_name):
    model1_scores = model_results[model1_name][metric]
    model2_scores = model_results[model2_name][metric]
    stat, p_value = wilcoxon(model1_scores, model2_scores)
    return stat, p_value

# Perform pairwise comparisons for each metric
for metric in metrics:
    print(f"\nComparing models for metric: {metric.capitalize()}")
    print('-' * 50)
    for (model1_name, model2_name) in itertools.combinations(models.keys(), 2):
        stat, p_value = compare_models(metric, model1_name, model2_name)
        print(f"{model1_name} vs. {model2_name}:")
        print(f"  Wilcoxon Statistic = {stat}, p-value = {p_value:.5f}")



Comparing models for metric: Accuracy
--------------------------------------------------
Random Forest vs. Logistic Regression:
  Wilcoxon Statistic = 13.0, p-value = 0.16016
Random Forest vs. SVM:
  Wilcoxon Statistic = 14.5, p-value = 0.19336
Logistic Regression vs. SVM:
  Wilcoxon Statistic = 13.5, p-value = 0.52709

Comparing models for metric: F1
--------------------------------------------------
Random Forest vs. Logistic Regression:
  Wilcoxon Statistic = 10.0, p-value = 0.08398
Random Forest vs. SVM:
  Wilcoxon Statistic = 11.0, p-value = 0.10547
Logistic Regression vs. SVM:
  Wilcoxon Statistic = 15.0, p-value = 0.23242

Comparing models for metric: Precision
--------------------------------------------------
Random Forest vs. Logistic Regression:
  Wilcoxon Statistic = 25.0, p-value = 0.84570
Random Forest vs. SVM:
  Wilcoxon Statistic = 22.0, p-value = 0.62500
Logistic Regression vs. SVM:
  Wilcoxon Statistic = 9.0, p-value = 0.06445

Comparing models for metric: Recall
---



**Radiomic Dataset Without Feature Selection**

In [42]:
# Define the models and the metrics to compare
models = {
    'Random Forest': 'Radiomic Dataset (Without Feature Selection)_random_forest_results.csv',
    'Logistic Regression': 'Radiomic Dataset (Without Feature Selection)_logistic_regression_results.csv',
    'SVM': 'Radiomic Dataset (Without Feature Selection)_svm_results.csv'
}

# Define the metrics to compare
metrics = ['accuracy', 'f1', 'precision', 'recall', 'roc_auc']

# Load per-fold results for each model into a dictionary
model_results = {}

for model_name, filename in models.items():
    df = pd.read_csv(filename)
    # Exclude 'mean', 'std', and 'test_set' rows
    exclude_rows = ['mean', 'std', 'test_set']
    df = df[~df['fold'].isin(exclude_rows)].reset_index(drop=True)
    model_results[model_name] = df

# Ensure folds are aligned across models
folds = model_results[next(iter(model_results))]['fold']
for model_name, df in model_results.items():
    assert all(df['fold'] == folds), f"Folds are misaligned in {model_name}"

# Function to perform Wilcoxon Signed-Rank Test
def compare_models(metric, model1_name, model2_name):
    model1_scores = model_results[model1_name][metric]
    model2_scores = model_results[model2_name][metric]
    stat, p_value = wilcoxon(model1_scores, model2_scores)
    return stat, p_value

# Perform pairwise comparisons for each metric
for metric in metrics:
    print(f"\nComparing models for metric: {metric.capitalize()}")
    print('-' * 50)
    for (model1_name, model2_name) in itertools.combinations(models.keys(), 2):
        stat, p_value = compare_models(metric, model1_name, model2_name)
        print(f"{model1_name} vs. {model2_name}:")
        print(f"  Wilcoxon Statistic = {stat}, p-value = {p_value:.5f}")


Comparing models for metric: Accuracy
--------------------------------------------------
Random Forest vs. Logistic Regression:
  Wilcoxon Statistic = 17.5, p-value = 0.37500
Random Forest vs. SVM:
  Wilcoxon Statistic = 9.0, p-value = 0.06445
Logistic Regression vs. SVM:
  Wilcoxon Statistic = 15.0, p-value = 0.23242

Comparing models for metric: F1
--------------------------------------------------
Random Forest vs. Logistic Regression:
  Wilcoxon Statistic = 19.0, p-value = 0.43164
Random Forest vs. SVM:
  Wilcoxon Statistic = 11.0, p-value = 0.10547
Logistic Regression vs. SVM:
  Wilcoxon Statistic = 16.0, p-value = 0.27539

Comparing models for metric: Precision
--------------------------------------------------
Random Forest vs. Logistic Regression:
  Wilcoxon Statistic = 19.0, p-value = 0.43164
Random Forest vs. SVM:
  Wilcoxon Statistic = 10.0, p-value = 0.08398
Logistic Regression vs. SVM:
  Wilcoxon Statistic = 14.0, p-value = 0.19336

Comparing models for metric: Recall
---



**Pylidc Dataset Without Feature Selection**

In [39]:
# Define the models and the metrics to compare
models = {
    'Random Forest': 'Pylidc Dataset (Without Feature Selection)_random_forest_results.csv',
    'Logistic Regression': 'Pylidc Dataset (Without Feature Selection)_logistic_regression_results.csv',
    'SVM': 'Pylidc Dataset (Without Feature Selection)_svm_results.csv'
}

# Define the metrics to compare
metrics = ['accuracy', 'f1', 'precision', 'recall', 'roc_auc']

# Load per-fold results for each model into a dictionary
model_results = {}

for model_name, filename in models.items():
    df = pd.read_csv(filename)
    # Exclude 'mean', 'std', and 'test_set' rows
    exclude_rows = ['mean', 'std', 'test_set']
    df = df[~df['fold'].isin(exclude_rows)].reset_index(drop=True)
    model_results[model_name] = df

# Ensure folds are aligned across models
folds = model_results[next(iter(model_results))]['fold']
for model_name, df in model_results.items():
    assert all(df['fold'] == folds), f"Folds are misaligned in {model_name}"

# Function to perform Wilcoxon Signed-Rank Test
def compare_models(metric, model1_name, model2_name):
    model1_scores = model_results[model1_name][metric]
    model2_scores = model_results[model2_name][metric]
    stat, p_value = wilcoxon(model1_scores, model2_scores)
    return stat, p_value

# Perform pairwise comparisons for each metric
for metric in metrics:
    print(f"\nComparing models for metric: {metric.capitalize()}")
    print('-' * 50)
    for (model1_name, model2_name) in itertools.combinations(models.keys(), 2):
        stat, p_value = compare_models(metric, model1_name, model2_name)
        print(f"{model1_name} vs. {model2_name}:")
        print(f"  Wilcoxon Statistic = {stat}, p-value = {p_value:.5f}")


Comparing models for metric: Accuracy
--------------------------------------------------
Random Forest vs. Logistic Regression:
  Wilcoxon Statistic = 4.0, p-value = 0.01367
Random Forest vs. SVM:
  Wilcoxon Statistic = 5.0, p-value = 0.03815
Logistic Regression vs. SVM:
  Wilcoxon Statistic = 4.0, p-value = 0.09039

Comparing models for metric: F1
--------------------------------------------------
Random Forest vs. Logistic Regression:
  Wilcoxon Statistic = 5.0, p-value = 0.01953
Random Forest vs. SVM:
  Wilcoxon Statistic = 8.0, p-value = 0.04883
Logistic Regression vs. SVM:
  Wilcoxon Statistic = 2.0, p-value = 0.02506

Comparing models for metric: Precision
--------------------------------------------------
Random Forest vs. Logistic Regression:
  Wilcoxon Statistic = 14.0, p-value = 0.19336
Random Forest vs. SVM:
  Wilcoxon Statistic = 12.0, p-value = 0.13086
Logistic Regression vs. SVM:
  Wilcoxon Statistic = 15.0, p-value = 0.67442

Comparing models for metric: Recall
--------



### Define best model with weights?

To make an informed decision about the best model, we implemented a weighted scoring system based on several important performance metrics: Recall, ROC-AUC, F1-Score, Precision, and Accuracy. Each metric is assigned a specific weight according to its importance in the medical context of cancer diagnosis, where recall (avoiding false negatives) is of utmost importance, while ROC-AUC and F1-Score also play critical roles in model selection.

In [99]:
# Define the weights for each metric
weights = {
    'recall': 0.35,
    'roc_auc': 0.25,
    'f1': 0.25,
    'precision': 0.1,
    'accuracy': 0.05
}
# Calculate the weighted score for test set metrics
df['weighted_score'] = (
    (weights['recall'] * df['test_recall']) +
    (weights['roc_auc'] * df['test_roc_auc']) +
    (weights['f1'] * df['test_f1']) +
    (weights['precision'] * df['test_precision']) +
    (weights['accuracy'] * df['test_accuracy'])
)

# Sort the DataFrame by the weighted score and get the top 5 models
top_5_models_test = df.sort_values(by='weighted_score', ascending=False).head(5)

print("Top 5 models based on Test weighted score:")
print(top_5_models_test[['model', 'dataset', 'weighted_score', 'test_recall', 'test_roc_auc', 'test_f1', 'test_precision', 'test_accuracy']])

# Calculate the weighted score for cross-validation (CV) metrics
df['cv_weighted_score'] = (
    (weights['recall'] * df['cv_recall_mean']) +
    (weights['roc_auc'] * df['cv_roc_auc_mean']) +
    (weights['f1'] * df['cv_f1_mean']) +
    (weights['precision'] * df['cv_precision_mean']) +
    (weights['accuracy'] * df['cv_accuracy_mean'])
)

# Sort the DataFrame by the CV weighted score and get the top 5 models
top_5_models_cv = df.sort_values(by='cv_weighted_score', ascending=False).head(5)

print("\nTop 5 models based on CV weighted score:")
print(top_5_models_cv[['model', 'dataset', 'cv_weighted_score', 'cv_recall_mean', 'cv_roc_auc_mean', 'cv_f1_mean', 'cv_precision_mean', 'cv_accuracy_mean']])

Top 5 models based on Test weighted score:
   model                                            dataset  weighted_score  \
12    RF         Pylidc Dataset (Without Feature Selection)        0.882663   
15    RF          Combined Dataset (With Feature Selection)        0.879666   
2    SVM  Combined Dataset (With Feature Selection by La...        0.852747   
6    SVM       Combined Dataset (Without Feature Selection)        0.850782   
16    RF       Combined Dataset (Without Feature Selection)        0.850563   

    test_recall  test_roc_auc   test_f1  test_precision  test_accuracy  
12     0.879479      0.955268  0.842434        0.808383       0.891631  
15     0.859935      0.956081  0.846154        0.832808       0.896996  
2      0.843648      0.940383  0.806854        0.773134       0.866953  
6      0.840391      0.939429  0.804992        0.772455       0.865880  
16     0.811075      0.945676  0.816393        0.821782       0.879828  

Top 5 models based on CV weighted score:
  

**OUTRAS IDEIAS**:


Analyze how the use of feature selection methods (Random Forest, Lasso, or no feature selection) affects model performance:
Compare performance metrics with and without feature selection for each model.

Examine how the model’s performance changes when trained on different datasets (Radiomic, Combined, Pylidc):

Review the standard deviation (std) values to gauge the stability and consistency of each model's performance:
High standard deviation indicates performance varies significantly across cross-validation folds, while low values imply stable performance.

Analyze which models have the most consistent performance across metrics (i.e., low std in accuracy, F1, etc.).

Compare the performance on the test set against cross-validation (CV) results:

Look for overfitting (high CV accuracy but lower test accuracy) or underfitting (low CV and test accuracy).

Boxplots for each metric across models and feature selection techniques.
Line plots or bar plots to compare model performance on the test set versus CV.
