**Table of contents**<a id='toc0_'></a>    
- [The project](#toc1_)    
  - [1. Approach 1: Baseline Model (Full Data)](#toc1_1_)    
    - [2.1 Logistics regression](#toc1_1_1_)    
    - [1.2 Random Forest](#toc1_1_2_)    
    - [1.3 XGboost](#toc1_1_3_)    
  - [2. Approach 2: Correlation-Based Feature Selection & Outlier Removal](#toc1_2_)    
    - [2.1 Logistics regression](#toc1_2_1_)    
    - [2.2 Random Forest](#toc1_2_2_)    
    - [2.3 XGboost](#toc1_2_3_)    
  - [3. Approach 3: Statistical Distribution-Based Feature Selection](#toc1_3_)    
    - [3.1 Logistics regression](#toc1_3_1_)    
    - [3.2 Random Forest](#toc1_3_2_)    
    - [3.3 XGboost](#toc1_3_3_)    
  - [4. Approach 4: Non-Linear Dimensionality Reduction & Scaling](#toc1_4_)    
    - [4.1 Logistics regression](#toc1_4_1_)    
    - [4.2 Random Forest](#toc1_4_2_)    
    - [4.3 XGboost](#toc1_4_3_)    
  - [5. Conclusion](#toc1_5_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

In [4]:
import jupyter_black

jupyter_black.load()
import os

os.environ["PYTHONWARNINGS"] = "ignore"

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [5]:
import polars as pl
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from scipy.stats import uniform, randint
from collections import Counter

from xgboost import XGBClassifier

from modeling import model_training, visualisation

In [6]:
###  declare global variable that stores metrics across all models
METRICS_DICT = {}
global METRICS_DICT

In [8]:
os.chdir("..")

# <a id='toc1_'></a>[The project](#toc0_)

Saha and colleagues collected a large set of breast MRI cases that
were used to detect breast cancer. The dataset also contains detailed
information about the characteristic of the patient and subtype of tumors
that were eventually diagnosed. This information can be used to determine
the best treatment.

Basic image processing was performed to extract features from the MRI
cases which were used to predict the molecular tumor subtypes from MRI.
In this assignment, we will try to predict the estrogen receptor (ER) status
from the MRI image features that were provided.

## <a id='toc1_1_'></a>[1. Approach 1: Baseline Model (Full Data)](#toc0_)

In [16]:
### 1. download train and test sets
train_set = pl.read_excel("dataset/processed/train_processed_full.xlsx")
test_set = pl.read_excel("dataset/processed/test_processed_full.xlsx")

### 2. Drop "Patient ID" and handle "ER" as the target and Separate features and target
X_train = train_set.drop("Patient ID", "ER").to_numpy()
y_train = train_set["ER"].to_numpy()
X_test = test_set.drop("Patient ID", "ER").to_numpy()
y_test = test_set["ER"].to_numpy()

### <a id='toc1_1_1_'></a>[2.1 Logistics regression](#toc0_)

In [20]:
### 1. Define hyperparameter distributions for each model
log_reg_params = {
    "C": uniform(0.01, 10),
    "penalty": ["l2"],
    "solver": ["sag"],
}

### 2. Train Logistic Regression
log_reg = LogisticRegression(max_iter=1000)
log_reg_best, log_reg_auc, log_reg_acc, log_reg_f1, report = (
    model_training.tune_and_evaluate_model(
        log_reg,
        log_reg_params,
        X_train,
        y_train,
        X_test,
        y_test,
        model_name="Logistic_Regression_with_ridge_penalty",
        log_dir="models/logs/approach_1",
        model_dir="models/saved_models/approach_1",
        approach="approach_1",
    )
)

Fitting 5 folds for each of 50 candidates, totalling 250 fits
Logistic_Regression_with_ridge_penalty Best Hyperparameters: {'C': 5.257564316322378, 'penalty': 'l2', 'solver': 'sag'}
Logistic_Regression_with_ridge_penalty Test Metrics:
AUC: 0.6524
Accuracy: 0.7475
F1-Score: 0.8550


In [21]:
### 1. create dataframe from metrics add to metrics dictionary
approach_1_log_reg_metrics = pl.DataFrame(
    {
        "metrics": ["accuracy, %", "f1_score", "auc"],
        "approach_1_log_reg": [
            np.round(log_reg_acc * 100),
            np.round(log_reg_f1, 3),
            np.round(log_reg_auc, 3),
        ],
    }
)
METRICS_DICT["approach_1_log_reg_metrics"] = approach_1_log_reg_metrics
print(approach_1_log_reg_metrics)

shape: (3, 2)
┌─────────────┬────────────────────┐
│ metrics     ┆ approach_1_log_reg │
│ ---         ┆ ---                │
│ str         ┆ f64                │
╞═════════════╪════════════════════╡
│ accuracy, % ┆ 75.0               │
│ f1_score    ┆ 0.855              │
│ auc         ┆ 0.652              │
└─────────────┴────────────────────┘


In [24]:
### 1. convert report to polars dataframe
approach_1_log_reg_report = model_training.classification_report_to_polars(report)

### 2. save report
approach_1_log_reg_report.write_excel(
    "reports/approach_1/approach_1_log_reg_report.xlsx"
)
print(approach_1_log_reg_report)

shape: (2, 5)
┌──────────┬───────────┬────────┬──────────┬───────┐
│ category ┆ precision ┆ recall ┆ f1-score ┆ count │
│ ---      ┆ ---       ┆ ---    ┆ ---      ┆ ---   │
│ str      ┆ f64       ┆ f64    ┆ f64      ┆ i64   │
╞══════════╪═══════════╪════════╪══════════╪═══════╡
│ 0        ┆ 1.0       ┆ 0.01   ┆ 0.03     ┆ 78    │
│ 1        ┆ 0.75      ┆ 1.0    ┆ 0.85     ┆ 227   │
└──────────┴───────────┴────────┴──────────┴───────┘


### <a id='toc1_1_2_'></a>[1.2 Random Forest](#toc0_)

In [25]:
### 1. Define hyperparameter distributions for each model
rf_params = {
    "n_estimators": randint(10, 150),
    "max_depth": [3, 5, 10, 15],
    "min_samples_split": randint(10, 30),
    "min_samples_leaf": randint(10, 30),
}

### 2. Train Random Forest
rf = RandomForestClassifier(random_state=42)
rf_best, rf_auc, rf_acc, rf_f1, report = model_training.tune_and_evaluate_model(
    rf,
    rf_params,
    X_train,
    y_train,
    X_test,
    y_test,
    model_name="Random_Forest",
    log_dir="models/logs/approach_1",
    model_dir="models/saved_models/approach_1",
    approach="approach_1",
)

Fitting 5 folds for each of 50 candidates, totalling 250 fits
Random_Forest Best Hyperparameters: {'max_depth': 10, 'min_samples_leaf': 25, 'min_samples_split': 24, 'n_estimators': 49}
Random_Forest Test Metrics:
AUC: 0.5951
Accuracy: 0.7475
F1-Score: 0.8550


In [26]:
### 1. create dataframe from metrics add to metrics dictionary
approach_1_rf_metrics = pl.DataFrame(
    {
        "metrics": ["accuracy, %", "f1_score", "auc"],
        "approach_1_rf_metrics": [
            np.round(rf_acc * 100),
            np.round(rf_f1, 3),
            np.round(rf_auc, 3),
        ],
    }
)
METRICS_DICT["approach_1_rf_metrics"] = approach_1_rf_metrics
print(approach_1_rf_metrics)

shape: (3, 2)
┌─────────────┬───────────────────────┐
│ metrics     ┆ approach_1_rf_metrics │
│ ---         ┆ ---                   │
│ str         ┆ f64                   │
╞═════════════╪═══════════════════════╡
│ accuracy, % ┆ 75.0                  │
│ f1_score    ┆ 0.855                 │
│ auc         ┆ 0.595                 │
└─────────────┴───────────────────────┘


In [27]:
### 1. convert report to polars dataframe
approach_1_rf_report = model_training.classification_report_to_polars(report)

### 2. save report
approach_1_rf_report.write_excel("reports/approach_1/approach_1_rf_report.xlsx")
print(approach_1_rf_report)

shape: (2, 5)
┌──────────┬───────────┬────────┬──────────┬───────┐
│ category ┆ precision ┆ recall ┆ f1-score ┆ count │
│ ---      ┆ ---       ┆ ---    ┆ ---      ┆ ---   │
│ str      ┆ f64       ┆ f64    ┆ f64      ┆ i64   │
╞══════════╪═══════════╪════════╪══════════╪═══════╡
│ 0        ┆ 1.0       ┆ 0.01   ┆ 0.03     ┆ 78    │
│ 1        ┆ 0.75      ┆ 1.0    ┆ 0.85     ┆ 227   │
└──────────┴───────────┴────────┴──────────┴───────┘


### <a id='toc1_1_3_'></a>[1.3 XGboost](#toc0_)

In [28]:
### 1. Define hyperparameter distributions for each model
xgb_params = {
    "n_estimators": randint(50, 150),
    "max_depth": randint(3, 10),
    "learning_rate": uniform(0.01, 0.3),
    "subsample": uniform(0.6, 0.4),
    "colsample_bytree": uniform(0.6, 1.0),
}

# Compute class distribution
counter = Counter(y_train)
scale_pos_weight = counter[0] / counter[1]  # Majority / Minority class ratio

### 2.Train XGBoost
xgb = XGBClassifier(
    use_label_encoder=False,
    eval_metric="logloss",
    random_state=42,
    scale_pos_weight=scale_pos_weight,
)
xgb_best, xgb_auc, xgb_acc, xgb_f1, report = model_training.tune_and_evaluate_model(
    xgb,
    xgb_params,
    X_train,
    y_train,
    X_test,
    y_test,
    model_name="XGBoost",
    log_dir="models/logs/approach_1",
    model_dir="models/saved_models/approach_1",
    approach="approach_1",
)

Fitting 5 folds for each of 50 candidates, totalling 250 fits
XGBoost Best Hyperparameters: {'colsample_bytree': 0.7652669390630025, 'learning_rate': 0.01469092202235818, 'max_depth': 3, 'n_estimators': 137, 'subsample': 0.7579526072702278}
XGBoost Test Metrics:
AUC: 0.6415
Accuracy: 0.6885
F1-Score: 0.7921


In [29]:
### 1. create dataframe from metrics add to metrics dictionary
approach_1_xgboost_metrics = pl.DataFrame(
    {
        "metrics": ["accuracy, %", "f1_score", "auc"],
        "approach_1_xgboost_metrics": [
            np.round(xgb_acc * 100),
            np.round(xgb_f1, 3),
            np.round(xgb_auc, 3),
        ],
    }
)
METRICS_DICT["approach_1_xgboost_metrics"] = approach_1_xgboost_metrics
print(approach_1_xgboost_metrics)

shape: (3, 2)
┌─────────────┬────────────────────────────┐
│ metrics     ┆ approach_1_xgboost_metrics │
│ ---         ┆ ---                        │
│ str         ┆ f64                        │
╞═════════════╪════════════════════════════╡
│ accuracy, % ┆ 69.0                       │
│ f1_score    ┆ 0.792                      │
│ auc         ┆ 0.642                      │
└─────────────┴────────────────────────────┘


In [30]:
### 1. convert report to polars dataframe
approach_1_xgboost_report = model_training.classification_report_to_polars(report)

### 2. save report
approach_1_xgboost_report.write_excel(
    "reports/approach_1/approach_1_xgboost_report.xlsx"
)
print(approach_1_xgboost_report)

shape: (2, 5)
┌──────────┬───────────┬────────┬──────────┬───────┐
│ category ┆ precision ┆ recall ┆ f1-score ┆ count │
│ ---      ┆ ---       ┆ ---    ┆ ---      ┆ ---   │
│ str      ┆ f64       ┆ f64    ┆ f64      ┆ i64   │
╞══════════╪═══════════╪════════╪══════════╪═══════╡
│ 0        ┆ 0.39      ┆ 0.37   ┆ 0.38     ┆ 78    │
│ 1        ┆ 0.79      ┆ 0.8    ┆ 0.79     ┆ 227   │
└──────────┴───────────┴────────┴──────────┴───────┘


## <a id='toc1_2_'></a>[2. Approach 2: Correlation-Based Feature Selection & Outlier Removal](#toc0_)

In [31]:
### 1. download train and test sets
train_set = pl.read_excel("dataset/processed/train_processed_approach_2.xlsx")
test_set = pl.read_excel("dataset/processed/test_processed_approach_2.xlsx")

### 2. Drop "Patient ID" and handle "ER" as the target and Separate features and target
X_train = train_set.drop("Patient ID", "ER").to_numpy()
y_train = train_set["ER"].to_numpy()
X_test = test_set.drop("Patient ID", "ER").to_numpy()
y_test = test_set["ER"].to_numpy()

### <a id='toc1_2_1_'></a>[2.1 Logistics regression](#toc0_)

In [32]:
### 1. Define hyperparameter distributions for each model
log_reg_params = {
    "C": uniform(0.01, 10),
    "penalty": ["l2"],
    "solver": ["sag"],
}

### 2. Train Logistic Regression
log_reg = LogisticRegression(max_iter=1000)
log_reg_best, log_reg_auc, log_reg_acc, log_reg_f1, report = (
    model_training.tune_and_evaluate_model(
        log_reg,
        log_reg_params,
        X_train,
        y_train,
        X_test,
        y_test,
        model_name="Logistic_Regression_with_ridge_penalty",
        log_dir="models/logs/approach_2",
        model_dir="models/saved_models/approach_2",
        approach="approach_2",
    )
)

Fitting 5 folds for each of 50 candidates, totalling 250 fits
Logistic_Regression_with_ridge_penalty Best Hyperparameters: {'C': 8.334426408004218, 'penalty': 'l2', 'solver': 'sag'}
Logistic_Regression_with_ridge_penalty Test Metrics:
AUC: 0.6508
Accuracy: 0.7475
F1-Score: 0.8550


In [33]:
### 1. create dataframe from metrics add to metrics dictionary
approach_2_log_reg_metrics = pl.DataFrame(
    {
        "metrics": ["accuracy, %", "f1_score", "auc"],
        "approach_2_log_reg_metrics": [
            np.round(log_reg_acc * 100),
            np.round(log_reg_f1, 3),
            np.round(log_reg_auc, 3),
        ],
    }
)
METRICS_DICT["approach_2_log_reg_metrics"] = approach_2_log_reg_metrics
print(approach_2_log_reg_metrics)

shape: (3, 2)
┌─────────────┬────────────────────────────┐
│ metrics     ┆ approach_2_log_reg_metrics │
│ ---         ┆ ---                        │
│ str         ┆ f64                        │
╞═════════════╪════════════════════════════╡
│ accuracy, % ┆ 75.0                       │
│ f1_score    ┆ 0.855                      │
│ auc         ┆ 0.651                      │
└─────────────┴────────────────────────────┘


In [34]:
### 1. convert report to polars dataframe
approach_2_log_reg_report = model_training.classification_report_to_polars(report)

### 2. save report
approach_2_log_reg_report.write_excel(
    "reports/approach_2/approach_2_log_reg_report.xlsx"
)
print(approach_2_log_reg_report)

shape: (2, 5)
┌──────────┬───────────┬────────┬──────────┬───────┐
│ category ┆ precision ┆ recall ┆ f1-score ┆ count │
│ ---      ┆ ---       ┆ ---    ┆ ---      ┆ ---   │
│ str      ┆ f64       ┆ f64    ┆ f64      ┆ i64   │
╞══════════╪═══════════╪════════╪══════════╪═══════╡
│ 0        ┆ 1.0       ┆ 0.01   ┆ 0.03     ┆ 78    │
│ 1        ┆ 0.75      ┆ 1.0    ┆ 0.85     ┆ 227   │
└──────────┴───────────┴────────┴──────────┴───────┘


### <a id='toc1_2_2_'></a>[2.2 Random Forest](#toc0_)

In [35]:
### 1. Define hyperparameter distributions for each model
rf_params = {
    "n_estimators": randint(10, 150),
    "max_depth": [3, 5, 10, 15],
    "min_samples_split": randint(10, 30),
    "min_samples_leaf": randint(10, 30),
}

### 2. Train Random Forest
rf = RandomForestClassifier(random_state=42)
rf_best, rf_auc, rf_acc, rf_f1, report = model_training.tune_and_evaluate_model(
    rf,
    rf_params,
    X_train,
    y_train,
    X_test,
    y_test,
    model_name="Random_Forest",
    log_dir="models/logs/approach_2",
    model_dir="models/saved_models/approach_2",
    approach="approach_2",
)

Fitting 5 folds for each of 50 candidates, totalling 250 fits
Random_Forest Best Hyperparameters: {'max_depth': 3, 'min_samples_leaf': 13, 'min_samples_split': 14, 'n_estimators': 38}
Random_Forest Test Metrics:
AUC: 0.6186
Accuracy: 0.7443
F1-Score: 0.8534


In [36]:
### 1. create dataframe from metrics add to metrics dictionary
approach_2_rf_metrics = pl.DataFrame(
    {
        "metrics": ["accuracy, %", "f1_score", "auc"],
        "approach_2_rf_metrics": [
            np.round(rf_acc * 100),
            np.round(rf_f1, 3),
            np.round(rf_auc, 3),
        ],
    }
)
METRICS_DICT["approach_2_rf_metrics"] = approach_2_rf_metrics
print(approach_2_rf_metrics)

shape: (3, 2)
┌─────────────┬───────────────────────┐
│ metrics     ┆ approach_2_rf_metrics │
│ ---         ┆ ---                   │
│ str         ┆ f64                   │
╞═════════════╪═══════════════════════╡
│ accuracy, % ┆ 74.0                  │
│ f1_score    ┆ 0.853                 │
│ auc         ┆ 0.619                 │
└─────────────┴───────────────────────┘


In [37]:
### 1. convert report to polars dataframe
approach_2_rf_report = model_training.classification_report_to_polars(report)

### 2. save report
approach_2_rf_report.write_excel("reports/approach_2/approach_2_rf_report.xlsx")
print(approach_2_rf_report)

shape: (2, 5)
┌──────────┬───────────┬────────┬──────────┬───────┐
│ category ┆ precision ┆ recall ┆ f1-score ┆ count │
│ ---      ┆ ---       ┆ ---    ┆ ---      ┆ ---   │
│ str      ┆ f64       ┆ f64    ┆ f64      ┆ i64   │
╞══════════╪═══════════╪════════╪══════════╪═══════╡
│ 0        ┆ 0.0       ┆ 0.0    ┆ 0.0      ┆ 78    │
│ 1        ┆ 0.74      ┆ 1.0    ┆ 0.85     ┆ 227   │
└──────────┴───────────┴────────┴──────────┴───────┘


### <a id='toc1_2_3_'></a>[2.3 XGboost](#toc0_)

In [38]:
### 1. Define hyperparameter distributions for each model
xgb_params = {
    "n_estimators": randint(50, 150),
    "max_depth": randint(3, 10),
    "learning_rate": uniform(0.01, 0.3),
    "subsample": uniform(0.6, 0.4),
    "colsample_bytree": uniform(0.6, 1.0),
}

# Compute class distribution
counter = Counter(y_train)
scale_pos_weight = counter[0] / counter[1]  # Majority / Minority class ratio

### 2.Train XGBoost
xgb = XGBClassifier(
    use_label_encoder=False,
    eval_metric="logloss",
    random_state=42,
    scale_pos_weight=scale_pos_weight,
)
xgb_best, xgb_auc, xgb_acc, xgb_f1, report = model_training.tune_and_evaluate_model(
    xgb,
    xgb_params,
    X_train,
    y_train,
    X_test,
    y_test,
    model_name="XGBoost",
    log_dir="models/logs/approach_2",
    model_dir="models/saved_models/approach_2",
    approach="approach_2",
)

Fitting 5 folds for each of 50 candidates, totalling 250 fits
XGBoost Best Hyperparameters: {'colsample_bytree': 0.7865185103998542, 'learning_rate': 0.022232542466429174, 'max_depth': 6, 'n_estimators': 108, 'subsample': 0.6880964190262193}
XGBoost Test Metrics:
AUC: 0.6119
Accuracy: 0.7475
F1-Score: 0.8451


In [39]:
### 1. create dataframe from metrics add to metrics dictionary
approach_2_xgboost_metrics = pl.DataFrame(
    {
        "metrics": ["accuracy, %", "f1_score", "auc"],
        "approach_2_xgboost_metrics": [
            np.round(xgb_acc * 100),
            np.round(xgb_f1, 3),
            np.round(xgb_auc, 3),
        ],
    }
)
METRICS_DICT["approach_2_xgboost_metrics"] = approach_2_xgboost_metrics
print(approach_2_xgboost_metrics)

shape: (3, 2)
┌─────────────┬────────────────────────────┐
│ metrics     ┆ approach_2_xgboost_metrics │
│ ---         ┆ ---                        │
│ str         ┆ f64                        │
╞═════════════╪════════════════════════════╡
│ accuracy, % ┆ 75.0                       │
│ f1_score    ┆ 0.845                      │
│ auc         ┆ 0.612                      │
└─────────────┴────────────────────────────┘


In [40]:
### 1. convert report to polars dataframe
approach_2_xgboost_report = model_training.classification_report_to_polars(report)

### 2. save report
approach_2_xgboost_report.write_excel(
    "reports/approach_2/approach_2_xgboost_report.xlsx"
)
print(approach_2_xgboost_report)

shape: (2, 5)
┌──────────┬───────────┬────────┬──────────┬───────┐
│ category ┆ precision ┆ recall ┆ f1-score ┆ count │
│ ---      ┆ ---       ┆ ---    ┆ ---      ┆ ---   │
│ str      ┆ f64       ┆ f64    ┆ f64      ┆ i64   │
╞══════════╪═══════════╪════════╪══════════╪═══════╡
│ 0        ┆ 0.51      ┆ 0.23   ┆ 0.32     ┆ 78    │
│ 1        ┆ 0.78      ┆ 0.93   ┆ 0.85     ┆ 227   │
└──────────┴───────────┴────────┴──────────┴───────┘


## <a id='toc1_3_'></a>[3. Approach 3: Statistical Distribution-Based Feature Selection](#toc0_)

In [41]:
### 1. download train and test sets
train_set = pl.read_excel("dataset/processed/train_processed_approach_3.xlsx")
test_set = pl.read_excel("dataset/processed/test_processed_approach_3.xlsx")

### 2. Drop "Patient ID" and handle "ER" as the target and Separate features and target
X_train = train_set.drop("Patient ID", "ER").to_numpy()
y_train = train_set["ER"].to_numpy()
X_test = test_set.drop("Patient ID", "ER").to_numpy()
y_test = test_set["ER"].to_numpy()

### <a id='toc1_3_1_'></a>[3.1 Logistics regression](#toc0_)

In [42]:
### 1. Define hyperparameter distributions for each model
log_reg_params = {
    "C": uniform(0.01, 10),
    "penalty": ["l2"],
    "solver": ["sag"],
}

### 2. Train Logistic Regression
log_reg = LogisticRegression(max_iter=1000)
log_reg_best, log_reg_auc, log_reg_acc, log_reg_f1, report = (
    model_training.tune_and_evaluate_model(
        log_reg,
        log_reg_params,
        X_train,
        y_train,
        X_test,
        y_test,
        model_name="Logistic_Regression_with_ridge_penalty",
        log_dir="models/logs/approach_3",
        model_dir="models/saved_models/approach_3",
        approach="approach_3",
    )
)

Fitting 5 folds for each of 50 candidates, totalling 250 fits
Logistic_Regression_with_ridge_penalty Best Hyperparameters: {'C': 3.7554011884736247, 'penalty': 'l2', 'solver': 'sag'}
Logistic_Regression_with_ridge_penalty Test Metrics:
AUC: 0.6605
Accuracy: 0.7475
F1-Score: 0.8550


In [43]:
### 1. create dataframe from metrics add to metrics dictionary
approach_3_log_reg_metrics = pl.DataFrame(
    {
        "metrics": ["accuracy, %", "f1_score", "auc"],
        "approach_3_log_reg_metrics": [
            np.round(log_reg_acc * 100),
            np.round(log_reg_f1, 3),
            np.round(log_reg_auc, 3),
        ],
    }
)
METRICS_DICT["approach_3_log_reg_metrics"] = approach_3_log_reg_metrics
print(approach_3_log_reg_metrics)

shape: (3, 2)
┌─────────────┬────────────────────────────┐
│ metrics     ┆ approach_3_log_reg_metrics │
│ ---         ┆ ---                        │
│ str         ┆ f64                        │
╞═════════════╪════════════════════════════╡
│ accuracy, % ┆ 75.0                       │
│ f1_score    ┆ 0.855                      │
│ auc         ┆ 0.66                       │
└─────────────┴────────────────────────────┘


In [44]:
### 1. convert report to polars dataframe
approach_3_log_reg_report = model_training.classification_report_to_polars(report)

### 2. save report
approach_3_log_reg_report.write_excel(
    "reports/approach_3/approach_3_log_reg_report.xlsx"
)
print(approach_3_log_reg_report)

shape: (2, 5)
┌──────────┬───────────┬────────┬──────────┬───────┐
│ category ┆ precision ┆ recall ┆ f1-score ┆ count │
│ ---      ┆ ---       ┆ ---    ┆ ---      ┆ ---   │
│ str      ┆ f64       ┆ f64    ┆ f64      ┆ i64   │
╞══════════╪═══════════╪════════╪══════════╪═══════╡
│ 0        ┆ 1.0       ┆ 0.01   ┆ 0.03     ┆ 78    │
│ 1        ┆ 0.75      ┆ 1.0    ┆ 0.85     ┆ 227   │
└──────────┴───────────┴────────┴──────────┴───────┘


### <a id='toc1_3_2_'></a>[3.2 Random Forest](#toc0_)

In [45]:
### 1. Define hyperparameter distributions for each model
rf_params = {
    "n_estimators": randint(10, 150),
    "max_depth": [3, 5, 10, 15],
    "min_samples_split": randint(10, 30),
    "min_samples_leaf": randint(10, 30),
}

### 2. Train Random Forest
rf = RandomForestClassifier(random_state=42)
rf_best, rf_auc, rf_acc, rf_f1, report = model_training.tune_and_evaluate_model(
    rf,
    rf_params,
    X_train,
    y_train,
    X_test,
    y_test,
    model_name="Random_Forest",
    log_dir="models/logs/approach_3",
    model_dir="models/saved_models/approach_3",
    approach="approach_3",
)

Fitting 5 folds for each of 50 candidates, totalling 250 fits
Random_Forest Best Hyperparameters: {'max_depth': 5, 'min_samples_leaf': 15, 'min_samples_split': 21, 'n_estimators': 53}
Random_Forest Test Metrics:
AUC: 0.5999
Accuracy: 0.7508
F1-Score: 0.8555


In [46]:
### 1. create dataframe from metrics add to metrics dictionary
approach_3_rf_metrics = pl.DataFrame(
    {
        "metrics": ["accuracy, %", "f1_score", "auc"],
        "approach_3_rf_metrics": [
            np.round(rf_acc * 100),
            np.round(rf_f1, 3),
            np.round(rf_auc, 3),
        ],
    }
)
METRICS_DICT["approach_3_rf_metrics"] = approach_3_rf_metrics
print(approach_3_rf_metrics)

shape: (3, 2)
┌─────────────┬───────────────────────┐
│ metrics     ┆ approach_3_rf_metrics │
│ ---         ┆ ---                   │
│ str         ┆ f64                   │
╞═════════════╪═══════════════════════╡
│ accuracy, % ┆ 75.0                  │
│ f1_score    ┆ 0.856                 │
│ auc         ┆ 0.6                   │
└─────────────┴───────────────────────┘


In [47]:
### 1. convert report to polars dataframe
approach_3_rf_report = model_training.classification_report_to_polars(report)

### 2. save report
approach_3_rf_report.write_excel("reports/approach_3/approach_3_rf_report.xlsx")
print(approach_3_rf_report)

shape: (2, 5)
┌──────────┬───────────┬────────┬──────────┬───────┐
│ category ┆ precision ┆ recall ┆ f1-score ┆ count │
│ ---      ┆ ---       ┆ ---    ┆ ---      ┆ ---   │
│ str      ┆ f64       ┆ f64    ┆ f64      ┆ i64   │
╞══════════╪═══════════╪════════╪══════════╪═══════╡
│ 0        ┆ 0.67      ┆ 0.05   ┆ 0.1      ┆ 78    │
│ 1        ┆ 0.75      ┆ 0.99   ┆ 0.86     ┆ 227   │
└──────────┴───────────┴────────┴──────────┴───────┘


### <a id='toc1_3_3_'></a>[3.3 XGboost](#toc0_)

In [48]:
### 1. Define hyperparameter distributions for each model
xgb_params = {
    "n_estimators": randint(50, 150),
    "max_depth": randint(3, 10),
    "learning_rate": uniform(0.01, 0.3),
    "subsample": uniform(0.6, 0.4),
    "colsample_bytree": uniform(0.6, 1.0),
}
# Compute class distribution
counter = Counter(y_train)
scale_pos_weight = counter[0] / counter[1]  # Majority / Minority class ratio

### 2.Train XGBoost
xgb = XGBClassifier(
    use_label_encoder=False,
    eval_metric="aucpr",
    random_state=42,
    scale_pos_weight=scale_pos_weight,
)  # Balance classes)
xgb_best, xgb_auc, xgb_acc, xgb_f1, report = model_training.tune_and_evaluate_model(
    xgb,
    xgb_params,
    X_train,
    y_train,
    X_test,
    y_test,
    model_name="XGBoost",
    log_dir="models/logs/approach_3",
    model_dir="models/saved_models/approach_3",
    approach="approach_3",
)

Fitting 5 folds for each of 50 candidates, totalling 250 fits
XGBoost Best Hyperparameters: {'colsample_bytree': 0.7652669390630025, 'learning_rate': 0.01469092202235818, 'max_depth': 3, 'n_estimators': 137, 'subsample': 0.7579526072702278}
XGBoost Test Metrics:
AUC: 0.6487
Accuracy: 0.6918
F1-Score: 0.7902


In [49]:
### 1. create dataframe from metrics add to metrics dictionary
approach_3_xgboost_metrics = pl.DataFrame(
    {
        "metrics": ["accuracy, %", "f1_score", "auc"],
        "approach_3_xgboost_metrics": [
            np.round(xgb_acc * 100),
            np.round(xgb_f1, 3),
            np.round(xgb_auc, 3),
        ],
    }
)
METRICS_DICT["approach_3_xgboost_metrics"] = approach_3_xgboost_metrics
print(approach_3_xgboost_metrics)

shape: (3, 2)
┌─────────────┬────────────────────────────┐
│ metrics     ┆ approach_3_xgboost_metrics │
│ ---         ┆ ---                        │
│ str         ┆ f64                        │
╞═════════════╪════════════════════════════╡
│ accuracy, % ┆ 69.0                       │
│ f1_score    ┆ 0.79                       │
│ auc         ┆ 0.649                      │
└─────────────┴────────────────────────────┘


In [50]:
### 1. convert report to polars dataframe
approach_3_xgboost_report = model_training.classification_report_to_polars(report)

### 2. save report
approach_3_xgboost_report.write_excel(
    "reports/approach_3/approach_3_xgboost_report.xlsx"
)
print(approach_3_xgboost_report)

shape: (2, 5)
┌──────────┬───────────┬────────┬──────────┬───────┐
│ category ┆ precision ┆ recall ┆ f1-score ┆ count │
│ ---      ┆ ---       ┆ ---    ┆ ---      ┆ ---   │
│ str      ┆ f64       ┆ f64    ┆ f64      ┆ i64   │
╞══════════╪═══════════╪════════╪══════════╪═══════╡
│ 0        ┆ 0.4       ┆ 0.44   ┆ 0.42     ┆ 78    │
│ 1        ┆ 0.8       ┆ 0.78   ┆ 0.79     ┆ 227   │
└──────────┴───────────┴────────┴──────────┴───────┘


## <a id='toc1_4_'></a>[4. Approach 4: Non-Linear Dimensionality Reduction & Scaling](#toc0_)

In [51]:
### 1. download train and test sets
train_set = pl.read_excel(
    "dataset/processed/approach_4/train_processed_approach_4.xlsx"
)
test_set = pl.read_excel("dataset/processed/approach_4/test_processed_approach_4.xlsx")

### 2. Drop "Patient ID" and handle "ER" as the target and Separate features and target
X_train = train_set.drop("Patient ID", "ER").to_numpy()
y_train = train_set["ER"].to_numpy()
X_test = test_set.drop("Patient ID", "ER").to_numpy()
y_test = test_set["ER"].to_numpy()

### <a id='toc1_4_1_'></a>[4.1 Logistics regression](#toc0_)

In [52]:
### 1. Define hyperparameter distributions for each model
log_reg_params = {
    "C": uniform(0.01, 10),
    "penalty": ["l2"],
    "solver": ["sag"],
}

### 2. Train Logistic Regression
log_reg = LogisticRegression(max_iter=1000)
log_reg_best, log_reg_auc, log_reg_acc, log_reg_f1, report = (
    model_training.tune_and_evaluate_model(
        log_reg,
        log_reg_params,
        X_train,
        y_train,
        X_test,
        y_test,
        model_name="Logistic_Regression_with_ridge_penalty",
        log_dir="models/logs/approach_4",
        model_dir="models/saved_models/approach_4",
        approach="approach_4",
    )
)

Fitting 5 folds for each of 50 candidates, totalling 250 fits
Logistic_Regression_with_ridge_penalty Best Hyperparameters: {'C': 9.666320330745593, 'penalty': 'l2', 'solver': 'sag'}
Logistic_Regression_with_ridge_penalty Test Metrics:
AUC: 0.5533
Accuracy: 0.4754
F1-Score: 0.5349


In [53]:
### 1. create dataframe from metrics add to metrics dictionary
approach_4_log_reg_metrics = pl.DataFrame(
    {
        "metrics": ["accuracy, %", "f1_score", "auc"],
        "approach_4_log_reg_metrics": [
            np.round(log_reg_acc * 100),
            np.round(log_reg_f1, 3),
            np.round(log_reg_auc, 3),
        ],
    }
)
METRICS_DICT["approach_4_log_reg_metrics"] = approach_4_log_reg_metrics
print(approach_4_log_reg_metrics)

shape: (3, 2)
┌─────────────┬────────────────────────────┐
│ metrics     ┆ approach_4_log_reg_metrics │
│ ---         ┆ ---                        │
│ str         ┆ f64                        │
╞═════════════╪════════════════════════════╡
│ accuracy, % ┆ 48.0                       │
│ f1_score    ┆ 0.535                      │
│ auc         ┆ 0.553                      │
└─────────────┴────────────────────────────┘


In [54]:
### 1. convert report to polars dataframe
approach_4_log_reg_report = model_training.classification_report_to_polars(report)

### 2. save report
approach_4_log_reg_report.write_excel(
    "reports/approach_4/approach_4_log_reg_report.xlsx"
)
print(approach_4_log_reg_report)

shape: (2, 5)
┌──────────┬───────────┬────────┬──────────┬───────┐
│ category ┆ precision ┆ recall ┆ f1-score ┆ count │
│ ---      ┆ ---       ┆ ---    ┆ ---      ┆ ---   │
│ str      ┆ f64       ┆ f64    ┆ f64      ┆ i64   │
╞══════════╪═══════════╪════════╪══════════╪═══════╡
│ 0        ┆ 0.28      ┆ 0.68   ┆ 0.4      ┆ 78    │
│ 1        ┆ 0.79      ┆ 0.41   ┆ 0.53     ┆ 227   │
└──────────┴───────────┴────────┴──────────┴───────┘


### <a id='toc1_4_2_'></a>[4.2 Random Forest](#toc0_)

In [55]:
### 1. Define hyperparameter distributions for each model
rf_params = {
    "n_estimators": randint(10, 150),
    "max_depth": [3, 5, 10, 15],
    "min_samples_split": randint(10, 30),
    "min_samples_leaf": randint(10, 30),
}

### 2. Train Random Forest
rf = RandomForestClassifier(random_state=42)
rf_best, rf_auc, rf_acc, rf_f1, report = model_training.tune_and_evaluate_model(
    rf,
    rf_params,
    X_train,
    y_train,
    X_test,
    y_test,
    model_name="Random_Forest",
    log_dir="models/logs/approach_4",
    model_dir="models/saved_models/approach_4",
    approach="approach_4",
)

Fitting 5 folds for each of 50 candidates, totalling 250 fits
Random_Forest Best Hyperparameters: {'max_depth': 15, 'min_samples_leaf': 10, 'min_samples_split': 21, 'n_estimators': 145}
Random_Forest Test Metrics:
AUC: 0.5285
Accuracy: 0.7443
F1-Score: 0.8534


In [56]:
### 1. create dataframe from metrics add to metrics dictionary
approach_4_rf_metrics = pl.DataFrame(
    {
        "metrics": ["accuracy, %", "f1_score", "auc"],
        "approach_4_rf_metrics": [
            np.round(rf_acc * 100),
            np.round(rf_f1, 3),
            np.round(rf_auc, 3),
        ],
    }
)
METRICS_DICT["approach_4_rf_metrics"] = approach_4_rf_metrics
print(approach_4_rf_metrics)

shape: (3, 2)
┌─────────────┬───────────────────────┐
│ metrics     ┆ approach_4_rf_metrics │
│ ---         ┆ ---                   │
│ str         ┆ f64                   │
╞═════════════╪═══════════════════════╡
│ accuracy, % ┆ 74.0                  │
│ f1_score    ┆ 0.853                 │
│ auc         ┆ 0.529                 │
└─────────────┴───────────────────────┘


In [57]:
### 1. convert report to polars dataframe
approach_4_rf_report = model_training.classification_report_to_polars(report)

### 2. save report
approach_4_rf_report.write_excel("reports/approach_4/approach_4_rf_report.xlsx")
print(approach_4_rf_report)

shape: (2, 5)
┌──────────┬───────────┬────────┬──────────┬───────┐
│ category ┆ precision ┆ recall ┆ f1-score ┆ count │
│ ---      ┆ ---       ┆ ---    ┆ ---      ┆ ---   │
│ str      ┆ f64       ┆ f64    ┆ f64      ┆ i64   │
╞══════════╪═══════════╪════════╪══════════╪═══════╡
│ 0        ┆ 0.0       ┆ 0.0    ┆ 0.0      ┆ 78    │
│ 1        ┆ 0.74      ┆ 1.0    ┆ 0.85     ┆ 227   │
└──────────┴───────────┴────────┴──────────┴───────┘


### <a id='toc1_4_3_'></a>[4.3 XGboost](#toc0_)

In [58]:
### 1. Define hyperparameter distributions for each model
xgb_params = {
    "n_estimators": randint(50, 150),
    "max_depth": randint(3, 10),
    "learning_rate": uniform(0.01, 0.3),
    "subsample": uniform(0.6, 0.4),
    "colsample_bytree": uniform(0.6, 1.0),
}
# Compute class distribution
counter = Counter(y_train)
scale_pos_weight = counter[0] / counter[1]  # Majority / Minority class ratio

### 2.Train XGBoost
xgb = XGBClassifier(
    use_label_encoder=False,
    eval_metric="logloss",
    random_state=42,
    scale_pos_weight=scale_pos_weight,
)
xgb_best, xgb_auc, xgb_acc, xgb_f1, report = model_training.tune_and_evaluate_model(
    xgb,
    xgb_params,
    X_train,
    y_train,
    X_test,
    y_test,
    model_name="XGBoost",
    log_dir="models/logs/approach_4",
    model_dir="models/saved_models/approach_4",
    approach="approach_4",
)

Fitting 5 folds for each of 50 candidates, totalling 250 fits
XGBoost Best Hyperparameters: {'colsample_bytree': 0.9745401188473625, 'learning_rate': 0.2952142919229748, 'max_depth': 5, 'n_estimators': 121, 'subsample': 0.8394633936788146}
XGBoost Test Metrics:
AUC: 0.5317
Accuracy: 0.7443
F1-Score: 0.8534


In [59]:
### 1. create dataframe from metrics add to metrics dictionary
approach_4_xgboost_metrics = pl.DataFrame(
    {
        "metrics": ["accuracy, %", "f1_score", "auc"],
        "approach_4_xgboost_metrics": [
            np.round(xgb_acc * 100),
            np.round(xgb_f1, 3),
            np.round(xgb_auc, 3),
        ],
    }
)
METRICS_DICT["approach_4_xgboost_metrics"] = approach_4_xgboost_metrics
print(approach_4_xgboost_metrics)

shape: (3, 2)
┌─────────────┬────────────────────────────┐
│ metrics     ┆ approach_4_xgboost_metrics │
│ ---         ┆ ---                        │
│ str         ┆ f64                        │
╞═════════════╪════════════════════════════╡
│ accuracy, % ┆ 74.0                       │
│ f1_score    ┆ 0.853                      │
│ auc         ┆ 0.532                      │
└─────────────┴────────────────────────────┘


In [60]:
### 1. convert report to polars dataframe
approach_4_xgboost_report = model_training.classification_report_to_polars(report)

### 2. save report
approach_4_xgboost_report.write_excel(
    "reports/approach_4/approach_4_xgboost_report.xlsx"
)
print(approach_4_xgboost_report)

shape: (2, 5)
┌──────────┬───────────┬────────┬──────────┬───────┐
│ category ┆ precision ┆ recall ┆ f1-score ┆ count │
│ ---      ┆ ---       ┆ ---    ┆ ---      ┆ ---   │
│ str      ┆ f64       ┆ f64    ┆ f64      ┆ i64   │
╞══════════╪═══════════╪════════╪══════════╪═══════╡
│ 0        ┆ 0.0       ┆ 0.0    ┆ 0.0      ┆ 78    │
│ 1        ┆ 0.74      ┆ 1.0    ┆ 0.85     ┆ 227   │
└──────────┴───────────┴────────┴──────────┴───────┘


## <a id='toc1_5_'></a>[5. Conclusion](#toc0_)

In [61]:
### save results across all the models as excel file
model_metrics = pl.DataFrame(METRICS_DICT["test_metrics"])
model_type = model_metrics.columns[1]
model_metrics = model_metrics.with_columns(pl.lit(model_type).alias("model_type"))
model_metrics = model_metrics.pivot(
    values=model_metrics.columns[1], index=["model_type"], columns="metrics"
)
for dataset in list(METRICS_DICT.keys())[1:]:
    new_data = pl.DataFrame(METRICS_DICT[dataset])
    model_type = new_data.columns[1]
    new_data = new_data.with_columns(pl.lit(model_type).alias("model_type"))
    pivoted_df = new_data.pivot(
        values=new_data.columns[1], index=["model_type"], columns="metrics"
    )
    model_metrics = model_metrics.vstack(pivoted_df)

model_metrics = model_metrics.sort("auc", descending=True)

model_metrics.write_excel("reports/model_metrics.xlsx")

<xlsxwriter.workbook.Workbook at 0x7483705a9a90>