## Task 3 – Nested Cross Validation implementation: Assess generalization performance and select the best model type / winner algorithm.

This notebook implements Task 3 of the assignment: using the `RepeatedNestedCV` class (developed in `src/rnCV.py`) to compare different classification algorithms, select the best-performing one based on robust metrics and statistical analysis (Confidence Intervals), and save the results.

### Imports and Setup

Imports necessary libraries including `pandas`, `numpy`, `scikit-learn` components (preprocessing, models, metrics), `LightGBM`, `SciPy` for bootstrap CIs, and standard Python modules (`os`, `sys`, `importlib`, `warnings`, `time`). The `src` directory is added to the Python path to allow importing the custom `RepeatedNestedCV` class from `rnCV.py`. The module is reloaded to incorporate recent changes and error handling for the import process is included.

In [1]:
# --- Imports ---
import os
import sys
import importlib
import pandas as pd
import numpy as np
import warnings
import time
import joblib # Although not used here, often used in ML workflows

# Scikit-learn imports
from sklearn.preprocessing import LabelEncoder, RobustScaler, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import make_scorer, roc_auc_score # Example metric import

# LightGBM
from lightgbm import LGBMClassifier

# Scipy for CIs
from scipy.stats import bootstrap

# --- Add src directory to path and import custom class ---
# Assuming the script/notebook is run from a location where '../src' is valid
# Adjust path as necessary depending on where you run this final script
try:
    sys.path.append(os.path.abspath("../src"))
    # Import rnCV class (adjust name if needed)
    import rnCV
    importlib.reload(rnCV) # Reload if changes were made
    from rnCV import RepeatedNestedCV
    print("Imported RepeatedNestedCV from src directory.")
except ImportError as e:
    print(f"Error: Could not import RepeatedNestedCV from ../src/rnCV.py.")
    print(f"Details: {e}")
    print("Ensure the file exists and the path is correct relative to your execution location.")
    sys.exit(1)
except FileNotFoundError:
     print("Error: Could not find the src directory or the rnCV class file.")
     sys.exit(1)

Imported RepeatedNestedCV from src directory.


### Configuration

Defines key configuration variables for the model selection process: the dataset path, target and ID column names, parameters for the Repeated Nested Cross-Validation (10 repeats, 5 outer splits, 3 inner splits), the random state for reproducibility, the scaler choice (`RobustScaler`, based on EDA), the metric for inner loop optimization (`roc_auc`), bootstrap CI settings (confidence level and number of resamples), and paths for saving the detailed rnCV results and summary statistics files. The output directory is also created if it doesn't exist.

In [2]:
# --- Configuration ---
DATASET_PATH = '../data/breast_cancer.csv' # Adjusted relative path
TARGET_VARIABLE = 'diagnosis'
ID_COLUMN = 'id'
N_REPEATS = 10
N_OUTER_SPLITS = 5
N_INNER_SPLITS = 3
RANDOM_STATE = 42
SCALER = RobustScaler # Using RobustScaler based on EDA
INNER_CV_METRIC = 'roc_auc' # Metric for hyperparameter tuning in rnCV

# Bootstrap settings
CONFIDENCE_LEVEL = 0.95
N_BOOTSTRAP_RESAMPLES = 5000

# Output file paths (relative to script execution location)
RESULTS_DIR = '../results' # Example: Save results in a dedicated folder
RESULTS_FILE = os.path.join(RESULTS_DIR, 'rnCV_results.csv')
STATS_FILE = os.path.join(RESULTS_DIR, 'rnCV_summary_stats_CIs.csv')

# Create results directory if it doesn't exist
os.makedirs(RESULTS_DIR, exist_ok=True)
print(f"Ensuring results directory exists: {RESULTS_DIR}")

Ensuring results directory exists: ../results


### 1. Load and Prepare Data

Loads the dataset from the configured path. Performs initial data preparation steps required before the main rnCV analysis: drops the non-informative ID column, encodes the categorical target variable (`diagnosis`) into numerical format (0/1), and separates the data into features (`X`) and the target variable (`y`). Includes error handling for file loading and target variable validation.

In [3]:
# --- 1. Load and Prepare Data ---
print("\n--- 1. Loading and Preparing Data ---")
try:
    df = pd.read_csv(DATASET_PATH)
    print(f"Dataset loaded successfully from: {DATASET_PATH}")
except FileNotFoundError:
    print(f"Error: Dataset file not found at {DATASET_PATH}", file=sys.stderr)
    sys.exit(1)
except Exception as e:
    print(f"An error occurred while loading the dataset: {e}", file=sys.stderr)
    sys.exit(1)

# Drop the ID column
if ID_COLUMN in df.columns:
    df = df.drop(columns=[ID_COLUMN])
    print(f"Dropped ID column: '{ID_COLUMN}'")

# Encode the target variable
if TARGET_VARIABLE in df.columns:
    if df[TARGET_VARIABLE].dtype == 'object':
        le = LabelEncoder()
        df[TARGET_VARIABLE] = le.fit_transform(df[TARGET_VARIABLE])
        print(f"Target variable '{TARGET_VARIABLE}' encoded.")
    X = df.drop(TARGET_VARIABLE, axis=1)
    y = df[TARGET_VARIABLE]
    print("Features (X) and target (y) separated.")
else:
    print(f"Error: Target variable '{TARGET_VARIABLE}' not found.", file=sys.stderr)
    sys.exit(1)


--- 1. Loading and Preparing Data ---
Dataset loaded successfully from: ../data/breast_cancer.csv
Dropped ID column: 'id'
Target variable 'diagnosis' encoded.
Features (X) and target (y) separated.


### 2. Define Estimators and Grids

Defines the six machine learning algorithms (estimators) to be compared: LR-ElasticNet, GNB, LDA, SVM, RF and LGBM. Instantiates each estimator with appropriate settings (e.g., `probability=True` for SVM, `verbosity=-1` for LGBM). Also defines the corresponding `param_grids` dictionary containing the simplified hyperparameter search spaces used during the inner loop of the rnCV for each estimator, optimized previously for runtime.

In [4]:
# --- 2. Define Estimators and Hyperparameter Grids ---
print("\n--- 2. Defining Estimators and Hyperparameter Grids ---")
estimators = {
    'LR_ElasticNet': LogisticRegression(penalty='elasticnet', solver='saga', max_iter=10000, random_state=RANDOM_STATE),
    'GNB': GaussianNB(),
    'LDA': LinearDiscriminantAnalysis(),
    'SVM': SVC(probability=True, random_state=RANDOM_STATE),
    'RF': RandomForestClassifier(random_state=RANDOM_STATE),
    'LGBM': LGBMClassifier(random_state=RANDOM_STATE, verbosity=-1)
}

param_grids = {
    'LR_ElasticNet': {'C': [0.01, 0.1, 1, 10], 'l1_ratio': [0, 0.25, 0.5, 0.75, 1]},
    'GNB': {'var_smoothing': np.logspace(-9, -2, 8)},
    'LDA': {'solver': ['svd', 'lsqr', 'eigen'], 'shrinkage': [None]},
    'SVM': {'C': [1, 10], 'gamma': ['scale', 0.01], 'kernel': ['rbf']},
    'RF': {'n_estimators': [75, 150], 'max_depth': [None, 20], 'min_samples_split': [5], 'min_samples_leaf': [3]},
    'LGBM': {'n_estimators': [75, 150], 'learning_rate': [0.05, 0.1], 'num_leaves': [31], 'max_depth': [-1]}
}


--- 2. Defining Estimators and Hyperparameter Grids ---


### 3. Instantiate and Run rnCV

Instantiates the `RepeatedNestedCV` class using the sequential version defined in `src/rnCV.py`, passing the prepared data, estimators, parameter grids and CV configurations. Includes logic to first check if a results file (`RESULTS_FILE`) from a previous run exists; if found, the results are loaded to avoid re-computation. Otherwise, the `rn_cv.run()` method is executed to perform the full Repeated Nested Cross-Validation. Execution time for this potentially long step is measured and printed. If the rnCV was run, the detailed results DataFrame (`results_df`) is saved to a CSV file.

In [5]:
# --- 3. Instantiate and Run RepeatedNestedCV ---
print("\n--- 3. Instantiating and Running RepeatedNestedCV ---")
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings('ignore', category=UserWarning)

start_run_time = time.time()

# Check if results already exist to avoid re-running
try:
    results_df = pd.read_csv(RESULTS_FILE)
    print(f"\nLoaded existing results from {RESULTS_FILE}")
    run_needed = False
except FileNotFoundError:
    print("\nNo existing results file found. Running RepeatedNestedCV...")
    rn_cv = RepeatedNestedCV(
        X=X, y=y, estimators=estimators, param_grids=param_grids,
        n_outer_splits=N_OUTER_SPLITS, n_inner_splits=N_INNER_SPLITS, n_repeats=N_REPEATS,
        inner_cv_metric=INNER_CV_METRIC, scaler=SCALER, random_state=RANDOM_STATE
    )
    results_df = rn_cv.run()
    run_needed = True # Flag that run was executed

end_run_time = time.time()
print(f"\nTotal execution time for section 3: {(end_run_time - start_run_time) / 60:.2f} minutes")

# Save results if the run was just executed
if run_needed:
    try:
        results_df.to_csv(RESULTS_FILE, index=False)
        print(f"Full results saved to {RESULTS_FILE}")
    except Exception as e:
        print(f"Error saving results: {e}", file=sys.stderr)


--- 3. Instantiating and Running RepeatedNestedCV ---

No existing results file found. Running RepeatedNestedCV...
Starting Repeated Nested CV (10 repeats, 5 outer folds, 3 inner folds)

--- Repeat 1/10 ---
  Outer Fold 1/5 (Overall Fold 1)
    Processing Estimator: LR_ElasticNet
      Best params for LR_ElasticNet in fold 1: {'C': 0.1, 'l1_ratio': 0.25} (Inner roc_auc: 0.9917)
      Outer test metrics for LR_ElasticNet: {'accuracy': '0.9903', 'balanced_accuracy': '0.9872', 'f1': '0.9870', 'f2': '0.9794', 'precision': '1.0000', 'recall': '0.9744', 'sensitivity': '0.9744', 'specificity': '1.0000', 'npv': '0.9846', 'mcc': '0.9795', 'roc_auc': '1.0000', 'pr_auc': '1.0000', 'log_loss': '0.0910'}
    Processing Estimator: GNB
      Best params for GNB in fold 1: {'var_smoothing': np.float64(0.01)} (Inner roc_auc: 0.9814)
      Outer test metrics for GNB: {'accuracy': '0.9417', 'balanced_accuracy': '0.9431', 'f1': '0.9250', 'f2': '0.9391', 'precision': '0.9024', 'recall': '0.9487', 'sensiti

### 4. Analyze Results with CIs

Analyzes the performance results stored in `results_df` (either loaded or generated). Defines the list of performance metrics to analyze. Includes the helper function `median_ci` to calculate the median and 95% bootstrap confidence interval for a given metric's scores across the 50 outer folds. Checks if a pre-calculated statistics file (`STATS_FILE`) exists; if not, it computes the medians and CIs for all specified metrics and estimators using the `median_ci` function and saves this summary (`final_stats_df`) to a CSV file. Finally, it sorts the summary statistics DataFrame by median ROC AUC and prints it for review.

In [6]:
# --- 4. Analyze Results (with Confidence Intervals) ---
print("\n--- 4. Analyzing Results with Confidence Intervals ---")
if results_df.empty:
    print("Error: Results DataFrame is empty. Cannot analyze.", file=sys.stderr)
    sys.exit(1)

metrics_to_analyze = [
    'roc_auc', 'mcc', 'balanced_accuracy', 'f1', 'pr_auc',
    'recall', 'specificity', 'precision', 'npv'
]

def median_ci(data, metric, confidence_level=CONFIDENCE_LEVEL):
    data_array = data[metric].dropna().values
    if len(data_array) < 2:
        return pd.Series({'median': np.nan, 'ci_lower': np.nan, 'ci_upper': np.nan})
    res = bootstrap((data_array,), np.median, confidence_level=confidence_level,
                    n_resamples=N_BOOTSTRAP_RESAMPLES, method='percentile',
                    random_state=RANDOM_STATE)
    return pd.Series({
        'median': np.median(data_array),
        'ci_lower': res.confidence_interval.low,
        'ci_upper': res.confidence_interval.high
    })

# Check if stats file exists
try:
    final_stats_df = pd.read_csv(STATS_FILE, index_col='estimator')
    print(f"\nLoaded existing summary stats with CIs from {STATS_FILE}")
    calc_needed = False
except FileNotFoundError:
    print(f"\nNo existing stats file found. Calculating Median and {CONFIDENCE_LEVEL*100:.0f}% CIs...")
    analysis_results = {}
    grouped_results = results_df.groupby('estimator')
    for metric in metrics_to_analyze:
        # Use include_groups=False to avoid the deprecation warning if pandas version is recent
        try:
             metric_stats = grouped_results.apply(median_ci, metric=metric, confidence_level=CONFIDENCE_LEVEL, include_groups=False)
        except TypeError: # Older pandas might not have include_groups
             metric_stats = grouped_results.apply(median_ci, metric=metric, confidence_level=CONFIDENCE_LEVEL)

        metric_stats = metric_stats.rename(columns={
            'median': f'{metric}_median', 'ci_lower': f'{metric}_ci_lower', 'ci_upper': f'{metric}_ci_upper'
        })
        analysis_results[metric] = metric_stats
    final_stats_df = pd.concat(analysis_results.values(), axis=1)
    calc_needed = True # Flag that calculation was done

# Save stats if they were just calculated
if calc_needed:
     try:
        final_stats_df.to_csv(STATS_FILE)
        print(f"Summary stats with CIs saved to {STATS_FILE}")
     except Exception as e:
        print(f"Error saving summary stats: {e}", file=sys.stderr)


# Sort and display
final_stats_df = final_stats_df.sort_values(by='roc_auc_median', ascending=False)
print("\nMedian Performance Metrics & 95% CIs per Estimator (across all outer folds):")
pd.set_option('display.float_format', '{:.4f}'.format)
print(final_stats_df)


--- 4. Analyzing Results with Confidence Intervals ---

No existing stats file found. Calculating Median and 95% CIs...
Summary stats with CIs saved to ../results/rnCV_summary_stats_CIs.csv

Median Performance Metrics & 95% CIs per Estimator (across all outer folds):
               roc_auc_median  roc_auc_ci_lower  roc_auc_ci_upper  mcc_median  \
estimator                                                                       
SVM                    0.9955            0.9937            0.9979      0.9380   
LR_ElasticNet          0.9947            0.9929            0.9966      0.9380   
LDA                    0.9928            0.9872            0.9951      0.8986   
LGBM                   0.9913            0.9877            0.9955      0.9174   
RF                     0.9901            0.9852            0.9930      0.8959   
GNB                    0.9875            0.9844            0.9904      0.8526   

               mcc_ci_lower  mcc_ci_upper  balanced_accuracy_median  \
estimator  

### 5. Select Winner Algorithm

Selects the final "winner" algorithm based on the statistical analysis performed in the previous step. Retrieves the name of the second-ranked algorithm (`LR_ElasticNet`, index 1) from the sorted `final_stats_df`, as the analysis showed its performance was statistically indistinguishable from the top-ranked algorithm (`SVM`, index 0) based on overlapping confidence intervals for key metrics. Prints the selected winner and a brief justification referencing the statistical tie and secondary factors like model simplicity and interpretability. Resets warnings and display options.

In [7]:
# --- 5. Select Winner Algorithm ---
print("\n--- 5. Selecting Winner Algorithm ---")
if final_stats_df.empty:
    print("Error: Cannot select winner, stats DataFrame is empty.", file=sys.stderr)
    sys.exit(1)

# Select the winner based on analysis (LR_ElasticNet is index 1 in the sorted list)
try:
    winner_algorithm_name = final_stats_df.index[1]
    print(f"\nSelected Winner Algorithm: {winner_algorithm_name}")
    print("Justification: SVM (index 0) and LR_ElasticNet (index 1) showed statistically indistinguishable performance")
    print("based on overlapping 95% CIs for key metrics (ROC AUC, MCC, Balanced Accuracy, Recall).")
    print("LR_ElasticNet was chosen due to its relative simplicity and potentially better interpretability compared to SVM.")
except IndexError:
    print("Error: Could not select winner algorithm. Check final_stats_df content.", file=sys.stderr)
    sys.exit(1)

# Reset warnings and display options
warnings.filterwarnings('default')
pd.reset_option('display.float_format')

print("\n--- Task 3 Script Complete ---")


--- 5. Selecting Winner Algorithm ---

Selected Winner Algorithm: LR_ElasticNet
Justification: SVM (index 0) and LR_ElasticNet (index 1) showed statistically indistinguishable performance
based on overlapping 95% CIs for key metrics (ROC AUC, MCC, Balanced Accuracy, Recall).
LR_ElasticNet was chosen due to its relative simplicity and potentially better interpretability compared to SVM.

--- Task 3 Script Complete ---
