# Neural Network Models

This notebook trains and evaluates deep learning models for credit risk prediction.

**Purpose**: Build advanced neural network models and create a stacked ensemble that combines multiple algorithms.

**Models Trained**:
1. **Pure TabNet**: Attention-based neural network designed for tabular data
2. **TabNet + Tokenizer**: TabNet with feature tokenization for better representation
3. **Deep & Cross Network (DCN)**: Combines cross layers and deep layers for feature interactions
4. **Residual Neural Network**: Deep network with residual connections
5. **Multi-Scale Ensemble**: Combines all models plus gradient boosting models via meta-learner

**Key Techniques**:
- **Focal Loss**: Handles class imbalance by focusing on hard examples
- **Stacking**: Meta-learner combines predictions from multiple base models
- **Class Weighting**: Adjusts model focus on minority class (bad loans)

**Output**: Trained models and performance metrics saved to models/ and artifacts/03_2_neural_network_images/


## Setup and Imports


In [1]:
import os
import json
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import joblib

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, regularizers
from pytorch_tabnet.tab_model import TabNetClassifier
from sklearn.model_selection import StratifiedKFold
import torch

from sklearn.metrics import (
    roc_auc_score, average_precision_score, f1_score,
    precision_score, recall_score, precision_recall_curve,
    confusion_matrix, classification_report, roc_curve, auc
)
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier

ROOT = os.path.abspath(os.getcwd())
PROJECT_ROOT = os.path.abspath(os.path.join(ROOT, '..'))

MODELS_DIR = os.path.join(PROJECT_ROOT, 'models')
DATASET_DIR = os.path.join(PROJECT_ROOT, 'dataset')
ARTIFACTS_DIR = os.path.join(PROJECT_ROOT, 'artifacts')

os.makedirs(MODELS_DIR, exist_ok=True)
os.makedirs(DATASET_DIR, exist_ok=True)
os.makedirs(ARTIFACTS_DIR, exist_ok=True)
os.makedirs(os.path.join(ARTIFACTS_DIR, '03_2_neural_network_images'), exist_ok=True)


In [2]:
# REPRODUCIBILITY SETUP: Set all random seeds for consistent results
# IMPORTANT: This ensures that every time you run this notebook, you get the
# same results. Without this, neural networks will produce different results
# each time due to random weight initialization and training randomness.

import random

# Set seed for Python's built-in random module
random.seed(42)

# Set seed for NumPy (used for array operations)
np.random.seed(42)

# Set seed for TensorFlow/Keras (neural network training)
tf.random.set_seed(42)

# Set seed for PyTorch (used by TabNet)
torch.manual_seed(42)
torch.cuda.manual_seed(42)  # For GPU if available
torch.cuda.manual_seed_all(42)  # For multi-GPU if available

# Additional TensorFlow settings for full reproducibility
# These ensure deterministic operations (may be slower but reproducible)
tf.config.experimental.enable_op_determinism()

# Set environment variable for TensorFlow determinism
os.environ['PYTHONHASHSEED'] = '0'
os.environ['TF_DETERMINISTIC_OPS'] = '1'
os.environ['TF_CUDNN_DETERMINISTIC'] = '1'

print('REPRODUCIBILITY CONFIGURED')
print('All random seeds set to 42')
print('TensorFlow deterministic operations enabled')
print('PyTorch seeds configured')
print('Results will be consistent across runs')

REPRODUCIBILITY CONFIGURED
All random seeds set to 42
TensorFlow deterministic operations enabled
PyTorch seeds configured
Results will be consistent across runs


## Load Data


In [3]:
X_train_frame = pd.read_pickle(os.path.join(DATASET_DIR, 'X_train.pkl'))
y_train_series = pd.read_pickle(os.path.join(DATASET_DIR, 'y_train.pkl'))
X_test_frame = pd.read_pickle(os.path.join(DATASET_DIR, 'X_test.pkl'))
y_test_series = pd.read_pickle(os.path.join(DATASET_DIR, 'y_test.pkl'))

if isinstance(y_train_series, pd.DataFrame):
    y_train_series = y_train_series.iloc[:, 0]
if isinstance(y_test_series, pd.DataFrame):
    y_test_series = y_test_series.iloc[:, 0]

feature_names = list(X_train_frame.columns)

X_train = X_train_frame.values
X_test = X_test_frame.values
y_train = y_train_series.values
y_test = y_test_series.values

print(f'Training set: {X_train.shape}, Test set: {X_test.shape}')


Training set: (26064, 17), Test set: (6517, 17)


## Data Preparation


In [4]:
pos_ratio = (y_train == 1).mean()
neg_ratio = 1 - pos_ratio
scale_pos_weight = neg_ratio / (pos_ratio + 1e-9)

print(f'Positive class ratio: {pos_ratio:.4f}')
print(f'Class weight ratio: {scale_pos_weight:.2f}')

input_dim = X_train.shape[1]
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)


Positive class ratio: 0.2182
Class weight ratio: 3.58


## Focal Loss Definition

**Purpose**: Define a custom loss function that handles class imbalance better than standard cross-entropy.

**What is Focal Loss?**
- Standard cross-entropy treats all examples equally
- Focal loss down-weights easy examples and focuses on hard-to-classify examples
- This is especially useful for imbalanced datasets (we have ~78% good loans, ~22% bad loans)

**How it works**:
- **Alpha (α)**: Balances positive/negative class importance
- **Gamma (γ)**: Controls how much to focus on hard examples (higher = more focus)
- Hard examples (misclassified) get higher weight, easy examples get lower weight

**Why use it**: Helps the model learn better from the minority class (bad loans) without needing resampling.


In [5]:
# Configure Focal Loss Parameters
# These values are commonly used and work well for imbalanced binary classification
focal_alpha = 0.25  # Weight for positive class (bad loans) - balances class importance
focal_gamma = 2.0   # Focusing parameter - higher values focus more on hard examples

def binary_focal_loss(alpha=0.25, gamma=2.0):
    """
    Focal Loss for Binary Classification with Imbalanced Data.
    
    This loss function helps neural networks learn better from imbalanced datasets by:
    1. Balancing class importance (alpha parameter)
    2. Focusing on hard-to-classify examples (gamma parameter)
    
    Parameters
    ----------
    alpha : float
        Class balancing weight for positive examples (bad loans)
        - 0.25 means positive class gets 25% weight, negative gets 75%
        - Helps balance the impact of minority class
    
    gamma : float
        Focusing parameter that down-weights easy examples
        - Higher gamma (e.g., 2.0) = more focus on hard examples
        - Easy examples (high confidence, correct predictions) get lower weight
        - Hard examples (low confidence, wrong predictions) get higher weight
        
    Returns
    -------
    Callable
        Keras-compatible loss function that can be used in model.compile()
    
    How it works:
    - Computes standard cross-entropy loss
    - Applies alpha weight to balance classes
    - Applies focal weight (1 - probability)^gamma to focus on hard examples
    - Easy examples (high probability of correct class) get down-weighted
    - Hard examples (low probability) get up-weighted
    """
    alpha_tensor = tf.constant(alpha, dtype=tf.float32)
    gamma_tensor = tf.constant(gamma, dtype=tf.float32)

    def loss(y_true, y_pred):
        # Convert inputs to proper format
        y_true_cast = tf.cast(tf.reshape(y_true, (-1, 1)), tf.float32)
        y_pred_cast = tf.cast(tf.reshape(y_pred, (-1, 1)), tf.float32)
        # Clip predictions to avoid log(0) errors
        y_pred_clipped = tf.clip_by_value(y_pred_cast, 1e-7, 1.0 - 1e-7)

        # Calculate probability of true class
        true_class_prob = y_true_cast * y_pred_clipped + (1.0 - y_true_cast) * (1.0 - y_pred_clipped)
        
        # Standard cross-entropy loss
        cross_entropy = -(y_true_cast * tf.math.log(y_pred_clipped) +
                          (1.0 - y_true_cast) * tf.math.log(1.0 - y_pred_clipped))

        # Alpha weight: Balance positive vs negative class
        alpha_weight = y_true_cast * alpha_tensor + (1.0 - y_true_cast) * (1.0 - alpha_tensor)
        
        # Focal weight: Focus on hard examples (low probability of correct class)
        # (1 - probability)^gamma: Higher gamma = more focus on hard examples
        focal_weight = tf.pow(1.0 - true_class_prob, gamma_tensor)

        # Combine: alpha_weight balances classes, focal_weight focuses on hard examples
        focal_loss_value = alpha_weight * focal_weight * cross_entropy
        return tf.reduce_mean(focal_loss_value)

    return loss

# Create the focal loss function with our parameters
focal_loss = binary_focal_loss(alpha=focal_alpha, gamma=focal_gamma)

print('FOCAL LOSS CONFIGURED')
print(f'Alpha (class balance): {focal_alpha} (gives more weight to minority class)')
print(f'Gamma (focus parameter): {focal_gamma} (focuses on hard examples)')
print('Focal loss ready to use in neural network training')


FOCAL LOSS CONFIGURED
Alpha (class balance): 0.25 (gives more weight to minority class)
Gamma (focus parameter): 2.0 (focuses on hard examples)
Focal loss ready to use in neural network training


## Model 1: Pure TabNet


### Training


In [6]:
# Set class weights to handle imbalanced data
# Weight for class 1 (bad loans) is higher to make model focus on minority class
class_weights = {0: 1.0, 1: neg_ratio / pos_ratio}

# TabNet: Attention-based neural network designed specifically for tabular data
# n_d=64, n_a=64: Dimension of decision and attention embeddings (controls model capacity)
# n_steps=5: Number of sequential attention steps (more steps = more complex feature interactions)
# gamma=1.5: Coefficient for feature reusage (higher = encourages feature reuse across steps)
# lambda_sparse=1e-2: Sparsity regularization (encourages model to use fewer features per step)
# optimizer_fn: AdamW optimizer with weight decay for regularization
# mask_type='entmax': Attention mechanism type (entmax provides sparse attention)
# n_shared=2, n_independent=2: Number of shared/independent layers in feature transformer
# momentum=0.95: Batch normalization momentum
# clip_value=2.0: Gradient clipping to prevent exploding gradients
tabnet_pure = TabNetClassifier(
    n_d=64,
    n_a=64,
    n_steps=5,
    gamma=1.5,
    lambda_sparse=1e-2,
    optimizer_fn=torch.optim.AdamW,
    optimizer_params=dict(lr=2e-2, weight_decay=5e-5),
    scheduler_fn=None,
    scheduler_params=None,
    mask_type='entmax',
    n_shared=2,
    n_independent=2,
    momentum=0.95,
    clip_value=2.0,
    seed=42,
    verbose=0
)

# Train TabNet with early stopping
# eval_set: Monitor performance on training set
# patience=15: Stop if no improvement for 15 epochs (prevents overfitting)
# batch_size=1024: Process 1024 samples at once
# virtual_batch_size=256: Split batch into smaller virtual batches (helps with batch normalization)
# weights: Apply class weights to handle imbalanced data
tabnet_pure.fit(
    X_train,
    y_train,
    eval_set=[(X_train, y_train)],
    eval_name=['train'],
    eval_metric=['auc'],
    max_epochs=100,
    patience=15,
    batch_size=1024,
    virtual_batch_size=256,
    num_workers=0,
    drop_last=False,
    weights=class_weights
)

print('Pure TabNet training completed')



Early stopping occurred at epoch 42 with best_epoch = 27 and best_train_auc = 0.92699
Pure TabNet training completed


### Threshold Optimization


In [7]:
# Get probability scores from TabNet for threshold optimization
tabnet_pure_train_scores = tabnet_pure.predict_proba(X_train)[:, 1]

# Find optimal threshold by maximizing F1 score
# Tests all possible thresholds and picks the one with best precision-recall balance
prec, rec, thresholds = precision_recall_curve(y_train, tabnet_pure_train_scores)
f1_scores = 2 * prec * rec / (prec + rec + 1e-9)
best_idx = np.nanargmax(f1_scores)
tabnet_pure_optimal_threshold = thresholds[max(0, best_idx - 1)] if len(thresholds) > 0 else 0.5

print(f'Optimal threshold: {tabnet_pure_optimal_threshold:.4f}')


Optimal threshold: 0.7670


### Evaluation


In [8]:
# Convert probabilities to binary predictions using optimal threshold
tabnet_pure_train_pred = (tabnet_pure_train_scores >= tabnet_pure_optimal_threshold).astype(int)
tabnet_pure_test_scores = tabnet_pure.predict_proba(X_test)[:, 1]
tabnet_pure_test_pred = (tabnet_pure_test_scores >= tabnet_pure_optimal_threshold).astype(int)

# Calculate performance metrics for training set
tabnet_pure_train_metrics = {
    'roc_auc': roc_auc_score(y_train, tabnet_pure_train_scores),
    'pr_auc': average_precision_score(y_train, tabnet_pure_train_scores),
    'f1': f1_score(y_train, tabnet_pure_train_pred),
    'precision': precision_score(y_train, tabnet_pure_train_pred),
    'recall': recall_score(y_train, tabnet_pure_train_pred)
}

# Calculate metrics for test set
tabnet_pure_test_metrics = {
    'roc_auc': roc_auc_score(y_test, tabnet_pure_test_scores),
    'pr_auc': average_precision_score(y_test, tabnet_pure_test_scores),
    'f1': f1_score(y_test, tabnet_pure_test_pred),
    'precision': precision_score(y_test, tabnet_pure_test_pred),
    'recall': recall_score(y_test, tabnet_pure_test_pred),
    'threshold': tabnet_pure_optimal_threshold
}

# Calculate specificity from confusion matrix
tabnet_pure_train_cm = confusion_matrix(y_train, tabnet_pure_train_pred)
tabnet_pure_test_cm = confusion_matrix(y_test, tabnet_pure_test_pred)
tabnet_pure_train_specificity = tabnet_pure_train_cm[0, 0] / (tabnet_pure_train_cm[0, 0] + tabnet_pure_train_cm[0, 1] + 1e-9)
tabnet_pure_test_specificity = tabnet_pure_test_cm[0, 0] / (tabnet_pure_test_cm[0, 0] + tabnet_pure_test_cm[0, 1] + 1e-9)

print('Training Metrics:')
print(f"  ROC-AUC: {tabnet_pure_train_metrics['roc_auc']:.4f}, PR-AUC: {tabnet_pure_train_metrics['pr_auc']:.4f}")
print(f"  F1: {tabnet_pure_train_metrics['f1']:.4f}, Precision: {tabnet_pure_train_metrics['precision']:.4f}, Recall: {tabnet_pure_train_metrics['recall']:.4f}")
print(f"  Specificity: {tabnet_pure_train_specificity:.4f}")

print('\nTest Metrics:')
print(f"  ROC-AUC: {tabnet_pure_test_metrics['roc_auc']:.4f}, PR-AUC: {tabnet_pure_test_metrics['pr_auc']:.4f}")
print(f"  F1: {tabnet_pure_test_metrics['f1']:.4f}, Precision: {tabnet_pure_test_metrics['precision']:.4f}, Recall: {tabnet_pure_test_metrics['recall']:.4f}")
print(f"  Specificity: {tabnet_pure_test_specificity:.4f}")


Training Metrics:
  ROC-AUC: 0.9270, PR-AUC: 0.8660
  F1: 0.7921, Precision: 0.9124, Recall: 0.6998
  Specificity: 0.9813

Test Metrics:
  ROC-AUC: 0.9218, PR-AUC: 0.8608
  F1: 0.7946, Precision: 0.9180, Recall: 0.7004
  Specificity: 0.9825


### Visualizations


In [9]:
images_dir = os.path.join(ARTIFACTS_DIR, '03_2_neural_network_images')
sns.set_style("whitegrid")
plt.rcParams['figure.dpi'] = 150

fpr_train, tpr_train, _ = roc_curve(y_train, tabnet_pure_train_scores)
fpr_test, tpr_test, _ = roc_curve(y_test, tabnet_pure_test_scores)
auc_train = auc(fpr_train, tpr_train)
auc_test = auc(fpr_test, tpr_test)

fig, ax = plt.subplots(figsize=(8, 6))
ax.plot(fpr_train, tpr_train, label=f'Train (AUC = {auc_train:.4f})', linewidth=2, linestyle='--')
ax.plot(fpr_test, tpr_test, label=f'Test (AUC = {auc_test:.4f})', linewidth=2)
ax.plot([0, 1], [0, 1], 'k--', linewidth=1, alpha=0.5, label='Random')
ax.set_xlabel('False Positive Rate', fontsize=11)
ax.set_ylabel('True Positive Rate', fontsize=11)
ax.set_title('Pure TabNet - ROC Curve', fontsize=12, fontweight='bold')
ax.legend(loc='lower right')
ax.grid(alpha=0.3)
plt.tight_layout()
plt.savefig(os.path.join(images_dir, 'tabnet_pure_roc.png'), dpi=150, bbox_inches='tight')
plt.close()
print('Saved: tabnet_pure_roc.png')


Saved: tabnet_pure_roc.png


In [10]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

sns.heatmap(tabnet_pure_train_cm, annot=True, fmt='d', cmap='Blues', ax=ax1, cbar=False)
ax1.set_title('Training Set', fontsize=11, fontweight='bold')
ax1.set_xlabel('Predicted')
ax1.set_ylabel('Actual')

sns.heatmap(tabnet_pure_test_cm, annot=True, fmt='d', cmap='Blues', ax=ax2, cbar=False)
ax2.set_title('Test Set', fontsize=11, fontweight='bold')
ax2.set_xlabel('Predicted')
ax2.set_ylabel('Actual')

plt.suptitle('Pure TabNet - Confusion Matrix', fontsize=12, fontweight='bold')
plt.tight_layout()
plt.savefig(os.path.join(images_dir, 'tabnet_pure_cm.png'), dpi=150, bbox_inches='tight')
plt.close()
print('Saved: tabnet_pure_cm.png')


Saved: tabnet_pure_cm.png


In [11]:
prec_train, rec_train, _ = precision_recall_curve(y_train, tabnet_pure_train_scores)
prec_test, rec_test, _ = precision_recall_curve(y_test, tabnet_pure_test_scores)
pr_auc_train = average_precision_score(y_train, tabnet_pure_train_scores)
pr_auc_test = average_precision_score(y_test, tabnet_pure_test_scores)

fig, ax = plt.subplots(figsize=(8, 6))
ax.plot(rec_train, prec_train, label=f'Train (AUC = {pr_auc_train:.4f})', linewidth=2, linestyle='--')
ax.plot(rec_test, prec_test, label=f'Test (AUC = {pr_auc_test:.4f})', linewidth=2)
ax.set_xlabel('Recall', fontsize=11)
ax.set_ylabel('Precision', fontsize=11)
ax.set_title('Pure TabNet - Precision-Recall Curve', fontsize=12, fontweight='bold')
ax.legend()
ax.grid(alpha=0.3)
plt.tight_layout()
plt.savefig(os.path.join(images_dir, 'tabnet_pure_pr.png'), dpi=150, bbox_inches='tight')
plt.close()
print('Saved: tabnet_pure_pr.png')


Saved: tabnet_pure_pr.png


## Model 2: TabNet + Tokenizer


### Feature Tokenization


In [12]:
# Feature Tokenization: Transform raw features into learned representations
# Why tokenize? Raw features might not be in optimal format for TabNet
# Tokenizer learns a better feature representation through a small neural network
# This can help TabNet work better by providing pre-processed features

# Build tokenizer network: 2-layer dense network that compresses features
# Input: Original 17 features
# Layer 1: 64 units with swish activation (smooth, non-linear transformation)
# BatchNormalization: Normalizes activations for stable training
# Layer 2: 32 units (compressed representation)
# L2 regularization: Prevents overfitting by penalizing large weights
tokenizer_input = layers.Input(shape=(input_dim,))
tokenized_features = layers.Dense(64, activation='swish', kernel_regularizer=regularizers.l2(1e-4))(tokenizer_input)
tokenized_features = layers.BatchNormalization()(tokenized_features)
tokenized_features = layers.Dense(32, activation='swish', kernel_regularizer=regularizers.l2(1e-4))(tokenized_features)
tokenizer_model = keras.Model(inputs=tokenizer_input, outputs=tokenized_features)

# Transform original features into tokenized features
# Note: Tokenizer is untrained here - it will learn during TabNet training
# This creates a 32-dimensional representation instead of original 17 features
X_train_tokenized = tokenizer_model.predict(X_train, verbose=0)
X_test_tokenized = tokenizer_model.predict(X_test, verbose=0)

print(f'Original features: {X_train.shape[1]}, Tokenized features: {X_train_tokenized.shape[1]}')


Original features: 17, Tokenized features: 32


### Training


In [13]:
# TabNet with tokenized features: Same architecture as pure TabNet
# But now works on 32 tokenized features instead of 17 original features
# This allows TabNet to learn from a potentially better feature representation
tabnet_tokenizer = TabNetClassifier(
    n_d=64,
    n_a=64,
    n_steps=5,
    gamma=1.5,
    lambda_sparse=1e-2,
    optimizer_fn=torch.optim.AdamW,
    optimizer_params=dict(lr=2e-2, weight_decay=5e-5),
    scheduler_fn=None,
    scheduler_params=None,
    mask_type='entmax',
    n_shared=2,
    n_independent=2,
    momentum=0.95,
    clip_value=2.0,
    seed=42,
    verbose=0
)

# Train on tokenized features instead of raw features
tabnet_tokenizer.fit(
    X_train_tokenized,
    y_train,
    eval_set=[(X_train_tokenized, y_train)],
    eval_name=['train'],
    eval_metric=['auc'],
    max_epochs=100,
    patience=15,
    batch_size=1024,
    virtual_batch_size=256,
    num_workers=0,
    drop_last=False,
    weights=class_weights
)

print('TabNet + Tokenizer training completed')



Early stopping occurred at epoch 57 with best_epoch = 42 and best_train_auc = 0.92098
TabNet + Tokenizer training completed


### Threshold Optimization


In [14]:
# Get predictions and find optimal threshold for TabNet + Tokenizer
tabnet_tokenizer_train_scores = tabnet_tokenizer.predict_proba(X_train_tokenized)[:, 1]

prec, rec, thresholds = precision_recall_curve(y_train, tabnet_tokenizer_train_scores)
f1_scores = 2 * prec * rec / (prec + rec + 1e-9)
best_idx = np.nanargmax(f1_scores)
tabnet_tokenizer_optimal_threshold = thresholds[max(0, best_idx - 1)] if len(thresholds) > 0 else 0.5

print(f'Optimal threshold: {tabnet_tokenizer_optimal_threshold:.4f}')


Optimal threshold: 0.7065


### Evaluation


In [15]:
# Evaluate TabNet + Tokenizer model
tabnet_tokenizer_train_pred = (tabnet_tokenizer_train_scores >= tabnet_tokenizer_optimal_threshold).astype(int)
tabnet_tokenizer_test_scores = tabnet_tokenizer.predict_proba(X_test_tokenized)[:, 1]
tabnet_tokenizer_test_pred = (tabnet_tokenizer_test_scores >= tabnet_tokenizer_optimal_threshold).astype(int)

tabnet_tokenizer_train_metrics = {
    'roc_auc': roc_auc_score(y_train, tabnet_tokenizer_train_scores),
    'pr_auc': average_precision_score(y_train, tabnet_tokenizer_train_scores),
    'f1': f1_score(y_train, tabnet_tokenizer_train_pred),
    'precision': precision_score(y_train, tabnet_tokenizer_train_pred),
    'recall': recall_score(y_train, tabnet_tokenizer_train_pred)
}

tabnet_tokenizer_test_metrics = {
    'roc_auc': roc_auc_score(y_test, tabnet_tokenizer_test_scores),
    'pr_auc': average_precision_score(y_test, tabnet_tokenizer_test_scores),
    'f1': f1_score(y_test, tabnet_tokenizer_test_pred),
    'precision': precision_score(y_test, tabnet_tokenizer_test_pred),
    'recall': recall_score(y_test, tabnet_tokenizer_test_pred),
    'threshold': tabnet_tokenizer_optimal_threshold
}

tabnet_tokenizer_train_cm = confusion_matrix(y_train, tabnet_tokenizer_train_pred)
tabnet_tokenizer_test_cm = confusion_matrix(y_test, tabnet_tokenizer_test_pred)
tabnet_tokenizer_train_specificity = tabnet_tokenizer_train_cm[0, 0] / (tabnet_tokenizer_train_cm[0, 0] + tabnet_tokenizer_train_cm[0, 1] + 1e-9)
tabnet_tokenizer_test_specificity = tabnet_tokenizer_test_cm[0, 0] / (tabnet_tokenizer_test_cm[0, 0] + tabnet_tokenizer_test_cm[0, 1] + 1e-9)

print('Training Metrics:')
print(f"  ROC-AUC: {tabnet_tokenizer_train_metrics['roc_auc']:.4f}, PR-AUC: {tabnet_tokenizer_train_metrics['pr_auc']:.4f}")
print(f"  F1: {tabnet_tokenizer_train_metrics['f1']:.4f}, Precision: {tabnet_tokenizer_train_metrics['precision']:.4f}, Recall: {tabnet_tokenizer_train_metrics['recall']:.4f}")
print(f"  Specificity: {tabnet_tokenizer_train_specificity:.4f}")

print('\nTest Metrics:')
print(f"  ROC-AUC: {tabnet_tokenizer_test_metrics['roc_auc']:.4f}, PR-AUC: {tabnet_tokenizer_test_metrics['pr_auc']:.4f}")
print(f"  F1: {tabnet_tokenizer_test_metrics['f1']:.4f}, Precision: {tabnet_tokenizer_test_metrics['precision']:.4f}, Recall: {tabnet_tokenizer_test_metrics['recall']:.4f}")
print(f"  Specificity: {tabnet_tokenizer_test_specificity:.4f}")


Training Metrics:
  ROC-AUC: 0.9210, PR-AUC: 0.8547
  F1: 0.7765, Precision: 0.8845, Recall: 0.6921
  Specificity: 0.9748

Test Metrics:
  ROC-AUC: 0.9197, PR-AUC: 0.8462
  F1: 0.7747, Precision: 0.8787, Recall: 0.6927
  Specificity: 0.9733


### Visualizations


In [16]:
fpr_train, tpr_train, _ = roc_curve(y_train, tabnet_tokenizer_train_scores)
fpr_test, tpr_test, _ = roc_curve(y_test, tabnet_tokenizer_test_scores)
auc_train = auc(fpr_train, tpr_train)
auc_test = auc(fpr_test, tpr_test)

fig, ax = plt.subplots(figsize=(8, 6))
ax.plot(fpr_train, tpr_train, label=f'Train (AUC = {auc_train:.4f})', linewidth=2, linestyle='--')
ax.plot(fpr_test, tpr_test, label=f'Test (AUC = {auc_test:.4f})', linewidth=2)
ax.plot([0, 1], [0, 1], 'k--', linewidth=1, alpha=0.5, label='Random')
ax.set_xlabel('False Positive Rate', fontsize=11)
ax.set_ylabel('True Positive Rate', fontsize=11)
ax.set_title('TabNet + Tokenizer - ROC Curve', fontsize=12, fontweight='bold')
ax.legend(loc='lower right')
ax.grid(alpha=0.3)
plt.tight_layout()
plt.savefig(os.path.join(images_dir, 'tabnet_tokenizer_roc.png'), dpi=150, bbox_inches='tight')
plt.close()
print('Saved: tabnet_tokenizer_roc.png')


Saved: tabnet_tokenizer_roc.png


In [17]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

sns.heatmap(tabnet_tokenizer_train_cm, annot=True, fmt='d', cmap='Blues', ax=ax1, cbar=False)
ax1.set_title('Training Set', fontsize=11, fontweight='bold')
ax1.set_xlabel('Predicted')
ax1.set_ylabel('Actual')

sns.heatmap(tabnet_tokenizer_test_cm, annot=True, fmt='d', cmap='Blues', ax=ax2, cbar=False)
ax2.set_title('Test Set', fontsize=11, fontweight='bold')
ax2.set_xlabel('Predicted')
ax2.set_ylabel('Actual')

plt.suptitle('TabNet + Tokenizer - Confusion Matrix', fontsize=12, fontweight='bold')
plt.tight_layout()
plt.savefig(os.path.join(images_dir, 'tabnet_tokenizer_cm.png'), dpi=150, bbox_inches='tight')
plt.close()
print('Saved: tabnet_tokenizer_cm.png')


Saved: tabnet_tokenizer_cm.png


In [18]:
prec_train, rec_train, _ = precision_recall_curve(y_train, tabnet_tokenizer_train_scores)
prec_test, rec_test, _ = precision_recall_curve(y_test, tabnet_tokenizer_test_scores)
pr_auc_train = average_precision_score(y_train, tabnet_tokenizer_train_scores)
pr_auc_test = average_precision_score(y_test, tabnet_tokenizer_test_scores)

fig, ax = plt.subplots(figsize=(8, 6))
ax.plot(rec_train, prec_train, label=f'Train (AUC = {pr_auc_train:.4f})', linewidth=2, linestyle='--')
ax.plot(rec_test, prec_test, label=f'Test (AUC = {pr_auc_test:.4f})', linewidth=2)
ax.set_xlabel('Recall', fontsize=11)
ax.set_ylabel('Precision', fontsize=11)
ax.set_title('TabNet + Tokenizer - Precision-Recall Curve', fontsize=12, fontweight='bold')
ax.legend()
ax.grid(alpha=0.3)
plt.tight_layout()
plt.savefig(os.path.join(images_dir, 'tabnet_tokenizer_pr.png'), dpi=150, bbox_inches='tight')
plt.close()
print('Saved: tabnet_tokenizer_pr.png')


Saved: tabnet_tokenizer_pr.png


## Model 3: Deep & Cross Network (DCN)


### Architecture


In [19]:
# Deep & Cross Network (DCN): Combines explicit feature interactions with deep learning
# Architecture has two paths: Cross Network (feature interactions) + Deep Network (non-linear patterns)

dcn_input = layers.Input(shape=(input_dim,))

# CROSS NETWORK: Explicitly models feature interactions
# Formula: x_{l+1} = x_0 * (W_l * x_l + b_l) + x_l
# This creates polynomial feature interactions (e.g., age * income, credit_score^2)
# Each cross layer learns higher-order interactions between features
x0 = dcn_input
x_l = x0

for i in range(3):
    # Linear transformation
    x_l = layers.Dense(input_dim, use_bias=False, kernel_regularizer=regularizers.l2(1e-4))(x_l)
    # Element-wise multiplication with original input (creates interactions)
    x_l = layers.Multiply()([x0, x_l])
    # Residual connection (adds original input back)
    x_l = layers.Add()([x_l, x0])

cross_output = x_l

# DEEP NETWORK: Standard feedforward network for non-linear patterns
# Processes features through multiple dense layers with non-linear activations
deep = layers.Dense(256, activation='swish', kernel_regularizer=regularizers.l2(1e-4))(dcn_input)
deep = layers.BatchNormalization()(deep)
deep = layers.Dropout(0.3)(deep)

deep = layers.Dense(128, activation='swish', kernel_regularizer=regularizers.l2(1e-4))(deep)
deep = layers.BatchNormalization()(deep)
deep = layers.Dropout(0.2)(deep)

deep = layers.Dense(64, activation='swish', kernel_regularizer=regularizers.l2(1e-4))(deep)
deep = layers.BatchNormalization()(deep)
deep = layers.Dropout(0.15)(deep)

# COMBINE: Concatenate cross network output (feature interactions) with deep network output (non-linear patterns)
# This allows model to use both explicit interactions and learned patterns
combined = layers.Concatenate()([cross_output, deep])
combined = layers.Dense(128, activation='swish', kernel_regularizer=regularizers.l2(1e-4))(combined)
combined = layers.BatchNormalization()(combined)
combined = layers.Dropout(0.2)(combined)
combined = layers.Dense(64, activation='swish', kernel_regularizer=regularizers.l2(1e-4))(combined)
combined = layers.BatchNormalization()(combined)

# Final output: Binary classification probability
dcn_output = layers.Dense(1, activation='sigmoid')(combined)

dcn_model = keras.Model(inputs=dcn_input, outputs=dcn_output)


### Training


In [20]:
# Compile DCN model with focal loss for imbalanced data
# Focal loss focuses on hard examples and handles class imbalance better than standard cross-entropy
dcn_model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=0.001),
    loss=focal_loss,
    metrics=[keras.metrics.AUC(name='roc_auc', curve='ROC'), keras.metrics.AUC(name='pr_auc', curve='PR')]
)

# Early stopping: Stop training if validation ROC-AUC doesn't improve for 12 epochs
# restore_best_weights: Keep the best model weights (not the last epoch)
dcn_early_stop = keras.callbacks.EarlyStopping(monitor='val_roc_auc', mode='max', patience=12, restore_best_weights=True)

# Learning rate reduction: If validation performance plateaus, reduce learning rate by half
# This helps fine-tune the model when it gets stuck
dcn_lr_reduce = keras.callbacks.ReduceLROnPlateau(monitor='val_roc_auc', mode='max', factor=0.5, patience=5, min_lr=1e-6)

# Train model with 10% validation split for monitoring
dcn_history = dcn_model.fit(
    X_train, y_train,
    validation_split=0.1,
    epochs=80,
    batch_size=512,
    callbacks=[dcn_early_stop, dcn_lr_reduce],
    verbose=1
)

print(f'DCN trained in {len(dcn_history.history["roc_auc"])} epochs')


Epoch 1/80
[1m46/46[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 24ms/step - loss: 0.1535 - pr_auc: 0.5476 - roc_auc: 0.7479 - val_loss: 0.1084 - val_pr_auc: 0.5718 - val_roc_auc: 0.7790 - learning_rate: 0.0010
Epoch 2/80
[1m46/46[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 14ms/step - loss: 0.1007 - pr_auc: 0.6733 - roc_auc: 0.8038 - val_loss: 0.0976 - val_pr_auc: 0.6189 - val_roc_auc: 0.8116 - learning_rate: 0.0010
Epoch 3/80
[1m46/46[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 13ms/step - loss: 0.0889 - pr_auc: 0.7101 - roc_auc: 0.8352 - val_loss: 0.0957 - val_pr_auc: 0.6157 - val_roc_auc: 0.8173 - learning_rate: 0.0010
Epoch 4/80
[1m46/46[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 13ms/step - loss: 0.0832 - pr_auc: 0.7370 - roc_auc: 0.8562 - val_loss: 0.0943 - val_pr_auc: 0.6447 - val_roc_auc: 0.8425 - learning_rate: 0.0010
Epoch 5/80
[1m46/46[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 13ms/step - loss: 0.0801 - pr_auc: 0.7521

### Threshold Optimization


In [21]:
# Get predictions and optimize threshold for DCN
dcn_train_scores = dcn_model.predict(X_train, verbose=0).ravel()

prec, rec, thresholds = precision_recall_curve(y_train, dcn_train_scores)
f1_scores = 2 * prec * rec / (prec + rec + 1e-9)
best_idx = np.nanargmax(f1_scores)
dcn_optimal_threshold = thresholds[max(0, best_idx - 1)] if len(thresholds) > 0 else 0.5

print(f'Optimal threshold: {dcn_optimal_threshold:.4f}')


Optimal threshold: 0.3076


### Evaluation


In [22]:
# Evaluate DCN model performance
dcn_train_pred = (dcn_train_scores >= dcn_optimal_threshold).astype(int)
dcn_test_scores = dcn_model.predict(X_test, verbose=0).ravel()
dcn_test_pred = (dcn_test_scores >= dcn_optimal_threshold).astype(int)

dcn_train_metrics = {
    'roc_auc': roc_auc_score(y_train, dcn_train_scores),
    'pr_auc': average_precision_score(y_train, dcn_train_scores),
    'f1': f1_score(y_train, dcn_train_pred),
    'precision': precision_score(y_train, dcn_train_pred),
    'recall': recall_score(y_train, dcn_train_pred)
}

dcn_test_metrics = {
    'roc_auc': roc_auc_score(y_test, dcn_test_scores),
    'pr_auc': average_precision_score(y_test, dcn_test_scores),
    'f1': f1_score(y_test, dcn_test_pred),
    'precision': precision_score(y_test, dcn_test_pred),
    'recall': recall_score(y_test, dcn_test_pred),
    'threshold': dcn_optimal_threshold
}

dcn_train_cm = confusion_matrix(y_train, dcn_train_pred)
dcn_test_cm = confusion_matrix(y_test, dcn_test_pred)
dcn_train_specificity = dcn_train_cm[0, 0] / (dcn_train_cm[0, 0] + dcn_train_cm[0, 1] + 1e-9)
dcn_test_specificity = dcn_test_cm[0, 0] / (dcn_test_cm[0, 0] + dcn_test_cm[0, 1] + 1e-9)

print('Training Metrics:')
print(f"  ROC-AUC: {dcn_train_metrics['roc_auc']:.4f}, PR-AUC: {dcn_train_metrics['pr_auc']:.4f}")
print(f"  F1: {dcn_train_metrics['f1']:.4f}, Precision: {dcn_train_metrics['precision']:.4f}, Recall: {dcn_train_metrics['recall']:.4f}")
print(f"  Specificity: {dcn_train_specificity:.4f}")

print('\nTest Metrics:')
print(f"  ROC-AUC: {dcn_test_metrics['roc_auc']:.4f}, PR-AUC: {dcn_test_metrics['pr_auc']:.4f}")
print(f"  F1: {dcn_test_metrics['f1']:.4f}, Precision: {dcn_test_metrics['precision']:.4f}, Recall: {dcn_test_metrics['recall']:.4f}")
print(f"  Specificity: {dcn_test_specificity:.4f}")


Training Metrics:
  ROC-AUC: 0.9225, PR-AUC: 0.8616
  F1: 0.7925, Precision: 0.9076, Recall: 0.7033
  Specificity: 0.9800

Test Metrics:
  ROC-AUC: 0.9251, PR-AUC: 0.8653
  F1: 0.7933, Precision: 0.9076, Recall: 0.7046
  Specificity: 0.9800


### Visualizations


In [23]:
fpr_train, tpr_train, _ = roc_curve(y_train, dcn_train_scores)
fpr_test, tpr_test, _ = roc_curve(y_test, dcn_test_scores)
auc_train = auc(fpr_train, tpr_train)
auc_test = auc(fpr_test, tpr_test)

fig, ax = plt.subplots(figsize=(8, 6))
ax.plot(fpr_train, tpr_train, label=f'Train (AUC = {auc_train:.4f})', linewidth=2, linestyle='--')
ax.plot(fpr_test, tpr_test, label=f'Test (AUC = {auc_test:.4f})', linewidth=2)
ax.plot([0, 1], [0, 1], 'k--', linewidth=1, alpha=0.5, label='Random')
ax.set_xlabel('False Positive Rate', fontsize=11)
ax.set_ylabel('True Positive Rate', fontsize=11)
ax.set_title('Deep & Cross Network - ROC Curve', fontsize=12, fontweight='bold')
ax.legend(loc='lower right')
ax.grid(alpha=0.3)
plt.tight_layout()
plt.savefig(os.path.join(images_dir, 'dcn_roc.png'), dpi=150, bbox_inches='tight')
plt.close()
print('Saved: dcn_roc.png')


Saved: dcn_roc.png


In [24]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

sns.heatmap(dcn_train_cm, annot=True, fmt='d', cmap='Blues', ax=ax1, cbar=False)
ax1.set_title('Training Set', fontsize=11, fontweight='bold')
ax1.set_xlabel('Predicted')
ax1.set_ylabel('Actual')

sns.heatmap(dcn_test_cm, annot=True, fmt='d', cmap='Blues', ax=ax2, cbar=False)
ax2.set_title('Test Set', fontsize=11, fontweight='bold')
ax2.set_xlabel('Predicted')
ax2.set_ylabel('Actual')

plt.suptitle('Deep & Cross Network - Confusion Matrix', fontsize=12, fontweight='bold')
plt.tight_layout()
plt.savefig(os.path.join(images_dir, 'dcn_cm.png'), dpi=150, bbox_inches='tight')
plt.close()
print('Saved: dcn_cm.png')


Saved: dcn_cm.png


In [25]:
prec_train, rec_train, _ = precision_recall_curve(y_train, dcn_train_scores)
prec_test, rec_test, _ = precision_recall_curve(y_test, dcn_test_scores)
pr_auc_train = average_precision_score(y_train, dcn_train_scores)
pr_auc_test = average_precision_score(y_test, dcn_test_scores)

fig, ax = plt.subplots(figsize=(8, 6))
ax.plot(rec_train, prec_train, label=f'Train (AUC = {pr_auc_train:.4f})', linewidth=2, linestyle='--')
ax.plot(rec_test, prec_test, label=f'Test (AUC = {pr_auc_test:.4f})', linewidth=2)
ax.set_xlabel('Recall', fontsize=11)
ax.set_ylabel('Precision', fontsize=11)
ax.set_title('Deep & Cross Network - Precision-Recall Curve', fontsize=12, fontweight='bold')
ax.legend()
ax.grid(alpha=0.3)
plt.tight_layout()
plt.savefig(os.path.join(images_dir, 'dcn_pr.png'), dpi=150, bbox_inches='tight')
plt.close()
print('Saved: dcn_pr.png')


Saved: dcn_pr.png


## Model 4: Residual Neural Network


### Architecture


In [26]:
# Residual Neural Network: Deep network with skip connections
# Residual connections allow gradients to flow directly through the network
# This helps train very deep networks by preventing vanishing gradient problem

residual_input = layers.Input(shape=(input_dim,))

# First block: Expand to 512 dimensions
x = layers.Dense(512, kernel_regularizer=regularizers.l2(1e-4))(residual_input)
x = layers.BatchNormalization()(x)
x = layers.Activation('swish')(x)
x = layers.Dropout(0.3)(x)

# Save this for residual connection (skip connection)
residual_1 = x

# Second block: Compress to 256 dimensions
x = layers.Dense(256, kernel_regularizer=regularizers.l2(1e-4))(x)
x = layers.BatchNormalization()(x)
x = layers.Activation('swish')(x)
x = layers.Dropout(0.3)(x)

# Third block: Expand back to 512 and add residual connection
# The Add layer combines current output with saved residual_1
# This allows the network to learn identity mappings if needed (makes training easier)
x = layers.Dense(512, kernel_regularizer=regularizers.l2(1e-4))(x)
x = layers.BatchNormalization()(x)
x = layers.Add()([x, residual_1])  # Residual connection: adds original 512-dim output
x = layers.Activation('swish')(x)
x = layers.Dropout(0.25)(x)

# Continue with standard feedforward layers
x = layers.Dense(128, kernel_regularizer=regularizers.l2(1e-4))(x)
x = layers.BatchNormalization()(x)
x = layers.Activation('swish')(x)
x = layers.Dropout(0.2)(x)

x = layers.Dense(64, kernel_regularizer=regularizers.l2(1e-4))(x)
x = layers.BatchNormalization()(x)
x = layers.Activation('swish')(x)
x = layers.Dropout(0.15)(x)

# Final output: Binary classification probability
residual_output = layers.Dense(1, activation='sigmoid')(x)

residual_model = keras.Model(inputs=residual_input, outputs=residual_output)


### Training


In [27]:
# Compile residual network with focal loss
residual_model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=0.001),
    loss=focal_loss,
    metrics=[keras.metrics.AUC(name='roc_auc', curve='ROC'), keras.metrics.AUC(name='pr_auc', curve='PR')]
)

# Early stopping with longer patience (15 epochs) since residual networks can train longer
residual_early_stop = keras.callbacks.EarlyStopping(monitor='val_roc_auc', mode='max', patience=15, restore_best_weights=True)
residual_lr_reduce = keras.callbacks.ReduceLROnPlateau(monitor='val_roc_auc', mode='max', factor=0.5, patience=5, min_lr=1e-6)

# Train with more epochs (120) since residual connections allow deeper training
residual_history = residual_model.fit(
    X_train, y_train,
    validation_split=0.1,
    epochs=120,
    batch_size=512,
    callbacks=[residual_early_stop, residual_lr_reduce],
    verbose=1
)

print(f'Residual NN trained in {len(residual_history.history["roc_auc"])} epochs')


Epoch 1/120
[1m46/46[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 37ms/step - loss: 0.1686 - pr_auc: 0.5709 - roc_auc: 0.7745 - val_loss: 0.1537 - val_pr_auc: 0.6789 - val_roc_auc: 0.8548 - learning_rate: 0.0010
Epoch 2/120
[1m46/46[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 37ms/step - loss: 0.1343 - pr_auc: 0.7011 - roc_auc: 0.8444 - val_loss: 0.1353 - val_pr_auc: 0.6575 - val_roc_auc: 0.8500 - learning_rate: 0.0010
Epoch 3/120
[1m46/46[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 36ms/step - loss: 0.1229 - pr_auc: 0.7427 - roc_auc: 0.8661 - val_loss: 0.1299 - val_pr_auc: 0.6470 - val_roc_auc: 0.8443 - learning_rate: 0.0010
Epoch 4/120
[1m46/46[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 32ms/step - loss: 0.1140 - pr_auc: 0.7611 - roc_auc: 0.8757 - val_loss: 0.1232 - val_pr_auc: 0.6628 - val_roc_auc: 0.8539 - learning_rate: 0.0010
Epoch 5/120
[1m46/46[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 32ms/step - loss: 0.1053 - pr_auc: 0

### Threshold Optimization


In [28]:
# Optimize threshold for residual network
residual_train_scores = residual_model.predict(X_train, verbose=0).ravel()

prec, rec, thresholds = precision_recall_curve(y_train, residual_train_scores)
f1_scores = 2 * prec * rec / (prec + rec + 1e-9)
best_idx = np.nanargmax(f1_scores)
residual_optimal_threshold = thresholds[max(0, best_idx - 1)] if len(thresholds) > 0 else 0.5

print(f'Optimal threshold: {residual_optimal_threshold:.4f}')


Optimal threshold: 0.3520


### Evaluation


In [29]:
# Evaluate residual network
residual_train_pred = (residual_train_scores >= residual_optimal_threshold).astype(int)
residual_test_scores = residual_model.predict(X_test, verbose=0).ravel()
residual_test_pred = (residual_test_scores >= residual_optimal_threshold).astype(int)

residual_train_metrics = {
    'roc_auc': roc_auc_score(y_train, residual_train_scores),
    'pr_auc': average_precision_score(y_train, residual_train_scores),
    'f1': f1_score(y_train, residual_train_pred),
    'precision': precision_score(y_train, residual_train_pred),
    'recall': recall_score(y_train, residual_train_pred)
}

residual_test_metrics = {
    'roc_auc': roc_auc_score(y_test, residual_test_scores),
    'pr_auc': average_precision_score(y_test, residual_test_scores),
    'f1': f1_score(y_test, residual_test_pred),
    'precision': precision_score(y_test, residual_test_pred),
    'recall': recall_score(y_test, residual_test_pred),
    'threshold': residual_optimal_threshold
}

residual_train_cm = confusion_matrix(y_train, residual_train_pred)
residual_test_cm = confusion_matrix(y_test, residual_test_pred)
residual_train_specificity = residual_train_cm[0, 0] / (residual_train_cm[0, 0] + residual_train_cm[0, 1] + 1e-9)
residual_test_specificity = residual_test_cm[0, 0] / (residual_test_cm[0, 0] + residual_test_cm[0, 1] + 1e-9)

print('Training Metrics:')
print(f"  ROC-AUC: {residual_train_metrics['roc_auc']:.4f}, PR-AUC: {residual_train_metrics['pr_auc']:.4f}")
print(f"  F1: {residual_train_metrics['f1']:.4f}, Precision: {residual_train_metrics['precision']:.4f}, Recall: {residual_train_metrics['recall']:.4f}")
print(f"  Specificity: {residual_train_specificity:.4f}")

print('\nTest Metrics:')
print(f"  ROC-AUC: {residual_test_metrics['roc_auc']:.4f}, PR-AUC: {residual_test_metrics['pr_auc']:.4f}")
print(f"  F1: {residual_test_metrics['f1']:.4f}, Precision: {residual_test_metrics['precision']:.4f}, Recall: {residual_test_metrics['recall']:.4f}")
print(f"  Specificity: {residual_test_specificity:.4f}")


Training Metrics:
  ROC-AUC: 0.9369, PR-AUC: 0.8846
  F1: 0.8180, Precision: 0.9474, Recall: 0.7197
  Specificity: 0.9889

Test Metrics:
  ROC-AUC: 0.9306, PR-AUC: 0.8764
  F1: 0.8094, Precision: 0.9273, Recall: 0.7180
  Specificity: 0.9843


### Visualizations


In [30]:
fpr_train, tpr_train, _ = roc_curve(y_train, residual_train_scores)
fpr_test, tpr_test, _ = roc_curve(y_test, residual_test_scores)
auc_train = auc(fpr_train, tpr_train)
auc_test = auc(fpr_test, tpr_test)

fig, ax = plt.subplots(figsize=(8, 6))
ax.plot(fpr_train, tpr_train, label=f'Train (AUC = {auc_train:.4f})', linewidth=2, linestyle='--')
ax.plot(fpr_test, tpr_test, label=f'Test (AUC = {auc_test:.4f})', linewidth=2)
ax.plot([0, 1], [0, 1], 'k--', linewidth=1, alpha=0.5, label='Random')
ax.set_xlabel('False Positive Rate', fontsize=11)
ax.set_ylabel('True Positive Rate', fontsize=11)
ax.set_title('Residual Neural Network - ROC Curve', fontsize=12, fontweight='bold')
ax.legend(loc='lower right')
ax.grid(alpha=0.3)
plt.tight_layout()
plt.savefig(os.path.join(images_dir, 'residual_roc.png'), dpi=150, bbox_inches='tight')
plt.close()
print('Saved: residual_roc.png')


Saved: residual_roc.png


In [31]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

sns.heatmap(residual_train_cm, annot=True, fmt='d', cmap='Blues', ax=ax1, cbar=False)
ax1.set_title('Training Set', fontsize=11, fontweight='bold')
ax1.set_xlabel('Predicted')
ax1.set_ylabel('Actual')

sns.heatmap(residual_test_cm, annot=True, fmt='d', cmap='Blues', ax=ax2, cbar=False)
ax2.set_title('Test Set', fontsize=11, fontweight='bold')
ax2.set_xlabel('Predicted')
ax2.set_ylabel('Actual')

plt.suptitle('Residual Neural Network - Confusion Matrix', fontsize=12, fontweight='bold')
plt.tight_layout()
plt.savefig(os.path.join(images_dir, 'residual_cm.png'), dpi=150, bbox_inches='tight')
plt.close()
print('Saved: residual_cm.png')


Saved: residual_cm.png


In [32]:
prec_train, rec_train, _ = precision_recall_curve(y_train, residual_train_scores)
prec_test, rec_test, _ = precision_recall_curve(y_test, residual_test_scores)
pr_auc_train = average_precision_score(y_train, residual_train_scores)
pr_auc_test = average_precision_score(y_test, residual_test_scores)

fig, ax = plt.subplots(figsize=(8, 6))
ax.plot(rec_train, prec_train, label=f'Train (AUC = {pr_auc_train:.4f})', linewidth=2, linestyle='--')
ax.plot(rec_test, prec_test, label=f'Test (AUC = {pr_auc_test:.4f})', linewidth=2)
ax.set_xlabel('Recall', fontsize=11)
ax.set_ylabel('Precision', fontsize=11)
ax.set_title('Residual Neural Network - Precision-Recall Curve', fontsize=12, fontweight='bold')
ax.legend()
ax.grid(alpha=0.3)
plt.tight_layout()
plt.savefig(os.path.join(images_dir, 'residual_pr.png'), dpi=150, bbox_inches='tight')
plt.close()
print('Saved: residual_pr.png')


Saved: residual_pr.png


## Model 5: Multi-Scale Ensemble


### Base Models Training


In [33]:
# Base Models for Ensemble: Train multiple gradient boosting models with different configurations
# Why multiple models? Different algorithms and hyperparameters capture different patterns
# Ensemble diversity improves final performance by combining complementary predictions

# XGBoost Deep: Deep trees (depth=8) for complex patterns, slower but more powerful
xgb_deep = XGBClassifier(
    max_depth=8, learning_rate=0.02, n_estimators=500,
    subsample=0.8, colsample_bytree=0.8, scale_pos_weight=scale_pos_weight,
    random_state=42, tree_method='hist', eval_metric='auc'
)

# XGBoost Shallow: Shallow trees (depth=3) for simpler patterns, faster training
xgb_shallow = XGBClassifier(
    max_depth=3, learning_rate=0.05, n_estimators=300,
    subsample=0.9, colsample_bytree=0.9, scale_pos_weight=scale_pos_weight,
    random_state=42, tree_method='hist', eval_metric='auc'
)

# LightGBM: Different algorithm (leaf-wise growth vs level-wise), often faster
lgbm_fast = LGBMClassifier(
    max_depth=6, learning_rate=0.03, n_estimators=400,
    subsample=0.85, colsample_bytree=0.85, scale_pos_weight=scale_pos_weight,
    random_state=42, verbose=-1
)

# CatBoost: Handles categorical features well, robust to overfitting
catboost_robust = CatBoostClassifier(
    depth=7, learning_rate=0.025, iterations=450,
    subsample=0.8, colsample_bylevel=0.8, scale_pos_weight=scale_pos_weight,
    random_state=42, verbose=False
)

base_models = [
    ('xgb_deep', xgb_deep),
    ('xgb_shallow', xgb_shallow),
    ('lgbm_fast', lgbm_fast),
    ('catboost_robust', catboost_robust)
]

# Train all base models on full training set
# These will be used later in cross-validation to generate meta-features
for name, base_model in base_models:
    base_model.fit(X_train, y_train)
    print(f'{name} trained')


xgb_deep trained
xgb_shallow trained


Exception in thread Thread-4 (_readerthread):
Traceback (most recent call last):
  File "d:\Conda\envs\final_last\lib\threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "d:\Conda\envs\final_last\lib\threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "d:\Conda\envs\final_last\lib\subprocess.py", line 1515, in _readerthread
    buffer.append(fh.read())
  File "d:\Conda\envs\final_last\lib\codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xce in position 4: invalid continuation byte


lgbm_fast trained
catboost_robust trained


### Meta-Learner Training


In [34]:
# ENSEMBLE STACKING: Train meta-learner on predictions from base models
# Stacking works by training base models on different folds, then using their predictions as features for meta-learner
# This prevents data leakage: meta-learner never sees predictions from models trained on same data

# Initialize meta-feature arrays
# n_base_models + 1: 4 gradient boosting models + 1 neural network = 5 base models
n_base_models = len(base_models) + 1
meta_train = np.zeros((X_train.shape[0], n_base_models))  # One column per base model
meta_test = np.zeros((X_test.shape[0], n_base_models))

# Separate arrays for neural network predictions (will be added to meta_train/meta_test later)
nn_meta_train = np.zeros(X_train.shape[0])
nn_meta_test = np.zeros(X_test.shape[0])

# Cross-validation loop: For each fold, train base models on training fold, predict on validation fold
# This ensures meta-learner training data (meta_train) has unbiased predictions
for train_idx, val_idx in cv.split(X_train, y_train):
    X_train_fold, X_val_fold = X_train[train_idx], X_train[val_idx]
    y_train_fold, y_val_fold = y_train[train_idx], y_train[val_idx]

    # Train each gradient boosting model on current fold
    for model_idx, (name, base_model) in enumerate(base_models):
        # Create fresh copy of model with same hyperparameters
        model_copy = type(base_model)(**base_model.get_params())
        # Train on training fold only
        model_copy.fit(X_train_fold, y_train_fold)
        # Predict on validation fold (these become meta-features for meta-learner)
        meta_train[val_idx, model_idx] = model_copy.predict_proba(X_val_fold)[:, 1]
        # Predict on test set and average across folds (reduces variance)
        meta_test[:, model_idx] += model_copy.predict_proba(X_test)[:, 1] / cv.n_splits

    # Train neural network base model on current fold
    # Neural network is trained per-fold to ensure diversity and prevent overfitting
    nn_model_fold = keras.Sequential([
        layers.Dense(512, kernel_regularizer=regularizers.l2(1e-4), input_shape=(X_train_fold.shape[1],)),
        layers.BatchNormalization(),
        layers.Activation('swish'),
        layers.Dropout(0.3),
        layers.Dense(256, kernel_regularizer=regularizers.l2(1e-4)),
        layers.BatchNormalization(),
        layers.Activation('swish'),
        layers.Dropout(0.3),
        layers.Dense(128, kernel_regularizer=regularizers.l2(1e-4)),
        layers.BatchNormalization(),
        layers.Activation('swish'),
        layers.Dropout(0.2),
        layers.Dense(64, kernel_regularizer=regularizers.l2(1e-4)),
        layers.BatchNormalization(),
        layers.Activation('swish'),
        layers.Dropout(0.15),
        layers.Dense(1, activation='sigmoid')
    ])
    
    nn_model_fold.compile(
        optimizer=keras.optimizers.Adam(learning_rate=0.001),
        loss=focal_loss,
        metrics=[keras.metrics.AUC(name='roc_auc', curve='ROC'), keras.metrics.AUC(name='pr_auc', curve='PR')]
    )
    
    fold_callbacks = [
        keras.callbacks.EarlyStopping(monitor='val_roc_auc', mode='max', patience=8, restore_best_weights=True, verbose=0),
        keras.callbacks.ReduceLROnPlateau(monitor='val_roc_auc', mode='max', factor=0.5, patience=4, min_lr=1e-6, verbose=0)
    ]
    
    # Train neural network on training fold
    nn_model_fold.fit(
        X_train_fold, y_train_fold,
        validation_data=(X_val_fold, y_val_fold),
        epochs=80,
        batch_size=512,
        callbacks=fold_callbacks,
        verbose=0
    )
    
    # Get predictions from neural network for validation fold and test set
    nn_meta_train[val_idx] = nn_model_fold.predict(X_val_fold, verbose=0).ravel()
    nn_meta_test += nn_model_fold.predict(X_test, verbose=0).ravel() / cv.n_splits

# Add neural network predictions as the 5th base model
meta_train[:, len(base_models)] = nn_meta_train
meta_test[:, len(base_models)] = nn_meta_test

print(f'Meta-learner training data shape: {meta_train.shape}')


Meta-learner training data shape: (26064, 5)


In [35]:
# META-LEARNER ARCHITECTURE: Small neural network that learns to combine base model predictions
# Input: 5 predictions (one from each base model) - these are the meta-features
# Output: Final ensemble prediction
# Why small network? Meta-features are already rich (predictions from trained models)
# Small network prevents overfitting and learns optimal combination weights

meta_input = layers.Input(shape=(n_base_models,))
meta_x = layers.Dense(32, activation='swish', kernel_regularizer=regularizers.l2(1e-4))(meta_input)
meta_x = layers.BatchNormalization()(meta_x)
meta_x = layers.Dropout(0.2)(meta_x)
meta_x = layers.Dense(16, activation='swish', kernel_regularizer=regularizers.l2(1e-4))(meta_x)
meta_x = layers.BatchNormalization()(meta_x)
meta_output = layers.Dense(1, activation='sigmoid')(meta_x)

meta_model = keras.Model(inputs=meta_input, outputs=meta_output)

# Compile meta-learner with focal loss
meta_model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=0.001),
    loss=focal_loss,
    metrics=[keras.metrics.AUC(name='roc_auc', curve='ROC'), keras.metrics.AUC(name='pr_auc', curve='PR')]
)

meta_early_stop = keras.callbacks.EarlyStopping(monitor='val_roc_auc', mode='max', patience=10, restore_best_weights=True)
meta_lr_reduce = keras.callbacks.ReduceLROnPlateau(monitor='val_roc_auc', mode='max', factor=0.5, patience=5, min_lr=1e-6)

# Train meta-learner on meta-features (base model predictions)
# meta_train contains predictions from base models trained on different folds
# This teaches meta-learner how to best combine the base model predictions
meta_history = meta_model.fit(
    meta_train, y_train,
    validation_split=0.1,
    epochs=80,
    batch_size=256,
    callbacks=[meta_early_stop, meta_lr_reduce],
    verbose=0
)

print(f'Meta-learner trained in {len(meta_history.history["roc_auc"])} epochs')


Meta-learner trained in 38 epochs


In [36]:
# Optimize threshold for ensemble meta-learner
ensemble_train_scores = meta_model.predict(meta_train, verbose=0).ravel()

prec, rec, thresholds = precision_recall_curve(y_train, ensemble_train_scores)
f1_scores = 2 * prec * rec / (prec + rec + 1e-9)
best_idx = np.nanargmax(f1_scores)
ensemble_optimal_threshold = thresholds[max(0, best_idx - 1)] if len(thresholds) > 0 else 0.5

print(f'Optimal threshold: {ensemble_optimal_threshold:.4f}')


Optimal threshold: 0.3620


### Evaluation


In [37]:
# Evaluate ensemble meta-learner performance
ensemble_train_pred = (ensemble_train_scores >= ensemble_optimal_threshold).astype(int)
ensemble_test_scores = meta_model.predict(meta_test, verbose=0).ravel()
ensemble_test_pred = (ensemble_test_scores >= ensemble_optimal_threshold).astype(int)

ensemble_train_metrics = {
    'roc_auc': roc_auc_score(y_train, ensemble_train_scores),
    'pr_auc': average_precision_score(y_train, ensemble_train_scores),
    'f1': f1_score(y_train, ensemble_train_pred),
    'precision': precision_score(y_train, ensemble_train_pred),
    'recall': recall_score(y_train, ensemble_train_pred)
}

ensemble_test_metrics = {
    'roc_auc': roc_auc_score(y_test, ensemble_test_scores),
    'pr_auc': average_precision_score(y_test, ensemble_test_scores),
    'f1': f1_score(y_test, ensemble_test_pred),
    'precision': precision_score(y_test, ensemble_test_pred),
    'recall': recall_score(y_test, ensemble_test_pred),
    'threshold': ensemble_optimal_threshold
}

ensemble_train_cm = confusion_matrix(y_train, ensemble_train_pred)
ensemble_test_cm = confusion_matrix(y_test, ensemble_test_pred)
ensemble_train_specificity = ensemble_train_cm[0, 0] / (ensemble_train_cm[0, 0] + ensemble_train_cm[0, 1] + 1e-9)
ensemble_test_specificity = ensemble_test_cm[0, 0] / (ensemble_test_cm[0, 0] + ensemble_test_cm[0, 1] + 1e-9)

print('Training Metrics:')
print(f"  ROC-AUC: {ensemble_train_metrics['roc_auc']:.4f}, PR-AUC: {ensemble_train_metrics['pr_auc']:.4f}")
print(f"  F1: {ensemble_train_metrics['f1']:.4f}, Precision: {ensemble_train_metrics['precision']:.4f}, Recall: {ensemble_train_metrics['recall']:.4f}")
print(f"  Specificity: {ensemble_train_specificity:.4f}")

print('\nTest Metrics:')
print(f"  ROC-AUC: {ensemble_test_metrics['roc_auc']:.4f}, PR-AUC: {ensemble_test_metrics['pr_auc']:.4f}")
print(f"  F1: {ensemble_test_metrics['f1']:.4f}, Precision: {ensemble_test_metrics['precision']:.4f}, Recall: {ensemble_test_metrics['recall']:.4f}")
print(f"  Specificity: {ensemble_test_specificity:.4f}")


Training Metrics:
  ROC-AUC: 0.9484, PR-AUC: 0.9040
  F1: 0.8353, Precision: 0.9508, Recall: 0.7448
  Specificity: 0.9893

Test Metrics:
  ROC-AUC: 0.9539, PR-AUC: 0.9137
  F1: 0.8397, Precision: 0.9636, Recall: 0.7440
  Specificity: 0.9921


### Visualizations


In [38]:
fpr_train, tpr_train, _ = roc_curve(y_train, ensemble_train_scores)
fpr_test, tpr_test, _ = roc_curve(y_test, ensemble_test_scores)
auc_train = auc(fpr_train, tpr_train)
auc_test = auc(fpr_test, tpr_test)

fig, ax = plt.subplots(figsize=(8, 6))
ax.plot(fpr_train, tpr_train, label=f'Train (AUC = {auc_train:.4f})', linewidth=2, linestyle='--')
ax.plot(fpr_test, tpr_test, label=f'Test (AUC = {auc_test:.4f})', linewidth=2)
ax.plot([0, 1], [0, 1], 'k--', linewidth=1, alpha=0.5, label='Random')
ax.set_xlabel('False Positive Rate', fontsize=11)
ax.set_ylabel('True Positive Rate', fontsize=11)
ax.set_title('Multi-Scale Ensemble - ROC Curve', fontsize=12, fontweight='bold')
ax.legend(loc='lower right')
ax.grid(alpha=0.3)
plt.tight_layout()
plt.savefig(os.path.join(images_dir, 'ensemble_roc.png'), dpi=150, bbox_inches='tight')
plt.close()
print('Saved: ensemble_roc.png')


Saved: ensemble_roc.png


In [39]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

sns.heatmap(ensemble_train_cm, annot=True, fmt='d', cmap='Blues', ax=ax1, cbar=False)
ax1.set_title('Training Set', fontsize=11, fontweight='bold')
ax1.set_xlabel('Predicted')
ax1.set_ylabel('Actual')

sns.heatmap(ensemble_test_cm, annot=True, fmt='d', cmap='Blues', ax=ax2, cbar=False)
ax2.set_title('Test Set', fontsize=11, fontweight='bold')
ax2.set_xlabel('Predicted')
ax2.set_ylabel('Actual')

plt.suptitle('Multi-Scale Ensemble - Confusion Matrix', fontsize=12, fontweight='bold')
plt.tight_layout()
plt.savefig(os.path.join(images_dir, 'ensemble_cm.png'), dpi=150, bbox_inches='tight')
plt.close()
print('Saved: ensemble_cm.png')


Saved: ensemble_cm.png


In [40]:
prec_train, rec_train, _ = precision_recall_curve(y_train, ensemble_train_scores)
prec_test, rec_test, _ = precision_recall_curve(y_test, ensemble_test_scores)
pr_auc_train = average_precision_score(y_train, ensemble_train_scores)
pr_auc_test = average_precision_score(y_test, ensemble_test_scores)

fig, ax = plt.subplots(figsize=(8, 6))
ax.plot(rec_train, prec_train, label=f'Train (AUC = {pr_auc_train:.4f})', linewidth=2, linestyle='--')
ax.plot(rec_test, prec_test, label=f'Test (AUC = {pr_auc_test:.4f})', linewidth=2)
ax.set_xlabel('Recall', fontsize=11)
ax.set_ylabel('Precision', fontsize=11)
ax.set_title('Multi-Scale Ensemble - Precision-Recall Curve', fontsize=12, fontweight='bold')
ax.legend()
ax.grid(alpha=0.3)
plt.tight_layout()
plt.savefig(os.path.join(images_dir, 'ensemble_pr.png'), dpi=150, bbox_inches='tight')
plt.close()
print('Saved: ensemble_pr.png')


Saved: ensemble_pr.png


## Model Comparison


### Performance Ranking


In [41]:
# Compare all neural network models across multiple metrics
# Ensemble typically performs best by combining strengths of individual models
comparison_data = {
    'Model': ['Pure TabNet', 'TabNet + Tokenizer', 'Deep & Cross Network', 'Residual Neural Network', 'Multi-Scale Ensemble'],
    'Test ROC-AUC': [
        tabnet_pure_test_metrics['roc_auc'],
        tabnet_tokenizer_test_metrics['roc_auc'],
        dcn_test_metrics['roc_auc'],
        residual_test_metrics['roc_auc'],
        ensemble_test_metrics['roc_auc']
    ],
    'Test PR-AUC': [
        tabnet_pure_test_metrics['pr_auc'],
        tabnet_tokenizer_test_metrics['pr_auc'],
        dcn_test_metrics['pr_auc'],
        residual_test_metrics['pr_auc'],
        ensemble_test_metrics['pr_auc']
    ],
    'Test F1': [
        tabnet_pure_test_metrics['f1'],
        tabnet_tokenizer_test_metrics['f1'],
        dcn_test_metrics['f1'],
        residual_test_metrics['f1'],
        ensemble_test_metrics['f1']
    ],
    'Test Precision': [
        tabnet_pure_test_metrics['precision'],
        tabnet_tokenizer_test_metrics['precision'],
        dcn_test_metrics['precision'],
        residual_test_metrics['precision'],
        ensemble_test_metrics['precision']
    ],
    'Test Recall': [
        tabnet_pure_test_metrics['recall'],
        tabnet_tokenizer_test_metrics['recall'],
        dcn_test_metrics['recall'],
        residual_test_metrics['recall'],
        ensemble_test_metrics['recall']
    ],
    'Optimal Threshold': [
        tabnet_pure_optimal_threshold,
        tabnet_tokenizer_optimal_threshold,
        dcn_optimal_threshold,
        residual_optimal_threshold,
        ensemble_optimal_threshold
    ]
}

comparison_df = pd.DataFrame(comparison_data)
# Sort by Test ROC-AUC to rank models
comparison_df = comparison_df.sort_values('Test ROC-AUC', ascending=False)
comparison_df.reset_index(drop=True, inplace=True)

print(comparison_df.to_string(index=False))


                  Model  Test ROC-AUC  Test PR-AUC  Test F1  Test Precision  Test Recall  Optimal Threshold
   Multi-Scale Ensemble      0.953922     0.913679 0.839683        0.963570     0.744023           0.362014
Residual Neural Network      0.930603     0.876362 0.809354        0.927339     0.718003           0.351998
   Deep & Cross Network      0.925147     0.865276 0.793349        0.907609     0.704641           0.307599
            Pure TabNet      0.921846     0.860787 0.794575        0.917972     0.700422           0.767038
     TabNet + Tokenizer      0.919736     0.846213 0.774676        0.878680     0.692686           0.706486


### ROC Curves Comparison


In [42]:
tabnet_pure_fpr, tabnet_pure_tpr, _ = roc_curve(y_test, tabnet_pure_test_scores)
tabnet_tokenizer_fpr, tabnet_tokenizer_tpr, _ = roc_curve(y_test, tabnet_tokenizer_test_scores)
dcn_fpr, dcn_tpr, _ = roc_curve(y_test, dcn_test_scores)
residual_fpr, residual_tpr, _ = roc_curve(y_test, residual_test_scores)
ensemble_fpr, ensemble_tpr, _ = roc_curve(y_test, ensemble_test_scores)

fig, ax = plt.subplots(figsize=(10, 7))
ax.plot(tabnet_pure_fpr, tabnet_pure_tpr, label=f'Pure TabNet (AUC = {tabnet_pure_test_metrics["roc_auc"]:.4f})', linewidth=2.5, color='#2E86AB')
ax.plot(tabnet_tokenizer_fpr, tabnet_tokenizer_tpr, label=f'TabNet + Tokenizer (AUC = {tabnet_tokenizer_test_metrics["roc_auc"]:.4f})', linewidth=2.5, color='#A23B72')
ax.plot(dcn_fpr, dcn_tpr, label=f'DCN (AUC = {dcn_test_metrics["roc_auc"]:.4f})', linewidth=2.5, color='#F18F01')
ax.plot(residual_fpr, residual_tpr, label=f'Residual NN (AUC = {residual_test_metrics["roc_auc"]:.4f})', linewidth=2.5, color='#6A994E')
ax.plot(ensemble_fpr, ensemble_tpr, label=f'Multi-Scale Ensemble (AUC = {ensemble_test_metrics["roc_auc"]:.4f})', linewidth=2.5, color='#C77DFF')
ax.plot([0, 1], [0, 1], 'k--', linewidth=1, alpha=0.5, label='Random Classifier')
ax.set_xlabel('False Positive Rate', fontsize=12)
ax.set_ylabel('True Positive Rate', fontsize=12)
ax.set_title('ROC Curves Comparison - Test Set', fontsize=14, fontweight='bold')
ax.legend(loc='lower right', fontsize=10)
ax.grid(alpha=0.3)
ax.set_xlim([0, 1])
ax.set_ylim([0, 1])
plt.tight_layout()
plt.savefig(os.path.join(images_dir, 'roc_curves_comparison.png'), dpi=150, bbox_inches='tight')
plt.close()
print('Saved: roc_curves_comparison.png')


Saved: roc_curves_comparison.png


## Save Results


In [43]:
# Prepare results dictionary with all model metrics and performance data
all_results = {
    'tabnet_pure': {
        'train_metrics': {k: float(v) for k, v in tabnet_pure_train_metrics.items()},
        'test_metrics': {k: float(v) for k, v in tabnet_pure_test_metrics.items()},
        'train_confusion': tabnet_pure_train_cm.tolist(),
        'test_confusion': tabnet_pure_test_cm.tolist(),
        'train_specificity': float(tabnet_pure_train_specificity),
        'test_specificity': float(tabnet_pure_test_specificity),
        'optimal_threshold': float(tabnet_pure_optimal_threshold)
    },
    'tabnet_tokenizer': {
        'train_metrics': {k: float(v) for k, v in tabnet_tokenizer_train_metrics.items()},
        'test_metrics': {k: float(v) for k, v in tabnet_tokenizer_test_metrics.items()},
        'train_confusion': tabnet_tokenizer_train_cm.tolist(),
        'test_confusion': tabnet_tokenizer_test_cm.tolist(),
        'train_specificity': float(tabnet_tokenizer_train_specificity),
        'test_specificity': float(tabnet_tokenizer_test_specificity),
        'optimal_threshold': float(tabnet_tokenizer_optimal_threshold)
    },
    'dcn': {
        'train_metrics': {k: float(v) for k, v in dcn_train_metrics.items()},
        'test_metrics': {k: float(v) for k, v in dcn_test_metrics.items()},
        'train_confusion': dcn_train_cm.tolist(),
        'test_confusion': dcn_test_cm.tolist(),
        'train_specificity': float(dcn_train_specificity),
        'test_specificity': float(dcn_test_specificity),
        'optimal_threshold': float(dcn_optimal_threshold)
    },
    'residual': {
        'train_metrics': {k: float(v) for k, v in residual_train_metrics.items()},
        'test_metrics': {k: float(v) for k, v in residual_test_metrics.items()},
        'train_confusion': residual_train_cm.tolist(),
        'test_confusion': residual_test_cm.tolist(),
        'train_specificity': float(residual_train_specificity),
        'test_specificity': float(residual_test_specificity),
        'optimal_threshold': float(residual_optimal_threshold)
    },
    'ensemble': {
        'train_metrics': {k: float(v) for k, v in ensemble_train_metrics.items()},
        'test_metrics': {k: float(v) for k, v in ensemble_test_metrics.items()},
        'train_confusion': ensemble_train_cm.tolist(),
        'test_confusion': ensemble_test_cm.tolist(),
        'train_specificity': float(ensemble_train_specificity),
        'test_specificity': float(ensemble_test_specificity),
        'optimal_threshold': float(ensemble_optimal_threshold)
    },
    'comparison': json.loads(comparison_df.to_json(orient='records'))
}

# Save metrics to JSON file for analysis
with open(os.path.join(MODELS_DIR, 'neural_network_results.json'), 'w') as f:
    json.dump(all_results, f, indent=2)

# Save trained models in their native formats
# TabNet models use .zip format (PyTorch-based)
tabnet_pure.save_model(os.path.join(MODELS_DIR, 'tabnet_pure_neural.zip'))
tabnet_tokenizer.save_model(os.path.join(MODELS_DIR, 'tabnet_tokenizer_neural.zip'))
# Keras models use .h5 format (HDF5)
dcn_model.save(os.path.join(MODELS_DIR, 'dcn_neural.h5'))
residual_model.save(os.path.join(MODELS_DIR, 'residual_neural.h5'))
meta_model.save(os.path.join(MODELS_DIR, 'meta_learner_neural.h5'))

# Save gradient boosting base models using joblib (scikit-learn compatible format)
joblib.dump({
    'xgb_deep': xgb_deep,
    'xgb_shallow': xgb_shallow,
    'lgbm_fast': lgbm_fast,
    'catboost_robust': catboost_robust
}, os.path.join(MODELS_DIR, 'ensemble_base_models_neural.joblib'))

print('Saved all model results and artifacts')


Successfully saved model at d:\FINAL PROJECT\models\tabnet_pure_neural.zip.zip




Successfully saved model at d:\FINAL PROJECT\models\tabnet_tokenizer_neural.zip.zip




Saved all model results and artifacts
