Note: If code does not run properly, connect to T4 GPU runtime. **Models will take a while to run so *only* rerun code if neccessary.**

## 1. Install and Unpack Data

In [None]:
! pip install -q kaggle
#Assuming kaggle.json already in directory

In [None]:
! mkdir ~/.kaggle

! cp kaggle.json ~/.kaggle/

! chmod 600 ~/.kaggle/kaggle.json

! kaggle datasets list

mkdir: cannot create directory ‘/root/.kaggle’: File exists
cp: cannot stat 'kaggle.json': No such file or directory
chmod: cannot access '/root/.kaggle/kaggle.json': No such file or directory
Traceback (most recent call last):
  File "/usr/local/bin/kaggle", line 4, in <module>
    from kaggle.cli import main
  File "/usr/local/lib/python3.12/dist-packages/kaggle/__init__.py", line 6, in <module>
    api.authenticate()
  File "/usr/local/lib/python3.12/dist-packages/kaggle/api/kaggle_api_extended.py", line 434, in authenticate
    raise IOError('Could not find {}. Make sure it\'s located in'
OSError: Could not find kaggle.json. Make sure it's located in /root/.kaggle. Or use the environment method. See setup instructions at https://github.com/Kaggle/kaggle-api/


In [None]:
!kaggle competitions download -c playground-series-s5e11
!unzip -q /content/playground-series-s5e11.zip

Traceback (most recent call last):
  File "/usr/local/bin/kaggle", line 4, in <module>
    from kaggle.cli import main
  File "/usr/local/lib/python3.12/dist-packages/kaggle/__init__.py", line 6, in <module>
    api.authenticate()
  File "/usr/local/lib/python3.12/dist-packages/kaggle/api/kaggle_api_extended.py", line 434, in authenticate
    raise IOError('Could not find {}. Make sure it\'s located in'
OSError: Could not find kaggle.json. Make sure it's located in /root/.kaggle. Or use the environment method. See setup instructions at https://github.com/Kaggle/kaggle-api/
unzip:  cannot find or open /content/playground-series-s5e11.zip, /content/playground-series-s5e11.zip.zip or /content/playground-series-s5e11.zip.ZIP.


In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from tqdm import tqdm
import os
import ipywidgets as widgets
from IPython.display import display
import warnings
import json
from tqdm.notebook import tqdm
from sklearn.metrics import f1_score
from pathlib import Path
from sklearn.metrics import roc_auc_score

sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)

# Define the base input path
BASE_PATH = Path('/content')

In [None]:
train_df = pd.read_csv('train.csv')
print('-'*50)
print('Train CSV:')
display(train_df.head())
test_df = pd.read_csv('test.csv')
print('-'*50)
print('Test CSV:')
display(test_df.head())
sample_submission_df = pd.read_csv('sample_submission.csv')
print('-'*50)
print('Sample Submission CSV:')
display(sample_submission_df.head())

In [None]:
from google.colab import drive
drive.mount('/content/drive')

## 2. EDA

First we'll do some basic EDA on the data to gain a better understanding of what we're working with.

In [None]:
#summary statistics
print(train_df.describe())

In [None]:
# Visualize the distribution of each column except 'id' using histograms
train_df.drop('id', axis=1).hist(figsize=(15, 10))
plt.tight_layout()
plt.show()

In [None]:
#heatmap with correlation matrix
corr = train_df.drop(['id', 'gender', 'marital_status', 'education_level', 'employment_status', 'loan_purpose', 'grade_subgrade'], axis=1).corr()
plt.figure(figsize=(10, 10))
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.show()

Most single variables have a minimal correlation with whether a loan is paid back, but `credit_score` has a moderate positive correlation, whereas `debt_to income_ratio` and `interest_rate` have quite notable negative correlations.

In [None]:
# Box plots to visualize the relationship between key numerical features and loan_paid_back
features_to_plot = ['debt_to_income_ratio', 'credit_score', 'interest_rate']

for feature in features_to_plot:
    plt.figure(figsize=(8, 6))
    sns.boxplot(x='loan_paid_back', y=feature, data=train_df)
    plt.title(f'Distribution of {feature} by Loan Paid Back Status')
    plt.xlabel('Loan Paid Back')
    plt.ylabel(feature)
    plt.show()

In [None]:
categorical_features = ['gender', 'marital_status', 'education_level', 'employment_status', 'loan_purpose', 'grade_subgrade']

for feature in categorical_features:
    plt.figure(figsize=(10, 6))
    sns.countplot(x=feature, hue='loan_paid_back', data=train_df)
    plt.title(f'Distribution of {feature} by Loan Paid Back Status')
    plt.xlabel(feature)
    plt.ylabel('Count')
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.show()

In general, employment status and education level seem to have a noticeable impact toward the likelihood of whether loans will be paid back. Most loan purposes seem to have similar levels of loan payback likelihood, but debt consolidation has a noticeably higher likelihood of the loan *not* being paid back.

Now that we've done some simple exploratory statistics we can begin constructing the baseline models.

##3. Baseline Models

In [None]:

# Generate random predictions between 0 and 1 for the training data
random_predictions = np.random.rand(len(train_df))

# Calculate the AUC ROC score
random_auc = roc_auc_score(train_df['loan_paid_back'], random_predictions)

print(f"AUC ROC for All-Random Baseline Model: {random_auc}")

In [None]:
# Create a baseline model that always predicts the majority class (loan_paid_back = 1)
majority_class_predictions = np.ones(len(train_df))

# Calculate the AUC ROC score for the majority class baseline model
majority_class_auc = roc_auc_score(train_df['loan_paid_back'], majority_class_predictions)

print(f"AUC ROC for Majority Class Baseline Model: {majority_class_auc}")

In [None]:
# Create a baseline model that always predicts the minority class (loan_paid_back = 0)
minority_class_predictions = np.zeros(len(train_df))

# Calculate the AUC ROC score for the minority class baseline model
minority_class_auc = roc_auc_score(train_df['loan_paid_back'], minority_class_predictions)

print(f"AUC ROC for Minority Class Baseline Model: {minority_class_auc}")

As would be expected, each baseline model boasts an AUC ROC of ~0.5, which indicates it will be right around half the time. For any of the actual models to be valuable, we will have to see AUC ROC greater than 0.5 to beat random guessing; ideally quite a bit higher.

## 4. Logistic Regression with Cross Validation (Model A)

First, we'll format the data for modelling.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Separate target variable from features
X = train_df.drop(['id', 'loan_paid_back'], axis=1)
y = train_df['loan_paid_back']
X_test = test_df.drop('id', axis=1)

# Identify categorical and numerical features
categorical_features = X.select_dtypes(include=['object']).columns
numerical_features = X.select_dtypes(include=['int64', 'float64']).columns

# Create preprocessing pipelines for numerical and categorical features
numerical_transformer = StandardScaler()
categorical_transformer = OneHotEncoder(handle_unknown='ignore')

# Create a column transformer to apply different transformations to different columns
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)])

# Apply preprocessing to the training and testing data
X_processed = preprocessor.fit_transform(X)
X_test_processed = preprocessor.transform(X_test)

# Split the training data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_processed, y, test_size=0.2, random_state=42, stratify=y)

print("Data preprocessing complete.")
print("Shape of X_train:", X_train.shape)
print("Shape of X_val:", X_val.shape)
print("Shape of X_test_processed:", X_test_processed.shape)

In [None]:
from sklearn.linear_model import LogisticRegression

# Initialize and train the Logistic Regression model
logistic_model = LogisticRegression(solver='liblinear', random_state=42)
logistic_model.fit(X_train, y_train)

# Predict probabilities on the validation set
y_val_pred_proba = logistic_model.predict_proba(X_val)[:, 1]

# Calculate and print the AUC ROC score on the validation set
logistic_auc = roc_auc_score(y_val, y_val_pred_proba)

print(f"AUC ROC for Logistic Regression Model on Validation Set: {logistic_auc}")

In [None]:
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt

# Calculate the ROC curve
fpr, tpr, thresholds = roc_curve(y_val, y_val_pred_proba)

# Calculate the AUC
roc_auc = auc(fpr, tpr)

# Plot the ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='Random Guessing')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve for Logistic Regression')
plt.legend(loc='lower right')
plt.show()

The first simple logistic regression achieves an AUC ROC of ~0.9103, which is already quite promising. However, we should hope to see higher values still as we try more complex models.

In [None]:
from sklearn.model_selection import cross_val_score

# Initialize the Logistic Regression model
logistic_model = LogisticRegression(solver='liblinear', random_state=42)

# Perform 5-fold cross-validation and calculate AUC ROC scores
cv_scores = cross_val_score(logistic_model, X_processed, y, cv=5, scoring='roc_auc')

print(f"AUC ROC scores for each fold: {cv_scores}")
print(f"Average AUC ROC score from 5-fold cross-validation: {cv_scores.mean()}")

In [None]:
from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
import numpy as np

# Initialize the Logistic Regression model
logistic_model = LogisticRegression(solver='liblinear', random_state=42)

# Initialize KFold
kf = KFold(n_splits=5, shuffle=True, random_state=42)

plt.figure(figsize=(10, 8))

tprs = []
aucs = []
mean_fpr = np.linspace(0, 1, 100)

# Explicitly get the target variable from the training DataFrame to avoid interference
y_train_cv = train_df['loan_paid_back']

i = 0
for train_index, test_index in kf.split(X_processed):
    X_train_fold, X_test_fold = X_processed[train_index], X_processed[test_index]
    y_train_fold, y_test_fold = y_train_cv.iloc[train_index], y_train_cv.iloc[test_index]


    # Train the model on the current fold's training data
    logistic_model.fit(X_train_fold, y_train_fold)

    # Predict probabilities on the current fold's test data
    y_pred_proba_fold = logistic_model.predict_proba(X_test_fold)[:, 1]

    # Calculate ROC curve and AUC for the current fold
    fpr, tpr, thresholds = roc_curve(y_test_fold, y_pred_proba_fold)
    aucs.append(auc(fpr, tpr))

    # Interpolate the TPR for the mean FPR
    tprs.append(np.interp(mean_fpr, fpr, tpr))
    tprs[-1][0] = 0.0

    # Plot the ROC curve for the current fold
    plt.plot(fpr, tpr, lw=1, alpha=0.3,
             label=f'ROC fold {i+1} (AUC = {aucs[-1]:.2f})')

    i += 1

# Plot the random guessing line
plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='r',
         label='Random Guessing', alpha=.8)

# Plot the mean ROC curve
mean_tpr = np.mean(tprs, axis=0)
mean_tpr[-1] = 1.0
mean_auc = auc(mean_fpr, mean_tpr)
std_auc = np.std(aucs)
plt.plot(mean_fpr, mean_tpr, color='b',
         label=f'Mean ROC (AUC = {mean_auc:.2f} $\\pm$ {std_auc:.2f})',
         lw=2, alpha=.8)

# Plot the standard deviation band
std_tpr = np.std(tprs, axis=0)
tprs_upper = np.minimum(mean_tpr + std_tpr, 1)
tprs_lower = np.maximum(mean_tpr - std_tpr, 0)
plt.fill_between(mean_fpr, tprs_lower, tprs_upper, color='grey', alpha=.2,
                 label='$\\pm$ 1 Std. Dev.')

plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curves for 5-fold Cross-Validation')
plt.legend(loc="lower right")
plt.show()

In [None]:
import pandas as pd
from sklearn.linear_model import LogisticRegression

# Train a Logistic Regression model on the entire preprocessed training data
logistic_model_full_data = LogisticRegression(solver='liblinear', random_state=42)
logistic_model_full_data.fit(X_processed, y)

# Get the coefficients from the trained model
coefficients = logistic_model_full_data.coef_[0]

# Get the feature names after preprocessing
numerical_feature_names = numerical_features.tolist()
categorical_feature_names = preprocessor.named_transformers_['cat'].get_feature_names_out(categorical_features).tolist()
all_feature_names = numerical_feature_names + categorical_feature_names

# Create a DataFrame to display feature names and their coefficients
feature_importance_df = pd.DataFrame({'Feature': all_feature_names, 'Coefficient': coefficients})

# Sort by absolute coefficient value to show most important features
feature_importance_df['Abs_Coefficient'] = abs(feature_importance_df['Coefficient'])
feature_importance_df = feature_importance_df.sort_values(by='Abs_Coefficient', ascending=False)

print("Feature Importance (Coefficients) for Logistic Regression Model:")
display(feature_importance_df[['Feature', 'Coefficient']].head(20)) # Display top 20 features

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Select the top N features to display
top_n = 20
top_features = feature_importance_df.head(top_n)

# Create the bar chart
plt.figure(figsize=(12, 8))
sns.barplot(x='Abs_Coefficient', y='Feature', data=top_features, palette='viridis')
plt.title(f'Top {top_n} Feature Importance (Absolute Coefficients) for Logistic Regression')
plt.xlabel('Absolute Coefficient Value')
plt.ylabel('Feature')
plt.tight_layout()
plt.show()

With 5-fold cross-validation, we see the model gain a slight increase in accuracy with an average AUC ROC of ~0.9106. While the evaluation metric is only slightly higher, the cross-validation model may also prevent overfitting, so the slightly higher performance strain will generally be worthwhile. In any case, both the cross-validation and simple logistic regression take less than a few minutes to run and produce moderately accurate results.

We can also see that the importance of features for the logistic model reveals that employment status has the heaviest importance; retired and employed have high positive impacts on the likelihood of loan payback, whereas unemployed and student have significant negative impacts. As we were able to see in the correlation analysis, credit score and debt to income ratio also have high absolute coefficients. Combined with employment status, these are our primary predictors; however, many other variables have a smaller but vital impact.

## 5. Random Forest Classifier (Model B)

Now let's train a random forest to see if we can improve the accuracy of our predictions and reduce overfitting risks.

In [None]:
from cuml.ensemble import RandomForestClassifier as cumlRandomForestClassifier
import cupy as cp
import numpy as np

# Convert data to CuPy arrays for cuml
# Ensure data is dense and float32
X_train_gpu = cp.asarray(X_train.toarray().astype(np.float32) if hasattr(X_train, 'toarray') else X_train.astype(np.float32))
y_train_gpu = cp.asarray(y_train.values.astype(np.float32) if hasattr(y_train, 'values') else y_train.astype(np.float32))
X_val_gpu = cp.asarray(X_val.toarray().astype(np.float32) if hasattr(X_val, 'toarray') else X_val.astype(np.float32))
y_val_gpu = cp.asarray(y_val.values.astype(np.float32) if hasattr(y_val, 'values') else y_val.astype(np.float32))

# Initialize and train the cuml Random Forest Classifier model on GPU
# Setting n_jobs=-1 might not be relevant for cuml as it uses GPU
rf_model = cumlRandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train_gpu, y_train_gpu)

# Predict probabilities on the validation set (on GPU)
y_val_pred_proba_rf_gpu = rf_model.predict_proba(X_val_gpu)[:, 1]

# Convert predictions and true labels back to NumPy for AUC ROC calculation
y_val_pred_proba_rf = cp.asnumpy(y_val_pred_proba_rf_gpu)
y_val_cpu = cp.asnumpy(y_val_gpu)


# Calculate and print the AUC ROC score on the validation set
rf_auc = roc_auc_score(y_val_cpu, y_val_pred_proba_rf)

print(f"AUC ROC for cuml Random Forest Classifier Model on Validation Set (GPU): {rf_auc}")

In [None]:
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt

# Calculate the ROC curve for the Random Forest model
fpr_rf, tpr_rf, thresholds_rf = roc_curve(y_val, y_val_pred_proba_rf)

# Calculate the AUC for the Random Forest model
roc_auc_rf = auc(fpr_rf, tpr_rf)

# Plot the ROC curve for the Random Forest model
plt.figure(figsize=(8, 6))
plt.plot(fpr_rf, tpr_rf, color='darkorange', lw=2, label=f'ROC curve (AUC = {roc_auc_rf:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='Random Guessing')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve for Random Forest Classifier')
plt.legend(loc='lower right')
plt.show()

The simple Random Forest model's AUC ROC is higher than the basic logistic regression, and even 5-fold cross-validation at ~0.9110. This model is also likely more robust against overfitting than either model type.

In [None]:
!git clone https://github.com/rapidsai/rapidsai-csp-utils.git
!python rapidsai-csp-utils/colab/env-check.py
!bash rapidsai-csp-utils/colab/update_proxy.sh
!conda install -c conda-forge cudatoolkit=11.8 --no-default-packages -y python=3.10 cudnn=8.7.0
!conda install -c rapidsai -c conda-forge -c nvidia cuml=24.6 python=3.10 cudatoolkit=11.8 -y --allow-downgrades

In [None]:
import cuml
print("cuml installed and imported.")

In [None]:
from cuml.ensemble import RandomForestClassifier as cumlRandomForestClassifier
from sklearn.model_selection import GridSearchCV
import cupy as cp
import numpy as np
from sklearn.ensemble import RandomForestClassifier # Import for potential CPU fallback

# Convert preprocessed data to a dense NumPy array with a supported dtype before converting to CuPy
# The preprocessor output is likely a sparse matrix, so convert to dense
X_processed_dense = X_processed.toarray().astype(np.float32) if hasattr(X_processed, 'toarray') else X_processed.astype(np.float32)

# Ensure the correct target variable is used before converting to CuPy
y_train_data = train_df['loan_paid_back']

# Convert data to CuPy arrays for cuml
X_processed_gpu = cp.asarray(X_processed_dense)
y_gpu = cp.asarray(y_train_data)


# Initialize a cuml RandomForestClassifier
# Use a smaller subset of the data for tuning due to potential memory constraints
# and to speed up the tuning process if needed.
# For demonstration, let's use a smaller parameter grid and potentially fewer data points if memory becomes an issue.
cuml_rf_model = cumlRandomForestClassifier(random_state=42, n_estimators=100)

# Define a smaller parameter grid for quicker demonstration
param_grid_cuml = {
    'n_estimators': [50, 100],
    'max_depth': [10, 20],
    'min_samples_split': [2, 10]
}


# Initialize GridSearchCV
# Note: GridSearchCV from sklearn works with cuml estimators, but the data needs to be on the GPU.
# For simplicity and compatibility, we will use the sklearn GridSearchCV with cuml model and cupy data.
# Set n_jobs=1 to avoid multi-processing issues with GPU and GridSearchCV
grid_search_cuml = GridSearchCV(estimator=cuml_rf_model, param_grid=param_grid_cuml, cv=3, scoring='roc_auc', n_jobs=1)

# Fit GridSearchCV to the GPU data - explicitly convert CuPy to NumPy
print("Starting GridSearchCV with cuml RandomForestClassifier...")
grid_search_cuml.fit(X_processed_gpu.get(), y_gpu.get())

# Print the best parameters and best score
print("Best parameters found: ", grid_search_cuml.best_params_)
print("Best cross-validation AUC ROC score: ", grid_search_cuml.best_score_)

# Optional: Convert best model back to CPU if needed later
# best_cuml_model = grid_search_cuml.best_estimator_
# best_cpu_model = RandomForestClassifier(n_estimators=best_cuml_model.n_estimators,
#                                         max_depth=best_cpu_model.max_depth,
#                                         min_samples_split=best_cpu_model.min_samples_split,
#                                         random_state=42)
# best_cpu_model.fit(X_processed, y) # Retrain on CPU data if needed for consistency or other operations

In [None]:
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
import cupy as cp

# Get the best model from the GridSearchCV results
best_cuml_model = grid_search_cuml.best_estimator_

# Convert validation data to CuPy array
X_val_gpu = cp.asarray(X_val.toarray().astype(np.float32) if hasattr(X_val, 'toarray') else X_val.astype(np.float32))

# Predict probabilities on the validation set using the best cuml model
# Ensure the model is in evaluation mode if applicable (though not standard for cuml RF)
y_val_pred_proba_cuml_rf = best_cuml_model.predict_proba(X_val_gpu)[:, 1]

# Convert predictions back to NumPy for plotting
y_val_pred_proba_cuml_rf_cpu = cp.asnumpy(y_val_pred_proba_cuml_rf)

# Calculate the ROC curve
fpr_cuml_rf, tpr_cuml_rf, thresholds_cuml_rf = roc_curve(y_val, y_val_pred_proba_cuml_rf_cpu)

# Calculate the AUC
roc_auc_cuml_rf = auc(fpr_cuml_rf, tpr_cuml_rf)

# Plot the ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr_cuml_rf, tpr_cuml_rf, color='darkorange', lw=2, label=f'ROC curve (AUC = {grid_search_cuml.best_score_:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='Random Guessing')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve for Hyperparameter Tuned Random Forest (cuml)')
plt.legend(loc='lower right')
plt.show()

Using hyperparameter tuning on the random forest, we get the highest AUC ROC yet at ~0.9120. This is more computationally expensive than the prior models, but runs relatively quickly on GPU.

## 6. Neural Network Classifier (Model C)

We will build a neural network using PyTorch to attempt to extract an even higher performance.

In [None]:
import torch

# Define the device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

In [None]:
# Convert sparse arrays to dense arrays before converting to PyTorch tensors
X_train_dense = X_train.todense() if hasattr(X_train, 'todense') else X_train
X_val_dense = X_val.todense() if hasattr(X_val, 'todense') else X_val
X_test_dense = X_test_processed.todense() if hasattr(X_test_processed, 'todense') else X_test_processed

# Convert dense arrays and Pandas Series to PyTorch tensors and move to GPU
X_train_tensor = torch.tensor(X_train_dense, dtype=torch.float32).to(device)
X_val_tensor = torch.tensor(X_val_dense, dtype=torch.float32).to(device)
X_test_tensor = torch.tensor(X_test_dense, dtype=torch.float32).to(device)

y_train_tensor = torch.tensor(y_train.values, dtype=torch.float32).to(device)
y_val_tensor = torch.tensor(y_val.values, dtype=torch.float32).to(device)

print("Data converted to dense arrays, then to PyTorch tensors and moved to GPU.")
print("Shape of X_train_tensor:", X_train_tensor.shape)
print("Shape of X_val_tensor:", X_val_tensor.shape)
print("Shape of X_test_tensor:", X_test_tensor.shape)
print("Shape of y_train_tensor:", y_train_tensor.shape)
print("Shape of y_val_tensor:", y_val_tensor.shape)

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

# Define the neural network class
class LoanPredictor(nn.Module):
    def __init__(self, input_size):
        super(LoanPredictor, self).__init__()
        self.fc1 = nn.Linear(input_size, 64)  # First fully connected layer
        self.fc2 = nn.Linear(64, 32)          # Second fully connected layer
        self.fc3 = nn.Linear(32, 1)           # Output layer

    def forward(self, x):
        x = F.relu(self.fc1(x))  # Apply ReLU activation after the first layer
        x = F.relu(self.fc2(x))  # Apply ReLU activation after the second layer
        x = torch.sigmoid(self.fc3(x)) # Apply sigmoid activation for binary classification
        return x

# Instantiate the model and move it to the device
input_size = X_train_tensor.shape[1]
model = LoanPredictor(input_size).to(device)

# Print the model architecture
print(model)

In [None]:
import torch.optim as optim

# Define the loss function (Binary Cross-Entropy)
criterion = nn.BCELoss()

# Define the optimizer (Adam)
optimizer = optim.Adam(model.parameters(), lr=0.001)

print("Loss function and optimizer defined.")

In [None]:
# Set the model to training mode
model.train()

# Define the number of training epochs
num_epochs = 100

# Start the training loop
for epoch in range(num_epochs):
    # Forward pass
    outputs = model(X_train_tensor)

    # Calculate the loss
    loss = criterion(outputs.squeeze(), y_train_tensor)

    # Backward pass and optimize
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    # Print the loss at regular intervals
    if (epoch + 1) % 10 == 0:
        print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')

print("Training complete.")

In [None]:
# Set the model to evaluation mode
model.eval()

# Disable gradient calculations
with torch.no_grad():
    # Get predictions on the validation set
    y_val_pred_proba_nn = model(X_val_tensor)

    # Move predictions and true labels to CPU and convert to NumPy arrays
    y_val_pred_proba_nn_cpu = y_val_pred_proba_nn.squeeze().cpu().numpy()
    y_val_cpu = y_val_tensor.cpu().numpy()

    # Calculate the AUC ROC score
    nn_auc = roc_auc_score(y_val_cpu, y_val_pred_proba_nn_cpu)

    # Print the AUC ROC score
    print(f"AUC ROC for Neural Network Model on Validation Set: {nn_auc}")

In [None]:
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt

# Calculate the ROC curve for the Neural Network model
fpr_nn, tpr_nn, thresholds_nn = roc_curve(y_val_cpu, y_val_pred_proba_nn_cpu)

# Calculate the AUC for the Neural Network model
roc_auc_nn = auc(fpr_nn, tpr_nn)

# Plot the ROC curve for the Neural Network model
plt.figure(figsize=(8, 6))
plt.plot(fpr_nn, tpr_nn, color='darkorange', lw=2, label=f'ROC curve (AUC = {roc_auc_nn:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='Random Guessing')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve for Neural Network')
plt.legend(loc='lower right')
plt.show()

The basic neural network classifier reaches an AUC ROC of ~0.9005, a bit lower than most models we've seen so far. We will perform hyperparameter tuning to construct a model with a potentially higher metric.

In [None]:
import torch.nn as nn
import torch.nn.functional as F

# Define the neural network class with dropout
class LoanPredictor(nn.Module):
    def __init__(self, input_size, dropout_rate=0.5):
        super(LoanPredictor, self).__init__()
        self.fc1 = nn.Linear(input_size, 64)  # First fully connected layer
        self.fc2 = nn.Linear(64, 32)          # Second fully connected layer
        self.fc3 = nn.Linear(32, 1)           # Output layer
        self.dropout = nn.Dropout(dropout_rate) # Dropout layer

    def forward(self, x):
        x = F.relu(self.fc1(x))  # Apply ReLU activation after the first layer
        x = self.dropout(x)      # Apply dropout
        x = F.relu(self.fc2(x))  # Apply ReLU activation after the second layer
        x = self.dropout(x)      # Apply dropout
        x = torch.sigmoid(self.fc3(x)) # Apply sigmoid activation for binary classification
        return x

# Instantiate the model and move it to the device
input_size = X_train_tensor.shape[1]
model = LoanPredictor(input_size, dropout_rate=0.5).to(device)

# Print the model architecture
print(model)

In [None]:
# Define the hyperparameter search space for the neural network
param_grid_nn = {
    'dropout_rate': [0.1, 0.3, 0.5],
    'lr': [0.001, 0.005, 0.01]
    # We could also tune layer sizes, but that would make the search space much larger
}

print("Hyperparameter search space defined:")
print(param_grid_nn)

In [None]:
from sklearn.metrics import roc_auc_score
import torch.optim as optim

def train_and_evaluate_nn(dropout_rate, lr):
    """
    Trains and evaluates a neural network with given hyperparameters.

    Args:
        dropout_rate (float): The dropout rate for the neural network.
        lr (float): The learning rate for the optimizer.

    Returns:
        float: The AUC ROC score on the validation set.
    """
    # Instantiate the model with the given dropout rate
    model = LoanPredictor(input_size, dropout_rate=dropout_rate).to(device)

    # Define the loss function and optimizer with the given learning rate
    criterion = nn.BCELoss()
    optimizer = optim.Adam(model.parameters(), lr=lr)

    # Set the model to training mode
    model.train()

    # Define the number of training epochs
    num_epochs = 100

    # Start the training loop
    for epoch in range(num_epochs):
        # Forward pass
        outputs = model(X_train_tensor)

        # Calculate the loss
        loss = criterion(outputs.squeeze(), y_train_tensor)

        # Backward pass and optimize
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    # Set the model to evaluation mode
    model.eval()

    # Disable gradient calculations
    with torch.no_grad():
        # Get predictions on the validation set
        y_val_pred_proba_nn = model(X_val_tensor)

        # Move predictions and true labels to CPU and convert to NumPy arrays
        y_val_pred_proba_nn_cpu = y_val_pred_proba_nn.squeeze().cpu().numpy()
        y_val_cpu = y_val_tensor.cpu().numpy()

        # Calculate the AUC ROC score
        nn_auc = roc_auc_score(y_val_cpu, y_val_pred_proba_nn_cpu)

    return nn_auc

print("Training and evaluation function defined.")

In [None]:
# Initialize variables to store the best AUC ROC score and hyperparameters
best_auc_nn = 0
best_params_nn = {}

# Iterate through each combination of hyperparameters
for dropout_rate in param_grid_nn['dropout_rate']:
    for lr in param_grid_nn['lr']:
        print(f"Training with dropout_rate={dropout_rate}, lr={lr}")
        # Train and evaluate the model
        current_auc = train_and_evaluate_nn(dropout_rate, lr)

        # Print the AUC ROC score for the current combination
        print(f"AUC ROC for dropout_rate={dropout_rate}, lr={lr}: {current_auc:.4f}")

        # Compare and update best parameters if current score is better
        if current_auc > best_auc_nn:
            best_auc_nn = current_auc
            best_params_nn = {'dropout_rate': dropout_rate, 'lr': lr}
            print("New best AUC ROC found!")

# Print the best AUC ROC score and corresponding hyperparameters
print("\n--- Hyperparameter Tuning Complete ---")
print(f"Best AUC ROC: {best_auc_nn:.4f}")
print(f"Best Hyperparameters: {best_params_nn}")

In [None]:
# Instantiate the final model with the best hyperparameters
final_model = LoanPredictor(input_size, dropout_rate=best_params_nn['dropout_rate']).to(device)

# Define the loss function and optimizer with the best learning rate
criterion = nn.BCELoss()
optimizer = optim.Adam(final_model.parameters(), lr=best_params_nn['lr'])

# Set the model to training mode
final_model.train()

# Define the number of training epochs
num_epochs = 100 # Can increase if needed

# Convert CuPy arrays to PyTorch tensors and move to the correct device
X_processed_tensor = torch.tensor(X_processed_gpu.get(), dtype=torch.float32).to(device)
y_tensor = torch.tensor(y_gpu.get(), dtype=torch.float32).to(device)


# Start the training loop on the entire training data
print(f"Training the final model with best hyperparameters: {best_params_nn}")
for epoch in range(num_epochs):
    # Forward pass
    outputs = final_model(X_processed_tensor)

    # Calculate the loss
    loss = criterion(outputs.squeeze(), y_tensor)

    # Backward pass and optimize
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    # Print the loss at regular intervals
    if (epoch + 1) % 10 == 0:
        print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')

print("Final model training complete.")

# Set the model to evaluation mode
final_model.eval()
print("Final model set to evaluation mode.")

In [None]:
# Set the final model to evaluation mode (already done in the previous step, but good practice to ensure)
final_model.eval()

# Disable gradient calculations
with torch.no_grad():
    # Get predictions on the validation set using the final model
    y_val_pred_proba_final_nn = final_model(X_val_tensor)

    # Move predictions and true labels to CPU and convert to NumPy arrays
    y_val_pred_proba_final_nn_cpu = y_val_pred_proba_final_nn.squeeze().cpu().numpy()
    y_val_cpu = y_val_tensor.cpu().numpy()

    # Calculate the AUC ROC score for the final model on the validation set
    final_nn_auc = roc_auc_score(y_val_cpu, y_val_pred_proba_final_nn_cpu)

# Print the AUC ROC score for the final model
print(f"AUC ROC for Final Neural Network Model on Validation Set: {final_nn_auc}")

In [None]:
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
import torch

# Set the final model to evaluation mode (already done in the previous step, but good practice to ensure)
final_model.eval()

# Disable gradient calculations
with torch.no_grad():
    # Get predictions on the validation set using the final model
    y_val_pred_proba_final_nn = final_model(X_val_tensor)

    # Move predictions and true labels to CPU and convert to NumPy arrays
    y_val_pred_proba_final_nn_cpu = y_val_pred_proba_final_nn.squeeze().cpu().numpy()
    y_val_cpu = y_val_tensor.cpu().numpy() # Ensure y_val_cpu is the same as used for calculating nn_auc

# Calculate the ROC curve for the final Neural Network model
fpr_final_nn, tpr_final_nn, thresholds_final_nn = roc_curve(y_val_cpu, y_val_pred_proba_final_nn_cpu)

# Calculate the AUC for the final Neural Network model
roc_auc_final_nn = auc(fpr_final_nn, tpr_final_nn)

# Plot the ROC curve for the final Neural Network model
plt.figure(figsize=(8, 6))
plt.plot(fpr_final_nn, tpr_final_nn, color='darkorange', lw=2, label=f'ROC curve (AUC = {roc_auc_final_nn:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='Random Guessing')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve for Hyperparameter Tuned Neural Network')
plt.legend(loc='lower right')
plt.show()

This neural network model garners an AUC ROC of ~0.9116, which is just barely lower than the random forest after hyperparameter tuning. Overall, it is slightly more resource-intensive than the random forest model. The differences between the two models are minimal, although the performance and accuracy of the neural network is a bit poorer, so the random forest is still the best model so far.

## 7. XGBoost (Model D)

We'll use a more advanced XGBoost model and start with some basic feature engineering to improve its performance.

In [None]:
from sklearn.preprocessing import PolynomialFeatures
import pandas as pd
import numpy as np

# Convert X_processed back to a DataFrame to easily create interaction terms
# Note: This might require handling the column names from the OneHotEncoder
# A simpler approach for demonstration is to work with the original numerical features
# and then combine with the processed categorical features.

# Let's focus on creating interaction terms from the original numerical features first
X_numerical = train_df[numerical_features]
X_test_numerical = test_df[numerical_features]

# Create polynomial features (including interaction terms) for numerical features
# Degree 2 will create terms like x^2 and x*y
poly = PolynomialFeatures(degree=2, include_bias=False)

X_numerical_poly = poly.fit_transform(X_numerical)
X_test_numerical_poly = poly.transform(X_test_numerical)

# The column names for polynomial features are not automatically generated in a user-friendly way.
# For simplicity in combining, we can convert the processed data back to DataFrames.
# A more robust approach would involve creating a custom transformer.

# Let's re-apply the preprocessing pipeline and then add new features
# We need to get the feature names after one-hot encoding

# Get feature names after preprocessing
# This requires fitting the preprocessor again or accessing its fitted state
# A simpler way is to get the feature names from the fitted OneHotEncoder
# and combine with numerical feature names

numerical_feature_names = numerical_features.tolist()
categorical_feature_names = preprocessor.named_transformers_['cat'].get_feature_names_out(categorical_features).tolist()
all_feature_names = numerical_feature_names + categorical_feature_names

# Convert processed data to dense NumPy arrays before converting to DataFrame
X_processed_dense = X_processed.toarray() if hasattr(X_processed, 'toarray') else X_processed
X_test_processed_dense = X_test_processed.toarray() if hasattr(X_test_processed, 'toarray') else X_test_processed

# Convert dense arrays back to DataFrame to add new features
X_processed_df = pd.DataFrame(X_processed_dense, columns=all_feature_names)
X_test_processed_df = pd.DataFrame(X_test_processed_dense, columns=all_feature_names)


# Add interaction terms between selected numerical features
X_processed_df['income_credit_interaction'] = X_processed_df['annual_income'] * X_processed_df['credit_score']
X_processed_df['debt_credit_interaction'] = X_processed_df['debt_to_income_ratio'] * X_processed_df['credit_score']
X_processed_df['income_debt_interaction'] = X_processed_df['annual_income'] * X_processed_df['debt_to_income_ratio']
X_processed_df['credit_interest_interaction'] = X_processed_df['credit_score'] * X_processed_df['interest_rate']


X_test_processed_df['income_credit_interaction'] = X_test_processed_df['annual_income'] * X_test_processed_df['credit_score']
X_test_processed_df['debt_credit_interaction'] = X_test_processed_df['debt_to_income_ratio'] * X_test_processed_df['credit_score']
X_test_processed_df['income_debt_interaction'] = X_test_processed_df['annual_income'] * X_test_processed_df['debt_to_income_ratio']
X_test_processed_df['credit_interest_interaction'] = X_test_processed_df['credit_score'] * X_test_processed_df['interest_rate']


print("Feature engineering complete. Added interaction terms.")
print("Shape of X_processed_df:", X_processed_df.shape)
print("Shape of X_test_processed_df:", X_test_processed_df.shape)

In [None]:
import xgboost as xgb
from sklearn.metrics import roc_auc_score
import pandas as pd
import numpy as np

# Apply the same feature engineering steps to the train, validation, and test sets
# This is needed because the splits were created before feature engineering

# Convert sparse arrays to dense NumPy arrays before creating DataFrames
X_train_dense = X_train.toarray() if hasattr(X_train, 'toarray') else X_train
X_val_dense = X_val.toarray() if hasattr(X_val, 'toarray') else X_val
X_test_processed_dense = X_test_processed.toarray() if hasattr(X_test_processed, 'toarray') else X_test_processed


X_train_df = pd.DataFrame(X_train_dense, columns=all_feature_names)
X_val_df = pd.DataFrame(X_val_dense, columns=all_feature_names)
X_test_processed_df_recreated = pd.DataFrame(X_test_processed_dense, columns=all_feature_names)


X_train_df['income_credit_interaction'] = X_train_df['annual_income'] * X_train_df['credit_score']
X_train_df['debt_credit_interaction'] = X_train_df['debt_to_income_ratio'] * X_train_df['credit_score']
X_train_df['income_debt_interaction'] = X_train_df['annual_income'] * X_train_df['debt_to_income_ratio']
X_train_df['credit_interest_interaction'] = X_train_df['credit_score'] * X_train_df['interest_rate']

X_val_df['income_credit_interaction'] = X_val_df['annual_income'] * X_val_df['credit_score']
X_val_df['debt_credit_interaction'] = X_val_df['debt_to_income_ratio'] * X_val_df['credit_score']
X_val_df['income_debt_interaction'] = X_val_df['annual_income'] * X_val_df['debt_to_income_ratio']
X_val_df['credit_interest_interaction'] = X_val_df['credit_score'] * X_val_df['interest_rate']

X_test_processed_df_recreated['income_credit_interaction'] = X_test_processed_df_recreated['annual_income'] * X_test_processed_df_recreated['credit_score']
X_test_processed_df_recreated['debt_credit_interaction'] = X_test_processed_df_recreated['debt_to_income_ratio'] * X_test_processed_df_recreated['credit_score']
X_test_processed_df_recreated['income_debt_interaction'] = X_test_processed_df_recreated['annual_income'] * X_test_processed_df_recreated['debt_to_income_ratio']
X_test_processed_df_recreated['credit_interest_interaction'] = X_test_processed_df_recreated['credit_score'] * X_test_processed_df_recreated['interest_rate']


# Initialize and train the XGBoost model
# Using default hyperparameters for a baseline XGBoost model
xgb_model = xgb.XGBClassifier(objective='binary:logistic', random_state=42, use_label_encoder=False, eval_metric='logloss')

print("Starting XGBoost model training...")
xgb_model.fit(X_train_df, y_train)
print("XGBoost model training complete.")

# Predict probabilities on the validation set
y_val_pred_proba_xgb = xgb_model.predict_proba(X_val_df)[:, 1]

# Calculate and print the AUC ROC score on the validation set
xgb_auc = roc_auc_score(y_val, y_val_pred_proba_xgb)

print(f"AUC ROC for XGBoost Model on Validation Set: {xgb_auc}")

In [None]:
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt

# Calculate the ROC curve for the simple XGBoost model
fpr_xgb, tpr_xgb, thresholds_xgb = roc_curve(y_val, y_val_pred_proba_xgb)

# Calculate the AUC for the simple XGBoost model
roc_auc_xgb = auc(fpr_xgb, tpr_xgb)

# Plot the ROC curve for the simple XGBoost model
plt.figure(figsize=(8, 6))
plt.plot(fpr_xgb, tpr_xgb, color='darkorange', lw=2, label=f'ROC curve (AUC = {roc_auc_xgb:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='Random Guessing')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve for Simple XGBoost Model')
plt.legend(loc='lower right')
plt.show()

The basic XGBoost model continues to improve upon our efforts with an AUC ROC of ~0.9194, the best score yet. We will attempt to improve this further with hyperparameter tuning.

In [None]:
import xgboost as xgb
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_auc_score

# Initialize XGBoost Classifier with GPU support
# tree_method='hist' is generally faster and compatible with GPU
# device='cuda' explicitly tells XGBoost to use the GPU
xgb_model_gpu = xgb.XGBClassifier(objective='binary:logistic', random_state=42,
                                  use_label_encoder=False, eval_metric='logloss',
                                  tree_method='hist', device='cuda')

# Define a smaller hyperparameter grid for tuning with GPU
param_grid_xgb = {
    'n_estimators': [100, 200],
    'learning_rate': [0.05, 0.1],
    'max_depth': [3, 5],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]
}

# Initialize GridSearchCV
# Set n_jobs=1 as the GPU handles parallelism within XGBoost
grid_search_xgb = GridSearchCV(estimator=xgb_model_gpu, param_grid=param_grid_xgb,
                             cv=3, scoring='roc_auc', n_jobs=1, verbose=2)

print("Starting GridSearchCV for XGBoost with GPU acceleration...")

# Fit GridSearchCV to the training data with engineered features
grid_search_xgb.fit(X_train_df, y_train)

print("\n--- Hyperparameter Tuning Complete ---")
# Print the best parameters and best score
print("Best parameters found: ", grid_search_xgb.best_params_)
print("Best cross-validation AUC ROC score: ", grid_search_xgb.best_score_)

# Evaluate the best model on the validation set
best_xgb_model = grid_search_xgb.best_estimator_
y_val_pred_proba_best_xgb = best_xgb_model.predict_proba(X_val_df)[:, 1]
best_xgb_auc = roc_auc_score(y_val, y_val_pred_proba_best_xgb)

print(f"AUC ROC for Best XGBoost Model on Validation Set: {best_xgb_auc}")

In [None]:
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt

# Get the best model from the GridSearchCV results
best_xgb_model = grid_search_xgb.best_estimator_

# Predict probabilities on the validation set using the best XGBoost model
y_val_pred_proba_best_xgb = best_xgb_model.predict_proba(X_val_df)[:, 1]

# Calculate the ROC curve
fpr_best_xgb, tpr_best_xgb, thresholds_best_xgb = roc_curve(y_val, y_val_pred_proba_best_xgb)

# Calculate the AUC
roc_auc_best_xgb = auc(fpr_best_xgb, tpr_best_xgb)

# Plot the ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr_best_xgb, tpr_best_xgb, color='darkorange', lw=2, label=f'ROC curve (AUC = {roc_auc_best_xgb:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='Random Guessing')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve for Hyperparameter Tuned XGBoost Model')
plt.legend(loc='lower right')
plt.show()

The XGBoost hyperparameter method actually has a bit lower AUC ROC value than the simple method (~0.9187), but may be less overfitted and actually perform better in classifying unforseen data. However, this is one of the most computationally expensive methods yet, so its accuracy comes at a price.

## 8. LightGBM (Model E)

In [None]:
!pip install lightgbm==3.3.2
import lightgbm as lgb
print(lgb.__version__)

try:
    # Attempt to initialize a model with device='gpu' to check for GPU support
    lgb.LGBMClassifier(device='gpu')
    print("LightGBM GPU support is available.")
except Exception as e:
    print(f"LightGBM GPU support is not available. Error: {e}")

In [None]:
# Convert DataFrames to NumPy arrays
X_train_np = X_train_df.values.astype(np.float32)
X_val_np = X_val_df.values.astype(np.float32)
X_test_processed_np = X_test_processed_df.values.astype(np.float32)

# Convert target variable Series to NumPy arrays
y_train_np = y_train.values.astype(np.float32)
y_val_np = y_val.values.astype(np.float32)

# Confirm the shapes
print("Shape of X_train_np:", X_train_np.shape)
print("Shape of X_val_np:", X_val_np.shape)
print("Shape of X_test_processed_np:", X_test_processed_np.shape)
print("Shape of y_train_np:", y_train_np.shape)
print("Shape of y_val_np:", y_val_np.shape)

# Confirm the data types
print("\nDtype of X_train_np:", X_train_np.dtype)
print("Dtype of X_val_np:", X_val_np.dtype)
print("Dtype of X_test_processed_np:", X_test_processed_np.dtype)
print("Dtype of y_train_np:", y_train_np.dtype)
print("Dtype of y_val_np:", y_val_np.dtype)

In [None]:
from lightgbm import LGBMClassifier

# Initialize an LGBMClassifier object without explicit GPU acceleration
lgbm_model = LGBMClassifier(objective='binary', metric='auc', random_state=42)

# Train the LightGBM model on the training data
print("Starting LightGBM model training without GPU...")
lgbm_model.fit(X_train_np, y_train_np)
print("LightGBM model training complete.")

In [None]:
from sklearn.metrics import roc_auc_score

# Predict probabilities on the validation set
y_val_pred_proba_lgbm = lgbm_model.predict_proba(X_val_np)[:, 1]

# Calculate and print the AUC ROC score on the validation set
lgbm_auc = roc_auc_score(y_val_np, y_val_pred_proba_lgbm)

print(f"AUC ROC for LightGBM Model on Validation Set: {lgbm_auc}")

In [None]:
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt

# Calculate the ROC curve
fpr_lgbm, tpr_lgbm, thresholds_lgbm = roc_curve(y_val_np, y_val_pred_proba_lgbm)

# Calculate the AUC
roc_auc_lgbm = auc(fpr_lgbm, tpr_lgbm)

# Plot the ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr_lgbm, tpr_lgbm, color='darkorange', lw=2, label=f'ROC curve (AUC = {roc_auc_lgbm:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='Random Guessing')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve for LightGBM Model')
plt.legend(loc='lower right')
plt.show()

The LightGBM model performs with an AUC ROC of ~0.9183, marginally poorer than the XGBoost after hyperparameter tuning. However, the model can be executed relatively quickly compared to other alternatives.

## 9. Gaussian Naive Bayes Classification (Model F)

In [None]:
import cupy as cp

# Convert sparse X_train and X_val to dense NumPy arrays (if they are sparse) and ensure float32 dtype
X_train_dense_np = X_train.toarray().astype(np.float32) if hasattr(X_train, 'toarray') else X_train.astype(np.float32)
X_val_dense_np = X_val.toarray().astype(np.float32) if hasattr(X_val, 'toarray') else X_val.astype(np.float32)

# Convert y_train and y_val to NumPy arrays and ensure float32 dtype
y_train_np = y_train.values.astype(np.float32)
y_val_np = y_val.values.astype(np.float32)

# Convert NumPy arrays to CuPy arrays
X_train_nb_gpu = cp.asarray(X_train_dense_np)
y_train_nb_gpu = cp.asarray(y_train_np)
X_val_nb_gpu = cp.asarray(X_val_dense_np)
y_val_nb_gpu = cp.asarray(y_val_np)

print("Data converted to dense CuPy arrays with float32 dtype.")
print(f"Shape of X_train_nb_gpu: {X_train_nb_gpu.shape}, dtype: {X_train_nb_gpu.dtype}")
print(f"Shape of y_train_nb_gpu: {y_train_nb_gpu.shape}, dtype: {y_train_nb_gpu.dtype}")
print(f"Shape of X_val_nb_gpu: {X_val_nb_gpu.shape}, dtype: {X_val_nb_gpu.dtype}")
print(f"Shape of y_val_nb_gpu: {y_val_nb_gpu.shape}, dtype: {y_val_nb_gpu.dtype}")

In [None]:
from cuml.naive_bayes import GaussianNB

# Instantiate the Gaussian Naive Bayes model
nb_model = GaussianNB()

print("Starting cuML Gaussian Naive Bayes model training...")
# Train the model using the GPU-acceleraccelerated training data
nb_model.fit(X_train_nb_gpu, y_train_nb_gpu)
print("cuML Gaussian Naive Bayes model training complete.")

In [None]:
from sklearn.metrics import roc_auc_score
import cupy as cp

# Predict probabilities on the validation set using the trained model
y_val_pred_proba_nb_gpu = nb_model.predict_proba(X_val_nb_gpu)[:, 1]

# Convert CuPy predictions and true labels back to NumPy for AUC ROC calculation
y_val_pred_proba_nb_cpu = cp.asnumpy(y_val_pred_proba_nb_gpu)
y_val_cpu = cp.asnumpy(y_val_nb_gpu)

# Calculate and print the AUC ROC score on the validation set
nb_auc = roc_auc_score(y_val_cpu, y_val_pred_proba_nb_cpu)

print(f"AUC ROC for cuML Gaussian Naive Bayes Model on Validation Set: {nb_auc}")

In [None]:
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt

# Calculate the ROC curve for the Naive Bayes model
fpr_nb, tpr_nb, thresholds_nb = roc_curve(y_val_cpu, y_val_pred_proba_nb_cpu)

# Calculate the AUC for the Naive Bayes model
roc_auc_nb = auc(fpr_nb, tpr_nb)

# Plot the ROC curve for the Naive Bayes model
plt.figure(figsize=(8, 6))
plt.plot(fpr_nb, tpr_nb, color='darkorange', lw=2, label=f'ROC curve (AUC = {roc_auc_nb:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='Random Guessing')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve for cuML Gaussian Naive Bayes Model')
plt.legend(loc='lower right')
plt.show()

The Naive Bayes model actually performs with the poorest results yet, so we will attempt to improve its ROC AUC score with hyperparameter tuning next.

In [None]:
import numpy as np

# Define the hyperparameter search space for var_smoothing
param_grid_nb = {'var_smoothing': [1e-9, 1e-8, 1e-7, 1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1]}

print("Hyperparameter search space for var_smoothing defined:")
print(param_grid_nb)

In [None]:
from cuml.naive_bayes import GaussianNB
from sklearn.metrics import roc_auc_score
import cupy as cp

# Initialize variables to store the best AUC ROC score and hyperparameters
best_auc_nb = 0
best_params_nb = {}

# Iterate through each var_smoothing value
print("Starting hyperparameter tuning for Gaussian Naive Bayes...")
for var_smoothing_val in param_grid_nb['var_smoothing']:
    print(f"\nTraining with var_smoothing={var_smoothing_val}")
    # Instantiate the Gaussian Naive Bayes model with the current var_smoothing
    current_nb_model = GaussianNB(var_smoothing=var_smoothing_val)

    # Train the model using the GPU-accelerated training data
    current_nb_model.fit(X_train_nb_gpu, y_train_nb_gpu)

    # Predict probabilities on the validation set (on GPU)
    y_val_pred_proba_current_nb_gpu = current_nb_model.predict_proba(X_val_nb_gpu)[:, 1]

    # Convert predictions and true labels back to NumPy for AUC ROC calculation
    y_val_pred_proba_current_nb_cpu = cp.asnumpy(y_val_pred_proba_current_nb_gpu)
    # y_val_cpu is already available from previous cells

    # Calculate the AUC ROC score for the current combination
    current_auc_nb = roc_auc_score(y_val_cpu, y_val_pred_proba_current_nb_cpu)

    print(f"AUC ROC for var_smoothing={var_smoothing_val}: {current_auc_nb:.4f}")

    # Compare and update best parameters if current score is better
    if current_auc_nb > best_auc_nb:
        best_auc_nb = current_auc_nb
        best_params_nb = {'var_smoothing': var_smoothing_val}
        print("New best AUC ROC found!")

print("\n--- Hyperparameter Tuning Complete ---")
print(f"Best AUC ROC: {best_auc_nb:.4f}")
print(f"Best Hyperparameters: {best_params_nb}")


In [None]:
from cuml.naive_bayes import GaussianNB
from sklearn.metrics import roc_auc_score, roc_curve, auc
import cupy as cp
import matplotlib.pyplot as plt

# 1. Instantiate a cuml.naive_bayes.GaussianNB model using the var_smoothing value obtained from the hyperparameter tuning
final_nb_model = GaussianNB(var_smoothing=best_params_nb['var_smoothing'])

# 2. Train this model on the GPU-accelerated training data
print(f"\nTraining final cuML Gaussian Naive Bayes model with var_smoothing={best_params_nb['var_smoothing']}...")
final_nb_model.fit(X_train_nb_gpu, y_train_nb_gpu)
print("Final cuML Gaussian Naive Bayes model training complete.")

# 3. Predict probabilities on the GPU-accelerated validation set
y_val_pred_proba_final_nb_gpu = final_nb_model.predict_proba(X_val_nb_gpu)[:, 1]

# 4. Convert the predicted probabilities and true validation labels to NumPy arrays
y_val_pred_proba_final_nb_cpu = cp.asnumpy(y_val_pred_proba_final_nb_gpu)
y_val_cpu = cp.asnumpy(y_val_nb_gpu) # Ensure y_val_cpu is available and correctly represents the true labels

# 5. Calculate the AUC ROC score for the tuned model
final_nb_auc = roc_auc_score(y_val_cpu, y_val_pred_proba_final_nb_cpu)

# 6. Print the AUC ROC score
print(f"AUC ROC for Hyperparameter Tuned cuML Gaussian Naive Bayes Model on Validation Set: {final_nb_auc}")

# 7. Calculate the False Positive Rate (FPR), True Positive Rate (TPR), and thresholds
fpr_final_nb, tpr_final_nb, thresholds_final_nb = roc_curve(y_val_cpu, y_val_pred_proba_final_nb_cpu)

# Calculate the AUC for the final Naive Bayes model
roc_auc_final_nb = auc(fpr_final_nb, tpr_final_nb)

# 8. Plot the ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr_final_nb, tpr_final_nb, color='darkorange', lw=2, label=f'ROC curve (AUC = {roc_auc_final_nb:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='Random Guessing')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve for Hyperparameter Tuned cuML Gaussian Naive Bayes Model')
plt.legend(loc='lower right')
plt.show()


After tuning, the bayes model still does not perform particularly well, even worse than the simple logistic regression at ~0.8981. However, it may be useful later when we compile hybrid models due to minimizing certain other models' weaknesses, particularly XGBoost which we will focus on for hybrids:

## 10. Hybrid Models

In [None]:
import numpy as np
from sklearn.metrics import roc_auc_score
import cupy as cp

# Get the best trained models from previous steps
# Assuming best_xgb_model and best_cuml_model are available from previous cells
# best_xgb_model = grid_search_xgb.best_estimator_
# best_cuml_model = grid_search_cuml.best_estimator_

# Make predictions on the validation set using the best XGBoost model
# Ensure X_val_df is available from the XGBoost feature engineering step
y_val_pred_proba_xgb_tuned = best_xgb_model.predict_proba(X_val_df)[:, 1]

# Make predictions on the validation set using the best cuml Random Forest model
# Ensure X_val_gpu is available from the cuml RF setup (convert X_val to CuPy if not)
# Assuming X_val_gpu was created in cell Z0dt_D78V1oN or ZRd5e49Dnhea
if 'X_val_gpu' not in locals():
    X_val_gpu = cp.asarray(X_val.toarray().astype(np.float32) if hasattr(X_val, 'toarray') else X_val.astype(np.float32))

y_val_pred_proba_rf_tuned_gpu = best_cuml_model.predict_proba(X_val_gpu)[:, 1]
y_val_pred_proba_rf_tuned = cp.asnumpy(y_val_pred_proba_rf_tuned_gpu)


# Combine the predictions (simple averaging)
hybrid_predictions_proba = (y_val_pred_proba_xgb_tuned + y_val_pred_proba_rf_tuned) / 2

# Calculate and print the AUC ROC score for the hybrid model
hybrid_auc = roc_auc_score(y_val, hybrid_predictions_proba)

print(f"AUC ROC for Hybrid XGBoost + Random Forest Model on Validation Set: {hybrid_auc}")

In [None]:
import numpy as np
from sklearn.metrics import roc_auc_score
import cupy as cp

# Make predictions on the training set using the best XGBoost model
# Ensure X_processed_df is available from the XGBoost feature engineering step
y_train_pred_proba_xgb_tuned = best_xgb_model.predict_proba(X_processed_df)[:, 1]

# Make predictions on the training set using the best cuml Random Forest model
# Ensure X_processed_gpu is available from the cuml RF setup (convert X_processed_df to CuPy if not)
X_processed_gpu = cp.asarray(X_processed_df.values.astype(np.float32))

y_train_pred_proba_rf_tuned_gpu = best_cuml_model.predict_proba(X_processed_gpu)[:, 1]
y_train_pred_proba_rf_tuned = cp.asnumpy(y_train_pred_proba_rf_tuned_gpu)

# Combine the predictions (simple averaging)
hybrid_train_predictions_proba = (y_train_pred_proba_xgb_tuned + y_train_pred_proba_rf_tuned) / 2

# Calculate and print the AUC ROC score for the hybrid model on the training set
hybrid_train_auc = roc_auc_score(y, hybrid_train_predictions_proba)

print(f"AUC ROC for Hybrid XGBoost + Random Forest Model on Training Set: {hybrid_train_auc}")

The DT-XGBoost hybrid model has a high AUC ROC on the validation set, but much lower on the training set. While the validation set might give us a better idea of the actual performance, this is still a bit questionable, so we will try another combination of models.

In [None]:
X_meta_val = np.concatenate((y_val_pred_proba.reshape(-1, 1), y_val_pred_proba_best_xgb.reshape(-1, 1)), axis=1)

print("Shape of X_meta_val:", X_meta_val.shape)

In [None]:
from sklearn.linear_model import LogisticRegression

# Instantiate a Logistic Regression meta-model
meta_model = LogisticRegression(solver='liblinear', random_state=42)

# Train the meta-model using the combined validation predictions and actual labels
meta_model.fit(X_meta_val, y_val)

print("Meta-model (Logistic Regression) training complete.")

In [None]:
from sklearn.metrics import roc_curve, auc, roc_auc_score
import matplotlib.pyplot as plt

# Predict probabilities on the validation meta-features using the trained meta-model
meta_model_pred_proba = meta_model.predict_proba(X_meta_val)[:, 1]

# Calculate the AUC ROC score for the hybrid model
hybrid_stacked_auc = roc_auc_score(y_val, meta_model_pred_proba)

print(f"AUC ROC for Hybrid Stacked Model (Logistic Regression + XGBoost): {hybrid_stacked_auc}")

# Calculate the ROC curve for the hybrid stacked model
fpr_hybrid_stacked, tpr_hybrid_stacked, thresholds_hybrid_stacked = roc_curve(y_val, meta_model_pred_proba)

# Plot the ROC curve for the hybrid stacked model
plt.figure(figsize=(8, 6))
plt.plot(fpr_hybrid_stacked, tpr_hybrid_stacked, color='darkorange', lw=2, label=f'ROC curve (AUC = {hybrid_stacked_auc:.4f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='Random Guessing')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve for Hybrid Stacked Model')
plt.legend(loc='lower right')
plt.show()

In [None]:
logistic_train_pred_proba = logistic_model.predict_proba(X_processed)[:, 1]
xgb_train_pred_proba = best_xgb_model.predict_proba(X_processed_df)[:, 1]

print("Logistic Regression training predictions generated.")
print("XGBoost training predictions generated.")

In [None]:
X_meta_train = np.concatenate((logistic_train_pred_proba.reshape(-1, 1), xgb_train_pred_proba.reshape(-1, 1)), axis=1)

print("Shape of X_meta_train:", X_meta_train.shape)

In [None]:
from sklearn.metrics import roc_curve, auc, roc_auc_score
import matplotlib.pyplot as plt

# Predict probabilities for the positive class on X_meta_train using the trained meta-model
hybrid_stacked_train_pred_proba = meta_model.predict_proba(X_meta_train)[:, 1]

# Calculate the AUC ROC score for the hybrid stacked model on the training set
hybrid_stacked_train_auc = roc_auc_score(y, hybrid_stacked_train_pred_proba)

print(f"AUC ROC for Hybrid Stacked Model on Training Set: {hybrid_stacked_train_auc}")

# Calculate the ROC curve for the hybrid stacked model on the training set
fpr_hybrid_stacked_train, tpr_hybrid_stacked_train, thresholds_hybrid_stacked_train = roc_curve(y, hybrid_stacked_train_pred_proba)

# Plot the ROC curve for the hybrid stacked model
plt.figure(figsize=(8, 6))
plt.plot(fpr_hybrid_stacked_train, tpr_hybrid_stacked_train, color='darkorange', lw=2, label=f'ROC curve (AUC = {hybrid_stacked_train_auc:.4f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='Random Guessing')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve for Hybrid Stacked Model on Training Set')
plt.legend(loc='lower right')
plt.show()

The LR-XGBoost model shows the highest AUC ROC based on the training set alone yet, with a value of ~0.9229. However, based on the validation set, the first hybrid model performed exceedingly high. Let's try one more hybrid model to see if we can strike a balance.

In [None]:
import cupy as cp

# 1. Generate probability predictions from the best_xgb_model on X_val_df
y_val_pred_proba_xgb_hybrid = best_xgb_model.predict_proba(X_val_df)[:, 1]
print("XGBoost predictions generated on validation set.")

# 2. Generate probability predictions from the final_nb_model on X_val_nb_gpu
y_val_pred_proba_nb_hybrid_gpu = final_nb_model.predict_proba(X_val_nb_gpu)[:, 1]
print("cuML Gaussian Naive Bayes (GPU) predictions generated on validation set.")

# 3. Convert y_val_pred_proba_nb_hybrid_gpu (CuPy array) to a NumPy array
y_val_pred_proba_nb_hybrid_cpu = cp.asnumpy(y_val_pred_proba_nb_hybrid_gpu)
print("cuML Gaussian Naive Bayes predictions converted to CPU NumPy array.")

print(f"Shape of y_val_pred_proba_xgb_hybrid: {y_val_pred_proba_xgb_hybrid.shape}")
print(f"Shape of y_val_pred_proba_nb_hybrid_cpu: {y_val_pred_proba_nb_hybrid_cpu.shape}")

In [None]:
import numpy as np

# Combine the predictions to form meta-features for the meta-model
X_meta_val_hybrid = np.column_stack((y_val_pred_proba_xgb_hybrid, y_val_pred_proba_nb_hybrid_cpu))

print(f"Shape of meta-features for validation set: {X_meta_val_hybrid.shape}")

In [None]:
from sklearn.linear_model import LogisticRegression

# Instantiate a Logistic Regression meta-model
meta_model_hybrid = LogisticRegression(solver='liblinear', random_state=42)

# Train the meta-model using the combined validation predictions and actual labels
meta_model_hybrid.fit(X_meta_val_hybrid, y_val) # y_val is the true labels for the validation set

print("Meta-model (Logistic Regression) training complete using XGBoost and Naive Bayes predictions.")

In [None]:
from sklearn.metrics import roc_auc_score, roc_curve, auc
import matplotlib.pyplot as plt

# Predict probabilities using the trained meta-model
y_val_pred_proba_hybrid_stacked = meta_model_hybrid.predict_proba(X_meta_val_hybrid)[:, 1]

# Calculate and print the AUC ROC score
hybrid_stacked_auc_xgb_nb = roc_auc_score(y_val, y_val_pred_proba_hybrid_stacked)
print(f"\nAUC ROC for Hybrid Stacked Model (XGBoost + Naive Bayes) on Validation Set: {hybrid_stacked_auc_xgb_nb}")

# Plot ROC curve
fpr_hybrid_stacked_xgb_nb, tpr_hybrid_stacked_xgb_nb, _ = roc_curve(y_val, y_val_pred_proba_hybrid_stacked)
roc_auc_hybrid_stacked_xgb_nb = auc(fpr_hybrid_stacked_xgb_nb, tpr_hybrid_stacked_xgb_nb)

plt.figure(figsize=(8, 6))
plt.plot(fpr_hybrid_stacked_xgb_nb, tpr_hybrid_stacked_xgb_nb, color='darkorange', lw=2, label=f'ROC curve (AUC = {roc_auc_hybrid_stacked_xgb_nb:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='Random Guessing')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve for Hybrid Stacked Model (XGBoost + Naive Bayes)')
plt.legend(loc='lower right')
plt.show()

## 11. Artificial Dataset Testing and Final Model Selection

Based on the single AUC ROC scores alone, it is difficult to choose a model that will generalize to unseen data. Each XGB hybrid performs rather well, but we can use artificial data to find which model scores best on average.

First we'll create the synthetic data:

In [None]:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

print("Imported make_classification and train_test_split.")

In [None]:
synthetic_dataset_configs = [
    {
        'n_samples': 1000,
        'n_features': 20,
        'n_informative': 10,
        'n_redundant': 5,
        'n_clusters_per_class': 2,
        'random_state': 42
    },
    {
        'n_samples': 2000,
        'n_features': 50,
        'n_informative': 15,
        'n_redundant': 10,
        'n_clusters_per_class': 3,
        'random_state': 123
    },
    {
        'n_samples': 500,
        'n_features': 10,
        'n_informative': 5,
        'n_redundant': 2,
        'n_clusters_per_class': 1,
        'random_state': 789
    },
    {
        'n_samples': 1500,
        'n_features': 30,
        'n_informative': 20,
        'n_redundant': 5,
        'n_clusters_per_class': 2,
        'random_state': 987
    }
]

synthetic_datasets = []

for i, config in enumerate(synthetic_dataset_configs):
    print(f"Generating dataset {i+1} with config: {config}")
    X_synthetic, y_synthetic = make_classification(**config)

    X_train_synthetic, X_test_synthetic, y_train_synthetic, y_test_synthetic = train_test_split(
        X_synthetic, y_synthetic, test_size=0.2, random_state=42, stratify=y_synthetic
    )

    synthetic_datasets.append({
        'config': config,
        'X_train': X_train_synthetic,
        'X_test': X_test_synthetic,
        'y_train': y_train_synthetic,
        'y_test': y_test_synthetic
    })

    print(f"  Dataset {i+1} shapes: ")
    print(f"    X_train: {X_train_synthetic.shape}, y_train: {y_train_synthetic.shape}")
    print(f"    X_test: {X_test_synthetic.shape}, y_test: {y_test_synthetic.shape}")
    print("\n")

print(f"Successfully created {len(synthetic_datasets)} synthetic datasets.")

In [None]:
from sklearn.preprocessing import StandardScaler

for i, dataset in enumerate(synthetic_datasets):
    scaler = StandardScaler()

    # Fit and transform X_train
    dataset['X_train'] = scaler.fit_transform(dataset['X_train'])

    # Transform X_test
    dataset['X_test'] = scaler.transform(dataset['X_test'])

    print(f"Dataset {i+1} scaled. New shapes: ")
    print(f"  X_train: {dataset['X_train'].shape}, X_test: {dataset['X_test'].shape}")

print("Preprocessing with StandardScaler complete for all synthetic datasets.")

Now, we'll train the models on this artificial data:

In [None]:
import xgboost as xgb
from cuml.ensemble import RandomForestClassifier as cumlRandomForestClassifier
from cuml.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
import cupy as cp
import numpy as np

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

retrained_hybrid_results = []
print("Initialized retrained_hybrid_results list.")

In [None]:
for i, dataset in enumerate(synthetic_datasets):
    print(f"\nProcessing synthetic dataset {i+1}...")
    X_train_synth, X_test_synth, y_train_synth, y_test_synth = (
        dataset['X_train'], dataset['X_test'], dataset['y_train'], dataset['y_test']
    )

    # b. Retrain Base XGBoost Model
    xgb_tuned_params = {
        'objective': 'binary:logistic',
        'random_state': 42,
        'eval_metric': 'logloss',
        'tree_method': 'hist',
        'device': 'cuda',
        'n_estimators': grid_search_xgb.best_params_['n_estimators'],
        'learning_rate': grid_search_xgb.best_params_['learning_rate'],
        'max_depth': grid_search_xgb.best_params_['max_depth'],
        'subsample': grid_search_xgb.best_params_['subsample'],
        'colsample_bytree': grid_search_xgb.best_params_['colsample_bytree']
    }
    xgb_model_synth = xgb.XGBClassifier(**xgb_tuned_params)
    xgb_model_synth.fit(X_train_synth, y_train_synth)
    xgb_train_preds = xgb_model_synth.predict_proba(X_train_synth)[:, 1]
    xgb_test_preds = xgb_model_synth.predict_proba(X_test_synth)[:, 1]
    print(f"  XGBoost trained. AUC on test: {roc_auc_score(y_test_synth, xgb_test_preds):.4f}")

    # c. Retrain Base cuml Random Forest Model
    rf_tuned_params = {
        'n_estimators': grid_search_cuml.best_params_['n_estimators'],
        'max_depth': grid_search_cuml.best_params_['max_depth'],
        'min_samples_split': grid_search_cuml.best_params_['min_samples_split'],
        'random_state': 42
    }
    rf_model_synth = cumlRandomForestClassifier(**rf_tuned_params)

    X_train_rf_gpu = cp.asarray(X_train_synth.astype(np.float32))
    y_train_rf_gpu = cp.asarray(y_train_synth.astype(np.float32))
    X_test_rf_gpu = cp.asarray(X_test_synth.astype(np.float32))

    rf_model_synth.fit(X_train_rf_gpu, y_train_rf_gpu)
    rf_train_preds_gpu = rf_model_synth.predict_proba(X_train_rf_gpu)[:, 1]
    rf_train_preds_cpu = cp.asnumpy(rf_train_preds_gpu)
    rf_test_preds_gpu = rf_model_synth.predict_proba(X_test_rf_gpu)[:, 1]
    rf_test_preds_cpu = cp.asnumpy(rf_test_preds_gpu)
    print(f"  cuML Random Forest trained. AUC on test: {roc_auc_score(y_test_synth, rf_test_preds_cpu):.4f}")

    # d. Retrain Base cuml Gaussian Naive Bayes Model
    nb_model_synth = GaussianNB(var_smoothing=best_params_nb['var_smoothing'])

    X_train_nb_gpu = cp.asarray(X_train_synth.astype(np.float32))
    y_train_nb_gpu = cp.asarray(y_train_synth.astype(np.float32))
    X_test_nb_gpu = cp.asarray(X_test_synth.astype(np.float32))

    nb_model_synth.fit(X_train_nb_gpu, y_train_nb_gpu)
    nb_train_preds_gpu = nb_model_synth.predict_proba(X_train_nb_gpu)[:, 1]
    nb_train_preds_cpu = cp.asnumpy(nb_train_preds_gpu)
    nb_test_preds_gpu = nb_model_synth.predict_proba(X_test_nb_gpu)[:, 1]
    nb_test_preds_cpu = cp.asnumpy(nb_test_preds_gpu)
    print(f"  cuML Naive Bayes trained. AUC on test: {roc_auc_score(y_test_synth, nb_test_preds_cpu):.4f}")

    # e. Retrain Base Logistic Regression Model
    lr_model_synth = LogisticRegression(solver='liblinear', random_state=42)
    lr_model_synth.fit(X_train_synth, y_train_synth)
    lr_train_preds = lr_model_synth.predict_proba(X_train_synth)[:, 1]
    lr_test_preds = lr_model_synth.predict_proba(X_test_synth)[:, 1]
    print(f"  Logistic Regression trained. AUC on test: {roc_auc_score(y_test_synth, lr_test_preds):.4f}")

    # f. Hybrid Model 1: XGBoost + Random Forest (Simple Averaging)
    hybrid1_test_preds = (xgb_test_preds + rf_test_preds_cpu) / 2
    hybrid1_auc = roc_auc_score(y_test_synth, hybrid1_test_preds)
    print(f"  Hybrid 1 (XGB+RF Avg) AUC on test: {hybrid1_auc:.4f}")

    # g. Hybrid Model 2: Logistic Regression + XGBoost (Stacked)
    X_meta_train_h2 = np.column_stack((lr_train_preds, xgb_train_preds))
    meta_model_h2 = LogisticRegression(solver='liblinear', random_state=42)
    meta_model_h2.fit(X_meta_train_h2, y_train_synth)
    X_meta_test_h2 = np.column_stack((lr_test_preds, xgb_test_preds))
    hybrid2_test_preds = meta_model_h2.predict_proba(X_meta_test_h2)[:, 1]
    hybrid2_auc = roc_auc_score(y_test_synth, hybrid2_test_preds)
    print(f"  Hybrid 2 (LR+XGB Stacked) AUC on test: {hybrid2_auc:.4f}")

    # h. Hybrid Model 3: XGBoost + Naive Bayes (Stacked)
    X_meta_train_h3 = np.column_stack((xgb_train_preds, nb_train_preds_cpu))
    meta_model_h3 = LogisticRegression(solver='liblinear', random_state=42)
    meta_model_h3.fit(X_meta_train_h3, y_train_synth)
    X_meta_test_h3 = np.column_stack((xgb_test_preds, nb_test_preds_cpu))
    hybrid3_test_preds = meta_model_h3.predict_proba(X_meta_test_h3)[:, 1]
    hybrid3_auc = roc_auc_score(y_test_synth, hybrid3_test_preds)
    print(f"  Hybrid 3 (XGB+NB Stacked) AUC on test: {hybrid3_auc:.4f}")

    retrained_hybrid_results.append({
        'config': dataset['config'],
        'XGB_RF_Avg_Test_Preds': hybrid1_test_preds,
        'LR_XGB_Stacked_Test_Preds': hybrid2_test_preds,
        'XGB_NB_Stacked_Test_Preds': hybrid3_test_preds,
        'XGB_RF_Avg_AUC': hybrid1_auc,
        'LR_XGB_Stacked_AUC': hybrid2_auc,
        'XGB_NB_Stacked_AUC': hybrid3_auc
    })

print("\nFinished retraining and evaluating hybrid models on all synthetic datasets.")

In [None]:
print("\n--- Hybrid Model Evaluation on Artificial Datasets ---")
for i, result in enumerate(retrained_hybrid_results):
    print(f"\nDataset {i+1} Configuration:")
    for key, value in result['config'].items():
        print(f"  {key}: {value}")

    print(f"  AUC ROC - XGBoost + Random Forest (Simple Averaging): {result['XGB_RF_Avg_AUC']:.4f}")
    print(f"  AUC ROC - Logistic Regression + XGBoost (Stacked): {result['LR_XGB_Stacked_AUC']:.4f}")
    print(f"  AUC ROC - XGBoost + Naive Bayes (Stacked): {result['XGB_NB_Stacked_AUC']:.4f}")

And finally the summary of results:

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# --- 1. Calculate Average AUC ROC for Hybrid Models on Artificial Datasets ---
hybrid1_aucs = [result['XGB_RF_Avg_AUC'] for result in retrained_hybrid_results]
hybrid2_aucs = [result['LR_XGB_Stacked_AUC'] for result in retrained_hybrid_results]
hybrid3_aucs = [result['XGB_NB_Stacked_AUC'] for result in retrained_hybrid_results]

mean_hybrid1_auc = np.mean(hybrid1_aucs)
mean_hybrid2_auc = np.mean(hybrid2_aucs)
mean_hybrid3_auc = np.mean(hybrid3_aucs)

# --- 2. Gather Individual Model AUC ROC scores on Original Validation Set ---
# From previous cells, these variables should be available
# Logistic Regression (from 5-fold CV on original training data for robustness) - using mean_auc
# Random Forest (tuned with cuml and GridSearchCV) - using grid_search_cuml.best_score_
# Neural Network (tuned) - using best_auc_nn
# XGBoost (tuned with GridSearchCV) - using best_xgb_auc
# LightGBM (single model) - using lgbm_auc
# TabNet (single model) - manually entered 0.9119 from previous output
# Gaussian Naive Bayes (tuned with cuml) - using final_nb_auc


# Create a DataFrame for comparison
comparison_data = {
    'Model': [
        'Hybrid: XGB+RF (Avg) - Artificial Data Avg',
        'Hybrid: LR+XGB (Stacked) - Artificial Data Avg',
        'Hybrid: XGB+NB (Stacked) - Artificial Data Avg',
        'XGBoost (Tuned) - Original Val Set',
        'Random Forest (Tuned) - Original Val Set',
        'Logistic Regression (CV) - Original Val Set',
        'Neural Network (Tuned) - Original Val Set',
        'LightGBM - Original Val Set',
        'TabNet - Original Val Set',
        'Gaussian Naive Bayes (Tuned) - Original Val Set'
    ],
    'AUC_ROC': [
        mean_hybrid1_auc,
        mean_hybrid2_auc,
        mean_hybrid3_auc,
        best_xgb_auc,
        grid_search_cuml.best_score_,
        mean_auc, # Using mean_auc from LR CV
        best_auc_nn,
        lgbm_auc,
        0.9119, # TabNet AUC from text output
        final_nb_auc
    ]
}

comparison_df = pd.DataFrame(comparison_data)
comparison_df_sorted = comparison_df.sort_values(by='AUC_ROC', ascending=False).reset_index(drop=True)

# --- 3. Print the average scores and individual scores ---
print("\n--- Average AUC ROC Scores for Hybrid Models on Artificial Datasets ---")
print(f"XGBoost + Random Forest (Simple Averaging): {mean_hybrid1_auc:.4f}")
print(f"Logistic Regression + XGBoost (Stacked):     {mean_hybrid2_auc:.4f}")
print(f"XGBoost + Naive Bayes (Stacked):             {mean_hybrid3_auc:.4f}")

print("\n--- Individual Model AUC ROC Scores on Original Validation Set ---")
print(f"XGBoost (Tuned):                       {best_xgb_auc:.4f}")
print(f"Random Forest (Tuned):                 {grid_search_cuml.best_score_:.4f}")
print(f"Logistic Regression (CV):              {mean_auc:.4f}")
print(f"Neural Network (Tuned):                {best_auc_nn:.4f}")
print(f"LightGBM:                              {lgbm_auc:.4f}")
print(f"TabNet:                                {0.9119:.4f}")
print(f"Gaussian Naive Bayes (Tuned):          {final_nb_auc:.4f}")

print("\n--- All Models Ranked by AUC ROC ---")
print(comparison_df_sorted.to_string(index=False))

# --- 4. Create a Bar Chart for Comparison ---
plt.figure(figsize=(14, 8))
sns.barplot(x='AUC_ROC', y='Model', data=comparison_df_sorted, palette='viridis')
plt.title('Comparison of Hybrid and Individual Model AUC ROC Scores')
plt.xlabel('AUC ROC Score')
plt.ylabel('Model Type')
plt.xlim(0.85, 1.0) # Adjust x-axis to better show differences
plt.tight_layout()
plt.show()


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import cupy as cp

# 1. Generate Training Predictions for XGB+NB Stacked Model
# Base predictions from best_xgb_model on X_processed_df
xgb_train_preds_for_nb_stack = best_xgb_model.predict_proba(X_processed_df)[:, 1]

# Base predictions from final_nb_model on X_processed_gpu
# The original X_processed (from preprocessor.fit_transform(X)) has 60 features.
# X_processed_df has 64 features due to feature engineering for XGBoost.
# The final_nb_model was trained on 60 features.
# Therefore, we need to convert X_processed to CuPy, not X_processed_df.

# Ensure X_processed is dense for CuPy conversion if it's sparse
X_processed_dense_np = X_processed.toarray().astype(np.float32) if hasattr(X_processed, 'toarray') else X_processed.astype(np.float32)
X_processed_nb_gpu = cp.asarray(X_processed_dense_np)

nb_train_preds_for_nb_stack_gpu = final_nb_model.predict_proba(X_processed_nb_gpu)[:, 1]
nb_train_preds_for_nb_stack_cpu = cp.asnumpy(nb_train_preds_for_nb_stack_gpu)

# Combine these predictions to form meta-features for the meta-model
X_meta_train_xgb_nb = np.column_stack((xgb_train_preds_for_nb_stack, nb_train_preds_for_nb_stack_cpu))

# Use meta_model_hybrid (meta-learner for XGB+NB) to predict on these meta-features
xgb_nb_stacked_train_pred_proba = meta_model_hybrid.predict_proba(X_meta_train_xgb_nb)[:, 1]

# 2. Calculate AUC ROC for XGB+NB Stacked Model on Training Set
xgb_nb_stacked_train_auc = roc_auc_score(y, xgb_nb_stacked_train_pred_proba)
print(f"AUC ROC for XGB+NB Stacked Model on Training Set: {xgb_nb_stacked_train_auc:.4f}")

# 3. Gather all three hybrid models' training AUC ROC scores
# hybrid_stacked_train_auc (for LR+XGB Stacked) - from cell 59fc3276
# hybrid_train_auc (for XGB+RF Avg) - from cell LGqcdsiLW_5z
# xgb_nb_stacked_train_auc (for XGB+NB Stacked) - calculated above

comparison_data_train = {
    'Hybrid Model': [
        'LR+XGB (Stacked) - Training Set',
        'XGB+RF (Avg) - Training Set',
        'XGB+NB (Stacked) - Training Set'
    ],
    'AUC_ROC (Training Set)': [
        hybrid_stacked_train_auc,
        hybrid_train_auc,
        xgb_nb_stacked_train_auc
    ]
}

comparison_df_train = pd.DataFrame(comparison_data_train)
comparison_df_train_sorted = comparison_df_train.sort_values(by='AUC_ROC (Training Set)', ascending=False).reset_index(drop=True)

print("\n--- Hybrid Models Ranked by AUC ROC on Training Data ---")
print(comparison_df_train_sorted.to_string(index=False))

# Create a Bar Chart for Comparison
plt.figure(figsize=(12, 7))
sns.barplot(x='AUC_ROC (Training Set)', y='Hybrid Model', data=comparison_df_train_sorted, palette='viridis')
plt.title('AUC ROC of Hybrid Models on Training Data')
plt.xlabel('AUC ROC Score')
plt.ylabel('Hybrid Model Type')
plt.xlim(0.8, 1.0) # Adjust x-axis to better show differences
plt.tight_layout()
plt.show()

The highest score for average performance on artificial data is just barely won by the logistic regression XGBoost model. With the XGB/RF and XGB/NB there were also some less consistent performances, particularly between the XGB/RF's validation and training scores, which would cause a bit more of a risk in terms of the performance on actual test data.

So, the final model will be the **Stacked Logistic Regression + XGBoost model.** Now we can create a submission.

In [None]:
submission_df = pd.DataFrame({'id': test_df['id'], 'loan_paid_back': stacked_lr_xgb_test_preds})

# Save the submission_df to a CSV file named 'submission.csv', excluding the index
submission_df.to_csv('submission.csv', index=False)

print("Submission file 'submission.csv' created successfully using the LR/XGB stacked model.")
print(submission_df.head())