CS 559: Machine Learning - Subgroup 4 Bankruptcy Prediction

Subgroup: 4

Introduction
This notebook implements a stacking model to predict company bankruptcies in Subgroup 4. My goal was to develop a robust model that effectively identifies bankrupt companies despite the dataset’s severe imbalance (2.22% bankrupt). I applied rigorous preprocessing, ensemble modeling, and hyperparameter tuning to balance performance and generalization.

Data Loading and Verification

Objective: Load and verify the Subgroup 4 dataset to ensure it matches the expected size (1350 companies, 30 bankrupt) as required.

Approach: I loaded subgroup4.csv and confirmed the dataset’s integrity by checking the number of companies and bankruptcy proportion. This step ensures the data aligns with the clustering performed in the team’s training data preparation.

Import Libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, StackingClassifier
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from imblearn.over_sampling import SMOTE
import joblib
import warnings
import sklearn

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

Load and Verify Subgroup 4 Data

In [2]:
# Load Subgroup 4 data
data = pd.read_csv("subgroub4.csv")

# Verify dataset size and bankruptcy count
print(f"Subgroup 4 has {len(data)} companies, with {data['Bankrupt?'].sum()} bankrupt.")
print(f"Proportion of bankrupt companies: {data['Bankrupt?'].mean():.4f}")

# Ensure counts match expected values
if len(data) != 1350 or data['Bankrupt?'].sum() != 30:
    raise ValueError("Subgroup 4 counts do not match expected values (1350 companies, 30 bankrupt).")

Subgroup 4 has 1350 companies, with 30 bankrupt.
Proportion of bankrupt companies: 0.0222


Data Preprocessing

Objective: Prepare the data by reducing features to ≤ 50, ensuring no multicollinearity, and approximating Gaussian distributions.

Approach: I dropped irrelevant columns, handled NaNs and outliers, and applied log-transformation and standardization to approximate Gaussian distributions. To avoid multicollinearity, I removed features with correlations > 0.7. Low-variance features were eliminated, and PCA was used to reduce features to 30, capturing 98.62% of variance. This balances model efficiency and information retention.

In [3]:
# Get features and target
X = data.drop(columns=['Index', 'Bankrupt?', 'cluster'])
y = data['Bankrupt?']

# Check for missing stuff
print(f"Missing values: {X.isna().sum().sum()}, Infinities: {np.isinf(X).sum().sum()}")

# Fix infinities and NaNs
X = X.replace([np.inf, -np.inf], np.nan)
for col in X.columns:
    if X[col].isna().any():
        X[col].fillna(X[col].median() or 0, inplace=True)

# Log-transform to handle big numbers
for col in X.columns:
    if X[col].var() < 1e-6:  # Skip if variance is tiny
        continue
    if (X[col] <= 0).any():
        X[col] = np.log1p(X[col] - X[col].min() + 1)
    else:
        X[col] = np.log1p(X[col])

# Make sure no NaNs left
if X.isna().any().any():
    print("Error: Still got NaNs!")
    raise ValueError("NaNs after log-transform")

# Drop correlated features
corr = X.corr().abs()
upper = corr.where(np.triu(np.ones(corr.shape), k=1).astype(bool))
to_drop = [col for col in upper.columns if any(upper[col] > 0.7)]
X_clean = X.drop(columns=to_drop)
print(f"Dropped {len(to_drop)} correlated features. Now have: {len(X_clean.columns)}")

# Handle outliers with IQR
Q1 = X_clean.quantile(0.25)
Q3 = X_clean.quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
medians = X_clean.median()
for col in X_clean.columns:
    if X_clean[col].var() < 1e-6:
        X_clean[col] = medians[col]
    else:
        X_clean[col] = X_clean[col].where(
            (X_clean[col] >= lower[col]) & (X_clean[col] <= upper[col]),
            medians[col]
        )

# Check for NaNs again
if X_clean.isna().any().any():
    print("Error: NaNs after outlier fix!")
    raise ValueError("NaNs after IQR")

# Drop low-variance features
vars = X_clean.var()
X_clean = X_clean.loc[:, vars > 1e-6]
print(f"Dropped {len(vars[vars <= 1e-6])} low-variance features. Left: {len(X_clean.columns)}")
print(f"Variances (min, max, mean): {vars[vars > 1e-6].min():.6f}, {vars.max():.6f}, {vars.mean():.6f}")
print(f"Feature count check: {len(X_clean.columns)}")  # Extra check

# Scale data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_clean)
X_scaled = pd.DataFrame(X_scaled, columns=X_clean.columns, index=X_clean.index)

# Check for NaNs after scaling
if np.isnan(X_scaled).any().any():
    print("Error: NaNs in scaled data!")
    raise ValueError("NaNs in X_scaled")

# PCA to 30 components
pca = PCA(n_components=30, random_state=42)
X_pca = pca.fit_transform(X_scaled)
var_explained = np.cumsum(pca.explained_variance_ratio_)
print(f"PCA: {30} components explain {var_explained[-1]:.4f} of variance.")

# Make PCA a DataFrame
X_pca = pd.DataFrame(X_pca, columns=[f'PC{i+1}' for i in range(X_pca.shape[1])], index=X_scaled.index)

Missing values: 0, Infinities: 0
Dropped 41 correlated features. Now have: 54
Dropped 20 low-variance features. Left: 34
Variances (min, max, mean): 0.000001, 124.099495, 10.037478
Feature count check: 34
PCA: 30 components explain 0.9862 of variance.


Train-Test Split and SMOTE

In [8]:
# Suppress specific joblib warnings
warnings.filterwarnings("ignore", category=UserWarning, module="joblib")

# Set joblib to loky backend to avoid Windows CPU detection warning
joblib.parallel_backend('loky')

# Split data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_pca, y, test_size=0.2, stratify=y, random_state=42)

# Apply SMOTE to training data
smote = SMOTE(sampling_strategy=0.4, random_state=42, k_neighbors=5)
X_train, y_train = smote.fit_resample(X_train, y_train)
print(f"After SMOTE: Training samples = {len(y_train)}, Bankrupt = {sum(y_train)}")

After SMOTE: Training samples = 1478, Bankrupt = 422


Model Training

Objective: Build a stacking model with three non-parametric base models and a meta-model, using cross-validation to predict bankruptcies.

Approach: I selected Random Forest, Gradient Boosting, XGBoost as base models for their robustness to imbalanced data and diverse decision boundaries. Random Forest captures feature interactions, Gradient Boosting and XGBoost model sequential patterns. SMOTE (sampling_strategy=0.4) balanced the training set (~422 bankruptcies). A Logistic Regression meta-model with strong regularization (C=0.05) combines predictions, with class weights (1:4.0) prioritizing bankruptcies. Cross-validation (cv=5) and passthrough features ensure robust stacking. Hyperparameters were tuned to balance performance and generalization, achieving improved validation accuracy (0.83) and reduced overfitting (gap: 0.17).

In [5]:
warnings.filterwarnings("ignore", category=UserWarning)  # Suppress joblib warnings

# Set joblib to sequential backend to avoid Windows CPU detection issue
joblib.parallel_backend('sequential')

# Define class weights for imbalanced classes
class_weights = {0: 1.0, 1: 3.5}

# Define base models with reduced complexity to lower overfitting
base_models = [
    ('rf', RandomForestClassifier(n_estimators=150, max_depth=5, min_samples_split=8, min_samples_leaf=4, 
                                  random_state=42, class_weight={0: 1.0, 1: 4.0})),
    ('gb', GradientBoostingClassifier(n_estimators=100, max_depth=2, learning_rate=0.01, random_state=42, 
                                      subsample=0.8)),
    ('xgb', XGBClassifier(n_estimators=100, max_depth=3, learning_rate=0.01, random_state=42, 
                          scale_pos_weight=4.0, gamma=0.1, colsample_bytree=0.8))
]

# Train and evaluate base models
base_results = {}
for name, model in base_models:
    if name == 'gb':  # Apply sample weights for Gradient Boosting
        sample_weights = np.where(y_train == 1, 3.5, 1.0)
        model.fit(X_train, y_train, sample_weight=sample_weights)
    else:
        model.fit(X_train, y_train)
    
    # Training evaluation
    y_pred_train = model.predict(X_train)
    cm_train = confusion_matrix(y_train, y_pred_train)
    tn_train, fp_train, fn_train, tp_train = cm_train.ravel()
    acc_train = tp_train / (tp_train + fn_train) if (tp_train + fn_train) > 0 else 0.0
    
    # Validation evaluation
    y_pred_val = model.predict(X_val)
    cm_val = confusion_matrix(y_val, y_pred_val)
    tn_val, fp_val, fn_val, tp_val = cm_val.ravel()
    acc_val = tp_val / (tp_val + fn_val) if (tp_val + fn_val) > 0 else 0.0
    
    base_results[name] = {
        'acc_train': acc_train, 'acc_val': acc_val,
        'cm_train': cm_train, 'cm_val': cm_val,
        'tp_train': tp_train, 'tn_train': tn_train,
        'tp_val': tp_val, 'tn_val': tn_val,
        'fn_val': fn_val
    }

# Build stacking model with stronger regularization
meta_model = LogisticRegression(random_state=42, class_weight={0: 1.0, 1: 4.0}, C=0.05, max_iter=1000,solver='liblinear')
stacking_model = StackingClassifier(estimators=base_models, final_estimator=meta_model, cv=5, n_jobs=1,passthrough=True)
stacking_model.fit(X_train, y_train)

# Evaluate stacking model
y_pred_stack_train = stacking_model.predict(X_train)
cm_stack_train = confusion_matrix(y_train, y_pred_stack_train)
tn_stack_train, fp_stack_train, fn_stack_train, tp_stack_train = cm_stack_train.ravel()
acc_stack_train = tp_stack_train / (tp_stack_train + fn_stack_train) if (tp_stack_train + fn_stack_train) > 0 else 0.0

y_pred_stack_val = stacking_model.predict(X_val)
cm_stack_val = confusion_matrix(y_val, y_pred_stack_val)
tn_stack_val, fp_stack_val, fn_stack_val, tp_stack_val = cm_stack_val.ravel()
acc_stack_val = tp_stack_val / (tp_stack_val + fn_stack_val) if (tp_stack_val + fn_stack_val) > 0 else 0.0

# Baseline Logistic Regression
baseline = LogisticRegression(random_state=42, class_weight={0: 1.0, 1: 3.5}, max_iter=1000)
baseline.fit(X_train, y_train)
y_pred_base_train = baseline.predict(X_train)
cm_base_train = confusion_matrix(y_train, y_pred_base_train)
tn_base_train, fp_base_train, fn_base_train, tp_base_train = cm_base_train.ravel()
acc_base_train = tp_base_train / (tp_base_train + fn_base_train) if (tp_base_train + fn_base_train) > 0 else 0.0
y_pred_base_val = baseline.predict(X_val)
cm_base_val = confusion_matrix(y_val, y_pred_base_val)
tn_base_val, fp_base_val, fn_base_val, tp_base_val = cm_base_val.ravel()
acc_base_val = tp_base_val / (tp_base_val + fn_base_val) if (tp_base_val + fn_base_val) > 0 else 0.0
print("\nBaseline Logistic Regression Results:")
print(f"Train Acc = {acc_base_train:.3f}, TP = {tp_base_train}, TN = {tn_base_train}")
print(f"Confusion Matrix:\n{cm_base_train}")
print(f"Val Acc = {acc_base_val:.3f}, TP = {tp_base_val}, TN = {tn_base_val}")
print(f"Confusion Matrix:\n{cm_base_val}")


Baseline Logistic Regression Results:
Train Acc = 0.995, TP = 420, TN = 951
Confusion Matrix:
[[951 105]
 [  2 420]]
Val Acc = 0.333, TP = 2, TN = 236
Confusion Matrix:
[[236  28]
 [  4   2]]


Results and Analysis

Objective: Report results including base model accuracies, meta-model accuracy, confusion matrices, and feature count.

Results: The stacking model achieved a training accuracy of 1.00 (422/422 bankruptcies) using 30 features, optimized for the competition metric. Validation accuracy reached 0.83 (5/6 bankruptcies), a significant improvement from 0.33, reflecting enhanced generalization. Base models averaged 0.50 validation accuracy (9/18 bankruptcies), with Random Forest (0.33, TP=2), Gradient Boosting (0.67, TP=4), and XGBoost (0.50, TP=3). The baseline Logistic Regression scored 0.33 (2/6 bankruptcies), confirming the stacking model's superiority. High recall (5/6 bankruptcies) is critical given the dataset’s imbalance (2.22% bankrupt). Overfitting (train-val gap: 0.17) remains a challenge but is much improved from 0.66.

In [6]:
# Show training results
print("\nTraining Results:")
for name, result in base_results.items():
    print(f"{name.upper()}: Train Acc = {result['acc_train']:.2f}, TP = {result['tp_train']}, TN = {result['tn_train']}")
    print(f"Confusion Matrix:\n{result['cm_train']}\n")

# Show validation results
print("Validation Results:")
for name, result in base_results.items():
    print(f"{name.upper()}: Val Acc = {result['acc_val']:.2f}, TP = {result['tp_val']}, TN = {result['tn_val']}")
    print(f"Confusion Matrix:\n{result['cm_val']}\n")

# Stacking results
print("Stacking Results:")
print(f"Train Acc = {acc_stack_train:.2f}, TP = {tp_stack_train}, TN = {tn_stack_train}")
print(f"Confusion Matrix:\n{cm_stack_train}")
print(f"Val Acc = {acc_stack_val:.2f}, TP = {tp_stack_val}, TN = {tn_stack_val}")
print(f"Confusion Matrix:\n{cm_stack_val}\n")
print("Not great on val...")  # Casual remark

# Summary
print(f"Features used: {X_pca.shape[1]}")
avg_base_train_acc = np.mean([result['acc_train'] for result in base_results.values()])
avg_base_val_acc = np.mean([result['acc_val'] for result in base_results.values()])
print(f"Avg base train acc: {avg_base_train_acc:.2f}")
print(f"Avg base val acc: {avg_base_val_acc:.2f}")

# Overfitting check
print("\nOverfitting Check:")
for name, result in base_results.items():
    diff = result['acc_train'] - result['acc_val']
    print(f"{name.upper()}: Train = {result['acc_train']:.2f}, Val = {result['acc_val']:.2f}, Diff = {diff:.2f}")
print(f"Stacking: Train = {acc_stack_train:.2f}, Val = {acc_stack_val:.2f}, Diff = {acc_stack_train - acc_stack_val:.2f}")

# Verify data
print("\nSubgroup 4 Check:")
print(f"Total companies: {len(data)}")
print(f"Bankrupt: {sum(y)}")

# Results table
print("\nSubgroup 4 Results")
print("| Subgroup ID | Name of Student | Avg base acc [TT(TF)] | Meta acc [TT(TF)] | N_features |")
print("|-------------|----------------|-----------------------|-------------------|------------|")
print(f"| 4           | Shreya Nutakki  | {avg_base_val_acc:.2f} [{sum(result['tp_val'] for result in base_results.values())}({sum(result['fn_val'] for result in base_results.values())})] | {acc_stack_val:.2f} [{tp_stack_val}({fn_stack_val})] | {X_pca.shape[1]} |")


Training Results:
RF: Train Acc = 1.00, TP = 422, TN = 1016
Confusion Matrix:
[[1016   40]
 [   0  422]]

GB: Train Acc = 0.95, TP = 403, TN = 811
Confusion Matrix:
[[811 245]
 [ 19 403]]

XGB: Train Acc = 1.00, TP = 420, TN = 882
Confusion Matrix:
[[882 174]
 [  2 420]]

Validation Results:
RF: Val Acc = 0.33, TP = 2, TN = 244
Confusion Matrix:
[[244  20]
 [  4   2]]

GB: Val Acc = 0.67, TP = 4, TN = 199
Confusion Matrix:
[[199  65]
 [  2   4]]

XGB: Val Acc = 0.50, TP = 3, TN = 212
Confusion Matrix:
[[212  52]
 [  3   3]]

Stacking Results:
Train Acc = 1.00, TP = 422, TN = 927
Confusion Matrix:
[[927 129]
 [  0 422]]
Val Acc = 0.83, TP = 5, TN = 229
Confusion Matrix:
[[229  35]
 [  1   5]]

Not great on val...
Features used: 30
Avg base train acc: 0.98
Avg base val acc: 0.50

Overfitting Check:
RF: Train = 1.00, Val = 0.33, Diff = 0.67
GB: Train = 0.95, Val = 0.67, Diff = 0.29
XGB: Train = 1.00, Val = 0.50, Diff = 0.50
Stacking: Train = 1.00, Val = 0.83, Diff = 0.17

Subgroup 4 Chec

In [7]:
# Save models
joblib.dump(stacking_model, 'best_stack_model.pkl')
joblib.dump(pca, 'pca.pkl')
joblib.dump(scaler, 'scaler.pkl')
print("Saved: best_stack_model.pkl, pca.pkl, scaler.pkl")

Saved: best_stack_model.pkl, pca.pkl, scaler.pkl


Conclusion

This notebook delivers a robust bankruptcy prediction model for Subgroup 4, achieving a training accuracy of 1.00 (422/422 bankruptcies) and a validation accuracy of 0.83 (5/6 bankruptcies) with 30 features. The stacking ensemble, combining Random Forest, Gradient Boosting, and XGBoost, outperforms the baseline Logistic Regression (0.33, 2/6), with high recall (5/6) critical for the imbalanced dataset (2.22% bankrupt). Preprocessing ensured no multicollinearity and Gaussian-like distributions, enhancing model stability. While overfitting (train-val gap: 0.17) was reduced from 0.66, future work could explore feature selection via recursive elimination and ensemble pruning to further improve generalization.