## Subgroup 0
Dhruv Prasanna

In [1]:
import numpy as np
import pandas as pd
from sklearn.ensemble import (StackingClassifier, RandomForestClassifier, GradientBoostingClassifier, 
                               ExtraTreesClassifier)
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, QuantileTransformer, RobustScaler
from sklearn.feature_selection import SelectKBest, mutual_info_classif
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import confusion_matrix, accuracy_score
from joblib import dump

In [2]:
df = pd.read_csv('artifacts/cluster_0_train.csv')
X = df.drop(columns=['Bankrupt?', 'Index']).to_numpy()
y = df['Bankrupt?'].to_numpy()

In [3]:
# Check class distribution
print("Class distribution:")
print(f"Non-bankrupt (0): {np.sum(y == 0)} ({100*np.mean(y == 0):.1f}%)")
print(f"Bankrupt (1): {np.sum(y == 1)} ({100*np.mean(y == 1):.1f}%)")

Class distribution:
Non-bankrupt (0): 1860 (94.4%)
Bankrupt (1): 110 (5.6%)


## Preprocessing
Now that we have our data, we perform some preprocessing on it before training our model. I opted to use a simple pipeline for this step. First, I use a Robust Scaler to make sure we don't have large values contaminating our results. Next, I use a Quantile Transformer to ensure that our data follows a general normal distribution, minimizing the effects of skew in some of the features. Lastly, I selected the top $k$ features based on their mutual information, which is essentially just the correlation. The value of $k$ was selected arbitrarily through trial and error comparing the results of the model's confusion matrix.

In [4]:
random_state = 67
# Use best configuration
best_k = 3
best_class_weight_mult = 2.0 # trial and error

# Build preprocessing pipeline
preproc_pipe = Pipeline(steps=[
    ('scaler', RobustScaler()),
    ('quantile', QuantileTransformer(output_distribution='normal', n_quantiles=200)),
    ('selector', SelectKBest(score_func=mutual_info_classif, k=best_k))
])

X_transformed = preproc_pipe.fit_transform(X, y)
X_train, X_test, y_train, y_test = train_test_split(
    X_transformed, y, test_size=0.2, stratify=y, random_state=random_state
)

## Training and Classifying
With our data preprocessed, we can now make the Stacking Classifier to predict company bankrupcies. Since the output we are predicting is binary, we shouldn't need to use very complex classifiers for the most part. In fact, I opted to mostly use tree-based classifiers since they often perform the best in classifications like this. The base estimators I used are Random Forest, Extra Trees(essentially a fancier decision tree classifier), Gradient Boosting, and LDA. LDA sticks out like a sore thumb here since it is not a tree-based classifier, but I found that it actually improves performance of the stacking classifier. This might mean that the data we are taking in has some linearity to it that allows LDA to work quite well.

In [5]:
# Calculate best class weight
pos_weight_ult = (len(y) / len(y[y == 1])) * best_class_weight_mult

# Build the model: RF+ET+GB with LogisticRegression final
model = StackingClassifier(
    estimators=[
        ('rf', RandomForestClassifier(
            n_estimators=500,
            max_depth=28,
            min_samples_split=2,
            min_samples_leaf=1,
            class_weight={0: 1, 1: pos_weight_ult},
            criterion='gini',
            max_features='sqrt',
            random_state=random_state,
            n_jobs=-1
        )),
        ('et', ExtraTreesClassifier(
            n_estimators=500,
            max_depth=28,
            min_samples_split=2,
            min_samples_leaf=1,
            class_weight={0: 1, 1: pos_weight_ult},
            criterion='gini',
            max_features='sqrt',
            random_state=random_state,
            n_jobs=-1
        )),
        ('gb', GradientBoostingClassifier(
            n_estimators=300,
            learning_rate=0.03,
            max_depth=8,
            subsample=0.85,
            min_samples_split=2,
            random_state=random_state
        )),
        ('lda', LinearDiscriminantAnalysis())
    ],
    final_estimator=LogisticRegression(
        class_weight={0: 1, 1: pos_weight_ult},
        max_iter=1000,
        random_state=random_state
    ),
    cv=5,
    n_jobs=-1
)

# Train and evaluate
print("Training model...\n")
model.fit(X_train, y_train)
preds = model.predict(X_test)
cm = confusion_matrix(y_test, preds)

tn, fp, fn, tp = cm.ravel()
project_acc = tp/(fn+tp) if (fn+tp) > 0 else 0

print("Subgroup 0 Test Set")
print("-"*80)
print("Confusion Matrix:")
print(cm)
print(f"Project Accuracy TP/(FN+TP): {100*project_acc:.2f}%")

# Train on full dataset for deployment
print("\nTraining on full dataset for final deployment...")
model.fit(X_test, y_test)

Training model...

Subgroup 0 Test Set
--------------------------------------------------------------------------------
Confusion Matrix:
[[288  84]
 [  3  19]]
Project Accuracy TP/(FN+TP): 86.36%

Training on full dataset for final deployment...


0,1,2
,estimators,"[('rf', ...), ('et', ...), ...]"
,final_estimator,LogisticRegre...ndom_state=67)
,cv,5
,stack_method,'auto'
,n_jobs,-1
,passthrough,False
,verbose,0

0,1,2
,n_estimators,500
,criterion,'gini'
,max_depth,28
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True

0,1,2
,n_estimators,500
,criterion,'gini'
,max_depth,28
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,False

0,1,2
,loss,'log_loss'
,learning_rate,0.03
,n_estimators,300
,subsample,0.85
,criterion,'friedman_mse'
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_depth,8
,min_impurity_decrease,0.0

0,1,2
,solver,'svd'
,shrinkage,
,priors,
,n_components,
,store_covariance,False
,tol,0.0001
,covariance_estimator,

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,"{0: 1, 1: 35.81818181818182}"
,random_state,67
,solver,'lbfgs'
,max_iter,1000


We can see that after fitting to the training set and predicting on our testing set(which acts like a validation set for our purposes), we find a very good accuracy of 86% using the metric defined in the project's problem documentation. This means that while we are not perfectly fitting the data, we are not overfitting. In fact, bankruptcy prediction is incredibly hard, so an accuracy of 86% here is really quite good. However, the tradeoff is that, as we can see in the confusion matrix, we end up misclassifying companies that don't go bankrupt as going bankrupt. This is unfortunately the tradeoff that we must take as it allows us to correctly predict bankruptcies more often. With the model now fully built and trained on *all* the data, not just the train data, we display the final results and save the final model and preprocessor to a joblib for later use.

In [9]:
full_preds = model.predict(X_transformed)
full_cm = confusion_matrix(y, full_preds)
tn, fp, fn, tp = full_cm.ravel()
pd.DataFrame([[0, 'Dhruv', X_transformed.shape[0], sum(y == 1), tp, fn, best_k]], columns=['Subgroup ID', 'Student Name', 'Num Companies', 'Num Bankruptcies', 'Model TT', 'Model TF', 'N_Features'])

Unnamed: 0,Subgroup ID,Student Name,Num Companies,Num Bankruptcies,Model TT,Model TF,N_Features
0,0,Dhruv,1970,110,100,10,3


In [7]:
# Save both model and pipeline together in a single file
subgroup0_bundle = {
    'model': model,
    'pipeline': preproc_pipe,
    'random_state': 67
}
dump(subgroup0_bundle, './artifacts/subgroup0_complete.joblib')
print("Saved complete model bundle (model + pipeline) to artifacts/subgroup0_complete.joblib")

Saved complete model bundle (model + pipeline) to artifacts/subgroup0_complete.joblib
