Data Science and Machine Learning are playing a critical role in almost every industry for extracting insights and predictions from data. In the finance sector, one of the key tasks is bankruptcy prediction, which is used to assess the risk involved and take appropriate action. By using machine learning algorithms, it is possible to train a model that can accurately predict whether or not a company or legal person will become bankrupt in the future.

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from imblearn.over_sampling import SMOTE

data = pd.read_csv("bankruptcy.csv")
data.head()

Unnamed: 0,Bankrupt?,ROA(C) before interest and depreciation before interest,ROA(A) before interest and % after tax,ROA(B) before interest and depreciation after tax,Operating Gross Margin,Realized Sales Gross Margin,Operating Profit Rate,Pre-tax net Interest Rate,After-tax net Interest Rate,Non-industry income and expenditure/revenue,...,Net Income to Total Assets,Total assets to GNP price,No-credit Interval,Gross Profit to Sales,Net Income to Stockholder's Equity,Liability to Equity,Degree of Financial Leverage (DFL),Interest Coverage Ratio (Interest expense to EBIT),Net Income Flag,Equity to Liability
0,1,0.370594,0.424389,0.40575,0.601457,0.601457,0.998969,0.796887,0.808809,0.302646,...,0.716845,0.009219,0.622879,0.601453,0.82789,0.290202,0.026601,0.56405,1,0.016469
1,1,0.464291,0.538214,0.51673,0.610235,0.610235,0.998946,0.79738,0.809301,0.303556,...,0.795297,0.008323,0.623652,0.610237,0.839969,0.283846,0.264577,0.570175,1,0.020794
2,1,0.426071,0.499019,0.472295,0.60145,0.601364,0.998857,0.796403,0.808388,0.302035,...,0.77467,0.040003,0.623841,0.601449,0.836774,0.290189,0.026555,0.563706,1,0.016474
3,1,0.399844,0.451265,0.457733,0.583541,0.583541,0.9987,0.796967,0.808966,0.30335,...,0.739555,0.003252,0.622929,0.583538,0.834697,0.281721,0.026697,0.564663,1,0.023982
4,1,0.465022,0.538432,0.522298,0.598783,0.598783,0.998973,0.797366,0.809304,0.303475,...,0.795016,0.003878,0.623521,0.598782,0.839973,0.278514,0.024752,0.575617,1,0.03549


In [2]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6819 entries, 0 to 6818
Data columns (total 96 columns):
Bankrupt?                                                   6819 non-null int64
 ROA(C) before interest and depreciation before interest    6819 non-null float64
 ROA(A) before interest and % after tax                     6819 non-null float64
 ROA(B) before interest and depreciation after tax          6819 non-null float64
 Operating Gross Margin                                     6819 non-null float64
 Realized Sales Gross Margin                                6819 non-null float64
 Operating Profit Rate                                      6819 non-null float64
 Pre-tax net Interest Rate                                  6819 non-null float64
 After-tax net Interest Rate                                6819 non-null float64
 Non-industry income and expenditure/revenue                6819 non-null float64
 Continuous interest rate (after tax)                       6819 non-null f

In [3]:
data.isnull().any()

Bankrupt?                                                   False
 ROA(C) before interest and depreciation before interest    False
 ROA(A) before interest and % after tax                     False
 ROA(B) before interest and depreciation after tax          False
 Operating Gross Margin                                     False
                                                            ...  
 Liability to Equity                                        False
 Degree of Financial Leverage (DFL)                         False
 Interest Coverage Ratio (Interest expense to EBIT)         False
 Net Income Flag                                            False
 Equity to Liability                                        False
Length: 96, dtype: bool

In [4]:
data.describe(include='all')

Unnamed: 0,Bankrupt?,ROA(C) before interest and depreciation before interest,ROA(A) before interest and % after tax,ROA(B) before interest and depreciation after tax,Operating Gross Margin,Realized Sales Gross Margin,Operating Profit Rate,Pre-tax net Interest Rate,After-tax net Interest Rate,Non-industry income and expenditure/revenue,...,Net Income to Total Assets,Total assets to GNP price,No-credit Interval,Gross Profit to Sales,Net Income to Stockholder's Equity,Liability to Equity,Degree of Financial Leverage (DFL),Interest Coverage Ratio (Interest expense to EBIT),Net Income Flag,Equity to Liability
count,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,...,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0
mean,0.032263,0.50518,0.558625,0.553589,0.607948,0.607929,0.998755,0.79719,0.809084,0.303623,...,0.80776,18629420.0,0.623915,0.607946,0.840402,0.280365,0.027541,0.565358,1.0,0.047578
std,0.17671,0.060686,0.06562,0.061595,0.016934,0.016916,0.01301,0.012869,0.013601,0.011163,...,0.040332,376450100.0,0.01229,0.016934,0.014523,0.014463,0.015668,0.013214,0.0,0.050014
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
25%,0.0,0.476527,0.535543,0.527277,0.600445,0.600434,0.998969,0.797386,0.809312,0.303466,...,0.79675,0.0009036205,0.623636,0.600443,0.840115,0.276944,0.026791,0.565158,1.0,0.024477
50%,0.0,0.502706,0.559802,0.552278,0.605997,0.605976,0.999022,0.797464,0.809375,0.303525,...,0.810619,0.002085213,0.623879,0.605998,0.841179,0.278778,0.026808,0.565252,1.0,0.033798
75%,0.0,0.535563,0.589157,0.584105,0.613914,0.613842,0.999095,0.797579,0.809469,0.303585,...,0.826455,0.005269777,0.624168,0.613913,0.842357,0.281449,0.026913,0.565725,1.0,0.052838
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,9820000000.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [5]:
data['Bankrupt?'].value_counts(normalize = True)

0    0.967737
1    0.032263
Name: Bankrupt?, dtype: float64

We see that we have 95 variable columns and one target column, what if we see which columns are more correlated to the target varibale, we can create two models, one with all variables, and one with the top 10 variables to predict bankruptcy

In [6]:
corr_matrix = data.corr()

# Find the top N columns with the largest correlation with the target variable
N = 15  
target_col = 'Bankrupt?'  
top_cols = corr_matrix.nlargest(N, target_col)[target_col].index

# Print the top N columns with the largest correlation with the target variable
print(f'Top {N} columns with largest correlation with {target_col}:')
print(top_cols)

Top 15 columns with largest correlation with Bankrupt?:
Index(['Bankrupt?', ' Debt ratio %', ' Current Liability to Assets',
       ' Borrowing dependency', ' Current Liability to Current Assets',
       ' Liability to Equity', ' Current Liabilities/Equity',
       ' Current Liability to Equity', ' Liability-Assets Flag',
       ' Total expense/Assets', ' Equity to Long-term Liability',
       ' Cash/Current Liability',
       ' Inventory and accounts receivable/Net value',
       ' Fixed Assets Turnover Frequency',
       ' Contingent liabilities/Net worth'],
      dtype='object')


In [7]:
X = data.drop(["Bankrupt?"], axis="columns")
y = data["Bankrupt?"]

Logistic Regression: Logistic regression is a common method for binary classification problems, such as predicting bankruptcy.

To make things simpler, you can check a models accuracy score and it can be 96% but it does not tell the full story about a model. While the second model is not the best, it takes account the false negative cases whereas the first model does not. We can certainly make our model better by feature selection, hyperparameter tuning, collecting more data, using SHAP values to to study more Interpretability, use different ensemble methods, but one thing is for sure, its all about plugging in and trying until you get the right model.  

To tackle the imbalance in our target variable, we will apply over sampling technique to remove the imbalance, and then check how our model is doing. 

In [8]:
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)

smote = SMOTE(random_state = 10)
x_smote, y_smote = smote.fit_resample(x_train, y_train)


print("printing shape")
print(x_smote.shape, y_smote.shape)


print(y_smote.value_counts(normalize = True))

printing shape
(9916, 95) (9916,)
1    0.5
0    0.5
Name: Bankrupt?, dtype: float64


In [9]:
lr = LogisticRegression()

lr.fit(x_smote, y_smote)

y_pred = lr.predict(x_test)


print("Printing Accuracy Score ", lr.score(x_test, y_test))

print("[Test Classification Report]")
print(classification_report(y_test, y_pred))

Printing Accuracy Score  0.743108504398827
[Test Classification Report]
              precision    recall  f1-score   support

           0       0.97      0.76      0.85      1641
           1       0.07      0.44      0.11        64

    accuracy                           0.74      1705
   macro avg       0.52      0.60      0.48      1705
weighted avg       0.94      0.74      0.82      1705



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Our accuracy score is 0.72 and instead of a regular model with imbalance, we see some positive side to applying resampling. Let us apply selectkbest to pick the best 10 variables for our model to train on and check our scores after. 

In [10]:
# 2. Feature selection
from sklearn.feature_selection import SelectKBest, f_classif
selector = SelectKBest(f_classif, k=10)
X_train_selected = selector.fit_transform(x_smote, y_smote)
X_test_selected = selector.transform(x_test)

  f = msb / msw


In [11]:
# Train the model on the engineered features
lr.fit(X_train_selected, y_smote)

# Evaluate the accuracy of the model on the engineered features
y_pred = lr.predict(X_test_selected)
engineered_accuracy = accuracy_score(y_test, y_pred)

print("[Test Classification Report]")
print(classification_report(y_test, y_pred))

print('')
print("Engineered accuracy:", engineered_accuracy)






[Test Classification Report]
              precision    recall  f1-score   support

           0       0.99      0.87      0.93      1641
           1       0.20      0.86      0.33        64

    accuracy                           0.87      1705
   macro avg       0.60      0.86      0.63      1705
weighted avg       0.96      0.87      0.90      1705


Engineered accuracy: 0.8686217008797654


We do see that after selecting the best features and training the model indeed does increase our results significantly. We can use this model to predict bankruptcy or tweak it to make it better. 