<a href="https://colab.research.google.com/github/HarshalPawar88/Bankruptcy-Prediction/blob/main/CAPSTONE_COMPANY_BANKRUPTCY_PREDICTION.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Problem Statement**

Prediction of bankruptcy is a phenomenon of increasing interest to firms who
stand to loose money because on unpaid debts. Since computers can store huge dataset
pertaining to bankruptcy making accurate predictions from them before hand is becoming
important. 

The data were collected from the Taiwan Economic Journal for the years 1999 to 2009. Company bankruptcy was defined based on the business regulations of the Taiwan Stock Exchange.

 In this project you will use various classification algorithms on bankruptcy
dataset to predict bankruptcies with satisfying accuracies long before the actual event.

# **Attribute Information**

Updated column names and description to make the data easier to understand (Y = Output feature, X = Input features)

Y - Bankrupt?: Class label 1 : Yes , 0: No 

X1 - ROA(C) before interest and depreciation before interest: Return On Total Assets(C)

X2 - ROA(A) before interest and % after tax: Return On Total Assets(A)

X3 - ROA(B) before interest and depreciation after tax: Return On Total Assets(B)

X4 - Operating Gross Margin: Gross Profit/Net Sales

X5 - Realized Sales Gross Margin: Realized Gross Profit/Net Sales

X6 - Operating Profit Rate: Operating Income/Net Sales

X7 - Pre-tax net Interest Rate: Pre-Tax Income/Net Sales

X8 - After-tax net Interest Rate: Net Income/Net Sales

X9 - Non-industry income and expenditure/revenue: Net Non-operating Income Ratio

X10 - Continuous interest rate (after tax): Net Income-Exclude Disposal Gain or Loss/Net Sales

X11 - Operating Expense Rate: Operating Expenses/Net Sales

X12 - Research and development expense rate: (Research and Development Expenses)/Net Sales

X13 - Cash flow rate: Cash Flow from Operating/Current Liabilities

X14 - Interest-bearing debt interest rate: Interest-bearing Debt/Equity

X15 - Tax rate (A): Effective Tax Rate

X16 - Net Value Per Share (B): Book Value Per Share(B)

X17 - Net Value Per Share (A): Book Value Per Share(A)

X18 - Net Value Per Share (C): Book Value Per Share(C)

X19 - Persistent EPS in the Last Four Seasons: EPS-Net Income

X20 - Cash Flow Per Share

X21 - Revenue Per Share (Yuan ¥): Sales Per Share

X22 - Operating Profit Per Share (Yuan ¥): Operating Income Per Share

X23 - Per Share Net profit before tax (Yuan ¥): Pretax Income Per Share

X24 - Realized Sales Gross Profit Growth Rate

X25 - Operating Profit Growth Rate: Operating Income Growth

X26 - After-tax Net Profit Growth Rate: Net Income Growth

X27 - Regular Net Profit Growth Rate: Continuing Operating Income after Tax Growth

X28 - Continuous Net Profit Growth Rate: Net Income-Excluding Disposal Gain or Loss Growth

X29 - Total Asset Growth Rate: Total Asset Growth

X30 - Net Value Growth Rate: Total Equity Growth

X31 - Total Asset Return Growth Rate Ratio: Return on Total Asset Growth

X32 - Cash Reinvestment %: Cash Reinvestment Ratio

X33 - Current Ratio

X34 - Quick Ratio: Acid Test

X35 - Interest Expense Ratio: Interest Expenses/Total Revenue

X36 - Total debt/Total net worth: Total Liability/Equity Ratio

X37 - Debt ratio %: Liability/Total Assets

X38 - Net worth/Assets: Equity/Total Assets

X39 - Long-term fund suitability ratio (A): (Long-term Liability+Equity)/Fixed Assets

X40 - Borrowing dependency: Cost of Interest-bearing Debt

X41 - Contingent liabilities/Net worth: Contingent Liability/Equity

X42 - Operating profit/Paid-in capital: Operating Income/Capital

X43 - Net profit before tax/Paid-in capital: Pretax Income/Capital

X44 - Inventory and accounts receivable/Net value: (Inventory+Accounts Receivables)/Equity

X45 - Total Asset Turnover

X46 - Accounts Receivable Turnover

X47 - Average Collection Days: Days Receivable Outstanding

X48 - Inventory Turnover Rate (times)

X49 - Fixed Assets Turnover Frequency

X50 - Net Worth Turnover Rate (times): Equity Turnover

X51 - Revenue per person: Sales Per Employee

X52 - Operating profit per person: Operation Income Per Employee

X53 - Allocation rate per person: Fixed Assets Per Employee

X54 - Working Capital to Total Assets

X55 - Quick Assets/Total Assets

X56 - Current Assets/Total Assets

X57 - Cash/Total Assets

X58 - Quick Assets/Current Liability

X59 - Cash/Current Liability

X60 - Current Liability to Assets

X61 - Operating Funds to Liability

X62 - Inventory/Working Capital

X63 - Inventory/Current Liability

X64 - Current Liabilities/Liability

X65 - Working Capital/Equity

X66 - Current Liabilities/Equity

X67 - Long-term Liability to Current Assets

X68 - Retained Earnings to Total Assets

X69 - Total income/Total expense

X70 - Total expense/Assets

X71 - Current Asset Turnover Rate: Current Assets to Sales

X72 - Quick Asset Turnover Rate: Quick Assets to Sales

X73 - Working capitcal Turnover Rate: Working Capital to Sales

X74 - Cash Turnover Rate: Cash to Sales

X75 - Cash Flow to Sales

X76 - Fixed Assets to Assets

X77 - Current Liability to Liability

X78 - Current Liability to Equity

X79 - Equity to Long-term Liability

X80 - Cash Flow to Total Assets

X81 - Cash Flow to Liability

X82 - CFO to Assets

X83 - Cash Flow to Equity

X84 - Current Liability to Current Assets

X85 - Liability-Assets Flag: 1 if Total Liability exceeds Total Assets, 0 otherwise

X86 - Net Income to Total Assets

X87 - Total assets to GNP price

X88 - No-credit Interval

X89 - Gross Profit to Sales

X90 - Net Income to Stockholder's Equity

X91 - Liability to Equity

X92 - Degree of Financial Leverage (DFL)

X93 - Interest Coverage Ratio (Interest expense to EBIT)

X94 - Net Income Flag: 1 if Net Income is Negative for the last two years, 0 otherwise

X95 - Equity to Liability

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import imblearn
from imblearn.under_sampling import NearMiss
from collections import Counter
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score, recall_score, precision_score, roc_auc_score
from sklearn.model_selection import GridSearchCV
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import StandardScaler,MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import PowerTransformer
pd.options.display.float_format = '{:,.6f}'.format
from imblearn.ensemble import BalancedBaggingClassifier
import warnings
warnings.filterwarnings("ignore")
from sklearn.naive_bayes import GaussianNB
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm



In [3]:
df_1=pd.read_csv('/content/drive/MyDrive/Alma Better/Capstone Project/Copy of COMPANY BANKRUPTCY PREDICTION.csv')

In [4]:
pd.set_option('display.max_columns', 96)
pd.set_option('display.max_rows', 96)
df_1.head(2)

Unnamed: 0,Bankrupt?,ROA(C) before interest and depreciation before interest,ROA(A) before interest and % after tax,ROA(B) before interest and depreciation after tax,Operating Gross Margin,Realized Sales Gross Margin,Operating Profit Rate,Pre-tax net Interest Rate,After-tax net Interest Rate,Non-industry income and expenditure/revenue,Continuous interest rate (after tax),Operating Expense Rate,Research and development expense rate,Cash flow rate,Interest-bearing debt interest rate,Tax rate (A),Net Value Per Share (B),Net Value Per Share (A),Net Value Per Share (C),Persistent EPS in the Last Four Seasons,Cash Flow Per Share,Revenue Per Share (Yuan ¥),Operating Profit Per Share (Yuan ¥),Per Share Net profit before tax (Yuan ¥),Realized Sales Gross Profit Growth Rate,Operating Profit Growth Rate,After-tax Net Profit Growth Rate,Regular Net Profit Growth Rate,Continuous Net Profit Growth Rate,Total Asset Growth Rate,Net Value Growth Rate,Total Asset Return Growth Rate Ratio,Cash Reinvestment %,Current Ratio,Quick Ratio,Interest Expense Ratio,Total debt/Total net worth,Debt ratio %,Net worth/Assets,Long-term fund suitability ratio (A),Borrowing dependency,Contingent liabilities/Net worth,Operating profit/Paid-in capital,Net profit before tax/Paid-in capital,Inventory and accounts receivable/Net value,Total Asset Turnover,Accounts Receivable Turnover,Average Collection Days,Inventory Turnover Rate (times),Fixed Assets Turnover Frequency,Net Worth Turnover Rate (times),Revenue per person,Operating profit per person,Allocation rate per person,Working Capital to Total Assets,Quick Assets/Total Assets,Current Assets/Total Assets,Cash/Total Assets,Quick Assets/Current Liability,Cash/Current Liability,Current Liability to Assets,Operating Funds to Liability,Inventory/Working Capital,Inventory/Current Liability,Current Liabilities/Liability,Working Capital/Equity,Current Liabilities/Equity,Long-term Liability to Current Assets,Retained Earnings to Total Assets,Total income/Total expense,Total expense/Assets,Current Asset Turnover Rate,Quick Asset Turnover Rate,Working capitcal Turnover Rate,Cash Turnover Rate,Cash Flow to Sales,Fixed Assets to Assets,Current Liability to Liability,Current Liability to Equity,Equity to Long-term Liability,Cash Flow to Total Assets,Cash Flow to Liability,CFO to Assets,Cash Flow to Equity,Current Liability to Current Assets,Liability-Assets Flag,Net Income to Total Assets,Total assets to GNP price,No-credit Interval,Gross Profit to Sales,Net Income to Stockholder's Equity,Liability to Equity,Degree of Financial Leverage (DFL),Interest Coverage Ratio (Interest expense to EBIT),Net Income Flag,Equity to Liability
0,1,0.370594,0.424389,0.40575,0.601457,0.601457,0.998969,0.796887,0.808809,0.302646,0.780985,0.000126,0.0,0.458143,0.000725,0.0,0.14795,0.14795,0.14795,0.169141,0.311664,0.01756,0.095921,0.138736,0.022102,0.848195,0.688979,0.688979,0.217535,4980000000.0,0.000327,0.2631,0.363725,0.002259,0.001208,0.629951,0.021266,0.207576,0.792424,0.005024,0.390284,0.006479,0.095885,0.137757,0.398036,0.086957,0.001814,0.003487,0.000182,0.000117,0.032903,0.034164,0.392913,0.037135,0.672775,0.166673,0.190643,0.004094,0.001997,0.000147,0.147308,0.334015,0.27692,0.001036,0.676269,0.721275,0.339077,0.025592,0.903225,0.002022,0.064856,701000000.0,6550000000.0,0.593831,458000000.0,0.671568,0.424206,0.676269,0.339077,0.126549,0.637555,0.458609,0.520382,0.312905,0.11825,0,0.716845,0.009219,0.622879,0.601453,0.82789,0.290202,0.026601,0.56405,1,0.016469
1,1,0.464291,0.538214,0.51673,0.610235,0.610235,0.998946,0.79738,0.809301,0.303556,0.781506,0.00029,0.0,0.461867,0.000647,0.0,0.182251,0.182251,0.182251,0.208944,0.318137,0.021144,0.093722,0.169918,0.02208,0.848088,0.689693,0.689702,0.21762,6110000000.0,0.000443,0.264516,0.376709,0.006016,0.004039,0.635172,0.012502,0.171176,0.828824,0.005059,0.37676,0.005835,0.093743,0.168962,0.397725,0.064468,0.001286,0.004917,9360000000.0,719000000.0,0.025484,0.006889,0.39159,0.012335,0.751111,0.127236,0.182419,0.014948,0.004136,0.001384,0.056963,0.341106,0.289642,0.00521,0.308589,0.731975,0.32974,0.023947,0.931065,0.002226,0.025516,0.000107,7700000000.0,0.593916,2490000000.0,0.67157,0.468828,0.308589,0.32974,0.120916,0.6411,0.459001,0.567101,0.314163,0.047775,0,0.795297,0.008323,0.623652,0.610237,0.839969,0.283846,0.264577,0.570175,1,0.020794


In [5]:
print(df_1.isna().sum().sum())
print(np.isnan(df_1).sum().sum())
print(df_1.isnull().sum().sum())

0
0
0


In [6]:
list_1=[]
for i in df_1.columns:
  x=df_1[i].value_counts()
  if len(x)<=2:
    list_1.append(i)
  else:
    continue

In [7]:
list_1

['Bankrupt?', ' Liability-Assets Flag', ' Net Income Flag']

### Liability-Assets Flag

In [8]:
df_1[' Liability-Assets Flag'].unique()
df_1[' Liability-Assets Flag'].value_counts()

0    6811
1       8
Name:  Liability-Assets Flag, dtype: int64

In [9]:
df_1[' Liability-Assets Flag'].value_counts()

0    6811
1       8
Name:  Liability-Assets Flag, dtype: int64

## This feature has a very large class imbalance and we can drop this feature.

## Net Income Flag

In [10]:
df_1[' Net Income Flag'].unique()

array([1])

## This feature has just one unique value and it wont be of any importance.

In [11]:
df_1.drop(columns=[' Net Income Flag', ' Liability-Assets Flag'],inplace=True)

In [12]:
df_1.shape

(6819, 94)

In [13]:
df_1.describe()

Unnamed: 0,Bankrupt?,ROA(C) before interest and depreciation before interest,ROA(A) before interest and % after tax,ROA(B) before interest and depreciation after tax,Operating Gross Margin,Realized Sales Gross Margin,Operating Profit Rate,Pre-tax net Interest Rate,After-tax net Interest Rate,Non-industry income and expenditure/revenue,Continuous interest rate (after tax),Operating Expense Rate,Research and development expense rate,Cash flow rate,Interest-bearing debt interest rate,Tax rate (A),Net Value Per Share (B),Net Value Per Share (A),Net Value Per Share (C),Persistent EPS in the Last Four Seasons,Cash Flow Per Share,Revenue Per Share (Yuan ¥),Operating Profit Per Share (Yuan ¥),Per Share Net profit before tax (Yuan ¥),Realized Sales Gross Profit Growth Rate,Operating Profit Growth Rate,After-tax Net Profit Growth Rate,Regular Net Profit Growth Rate,Continuous Net Profit Growth Rate,Total Asset Growth Rate,Net Value Growth Rate,Total Asset Return Growth Rate Ratio,Cash Reinvestment %,Current Ratio,Quick Ratio,Interest Expense Ratio,Total debt/Total net worth,Debt ratio %,Net worth/Assets,Long-term fund suitability ratio (A),Borrowing dependency,Contingent liabilities/Net worth,Operating profit/Paid-in capital,Net profit before tax/Paid-in capital,Inventory and accounts receivable/Net value,Total Asset Turnover,Accounts Receivable Turnover,Average Collection Days,Inventory Turnover Rate (times),Fixed Assets Turnover Frequency,Net Worth Turnover Rate (times),Revenue per person,Operating profit per person,Allocation rate per person,Working Capital to Total Assets,Quick Assets/Total Assets,Current Assets/Total Assets,Cash/Total Assets,Quick Assets/Current Liability,Cash/Current Liability,Current Liability to Assets,Operating Funds to Liability,Inventory/Working Capital,Inventory/Current Liability,Current Liabilities/Liability,Working Capital/Equity,Current Liabilities/Equity,Long-term Liability to Current Assets,Retained Earnings to Total Assets,Total income/Total expense,Total expense/Assets,Current Asset Turnover Rate,Quick Asset Turnover Rate,Working capitcal Turnover Rate,Cash Turnover Rate,Cash Flow to Sales,Fixed Assets to Assets,Current Liability to Liability,Current Liability to Equity,Equity to Long-term Liability,Cash Flow to Total Assets,Cash Flow to Liability,CFO to Assets,Cash Flow to Equity,Current Liability to Current Assets,Net Income to Total Assets,Total assets to GNP price,No-credit Interval,Gross Profit to Sales,Net Income to Stockholder's Equity,Liability to Equity,Degree of Financial Leverage (DFL),Interest Coverage Ratio (Interest expense to EBIT),Equity to Liability
count,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0
mean,0.032263,0.50518,0.558625,0.553589,0.607948,0.607929,0.998755,0.79719,0.809084,0.303623,0.781381,1995347312.802792,1950427306.056799,0.467431,16448012.905942,0.115001,0.190661,0.190633,0.190672,0.228813,0.323482,1328640.602096,0.109091,0.184361,0.022408,0.84798,0.689146,0.68915,0.217639,5508096595.248731,1566212.055241,0.264248,0.379677,403284.954245,8376594.819685,0.630991,4416336.714259,0.113177,0.886823,0.008783,0.374654,0.005968,0.108977,0.182715,0.402459,0.141606,12789705.237554,9826220.861192,2149106056.60753,1008595981.817477,0.038595,2325854.266358,0.400671,11255785.321742,0.814125,0.400132,0.522273,0.124095,3592902.19683,37159994.147133,0.090673,0.353828,0.277395,55806804.52578,0.761599,0.735817,0.33141,54160038.135894,0.934733,0.002549,0.029184,1195855763.308841,2163735272.034319,0.594006,2471976967.444247,0.671531,1220120.50159,0.761599,0.33141,0.115645,0.649731,0.461849,0.593415,0.315582,0.031506,0.80776,18629417.811836,0.623915,0.607946,0.840402,0.280365,0.027541,0.565358,0.047578
std,0.17671,0.060686,0.06562,0.061595,0.016934,0.016916,0.01301,0.012869,0.013601,0.011163,0.012679,3237683890.522487,2598291553.998342,0.017036,108275033.532823,0.138667,0.03339,0.033474,0.03348,0.033263,0.017611,51707089.767907,0.027942,0.03318,0.012079,0.010752,0.013853,0.01391,0.010063,2897717771.169734,114159389.518336,0.009634,0.020737,33302155.82548,244684748.446872,0.011238,168406905.281511,0.05392,0.05392,0.028153,0.016286,0.012188,0.027782,0.030785,0.013324,0.101145,278259836.984053,256358895.705332,3247967014.047904,2477557316.920172,0.03668,136632654.389936,0.03272,294506294.116772,0.059054,0.201998,0.218112,0.139251,171620908.606822,510350903.162733,0.05029,0.035147,0.010469,582051554.61942,0.206677,0.011678,0.013488,570270621.959227,0.025564,0.012093,0.027149,2821161238.262457,3374944402.166119,0.008959,2938623226.67881,0.009341,100754158.713168,0.206677,0.013488,0.019529,0.047372,0.029943,0.058561,0.012961,0.030845,0.040332,376450059.745829,0.01229,0.016934,0.014523,0.014463,0.015668,0.013214,0.050014
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.476527,0.535543,0.527277,0.600445,0.600434,0.998969,0.797386,0.809312,0.303466,0.781567,0.000157,0.000128,0.461558,0.000203,0.0,0.173613,0.173613,0.173676,0.214711,0.317748,0.015631,0.096083,0.17037,0.022065,0.847984,0.68927,0.68927,0.21758,4860000000.0,0.000441,0.263759,0.374749,0.007555,0.004726,0.630612,0.003007,0.072891,0.851196,0.005244,0.370168,0.005366,0.096105,0.169376,0.397403,0.076462,0.00071,0.004387,0.000173,0.000233,0.021774,0.010433,0.392438,0.004121,0.774309,0.241973,0.352845,0.033543,0.00524,0.001973,0.053301,0.341023,0.277034,0.003163,0.626981,0.733612,0.328096,0.0,0.931097,0.002236,0.014567,0.000146,0.000142,0.593934,0.000274,0.671565,0.08536,0.626981,0.328096,0.110933,0.633265,0.457116,0.565987,0.312995,0.018034,0.79675,0.000904,0.623636,0.600443,0.840115,0.276944,0.026791,0.565158,0.024477
50%,0.0,0.502706,0.559802,0.552278,0.605997,0.605976,0.999022,0.797464,0.809375,0.303525,0.781635,0.000278,509000000.0,0.46508,0.000321,0.073489,0.1844,0.1844,0.1844,0.224544,0.322487,0.027376,0.104226,0.179709,0.022102,0.848044,0.689439,0.689439,0.217598,6400000000.0,0.000462,0.26405,0.380425,0.010587,0.007412,0.630698,0.005546,0.111407,0.888593,0.005665,0.372624,0.005366,0.104133,0.178456,0.400131,0.118441,0.000968,0.006573,0.000765,0.000593,0.029516,0.018616,0.395898,0.007844,0.810275,0.386451,0.51483,0.074887,0.007909,0.004904,0.082705,0.348597,0.277178,0.006497,0.806881,0.736013,0.329685,0.001975,0.937672,0.002336,0.022674,0.000199,0.000225,0.593963,1080000000.0,0.671574,0.196881,0.806881,0.329685,0.11234,0.645366,0.45975,0.593266,0.314953,0.027597,0.810619,0.002085,0.623879,0.605998,0.841179,0.278778,0.026808,0.565252,0.033798
75%,0.0,0.535563,0.589157,0.584105,0.613914,0.613842,0.999095,0.797579,0.809469,0.303585,0.781735,4145000000.0,3450000000.0,0.471004,0.000533,0.205841,0.19957,0.19957,0.199612,0.23882,0.328623,0.046357,0.116155,0.193493,0.022153,0.848123,0.689647,0.689647,0.217622,7390000000.0,0.000499,0.264388,0.386731,0.01627,0.012249,0.631125,0.009273,0.148804,0.927109,0.006847,0.376271,0.005764,0.115927,0.191607,0.404551,0.176912,0.001455,0.008973,4620000000.0,0.003652,0.042903,0.035855,0.401851,0.01502,0.850383,0.540594,0.689051,0.161073,0.012951,0.012806,0.119523,0.360915,0.277429,0.011147,0.942027,0.73856,0.332322,0.009006,0.944811,0.002492,0.03593,0.000453,4900000000.0,0.594002,4510000000.0,0.671587,0.3722,0.942027,0.332322,0.117106,0.663062,0.464236,0.624769,0.317707,0.038375,0.826455,0.00527,0.624168,0.613913,0.842357,0.281449,0.026913,0.565725,0.052838
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,9990000000.0,9980000000.0,1.0,990000000.0,1.0,1.0,1.0,1.0,1.0,1.0,3020000000.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,9990000000.0,9330000000.0,1.0,1.0,2750000000.0,9230000000.0,1.0,9940000000.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,9740000000.0,9730000000.0,9990000000.0,9990000000.0,1.0,8810000000.0,1.0,9570000000.0,1.0,1.0,1.0,1.0,8820000000.0,9650000000.0,1.0,1.0,1.0,9910000000.0,1.0,1.0,1.0,9540000000.0,1.0,1.0,1.0,10000000000.0,10000000000.0,1.0,10000000000.0,1.0,8320000000.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,9820000000.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


# **Feature Selection**

## **1. Using VIF.**

In [14]:
y=df_1['Bankrupt?']
X_1=df_1.drop(columns='Bankrupt?')

In [16]:
# # VIF dataframe
# vif_data = pd.DataFrame()
# vif_data["feature"] = vif_df.columns

In [17]:
X_1.shape

(6819, 93)

In [18]:
# skewness=PowerTransformer()
# X_1_transformed=pd.DataFrame(skewness.fit_transform(X_1), columns=X_1.columns)
X_1_transformed=X_1.copy()

In [19]:

X_1_transformed.shape

(6819, 93)

In [20]:
#cols_removed_vif=[]
#=[x for x in X_1_transformed.columns if x not in cols_removed_vif]

In [21]:
# columns_after_vif=[x for x in X_1_transformed.columns if x not in cols_removed_vif]

In [22]:
empty_=0
vif_data = pd.DataFrame()
cols_removed_vif=[]
columns_after_vif=[x for x in X_1_transformed.columns]
while empty_ ==0:
 
  # VIF dataframe
  vif_data = pd.DataFrame()
  vif_data["feature"] = columns_after_vif
  vif_data["VIF"] = [variance_inflation_factor(X_1_transformed[columns_after_vif].values, i)
                              for i in range(len(columns_after_vif))]  
  # calculating VIF for each feature
  if np.array(vif_data['VIF'])[-1:]>=15:
    cols_removed_vif.extend([x for x in vif_data.sort_values('VIF')['feature'][-1:] if x not in cols_removed_vif])
    columns_after_vif=[x for x in X_1_transformed.columns if x not in cols_removed_vif]                          
    empty_=0
  else:
    empty_=2


In [23]:
len(columns_after_vif)

93

In [24]:
len(cols_removed_vif)

0

In [25]:
# cols_removed_vif.extend([x for x in vif_data.sort_values('VIF')['feature'][-1:] if x not in cols_removed_vif])
# cols_removed_vif

In [26]:
vif_final=X_1[[x for x in X_1.columns if x not in cols_removed_vif]]

In [27]:
vif_final.shape

(6819, 93)

## **2. Using p_values**

In [28]:
skewness=PowerTransformer(standardize=True)
vif_final_transformed=pd.DataFrame(skewness.fit_transform(vif_final), columns=vif_final.columns)

In [29]:
vif_final_GLM=vif_final_transformed.copy()
vif_final_GLM.set_axis(['x'+str(i) for i in range(0,len(vif_final.columns))], axis='columns', inplace=True)
formula = " Bankrupt ~ "+ '+'.join(vif_final_GLM.columns)
vif_final_GLM['Bankrupt']=y

In [30]:
model = sm.GLM.from_formula(formula, family=sm.families.Binomial(), data=vif_final_GLM)
result = model.fit()
#result.summary() 

In [31]:
p_val=pd.DataFrame(zip(vif_final_transformed.columns,result.pvalues[1:]),columns=['column','p_values'])
columns_based_on_p_and_vif=[x for x in p_val[p_val.p_values<=0.05]['column']]
len(columns_based_on_p_and_vif)

20

## **3. Using on L1 Regularization.**

In [32]:
skewness=PowerTransformer(standardize=False)
vif_final=pd.DataFrame(skewness.fit_transform(vif_final), columns=vif_final.columns)

In [33]:
X_train, X_test, y_train, y_test=train_test_split(vif_final,y,test_size=0.25,stratify=y, random_state=43)

In [34]:
X_train.shape

(5114, 93)

In [35]:
scaler=StandardScaler()
X_train=scaler.fit_transform(X_train)
X_test=scaler.transform(X_test)

In [36]:
lr=LogisticRegression(penalty='l1',solver='saga',class_weight='balanced',C=0.01)
lr.fit(X_train,y_train)
y_train_pred=lr.predict(X_train)
y_test_pred=lr.predict(X_test)
print(confusion_matrix(y_train,y_train_pred))
print(classification_report(y_train,y_train_pred))
print(confusion_matrix(y_test,y_test_pred))
print(classification_report(y_test,y_test_pred))

[[4226  723]
 [  18  147]]
              precision    recall  f1-score   support

           0       1.00      0.85      0.92      4949
           1       0.17      0.89      0.28       165

    accuracy                           0.86      5114
   macro avg       0.58      0.87      0.60      5114
weighted avg       0.97      0.86      0.90      5114

[[1406  244]
 [   3   52]]
              precision    recall  f1-score   support

           0       1.00      0.85      0.92      1650
           1       0.18      0.95      0.30        55

    accuracy                           0.86      1705
   macro avg       0.59      0.90      0.61      1705
weighted avg       0.97      0.86      0.90      1705



In [37]:
lr.coef_

array([[-0.02479326, -0.01104881, -0.18707185,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        , -0.01453703,
         0.        ,  0.        ,  0.        , -0.38580146,  0.        ,
        -0.01651382,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        , -0.01389611,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.13551777,  0.22841157, -0.24995479,  0.        ,  0.        ,
         0.        ,  0.        , -0.08982527,  0.        , -0.1610681 ,
        -0.02900015,  0.        ,  0.        ,  0.0011994 ,  0.        ,
         0.        ,  0.        ,  0.21738796,  0.        ,  0.        ,
         0.        , -0.18611409,  0.        ,  0.02480708,  0.13770406,
         0.        , -0.03098199,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        , -0.1337936 ,  0. 

In [38]:
coef_df=pd.DataFrame(zip(vif_final[columns_based_on_p_and_vif].columns,lr.coef_.reshape(-1,1)),columns=['column','coeff'])
columns_after_vif_l1=[x for x in coef_df[coef_df.coeff!=0]['column']]
len(columns_after_vif_l1)

5

In [39]:
roc_auc_score(y_test,y_test_pred)

0.8987878787878787

# **Logistic Regression**

In [45]:
#[columns_based_on_p]
#[coef_columns]

In [46]:
skewness=PowerTransformer(standardize=True)
X_1=pd.DataFrame(skewness.fit_transform(X_1), columns=X_1.columns)

In [47]:
X_train, X_test, y_train, y_test=train_test_split(X_1[zq],y,test_size=0.25, stratify=y, random_state=99)


In [48]:
X_train.shape

(5114, 31)

In [49]:
scaler=StandardScaler()
X_train=scaler.fit_transform(X_train)
X_test=scaler.transform(X_test)

In [50]:
# #importing and fitting the data to the PCA
# from sklearn.decomposition import PCA
# pca = PCA(n_components=20)
# pca.fit(X_train)
# # transforming the principal components to the original values
# X_train=pca.transform(X_train)
# X_test=pca.transform(X_test)

In [51]:
lr=LogisticRegression(class_weight='balanced',C=1)
lr.fit(X_train,y_train)
y_train_pred=lr.predict(X_train)
y_test_pred=lr.predict(X_test)
print(confusion_matrix(y_train,y_train_pred))
print(classification_report(y_train,y_train_pred))
print(confusion_matrix(y_test,y_test_pred))
print(classification_report(y_test,y_test_pred))

[[4187  762]
 [  20  145]]
              precision    recall  f1-score   support

           0       1.00      0.85      0.91      4949
           1       0.16      0.88      0.27       165

    accuracy                           0.85      5114
   macro avg       0.58      0.86      0.59      5114
weighted avg       0.97      0.85      0.89      5114

[[1423  227]
 [   9   46]]
              precision    recall  f1-score   support

           0       0.99      0.86      0.92      1650
           1       0.17      0.84      0.28        55

    accuracy                           0.86      1705
   macro avg       0.58      0.85      0.60      1705
weighted avg       0.97      0.86      0.90      1705



In [52]:
lr1=LogisticRegression(class_weight='balanced')
lr_cv=GridSearchCV(estimator=lr1, param_grid={'C':[1,0.5,0.1,0.05,0.01,0.08]},scoring='f1')
lr_cv.fit(X_train,y_train)

GridSearchCV(cv=None, error_score=nan,
             estimator=LogisticRegression(C=1.0, class_weight='balanced',
                                          dual=False, fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=100, multi_class='auto',
                                          n_jobs=None, penalty='l2',
                                          random_state=None, solver='lbfgs',
                                          tol=0.0001, verbose=0,
                                          warm_start=False),
             iid='deprecated', n_jobs=None,
             param_grid={'C': [1, 0.5, 0.1, 0.05, 0.01, 0.08]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='f1', verbose=0)

In [53]:
lr_cv.best_params_

{'C': 0.1}

In [54]:
# coef_df=pd.DataFrame(zip(vif_final_transformed.columns,lr.coef_.reshape(-1,1)),columns=['column','coeff'])
# coef_columns=[x for x in coef_df[coef_df.coeff>0]['column']]

In [55]:
# coef_columns

## 2.SVM

In [56]:
X_train.shape

(5114, 31)

In [57]:
svm_mod=SVC(kernel='poly',C=0.1, class_weight='balanced',gamma=0.08)
svm_mod.fit(X_train,y_train)
y_train_pred_svm=svm_mod.predict(X_train)
y_test_pred_svm=svm_mod.predict(X_test)

In [58]:
print(confusion_matrix(y_train,y_train_pred_svm))
print(classification_report(y_train,y_train_pred_svm))
print(confusion_matrix(y_test,y_test_pred_svm))
print(classification_report(y_test,y_test_pred_svm))

[[4665  284]
 [   6  159]]
              precision    recall  f1-score   support

           0       1.00      0.94      0.97      4949
           1       0.36      0.96      0.52       165

    accuracy                           0.94      5114
   macro avg       0.68      0.95      0.75      5114
weighted avg       0.98      0.94      0.96      5114

[[1559   91]
 [  30   25]]
              precision    recall  f1-score   support

           0       0.98      0.94      0.96      1650
           1       0.22      0.45      0.29        55

    accuracy                           0.93      1705
   macro avg       0.60      0.70      0.63      1705
weighted avg       0.96      0.93      0.94      1705



In [59]:
# oversample = SMOTE()
# X_train, y_train = oversample.fit_resample(X_train, y_train)

In [60]:
estimator = []

estimator.append(('LR', LogisticRegression(C=0.1,class_weight='balanced')))
estimator.append(('Naive', GaussianNB()))
estimator.append(('SVC1', SVC(kernel='linear',C=0.11, class_weight='balanced',gamma=0.01, probability=True)))
estimator.append(('RFC1', RandomForestClassifier(n_estimators=500, criterion = 'gini', class_weight='balanced')))

  
# Voting Classifier with hard voting
vot_hard = VotingClassifier(estimators = estimator, voting ='hard')
vot_hard.fit(X_train, y_train)
y_pred = vot_hard.predict(X_test)
  
# using accuracy_score metric to predict accuracy
score = accuracy_score(y_test, y_pred)
print(classification_report(y_test, y_pred))
  
# Voting Classifier with soft voting
vot_soft = VotingClassifier(estimators = estimator, voting ='soft')
vot_soft.fit(X_train, y_train)
y_pred = vot_soft.predict(X_test)
  
# using accuracy_score
score = accuracy_score(y_test, y_pred)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.98      0.92      0.95      1650
           1       0.19      0.53      0.28        55

    accuracy                           0.91      1705
   macro avg       0.59      0.73      0.61      1705
weighted avg       0.96      0.91      0.93      1705

              precision    recall  f1-score   support

           0       0.98      0.97      0.98      1650
           1       0.37      0.47      0.42        55

    accuracy                           0.96      1705
   macro avg       0.68      0.72      0.70      1705
weighted avg       0.96      0.96      0.96      1705



In [61]:
forest = RandomForestClassifier(class_weight={0:1,1:18})
balbag = BalancedBaggingClassifier(base_estimator = forest, n_estimators = 100, bootstrap = False,  bootstrap_features= True,
                                  replacement = False, random_state = 5)
model_full_sample = balbag.fit(X_train, y_train)

In [62]:
y_pred=model_full_sample.predict(X_test)
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))

[[1447  203]
 [  13   42]]
              precision    recall  f1-score   support

           0       0.99      0.88      0.93      1650
           1       0.17      0.76      0.28        55

    accuracy                           0.87      1705
   macro avg       0.58      0.82      0.61      1705
weighted avg       0.96      0.87      0.91      1705

