# Building a Model to Predict Company Bankruptcy
Using the bankruptcy data from the Taiwan Economic Journal for the years 1999–2009, I'll create a model to predict whether or not a company will go bankrupt.

### Outline:
1. Import libraries and data
2. Check data for categorical data and null values
3. Check if classes are even (ie. how many records do we have for each class?)
4. Build and compare algorithms with and without the SMOTE sampling technique

In [None]:
import pandas as pd
from sklearn.metrics import confusion_matrix, f1_score, recall_score, precision_score, accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from imblearn.over_sampling import SMOTE
from xgboost import XGBClassifier

# Import dataset
file_path = '/kaggle/input/company-bankruptcy-prediction/data.csv'
df = pd.read_csv(file_path)
df.head(10)

In [None]:
# Check datatypes of dataframe columns
df.dtypes.value_counts()

All features are numeric. Since there are 96 columns, I'll do a quick check to see if there are any null values in the entire dataset and then drill down by column if needed.

In [None]:
print('Total Null Values in Dataset: ',df.isna().sum().sum())

Since there are no null values in the dataset, I can move on to check if the two classes (bankrupt, and not bankrupt) are even:

In [None]:
# Investigate count of classes
df['Bankrupt?'].value_counts()

Classes are not even. This is something I'll look to handle as I decide on algorithms and sampling techniques to use.

I'll define a function to output the metrics I want to see when comparing models:

In [None]:
def ModelPerformanceMetrics(Y_test,Y_pred):
    cf = confusion_matrix(Y_test,Y_pred)
    precision = precision_score(Y_test,Y_pred)
    recall = recall_score(Y_test,Y_pred)
    accuracy = accuracy_score(Y_test,Y_pred)
    fscore = f1_score(Y_test,Y_pred)
    print(cf)
    print('Precision: {} \nRecall: {} \nAccuracy: {} \nFScore: {}' \
          .format(round(precision,2),round(recall,2),round(accuracy,2),round(fscore,2)))

It's clear that Logistic Regression does not perform well on this dataset.

In [None]:
# Build a logistic regression model
X = df.drop(['Bankrupt?'], axis = 1)
Y = df['Bankrupt?']
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 0)

Classifier = LogisticRegression(max_iter = 1000)
Classifier = Classifier.fit(X_train,Y_train)
Y_pred = Classifier.predict(X_test)

ModelPerformanceMetrics(Y_test,Y_pred)

This also illustrates why it is necessary to consider a variety of metrics when evaluating a model. Next, I'll try oversampling by increasing the less prevalent class (bankrupt companies) to be less than or equal to the more prevalent class (solvent companies):

In [None]:
training_set = pd.concat([X_train,Y_train], axis = 1)
df_bankrupt = training_set.loc[df['Bankrupt?'] == 1]
df_solvent = training_set.loc[df['Bankrupt?'] == 0]
multiplier = len(df_solvent)//len(df_bankrupt)
df_bankrupt_boosted = pd.concat([df_bankrupt]*multiplier, ignore_index = True)
df_oversampled = pd.concat([df_bankrupt_boosted,df_solvent], ignore_index = True)

X_train = df_oversampled.drop(['Bankrupt?'], axis = 1)
Y_train = df_oversampled['Bankrupt?']
Classifier = LogisticRegression(max_iter = 1000)
Classifier = Classifier.fit(X_train,Y_train)
Y_pred = Classifier.predict(X_test)

ModelPerformanceMetrics(Y_test,Y_pred)

That was better than logistic regression, but still not a useful model. I'll try undersampling, by removing a large number of records from the more prevalent class:

In [None]:
df_bankrupt = df.loc[df['Bankrupt?'] == 1]
df_solvent = df.loc[df['Bankrupt?'] == 0].iloc[0:220,:]
df_undersampled = pd.concat([df_bankrupt,df_solvent], ignore_index = True)

X = df_undersampled.drop(['Bankrupt?'], axis = 1)
Y = df_undersampled['Bankrupt?']
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 0)
Classifier = LogisticRegression(max_iter = 1000)
Classifier = Classifier.fit(X_train,Y_train)
Y_pred = Classifier.predict(X_test)

ModelPerformanceMetrics(Y_test,Y_pred)

Undersampling improved performance, but still not enough to be useful. I'll see if random forest can handle the imbalanced classes:

In [None]:
X = df.drop(['Bankrupt?'], axis = 1)
Y = df['Bankrupt?']
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 0)

Classifier = RandomForestClassifier(n_estimators = 1000)
Classifier = Classifier.fit(X_train,Y_train)
Y_pred = Classifier.predict(X_test)

ModelPerformanceMetrics(Y_test,Y_pred)

Random Forest isn't useful either.

I'll try the XGBoost algorithm:

In [None]:
X = df.drop(['Bankrupt?'], axis = 1)
Y = df['Bankrupt?']
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 0)
Classifier = XGBClassifier(use_label_encoder=False)
Classifier = Classifier.fit(X_train, Y_train)
Y_pred = Classifier.predict(X_test)

ModelPerformanceMetrics(Y_test,Y_pred)

XGBoost didn't perform very well here. I'll try the SMOTE sampling technique and Logistic Regression next:

In [None]:
X = df.drop(['Bankrupt?'], axis = 1)
Y = df['Bankrupt?']
sm = SMOTE(sampling_strategy = 'auto', k_neighbors = 5, random_state = 0)
X_smote, Y_smote = sm.fit_resample(X, Y)
X_train, X_test, Y_train, Y_test = train_test_split(X_smote, Y_smote, test_size = 0.2, random_state = 0)

Classifier = LogisticRegression(max_iter = 1000)
Classifier = Classifier.fit(X_train,Y_train)
Y_pred = Classifier.predict(X_test)

ModelPerformanceMetrics(Y_test,Y_pred)

SMOTE and Logistic Regression performed better than the undersampling technique, which was the best option so far. I'll combine SMOTE and XGBoost to see if that helps:

In [None]:
X = df.drop(['Bankrupt?'], axis = 1)
Y = df['Bankrupt?']
sm = SMOTE(sampling_strategy = 'auto', k_neighbors = 5, random_state = 0)
X_smote, Y_smote = sm.fit_resample(X, Y)
X_train, X_test, Y_train, Y_test = train_test_split(X_smote, Y_smote, test_size = 0.2, random_state = 0)

Classifier = XGBClassifier(use_label_encoder=False)
Classifier = Classifier.fit(X_train, Y_train)
Y_pred = Classifier.predict(X_test)

ModelPerformanceMetrics(Y_test,Y_pred)

The SMOTE technique used with the XGBoost algorithm performed exceptionally well, with precision of 98%, recall close to 100%, and an F-Score of 99%.