 <center>
    <h1> Bankruptcy Prediction</h1>
</center>

# Introduction
Being able to predict Bankruptcy of a company well in advance is of great importance to the various stakeholders of the company and also help in economic decision making. The purpose of our research is to study the suitability of major bankruptcy prediction models by applying them to the dataset provided to us. Our attempts of bankruptcy prediction are based on various Machine Learning Models such as Decision Trees, Bagging, Random Forest, AdaBoost, XGboost etc. This report is an empirical study of bankruptcy prediction based on data set provided to us, mainly focusing on tackling imbalance and the comparison of different methods.



# Dataset

The dataset is about bankruptcy prediction of Polish companies. The data was collected from Emerging Markets Information Service (EMIS, <a href="https://www.emis.com/">Web_Link</a>), which is a database containing information on emerging markets around the world. The bankrupt companies were analyzed in the period 2000-2012, while the still operating companies were evaluated from 2007 to 2013.

# Source

https://archive.ics.uci.edu/ml/datasets/Polish+companies+bankruptcy+data



Creator: Sebastian Tomczak
-- Department of Operations Research, WrocÅ‚aw University of Science and Technology, wybrzeÅ¼e WyspiaÅ„skiego 27, 50-370, WrocÅ‚aw, Poland

Donor: Sebastian Tomczak (sebastian.tomczak '@' pwr.edu.pl), Maciej Zieba (maciej.zieba '@' pwr.edu.pl), Jakub M. Tomczak (jakub.tomczak '@' pwr.edu.pl), Tel. (+48) 71 320 44 53


### Citation Request:

Zieba, M., Tomczak, S. K., & Tomczak, J. M. (2016). Ensemble Boosted Trees with Synthetic Features Generation in Application to Bankruptcy Prediction. Expert Systems with Applications.

### Feature Description

Attribute Information:

- X1 net profit / total assets
- X2 total liabilities / total assets
- X3 working capital / total assets
- X4 current assets / short-term liabilities
- X5 [(cash + short-term securities + receivables - short-term liabilities) / (operating expenses - depreciation)] * 365
- X6 retained earnings / total assets
- X7 EBIT / total assets
- X8 book value of equity / total liabilities
- X9 sales / total assets
- X10 equity / total assets
- X11 (gross profit + extraordinary items + financial expenses) / total assets
- X12 gross profit / short-term liabilities
- X13 (gross profit + depreciation) / sales
- X14 (gross profit + interest) / total assets
- X15 (total liabilities * 365) / (gross profit + depreciation)
- X16 (gross profit + depreciation) / total liabilities
- X17 total assets / total liabilities
- X18 gross profit / total assets
- X19 gross profit / sales
- X20 (inventory * 365) / sales
- X21 sales (n) / sales (n-1)
- X22 profit on operating activities / total assets
- X23 net profit / sales
- X24 gross profit (in 3 years) / total assets
- X25 (equity - share capital) / total assets
- X26 (net profit + depreciation) / total liabilities
- X27 profit on operating activities / financial expenses
- X28 working capital / fixed assets
- X29 logarithm of total assets
- X30 (total liabilities - cash) / sales
- X31 (gross profit + interest) / sales
- X32 (current liabilities * 365) / cost of products sold
- X33 operating expenses / short-term liabilities
- X34 operating expenses / total liabilities
- X35 profit on sales / total assets
- X36 total sales / total assets
- X37 (current assets - inventories) / long-term liabilities
- X38 constant capital / total assets
- X39 profit on sales / sales
- X40 (current assets - inventory - receivables) / short-term liabilities
- X41 total liabilities / ((profit on operating activities + depreciation) * (12/365))
- X42 profit on operating activities / sales
- X43 rotation receivables + inventory turnover in days
- X44 (receivables * 365) / sales
- X45 net profit / inventory
- X46 (current assets - inventory) / short-term liabilities
- X47 (inventory * 365) / cost of products sold
- X48 EBITDA (profit on operating activities - depreciation) / total assets
- X49 EBITDA (profit on operating activities - depreciation) / sales
- X50 current assets / total liabilities
- X51 short-term liabilities / total assets
- X52 (short-term liabilities * 365) / cost of products sold)
- X53 equity / fixed assets
- X54 constant capital / fixed assets
- X55 working capital
- X56 (sales - cost of products sold) / sales
- X57 (current assets - inventory - short-term liabilities) / (sales - gross profit - depreciation)
- X58 total costs /total sales
- X59 long-term liabilities / equity
- X60 sales / inventory
- X61 sales / receivables
- X62 (short-term liabilities *365) / sales
- X63 sales / short-term liabilities
- X64 sales / fixed assets

#### Learning outcome from this assignment

Here expectation is that you will be able to read all the details about the problem statement , understand the concept and also
read additional details about this sector or domain. When you do some research about the domain, you may come up with some interesting ideas about feature processing or engineering, which is definitely going to help you in improving the overall performance of the model.

Also, we expect you to be able to apply ML techniques that you have learned in this module like
- KNN
- Decision Tree

In [24]:
import warnings
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

from sklearn.metrics import recall_score, f1_score

warnings.filterwarnings("ignore")

### Task 1

We have provided you the dataset, please change your working directory and read the data:

* Use below path to access Train and Test datasets:
    * Train set - '/home/datasets/lab/ML_HOT/ML_HOT_bankruptcy.csv'
    * Test set - '/home/datasets/lab/ML_HOT/ML_HOT_test_dataset.csv'

Please do not change the name of the dataframe in the below section, you need to use the exact same name as **datadf**

In [13]:
### Please write your code here for Task - 1. Please remember that you need to write all your code in the cell/section only.
### Do not use multiple section to write this code.
# Remember to give the name of the dataframe as datadf
### BEGIN SOLUTION
datadf = pd.read_csv("/home/datasets/lab/ML_HOT/ML_HOT_bankruptcy.csv")
### END SOLUTION

In [14]:
""" Please do not delete this cell/section. It is there to validate your work for Task - 1.
    Please finish Task - 1 before to this cell/section else you will miss the grade point for Task - 1.
    Please remember to name the dataframe as datadf, else this test case will not work."""
### BEGIN HIDDEN TESTS
assert(datadf.shape == (43379, 65))
### END HIDDEN TESTS

### Taks 2

- In this task we want to come up with a solution to replace or impute the NA values present in the data.
- You are free to choose any of the strategy you think is appropriate for the domain, or based upon your analysis of the 
problem.
- After you have completed this task, there should not be any NA values in the dataset.

In [15]:
""" Please write your code here for Task 2"""
### BEGIN SOLUTION
na_count = datadf.isnull().sum(axis=1)
datadf.drop("Attr37", axis=1, inplace = True)# Because more than 40% na value
datadf["na_count"] = na_count
datadf = datadf.fillna(0)
### END SOLUTION

In [16]:
""" Please do not delete this cell/section. It is there to validate your work for Task - 2.
    Please finish Task - 2 before to this cell/section else you will miss the grade point for Task - 2.
    Please do not change the name of the dataframe from datadf to anything else."""
### BEGIN HIDDEN TESTS
assert datadf.isnull().sum().sum() == 0
### END HIDDEN TESTS

### Task 3

### Feature Engineering and Data Preprocessing

The given Data are various Ratios used for analysing the financial health of a company. Apart from the given variables some derived variables might be useful in predicting the Bankruptcy, while some of the featutres might not be as useful. So in this section we will first try to add some new features which we think might be useful in prediction. We will then build models on the new dataset with added features. 

- **Here we will show example of one such feature that you need to include to the already existing features**
<br>
<br> total_assets = 10^(X29)

After Task 3 we would expect a column called <b> total_assets </b> to be added to the dataframe <b> datadf </b>

In [17]:
""" Please write your code here for Task 3"""
### BEGIN SOLUTION
datadf["total_assets"] = 10**datadf.Attr29
### END SOLUTION

In [18]:
""" Please do not delete this cell/section. It is there to validate your work for Task - 3.
    Please finish Task - 3 before to this cell/section else you will miss the grade point for Task - 3.
    Please do not change the name of the dataframe from datadf to anything else."""
### BEGIN HIDDEN TESTS
assert ('total_assets' in datadf.columns)
### END HIDDEN TESTS

### Feature Engineering and Data Preprocessing


Apart from Task 3 you are also free to generate any new features which you think are going to help you in increasing
the prediction capability of your model.

In Task 3 we just show you one way to do that, and expect you to come up with some new features.

### Task 4

Make a train, test  split with test size= 0.3 and store the result X_train, X_test, y_train, y_test.

Please use the train and test data-set names as we have given in the description of this task, if you use any other
names you will not get the assigned grade points of this task.

In [19]:
"""Write your code here for Task 4"""
### BEGIN SOLUTION
X = datadf.drop("class", axis=1)
y = datadf["class"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 123, stratify = y)
### END SOLUTION

In [20]:
""" Please do not delete this cell/section. It is there to validate your work for Task - 4.
    Please finish Task - 4 before to this cell/section else you will miss the grade point for Task - 4.
    Please use the name of the variables as given in the description of the Task 4."""
### BEGIN HIDDEN TESTS
assert(X_train.shape[0] > 1)
assert(y_train.shape[0] > 1)
assert(X_test.shape[0] > 1)
assert(y_test.shape[0] > 1)

### END HIDDEN TESTS


### Task 5

Standardize the X_train, X_test dataframes and store the result X_train_std, X_test_std.

Please use the train and test data-set names as we have given in the description of this task, if you use any other
names you will not get the assigned grade points of this task.

In [21]:
"""Write your code here for Task 5"""
### BEGIN SOLUTION
scaler = StandardScaler()
scaler.fit(X_train)

X_train_std = X_train.copy()
X_test_std = X_test.copy()


X_train_std.loc[:,:]=scaler.transform(X_train)
X_test_std.loc[:,:]=scaler.transform(X_test)

### END SOLUTION

In [22]:
""" Please do not delete this cell/section. It is there to validate your work for Task - 5.
    Please finish Task - 5 before to this cell/section else you will miss the grade point for Task - 5.
    Please use the name of the variables as given in the description of the Task 5."""

### BEGIN HIDDEN TESTS
assert(np.all(round(X_train_std.apply(np.mean)) == 0))
assert(np.all(round(X_train_std.apply(np.std)) == 1))
### END HIDDEN TESTS


### Model Building

- We want to you to build different models for this classification task.


### Task 6

- As part of this task please build a ML model using KNN.
- Please use the name of the model object as **knn_model**.
- You are free to choose any of the hyperparameters you want while building this model.
- You are free to use any of the hyperparamters tuning technique you want.
- Please use the name of the model as  **knn_model**  else you will not get the grade points for this task.

In [28]:
""" Please write your code here for Task 6"""
### BEGIN SOLUTION

knn_model= KNeighborsClassifier()
knn_model.fit(X_train_std,y_train)

## Predictions on test set
pred_train = knn_model.predict(X=X_train_std)
pred_test = knn_model.predict(X=X_test_std)

## Evaluate function
def evaluate(y_train,y_test,pred_train,pred_test):
    print("F1 Score train :",f1_score(y_train,pred_train))
    print("F1 Score test  :",f1_score(y_test,pred_test))

    print("Recall train   :",recall_score(y_train,pred_train))
    print("Recall test    :",recall_score(y_test,pred_test))


## Evaluation
evaluate(y_train,y_test,pred_train,pred_test)

### END SOLUTION

F1 train: 0.2883569096844396
F1 test: 0.13178294573643412
Recall train 0.181631254283756
Recall test 0.08146964856230032


In [43]:
""" Please do not delete this cell/section. It is there to validate your work for Task - 6.
    This test case will check for the existance and validity of the KNN model which you have created.
    It expects the name of the model to be knn_model, else this test case will fail, even if you have a valid model
    with another name."""

## model
print(knn_model)

### BEGIN HIDDEN TESTS
assert (knn_model.predict(X_test_std).shape == y_test.shape)
### END HIDDEN TESTS

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')


### Task 7

- As part of this task please build a ML model using Decision Trees.
- Please use the name of the model as **dectree_model**.
- You are free to choose any of the hyperparameters you want while building this model. 
- You are free to use any of the hyperparamters tuning technique you want.
- Please use the name of the model as  **dectree_model**  else you will not get the grade points for this task.

In [44]:
""" Please write your code here for Task 7"""
### BEGIN SOLUTION

dectree_model = DecisionTreeClassifier()
dectree_model.fit(X_train_std,y_train)

## Predictions on test set
pred_train = dectree_model.predict(X=X_train)
pred_test = dectree_model.predict(X=X_test)

## Evaluate function
def evaluate(y_train,y_test,pred_train,pred_test):
    print("F1 Score train :",f1_score(y_train,pred_train))
    print("F1 Score test  :",f1_score(y_test,pred_test))

    print("Recall train   :",recall_score(y_train,pred_train))
    print("Recall test    :",recall_score(y_test,pred_test))


## Evaluation
evaluate(y_train,y_test,pred_train,pred_test)

### END SOLUTION

F1 Score train : 1.0
F1 Score test  : 0.4674329501915709
Recall train   : 1.0
Recall test    : 0.48722044728434505


In [45]:
""" Please do not delete this cell/section. It is there to validate your work for Task - 7.
    This test case will check for the existance and validity of the Decision Tree model which you have created.
    It expects the name of the model to be dectree_model, else this test case will fail, even if you have a valid model
    with another name."""

## model
print(dectree_model)

### BEGIN HIDDEN TESTS
assert (dectree_model.predict(X_test).shape == y_test.shape)
### END HIDDEN TESTS

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')


### Task - 8

We have another dataset called `test_dataset` where we have only the features, there will be no target variable in the `test_dataset`. 

We have the original target values with us, which we will compare with your `prediction`, to see the accuracies of your models.

**_Hint: Make sure that the test data is in the same format as your train._**

**Note: - Here we are using `f1-score` as the evaluation metric.**
      
There will be a cell with a test case to evaluate Task - 7, please make sure that you finish all of your coding activity to make the prediction before to that cell/section.

<p style="color:red">Please ensure that you use your <b>best model</b> to do the <b>prediction</b> and you should create a <b>list</b> with your <b>prediction result</b>.
For example if your <b>test_dataset</b> have <b>3</b> records then your <b>prediction result</b> should be a list like this <b>[0,0,1]</b> a <b>list</b> with <b>3</b> values.</p>

Make sure you name that list as **`pred_result_list`**.

In [51]:
""" Write your code here for Task 7"""

### BEGIN SOLUTION

## Reading the data
testdf = pd.read_csv("/home/datasets/lab/ML_HOT/ML_HOT_test_dataset.csv")

## Pre-processing
na_count = testdf.isnull().sum(axis=1)
testdf.drop("Attr37", axis=1, inplace = True)# Because more than 40% na value
testdf["na_count"] = na_count
testdf = testdf.fillna(0)
tot_assets = 10**testdf.Attr29
testdf["total_assets"] = tot_assets

## Predict
test_pred = dectree_model.predict(testdf)
pred_result_list = list(test_pred)

### END SOLUTION

Please make sure that the prediction you have obtained from the `test-dataset`, you stored it in a list called **`pred_result_list`**, otherwise you will have problem in the auto evaluation test case.

Please make sure that you do not sort or modify the order of the `test_dataset`, you should consider it as it is, and generate your prediction sequentially for each of the test record.

In [52]:
""" Please do not delete this cell/section. It is there to validate your work for Task - 8.
    Please finish Task - 8 before to this cell/section else you will miss the grade point for this task.
    This test case is going to see if your f1 score is more than .5 or not"""
### BEGIN HIDDEN TESTS
hidden_list = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
score = f1_score(hidden_list, pred_result_list)
assert score >= 0.5, "f1 score greater than 0.5" 
### END HIDDEN TESTS

In [52]:
""" Please do not delete this cell/section. It is there to validate your work for Task - 8.
    Please finish Task - 8 before to this cell/section else you will miss the grade point for this task.
    This test case is going to see if your f1 score is more than .3 or not"""
### BEGIN HIDDEN TESTS
hidden_list = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
score = f1_score(hidden_list, pred_result_list)
assert score < 0.5, "f1 score greater than 0.5" 
assert score >= 0.3, "f1 score less than 0.3"
### END HIDDEN TESTS