# Predicting Company Bankruptcy using Machine Learning
By CS3244 Project Team 11  
Cao Han  |  Chin Sek Yi  |  Lim Kai Sin  |  Luo Xinming  

## Table of Contents
* [Chapter 1: Project Overview](#chapter1)
    * [1.1 Project Motivation](#section1_1)
    * [1.2 Dataset Description](#section1_2)
    * [1.3 Methodologies](#section1_3)
* [Chapter 2: Data Preparation](#chapter2)
    * [2.1 Data Collection](#section2_1)
    * [2.2 Data Exploration (EDA)](#section2_2)
    * [2.3 Data Pre-Processing](#section2_3)
* [Chapter 3: Modelling](#chapter3)
    * [3.1 Evaluation Metrics](#section3_1)
    * [3.2 Machine Learning Models](#section3_2)
    * [3.3 Our Modelling Approach](#section3_3)

## Chapter 1: Project Overview <a id="chapter1"></a>

### 1.1 Project Motivation <a id="section1_1"></a>
In today's dynamic business landscape, exemplified by recent events such as the bankruptcy of Silicon Valley Bank, the ability to anticipate and mitigate financial risks is crucial for sustainable growth and stability. This project aims to develop a robust predictive model of company bankruptcy, leveraging advanced machine learning algorithms and financial data analysis techniques, so as to equip stakeholders with nuanced insights to confidently traverse the unpredictable landscape of financial risk.

### 1.2 Dataset Description <a class="anchor" id="section1_2"></a>
The dataset used in this project is about bankruptcy prediction of Polish companies. The dataset contains financial rates from one year and corresponding class label that indicates bankruptcy status 3 years after that year. The data contains 10503 instances (financial statements), with 495 representing bankrupted companies and 10008 still operating at the end of the 3-year forecasting period.

The data was collected from [Emerging Markets Information Service](http://www.securities.com), which is a database containing information on emerging markets around the world. The bankrupt companies were analyzed in the period 2000-2012, while the still operating companies were evaluated from 2007 to 2013.  

Source: [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/dataset/365/polish+companies+bankruptcy+data)

### 1.3 Methodologies <a class="anchor" id="section1_3"></a>
This project is structured into five distinct steps:

1. **Data Collection**: This initial phase involves setting up a web scraper to automate the acquisition and transformation of the desired dataset. The process is designed to systematically download and convert data, streamlining subsequent analysis.

2. **Exploratory Data Analysis (EDA)**: In this step, various visualization tools such as histograms and heatmaps are employed to explore the data distribution and examine the relationships between features. This analysis helps identify patterns and insights that inform further data handling strategies.

3. **Data Pre-processing**: During this stage, missing values are addressed through mean imputation, ensuring no data is unnecessarily discarded. Additionally, numerical values are standardized to neutralize disparities in scale among features, enhancing model accuracy. To mitigate the effects of multicollinearity, features that show high correlation with others are selectively removed. As the dataset is highly imbalanced, we used Synthetic Minority Over-sampling Technique (SMOTE) to generate instances of the minortity class for training. 

4. **Modeling**: Various well-established machine learning algorithms are utilized in this step, including Support Vector Machines (SVM), K-Nearest Neighbors (KNN), and Decision Trees, to develop predictive models for predicting company bankruptcy. Logistic regression serves as a baseline for comparison. Advanced techniques such as bagging and boosting are implemented to improve each model's robustness. Furthermore, the potential of neural networks is explored to assess their efficacy in enhancing predictive performance.

5. **Evaluation and Comparison**: The final step involves a thorough evaluation and comparison of each model's performance. The outcomes are meticulously analyzed to identify the most effective model, which is then detailed in this section, highlighting the relative advantages and effectiveness of the approaches used.

This structured approach ensures a comprehensive analysis and robust development of predictive models, aiming to deliver reliable and actionable insights.

## Chapter 2: Data Preparation <a class="anchor" id="chapter2"></a>

### 2.1 Data Collection <a class="anchor" id="section2_1"></a>

In [13]:
import requests
import zipfile
import os
from scipy.io import arff
import pandas as pd

In [None]:
download_url = "https://archive.ics.uci.edu/static/public/365/polish+companies+bankruptcy+data.zip"
response = requests.get(download_url)
zip_file_path = "downloaded_file.zip"

with open(zip_file_path, "wb") as file:
    file.write(response.content)

def extract_arff_from_zip(zip_path, arff_filename, extraction_path='.'):
    with zipfile.ZipFile(zip_path, 'r') as zip_ref:
        zip_ref.extract(arff_filename, path=extraction_path)
        return os.path.join(extraction_path, arff_filename)

def convert_arff_to_csv(arff_path, csv_path):
    data, meta = arff.loadarff(arff_path)
    df = pd.DataFrame(data)    
    for col in df.select_dtypes([object]):
        if isinstance(df[col][0], bytes):
            df[col] = df[col].apply(lambda x: x.decode('utf-8'))
    df.to_csv(csv_path, index=False)

for i in range(1, 6):
    arff_filename = f'{i}year.arff'
    csv_path = f'{i}year.csv'
    # extracted_arff_path = extract_arff_from_zip(zip_file_path, arff_filename)
    # convert_arff_to_csv(extracted_arff_path, csv_path)

### 2.2 Data Exploration (EDA) <a class="anchor" id="section2_2"></a>

### 2.3 Data Pre-Processing <a class="anchor" id="section2_3"></a>

## Chapter 3: Modelling <a class="anchor" id="chapter3"></a>

### 3.1 Evaluation Metrics  <a class="anchor" id="section3_1"></a>

Our primary task is to predict bankruptcy (a classification problem). Below are the key classification metrics that we will use to evaluate our model performance:

#### 3.1.1 Accuracy
Accuracy provides a ratio of correctly predicted observations to the total observations. It is especially useful when the classes are balanced with SMOTE.

**Formula**:
$$\text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}}$$

#### 3.1.2 Confusion Matrix and Related Terms
The confusion matrix is a table layout that allows visualization of the performance of the algorithm, where each number in the matrix represents:
- **TP (True Positives)**: Correctly predicted positive observations.
- **TN (True Negatives)**: Correctly predicted negative observations.
- **FP (False Positives)**: Incorrectly predicted as positive.
- **FN (False Negatives)**: Incorrectly predicted as negative.

#### 3.1.3 Precision, Recall, and F1-Score
These metrics offer a deeper understanding of the model's performance by taking into account data imbalances, thereby providing a more nuanced view of the accuracy across different class labels.

- **Precision**:
$$ \text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} $$
Precision measures the accuracy of positive predictions.

- **Recall** (or Sensitivity or TPR):
$$ \text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}} $$
Recall measures the ability of a model to find all the relevant cases (all positive samples).

- **F1-Score**:
$$ \text{F1-Score} = 2 \times \left( \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} \right) $$
The F1-Score is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account. A high F1-score shows a model can classify the positive class correctly, while not misclassifying many negative classes as positive.

#### 3.1.4. Macro Average vs Micro Average

- **Macro Average**: Macro average calculates the metric independently for each class and then takes the average (hence treating all classes equally). It is useful equal weight is given to the performance of each class, regardless of its frequency.

- **Micro Average**: Micro average aggregates the contributions of all classes to compute the average metric. In other words, Micro Average will compute the overall total TP, FP, and FN across all classes, and then use these totals to calculate the performance scores. It is useful when metrics are weighted by class size, which is ideal for imbalanced data as it reflects the contribution of each class proportionally to its size.


In [None]:
# import evaluation metrics
from sklearn.metrics import (accuracy_score, classification_report,
                             confusion_matrix, f1_score, precision_score,
                             recall_score)

### 3.2 Machine Learning Models 

#### Scikit-learn (sklearn)
- **KNeighborsClassifier**: This model implements the k-nearest neighbors voting algorithm.
- **LogisticRegression**: A model that applies logistic regression for binary classification tasks.
- **SVC**: Support Vector Machine classifier known for its effectiveness in high-dimensional spaces.
- **DecisionTreeClassifier**: A model that uses a decision tree for classification, useful for interpretability.
- **GradientBoostingClassifier**: An ensemble model that builds on weak prediction models to create a strong classifier.
- **RandomForestClassifier**: A meta estimator that fits a number of decision tree classifiers to various sub-samples of the dataset and uses averaging to improve predictive accuracy and control overfitting.

#### Imbalanced-learn (imblearn)
Dealing with imbalanced data:

- **BalancedRandomForestClassifier**: A variation of the RandomForest that handles imbalances by adjusting weights inversely proportional to class frequencies in the input data.

#### XGBoost (xgboost)

- **XGBoost**: An implementation of gradient boosted decision trees designed for speed and performance.

In [15]:
# sklearn
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier

# imblearn
from imblearn.ensemble import BalancedRandomForestClassifier

# XGBoost
import xgboost as xgb

### 3.3 Our Modelling Approach

#### 3.3.1 Simple logistic regression
We begin our modeling with a simple logistic regression to establish a baseline for benchmark performance.
This approach provides an initial overview of how well simple models can predict outcomes based on our data.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

print("Confusion Matrix:\n", conf_matrix)
print("Classification Report:\n", class_report)

#### 3.3.2 K-Nearest Neighbors (KNN)
We continue our modeling with the K-Nearest Neighbors (KNN) algorithm to compare its performance against the baseline established by logistic regression. KNN is a technique that predicts the class of a given point based on the majority vote of its nearest neighbors. KNN is useful for gaining insights into the dataset’s structure due to its reliance on feature similarity. 

In [None]:
# knn code here

Given the high dimensionality of our dataset, even after data cleaning, the KNN model might not perform optimally due to its known limitations with high-dimensional data. To address this issue, we apply Principal Component Analysis (PCA) to reduce the dimensionality. This step aims to enhance the performance of our KNN model by focusing on the most significant features and reducing the noise associated with less important variables.

In [None]:
# knn with pca code here

#### 3.3.3 Support Vector Machine (SVM)
We further our analysis by incorporating a Support Vector Machine (SVM) model to assess its effectiveness compared to the simpler models previously tested. SVM is a classification technique that finds the optimal hyperplane which best separates the data into different classes. It addresses the short comings of KNN, as it performs well in high-dimensional space and work well with non-linearly separable data.


In [None]:
# SVM code here

The Support Vector Machine (SVM) model has not performed as expected in our analysis for several reasons:

1. **Sensitivity to Outliers**: SVM is particularly sensitive to outliers. In our dataset, we retained outliers to capture distinctive characteristics of different companies with respect to their bankruptcy status. This sensitivity can lead to skewed decision boundaries, adversely affecting model performance. 

2. **Overfitting Issues**: In this analysis, the SVM model has shown a tendency to overfit. This issue stems from the lack of an appropriate regularization function. Without proper regularization, SVM models can overly conform to the noise in the training data rather than capturing the general pattern, leading to poor generalization on new, unseen data.

## Chapter 5: Result Evaluation