# Optimizing Bankruptcy Prediction with K-NN, SMOTE, and Logistic Regression

## Introduction:

### Project Background:

The possibility of a company facing bankruptcy is a major concern in the economic and financial field. Predicting bankruptcy early can reduce risks and provide stakeholders with valuable information to make informed decisions. The project aims to leverage machine learning techniques to predict the probability of bankruptcy based on financial metrics. We particularly address the challenges posed by unbalanced datasets, which are common in bankruptcy prediction scenarios where the number of solvent companies far exceeds the number of insolvent companies.

### Main Question:

Can we accurately predict the likelihood of a company facing bankruptcy by applying SMOTE for data balancing, KNN for classification, and logistic regression for identifying key financial indicators?

### Dataset Description:

The dataset chosen for this project is from Kaggle (https://www.kaggle.com/datasets/utkarshx27/american-companies-bankruptcy-prediction-dataset/data), which comprises various financial indicators of American public companies listed on the New York Stock Exchange and NASDAQ over a period. It includes features such as current assets, market value, inventorys, depreciation and amortization etc. The target variable is binary, indicating whether a company went bankrupt or remained solvent within the time frame studied.

## Preliminary Exploratory Data Analysis:

### Read the dataset from internet:

First, we intall and load the ***tidyverse*** library we're going to use to read the dataset:

In [None]:
install.packages("tidyverse")
install.packages("tidymodels")
library(tidyverse)
library(tidymodels)

Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done



Then we download the dataset from our github repository (https://raw.githubusercontent.com/4ugenstern/DSCI-100-GroupProject/main/american_bankruptcy.csv):

In [None]:
url = "https://raw.githubusercontent.com/4ugenstern/DSCI-100-GroupProject/main/american_bankruptcy.csv"
download.file(url, "data.csv")

raw_data <- read_csv("data.csv")

head(raw_data)

bank_split <- initial_split(raw_data, prop = 0.75, strata = status_label)  
bank_train <- training(bank_split)   
bank_test <- testing(bank_split)

### Clean and wrangle the dataset:

As we can see from the dataset above, there is only one single observation in each row, onlyaone  single variabl in each columne, an 
each value is a single ce. therfore, we claim that this dataset is already in tidy format, no more action needed.e)

### Target Column Distribution (in training set):

We use *pie()* function to make a pie chart of our target column, which illustrates the dataset is quite imbalanced. Therefore, we may have to use technique such as *SMOTE* to up-sample the minority (*Bankrupt*) while avoiding overfitting.

In [None]:
alive <- filter(bank_train, status_label == "alive")
total_number <- nrow(bank_train)
alive_number <- nrow(alive)
failed_number <- total_number - alive_number

pie_data <- c(alive_number, failed_number)
pie_labels <- c("Alive", "Bankrupt")
status_label <- c("Alive", "Bankrupt")
slice_colors <- c("red", "blue")

pie(pie_data, labels = percent(pie_data / sum(pie_data)), col = slice_colors, main = "Distribution of Company Statuses in Training Set")

legend("topright",
    legend = status_label,
    fill = slice_colors,
    title = "Company Status")

## Expected Outcomes and Significance 

### What we expect to find:

We expect to find the probability that a certain company goes bankrupt based on definitive factors.

### Impact of the findings: 

These findings can be beneficial to potential business owners who wish to know how successful other businesses in the same fields as them are. They can also be useful to current business owners as this model can help predicting whether their business is headed towards bankruptcy or not.