# Bankruptcy Prevention 

#### This is a classification project, since the variable to predict is binary (bankruptcy or non-bankruptcy). The goal here is to model the probability that a business goes bankrupt from different features.

# Data Cleaning and Pre-Processing

### Notebook Walkthrough
    - Importing Libraries
    - Loading Dataset
    - Pre-Process Dataset
    - Saving the data in a csv file

In [25]:
## importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
%matplotlib inline
print('Imported')

Imported


In [26]:
df = pd.read_excel('Datasets/bankruptcy-prevention.xlsx')
df.head()

Unnamed: 0,industrial_risk; management_risk; financial_flexibility; credibility; competitiveness; operating_risk; class
0,0.5;1;0;0;0;0.5;bankruptcy
1,0;1;0;0;0;1;bankruptcy
2,1;0;0;0;0;1;bankruptcy
3,0.5;0;0;0.5;0;1;bankruptcy
4,1;1;0;0;0;1;bankruptcy


In [27]:
rows, columns = df.shape
print(f'Number of Rows {rows}')
print(f'Number of Columns {columns}')

Number of Rows 250
Number of Columns 1


In [28]:
## checking for separation
df.iloc[2,0]

'1;0;0;0;0;1;bankruptcy'

In [29]:
## columns of the dataset
df.columns

Index(['industrial_risk; management_risk; financial_flexibility; credibility; competitiveness; operating_risk; class'], dtype='object')

In [30]:
import re
## preprocess the columns into a separate list to track the columns
columns = list(df.columns)                                       ## column lists
columns = [re.sub(';','',col) for col in columns]                ## substitute ';' with blank spaces
columns = [x.split(' ') for x in columns]                        ## split the string for separate column names
columns = [x for sublist in columns for x in sublist]            ## change the dimensionality to 1
columns

['industrial_risk',
 'management_risk',
 'financial_flexibility',
 'credibility',
 'competitiveness',
 'operating_risk',
 'class']

In [31]:
raw_data = df.iloc[1:,0]
raw_data

1              0;1;0;0;0;1;bankruptcy
2              1;0;0;0;0;1;bankruptcy
3          0.5;0;0;0.5;0;1;bankruptcy
4              1;1;0;0;0;1;bankruptcy
5          1;1;0;0.5;0;0.5;bankruptcy
                    ...              
245        0;1;1;1;1;1;non-bankruptcy
246      1;1;0.5;1;1;0;non-bankruptcy
247    0;1;1;0.5;0.5;0;non-bankruptcy
248    1;0;0.5;1;0.5;0;non-bankruptcy
249    1;0;0.5;0.5;1;1;non-bankruptcy
Name: industrial_risk; management_risk; financial_flexibility; credibility; competitiveness; operating_risk; class, Length: 249, dtype: object

In [32]:
## create a separate dataframe and separate the values into separate columns
df_new = raw_data.str.split(';', expand=True)
df_new.head()

Unnamed: 0,0,1,2,3,4,5,6
1,0.0,1,0,0.0,0,1.0,bankruptcy
2,1.0,0,0,0.0,0,1.0,bankruptcy
3,0.5,0,0,0.5,0,1.0,bankruptcy
4,1.0,1,0,0.0,0,1.0,bankruptcy
5,1.0,1,0,0.5,0,0.5,bankruptcy


In [33]:
## assigning proper columns to the new dataframe
df_new.columns = columns
df_new.head()

Unnamed: 0,industrial_risk,management_risk,financial_flexibility,credibility,competitiveness,operating_risk,class
1,0.0,1,0,0.0,0,1.0,bankruptcy
2,1.0,0,0,0.0,0,1.0,bankruptcy
3,0.5,0,0,0.5,0,1.0,bankruptcy
4,1.0,1,0,0.0,0,1.0,bankruptcy
5,1.0,1,0,0.5,0,0.5,bankruptcy


In [34]:
## information of the dataset (new)
df_new.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 249 entries, 1 to 249
Data columns (total 7 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   industrial_risk        249 non-null    object
 1   management_risk        249 non-null    object
 2   financial_flexibility  249 non-null    object
 3   credibility            249 non-null    object
 4   competitiveness        249 non-null    object
 5   operating_risk         249 non-null    object
 6   class                  249 non-null    object
dtypes: object(7)
memory usage: 13.7+ KB


In [35]:
## the numerical values are in string / object format, change them into numerical format
for col in df_new.columns[:-1]:
    df_new[col] = pd.to_numeric(df_new[col], errors='coerce')   ## conver forcefully to float value

In [36]:
df_new.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 249 entries, 1 to 249
Data columns (total 7 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   industrial_risk        249 non-null    float64
 1   management_risk        249 non-null    float64
 2   financial_flexibility  249 non-null    float64
 3   credibility            249 non-null    float64
 4   competitiveness        249 non-null    float64
 5   operating_risk         249 non-null    float64
 6   class                  249 non-null    object 
dtypes: float64(6), object(1)
memory usage: 13.7+ KB


In [38]:
## checking for null values
df_new.isnull().sum()

industrial_risk          0
management_risk          0
financial_flexibility    0
credibility              0
competitiveness          0
operating_risk           0
class                    0
dtype: int64

So, there is no null values present in the data

In [39]:
rows, columns = df_new.shape
print(f'Number of Rows {rows}')
print(f'Number of Columns {columns}')

Number of Rows 249
Number of Columns 7


In [40]:
## saving the new data in a csv file
df_new.to_csv('Datasets/pre_processed_data.csv')
print('Saved to Datasets')

Saved to Datasets
