A novel dataset for bankruptcy prediction related to American public companies listed on the New York Stock Exchange and NASDAQ is provided. The dataset comprises accounting data from 8,262 distinct companies recorded during the period spanning from 1999 to 2018.

According to the Security Exchange Commission (SEC), a company in the American market is deemed bankrupt under two circumstances. Firstly, if the firm's management files for Chapter 11 of the Bankruptcy Code, indicating an intention to "reorganize" its business. In this case, the company's management continues to oversee day-to-day operations, but significant business decisions necessitate approval from a bankruptcy court. Secondly, if the firm's management files for Chapter 7 of the Bankruptcy Code, indicating a complete cessation of operations and the company going out of business entirely.

In this dataset, the fiscal year prior to the filing of bankruptcy under either Chapter 11 or Chapter 7 is labeled as "Bankruptcy" (1) for the subsequent year. Conversely, if the company does not experience these bankruptcy events, it is considered to be operating normally (0). The dataset is complete, without any missing values, synthetic entries, or imputed added values.

The resulting dataset comprises a total of 78,682 observations of firm-year combinations. To facilitate model training and evaluation, the dataset is divided into three subsets based on time periods. The training set consists of data from 1999 to 2011, the validation set comprises data from 2012 to 2014, and the test set encompasses the years 2015 to 2018. The test set serves as a means to assess the predictive capability of models in real-world scenarios involving unseen cases.


In [5]:
import pandas as pd
import pymongo

df = pd.read_csv('./american_bankruptcy.csv')
df.head()

Unnamed: 0,company_name,status_label,year,X1,X2,X3,X4,X5,X6,X7,...,X9,X10,X11,X12,X13,X14,X15,X16,X17,X18
0,C_1,alive,1999,511.267,833.107,18.373,89.031,336.018,35.163,128.348,...,1024.333,740.998,180.447,70.658,191.226,163.816,201.026,1024.333,401.483,935.302
1,C_1,alive,2000,485.856,713.811,18.577,64.367,320.59,18.531,115.187,...,874.255,701.854,179.987,45.79,160.444,125.392,204.065,874.255,361.642,809.888
2,C_1,alive,2001,436.656,526.477,22.496,27.207,286.588,-58.939,77.528,...,638.721,710.199,217.699,4.711,112.244,150.464,139.603,638.721,399.964,611.514
3,C_1,alive,2002,396.412,496.747,27.172,30.745,259.954,-12.41,66.322,...,606.337,686.621,164.658,3.573,109.59,203.575,124.106,606.337,391.633,575.592
4,C_1,alive,2003,432.204,523.302,26.68,47.491,247.245,3.504,104.661,...,651.958,709.292,248.666,20.811,128.656,131.261,131.884,651.958,407.608,604.467


1. Company Name (This column can be dropped)
2. Status: Company Status (Target Column)
3. Year: (1999 - 2018)
4. Current assets: All the assets of a company that are expected to be sold or used as a result of standard business
5. Cost of goods sold: The total amount a company paid as a cost directly related to the sale of products
6. Depreciation and amortization: Depreciation refers to the loss of value of a tangible fixed asset over
7. EBITDA: Earnings before interest, taxes, depreciation, and amortization. A measure of a company’s overall
8. Inventory: The accounting of items and raw materials that a company either uses in production or sells
9. Net Income: The overall profitability of a company after all expenses and costs have been deducted from total
10. Total Receivables: The balance of money due to a firm for goods or services delivered or used but not yet paid for
11. Market Value: The price of an asset in a marketplace. In our dataset, it refers to the market capitalization
12. Net Sales: The sum of a company’s gross sales minus its returns, allowances, and discounts
13. Total Assets: All the assets, or items of value, a business owns
14. Total Long-term Debt: A company’s loans and other liabilities that will not become due within one year of the balance
15. EBIT: Earnings before interest and taxes
16. Gross Profit: The profit a business makes after subtracting all the costs that are related to manufacturing and
17. Total Current Liabilities: The sum of accounts payable, accrued liabilities, and taxes such as bonds payable at the
18. Retained Earnings: The amount of profit a company has left over after paying all its direct costs, indirect costs
19. Total Revenue: The amount of income that a business has made from all sales before subtracting expenses
20. Total Liabilities: The combined debts and obligations that the company owes to outside parties
21. Total Operating Expenses: The expense a business incurs through its normal business operations


In [6]:
df.rename(columns={"status_label" : "Company Status","X1" : "Current assets","X2": "Cost of goods sold",
                   "X3" : "Depreciation and amortization","X4": "EBDITDA","X5" : "Inventory","X6" : "Net Income",
                   "X7" : "Total Receivable","X8" : "Market Value","X9" : "Net Sales","X10" : "Total Assets",
                   "X11" : "Total Long-term Debt","X12" : "EBIT" ,"X13" : "Gross Profit",
                   "X14" : "Total Current Liabilitie","X15" : "Retained Earnings" , "X16" :"Total Revenue",
                   "X17" : "Total Liabilities", "X18" : "Total Operating Expenses"},inplace= True)

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 78682 entries, 0 to 78681
Data columns (total 21 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   company_name                   78682 non-null  object 
 1   Company Status                 78682 non-null  object 
 2   year                           78682 non-null  int64  
 3   Current assets                 78682 non-null  float64
 4   Cost of goods sold             78682 non-null  float64
 5   Depreciation and amortization  78682 non-null  float64
 6   EBDITDA                        78682 non-null  float64
 7   Inventory                      78682 non-null  float64
 8   Net Income                     78682 non-null  float64
 9   Total Receivable               78682 non-null  float64
 10  Market Value                   78682 non-null  float64
 11  Net Sales                      78682 non-null  float64
 12  Total Assets                   78682 non-null 

In [9]:
# df should be converted into dict before we push it to mongodb

data = df.to_dict(orient='records')
# data

In [18]:
DB_NAME = "US-Company-Bankruptcy"
COLLECTION_NAME = "US-Company-Bankruptcy-Data"
CONNECTION_URL  = "mongodb+srv://husainshahnawaz15:gVaBkDHz7z9QfDB4@cluster0.ijaahc2.mongodb.net/?retryWrites=true&w=majority&appName=Cluster0"
#above, either remove your credentials or delete the mongoDB resource bofore pushing it to github.

In [19]:
client = pymongo.MongoClient(CONNECTION_URL)
data_base = client[DB_NAME]
collection = data_base[COLLECTION_NAME]

In [15]:
# Uploading data to MongoDB
rec = collection.insert_many(data)

In [17]:
print(type(rec))

<class 'pymongo.results.InsertManyResult'>


In [20]:
# Load back data from mongodb

df = pd.DataFrame(list(collection.find()))
df.head(2)

Unnamed: 0,_id,company_name,Company Status,year,Current assets,Cost of goods sold,Depreciation and amortization,EBDITDA,Inventory,Net Income,...,Net Sales,Total Assets,Total Long-term Debt,EBIT,Gross Profit,Total Current Liabilitie,Retained Earnings,Total Revenue,Total Liabilities,Total Operating Expenses
0,68a9cd4637bac44f68cabd5f,C_1,alive,1999,511.267,833.107,18.373,89.031,336.018,35.163,...,1024.333,740.998,180.447,70.658,191.226,163.816,201.026,1024.333,401.483,935.302
1,68a9cd4637bac44f68cabd60,C_1,alive,2000,485.856,713.811,18.577,64.367,320.59,18.531,...,874.255,701.854,179.987,45.79,160.444,125.392,204.065,874.255,361.642,809.888


In [None]:
## If you are getting timeout issue

# import certifi
# client = pymongo.MongoClient(CONNECTION_URL, tlsCAFile=certifi.where())
# # TO CREATE THE DATABASE
# data_base = client[DB_NAME]
# collection = data_base[COLLECTION_NAME]
# # TO INSERT DATA INTO THE COLLECTION
# rec = collection.insert_many(data)