## Combining Financial Data

#### Initial Set Up

In this notebook, we will combine the financial data from the previous notebooks into a single dataset. 

The data we will combine is:

- Enterprise Value (EV)
- Financial Ratios
- Financial KPI Growth Metrics
- Financial KPI Metrics

The quantitative data will be then complied with company descriptions and industry classifications from a previous notebook.

#### Data Clean Up

Once a single dataset is created we will clean up the data by:

- Removing any duplicate rows or columns
- Identifying missing data and deciding how to handle it
- Removing any unnecessary columns

#### Data Saving

The final dataset will be saved as a CSV file for use in future notebooks.

Let's get started by loading the necessary libraries and the data we will be using.



In [1]:
# loading libraries

import pandas as pd




#### Loading 4 datasets


In [2]:
# Load the Enterprise Value (EV) dataset

ev = pd.read_csv('../Data/Datasets/financial_data_EnterpriseValue.csv')

# Load the Financial Ratios dataset

ratios = pd.read_csv('../Data/Datasets/financial_data_Ratios.csv')

# Load the Financial KPI Growth Metrics dataset

kpi_growth = pd.read_csv('../Data/Datasets/financial_data_Growth.csv')

# Load the Financial KPI Metrics dataset

kpi = pd.read_csv('../Data/Datasets/financial_data_KeyMetrics.csv')

# load the company descriptions and industry classifications dataset

company_info = pd.read_csv('data/Datasets/company_profile_cleaned_50B.csv')

#### Checking the first 100 rows of each dataset to understand the data


In [3]:
# Enterprise Value (EV) dataset

# sort the data by symbol ascending order and then by date descending order

ev = ev.sort_values(by=['symbol', 'date'], ascending=[True, False])

ev.head(100)


Unnamed: 0,symbol,date,stockPrice,numberOfShares,marketCapitalization,minusCashAndCashEquivalents,addTotalDebt,enterpriseValue
40,AAPL,2024-09-28,227.79,15171990000,3.456028e+12,29943000000,106629000000,3532713602100
41,AAPL,2024-06-29,210.62,15320000000,3.226698e+12,25565000000,101304000000,3302437400000
42,AAPL,2024-03-30,171.48,15405856000,2.641796e+12,32695000000,104590000000,2713691186880
43,AAPL,2023-12-30,192.53,15509763000,2.986095e+12,40760000000,108040000000,3053374670390
44,AAPL,2023-09-30,171.21,15599434000,2.670779e+12,29965000000,123930000000,2764744095140
...,...,...,...,...,...,...,...,...
6562,ABNB,2020-12-31,146.80,345755000,5.075683e+10,5480557000,2329808000,47606085000
6563,ABNB,2020-09-30,144.71,587199007,8.497357e+10,2664390000,2254092000,84563270302
6564,ABNB,2020-06-30,144.71,530945000,7.683305e+10,5480557000,487491000,71839984950
6565,ABNB,2020-03-31,144.71,530945000,7.683305e+10,0,0,76833050950


In [4]:
# Financial Ratios dataset

# sort the data by symbol ascending order and then by date descending order

ratios = ratios.sort_values(by=['symbol', 'date'], ascending=[True, False])


ratios.head(100)


Unnamed: 0,symbol,date,calendarYear,period,currentRatio,quickRatio,cashRatio,daysOfSalesOutstanding,daysOfInventoryOutstanding,operatingCycle,...,priceToSalesRatio,priceEarningsRatio,priceToFreeCashFlowsRatio,priceToOperatingCashFlowsRatio,priceCashFlowRatio,priceEarningsToGrowthRatio,priceSalesRatio,dividendYield,enterpriseValueMultiple,priceFairValue
0,AAPL,2024-09-28,2024,Q4,0.867313,0.826007,0.169753,62.802802,12.844802,75.647604,...,36.406063,58.632390,144.585517,128.903346,128.903346,-1.908962,36.406063,0.001101,108.692191,60.685296
1,AAPL,2024-06-29,2024,Q3,0.952980,0.906142,0.194227,45.297457,12.036053,57.333510,...,37.617291,37.610714,120.818452,111.812960,111.812960,-4.426492,37.617291,0.001207,117.099404,48.370486
2,AAPL,2024-03-30,2024,Q2,1.037102,0.986771,0.264048,40.808568,11.568830,52.377398,...,29.109739,27.942505,127.660007,116.429977,116.429977,-0.927183,29.109739,0.001404,88.290317,35.606601
3,AAPL,2023-12-30,2024,Q1,1.072544,1.023945,0.304240,37.710056,9.054234,46.764290,...,24.972567,22.010958,79.622821,74.848845,74.848845,0.449390,24.972567,0.001281,70.645628,40.298174
4,AAPL,2023-09-30,2023,Q4,0.988012,0.944442,0.206217,61.327069,11.611542,72.938611,...,29.841774,29.085850,137.421101,123.658630,123.658630,1.846951,29.841774,0.001407,93.334147,42.975881
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,ABNB,2020-12-31,2020,Q4,1.734780,1.734780,1.066302,228.474148,0.000000,228.474148,...,59.070127,-3.263810,-345.303377,-364.868334,-364.868334,0.001040,59.070127,0.000000,-12.219459,17.491602
96,ABNB,2020-09-30,2020,Q3,1.218680,1.218680,0.450214,12.392934,0.000000,12.392934,...,63.302992,96.856726,259.038905,253.146150,253.146150,-0.721416,63.302992,0.000000,179.500598,264.530572
97,ABNB,2020-06-30,2020,Q2,1.734780,1.734780,1.066302,0.000000,0.000000,0.000000,...,229.507223,-33.371548,-292.532404,-299.588442,-299.588442,-0.485404,229.507223,0.000000,-128.221805,26.477876
98,ABNB,2020-03-31,2020,Q1,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,91.269082,-56.394541,-131.227062,-134.835040,-134.835040,18.610199,91.269082,0.000000,-237.712785,31.699196


In [5]:
# Financial KPI Growth Metrics dataset

# sort the data by symbol ascending order and then by date descending order

kpi_growth = kpi_growth.sort_values(by=['symbol', 'date'], ascending=[True, False])

kpi_growth.head(100)


Unnamed: 0,symbol,date,calendarYear,period,revenueGrowth,grossProfitGrowth,ebitgrowth,operatingIncomeGrowth,netIncomeGrowth,epsgrowth,...,tenYDividendperShareGrowthPerShare,fiveYDividendperShareGrowthPerShare,threeYDividendperShareGrowthPerShare,receivablesGrowth,inventoryGrowth,assetGrowth,bookValueperShareGrowth,debtGrowth,rdexpenseGrowth,sgaexpensesGrowth
0,AAPL,2024-09-28,2024,Q4,0.106707,0.105877,0.167206,0.167206,-0.312943,-0.307143,...,1.103591,0.294579,0.135642,0.534397,0.181833,0.100624,-0.137951,0.052565,-0.030102,0.032120
1,AAPL,2024-06-29,2024,Q3,-0.054830,-0.061342,-0.091326,-0.091326,-0.092571,-0.084967,...,1.132779,0.280849,0.122352,0.049137,-0.010751,-0.017187,-0.095859,-0.031418,0.013033,-0.022882
2,AAPL,2024-03-30,2024,Q2,-0.241037,-0.229405,-0.308944,-0.308944,-0.303102,-0.301370,...,1.216607,0.307695,0.170447,-0.178676,-0.042851,-0.045551,0.008022,-0.031933,0.026897,-0.046861
3,AAPL,2023-12-30,2024,Q1,0.336063,0.356890,0.497015,0.497015,0.477435,0.489796,...,1.234623,0.309353,0.155970,-0.178454,0.028432,0.002641,0.199247,-0.128218,0.053237,0.103235
4,AAPL,2023-09-30,2023,Q4,0.094148,0.110235,0.172667,0.172667,0.154670,0.157480,...,1.202570,0.310745,0.170403,0.556296,-0.138757,0.052367,0.037547,0.134059,-0.018140,0.029801
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,ABNB,2020-12-31,2020,Q4,-0.359872,-0.417477,-8.403645,-8.403645,-18.726195,-31.378378,...,0.000000,0.000000,0.000000,10.801302,0.000000,0.201985,14.341711,0.033590,8.640029,3.577565
96,ABNB,2020-09-30,2020,Q3,3.009663,5.423734,1.717981,1.717981,1.381050,1.342593,...,0.000000,0.000000,0.000000,0.000000,0.000000,-0.168043,-0.899906,3.623864,-0.018436,0.111227
97,ABNB,2020-06-30,2020,Q2,-0.602326,-0.692273,-0.791812,-0.791812,-0.689899,-0.687500,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,-0.157952,-0.354098
98,ABNB,2020-03-31,2020,Q1,-0.239399,-0.306354,0.007471,0.007471,0.031100,0.030303,...,0.000000,0.000000,0.000000,0.000000,0.000000,-1.000000,1.000000,-1.000000,-0.085119,-0.364931


In [6]:
# Financial KPI Metrics dataset

# sort the data by symbol ascending order and then by date descending order

kpi = kpi.sort_values(by=['symbol', 'date'], ascending=[True, False])

kpi.head(100)


Unnamed: 0,symbol,date,calendarYear,period,revenuePerShare,netIncomePerShare,operatingCashFlowPerShare,freeCashFlowPerShare,cashPerShare,bookValuePerShare,...,averagePayables,averageInventory,daysSalesOutstanding,daysPayablesOutstanding,daysOfInventoryOnHand,receivablesTurnover,payablesTurnover,inventoryTurnover,roe,capexPerShare
0,AAPL,2024-09-28,2024,Q4,6.256925,0.971263,1.767138,1.575469,4.295481,3.753628,...,5.826700e+10,6.725500e+09,62.802802,121.572545,12.844802,1.433057,0.740299,7.006725,0.258753,0.191669
1,AAPL,2024-06-29,2024,Q3,5.599021,1.400000,1.883681,1.743277,4.034008,4.354308,...,4.666350e+10,6.198500e+09,45.297457,92.879672,12.036053,1.986866,0.968996,7.477534,0.321521,0.140405
2,AAPL,2024-03-30,2024,Q2,5.890812,1.534222,1.472817,1.343255,4.358732,4.815961,...,5.194950e+10,6.371500e+09,40.808568,84.933996,11.568830,2.205419,1.059646,7.779525,0.318570,0.129561
3,AAPL,2023-12-30,2024,Q1,7.709660,2.186752,2.572251,2.418025,4.713160,4.777636,...,6.037850e+10,6.421000e+09,37.710056,80.858158,9.054234,2.386631,1.113060,9.940101,0.457706,0.154225
4,AAPL,2023-09-30,2023,Q4,5.737259,1.471592,1.384537,1.245879,3.945977,3.983862,...,5.465500e+10,6.841000e+09,61.327069,114.833405,11.611542,1.467541,0.783744,7.750908,0.369388,0.138659
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,ABNB,2020-12-31,2020,Q4,2.485182,-11.244526,-0.402337,-0.425133,18.484930,8.392599,...,6.293500e+07,0.000000e+00,228.474148,34.283303,0.000000,0.393918,2.625185,0.000000,-1.339815,0.022796
96,ABNB,2020-09-30,2020,Q3,2.285990,0.373516,0.571646,0.558642,7.655345,0.547045,...,6.293500e+07,0.000000e+00,12.392934,18.200726,0.000000,7.262203,4.944858,0.000000,0.682788,0.013004
97,ABNB,2020-06-30,2020,Q2,0.630525,-1.084082,-0.483029,-0.494680,12.037512,5.465317,...,3.994900e+07,0.000000e+00,0.000000,44.608618,0.000000,0.000000,2.017547,0.000000,-0.198357,0.011651
98,ABNB,2020-03-31,2020,Q1,1.585531,-0.641507,-1.073237,-1.102745,5.790191,0.000000,...,7.570850e+07,0.000000e+00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,-0.140524,0.029508


In [7]:
# Company descriptions and industry classifications dataset

company_info.head(100)

Unnamed: 0,symbol,price,beta,mktCap,companyName,currency,cik,isin,cusip,exchange,...,sector,country,city,state,zip,isEtf,isActivelyTrading,isAdr,isFund,date
0,NVDA,141.98,1.657,3.482769e-06,NVIDIA Corporation,USD,1045810.0,US67066G1040,67066G104,NASDAQ Global Select,...,Technology,US,Santa Clara,CA,95051,False,True,False,False,2024-12-02
1,AAPL,225.00,1.240,3.401055e-06,Apple Inc.,USD,320193.0,US0378331005,037833100,NASDAQ Global Select,...,Technology,US,Cupertino,CA,95014,False,True,False,False,2024-12-02
2,MSFT,415.00,0.904,3.085475e-06,Microsoft Corporation,USD,789019.0,US5949181045,594918104,NASDAQ Global Select,...,Technology,US,Redmond,WA,98052-6399,False,True,False,False,2024-12-02
3,AMZN,202.61,1.146,2.130444e-06,"Amazon.com, Inc.",USD,1018724.0,US0231351067,023135106,NASDAQ Global Select,...,Consumer Cyclical,US,Seattle,WA,98109-5210,False,True,False,False,2024-12-02
4,GOOGL,172.49,1.034,2.120042e-06,Alphabet Inc.,USD,1652044.0,US02079K3059,02079K305,NASDAQ Global Select,...,Communication Services,US,Mountain View,CA,94043,False,True,False,False,2024-12-02
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,TJX,119.97,0.893,1.353106e-07,"The TJX Companies, Inc.",USD,109198.0,US8725401090,872540109,New York Stock Exchange,...,Consumer Cyclical,US,Framingham,MA,01701,False,True,False,False,2024-12-02
96,ARM,128.73,3.920,1.352952e-07,Arm Holdings plc American Depositary Shares,USD,1973239.0,US0420682058,,NASDAQ,...,Technology,GB,Cambridge,,CB1 9NJ,False,True,True,False,2024-12-02
97,KKR,150.02,1.606,1.332521e-07,KKR & Co. Inc.,USD,1404912.0,US48251W1045,48251W104,New York Stock Exchange,...,Financial Services,US,New York,NY,10001,False,True,False,False,2024-12-02
98,BHP,51.84,0.859,1.314829e-07,BHP Group Limited,USD,811809.0,US0886061086,088606108,New York Stock Exchange,...,Basic Materials,AU,Melbourne,VIC,3000,False,True,True,False,2024-12-02


#### Merging the datasets

In [8]:
# In several lines, I will merge the datasets on the 'symbol' and 'date' columns to create a single dataset

# I will merge the data sets in the follwing order KPI -> KPI Growth -> Ratios -> EV -> Company Info


In [9]:
# Merging the KPI and KPI Growth datasets

# checking how many unmatched rows there will be

# Find unmatched rows in KPI dataset
unmatched_kpi = kpi[~kpi.set_index(['symbol', 'date']).index.isin(kpi_growth.set_index(['symbol', 'date']).index)]

# Find unmatched rows in KPI Growth dataset
unmatched_kpi_growth = kpi_growth[~kpi_growth.set_index(['symbol', 'date']).index.isin(kpi.set_index(['symbol', 'date']).index)]

# Create DataFrames for unmatched rows
unmatched_kpi_df = unmatched_kpi[['symbol', 'date']]
unmatched_kpi_growth_df = unmatched_kpi_growth[['symbol', 'date']]

# Merge the KPI and KPI Growth datasets


In [10]:
# viewing the dataframe of unmatched rows in KPI dataset

unmatched_kpi_growth_df

Unnamed: 0,symbol,date
970,ASML,2024-09-29
1406,BHP,2008-12-31
1407,BHP,2008-07-23
1408,BHP,2008-01-23
1409,BHP,2007-07-24
...,...,...
10306,UL,2009-03-31
10307,UL,2008-12-31
10308,UL,2008-09-30
10309,UL,2008-06-30


In [11]:
# merging the KPI and KPI Growth datasets using inner join

kpi_combined = pd.merge(kpi, kpi_growth, on=['symbol', 'date'], how='inner')

In [12]:
# merging the next dataset (Ratios) with the combined KPI and KPI Growth dataset without the check for unmatched rows

# merging the Ratios dataset with the combined KPI and KPI Growth dataset

ratios_combined = pd.merge(ratios, kpi_combined, on=['symbol', 'date'], how='inner')


In [13]:
# merging the next dataset (EV) with the combined Ratios, KPI and KPI Growth dataset without the check for unmatched rows

# merging the EV dataset with the combined Ratios, KPI and KPI Growth dataset

ev_combined = pd.merge(ev, ratios_combined, on=['symbol', 'date'], how='inner')


In [14]:
# merging the next dataset (Company Info) with the combined EV, Ratios, KPI and KPI Growth dataset without the check for unmatched rows. The final dataset will be merged on the symbol column. 

# merging the Company Info dataset with the combined EV, Ratios, KPI and KPI Growth dataset

final_dataset = pd.merge(ev_combined, company_info, on=['symbol'], how='inner')


In [15]:
# Checking the first 100 rows of the final dataset

final_dataset.head(100)

Unnamed: 0,symbol,date_x,stockPrice,numberOfShares,marketCapitalization,minusCashAndCashEquivalents,addTotalDebt,enterpriseValue_x,calendarYear,period,...,sector,country,city,state,zip,isEtf,isActivelyTrading,isAdr,isFund,date_y
0,AAPL,2024-09-28,227.79,15171990000,3.456028e+12,29943000000,106629000000,3532713602100,2024,Q4,...,Technology,US,Cupertino,CA,95014,False,True,False,False,2024-12-02
1,AAPL,2024-06-29,210.62,15320000000,3.226698e+12,25565000000,101304000000,3302437400000,2024,Q3,...,Technology,US,Cupertino,CA,95014,False,True,False,False,2024-12-02
2,AAPL,2024-03-30,171.48,15405856000,2.641796e+12,32695000000,104590000000,2713691186880,2024,Q2,...,Technology,US,Cupertino,CA,95014,False,True,False,False,2024-12-02
3,AAPL,2023-12-30,192.53,15509763000,2.986095e+12,40760000000,108040000000,3053374670390,2024,Q1,...,Technology,US,Cupertino,CA,95014,False,True,False,False,2024-12-02
4,AAPL,2023-09-30,171.21,15599434000,2.670779e+12,29965000000,123930000000,2764744095140,2023,Q4,...,Technology,US,Cupertino,CA,95014,False,True,False,False,2024-12-02
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,ABNB,2020-12-31,146.80,345755000,5.075683e+10,5480557000,2329808000,47606085000,2020,Q4,...,Consumer Cyclical,US,San Francisco,CA,94103,False,True,False,False,2024-12-02
96,ABNB,2020-09-30,144.71,587199007,8.497357e+10,2664390000,2254092000,84563270302,2020,Q3,...,Consumer Cyclical,US,San Francisco,CA,94103,False,True,False,False,2024-12-02
97,ABNB,2020-06-30,144.71,530945000,7.683305e+10,5480557000,487491000,71839984950,2020,Q2,...,Consumer Cyclical,US,San Francisco,CA,94103,False,True,False,False,2024-12-02
98,ABNB,2020-03-31,144.71,530945000,7.683305e+10,0,0,76833050950,2020,Q1,...,Consumer Cyclical,US,San Francisco,CA,94103,False,True,False,False,2024-12-02


In [16]:
# creating a list of all columns in the dataset in order to check if there is any duplicate columns or parameters under different names

col_names = final_dataset.columns.tolist()

# sorting the list of column names in ascending order

col_names.sort()
col_names

['addTotalDebt',
 'assetGrowth',
 'assetTurnover',
 'averageInventory',
 'averagePayables',
 'averageReceivables',
 'beta',
 'bookValuePerShare',
 'bookValueperShareGrowth',
 'calendarYear',
 'calendarYear_x',
 'calendarYear_y',
 'capexPerShare',
 'capexToDepreciation',
 'capexToOperatingCashFlow',
 'capexToRevenue',
 'capitalExpenditureCoverageRatio',
 'cashConversionCycle',
 'cashFlowCoverageRatios',
 'cashFlowToDebtRatio',
 'cashPerShare_x',
 'cashPerShare_y',
 'cashRatio',
 'ceo',
 'cik',
 'city',
 'companyEquityMultiplier',
 'companyName',
 'country',
 'currency',
 'currentRatio_x',
 'currentRatio_y',
 'cusip',
 'date_x',
 'date_y',
 'daysOfInventoryOnHand',
 'daysOfInventoryOutstanding',
 'daysOfPayablesOutstanding',
 'daysOfSalesOutstanding',
 'daysPayablesOutstanding',
 'daysSalesOutstanding',
 'debtEquityRatio',
 'debtGrowth',
 'debtRatio',
 'debtToAssets',
 'debtToEquity',
 'description',
 'dividendPaidAndCapexCoverageRatio',
 'dividendPayoutRatio',
 'dividendYield_x',
 'divi

In [17]:
# sorting the columns in the dataset in ascending order

final_dataset = final_dataset.reindex(sorted(final_dataset.columns), axis=1)

# checking the first 100 rows of the dataset to see if there are any duplicate columns

final_dataset.head(100)

Unnamed: 0,addTotalDebt,assetGrowth,assetTurnover,averageInventory,averagePayables,averageReceivables,beta,bookValuePerShare,bookValueperShareGrowth,calendarYear,...,threeYDividendperShareGrowthPerShare,threeYNetIncomeGrowthPerShare,threeYOperatingCFGrowthPerShare,threeYRevenueGrowthPerShare,threeYShareholdersEquityGrowthPerShare,totalDebtToCapitalization,weightedAverageSharesDilutedGrowth,weightedAverageSharesGrowth,workingCapital,zip
0,106629000000,0.100624,0.260096,6.725500e+09,5.826700e+10,5.470750e+10,1.240,3.753628,-0.137951,2024,...,0.135642,-0.220800,0.442328,0.237508,-0.019076,0.651850,-0.006862,-0.009661,-2.340500e+10,95014
1,101304000000,-0.017187,0.258667,6.198500e+09,4.666350e+10,4.216100e+10,1.240,4.354308,-0.095859,2024,...,0.122352,0.070692,0.484993,0.143358,0.126469,0.602957,-0.007535,-0.005573,-6.189000e+09,95014
2,104590000000,-0.045551,0.268969,6.371500e+09,5.194950e+10,4.562600e+10,1.240,4.815961,0.008022,2024,...,0.170447,0.087751,0.028931,0.101665,0.166326,0.585008,-0.007186,-0.006699,4.594000e+09,95014
3,108040000000,0.002641,0.338247,6.421000e+09,6.037850e+10,5.554350e+10,1.240,4.777636,0.199247,2024,...,0.155970,0.287877,0.123787,0.171619,0.221760,0.593170,-0.006110,-0.005748,9.719000e+09,95014
4,123930000000,0.052367,0.253835,6.841000e+09,5.465500e+10,5.008550e+10,1.240,3.983862,0.037547,2023,...,0.170403,0.980735,0.147789,0.512628,0.040041,0.641260,-0.006505,-0.006254,-1.742000e+09,95014
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,2329808000,0.201985,0.081901,0.000000e+00,6.293500e+07,1.183084e+09,1.156,8.392599,14.341711,2020,...,0.000000,-15.983156,-0.087816,0.192175,6.517013,0.388339,-0.411179,-0.411179,3.776607e+09,94103
96,2254092000,-0.168043,0.153788,0.000000e+00,6.293500e+07,9.241900e+07,1.156,0.547045,-0.899906,2020,...,0.000000,-0.635875,0.000000,-0.638930,0.000000,0.851311,0.105951,0.105951,1.294158e+09,94103
97,487491000,0.000000,0.031909,0.000000e+00,3.994900e+07,0.000000e+00,1.156,5.465317,0.000000,2020,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.143833,0.000000,0.000000,3.776607e+09,94103
98,0,-1.000000,0.000000,0.000000e+00,7.570850e+07,0.000000e+00,1.156,0.000000,1.000000,2020,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,3.074273e+09,94103


### Cleaning the dataset

The following are issues i observed in the dataset:

1. Delete column 'calendarYear_x' and 'calendarYear_y'
2. Delete column 'cashPerShare_y' column
3. Delete column 'currentRatio_y' column
4. Delete column 'date_y'
5. Delete column 'daysOfInventoryOnHand'
6. Delete column 'daysOfPayablesOutstanding'
7. Delete column 'daysOfSalesOutstanding'
8. Delete column 'debtRatio'
9. Delete column 'debtEquityRatio'
10. Delete column 'dividendYield_y'
11. Delete column 'enterpriseValue_y'
12. Delete column 'freeCashFlowPerShare_y'
13. Delete column 'interestCoverage_y'
14. Delete column 'inventoryTurnover_y'
15. Delete column  isActivelyTrading
16. Delete column isAdr
17. Delete column isEtf
18. Delete column isFund
19. Delete column marketCapitalization
20. Delete column mktCap
21. Delete column operatingCashFlowPerShare_y
22. Delete column payablesTurnover_y
23. Delete column payoutRatio_y
24. Delete column period_x
25. Delete column period_y
26. Delete column priceBookValueRatio
27. Delete column priceCashFlowRatio
28. Delete column priceToSalesRatio_y
29. Delete column ptbRatio
30. Delete column receivablesTurnover_y
31. Delete column roe
32. Delete column price

In [18]:
# dropping the columns that are not needed

final_dataset = final_dataset.drop(
    ['calendarYear_x', 'calendarYear_y', 'cashPerShare_y', 'currentRatio_y', 'date_y', 
     'daysOfInventoryOnHand', 'daysOfPayablesOutstanding', 'daysOfSalesOutstanding', 'debtRatio', 'debtEquityRatio', 
     'dividendYield_y', 'enterpriseValue_y', 'freeCashFlowPerShare_y', 'interestCoverage_y', 'inventoryTurnover_y', 
     'isActivelyTrading', 'isAdr', 'isEtf', 'isFund', 'marketCapitalization', 'mktCap', 'operatingCashFlowPerShare_y', 
     'payablesTurnover_y', 'payoutRatio_y', 'period_x', 'period_y', 'priceBookValueRatio', 'priceCashFlowRatio', 
     'priceToSalesRatio_y', 'ptbRatio', 'receivablesTurnover_y', 'roe', 'price']
    , axis=1)




In [19]:
# Checking the first 100 rows of the dataset after dropping the columns

final_dataset.head(100)

Unnamed: 0,addTotalDebt,assetGrowth,assetTurnover,averageInventory,averagePayables,averageReceivables,beta,bookValuePerShare,bookValueperShareGrowth,calendarYear,...,threeYDividendperShareGrowthPerShare,threeYNetIncomeGrowthPerShare,threeYOperatingCFGrowthPerShare,threeYRevenueGrowthPerShare,threeYShareholdersEquityGrowthPerShare,totalDebtToCapitalization,weightedAverageSharesDilutedGrowth,weightedAverageSharesGrowth,workingCapital,zip
0,106629000000,0.100624,0.260096,6.725500e+09,5.826700e+10,5.470750e+10,1.240,3.753628,-0.137951,2024,...,0.135642,-0.220800,0.442328,0.237508,-0.019076,0.651850,-0.006862,-0.009661,-2.340500e+10,95014
1,101304000000,-0.017187,0.258667,6.198500e+09,4.666350e+10,4.216100e+10,1.240,4.354308,-0.095859,2024,...,0.122352,0.070692,0.484993,0.143358,0.126469,0.602957,-0.007535,-0.005573,-6.189000e+09,95014
2,104590000000,-0.045551,0.268969,6.371500e+09,5.194950e+10,4.562600e+10,1.240,4.815961,0.008022,2024,...,0.170447,0.087751,0.028931,0.101665,0.166326,0.585008,-0.007186,-0.006699,4.594000e+09,95014
3,108040000000,0.002641,0.338247,6.421000e+09,6.037850e+10,5.554350e+10,1.240,4.777636,0.199247,2024,...,0.155970,0.287877,0.123787,0.171619,0.221760,0.593170,-0.006110,-0.005748,9.719000e+09,95014
4,123930000000,0.052367,0.253835,6.841000e+09,5.465500e+10,5.008550e+10,1.240,3.983862,0.037547,2023,...,0.170403,0.980735,0.147789,0.512628,0.040041,0.641260,-0.006505,-0.006254,-1.742000e+09,95014
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,2329808000,0.201985,0.081901,0.000000e+00,6.293500e+07,1.183084e+09,1.156,8.392599,14.341711,2020,...,0.000000,-15.983156,-0.087816,0.192175,6.517013,0.388339,-0.411179,-0.411179,3.776607e+09,94103
96,2254092000,-0.168043,0.153788,0.000000e+00,6.293500e+07,9.241900e+07,1.156,0.547045,-0.899906,2020,...,0.000000,-0.635875,0.000000,-0.638930,0.000000,0.851311,0.105951,0.105951,1.294158e+09,94103
97,487491000,0.000000,0.031909,0.000000e+00,3.994900e+07,0.000000e+00,1.156,5.465317,0.000000,2020,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.143833,0.000000,0.000000,3.776607e+09,94103
98,0,-1.000000,0.000000,0.000000e+00,7.570850e+07,0.000000e+00,1.156,0.000000,1.000000,2020,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,3.074273e+09,94103


In [20]:
# geting shape of the dataset

final_dataset.shape


(10952, 149)

In [21]:
# getting information about the dataset

final_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10952 entries, 0 to 10951
Columns: 149 entries, addTotalDebt to zip
dtypes: float64(127), int64(5), object(17)
memory usage: 12.5+ MB


In [22]:
# describe the dataset

final_dataset.describe()

Unnamed: 0,addTotalDebt,assetGrowth,assetTurnover,averageInventory,averagePayables,averageReceivables,beta,bookValuePerShare,bookValueperShareGrowth,calendarYear,...,tenYShareholdersEquityGrowthPerShare,threeYDividendperShareGrowthPerShare,threeYNetIncomeGrowthPerShare,threeYOperatingCFGrowthPerShare,threeYRevenueGrowthPerShare,threeYShareholdersEquityGrowthPerShare,totalDebtToCapitalization,weightedAverageSharesDilutedGrowth,weightedAverageSharesGrowth,workingCapital
count,10952.0,10952.0,10939.0,10711.0,10711.0,10711.0,10952.0,10908.0,10952.0,10952.0,...,10952.0,10952.0,10952.0,10952.0,10952.0,10952.0,10952.0,10952.0,10952.0,10892.0
mean,464589000000.0,0.032221,0.142551,-98489820000.0,29324570000.0,53483370000.0,0.989851,93.124498,279.0834,2019.342038,...,109.9521,692.9349,152.376202,65.732776,232.0106,182.6097,0.488538,186.146,183.9849,1456978000000.0
std,3610263000000.0,0.389687,0.133977,3586716000000.0,223605100000.0,607999700000.0,0.512182,577.086465,16864.42,2.961672,...,11123.6,37597.02,11127.602769,6769.16464,14269.88,13455.15,0.339665,13776.34,13620.51,14827370000000.0
min,0.0,-1.0,-0.867728,-128654400000000.0,-39016000000.0,0.0,0.0,-745.016336,-22.39351,2000.0,...,-178.0765,-1.0,-739.225065,-1190.907592,-2.877882,-236.7995,-6.421793,-1.0,-1.0,-6888583000000.0
25%,5498440000.0,-0.005952,0.05867,0.0,371912200.0,942091000.0,0.615,8.232665,-0.01263626,2017.0,...,0.0,0.0,-0.084246,-0.088423,0.04709619,-0.009825891,0.301768,-0.006794129,-0.006036531,-549241000.0
50%,14982500000.0,0.010477,0.11722,978000000.0,1583000000.0,2867500000.0,0.992,21.481232,0.01393778,2019.0,...,0.6296454,0.1586258,0.303281,0.240013,0.2258442,0.1732742,0.471439,-0.0005332701,-6.820482e-07,1957500000.0
75%,43220250000.0,0.035741,0.18137,3531750000.0,6303000000.0,7947750000.0,1.286,43.322532,0.04554534,2022.0,...,1.562238,0.3990981,0.94631,0.850525,0.4574796,0.4293866,0.633097,0.001622793,0.001467575,8525720000.0
max,95619150000000.0,20.960384,1.293004,38979600000000.0,5121950000000.0,15552640000000.0,3.92,11135.951461,1064756.0,2025.0,...,1164082.0,3050608.0,882351.941125,708399.839402,1035709.0,1009785.0,8.453252,1038577.0,1038577.0,253131200000000.0


### Saving the final dataset

In [23]:
# saving the final dataset as a CSV file in the Data/Datasets folder

final_dataset.to_csv('Data/Datasets/starting_financial_data_combined.csv', index=False)