### Creating Financial Datasets [Company Profile Data]

In this notebook, we will create financial datasets that will be used in the subsequent notebooks to build predictive data models. 

The notebook will start with company profile data created previously. 

The company profile data will be cleaned to include only stocks and to keep the qualitative information required for future model development. 

For the set of remaining company stocks I will write code that will connect to the Financial Model Prep API and donwload financial data for each stock.

The financial data will be: 

1. Historical stock prices
2. Financial statements key metrics
3. Financial statement ratios
4. Financial growth

Each of the above data will be stored in a separate file and saved as a csv file. 

The final dataset will be a combination of all the above datasets.

In [1]:
# importing libraries

import pandas as pd

# importing the company profile data

company_profile_data = pd.read_csv('data/Datasets/company_profile_cleaned_50B.csv')

# checking the data

company_profile_data.head()



Unnamed: 0,symbol,price,beta,mktCap,companyName,currency,cik,isin,cusip,exchange,...,ceo,sector,country,city,state,zip,isEtf,isActivelyTrading,isAdr,isFund
0,NVDA,141.98,1.657,3482.7694,NVIDIA Corporation,USD,1045810.0,US67066G1040,67066G104,NASDAQ Global Select,...,Mr. Jen-Hsun Huang,Technology,US,Santa Clara,CA,95051,False,True,False,False
1,AAPL,225.0,1.24,3401.055,Apple Inc.,USD,320193.0,US0378331005,037833100,NASDAQ Global Select,...,Mr. Timothy D. Cook,Technology,US,Cupertino,CA,95014,False,True,False,False
2,MSFT,415.0,0.904,3085.4752,Microsoft Corporation,USD,789019.0,US5949181045,594918104,NASDAQ Global Select,...,Mr. Satya Nadella,Technology,US,Redmond,WA,98052-6399,False,True,False,False
3,AMZN,202.61,1.146,2130.44415,"Amazon.com, Inc.",USD,1018724.0,US0231351067,023135106,NASDAQ Global Select,...,Mr. Andrew R. Jassy,Consumer Cyclical,US,Seattle,WA,98109-5210,False,True,False,False
4,GOOGL,172.49,1.034,2120.041615,Alphabet Inc.,USD,1652044.0,US02079K3059,02079K305,NASDAQ Global Select,...,Mr. Sundar Pichai,Communication Services,US,Mountain View,CA,94043,False,True,False,False


In [2]:
# creating a list of all the columns in the company profile data

company_profile_data.columns

Index(['symbol', 'price', 'beta', 'mktCap', 'companyName', 'currency', 'cik',
       'isin', 'cusip', 'exchange', 'exchangeShortName', 'industry',
       'description', 'ceo', 'sector', 'country', 'city', 'state', 'zip',
       'isEtf', 'isActivelyTrading', 'isAdr', 'isFund'],
      dtype='object')

In [3]:
# defining the columns that will be kept in the final dataset

columns_to_keep = ['symbol', 'price', 'beta', 'mktCap' , 'companyName', 'currency', 'cik', 'isin', 'cusip',
                   'exchange', 'exchangeShortName', 'industry',  'description',
                   'ceo', 'sector', 'country', 
                   'city', 'state', 'zip',  'isEtf', 'isActivelyTrading', 'isAdr', 'isFund']

# keeping only the columns that are required

company_profile_data = company_profile_data[columns_to_keep]

# checking the data

company_profile_data.head()

Unnamed: 0,symbol,price,beta,mktCap,companyName,currency,cik,isin,cusip,exchange,...,ceo,sector,country,city,state,zip,isEtf,isActivelyTrading,isAdr,isFund
0,NVDA,141.98,1.657,3482.7694,NVIDIA Corporation,USD,1045810.0,US67066G1040,67066G104,NASDAQ Global Select,...,Mr. Jen-Hsun Huang,Technology,US,Santa Clara,CA,95051,False,True,False,False
1,AAPL,225.0,1.24,3401.055,Apple Inc.,USD,320193.0,US0378331005,037833100,NASDAQ Global Select,...,Mr. Timothy D. Cook,Technology,US,Cupertino,CA,95014,False,True,False,False
2,MSFT,415.0,0.904,3085.4752,Microsoft Corporation,USD,789019.0,US5949181045,594918104,NASDAQ Global Select,...,Mr. Satya Nadella,Technology,US,Redmond,WA,98052-6399,False,True,False,False
3,AMZN,202.61,1.146,2130.44415,"Amazon.com, Inc.",USD,1018724.0,US0231351067,023135106,NASDAQ Global Select,...,Mr. Andrew R. Jassy,Consumer Cyclical,US,Seattle,WA,98109-5210,False,True,False,False
4,GOOGL,172.49,1.034,2120.041615,Alphabet Inc.,USD,1652044.0,US02079K3059,02079K305,NASDAQ Global Select,...,Mr. Sundar Pichai,Communication Services,US,Mountain View,CA,94043,False,True,False,False


In [4]:
# cleaning up the data and seleciing only stocks that are actively trading and are not ETFs or Funds

# selecting only stocsk which are actively trading and are not ETFs or Funds

company_profile_data = company_profile_data[(company_profile_data['isActivelyTrading'] == True) &
                                            (company_profile_data['isEtf'] == False) &
                                            (company_profile_data['isFund'] == False)]

# adjust the mktCap column to be in billions

company_profile_data['mktCap'] = company_profile_data['mktCap'] / 1000000000

# cleaning up the data so there is only unique 'companyName'. For with multiple names select the symbol with the largest market cap

# sorting the data by market cap

company_profile_data = company_profile_data.sort_values(by = 'mktCap', ascending = False)

# selecting only the first row for each company name

company_profile_data = company_profile_data.drop_duplicates(subset = 'companyName', keep = 'first')

# checking the data

company_profile_data

Unnamed: 0,symbol,price,beta,mktCap,companyName,currency,cik,isin,cusip,exchange,...,ceo,sector,country,city,state,zip,isEtf,isActivelyTrading,isAdr,isFund
0,NVDA,141.98,1.657,3.482769e-06,NVIDIA Corporation,USD,1045810.0,US67066G1040,67066G104,NASDAQ Global Select,...,Mr. Jen-Hsun Huang,Technology,US,Santa Clara,CA,95051,False,True,False,False
1,AAPL,225.00,1.240,3.401055e-06,Apple Inc.,USD,320193.0,US0378331005,037833100,NASDAQ Global Select,...,Mr. Timothy D. Cook,Technology,US,Cupertino,CA,95014,False,True,False,False
2,MSFT,415.00,0.904,3.085475e-06,Microsoft Corporation,USD,789019.0,US5949181045,594918104,NASDAQ Global Select,...,Mr. Satya Nadella,Technology,US,Redmond,WA,98052-6399,False,True,False,False
3,AMZN,202.61,1.146,2.130444e-06,"Amazon.com, Inc.",USD,1018724.0,US0231351067,023135106,NASDAQ Global Select,...,Mr. Andrew R. Jassy,Consumer Cyclical,US,Seattle,WA,98109-5210,False,True,False,False
4,GOOGL,172.49,1.034,2.120042e-06,Alphabet Inc.,USD,1652044.0,US02079K3059,02079K305,NASDAQ Global Select,...,Mr. Sundar Pichai,Communication Services,US,Mountain View,CA,94043,False,True,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
277,AEP,96.31,0.539,5.129134e-08,"American Electric Power Company, Inc.",USD,4904.0,US0255371017,025537101,NASDAQ Global Select,...,Mr. William J. Fehrman,Utilities,US,Columbus,OH,43215-2373,False,True,False,False
278,TRP,48.99,0.816,5.084427e-08,TC Energy Corporation,USD,1232384.0,CA87807B1076,87807B107,New York Stock Exchange,...,Mr. Francois Lionel Poirier,Energy,CA,Calgary,AB,T2P 5H1,False,True,False,False
279,MPC,157.52,1.371,5.062520e-08,Marathon Petroleum Corporation,USD,1510295.0,US56585A1025,56585A102,New York Stock Exchange,...,Ms. Maryann T. Mannen,Energy,US,Findlay,OH,45840-3229,False,True,False,False
280,MNST,52.00,0.742,5.057104e-08,Monster Beverage Corporation,USD,865752.0,US61174X1090,61174X109,NASDAQ Global Select,...,"Mr. Rodney Cyril Sacks H.Dip.Law, H.Dip.Tax",Consumer Defensive,US,Corona,CA,92879,False,True,False,False


In [5]:
# checking the data by looking at the breakdown of the data by different features

feature_comp = 'sector' # feature to compare

company_profile_data[feature_comp].value_counts()


sector
Financial Services        57
Technology                51
Industrials               35
Healthcare                31
Consumer Cyclical         26
Energy                    22
Communication Services    15
Consumer Defensive        15
Utilities                 12
Basic Materials           10
Real Estate                7
Name: count, dtype: int64

In [6]:
# adding a date column to mark when the dataset was created

company_profile_data['date'] = '2024-12-02' # date when the dataset was created


In [7]:
# saving the dataset with the same name as originally imported

company_profile_data.to_csv('data/Datasets/company_profile_cleaned_50B.csv', index = False)
