# Beginning state:
- For preliminaries, see "Initial Data Gathering" notebook, with the caveat that I did NOT upload the associated data, so you won't be able to re-run the notebook. As downloaded from the multiple Kaggle datasets there were 4+ GB of data and 200K+ individual files.
- For this notebook, I have combed through those initial datasets to collect data on as many of the 2020 Fortune 100 companies as I could match by ticker/symbol, and produced CSV files of those initial reduced datasets that I was able to upload, so you should be able to re-run this notebook if you have access to my "RawData" folder.

## Goals and notes:
- Concentrating, for now, on balance-sheet data vs price data.
- Goal is to look for correlations between balance-sheet information, price data, and possibly some calculated fields like P/E ratio, market capitalization, etc.
    - I intend to limit price data to the area around the date of each filing, assuming that price movements between the day before/day of/day after the filing are the most likely to reflect the influence of the new information.
    - This means I may need to be pretty careful about getting the "right" release date.
- I have two different datasets for each of market and financial data. Need to make some decisions about whether I collate data or pick one dataset and stick with that.
    - I am suspicious of one of the market datasets; price data seemed artificially low and there is a comment on Kaggle that the creator has "adjusted" all of the price data to account for dividends and splits. Not sure if this will suit my current purpose.
- I suspect most of the work is going to be in the Finnhub data (the bigger financials dataset). Files are currently organized by rows in which each row is a "topic" within a filing, so it's hard to see which topics appear repeatedly and which will turn out to be sparse across many companies and many filings.
    - Beware the sunk costs... I really don't want to toss out this data, I spent a lot of work figuring out how to get it into its current state. But, the other financials dataset is already columnar and will be simpler to work with.
    - On the other hand, that means I don't know how the creator of the simpler dataset did the simplifying.
- My ideal is to produce a dataframe that contains something like:
    - Filing date
    - Filing metadata (quarter/FY, anything else that seems important)
    - Company ticker
    - <balance sheet categories, possibly the least-sparse of them>
    - Close price date - 1
    - Close price date + 0
    - Close price date + 1
    - Stock data like earnings, dividends, and shares outstanding
    

## 1. Start with the simple
- I'm going to pick the "historical financials" aka the simpler SEC dataset, and the "historical prices" market dataset aka the one I'm less suspicious of its price data.

In [1]:
import pandas as pd
import os

data_dir = os.path.join(os.getcwd(), 'RawData')

In [9]:
hf_df = pd.read_csv(os.path.join(data_dir, 'HF-financials-tickers.csv'), parse_dates=['filing_date'], index_col=0)
print(hf_df.head())

    commonstocksharesissued  assetscurrent  accountspayablecurrent  \
11                   1000.0   1.789100e+10            1.625000e+09   
12                   1000.0   1.838000e+10            1.580000e+09   
13                   1000.0   1.705600e+10            1.546000e+09   
14                   1000.0   1.705600e+10            1.546000e+09   
15                   1000.0   1.705600e+10            1.546000e+09   

    commonstockvalue  liabilities  liabilitiesandstockholdersequity  \
11               NaN          NaN                      6.001200e+10   
12               NaN          NaN                      6.079300e+10   
13               NaN          NaN                      5.809200e+10   
14               NaN          NaN                      5.809200e+10   
15               NaN          NaN                      5.809200e+10   

    stockholdersequity  earningspersharebasic  netincomeloss  profitloss  ...  \
11       -7.800000e+08                   3.92   1.282000e+09         Na

In [10]:
print(hf_df.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2743 entries, 11 to 100690
Data columns (total 44 columns):
 #   Column                            Non-Null Count  Dtype         
---  ------                            --------------  -----         
 0   commonstocksharesissued           2471 non-null   float64       
 1   assetscurrent                     2375 non-null   float64       
 2   accountspayablecurrent            2083 non-null   float64       
 3   commonstockvalue                  2160 non-null   float64       
 4   liabilities                       1751 non-null   float64       
 5   liabilitiesandstockholdersequity  2743 non-null   float64       
 6   stockholdersequity                2503 non-null   float64       
 7   earningspersharebasic             2739 non-null   float64       
 8   netincomeloss                     2639 non-null   float64       
 9   profitloss                        1906 non-null   float64       
 10  costofgoodssold                   1168 non-nu

Yup, this data is pretty sparse. I see a few columns off the bat I should drop because they contain no non-null values. 

In [30]:
hf_df2 = hf_df.dropna(axis='columns', how='all').reset_index()
hf_df2.rename({'stock':'ticker', 'filing_date':'date'}, axis=1, inplace=True) # need to normalize column names
print(hf_df2.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2743 entries, 0 to 2742
Data columns (total 41 columns):
 #   Column                            Non-Null Count  Dtype         
---  ------                            --------------  -----         
 0   index                             2743 non-null   int64         
 1   commonstocksharesissued           2471 non-null   float64       
 2   assetscurrent                     2375 non-null   float64       
 3   accountspayablecurrent            2083 non-null   float64       
 4   commonstockvalue                  2160 non-null   float64       
 5   liabilities                       1751 non-null   float64       
 6   liabilitiesandstockholdersequity  2743 non-null   float64       
 7   stockholdersequity                2503 non-null   float64       
 8   earningspersharebasic             2739 non-null   float64       
 9   netincomeloss                     2639 non-null   float64       
 10  profitloss                        1906 non-null 

While I think about "how sparse is too sparse" for some of these columns, will try pulling in share price and earnings data.

In [17]:
pr_df = pd.read_csv(os.path.join(data_dir, 'HP-market-ticker-prices.csv'), index_col=0, parse_dates=['date'])
print(pr_df.head())

  symbol       date   open   high    low  close  close_adjusted     volume  \
0   MSFT 2016-05-16  50.80  51.96  50.75  51.83         49.7013   20032017   
1   MSFT 2002-01-16  68.85  69.84  67.85  67.87         22.5902   30977700   
2   MSFT 2001-09-18  53.41  55.00  53.17  54.32         18.0802   41591300   
3   MSFT 2007-10-26  36.01  36.03  34.56  35.03         27.2232  288121200   
4   MSFT 2014-06-27  41.61  42.29  41.51  42.25         38.6773   74640000   

   split_coefficient  
0                1.0  
1                1.0  
2                1.0  
3                1.0  
4                1.0  


In [18]:
print(pr_df.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 437565 entries, 0 to 23526281
Data columns (total 9 columns):
 #   Column             Non-Null Count   Dtype         
---  ------             --------------   -----         
 0   symbol             437565 non-null  object        
 1   date               437565 non-null  datetime64[ns]
 2   open               437565 non-null  float64       
 3   high               437565 non-null  float64       
 4   low                437565 non-null  float64       
 5   close              437565 non-null  float64       
 6   close_adjusted     437565 non-null  float64       
 7   volume             437565 non-null  int64         
 8   split_coefficient  437565 non-null  float64       
dtypes: datetime64[ns](1), float64(6), int64(1), object(1)
memory usage: 33.4+ MB
None


If I multi-index these by ticker and date, will that make it easier to combine them?

In [21]:
pr_df.rename({'symbol':'ticker'}, axis=1, inplace=True) # need to normalize what the ticker/symbol column name is
pr_df2 = pr_df.set_index(['ticker', 'date']).sort_index()
print(pr_df2.info())

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 437565 entries, ('AAL', Timestamp('2005-09-27 00:00:00')) to ('XOM', Timestamp('2020-11-11 00:00:00'))
Data columns (total 7 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   open               437565 non-null  float64
 1   high               437565 non-null  float64
 2   low                437565 non-null  float64
 3   close              437565 non-null  float64
 4   close_adjusted     437565 non-null  float64
 5   volume             437565 non-null  int64  
 6   split_coefficient  437565 non-null  float64
dtypes: float64(6), int64(1)
memory usage: 24.8+ MB
None


In [22]:
print(pr_df2.head())

                    open   high    low  close  close_adjusted    volume  \
ticker date                                                               
AAL    2005-09-27  21.05  21.40  19.10  19.30         18.6645   2576944   
       2005-09-28  19.30  20.53  19.20  20.50         19.8250  15409920   
       2005-09-29  20.40  20.58  20.10  20.21         19.5445   2890617   
       2005-09-30  20.26  21.05  20.18  21.01         20.3182   8373458   
       2005-10-03  20.90  21.75  20.90  21.50         20.7920   2836193   

                   split_coefficient  
ticker date                           
AAL    2005-09-27                1.0  
       2005-09-28                1.0  
       2005-09-29                1.0  
       2005-09-30                1.0  
       2005-10-03                1.0  


In [31]:
hf_df3 = hf_df2.set_index(['ticker', 'date']).sort_index()
print(hf_df3.info())

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 2743 entries, ('AAL', Timestamp('2010-07-21 00:00:00')) to ('XOM', Timestamp('2019-05-02 00:00:00'))
Data columns (total 39 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   index                             2743 non-null   int64  
 1   commonstocksharesissued           2471 non-null   float64
 2   assetscurrent                     2375 non-null   float64
 3   accountspayablecurrent            2083 non-null   float64
 4   commonstockvalue                  2160 non-null   float64
 5   liabilities                       1751 non-null   float64
 6   liabilitiesandstockholdersequity  2743 non-null   float64
 7   stockholdersequity                2503 non-null   float64
 8   earningspersharebasic             2739 non-null   float64
 9   netincomeloss                     2639 non-null   float64
 10  profitloss                        1906 non-null   float

In [32]:
hf_df3.drop(labels='index', axis=1, inplace=True)
print(hf_df3.info())

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 2743 entries, ('AAL', Timestamp('2010-07-21 00:00:00')) to ('XOM', Timestamp('2019-05-02 00:00:00'))
Data columns (total 38 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   commonstocksharesissued           2471 non-null   float64
 1   assetscurrent                     2375 non-null   float64
 2   accountspayablecurrent            2083 non-null   float64
 3   commonstockvalue                  2160 non-null   float64
 4   liabilities                       1751 non-null   float64
 5   liabilitiesandstockholdersequity  2743 non-null   float64
 6   stockholdersequity                2503 non-null   float64
 7   earningspersharebasic             2739 non-null   float64
 8   netincomeloss                     2639 non-null   float64
 9   profitloss                        1906 non-null   float64
 10  costofgoodssold                   1168 non-null   float

In [27]:
print(hf_df3.head())

                    commonstocksharesissued  assetscurrent  \
ticker filing_date                                           
AAL    2010-07-21               339389724.0   7.344000e+09   
       2010-10-20               339389724.0   6.837000e+09   
       2011-02-16               339389724.0   6.838000e+09   
       2011-04-20               341207797.0   8.825000e+09   
       2011-07-20               341207797.0   7.997000e+09   

                    accountspayablecurrent  commonstockvalue  liabilities  \
ticker filing_date                                                          
AAL    2010-07-21             1.305000e+09       339000000.0          NaN   
       2010-10-20             1.220000e+09       339000000.0          NaN   
       2011-02-16             1.156000e+09       339000000.0          NaN   
       2011-04-20             1.267000e+09       339000000.0          NaN   
       2011-07-20             1.291000e+09       341000000.0          NaN   

                    liabi

### Pull out 3 close prices per filing date

In [35]:
# in pseudocode, I want to:
# for each ticker and filing date in hf_df (currently df3):
#     retrieve the closing price for the same ticker and date from pr_df(currently df2), stored as a new column in hf_df
#     retrieve the closing price for the same ticker and date-1, stored as another new column in hf_df
#     retrieve the closing price for the same ticker and date+1, stored as a third new column in hf_df
# option on storing the 3 columns in their own dataframe on the way to merging into hf_df

# research:
# https://stackoverflow.com/questions/50655370/filtering-the-dataframe-based-on-the-column-value-of-another-dataframe
# https://datascience.stackexchange.com/questions/47562/multiple-filtering-pandas-columns-based-on-values-in-another-column
# https://pandas.pydata.org/docs/user_guide/indexing.html#selection-by-callable
# also looking at merge (inner join could work for same-date, but how to only get the "close" column?)
# indexing with query or where methods?

# I suspect I need to make my "date" fields the same name for indexes to match... done that

test_df = hf_df3.merge(pr_df2, left_index=True, right_index=True)

# suggestion to import SQL, merge doesn't really do query join, only exact matches:
# https://stackoverflow.com/questions/30627968/merge-pandas-dataframes-where-one-value-is-between-two-others

# "interval index" idea, although it seems to be panned as the slow way in above question:
# https://stackoverflow.com/questions/46525786/how-to-join-two-dataframes-for-which-column-values-are-within-a-certain-range/46526249#46526249

In [36]:
print(test_df.info())

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 2735 entries, ('AAL', Timestamp('2010-07-21 00:00:00')) to ('XOM', Timestamp('2019-05-02 00:00:00'))
Data columns (total 45 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   commonstocksharesissued           2464 non-null   float64
 1   assetscurrent                     2367 non-null   float64
 2   accountspayablecurrent            2076 non-null   float64
 3   commonstockvalue                  2153 non-null   float64
 4   liabilities                       1746 non-null   float64
 5   liabilitiesandstockholdersequity  2735 non-null   float64
 6   stockholdersequity                2495 non-null   float64
 7   earningspersharebasic             2733 non-null   float64
 8   netincomeloss                     2631 non-null   float64
 9   profitloss                        1902 non-null   float64
 10  costofgoodssold                   1164 non-null   float

This is good for same-date. Do I want to also keep volume info for all 3 dates? I'd like to try...