#### Thinking out loud, at the beginning:
- Wow these files are big, I'm going to need to pare down the data probably a lot.
- First going to be looking for "what's the most complete data that I have for what span of time"?
- What data do I actually want to focus on? (record some research on how, broadly, one comes up with a company valuation that could be comparable to market capitalization)
- I probably should focus my market data gathering on the day after each SEC filing is released, assuming that will be the day that stock price movements are most influenced by the filing information rather than other news or factors.
- My core question so far is, are market prices correlated with valuation information in some fashion?
    - If so, which valuation information is most strongly correlated?
    - What, if any, is the relationship between stock price/market capitalization and the business's balance sheet?
    - Do any of these data include attempts to put a value on intangibles such as goodwill/reputation?
    - The fundamental analysis tradition focuses on ratios of price to earnings, book value, earnings growth; do the ratios that investors find acceptable change over time? Do they tend to change back (or ahead) when SEC filings are released?
    - (Do I have time for this?) What happens in the periods between filings? Do prices maintain a range, like the moving average support/resistance bands favored by technical analysis?
- Therefore, what am I looking for in the data I've found so far?
    - As much history as I can assemble from both market data and SEC filings; will need to drop years where I have one but not the other.
    - Focus on the Fortune 100, probably choose the 100 as of the most recent year I have data for.
    - Market data at daily granularity, hopefully with open, high, low, and close prices for each day. Probably focus on dates related to the company's SEC filings.
    - Still deciding what SEC filing information, partially I don't know what I have to work with yet, hoping for a moderately quick thumbnail of value related to assets, liabilities, and maybe intangibles/goodwill. Assuming earnings-per-share numbers also will be found here.
    - If market data has already calculated P/E or other ratios, great, I can use those to sanity-check what I'm pulling in from SEC.

# First crack at a plan:

## 1. Open each dataset (I have 4 right now) individually, probably not in code. 
- Need to see file structure and years covered. Make decision on how much history to collect/collate.
    - Market datasets from:
        - https://www.kaggle.com/borismarjanovic/price-volume-data-for-all-us-stocks-etfs
            - description does not offer a begin date but last updated in 2017-Nov
            - data appears to be in txt files by ticker: file for AAPL dates to 1984-Sep and columns are Date,Open,High,Low,Close,Volume,OpenInt
            - there is a comment that prices in this data are adjusted for dividends AND splits (as opposed to Yahoo Finance data, which is not or not always--not sure what that means)
        - https://www.kaggle.com/tsaustin/us-historical-stock-prices-with-earnings-data
            - 20 years according to description; last updated 2020-Dec
            - Data in 4 files: stock price, earnings (estimated and reported), dividends, and a summary
    - SEC datasets from:
        - https://www.kaggle.com/miguelaenlle/parsed-sec-10q-filings-since-2006
            - Since 2006 according to description; Last updated 2020-Jul
            - Notes that ~50% of data is null, which might be 'not found in filing' or 'not found by parser'.
            - Does have lots of promising columns like commonstocksharesissued, assetscurrent, accountspayablecurrent, commonstockvalue, liabilities, liabilitiesandstockholdersequity, stockholdersequity, earningspersharebasic, netincomeloss, profitloss, costofgoodssold, filing_date, costsandexpenses, cash 
        - https://www.kaggle.com/finnhub/reported-financials
            - 2010-2020, according to description. Last updated 2021-Mar
            - 4 GB data!
            - Data is in folders by year-qtr; JSON files. Will take some work, I think, to find the tickers I want.
            
So, I can probably get 2010-2020 reasonably complete. Maybe 2006-2020.

In [74]:
import pandas as pd
#import numpy as np
import os
import json

# lots of code will depend on this ticker list
ticker_list = ['AAPL']

## 2. Open each dataset in code and examine columns.
See if I can choose one market and one SEC dataset; do they contain the same information? Is there just a column or two from one I can add to the other?

### 2a: the Finnhub SEC filings dataset
- Hopefully the most complicated one!

In [12]:
# Note some of this code may not work as written except on my own computer, or if you download the datasets like I did.
# Not planning to upload all this data, it's waaay too much.
# CAUTION: walking all this data will take a long time! Over 200K files, expect 20+ min of "busy" before you get an answer.
# I am grateful that Jupyter changes the tab icon to hourglass when a notebook is busy!

# For Finnhub reported-financials dataset:
# Recurse through many YYYY.Q# subdirectories using os.walk
# Acquire list of files containing the ticker(s) I want (define a list somewhere early!)
# Copy those files to a "data I'm keeping" folder
# Import reduced file set into a dataframe for analysis
# I thought about defining a function, but it's pretty specific to how this one dataset is laid out, is it worth it?


start_dir = os.path.join(os.getcwd(), 'SEC-FinancialsAsReported') 
files_list = []
problem_files = []

for root, subdirs, files in os.walk(start_dir):
    for filename in files:
        if os.path.splitext(filename)[1] == ('.json'):
            file_path = os.path.join(root, filename)
            try:
                df = pd.read_json(file_path) # is there a better way to read json if all I want is to check one column?
            except ValueError as val_err:
                problem_files.append((file_path, "Value Error: {0}".format(val_err)))
            except:
                print("Unexpected error:", sys.exc_info()[0])
                raise
            else:
                set_tickers = set(ticker_list)
                set_symbols = set(df['symbol'].unique())
                if set_tickers.intersection(set_symbols):
                    # print('Found one!')
                    files_list.append(file_path)

print('Files found: ' + str(len(files_list)))
print('Problem files found: ' + str(len(problem_files)))

Found one!
Found one!
Found one!
Found one!
Found one!
Found one!
Found one!
Found one!
Found one!
Found one!
Found one!
Found one!
Found one!
Found one!
Found one!
Found one!
Found one!
Found one!
Found one!
Found one!
Found one!
Found one!
Found one!
Found one!
Found one!
Found one!
Value Error: Value is too big
Value Error1: Expected object or value
Found one!
Found one!
Found one!
Found one!
Found one!
Found one!
Found one!
Found one!
Found one!
Found one!
Found one!
Found one!
Found one!
Found one!
Found one!
Found one!
Found one!
Found one!
Files found: 44
Problem files found: 2


In [82]:
# experimenting with the one problem file. Set lines? orient? skip because it's one file out of 200K or so?
'''
print(os.getcwd())
file_problem = os.path.join(os.getcwd(),'SEC-FinancialsAsReported\\2016.QTR4\\0001098009-16-000023.json')
print(file_problem)
'''
# tried a lot of variations to get path join to work by hand (as opposed to using walk). 
# The two most important seem to be start with os.getcwd() and escape any slashes in the path.

'''
df = pd.read_json(file_problem) #as done in filewalk above, "value is too big"
df = pd.read_json(file_problem, lines=True) #did not work, "expected object or value"
df = pd.read_json(file_problem, orient='records') #did not work, "value is too big"
df = pd.read_json(file_problem, orient='split') #did not work, "value is too big"
df = pd.read_json(file_problem, orient='index') #did not work, "value is too big"
df = pd.read_json(file_problem, orient='columns') #did not work, "value is too big"
df = pd.read_json(file_problem, orient='values') #did not work, "value is too big"
print(df.head)
'''
# nothing is working so far, skip this damn file. Opened in Notepad++ and its ticker is 'CWNM' 
# which I don't think is on my list anyway.
;

''

#### Now, to see what I have for my ticker(s) in finnhub files:

In [116]:
# concat/map solution found at https://stackoverflow.com/questions/20906474/import-multiple-csv-files-into-pandas-and-concatenate-into-one-dataframe
#finnhub_df = pd.concat(map(pd.read_json, files_list))
# not working well due to nested JSON, I think...

# json_normalize found at https://stackoverflow.com/questions/20638006/convert-list-of-dictionaries-to-a-pandas-dataframe
# convert to dict clue found at https://stackoverflow.com/questions/55710154/python-json-normalize-string-indices-must-be-integers-error
# nested JSON clue 1: https://stackoverflow.com/questions/47341519/json-normalize-for-dicts-within-dicts
# nested JSON clue 2: https://stackoverflow.com/questions/46091362/how-to-normalize-json-correctly-by-python-pandas (needs json.load)
# nested JSON clue 3: https://www.kaggle.com/jboysen/quick-tutorial-flatten-nested-json-in-pandas/notebook
bs_df_list = []
cf_df_list = []
ic_df_list = []

for i in range(len(files_list)):
    with open(files_list[i]) as f:
        d = json.load(f)
    balsheet_df = pd.json_normalize(d, record_path=['data','bs'], meta=['startDate','endDate','year','quarter','symbol'])
    cashflow_df = pd.json_normalize(d, record_path=['data','cf'], meta=['startDate','endDate','year','quarter','symbol'])
    income_df = pd.json_normalize(d, record_path=['data','ic'], meta=['startDate','endDate','year','quarter','symbol'])
    bs_df_list.append(balsheet_df)
    cf_df_list.append(cashflow_df)
    ic_df_list.append(income_df)
    
#print(bs_df_list)
bs_df = pd.concat(bs_df_list, ignore_index=True)
cf_df = pd.concat(cf_df_list, ignore_index=True)
ic_df = pd.concat(ic_df_list, ignore_index=True)

print(bs_df.head())
print(bs_df.info())
print(bs_df.describe())
print(bs_df.endDate.unique())

                                               label  \
0                    Long-term marketable securities   
1  Tangible assets that are held by an entity for...   
2                                           Goodwill   
3                    Acquired intangible assets, net   
4                                       Other assets   

                                             concept unit        value  \
0               AvailableForSaleSecuritiesNoncurrent  usd  18549000000   
1  aapl:PropertyPlantAndEquipmentAndCapitalizedSo...  usd   3504000000   
2                                           Goodwill  usd    480000000   
3               IntangibleAssetsNetExcludingGoodwill  usd    263000000   
4                              OtherAssetsNoncurrent  usd   1925000000   

    startDate     endDate  year quarter symbol  
0  2009-09-27  2010-03-27  2010      Q2   AAPL  
1  2009-09-27  2010-03-27  2010      Q2   AAPL  
2  2009-09-27  2010-03-27  2010      Q2   AAPL  
3  2009-09-27  2010-03

In [117]:
print(cf_df.head())
print(cf_df.info())
print(cf_df.describe())
print(cf_df.endDate.unique())

                                               label  \
0           Depreciation, amortization and accretion   
1                   Stock-based compensation expense   
2                        Deferred income tax expense   
3  Loss on disposition of property, plant and equ...   
4                           Accounts receivable, net   

                                   concept unit       value   startDate  \
0  DepreciationAmortizationAndAccretionNet  usd   425000000  2009-09-27   
1                   ShareBasedCompensation  usd   436000000  2009-09-27   
2          DeferredIncomeTaxExpenseBenefit  usd   893000000  2009-09-27   
3   GainLossOnSaleOfPropertyPlantEquipment  usd    -9000000  2009-09-27   
4     IncreaseDecreaseInAccountsReceivable  usd  -482000000  2009-09-27   

      endDate  year quarter symbol  
0  2010-03-27  2010      Q2   AAPL  
1  2010-03-27  2010      Q2   AAPL  
2  2010-03-27  2010      Q2   AAPL  
3  2010-03-27  2010      Q2   AAPL  
4  2010-03-27  2010      Q2

In [118]:
print(ic_df.head())
print(ic_df.info())
print(ic_df.describe())
print(ic_df.endDate.unique())

                                 label  \
0            Earnings Per Share, Basic   
1          Earnings Per Share, Diluted   
2             Research and development   
3  Selling, general and administrative   
4             Total operating expenses   

                                  concept        unit         value  \
0                   EarningsPerShareBasic  usd/shares  7.120000e+00   
1                 EarningsPerShareDiluted  usd/shares  7.000000e+00   
2           ResearchAndDevelopmentExpense         usd  8.240000e+08   
3  SellingGeneralAndAdministrativeExpense         usd  2.508000e+09   
4                       OperatingExpenses         usd  3.332000e+09   

    startDate     endDate  year quarter symbol  
0  2009-09-27  2010-03-27  2010      Q2   AAPL  
1  2009-09-27  2010-03-27  2010      Q2   AAPL  
2  2009-09-27  2010-03-27  2010      Q2   AAPL  
3  2009-09-27  2010-03-27  2010      Q2   AAPL  
4  2009-09-27  2010-03-27  2010      Q2   AAPL  
<class 'pandas.core.frame.

##### Some things I needed to find out:
- Why are all these reports split into records called "bs/cf/ic"? What are those abbreviations for?
- how about... **B**alance **S**heet, **C**ash **F**low, **I**n**C**ome statement? https://www.sec.gov/oiea/reportspubs/investor-publications/beginners-guide-to-financial-statements.html
- I want to see more of what's in those "data" fields: learning how to unpack nested JSON
- Describe doesn't do much when every number is a different thing!

***TODO:*** code to save the files to a location I'm willing to upload, code to save the dataframes for later, melt these dataframes and see where the commonalities are (group/sort by label and/or concept)

### 2b. The miguelaenlle SEC dataset "Historical Financials"
- Should be far less complicated! Everything in one CSV file

In [105]:
hf_path = os.path.join(os.getcwd(), 'SEC-HistoricalFinancials', 'quarterly_financials.csv')
hf_df = pd.read_csv(hf_path, index_col=0, low_memory=False)
# dtype warning for mixed types in column 43?
#print(hf_df.iloc[:, 43])
#print(hf_df['stock'].unique())
#print(hf_df.info())
# maybe this is about the first ticker being 'AA$'?
hf_tickers_df = hf_df[hf_df['stock'].isin(ticker_list)]
print(hf_tickers_df.info())
print(hf_tickers_df.describe())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 39 entries, 203 to 241
Data columns (total 44 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   commonstocksharesissued           39 non-null     float64
 1   assetscurrent                     39 non-null     float64
 2   accountspayablecurrent            39 non-null     float64
 3   commonstockvalue                  24 non-null     float64
 4   liabilities                       39 non-null     float64
 5   liabilitiesandstockholdersequity  39 non-null     float64
 6   stockholdersequity                39 non-null     float64
 7   earningspersharebasic             39 non-null     float64
 8   netincomeloss                     39 non-null     float64
 9   profitloss                        0 non-null      float64
 10  costofgoodssold                   0 non-null      float64
 11  filing_date                       39 non-null     object 
 12  costsan

### 2c. The borismarjanovic 'Huge Dataset' of market prices

In [108]:
hd_path = os.path.join(os.getcwd(), 'Market-HugeDataset', 'Stocks')
hd_filelist = []
for ticker in ticker_list:
    filename = ticker.lower() + '.us.txt'
    hd_filelist.append(os.path.join(hd_path, filename))
    
# concat/map solution found at https://stackoverflow.com/questions/20906474/import-multiple-csv-files-into-pandas-and-concatenate-into-one-dataframe
hd_df = pd.concat(map(pd.read_csv, hd_filelist))
print(hd_df.head())
print(hd_df.info())
print(hd_df.describe())

         Date     Open     High      Low    Close    Volume  OpenInt
0  1984-09-07  0.42388  0.42902  0.41874  0.42388  23220030        0
1  1984-09-10  0.42388  0.42516  0.41366  0.42134  18022532        0
2  1984-09-11  0.42516  0.43668  0.42516  0.42902  42498199        0
3  1984-09-12  0.42902  0.43157  0.41618  0.41618  37125801        0
4  1984-09-13  0.43927  0.44052  0.43927  0.43927  57822062        0
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8364 entries, 0 to 8363
Data columns (total 7 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Date     8364 non-null   object 
 1   Open     8364 non-null   float64
 2   High     8364 non-null   float64
 3   Low      8364 non-null   float64
 4   Close    8364 non-null   float64
 5   Volume   8364 non-null   int64  
 6   OpenInt  8364 non-null   int64  
dtypes: float64(4), int64(2), object(1)
memory usage: 457.5+ KB
None
              Open         High          Low        Close        Volu

Do I believe these prices? OK, a quick search says today's price was `$`134.16 at last close. Wonder if I'll see what the effect of adjustment is by comparing with the next dataset? I'm pretty sure I've noticed AAPL's price being higher than $175 in my random contacts with stock market reports.

### 2d. The tsaustin 'Historical Prices' market dataset

In [111]:
hp_filepath = os.path.join(os.getcwd(), 'Market-HistoricalPrices')
hp_df_s = pd.read_csv(os.path.join(hp_filepath, 'dataset_summary.csv'))
hp_df_tckr_s = hp_df_s[hp_df_s['symbol'].isin(ticker_list)]
print(hp_df_tckr_s.head())
print(hp_df_tckr_s.info())
hp_df_e = pd.read_csv(os.path.join(hp_filepath, 'stocks_latest', 'earnings_latest.csv'))
hp_df_tckr_e = hp_df_e[hp_df_e['symbol'].isin(ticker_list)]
print(hp_df_tckr_e.head())
print(hp_df_tckr_e.info())
hp_df_d = pd.read_csv(os.path.join(hp_filepath, 'stocks_latest', 'dividends_latest.csv'))
hp_df_tckr_d = hp_df_d[hp_df_d['symbol'].isin(ticker_list)]
print(hp_df_tckr_d.head())
print(hp_df_tckr_d.info())
hp_df_p = pd.read_csv(os.path.join(hp_filepath, 'stocks_latest', 'stock_prices_latest.csv'))
hp_df_tckr_p = hp_df_p[hp_df_p['symbol'].isin(ticker_list)]
print(hp_df_tckr_p.head())
print(hp_df_tckr_p.info())
print(hp_df_tckr_p.describe())

   symbol  total_prices stock_from_date stock_to_date  total_earnings  \
16   AAPL          5756      1998-01-02    2020-11-13              46   

   earnings_from_date earnings_to_date  
16         2009-07-21       2020-10-29  
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1 entries, 16 to 16
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   symbol              1 non-null      object
 1   total_prices        1 non-null      int64 
 2   stock_from_date     1 non-null      object
 3   stock_to_date       1 non-null      object
 4   total_earnings      1 non-null      int64 
 5   earnings_from_date  1 non-null      object
 6   earnings_to_date    1 non-null      object
dtypes: int64(2), object(5)
memory usage: 64.0+ bytes
None
    symbol        date      qtr  eps_est   eps release_time
357   AAPL  2009-07-21  06/2009      NaN   NaN         post
358   AAPL  2009-10-19  09/2009      NaN  0.45         pos

See? This one suggests much higher max prices. What's going on here?

## 3. Choose 100 companies and pull their data into new files to reduce data sizes. 
Let's try the 2020 Fortune 100.
Data retrieved from https://fortune.com/fortune500/2020/search/?rank=asc
Spent a bunch of time massaging the text as copied back into tabular format. Spent a bunch more time going through each company's profile page to pull out ticker symbols. Hopefully an employer would have a better way to access the data, but it's good enough for this project. They want \$500 or more for a data file and don't talk about having an API that I can see.

- Some ticker notes:
    - There are a bunch of companies (especially mutual insurance companies) without tickers, so I'm going to end up with less than 100 companies. Expand search or leave it?
    - Will I need to look out for name changes (esp ticker changes)?
    - Albertsons has a ticker but IPO was last year sometime so may not have data

## 4. Begin EDA, look for cleaning needs. 
- Figure out how best to pull only the SEC-related dates of market data, assuming that dates will generally be different for each company.
- Calculate (can I sanity-check somehow?) fundamental analysis ratios: P/E, P/E/G, price/book.

## Other tasks as I think of them while exploring!
- Do I have data that shows market capitalization? Or shares outstanding, so I can calculate?
- How does one adjust prices for dividends, splits? What price do I need to use to calculate ratios, market cap, etc?

## Questions for later:
- Once I produce something with this, should I share it on Kaggle? If I've combined datasets, can I link my code to both of them?
- Can or should I try to confirm that Yahoo Finance is a bad place to get stock data from? (see notes on first market dataset above)
- How do I pull out the data I want from these JSON files and save for later? As dataframe, csv, something else?
- Can I pull additional company data from the Fortune site? Should I? https://fortune.com/company/stonex-group/fortune500/

### Research:
- https://www.investopedia.com/terms/a/adjusted_closing_price.asp
- https://www.investopedia.com/articles/fundamental-analysis/08/sec-forms.asp