### Thinking out loud, at the beginning:
- Wow these files are big, I'm going to need to pare down the data probably a lot.
- First going to be looking for "what's the most complete data that I have for what span of time"?
- What data do I actually want to focus on? (record some research on how, broadly, one comes up with a company valuation that could be comparable to market capitalization)
- I probably should focus my market data gathering on the day after each SEC filing is released, assuming that will be the day that stock price movements are most influenced by the filing information rather than other news or factors.
- My core question so far is, are market prices correlated with valuation information in some fashion?
    - If so, which valuation information is most strongly correlated?
    - What, if any, is the relationship between stock price/market capitalization and the business's balance sheet?
    - Do any of these data include attempts to put a value on intangibles such as goodwill/reputation?
    - The fundamental analysis tradition focuses on ratios of price to earnings, book value, earnings growth; do the ratios that investors find acceptable change over time? Do they tend to change back (or ahead) when SEC filings are released?
    - (Do I have time for this?) What happens in the periods between filings? Do prices maintain a range, like the moving average support/resistance bands favored by technical analysis?
- Therefore, what am I looking for in the data I've found so far?
    - As much history as I can assemble from both market data and SEC filings; will need to drop years where I have one but not the other.
    - Focus on the Fortune 100, probably choose the 100 as of the most recent year I have data for.
    - Market data at daily granularity, hopefully with open, high, low, and close prices for each day. Probably focus on dates related to the company's SEC filings.
    - Still deciding what SEC filing information, partially I don't know what I have to work with yet, hoping for a moderately quick thumbnail of value related to assets, liabilities, and maybe intangibles/goodwill. Assuming earnings-per-share numbers also will be found here.
    - If market data has already calculated P/E or other ratios, great, I can use those to sanity-check what I'm pulling in from SEC.

## First crack at a plan:

### 1. Open each dataset (I have 4 right now) individually, probably not in code. 
- Need to see file structure and years covered. Make decision on how much history to collect/collate.
    - Market datasets from:
        - https://www.kaggle.com/borismarjanovic/price-volume-data-for-all-us-stocks-etfs
            - description does not offer a begin date but last updated in 2017-Nov
            - data appears to be in txt files by ticker: file for AAPL dates to 1984-Sep and columns are Date,Open,High,Low,Close,Volume,OpenInt
            - there is a comment that prices in this data are adjusted for dividends AND splits (as opposed to Yahoo Finance data, which is not or not always--not sure what that means)
        - https://www.kaggle.com/tsaustin/us-historical-stock-prices-with-earnings-data
            - 20 years according to description; last updated 2020-Dec
            - Data in 4 files: stock price, earnings (estimated and reported), dividends, and a summary
    - SEC datasets from:
        - https://www.kaggle.com/miguelaenlle/parsed-sec-10q-filings-since-2006
            - Since 2006 according to description; Last updated 2020-Jul
            - Notes that ~50% of data is null, which might be 'not found in filing' or 'not found by parser'.
            - Does have lots of promising columns like commonstocksharesissued, assetscurrent, accountspayablecurrent, commonstockvalue, liabilities, liabilitiesandstockholdersequity, stockholdersequity, earningspersharebasic, netincomeloss, profitloss, costofgoodssold, filing_date, costsandexpenses, cash 
        - https://www.kaggle.com/finnhub/reported-financials
            - 2010-2020, according to description. Last updated 2021-Mar
            - 4 GB data!
            - Data is in folders by year-qtr; JSON files. Will take some work, I think, to find the tickers I want.
            
So, I can probably get 2010-2020 reasonably complete. Maybe 2006-2020.

In [1]:
import pandas as pd
import os

### 2. Open each dataset in code and examine columns.
See if I can choose one market and one SEC dataset; do they contain the same information? Is there just a column or two from one I can add to the other?

In [12]:
# Note some of this code may not work as written except on my own computer, or if you download the datasets like I did.
# Not planning to upload all this data, it's waaay too much.

# For Finnhub reported-financials dataset:
# Recurse through many YYYY.Q# subdirectories using os.walk
# Acquire list of files containing the ticker(s) I want (define a list somewhere early!)
# Copy those files to a "data I'm keeping" folder
# Import reduced file set into a dataframe for analysis
# I thought about defining a function, but it's pretty specific to how this one dataset is laid out, is it worth it?

ticker_list = ['AAPL']
start_dir = os.getcwd() 

# def get_files_with_ticker(tickers, startdir):
files_list = []
problem_files = []
    #for root, subdirs, files in os.walk(startdir):
for root, subdirs, files in os.walk(start_dir):
    for filename in files:
        if os.path.splitext(filename)[1] == ('.json'):
            file_path = os.path.join(root, filename)
            #print(file_path)
            try:
                df = pd.read_json(file_path) # is there a better way to read json if all I want is to check one column?
                #print(df.head(1))
            except ValueError as val_err:
                #print("Value Error: {0}".format(val_err))
                problem_files.append((file_path, "Value Error: {0}".format(val_err)))
            except:
                print("Unexpected error:", sys.exc_info()[0])
                raise
            else:
                #set_tickers = set(tickers)
                set_tickers = set(ticker_list)
                set_symbols = set(df['symbol'].unique())
                if set_tickers.intersection(set_symbols):
                    # print('Found one!')
                    files_list.append(file_path)
    #return files_list
print('Files found: ' + str(len(files_list)))
print('Problem files found: ' + str(len(problem_files)))
#copy_files = get_files_with_ticker(ticker_list, start_dir)

# first try at running this took 35+ minutes and produced "value too big" somewhere in a read_json call
# I am grateful that Jupyter changes the tab icon to hourglass when a notebook is busy!

Found one!
Found one!
Found one!
Found one!
Found one!
Found one!
Found one!
Found one!
Found one!
Found one!
Found one!
Found one!
Found one!
Found one!
Found one!
Found one!
Found one!
Found one!
Found one!
Found one!
Found one!
Found one!
Found one!
Found one!
Found one!
Found one!
Value Error: Value is too big
Value Error1: Expected object or value
Found one!
Found one!
Found one!
Found one!
Found one!
Found one!
Found one!
Found one!
Found one!
Found one!
Found one!
Found one!
Found one!
Found one!
Found one!
Found one!
Found one!
Found one!
Files found: 44
Problem files found: 2


In [22]:
# experimenting with the one problem file. Set lines? orient? skip because it's one file out of 200K or so?
'''
print(os.getcwd())
file_problem = os.path.join(os.getcwd(),'SEC-FinancialsAsReported\\2016.QTR4\\0001098009-16-000023.json')
print(file_problem)
'''
# tried a lot of variations to get path join to work by hand (as opposed to using walk). 
# The two most important seem to be start with os.getcwd() and escape any slashes in the path.

'''
df = pd.read_json(file_problem) #as done in filewalk above, "value is too big"
df = pd.read_json(file_problem, lines=True) #did not work, "expected object or value"
df = pd.read_json(file_problem, orient='records') #did not work, "value is too big"
df = pd.read_json(file_problem, orient='split') #did not work, "value is too big"
df = pd.read_json(file_problem, orient='index') #did not work, "value is too big"
df = pd.read_json(file_problem, orient='columns') #did not work, "value is too big"
df = pd.read_json(file_problem, orient='values') #did not work, "value is too big"
print(df.head)
'''
# nothing is working so far, skip this damn file. Opened in Notepad++ and its ticker is 'CWNM' 
# which I don't think is on my list anyway.

D:\Documents\GitHub\Springboard\Capstone2-Stock_price_vs_sec_filings\SEC-FinancialsAsReported\2016.QTR4\0001098009-16-000023.json


ValueError: Value is too big

#### Now, to see what I have for my ticker(s) in finnhub files:

In [30]:
# concat/map solution found at https://stackoverflow.com/questions/20906474/import-multiple-csv-files-into-pandas-and-concatenate-into-one-dataframe
finnhub_df = pd.concat(map(pd.read_json, files_list))
print(finnhub_df.head(10))

     startDate     endDate  year quarter symbol  \
bs  2009-09-27  2010-03-27  2010      Q2   AAPL   
cf  2009-09-27  2010-03-27  2010      Q2   AAPL   
ic  2009-09-27  2010-03-27  2010      Q2   AAPL   
bs  2009-09-27  2010-06-26  2010      Q3   AAPL   
cf  2009-09-27  2010-06-26  2010      Q3   AAPL   
ic  2009-09-27  2010-06-26  2010      Q3   AAPL   
bs  2009-09-27  2010-09-25  2010      FY   AAPL   
cf  2009-09-27  2010-09-25  2010      FY   AAPL   
ic  2009-09-27  2010-09-25  2010      FY   AAPL   
bs  2010-09-26  2010-12-25  2011      Q1   AAPL   

                                                 data  
bs  [{'label': 'Long-term marketable securities', ...  
cf  [{'label': 'Depreciation, amortization and acc...  
ic  [{'label': 'Earnings Per Share, Basic', 'conce...  
bs  [{'label': 'Available-for-sale Securities, Deb...  
cf  [{'label': 'Depreciation, amortization and acc...  
ic  [{'label': 'Basic earnings per common share', ...  
bs  [{'label': 'Available-for-sale Securities,

#### Some things I need to find out:
- Why are all these reports split into records called "bs/cf/ic"? What are those abbreviations for?
- how about... **B**alance **S**heet, **C**ash **F**low, **I**n**C**ome statement? https://www.sec.gov/oiea/reportspubs/investor-publications/beginners-guide-to-financial-statements.html
- I want to see more of what's in those "data" fields

In [None]:
# figure out how to pull this data apart into each category and get "data" out of being sub-dictionaries

### 3. Choose 100 companies and pull their data into new files to reduce data sizes. 
Let's try the 2020 Fortune 100.
Data retrieved from https://fortune.com/fortune500/2020/search/?rank=asc
Spent a bunch of time massaging the text as copied back into tabular format. Spent a bunch more time going through each company's profile page to pull out ticker symbols. Hopefully an employer would have a better way to access the data, but it's good enough for this project. They want \$500 or more for a data file and don't talk about having an API that I can see.

- Some ticker notes:
    - There are a bunch of companies (especially mutual insurance companies) without tickers, so I'm going to end up with less than 100 companies. Expand search or leave it?
    - Will I need to look out for name changes (esp ticker changes)?
    - Albertsons has a ticker but IPO was last year sometime so may not have data

### 4. Begin EDA, look for cleaning needs. 
- Figure out how best to pull only the SEC-related dates of market data, assuming that dates will generally be different for each company.
- Calculate (can I sanity-check somehow?) fundamental analysis ratios: P/E, P/E/G, price/book.

#### Other tasks as I think of them while exploring!
- Do I have data that shows market capitalization? Or shares outstanding, so I can calculate?
- How does one adjust prices for dividends, splits? What price do I need to use to calculate ratios, market cap, etc?

#### Questions for later:
- Once I produce something with this, should I share it on Kaggle? If I've combined datasets, can I link my code to both of them?
- Can or should I try to confirm that Yahoo Finance is a bad place to get stock data from? (see notes on first market dataset above)
- Suddenly, after importing a couple hundred thousand files, GitHub desktop can't find my repository... it can find sub-repositories like the ones I cloned from other projects but not my main Springboard folder.
- How do I make Pandas read from somewhere other than the working directory?
- How do I pull out the data I want from these JSON files and save for later? As dataframe, csv, something else?
- Any ideas how I figure out what the 'bs/cf/ic' categories (row labels) are in the JSON files?
- Can I pull additional company data from the Fortune site? Should I? https://fortune.com/company/stonex-group/fortune500/

### Research:
- https://www.investopedia.com/terms/a/adjusted_closing_price.asp
- https://www.investopedia.com/articles/fundamental-analysis/08/sec-forms.asp