### Thinking out loud, at the beginning:
    - Wow these files are big, I'm going to need to pare down the data probably a lot.
    - First going to be looking for "what's the most complete data that I have for what span of time"?
    - What data do I actually want to focus on? (record some research on how, broadly, one comes up with a company valuation that could be comparable to market capitalization)
    - I probably should focus my market data gathering on the day after each SEC filing is released, assuming that will be the day that stock price movements are most influenced by the filing information rather than other news or factors.
    - My core question so far is, are market prices correlated with valuation information in some fashion?
        - If so, which valuation information is most strongly correlated?
        - What, if any, is the relationship between stock price/market capitalization and the business's balance sheet?
        - Do any of these data include attempts to put a value on intangibles such as goodwill/reputation?
        - The fundamental analysis tradition focuses on ratios of price to earnings, book value, earnings growth; do the ratios that investors find acceptable change over time? Do they tend to change back (or ahead) when SEC filings are released?
        - (Do I have time for this?) What happens in the periods between filings? Do prices maintain a range, like the moving average support/resistance bands favored by technical analysis?
    - Therefore, what am I looking for in the data I've found so far?
        - As much history as I can assemble from both market data and SEC filings; will need to drop years where I have one but not the other.
        - Focus on the Fortune 100, probably choose the 100 as of the most recent year I have data for.
        - Market data at daily granularity, hopefully with open, high, low, and close prices for each day. Probably focus on dates related to the company's SEC filings.
        - Still deciding what SEC filing information, partially I don't know what I have to work with yet, hoping for a moderately quick thumbnail of value related to assets, liabilities, and maybe intangibles/goodwill. Assuming earnings-per-share numbers also will be found here.
        - If market data has already calculated P/E or other ratios, great, I can use those to sanity-check what I'm pulling in from SEC.

## First crack at a plan:

### Open each dataset (I have 4 right now) individually, probably not in code. 
Need to see file structure and years covered. Make decision on how much history to collect/collate.
    - Market datasets from:
        - https://www.kaggle.com/borismarjanovic/price-volume-data-for-all-us-stocks-etfs
            - description does not offer a begin date but last updated in 2017-Nov
            - data appears to be in txt files by ticker: file for AAPL dates to 1984-Sep and columns are Date,Open,High,Low,Close,Volume,OpenInt
            - there is a comment that prices in this data are adjusted for dividends AND splits (as opposed to Yahoo Finance data, which is not or not always--not sure what that means)
        - https://www.kaggle.com/tsaustin/us-historical-stock-prices-with-earnings-data
            - 20 years according to description; last updated 2020-Dec
            - Data in 4 files: stock price, earnings (estimated and reported), dividends, and a summary
    - SEC datasets from:
        - https://www.kaggle.com/miguelaenlle/parsed-sec-10q-filings-since-2006
            - Since 2006 according to description; Last updated 2020-Jul
            - Notes that ~50% of data is null, which might be 'not found in filing' or 'not found by parser'.
            - Does have lots of promising columns like commonstocksharesissued, assetscurrent, accountspayablecurrent, commonstockvalue, liabilities, liabilitiesandstockholdersequity, stockholdersequity, earningspersharebasic, netincomeloss, profitloss, costofgoodssold, filing_date, costsandexpenses, cash 
        - https://www.kaggle.com/finnhub/reported-financials
            - 2010-2020, according to description. Last updated 2021-Mar
            - 4 GB data!
            - Data is in folders by year-qtr; JSON files. Will take some work, I think, to find the tickers I want.
            
So, I can probably get 2010-2020 reasonably complete. Maybe 2006-2020.

In [9]:
import pandas as pd
import os

### Open each dataset in code and examine columns.
See if I can choose one market and one SEC dataset; do they contain the same information? Is there just a column or two from one I can add to the other?

In [21]:
# Note some of this code may not work as written except on my own computer, or if you download the datasets like I did.
# Not planning to upload all this data, it's waaay too much.

finnhub_df = pd.read_json('0000950123-10-025907.json')
print(finnhub_df.head(10))
print(finnhub_df['symbol'].unique())
print(finnhub_df.data['bs'])
print(finnhub_df.data['cf'])
print(finnhub_df.data['ic'])

     startDate     endDate  year quarter symbol  \
bs  2009-08-30  2010-02-13  2010      Q2    AZO   
cf  2009-08-30  2010-02-13  2010      Q2    AZO   
ic  2009-08-30  2010-02-13  2010      Q2    AZO   

                                                 data  
bs  [{'label': '', 'concept': 'CashAndCashEquivale...  
cf  [{'label': '', 'concept': 'NetIncomeLoss', 'un...  
ic  [{'label': '', 'concept': 'SalesRevenueNet', '...  
['AZO']
[{'label': '', 'concept': 'CashAndCashEquivalentsAtCarryingValue', 'unit': 'usd', 'value': 105161000}, {'label': '', 'concept': 'AccountsReceivableNetCurrent', 'unit': 'usd', 'value': 147466000}, {'label': '', 'concept': 'InventoryFinishedGoods', 'unit': 'usd', 'value': 2261528000}, {'label': '', 'concept': 'OtherAssetsCurrent', 'unit': 'usd', 'value': 134558000}, {'label': '', 'concept': 'AssetsCurrent', 'unit': 'usd', 'value': 2648713000}, {'label': '', 'concept': 'PropertyPlantAndEquipmentGross', 'unit': 'usd', 'value': 3901843000}, {'label': '', 'concep

In [22]:
finnhub_df2 = pd.read_json('0000950123-10-027758.json')
print(finnhub_df2.head(10))
print(finnhub_df2['symbol'].unique())
print(finnhub_df2.data['bs'])
print(finnhub_df2.data['cf'])
print(finnhub_df2.data['ic'])

     startDate     endDate  year quarter symbol  \
bs  2009-06-01  2010-02-28  2010      Q3   PAYX   
cf  2009-06-01  2010-02-28  2010      Q3   PAYX   
ic  2009-06-01  2010-02-28  2010      Q3   PAYX   

                                                 data  
bs  [{'label': '', 'concept': 'CashAndCashEquivale...  
cf  [{'label': '', 'concept': 'NetIncomeLoss', 'un...  
ic  [{'label': '', 'concept': 'SalesRevenueService...  
['PAYX']
[{'label': '', 'concept': 'CashAndCashEquivalentsAtCarryingValue', 'unit': 'usd', 'value': 279261000}, {'label': '', 'concept': 'MarketableSecuritiesCurrent', 'unit': 'usd', 'value': 115765000}, {'label': '', 'concept': 'InterestReceivableCurrent', 'unit': 'usd', 'value': 22591000}, {'label': '', 'concept': 'AccountsReceivableNetCurrent', 'unit': 'usd', 'value': 159131000}, {'label': '', 'concept': 'DeferredTaxAssetsNetCurrent', 'unit': 'usd', 'value': 23221000}, {'label': '', 'concept': 'PrepaidTaxes', 'unit': 'usd', 'value': 0}, {'label': '', 'concept': 

In [24]:
print(os.getcwd())

D:\Documents\GitHub\Springboard\Capstone2-Stock_price_vs_sec_filings


In [26]:
finnhub_df3 = pd.read_json('0000950123-10-029845.json')
print(finnhub_df3.head(10))

     startDate     endDate  year quarter symbol  \
bs  2009-02-01  2010-01-30  2010      FY    TJX   
cf  2009-02-01  2010-01-30  2010      FY    TJX   
ic  2009-02-01  2010-01-30  2010      FY    TJX   

                                                 data  
bs  [{'label': '', 'concept': 'CashAndCashEquivale...  
cf  [{'label': '', 'concept': 'ProfitLoss', 'unit'...  
ic  [{'label': '', 'concept': 'Revenues', 'unit': ...  


In [27]:
finnhub_df4 = pd.read_json('0000950123-10-030079.json')
print(finnhub_df4.head(10))

     startDate     endDate  year quarter symbol  \
bs  2009-01-01  2009-12-31  2009      FY    MRK   
cf  2009-01-01  2009-12-31  2009      FY    MRK   
ic  2009-01-01  2009-12-31  2009      FY    MRK   

                                                 data  
bs  [{'label': '', 'concept': 'CashAndCashEquivale...  
cf  [{'label': '', 'concept': 'ProfitLoss', 'unit'...  
ic  [{'label': '', 'concept': 'Revenues', 'unit': ...  


### Choose 100 companies and pull their data into new files to reduce data sizes. 
Let's try the 2020 Fortune 100.
Data retrieved from https://fortune.com/fortune500/2020/search/?rank=asc
Spent a bunch of time massaging the text as copied back into tabular format. Spent a bunch more time going through each company's profile page to pull out ticker symbols. Hopefully an employer would have a better way to access the data, but it's good enough for this project. They want \$500 or more for a data file and don't talk about having an API that I can see.

Some ticker notes:
    - There are a bunch of companies (especially mutual insurance companies) without tickers, so I'm going to end up with less than 100 companies. Expand search or leave it?
    - Will I need to look out for name changes (esp ticker changes)?
    - Albertsons has a ticker but IPO was last year sometime so may not have data

Begin EDA, look for cleaning needs. 

Figure out how best to pull only the SEC-related dates of market data, assuming that dates will generally be different for each company.

Calculate (can I sanity-check somehow?) fundamental analysis ratios: P/E, P/E/G, price/book.

Other tasks as I think of them while exploring!
    - Do I have data that shows market capitalization? Or shares outstanding, so I can calculate?
    - How does one adjust prices for dividends, splits? What price do I need to use to calculate ratios, market cap, etc?

Questions for later:
    - Once I produce something with this, should I share it on Kaggle? If I've combined datasets, can I link my code to both of them?
    - Can or should I try to confirm that Yahoo Finance is a bad place to get stock data from? (see notes on first market dataset above)
    - Suddenly, after importing a couple hundred thousand files, GitHub desktop can't find my repository... it can find sub-repositories like the ones I cloned from other projects but not my main Springboard folder.
    - How do I make Pandas read from somewhere other than the working directory?
    - How do I pull out the data I want from these JSON files and save for later? As dataframe, csv, something else?
    - Any ideas how I figure out what the 'bs/cf/ic' categories (row labels) are in the JSON files?
    - Can I pull additional company data from the Fortune site? Should I? https://fortune.com/company/stonex-group/fortune500/

Research:
https://www.investopedia.com/terms/a/adjusted_closing_price.asp
https://www.investopedia.com/articles/fundamental-analysis/08/sec-forms.asp