## How to work with Fundamental data

The Securities and Exchange Commission (SEC) requires US issuers, that is, listed companies and securities, including mutual funds to file three quarterly financial statements (Form 10-Q) and one annual report (Form 10-K), in addition to various other regulatory filing requirements.

Since the early 1990s, the SEC made these filings available through its Electronic Data Gathering, Analysis, and Retrieval (EDGAR) system. They constitute the primary data source for the fundamental analysis of equity and other securities, such as corporate credit, where the value depends on the business prospects and financial health of the issuer. 

#### Automated processing using XBRL markup

Automated analysis of regulatory filings has become much easier since the SEC introduced XBRL, a free, open, and global standard for the electronic representation and exchange of business reports. XBRL is based on XML; it relies on [taxonomies](https://www.sec.gov/dera/data/edgar-log-file-data-set.html) that define the meaning of the elements of a report and map to tags that highlight the corresponding information in the electronic version of the report. One such taxonomy represents the US Generally Accepted Accounting Principles (GAAP).

The SEC introduced voluntary XBRL filings in 2005 in response to accounting scandals before requiring this format for all filers since 2009 and continues to expand the mandatory coverage to other regulatory filings. The SEC maintains a website that lists the current taxonomies that shape the content of different filings and can be used to extract specific items.

There are several avenues to track and access fundamental data reported to the SEC:
- As part of the [EDGAR Public Dissemination Service]((https://www.sec.gov/oit/announcement/public-dissemination-service-system-contact.html)) (PDS), electronic feeds of accepted filings are available for a fee. 
- The SEC updates [RSS feeds](https://www.sec.gov/structureddata/rss-feeds-submitted-filings) every 10 minutes, which list structured disclosure submissions.
- There are public [index files](https://www.sec.gov/edgar/searchedgar/accessing-edgar-data.htm) for the retrieval of all filings through FTP for automated processing.
- The financial statement (and notes) datasets contain parsed XBRL data from all financial statements and the accompanying notes.

The SEC also publishes log files containing the [internet search traffic](https://www.sec.gov/dera/data/edgar-log-file-data-set.html) for EDGAR filings through SEC.gov, albeit with a six-month delay.


#### Building a fundamental data time series

The scope of the data in the [Financial Statement and Notes](https://www.sec.gov/dera/data/financial-statement-and-notes-data-set.html) datasets consists of numeric data extracted from the primary financial statements (Balance sheet, income statement, cash flows, changes in equity, and comprehensive income) and footnotes on those statements. The data is available as early as 2009.


The folder [03_sec_edgar](03_sec_edgar) contains the notebook [edgar_xbrl](03_sec_edgar/edgar_xbrl.ipynb) to download and parse EDGAR data in XBRL format, and create fundamental metrics like the P/E ratio by combining financial statement and price data.

### Other fundamental data sources

- [Compilation of macro resources by the Yale Law School](https://library.law.yale.edu/news/75-sources-economic-data-statistics-reports-and-commentary)
- [Capital IQ](www.capitaliq.com)
- [Compustat](www.compustat.com)
- [MSCI Barra](www.mscibarra.com)
- [Northfield Information Services](www.northinfo.com)
- [Quantitative Services Group](www.qsg.com)



# Working with filing data from the SEC's EDGAR service

In [2]:
import warnings
warnings.filterwarnings('ignore')

In [3]:
%matplotlib inline

from pathlib import Path
from datetime import date
import json
from io import BytesIO
from zipfile import ZipFile, BadZipFile
import requests

import pandas_datareader.data as web
import pandas as pd

from pprint import pprint

import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
from concurrent.futures import ThreadPoolExecutor

In [4]:
sns.set_style('whitegrid') # figure aesthetics

In [5]:
data_path = Path('data') # perhaps set to external harddrive to coomodate large amount of data

In [6]:
SEC_URL = 'https://www.sec.gov/'
FSN_PATH = 'files/dera/data/financial-statement-and-notes-data-sets/'

In [7]:
today = pd.Timestamp(date.today())
this_year = today.year
this_quarter = today.quarter
past_years = range(2014, this_year)

In [9]:
# Add previous quarters
filing_periods = [(y, q) for y in past_years for q in range(1, 5)]
# Add this year's quarters
filing_periods.extend([(this_year, q) for q in range(1, this_quarter + 1)])
filing_periods[-7:]


[(2019, 3), (2019, 4), (2020, 1), (2020, 2), (2020, 3), (2020, 4), (2021, 1)]

In [11]:
def fetch_filing_data(filing_periods):
    for i, (yr, qtr) in enumerate(filing_periods, 1):
        print(f'{yr}-Q{qtr}', end=' ', flush=True)
        filing = f'{yr}q{qtr}_notes.zip'
        path = data_path / f'{yr}_{qtr}' / 'source'
        if not path.exists():
            path.mkdir(exist_ok=True, parents=True)
        url = SEC_URL + FSN_PATH + filing
        print('url = ' + url)
        # 2020q1 is currently (Oct 2020) in a different location; this may change at some point
#         if yr == 2020 and qtr == 1:
#             url = SEC_URL + 'files/node/add/data_distribution/' + filing

        response = requests.get(url).content
        try:
            with ZipFile(BytesIO(response)) as zip_file:
                for file in zip_file.namelist():
                    local_file = path / file
                    if local_file.exists():
                        continue
                    with local_file.open('wb') as output:
                        for line in zip_file.open(file).readlines():
                            output.write(line)
        except BadZipFile as e:
            print(e)
            print('got bad zip file')
            continue

In [12]:
fetch_filing_data(filing_periods[-7:])

2019-Q3 url = https://www.sec.gov/files/dera/data/financial-statement-and-notes-data-sets/2019q3_notes.zip
2019-Q4 url = https://www.sec.gov/files/dera/data/financial-statement-and-notes-data-sets/2019q4_notes.zip
2020-Q1 url = https://www.sec.gov/files/dera/data/financial-statement-and-notes-data-sets/2020q1_notes.zip
2020-Q2 url = https://www.sec.gov/files/dera/data/financial-statement-and-notes-data-sets/2020q2_notes.zip
2020-Q3 url = https://www.sec.gov/files/dera/data/financial-statement-and-notes-data-sets/2020q3_notes.zip
2020-Q4 url = https://www.sec.gov/files/dera/data/financial-statement-and-notes-data-sets/2020q4_notes.zip
2021-Q1 url = https://www.sec.gov/files/dera/data/financial-statement-and-notes-data-sets/2021q1_notes.zip
File is not a zip file
got bad zip file


In [11]:
import logging
import boto3
import os
from botocore.exceptions import ClientError

AWS_REGION = 'us-west-2'
FUNDAMENTAL_DATA_PATH = 'fundamentals-data'
AWS_PROFILE_NAME = 'gold-miner'


def create_bucket(bucket_name, region=None):
    """Create an S3 bucket in a specified region

    If a region is not specified, the bucket is created in the S3 default
    region (us-east-1).

    :param bucket_name: Bucket to create
    :param region: String region to create bucket in, e.g., 'us-west-2'
    :return: True if bucket created, else False
    """

    # Create bucket
    try:
        if region is None:
            session = boto3.Session(profile_name='gold-miner')
            s3_client = session.client('s3')
            s3_client.create_bucket(Bucket=bucket_name)
        else:
            session = boto3.Session(profile_name='gold-miner', region_name='us-west-2')
            s3_client = session.client('s3')
            location = {'LocationConstraint': region}
            s3_client.create_bucket(Bucket=bucket_name,
                                    CreateBucketConfiguration=location)
    except ClientError as e:
        logging.error(e)
        return False
    return True


# create_bucket(FUNDAMENTAL_DATA_PATH, AWS_REGION)

In [12]:
# List all the existing buckets for the AWS account.
session = boto3.Session(profile_name='gold-miner', region_name='us-west-2')
s3_client = session.client('s3')
response = s3_client.list_buckets()

# Output the bucket names
print('Existing buckets:')
for bucket in response['Buckets']:
    print(f'  {bucket["Name"]}')


Existing buckets:
  fundamentals-data


In [123]:
import os

# traverse root directory, and list directories as dirs and files as files
for root, dirs, files in os.walk("."):
#     path = root.split(os.sep)
#     print(root)
#     print((len(path) - 1) * '---', os.path.basename(root))
    for file in files:
        file_path = root + "/" + file
        print(file_path)

./fundamental_data_edgar_xbrl.ipynb
./README.md
./.ipynb_checkpoints/fundamental_data_edgar_xbrl-checkpoint.ipynb
./data/2015_1/source/ren.tsv
./data/2015_1/source/pre.tsv
./data/2015_1/source/sub.tsv
./data/2015_1/source/txt.tsv
./data/2015_1/source/cal.tsv
./data/2015_1/source/tag.tsv
./data/2015_1/source/readme.htm
./data/2015_1/source/2015q1_notes-metadata.json
./data/2015_1/source/dim.tsv
./data/2015_1/source/num.tsv
./data/2017_2/source/ren.tsv
./data/2017_2/source/pre.tsv
./data/2017_2/source/sub.tsv
./data/2017_2/source/txt.tsv
./data/2017_2/source/cal.tsv
./data/2017_2/source/tag.tsv
./data/2017_2/source/readme.htm
./data/2017_2/source/dim.tsv
./data/2017_2/source/2017q2_notes-metadata.json
./data/2017_2/source/num.tsv
./data/2019_3/source/ren.tsv
./data/2019_3/source/pre.tsv
./data/2019_3/source/sub.tsv
./data/2019_3/source/cal.tsv
./data/2019_3/source/tag.tsv
./data/2019_3/source/dim.tsv
./data/2017_3/source/ren.tsv
./data/2017_3/source/pre.tsv
./data/2017_3/source/sub.tsv
.

In [3]:
def upload_folder_to_s3(folder, bucket, s3_client):
    for root, dirs, files in os.walk("./data"):
        for file in files:
            file_path = root + "/" + file
            upload_file_to_s3(file_path, bucket, file_path[2:], s3_client)
            print("uploading file: ", file_path)
        
def upload_file_to_s3(file_name, bucket, key, s3_client):
    try:
        s3_client.upload_file(file_name, bucket, key)
        
    except:
        print("Error when uploading file: " + file_name)
    
# upload_folder_to_s3('./data', FUNDAMENTAL_DATA_PATH, s3_client)

In [4]:
def parallel_upload_folder_to_s3(folder, bucket, s3_client):
    with ThreadPoolExecutor(max_workers=10) as executor:
        for root, dirs, files in os.walk("./data"):
            for file in files:
                file_path = root + "/" + file
                print("uploading file: ", file_path)
                upload_file_to_s3(file_path, bucket, file_path[2:], s3_client)
                future = executor.submit(upload_file_to_s3, file_path, bucket, file_path[2:], s3_client)
                


In [None]:
parallel_upload_folder_to_s3('./data', FUNDAMENTAL_DATA_PATH, s3_client)

uploading file:  ./data/2015_1/source/ren.tsv
uploading file:  ./data/2015_1/source/pre.tsv
uploading file:  ./data/2015_1/source/sub.tsv
uploading file:  ./data/2015_1/source/txt.tsv
Error when uploading file: ./data/2015_1/source/txt.tsv
uploading file:  ./data/2015_1/source/cal.tsv
