# Full EDGAR walkthrough

**Author:** Ties de Kok ([Personal Website](http://www.tiesdekok.com))  
**Last updated:** 12 June 2018  
**Python version:** Python 3.6  
**License:** MIT License  

## *Introduction*

This notebook contains a full walkthrough on how I would go about to:

0. get URLs of EDGAR filings
1. Download an EDGAR filing  
2. Prepare an EDGAR filing 
3. Extract sections of EDGAR filing
4. Calculate common metrics

More specifically, in this example I will download a 10-K filing, extract the MD&A section from it, and calculate tone.

We will parse the 2017 10-K for TESLA (CIK: 0001318605):

https://www.sec.gov/Archives/edgar/data/1318605/000156459018002956/0001564590-18-002956-index.htm

## *Relevant notebooks*

1. [`0_python_basics.ipynb`](https://nbviewer.jupyter.org/github/TiesdeKok/LearnPythonforResearch/blob/master/0_python_basics.ipynb)
1. [`1_opening_files.ipynb`](https://nbviewer.jupyter.org/github/TiesdeKok/LearnPythonforResearch/blob/master/0_python_basics.ipynb)
1. [`2_handling data.ipynb`](https://nbviewer.jupyter.org/github/TiesdeKok/LearnPythonforResearch/blob/master/0_python_basics.ipynb)
1. [`3_web_scraping.ipynb`](https://nbviewer.jupyter.org/github/TiesdeKok/LearnPythonforResearch/blob/master/0_python_basics.ipynb)
1. [`4_NLP_notebook.ipynb`](https://nbviewer.jupyter.org/github/TiesdeKok/LearnPythonforResearch/blob/master/0_python_basics.ipynb)

## Imports

In [2]:
!pip install requests_html

Collecting requests_html
  Downloading https://files.pythonhosted.org/packages/d6/bb/54d8db5ac95f34b035b68747d765aadcbbd78687b029b41b39d2a3728f35/requests_html-0.9.0-py2.py3-none-any.whl
Collecting w3lib (from requests_html)
  Downloading https://files.pythonhosted.org/packages/37/94/40c93ad0cadac0f8cb729e1668823c71532fd4a7361b141aec535acb68e3/w3lib-1.19.0-py2.py3-none-any.whl
Collecting fake-useragent (from requests_html)
  Downloading https://files.pythonhosted.org/packages/19/78/942c4be64409dcb3ebdd5741e1b6cdc4d6153b16e9765bcecfb81547c7a1/fake-useragent-0.1.10.tar.gz
Collecting parse (from requests_html)
  Downloading https://files.pythonhosted.org/packages/79/e1/522401e2cb06d09497f2f56baa3b902116c97dec6f448d02b730e63b44a8/parse-1.8.4.tar.gz
Collecting pyppeteer>=0.0.14 (from requests_html)
[?25l  Downloading https://files.pythonhosted.org/packages/46/0d/a136a3c2d04574051673582967348a60a9532547eca45e108c0f42313d1f/pyppeteer-0.0.17.tar.gz (1.2MB)
[K    100% |███████████████████████

In [0]:
import os, re, sys
import pandas as pd
import numpy as np
import requests
import requests_html

In [0]:
import lxml.html
import lxml.html.clean
from lxml.html import HTMLParser as LXMLParser

## Step 1: get URL to EDGAR filings <br> -----------------------------------------------

Before we can download our EDGAR filings we first need to:

1. Identify the filings that we are interested in  
2. Collect the URLs that we can use to download these filings

### Option 1: use WRDS SEC Analytics Suite



The easiest way to get a list of URLs for all the filings that you are interested in is to use the `WRDS SEC Analytics Suite`. 

This proprietary database is offered through the WRDS platform.

If you institution purchased this is definitely the easiest option:

### Option 2: download index files

Blablaba, good for large samples but incovenient for small samples

### Option 3: scrape EDGAR directly

Scraping EDGAR directly is relatively straightforward.  

Most of our search queries are URL parameters:  
https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=0001318605&type=10-K&dateb=&owner=exclude&count=20

So all we need to do is make these parameters dynamic:

In [0]:
CIK = '0001318605'
FILING_TYPE = '10-K'

In [0]:
edgar_search_endpoint = 'https://www.sec.gov/cgi-bin/browse-edgar'
edgar_search_payload = {'action' : 'getcompany', 
                       'CIK' : CIK,
                       'type' : FILING_TYPE,
                        'start' : 0,
                       'count' : 100}

In [0]:
session = requests_html.HTMLSession()
raw_res = session.get(edgar_search_endpoint, params=edgar_search_payload)

In [0]:
html_table = raw_res.html.find('.tableFile2', first=True)

In [0]:
html_table.find('tr')

[<Element 'tr' >]

In [0]:
filing_list = []
for row in html_table.find('tr'):
    url = row.find('a#documentsbutton', first=True)
    if url:
        for col in row.find('td'):
            if re.match('\d\d\d\d-\d\d-\d\d', col.text):
                filing_list.append({'cik' : CIK, 
                                    'filing_type' : FILING_TYPE, 
                                    'filing_date' : col.text,
                                    'url' : url.attrs['href']
                                   })
filing_df = pd.DataFrame(filing_list)

In [0]:
filing_df

Unnamed: 0,cik,filing_date,filing_type,url
0,1318605,2018-02-23,10-K,/Archives/edgar/data/1318605/00015645901800295...
1,1318605,2017-03-01,10-K,/Archives/edgar/data/1318605/00015645901700311...
2,1318605,2016-02-24,10-K,/Archives/edgar/data/1318605/00015645901601319...
3,1318605,2015-02-26,10-K,/Archives/edgar/data/1318605/00015645901500103...
4,1318605,2014-02-26,10-K,/Archives/edgar/data/1318605/00011931251406968...
5,1318605,2013-03-07,10-K,/Archives/edgar/data/1318605/00011931251309624...
6,1318605,2012-03-28,10-K,/Archives/edgar/data/1318605/00011931251213756...
7,1318605,2012-02-27,10-K,/Archives/edgar/data/1318605/00011931251208199...
8,1318605,2011-03-03,10-K,/Archives/edgar/data/1318605/00011931251105484...


### Advanced

This will only show 100 results per page, so we might have to deal with the pagination

In [0]:
def get_filing_list(cik, filing_type):
    session = requests_html.HTMLSession()
    edgar_search_endpoint = 'https://www.sec.gov/cgi-bin/browse-edgar'
    done_searching = False
    page = 0 
    filing_list = []
    
    while not done_searching:
        edgar_search_payload = {'action' : 'getcompany', 
                               'CIK' : cik,
                               'type' : filing_type,
                               'start' : page*100,
                               'count' : 100}
        
        page_res = session.get(edgar_search_endpoint, params=edgar_search_payload)
        html_table = page_res.html.find('.tableFile2', first=True)
        
        if len(html_table.find('tr')) > 1:
            page += 1
            for row in html_table.find('tr'):
                url = row.find('a#documentsbutton', first=True)
                if url:
                    for col in row.find('td'):
                        if re.match('\d\d\d\d-\d\d-\d\d', col.text):
                            filing_list.append({'cik' : cik, 
                                                'filing_type' : filing_type, 
                                                'filing_date' : col.text,
                                                'url' : url.attrs['href']
                                               })
        else:
            done_searching = True
            
    filing_df = pd.DataFrame(filing_list)
    return filing_df

In [6]:
filing_df = get_filing_list('0001318605', '10-K')
filing_df.head()

Unnamed: 0,cik,filing_date,filing_type,url
0,1318605,2018-02-23,10-K,/Archives/edgar/data/1318605/00015645901800295...
1,1318605,2017-03-01,10-K,/Archives/edgar/data/1318605/00015645901700311...
2,1318605,2016-02-24,10-K,/Archives/edgar/data/1318605/00015645901601319...
3,1318605,2015-02-26,10-K,/Archives/edgar/data/1318605/00015645901500103...
4,1318605,2014-02-26,10-K,/Archives/edgar/data/1318605/00011931251406968...


In [8]:
form_cik = "0001318605" #@param {type:"string"}
form_filing_type = "10-K" #@param {type:"string"}

filing_df = get_filing_list(form_cik, form_filing_type)
filing_df.head()

Unnamed: 0,cik,filing_date,filing_type,url
0,1318605,2018-02-23,10-K,/Archives/edgar/data/1318605/00015645901800295...
1,1318605,2017-03-01,10-K,/Archives/edgar/data/1318605/00015645901700311...
2,1318605,2016-02-24,10-K,/Archives/edgar/data/1318605/00015645901601319...
3,1318605,2015-02-26,10-K,/Archives/edgar/data/1318605/00015645901500103...
4,1318605,2014-02-26,10-K,/Archives/edgar/data/1318605/00011931251406968...


## Step 2: download EDGAR filing <br> ---------------------------------------------

Each EDGAR filing usually consists of multiple separate documents. To avoid having to download them all separately I recommend to download the aggregate TXT file called: "Complete submission text file".

![image.png](attachment:image.png)

For that we need to slightly alter our URL

In [0]:
filing_df['url_text_submission'] = filing_df.url.apply(lambda url: 'https://www.sec.gov' + url.replace('-index.htm', '.txt'))

In [0]:
print(filing_df.iloc[0].url_text_submission)

https://www.sec.gov/Archives/edgar/data/1318605/000156459018002956/0001564590-18-002956.txt


A basic download function that can deal with failure:

In [0]:
def download_file(url, max_tries=4, sleep_time = 2):
    done = False
    success = False
    tries = 0
    
    while not done:
        res = requests.get(url)
        if not res.status_code == 200:
            if tries < max_tries:
                tries += 1
                time.sleep(sleep_time)
            else:
                done = True
        else:
            success, done = True, True
            
    data = res.text
    return (url, success, data)

In [0]:
raw_filing_data = download_file(filing_df.iloc[0].url_text_submission)

In [0]:
print(raw_filing_data[0], raw_filing_data[1])
print(raw_filing_data[2][:500])

https://www.sec.gov/Archives/edgar/data/1318605/000156459018002956/0001564590-18-002956.txt True
<SEC-DOCUMENT>0001564590-18-002956.txt : 20180223
<SEC-HEADER>0001564590-18-002956.hdr.sgml : 20180223
<ACCEPTANCE-DATETIME>20180223060743
ACCESSION NUMBER:		0001564590-18-002956
CONFORMED SUBMISSION TYPE:	10-K
PUBLIC DOCUMENT COUNT:		155
CONFORMED PERIOD OF REPORT:	20171231
FILED AS OF DATE:		20180223
DATE AS OF CHANGE:		20180223

FILER:

	COMPANY DATA:	
		COMPANY CONFORMED NAME:			Tesla, Inc.
		CENTRAL INDEX KEY:			0001318605
		STANDARD INDUSTRIAL CLASSIFICATION:	MOTOR VEHICLES & PASSENGER CAR


## Step 3: prepare EDGAR filing <br> -----------------------------------------

The text document we downloaded in step 2 contains many different documents. We need to do three things:  

1. Split the documents up  
2. Keep and label documents we need  
3. Parse HTML documents

In [0]:
pattern_dict = {
    'documents' : re.compile(r"<document>(.*?)</document>", re.IGNORECASE | re.DOTALL),
    'metadata' : {
        'type' : re.compile(r"<type>(.*?)\n", re.IGNORECASE | re.DOTALL),
        'sequence' : re.compile(r"<sequence>(.*?)\n", re.IGNORECASE | re.DOTALL),
        'filename' : re.compile(r"<filename>(.*?)\n", re.IGNORECASE | re.DOTALL),
        'description' : re.compile(r"<description>(.*?)\n", re.IGNORECASE | re.DOTALL)
    },
    'text' : re.compile(r"<text>(.*?)</text>", re.IGNORECASE | re.DOTALL)
}

In [0]:
def extract_metadata_from_doc(doc, pattern_dict=pattern_dict):
    data_dict = {'metadata' : {}}
    for key, pattern in pattern_dict['metadata'].items():
        data_dict['metadata'][key] = [x for x in pattern.findall(doc)][0]
    data_dict['text'] = [x for x in pattern_dict['text'].findall(doc)][0]
    return data_dict

In [0]:
## Split into separate documents
doc_elements = pattern_dict['documents'].findall(raw_filing_data[2])

In [0]:
len(doc_elements)

155

In [0]:
## Try to capture meta-data for each document
data_dict_list = []
for i, doc in enumerate(doc_elements):
    try:
        doc_data = extract_metadata_from_doc(doc)
        
        ## Only save non-XBRL documents
        if 'xbrl' not in doc_data['metadata']['description'].lower():
            data_dict_list.append(doc_data)
    except:
        pass
document_dict = {x['metadata']['type'] : x['text'] for x in data_dict_list}

In [0]:
document_dict.keys()

dict_keys(['10-K', 'EX-10.4', 'EX-10.44', 'EX-10.46', 'EX-12.1', 'EX-21.1', 'EX-23.1', 'EX-23.2', 'EX-31.1', 'EX-31.2', 'EX-32.1', 'GRAPHIC'])

In [0]:
len(document_dict['10-K'])

6747782

#### Parse HTML

Parse as HTML, and then remove all HTML to only keep the text

In [0]:
filing_text = lxml.html.fromstring(document_dict['10-K'], parser=LXMLParser()).text_content()

Attempt to fix weird encoding (does not always work)

In [0]:
filing_text = filing_text.encode('ascii', 'ignore').decode('utf-8')

Replace multiple line breaks with just one line break

In [0]:
filing_text = re.sub(r'\n+', '\n', filing_text)

## Step 4: extract section <br> --------------------------------

Item 7A, Quantitative and Qualitative Disclosures About Market Risk; 


In [0]:
def retrieve_MDA_10K_HTML(text):
    ## Split the document based on the word "item"
    t = re.split('item|Item|ITEM', text)

    ## Loop over each element and check whether it starts with 7 that is not followed an "a"
    idx = []
    for i in range(len(t)):
        if '7' in t[i].strip()[:10] and not '7a' in t[i].strip().lower()[:10]:
            idx.append(i)

    results = []
    reference = 0

    ## Loop over all elements that start with 7 (excluding those starting with 7a)
    for i in idx:
        res = t[i]
        endPoint = i+1

        isValid = 0
        reference = 0

        ## Loop over the elements that follow the current element and check whether it matches item 7a
        while endPoint < len(t):
            if '7a' in t[endPoint].strip().lower()[:10]:
                
                ## Check for keywords (quan and qual) that indicate start section 7a (and not just 7a mentioned in the text)
                if t[endPoint].lower().find('quan') >= 0 and t[endPoint].lower().find('qual') >= 0:
                    t_text = t[endPoint]
                    
                    ## Check whether keywords are within 200 characters of the 7a mention
                    if t_text[0:200].lower().find('quan') >= 0 and t_text[0:200].lower().find('qual') >= 0:
                        
                        ## Final check to make sure it is not a reference, if not then consider valid and break loop
                        if res[-10:].lower().find(' see ') <= 0 and res[-10:].lower().find(' in ') <= 0:
                            isValid = 1
                            break
                            
            res = ' '.join((res, t[endPoint]))
            endPoint += 1
        
        ## Short hits are either invalid or imply incorporated by reference
        if (len(res) < 1000):
            if res.lower().find('incorporated') and res.lower().find('reference'):
                reference = 1
            else:
                isValid = 0

        ## Check whether it contains the title, otherwise consider it an invalid hit
        if res[0:100].lower().find('discussion') < 0 and res[0:100].lower().find('analysis') < 0:
            isValid = 0

        ## Append valid hits that do not indicate incorporated by reference
        if isValid:
            if not reference:
                results.append(res)

    ## Return the shortest possible hit:
    if len(results) > 0:
        return sorted(results, key = len)[0]
    else:
        if reference:
            return "MD&A incorporated by reference"
        else:
            return "No MD&A found"

In [0]:
MDA_text = retrieve_MDA_10K_HTML(filing_text)

## Step 5: calculate metrics <br> ------------------------------------

In [0]:
lm_df = pd.read_excel(os.path.join('data', 'LoughranMcDonald_MasterDictionary_2014.xlsx'))

In [0]:
lm_df.head()

Unnamed: 0,Word,Sequence Number,Word Count,Word Proportion,Average Proportion,Std Dev,Doc Count,Negative,Positive,Uncertainty,Litigious,Constraining,Superfluous,Interesting,Modal,Irr_Verb,Harvard_IV,Syllables,Source
0,AARDVARK,1,81,5.690194e-09,3.06874e-09,5.779943e-07,45,0,0,0,0,0,0,0,0,0,0,2,12of12inf
1,AARDVARKS,2,2,1.404986e-10,8.217606e-12,7.84187e-09,1,0,0,0,0,0,0,0,0,0,0,2,12of12inf
2,ABACI,3,8,5.619945e-10,1.686149e-10,7.09624e-08,7,0,0,0,0,0,0,0,0,0,0,3,12of12inf
3,ABACK,4,5,3.512466e-10,1.727985e-10,7.532677e-08,5,0,0,0,0,0,0,0,0,0,0,2,12of12inf
4,ABACUS,5,1752,1.230768e-07,1.198634e-07,1.110293e-05,465,0,0,0,0,0,0,0,0,0,0,3,12of12inf


In [0]:
negative_words = [str(x).lower() for x in lm_df[lm_df.Negative != 0].Word.values]
positive_words = [str(x).lower() for x in lm_df[lm_df.Positive != 0].Word.values]

In [0]:
pos_count = 0
neg_count = 0
for word in positive_words:
    pos_count += MDA_text.lower().count(word)
for word in negative_words:
    neg_count += MDA_text.lower().count(word)

In [0]:
print('Number of positive words: {} \nNumber of negative words: {}'.format(pos_count, neg_count))

Number of positive words: 248 
Number of negative words: 482
