# Chapter 11: Sample Notebook

This notebook contains all code from Chapter 11: _Identifying Specific Information in Text_.

## 11.2 Example: Extracting Management Discussion \& Analysis Section from a plain-text 10-K filing

In [1]:
import re

# a regex to identify the location of Item 7 (MD&A) heading
# re.DOTALL flag allows . character to match new line 
# characters; needed for case when heading titles span 
# multiple lines
item7_regex = re.compile(r"(item.{1,5})?\b7\b.{1,5}management.{1,5}discussion.{1,5}analysis", 
                         re.IGNORECASE | re.DOTALL)

In [2]:
# MD&A is typically followed by Item 8, 
# "Financial Statements and Supplementary Data"
# However, sometimes it is followed by "Summary of Selected 
# Financial Data" section
item8_regex = re.compile(r"(item.{1,5})?\b8\b.{1,5}(financial.{1,5}statements.{1,5}supplement.{1,5}data|summary.{1,5}selected.{1,5}financial.{1,5}data)", re.IGNORECASE | re.DOTALL)

In [3]:
def extract_mdna(plain_text:str):
    """Extracts MD&A section from a plain-text 10-K filing"""
    # tries to find position of Item 7 heading
    section_start_match = item7_regex.search(plain_text)
    # if the attempt was successful, tries to identify 
    # location of the subsequent section heading
    if section_start_match:
        # saves the text position of Item 7 heading to 
        # a variable. Method start() returns the position
        # of the regex match in the text
        section_start_pos = section_start_match.start()
        # finds position of Item 8 heading; starts search 
        # after Item 7 heading position.
        section_end_match = item8_regex.search(plain_text,section_start_pos)
        # if Item 8 heading was identified, saves its 
        # position to a variable
        if section_end_match:
            section_end_pos = section_end_match.start()
            # finally, extracts all text in-between
            # text[a:b] allows to extract text (substring) 
            # between a and b positions
            item7_text = plain_text[section_start_pos:section_end_pos]
            # returns the content of the MD&A section
            return item7_text
    # if neither Item 7 nor Item 8 heading was 
    # identified, returns None
    # note that the function will only reach the following 
    # line of code if the heading search failed
    return None

In [4]:
# requests is a built-in Python library for 
# HTTP (web) requests
import requests

# PepsiCo's 1997 10-K filings files are accessible 
# through the URL link below:
# https://www.sec.gov/Archives/edgar/data/77476/0000077476-98-000014-index.html
# generates an HTTP request to download 
# PepsiCo's 1997 10-K filing
response = requests.get('https://www.sec.gov/Archives/edgar/data/77476/0000077476-98-000014.txt')

# saves the response (filing) text to a variable
text_complete_10k = response.text

# checks if the 10-K file was downloaded correctly
# prints 300 characters of the text starting at 
# 10000 character position
print(text_complete_10k[10000:10300])

 foods have been introduced to international  markets.
Principal  international  markets include Brazil,  France,  Mexico,  Poland, the
Netherlands, South Africa, Spain and the United Kingdom.

COMPETITION

      Both of PepsiCo's  businesses are highly competitive.  PepsiCo's beverages
and snack fo


In [5]:
# extracts the MD&A section from the PepsiCo's 10-K 
# filing text
text_mdna_only = extract_mdna(text_complete_10k)

# checks if the MD&A section extraction was successful
# prints the first 200 characters of the MD&A section
print(text_mdna_only[:200])
print('\n[...]\n')
# prints the last 200 characters of the MD&A section
print(text_mdna_only[-200:])

Item 7. Management's Discussion and Analysis of Results of Operations, Cash
Flows and Liquidity and Capital Resources

Management's Discussion and Analysis

All per share information is computed using

[...]

pital markets throughout the world.

ITEM 7A. QUANTITATIVE AND QUALITATIVE DISCLOSURES ABOUT MARKET RISK.

Included in Item 7, Management's Discussion and Analysis - Market Risk beginning
on page 9.




## 11.3 Example: Extracting Management Discussion \& Analysis Section From an HTML 10-K filing

### 11.3.2 Writing Code to Identify Section Titles in HTML Documents

In [6]:
import re

# list of regex patterns that identify HTML text styles 
# commonly used to display section headings
html_styles = [
          # b tag; bold text
          r"<b>(?P<value>.+?)</b>",
          # u tag; underlined text
          r"<u>(?P<value>.+?)</u>",
          # strong tag; important text
          r"<strong[^>]*>(?P<value>.+?)</strong>",
          # center tag; centered text
          r"<center[^>]*>(?P<value>.+?)</center>",
          # any tag that has an attribute ("style") with 
          # 'font-weight: bold' value
          r"<(?P<tag>[\w-]+)\b[^>]*font-weight:\s*bold[^>]*>(?P<value>.+?)</(?P=tag)>",
          # any tag that has an attribute ("style") with 
          # 'text-decoration: underline' value
          r"<(?P<tag>[\w-]+)\b[^>]*text-decoration:\s*underline[^>]*>(?P<value>.+?)</(?P=tag)>",
          # em tag; emphasized text
          r"<em>(?P<value>.+?)</em>"]

# function that for a given regex HTML style pattern and 
# HTML source (document) returns all the (text) values of
# HTML elements that match that HTML style along with their
# positions (indexes) in the document's HTML source code
def get_html_style_values(html_style:str, html_source:str):
    # creates a regular expression from the input 
    # HTML style pattern
    html_style_regex = re.compile(html_style, re.IGNORECASE | re.DOTALL)
    # finds all the matches for the above regular 
    # expression in the HTML document
    style_matches = html_style_regex.finditer(html_source)
    # creates a dictionary list to store the value (text) 
    # of all regex matches and their positions
    results = [{'text':m['value'],'position':m.start()} for m in style_matches]
    # outputs results
    return results

In [7]:
import requests

# Boeing's 2015 10-K files are accessible through the 
# URL link below:
# https://www.sec.gov/Archives/edgar/data/12927/000001292716000099/0000012927-16-000099-index.htm
# generates a HTTP request to download Boeing's 
# 2015 HTML 10-K filing:
response = requests.get('https://www.sec.gov/Archives/edgar/data/12927/000001292716000099/a201512dec3110k.htm')

# saves the response (filing) source HTML to a variable
html_complete_10k = response.text

# checks if the 10-K file was downloaded correctly
# prints 300 characters of the HTML source code 
# starting at 10000 character position
# the output should look like HTML source code
print(html_complete_10k[10000:10300])

dding-bottom:2px;padding-right:2px;"><div style="text-align:center;font-size:7pt;"><font style="font-family:Arial;font-size:7pt;font-weight:bold;">(Zip Code)</font></div></td></tr></table></div></div><div style="line-height:120%;padding-top:2px;text-align:center;font-size:9pt;"><font style="font-fam


In [8]:
# gets all text from the HTML 10-K filing defined using 
# <center> element
style_values = get_html_style_values(html_styles[4], html_complete_10k)

# displays the first three instances of such text
for i in range(3):
    print(style_values[i])

{'text': 'UNITED STATES', 'position': 753}
{'text': 'SECURITIES AND EXCHANGE COMMISSION', 'position': 906}
{'text': 'Washington, D.C. 20549', 'position': 1080}


In [9]:
# a regex to identify the location of Item 7 (MD&A) heading
item7_regex = re.compile(r"management.{1,20}discussion.{1,20}analysis", re.IGNORECASE | re.DOTALL)

# MD&A is typically followed by Item 8 heading
item8_regex = re.compile(r"financial.{1,20}statements.{1,20}supplement.{1,20}data|summary.{1,20}selected.{1,20}financial.{1,20}data", re.IGNORECASE | re.DOTALL)

def extract_mdna(html_source:str):
    """Extracts the MD&A section from an HTML filing"""
    # iterates over all possible HTML styles for headings 
    # until we can identify MD&A section or until we 
    # exhaust all the styles and fail to identify the 
    # MD&A section
    for style in html_styles:
        # gets all text in the input HTML document that 
        # matches the current style
        style_values = get_html_style_values(style, 
                                             html_source)
        # attempts to identify the heading of Item 7 
        # (MD&A) section
        section_start = next((v for v in style_values
                              if item7_regex.search(v['text'])),None)
        # if Item 7 heading position is identified, 
        # proceeds with identifying Item 8 section
        if section_start:
            # Item 8 heading location should be after Item 7
            section_end = next((v for v in style_values 
                                if item8_regex.search(v['text']) 
                                and v['position'] > section_start['position']), None)
            # if Item 8 heading position is identified, extracts the 
            # HTML code of the MD&A section
            if section_end:
                item7_html = html_source[section_start['position']:section_end['position']]
                # outputs the HTML code of the MD&A section
                return item7_html
    # note that the function will reach the following line 
    # of code only if the heading search failed
    return None

In [10]:
html_mdna_only = extract_mdna(html_complete_10k)

# checks if the MD&A section extraction was successful
# prints the first 200 characters of the MD&A section
print(html_mdna_only[:200])

<font style="font-family:Arial;font-size:10pt;font-weight:bold;">Item&#160;7. Management&#8217;s Discussion and Analysis of Financial Condition and Results of Operations</font></div><a name="sAE854BB7


In [11]:
# lxml is a library that parses XML and HTML source code
import lxml.html

# converts HTML code to plain text
def get_text_from_html(html:str):
    # creates an lxml document object
    doc = lxml.html.fromstring(html)
    # optional: removes tables from the HTML source code
    for table in doc.xpath('.//table'):
        table.getparent().remove(table)
    # preserves line breaks
    # HTML tags in the list below should be followed by 
    # new line character
    for tag in ["a", "p", "div", "br", "h1", "h2", "h3", "h4", "h5"]:
        # finds all elements for a given tag
        for element in doc.findall(tag):
            # if the text value is non-empty adds a 
            # new line character (line break)
            if element.text:
                element.text = element.text + "\n"
            # else creates a text value with a 
            # new line character
            else:
                element.text = "\n"
    # extracts and output text from the HTML source code
    return doc.text_content()

In [12]:
# extracts text from the HTML MD&A
mdna_text = get_text_from_html(html_mdna_only)
# prints the first 300 characters of the MD&A section
print(mdna_text[:300])

Item 7. Management’s Discussion and Analysis of Financial Condition and Results of Operations

Consolidated Results of Operations and Financial Condition
Overview
We are a global market leader in design, development, manufacture, sale, service and support of commercial jetliners, military aircraft, 


## 11.4 Extracting text from XBRL financial reports

In [13]:
import requests

# Home Depot's 2013 10-K files are accessible through 
# the URL link below:
# https://www.sec.gov/Archives/edgar/data/354950/000035495014000008/0000354950-14-000008-index.htm
# generates an HTTP request to download Home Depot's 
# 2013 XBRL 10-K instance file
response = requests.get('https://www.sec.gov/Archives/edgar/data/354950/000035495014000008/hd-20140202.xml')

# saves the response (filing content) a variable
xbrl_10k = response.text

# checks if the 10-K file was downloaded correctly
# prints the first 400 characters of the XBRL 10-K 
# instance document
print(xbrl_10k[:400])

<?xml version="1.0" encoding="US-ASCII"?>
<!--XBRL Document Created with WebFilings-->
<!--p:ee88abc9bc7d42cca8267c157db83ce8,x:1ee976a4cce447438165efc656d0aac8-->
<xbrli:xbrl xmlns:country="http://xbrl.sec.gov/country/2013-01-31" xmlns:dei="http://xbrl.sec.gov/dei/2013-01-31" xmlns:hd="http://www.homedepot.com/20140202" xmlns:invest="http://xbrl.sec.gov/invest/2013-01-31" xmlns:iso4217="http://ww


In [14]:
import re

# html.unescape is a built-in Python function to decode HTML characters
from html import unescape

# regular expression that captures content of the income tax footnote
tax_footnote_regex = re.compile(r"<us-gaap:IncomeTaxDisclosureTextBlock[^>]*>(?P<value>.+?)</us-gaap:IncomeTaxDisclosureTextBlock>", re.IGNORECASE | re.DOTALL)

# XBRL documents report text values in HTML format. However, the HTML characters have to be decoded first.
tax_footnote_html = unescape(tax_footnote_regex.search(xbrl_10k)['value'])

# converts HTML income tax footnote to the plain-text format
tax_footnote_text = get_text_from_html(tax_footnote_html)

# outputs the first 300 characters
print(tax_footnote_text[:300])


INCOME TAXES
The components of Earnings before Provision for Income Taxes for fiscal 2013, 2012 and 2011 were as follows (amounts in millions):
 


The Provision for Income Taxes consisted of the following (amounts in millions):
 

The Company’s combined federal, state and foreign effective tax rat
