# DOCX Homework Solution
#### Author: Alex Sherman (alsherman@deloitte.com)

In this homework, we will use python-docx to extract and structure the text from a Microsoft Word document.

The overall objective (part 3) is to structure the document into sections. As there is no perfect method to define or identify document sections, we will create our own simple hueristics. These include looking if a paragraph contains 'HEADING' styles, uses ALL CAPS, or applies other styles like bold or underlined text.

In part one and two, we will create and test two helper functions (doesnt_have_text and is_section_header) to help us identify which paragraphs contain useful text and/or are section headers in part 3.

In [4]:
# read the raw data paths from the config.ini file
# confirm that the printed DOCX_PATH is the correct location of the data
import os
import docx
from configparser import ConfigParser, ExtendedInterpolation

config = ConfigParser(interpolation=ExtendedInterpolation())
config.read('../../config.ini')
DOCX_PATH = config['TEXT_EXTRACTION']['DOCX_PATH']

print(DOCX_PATH)

C:\Users\alsherman\Desktop\NLP\nlp_practicum_health\raw_data\southwest-airlines-co\in_progress\southwest-airlines-co_annual_report_2016.docx


In [5]:
# read the document into docx
doc = docx.Document(DOCX_PATH)

In [6]:
# create an object with all paragraphs
paragraphs = doc.paragraphs

### PART 1: Check if a paragraph contains text

Many of the paragraphs in a document are empty strings, not containing useful text. We will want to skip (continue) these paragraphs when structing the document into sections.

We will look for two conditions:
1. paragraph text is empty
2. paragraph text does not contain any letters (e.g. a phone number or ' ____ ')

In [7]:
def doesnt_have_text(text, alpha_only=True):
    """ ignore paragraphs that do not contain any text

    :param text: text to check for characters
    :param alpha_only: if True, keep  paragraph that have letters 
    :return: bool (True) if the paragraph is empty
    """
    
    # condition one - check for empty strings
    # create a boolean (True/False) called empty_string
    # with the result of if the text has no characters
    empty_string = text.strip() == ''

    # condition two - only keep string with at least one letter ([a-zA-Z])
    # This will ignore phone numbers and strings of non-text characters (e.g. '___')
    # consider using the string method .isalpha()
    # name the variable has_required_characters
    # make this optional by adding a check for the parameter alpha_only (e.g. if alpha_only:)
    has_required_characters = True
    if alpha_only:
        has_required_characters = any(char.isalpha() for char in text)

    # if the paragraph is empty (e.g. is an empty_string or not has_required_characters)
    # return True (i.e. doesn't have text) else return False
    if empty_string or not has_required_characters:
        return True
    return False

### Run this code to check your answers

In [8]:
def check_doesnt_have_text(section_nums, answer):
    for section_num in section_nums:
        print('YOUR ANSWER: {}'.format(doesnt_have_text(paragraphs[section_num].text)))
        print('CORRECT ANSWER: {}'.format(answer))
        print('SECTION: {}\n'.format(paragraphs[section_num].text))
        
check_doesnt_have_text(section_nums=[8, 70, 178], answer='True')
check_doesnt_have_text(section_nums=[7, 92, 104], answer='False')

YOUR ANSWER: True
CORRECT ANSWER: True
SECTION: 

YOUR ANSWER: True
CORRECT ANSWER: True
SECTION: 

YOUR ANSWER: True
CORRECT ANSWER: True
SECTION: 

YOUR ANSWER: False
CORRECT ANSWER: False
SECTION: $45-$55 a barrel range for Brent crude oil. The result was another year of record traffic, record load factors, record revenues, record profits, and a record year-end stock price (LUV). For the second year in a row, and for only the second time in our history, our annual pre-tax return on invested capital (ROIC)1 was 30 percent or better. It was our 44th consecutive year of profitability, a record unmatched in the domestic airline industry, and a continued display of our leadership in corporate America.

YOUR ANSWER: False
CORRECT ANSWER: False
SECTION: TEXAS	74-1563240

YOUR ANSWER: False
CORRECT ANSWER: False
SECTION: Indicate by check mark if the registrant is not required to file reports pursuant to Section 13 or Section 15(d) of the Act.    Yes  ‘ No Í



### PART 2: Check if a paragraph is a section header

Each paragraph that contains text will either start a new section (is a section header) or will be added to the text of a section. To make this determination, we will use the function is_section_header to test if the paragraph text meets any of the criteria we set to define a section header.

There are many criteria we could create to define a section header. Moreso, it is not worth the effort to perfectly define each section header as that is difficult if not impossible due to inconsistencies in the ways that different documents are structured. As an example, some documents use a Table of Contents, giving sections a 'HEADING' style, other times the document owner just CAPITALIZES or bolds the text of each section.

We will create functions to check for the three folowing criteria:
1. Heading style applied to the paragraph text
2. All text is capitalized
3. All text is bold

In [9]:
def heading(p):
    """ check if Header formatting is applied to the paragraph style (common in a table of contents) """
    # make sure to use upper case as the 'HEADING' style is always uppercase
    if 'HEADING' in p.style.name.upper():
        return True
    return False

In [10]:
def capitalization(p):
    """ has capitalization of every letter """
    if p.text.isupper():
        return True
    return False

In [11]:
def bold(p):
    """ determine if all text in a paragraph is bold """
    bold_runs = []

    # iterate through all the runs in a paragraph to check for bold/underlined style
    for run in p.runs:
        # ignore runs (style) for blank space at end of sentence
        # NOTE: the last run is often an empty string
        if run.text.strip() == '':
            continue

        # add if the text is bold to the bold_runs list
        bold_runs.append(run.bold)
    
    # all() tests if every item in a list is True
    # if all items are True the all() returns True
    # thus if all items are bold then section_header is set to True
    bold_cond = all(bold_runs) and bold_runs != list()
    if bold_cond:
        return True
    return False

In [12]:
def is_section_header(p, use_headings=True, use_capitalization=True, use_bold=True):
    """ determine if a paragraph is a section header

    :param p: paragraph
    :param use_headings: uses a header formatting (e.g. table of contents)
    :param use_capitalization: capitalization of every letter often indicates section header
    :param use_bold: all words in a sentence are bold
    :return section_header: boolean, True if paragraph is section header

    NOTE: there is not an exact method to determine a section header
    due to inconsistencies in the way documents are created.
    This function uses heuristics (e.g. all CAPS) to determine sections
    """

    # section_header starts as False and is switched to True
    # if it meets any section condition
    section_header = False

    # check for heading style
    if use_headings:
        section_header = heading(p)

    # check if every letter in a paragraph is capitalized
    if use_capitalization:
        section_header = capitalization(p)

    # check for bold text
    if use_bold:
        section_header = bold(p)

    return section_header

### Run this code to check your answers

In [13]:
def check_section_header_answers(section_nums, answer):
    for section_num in section_nums:
        print('YOUR ANSWER: {}'.format(is_section_header(paragraphs[section_num])))
        print('CORRECT ANSWER: {}'.format(answer))
        print('SECTION: {}\n'.format(paragraphs[section_num].text))
        
check_section_header_answers(section_nums=[2,674, 774], answer='True')
check_section_header_answers(section_nums=[46,500,546], answer='False')

YOUR ANSWER: True
CORRECT ANSWER: True
SECTION: SOUTHWEST AIRLINES CO.

YOUR ANSWER: True
CORRECT ANSWER: True
SECTION: Item 4.        Mine Safety Disclosures

YOUR ANSWER: True
CORRECT ANSWER: True
SECTION: Item 6.      Selected Financial Data

YOUR ANSWER: False
CORRECT ANSWER: False
SECTION: 

YOUR ANSWER: False
CORRECT ANSWER: False
SECTION: The airline business  is  labor  intensive.  Salaries,  wages,  and  benefits  represented  approximately  41 percent of the Company’s operating expenses for the year ended December 31, 2016. In addition, as of December 31, 2016, approximately 83 percent of the Company’s Employees were represented for collective bargaining purposes by labor unions, making the Company particularly exposed in the event of labor-related job actions. Employment-related issues that have, and continue to, impact the Company’s results of operations, some of which are negotiated items, include hiring/retention rates, pay rates, outsourcing costs, work rules, health car

### PART 3: Extract and structure sections

In the below exercise, we will create a dictionary named sections that will store all of the extracted sections from the document text.

To populate the dict, we will stream through the text using the doesnt_have_text function to retain only useful text and the is_section_header function to identify the start of each new section.

In [14]:
def set_sections(paragraphs):
    """ Iterate through every paragraph in a document and group them into sections.
    
    Sections are determined by Microsoft Word style formatting and other
    optional heuristic parameters (listed below).
    """

    # store all sections in a dict with the section name as
    # the key and the section text as the value
    sections = {}

    # set a name for the first section  to group all the text 
    # in case the documents has no sections and create an empty
    # list to start collecting section text
    section_name = 'FIRST SECTION'
    section_text = []

    # iterate through every paragraph 
    # if there is text, determine if the text should start a new
    # section_name or if the text should be added to section_text
    for p in paragraphs:

        # docx stores many paragraph which are not informative
        # (e.g. line breaks) ignore these - continue to next paragraph
        if doesnt_have_text(p.text):
            continue

        if is_section_header(p, use_headings=True, use_capitalization=True, use_bold=True):
            # a new section_name has been found, so combine
            # all the section text and add it to the sections dict
            text = ' '.join(section_text).strip()

            # if section_text is empty that means >1 section name 
            # was found in a row (e.g. section name spans multiple lines)
            # these section_names should be combined
            if doesnt_have_text(text):
                section_name = ' '.join([section_name, p.text.upper().strip()])
            else:                    
                # add completed section to sections dict
                sections[section_name] = text
                # reset name and text for next section
                section_name = p.text.upper().strip()
                section_text = []
        else:
            # keep adding text to section_text until
            # a new section starts
            section_text.append(p.text)


    # add the text from the final section to sections dict
    text = ' '.join(section_text).strip()
    if not doesnt_have_text(text):
        sections[section_name] = text

    return sections

In [15]:
# Run set sections to store the identified sections
sections = set_sections(paragraphs)

In [16]:
# Optional - View the identified sections
sections.keys()

dict_keys(['FAIR VALUE MEASUREMENTS AT REPORTING DATE USING: QUOTED PRICES IN ACTIVE MARKETS SIGNIFICANT OTHER OBSERVABLE SIGNIFICANT UNOBSERVABLE DESCRIPTION DECEMBER 31, FOR IDENTICAL ASSETS (LEVEL 1) INPUTS (LEVEL 2) INPUTS (LEVEL 3)', 'DOCUMENTS INCORPORATED BY REFERENCE', 'SOUTHWEST AIRLINES CO. CONSOLIDATED STATEMENT OF INCOME', 'FAIR VALUE MEASUREMENTS AT REPORTING DATE USING: DESCRIPTION\tDECEMBER 31, 2016 QUOTED PRICES IN ACTIVE MARKETS FOR IDENTICAL ASSETS (LEVEL 1) SIGNIFICANT OTHER OBSERVABLE INPUTS (LEVEL 2) SIGNIFICANT UNOBSERVABLE INPUTS (LEVEL 3)', 'FIRST SECTION SOUTHWEST AIRLINES CO. 2016 ANNUAL REPORT TO SHAREHOLDERS', 'SECURITIES REGISTERED PURSUANT TO SECTION 12(G) OF THE ACT: NONE', 'STOCK EXCHANGE LISTING', 'DAVID W. BIEGLER', 'FAIR VALUE MEASUREMENTS USING SIGNIFICANT UNOBSERVABLE INPUTS (LEVEL 3)', 'DECEMBER 31, 2016\tDECEMBER 31, 2015 DESCRIPTION', 'THOMAS W. GILLIGAN', 'INDEPENDENT REGISTERED PUBLIC ACCOUNTING FIRM', 'ITEM 11.\tEXECUTIVE COMPENSATION', 'GROSS

### Run this code to check your answers

2 of the below 8 sections are correct in the section_headers list are correct. If the set_sections function identifies all five of the correct answers, then the below code will print out each of the five correct sections and print 'Correct Answer'

Good Luck!

In [17]:
import hashlib

section_headers = [
      'TECHNOLOGY INITIATIVES'
    , 'REGULATION'
    , 'FINANCIAL INFORMATION'
    , 'SEASONALITY AND CYCLICALITY'
    , 'COMPETITORS'
    , 'REVENUE RECOGNITION FOR MULTIPLE-ELEMENT ARRANGEMENTS—CLOUD SAAS, PAAS AND IAAS OFFERINGS'
    , 'STOCK EXCHANGE LISTING'    
    , 'AVIATION TAXES AND FEES'
]

# identify which of the above 10 sections were identified in the set_sections function
# 2 of the 9 should be identified in the correct answer
section_answers = []
for section in section_headers:
    if section in sections.keys():
        print('SECTION IDENTIFIED: {}'.format(section))
        section_answers.append(section)

SECTION IDENTIFIED: FINANCIAL INFORMATION
SECTION IDENTIFIED: STOCK EXCHANGE LISTING


In [18]:
def check_answer(section_answers):
    """ check if all of the identified sections are correct """
    # hash of correct answers
    answers = [
          '7c3eccefc570963b93e8a11e723f73d1b35e48b68268883169945686a577edaf'
        , '03c2ff1e7ca7cd3d007c1996bdb4a855163cf5f2e452999293b614368104eab4'
    ]
        
    # check if each provided section is a correct section_header
    checked_answer = []
    for section in section_answers:
        print(section)
        print(hashlib.sha256(section.encode()).hexdigest())
        print()
        
        if hashlib.sha256(section.encode()).hexdigest() not in answers:
            checked_answer.append(False)
        else:
            checked_answer.append(True)
    
    # check that 2 distinct sections were provided 
    num_distinct_sections = len(set(section_answers))
    if num_distinct_sections != 2:
        print('INCORRECT ANSWER - ONLY {} OF 2 SECTIONS WERE IDENTIFIED'.format(num_distinct_sections))
    
    # check if all answers were correct
    if all(checked_answer) and len(checked_answer) > 0:
        return 'Correct Answer'
    return 'Incorrect Answer - {} of the 2 identified sections were incorrect'.format(sum(checked_answer))

# check if all identified sections are correct
check_answer(section_answers)

FINANCIAL INFORMATION
7c3eccefc570963b93e8a11e723f73d1b35e48b68268883169945686a577edaf

STOCK EXCHANGE LISTING
03c2ff1e7ca7cd3d007c1996bdb4a855163cf5f2e452999293b614368104eab4



'Correct Answer'