# DOCX Homework Solution
#### Author: Alex Sherman (alsherman@deloitte.com)

In this homework, we will use python-docx to extract and structure the text from a Microsoft Word document.

The overall objective (part 3) is to structure the document into sections. As there is no perfect method to define or identify document sections, we will create our own simple hueristics. These include looking if a paragraph contains 'HEADING' styles, uses ALL CAPS, or applied other style like bold or underlined text.

In part one and two, we will create and test two helper functions (doesnt_have_text and is_section_header) to help us identify which paragraphs contain useful text and/or are section headers in part 3.

In [33]:
# read the raw data paths from the config.ini file
# confirm that the printed DOCX_PATH is the correct location of the data
import os
import configparser

config = configparser.ConfigParser()
config.read('../../config.ini')
RAW_DATA = config['USER']['RAW_DATA']
DOC_PATH = config['DOCX']['DOC_PATH']

DOCX_PATH = os.path.join(RAW_DATA, DOC_PATH)

print(DOCX_PATH)

C:\Users\alsherman\Desktop\PycharmProjects\firm_initiatives\ml_guild\raw_data\southwest-airlines-co\southwest-airlines-co_annual_report_2016.docx


In [34]:
# read the document into docx
import docx
doc = docx.Document(DOCX_PATH)

# create an object with all paragraphs
paragraphs = doc.paragraphs

### PART 1: Check if a paragraph contains text

Many of the paragraphs in a document are empty strings, not containing useful text. We will want to skip (continue) these paragraphs when structing the document into sections.

We will look for two conditions:
1. paragraph text is empty
2. paragraph text does not contain any letters (e.g. a phone number or ' ____ ')

In [35]:
def doesnt_have_text(text, contains_alpha=True):
    """ ignore paragraphs that do not contain any text

    :param text: text to check for characters
    :param alpha only: if True, only keep paragraph that have letters 
    :return: bool (True) if the paragraph is empty
    """
    
    # condition one - check for empty strings
    empty_string = text.strip() == ''

    # condition two (make this optional by adding a check if alpha_only)
    # ignores phone numbers and string of non-text characters (e.g. '___')
    # consider using the string method .isalpha()
    has_letters = True
    if contains_alpha:
        has_letters = any(char.isalpha() for char in text)

    # if either condition is met, return True (i.e. doesn't have text)
    # else return False
    if empty_string or not has_letters:
        return True
    return False

In [47]:
# run this function to check your answers

def check_doesnt_have_text(section_nums, answer):
    for section_num in section_nums:
        print('YOUR ANSWER: {}'.format(doesnt_have_text(paragraphs[section_num].text)))
        print('CORRECT ANSWER: {}'.format(answer))
        print('SECTION TEXT: {}'.format(paragraphs[section_num].text))
        print()    
        
check_doesnt_have_text(section_nums=[2, 29, 33], answer='False')
check_doesnt_have_text(section_nums=[8, 17, 28], answer='True')

YOUR ANSWER: False
CORRECT ANSWER: False
SECTION TEXT: SOUTHWEST AIRLINES CO.

YOUR ANSWER: False
CORRECT ANSWER: False
SECTION TEXT: Another significant long-term effort scheduled for 2017 is the completion of a new five-gate international terminal at the Ft Lauderdale-Hollywood International Airport (FLL), slated for June 2017. Coincident with the opening is the launch of new international service at FLL by Southwest Airlines. Currently, we serve just The Bahamas and Cuba from FLL, but plan to add service to Belize, Jamaica, and Mexico along with our newest destination, Grand Cayman. This is a much-anticipated and much-needed enhancement to our FLL franchise.

YOUR ANSWER: False
CORRECT ANSWER: False
SECTION TEXT: This is now the 23rd consecutive year that Southwest has been named to Fortune’s list of World’s Most Admired Companies, coming in at #8. We’re proud of that.

YOUR ANSWER: True
CORRECT ANSWER: True
SECTION TEXT: 

YOUR ANSWER: True
CORRECT ANSWER: True
SECTION TEXT: 

YOUR

### PART 2: Check if a paragraph is a section header

Each paragraph that contains text will either start a new section (is a section header) or will be added to the text of a section. To make this determination, we will use the function is_section_header to test if the paragraph text meets any of the criteria we set to define a section header.

There are many criteria we could create to define a section header. Moreso, it is not worth the effort to perfectly define each section header as that is difficult if not impossible due to inconsistencies in the ways that different documents are structured. As an example, some documents use a Table of Contents, giving sections a 'HEADING' style, other times the document owner just CAPITALIZES or bolds the text of each section.

We will check for the four folowing criteria:
1. Heading style applied to the paragraph text
2. All text is capitalized
3. All text is bold
4. All text is underlined

In [50]:
def is_section_header(p, use_headings=True, use_capitalization=True, use_bold=True, use_underline=True):
    """ determine if a paragraph is a section header

    :param p: paragraph
    :param use_headings: uses a header formatting (e.g. table of contents)
    :param use_capitalization: capitalization of every letter often indicates section header
    :param use_bold: all words in a sentence are bold
    :param use_underline: all words in a sentence are underlined
    :return section_header: boolean, True if paragraph is section header

    NOTE: there is not an exact method to determine a section header
    due to inconsistencies in the way documents are created.
    This function uses heuristics (e.g. all CAPS) to determine sections
    """

    # section_header starts as False and is switched to True
    # if it meets a section condition
    section_header = False

    # check if Header formatting is applied to the paragraph style
    # make sure to use upper case as 'HEADING' is always uppercase
    # NOTE: this is common for sections listed in a table of contents
    if use_headings:
        if 'HEADING' in p.style.name.upper():
            section_header = True

    # check if every letter in a paragraph is capitalized
    if use_capitalization:
        if p.text.isupper():
            section_header = True

    # check for bold and/or underlines conditions
    bold_runs = []
    underline_runs = []

    # iterate through all the runs in a paragraph to check for bold/underlined style
    for run in p.runs:
        # ignore (use continue) runs that do not contain text
        # NOTE: the last run is often an empty string
        if run.text.strip() == '':
            continue
        if use_bold:
            # add if the text is bold to the bold_runs list
            bold_runs.append(run.bold)
        if use_underline:
            # add if the text is underlined to the underline_runs list
            underline_runs.append(run.underline)

    # all() tests if every item in a list is True
    # if all items are True the all() returns True
    # thus if all items are bold or all are underlined, 
    # then section_header is set to True
    bold_cond = all(bold_runs) and bold_runs != list()
    underline_cond = all(underline_runs) and underline_runs != list()
    if bold_cond or underline_cond:
        section_header = True

    return section_header

In [55]:
# run this function to check your answers

def check_section_header_answers(section_nums, answer):
    for section_num in section_nums:
        print('YOUR ANSWER: {}'.format(is_section_header(paragraphs[section_num])))
        print('CORRECT ANSWER: {}'.format(answer))
        print('SECTION: {}'.format(paragraphs[section_num].text))
        print()    
        
check_section_header_answers(section_nums=[124,351,441], answer='True')
check_section_header_answers(section_nums=[29,101,145], answer='False')

YOUR ANSWER: True
CORRECT ANSWER: True
SECTION: Company Overview

YOUR ANSWER: True
CORRECT ANSWER: True
SECTION: Environmental Regulation

YOUR ANSWER: True
CORRECT ANSWER: True
SECTION: Employees

YOUR ANSWER: False
CORRECT ANSWER: False
SECTION: Another significant long-term effort scheduled for 2017 is the completion of a new five-gate international terminal at the Ft Lauderdale-Hollywood International Airport (FLL), slated for June 2017. Coincident with the opening is the launch of new international service at FLL by Southwest Airlines. Currently, we serve just The Bahamas and Cuba from FLL, but plan to add service to Belize, Jamaica, and Mexico along with our newest destination, Grand Cayman. This is a much-anticipated and much-needed enhancement to our FLL franchise.

YOUR ANSWER: False
CORRECT ANSWER: False
SECTION: Common Stock ($1.00 par value)	New York Stock Exchange

YOUR ANSWER: False
CORRECT ANSWER: False
SECTION: operations at a limited number of central hub cities and s

### PART 3: Extract and structure sections

In the below exercise, we will create a dictionary named sections that will store all of the extracted sections from the document text.

To populate the dict, we will stream through the text using the doesnt_have_text function to confirm retain only useful text and the is_section_header function to identify the start of each new section.

In [56]:
def set_sections(paragraphs):
    """ Iterate through every paragraph in a document and group them into sections.
    
    Sections are determined by Microsoft Word style formatting and other
    optional heuristic parameters (listed below).
    """

    # store all sections in a dict with the section name as
    # the key and the section text as the value
    sections = {}

    # set a name for the first section  to group all the text 
    # in case the documents has no sections and create an empty
    # list to start collecting section text
    section_name = 'FIRST SECTION'
    section_text = []

    # iterate through every paragraph 
    # if there is text, determine if the text should start a new
    # section_name or if the text should be added to section_text
    for p in paragraphs:

        # docx stores many paragraph which are not informative
        # (e.g. line breaks) ignore these - go to next paragraph
        if doesnt_have_text(p.text):
            continue

        if is_section_header(p, use_headings=True, use_capitalization=True, use_bold=True, use_underline=True):
            # a new section_name has been found, so combine
            # all the section text and add it to the sections dict
            text = ' '.join(section_text).strip()

            # if section_text is empty that means >1 section name 
            # was found in a row (e.g. section name spans multiple lines)
            # these section_names should be combined
            if doesnt_have_text(text):
                section_name = ' '.join([section_name, p.text.upper().strip()])
            else:                    
                # add completed section to sections dict
                sections[section_name] = text
                # reset name and text for next section
                section_name = p.text.upper().strip()
                section_text = []
        else:
            # keep adding text to section_text until
            # a new section starts
            section_text.append(p.text)


    # add the text from the final section to sections dict
    text = ' '.join(section_text).strip()
    if not doesnt_have_text(text):
        sections[section_name] = text

    return sections

In [57]:
# Run set sections to store the identified sections
sections = set_sections(paragraphs)

In [58]:
# Optional - View the identified sections
sections.keys()

dict_keys(['FIRST SECTION SOUTHWEST AIRLINES CO. 2016 ANNUAL REPORT TO SHAREHOLDERS', 'UNITED STATES SECURITIES AND EXCHANGE COMMISSION WASHINGTON, D.C. 20549 FORM 10-K', 'SOUTHWEST AIRLINES CO.', 'TEXAS\t74-1563240', 'SECURITIES REGISTERED PURSUANT TO SECTION 12(B) OF THE ACT: TITLE OF EACH CLASS\tNAME OF EACH EXCHANGE ON WHICH REGISTERED', 'SECURITIES REGISTERED PURSUANT TO SECTION 12(G) OF THE ACT: NONE', 'DOCUMENTS INCORPORATED BY REFERENCE', 'TABLE OF CONTENTS PART I ITEM 1.\tBUSINESS COMPANY OVERVIEW', 'INDUSTRY', 'COMPANY OPERATIONS ROUTE STRUCTURE GENERAL', 'INTERNATIONAL SERVICE', 'COST STRUCTURE', 'COST AVERAGE COST PER PERCENTAGE OF OPERATING', 'YEAR ENDED DECEMBER 31,', 'FARE STRUCTURE GENERAL', 'ANCILLARY SERVICES', 'RAPID REWARDS FREQUENT FLYER PROGRAM', 'SOUTHWEST.COM', 'MARKETING', 'TECHNOLOGY INITIATIVES', 'REGULATION', 'ECONOMIC AND OPERATIONAL REGULATION CONSUMER PROTECTION REGULATION BY THE U.S. DEPARTMENT OF TRANSPORTATION', 'AVIATION TAXES AND FEES', 'OPERATIONAL,

### Run this code to check your answers

5 of the below 10 sections in the section_headers list are correct. 

If the set_sections function identifies all five of the correct answers, then the below code will print out each of the five correct sections and the check_answer function in the next cell will print 'Correct Answer'

Good Luck!

In [84]:
import hashlib

section_headers = [
      'TECHNOLOGY INITIATIVES'
    , 'LITIGATION'
    , 'REGULATION'
    , 'FINANCIAL INFORMATION'
    , 'SEASONALITY AND CYCLICALITY'
    , 'COMPETITORS'
    , 'REVENUE RECOGNITION FOR MULTIPLE-ELEMENT ARRANGEMENTS—CLOUD SAAS, PAAS AND IAAS OFFERINGS'
    , 'STOCK EXCHANGE LISTING'    
    , 'AVIATION TAXES AND FEES'
    , 'LEGAL CONTINGENCIES'
]

# identify which of the above 10 sections were identified in the set_sections function
# 5 of the 10 should be identified in the correct answer
section_answers = []
for section in section_headers:
    if section in sections.keys():
        print('SECTION IDENTIFIED: {}'.format(section))
        section_answers.append(section)

SECTION IDENTIFIED: TECHNOLOGY INITIATIVES
SECTION IDENTIFIED: REGULATION
SECTION IDENTIFIED: FINANCIAL INFORMATION
SECTION IDENTIFIED: STOCK EXCHANGE LISTING
SECTION IDENTIFIED: AVIATION TAXES AND FEES


In [85]:
def check_answer(section_answers):
    """ check if all of the identified sections are correct """
    # hash of all 5 correct answers
    answers = [
          '48084e86a5eb365bd3c15ace77221466fa14eb37a8da46e31c42eb520d7b9182'
        , 'c18ba027322b631ffc9dfeedac967f907f7cf9aba45e57c32a0aa1f798973558'
        , '03c2ff1e7ca7cd3d007c1996bdb4a855163cf5f2e452999293b614368104eab4'
        , '9bfd9c221f652597c1fa3f58735e93cd5c40cee8020bffdc07443d6c742db105'
        , '7c3eccefc570963b93e8a11e723f73d1b35e48b68268883169945686a577edaf'
    ]
        
    # check if each provided section is a correct section_header
    checked_answer = []
    for section in section_answers:
        if hashlib.sha256(section.encode()).hexdigest() not in answers:
            checked_answer.append(False)
        else:
            checked_answer.append(True)
    
    # check that 5 distinct sections were provided 
    num_distinct_sections = len(set(section_answers))
    if num_distinct_sections != 5:
        print('INCORRECT ANSWER - ONLY {} OF 5 SECTIONS WERE IDENTIFIED'.format(num_distinct_sections))
    
    # check if all answers were correct
    if all(checked_answer) and len(checked_answer) > 0:
        return 'Correct Answer'
    return 'Incorrect Answer - {} of the 5 identified sections were incorrect'.format(sum(checked_answer))

# check if all identified sections are correct
check_answer(section_answers)

'Correct Answer'