# DOCX Homework
#### Author: Alex Sherman (alsherman@deloitte.com)

In this homework, we will use python-docx to extract and structure the text from a Microsoft Word document.

The overall objective (part 3) is to structure the document into sections. As there is no perfect method to define or identify document sections, we will create our own simple hueristics. These include looking if a paragraph contains 'HEADING' styles, uses ALL CAPS, or applied other style like bold or underlined text.

In part one and two, we will create and test two helper functions (doesnt_have_text and is_section_header) to help us identify which paragraphs contain useful text and/or are section headers in part 3.

In [3]:
# read the raw data paths from the config.ini file
# confirm that the printed DOCX_PATH is the correct location of the data
import os
import configparser

config = configparser.ConfigParser()
config.read('../../config.ini')
RAW_DATA = config['USER']['RAW_DATA']
DOC_PATH = config['DOCX']['DOC_PATH']

DOCX_PATH = os.path.join(RAW_DATA, DOC_PATH)

print(DOCX_PATH)

C:\Users\[INSERT_YOUR_USERNAME]\Desktop\PycharmProjects\firm_initiatives\ml_guild\raw_data\oracle-corporation\oracle-corporation_annual_report_2015.docx


In [4]:
# read the document into docx
import docx


### PART 1: Check if a paragraph contains text

Many of the paragraphs in a document are empty strings, not containing useful text. We will want to skip (continue) these paragraphs when structing the document into sections.

We will look for two conditions:
1. paragraph text is empty
2. paragraph text does not contain any letters (e.g. a phone number or ' ____ ')

In [4]:
def doesnt_have_text(text, contains_alpha=True):
    """ ignore paragraphs that do not contain any text

    :param text: text to check for characters
    :param alpha only: if True, only keep paragraph that have letters 
    :return: bool (True) if the paragraph is empty
    """
    
    # condition one - check for empty strings

    # condition two (make this optional by adding a check if alpha_only)
    # ignores phone numbers and string of non-text characters (e.g. '___')
    # consider using the string method .isalpha()

    # if either condition is met, return True (i.e. doesn't have text)
    # else return False


In [6]:
# run this function to check your answers

def check_doesnt_have_text(section_nums, answer):
    for section_num in section_nums:
        print('YOUR ANSWER: {}'.format(doesnt_have_text(paragraphs[section_num].text)))
        print('CORRECT ANSWER: {}'.format(answer))
        print('SECTION: {}'.format(paragraphs[section_num].text))
        print()    
        
check_doesnt_have_text(section_nums=[7, 70, 178], answer='True')
check_doesnt_have_text(section_nums=[8, 92, 104], answer='False')

### PART 2: Check if a paragraph is a section header

Each paragraph that contains text will either start a new section (is a section header) or will be added to the text of a section. To make this determination, we will use the function is_section_header to test if the paragraph text meets any of the criteria we set to define a section header.

There are many criteria we could create to define a section header. Moreso, it is not worth the effort to perfectly define each section header as that is difficult if not impossible due to inconsistencies in the ways that different documents are structured. As an example, some documents use a Table of Contents, giving sections a 'HEADING' style, other times the document owner just CAPITALIZES or bolds the text of each section.

We will check for the four folowing criteria:
1. Heading style applied to the paragraph text
2. All text is capitalized
3. All text is bold
4. All text is underlined

In [40]:
def is_section_header(p, use_headings=True, use_capitalization=True, use_bold=True, use_underline=True):
    """ determine if a paragraph is a section header

    :param p: paragraph
    :param use_headings: uses a header formatting (e.g. table of contents)
    :param use_capitalization: capitalization of every letter often indicates section header
    :param use_bold: all words in a sentence are bold
    :param use_underline: all words in a sentence are underlined
    :return section_header: boolean, True if paragraph is section header

    NOTE: there is not an exact method to determine a section header
    due to inconsistencies in the way documents are created.
    This function uses heuristics (e.g. all CAPS) to determine sections
    """

    # section_header starts as False and is switched to True
    # if it meets a section condition

    # check if Header formatting is applied to the paragraph style
    # make sure to use upper case as 'HEADING' is always uppercase
    # NOTE: this is common for sections listed in a table of contents

    # check if every letter in a paragraph is capitalized

    # check for bold and/or underlines conditions

    # iterate through all the runs in a paragraph to check for bold/underlined style

        # ignore (use continue) runs that do not contain text
        # NOTE: the last run is often an empty string

            # add if the text is bold to the bold_runs list
            
            # add if the text is underlined to the underline_runs list

    # all() tests if every item in a list is True
    # if all items are True the all() returns True
    # thus if all items are bold or all are underlined, 
    # then section_header is set to True


In [7]:
# run this function to check your answers

def check_section_header_answers(section_nums, answer):
    for section_num in section_nums:
        print('YOUR ANSWER: {}'.format(is_section_header(paragraphs[section_num])))
        print('CORRECT ANSWER: {}'.format(answer))
        print('SECTION: {}'.format(paragraphs[section_num].text))
        print()    
        
check_section_header_answers(section_nums=[45,388, 565], answer='True')
check_section_header_answers(section_nums=[46,500,545], answer='False')

### PART 3: Extract and structure sections

In the below exercise, we will create a dictionary named sections that will store all of the extracted sections from the document text.

To populate the dict, we will stream through the text using the doesnt_have_text function to confirm retain only useful text and the is_section_header function to identify the start of each new section.

In [126]:
def set_sections(paragraphs):
    """ Iterate through every paragraph in a document and group them into sections.
    
    Sections are determined by Microsoft Word style formatting and other
    optional heuristic parameters (listed below).
    """

    # store all sections in a dict with the section name as
    # the key and the section text as the value

    # set a name for the first section  to group all the text 
    # in case the documents has no sections and create an empty
    # list to start collecting section text

    # iterate through every paragraph 
    # if there is text, determine if the text should start a new
    # section_name or if the text should be added to section_text

        # docx stores many paragraph which are not informative
        # (e.g. line breaks) ignore these - go to next paragraph

            # a new section_name has been found, so combine
            # all the section text and add it to the sections dict

            # if section_text is empty that means >1 section name 
            # was found in a row (e.g. section name spans multiple lines)
            # these section_names should be combined

                # add completed section to sections dict
                
                # reset name and text for next section
                
            # keep adding text to section_text until
            # a new section starts


    # add the text from the final section to sections dict


In [115]:
# Run set sections to store the identified sections
sections = set_sections(paragraphs)

In [128]:
# Optional - View the identified sections
sections.keys()

dict_keys(['FIRST SECTION ORACLE CORP FORM 10-K', 'MAIL STOP 5 OP 7 REDWOOD CITY, CA 94065', 'CIK\t0001341439', 'TABLE OF CONTENTS UNITED STATES SECURITIES AND EXCHANGE COMMISSION WASHINGTON, D.C. 20549 FORM 10-K \uf8f3\tANNUAL REPORT PURSUANT TO SECTION 13 OR 15(D) OF THE SECURITIES EXCHANGE ACT OF 1934 FOR THE FISCAL YEAR ENDED MAY 31, 2015 OR TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(D) OF THE SECURITIES EXCHANGE ACT OF 1934 FOR THE TRANSITION PERIOD FROM\tTO  \t COMMISSION FILE NUMBER: 001-35992 ORACLE CORPORATION (EXACT NAME OF REGISTRANT AS SPECIFIED IN ITS CHARTER) DELAWARE\t54-2185193 (STATE OR OTHER JURISDICTION OF INCORPORATION OR ORGANIZATION) 500 ORACLE PARKWAY (I.R.S. EMPLOYER IDENTIFICATION NO.) REDWOOD CITY, CALIFORNIA\t94065 (ADDRESS OF PRINCIPAL EXECUTIVE OFFICES)\t(ZIP CODE) (REGISTRANT’S TELEPHONE NUMBER, INCLUDING AREA CODE) SECURITIES REGISTERED PURSUANT TO SECTION 12(B) OF THE ACT: TITLE OF EACH CLASS\tNAME OF EACH EXCHANGE ON WHICH REGISTERED', 'SECURITIES R

### Run this code to check your answers

5 of the below 10 sections are correct in the section_headers list are correct. If the set_sections function identifies all five of the correct answers, then the below code will print out each of the five correct sections and print 'Correct Answer'

Good Luck!

In [124]:
import hashlib

section_headers = [
      '2. FINANCIAL STATEMENT SCHEDULES'
    , 'FIRST SECTION ORACLE CORP FORM 10-K'
    , 'ORACLE APPLICATION'
    , 'PLATFORM TECHNOLOGY'
    , 'MARKETING AND SALES'
    , 'SEASONALITY AND CYCLICALITY'
    , 'COMPETITORS'
    , 'REVENUE RECOGNITION FOR MULTIPLE-ELEMENT ARRANGEMENTS—CLOUD SAAS, PAAS AND IAAS OFFERINGS'
    , 'LEGAL CONTINGENCIES'
    , 'LITIGATION'
]

 
def check_answer(section_answers):
    """ check if all of the identified sections are correct """
    # hash of all 5 correct answers
    answers = [
        'b09a40f412075a5172c2dddb8fcfe6e3adbf1c15e95a3545edb4f9a7deb8a99c'
        , '42d983f056c3258496bd56710be63acc94d8f3b34cddb40a00bfb73fb6569046'
        , '87e2d8aa4ef694c90cd75bd70d2522a8c89db87913a10732b6a78b6a454b59e0'
        , '29e15326a55259af5246041e58609bd4a77d05260b0b4be4878567f1bd21a7cc'
        , 'bbae872247b4d219b7e3c4591860a323af12898bbd107a0348c73f52dbb5ef72'
    ]
    
    # check if each provided section is a correct section_header
    checked_answer = []
    for section in section_answers:
        if hashlib.sha256(section.encode()).hexdigest() not in answers:
            checked_answer.append(False)
    checked_answer.append(True)
    
    # check that 5 distinct sections were provided 
    num_distinct_sections = len(set(section_answers))
    if num_distinct_sections != 5:
        print('INCORRECT ANSWER - ONLY {} OF 5 SECTIONS WERE IDENTIFIED'.format(num_distinct_sections))
    
    # check if all answers were correct
    if all(checked_answer):
        return 'Correct Answer'
    return 'Incorrect Answer - {} of the 5 identified sections were incorrect'.format(len(checked_answer))


# identify which of the above 10 sections were identified in the set_sections function
# 5 of the 10 should be identified in the correct answer
section_answers = []
for section in section_headers:
    if section in sections.keys():
        print('SECTION IDENTIFIED: {}'.format(section))
        section_answers.append(section)

# check if all identified sections are correct
check_answer(section_answers)

SECTION IDENTIFIED: 2. FINANCIAL STATEMENT SCHEDULES
SECTION IDENTIFIED: FIRST SECTION ORACLE CORP FORM 10-K
SECTION IDENTIFIED: MARKETING AND SALES
SECTION IDENTIFIED: SEASONALITY AND CYCLICALITY
SECTION IDENTIFIED: LEGAL CONTINGENCIES


'Correct Answer'