## Text Extraction Solution

##### Author: Alex Sherman | alsherman@deloitte.com

### Exercise
 1. count how many paragraphs have a heading style
 2. store the text of all the paragraphs with a heading style

In [16]:
# store all heading paragraphs
headings = [p.text.strip() for p in paragraphs if 'heading' in p.style.name.lower()]

print('# heading paragraphs: {}\n'.format(len(headings)))
headings[0:10]

# heading paragraphs: 145



['UNITED STATES',
 'PART I',
 'Company Overview',
 'Industry',
 'Company Operations Route Structure',
 'General',
 'International Service',
 'Cost Structure',
 'Fare Structure',
 'General']

## Exercise

#### Find all the bold runs

- Iterate through all the runs in all the paragraphs to identify any run with a bold style applied.
- Store all the bold text in a list named bold_text
- Do not include empty strings (e.g. '') 
- print the first 10 items in bold_text

In [26]:
bold_text = []
for paragraph in paragraphs:
    for run in paragraph.runs:
        if run.bold and run.text.strip() != '':
            text = run.text
            bold_text.append(text)

bold_text[0:10]

['SOUTHWEST AIRLINES CO.',
 '2016 ANNUAL REPORT TO SHAREHOLDERS',
 'SECURITIES AND EXCHANGE COMMISSION',
 'Washington, D.C. 20549',
 'FORM 10-K',
 'ANNUAL',
 'REPORT',
 'PURSUANT',
 'TO',
 'SECTION']

### Create a function to determine if all runs in a paragraph are bold

- Name the function is_bold
- Return True if all runs (with text) in a paragraph are bold
- Test the function by adding all the bold paragraphs to a list named bold_paragraphs
- Print the first 10 paragraphs in bold_paragraphs

In [27]:
# create the function is_bold
def is_bold(paragraph):   
    runs_are_bold = [run.bold for run in paragraph.runs if run.text != '']

    if runs_are_bold and all(runs_are_bold):  # runs_are_bold evaluates as False if the list is empty
        return True
    return False

In [28]:
# test the is_bold function
bold_paragraphs = []
for paragraph in paragraphs:
    if is_bold(paragraph):
        bold_paragraphs.append(paragraph.text)

bold_paragraphs[0:10]

['SOUTHWEST AIRLINES CO.',
 '2016 ANNUAL REPORT TO SHAREHOLDERS',
 'SECURITIES AND EXCHANGE COMMISSION',
 'Washington, D.C. 20549',
 'FORM 10-K',
 'Southwest Airlines Co.',
 'Securities registered pursuant to Section 12(b) of the Act:',
 'Title of Each Class\tName of Each Exchange on Which Registered',
 'Securities registered pursuant to Section 12(g) of the Act: None',
 'DOCUMENTS INCORPORATED BY REFERENCE']

### Exercise

In this exercise, we will search through several Oracle annual reports to find selected text throughout all the documents without needing to extract the files from the zip manually. 

In [43]:
EXAMPLE_ZIP

'C:\\Users\\alsherman\\Desktop\\NLP\\nlp_practicum_health\\raw_data\\oracle-corporation.zip'

In [44]:
# use zipfile to read the EXAMPLE_ZIP
zipf = zipfile.ZipFile(EXAMPLE_ZIP, 'r')

In [45]:
# How many documents are in the provided zip?
len(zipf.filelist)

3

In [46]:
# view the filenames
# use the .filename attribute on each file in zip.filelist

[f.filename for f in zipf.filelist]

['oracle-corporation_annual_report_1994.docx',
 'oracle-corporation_annual_report_1995.docx',
 'oracle-corporation_annual_report_1996.docx']

In [54]:
# Find the five paragraphs scattered in all the documents in the zip
# that speak about 'Financial Accounting Standards No. 109'

# iterate through the filelist
for f in zipf.filelist:
    # use zip.extract the file to the currect working directory
    doc_file = zipf.extract(f)
    # open the document with docx
    doc = docx.Document(doc_file)
    # iterate through the paragraphs in the document
    for p in doc.paragraphs:
        # check which paragraphs contain 'Financial Accounting Standards No. 109'
        if 'Financial Accounting Standards No. 109' in p.text:
            # print the paragraphs that meet the condition
            print(p.text)
            print()

Effective June 1, 1992, the Company adopted Statement of Financial Accounting Standards No. 109, "Accounting for Income Taxes," which requires recognition of deferred tax liabilities and assets for the expected future tax consequences of events that have been included in the financial statements or tax returns. Under this statement, deferred tax liabilities and assets are determined based on the difference between the financial statement and tax bases of assets and liabilities, using enacted tax rates in effect for the year in which the differences are expected to reverse.

Effective June 1, 1992, the Company adopted Statement of Financial Accounting Standards No. 109, "Accounting for Income Taxes, " (SFAS

Effective June 1, 1992, the Company adopted Statement of Financial Accounting Standards No. 109, "Accounting for Income Taxes." The comparative income tax data provided in this footnote for the year ended May 31, 1992 is presented under the provisions of APB 11.

Effective June 1, 1