## Text Extraction Solution

##### Author: Alex Sherman | alsherman@deloitte.com

In [1]:
import os
from IPython.core.display import display, HTML
from configparser import ConfigParser, ExtendedInterpolation

config = ConfigParser(interpolation=ExtendedInterpolation())
config.read('../../config.ini')

DOCX_PATH = config['DOCX']['DOCX_PATH']
XML_PATH = config['DOCX']['XML_PATH']
EXAMPLE_ZIP = config['DOCX']['EXAMPLE_ZIP']

### python-docx

python-docx is a Python library for creating, updating, and extracting text from Microsoft Word (.docx) files.

In [2]:
sqlalchemy_url = 'https://python-docx.readthedocs.io/en/latest/'
iframe = '<iframe src={} width=1100 height=300></iframe>'.format(sqlalchemy_url)
HTML(iframe)

In [3]:
# the Document method reads the text, style, and formatting
# of a word .docx document

import docx
doc = docx.Document(DOCX_PATH)

### Paragraphs

Word paragraphs contain the text of the document. However, the table text, headers, footers, are not included in paragraphs.

In [4]:
# get all paragraphs 
paragraphs = doc.paragraphs

In [5]:
# count all paragraphs in the document
len(paragraphs)

2579

In [6]:
# only include with text (ignore empty strings)
paragraphs = [p for p in paragraphs if p.text.strip() != '']

### Style

In [7]:
# view the text in the first paragraph
paragraphs[0].text

'SOUTHWEST AIRLINES CO.'

In [8]:
# get the paragraph style
paragraphs[0].style.name

'Normal'

In [9]:
# Identify if paragraph text has 'HEADING' style
# HEADING is always uppercase 

'HEADING' in paragraphs[0].style.name

False

### Runs

Each paragraph may contain one or more runs. A run denotes the style attached to the text in a paragraph. Every time the style change (e.g. from bold to normal text) a new run is added.

In [10]:
runs = paragraphs[0].runs
runs

[<docx.text.run.Run at 0x2ce67908588>,
 <docx.text.run.Run at 0x2ce679086a0>,
 <docx.text.run.Run at 0x2ce679084a8>]

In [11]:
# View all the runs in the paragraph
[run.text for run in runs]

['', '', 'SOUTHWEST AIRLINES CO.']

In [12]:
# each run contains a portion of text from the paragraph
run = runs[2]
run.text

'SOUTHWEST AIRLINES CO.'

### Run style

- Each run contains style information such as bold, italic, or underline. 
- The style information will be True, False, or None
- A value of None indicates the run has no directly-applied style value and so will inherit the value of its containing paragraph.

In [13]:
# font size
run.font.size.pt

12.0

In [14]:
print(run.italic)

None


In [15]:
print(run.underline)

None


In [16]:
print(run.bold)

True


In [17]:
# View all the run stlye 
[run.bold for run in runs]

[None, None, True]

## Exercise

#### Find all the bold runs

- Iterate through all the runs in all the paragraphs to identify any run with a bold style applied.
- Store all the bold text in a list named bold_text
- Do not include empty strings (e.g. '') 
- print the first 10 items in bold_text

In [20]:
bold_text = []
for paragraph in paragraphs:
    for run in paragraph.runs:
        if run.bold and run.text.strip() != '':
            text = run.text
            bold_text.append(text)

bold_text[0:10]

['SOUTHWEST AIRLINES CO.',
 '2016 ANNUAL REPORT TO SHAREHOLDERS',
 'SECURITIES AND EXCHANGE COMMISSION',
 'Washington, D.C. 20549',
 'FORM 10-K',
 'ANNUAL',
 'REPORT',
 'PURSUANT',
 'TO',
 'SECTION']

### Create a function to determine if all runs in a paragraph are bold

- Name the function is_bold
- Return True if all runs (with text) in a paragraph are bold
- Test the function by adding all the bold paragraphs to a list named bold_paragraphs
- Print the first 10 paragraphs in bold_paragraphs

In [21]:
# create the function is_bold
def is_bold(paragraph):   
    runs_are_bold = [run.bold for run in paragraph.runs if run.text != '']

    if runs_are_bold != [] and all(runs_are_bold):
        return True
    return False

In [22]:
# test the is_bold function
bold_paragraphs = []
for paragraph in paragraphs:
    if is_bold(paragraph):
        bold_paragraphs.append(paragraph.text)

bold_paragraphs[0:10]

['SOUTHWEST AIRLINES CO.',
 '2016 ANNUAL REPORT TO SHAREHOLDERS',
 'SECURITIES AND EXCHANGE COMMISSION',
 'Washington, D.C. 20549',
 'FORM 10-K',
 'Southwest Airlines Co.',
 'Securities registered pursuant to Section 12(b) of the Act:',
 'Title of Each Class\tName of Each Exchange on Which Registered',
 'Securities registered pursuant to Section 12(g) of the Act: None',
 'DOCUMENTS INCORPORATED BY REFERENCE']

### Tables

In [25]:
# identify all document tables
tables = doc.tables

In [26]:
table_cells = [cell.text for table in tables for cell in table._cells if cell.text != '']

table_cells[0:10]

['PART I',
 'Item 1.',
 'Business',
 '1',
 'Item 1A.',
 'Risk Factors',
 '22',
 'Item 1B.',
 'Unresolved Staff Comments',
 '30']

### Core Properties

In [27]:
doc.core_properties.title

'Southwest Airlines Co. 2016 Annual Report'

In [28]:
doc.core_properties.subject

''

In [29]:
doc.core_properties.author

''

In [30]:
doc.core_properties.created

datetime.datetime(2018, 1, 3, 22, 53, 10)

In [31]:
doc.core_properties.revision

0

## Explore docx xml
Every word document is a zip of xml files. To test this, change the extension of any word file from .docx to .xml. 

Inside each zip, a directory named word contains document.xml. This file contains all of the xml for the word document.

To open the zip we use the package zipfile

In [34]:
XML_PATH

'C:\\Users\\alsherman\\Desktop\\PycharmProjects\\firm_initiatives\\ml_guild\\raw_data\\docx_example.zip'

In [32]:
import zipfile

zip = zipfile.ZipFile(XML_PATH, 'r')

In [33]:
for f in zip.filelist:
    print(f.filename)

[Content_Types].xml
_rels/.rels
word/_rels/document.xml.rels
word/document.xml
word/theme/theme1.xml
word/settings.xml
word/fontTable.xml
word/webSettings.xml
docProps/app.xml
docProps/core.xml
word/styles.xml


In [35]:
xml_content = zip.read('word/document.xml')

### zipfile

ZipFile - The class for reading and writing ZIP files
read - Returns the bytes content from a zipfile

In [36]:
sqlalchemy_url = 'https://docs.python.org/2/library/zipfile.html#zipfile-objects'
iframe = '<iframe src={} width=1100 height=300></iframe>'.format(sqlalchemy_url)
HTML(iframe)

In [37]:
from bs4 import BeautifulSoup

b = BeautifulSoup(xml_content, 'lxml')

In [38]:
# view the xml from a short document with one heading and one sentence
for word in b.find('w:body'):
    print(word)
    print()

<w:p w:rsidp="00A96863" w:rsidr="007F6AD8" w:rsidrdefault="00A96863"><w:ppr><w:pstyle w:val="Heading1"></w:pstyle></w:ppr><w:r><w:t>Section Header</w:t></w:r></w:p>

<w:p w:rsidr="00A96863" w:rsidrdefault="00A96863"><w:r><w:t>Text in the section</w:t></w:r><w:bookmarkstart w:id="0" w:name="_GoBack"></w:bookmarkstart><w:bookmarkend w:id="0"></w:bookmarkend></w:p>

<w:sectpr w:rsidr="00A96863"><w:pgsz w:h="15840" w:w="12240"></w:pgsz><w:pgmar w:bottom="1440" w:footer="720" w:gutter="0" w:header="720" w:left="1440" w:right="1440" w:top="1440"></w:pgmar><w:cols w:space="720"></w:cols><w:docgrid w:linepitch="360"></w:docgrid></w:sectpr>



### docx XML tag definitions
- < w:body > - contains the document paragraphs
- < w:p > - Document paragraph
- < w:pstyle > Document Style (e.g. Header 1)
- < w:t > text in a paragraph or run
- < w:bookmarkstart > defines a bookmark, such as a link in a table of contents
- < w:r > - Document runs. Every time the style in a paragraph changes, for instance a bold or underline term, a new run is added. Each paragraph may contain multiple runs.


### Exercise

In this exercise, we will search through several Oracle annual reports to find selected text throughout all the documents without needing to extract the files from the zip manually. 

In [46]:
EXAMPLE_ZIP

'C:\\Users\\alsherman\\Desktop\\PycharmProjects\\firm_initiatives\\ml_guild\\raw_data\\oracle-corporation.zip'

In [39]:
# use zipfile to read the example_zip
zip = zipfile.ZipFile(EXAMPLE_ZIP, 'r')

In [41]:
# How many documents are in the provided zip?
len(zip.filelist)

3

In [47]:
# view the filenames
[f.filename for f in zip.filelist]

['oracle-corporation_annual_report_1994.docx',
 'oracle-corporation_annual_report_1995.docx',
 'oracle-corporation_annual_report_1996.docx']

In [48]:
# Find the five paragraphs scattered in all the documents in the zip
# that speak about 'Financial Accounting Standards No. 109'

# iterate through the filelist
for f in zip.filelist:
    # use zip.extract to open the file
    doc_file = zip.extract(f)
    # open the document with docx
    doc = docx.Document(doc_file)
    # iterate through the paragraphs in the document
    for p in doc.paragraphs:
        # check which paragraphs contain 'Financial Accounting Standards No. 109'
        if 'Financial Accounting Standards No. 109' in p.text:
            # print the paragraphs that meet the condition
            print(p.text)
            print()

Effective June 1, 1992, the Company adopted Statement of Financial Accounting Standards No. 109, "Accounting for Income Taxes," which requires recognition of deferred tax liabilities and assets for the expected future tax consequences of events that have been included in the financial statements or tax returns. Under this statement, deferred tax liabilities and assets are determined based on the difference between the financial statement and tax bases of assets and liabilities, using enacted tax rates in effect for the year in which the differences are expected to reverse.

Effective June 1, 1992, the Company adopted Statement of Financial Accounting Standards No. 109, "Accounting for Income Taxes, " (SFAS

Effective June 1, 1992, the Company adopted Statement of Financial Accounting Standards No. 109, "Accounting for Income Taxes." The comparative income tax data provided in this footnote for the year ended May 31, 1992 is presented under the provisions of APB 11.

Effective June 1, 1