## Text Extraction

##### Author: Alex Sherman | alsherman@deloitte.com


Agenda:
- Extract Text from Word Documents
- Identify style (e.g. Bold, Font) and metadata (e.g. author) associated with document text
- Understand docx XML tag definitions
- Learn how to interact with Zip Files
- Identify content surrounding key piece of text
- Extract text from a pdf with pdfminer.six

In [None]:
from IPython.display import Image
Image("../../raw_data/images/lesson4_text_extraction.png", width=800, height=700)

In [None]:
import os
from IPython.core.display import display, HTML
from configparser import ConfigParser, ExtendedInterpolation

config = ConfigParser(interpolation=ExtendedInterpolation())
config.read('../../config.ini')

DOCX_PATH = config['DOCX']['DOCX_PATH']
XML_PATH = config['DOCX']['XML_PATH']
EXAMPLE_ZIP = config['DOCX']['EXAMPLE_ZIP']

### python-docx

python-docx is a Python library for creating, updating, and extracting text from Microsoft Word (.docx) files.

In [None]:
sqlalchemy_url = 'https://python-docx.readthedocs.io/en/latest/'
iframe = '<iframe src={} width=1100 height=300></iframe>'.format(sqlalchemy_url)
HTML(iframe)

In [None]:
# the Document method reads the text, style, and formatting of a word .docx document
import docx
doc = docx.Document(DOCX_PATH)

In [None]:
# view the methods and attributes of a doc
print(dir(doc))

### Paragraphs

Word paragraphs contain the text of the document. However, the table text, headers, footers, are not included in paragraphs.

In [None]:
# get all paragraphs 
paragraphs = doc.paragraphs

In [None]:
# view the docx paragraph objects
paragraphs[0:5]

In [None]:
# count all paragraphs in the document
len(paragraphs)

In [None]:
# only include with text (ignore empty strings)
paragraphs = [p for p in paragraphs if p.text.strip() != '']

In [None]:
# view the text of the first paragraph
paragraphs[0].text

### Style

In [None]:
# view the methods and attributes of a paragraph
print(dir(paragraphs[0]))

In [None]:
# get the paragraph style
paragraphs[0].style.name

In [None]:
# Identify if paragraph text has 'Heading' style
'heading' in paragraphs[0].style.name.lower()

In [None]:
# view all the heading styles in the doc
set(p.style.name for p in paragraphs if 'heading' in p.style.name.lower())

### Exercise
 1. count how many paragraphs have a heading style
 2. store the text of all the paragraphs with a heading style

In [None]:
# store all heading paragraphs
headings = [p.text.strip() for p in paragraphs if 'heading' in p.style.name.lower()]

print('# heading paragraphs: {}\n'.format(len(headings)))
headings[0:10]

### Runs

Each paragraph may contain one or more runs. A run denotes the style attached to the text in a paragraph. Every time the style change (e.g. from bold to normal text) a new run is added.

In [None]:
runs = paragraphs[0].runs
runs

In [None]:
# View all the runs in the paragraph
[run.text for run in runs]

In [None]:
# each run contains a portion of text from the paragraph
run = runs[2]
run.text

### Run style

- Each run contains style information such as bold, italic, or underline. 
- The style information will be True, False, or None
- A value of None indicates the run has no directly-applied style value and so will inherit the value of its containing paragraph.

In [None]:
# view the methods and attributes of a run
print(dir(run))

In [None]:
# font size
run.font.size.pt

In [None]:
print(run.bold)

In [None]:
print(run.italic)

In [None]:
print(run.underline)

In [None]:
# View all the run stlye 
[run.bold for run in runs]

## Exercise

#### Find all the bold runs

- Iterate through all the runs in all the paragraphs to identify any run with a bold style applied.
- Store all the bold text in a list named bold_text
- Do not include empty strings (e.g. '') 
- print the first 10 items in bold_text

In [None]:
# TODO


### Create a function to determine if all runs in a paragraph are bold

- Name the function is_bold
- Return True if all runs (with text) in a paragraph are bold
- Test the function by adding all the bold paragraphs to a list named bold_paragraphs
- Print the first 10 paragraphs in bold_paragraphs

In [None]:
# create the function is_bold
def is_bold(paragraph):   
    # TODO

In [None]:
# test the is_bold function
bold_paragraphs = []
for paragraph in paragraphs:
    if is_bold(paragraph):
        bold_paragraphs.append(paragraph.text)

bold_paragraphs[0:10]

### Tables

In [None]:
# identify all document tables
tables = doc.tables

In [None]:
# view a few table objects
tables[0:5]

In [None]:
# count the document tables
len(tables)

In [None]:
# view the methods and attributes of a table
print(dir(tables[0]))

In [None]:
# view the cells of a table

table_cells = [cell.text.strip() for cell in tables[0]._cells if cell.text != '']
table_cells[0:10]

### Core Properties

In [None]:
print(dir(doc.core_properties))

In [None]:
doc.core_properties.title

In [None]:
doc.core_properties.subject

In [None]:
doc.core_properties.author

In [None]:
doc.core_properties.created

In [None]:
doc.core_properties.revision

## Explore docx xml
Every word document is a zip of xml files. To test this, change the extension of any word file from .docx to .xml. 

Inside each zip, a directory named word contains document.xml. This file contains all of the xml for the word document.

To open the zip we use the package zipfile

In [None]:
XML_PATH

### zipfile

ZipFile - The class for reading and writing ZIP files
read - Returns the bytes content from a zipfile

In [None]:
sqlalchemy_url = 'https://docs.python.org/2/library/zipfile.html#zipfile-objects'
iframe = '<iframe src={} width=1100 height=300></iframe>'.format(sqlalchemy_url)
HTML(iframe)

In [None]:
import zipfile

zipf = zipfile.ZipFile(XML_PATH, 'r')

In [None]:
for f in zipf.filelist:
    print(f.filename)

In [None]:
xml_content = zipf.read('word/document.xml')

In [None]:
from bs4 import BeautifulSoup

b = BeautifulSoup(xml_content, 'lxml')

In [None]:
# view the xml from a short document with one heading and one sentence
for word in b.find('w:body'):
    print(word)
    print()

### docx XML tag definitions
- < w:body > - contains the document paragraphs
- < w:p > - Document paragraph
- < w:pstyle > Document Style (e.g. Header 1)
- < w:t > text in a paragraph or run
- < w:bookmarkstart > defines a bookmark, such as a link in a table of contents
- < w:r > - Document runs. Every time the style in a paragraph changes, for instance a bold or underline term, a new run is added. Each paragraph may contain multiple runs.


In [None]:
# view the lesson directory - notice there is no 'word' directory
%ls

In [None]:
# Extract a member from the archive to the current working directory
zipf.extract('word/document.xml')

In [None]:
# view the lesson3 directory with a 'word' directory
%ls

### Exercise

In this exercise, we will search through several Oracle annual reports to find selected text throughout all the documents without needing to extract the files from the zip manually. 

In [None]:
EXAMPLE_ZIP

In [None]:
# use zipfile to read the EXAMPLE_ZIP


In [None]:
# How many documents are in the provided zip?


In [None]:
# view the filenames
# use the .filename attribute on each file in zip.filelist


In [None]:
# Find the five paragraphs scattered in all the documents in the zip
# that speak about 'Financial Accounting Standards No. 109'

# iterate through the filelist

    # use zip.extract the file to the currect working directory

    # open the document with docx
    
    # iterate through the paragraphs in the document
    
        # check which paragraphs contain 'Financial Accounting Standards No. 109'
        
            # print the paragraphs that meet the condition
            

# PDF Text Extraction

##### subprocess - use python to interact with the command line

"The subprocess module allows you to spawn new processes, connect to their input/output/error pipes, and obtain their return codes"

**subprocess.check_output()**
- Run command with arguments and return its output.
- If the return code was non-zero it raises a CalledProcessError. The CalledProcessError object will have the return code in the returncode attribute and any output in the output attribute.

**subprocess.call**
- Run the command described by args. Wait for command to complete, then return the returncode attribute.

SOURCE: https://docs.python.org/3/library/subprocess.html

In [None]:
# run an ls from python
import subprocess
output = subprocess.check_output('dir', shell=True)
output.split()

In [None]:
# view example pdf in raw_data dir to extract text from using pdfminer.six
output = subprocess.check_output(['dir','raw_data'], shell=True)
output.split()

### pdfminer.six

##### Installation
- conda install -c conda-forge pdfminer.six

"PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data. PDFMiner allows one to obtain the exact location of text in a page, as well as other information such as fonts or lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes than text analysis."

"The PDFMiner library excels at extracting data and coordinates from a PDF. In most cases, you can use the included command-line scripts to extract text and images (pdf2txt.py) or find objects and their coordinates (dumppdf.py). If you're dealing with a particularly nasty PDF and you need to get more detailed , you can import the package and use it as library.

The pdf2txt.py command: 
- The package includes the pdf2txt.py command-line command, which you can use to extract text and images. The command supports many options and is very flexible. Some popular options are shown below. See the usage information for complete details.

**pdf2txt.py [options] filename.pdf**

Options:
- o output file name
- p comma-separated list of page numbers to extract
- t output format (text/html/xml/tag[for Tagged PDFs])
- O dirname (triggers extraction of images from PDF into directory)
- P password

Source: https://www.binpress.com/tutorial/manipulating-pdfs-with-python/167

In [None]:
# add your username to read the local pdf
username = 'ADD YOUR USERNAME'

In [None]:
# extract the first three pages of the pdf, output to a .txt 
cmd = [
    'python'
  , r'C:\Users\{}\AppData\Local\Continuum\anaconda3\Scripts\pdf2txt.py'.format(username)  # pdfminer
  , 'raw_data\southwest-airlines-co_annual_report_2016.pdf '  # imput pdf
  , '-o'  # output file name
  , 'raw_data\southwest_2016.txt'
  , '-t'  # output format
  , 'text'
  , '-p'  # pages to extract, default is to extract all pages
  , '1,2,3'
]

subprocess.call(cmd, shell=True) 

In [None]:
# check the raw_data dir for the extracted text from the pdf
output = subprocess.check_output(['dir','raw_data'], shell=True)
output.split()

# Additional Resources for text extraction
- PDF: https://github.com/jsvine/pdfplumber
- OCR: https://pypi.org/project/pytesseract
- Excel: https://openpyxl.readthedocs.io/en/stable/