# Useful Links
Extract images http://theautomatic.net/2019/10/14/how-to-read-word-documents-with-python/

Step 1: Need to read through all a href links, and download the documents<br>
Step 2: Read the documents from the folders and extract text

# 1. Working with Documents

In [1]:
import numpy as np
import pandas as pd

from docx2python import docx2python

In [2]:
# extract docx content
doc_result = docx2python('zen_of_python.docx', extract_image=False)

In [3]:
# get separate components of the document
doc_result_body = doc_result.body
print(f"This is the Body:\n{doc_result_body}\n")

# get the table text
print(f"This is the First Element:\n{doc_result_body[0]}]\n")
 
# get the text from Zen of Python
print(f"This is the Second Element:\n{doc_result_body[1]}\n")

This is the Body:
[[[['Field1'], ['Field2']], [['Info1'], ['Info2']], [['Info3'], ['Info4']]], [[['Header Title', '', 'Paragraph written over here.', '--\tList 1', '--\tList 2', '', 'Sub Heading', '1)\tPoint One', '2)\tPoint Two', '3)\tPoint Three']]]]

This is the First Element:
[[['Field1'], ['Field2']], [['Info1'], ['Info2']], [['Info3'], ['Info4']]]]

This is the Second Element:
[[['Header Title', '', 'Paragraph written over here.', '--\tList 1', '--\tList 2', '', 'Sub Heading', '1)\tPoint One', '2)\tPoint Two', '3)\tPoint Three']]]



In [4]:
# convert this result into a tabular format using pandas
df = pd.DataFrame(doc_result_body[0][1:])

df

Unnamed: 0,0,1
0,[Info1],[Info2]
1,[Info3],[Info4]


In [5]:
# applymap method to apply the lambda function below to every cell in the data frame. This function gets the individual value within the list in each cell and removes all instances of “\t”.
df = df.applymap(lambda val: val[0].strip("\t"))

df

Unnamed: 0,0,1
0,Info1,Info2
1,Info3,Info4


In [6]:
# change the column headers to what we see in the Word file
df.columns = [val[0].strip("\t") for val in doc_result.body[0][0]]

df

Unnamed: 0,Field1,Field2
0,Info1,Info2
1,Info3,Info4


In [7]:
# get all text in a single string
doc_result.text

'Field1\n\nField2\n\nInfo1\n\nInfo2\n\nInfo3\n\nInfo4\n\nHeader Title\n\n\n\nParagraph written over here.\n\n--\tList 1\n\n--\tList 2\n\n\n\nSub Heading\n\n1)\tPoint One\n\n2)\tPoint Two\n\n3)\tPoint Three'

In [8]:
# get metadata about the file using the properties attribute
doc_result.properties

{'title': None,
 'subject': None,
 'creator': 'Andy Ang',
 'keywords': None,
 'description': None,
 'lastModifiedBy': 'Andy Ang',
 'revision': '7',
 'created': '2020-09-14T02:36:00Z',
 'modified': '2020-09-14T03:02:00Z'}

In [9]:
# get the headers
doc_result.header
 
# get the footers
doc_result.footer

# get foot notes
doc_result.footnotes

[]

In [10]:
# get html, does not currently support table-related tags
doc_html_result = docx2python('zen_of_python.docx', html = True)

doc_html_result.body

[[[['Field1'], ['Field2']], [['Info1'], ['Info2']], [['Info3'], ['Info4']]],
 [[['Header Title',
    '',
    'Paragraph written over here.',
    '--\tList 1',
    '--\tList 2',
    '',
    'Sub Heading',
    '1)\tPoint One',
    '2)\tPoint Two',
    '3)\tPoint Three']]]]

# 2. Working with PDF

In [11]:
import PyPDF2

f = open('US_Declaration.pdf','rb')
pdf_reader = PyPDF2.PdfFileReader(f)

pdf_reader.numPages

page_one = pdf_reader.getPage(0)

page_one_text = page_one.extractText()

print(page_one_text)

f.close()

Declaration of IndependenceIN CONGRESS, July 4, 1776. The unanimous Declaration of the thirteen united States of America, When in the Course of human events, it becomes necessary for one people to dissolve the
political bands which have connected them with another, and to assume among the powers of the
earth, the separate and equal station to which the Laws of Nature and of Nature's God entitle

them, a decent respect to the opinions of mankind requires that they should declare the causes

which impel them to the separation. 
We hold these truths to be self-evident, that all men are created equal, that they are endowed by

their Creator with certain unalienable Rights, that among these are Life, Liberty and the pursuit
of Happiness.ŠThat to secure these rights, Governments are instituted among Men, deriving

their just powers from the consent of the governed,ŠThat whenever any Form of Government
becomes destructive of these ends, it is the Right of the People to alter or to abolish it,