# Chapter 15: WORKING WITH PDF AND WORD DOCUMENTS

## PDF Documents

`$ pip install PyPDF2==1.26.0`

In [34]:
import PyPDF2
PyPDF2.__version__

'2.11.1'

### Extracting Text from PDFs

In [1]:
import PyPDF2

pdfFileObj = open("automate-online-materials/meetingminutes.pdf", 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pdfReader.numPages

19

In [2]:
pageObj = pdfReader.getPage(0)
print(pageObj.extract_text(0))
pdfFileObj.close()

OOFFFFIICCIIAALL  BBOOAARRDD  MMIINNUUTTEESS  
 
Meeting of March 7 , 2014  
 
 
 
  
 
  
 
   
The Board of Elementary and Secondary Education shall provide leadership and 
create policies for education that expand opportunities for children, empower 
families and communities, and advance Louisiana in an increasingly 
competitive glob al market.  BOARD  
of 
ELEMENTARY  
and  
SECONDARY  
EDUCATION  
 


### Decrypting PDFs

In [3]:
import PyPDF2

pdfReader = PyPDF2.PdfFileReader(open("automate-online-materials/encrypted.pdf", 'rb'))
pdfReader.is_encrypted

True

In [4]:
pdfReader.getPage(0)

FileNotDecryptedError: File has not been decrypted

In [5]:
pdfReader = PyPDF2.PdfFileReader(open("automate-online-materials/encrypted.pdf", 'rb'))
pdfReader.decrypt('rosebud')

<PasswordType.OWNER_PASSWORD: 2>

In [8]:
print(pdfReader.getPage(0).extract_text(0))

OOFFFFIICCIIAALL  BBOOAARRDD  MMIINNUUTTEESS  
 
Meeting of March 7 , 2014  
 
 
 
  
 
  
 
   
The Board of Elementary and Secondary Education shall provide leadership and 
create policies for education that expand opportunities for children, empower 
families and communities, and advance Louisiana in an increasingly 
competitive glob al market.  BOARD  
of 
ELEMENTARY  
and  
SECONDARY  
EDUCATION  
 


### Copying Pages

In [9]:
# Combine two PDF files into a new single PDF.

import PyPDF2

pdf1File = open('automate-online-materials/meetingminutes.pdf', 'rb')
pdf2File = open('automate-online-materials/meetingminutes2.pdf', 'rb')

pdf1Reader = PyPDF2.PdfFileReader(pdf1File)
pdf2Reader = PyPDF2.PdfFileReader(pdf2File)
pdfWriter = PyPDF2.PdfFileWriter()

for pageNum in range(pdf1Reader.numPages):
    pageObj = pdf1Reader.getPage(pageNum)
    pdfWriter.addPage(pageObj)
    
for pageNum in range(pdf2Reader.numPages):
    pageObj = pdf2Reader.getPage(pageNum)
    pdfWriter.addPage(pageObj)

pdfOutputFile = open('combinedmintes.pdf', 'wb')
pdfWriter.write(pdfOutputFile)
pdfOutputFile.close()
pdf1File.close()
pdf2File.close()

### Rotating Pages

In [1]:
import PyPDF2

pdf_reader = PyPDF2.PdfFileReader(open("automate-online-materials/meetingminutes.pdf", 'rb'))
pdf_writer = PyPDF2.PdfFileWriter()

page = pdf_reader.pages[0]
page.rotate_clockwise(90)
pdf_writer.add_page(page)
result_pdf = open("rotated_page.pdf", 'wb')
pdf_writer.write(result_pdf)

result_pdf.close()

![rotated-page.jpg](https://automatetheboringstuff.com/2e/images/000098.jpg)

### Overlaying Pages

In [9]:
import PyPDF2

minutesFile = open("automate-online-materials/meetingminutes.pdf", 'rb')
pdfReader = PyPDF2.PdfFileReader(minutesFile)
minutesFirstPage = pdfReader.pages[0]
pdfWatermarkReaer = PyPDF2.PdfReader(open("automate-online-materials/watermark.pdf", 'rb'))
minutesFirstPage.merge_page(pdfWatermarkReaer.pages[0])
pdfWriter = PyPDF2.PdfFileWriter()
pdfWriter.add_page(minutesFirstPage)

for page_num in range(1, len(pdfReader.pages)):
    pageObj = pdfReader.pages[page_num]
    pdfWriter.add_page(pageObj)

resultPdfFile = open("watermarkdCover.pdf", 'wb')
pdfWriter.write(resultPdfFile)
minutesFile.close()
resultPdfFile.close()

![watermark-pdf.jpg](https://automatetheboringstuff.com/2e/images/000044.jpg)

### Encrypting PDFs

In [10]:
import PyPDF2

pdfFile = open("automate-online-materials/meetingminutes.pdf", 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFile)
pdfWriter = PyPDF2.PdfFileWriter()

for page_num in range(len(pdfReader.pages)):
    pdfWriter.add_page(pdfReader.pages[page_num])
    
pdfWriter.encrypt('swordfish')
resultPdf = open('encryptedminutes.pdf', 'wb')
pdfWriter.write(resultPdf)
resultPdf.close()

### Project: Combining Select Pages from Many PDFs

In [31]:
"""Combines all the PDFs in the current working directory into a single PDF."""

import os
import PyPDF2

# Get all the PDF filenames.
pdfFiles = [file for file in os.listdir('.') if file.endswith('.pdf')]
pdfFiles.sort()

pdfWriter = PyPDF2.PdfFileWriter()

# Loop through all the PDF files.
for filename in pdfFiles:
    pdfFileObj = open(filename, 'rb')
    pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

    # Loop through all the pages (except the first) and add them.
    for pageNum in range(1, pdfReader.numPages):
        pageObj = pdfReader.pages[pageNum]
        pdfWriter.add_page(pageObj)

# Save the resulting PDF to a file.
pdfOutput = open('allminutes.pdf', 'wb')
pdfWriter.write(pdfOutput)
pdfOutput.close()

#### Ideas for Similar Programs

- Cut out specific pages from PDFs.
- Reorder pages in a PDF.
- Create a PDF from only those pages that have some specific text, identified by `extractText()`.

## Word Documents

`$ pip install python-docx`

---

![word-doc.jpg](https://automatetheboringstuff.com/2e/images/000138.jpg)
*The Run objects identified in a Paragraph object*

In [35]:
import docx
docx.__version__

'0.8.11'

### Reading Word Documents

In [41]:
import docx

doc = docx.Document("automate-online-materials/demo.docx")
len(doc.paragraphs)

7

In [48]:
doc.paragraphs[1].text

'A plain paragraph with some bold and some italic'

In [57]:
len(doc.paragraphs[1].runs)

5

In [66]:
doc.paragraphs[1].runs[0].text

'A plain paragraph with'

In [67]:
doc.paragraphs[1].runs[1].text

' some '

In [68]:
doc.paragraphs[1].runs[2].text

'bold'

In [69]:
doc.paragraphs[1].runs[3].text

' and some '

In [70]:
doc.paragraphs[1].runs[4].text

'italic'

### Getting the Full Text from a .docx File

If you care only about the text, not the styling information, in the Word document, you can use the `getText()` function. It accepts a filename of a *.docx* file and returns a single string value of its text.

In [79]:
import docx

def getText(filename):
    doc = docx.Document(filename)
    fullText = []
    for para in doc.paragraphs:
        fullText.append(para.text)
    
    return "\n".join(fullText)

print(getText("automate-online-materials/demo.docx"))

Document Title
A plain paragraph with some bold and some italic
Heading, level 1
Intense quote
first item in unordered list
first item in ordered list




### Run Attributes

`Run` Object `text` Attributes:

| Attribute | Description |
| :-: | :- |
| **`bold`** | The text appears in bold. |
| **`italic`** | The text appears in italic. |
| **`underline`** | The text is underlined. |
| **`strike`** | The text appears with strikethrough. |
| **`double_strike`** | The text appears with double strikethrough. |
| **`all_caps`** | The text appears in capital letters. |
| **`small_caps`** | The text appears in capital letters, with lowercase letters two points smaller. |
| **`shadow`** | The text appears with a shadow. |
| **`outline`** | The text appears outlined rather than solid. |
| **`rtl`** | The text is written right-to-left. |
| **`imprint`** | The text appears pressed into the page. |
| **`emboss`** | The text appears raised off the page in relief. |

In [5]:
import docx

doc = docx.Document("automate-online-materials/demo.docx")
doc.paragraphs[0].text

'Document Title'

In [14]:
doc.paragraphs[0].style  # The exact id may be different

_ParagraphStyle('Title') id: 140293766293104

In [15]:
doc.paragraphs[0].style = 'Normal'

In [16]:
doc.paragraphs[1].text

'A plain paragraph with some bold and some italic'

In [18]:
(doc.paragraphs[1].runs[0].text, doc.paragraphs[1].runs[1].text, doc.paragraphs[1].runs[2].text, doc.paragraphs[1].runs[3].text, doc.paragraphs[1].runs[4].text)

('A plain paragraph with', ' some ', 'bold', ' and some ', 'italic')

In [23]:
doc.paragraphs[1].runs[0].style = 'QuoteChar'
doc.paragraphs[1].runs[1].underline = True
doc.paragraphs[1].runs[3].underline = True
doc.save('restyled.docx')

  return self._get_style_id_from_style(self[style_name], style_type)


### Writing Word Documents

In [3]:
import docx

doc = docx.Document()
doc.add_paragraph("Hello, world!")

<docx.text.paragraph.Paragraph at 0x7f0cba498c40>

In [4]:
doc.save('helloworld.docx')

You can add paragraphs by calling the `add_paragraph()` method again with the new paragraph’s text. Or to add text to the end of an existing paragraph, you can call the paragraph’s `add_run()` method and pass it a string.

In [10]:
doc = docx.Document()
doc.add_paragraph('Hello, world!')

<docx.text.paragraph.Paragraph at 0x7f0cba441970>

In [11]:
paraObj1 = doc.add_paragraph('This is a second paragraph.')
paraObj2 = doc.add_paragraph('This is a yet another paragraph.')

paraObj1.add_run(' This text is being added to the second paragraph.')

<docx.text.run.Run at 0x7f0ce8ae9190>

In [12]:
doc.save('multipleParagraphs.docx')

### Adding Headings

In [15]:
doc = docx.Document()

doc.add_heading('Header 0', 0)
doc.add_heading('Header 1', 1)
doc.add_heading('Header 2', 2)
doc.add_heading('Header 3', 3)
doc.add_heading('Header 4', 4)

doc.save('headings.docx')

### Adding Line and Page Breaks

In [16]:
doc = docx.Document()

doc.add_paragraph('This is on the first page!')
doc.paragraphs[0].runs[0].add_break(docx.enum.text.WD_BREAK.PAGE)
doc.add_paragraph('This is on the second page!')

doc.save('twoPage.docx')

### Adding Pictures

In [20]:
doc.add_picture('Astronaut.jpg', width=docx.shared.Inches(1), height=docx.shared.Cm(4))
doc.save('image.docx')

## Creating PDFs from Word Documents

The `PyPDF2` module doesn’t allow you to create PDF documents directly, but there’s a way to generate PDF files with Python if you’re on Windows and have Microsoft Word installed. You’ll need to install the `Pywin32` package by running `pip install pywin32`. With this and the `docx` module, you can create Word documents and then convert them to PDFs with the following script.

In [5]:
# This script runs on Windows only, and you must have Word installed.
import win32com.client # install with "pip install pywin32==224"
import docx
wordFilename = 'your_word_document.docx'
pdfFilename = 'your_pdf_filename.pdf'

doc = docx.Document()
# Code to create Word document goes here.
doc.save(wordFilename)

wdFormatPDF = 17 # Word's numeric code for PDFs.
wordObj = win32com.client.Dispatch('Word.Application')

docObj = wordObj.Documents.Open(wordFilename)
docObj.SaveAs(pdfFilename, FileFormat=wdFormatPDF)
docObj.Close()
wordObj.Quit()

## Practice Projects

### PDF Paranoia

Using the `os.walk()` function from Chapter 10, write a script that will go through every PDF in a folder (and its subfolders) and encrypt the PDFs using a password provided on the command line. Save each encrypted PDF with an *_encrypted.pdf* suffix added to the original filename. Before deleting the original file, have the program attempt to read and decrypt the file to ensure that it was encrypted correctly.

Then, write a program that finds all encrypted PDFs in a folder (and its subfolders) and creates a decrypted copy of the PDF using a provided password. If the password is incorrect, the program should print a message to the user and continue to the next PDF.

### Custom Invitations as Word Documents

Say you have a text file of guest names. This *guests.txt* file has one name per line, as follows:

---
`Prof. Plum` \
`Miss Scarlet` \
`Col. Mustard` \
`Al Sweigart` \
`RoboCop`

Write a program that would generate a Word document with custom invitations that look like figure below: ![robocop-invitation.jpg](https://automatetheboringstuff.com/2e/images/000069.jpg)

Since Python-Docx can use only those styles that already exist in the Word document, you will have to first add these styles to a blank Word file and then open that file with Python-Docx. There should be one invitation per page in the resulting Word document, so call `add_break()` to add a page break after the last paragraph of each invitation. This way, you will need to open only one Word document to print all of the invitations at once.

You can download a sample guests.txt file from https://nostarch.com/automatestuff2/.

### Brute-Force PDF Password Breaker

Say you have an encrypted PDF that you have forgotten the password to, but you remember it was a single English word. Trying to guess your forgotten password is quite a boring task. Instead you can write a program that will decrypt the PDF by trying every possible English word until it finds one that works. This is called a *brute-force password attack.* Download the text file *dictionary.txt* from https://nostarch.com/automatestuff2/. This *dictionary file* contains over 44,000 English words with one word per line.

Using the file-reading skills you learned in Chapter 9, create a list of word strings by reading this file. Then loop over each word in this list, passing it to the `decrypt()` method. If this method returns the integer 0, the password was wrong and your program should continue to the next password. If `decrypt()` returns 1, then your program should break out of the loop and print the hacked password. You should try both the uppercase and lowercase form of each word. (On my laptop, going through all 88,000 uppercase and lowercase words from the dictionary file takes a couple of minutes. This is why you shouldn’t use a simple English word for your passwords.)