# PyPDF2

Basic Example with Reading, Writing, and Copying PDFs

In [3]:

import glob
from PyPDF2 import PdfFileReader, PdfFileWriter

paths = glob.glob('./Python- OCR for PDF or Compare textract, pytesseract, and pyocr.pdf')
paths.sort()
pdf_writer = PdfFileWriter()
for path in paths:
    """Something"""
    pdf_reader = PdfFileReader(stream=path)
    for page_number in range(pdf_reader.getNumPages())
        pdf_writer.addPage(page=pdf_reader.getPage(pageNumber=page_number))
    with open(file='./new_pdf', mode='wb') as out:
        pdf_writer.write(stream=out)


## PdfFileReader class

The PdfFileReader Class <br><br>

class PyPDF2.PdfFileReader(stream, strict=True, warndest=None, overwriteWarnings=True)
<br><br>
Initializes a PdfFileReader object. This operation can take some time, as the PDF stream’s cross-reference tables are read into memory.<br><br>

Parameters:<br><br>
stream – A File object or an object that supports the standard read and seek methods similar to a File object. Could also be a string representing a path to a PDF file.<br><br>

strict (bool) – Determines whether user should be warned of all problems and also causes some correctable problems to be fatal. Defaults to True.<br><br>

warndest – Destination for logging warnings (defaults to sys.stderr).<br><br>

overwriteWarnings (bool) – Determines whether to override Python’s warnings.py module with a custom implementation (defaults to True).

In [3]:

from PyPDF2 import PdfFileReader
import sys

path = open(file='./Python- OCR for PDF or Compare textract, pytesseract, and pyocr.pdf', mode='rb')
pdf_reader = PdfFileReader(stream=path, strict=True, warndest=sys.stderr, overwriteWarnings=True)


decrypt(password)<br><br>

When using an encrypted / secured PDF file with the PDF Standard encryption handler, this function will allow the file to be decrypted. It checks the given password against the document’s user password and owner password, and then stores the resulting decryption key if either password is correct.

It does not matter which password was matched. Both passwords provide the correct decryption key that will allow the document to be used with this library.

Parameters:<br><br>

password (str) – The password to match.<br><br>

Returns:	0 if the password failed, 1 if the password matched the user password, and 2 if the password matched the owner password.<br><br>
Return type:	int
Raises NotImplementedError:
 	if document uses an unsupported encryption method.

isEncrypted<br><br>
Read-only boolean property showing whether this PDF file is encrypted. Note that this property, if true, will remain true even after the decrypt() method is called.

In [39]:

from PyPDF2 import PdfFileReader
import sys

path = open(file='./Python- OCR for PDF or Compare textract, pytesseract, and pyocr.pdf', mode='rb')
pdf_reader = PdfFileReader(stream=path, strict=True, warndest=sys.stderr, overwriteWarnings=True)
try:
    pdf_reader.decrypt(password='None')
except KeyError as error:
    print(str(error))

print(pdf_reader.isEncrypted)
assert pdf_reader.isEncrypted is not True


'/Encrypt'
False


documentInfo<br><br>
Read-only property that accesses the getDocumentInfo() function.

getDocumentInfo()<br><br>

Retrieves the PDF file’s document information dictionary, if it exists. Note that some PDF files use metadata streams instead of docinfo dictionaries, and these metadata streams will not be accessed by this function.

Returns:	the document information of this PDF file
Return type:	DocumentInformation or None if none exists.

class PyPDF2.pdf.DocumentInformation<br><br>

A class representing the basic document metadata provided in a PDF File. This class is accessible through getDocumentInfo()<br><br>

All text properties of the document metadata have two properties, eg. author and author_raw. The non-raw property will always return a TextStringObject, making it ideal for a case where the metadata is being displayed. The raw property can sometimes return a ByteStringObject, if PyPDF2 was unable to decode the string’s text encoding; this requires additional safety in the caller and therefore is not as commonly accessed.<br><br>

author
Read-only property accessing the document’s author. Returns a unicode string (TextStringObject) or None if the author is not specified.

author_raw<br>
The “raw” version of author; can return a ByteStringObject.<br><br>

creator<br>
Read-only property accessing the document’s creator. If the document was converted to PDF from another format, this is the name of the application (e.g. OpenOffice) that created the original document from which it was converted. Returns a unicode string (TextStringObject) or None if the creator is not specified.<br><br>

creator_raw<br>
The “raw” version of creator; can return a ByteStringObject.<br><br>

producer<br>
Read-only property accessing the document’s producer. If the document was converted to PDF from another format, this is the name of the application (for example, OSX Quartz) that converted it to PDF. Returns a unicode string (TextStringObject) or None if the producer is not specified.<br><br>

producer_raw<br>
The “raw” version of producer; can return a ByteStringObject.<br><br>

subject<br>
Read-only property accessing the document’s subject. Returns a unicode string (TextStringObject) or None if the subject is not specified.<br><br>

subject_raw<br>
The “raw” version of subject; can return a ByteStringObject.<br><br>

title<br>
Read-only property accessing the document’s title. Returns a unicode string (TextStringObject) or None if the title is not specified.<br><br>

title_raw<br>
The “raw” version of title; can return a ByteStringObject.<br><br>

In [21]:

from PyPDF2 import PdfFileReader
import sys

path = open(file='./Python- OCR for PDF or Compare textract, pytesseract, and pyocr.pdf', mode='rb')
pdf_reader = PdfFileReader(stream=path, strict=True, warndest=sys.stderr, overwriteWarnings=True)
pdf_info = pdf_reader.documentInfo
print(type(pdf_info))
print(pdf_info, end='\n\n')
# for key, value in pdf_info.__dict__.items():
#     """Print each attribute KV pair in pdf_info"""
#     print(key, ': ', value)
assert pdf_reader.documentInfo == pdf_reader.getDocumentInfo()
print('author: ', pdf_info.author)
print('author_raw: ', pdf_info.author_raw)
print('creator: ', pdf_info.creator)
print('creator_raw: ', pdf_info.creator_raw)
print('producer: ', pdf_info.producer)
print('producer_raw: ', pdf_info.producer_raw)
print('subject: ', pdf_info.subject)
print('subject_raw: ', pdf_info.subject_raw)
print('title: ', pdf_info.title)
print('title_raw: ', pdf_info.title_raw)


<class 'PyPDF2.pdf.DocumentInformation'>
{'/Creator': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36', '/Producer': 'Skia/PDF m83', '/CreationDate': "D:20200608035829+00'00'", '/ModDate': "D:20200608035829+00'00'"}

author:  None
author_raw:  None
creator:  Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36
creator_raw:  Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36
producer:  Skia/PDF m83
producer_raw:  Skia/PDF m83
subject:  None
subject_raw:  None
title:  None
title_raw:  None


getDestinationPageNumber(destination)<br><br>
Retrieve page number of a given Destination object

Parameters:<br><br>

destination (Destination) – The destination to get page number. Should be an instance of Destination<br><br>

Returns:
the page number or -1 if page not found<br><br>
Return type: int

namedDestinations<br><br>
Read-only property that accesses the getNamedDestinations() function.

getNamedDestinations(tree=None, retval=None)<br><br>
Retrieves the named destinations present in the document.

Returns:	a dictionary which maps names to Destinations.<br>
Return type:	dict

The Destination Class<br><br>
class PyPDF2.generic.Destination(title, page, typ, *args)<br><br>
A class representing a destination within a PDF file. See section 8.2.1 of the PDF 1.6 reference.<br><br>

Parameters:	
title (str) – Title of this destination.<br>
page (int) – Page number of this destination.<br>
typ (str) – How the destination is displayed.<br>
args – Additional arguments may be necessary depending on the type.<br>
Raises PdfReadError:<br>
 	
If destination type is invalid.<br><br>

Valid typ arguments (see PDF spec for details):<br>
/Fit	No additional arguments<br>
/XYZ	[left] [top] [zoomFactor]<br>
/FitH	[top]<br>
/FitV	[left]<br>
/FitR	[left] [bottom] [right] [top]<br>
/FitB	No additional arguments<br>
/FitBH	[top]<br>
/FitBV	[left]<br>
bottom<br>
Read-only property accessing the bottom vertical coordinate.<br><br>

Return type:	int, or None if not available.
left
Read-only property accessing the left horizontal coordinate.

Return type:	int, or None if not available.
page
Read-only property accessing the destination page number.

Return type:	int
right
Read-only property accessing the right horizontal coordinate.

Return type:	int, or None if not available.
title
Read-only property accessing the destination title.

Return type:	str
top
Read-only property accessing the top vertical coordinate.

Return type:	int, or None if not available.
typ
Read-only property accessing the destination type.

Return type:	str
zoom
Read-only property accessing the zoom factor.

Return type:	int, or None if not available.

In [73]:

from PyPDF2 import PdfFileReader
from PyPDF2.generic import Destination
import sys

path = open(file='./Python- OCR for PDF or Compare textract, pytesseract, and pyocr.pdf', mode='rb')
pdf_reader = PdfFileReader(stream=path, strict=True, warndest=sys.stderr, overwriteWarnings=True)
named_dests = pdf_reader.namedDestinations
print(named_dests)
assert pdf_reader.namedDestinations == pdf_reader.getNamedDestinations()
# dest = Destination(title=path.name, page=1, typ=None)
# dest_pg_num = pdf_reader.getDestinationPageNumber(destination=dest)


{}


getFields(tree=None, retval=None, fileobj=None)<br><br>
Extracts field data if this PDF contains interactive form fields. The tree and retval parameters are for recursive use.

Parameters:<br><br>

fileobj – A file object (usually a text file) to write a report to on all interactive form fields found.<br><br>
Returns:	A dictionary where each key is a field name, and each value is a Field object. By default, the mapping name is used for keys.
Return type:	dict, or

getFormTextFields()<br><br>
Retrieves form fields from the document with textual data (inputs, dropdowns)

In [74]:

from PyPDF2 import PdfFileReader
import sys

path = open(file='./f1040_Agatha_Christie.pdf', mode='rb')
pdf_reader = PdfFileReader(stream=path, strict=True, warndest=sys.stderr, overwriteWarnings=True)
form_output = bytearray()
fields = pdf_reader.getFields(fileobj=form_output)
print(type(fields))
print(fields, end='\n' + 125 * '-' + '\n\n')

text_form_output = bytes()
text_fields = pdf_reader.getFormTextFields()
print(type(text_fields))
print(text_fields)


<class 'dict'>
{'topmostSubform[0].Page1[0].FilingStatus[0].c1_01[0]': {'/FT': '/Btn', '/T': 'topmostSubform[0].Page1[0].FilingStatus[0].c1_01[0]', '/Ff': 0, '/V': '/1'}, 'topmostSubform[0].Page1[0].f1_02[0]': {'/FT': '/Tx', '/T': 'topmostSubform[0].Page1[0].f1_02[0]', '/Ff': 8388608, '/V': 'U.N.'}, 'topmostSubform[0].Page1[0].f1_03[0]': {'/FT': '/Tx', '/T': 'topmostSubform[0].Page1[0].f1_03[0]', '/Ff': 8388608, '/V': 'Owen'}, 'topmostSubform[0].Page1[0].YourSocial_ReadOrderControl[0].f1_04[0]': {'/FT': '/Tx', '/T': 'topmostSubform[0].Page1[0].YourSocial_ReadOrderControl[0].f1_04[0]', '/Ff': 25165824, '/V': '123456789'}, 'topmostSubform[0].Page1[0].ReadOrderControl[0].Address[0].f1_08[0]': {'/FT': '/Tx', '/T': 'topmostSubform[0].Page1[0].ReadOrderControl[0].Address[0].f1_08[0]', '/Ff': 8388608, '/V': '666 Enigma St.'}, 'topmostSubform[0].Page1[0].ReadOrderControl[0].Address[0].f1_10[0]': {'/FT': '/Tx', '/T': 'topmostSubform[0].Page1[0].ReadOrderControl[0].Address[0].f1_10[0]', '/Ff': 8

getNumPages()<br>
Calculates the number of pages in this PDF file.<br>

Returns:	number of pages<br>
Return type:	int<br>
Raises PdfReadError:<br>
 	if file is encrypted and restrictions prevent this action.

numPages<br><br>
Read-only property that accesses the getNumPages() function.

In [75]:

from PyPDF2 import PdfFileReader
import sys

path = open(file='./Python- OCR for PDF or Compare textract, pytesseract, and pyocr.pdf', mode='rb')
pdf_reader = PdfFileReader(stream=path, strict=True, warndest=sys.stderr, overwriteWarnings=True)
num_pages = pdf_reader.numPages
print(num_pages)
assert pdf_reader.numPages == pdf_reader.getNumPages()


7


getOutlines(node=None, outlines=None)<br><br>
Retrieves the document outline present in the document.

Returns:	a nested list of Destinations.

outlines<br><br>
Read-only property that accesses the
getOutlines() function.

In [79]:

from PyPDF2 import PdfFileReader
import sys

path = open(file='./Python- OCR for PDF or Compare textract, pytesseract, and pyocr.pdf', mode='rb')
pdf_reader = PdfFileReader(stream=path, strict=True, warndest=sys.stderr, overwriteWarnings=True)
out = pdf_reader.outlines
print(type(out))
print(out)
assert pdf_reader.outlines == pdf_reader.getOutlines()


<class 'list'>
[]


getPage(pageNumber)<br><br>
Retrieves a page by number from this PDF file.

Parameters:	pageNumber (int) – The page number to retrieve (pages begin at zero)<br>
Returns:	a PageObject instance.<br>
Return type:	PageObject

pages<br><br>
Read-only property that emulates a list based upon the getNumPages() and getPage() methods.

In [84]:

from PyPDF2 import PdfFileReader
import sys

path = open(file='./Python- OCR for PDF or Compare textract, pytesseract, and pyocr.pdf', mode='rb')
pdf_reader = PdfFileReader(stream=path, strict=True, warndest=sys.stderr, overwriteWarnings=True)
first_page = pdf_reader.getPage(pageNumber=1)
print(type(first_page))
print(first_page, end='\n' + 125 * '-' + '\n\n')
page_list = pdf_reader.pages
print(type(page_list))
print(page_list)


<class 'PyPDF2.pdf.PageObject'>
{'/Type': '/Page', '/Resources': {'/ProcSet': ['/PDF', '/Text', '/ImageB', '/ImageC', '/ImageI'], '/ExtGState': {'/G3': IndirectObject(3, 0), '/G5': IndirectObject(5, 0), '/G10': IndirectObject(10, 0)}, '/XObject': {'/X29': IndirectObject(29, 0), '/X30': IndirectObject(30, 0), '/X31': IndirectObject(31, 0), '/X32': IndirectObject(32, 0), '/X33': IndirectObject(33, 0)}, '/Font': {'/F4': IndirectObject(4, 0), '/F13': IndirectObject(13, 0), '/F14': IndirectObject(14, 0), '/F26': IndirectObject(26, 0), '/F27': IndirectObject(27, 0), '/F28': IndirectObject(28, 0)}}, '/MediaBox': [0, 0, 612, 792], '/Contents': IndirectObject(34, 0), '/StructParents': 1, '/Parent': IndirectObject(77, 0)}
-----------------------------------------------------------------------------------------------------------------------------

<class 'PyPDF2.utils.ConvertFunctionsToVirtualList'>
<PyPDF2.utils.ConvertFunctionsToVirtualList object at 0x1168e9650>


getPageLayout()<br><br>
Get the page layout. See setPageLayout() for a description of valid layouts.

Returns:	Page layout currently being used.
Return type:	str, None if not specified

getPageMode()<br><br>
Get the page mode. See setPageMode() for a description of valid modes.

Returns:	Page mode currently being used.
Return type:	str, None if not specified

In [93]:

from PyPDF2 import PdfFileReader, PdfFileWriter
import sys

path = open(file='./Python- OCR for PDF or Compare textract, pytesseract, and pyocr.pdf', mode='rb')
pdf_reader = PdfFileReader(stream=path, strict=True, warndest=sys.stderr, overwriteWarnings=True)
pdf_writer = PdfFileWriter()
pdf_writer.addPage(pdf_reader.getPage(pageNumber=1))
pdf_writer.addBlankPage()

layouts = ['/NoLayout', '/SinglePage', '/OneColumn', '/TwoColumnLeft', '/TwoColumnRight', '/TwoPageLeft', '/TwoPageRight']
print('Page Layouts', end='\n' + 12 * '-' + '\n')
for layout in layouts:
    """Set each possible type of layouts"""
    pdf_writer.setPageLayout(layout=layout)
    pdf_writer.write(open(file='./new_pdf', mode='wb'))
    pdf_reader = PdfFileReader(stream='./new_pdf', strict=True, warndest=sys.stderr, overwriteWarnings=True)
    print(pdf_reader.getPageLayout())
    
modes = ['/UseNone', '/UseOutlines', '/UseThumbs', '/Fullscreen', '/UseOC', '/UseAttachments']
print('\nPage Modes', end='\n' + 10 * '-' + '\n')
for mode in modes:
    """Set each possible type of mode"""
    pdf_writer.setPageMode(mode=mode)
    pdf_writer.write(open(file='./new_pdf', mode='wb'))
    pdf_reader = PdfFileReader(stream='./new_pdf', strict=True, warndest=sys.stderr, overwriteWarnings=True)
    print(pdf_reader.getPageMode())

Page Layouts
------------
/NoLayout
/SinglePage
/OneColumn
/TwoColumnLeft
/TwoColumnRight
/TwoPageLeft
/TwoPageRight

Page Modes
----------
/UseNone
/UseOutlines
/UseThumbs
/Fullscreen
/UseOC
/UseAttachments




getPageNumber(page)<br><br>
Retrieve page number of a given PageObject

Parameters:<br><br>
page (PageObject) – The page to get page number. Should be an instance of PageObject<br>
Returns:	the page number or -1 if page not found<br>
Return type:	int

In [102]:

from PyPDF2 import PdfFileReader
import random
import sys

path = open(file='./Python- OCR for PDF or Compare textract, pytesseract, and pyocr.pdf', mode='rb')
pdf_reader = PdfFileReader(stream=path, strict=True, warndest=sys.stderr, overwriteWarnings=True)

for trial in range(10):
    """Get random pages 10 times"""
    random_page_num = random.randrange(pdf_reader.getNumPages())
    random_page = pdf_reader.getPage(pageNumber=random_page_num)
    print(pdf_reader.getPageNumber(page=random_page))


3
2
4
6
1
6
1
1
1
5


getXmpMetadata()<br><br>
Retrieves XMP (Extensible Metadata Platform) data from the PDF document root.<br>

Returns:	a XmpInformation instance that can be used to access XMP metadata from the document.<br>
Return type:	XmpInformation or None if no metadata was found on the document root.

In [104]:

from PyPDF2 import PdfFileReader
import sys

path = open(file='./Python- OCR for PDF or Compare textract, pytesseract, and pyocr.pdf', mode='rb')
pdf_reader = PdfFileReader(stream=path, strict=True, warndest=sys.stderr, overwriteWarnings=True)
xmp = pdf_reader.xmpMetadata
print(type(xmp))
print(xmp)
assert pdf_reader.xmpMetadata == pdf_reader.getXmpMetadata()


<class 'NoneType'>
None
