# PDF FILE MANIPULATION
# Credit goes to Dr. Ryan @STEMplicity





- Please install PyPDF2 which is a common python library to work with PDF files.
- The library can be used to read text from PDF files
- You can install the library as follows:
pip install PyPDF2


# Overview:

# 1.) How to read a PDF
# 2.) Copy a single PDF and put this into a newly created PDF
# 3.) Rotate PDFs and write them to a new PDF
# 4.) Read multiple pages
# 5.) Merge two PDFs

---------------------------------------------------

# 1.) How to read a PDF

In [2]:
from PyPDF2 import PdfFileWriter, PdfFileReader

In [3]:
f = open('Harvard_Business_School.pdf', 'rb') # rb is for reading in binary file

In [7]:
# Create a PDF reader object
read_pdf = PdfFileReader(f)

In [12]:
read_pdf.documentInfo

{'/Author': 'Barlow, Andrew Jonathan',
 '/Company': 'Harvard University',
 '/CreationDate': "D:20180817171357-04'00'",
 '/Creator': 'Acrobat PDFMaker 18 for Word',
 '/ModDate': "D:20180817171437-04'00'",
 '/Producer': 'Adobe PDF Library 15.0',
 '/SourceModified': 'D:20180817211351',
 '/Title': 'I'}

In [10]:
read_pdf.getDocumentInfo()

{'/Author': 'Barlow, Andrew Jonathan',
 '/Company': 'Harvard University',
 '/CreationDate': "D:20180817171357-04'00'",
 '/Creator': 'Acrobat PDFMaker 18 for Word',
 '/ModDate': "D:20180817171437-04'00'",
 '/Producer': 'Adobe PDF Library 15.0',
 '/SourceModified': 'D:20180817211351',
 '/Title': 'I'}

In [13]:
read_pdf.numPages

8

In [17]:
# apply the Number of pages attribute method to get the number of pages
read_pdf.getNumPages()

8

In [18]:
read_pdf.getIsEncrypted()

False

In [24]:
read_pdf.getPage(0)

{'/ArtBox': [0, 0, 612, 792],
 '/BleedBox': [0, 0, 612, 792],
 '/Contents': [IndirectObject(387, 0),
  IndirectObject(388, 0),
  IndirectObject(389, 0),
  IndirectObject(390, 0),
  IndirectObject(391, 0),
  IndirectObject(392, 0),
  IndirectObject(393, 0),
  IndirectObject(394, 0)],
 '/CropBox': [0, 0, 612, 792],
 '/MediaBox': [0, 0, 612, 792],
 '/Parent': {'/Count': 8,
  '/Kids': [IndirectObject(374, 0),
   IndirectObject(350, 0),
   IndirectObject(301, 0),
   IndirectObject(1, 0),
   IndirectObject(6, 0),
   IndirectObject(9, 0),
   IndirectObject(12, 0),
   IndirectObject(15, 0)],
  '/Type': '/Pages'},
 '/Resources': {'/ExtGState': {'/GS0': {'/AIS': <PyPDF2.generic.BooleanObject at 0x29c800cf7b8>,
    '/BM': '/Normal',
    '/CA': 1,
    '/OP': <PyPDF2.generic.BooleanObject at 0x29c81dce630>,
    '/OPM': 1,
    '/SA': <PyPDF2.generic.BooleanObject at 0x29c81da5160>,
    '/SMask': '/None',
    '/Type': '/ExtGState',
    '/ca': 1,
    '/op': <PyPDF2.generic.BooleanObject at 0x29c81da59

In [25]:
# Grab any page 
sample_page_text = read_pdf.getPage(2).extractText()

In [26]:
sample_page_text

'IS \nBUSINESS \nSCHOOL \nRIGHT FOR \nYOU?\n Graduates of \nMBA \nprograms can be found in almost any type of organization. Business school will \nprepare you to create or lead an organization, manage resources, develop effective operational \nstrategies, and more. \nOnce \nadmitted, r\nequired coursework typically include\ns: Organizational \nBehavior, Marketing, Accounting, Finance, Strategy\n, \nand Operations Management. This is followed \nby elective coursework that allows the student to customize their experience. Some students \n\ncon\nsider an MBA as essential for advancement to a management role while others will use it as a \n\nmeans to change careers. As an undergraduate student, it is unlikely that you will be admitted to \n\nenter directly into an MBA program without first working for a fe\nw years. This period of \n\nemployment will give you time to think about your long term goals and help you determine if a \n\ngraduate degree is appropriate.\n Informational \nmeetings\

In [4]:
f.close()

# 2.) Copy a single PDF and put this into a newly created PDF

## 2.1) Read a page

In [5]:
# Read a sample page
f = open('Harvard_Business_School.pdf', 'rb')
read_pdf = PdfFileReader(f)
sample_page = read_pdf.getPage(4)

## 2.2) Write a new page

In [6]:
# Create a writer object
write_pdf = PdfFileWriter()
write_pdf.addPage(sample_page)

## 2.3) Create a new PDF and link the new page to it

In [7]:
# open a newly created file
pdf_output = open('Test_01.pdf', 'wb')
write_pdf.write(pdf_output)


pdf_output.close() # Close the new file
f.close() # close the source file

# 3: Rotate PDFs and write them to a new PDF

In [8]:
# Open the Source file
f = open('Harvard_Business_School.pdf', 'rb')
read_pdf = PdfFileReader(f)


In [9]:
# Create a writer object
write_pdf = PdfFileWriter()

In [10]:

# Read a page and rotate it
rotated_page = read_pdf.getPage(2).rotateClockwise(90)

In [11]:

# Add the rotated page to the writer object
write_pdf.addPage(rotated_page)

In [12]:

# Save the writer object somewhere!
pdf_output = open('Harvard_New_rotated.pdf', 'wb')
write_pdf.write(pdf_output)

In [13]:
#Close the new file and source file
pdf_output.close()
f.close() 


# 4.) Read multiple pages

In [23]:
f = open('Harvard_Business_School.pdf', 'rb')
read_pdf = PdfFileReader(f)




In [24]:
num_pages = read_pdf.numPages #count the number of pages and store it
num_pages

8

In [25]:
pdf_text_all = [] # Create an empty list to hold the data

In [26]:
#store each page into the created list
for page in range(num_pages): # same as range(8)
    one_page_text = read_pdf.getPage(page).extractText()
    pdf_text_all.append(one_page_text)

In [27]:
pdf_text_all # Every single page summarized in a list!

['Undergraduate Resource Series\nO˜ce of Career Services | 54 Dunster Street     \nHarvard University | Faculty of Arts and Sciences | 617.495.2595\nwww.ocs.fas.harvard.edu\nOCSAPPLYING  TO \nBUSINESS  SCHOOLPhoto: Harvard University News O˜ce\n',
 '© 201 President and Fellows of Harvard CollegeAll rights reserved.\nNo part \nof this publication may be reproduc\ned in any wa\ny without the express written \npermission of the Harvard University Faculty of Arts & Sciences Office of Career Services.08/1O˜ce of Career Services\nHarvard University\n\nFaculty of Arts & Sciences\n\nCambridge, MA 02138\n\nPhone: (617) 495-2595\n\nwww.ocs.fas.harvard.edu\n',
 'IS \nBUSINESS \nSCHOOL \nRIGHT FOR \nYOU?\n Graduates of \nMBA \nprograms can be found in almost any type of organization. Business school will \nprepare you to create or lead an organization, manage resources, develop effective operational \nstrategies, and more. \nOnce \nadmitted, r\nequired coursework typically include\ns: Organization

# 5.) Merge two PDFs

In [28]:
from PyPDF2 import PdfFileWriter, PdfFileReader

In [29]:
# Open watermark
f = open('watermark_conf_2.pdf', 'rb')
read_watermark = PdfFileReader(f)
watermark_page = read_watermark.getPage(0)

In [30]:
# Open file to be watermarked
f = open('Harvard_Business_School.pdf', 'rb')
read_pdf = PdfFileReader(f)

In [31]:
# Create a writer object
write_pdf = PdfFileWriter()

In [32]:
# watermark all pages
num_pages = read_pdf.getNumPages()
for page in range(num_pages):
    single_page = read_pdf.getPage(page)
    single_page.mergePage(watermark_page)
    write_pdf.addPage(single_page)

In [33]:
# Save the writer object somewhere!
pdf_output = open('Harvard_watermarked.pdf', 'wb')
write_pdf.write(pdf_output)

In [34]:
pdf_output.close() # Close the new file
f.close() # Close the new file    