# Working with PDF Files


 Often you will have to deal with PDF files. There are many libraries in Python for working with PDFs, each with their pros and cons, the most common one being PyPDF2. You can install it with (note the case-sensitivity, you need to make sure your capitilization matches):

pip install PyPDF2 Keep in mind that not every PDF file can be read with this library. PDFs that are too blurry, have a special encoding, encrypted, or maybe just created with a particular program that doesn't work well with PyPDF2 won't be able to be read. If you find yourself in this situation, try using the libraries linked above, but keep in mind, these may also not work. The reason for this is because of the many different parameters for a PDF and how non-standard the settings can be, text could be shown as an image instead of a utf-8 encoding. There are many parameters to consider in this aspect.

As far as PyPDF2 is concerned, it can only read the text from a PDF document, it won't be able to grab images or other media files from a PDF.

# Working with PyPDF2
Let's begin by showing the basics of the PyPDF2 library.

In [1]:
# note the capitalization
import PyPDF2

# Reading Pdfs

In [2]:
# Notice we read it as a binary with 'rb'
f = open('The_Constitution_of_India.pdf','rb')

In [3]:
pdf_reader = PyPDF2.PdfFileReader(f)

In [4]:
pdf_reader.getNumPages()


256

In [5]:
page1 = pdf_reader.getPage(3)

In [6]:
page1

{'/Type': '/Page',
 '/Parent': {'/Type': '/Pages',
  '/Count': 256,
  '/Kids': [IndirectObject(3, 0),
   IndirectObject(16, 0),
   IndirectObject(25, 0),
   IndirectObject(28, 0),
   IndirectObject(30, 0),
   IndirectObject(32, 0),
   IndirectObject(44, 0),
   IndirectObject(48, 0),
   IndirectObject(52, 0),
   IndirectObject(54, 0),
   IndirectObject(56, 0),
   IndirectObject(58, 0),
   IndirectObject(60, 0),
   IndirectObject(62, 0),
   IndirectObject(64, 0),
   IndirectObject(66, 0),
   IndirectObject(68, 0),
   IndirectObject(70, 0),
   IndirectObject(72, 0),
   IndirectObject(74, 0),
   IndirectObject(81, 0),
   IndirectObject(83, 0),
   IndirectObject(85, 0),
   IndirectObject(87, 0),
   IndirectObject(89, 0),
   IndirectObject(91, 0),
   IndirectObject(93, 0),
   IndirectObject(95, 0),
   IndirectObject(97, 0),
   IndirectObject(99, 0),
   IndirectObject(101, 0),
   IndirectObject(103, 0),
   IndirectObject(105, 0),
   IndirectObject(107, 0),
   IndirectObject(109, 0),
   Indire

In [7]:
page_one_text = page1.extractText()

In [8]:
page_one_text

'4\n \nTHE CONSTITUTION OF INDIA\n \n____________\n \nCONTENTS\n \n__________\n__\n \n \n \nPREAMBLE\n \nPART I\n \nTHE UNION AND ITS TERRITORY\n \nARTICLES\n \n1.\n \nName and territory of the Union.\n \n2.\n \nAdmission or establishment of new States.\n \n2A.   [\nOmitted\n.\n]\n \n3.\n \nFormation of new\n \nStates and alteration of areas, boundaries or names of existing  States.\n \n4.\n \nLaws made under articles 2 and 3 to provide for the amendment of the First and the Fourth \nSchedules and supplemental, incidental and consequential matters.\n \nPART II\n \nCITIZENSHIP\n \n5.\n \nCitizenship at the commencement of the Constitution.\n \n6.\n \nRights of citizenship of certain persons who have migrated to India from Pakistan.\n \n7.\n \nRights of citizenship of certain migrants to Pakistan.\n \n8.\n \nRights of citizenship of certain persons of Indian origi\nn residing outside India.\n \n9.\n \nPersons voluntarily acquiring citizenship of a foreign State not to be citizens.\n \n10

In [9]:
print(page_one_text)

4
 
THE CONSTITUTION OF INDIA
 
____________
 
CONTENTS
 
__________
__
 
 
 
PREAMBLE
 
PART I
 
THE UNION AND ITS TERRITORY
 
ARTICLES
 
1.
 
Name and territory of the Union.
 
2.
 
Admission or establishment of new States.
 
2A.   [
Omitted
.
]
 
3.
 
Formation of new
 
States and alteration of areas, boundaries or names of existing  States.
 
4.
 
Laws made under articles 2 and 3 to provide for the amendment of the First and the Fourth 
Schedules and supplemental, incidental and consequential matters.
 
PART II
 
CITIZENSHIP
 
5.
 
Citizenship at the commencement of the Constitution.
 
6.
 
Rights of citizenship of certain persons who have migrated to India from Pakistan.
 
7.
 
Rights of citizenship of certain migrants to Pakistan.
 
8.
 
Rights of citizenship of certain persons of Indian origi
n residing outside India.
 
9.
 
Persons voluntarily acquiring citizenship of a foreign State not to be citizens.
 
10.
 
Continuance of the rights of citizenship.
 
11.
 
Parliament to reg

In [10]:
f.close()

# Adding to PDFs

We can not write to PDFs using Python because of the differences between the single string type of Python, and the variety of fonts, placements, and other parameters that a PDF could have.

What we can do is copy pages and append pages to the end.

In [11]:
f = open('The_Constitution_of_India.pdf','rb')

In [12]:
pdf_reader = PyPDF2.PdfFileReader(f)

In [13]:
wanted_page = pdf_reader.getPage(3)

In [14]:
pdf_writer = PyPDF2.PdfFileWriter()

In [15]:
pdf_writer.addPage(wanted_page)

In [16]:
pdf_output = open("Some_New_Doc.pdf","wb")

In [17]:
pdf_writer.write(pdf_output)

In [18]:
pdf_output.close()
f.close()

Now we have copied a page and added it to another new document!

# Simple Example

Let's try to grab all the text from this PDF file:

In [19]:
f = open('The_Constitution_of_India.pdf','rb')

In [20]:
# List of every page's text.
# The index will correspond to the page number.
pdf_text = [0]  # zero is a placehoder to make page 1 = index 1

In [21]:
pdf_reader = PyPDF2.PdfFileReader(f)

for p in range(pdf_reader.numPages):
    
    page = pdf_reader.getPage(p)
    
    pdf_text.append(page.extractText())

f.close()

In [22]:
pdf_text

[0,
 ' \n \n \n \n \nTHE CONSTITUTION OF INDIA\n \n[\nAs on \n9\nth\n \nDecem\nber\n, \n2020\n]\n \n \n \n \n \n \n2020\n \n \n \nGOVERNMENT OF INDIA\n \nMINISTRY OF LAW AND JUSTICE\n \nLEGISLATIVE DEPARTMENT\n',
 ' \nLIST OF ABBREVIATIONS USED\n \n \n \n \nArt., arts. \n \n \n \n \n \n \nfor\n \nArticle, articles.\n \nCl., cls.    \n \n \n \n \n\n   \nClause, clauses.\n \nC.O.         \n \n \n \n \n\n   \nConstitution Order.\n \nIns.           \n \n \n \n \n\n    \nInserted.\n \nP., pp.      \n \n \n \n \n\n    \nPage, pages.\n \nPt.            \n \n \n \n \n\n    \nPart.\n \nRep.         \n \n \n \n \n\n    \nRepealed.\n \nS., ss.    \n \n \n \n \n \n\n    \nSection, sections.\n \nSch.         \n \n \n \n \n \n\n    \nSchedule.\n \nSubs.        \n \n \n \n \n\n    \nSubstituted.\n \nw.e.f.       \n \n \n \n \n \n\n    \nwith effect from.\n \n \n \n',
 '3\n \n \nPREFACE\n \n \n \nTh\nis edition of the\n \nConstitution of India \nreproduces the text of the Constitution of India as \nam

In [23]:
print(pdf_text[3])

3
 
 
PREFACE
 
 
 
Th
is edition of the
 
Constitution of India 
reproduces the text of the Constitution of India as 
amended 
by Parliament 
from time to time.  All amendments made by Parliament up to and 
including the Constitution (One Hundred and 
Fourth 
Amendment) Act,
 
2019 are incor
porat
ed in 
this edition.  The foo
t
 
notes below the text indicate the 
C
onstitution 
A
mendment Acts by which 
such amendments have been made.
 
The Constitution (Application to Jammu and Kashmir) Order, 
2019 
has been 
provided
 
in 
A
PPENDIX 
-
 
I for r
eference.  
 
The text of the constitutional amendments relating to the Constitution (Forty
-
fourth 
Amendment) Act, 1978 and the Constitution (Eighty
-
eighth Amendment) Act, 2003, which have 
not yet come into force, have been provided in the text at the appropriate places
 
or otherwise in 
   
the footnote.
  
The text of these amendments have been provided in 
A
PPENDIX
-
I
I and 
  
A
PPENDIX 
-
 
I
II for reference.
 
The Constitu