### Working with Portable Document Files (.pdf) files

#### Premables
1. Not all PDF files have text that can be extracted
2. We'll be using `PyPDF2` library to read in-text data from a PDF file
3. Some PDF files are scanned documents, not exported from a text editor (like MS Word)
4. Often the scanned docs (like image files) requires specialized softs for text extracton
5. `PyPDF2` will mostly able to extract texts from scanned images, but NOT GURANTEED!

In [1]:
# Install PyPDF2 library in your environment
#!pip install PyPDF2

In [2]:
# import the library
import PyPDF2 as PDF

In [3]:
myfile = open('odyssey.pdf', mode='rb')

In [4]:
# A PdfReader() method to read the pdf file
pdf_reader = PDF.PdfReader(myfile)

In [5]:
# To show number of pages
len(pdf_reader.pages)

215

In [6]:
# Extract page from the PDF
page50 = pdf_reader.pages[49]

In [7]:
# Extract texts from the page
page50.extract_text()

'file:///C/ ...20-%20NLP%20with%20Deep%20Learning/ASSIGNMENTS/ASSIGNMENT_01/assignment1/data/books/mythology/odyssey.txt [1/18/2026 3:54:32 PM]covered with thick leaves to hide his nakedness. He looked like some\nlion of the wilderness that stalks about exulting in his strength and\ndefying both wind and rain; his eyes glare as he prowls in quest ofoxen, sheep, or deer, for he is famished, and will dare break even intoa well fenced homestead, trying to get at the sheep—even such didUlysses seem to the young women, as he drew near to them all naked ashe was, for he was in great want. On seeing one so unkempt and sobegrimed with salt water, the others scampered off along the spits thatjutted out into the sea, but the daughter of Alcinous stood firm, forMinerva put courage into her heart and took away all fear from her. Shestood right in front of Ulysses, and he doubted whether he should go upto her, throw himself at her feet, and embrace her knees as asuppliant, or stay where he was and 

In [8]:
# To print the text with lines
print(page50.extract_text())

file:///C/ ...20-%20NLP%20with%20Deep%20Learning/ASSIGNMENTS/ASSIGNMENT_01/assignment1/data/books/mythology/odyssey.txt [1/18/2026 3:54:32 PM]covered with thick leaves to hide his nakedness. He looked like some
lion of the wilderness that stalks about exulting in his strength and
defying both wind and rain; his eyes glare as he prowls in quest ofoxen, sheep, or deer, for he is famished, and will dare break even intoa well fenced homestead, trying to get at the sheep—even such didUlysses seem to the young women, as he drew near to them all naked ashe was, for he was in great want. On seeing one so unkempt and sobegrimed with salt water, the others scampered off along the spits thatjutted out into the sea, but the daughter of Alcinous stood firm, forMinerva put courage into her heart and took away all fear from her. Shestood right in front of Ulysses, and he doubted whether he should go upto her, throw himself at her feet, and embrace her knees as asuppliant, or stay where he was and ent

In [9]:
# remember to close the file
myfile.close()

### Edit the PDF
#### - CREATE A NEW PDF


In [10]:
# import the library
import PyPDF2 as PDF

In [11]:
f = open('MyPDF.pdf', 'rb')
pdf_reader = PDF.PdfReader(f)
first_page = pdf_reader.pages[0]
print(first_page.extract_text())

This is a Portable Document File, or as known as , PDF.  
PDF files are  versatile digital documents created by Adobe that preserve formatting 
(text, images, layout) across different devices and software, ensuring consistent 
viewing, and can contain interactive elements like links, forms, audio, and video, making 
them  ideal for sharing reports, invoices, eBooks, and forms.  
It is standardized as  ISO 32000  file format in 1993 by Adobe.  
The development of PDF began in 1991 when John Warnock wrote a paper for a project 
then code -named Camelot, in which he proposed the creation of a simplified version of 
PostScript called Interchange PostScript (IPS).   
Unlike traditional PostScript, which was tightly focused on rendering print jobs to output 
devices, IPS would be optimized for displaying pages to any screen and any platform.  
A PDF file is often a combination of  vector graphics , text, and  bitmap graphics . The 
basic types of content in a PDF are:  
• Typeset text stored

In [12]:
pdf_writer = PDF.PdfWriter()

In [13]:
pdf_writer.add_page(first_page)

{'/Contents': [IndirectObject(5, 0, 1930999924912),
  IndirectObject(6, 0, 1930999924912),
  IndirectObject(7, 0, 1930999924912),
  IndirectObject(8, 0, 1930999924912),
  IndirectObject(9, 0, 1930999924912),
  IndirectObject(10, 0, 1930999924912),
  IndirectObject(11, 0, 1930999924912),
  IndirectObject(12, 0, 1930999924912)],
 '/CropBox': [0, 0, 612, 792],
 '/MediaBox': [0, 0, 612, 792],
 '/Resources': {'/Font': {'/C2_0': {'/BaseFont': '/GATERB+SymbolMT',
    '/DescendantFonts': [IndirectObject(15, 0, 1930999924912)],
    '/Encoding': '/Identity-H',
    '/Subtype': '/Type0',
    '/ToUnicode': {'/Filter': '/FlateDecode'},
    '/Type': '/Font'},
   '/TT0': {'/BaseFont': '/CHXLNJ+ArialMT',
    '/Encoding': '/WinAnsiEncoding',
    '/FirstChar': 32,
    '/FontDescriptor': {'/Ascent': 1040,
     '/CapHeight': 716,
     '/Descent': -325,
     '/Flags': 32,
     '/FontBBox': [-665, -325, 2000, 1040],
     '/FontFamily': 'Arial',
     '/FontFile2': {'/Filter': '/FlateDecode', '/Length1': 81127

In [14]:
pdf_output = open('MY_BRAND_NEW_PDF.pdf', 'wb')

In [15]:
pdf_writer.write(pdf_output)

(False, <_io.BufferedWriter name='MY_BRAND_NEW_PDF.pdf'>)

In [16]:
pdf_output.close()
f.close()

In [17]:
brand_new = open('MY_BRAND_NEW_PDF.pdf', 'rb')
pdf_reader = PDF.PdfReader(brand_new)
len(pdf_reader.pages)

1

#### - GRAB ALL TEXT FROM A PDF

In [18]:
f = open('MyPDF.pdf', 'rb')

pdf_reader = PDF.PdfReader(f)
pdf_txt = []

for p in range(len(pdf_reader.pages)):
    page = pdf_reader.pages[p]
    pdf_txt.append(page.extract_text())
f.close()

In [19]:
len(pdf_txt)

1

In [20]:
for page in pdf_txt:
    print(f"\n{page}\n\n")


This is a Portable Document File, or as known as , PDF.  
PDF files are  versatile digital documents created by Adobe that preserve formatting 
(text, images, layout) across different devices and software, ensuring consistent 
viewing, and can contain interactive elements like links, forms, audio, and video, making 
them  ideal for sharing reports, invoices, eBooks, and forms.  
It is standardized as  ISO 32000  file format in 1993 by Adobe.  
The development of PDF began in 1991 when John Warnock wrote a paper for a project 
then code -named Camelot, in which he proposed the creation of a simplified version of 
PostScript called Interchange PostScript (IPS).   
Unlike traditional PostScript, which was tightly focused on rendering print jobs to output 
devices, IPS would be optimized for displaying pages to any screen and any platform.  
A PDF file is often a combination of  vector graphics , text, and  bitmap graphics . The 
basic types of content in a PDF are:  
• Typeset text store