# PyPDF PdfReader Class

Alejandro Ricciardi (Omegapy)  
created date: 01/10/2024   
[GitHub](https://github.com/Omegapy)  

Credit: 
[Control PDF with Python & PyPDF2](https://www.udemy.com/course/control-pdf-with-python-pypdf2) Udemy - Conny Soderholm
The original code was substantially modified to meet my requirements and to add functionally to the program.

Projects Description:  
Using the PyPDF [PdfReader](https://pypdf.readthedocs.io/en/stable/modules/PdfReader.html?highlight=PdfReader) class to read from a PDf file.  

Class:
```
class pypdf.PdfReader(stream: Union[str, IO[Any], Path], strict: bool = False, password: Union[None, str, bytes] = None)
```

Parameters
- stream – A File object or an object that supports the standard read and seek methods similar to a File object. Could also be a string representing a path to a PDF file.

- strict – Determines whether the user should be warned of all problems and also causes some correctable problems to be fatal. Defaults to False.

- password – Decrypt PDF file at initialization. If the password is None, the file will not be decrypted. Defaults to None

Project map:
- Load and Read a PDF 
    - First method -``` with open(file_name, "rb") as pdf_file: ```-
    - Second method -```pdf_reader = PdfReader("docs/WorkStation.pdf")```-
- PDF files Metadata -```doc_info = pdf_reader.metadata```
    - Metadata (unscripted) 
    - Encrypted Metadata
- Fix for encrypted Metadata -```writer = PdfWriter().append_pages_from_reader(pdf_reader)```-
- PDF Fields -```fields = pdf_reader.get_fields()```-
- Page Document Layout -```pdf_reader.page_layout```
- Page Mode (Document outline) -```pdf_reader.page_mode```-
- XPM Metadata -```xmp = pdf_reader.xmp_metadata```-
- Get Text From Page - full page and lines -```xmp = pdf_reader.xmp_metadata``` and  -```for number, line in enumerate(full_page_text.splitlines()): ```-


In [29]:
from pypdf import PdfReader 

### Load and Read a PDF

First method:

In [30]:
file_name = "docs/WorkStation.pdf"

# Load the pdf to the PdfFileReader object
with open(file_name, "rb") as pdf_file: # the with() command automatically file the file, 'rb' opens/reads it in binary mode
    pdf_reader= PdfReader(pdf_file) # PdfReader object
    print(f"The umber of page in the PDF file is: {len(pdf_reader.pages)}\n") # Number of pages
    print(pdf_reader)
# file close

The umber of page in the PDF file is: 1

<pypdf._reader.PdfReader object at 0x000002E6D609FCE0>


Second Method: It is commanded when using PyPDF

In [31]:
pdf_reader = PdfReader("docs/WorkStation.pdf") # PdfReader object
print(f"The umber of page in the PDF file is: {len(pdf_reader.pages)}\n") # Number of pages

The umber of page in the PDF file is: 1


### PDF files Metadata

##### Metadata (unscripted) 

In [32]:
pdf_reader = PdfReader("docs/WorkStation.pdf")

doc_info = pdf_reader.metadata
print(type(doc_info))

for info in doc_info:
    print(info, doc_info[info])

<class 'pypdf._reader.DocumentInformation'>
/Author Alex Ricciardi
/Creator Microsoft® Word for Microsoft 365
/CreationDate D:20240110104403-07'00'
/ModDate D:20240110104403-07'00'
/Producer Microsoft® Word for Microsoft 365


##### Encrypted Metadata

More complex PDf file may require installation of dependencies to read the metadata

For example the following Kyocera User Guide requires [cryptography=>3.1](https://pypi.org/project/cryptography/3.1/)
The data is encrypted :(

In [33]:
!pipenv install cryptography

Installing cryptography...
Resolving cryptography...
[    ] Installing...
Installation Succeeded
[    ] Installing cryptography...
[    ] Installing cryptography...

Installing dependencies from Pipfile.lock (5657a6)...
To activate this project's virtualenv, run pipenv shell.
Alternatively, run a command inside the virtualenv with pipenv run.




In [34]:
pdf_reader = PdfReader("docs/2554ci Operation Guide.pdf")

doc_info = pdf_reader.metadata

for info in doc_info:
    print(info, doc_info[info])

/CreationDate D:20231116104630+09'00'
/ModDate D:20231116104630+09'00'


### Fix for encrypted Metadata
It is not a real fix to the retrieve encrypted Medata without proper authorization or decryption keys.
Nonetheless, pdf files can be rewrited without the original encrypted Metadata and new Metadata can be assigned to it.

In [35]:
from pypdf import  PdfWriter

pdf_reader = PdfReader("docs/2554ci Operation Guide.pdf")

writer = PdfWriter() # writer object
# Add all pages to the writer
# pdf_reader.pages returns a dictionary
for page in pdf_reader.pages:
     writer.add_page(page)

#--- Better
# Add all pages
writer = PdfWriter()
writer.append_pages_from_reader(pdf_reader)

from datetime import datetime
# Format the current date and time for the metadata
utc_time = "-05'00'"  # UTC time optional
time = datetime.now().strftime(f"D\072%Y%m%d%H%M%S{utc_time}")

# Add the new metadata
writer.add_metadata(
    {
        "/Author": "Kyocera",
        "/Title": "2554ci User Guide",
        "/ModDate": time
    }
)

# Save the new PDF to disk
with open("Mod_Meta Docs/Mod_Meta-2554ci Operation Guide.pdf", "wb") as f:
    writer.write(f)

### PDF Fields
PDF fields are fields that can be filled out by the user like in an application.



In [36]:
pdf_reader = PdfReader("docs/docs_pdf/Section 3 Reading PDFs/APPLICATION FOR TAX CARD.pdf") # Reader object
# Fields
# get_fields() returns a dictionary
fields = pdf_reader.get_fields()

for field in fields:
    
    field_type = fields[field].field_type
    
    name = fields[field].name
    value = fields[field].value
    
    print(field_type, name,  value)

# /Bt is a button field, button click on shows an 'X'
# /Tx is  field, 'None' means not prefilled 


/Tx 020 Conny Söderholm
/Tx 477 None
/Btn 476 None
/Btn 479 None
/Btn 478 None
/Tx 488 None
/Tx s488 None
/Tx 489 None
/Tx s489 None
/Tx 562 None
/Tx s562 None
/Tx 563 None
/Tx s563 None
/Tx 564 None
/Tx s564 None
/Tx 565 None
/Tx s565 None
/Tx 566 None
/Tx s566 None
/Tx 567 None
/Tx s567 None
/Tx 568 None
/Tx s568 None
/Tx 569 None
/Tx s569 None
/Btn 481 None
/Btn 572 None
/Btn 573 None
/Tx 482 None
/Tx s482 None
/Tx 483 None
/Tx s483 None
/Tx 484 None
/Tx s484 None
/Tx 574 None
/Tx 575 None
/Tx s575 None
/Tx 576 None
/Tx s576 None
/Tx s574 None
/Tx 010 None
/Tx 053 None
/Tx 580;1 None
/Tx 581;1 None
/Tx 582;1 None
/Tx 583;1 None
/Tx 584;1 None
/Tx s584;1 None
/Tx 585;1 None
/Tx s585;1 None
/Tx 586;1 None
/Tx s586;1 None
/Tx 580;2 None
/Tx 581;2 None
/Tx 582;2 None
/Tx 583;2 None
/Tx 584;2 None
/Tx s584;2 None
/Tx 585;2 None
/Tx s585;2 None
/Tx 586;2 None
/Tx s586;2 None
/Tx 587;1 None
/Tx 588;1 None
/Tx 589;1 None
/Tx 590;1 None
/Tx 591;1 None
/Tx s591;1 None
/Tx 592;1 None
/Tx s592;

### Page Document Layout

Page layout currently being used.
Variable: str, None if not specified

Valid layouts are:
- /NoLayout 
$\;\;\;$ Layout explicitly not specified</br> 
- /SinglePage 
$\;\;\;$ Show one page at a time
- /OneColumn 
$\;\;\;$ Show one column at a 
- /TwoColumnLeft
$\;\;\;$ Show pages in two columns, odd-numbered pages on the left
- /TwoColumnRight 
$\;\;\;$ Show pages in two columns, odd-numbered pages on the right
- /TwoPageLeft
$\;\;\;$ Show two pages at a time, odd-numbered pages on the left
- /TwoPageRight
$\;\;\;$ Show two pages at a time, odd-numbered pages on the right


In [37]:
pdf_reader = PdfReader("docs/docs_pdf/Section 3 Reading PDFs/p17 UseThumbs.pdf") # Reader 

page_layout = pdf_reader.page_layout

print(page_layout)

None


### Page Mode (Document outline)

Page mode currently being used.
Variable: str, None if not specified

Valid layouts are:
- /UseNone          
$\;\;\;$ Do not show outlines or thumbnails panels
- /UseOutlines      
$\;\;\;$ Show outlines (aka bookmarks) panel
- /UseThumbs        
$\;\;\;$ Show page thumbnails panel
- /FullScreen       
$\;\;\;$ Fullscreen view
- /UseOC            
$\;\;\;$ Show Optional Content Group (OCG) panel
- /UseAttachments   
$\;\;\;$ Show attachments panel

In [38]:
pdf_reader = PdfReader("docs/docs_pdf/Section 3 Reading PDFs/p17.pdf") 

page_mode = pdf_reader.page_mode

print(page_mode)

/UseOutlines


### XPM Metadata

[Adobe’s Extensible Metadata Platform (XMP)](https://www.adobe.com/products/xmp.html) is a file labeling technology that lets you embed metadata into files themselves during the content creation process. With an XMP enabled application, your workgroup can capture meaningful information about a project (such as titles and descriptions, searchable keywords, and up-to-date author and copyright information) in a format that is easily understood by your team as well as by software applications, hardware devices, and even file formats. Best of all, as team members modify files and assets, they can edit and update the metadata in real time during the workflow.
 

In [39]:
pdf_reader = PdfReader("docs/docs_pdf/Section 3 Reading PDFs/APPLICATION FOR TAX CARD.pdf")
try:
    xmp = pdf_reader.xmp_metadata
    print("Creator", xmp.dc_creator)
    print("Creator tool", xmp.xmp_creator_tool)
    print("Title", xmp.dc_title)
    print("Producer", xmp.pdf_producer)
except AttributeError:
    print("The document has no xmp metadata")

Creator ['Tax Administration']
Creator tool PScript5.dll Version 5.2.2
Title {'x-default': 'Application for tax card and/or tax prepayment'}
Producer Acrobat Distiller 10.1.15 (Windows)


### Get Documents Pages
Retrieves a page by number from a PDF file.

The pages are indexed: first page is index ```[0]``` the last page is index ```[len(pdf_reader.pages)- 1]```

In [40]:
# PrettyPrinter 
import pprint # https://docs.python.org/3/library/pprint.html

pdf_reader = PdfReader("docs/docs_pdf/Section 3 Reading PDFs/p17.pdf")
# Get the first page
page1 = pdf_reader.pages[0]
# Get the second page
page2 = pdf_reader.pages[1]
# Get last page
last_page = pdf_reader.pages [len(pdf_reader.pages)- 1]
# Print last page using PPrint
pprint.pprint(page1)

{'/Annots': [IndirectObject(15827, 0, 3190219505200),
             IndirectObject(15835, 0, 3190219505200),
             IndirectObject(15846, 0, 3190219505200),
             IndirectObject(15854, 0, 3190219505200),
             IndirectObject(15865, 0, 3190219505200),
             IndirectObject(15873, 0, 3190219505200)],
 '/BleedBox': [0, 0, 612, 1008],
 '/Contents': [IndirectObject(114474, 0, 3190219505200)],
 '/CropBox': [0, 0, 612, 792],
 '/Group': {'/CS': '/DeviceRGB', '/S': '/Transparency', '/Type': '/Group'},
 '/MediaBox': [0, 0, 612, 792],
 '/Parent': IndirectObject(8, 0, 3190219505200),
 '/Resources': {'/Font': {'/F0': IndirectObject(4, 0, 3190219505200),
                          '/F1': IndirectObject(5, 0, 3190219505200),
                          '/F2': IndirectObject(6, 0, 3190219505200),
                          '/F3': IndirectObject(18, 0, 3190219505200),
                          '/F4': IndirectObject(114475, 0, 3190219505200),
                          '/F5': Indirec

### Get Text From Page - full page and lines

Retrieving a single pdf page text.

Using ```enumerate(full_page_text.splitlines())``` to extract lines


In [41]:
pdf_reader = PdfReader("docs/2554ci Operation Guide.pdf")

# Get the first page
page = pdf_reader.pages[5]
full_page_text = page.extract_text()
print(type(full_page_text))

for number, line in enumerate(full_page_text.splitlines()):
    print("Line Num.: ", number, line)

<class 'str'>
Line Num.:  0 vLoading Originals in the Docu ment Processor ..........................................................  5-3
Line Num.:  1 Loading Paper in the Mu ltipurpose Tray .........................................................................  5-6
Line Num.:  2 Favorites ..................................................................................................................... .....  5-11
Line Num.:  3 Registering Favorites ..............................................................................................  5-12
Line Num.:  4 Recalling Favorites .................. ................................................................................  5-12
Line Num.:  5 Editing Favorites .....................................................................................................  5-13Deleting Favorites ............... ....................................................................................  5-13
Line Num.:  6 Application .......