# Create and Modify PDF Files in Python

# About this notebook:
* [1. Read text from a PDF with pypdf](#1)
* [2. Split a PDF file into multiple files](#2)
* [3. Concatenate and merge PDF files together](#3)
* [4. Rotate and crop pages in PDF files](#4)
* [5. Encrypt and decrypt PDF files](#5)


You’ll use the pypdf library to manipulate existing PDF files and the ReportLab library to create new PDF files from scratch. Along the way, you’ll have several opportunities to deepen your understanding with exercises and examples.

* Extracting Text From PDF Files With pypdf

In [None]:
pip install pypdf

In [None]:
pip show pypdf

<a id="1"> <h1> Read text from a PDF with pypdf</h1> </a>
    
    import the PdfReader class from pypdf
    
    you’ll need to provide the path to the PDF file that you want to open. You can do that using the pathlib module
    

In [None]:
from pypdf import PdfReader

In [None]:
from pathlib import Path

In [None]:
>>> pdf_path = Path("C:/Users/hp/anaconda3/materials-creating-and-modifying-pdfs/practice_files/Pride_and_Prejudice.pdf")

In [None]:
pdf_reader = PdfReader(pdf_path)

In [None]:
len(pdf_reader.pages)

In [None]:
pdf_reader.metadata

The object stored in .metadata looks like a dictionary but isn’t the same thing. You can access each item in .metadata as an attribute. For example, to get the title, use the .title attribute:

In [None]:
pdf_reader.metadata.title

## Extracting Text From a Page

In pypdf, the PageObject class represents the pages of a PDF file. You use PageObject instances to interact with pages in a PDF file. You don’t need to create your own PageObject instances directly. Instead, you can access them through the PdfReader object’s .pages attribute as you saw before.

If you need to extract text from a PDF page, then you need to run the following steps:

* Get a PageObject with PdfReader.page[page_index].

* Extract the text as a string with the PageObject instance’s .extract_text() method.


In [None]:
pdf_reader.pages[0].extract_text()

In [None]:
count=0
print("Print three pages")
for i in pdf_reader.pages:
    print(i.extract_text(),end="-----------------------\n")
    count +=1
    if count==3:
        break

In [None]:
pdf_write=Path("C:/Users/hp/anaconda3/materials-creating-and-modifying-pdfs/practice_files/Pride_and_Prejudice.txt")

In [None]:
content=[
    f"{pdf_reader.metadata.title}",
    f"Number of pages: {len(pdf_reader.pages)}"
]

In [None]:
content

In [None]:
for page in pdf_reader.pages:
    content.append(page.extract_text())
pdf_write.write_text("\n".join(content))

<a id="2"> <h1> Retrieving Pages From a PDF File With pypdf </h1> </a>

* The width and height arguments are required. 
* They determine the dimensions of the page in user space units. 
* One of these units is equal to 1/72 of an inch, so the above code adds an A4 blank page to pdf_writer.

## Three steps to create a new PDF file using pypdf:

* Create a PdfWriter instance.
* Add one or more pages to the PdfWriter instance, using either .add_blank_page() or .add_page().
* Write to a file using PdfWriter.write().

In [None]:
from pypdf import PdfWriter

In [None]:
pdf_writer=PdfWriter()

In [None]:
page = pdf_writer.add_blank_page(width=8.27 * 72, height=11.7 * 72)

#add_blank_page return page object

In [None]:
page

In [None]:
type(page)

In [None]:
pdf_writer.write("C:/Users/hp/anaconda3/materials-creating-and-modifying-pdfs/practice_files/blank.pdf")

In [None]:
#Extracting a Single Page From a PDF

from pypdf import PdfWriter, PdfReader
from pathlib import Path

In [None]:
pdf_path=Path("C:/Users/hp/anaconda3/materials-creating-and-modifying-pdfs/practice_files/Pride_and_Prejudice.pdf")
input_pdf=PdfReader(pdf_path)

In [None]:
page1=input_pdf.pages[0]

In [None]:
output_page=PdfWriter()

In [None]:
output_page.add_page(page1)

In [None]:
output_page.write("C:/Users/hp/anaconda3/materials-creating-and-modifying-pdfs/practice_files/first_page.pdf")

In [None]:
for i in input_pdf.pages[1:5]:
    output_page.add_page(i)


In [None]:
len(output_page.pages)

In [None]:
output_page.write("C:/Users/hp/anaconda3/materials-creating-and-modifying-pdfs/practice_files/chapter1.pdf")

<a id="3"> <h1> Concatenating and Merging PDF Files With pypdf </h1> </a>

* n you concatenate two or more PDF files, you join the files one after another into a single document.
* Merging two PDF files also joins them into a single file, but instead of attaching the second PDF to the end of the first, merging inserts the file after a specific page in the first PDF. Then it pushes all of the first PDF’s pages after the insertion point to the end of the second PDF.



There are a couple of ways to add pages to a PdfMerger object, and you’ll choose one based on what you need to accomplish:

* append() concatenates every page in an existing PDF document to the end of the pages currently in PdfMerger.
* merge() inserts all of the pages in an existing PDF document after a specific page in PdfMerger.

In [None]:
from pypdf import PdfMerger
from pathlib import Path

In [None]:
pdf_merger=PdfMerger()

In [None]:
#Concatenating PDFs With .append()
reports_dir=Path("C:/Users/hp/anaconda3/materials-creating-and-modifying-pdfs/practice_files/expense_reports")

Once you have the path to the expense_reports/ directory assigned to the reports_dir variable, 

you can use .glob() to get an iterable of paths to PDF files in the directory.

In [None]:
for file in reports_dir.glob("*.pdf"):
    print(file.name)

In general, the order of paths that .glob() returns isn’t guaranteed, 

so you’ll need to order them yourself. You can do this by creating a list using the built-in .sorted() function:

In [None]:
expense_reports=sorted(reports_dir.glob("*.pdf"))

In [None]:
pdf_merged=PdfMerger()

In [None]:
for file in expense_reports:
    pdf_merged.append(file)
    
print(pdf_merged)

In [None]:
pdf_merged.write("C:/Users/hp/anaconda3/materials-creating-and-modifying-pdfs/practice_files/expense_reports_merged.pdf")

Merging PDFs With .merge()

In [None]:
path_report=Path("C:/Users/hp/anaconda3/materials-creating-and-modifying-pdfs/practice_files/quarterly_report/report.pdf")
path_toc=Path("C:/Users/hp/anaconda3/materials-creating-and-modifying-pdfs/practice_files/quarterly_report/toc.pdf")

In [None]:
pdf_merger=PdfMerger()

In [None]:
pdf_merger.append(path_report)

In [None]:
pdf_merger.merge(1,path_toc)

In [None]:
pdf_merger.write("C:/Users/hp/anaconda3/materials-creating-and-modifying-pdfs/practice_files/quarterly_report/merged_report.pdf")

<a id="4"> <h1> Rotating and Cropping PDF Pages With pypdf </h1> <a>
    
    * A 0 rotation value means that the page is normally oriented. 
    * So, the .rotation attribute allows you to check the current rotation of the pages in ugly.pdf and then rotate any pages that don’t have a rotation of 0.

In [None]:
pdf_path=Path("C:/Users/hp/anaconda3/materials-creating-and-modifying-pdfs/practice_files/ugly.pdf")

In [None]:
pdf_reader=PdfReader(pdf_path)
pdf_writer=PdfWriter()

In [None]:
for page in pdf_reader.pages:
    if page.rotation !=0:
        page.rotate(-page.rotation)
    pdf_writer.add_page(page)

In [None]:
pdf_writer.write("C:/Users/hp/anaconda3/materials-creating-and-modifying-pdfs/practice_files/rotated.pdf")

Cropping Pages With RectangleObject

To crop the page, you first need to know a little bit more about how pages are structured.

PageObject instances like first_page have a .mediabox attribute that represents a rectangular area defining the boundaries of the page.

<a id="5"> <h1> Encrypting and Decrypting PDF Files With pypdf </h1> </a>

You can add password protection to a PDF file using the .encrypt() method of a PdfWriter instance. It has two main parameters:

user_password sets the user password. This argument allows for opening and reading the encrypted PDF file.
owner_password sets the owner password. This argument allows for opening and editing the PDF without any restrictions.

In [None]:

>>> pdf_writer.write("C:/Users/hp/anaconda3/materials-creating-and-modifying-pdfs/practice_files/newsletter_protected.pdf")

In [None]:
>>> user_pwd = "SuperSecret"
>>> owner_pwd = "ReallySuperSecret"
>>> pdf_writer.encrypt(user_password=user_pwd, owner_password=owner_pwd)

In [None]:
>>> output_path = Path("C:/Users/hp/anaconda3/materials-creating-and-modifying-pdfs/practice_files/newsletter_protected.pdf")
>>> pdf_writer.write("C:/Users/hp/anaconda3/materials-creating-and-modifying-pdfs/practice_files/newsletter_protected.pdf")

In [None]:
>>> from pathlib import Path
>>> from pypdf import PdfReader, PdfWriter
>>> pdf_path = Path("C:/Users/hp/anaconda3/materials-creating-and-modifying-pdfs/practice_files/newsletter_protected.pdf")
>>> pdf_reader = PdfReader(pdf_path)

In [None]:
pdf_reader.pages[0]

In [None]:
>>> pdf_reader.decrypt(password="SuperSecret")


In [None]:
pdf_reader.pages[0]

In [None]:
>>> pdf_writer.write("C:/Users/hp/anaconda3/materials-creating-and-modifying-pdfs/practice_files/newsletter_protected.pdf")