<b>Portable Document Format (PDF)</b> files are a ubiquitous format for document exchange due to their platform independence and consistent formatting. In Python, several libraries provide tools to work with PDFs, allowing developers to manipulate, extract information, and create PDF files programmatically.

Before diving into the code, it’s essential to have a basic understanding of how PDFs work. PDFs, or Portable Document Format files, are a standardized format for document exchange. They can contain text, images, hyperlinks, forms, and more. In Python, several libraries simplify the process of working with PDFs.

To begin, you need to install a PDF manipulation library. Two commonly used libraries are <b>PyPDF2</b> and <b>PyMuPDF</b>. You can install them using the following commands:

In [23]:
!pip install PyPDF2
!pip install pymupdf



### Extracting Text from a PDF:

##### Using PyPDF2:

In [24]:
# importing required modules
import PyPDF2

In [25]:
#We accessed the “a83f8926-4718-4806-8eb3-670fffa171c7.pdf” file in binary mode and assigned the resulting file object to the variable pdfFileObj.
# creating a pdf file object
pdfFileObj = open('a83f8926-4718-4806-8eb3-670fffa171c7.pdf', 'rb')

In [26]:
# creating a pdf reader object
#we instantiate a PdfReader class object from the PyPDF2 module, utilizing a PDF file object as a parameter, thereby obtaining a PDF reader object.
pdfReader = PyPDF2.PdfReader(pdfFileObj)

In [27]:
# printing number of pages in pdf file
#The ‘pages’ attribute indicates the quantity of pages present in the PDF document
print(len(pdfReader.pages))

1


In [28]:
#Instantiate an object from the PageObject class within the PyPDF2 module. The PDF reader object provides a pages[] function that, when given a page number (indexed from 0), yields the corresponding page object.
# creating a page object
pageObj = pdfReader.pages[0]

In [29]:
#The page object provides a method called extract_text() designed for retrieving text content from the PDF page.
print(pageObj.extract_text())

Online Payment Receipt
Dear Customer,
You have successfully completed your recharge
The details of your transaction are below. Kindly quote your transaction ID for
future communications.
Mobile Number:
Recharge Amount:
Payment Date & Time:
Order Id:
Recharge Description:
Recharge Validity:
8334907556
Rs. 719
01-Feb-2024 | 14:24PM
PRE7508425834877556
Unlimited -719
84 Days
If you have any questions concerning this receipt, please email
customercare@vodafoneidea.com
THANK YOU FOR YOUR BUSINESS!


In [30]:
#At last, we close the PDF file object.
pdfFileObj.close()

### Merging PDFs:

##### Using PyPDF2

In [31]:
import PyPDF2

In [32]:
#Creating a function named merge_pdfs that takes a list of PDF files (pdf_list) and an output path (output_path) as parameters.
def merge_pdfs(pdf_list, output_path):
    #Within the function, initializing a PdfMerger object from PyPDF2. This object will be used to merge the PDF files.
    pdf_merger = PyPDF2.PdfMerger()
    #Using a for loop to iterate over each PDF file in the provided list (pdf_list).
    #Appending each PDF file to the pdf_merger object.
    for pdf in pdf_list:
        pdf_merger.append(pdf)
    #Opening the specified output_path in write-binary mode ('wb').
    #Writing the merged content from the pdf_merger object to the newly created file.
    with open(output_path, 'wb') as merged_pdf:
        pdf_merger.write(merged_pdf)
        
#Defining a list of input PDF files (pdf_list) to be merged.
#Specifying the desired output path for the merged PDF (output_path).
pdf_list = ['T2-NLP_Course outline 2023 .docx.pdf', 'T3-NLU_Course outline 2023 .docx.pdf']
output_path = 'merged.pdf'

#Invoking the merge_pdfs function with the specified input parameters.
merge_pdfs(pdf_list, output_path)

##### In summary, 
this code defines a function to merge multiple PDF files into a single PDF. The function utilizes the PdfMerger class from PyPDF2, iterates through the provided list of PDFs, appends them to the merger object, and then writes the merged content to the specified output file. Finally, the function is called with a sample list of input PDFs and an output path.

### Creating a PDF

#### Using PyPDF2:

In [51]:
import PyPDF2

# Creating a function named create_pdf that takes an output path (output_path) and a content source (content) as parameters.
def create_pdf(output_path, content):
    # Initialize a PdfFileWriter object to write the new PDF.
    pdf_writer = PyPDF2.PdfWriter()
    
    # Open the existing PDF file for reading using PdfFileReader.
    pdf_reader = PyPDF2.PdfReader(open(content, 'rb'))
    
    # Add the first page (indexed at 0) from the existing PDF to the PdfFileWriter.
    pdf_writer.add_page(pdf_reader.pages[0])
    
    # Open the specified output_path in write-binary mode ('wb') and write the new PDF.
    with open(output_path, 'wb') as new_pdf:
        pdf_writer.write(new_pdf)

# Specify the path to the existing PDF file and the output path for the new PDF file.
content = 'a83f8926-4718-4806-8eb3-670fffa171c7.pdf'  # Replace with the path to an existing PDF file
output_path = 'new_pdf_2.pdf'

# Invoking the create_pdf function with the specified input parameters.
create_pdf(output_path, content)


##### In summary, 
this code defines a function to create a new PDF by extracting a page from an existing PDF. It utilizes the PdfWriter class from PyPDF2, adds a specific page from the source PDF, and then writes the content to the newly created PDF file. The function is called with a sample input PDF and an output path.

### Rotating PDF pages:

In [52]:
# importing the required modules
import PyPDF2

def PDFrotate(origFileName, newFileName, rotation):

 # creating a pdf File object of original pdf
 pdfFileObj = open(origFileName, 'rb')

 # creating a pdf Reader object
 pdfReader = PyPDF2.PdfReader(pdfFileObj)

 # creating a pdf writer object for new pdf
 pdfWriter = PyPDF2.PdfWriter()

 # rotating each page
 for page in range(len(pdfReader.pages)):

  # creating rotated page object
  pageObj = pdfReader.pages[page]
  pageObj.rotate(rotation)

  # adding rotated page object to pdf writer
  pdfWriter.add_page(pageObj)

  # new pdf file object
  newFile = open(newFileName, 'wb')

  # writing rotated pages to new file
  pdfWriter.write(newFile)

 # closing the original pdf file object
 pdfFileObj.close()

 # closing the new pdf file object
 newFile.close()


def main():

 # original pdf file name
 origFileName = 'a83f8926-4718-4806-8eb3-670fffa171c7.pdf'

 # new pdf file name
 newFileName = 'rotated_new.pdf'

 # rotation angle
 rotation = 270

 # calling the PDFrotate function
 PDFrotate(origFileName, newFileName, rotation)

if __name__ == "__main__":
 # calling the main function
 main()