This repository contains a set of Python scripts for working with PDF files. Each script serves a specific purpose related to PDF manipulation. The scripts use the PyPDF2 library.
Install the required libraries
pip install PyPDF2
# For extracting the images
pip install PyPDF2[image]
# Or just install all optional dependencies
pip install PyPDF2[full]
Follow the usage instructions provided below for each script.
Merge multiple PDF files into a single PDF file.
# Run the script
python merge.py output.pdf input1.pdf input2.pdf input3.pdf
This script extracts text from PDF files, either from all pages or a specific page.
# Extract text from all pages
python extract_text.py file.pdf --output text_file.txt
# Extract text from page 3
python extract_text.py file.pdf --output text_file.txt --page 3
This script extracts images from PDF files, either from all pages or a specific page.
# Extract images from all pages to the default folder "extracted_images"
python extract_images.py example.pdf
# Extract images from page 2 to a custom folder "my_images"
python extract_images.py example.pdf --output my_images --page 2
This script creates a new PDF file from specific pages of an existing PDF.
# Create a new PDF file containing pages 3, 4, and 7
python extract_pages.py input.pdf output.pdf 3 4 7
# Create a new PDF file containing pages from 3 to 7 and 11
python extract_pages.py input.pdf output.pdf 3-7 11
This script splits every page of a PDF file and stores them in a new folder.
# Split input.pdf into individual pages in the default folder "split_pages"
python split.py input.pdf
# Split input.pdf into individual pages in a custom folder "output_folder"
python split.py input.pdf --output output_folder
This script compares the text content of corresponding pages in two PDF files. If differences are found, it prints the differing text for each page.
# Compare two PDF files and save differences to a txt file
python compare.py file1.pdf file2.pdf --output differences.txt
Convert a .txt file to a PDF.
# install reportlab
pip install reportlab
# Convert input.txt to output.pdf
python txt_to_pdf.py input.txt output.pdf