# Chapter 15: Working with PDF Files using PyPDF2

## 📋 Chapter Overview

This comprehensive guide will teach you how to manipulate PDF files using Python's PyPDF2 library. You'll learn to extract content, modify documents, and create new PDFs programmatically.

### 🎯 Learning Objectives
- Extract text and metadata from PDFs
- Split and merge PDF documents
- Rotate and crop pages
- Encrypt and decrypt PDFs
- Create PDFs from scratch

### 📦 Prerequisites
- Basic Python knowledge
- PyPDF2 library installed (`pip install PyPDF2`)
- Sample PDF files to work with

In [None]:
# Install PyPDF2 if not already installed
!pip install PyPDF2

## 1. 📖 Extracting Text from PDFs

Let's start by learning how to extract text content from PDF files.

In [None]:
from PyPDF2 import PdfReader
from pathlib import Path

# Get the home directory
home_dir = Path.home()
print(f"Home directory: {home_dir}")

In [None]:
# Create a PDF file path (replace with your PDF path)
pdf_path = home_dir / "example.pdf"  # Change this to your PDF file path

# Check if file exists
if pdf_path.exists():
    # Create a PdfReader object
    pdf_reader = PdfReader(str(pdf_path))
    
    # Get document information
    print("📄 Document Info:")
    print(f"Number of pages: {len(pdf_reader.pages)}")
    print(f"Title: {pdf_reader.metadata.title if pdf_reader.metadata.title else 'Not available'}")
    print(f"Author: {pdf_reader.metadata.author if pdf_reader.metadata.author else 'Not available'}")
    
    # Extract text from a specific page
    print("\n📖 Text from first page:")
    first_page = pdf_reader.pages[0]
    print(first_page.extract_text()[:500] + "...")  # Show first 500 characters
else:
    print(f"❌ File not found: {pdf_path}")
    print("Please update the pdf_path variable with a valid PDF file path.")

## 2. ✂️ Extracting Pages from PDFs

Now let's learn how to extract specific pages from a PDF document.

In [None]:
from PyPDF2 import PdfWriter

def extract_pages(input_path, output_path, page_numbers):
    """Extract specific pages from a PDF and save them as a new PDF."""
    
    # Create reader and writer objects
    reader = PdfReader(input_path)
    writer = PdfWriter()
    
    # Add specified pages to writer
    for page_num in page_numbers:
        if page_num < len(reader.pages):
            writer.add_page(reader.pages[page_num])
        else:
            print(f"⚠️ Page {page_num} doesn't exist in the document")
    
    # Save the extracted pages
    with open(output_path, 'wb') as output_file:
        writer.write(output_file)
    
    print(f"✅ Successfully extracted pages {page_numbers} to {output_path}")

# Example usage
if pdf_path.exists():
    output_pdf = home_dir / "extracted_pages.pdf"
    extract_pages(str(pdf_path), str(output_pdf), [0, 2, 4])  # Extract pages 0, 2, and 4
else:
    print("❌ PDF file not available for extraction")

## 🧩 Challenge: PDF File Splitter Class

Create a Python class that can split PDFs in different ways:
- By page ranges
- Into single-page PDFs
- Every n pages

In [None]:
class PDFSplitter:
    def __init__(self, input_pdf):
        self.reader = PdfReader(input_pdf)
        self.total_pages = len(self.reader.pages)
    
    def split_by_range(self, start_page, end_page, output_path):
        """Split PDF by page range (inclusive)."""
        writer = PdfWriter()
        
        # Validate page range
        if start_page < 0 or end_page >= self.total_pages or start_page > end_page:
            print("❌ Invalid page range")
            return
        
        # Add pages to writer
        for page_num in range(start_page, end_page + 1):
            writer.add_page(self.reader.pages[page_num])
        
        # Save output
        with open(output_path, 'wb') as output_file:
            writer.write(output_file)
        
        print(f"✅ Pages {start_page}-{end_page} saved to {output_path}")
    
    def split_single_pages(self, output_prefix):
        """Split PDF into single-page PDFs."""
        for page_num in range(self.total_pages):
            writer = PdfWriter()
            writer.add_page(self.reader.pages[page_num])
            
            output_path = f"{output_prefix}_page_{page_num + 1}.pdf"
            with open(output_path, 'wb') as output_file:
                writer.write(output_file)
        
        print(f"✅ Split into {self.total_pages} single-page PDFs with prefix '{output_prefix}'")
    
    def split_every_n_pages(self, n, output_prefix):
        """Split PDF every n pages."""
        if n <= 0:
            print("❌ n must be a positive integer")
            return
        
        for i in range(0, self.total_pages, n):
            writer = PdfWriter()
            # Add up to n pages
            for j in range(i, min(i + n, self.total_pages)):
                writer.add_page(self.reader.pages[j])
            
            part_num = i // n + 1
            output_path = f"{output_prefix}_part_{part_num}.pdf"
            with open(output_path, 'wb') as output_file:
                writer.write(output_file)
        
        print(f"✅ Split into {(self.total_pages + n - 1) // n} parts every {n} pages")

# Example usage
if pdf_path.exists():
    splitter = PDFSplitter(str(pdf_path))
    
    # Split pages 0-2
    splitter.split_by_range(0, 2, home_dir / "first_three_pages.pdf")
    
    # Split into single pages
    splitter.split_single_pages(str(home_dir / "single_page"))
    
    # Split every 2 pages
    splitter.split_every_n_pages(2, str(home_dir / "every_two_pages"))
else:
    print("❌ PDF file not available for splitting")

## 3. 🔗 Concatenating and Merging PDFs

Learn how to combine multiple PDF documents into a single file.

In [None]:
def merge_pdfs(output_path, *input_paths):
    """Merge multiple PDFs into a single PDF."""
    writer = PdfWriter()
    
    for path in input_paths:
        reader = PdfReader(path)
        for page in reader.pages:
            writer.add_page(page)
    
    with open(output_path, 'wb') as output_file:
        writer.write(output_file)
    
    print(f"✅ Merged {len(input_paths)} PDFs into {output_path}")

# Example usage
if pdf_path.exists():
    # Create a second PDF for demonstration (using the same PDF twice)
    output_merged = home_dir / "merged.pdf"
    merge_pdfs(str(output_merged), str(pdf_path), str(pdf_path))
else:
    print("❌ PDF file not available for merging")

## 4. 🔄 Rotating and Cropping PDF Pages

Learn how to manipulate page orientation and dimensions.

In [None]:
def rotate_pages(input_path, output_path, rotation_angle, page_numbers=None):
    """Rotate specific pages in a PDF."""
    reader = PdfReader(input_path)
    writer = PdfWriter()
    
    # Validate rotation angle
    if rotation_angle % 90 != 0:
        print("❌ Rotation angle must be a multiple of 90")
        return
    
    # If no specific pages provided, rotate all pages
    if page_numbers is None:
        page_numbers = range(len(reader.pages))
    
    for i, page in enumerate(reader.pages):
        if i in page_numbers:
            page.rotate(rotation_angle)
        writer.add_page(page)
    
    with open(output_path, 'wb') as output_file:
        writer.write(output_file)
    
    print(f"✅ Rotated pages {page_numbers} by {rotation_angle} degrees")

# Example usage
if pdf_path.exists():
    output_rotated = home_dir / "rotated.pdf"
    rotate_pages(str(pdf_path), str(output_rotated), 90, [0])  # Rotate first page 90 degrees
else:
    print("❌ PDF file not available for rotation")

## 5. 🔐 Encrypting and Decrypting PDFs

Learn how to add security to your PDF documents.

In [None]:
def encrypt_pdf(input_path, output_path, password):
    """Encrypt a PDF with a password."""
    reader = PdfReader(input_path)
    writer = PdfWriter()
    
    # Add all pages to the writer
    for page in reader.pages:
        writer.add_page(page)
    
    # Encrypt the PDF
    writer.encrypt(password)
    
    with open(output_path, 'wb') as output_file:
        writer.write(output_file)
    
    print(f"✅ PDF encrypted with password: {password}")

def decrypt_pdf(input_path, output_path, password):
    """Decrypt a password-protected PDF."""
    try:
        reader = PdfReader(input_path)
        
        if reader.is_encrypted:
            reader.decrypt(password)
        
        writer = PdfWriter()
        for page in reader.pages:
            writer.add_page(page)
        
        with open(output_path, 'wb') as output_file:
            writer.write(output_file)
        
        print(f"✅ PDF successfully decrypted")
        return True
    except Exception as e:
        print(f"❌ Failed to decrypt PDF: {e}")
        return False

# Example usage
if pdf_path.exists():
    # Encrypt the PDF
    encrypted_pdf = home_dir / "encrypted.pdf"
    encrypt_pdf(str(pdf_path), str(encrypted_pdf), "secure_password")
    
    # Decrypt the PDF
    decrypted_pdf = home_dir / "decrypted.pdf"
    decrypt_pdf(str(encrypted_pdf), str(decrypted_pdf), "secure_password")
else:
    print("❌ PDF file not available for encryption/decryption")

## 🧩 Challenge: Unscramble a PDF

Create a program that can unscramble a PDF with pages in the wrong order.
You'll be given a PDF with scrambled pages and need to restore the correct order.

In [None]:
def unscramble_pdf(input_path, output_path, correct_order):
    """Unscramble PDF pages by putting them in the correct order."""
    reader = PdfReader(input_path)
    writer = PdfWriter()
    
    # Validate the correct_order list
    if len(correct_order) != len(reader.pages):
        print("❌ The correct order list must have the same length as the number of pages")
        return
    
    if sorted(correct_order) != list(range(len(reader.pages))):
        print("❌ The correct order list must contain all page indices from 0 to n-1")
        return
    
    # Add pages in correct order
    for page_index in correct_order:
        writer.add_page(reader.pages[page_index])
    
    with open(output_path, 'wb') as output_file:
        writer.write(output_file)
    
    print(f"✅ PDF pages reordered successfully")

# Example usage
if pdf_path.exists():
    # Create a scrambled version first (reverse order)
    scrambled_pdf = home_dir / "scrambled.pdf"
    reader = PdfReader(str(pdf_path))
    writer = PdfWriter()
    
    # Add pages in reverse order to create a scrambled PDF
    for i in range(len(reader.pages)-1, -1, -1):
        writer.add_page(reader.pages[i])
    
    with open(str(scrambled_pdf), 'wb') as output_file:
        writer.write(output_file)
    
    # Now unscramble it (back to original order)
    unscrambled_pdf = home_dir / "unscrambled.pdf"
    
    # The correct order is the original order (0, 1, 2, ...)
    correct_order = list(range(len(reader.pages)))
    unscramble_pdf(str(scrambled_pdf), str(unscrambled_pdf), correct_order)
else:
    print("❌ PDF file not available for unscrambling")

## 6. 🆕 Creating a PDF from Scratch

While PyPDF2 is great for manipulating existing PDFs, creating new PDFs from scratch typically requires additional libraries like ReportLab. However, we can create a simple text-based PDF using PyPDF2's limited capabilities.

In [None]:
# Note: PyPDF2 is not designed for creating PDFs from scratch
# For creating PDFs, consider using libraries like ReportLab, FPDF, or WeasyPrint

print("📝 PyPDF2 is primarily for manipulating existing PDFs.")
print("For creating PDFs from scratch, consider these alternatives:")
print("1. ReportLab - Powerful PDF generation")
print("2. FPDF - Simple PDF creation")
print("3. WeasyPrint - HTML to PDF conversion")

## 📋 Practice Exercises

1. Create a function that extracts all text from a PDF and saves it to a .txt file
2. Write a script that merges all PDFs in a directory into a single PDF
3. Create a PDF watermarking function that adds a text watermark to each page
4. Build a PDF metadata editor that can modify title, author, and other metadata

In [None]:
# Exercise 1: Extract all text from PDF and save to .txt file
def extract_all_text_to_file(input_path, output_path):
    """Extract all text from a PDF and save it to a text file."""
    reader = PdfReader(input_path)
    
    with open(output_path, 'w', encoding='utf-8') as text_file:
        for page_num, page in enumerate(reader.pages):
            text = page.extract_text()
            text_file.write(f"--- Page {page_num + 1} ---\n")
            text_file.write(text)
            text_file.write("\n\n")
    
    print(f"✅ All text extracted to {output_path}")

# Example usage
if pdf_path.exists():
    text_output = home_dir / "extracted_text.txt"
    extract_all_text_to_file(str(pdf_path), str(text_output))
else:
    print("❌ PDF file not available for text extraction")

## 🏁 Summary

In this chapter, you've learned how to:

✅ Extract text and metadata from PDF documents  
✅ Split PDFs into smaller files or individual pages  
✅ Merge multiple PDFs into a single document  
✅ Rotate and manipulate PDF pages  
✅ Add password protection to PDFs  
✅ Create a PDF splitter utility class  
✅ Unscramble PDFs with pages in the wrong order  

### 📚 Additional Resources
- [PyPDF2 Documentation](https://pythonhosted.org/PyPDF2/)
- [ReportLab PDF Library](https://www.reportlab.com/opensource/)
- [Python PDF Processing Guide](https://realpython.com/pdf-python/)

### 🚀 Next Steps
1. Practice with different types of PDF documents
2. Explore more advanced PDF manipulation techniques
3. Learn about other PDF libraries like PDFMiner, PyMuPDF, or pdfrw
4. Consider learning about OCR (Optical Character Recognition) for scanned PDFs