# Working with PDF files in python
## Author: Gustavo Amarante

There are several libraries that can handle pdf files but the main library we are going to use is *`PyPDF4`*. **This library is no longer maintained, its latest version is now `pyPDF4`. Most (but not all) of `pyPDF2` is backward compatible with `pyPDF4`, which means you can simply swap one for the other onm the imports and not have to change any other lines.**

The example document that we are going to use in this notebook is the piano score for "The Fools Who Dream" by Justin Hurwitz (from the audition scene from the movie "La La Land")

### What is a PDF file?
* It stands for Portable Document Format
* Initially invented by Adobe, it is now a standard document format maintaned by the International Organization for Standardisation (ISO)

---
## Extracting Metadata
Some of the metadata on pdf file are very useful for automations. We are capable of extracting:
* Author
* Creator
* Producer
* Subject
* Title
* Number of pages

In [1]:
from PyPDF4 import PdfFileReader

def extract_pdf_metadata(pdf_path):
    with open(pdf_path, 'rb') as file:
        pdf = PdfFileReader(file)
        info = pdf.getDocumentInfo()
        n_pages = pdf.getNumPages()
        
    print('Author:', info.author)
    print('Creator:', info.creator)
    print('Producer:', info.producer)
    print('Subject:', info.subject)
    print('Title:', info.title)
    print('Pages:', n_pages)

In [2]:
extract_pdf_metadata(r'data/Audition_The_Fools_Who_Dream.pdf')

Author: None
Creator: MuseScore Version: 2.1.0
Producer: Qt 5.4.2
Subject: None
Title: Audition (The Fools Who Dream
Pages: 12




---
## Rotating Pages
Lets put all of the pages of this score upside down.

In [3]:
from PyPDF4 import PdfFileReader, PdfFileWriter

def rotate_pages(pdf_path):
    
    pdf_writer = PdfFileWriter()

    pdf_reader = PdfFileReader(pdf_path)
    n_pages = pdf_reader.getNumPages()
    
    for page_i in range(n_pages):
        page = pdf_reader.getPage(page_i).rotateClockwise(180)
        pdf_writer.addPage(page)
        
    with open('rotated_pages.pdf', 'wb') as rp:
        pdf_writer.write(rp)

In [4]:
rotate_pages(r'data/Audition_The_Fools_Who_Dream.pdf')



---
## Merging PDF Files
Let's create a new file where the first page is the FinanceHub logo and the second page is the first page of the music score.

For that we can create a function to merge pdfs

In [47]:
from PyPDF4 import PdfFileReader, PdfFileWriter

def merge_pdf(path, writer=None, pages=None, save_path=None):
    """
    This function merges pdf files. 
    
    If a pdf writer is passed, it appends the selected file to its end, 
    otherwise it creates a new pdf writer.
    
    If a page number is passed, only that page will be append to the pdf writer.

    If 'save_path' is passed, it saves the file to that path. Otherwise, 
    it returns the pdf writer object.
    """
    
    if writer is None:
        writer = PdfFileWriter()
    
    reader = PdfFileReader(path)
    
    if pages is None:
        # Full document
        n_page = reader.getNumPages()
        for page in range(n_page):
            writer.addPage(reader.getPage(page))
    else:
        # Single selected page
        writer.addPage(reader.getPage(pages))
    
    if save_path is None:
        return writer
    else:
        with open(save_path, 'wb') as out:
            writer.write(out)

In [49]:
pdf_writer = merge_pdf('data/FH Logo.pdf')

merge_pdf('data/Audition_The_Fools_Who_Dream.pdf', 
          writer=pdf_writer,
          pages=0,
          save_path='logo+score.pdf')



---
## Watermarking a pdf file
In this example we want to put the FinanceHub logo as a watermark in all of the pages of the music score sheet.

In [52]:
from PyPDF4 import PdfFileReader, PdfFileWriter

def pdf_watermart(input_pdf, save_path, watermark_path):
    watermark_obj = PdfFileReader(watermark_path)
    watermark_page = watermark_obj.getPage(0)
    
    pdf_reader = PdfFileReader(input_pdf)
    n_pages = pdf_reader.getNumPages()
    
    pdf_writer = PdfFileWriter()
    
    for page in range(n_pages):
        page_read = pdf_reader.getPage(page)
        page_read.mergePage(watermark_page)
        pdf_writer.addPage(page_read)
    
    with open(save_path, 'wb') as out:
        pdf_writer.write(out)

In [54]:
pdf_watermart('data/Audition_The_Fools_Who_Dream.pdf', 'watermarked score.pdf', 'data/FH watermark.pdf')

