# Import dependencies

In [1]:
import os
from PyPDF2 import PdfFileReader, PdfFileWriter

# load data
%store -r labels

import nbimporter
from pypdf2_split_part1 import split_pdf

Importing Jupyter notebook from pypdf2_split_part1.ipynb


<hr style="border-top: 2px solid black;">

# <font color="blue">Function Explanation</font>

<strong><font color="red">Line 1</font></strong>
<strong>Code</strong><code>def split_pdf(path, labels):</code><br><br>
<strong>What it does:</strong>
Creates function that will take in two parameters:
 - file path
 - list of label

<strong><font color="red">Line 2</font></strong>
<strong>Code:</strong><code> fname = os.path.splitext(os.path.basename(path))[0]</code><br><br>
<strong>What it does:</strong> It returns the file name of the pdf file that will be split<br>
<strong>How?</strong>
 - The <code>os</code> module provides a portable way of using operating system dependent functionality.
<strong>Bottomline:</strong><br>
The filename is obtained

In [2]:
path = 'File.pdf'

<strong>First: Get the file name</strong>
 - The <code>os.path.basename(path)</code> returns the basename of the file
 - <code>[0]</code> will return just the name of the input file and ignore the extension.<br>

In [3]:
os.path.basename(path)

'File.pdf'

<strong>Second: Split the file name</strong>
 - The <code>os.path</code> is a submodule used for pathname manipulation
 - The <code>os.path.splitext()</code> method is used to split the path name into a pair root and extention.
 - This will return a tuple (file name, extention)

In [4]:
fname = os.path.splitext(os.path.basename(path))
fname

('File', '.pdf')

<strong>Third: Extract only the filename</strong>
 - <code>os.path.splitext()</code> will return a tuple: (file name, extention)
 - <code>os.path.splitext()[0]</code> will return the first element in the tuple, the file name.<br>

In [5]:
fname = os.path.splitext(os.path.basename(path))[0]
fname

'File'

<strong><font color="red">Line 3</font></strong>
<strong>Code:</strong><code> pdf = PdfFileReader(path)</code><br><br>
<strong>What it does:</strong> creates a reader object of the pdf file we will split<br>
 - Documentation: https://pythonhosted.org/PyPDF2/PdfFileReader.html 
 - You can then apply these methods

In [6]:
pdf = PdfFileReader(path)
pdf

<PyPDF2.pdf.PdfFileReader at 0x112a1ec50>

<strong><font color="red">Line 4</font></strong>: Inside the for loop
<br>
<strong>Code:</strong><code> for page, label in zip(range(pdf.getNumPages()), labels):</code><br>

 -  <code>range(pdf.getNumPages())</code> returns the range of the number pages in the pdf

In [7]:
range(pdf.getNumPages())

range(0, 25)

 -  <code>zip(range(pdf.getNumPages()), labels)</code> The zip() function takes iterables (the range of pages in the pdf and the list of labels), and aggregates them in a tuple, and return it.

In [8]:
zip(range(pdf.getNumPages()), labels)

<zip at 0x111c93af0>

 -  <code>for page, label in zip(range(pdf.getNumPages()), labels):</code><br>
 - since <code>zip()</code> is used, the for loop will require two indexes 
  - 1) page will be in the index in the range of pages in the pdf file
  - 2) label will be the index for the labels list

In [9]:
for page, label in zip(range(pdf.getNumPages()), labels):
# Printing this just for illustration
    print(page, label) 

0 a
1 b
2 c
3 d
4 e
5 f
6 g
7 h
8 i
9 j
10 k
11 l
12 m
13 n
14 o
15 p
16 q
17 r
18 s
19 t
20 u
21 v
22 w
23 x
24 y


<strong><font color="red">Line 5</font></strong>: Inside the for loop
<strong>Code:</strong><code>pdf_writer = PdfFileWriter()</code><br><br>
<strong>What it does:</strong> creates an instance of the PdfFileWriter<br>
 - Documentation: https://pythonhosted.org/PyPDF2/PdfFileWriter.html
 - This class supports writing PDF files out. In other words, to save each page we are splitting.

In [10]:
pdf_writer = PdfFileWriter()
pdf_writer

<PyPDF2.pdf.PdfFileWriter at 0x112cd0b10>

<strong><font color="red">Line 6</font></strong>: Inside the for loop
<strong>Code:</strong><code>pdf_writer.addPage(pdf.getPage(page))</code><br><br>
<strong>What it does:</strong>Gets the current page of the pdf file that will be split, and then saves taht information into a new pdf file object<br>

<strong><font color="red">Line 7</font></strong>: Inside the for loop
<strong>Code:</strong><code>output_filename = '{}_{}.pdf'.format(fname, label)</code><br>
 - <strong>What it does:</strong> Creates a unique file name that will be used as the new file name when the file is saved.<br>
 - <code>fname</code>: value saved from <font color="red">Line 1</font>
 - <code>label</code>: is the unique label from the list of labels

In [11]:
# Example: fname as 'File' and 'a' as the unique label
# the {}_{} will be replaced with the values in the .format() arguments
output_filename = '{}_{}.pdf'.format('File', 'a')
output_filename

'File_a.pdf'

<strong><font color="red">Line 8-9</font></strong>: Inside the for loop<br>

<strong>Code:</strong><br>
<code>with open(output_filename, 'wb') as out:
    pdf_writer.write(out)</code><br>
 - <strong>What it does:</strong> open the new file name in write-binary mode and use the PDF writer object’s write method to write the object’s contents to disk.<br>

<strong><font color="red">Line 10</font></strong>: Inside the for loop
<strong>Code:</strong><code>print('Created: {}'.format(output_filename))</code><br>
 - <strong>What it does:</strong> print out the file that has been saved, so you can track the progress of your file splitting.<br>