# PDFs with python 3

This is a documentation about **how to extract information from PDFs using python 3**.

Date: 13 Dec (2017)

## PDF Samples

In order to explore the tool's functionalities, I have downloaded three pdfs.  

One of them named <font color="red">sample.pdf</font> which was obtained from this tutorial: http://programacion.net/articulo/como_trabajar_con_documentos_pdf_utilizando_python_1398.  

Other one named <font color="red">pdf-sample.pdf</font> which was obtained from: http://unec.edu.az/application/uploads/2014/12/pdf-sample.pdf.  

Last one is a paper.

I have created one directory named **pdfs** to contain the documents. My work directory looks like:


    PDF_with_Python/
    ├── LICENSE
    ├── pdfs
    │   ├── paper.pdf
    │   ├── pdf-sample.pdf
    │   └── sample.pdf
    ├── PDF_with_Python.ipynb
    └── README.md

## Tools information

There exists different python tools for manipulating PDFs. Our interest in this notebook is to address this issue using **python 3**. Next tools are reported to work with python 3.

### [PyPDF2](http://mstamy2.github.io/PyPDF2/ "PyPDF2")
- Latest commit on 6 Aug (2017).  
- Split, merge, crop, etc. of PDF files. Pure Python.  
- Python 2 and **Python 3**.  
- BSD License.  
- Project's **GitHub**: https://github.com/mstamy2/PyPDF2  
- **Documentation**: https://pythonhosted.org/PyPDF2/  
- Includes sample code and **command line interface**.  
- Current versión (up to date): **1.26.0**  
- To install [this package](https://anaconda.org/conda-forge/pypdf2 "PyPDF2 conda") with [conda](https://anaconda.org/ "anaconda cloud") run one of the following:

        conda install -c conda-forge pypdf2 
        conda install -c conda-forge/label/broken pypdf2 

You can find some examples for what it is used for (merge two pdf documents, delete pages, split pdf document into one-page documents, operate in pdf slices, set pdf metadata, etc), here: https://www.binpress.com/tutorial/manipulating-pdfs-with-python/167.

### [PDFMiner3k](https://github.com/jaepil/pdfminer3k)
- Latest commit on 5 Oct (2016).  
- Requires **PDFMiner**. 
- **Python 3**.  
- Unspecified License.          
- Project's **GitHub**: https://github.com/jaepil/pdfminer3k  
- **Documentation**: https://pypi.python.org/pypi/pdfminer3k/  
- Current versión (up to date): **1.3.1** 
- To install [this package](https://anaconda.org/conda-forge/pdfminer3k "PDFMiner3k conda") with [conda](https://anaconda.org/ "anaconda cloud") run one of the following:

        conda install -c conda-forge pdfminer3k
        
pdfminer3k is a Python 3 port of pdfminer. PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data. PDFMiner allows to obtain the exact location of texts in a page, as well as other information such as fonts or lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes instead of text analysis.   

### [pdfrw](https://github.com/pmaupin/pdfrw)
- Latest commit on 18 Sept (2017).  
- **Python 3**.  
- MIT License.         
- Project's **GitHub**: https://github.com/pmaupin/pdfrw   
- Current versión (up to date): **0.4**
- To install [this package](https://anaconda.org/luceda/pdfrw "pdfrw conda") with [conda](https://anaconda.org/ "anaconda cloud") run one of the following:

        conda install -c bjornfjohansson pdfrw
        
You can find a brief tutorial here: https://www.binpress.com/tutorial/manipulating-pdfs-with-python/171.

### [pdftotext](https://github.com/jalan/pdftotext)
- Latest commit on 12 Nov (2017).  
- **Python 2** and **Python 3**.  
- MIT License.         
- Project's **GitHub**: https://github.com/jalan/pdftotext
- PyPi: https://pypi.python.org/pypi/pdftotext/2.0.1
- Current versión (up to date): **2.0.1**
        
Simple PDF text extraction. **No more info was found**.


# PyPDF2

First we have to import the module

In [1]:
import PyPDF2

***
Next, we need to open the PDF file, which is done with the built-in function [open()](https://docs.python.org/3/library/functions.html#open) as:

    File = open(filename, mode)

Where "mode" (which is optional) can be:

- **'r'** : only read mode (this is the default mode)
- **'w'** : only write mode
- **'a'** : appending mode
- **'r+'** : both read and write mode

Files are opened in text mode, that means, you read and write strings from and to the file, which are encoded in a specific encoding. If encoding is not specified, the default is platform dependent, but if **'b'** is added to previous modes, then the file is opened in **binary mode**, hence **'rb'** opens file in **only read binary mode**. This mode should be used for all files that don’t contain text.  
When you finish working with file, you must close the file object **File** using:

    File.close()

Now, reading a PDF file is not as simple as a text file, since PDFs are a proprietary format by Adobe that come with their own little quirks when it comes to automating the process of extracting information from each file. So, after open the file, we need to read the PDF format. For this purpose we must first initialize a **PdfFileReader object** with the class:

- class [PyPDF2.PdfFileReader(File)](https://pythonhosted.org/PyPDF2/PdfFileReader.html): Initializes a PdfFileReader object. Obviously **File** is a file object.  


***
Let's try to read <font color="red">sample.pdf</font>:

In [2]:
File = open('./pdfs/sample.pdf')
read_pdf = PyPDF2.PdfFileReader(File)



UnsupportedOperation: can't do nonzero end-relative seeks

***
According to this error, it seems that <font color="red">sample.pdf</font> is in binary mode. So we need "to close File" and re-open it in binary mode:

In [3]:
File.close()

File = open('./pdfs/sample.pdf', 'rb')
read_pdf = PyPDF2.PdfFileReader(File)

***
Afterwards we can work inside the document.

First let's check some of the things can be done.

***
### 1. Number of pages

    getNumPages(): Calculates the number of pages in this PDF file.

In [4]:
number_of_pages = read_pdf.getNumPages()
print('\n', 'Number of pages = ', number_of_pages)


 Number of pages =  1


***
### 2. Get a specific page

Now, since we yet know the number of pages of our document, we can go to any page and create a **page object** using:

    getPage(pageNumber): Retrieves a page object by number from this PDF file.    

In [5]:
page = read_pdf.getPage(0)
print('\n', page)


 {'/Type': '/Page', '/Parent': IndirectObject(3, 0), '/Resources': IndirectObject(6, 0), '/Contents': IndirectObject(4, 0), '/MediaBox': [0, 0, 612, 792]}


We know that <font color="red">sample.pdf</font> has only 1 page. What if we try to get a page whose number is equal or more than the PDF's number of pages?

Let's see:

In [6]:
page = read_pdf.getPage(1)
print('\n', page)

IndexError: list index out of range

***
### 3. Get the page content

What about getting the page content? The **page object** has a method which returns it:

    extractText(): Locate all text drawing commands, in the order they are provided in the content stream, and extract the text. This works well for some PDF files, but poorly for others, depending on the generator used. This will be refined in the future.

Let's see:

In [7]:
page_content = page.extractText()
print('\n', page_content)

File.close()


 !"#$%#$%&%$&'()*%+,-%./01'*23%4
5'%1$#26%3/%7/))/8%&)/26%8#3"%3"*%313/9#&)
%


So, as documentation said: **_This works well for some PDF files, but poorly for others, depending on the generator used_**.

***
Let's try with the file <font color="red">pdf-sample.pdf</font>:

In [8]:
File = open('./pdfs/pdf-sample.pdf', 'rb')

read_pdf = PyPDF2.PdfFileReader(File)

page = read_pdf.getPage(0)
print('\n', page)

number_of_pages = read_pdf.getNumPages()
print('\n', 'Number of pages = ', number_of_pages)

page_content = page.extractText()
print('\n', page_content)

File.close()


 {'/Contents': IndirectObject(11, 0), '/CropBox': [0, 0, 595, 842], '/MediaBox': [0, 0, 595, 842], '/Parent': IndirectObject(5, 0), '/Resources': IndirectObject(14, 0), '/Rotate': 0, '/Type': '/Page'}

 Number of pages =  1

 Adobe Acrobat PDF Files
Adobe® Portable Document Format (PDF) is a universal file format that preserves all
of the fonts, formatting, colours and graphics of any source document, regardless of

the application and platform used to create it.
Adobe PDF is an ideal format for electronic document distribution as it overcomes the
problems commonly encountered with electronic file sharing.
 Anyone, anywhere can open a PDF file. All you need is the free Adobe Acrobat
Reader. Recipients of other file formats sometimes can't open files because they
don't have the applications used to create the documents.
 PDF files 
always print correctly
 on any printing device.
 PDF files always display 
exactly
 as created, regardless of fonts, software, and
operating systems. Fonts,

Last result was so good although it seems disordered. 

***
What happen when we try to read more complex PDFs like the **paper** we have downloaded?

Let's try with the file <font color="red">paper.pdf</font>:

In [9]:
File = open('./pdfs/paper.pdf', 'rb')

read_pdf = PyPDF2.PdfFileReader(File)

page = read_pdf.getPage(0)
print('\n', page)

number_of_pages = read_pdf.getNumPages()
print('\n', 'Number of pages = ', number_of_pages)

page_content = page.extractText()
print('\n', page_content)

File.close()




 {'/CropBox': [41.89999, 46.5, 637.2, 840.2], '/Annots': [IndirectObject(116, 0), IndirectObject(117, 0), IndirectObject(118, 0), IndirectObject(119, 0), IndirectObject(120, 0), IndirectObject(121, 0), IndirectObject(122, 0), IndirectObject(123, 0), IndirectObject(124, 0), IndirectObject(125, 0), IndirectObject(126, 0), IndirectObject(127, 0), IndirectObject(128, 0), IndirectObject(129, 0), IndirectObject(130, 0), IndirectObject(131, 0), IndirectObject(132, 0), IndirectObject(133, 0), IndirectObject(134, 0), IndirectObject(135, 0)], '/Parent': IndirectObject(98, 0), '/B': [IndirectObject(136, 0), IndirectObject(137, 0), IndirectObject(144, 0)], '/Contents': [IndirectObject(182, 0), IndirectObject(184, 0), IndirectObject(186, 0), IndirectObject(188, 0), IndirectObject(195, 0), IndirectObject(197, 0), IndirectObject(199, 0), IndirectObject(202, 0)], '/Rotate': 0, '/MediaBox': [41.89999, 46.5, 637.2, 840.2], '/Thumb': IndirectObject(46, 0), '/TrimBox': [41.89999, 46.5, 637.2, 840.2], '/R

If you see carefully, **this is wrong**.

We expected the full text of page number 0, but in counterpart we have obtained an **incomplete part of the text**.

This leads us to think that PyPDF2 requires more development time. 

## Conclusion

Although PyPDF2 has an **extractText()** function is very easy to use, s other people have noted, it is harder than it seems to get to text out of a PDF and extractText() function **is known not to work on every PDF**, especially if it is formatted in a _complicated way (for example pictures with captions, columns)_.