# PDFs with python 2.7

This is a documentation about **how to extract information from PDFs using python 2.7**.

Date: 13 Dec (2017)


I needed to record this process because it was difficult to find a good package which let me extract text from any PDF file. Maybe I was expecting too much from them. 

First, I needed to understand how the pdf file format works.

In quora, I found the next [comment](https://www.quora.com/Is-there-an-easy-to-use-Python-library-to-read-a-PDF-file-and-extract-its-text): 

    We all expect to be able to use a library to parse some file format for text and be able to iterate through the text line by line, but what if the text has no line characters?

    How would the library know what constitutes a line? Most libraries won’t try to guess at that, and honestly we wouldn’t want them to, because if the line isn’t represented by a line character, then the concept of line isn’t really part of the text (is it?) and we are using the library to extract *text*.

    In pdf, text is laid out, meaning that a particular text object get displayed at a particular x,y position on the page. So what you might think of as 3 lines would actually be 3 text objects, displayed at (x,y), (x, y-20), (x, y-40) respectively, so a text extraction library would just pull out the text, but you’d have no line data. (IRRC pdfminer hands you String as output, just a big String, not a (line) iterable, it was because PDFMiner didn’t work for me that I had to study up and learn a bit about pdf to get what I wanted out of the files).

    The upside is this — You finally get a chance to ‘roll your own.’

    Fortunately, extracting the text out of a pdf is very well defined and simple goal. And fortuanately, PDF is a very well documented and very well understood file format, so google is going to be very helpful. If push comes to shove, the text rendering part of the spec is less than 200 pages, but you won’t need to go there.

    Start here:

    https://web.archive.org/web/20141010035745/http://gnupdf.org/Introduction_to_PDF


    Then read the wikipedia article which is super well written. Then you will have to open the file in text editor and study it, which won’t be hard if you are interested only in text. Use this as a tool to understand the stream writing operators: 

    http://www.verypdf.com/document/pdf-format-reference/pg_0985.htm

    The accepted answer to the following SO tells you what you need to investigate to understand how text is encoded within the pdf:

    https://stackoverflow.com/questions/4047953/programatically-rip-text-from-a-pdf-file-by-hand-missing-some-text

    Google anything you wish to understand, and you will be brought to cool sites like planetpdf, where they have great articles.

    It should take you a day or two to hand write your parser and you will learn a lot in the process about something pretty common. The libraries have to be general, so they are going to be limited.

    (Perhaps irrelevant, the pdfs I was working with are linearized—see the linked references—which made studying the text in the pdf and mapping to the layout on the screen super simple, I didn’t study an non-linearized files because i didn’t have to, but if it makes things harder there’s a ton of code out there to linearize a pdf but not a lot out there that can go the otherway)
    
   **Naftali Michalowsky**
    
***
Another comment I found in the same page said:

    This may come to you as a surprise, but PDF was never actually intended as a format for easy text extraction. Indeed, its primary purpose is to make sure that whatever is in the document would be displayed in a consistent manner across multiple platforms, as well as print identically (as much as that is possible) everywhere. This is achieved in many different ways, and sometimes goes as far as literally saying "this letter goes here, this letter goes here, and this letter goes here" in its internal language.

    So yes, the libraries to extract text from PDF are insanely complicated by necessity, and they sometimes do not even function properly, because PDF is not meant for that.
    
   **Paulina Jonušaitė**

***

So I decided to take some packages and take a look of their performance "reading" PDF files.

## PDF Samples

In order to explore the tool's functionalities, I have downloaded three pdfs.  

One of them named <font color="red">sample.pdf</font> which was obtained from this tutorial: http://programacion.net/articulo/como_trabajar_con_documentos_pdf_utilizando_python_1398.  

Other one named <font color="red">pdf-sample.pdf</font> which was obtained from: http://unec.edu.az/application/uploads/2014/12/pdf-sample.pdf.  

Last one is a paper.

I have created one directory named **pdfs** to contain the documents. My work directory looks like:


    PDF_with_Python/
    ├── LICENSE
    ├── pdfs
    │   ├── paper.pdf
    │   ├── pdf-sample.pdf
    │   └── sample.pdf
    ├── PDF_with_Python.ipynb
    └── README.md

## Tools information

There exists different python tools for manipulating PDFs. Our interest in this notebook is to address this issue using **python 2.7**. Next tools are reported to work with python 2.

**But remember that _Python 2.7_ will not be maintained past 2020**.

### [PDFMiner](http://euske.github.io/pdfminer/index.html "PDFMiner")
- Last Modified: Fri Mar 28 09:17:06 UTC 2014.  
- **Python 2.x** only.  
- MIT License.  
- To install [this package](https://anaconda.org/conda-forge/pdfminer "pdfminer conda") with [conda](https://anaconda.org/ "anaconda cloud") run:
    
        conda install -c conda-forge pdfminer
        
- When I did using conda, the scripts are not globally recognized. So I had to install following next steps:
    + Install Python 2.4 or newer. (Python 3 is not supported.)
    + Download the PDFMiner source: https://pypi.python.org/pypi/pdfminer/
    + Unpack it.
    + Run setup.py to install:
         
            sudo python setup.py install

    + Do the following test:

            pdf2txt.py samples/simple1.pdf
            
            Hello

            World

            Hello
            
            World

            H e l l o

            W o r l d

            H e l l o

            W o r l d
    + Done!

PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data.

PDFMiner allows one to obtain the exact location of text in a page, as well as other information such as fonts or lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes than text analysis.

#### Tutorials

- [PDFMiner API Doc](http://unixuser.org/~euske/python/pdfminer/programming.html)
- [PDFMiner tutorial 1](http://denis.papathanasiou.org/posts/2010.08.04.post.html)
- [PDFMiner tutorial 2](https://www.binpress.com/tutorial/manipulating-pdfs-with-python/167)
- [PDFMiner tutorial 3](https://wiki.carleton.edu/display/itskb/PDFMiner%3A+Extracting+Text+from+a+PDF+File)

#### Features

- Written entirely in Python. (for version 2.4 or newer)
- Parse, analyze, and convert PDF documents.
- PDF-1.7 specification support. (well, almost)
- CJK languages and vertical writing scripts support.
- Various font types (Type1, TrueType, Type3, and CID) support.
- Basic encryption (RC4) support.
- PDF to HTML conversion (with a sample converter web app).
- Outline (TOC) extraction.
- Tagged contents extraction.
- Reconstruct the original layout by grouping text chunks.
        
Though PDFminer is simple it has its own DRAWBACKS:

   + Getting simple things done, like extracting the text is quite complex. The program is not designed to return Python objects, which makes interfacing things irritating.
   + It’s an extremely complete set of tools, with multiple and moderately steep learning curves.
   + It’s not written with hackability in mind.

### [Textract](http://textract.readthedocs.io/en/latest/)
- Latest commit on 21 Jul (2017).  
- Requires **PDFMiner**, pdf2text and other solutions.  
- **Python 2**.  
- It is supposed to work in **Python 3** too using **pdfminer.six**, but when I tried to install it with conda (for python 3) it generated conflicts.
- Unspecified License.  
- Includes **command line interface**:
        
        textract path/to/file.extension
        
- Project's **GitHub**: https://github.com/deanmalmgren/textract 
- **Documentation**: http://textract.readthedocs.io/en/latest/   
- **Documentation for Python Package**: http://textract.readthedocs.io/en/latest/python_package.html
- Current versión (up to date): **1.6.1** (manually) 
- To install [this package](https://anaconda.org/conda-forge/textract "Textract conda") with [conda](https://anaconda.org/ "anaconda cloud") run one of the following (version 1.5.0):

        conda install -c conda-forge textract 

### [PdfQuery](https://github.com/jcushman/pdfquery "PdfQuery")
- Latest commit on 29 Jul (2017).  
- Requires **PDFMiner**, pyquery and lxml libraries.  
- Includes sample code, documentation.  
- Seems to be **Python 2.x**. It is not clear if this tool is Python 2.x or Python 3.x.  
- MIT License.  
- To install [this package](https://anaconda.org/jacksongs/pdfquery "PdfQuery conda") with [conda](https://anaconda.org/ "anaconda cloud") run:
    
        conda install -c jacksongs pdfquery

# Textract

First we have to import the module

In [1]:
import textract

Now let's try to read a document. 

***
Let's try to read <font color="red">sample.pdf</font>:

In [2]:
text = textract.process("./pdfs/sample.pdf")

What happened? ... Let's print **text** content:

In [3]:
print(text)

This is a sample PDF document Iâ€™m using to follow along with the tutorial




It seems pretty good.

***
Let's try with the file <font color="red">pdf-sample.pdf</font>:

In [4]:
text = textract.process("./pdfs/pdf-sample.pdf")
print(text)

Adobe Acrobat PDF Files
Adobe® Portable Document Format (PDF) is a universal file format that preserves all
of the fonts, formatting, colours and graphics of any source document, regardless of
the application and platform used to create it.
Adobe PDF is an ideal format for electronic document distribution as it overcomes the
problems commonly encountered with electronic file sharing.
•

Anyone, anywhere can open a PDF file. All you need is the free Adobe Acrobat
Reader. Recipients of other file formats sometimes can't open files because they
don't have the applications used to create the documents.

•

PDF files always print correctly on any printing device.

•

PDF files always display exactly as created, regardless of fonts, software, and
operating systems. Fonts, and graphics are not lost due to platform, software, and
version incompatibilities.

•

The free Acrobat Reader is easy to download and can be freely distributed by
anyone.

•

Compact PDF files are smaller than their sourc

Last result was so good although it seems disordered. 

***
What happen when we try to read more complex PDFs like the **paper** we have downloaded?

Let's try with the file <font color="red">paper.pdf</font>:

In [5]:
text = textract.process("./pdfs/paper.pdf")
print(text)

Leukemia Research 34 (2010) 677–681

Contents lists available at ScienceDirect

Leukemia Research
journal homepage: www.elsevier.com/locate/leukres

Brief communication

Micro-RNA-15a and micro-RNA-16 expression and chromosome 13 deletions in
multiple myeloma
Sophie L. Corthals a , Mojca Jongen-Lavrencic a , Yvonne de Knegt a , Justine K. Peeters a ,
H. Berna Beverloo b , Henk M. Lokhorst c , Pieter Sonneveld a,∗
a
b
c

Department of Hematology, Erasmus Medical Centre Rotterdam, Rotterdam, The Netherlands
Department of Clinical Genetics, Erasmus Medical Centre Rotterdam, Rotterdam, The Netherlands
Department of Hematology, University Medical Center Utrecht, Utrecht, The Netherlands

a r t i c l e

i n f o

Article history:
Received 15 October 2009
Received in revised form 28 October 2009
Accepted 28 October 2009
Available online 23 December 2009
Keywords:
Multiple myeloma
Micro-RNA
Chromosome 13 deletion

a b s t r a c t
We have used copy number variation (CNV) analysis with SNP mappin

In [6]:
text

'Leukemia Research 34 (2010) 677\xe2\x80\x93681\n\nContents lists available at ScienceDirect\n\nLeukemia Research\njournal homepage: www.elsevier.com/locate/leukres\n\nBrief communication\n\nMicro-RNA-15a and micro-RNA-16 expression and chromosome 13 deletions in\nmultiple myeloma\nSophie L. Corthals a , Mojca Jongen-Lavrencic a , Yvonne de Knegt a , Justine K. Peeters a ,\nH. Berna Beverloo b , Henk M. Lokhorst c , Pieter Sonneveld a,\xe2\x88\x97\na\nb\nc\n\nDepartment of Hematology, Erasmus Medical Centre Rotterdam, Rotterdam, The Netherlands\nDepartment of Clinical Genetics, Erasmus Medical Centre Rotterdam, Rotterdam, The Netherlands\nDepartment of Hematology, University Medical Center Utrecht, Utrecht, The Netherlands\n\na r t i c l e\n\ni n f o\n\nArticle history:\nReceived 15 October 2009\nReceived in revised form 28 October 2009\nAccepted 28 October 2009\nAvailable online 23 December 2009\nKeywords:\nMultiple myeloma\nMicro-RNA\nChromosome 13 deletion\n\na b s t r a c t\nWe hav

In [7]:
type(text)

str

In [8]:
text.find('miR-15')

1312

In [10]:
text.find('bioanal')

-1

In [11]:
text.find('Bioanal')

3467

In [9]:
text.find('physics')

-1

In [12]:
str1 = "this is string example....wow!!!";
str2 = "is";

print str1.find(str2)


2
