Reading data from PDFs 
====================

<div class="overview">
   <p class="overview-title">Overview</p>
    <p>Questions</p>
      <ul>
        <li>How can I tell if I can extract data from a pdf?</li>
        <li>How can I run optical character recognition on a pdf?</li>
        <li>How can I extract information from a pdf which has character information?</li>
    </ul>
    <p>Objectives:</p>
        <ul>
            <li>Use `ocrmypdf` to make sure our pdf has recognizable characters.</li>
            <li>Use `tabula-py` to extract data from a table in a pdf.</li>
        </ul>
    <p>Keypoints:</p>
        <ul>
            <li>PDFs usually have text associated with them. If they don't, you can use `ocrmypdf` to perform optical character recognition.</li>
            <li>You can use the library `tabula-py` to extract data from tables in pdfs.</li>
        </ul>
    </div>
    
You should have the paper we are going to work with in your `pdfs` folder. The name of the file is `Potts-Guy1995_Article_APredictiveAlgorithmForSkinPer.pdf`. We will be reading the tables on page 3.

Start by checking to ensure that you have the pdfs folder and the pdf. We will use the special command `ls` for this. We put an exclamation mark at the beginning of this command because it is not Python. In the Jupyter notebook, the commands that start with `!` are commands you could execute in your terminal if you were using a terminal.

In [1]:
! ls pdfs

186.full.pdf	   Potts-Guy1995_Article_APredictiveAlgorithmForSkinPer.pdf
Delaney_paper.pdf  cyclodextrin.pdf


We're going to use a Python library called `tabula-py` to read the data in `Table 1`. However, this pdf doesn't have any text information in it yet. One way you can tell this is by clicking and dragging your cursor over the text in a pdf viewer like Adobe Acrobat. If the text is not highlighted, the pdf does not contain text information. If we tried to extract the data in the table at this point, we would get an empty table.

You can get text information in a pdf by performing optical character recognition, or OCR. If you have Adobe Acrobat Pro, it has an OCR tool built in that you can use. Python also has some free libraries which can be used for OCR. We'll be using one called [OCRmyPDF](https://ocrmypdf.readthedocs.io/en/latest/).

Again, this command is not Python. We can tell this because it starts with an exclamation mark `!`. To use this software, we type the command `ocrmypdf` followed by the path to the pdf we would like to convert. Then you put the name you would like your new output file to have.

In [2]:
! ocrmypdf "pdfs/Potts-Guy1995_Article_APredictiveAlgorithmForSkinPer.pdf"  "pdfs/pottsguyocr.pdf"

Scanning contents: 100%|███████████████████████| 6/6 [00:00<00:00, 202.22page/s]
Start processing 6 pages concurrently
OCR: 100%|██████████████████████████████████| 6.0/6.0 [00:18<00:00,  3.01s/page]
Postprocessing...
PDF/A conversion: 100%|█████████████████████████| 6/6 [00:01<00:00,  3.29page/s]
JPEGs: 0image [00:00, ?image/s]
JBIG2: 0item [00:00, ?item/s]
Optimize ratio: 1.00 savings: -0.2%
Image optimization did not improve the file - discarded
Output file is a PDF/A-2B (as expected)


In [3]:
! ls pdfs

186.full.pdf						  cyclodextrin.pdf
Delaney_paper.pdf					  pottsguyocr.pdf
Potts-Guy1995_Article_APredictiveAlgorithmForSkinPer.pdf


## Reading Tables with `tabula-py`

We now have two pdfs in the folder. The second one, `pottsguyocr.pdf` has text information in the pdf. We can use the library `tabula-py` to get information from table 1. The function we will be using is called `tabula.read_pdf`. We pass the file path to the pdf we would like to read to this function. You should also specify the page number of the table. Otherwise, it will by default try to read page 1.

In [4]:
import os
import tabula

In [5]:
pdf_path = os.path.join("pdfs", "pottsguyocr.pdf")

In order to read from pages other than page 1, we will need to pass another argument (`pages`) to the function to specifiy which page contains the table we want to parse

In [6]:
tables = tabula.read_pdf(pdf_path, pages=[3])

This will return a list of pandas dataframes. Tabula will convert each table it finds on the page into a pandas dataframe. Let's examine each of these.

In [7]:
tables[0].head()

Unnamed: 0.1,Unnamed: 0,Compound,log P,Unnamed: 1,II,Hy,"H,",MV,"R,",log Kou,log Kyex,Unnamed: 2,log Kpep
0,,water,— 6.85,,0.45,0.82,0.35,10.6,0.0,— 1.38,"— 4,38",,
1,',methanol,— 6.68,,0.44,0.43,0.47,21.7,0.28,—0.73,— 2.42,,— 2.80
2,,methanoic acid,— 7.08,,0.6,0.75,0.38,22.3,0.3,—0.54,— 3.93,,— 3.63
3,,ethanol,— 6.66,,0.42,0.37,0.48,31.9,0.25,—0.32,—2.24,,—2.10
4,,ethanoic acid,—7.01,,0.65,0.61,0.45,33.4,0.27,—0.31,— 3.28,,—2.90


In [8]:
tables[1].head()

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,coefficients show that,solutes with hydrogen-bond donating,Unnamed: 7,Unnamed: 8
0,Octanol,3.80,1.19,— 5.06,0.84,88.0,37.0,ability partition least well,into alkanes.,This expected,result
1,,(0.73),(0.13),(0.29),,,,,,,
2,,,,,,,,"is, of course, completely consistent with",,the relative hydro-,
3,Heptane,nsd?,0.43,— 5.53,0.79,113.0,33.0,,,,
4,,,,,,,,gen bond acceptor activity,of the solvent,phases,involved


In [9]:
tables[0].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37 entries, 0 to 36
Data columns (total 13 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  1 non-null      object 
 1   Compound    37 non-null     object 
 2   log P       37 non-null     object 
 3   Unnamed: 1  0 non-null      float64
 4   II          37 non-null     float64
 5   Hy          37 non-null     float64
 6   H,          37 non-null     float64
 7   MV          37 non-null     object 
 8   R,          37 non-null     float64
 9   log Kou     37 non-null     object 
 10  log Kyex    31 non-null     object 
 11  Unnamed: 2  0 non-null      float64
 12  log Kpep    25 non-null     object 
dtypes: float64(6), object(7)
memory usage: 3.9+ KB


In [10]:
tables[1].head()

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,coefficients show that,solutes with hydrogen-bond donating,Unnamed: 7,Unnamed: 8
0,Octanol,3.80,1.19,— 5.06,0.84,88.0,37.0,ability partition least well,into alkanes.,This expected,result
1,,(0.73),(0.13),(0.29),,,,,,,
2,,,,,,,,"is, of course, completely consistent with",,the relative hydro-,
3,Heptane,nsd?,0.43,— 5.53,0.79,113.0,33.0,,,,
4,,,,,,,,gen bond acceptor activity,of the solvent,phases,involved


Neither of these tables are usable yet. We'll save both as csvs and work on cleaning them in the next section.

In [11]:
output_1 = os.path.join("data", "potts_table1.csv")
output_2 = os.path.join("data", "potts_table2.csv")

tables[0].to_csv(output_1)
tables[1].to_csv(output_2)