# Reading PDF files into Python

<div class="questions">

### Questions

- How do I read PDF files in Python?

</div>

<div class="objectives">

### Objectives

- Use tabula-py to work with PDF files in Python
</div>

***Note:*** in the time it took me to figure out this code, I could have manually transcribed about 50 of these tables I reckon! Just because you *can* does not mean you *should*.

There seem to be a few approaches to reading PDFs with Python. If the PDF is already searchable and you just want to transcribe it, then this notebook using the `tabula-py` library seems like a good method.

If your PDF is just a plain image, a more versatile approach is to use an OCR on your document or to convert it to and image. Adjust to your needs, but these workflows and libraries may be helpful: 
- https://towardsdatascience.com/extracting-text-from-scanned-pdf-using-pytesseract-open-cv-cd670ee38052, or 
- https://pypi.org/project/ocrmypdf/

Note that this requres Java! To install on a Mac via Homebrew, follow the instructions [here](https://stackoverflow.com/questions/65601196/how-to-brew-install-java).

In [1]:
!pip install tabula-py

Collecting tabula-py
  Downloading tabula_py-2.5.1-py3-none-any.whl (12.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.0/12.0 MB[0m [31m87.4 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting distro
  Downloading distro-1.7.0-py3-none-any.whl (20 kB)
Installing collected packages: distro, tabula-py
Successfully installed distro-1.7.0 tabula-py-2.5.1


In [5]:
#https://pypi.org/project/tabula-py/

import tabula

# Read pdf into list of DataFrame
dfs = tabula.read_pdf("../userdata//G32716A3.pdf", pages='all', pandas_options={"header":None})

# Read remote pdf into list of DataFrame
#dfs2 = tabula.read_pdf("https://github.com/tabulapdf/tabula-java/raw/master/src/test/resources/technology/tabula/arabic.pdf")

# convert PDF into CSV file
#tabula.convert_into("test.pdf", "output.csv", output_format="csv", pages='all')

# convert all PDFs in a directory
#tabula.convert_into_by_batch("input_directory", output_format='csv', pages='all')

Got stderr: Aug 30, 2022 3:04:07 PM org.apache.pdfbox.pdmodel.font.FileSystemFontProvider loadDiskCache
Aug 30, 2022 3:04:07 PM org.apache.pdfbox.pdmodel.font.FileSystemFontProvider <init>
Aug 30, 2022 3:04:08 PM org.apache.pdfbox.pdmodel.font.FileSystemFontProvider <init>



In [6]:
#What format has the load returned?
type(dfs)

list

In [7]:
#Import pandas to do some table manipulation
import pandas as pd

In [8]:
colnames = ["SampleID", "Project", "Season", "OrigGeo", "Lithology", "CoreName", "CoreDepth", "Geochronology"]
df = dfs[0]
df.columns=colnames
df

Unnamed: 0,SampleID,Project,Season,OrigGeo,Lithology,CoreName,CoreDepth,Geochronology
0,214830.0,53.0,,Markwitz,Quartz-garnet gneiss (PJO),,1246.75-,
1,,,,,,,,Yes
2,,,,,,,1246.6,
3,214831.0,53.0,,Markwitz,Cordierite-sillimanite-garnet,,1244.5-,
4,,,,,,,,No
5,,,,,gneiss (PJO),,1244.3,
6,214832.0,53.0,,Markwitz,Cordierite-sillimanite-garnet,,1241.4-,
7,,,,,,,,No
8,,,,,gneiss (PJO),,1241.3,
9,214833.0,53.0,,Markwitz,Sandstone – Tumblagooda,,1209.72-,


In [9]:
# Fill "forward" all the approriate groups
df[["SampleID","Project","OrigGeo"]] = df[["SampleID","Project","OrigGeo"]].fillna(method="ffill")

#Group by the unique sample id...

#...then fill all the nan values in that group
df['Season'] = df.groupby('SampleID').Season.transform('first')
df['CoreName'] = df.groupby('SampleID').CoreName.transform('first')
df['Geochronology'] = df.groupby('SampleID').Geochronology.transform('first')

#..then combine strings if the group has multiple lines of text, note what we want to pad each bit of text with
df['Lithology'] = df.groupby(['SampleID'])['Lithology'].transform(lambda x: ' '.join(x.dropna()))
df['CoreDepth'] = df.groupby(['SampleID'])['CoreDepth'].transform(lambda x: ''.join(x.dropna()))

In [12]:
#Drop all the repeated lines to get the final table
df = df.drop_duplicates(keep='first')
df

Unnamed: 0,SampleID,Project,Season,OrigGeo,Lithology,CoreName,CoreDepth,Geochronology
0,214830.0,53.0,,Markwitz,Quartz-garnet gneiss (PJO),,1246.75-1246.6,Yes
3,214831.0,53.0,,Markwitz,Cordierite-sillimanite-garnet gneiss (PJO),,1244.5-1244.3,No
6,214832.0,53.0,,Markwitz,Cordierite-sillimanite-garnet gneiss (PJO),,1241.4-1241.3,No
9,214833.0,53.0,2014.0,Markwitz,Sandstone – Tumblagooda Sandstone,Wendy-1,1209.72-1209.0,Yes
13,214834.0,53.0,,Markwitz,Sandstone – Tumblagooda Sandstone,,1134.4-1133.9,Yes
16,214835.0,53.0,,Markwitz,Sandstone – Tumblagooda Sandstone,,1055.2-1054.9,No
19,214836.0,53.0,,Markwitz,Sandstone – Tumblagooda Sandstone,,932.15-931.75,Yes
22,214837.0,53.0,,Markwitz,Sandstone – Tumblagooda Sandstone,,915.2-915,Yes
25,214839.0,53.0,,Markwitz,Sandstone – Tumblagooda Sandstone,,1082.0-1082.3,Yes
28,214840.0,53.0,,Markwitz,Sandstone – Tumblagooda Sandstone,,1072-1071.7,Yes


In [11]:
#df.to_csv()