# Importing data from PDF Files

Tables in PDFs can be extracted by using the [tabula](https://tabula-py.readthedocs.io/en/latest/tabula.html) package. The following examples will use pdf files from the following online [archive](https://www.rrc.state.tx.us/oil-gas/research-and-statistics/well-information/monthly-drilling-completion-and-plugging-summaries/archive-monthly-drilling-completion-and-plugging-summaries-archive/).
***
The PDF file for February 2019 contains the following formatted table 

![image](Images/RRC.png)

In [19]:
import tabula

PDFdf = tabula.read_pdf("ogdc0219.pdf", pages='1')
PDFdf.head()

Unnamed: 0.1,Unnamed: 0,State Totals\rFebruaryJan-FebJan-Feb\r201920192018,Activity by District For February\r0102030405066E7B7C088A0910,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16
0,New field Discoveries*,1.0,2.0,3.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Oil,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Gas,1.0,2.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Drilling Overview**,,,,,,,,,,,,,,,,
4,New Drill Dry/Completions,709.0,1654.0,1521.0,107.0,90.0,44.0,8.0,6.0,14.0,0.0,26.0,68.0,247.0,37.0,34.0,28.0


The imported dataframe may require further transformation to get it into a usable format

***
PDFs can also be read directly from a specified url

In [18]:
PDFdf2 = tabula.read_pdf("https://www.rrc.state.tx.us/media/50312/ogdc0119.pdf")
PDFdf2.head()

Unnamed: 0.1,Unnamed: 0,State Totals\rJanuaryJan-JanJan-Jan\r201920192018,Activity by District For January\r0102030405066E7B7C088A0910,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16
0,New field Discoveries*,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Oil,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Gas,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Drilling Overview**,,,,,,,,,,,,,,,,
4,New Drill Dry/Completions,945.0,945.0,776.0,141.0,108.0,40.0,15.0,12.0,25.0,0.0,21.0,35.0,437.0,46.0,39.0,26.0


***
The Tabula package also has methods for converting pdfs into csv files 

In [ ]:
# convert PDF into CSV
tabula.convert_into("test.pdf", "output.csv", output_format="csv", pages='all')

# convert all PDFs in a directory
tabula.convert_into_by_batch("input_directory", output_format='csv', pages='all)