# Demonstration of Different Tools to Extract Tabluar Information from pdfs
---
In this Jupyter Notebook, we will take a look into different approaches to extract tabular infromation from <code> .pdf</code> files.
The tool that we will be mainly looking at is: ***Camelot***

In [1]:
# import useful libraries
import numpy as np
import pandas as pd
import camelot as camelot

In [2]:
# Read pdf file
# Read the Table 4.5.1 at p11 as demonstration
p11_table = camelot.read_pdf('Bromu EPA-HQ-OPP-2015-0535-0010-p11.pdf')  

In [3]:
# Convert the extracted table into Pandas DataFrame format
p11_table_df = p11_table[0].df
p11_table_df

Unnamed: 0,0,1,2,3,4
0,Table 4.5.1: Summary of Toxicological Doses a...,,,,
1,Exposure Scenario,Point of Departure \n(POD),Uncertainty / \nFQPA Safety \nFactors,Level of Concern \nfor Risk \nAssessment,Study and Toxicological Effects
2,I\nDermal \nShort-Term \n(1-30 Days) \nnterme...,NOAEL = \n40 mg/kg/day,UFA = 10x \nUFH = 10x,Occupational LOC \nfor MOE = 100,Dermal Developmental Toxicity Study \nin Rats ...
3,I\nInhalation \nShort-Term \n(1-30 Days) \nnte...,N\nNOAEL = \n10 mg/kg/day \note: Inhalation an...,UFA = 10x \nUFH = 10x,Occupational LOC \nfor MOE = 100,Oral Developmental Study in Rats \nLOAEL = 70 ...
4,Cancer (dermal and \ninhalation),Classification: “Not Likely to be Carcinogenic...,,,


**Compared to the original table**
<img src="./tb-4-5-1.png" alt="Drawing" style="width: 600px;"/>

In [4]:
# As we look into certain cell
print(p11_table_df.iloc[3,1])

N
NOAEL = 
10 mg/kg/day 
ote: Inhalation and 
oral absorption are 
assumed to be 
equivalent.


**The original cell in the document**
<img src="./cell-3-1.png" alt="Drawing" style="width: 150px;"/>

**Note:**
1. There are newline character pickedup by the algorithm, but we don't necessary want them to be there.
    - e.g. <code>NOAEL = 10 mg/kg/day</code> is interpreted as <br  />
    <code>NOAEL = <br  />10 mg/kg/day</code>
2. The ***N*** of the ‘*Note*’ in '*Note: Inhalation and ...*' is misplaced to the top. 

This problem has been mentioned [here](https://github.com/socialcopsdev/camelot/issues/170)<br />
The underlying issue with the letter placement is caused by the x-position have been mishandled when *Camelot* is building text in cells.

In [5]:
# Read pdf file
# Read the Table 4.5.1 at p11 as demonstration
p11_table_fixed = camelot.read_pdf('Bromu EPA-HQ-OPP-2015-0535-0010-p11.pdf', layout_kwargs={'detect_vertical':False})

# Convert the extracted table into Pandas DataFrame format
p11_table_fixed_df = p11_table_fixed[0].df
p11_table_fixed_df

Unnamed: 0,0,1,2,3,4
0,Table 4.5.1: Summary of Toxicological Doses a...,,,,
1,Exposure Scenario,Point of Departure \n(POD),Uncertainty / \nFQPA Safety \nFactors,Level of Concern \nfor Risk \nAssessment,Study and Toxicological Effects
2,Dermal \nShort-Term \n(1-30 Days) \n \nInterm...,NOAEL = \n40 mg/kg/day,UFA = 10x \nUFH = 10x,Occupational LOC \nfor MOE = 100,Dermal Developmental Toxicity Study \nin Rats ...
3,Inhalation \nShort-Term \n(1-30 Days) \n \nInt...,NOAEL = \n10 mg/kg/day \n \nNote: Inhalation a...,UFA = 10x \nUFH = 10x,Occupational LOC \nfor MOE = 100,Oral Developmental Study in Rats \nLOAEL = 70 ...
4,Cancer (dermal and \ninhalation),Classification: “Not Likely to be Carcinogenic...,,,


In [6]:
# As we look into certain cell
print(p11_table_fixed_df.iloc[3,1])

NOAEL = 
10 mg/kg/day 
 
Note: Inhalation and 
oral absorption are 
assumed to be 
equivalent.


## Moving on (Cleaning up with pandas)
With the format of the above extracted tables we are able to get some pretty neat information:

In [7]:
# Present the table once again
p11_table_fixed_df

Unnamed: 0,0,1,2,3,4
0,Table 4.5.1: Summary of Toxicological Doses a...,,,,
1,Exposure Scenario,Point of Departure \n(POD),Uncertainty / \nFQPA Safety \nFactors,Level of Concern \nfor Risk \nAssessment,Study and Toxicological Effects
2,Dermal \nShort-Term \n(1-30 Days) \n \nInterm...,NOAEL = \n40 mg/kg/day,UFA = 10x \nUFH = 10x,Occupational LOC \nfor MOE = 100,Dermal Developmental Toxicity Study \nin Rats ...
3,Inhalation \nShort-Term \n(1-30 Days) \n \nInt...,NOAEL = \n10 mg/kg/day \n \nNote: Inhalation a...,UFA = 10x \nUFH = 10x,Occupational LOC \nfor MOE = 100,Oral Developmental Study in Rats \nLOAEL = 70 ...
4,Cancer (dermal and \ninhalation),Classification: “Not Likely to be Carcinogenic...,,,


In [8]:
# Extracting the name of the table
table_name = p11_table_fixed_df.iloc[0,0]
print (table_name)

Table 4.5.1:  Summary of Toxicological Doses and Endpoints for use in Occupational Human Health Risk Assessments 
for Bromuconazole.


After extracting the table name, we can move the second row to the top as the new header.
However, there are too many unnecessary contained in each cell of the header

In [9]:
p11_table_fixed_df.iloc[1]

0                           Exposure Scenario
1                  Point of Departure \n(POD)
2       Uncertainty / \nFQPA Safety \nFactors
3    Level of Concern \nfor Risk \nAssessment
4             Study and Toxicological Effects
Name: 1, dtype: object

In Python, we can define simple method and use <code>apply()</code> in *pandas.DataFrame* to remove excess newline characters.

In [10]:
def remove_header_nl(cell_text):
    return cell_text.replace('\n', '')

In [11]:
# Removing excess newline characters
p11_table_fixed_df.iloc[1].apply(remove_header_nl)

0                       Exposure Scenario
1                Point of Departure (POD)
2       Uncertainty / FQPA Safety Factors
3    Level of Concern for Risk Assessment
4         Study and Toxicological Effects
Name: 1, dtype: object

In [12]:
# Replace the header with the new one
p11_table_fixed_df.iloc[1] = p11_table_fixed_df.iloc[1].apply(remove_header_nl)

# Then we can drop the current headers since they are no long useful
p11_table_fixed_df = p11_table_fixed_df[1:]
p11_table_fixed_df = p11_table_fixed_df.rename(columns=p11_table_fixed_df.iloc[0]).drop(p11_table_fixed_df.index[0])

# Re-index each row
p11_table_fixed_df = p11_table_fixed_df.reset_index(drop=True)
p11_table_fixed_df

Unnamed: 0,Exposure Scenario,Point of Departure (POD),Uncertainty / FQPA Safety Factors,Level of Concern for Risk Assessment,Study and Toxicological Effects
0,Dermal \nShort-Term \n(1-30 Days) \n \nInterm...,NOAEL = \n40 mg/kg/day,UFA = 10x \nUFH = 10x,Occupational LOC \nfor MOE = 100,Dermal Developmental Toxicity Study \nin Rats ...
1,Inhalation \nShort-Term \n(1-30 Days) \n \nInt...,NOAEL = \n10 mg/kg/day \n \nNote: Inhalation a...,UFA = 10x \nUFH = 10x,Occupational LOC \nfor MOE = 100,Oral Developmental Study in Rats \nLOAEL = 70 ...
2,Cancer (dermal and \ninhalation),Classification: “Not Likely to be Carcinogenic...,,,


Similarly, we can apply the function to remove excess newline characters in all cells.
However, in some other situations, we might want to keep certain newline characters.
<br  />
<br  />
Here is a demonstration of how would the table loook life, if we apply the removal function on all cells

In [13]:
p11_table_fixed_df = p11_table_fixed_df.replace('\n','', regex=True)
p11_table_fixed_df

Unnamed: 0,Exposure Scenario,Point of Departure (POD),Uncertainty / FQPA Safety Factors,Level of Concern for Risk Assessment,Study and Toxicological Effects
0,Dermal Short-Term (1-30 Days) Intermediate-T...,NOAEL = 40 mg/kg/day,UFA = 10x UFH = 10x,Occupational LOC for MOE = 100,Dermal Developmental Toxicity Study in Rats LO...
1,Inhalation Short-Term (1-30 Days) Intermediat...,NOAEL = 10 mg/kg/day Note: Inhalation and ora...,UFA = 10x UFH = 10x,Occupational LOC for MOE = 100,Oral Developmental Study in Rats LOAEL = 70 mg...
2,Cancer (dermal and inhalation),Classification: “Not Likely to be Carcinogenic...,,,


In [14]:
p11_table_fixed_df.iloc[(1,0)]

'Inhalation Short-Term (1-30 Days)  Intermediate-Term (1-6 Months)'

---
## Applying the Camelot on other tables

In [15]:
# Read another pdf file
# Read the Table 10.1.1 at p17 as demonstration
p17_table = camelot.read_pdf('Bromu EPA-HQ-OPP-2015-0535-0010-p17.pdf')
p17_table[0]
# Convert the extracted table into Pandas DataFrame format
p17_table_df = p17_table[0].df
p17_table_df

Unnamed: 0,0,1,2,3,4,5,6,7
0,Table 10.1.1. Occupational Handler Exposure a...,,,,,,,
1,Scenario,Application Rate1 \n(lb ai/cu ft),Route of \nExposure,Unit \nExposure2 \n(mg/lb ai),Units \nTreated3,Average \nDaily Dose4 \n(mg/kg/day),MOE5,Combined \nMOE6
2,Mixer/Loader – Liquids for \nGroundboom,0.01957 lb ai/A,Dermal,0.220,60 Acres,0.00374,11000,11000
3,,,Inhalation,0.000219,,0.00000372,2700000,
4,Applicator Liquids for \nGroundboom,0.01957 lb ai/A,Dermal,0.0786,60 Acres,0.00134,30000,29000
5,,,Inhalation,0.00034,,0.00000578,1700000,
6,Mixer/Loader/Applicator–\nLiquids Sprays with ...,0.0001957 lb ai/gal,Dermal,13.2,40 gals,0.00149,27000,26000
7,,,Inhalation,0.14,,0.0000159,630000,
8,Mixer/Loader/Applicator– \nLiquids Sprays with...,0.0001957 lb ai/gal,Dermal,100,40 gals,0.0113,3500,3500
9,,,Inhalation,0.03,,0.00000341,2900000,


**Here is the original table in the document**
<img src="./tb-10-1-1.png" alt="Drawing" style="width: 700px;"/>
<br  />
As we can see from the table, there are multiple rows share a same row entry on the far left of the table.
<br />
In the case of this table, every two rows share a same entry on the left-most column.
<br />
In order to benefit us to extract information in the the future, it would be better to autofill the empty entries.

In [16]:
# Read the p17 pdf file but with auto filling (vertically)
p17_table_fill = camelot.read_pdf('Bromu EPA-HQ-OPP-2015-0535-0010-p17.pdf', copy_text=['v'])    # Read the Table 10.1.1 at p17 as demonstration
p17_table_fill[0]
# Convert the extracted table into Pandas DataFrame format
p17_table_fill_df = p17_table_fill[0].df
p17_table_fill_df

Unnamed: 0,0,1,2,3,4,5,6,7
0,Table 10.1.1. Occupational Handler Exposure a...,,,,,,,
1,Scenario,Application Rate1 \n(lb ai/cu ft),Route of \nExposure,Unit \nExposure2 \n(mg/lb ai),Units \nTreated3,Average \nDaily Dose4 \n(mg/kg/day),MOE5,Combined \nMOE6
2,Mixer/Loader – Liquids for \nGroundboom,0.01957 lb ai/A,Dermal,0.220,60 Acres,0.00374,11000,11000
3,Mixer/Loader – Liquids for \nGroundboom,0.01957 lb ai/A,Inhalation,0.000219,60 Acres,0.00000372,2700000,11000
4,Applicator Liquids for \nGroundboom,0.01957 lb ai/A,Dermal,0.0786,60 Acres,0.00134,30000,29000
5,Applicator Liquids for \nGroundboom,0.01957 lb ai/A,Inhalation,0.00034,60 Acres,0.00000578,1700000,29000
6,Mixer/Loader/Applicator–\nLiquids Sprays with ...,0.0001957 lb ai/gal,Dermal,13.2,40 gals,0.00149,27000,26000
7,Mixer/Loader/Applicator–\nLiquids Sprays with ...,0.0001957 lb ai/gal,Inhalation,0.14,40 gals,0.0000159,630000,26000
8,Mixer/Loader/Applicator– \nLiquids Sprays with...,0.0001957 lb ai/gal,Dermal,100,40 gals,0.0113,3500,3500
9,Mixer/Loader/Applicator– \nLiquids Sprays with...,0.0001957 lb ai/gal,Inhalation,0.03,40 gals,0.00000341,2900000,3500
