# Python Library Complex Table PDF Parsing Feasibility Study

Comparision of various Python libraries and one Java library for parsing tables in PDF. Arguably in order from worst to best at parsing complex tables. Tabula-java can compensate for merged cells by finding the area of the table with the most columns. Only Unstructured, Camelot, and LlamaParse can completely successfully handle complex tables with merged cells. Camelot is the only library able to handle tables with merged cells without having to resort to ML models.

The various libraries will have to be installed to run this notebook.

## pdfplumber

In [1]:
pdf_file = "OnePageOneTable.pdf"

In [2]:
from collections import namedtuple
import re

import pdfplumber
import pandas as pd

with pdfplumber.open(pdf_file) as f:
    for page in f.pages:
        text = page.extract_text()
        if text.__contains__("Consolidated Statements of Income"):
            print(text)
            print(page.extract_tables())

Consolidated Statements of Income
Barrick Gold Corporation Three months ended Six months ended
(in millions of United States dollars, except per share data) (Unaudited) June 30, June 30,
2015 2014 2015 2014
Revenue (notes 5 and 6) $ 2,231 $ 2,458 $ 4,476 $ 5,105
Costs and expenses (income)
Cost of sales (notes 5 and 7) 1,689 1,631 3,397 3,350
General and administrative expenses 70 82 137 185
Exploration, evaluation and project expenses (note 8) 9 7 1 05 183 205
Impairment charges (note 10B) 35 512 40 524
Loss on currency translation 33 31 31 110
Closed mine rehabilitation (19) 27 (11) 49
Loss (gain) on non-hedge derivatives (note 18D) 8 ( 44) 11 (65)
Other expense (note 10A) 32 17 14 36
Income before finance items and income taxes $ 286 $ 97 $ 674 $ 711
Finance items
Finance income 2 3 4 6
Finance costs (note 11) (194) (200) (390) ( 401)
Income (loss) before income taxes $ 94 $ (100) $ 288 $ 316
Income tax expense (note 12) (103) ( 123) (208) (412)
Net income (loss) $ (9) $ (223) $ 80 

## PyMuPDF

In [2]:
import sys, pathlib, pymupdf

with pymupdf.open(pdf_file) as doc:  # open document
    text = chr(12).join([page.get_text() for page in doc])
# write as a binary file to support non-ASCII characters
pathlib.Path(pdf_file + ".txt").write_bytes(text.encode())
myDoc = pymupdf.open(pdf_file)
# print(myDoc[0].get_text())
lines = myDoc[0].get_text('blocks')
for line in lines:
    print(line)

(49.5, 47.94342041015625, 429.6875915527344, 74.75592041015625, 'Consolidated Statements of Income\n', 0, 0)
(52.130001068115234, 81.82781982421875, 517.07666015625, 90.955322265625, ' Barrick Gold Corporation\n  \n  \n', 1, 0)
(52.130001068115234, 93.6378173828125, 551.6292114257812, 102.76531982421875, ' For the years ended December 31 (in millions of United States dollars, except per share data)\n2023\n2022\n', 2, 0)
(52.130001068115234, 106.767822265625, 553.0606689453125, 117.01531982421875, 'Revenue (notes 5 and 6)\n \n$11,397  \n$11,013 \n', 3, 0)
(52.130001068115234, 121.017822265625, 164.90603637695312, 129.955322265625, 'Costs and expenses (income)\n', 4, 0)
(52.130001068115234, 135.267822265625, 553.0606689453125, 145.51531982421875, 'Cost of sales (notes 5 and 7)\n \n7,932  \n7,497 \n', 5, 0)
(52.130001068115234, 149.517822265625, 553.0606689453125, 159.76531982421875, 'General and administrative expenses (note 11)\n \n126  \n159 \n', 6, 0)
(52.130001068115234, 163.76782226

In [3]:
import fitz  # PyMuPDF

# Function to save the first page of a PDF
def save_first_page(input_pdf_path, output_pdf_path):
    # Open the input PDF file
    pdf_document = fitz.open(input_pdf_path)
    
    # Select the first page (index 0)
    first_page = pdf_document.load_page(0)  # Index starts at 0
    
    # Create a new PDF document to save the first page
    new_pdf = fitz.open()
    
    # Insert the first page into the new PDF
    new_pdf.insert_pdf(pdf_document, from_page=0, to_page=0)
    
    # Save the new PDF
    new_pdf.save(output_pdf_path)
    new_pdf.close()
    pdf_document.close()

# Example usage
input_pdf_path = pdf_file  # Replace with your input PDF path
output_pdf_path = "OnePageOneTable.pdf" # Replace with your output PDF path
save_first_page(input_pdf_path, output_pdf_path)


## pdfminer

In [13]:
from pdfminer.high_level import extract_text

text = extract_text(pdf_file, page_numbers=[0])
print(text[0:1000])

Consolidated Statements of Income

Barrick Gold Corporation

(in millions of United States dollars, except per share data) (Unaudited)

Revenue (notes 5 and 6)
Costs and expenses (income)
Cost of sales (notes 5 and 7)
General and administrative expenses 
Exploration, evaluation and project expenses (note 8)
Impairment charges (note 10B)
Loss on currency translation
Closed mine rehabilitation
Loss (gain) on non-hedge derivatives (note 18D)
Other expense (note 10A)
Income before finance items and income taxes
Finance items
Finance income
Finance costs (note 11)
Income (loss) before income taxes
Income tax expense (note 12)
Net income (loss)
Attributable to:
Equity holders of Barrick Gold Corporation
Non-controlling interests (note 21)

Three months ended
June 30,
2014

2015

Six months ended
June 30,
2014

2015

$     

2,231

$      

2,458

$   

4,476

$        

5,105

1,689
70
97
35
33
(19)
8
32
286

$       

1,631
82
105
512
31
27
(44)
17
97

$           

3,397
137
183
40
31
(11)

In [17]:
from pdfminer.high_level import extract_pages
for page_layout in extract_pages(pdf_file, page_numbers=[0]):
    for element in page_layout:
        if element.__class__.__name__ != "LTTextLineHorizontal":
            print(element)

<LTTextBoxHorizontal(0) 57.240,691.499,337.303,713.579 'Consolidated Statements of Income\n'>
<LTTextBoxHorizontal(1) 55.560,666.562,128.507,674.842 'Barrick Gold Corporation\n'>
<LTTextBoxHorizontal(2) 55.560,653.842,265.624,662.122 '(in millions of United States dollars, except per share data) (Unaudited)\n'>
<LTTextBoxHorizontal(3) 55.678,376.339,231.840,623.536 'Revenue (notes 5 and 6)\nCosts and expenses (income)\nCost of sales (notes 5 and 7)\nGeneral and administrative expenses \nExploration, evaluation and project expenses (note 8)\nImpairment charges (note 10B)\nLoss on currency translation\nClosed mine rehabilitation\nLoss (gain) on non-hedge derivatives (note 18D)\nOther expense (note 10A)\nIncome before finance items and income taxes\nFinance items\nFinance income\nFinance costs (note 11)\nIncome (loss) before income taxes\nIncome tax expense (note 12)\nNet income (loss)\nAttributable to:\nEquity holders of Barrick Gold Corporation\nNon-controlling interests (note 21)\n'>
<

## Marker
`marker_single /path/to/file.pdf /path/to/output/folder --batch_multiplier 2 --max_pages 1` 

### Consolidated Statements Of Income

| Barrick Gold Corporation                                                                                                                 | Three months ended   | Six months ended   |      |         |         |        |       |        |
|------------------------------------------------------------------------------------------------------------------------------------------|----------------------|--------------------|------|---------|---------|--------|-------|--------|
| (in millions of United States dollars, except per share data) (Unaudited)                                                                | June 30,             | June 30,           |      |         |         |        |       |        |
| 2015                                                                                                                                     |                      | 2014               | 2015 | 2014    |         |        |       |        |
| Revenue (notes 5 and 6)                                                                                                                  | $                    | 2,231              | $    | 2,458   | $ 4,476 | $      | 5,105 |        |
| Costs and expenses (income) Cost of sales (notes 5 and 7)                                                                                |                      | 1,689              |      | 1,631   |         | 3,397  |       | 3,350  |
| General and administrative expenses                                                                                                      | 70                   |                    | 82   |         | 137     |        | 185   |        |
| Exploration, evaluation and project expenses (note 8)                                                                                    |                      | 97                 |      | 105     |         | 183    |       | 205    |
| Impairment charges (note 10B)                                                                                                            |                      | 35                 |      | 512     |         | 40     |       | 524    |
| Loss on currency translation                                                                                                             |                      | 33                 |      | 31      |         | 31     |       | 110    |
| Closed mine rehabilitation                                                                                                               |                      | (19)               |      | 27      |         | (11)   |       | 49     |
| Loss (gain) on non-hedge derivatives (note 18D)                                                                                          |                      | 8                  |      | (44)    | 11      |        | (65)  |        |
| Other expense (note 10A)                                                                                                                 |                      | 32                 |      | 17      |         | 14     |       | 36     |
| Income before finance items and income taxes                                                                                             | 286                  |                    |      |         |         |        |       |        |
| $                                                                                                                                        | $                    | 97                 | $    | 674     | $       | 711    |       |        |
| Finance items Finance income                                                                                                             |                      | 2                  |      | 3       |         | 4      |       | 6      |
| Finance costs (note 11)                                                                                                                  |                      | (194)              |      | (200)   | (390)   |        | (401) |        |
| Income (loss) before income taxes                                                                                                        | 94                   |                    |      |         |         |        |       |        |
| $                                                                                                                                        | $                    | (100) $            | 288  | $       | 316     |        |       |        |
| Income tax expense (note 12)                                                                                                             |                      | (103)              |      | (123)   | (208)   |        | (412) |        |
| Net income (loss)                                                                                                                        | (9)                  |                    |      |         |         |        |       |        |
| $                                                                                                                                        | $                    | (223) $            | 80   | $       | (96)    |        |       |        |
| Attributable to: Equity holders of Barrick Gold Corporation                                                                              | $                    | (9)                | $    | (269) $ | 48      | $      | (181) |        |
| Non-controlling interests (note 21)                                                                                                      | $                    | -                  | $    | 46      | $       | 32     | $     | 85     |
| Earnings (loss) per share data attributable to the equity holders of Barrick Gold Corporation (note 9) Net income (loss)  Basic $ (0.01) | $                    | (0.23)             | $    | 0.04    | $       | (0.16) |       |        |
|                                                                                                                                          | Diluted              | $ (0.01)           | $    | (0.23)  | $       | 0.04   | $     | (0.16) |


## tabula-java


Using the guess option tabula-java can find the area of the table with the most columns and return that.

```bash
java -jar ~/bin/tabula.jar -g ~/tabula-java/src/test/resources/technology/tabula/table_report.pdf
```

In [2]:
from io import StringIO
import pandas as pd

from_tabula_java = """,,2015,,2014,,2015,,2014
Revenue (notes 5 and 6),$,"2,231",$,"2,458",$,"4,476",$,"5,105"
Costs and expenses (income),,,,,,,,
Cost of sales (notes 5 and 7),,"1,689",,"1,631",,"3,397",,"3,350"
General and administrative expenses,,70,,82,,137,,185
"Exploration, evaluation and project expenses (note 8)",,9 7,,105,,183,,205
Impairment charges (note 10B),,35,,512,,40,,524
Loss on currency translation,,33,,31,,31,,110
Closed mine rehabilitation,,(19),,27,,(11),,49
Loss (gain) on non-hedge derivatives (note 18D),,8,,( 44),,11,,(65)
Other expense (note 10A),,32,,17,,14,,36
Income before finance items and income taxes,$,286,$,97,$,674,$,711
Finance items,,,,,,,,
Finance income,,2,,3,,4,,6
Finance costs (note 11),,(194),,(200),,(390),,(401)
Income (loss) before income taxes,$,94,$,(100),$,288,$,316
Income tax expense (note 12),,(103),,(123),,(208),,(412)
Net income (loss),$,(9),$,(223),$,80,$,(96)
Attributable to:,,,,,,,,
Equity holders of Barrick Gold Corporation,$,(9),$,( 269),$,48,$,(181)
Non-controlling interests (note 21),$,-,$,46,$,32,$,85
Earnings (loss) per share data attributable to the equity holders of Barrick Gold Corporation (note 9),,,,,,,,
Net income (loss),,,,,,,,
Basic,$,(0.01),$,(0.23),$,0.04,$,(0.16)
Diluted,$,(0.01),$,(0.23),$,0.04,$,(0.16)"""

df = pd.read_csv(StringIO(from_tabula_java))
df

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,2015,Unnamed: 3,2014,Unnamed: 5,2015.1,Unnamed: 7,2014.1
0,Revenue (notes 5 and 6),$,2231,$,2458,$,4476,$,5105
1,Costs and expenses (income),,,,,,,,
2,Cost of sales (notes 5 and 7),,1689,,1631,,3397,,3350
3,General and administrative expenses,,70,,82,,137,,185
4,"Exploration, evaluation and project expenses (...",,9 7,,105,,183,,205
5,Impairment charges (note 10B),,35,,512,,40,,524
6,Loss on currency translation,,33,,31,,31,,110
7,Closed mine rehabilitation,,(19),,27,,(11),,49
8,Loss (gain) on non-hedge derivatives (note 18D),,8,,( 44),,11,,(65)
9,Other expense (note 10A),,32,,17,,14,,36


## Unstructured

In [10]:
from unstructured.partition.auto import partition
from IPython.core.display import HTML

elements = partition(pdf_file, 
    skip_infer_table_types=[],
    pdf_infer_table_structure=True,
    strategy="hi_res")
tables = [el for el in elements if el.category == "Table"]
print(f"Length of tables: {len(tables)}")
table_html = tables[0].metadata.text_as_html
HTML(table_html)
print(tables[0].metadata.)

The pdf_infer_table_structure kwarg is deprecated. Please use skip_infer_table_types instead.


Length of tables: 1
None


## Camelot

In [5]:
import camelot

tables = camelot.read_pdf(pdf_file, pages='1-end', flavor='lattice')
tables

<TableList n=0>

In [8]:
tables = camelot.read_pdf(pdf_file, flavor='stream')
print(tables[0])
print(tables[0].parsing_report)

<Table shape=(15, 6)>
{'accuracy': 95.02, 'whitespace': 28.89, 'order': 1, 'page': 1}


In [9]:
tables[0].df

Unnamed: 0,0,1,2,3,4,5
0,Barrick Gold Corporation,,Three months ended,,,Six months ended
1,"(in millions of United States dollars, except ...",,,"June 30,",,"June 30,"
2,,2015,,2014,2015,2014
3,Revenue (notes 5 and 6),"$ \n2,231",$,2458,"$ \n4,476","$ \n5,105"
4,Costs and expenses (income),,,,,
5,Cost of sales (notes 5 and 7),1689,,1631,3397,3350
6,General and administrative expenses,70,,82,137,185
7,"Exploration, evaluation and project expenses (...",97,,105,183,205
8,Impairment charges (note 10B),35,,512,40,524
9,Loss on currency translation,33,,31,31,110


## LlamaParse

### Consolidated Statements of Income

### Barrick Gold Corporation

(in millions of United States dollars, except per share data) (Unaudited)

|Description|Three months ended June 30, 2015|Three months ended June 30, 2014|Six months ended June 30, 2015|Six months ended June 30, 2014|
|---|---|---|---|---|
|Revenue (notes 5 and 6)|$ 2,231|$ 2,458|$ 4,476|$ 5,105|
|Costs and expenses (income)| | | | |
|Cost of sales (notes 5 and 7)|$ 1,689|$ 1,631|$ 3,397|$ 3,350|
|General and administrative expenses|$ 70|$ 82|$ 137|$ 185|
|Exploration, evaluation and project expenses (note 8)|$ 97|$ 105|$ 183|$ 205|
|Impairment charges (note 10B)|$ 35|$ 512|$ 40|$ 524|
|Loss on currency translation|$ 33|$ 31|$ 31|$ 110|
|Closed mine rehabilitation|($ 19)|$ 27|($ 11)|$ 49|
|Loss (gain) on non-hedge derivatives (note 18D)|$ 8|($ 44)|$ 11|($ 65)|
|Other expense (note 10A)|$ 32|$ 17|$ 14|$ 36|
|Income before finance items and income taxes|$ 286|$ 97|$ 674|$ 711|
|Finance items| | | | |
|Finance income|$ 2|$ 3|$ 4|$ 6|
|Finance costs (note 11)|($ 194)|($ 200)|($ 390)|($ 401)|
|Income (loss) before income taxes|$ 94|($ 100)|$ 288|$ 316|
|Income tax expense (note 12)|($ 103)|($ 123)|($ 208)|($ 412)|
|Net income (loss)|($ 9)|($ 223)|$ 80|($ 96)|
|Attributable to:| | | | |
|Equity holders of Barrick Gold Corporation|($ 9)|($ 269)|$ 48|($ 181)|
|Non-controlling interests (note 21)|$ 0|$ 46|$ 32|$ 85|
|Earnings (loss) per share data attributable to the equity holders of Barrick Gold Corporation (note 9)| | | | |
|Net income (loss)| | | | |
|Basic|($ 0.01)|($ 0.23)|$ 0.04|($ 0.16)|
|Diluted|($ 0.01)|($ 0.23)|$ 0.04|($ 0.16)|

The accompanying notes are an integral part of these consolidated financial statements.