# Python Library Complex Table PDF Parsing Feasibility Study

Comparision of various Python libraries and one Java library for parsing tables in PDF. Arguably in order from worst to best at parsing complex tables. Tabula-java can compensate for merged cells by finding the area of the table with the most columns. Only Unstructured, Camelot, and LlamaParse can completely successfully handle complex tables with merged cells. Camelot is the only library able to handle tables with merged cells without having to resort to ML models.

The various libraries will have to be installed to run this notebook.

## pdfplumber

In [1]:
pdf_file = "OnePageOneTable2023.pdf"

In [2]:
from collections import namedtuple
import re

import pdfplumber
import pandas as pd

with pdfplumber.open(pdf_file) as f:
    for page in f.pages:
        text = page.extract_text()
        if text.__contains__("Consolidated Statements of Income"):
            print(text)
            print(page.extract_tables())

OTHER INFORMATION MINERAL RESERVES
OPERATING GROWTH PROJECTS & REVIEW OF FINANCIAL FINANCIAL
OVERVIEW & NON-GAAP AND MINERAL
PERFORMANCE EXPLORATION RESULTS STATEMENTS
RECONCILIATIONS RESOURCES
Consolidated Statements of Income
Barrick Gold Corporation
For the years ended December 31 (in millions of United States dollars, except per share data) 2023 2022
Revenue (notes 5 and 6) $11,397 $11,013
Costs and expenses (income)
Cost of sales (notes 5 and 7) 7,932 7,497
General and administrative expenses (note 11) 126 159
Exploration, evaluation and project expenses (notes 5 and 8) 361 350
Impairment charges (notes 10 and 21) 312 1,671
Loss on currency translation 93 16
Closed mine rehabilitation (note 27b) 16 (136)
Income from equity investees (note 16) (232) (258)
Other (income) expense (note 9) (195) (268)
Income before finance items and income taxes 2,984 1,982
Finance costs, net (note 14) (170) (301)
Income before income taxes 2,814 1,681
Income tax expense (note 12) (861) (664)
Net inco

## PyMuPDF

In [3]:
import sys, pathlib, pymupdf

with pymupdf.open(pdf_file) as doc:  # open document
    text = chr(12).join([page.get_text() for page in doc])
# write as a binary file to support non-ASCII characters
pathlib.Path(pdf_file + ".txt").write_bytes(text.encode())
myDoc = pymupdf.open(pdf_file)
# print(myDoc[0].get_text())
lines = myDoc[0].get_text('blocks')
for line in lines:
    print(line)

(49.5, 47.94342041015625, 429.6875915527344, 74.75592041015625, 'Consolidated Statements of Income\n', 0, 0)
(52.130001068115234, 81.82781982421875, 517.07666015625, 90.955322265625, ' Barrick Gold Corporation\n  \n  \n', 1, 0)
(52.130001068115234, 93.6378173828125, 551.6292114257812, 102.76531982421875, ' For the years ended December 31 (in millions of United States dollars, except per share data)\n2023\n2022\n', 2, 0)
(52.130001068115234, 106.767822265625, 553.0606689453125, 117.01531982421875, 'Revenue (notes 5 and 6)\n \n$11,397  \n$11,013 \n', 3, 0)
(52.130001068115234, 121.017822265625, 164.90603637695312, 129.955322265625, 'Costs and expenses (income)\n', 4, 0)
(52.130001068115234, 135.267822265625, 553.0606689453125, 145.51531982421875, 'Cost of sales (notes 5 and 7)\n \n7,932  \n7,497 \n', 5, 0)
(52.130001068115234, 149.517822265625, 553.0606689453125, 159.76531982421875, 'General and administrative expenses (note 11)\n \n126  \n159 \n', 6, 0)
(52.130001068115234, 163.76782226

In [4]:
import fitz  # PyMuPDF

# Function to save the first page of a PDF
def save_first_page(input_pdf_path, output_pdf_path):
    # Open the input PDF file
    pdf_document = fitz.open(input_pdf_path)
    
    # Select the first page (index 0)
    first_page = pdf_document.load_page(0)  # Index starts at 0
    
    # Create a new PDF document to save the first page
    new_pdf = fitz.open()
    
    # Insert the first page into the new PDF
    new_pdf.insert_pdf(pdf_document, from_page=0, to_page=0)
    
    # Save the new PDF
    new_pdf.save(output_pdf_path)
    new_pdf.close()
    pdf_document.close()

# Example usage
input_pdf_path = pdf_file  # Replace with your input PDF path
output_pdf_path = "OnePageOneTable.pdf" # Replace with your output PDF path
save_first_page(input_pdf_path, output_pdf_path)


## pdfminer

In [2]:
from pdfminer.high_level import extract_text

text = extract_text(pdf_file, page_numbers=[0])
print(text.strip())

OVERVIEW

OPERATING 
PERFORMANCE

GROWTH PROJECTS & 
EXPLORATION

REVIEW OF FINANCIAL 
RESULTS

OTHER INFORMATION 
& NON-GAAP 
RECONCILIATIONS

MINERAL RESERVES 
AND MINERAL 
RESOURCES

FINANCIAL 
STATEMENTS

Consolidated Statements of Income

 Barrick Gold Corporation
 For the years ended December 31 (in millions of United States dollars, except per share data)

Revenue (notes 5 and 6)

Costs and expenses (income)

Cost of sales (notes 5 and 7)

General and administrative expenses (note 11)

Exploration, evaluation and project expenses (notes 5 and 8)

Impairment charges (notes 10 and 21)

Loss on currency translation

Closed mine rehabilitation (note 27b)

Income from equity investees (note 16)

Other (income) expense (note 9)

Income before finance items and income taxes

Finance costs, net (note 14)

Income before income taxes

Income tax expense (note 12)

Net income

Attributable to:

Equity holders of Barrick Gold Corporation 

Non-controlling interests (note 32)

Earnings (loss

In [6]:
from pdfminer.high_level import extract_pages
for page_layout in extract_pages(pdf_file, page_numbers=[0]):
    for element in page_layout:
        if element.__class__.__name__ == "LTTextBoxHorizontal":
            print(element)

<LTTextBoxHorizontal(0) 26.160,773.517,63.862,780.517 'OVERVIEW\n'>
<LTTextBoxHorizontal(1) 104.610,770.017,159.427,784.017 'OPERATING \nPERFORMANCE\n'>
<LTTextBoxHorizontal(2) 179.770,770.017,260.249,784.017 'GROWTH PROJECTS & \nEXPLORATION\n'>
<LTTextBoxHorizontal(3) 266.580,770.017,347.444,784.017 'REVIEW OF FINANCIAL \nRESULTS\n'>
<LTTextBoxHorizontal(4) 355.130,766.517,432.900,787.517 'OTHER INFORMATION \n& NON-GAAP \nRECONCILIATIONS\n'>
<LTTextBoxHorizontal(5) 444.070,766.517,517.920,787.517 'MINERAL RESERVES \nAND MINERAL \nRESOURCES\n'>
<LTTextBoxHorizontal(6) 543.310,770.017,590.735,784.017 'FINANCIAL \nSTATEMENTS\n'>
<LTTextBoxHorizontal(7) 49.500,717.244,429.684,741.244 'Consolidated Statements of Income\n'>
<LTTextBoxHorizontal(8) 52.130,689.425,385.282,709.045 ' Barrick Gold Corporation\n For the years ended December 31 (in millions of United States dollars, except per share data)\n'>
<LTTextBoxHorizontal(9) 52.130,676.295,140.074,684.295 'Revenue (notes 5 and 6)\n'>
<LTTe

## Marker
`marker_single /path/to/file.pdf /path/to/output/folder --batch_multiplier 2 --max_pages 1` 

### Consolidated Statements Of Income

| Barrick Gold Corporation                                                                                                  |        |         |         |
|---------------------------------------------------------------------------------------------------------------------------|--------|---------|---------|
| For the years ended December 31 (in millions of United States dollars, except per share data)                             | 2023   | 2022    |         |
| Revenue (notes 5 and 6)                                                                                                   |        | $11,397 | $11,013 |
| Costs and expenses (income) Cost of sales (notes 5 and 7)                                                                 |        | 7,932   | 7,497   |
| General and administrative expenses (note 11)                                                                             |        | 126     | 159     |
| Exploration, evaluation and project expenses (notes 5 and 8)                                                              |        | 361     | 350     |
| Impairment charges (notes 10 and 21)                                                                                      |        | 312     | 1,671   |
| Loss on currency translation                                                                                              |        | 93      | 16      |
| Closed mine rehabilitation (note 27b)                                                                                     |        | 16      | (136)   |
| Income from equity investees (note 16)                                                                                    |        | (232)   | (258)   |
| Other (income) expense (note 9)                                                                                           |        | (195)   | (268)   |
| Income before finance items and income taxes                                                                              |        | 2,984   | 1,982   |
| Finance costs, net (note 14)                                                                                              |        | (170)   | (301)   |
| Income before income taxes                                                                                                |        | 2,814   | 1,681   |
| Income tax expense (note 12)                                                                                              |        | (861)   | (664)   |
| Net income                                                                                                                |        | $1,953  | $1,017  |
| Attributable to: Equity holders of Barrick Gold Corporation                                                               | $1,272 | $432    |         |
| Non-controlling interests (note 32)                                                                                       |        | $681    | $585    |
| Earnings (loss) per share data attributable to the equity holders of Barrick Gold Corporation (note 13)  Net income Basic |        | $0.72   | $0.24   |
| Diluted                                                                                                                   |        | $0.72   | $0.24   |
| The accompanying notes are an integral part of these consolidated financial statements.                                   |        |         |         |

## tabula-java

Using the guess option tabula-java can find the area of the table with the most columns and return that.

```bash
java -jar ~/bin/tabula.jar -g ~/dev/python/pdfparser_comparison/OnePageOneTable2023.pdf
```

tabula-java returns CSV, TSV, and JSON formats. The results have been transformed to Markdown to better visualize the quality.


| Consolidated Statements of Income                                      |                              |                              |               |       |       |        |
|------------------------------------------------------------------------|------------------------------|------------------------------|---------------|-------|-------|--------|
| Barrick Gold Corporation                                               |                              |                              |               |       |       |        |
| "For the years ended December 31 (in millions of United States dollars, except per share data)" |                              |                              |               | 2023  |       | 2022   |
| Revenue (notes 5 and 6)                                                |                              |                              |               | $11,397 |       | $11,013 |
| Costs and expenses (income)                                            |                              |                              |               |       |       |        |
| Cost of sales (notes 5 and 7)                                          |                              |                              |               | 7,932 |       | 7,497  |
| General and administrative expenses (note 11)                          |                              |                              |               | 126   |       | 159    |
| Exploration, evaluation and project expenses (notes 5 and 8)           |                              |                              |               | 361   |       | 350    |
| Impairment charges (notes 10 and 21)                                   |                              |                              |               | 312   |       | 1,671  |
| Loss on currency translation                                           |                              |                              |               | 93    |       | 16     |
| Closed mine rehabilitation (note 27b)                                  |                              |                              |               | 16    |       | (136)  |
| Income from equity investees (note 16)                                 |                              |                              |               | (232) |       | (258)  |
| Other (income) expense (note 9)                                        |                              |                              |               | (195) |       | (268)  |
| Income before finance items and income taxes                           |                              |                              |               | 2,984 |       | 1,982  |
| Finance costs, net (note 14)                                           |                              |                              |               | (170) |       | (301)  |
| Income before income taxes                                             |                              |                              |               | 2,814 |       | 1,681  |
| Income tax expense (note 12)                                           |                              |                              |               | (861) |       | (664)  |
| Net income                                                             |                              |                              |               | $1,953 |       | $1,017 |
| Attributable to:                                                       |                              |                              |               |       |       |        |
| Equity holders of Barrick Gold Corporation                             |                              |                              |               | $1,272 |       | $432   |
| Non-controlling interests (note 32)                                    |                              |                              |               | $681  |       | $585   |
| Earnings (loss) per share data attributable to the equity holders of Barrick Gold Corporation (note 13) |                              |                              |               |       |       |        |
| Net income                                                             |                              |                              |               |       |       |        |
| Basic                                                                  |                              |                              |               | $0.72 |       | $0.24  |
| Diluted                                                                |                              |                              |               | $0.72 |       | $0.24  |
| The accompanying notes are an integral part of these consolidated financial statements. |                              |                              |               |       |       |        |


## Unstructured

In [2]:
from unstructured.partition.auto import partition
from IPython.core.display import HTML

elements = partition(pdf_file, 
    skip_infer_table_types=[],
    pdf_infer_table_structure=True,
    strategy="hi_res")
tables = [el for el in elements if el.category == "Table"]
print(f"Lenght of tables: {len(tables)}")
table_html = tables[0].metadata.text_as_html
HTML(table_html)

The pdf_infer_table_structure kwarg is deprecated. Please use skip_infer_table_types instead.
Some weights of the model checkpoint at microsoft/table-transformer-structure-recognition were not used when initializing TableTransformerForObjectDetection: ['model.backbone.conv_encoder.model.layer2.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer3.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer4.0.downsample.1.num_batches_tracked']
- This IS expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Lenght of tables: 1


"For the years ended December 31 (in millions of United States dollars, except per share data)","| 2023 | $11,397","2022 $11,013"
Revenue (notes 5 and 6),"| 2023 | $11,397","2022 $11,013"
Costs and expenses (income),Costs and expenses (income),Costs and expenses (income)
Cost of sales (notes 5 and 7),7932,7497
General and administrative expenses (note 11),126,159
"=xploration, evaluation and project expenses (notes 5 and 8)",361,350
impairment charges (notes 10 and 21),312,1671
OSs on currency translation,93,16
Closed mine rehabilitation (note 27b),16,(136)
Income from equity investees (note 16),(232),(258)
Other (income) expense (note 9),(195),(268)
Income before finance items and income taxes,2984,1982


## Camelot

In [3]:
import camelot

tables = camelot.read_pdf(pdf_file, pages='1-end', flavor='lattice')
tables

<TableList n=1>

In [4]:
tables = camelot.read_pdf(pdf_file, flavor='stream')
print(tables[0])
print(tables[0].parsing_report)

<Table shape=(24, 3)>
{'accuracy': 100.0, 'whitespace': 16.67, 'order': 1, 'page': 1}


In [5]:
tables[0].df

Unnamed: 0,0,1,2
0,Consolidated Statements of Income,,
1,Barrick Gold Corporation,,
2,For the years ended December 31 (in millions o...,2023,2022
3,Revenue (notes 5 and 6),"$11,397","$11,013"
4,Costs and expenses (income),,
5,Cost of sales (notes 5 and 7),7932,7497
6,General and administrative expenses (note 11),126,159
7,"Exploration, evaluation and project expenses (...",361,350
8,Impairment charges (notes 10 and 21),312,1671
9,Loss on currency translation,93,16


## LlamaParse

### Consolidated Statements of Income

### Barrick Gold Corporation

### For the years ended December 31 (in millions of United States dollars, except per share data)

| |2023|2022|
|---|---|---|
|Revenue (notes 5 and 6)|$11,397|$11,013|
|Costs and expenses (income)| | |
|Cost of sales (notes 5 and 7)|7,932|7,497|
|General and administrative expenses (note 11)|126|159|
|Exploration, evaluation and project expenses (notes 5 and 8)|361|350|
|Impairment charges (notes 10 and 21)|312|1,671|
|Loss on currency translation|93|16|
|Closed mine rehabilitation (note 27b)|16|(136)|
|Income from equity investees (note 16)|(232)|(258)|
|Other (income) expense (note 9)|(195)|(268)|
|Income before finance items and income taxes|2,984|1,982|
|Finance costs, net (note 14)|(170)|(301)|
|Income before income taxes|2,814|1,681|
|Income tax expense (note 12)|(861)|(664)|
|Net income|$1,953|$1,017|
|Attributable to:| | |
|Equity holders of Barrick Gold Corporation|$1,272|$432|
|Non-controlling interests (note 32)|$681|$585|
|Earnings (loss) per share data attributable to the equity holders of Barrick Gold Corporation (note 13)| | |
|Net income| | |
|Basic|$0.72|$0.24|
|Diluted|$0.72|$0.24|

The accompanying notes are an integral part of these consolidated financial statements.

### BARRICK YEAR-END 2023

117

### FINANCIAL STATEMENTS