## <h1>Table Extraction using python </h1>

Have you ever wondered if you could somehow read your bank statements, or Sales data or Financial results of a company into Python to analyse them i.e. to better understand your habits or you have multiple bank accounts and you want to analyse all statements at one place using Python?

Tables are ubiquitous. Be it your bank statements, timetable, sales data, financial results and so on. These tables are sometimes embedded in a PDF file or in an Image. It may be easy to convert some amount of data to an excel or sql database or any other type for analysis but let's say we have so much data or our data is scattered across many files. In such cases it is not easy to convert everything and Python programming language comes to our rescue with few packages in its quiver for example

### <h2> Let's discuss the process flow. </h2>
1. Load the pdf or Image.
2. Extract the table
3. Convert the table into pandas dataframe.
4. Calculate the time taken for the process to complete.
5. Compare the accuracy of all python packages




Comparison metrics:

1. Speed of conversion
2. Accuracy
3. Versatile to convert from Image or Image based PDF or text based PDF
4. Resources utilization and support for CPU and GPU.
5. Ease of post processing i.e. conversion to csv, excel or DataFrames
6. Structural fidelity i.e. Table layout preservation for merged cells (This is for future work)
7.

### We are going to compare below python packages.
1. img2table
2. pdf2table
3. pdfplumber
4. pymupdf
5. tabula-py    -- pdfplumber and pymupdf are fast as compared to this one
6. marker-pdf   -- <font color='orange'> Not so impressive </font>
7. camelot      -- <font color='orange'> Not so impressive </font>

## img2table

In [None]:
!pip install -qU img2table
# !pip install -qU img2table[surya]

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.5/48.5 kB[0m [31m405.1 kB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.0/92.0 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.8/2.8 MB[0m [31m38.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m169.4/169.4 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
from img2table.document import Image, PDF
from img2table.ocr import TesseractOCR, EasyOCR, SuryaOCR

from IPython.display import display, HTML, IFrame
import IPython.display
from ipywidgets import HBox, VBox, Output
import pandas as pd
import time

#### Testing with Table-2 jpg and its pdf

In [None]:
doc = Image("/content/Table-2.jpg")
ocr = TesseractOCR(n_threads=1, lang="eng")
p = time.time()
extracted_tables = doc.extract_tables(ocr=ocr, min_confidence=50)
print("time taken for conversion from the Image:", time.time()-p)

doc_pdf = PDF("/content/table-2.pdf")
p = time.time()
ocr = TesseractOCR(n_threads=1, lang="eng")
tables_pdf = doc_pdf.extract_tables(ocr=ocr)
print("time taken for conversion from the Image inscribed in a PDF:", time.time()-p)

# tables_pdf[0][0].df

time taken for conversion from the Image: 3.486722230911255
time taken for conversion from the Image inscribed in a PDF: 3.7605626583099365


In [None]:
img_out = Output()
img_df_out = Output()
pdf_df_out = Output()
with img_out:
    print("Image")
    display(IPython.display.Image('/content/Table-2.jpg'))
with img_df_out:
    print("Table extracted from image")
    display(extracted_tables[0].df)
with pdf_df_out:
    print("Table extracted from Image inscribed in a pdf")
    display(tables_pdf[0][0].df)
display(HBox([img_out, img_df_out, pdf_df_out]))

HBox(children=(Output(), Output(), Output()))

#### Testing with file-2 or "Population history table"

In [None]:
## This is a text based PDF table.
doc = PDF("/content/Population history of India - Sheet.pdf")
ocr = TesseractOCR(n_threads=1, lang="eng")
tables = doc.extract_tables(ocr=ocr)


In [None]:
# display(Image('/content/Most popular 1000 Youtube videos short.jpg'))
doc = PDF("/content/Population history of India - Sheet text_based.pdf")
ocr = TesseractOCR(n_threads=1, lang="eng")
tables = doc.extract_tables(ocr=ocr)


img_out = Output()
pdf_df_out = Output()
with img_out:
    print("Image")
    display(IPython.display.Image('/content/Population history of India - Sheet.jpg'))
with pdf_df_out:
    print("Table extracted from text inscribed in a pdf")
    display(tables[0][0].df)
display(HBox([img_out, pdf_df_out]))

HBox(children=(Output(), Output()))

In [None]:
p = time.time()
tables = doc.extract_tables()
print("time taken for conversion from the Image inscribed in a PDF:", time.time()-p)

time taken for conversion from the Image inscribed in a PDF: 7.154954195022583


## pdf2table

It is supposed to work right out of the box as I tested it previously but as of today i.e. 22.05.2025 the package is not working. (Tested it on Colab(Ubuntu) as well as on Mac) \
<font color='#FF8C00'>Error : found duplicate columns </font>


Someone has already issue on this github page : https://github.com/li-rongzhi/pdf2table/issues/4

In [None]:
!apt install poppler-utils
!pip install pdf2table==0.1.3

In [None]:
import cv2
import pdf2table
from pdf2table import Driver
from pdf2table.document import Image, PDF
import pandas as pd
import time
from glob import glob

driver = Driver()


In [None]:

# Extract tables from a PDF
# which returns a list of dataframes but this is computationally expensive operation

p = time.perf_counter()
tables_from_text_pdf = {}
for idx, pdf in enumerate(glob("*.pdf")):
    print(pdf)
    try:
        tables_from_pdf[pdf] = driver.extract_tables(pdf)
        display(tables_from_pdf[pdf][1][0].T)
    except Exception as e: # Catch the exception
        print(f"Error processing {pdf}: {e}") # Print the error
print("time taken for pdf2table to extract table from text pdf with GPU:", time.perf_counter()-p)

p = time.perf_counter()
tables_from_image = {}
for idx, image in enumerate(glob("*.jpg")):
    print(image)
    try:
        tables_from_image[image] = driver.extract_tables(image)
        display(tables_from_image[image][0].T)
    except Exception as e: # Catch the exception
        print(f"Error processing {image}: {e}") # Print the error

print("time taken for pdf2table to extract table from image with GPU:", time.perf_counter()-p)

In [None]:
tables_from_text_pdf, tables_from_image_pdf, tables_from_image

({}, {}, {'Zomato-data-short.jpg': []})

In [None]:
display(tables_from_text_pdf['Zomato-data-short text_based.pdf'][1][0].T)
display(tables_from_image_pdf['Zomato-data-short image_based.pdf'][1][0].T)
display(tables_from_image['Zomato-data-short.jpg'][0].T)

Unnamed: 0,0,1,2,3,4,5,6
0,hname,online_order,book_table,rate,votes,approx_cost(for two people),listed_in(type)
1,Jalsa,Yes,Yes,4.1/5,775,800,Buffet
2,Spice Elephant,Yes,No,4.1/5,787,800,Buffet
3,San Churro Cafe,Yes,No,3.8/5,918,800,Buffet
4,Addhuri Udupi Bhojana,No,No,3.7/5,88,300,Buffet
5,Grand Village,No,No,3.8/5,166,600,Buffet
6,Timepass Dinner,Yes,No,3.8/5,286,600,Buffet
7,Rosewood International Hotel Bar & Restaurant,No,No,3.6/5,8,800,Buffet
8,Onesta,Yes,Yes,4.6/5,2556,600,Cafes
9,Penthouse Cafe,Yes,No,4.0/5,324,700,other


Unnamed: 0,0,1,2,3,4,5,6
0,name,online_ order,book_table,rate,votes,approx_cost(for two people),listed_in(type)
1,Jalsa,Yes,Yes,4.1/5,775,800,Buffet
2,Spice Elephant,Yes,No,4.1/5,787,800,Buffet
3,San Churro Cafe,Yes,No,3.8/5,918,800,Buffet
4,Addhuri Udupi Bhojana,No,No,3.7/5,88,300,Buffet
5,Grand Village,No,No,3.8/5,166,600,Buffet
6,Timepass Dinner,Yes,No,3.8/5,286,600,Buffet
7,Rosewood International Hotel E Bar & Restaurant,No,No,3.6/5,8,800,Buffet
8,Onesta,Yes,Yes,4.6/5,2556,600,Cafes
9,Penthouse Cafe,Yes,No,4.0/5,324,700,other


Unnamed: 0,0,1,2,3,4,5,6
0,hname,online_order,book_table,rate,votes,approx_cost(for two people),listed_in(type)
1,Jalsa,Yes,Yes,4.1/5,775,800,Buffet
2,Spice Elephant,Yes,No,4.1/5,787,800,Buffet
3,San Churro Cafe,Yes,No,3.8/5,918,800,Buffet
4,Addhuri Udupi Bhojana,No,No,3.7/5,88,300,Buffet
5,Grand Village,No,No,3.8/5,166,600,Buffet
6,Timepass Dinner,Yes,No,3.8/5,286,600,Buffet
7,Rosewood International Hotel Bar & Restaurant,No,No,3.6/5,8,800,Buffet
8,Onesta,Yes,Yes,4.6/5,2556,600,Cafes
9,Penthouse Cafe,Yes,No,4.0/5,324,700,other


#### Conclusion:



pdf2table works great for converting tables inside pdfs using Driver() call only. However, it fails to work when we try extract tables using PDF() or Image() methods.

## pdfplumber

### Somewhat usable similar to img2table.

### This works only with PDF files with text based tables and not image based tables. It cannot convert tables from scanned images

In [None]:
!apt install poppler-utils

!pip install pdfplumber
!pip install pdf2image


Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
poppler-utils is already the newest version (22.02.0-2ubuntu0.8).
0 upgraded, 0 newly installed, 0 to remove and 34 not upgraded.
Collecting pdfplumber
  Downloading pdfplumber-0.11.6-py3-none-any.whl.metadata (42 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pdfminer.six==20250327 (from pdfplumber)
  Downloading pdfminer_six-20250327-py3-none-any.whl.metadata (4.1 kB)
Collecting pypdfium2>=4.18.0 (from pdfplumber)
  Downloading pypdfium2-4.30.1-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (48 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.2/48.2 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
Downloading pdfplumber-0.11.6-py3-none-any.whl (60 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.2/60.2 kB[0m [31m4.6 MB/s[0m eta [36m0:00

In [None]:
import pdfplumber
import pandas as pd
import time
from glob import glob
from IPython.display import IFrame, HTML, display
import pdf2image
import PIL
from ipywidgets import HBox, VBox, Output



In [None]:
# prompt: write a function to display pdf

def disp_pdf(file_path):
    """
    Displays a PDF file in the output of a Colab notebook.
    Args:
        file_path (str): The path to the PDF file.
    """
    images = pdf2image.convert_from_path(file_path)
    # display(IFrame(file_path, width=900, height=800))
    for image in images:
        display(image.resize(((1500, 1000))))


In [None]:
# file_name = "Population history of India - Sheet1.pdf"
tables_from_text_pdf = {}
p = time.perf_counter()
for file_name in glob("*.pdf"):
    with pdfplumber.open(file_name) as doc:
        print(file_name)
        for page in doc.pages:
            table = page.extract_tables()
            if len(table) == 0:
                continue
            tables_from_text_pdf[file_name] = pd.DataFrame(table[0])
            # display(pd.DataFrame(table[0]))
            print(" ")
            # temp = [pd.DataFrame(table) for table in  page.extract_tables()]

print("total time for reading the doc 100 times is ", time.perf_counter()-p)



netflix_titles_short image_based.pdf
Most popular 1000 Youtube videos short image_based.pdf
Most popular 1000 Youtube videos_short text_based.pdf




 
Zomato-data-short text_based.pdf




 
Zomato-data-short image_based.pdf
netflix_titles_short text_based.pdf




 
Population history of India - Sheet.pdf




 
table-2.pdf
total time for reading the doc 100 times is  1.5340221410001504


In [None]:
pdf_out = Output()
pdf_df_out = Output()
boxes = []
for key in tables_from_text_pdf.keys():
    pdf_out = Output()
    pdf_df_out = Output()
    with pdf_out:
        print("/content/"+key)
        # display(IPython.display.Image('/content/Table-2.jpg'))
        # display(disp_pdf("/content/"+key))
        disp_pdf("/content/"+key)
    with pdf_df_out:
        print("Table extracted from Image inscribed in a pdf")
        display(tables_from_text_pdf[key])
    display(HBox([pdf_out, pdf_df_out]))
#     boxes.append(HBox([pdf_out, pdf_df_out]))

# display(VBox(boxes))


HBox(children=(Output(), Output()))

HBox(children=(Output(), Output()))

HBox(children=(Output(), Output()))

HBox(children=(Output(), Output()))

## pymupdf

### This works only with PDF files with text based tables and not image based tables. It cannot convert tables from scanned images

In [None]:
# !apt install tesseract-ocr
!apt install poppler-utils

!pip install -qU pymupdf
!pip install pdf2image

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
tesseract-ocr is already the newest version (4.1.1-2.1build1).
0 upgraded, 0 newly installed, 0 to remove and 34 not upgraded.
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following NEW packages will be installed:
  poppler-utils
0 upgraded, 1 newly installed, 0 to remove and 34 not upgraded.
Need to get 186 kB of archives.
After this operation, 697 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 poppler-utils amd64 22.02.0-2ubuntu0.8 [186 kB]
Fetched 186 kB in 1s (170 kB/s)
Selecting previously unselected package poppler-utils.
(Reading database ... 126102 files and directories currently installed.)
Preparing to unpack .../poppler-utils_22.02.0-2ubuntu0.8_amd64.deb ...
Unpacking poppler-utils (22.02.0-2ubuntu0.8) ...
Setting up poppler-utils (22.02.0-2ubuntu0.8) ...
Processing 

In [None]:
import pymupdf
import time
import pandas as pd
import pdf2image
from glob import glob
from IPython.display import IFrame, HTML, display
import PIL
from ipywidgets import HBox, VBox, Output


In [None]:
# prompt: write a function to display pdf

def disp_pdf(file_path):
    """
    Displays a PDF file in the output of a Colab notebook.
    Args:
        file_path (str): The path to the PDF file.
    """
    images = pdf2image.convert_from_path(file_path)
    # display(IFrame(file_path, width=900, height=800))
    for image in images:
        display(image.resize(((1500, 1000))))


In [None]:
# file_name = "/content/Most popular 1000 Youtube videos_short text_based.pdf"
# with pymupdf.open(file_name) as doc:
#     for page in doc:
#         # page = doc[0]  # this is the first page
#         # Look for tables on this page and display the table count
#         print(file_name)
#         tabs = page.find_tables()
#         # print(f"{len(tabs.tables)} table(s) on {page}")
#         display(tabs[0].to_pandas())
#         print("")


In [None]:
tables_from_text_pdf = {}
p = time.time()
for file_name in glob("*.pdf"):
    with pymupdf.open(file_name) as doc:
        print(file_name)
        for page in doc:
            table = page.find_tables()
            if len(table.tables) == 0:
                continue
            tables_from_text_pdf[file_name] = table[0].to_pandas()
            # display(pd.DataFrame(table[0]))
            print(" ")
            # temp = [pd.DataFrame(table) for table in  page.extract_tables()]

print("total time for reading the doc 100 times is ", time.time()-p)

netflix_titles_short image_based.pdf
Most popular 1000 Youtube videos short image_based.pdf
Most popular 1000 Youtube videos_short text_based.pdf
 
Zomato-data-short text_based.pdf
 
Zomato-data-short image_based.pdf
netflix_titles_short text_based.pdf
 
Population history of India - Sheet.pdf
 
table-2.pdf
total time for reading the doc 100 times is  1.0602185726165771


In [None]:
pdf_out = Output()
pdf_df_out = Output()
boxes = []
for key in tables_from_text_pdf.keys():
    pdf_out = Output()
    pdf_df_out = Output()
    with pdf_out:
        print("/content/"+key)
        # display(IPython.display.Image('/content/Table-2.jpg'))
        # display(disp_pdf("/content/"+key))
        disp_pdf("/content/"+key)
    with pdf_df_out:
        print("Table extracted from Image inscribed in a pdf")
        display(tables_from_text_pdf[key])
    display(HBox([pdf_out, pdf_df_out]))
#     boxes.append(HBox([pdf_out, pdf_df_out]))

# display(VBox(boxes))


HBox(children=(Output(), Output()))

HBox(children=(Output(), Output()))

HBox(children=(Output(), Output()))

HBox(children=(Output(), Output()))

## tabula-py

In [None]:
# !pip install tabula
!pip install tabula-py

Collecting tabula-py
  Downloading tabula_py-2.10.0-py3-none-any.whl.metadata (7.6 kB)
Downloading tabula_py-2.10.0-py3-none-any.whl (12.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.0/12.0 MB[0m [31m69.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: tabula-py
Successfully installed tabula-py-2.10.0


In [None]:
!pip install jpype1


Collecting jpype1
  Downloading jpype1-1.5.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.9 kB)
Downloading jpype1-1.5.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (494 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m494.1/494.1 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: jpype1
Successfully installed jpype1-1.5.2


In [None]:
import tabula
import pandas as pd
import time
from glob import glob

In [None]:
pdf_file_path = "/content/Population history of India - Sheet1.pdf"
pdf_file_path = "/content/Population history of India - Sheet.pdf"
tables = tabula.read_pdf(pdf_file_path, pages='all')

    # Write each table to a separate sheet in the Excel file
# with pd.ExcelWriter(excel_file_path) as writer:
#     for i, table in enumerate(tables):
#         table.to_excel(writer, sheet_name=f'Sheet{i+1}')

In [None]:
tables_from_text_pdf = {}
p = time.time()
for file_name in glob("*text_based.pdf"):
    print(f"processing: {file_name}")
    tables = tabula.read_pdf(file_name, pages="all")
    if len(tables) == 0:
        continue
    tables_from_text_pdf[file_name] = tables
print("total time for reading the doc 100 times is ", time.time()-p)

processing: Most popular 1000 Youtube videos_short text_based.pdf
processing: Zomato-data-short text_based.pdf
processing: netflix_titles_short text_based.pdf
total time for reading the doc 100 times is  6.872096061706543


In [None]:
tables = tabula.read_pdf(file_name, pages="all")


In [None]:
tables

In [None]:
tables_from_text_pdf.keys()

dict_keys(['Most popular 1000 Youtube videos_short text_based.pdf', 'Zomato-data-short text_based.pdf', 'netflix_titles_short text_based.pdf'])

In [None]:
tables_from_text_pdf['Zomato-data-short text_based.pdf'][0]

Unnamed: 0.1,nameonline_orderbook_tableratevotesapprox_cost(for two people)listed_in(type)\rJalsa\rYesYes4.1/5775800Buffet\rSpice Elephant\rYesNo4.1/5787800Buffet\rSan Churro Cafe\rYesNo3.8/5918800Buffet\rAddhuri Udupi Bhojana\rNoNo3.7/588300Buffet\rGrand Village\rNoNo3.8/5166600Buffet\rTimepass Dinner\rYesNo3.8/5286600Buffet\rRosewood International Hotel - Bar & Restaurant\rNoNo3.6/58800Buffet\rOnesta\rYesYes4.6/52556600Cafes\rPenthouse Cafe\rYesNo4.0/5324700other\rSmacznego\rYesNo4.2/5504550Cafes\rVillage Café\rYesNo4.1/5402500Cafes\rCafe Shuffle\rYesYes4.2/5150600Cafes\rThe Coffee Shack\rYesYes4.2/5164500Cafes\rCaf-Eleven\rNoNo4.0/5424450Cafes\rSan Churro Cafe\rYesNo3.8/5918800Cafes\rCafe Vivacity\rYesNo3.8/590650Cafes\rCatch-up-ino\rYesNo3.9/5133800Cafes\rKirthi's Biryani\rYesNo3.8/5144700Cafes\rT3H Cafe\rNoNo3.9/593300Cafes\r360 Atoms Restaurant And Cafe\rYesNo3.1/513400Cafes\rThe Vintage Cafe\rYesNo3.0/562400Cafes\rWoodee Pizza\rYesNo3.7/5180500Cafes\rCafe Coffee Day\rNoNo3.6/528900Cafes\rMy Tea House\rYesNo3.6/562600Cafes\rHide Out Cafe\rNoNo3.7/531300Cafes\rCAFE NOVA\rNoNo3.2/511600Cafes\rCoffee Tindi\rYesNo3.8/575200Cafes\rSea Green Cafe\rNoNo3.3/54500Cafes\rCuppa\rNoNo3.3/523550Cafes\r1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5
0,name,online_order,book_table,rate,votes,approx_cost(for two people),listed_in(type)
1,Jalsa,Yes,Yes,4.1/5,775,800,Buffet
2,Spice Elephant,Yes,No,4.1/5,787,800,Buffet
3,San Churro Cafe,Yes,No,3.8/5,918,800,Buffet
4,Addhuri Udupi Bhojana,No,No,3.7/5,88,300,Buffet
5,Grand Village,No,No,3.8/5,166,600,Buffet
6,Timepass Dinner,Yes,No,3.8/5,286,600,Buffet
7,Rosewood International Hotel - Bar & Restaurant,No,No,3.6/5,8,800,Buffet
8,Onesta,Yes,Yes,4.6/5,2556,600,Cafes
9,Penthouse Cafe,Yes,No,4.0/5,324,700,other


## Conclusion