# **Extracting Tabular Data From PDF Files Using**

# **How It Works**
You can choose between two table parsing methods, Stream and Lattice.
# 1.   **Stream**

Stream can be used to parse tables that have whitespaces between cells to simulate a table structure. It is built on top of PDFMiner’s functionality of grouping characters on a page into words and sentences, using margins.

Words on the PDF page are grouped into text rows based on their y axis overlaps.

Textedges are calculated and then used to guess interesting table areas on the PDF page. You can read Anssi Nurminen’s master’s thesis to know more about this table detection technique.

The number of columns inside each table area are then guessed. This is done by calculating the mode of number of words in each text row. Based on this mode, words in each text row are chosen to calculate a list of column x ranges.

Words that lie inside/outside the current column x ranges are then used to extend the current list of columns.

Finally, a table is formed using the text rows’ y ranges and column x ranges and words found on the page are assigned to the table’s cells based on their x and y coordinates.


#2.  **Lattice**


Lattice is more deterministic in nature, and it does not rely on guesses. It can be used to parse tables that have demarcated lines between cells, and it can automatically parse multiple tables present on a page.

It starts by converting the PDF page to an image using ghostscript, and then processes it to get horizontal and vertical line segments by applying a set of morphological transformations (erosion and dilation) using OpenCV.



*   **tabula-py**

In [None]:
# importing libaray
import tabula

In [None]:
# Before trying tabula-py, check your environment via tabula-py environment_info() function, 
# which shows Python version, Java version, and your OS environment.
import tabula
tabula.environment_info()

Python version:
    3.7.13 (default, Apr 24 2022, 01:04:09) 
[GCC 7.5.0]
Java version:
    openjdk version "11.0.15" 2022-04-19
OpenJDK Runtime Environment (build 11.0.15+10-Ubuntu-0ubuntu0.18.04.1)
OpenJDK 64-Bit Server VM (build 11.0.15+10-Ubuntu-0ubuntu0.18.04.1, mixed mode, sharing)
tabula-py version: 2.4.0
platform: Linux-5.4.188+-x86_64-with-Ubuntu-18.04-bionic
uname:
    uname_result(system='Linux', node='76b883a89026', release='5.4.188+', version='#1 SMP Sun Apr 24 10:03:06 PDT 2022', machine='x86_64', processor='x86_64')
linux_distribution: ('Ubuntu', '18.04', 'bionic')
mac_ver: ('', ('', '', ''), '')


In [None]:
# path to pdf file.
pdf_path1 = "/content/sample_data/pdf_sample1.pdf"
pdf_path2 = "/content/sample_data/pdf_with_table.pdf"

In [None]:
# reading pdf file with help of tabula
dfs = tabula.read_pdf(pdf_path2, stream=True)

# read_pdf returns list of DataFrames
print("Total table Found: ", len(dfs))
print("Type: ", type(dfs))
print("\n Table:\n", dfs[0])

'pages' argument isn't specified.Will extract only from page 1 by default.


Total table Found:  1
Type:  <class 'list'>

 Table:
     Unnamed: 0    Unnamed: 1 Unnamed: 2   Unnamed: 3    Unnamed: 4  \
0          NaN           NaN        NaN      Ballots           NaN   
1   Disability           NaN    Ballots          NaN           NaN   
2          NaN  Participants        NaN  Incomplete/           NaN   
3     Category           NaN  Completed          NaN      Accuracy   
4          NaN           NaN        NaN   Terminated           NaN   
5          NaN           NaN        NaN          NaN           NaN   
6        Blind             5          1            4    34.5%, n=1   
7   Low Vision             5          2            3     98.3% n=2   
8          NaN           NaN        NaN          NaN  (97.7%, n=3)   
9    Dexterity             5          4            1    98.3%, n=4   
10    Mobility             3          3            0    95.4%, n=3   

            Results  
0               NaN  
1               NaN  
2               NaN  
3           Time 

In [None]:
# Options available in read_pdf()
print(help(tabula.read_pdf))

Help on function read_pdf in module tabula.io:

read_pdf(input_path, output_format=None, encoding='utf-8', java_options=None, pandas_options=None, multiple_tables=True, user_agent=None, **kwargs)
    Read tables in PDF.
    
    Args:
        input_path (str, path object or file-like object):
            File like object of tareget PDF file.
            It can be URL, which is downloaded by tabula-py automatically.
        output_format (str, optional):
            Output format for returned object (``dataframe`` or ``json``)
        encoding (str, optional):
            Encoding type for pandas. Default: ``utf-8``
        java_options (list, optional):
            Set java options.
    
            Example:
                ``["-Xmx256m"]``
        pandas_options (dict, optional):
            Set pandas options.
    
            Example:
                ``{'header': None}``
    
            Note:
                With ``multiple_tables=True`` (default), pandas_options is passed
        

In [None]:
# # So we got.
# read_pdf(input_path, output_format=None, encoding='utf-8',
#          java_options=None, pandas_options=None,
#          multiple_tables=True, user_agent=None, **kwargs)
#     Read tables in PDF.
    
#     Args:
#         input_path (str, path object or file-like object):
#             File like object of tareget PDF file.
#             It can be URL, which is downloaded by tabula-py automatically.
#         output_format (str, optional):
#             Output format for returned object (``dataframe`` or ``json``)
#         encoding (str, optional):
#             Encoding type for pandas. Default: ``utf-8``
#         java_options (list, optional):
#             Set java options.
    
#             Example:
#                 ``["-Xmx256m"]``
#         pandas_options (dict, optional):
#             Set pandas options.
    
#             Example:
#                 ``{'header': None}``
    
#             Note:
#                 With ``multiple_tables=True`` (default), pandas_options is passed
#                 to pandas.DataFrame, otherwise it is passed to pandas.read_csv.
#                 Those two functions are different for accept options like ``dtype``.
#         multiple_tables (bool):
#             It enables to handle multiple tables within a page. Default: ``True``
    
#             Note:
#                 If `multiple_tables` option is enabled, tabula-py uses not
#                 :func:`pd.read_csv()`, but :func:`pd.DataFrame()`. Make
#                 sure to pass appropriate `pandas_options`.
#         user_agent (str, optional):
#             Set a custom user-agent when download a pdf from a url. Otherwise
#             it uses the default ``urllib.request`` user-agent.
#         kwargs:
#             Dictionary of option for tabula-java. Details are shown in
#             :func:`build_options()`
    
#     Returns:
        # list of DataFrames or dict.

In [None]:
# So By default read_pdf takes only one page
# so we need to specify pages to process.
dfs1 = tabula.read_pdf(pdf_path1, encoding='utf-8',
         java_options=None, pandas_options=None,
         multiple_tables=True, user_agent=None, pages="all", stream=True)
dfs1

Got stderr: Jun 06, 2022 5:29:39 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
Jun 06, 2022 5:29:39 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
Jun 06, 2022 5:29:41 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
Jun 06, 2022 5:29:41 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
Jun 06, 2022 5:29:43 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
Jun 06, 2022 5:29:43 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
Jun 06, 2022 5:29:43 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
Jun 06, 2022 5:29:44 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
Jun 06, 2022 5:29:44 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
Jun 06, 2022 5:29:44 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>



[   S. No.                       GIT                    GITHUB
 0     1.0        Git is a software.      GitHub is a service.
 1     2.0     Git is a command line     GitHub is a graphical
 2     NaN                     tool.           user interface.
 3     3.0      Git is maintained by   GitHub is maintained by
 4     NaN                    linux.               Mircrosoft.
 5     4.0    Git focused on version         GitHub focused on
 6     NaN  control and code sharing  centralized code hosting]

In [None]:
# SO this time we have got table from all the pages.

In [None]:
# read pdf as JSON
tabula.read_pdf(pdf_path1, output_format="json", pages="all")
# we can also read as JSON, CSV, or TSV

Got stderr: Jun 06, 2022 5:32:33 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
Jun 06, 2022 5:32:33 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
Jun 06, 2022 5:32:35 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
Jun 06, 2022 5:32:36 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
Jun 06, 2022 5:32:37 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
Jun 06, 2022 5:32:37 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
Jun 06, 2022 5:32:38 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
Jun 06, 2022 5:32:38 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>



[{'bottom': 817.75366,
  'data': [[{'height': 793.4736938476562,
     'left': 24.279991,
     'text': 'GIT keeps a record of all the commits done by each of the collaborators on the local copy of the\rdeveloper. A log file is maintained and is pushed to the central repository each time the push\roperation is performed. So, if a problem arises then it can be easily tracked and handled by the\rdeveloper. GIT uses SHA1 to store all the records in the form of objects in the Hash. Each object\rcollaborates with each other with the use of these Hash keys.\rSHA1 is a cryptographic algorithm that converts the commit object into a 14-diGIT Hex code. It\rhelps to store the record of all the commits done by each of the developers. Hence, easily\rdiagnosable that which commit has resulted in the failure of the work.\r\r\rh.\rReliable\rProviding a central repository that is being cloned each time a User performs the Pull operation,\rthe data of the central repository is always being backed up in ev

### **Use lattice mode for more accurate extraction for spreadsheet style tables**
If your tables have lines separating cells, you can use lattice option. If your tables don't have separation lines, you can try stream option.

In [None]:
pdf_path3 = "https://github.com/tabulapdf/tabula-java/raw/master/src/test/resources/technology/tabula/spanning_cells.pdf"
dfs = tabula.read_pdf(
    pdf_path3,
    pages="1",
    lattice=True,
    pandas_options={"header": [0, 1]},
    area=[0, 0, 50, 100],
    relative_area=True,
    multiple_tables=False,
)
dfs[0]

Unnamed: 0_level_0,Improved operation scenario,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0
Unnamed: 0_level_1,Volume servers in:,2007,2008,2009,2010,2011
0,Server closets,1505.0,1580.0,1643.0,1673.0,1689.0
1,Server rooms,1512.0,1586.0,1646.0,1677.0,1693.0
2,Localized data centers,1512.0,1586.0,1646.0,1677.0,1693.0
3,Mid-tier data centers,1512.0,1586.0,1646.0,1677.0,1693.0
4,Enterprise-class data centers,1512.0,1586.0,1646.0,1677.0,1693.0
5,Best practice scenario,,,,,
6,Volume servers in:,2007.0,2008.0,2009.0,2010.0,2011.0
7,Server closets,1456.0,1439.0,1386.0,1296.0,1326.0
8,Server rooms,1465.0,1472.0,1427.0,1334.0,1371.0
9,Localized data centers,1465.0,1471.0,1426.0,1334.0,1371.0


In [None]:
########################################################################################


#2.   **Camelot**

https://pypi.org/project/camelot-py/

In [1]:
# importing library
import camelot as cml

In [None]:
# Syntax of the camelot.read_pdf function 
# cml.read_pdf(
#     filepath,
#     pages='1',
#     password=None,
#     flavor='lattice',
#     suppress_stdout=False,
#     layout_kwargs={},
#     **kwargs,
# )

In [2]:
# extract all the tables in the pdf file.
tables = cml.read_pdf(pdf_path1)

# checking type
print("Table Type: ", type(tables))

print("Tables: ", tables)

In [None]:
# checking parsing report.
tables[2].parsing_report

In [None]:
# number of tables extracted
print("Total tables extracted:", tables.n)

In [2]:
# print the first table as Pandas DataFrame
print(tables[0].df)

In [None]:
# export individually as CSV
tables[0].to_csv("table_camelot.csv")

In [None]:
# export individually as Excel (.xlsx extension)
tables[0].to_excel("table_camelot.xlsx")

In [None]:
# or export all in a zip
tables.export("table_camelot.csv", f="csv", compress=True)

In [None]:
# export to HTML
tables.export("foo.html", f="html")

In [None]:
https://camelot-py.readthedocs.io/en/master/

For Exploring all other features, you can go through Docs.
https://camelot-py.readthedocs.io/en/master/

In [3]:
##############################################################################################


# Convert PDF to Image using Python
https://pdf2image.readthedocs.io/en/latest/index.html

In [6]:

# import library
import pdf2image as pmg

# Make sure poppler is installed in your system. If not please install it.
# can also use following command in terminal.
# !apt-get install poppler-utils

In [7]:
# There are following options available with pdf2image
#
# pmg.convert_from_path(
#     pdf_path,
#     dpi=200,
#     output_folder=None,
#     first_page=None,
#     last_page=None,
#     fmt="ppm",
#     jpegopt=None,
#     thread_count=1,
#     userpw=None,
#     use_cropbox=False,
#     strict=False,
#     transparent=False,
#     single_file=False,
#     output_file=uuid_generator(),
#     poppler_path=None,
#     grayscale=False,
#     size=None,
#     paths_only=False,
#     hide_annotations=False,
# )

# convert_from_bytes(
#     pdf_bytes,
#     dpi=200,
#     output_folder=None,
#     first_page=None,
#     last_page=None,
#     fmt="ppm",
#     jpegopt=None,
#     thread_count=1,
#     userpw=None,
#     use_cropbox=False,
#     strict=False,
#     transparent=False,
#     single_file=False,
#     output_file=uuid_generator(),
#     poppler_path=None,
#     grayscale=False,
#     size=None,
#     paths_only=False,
#     hide_annotations=False,
# )


In [3]:
# path of pdf files to convert
pdf_path_img = "/content/sample_data/pdf_sample1.pdf"

# Convert_from_path take path of pdf file and returns images.
# Store Pdf with convert_from_path function
images = pmg.convert_from_path(pdf_path_img)
 
for indx in range(len(images)):
      # Save pages as images in the pdf
    images[indx].save('page'+ str(indx) +'.jpg', 'JPEG')

print("all images saved")

all images saved


In [8]:
# Please also explore other option convert_from_bytes()

In [9]:
##############################################################

# **Checking if PDF is Image Based PDF or Searchable PDF**

Identifying the type of PDF whether text-based or image-based is an essential step when you want to extract text from a PDF.

If the text is entirely selectable from the PDF, then it can be extracted using various packages.
If the text is not selectable from the PDF, then these text extraction tools or packages will fail and you need to convert these into images and use OCR to extract the text from them.
Thus, it is essential to classify text-based and image-based PDFs from the dataset.
If a text-based PDF is detected, there are lots of Python packages like pdftotext, PyPDF2, PyMuPDF etc. which provides methods to extract text and if an image-based PDF is detected, OCR modules such as pytesseract, have to used for extract text after converting the PDF page to an image.

In [12]:
# PyMuPDF is a powerful module for PDF processing and operations.
# It has an inbuilt class called fitz
# which we are going to use for classification.
# please intall PyMuPDF library before using.

In [14]:
import fitz

In [17]:
def classify_pdf(pdf_file):
  """Input: pdf file path
     Output: If image based pdf of not
  """
  # Opening pdf file from path
  with open(pdf_file,"rb") as f:
    pdf = fitz.open(f)

  # initializing empty list for response.
  res = []

  # Now iterate through each page and
  # check if images based page or not.
  for page in pdf:
    image_area = 0.0
    text_area = 0.0
    # identify text-based or image-based PDF page
    # using text_area and image_area
    for b in page.get_text("blocks"):
      if '<image:' in b[4]:
        r = fitz.Rect(b[:4])
        image_area = image_area + abs(r)
      else:
        r = fitz.Rect(b[:4])
        text_area = text_area + abs(r)
    
    if image_area == 0.0 and text_area != 0.0:
      res.append(1)
    if text_area == 0.0 and image_area != 0.0:
      res.append(0) 
  return res

In [18]:
# Function call.
pdf_path = "/content/sample_data/image-based-pdf-sample.pdf"
classifier_result = classify_pdf(pdf_path)

# if list contains 0 then it is image based else searchable page.
if 0 in classifier_result:
    print("PDF is image-based!")
else:
    print("PDF is text-based!")

PDF is image-based!


In [19]:
# In practical world most pdf are combine of both.
# containing images as well as text.

In [None]:
########################################################################

# **Image Based To Searchable PDF**
**Using Pytesseract**

In [20]:
# please install following package before executing below code
# PIL, pytesseract, install pdf2image, tesseract-ocr

In [3]:
# !pip install PIL
# !pip install pytesseract
# !pip install pdf2image
# !apt-get install tesseract-ocr


In [4]:
# import library
from PIL import Image
import pytesseract
from pdf2image import convert_from_path

In [8]:
def scanned_pdf_to_text(pdf_path):
  """Input: (PDF path)
     Output: Text from pdf.
  """
  # First we need to convert the PDF into image files.

  # Store all the pages of the PDF in a variable
  pages = convert_from_path(pdf_path, 500)

  # Counter to store images of each page of PDF to image
  image_counter = 1

  # Iterate through all the pages stored above
  for page in pages:
    filename = pdf_path.replace(".pdf", "_image") + str(image_counter)+".jpg"
    
    # Save the image of the page in system
    page.save(filename, 'JPEG')

    # Increment the counter to update filename
    image_counter = image_counter + 1

  # Recognizing text from the images using OCR
  # Variable to get count of total number of pages
  filelimit = image_counter-1

  # Creating a text file to write the output
  outfile = "out_text.txt"

  # Open the file in append mode so that
  # All contents of all images are added to the same file
  f = open(outfile, "a")
  
  # empty list for output response 
  response = []

  # Iterate from 1 to total number of pages
  for i in range(1, filelimit + 1):
    filename = pdf_path.replace(".pdf", "_image") + str(i)+".jpg"
      
    # Recognize the text as string in image using pytesserct
    text = str(((pytesseract.image_to_string(Image.open(filename)))))
    text = text.replace('-\n', '')

    # Finally, append text to response.
    response.append(text)

  # returning response containing texts.
  return response

In [11]:
if __name__ == "__main__":
  print("Getting Texts from Image Based PDF file.")
  
  # pdf path
  pdf_path = "/content/sample_data/image-based-pdf-sample.pdf"
  
  # Function call to crop pdf file.
  texts = scanned_pdf_to_text(pdf_path)

  print(texts)

Getting Texts from Image Based PDF file.
[' \n\nThis is an example of an “Image-based PDF” (also known as image-only PDFs).\n\nImage-based PDFs are typically created through scanning paper in a copier, taking photographs\nor taking screenshots. To a computer, they are images. Though we humans can see text in the\nimage, the file only consists of the image layer but not the searchable text layer that True PDFs\ncontain. As a result, we cannot use a computer to search the text we see in the image as that text\nlayer is missing. There are times when discovery is produced, it will be in an image-based PDF\nformat. When you come across image-based PDFs, ask the U.S. Attorney’s Office in what\nformat was that file originally. Second, ask if they have it in a searchable format and specifically\nif they have it in a digitally created, True, Text-based PDF format. They may not, as they often\nreceive PDFs from other sources before they provide them to you, but you will want to know\nwhat is the

In [12]:
# Please explore other options for image based pdf to text based pdf.

In [13]:
#####################################################################################
