# Algorithm Extension: Table Recognition

One of the downsides of the current algorithm is that it has trouble with tables, as these resemble the border class that we detect in the algorithm. In this notebook we first check the extent of this problem, and also propose a fix that we can make to the model so that we can avoid these errors. We  decided against changing the detection of the border class, as this would mean we would sacrifice recall. Instead we opt for the approach of detecting whether a page contains a table, after which we ignore this page for the further steps in the algorithm, avoiding these false positives.

In [21]:
import img2table

In [25]:
img2table.version

AttributeError: module 'img2table' has no attribute 'version'

In [1]:
import os
import subprocess
import pandas as pd
from glob import glob
from tqdm import tqdm

In [2]:
#!pip install img2table

## Examining the problem

The first step is to see the extent of the problem. For this we will be taking a set of 'inventarislijsten' from the corpus, of which we know that they contain tables, and get the statistics on the number of redacted text and regions, as well as to examine a few ourselves. Because it would take a very long time run it on all these PDF files we are going to run it on a sample and see the output that we get. In principle this should be equal to 0 for the number of redacted blocks, as these tables should not be detected. Of course sometimes some information in tables is redacted, but this is relatively low so we should be able to see that something is wrong on a page level.

In [18]:
subprocess.run(['python', '../scripts/run_redaction_detector.py', '--pdf_path',
                   '../debug_data/maik_example.pdf', '--output_path', '../debug_data/maik_out.pdf',
               '--exclude_tables', 'True'])

CompletedProcess(args=['python', '../scripts/run_redaction_detector.py', '--pdf_path', '../debug_data/maik_example.pdf', '--output_path', '../debug_data/maik_out.pdf', '--exclude_tables', 'True'], returncode=0)

In [3]:
# I have put the data into 'debug_data/images'
# Now we run our script over all of the images.
for image in tqdm(glob('../debug_data/table_pdfs/*')):
    output_path = '../debug_data/table_pdfs_output/' + image.split('/')[-1]
    subprocess.run(['python', '../scripts/run_redaction_detector.py', '--pdf_path',
                   image, '--output_path', output_path])

100%|███████████████████████████████████████████| 13/13 [05:35<00:00, 25.78s/it]


Now that we have run the code over a few input pdfs we will get the statistics of all of them, create a complete dataframe and get the statistics for all the pdfs combined.

In [4]:
all_csv_files = glob('../debug_data/table_pdfs_output/*.csv')
redaction_output_dataframe = pd.concat([pd.read_csv(path) for path in all_csv_files])
redaction_output_dataframe.describe()

Unnamed: 0,Number of redacted regions,Percentage of redacted text
count,169.0,169.0
mean,4.426036,13.899408
std,12.535028,28.651777
min,0.0,0.0
25%,0.0,0.0
50%,0.0,0.0
75%,1.0,0.0
max,80.0,100.0


Although not incredibly high, these percentages are still too high, as there should not be any redacted text in these pages at all. In the next part of the notebook we will attempt to fix this problem. What did stand out is that the problem seemed to mostly be around tables that did not fill the entire page, this makes sense, as we specifically only consider boxes where the width is larger than the hight, for full page tables this will not be the case most of the times.

In [5]:
import io
from pdf2image import convert_from_path
from img2table.document import Image as TableImage

Let's experiment with a specific document that we know contains some tables the we should discard but that are not discarded, and see if we can do this with the `img2table` package. In the following example all pages contain a table, but about half of them are seen as redacted text.

## Solving the problem

We are going to be detecting tables using the `img2table` package, and skipping these tables when we detect a table, as with the approach shown above, where we restrict on having a table larger than at least half of the page width. We will set the statistics for these pages to zero so that we get an idea of the improvement we have over the original method. The code has been implemented in the algorithm itself, and we will test it below

In [6]:
images = convert_from_path('../debug_data/table_pdfs_output/Schiphol deel 1.pdf')

In [7]:
# Go through all pages and see if they contain a table according to our rules.
for i, image in enumerate(images):
    byte_image = io.BytesIO()
    # image.save expects a file-like as a argument
    image.save(byte_image, format=image.format)
    # Turn the BytesIO object back into a bytes object
    byte_image = byte_image.getvalue()
    table_image = TableImage(src=byte_image)
    # Table identification
    image_tables = table_image.extract_tables()
    # This module is recall oriented so we should require the tables to be of a larger size
    page_width, page_height = image.size
    detected_tables = []
    for table in image_tables:
        table_width = table.bbox.x2 - table.bbox.x1
        table_height = table.bbox.y2 - table.bbox.y1
        # is the size of the table in pixels big enough compared to the complete page?
        if (table_width / page_width) > 0.50 and table_height > 50:
            # now check the number of cells
            if ((table.df.shape[0] >= 2) and (table.df.shape[1] >=3)):
                detected_tables.append(True)
        
    print('page-%d' % (i+1), any(detected_tables))

page-1 True
page-2 True
page-3 True
page-4 True
page-5 True
page-6 True
page-7 True
page-8 True
page-9 False
page-10 True
page-11 True
page-12 True
page-13 True
page-14 True
page-15 True


In [8]:
# and now on the tables with our algorithm
for image in tqdm(glob('../debug_data/table_pdfs/*')):
    output_path = '../debug_data/table_pdfs_output/' + image.split('/')[-1]
    subprocess.run(['python', '../scripts/run_redaction_detector.py', '--pdf_path',
                   image, '--output_path', output_path, '--exclude_tables', 'True'])

100%|███████████████████████████████████████████| 13/13 [02:01<00:00,  9.35s/it]


In [9]:
all_csv_files = glob('../debug_data/table_pdfs_output/*.csv')
redaction_output_dataframe = pd.concat([pd.read_csv(path) for path in all_csv_files])
redaction_output_dataframe.describe()

Unnamed: 0,Number of redacted regions,Percentage of redacted text
count,169.0,169.0
mean,2.650888,6.230769
std,10.256938,20.777735
min,0.0,0.0
25%,0.0,0.0
50%,0.0,0.0
75%,0.0,0.0
max,80.0,100.0


This seems to work very well for the tables, but we should also make sure that it does not compromise the ability of our algorithm in detecting other boxes, i.e. that it skips too many pages and we our recall drops. For this we will take several PDF files that do not have any tables, and we compare the original and the updated algorithm on both numbers, these are in the `clean_data` folder. We will again take an example first.

## Testing Method Recall

To test the recall of the method and check that we are not losing pages that do not contain tables we will test our method on a set of pdf files of which we know that there are no tables present, and therefore the statistics on the dataset should not change.

In [10]:
images = convert_from_path('../debug_data/clean_data/Binder OneLove.pdf')

In [11]:
# Go through all pages and see if they contain a table according to our rules.
for i, image in enumerate(images):
    byte_image = io.BytesIO()
    # image.save expects a file-like as a argument
    image.save(byte_image, format=image.format)
    # Turn the BytesIO object back into a bytes object
    byte_image = byte_image.getvalue()
    table_image = TableImage(src=byte_image)
    # Table identification
    image_tables = table_image.extract_tables()
    # This module is recall oriented so we should require the tables to be of a larger size
    page_width, page_height = image.size
    detected_tables = []
    for table in image_tables:
        table_width = table.bbox.x2 - table.bbox.x1
        table_height = table.bbox.y2 - table.bbox.y1
        # is the size of the table in pixels big enough compared to the complete page?
        if (table_width / page_width) > 0.50 and table_height > 50:
            # now check the number of cells
            if ((table.df.shape[0] >= 2) and (table.df.shape[1] >=3)):
                detected_tables.append(True)
        
    print('page-%d' % (i+1), any(detected_tables))

page-1 False
page-2 False
page-3 False
page-4 False
page-5 False
page-6 False
page-7 False
page-8 False
page-9 False
page-10 True
page-11 False
page-12 False
page-13 False
page-14 False
page-15 False
page-16 False
page-17 False
page-18 False
page-19 False
page-20 True
page-21 False
page-22 True
page-23 True
page-24 True
page-25 False


We are getting a few false positives that we should actually keep, but this particular document only contains the more difficult border class, which is the one that will get confused as a table the most often, so it is actually quite goo as it is. Let's run the full experiment in the clean pages and see the differences between discarding table pages and not discarding table pages.

In [12]:
for image in tqdm(glob('../debug_data/clean_data/*')):
    output_path = '../debug_data/clean_data_output/' + image.split('/')[-1]
    subprocess.run(['python', '../scripts/run_redaction_detector.py', '--pdf_path',
                   image, '--output_path', output_path])

  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
100%|███████████████████████████████████████████| 15/15 [11:23<00:00, 45.56s/it]


In [13]:
all_csv_files = glob('../debug_data/clean_data_output/*.csv')
redaction_output_dataframe = pd.concat([pd.read_csv(path) for path in all_csv_files])
redaction_output_dataframe.describe()

Unnamed: 0,Number of redacted regions,Percentage of redacted text
count,279.0,279.0
mean,9.179211,23.892473
std,13.559228,27.598403
min,0.0,0.0
25%,0.0,0.0
50%,4.0,10.0
75%,13.0,46.0
max,103.0,99.0


Now we compare this to the version where we exclude tables, and see the difference.

In [14]:
for image in tqdm(glob('../debug_data/clean_data/*')):
    output_path = '../debug_data/clean_data_output/' + image.split('/')[-1]
    subprocess.run(['python', '../scripts/run_redaction_detector.py', '--pdf_path',
                   image, '--output_path', output_path, '--exclude_tables', 'True'])

  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
 80%|██████████████████████████████████▍        | 12/15 [08:17<02:05, 41.79s/it]Traceback (most recent call last):
  File "/Users/rubenvanheusden/Desktop/AronTextRedaction/TPDLTextRedaction/notebooks/../scripts/run_redaction_detector.py", line 90, in <module>
    main(args)
  File "/Users/rubenvanheusden/Desktop/AronTextRedaction/TPDLTextRedaction/notebooks/../scripts/run_redaction_detector.py", line 50, in main
    image_tables = table_image.extract_tables()
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/rubenvanheusden/anaconda3/envs/text_redaction_env/lib/python3.11/site-packages/img2table/document/image.py", line 41, in extract_tables
    extracted_tables = super(Image, self).extract_tables(ocr=ocr,
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/rubenvanheusden/anaconda3/envs/text_redaction_env/lib/python3.11/site-packages/img2table/document/base

In [15]:
all_csv_files = glob('../debug_data/clean_data_output/*.csv')
redaction_output_dataframe = pd.concat([pd.read_csv(path) for path in all_csv_files])
redaction_output_dataframe.describe()

Unnamed: 0,Number of redacted regions,Percentage of redacted text
count,279.0,279.0
mean,8.286738,21.623656
std,12.973002,26.866007
min,0.0,0.0
25%,0.0,0.0
50%,3.0,6.0
75%,11.0,42.0
max,103.0,99.0


As we can see from the above dataframe, the number of detected redacted blocks and perdentage is pretty consistent with the version where we don't exlcude tables, meaning that we are nog getting a lot of false positives. If anything we are detecting tables a bit too much but it is quite close to the original version.