# Analyzer configuration for running Table Transformer


In this notebook, we demonstrate how the [Table Transformer](https://github.com/microsoft/table-transformer) models can be utilized for table detection and table segmentation by adjusting the analyzer's default configuration. 

Additionally, we illustrate that modifying downstream parameters might be beneficial as well. We start from the default configuration and improve the quality of page parsing only by changing some processing parameters. The chosen configurations in this notebook may not be optimal, and we recommend continuing experimentation with the parameters, especially if fine-tuning models is not an option.

## General configuration

In [12]:
pip install 'deepdoctection[pt]'

Collecting deepdoctection[pt]
  Using cached deepdoctection-0.36-py3-none-any.whl.metadata (19 kB)
Collecting lazy-imports==0.3.1 (from deepdoctection[pt])
  Using cached lazy_imports-0.3.1-py3-none-any.whl.metadata (10 kB)
Collecting pypdfium2>=4.30.0 (from deepdoctection[pt])
  Using cached pypdfium2-4.30.0-py3-none-macosx_10_13_x86_64.whl.metadata (48 kB)
Collecting scipy>=1.13.1 (from deepdoctection[pt])
  Using cached scipy-1.13.1-cp39-cp39-macosx_10_9_x86_64.whl.metadata (60 kB)
Collecting timm>=0.9.16 (from deepdoctection[pt])
  Using cached timm-1.0.11-py3-none-any.whl.metadata (48 kB)
Collecting transformers>=4.36.0 (from deepdoctection[pt])
  Using cached transformers-4.46.3-py3-none-any.whl.metadata (44 kB)
Collecting accelerate>=0.29.1 (from deepdoctection[pt])
  Using cached accelerate-1.1.1-py3-none-any.whl.metadata (19 kB)
Collecting python-doctr==0.8.1 (from deepdoctection[pt])
  Using cached python_doctr-0.8.1-py3-none-any.whl.metadata (33 kB)
Collecting boto3==1.34.10

In [13]:
import deepdoctection as dd
print(dd.__version__)

ModuleNotFoundError: No module named 'deepdoctection'

In [14]:
import notebook
print(notebook.__version__)

ModuleNotFoundError: No module named 'notebook'

In [15]:
import os
from pathlib import Path

# Using os.path
home_dir = os.path.expanduser("~")

# Or using pathlib (Python 3.5+)
home_dir = str(Path.home())
print({home_dir})

{'/Users/chenhsihu'}


In [16]:

os.environ["USE_DD_PILLOW"]="True"
os.environ["USE_DD_OPENCV"]="False"

import deepdoctection as dd
from matplotlib import pyplot as plt
from IPython.core.display import HTML

ModuleNotFoundError: No module named 'deepdoctection'

The first configuration replaces the default layout and segmentation models with the registered table transformer models. The values need to be the equal to the model names in the `ModelCatalog`. You can find all registered model with `ModelCatalog.get_profile_list()`.

The table recognition model identifies tables again from cropped table regions. This is irrelevant for processing and actually leads to errors. For this reason, category `table` must be filtered out.

```yaml
PT:
   LAYOUT:
      WEIGHTS: microsoft/table-transformer-detection/pytorch_model.bin
   ITEM:
      WEIGHTS: microsoft/table-transformer-structure-recognition/pytorch_model.bin
      FILTER:
         - table
```

In [18]:
df = analyzer.analyze(path=path)
df.reset_state()
dp = next(iter(df))

np_image = dp.viz()

plt.figure(figsize = (25,17))
plt.axis('off')
plt.imshow(np_image)

NameError: name 'analyzer' is not defined

In [None]:
dp.tables[0].csv

In [None]:
dp.tables[0].score

In [None]:
dp.text

Okay, table detection doesn't work at all. Besides that, we see that no text is recognized outside of the table. To suppress this poor table region prediction, we are increasing the filter confidence score to 0.4. We cannot change this directly in the `analyzer` configuration. 

The surrounding text is not displayed because the configuration only outputs the text within a layout segment. In this case, these are only tables. If we set `TEXT_ORDERING.INCLUDE_RESIDUAL_TEXT_CONTAINER=True`, line layout segments will be generated for all words, and all all line segments will be taken into account when generating narrative text.

In [None]:
#path="/path/to/dir/sample/2312.13560.pdf" # Use the PDF in the sample folder
path = "/home/janis/Documents/Repos/notebooks/sample/2312.13560.pdf"
    
analyzer =dd.get_dd_analyzer(config_overwrite=
   ["PT.LAYOUT.WEIGHTS=microsoft/table-transformer-detection/pytorch_model.bin",
    "PT.ITEM.WEIGHTS=microsoft/table-transformer-structure-recognition/pytorch_model.bin",
    "PT.ITEM.FILTER=['table']",
    "OCR.USE_DOCTR=True",
    "OCR.USE_TESSERACT=False",
    "TEXT_ORDERING.INCLUDE_RESIDUAL_TEXT_CONTAINER=True",
                        ])

analyzer.pipe_component_list[0].predictor.config.threshold = 0.4  # default threshold is at 0.1

df = analyzer.analyze(path=path)
df.reset_state()
dp = next(iter(df))

np_image = dp.viz()

In [None]:
plt.figure(figsize = (25,17))
plt.axis('off')
plt.imshow(np_image)

In [None]:
print(dp.text)

The result is not much better. Although we are able to retrieve text outside tables, we observe that the text lines span across multiple columns. This leads to a disastrous outcome in terms of reading order. 

The construction of text lines is done heuristically. In particular, it is determined when adjacent words belong to the same text line and when text lines need to be separated, even if they are at the same horizontal level. 

By reducing the value of `TEXT_ORDERING.PARAGRAPH_BREAK`, we can achieve the splitting of text lines as soon as the word boxes exceed a minimum distance.

In [None]:
#path="/path/to/dir/sample/2312.13560.pdf" # Use the PDF in the sample folder
path = "/home/janis/Documents/Repos/notebooks/sample/2312.13560.pdf"
    
analyzer =dd.get_dd_analyzer(config_overwrite=
   ["PT.LAYOUT.WEIGHTS=microsoft/table-transformer-detection/pytorch_model.bin",
    "PT.ITEM.WEIGHTS=microsoft/table-transformer-structure-recognition/pytorch_model.bin",
    "PT.ITEM.FILTER=['table']",
    "OCR.USE_DOCTR=True",
    "OCR.USE_TESSERACT=False",
    "TEXT_ORDERING.INCLUDE_RESIDUAL_TEXT_CONTAINER=True",
    "TEXT_ORDERING.PARAGRAPH_BREAK=0.01",  # default value is at 0.035 which might be too large
                        ])

analyzer.pipe_component_list[0].predictor.config.threshold = 0.4

df = analyzer.analyze(path=path)
df.reset_state()
df_iter = iter(df)

In [None]:
dp = next(df_iter)
np_image = dp.viz()

plt.figure(figsize = (25,17))
plt.axis('off')
plt.imshow(np_image)

In [None]:
print(dp.text)

Okay, this page looks good now. Let's continue scrolling through the document.

In [None]:
dp = next(df_iter)
np_image = dp.viz()

plt.figure(figsize = (25,17))
plt.axis('off')
plt.imshow(np_image)

Once again, we observe a false positive, this time with an even higher confidence threshold. We are not going to increase the threshold, though.

In [None]:
dp.tables[0].score

In [None]:
print(dp.text)

In [None]:
dp = next(df_iter)
np_image = dp.viz()

plt.figure(figsize = (25,17))
plt.axis('off')
plt.imshow(np_image)

In [None]:
dp = next(df_iter)
np_image = dp.viz()

plt.figure(figsize = (25,17))
plt.axis('off')
plt.imshow(np_image)

The results now look quite decent, and the segmentation is also yielding usable outcomes. However, as noted in many instances, it should be acknowledged that the models may produce much weaker results on other types of documents.

## Table segmentation

We will now take a look at another example, focusing on optimizations in table segmentation.

In [None]:
#path="/path/to/dir/sample/finance" # Use the PDF in the sample folder
path = "/home/janis/Documents/Repos/notebooks/sample/finance"
    
analyzer =dd.get_dd_analyzer(config_overwrite=
   ["PT.LAYOUT.WEIGHTS=microsoft/table-transformer-detection/pytorch_model.bin",
    "PT.ITEM.WEIGHTS=microsoft/table-transformer-structure-recognition/pytorch_model.bin",
    "PT.ITEM.FILTER=['table']",
    "OCR.USE_DOCTR=True",
    "OCR.USE_TESSERACT=False",
    "TEXT_ORDERING.INCLUDE_RESIDUAL_TEXT_CONTAINER=True",
    "TEXT_ORDERING.PARAGRAPH_BREAK=0.01",
                        ])

analyzer.pipe_component_list[0].predictor.config.threshold = 0.4  # default threshold is at 0.1

df = analyzer.analyze(path=path)
df.reset_state()
df_iter = iter(df)

In [None]:
dp = next(df_iter)
np_image = dp.viz()

plt.figure(figsize = (25,17))
plt.axis('off')
plt.imshow(np_image)

In [None]:
HTML(dp.tables[0].html)

In [None]:
HTML(dp.tables[1].html)

The table segmentation incorporates various cell types identified by the segmentation model and processes them. Unfortunately, the detection of, for example, spanning cells does not work particularly well. This can be observed from the last sample where the model identifies at first column contains a spanning cell. We want to deactivate this feature. To do this, we need to filter out all cell types.

In [None]:
#path="/path/to/dir/sample/finance" # Use the PDF in the sample folder
path = "/home/janis/Documents/Repos/notebooks/sample/finance"
    
analyzer =dd.get_dd_analyzer(config_overwrite=
   ["PT.LAYOUT.WEIGHTS=microsoft/table-transformer-detection/pytorch_model.bin",
    "PT.ITEM.WEIGHTS=microsoft/table-transformer-structure-recognition/pytorch_model.bin",
    "PT.ITEM.FILTER=['table','column_header','projected_row_header','spanning']",
    "OCR.USE_DOCTR=True",
    "OCR.USE_TESSERACT=False",
    "TEXT_ORDERING.INCLUDE_RESIDUAL_TEXT_CONTAINER=True",
    "TEXT_ORDERING.PARAGRAPH_BREAK=0.01",
                        ])

analyzer.pipe_component_list[0].predictor.config.threshold = 0.4

df = analyzer.analyze(path=path)
df.reset_state()
df_iter = iter(df)

In [None]:
dp = next(df_iter)
np_image = dp.viz()

plt.figure(figsize = (25,17))
plt.axis('off')
plt.imshow(np_image)

In [None]:
HTML(dp.tables[0].html)

In [None]:
HTML(dp.tables[1].html)

In [None]:
HTML(dp.tables[2].html)

As already mentiond, all text that is not part of table cells will be pushed into the narrative text.

In [None]:
print(dp.text)

There are additional configuration parameters that can improve segmentation. These include, for example, `SEGMENTATION.THRESHOLD_ROWS`, `SEGMENTATION.THRESHOLD_COLS`, `SEGMENTATION.REMOVE_IOU_THRESHOLD_ROWS`, and `SEGMENTATION.REMOVE_IOU_THRESHOLD_COLS`. To observe the effects, we recommend experimenting with these parameters.