Skip to content

How to import Abbyy generated ALTO

Elisabeth Engl edited this page Dec 4, 2020 · 2 revisions

Converting ALTO to PAGE

In order to re/post-process (or just evaluate) results from ABBYY in OCR-D, you need to convert its ALTO output to PAGE first. (You can also get AbbyyXml if your license permits it, but this is not covered here.)

You can use ocrd-fileformat-transform for this, which wraps ocr-transform.sh which includes the prima-page-converter.

Problem

However, there are 2 problems with the output of ABBYY:

  1. It does not set the /alto/Description/sourceImageInformation/fileName, so the PAGE-XML won't contain any /PcGts/Page/@imageFilename (which makes it impossible to process with OCR-D).
  2. It has a bug which sets the wrong coordinates in the blocks' and lines' Shape elements. When ABBYY detects a skew, it internally rotates the images, which increases the pixel size. When it exports the detected segments, it needs to back-transform the coordinates – which it does w.r.t. the angle, but not the extra offset. That's true for the coordinates described by Shape elements, but not for the bounding box attributes (@HPOS, @VPOS, @WIDTH, @HEIGHT).

Solution

One can compensate for these by the following post-processing steps:

  1. Align both fileGrps, original images and imported PAGE annotations, by their physical page IDs. (The PAGE files will have empty @imageFilename.) For each pair, create an (empty) PAGE for the image and add the segments from the existing PAGE. This can be done via ocrd-segment-replace-page.
  2. Remove all the Shape elements – they are redundant anyway (as ABBYY does not yield polygons, only bounding boxes).

For example, when using the makefileization, the workflow could look like this:

# import from DFGViewer
FULLTEXT:
        ocrd workspace find -G FULLTEXT --download
        xmlstarlet ed --inplace -d //_:Shape FULLTEXT/* # fix 2
        ocrd workspace find -G ORIGINAL --download
        ocrd workspace prune-files # delete all other files (not downloaded)

# created by side effect above
ORIGINAL: ;

# convert
PAGE: FULLTEXT
PAGE: TOOL = ocrd-fileformat-transform
PAGE: PARAMS = "from-to": "alto page"   

# fix 1
PAGE2: ORIGINAL PAGE
PAGE2: TOOL = ocrd-segment-replace-page
PAGE2: OPTIONS = -P transform_coordinates false

Welcome to the OCR-D wiki, a companion to the OCR-D website.

Articles and tutorials
Discussions
Expert section on OCR-D- workflows
Particular workflow steps
Recommended workflows
Workflow Guide
Videos
Section on Ground Truth
Clone this wiki locally