# Fix Transkribus Reading Order

By default, Transkribus apparently puts regions in a column-wise reading order. In our test file, we have four text regions ordered like this:

| (1) lo | (3) ro |
|:------:|:------:|
| (2) lu | (4) ru |

The number denotes Transkribus’ assigned reading order, the text helps us identify the cells.

Instead of this reading order, we want a row-wise reading order. Assigning this by hand is painful in Transkribus, so we ignore the reading order we get from Transkribus and instead fix it during processing.

In [1]:
from pathlib import Path
from collections import namedtuple
from functools import cmp_to_key

from lxml import etree

In [2]:
PAGE_NS = 'http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15'
PAGE = '{' + PAGE_NS + '}'
NSMAP = {'pc': PAGE_NS}

In [3]:
PAGE_DIR = Path('.') / 'test_data'

A small helper function to get the text content of a region.

In [4]:
def gather_lines(region_element):
    return '\n'.join([el.text for el in region_element.findall(f'{PAGE}TextLine/{PAGE}TextEquiv/{PAGE}Unicode')])

The region coordinates are stored in the XML, we convert this to a list of points:

`[[p1x, p1y], [p2x, p2y], ...]`

In [5]:
def region_coords(region):
    coords = region.find(f'{PAGE}Coords').get('points')
    coords = [point.split(',') for point in coords.split(' ')]
    return coords

For each region, we calculate a bounding box, which is the smallest rectangle that fits the entire shape.

In [6]:
BoundingBox = namedtuple('BoundingBox', ['minx', 'miny', 'maxx', 'maxy'])

def bounding_box(coords):
    x = [point[0] for point in coords]
    y = [point[1] for point in coords]
    return BoundingBox(min(x), min(y), max(x), max(y))

Our re-ordering of the regions is based on a simple heuristic:

* If a region is below another region, it always comes after in reading order.
* If a region is at the same height, the region that is further to the right comes after.

Because regions might vary slightly in their coordinates, we don’t just compare on point (e.g., $X_{min},Y_{min}$), but take the full shape into account. This will not work for overlapping shapes, but we don’t have them in our case.

In [7]:
def region_cmp(region1, region2):
    bb1 = bounding_box(region_coords(region1))
    bb2 = bounding_box(region_coords(region2))
    if bb1 == bb2:
        return 0
    if bb2.miny >= bb1.maxy:
        return -1
    if bb2.minx >= bb1.maxx:
        return -1
    return 1

Now we can use this comparison function to re-order the text regions.

In [8]:
page_file = PAGE_DIR / 'reading_order.xml'
tree = etree.parse(str(page_file))
page = tree.find(f'{PAGE}Page')

regions = list(page.iter(f'{PAGE}GraphicRegion', f'{PAGE}TextRegion'))

print('original order:')

for region in regions:
    print(region.get('custom'))
    print(gather_lines(region))
    
print()
print('fixed order:')

for region in sorted(regions, key=cmp_to_key(region_cmp)):
    print(region.get('custom'))
    print(gather_lines(region))

original order:
readingOrder {index:0;}
lo
readingOrder {index:1;}
lu
readingOrder {index:2;}
ro
readingOrder {index:3;}
ru

fixed order:
readingOrder {index:0;}
lo
readingOrder {index:2;}
ro
readingOrder {index:1;}
lu
readingOrder {index:3;}
ru
