Skip to content

How to create searchable Fulltext Data for DFG Viewer

Stefan Weil edited this page Jun 21, 2023 · 5 revisions

The OCR-D Pilot Phase at ULB Saxony-Anhalt (2020) was aimed towards the production of searchable fulltext data for use in the DFG- as well as IIIF Viewers. Already digitized materials from the library - specifically from the historical documents at the inhouse section of ULB's collections - formed the basis for the workflow, as they are readily available via OAI-PMH API. The goal was a workflow that operated under real use-case scenarios rather than a theoretical construction based on assumption.

Workflow Configuration

The ALTO XML data resulted from a Makefile Workflow-Configuration of the following processors (and their parameters, respectively):

  • ocrd-im6convert
  • ocrd-olena-binarize ("impl": "sauvola-ms-split")
  • ocrd-anybaseocr-crop
  • ocrd-cis-ocropy-deskew ("level-of-operation": "page", "maxskew": 5)
  • ocrd-tesserocr-segment-region ("padding": 5, "find_tables": false)
  • ocrd-segment-repair ("plausibilize": true, "plausibilize_merge_min_overlap": 0.7)
  • ocrd-cis-ocropy-clip
  • ocrd-cis-ocropy-segment ("spread": 2.4)
  • ocrd-cis-ocropy-dewarp
  • ocrd-tesserocr-recognize ("overwrite_segments": true, "model" : "gt4hist_5000k+Fraktur+frk+deu")
  • ocrd-fileformat-transform

Setup

To enhance runtime performance and stability, each single digitized book's pages were processed in up to 12 single Docker-Containers from Docker-Image ocrd/all:2020-08-04. From overall 50.000 pages, about 500 (around 1%) were dropped due errors. These errors are almost completely related to rather difficult input data (large illustrations, maps, tables, handwritten notes and alike). During tests, the parallelization of the entire workflow proved to be more error-prone and thus was dropped in favour of the page separation.

Because of the pages' partition, all files need to be integrated into the METS/MODS afterwards. This step is not required if the entire work is processed at once, but made necessary due to the separation. It can be initiated by creating a OCR-D-Workspace from an OAI-Record (following snippets implemented with standard python 3.6 using additional functionalities from ocrd_core-package):

from ocrd.resolver import (
    Resolver
)

def ocrd_workspace_clone(oai_identifier):
    """Wrap ocrd workspace clone in curdir"""

    # clone from oai-pmh
    mets_url = f"{OAI_URL}{OAI_PARAMS}&identifier={oai_identifier}"
    resolver = Resolver()
    workspace = resolver.workspace_from_url(
        mets_url=mets_url,
        dst_dir='.',
        download=False
    )
    workspace.save_mets()


def ocrd_workspace_setup(root_dir, sub_dir, file_id):
    """Wrap ocrd workspace init and add single file"""

    # init workspace
    the_dir = os.path.abspath(sub_dir)
    resolver = Resolver()
    workspace = resolver.workspace_from_nothing(
        directory=the_dir
    )
    workspace.save_mets()

    # add image
    image_src = f"{root_dir}/MAX/{file_id}.jpg"
    resolver.download_to_directory(
        the_dir,
        image_src,
        subdir='MAX')
    kwargs = {
        'fileGrp': 'MAX',
        'ID': 'IMG_' + file_id,
        'mimetype': 'image/jpeg',
        'url': f"MAX/{file_id}.jpg"}
    workspace.mets.add_file(**kwargs)
    workspace.save_mets()

# proceed with workflow ...

The actual execution of the OCR-D-Container is wrapped via python to allow scaling:

# case a: run n workspaces parallel
def run_ocr_workspaces(*args):
    ocr_dir = args[0][0]
    part_by = args[0][1]
    os.chdir(ocr_dir)
    user_id = os.getuid()
    cmd_clean = f'docker rm --force {CNT_NAME}'
    subprocess.run(cmd_clean, shell=True, check=True)
    cmd = f'docker run --rm --name {CNT_NAME} -u "{user_id}" -w /data -v "{ocr_dir}":/data -v {TESSDIR_HOST}:{TESSDIR_CNT} {IMAGE} ocrd-make all -j"{part_by}" -f {MAKEFILE} .'

    try:
        result = subprocess.run(cmd, shell=True, check=True, timeout=7200)
        # ... analyze result
    except subprocess.CalledProcessError as exc:
        # ... handle subprocess failure


# case b: run n containers with a single page
def run_ocr_page(*args):
    ocr_dir = args[0][0]
    os.chdir(ocr_dir)
    user_id = os.getuid()
    cmd = f'docker run --rm -u "{user_id}" -w /data -v "{ocr_dir}":/data -v {TESSDIR_HOST}:{TESSDIR_CNT} {IMAGE} ocrd-make -f {MAKEFILE_PP} .'

    try:
        result = subprocess.run(cmd, shell=True, check=True, timeout=1800)
        # ... analyze result
    except subprocess.CalledProcessError as exc:
        # ... handle subprocess failure

To enhance throughput, the calls are executed via standard python process pooling in the main script:

import concurrent.futures

def create_ocr(image_path):
    # ... additional setup
    file_id = os.path.basename(image_path).split('.')[0]
    workdir_sub = os.path.join(migration.migration_workdir, file_id)
    try:
        run_ocr_page(workdir_sub)

    # ... handle further exceptions and results



if __name__ == "__main__":
    APP_ARGUMENTS = argparse.ArgumentParser()
    APP_ARGUMENTS.add_argument(
        "-p",
        "--part_by",
        required=False,
        help="partition size for workflow")
    ARGS = vars(APP_ARGUMENTS.parse_args())
    if ARGS['part_by']:
        PART_BY = int(ARGS['part_by'])
    else:
        PART_BY = 4

    # additional initialization ...

    try:
        with concurrent.futures.ProcessPoolExecutor(max_workers=PART_BY) as executor:
            outcomes = list(executor.map(create_ocr, image_paths))
            # ... analyze outcomes

    except TimeoutError:
        (exc_type, value, traceback) = sys.exc_info()
        migration.the_logger.error(
            "Run into timeout: '%s'(%s): %s",
            value,
            exc_type,
            traceback)

    # further processing ...

ALTO Postprocessing

As indexing is one of the most important preconditions for fulltext search, some adjustments were required. The ALTO files produced did not include a spacing element SP between each String element of an ALTO TextLine element. This resulted in undesirable effects for the depiction of the fulltext, e.g. a spaceless text being rendered in the viewers. To remedy this, a postprocessing step for automatically adding the SP elements was implemented.

As adding footers (for URN and digitizing library display) changes the dimensions of the image, these also need to be adjusted in the ALTO XML to ensure scaling still functions as expected for the text coordinates.

DFG-Viewer enables fulltext display at line level and additionally fulltext Search. IIIF-Viewer is capable for fulltext search if the IIIF-Manifest includes information on search API.

The Share_it Repository of ULB Saxony-Anhalt uses a Solr-based indexing of fulltext data from solr-ocrhighlighting. If your administrative METS-section contains additional information for the DFG-Viewer specified as fulltext SRU to enable fulltext search, the connection of search results to the corresponding images (and even the position of these images) needs to be specified. Therefore each ALTO file needs to be linked with the corresponding image, i.e. the Page ID-Attribute needs be equalized to the image's physical name:

<Layout>
    <Page ID="OCR-D-BINPAGE_0001" HEIGHT="2395" .... >

<!--
    exchange pre-set ID from ocrd-workflow with physical filename
    take care of new image height
-->
<Layout>
    <Page ID="1056985" HEIGHT="2495" .... >

Postprocessings can be achieved with help from the lxml-package:

import lxml.etree as ET

def clear_alto(file_path, n_file, new_height):
    """
    Ensure:
    - each ALTO Page-Element references proper image
    - drop geometrical information not necessary for presentation
    - add SP-element between each word
    """

    xml_tree = ET.parse(file_path)
    xml_root = xml_tree.getroot()

    # update page ID
    page = xml_root.find('.//alto:Page', XMLNS_POST)
    page.attrib['ID'] = f'p{n_file}'
    page.attrib['HEIGHT'] = new_height

    # remove geometrical elements not used for presentation
    _remove_elements(
        xml_root, ['alto:Shape', 'alto:Illustration', 'alto:GraphicalElement'])

    # add element "Space"
    lines = xml_root.findall('.//alto:TextLine', XMLNS_POST)
    for line in lines:
        indices = len(line) - 2
        i = indices
        while i >= 0:
            line[i].addnext(ET.XML('<SP />'))
            i = i - 1

    write_xml(xml_tree, file_path)


def _remove_elements(xml_root, tags):
    for tag in tags:
        removals = xml_root.findall(f'.//{tag}', XMLNS_POST)
        for rem in removals:
            parent = rem.getparent()
            parent.remove(rem)

Welcome to the OCR-D wiki, a companion to the OCR-D website.

Articles and tutorials
Discussions
Expert section on OCR-D- workflows
Particular workflow steps
Recommended workflows
Workflow Guide
Videos
Section on Ground Truth
Clone this wiki locally