Skip to content

AI-Riksarkivet/htrflow

Repository files navigation

htrflow

GitHub Repo stars License Build GitHub release GitHub docs

HTRFLOW Image

HTRFlow is an open source tool for handwritten text recognition. It is developed by the AI lab at the Swedish National Archives (Riksarkivet).

Installation

Package

From source

Requirements:

Clone this repository and run

poetry install --extras "huggingface ultralytics cli"

This will install the HTRFlow CLI and enable huggingface and ultralytics models in a virtual environment. If you also want to use openmmlab models such as RTMDet and Satrn, you also need to run:

python3 scripts/openmmlab_install.py

Now activate the virtual enviroment with

poetry shell

The HTRFlow CLI is now available within the poetry shell. Try it by running:

htrflow pipeline examples/pipelines/demo.yaml examples/images/pages

This command runs HTRFlow on the three example pages in examples/images/pages and writes the output Page XML and Alto XML.

Usage

Once HTRFlow is installed, run it with:

htrflow pipeline <pipeline file> <input image(s)>

Pipelines

HTRFlow is configured with a pipeline file which describes what steps it should perform and which models it should use. Here is an example of a simple pipeline:

steps:
- step: Segmentation
  settings:
    model: RTMDet
    model_settings:
       model: Riksarkivet/rtmdet_lines
- step: TextRecognition
  settings:
    model: TrOCR
    model_settings:
       model: Riksarkivet/trocr-base-handwritten-swe
    generation_settings:
       num_beams: 1
- step: RemoveLowTextConfidenceLines
  settings:
    threshold: 0.9
- step: Export
  settings:
    dest: outputs/alto
    format: alto

This pipeline uses Riksarkivet/rtmdet_lines to detect the pages' text lines, then runs Riksarkivet/trocr-base-handwritten-swe to transcribe them, filters the text lines on their confidence score, and then exports the result to Alto XML.

See the demo pipeline examples/pipelines/demo.yaml for a more complex pipeline.

Built-in pipeline steps

HTRflow comes with several pre-defined pipeline steps out of the box. These include:

  • Inference, including text recognition and segmentation
  • Image preprocessing
  • Reading order detection
  • Filtering
  • Export

Custom pipeline steps

You can define your own custom pipeline step by subclassing PipelineStep and defining the run() method. It takes a Collection and returns a Collection:

class MyPipelineStep(PipelineStep):
    """A custom pipeline step"""
    def run(self, collection: Collection) -> Collection:
        for page in collection:
            # Do something
        return collection

You can add parameters to your pipeline step by also defining the __init__() method. It can take any number of arguments. Here, we add one argument, which can be accessed when the step is run:

class MyPipelineStep(PipelineStep):
    """A custom pipeline step"""
    def __init__(self, arg):
        self.arg = arg

    def run(self, collection: Collection) -> Collection:
        for page in collection:
            # Do something
            if self.arg:
              ...
        return collection

To use the pipeline step in a pipeline, add the following to your pipeline file:

steps:
  - step: MyPipelineStep
    settings: 
      arg: value

All key-value pairs listed under settings will be passed to the step's __init__() method. If the pipeline step doesn't need any arguments, you can omit settings.

For filtering and image processing operations, you can base your custom step on the base classes Prune and ProcessImages. Examples of this, and other pipeline steps, can be found in htrflow_core/pipeline/steps.py.

Models

The following model architectures are currently supported by HTRFlow:

Model Type Fine-tuned by the AI lab
TrOCR Text recognition Riksarkivet/trocr-base-handwritten-swe
Satrn Text recognition Riksarkivet/satrn_htr
RTMDet Segmentation Riksarkivet/rtmdet_lines
Riksarkivet/rtmdet_regions
Yolo Segmentation
DiT Image classification