Skip to content
Konstantin Baierer edited this page Sep 19, 2022 · 4 revisions

Adding files to workspaces in bulk

Basic recipe

Let's assume you have directories path/to/files/PAGE containing PAGE files and path/to/files/IMG with images. The files have a basename page_0001.xml, page_0001.tif' etc.

ocrd workspace bulk-add \
    --regex '^.*/(?P<fileGrp>[^/]+)/page_(?P<pageid>.*)\.(?P<ext>[^\.]*)$' \
    --file-id 'FILE_{{ fileGrp }}_{{ pageid }}' \
    --page-id 'PHYS_{{ pageid }}' \
    --file-grp "{{ fileGrp }}" \
    --url '{{ fileGrp }}/FILE_{{ pageid }}.{{ ext }}' \
    'path/to/files/*/*.*'

This will first expand the glob to get filenames and resolve them to absolute paths.

Every path is then matched against --regex with re.match, yielding template variables derived from the syntax of the path. These template variables can be used in all file-specific options. --url after expansion is used as the filename relative to the workspace directory and copied into the workspace if not already present. After expanding all template variables, the file is added with Workspace.add_file.

In this case:

  • path/to/files/PAGE/page_0001.xml ->
    • url: PAGE/FILE_0001.xml (will be copied because file name is different)
    • fileGrp: PAGE
    • ID: FILE_0001
    • pageId: PHYS_0001

--mimetype, if not provided, is mapped from the file extension.

--ignore will disable the check for existing files with the same @ID and is a huge performance boost.

Reading from STDIN

If the FILE_GLOB is a single dash -, the file path list is read from STDIN, so you can pass in data about the files to be added in a simple space-separated list of values:

{ echo PHYS_0001 BIN FILE_0001_BIN.IMG-wolf BIN/FILE_0001_BIN.IMG-wolf.png; \
  echo PHYS_0001 BIN FILE_0001_BIN BIN/FILE_0001_BIN.xml; \
  echo PHYS_0002 BIN FILE_0002_BIN.IMG-wolf BIN/FILE_0002_BIN.IMG-wolf.png; \
  echo PHYS_0002 BIN FILE_0002_BIN BIN/FILE_0002_BIN.xml; \
} | ocrd workspace bulk-add -r '(?P<pageid>.*) (?P<filegrp>.*) (?P<fileid>.*) (?P<url>.*)' \
-G '{{ filegrp }}' -g '{{ pageid }}' -i '{{ fileid }}' -S '{{ url }}' -

This allows users to prepare the data to be added semi-manually as a CSV file, which works particularly well for cases where the naming convention of the files to be added is not consistent or informative enough for relying just on the filenames for pattern matching.

Adding legacy OCR-D GT data in bulk

For example, to import the old (first-generation zip-file) OCR-D GT directories, one could then do:

# in a directory where all zip-files have been extracted already:
for book in */; do

pushd $book
ocrd workspace init
ocrd workspace set-id $book

# only images, no copying
ocrd workspace bulk-add \
  --skip \
  --regex '^(?P<dispname>[^/]*)/(?P=dispname)_(?P<pageid>[0-9]*)\.tif$' \
  --file-id 'FILE_ORIG_{{ pageid }}'
  --page-id 'PHYS_{{ pageid }}' \
  --file-grp OCR-D-IMG \
  --url '{{ dispname }}_{{ pageid }}.tif' \
  $(find . -name "*.tif")

# only PAGE, no copying
ocrd workspace bulk-add \
  --skip \
  --regex '^(?P<dispname>[^/]*)/page/(?P=dispname)_(?P<pageid>[0-9]*)\.xml$' \
  --file-id 'FILE_GT_{{ pageid }}' \
  --page-id 'PHYS_{{ pageid }}' \
  --file-grp OCR-D-GT-SEG-PAGE \
  --url 'page/{{ dispname }}_{{ pageid }}.xml' \
  $(find . -name "*.xml")

# only ALTO, no copying
ocrd workspace bulk-add \
  --skip \
  --regex '^(?P<dispname>[^/]*)/alto/(?P=dispname)_(?P<pageid>[0-9]*)\.xml$' \
  --file-id 'FILE_GT-ALTO_{{ pageid }}' \
  --page-id 'PHYS_{{ pageid }}' \
  --file-grp OCR-D-GT-ALTO-SEG-PAGE \
  --mimetype application/alto+xml \
  --url 'alto/{{ dispname }}_{{ pageid }}.xml' \
  $(find . -name "*.xml")

popd
done

(You cannot match the non-existing image subdirectory as fileGrp in this convention directly, and breaking it up allows a basic form of string transformation.)

Adding flat directory hierarchies

In the common case where images and annotations reside in per-document directories with image files along PAGE-XML files of the same basename (as in the old LAREX bookpath convention, or in various GT collections), the following would import such books into (OCR-D conforming) METS, while not copying files into new (OCR-D conforming) paths:

# in the bookpath/library directory:
for book in */; do

pushd $book
ocrd workspace init
ocrd workspace set-id $book

ocrd workspace bulk-add \
  --regex '^(?P<pageid>.*)\.xml$' \
  --file-id 'OCR-D-GT-SEG-LINE_{{ pageid }}' \
  --page-id 'PHYS_{{ pageid }}' \
  --file-grp OCR-D-GT-SEG-LINE \
  --url '{{ pageid }}.xml' \
  $(find . -name "*.xml" -not -name mets.xml)

ocrd workspace bulk-add \
  --regex '^(?P<pageid>.*)\.(^P<ext>[^.]*)$' \
  --file-id 'OCR-D-IMG_{{ pageid }}' \
  --page-id 'PHYS_{{ pageid }}' \
  --file-grp OCR-D-IMG \
  --url '{{ pageid }}.{{ ext }}' \
  $(find . -type f -not -name "*.xml")

popd
done

Welcome to the OCR-D wiki, a companion to the OCR-D website.

Articles and tutorials
Discussions
Expert section on OCR-D- workflows
Particular workflow steps
Recommended workflows
Workflow Guide
Videos
Section on Ground Truth
Clone this wiki locally