OCR-D/ocrd_all

This controls installation of all OCR-D modules from source (as git submodules).

It includes a Makefile for their installation into a virtual environment (venv) or Docker container.

(A venv is a local user directory with shell scripts to load/unload itself in the current shell environment via PATH and PYTHONHOME.)

Note: If you are going to install ocrd_all, you may want to first consult the OCR-D setup guide on the OCR-D website. If you are a non-IT user, it is especially recommended you utilize the guide.

Prerequisites
Usage
Challenges
Contributing

Prerequisites

Space

Make sure that there is enough free disk space. For a full installation including executables from all modules, around 22 GiB will be needed (mostly on the same filesystem as the ocrd_all checkout). The same goes for the maximum-cuda variant of the prebuilt Docker images (due on the filesystem harboring Docker, typically /var/lib/docker).

Also, during build, an additional 5 GiB may be needed for temporary files, typically in the /tmp directory. To use a different location path with more free space, set the TMPDIR variable when calling make:

TMPDIR=/path/to/my/tempdir make all

Locale

The (shell) environment must have a Unicode-based localization. (Otherwise Python code based on click will not work, i.e. most OCR-D CLIs.) This is true for most installations today, and can be verified by:

locale | fgrep .UTF-8

This should show several LC_* variables. Otherwise, either select another localization globally...

sudo dpkg-reconfigure locales

... or use the Unicode-based POSIX locale temporarily:

export LC_ALL=C.UTF-8
export LANG=C.UTF-8

System packages

Install git, GNU make and GNU parallel.

  # on Debian / Ubuntu:
  sudo apt install make git parallel

Install wget or curl if you want to download Tesseract models.
```
  # on Debian / Ubuntu:
  sudo apt install wget
```
Install the packages for Python3 development and Python3 virtual environments for your operating system / distribution.
```
  # on Debian / Ubuntu:
  sudo apt install python3-dev python3-venv
```
Some modules require Tesseract. If your operating system / distribution already provides Tesseract 4.1 or newer, then just install its development package:
```
  # on Debian / Ubuntu:
  sudo apt install libtesseract-dev
```
Otherwise, recent Tesseract packages for Ubuntu are available via PPA alex-p.

If no Tesseract is installed, a recent version will be downloaded and built as part of the ocrd_tesserocr module rules.
Other modules will have additional system dependencies.

Note: System dependencies for all modules on Ubuntu 20.04 (or similar) can also be installed automatically by running:
    # on Debian / Ubuntu:
    make modules
    sudo apt install make
    sudo make deps-ubuntu
(And you can define the scope of all modules by setting the OCRD_MODULES variable as described below. If unsure, consider doing a dry-run first, by using make -n.)

GPU support

Many executables can utilize Nvidia GPU for much faster computation, if available (i.e. optionally).

For that, as a further prerequisite you need an installation of CUDA Toolkit and additional optimised libraries like cuDNN for your system.

The CUDA version currently supported is 11.8 (but other's may work as well).

Note: CUDA toolkit and libraries (in a development version with CUDA compiler) can also be installed automatically by running:
    make ocrd
    sudo make deps-cuda
This will deploy Micromamba non-intrusively (without system packages or Conda environments), but also share some of the CUDA libraries installed as Python packages system-wide via ld.so.conf rules. If unsure, consider doing a dry-run first, by using make -n.)

Usage

Run make with optional parameters for variables and targets like so:

make [PYTHON=python3] [VIRTUAL_ENV=./venv] [OCRD_MODULES="..."] [TARGET...]

Targets

deps-ubuntu

Install system packages for all modules. (Depends on modules.)

See system package prerequisites above.

deps-cuda

Install CUDA toolkit and libraries. (Depends on ocrd.)

See (optional) GPU support prerequisites above.

modules

Checkout/update all modules, but do not install anything.

all

Install executables from all modules into the venv. (Depends on modules and ocrd.)

ocrd

Install only the core module and its CLI ocrd into the venv.

docker

(Re-)build a Docker image for all modules/executables. (Depends on modules.)

dockers

(Re-)build Docker images for some pre-selected subsets of modules/executables. (Depends on modules.)

(These are the very same variants published as prebuilt images on Docker Hub, cf. CI configuration.)

Note: The image will contain all refs and branches of all checked out modules, which may not be actually needed. If you are planning on building and distributing Docker images with minimal size, consider using GIT_DEPTH=--single-branch before modules or running make tidy later-on.

clean

Remove the venv and the modules' build directories.

show

Print the venv directory, the module directories, and the executable names – as configured by the current variables.

check

Verify that all executables are runnable and the venv is consistent.

help (default goal)

Print available targets and variables.

Further targets:

[any module name]

Download/update that module, but do not install anything.

[any executable name]

Install that CLI into the venv. (Depends on that module and on ocrd.)

Variables

OCRD_MODULES

Override the list of git submodules to include. Targets affected by this include:

deps-ubuntu (reducing the list of system packages to install)
modules (reducing the list of modules to checkout/update)
all (reducing the list of executables to install)
docker (reducing the list of executables and modules to install)
show (reducing the list of OCRD_MODULES and of OCRD_EXECUTABLES to print)

NO_UPDATE

If set to 1, then when installing executables, does not attempt to git submodule update any currently checked out modules. (Useful for development when testing different module version prior to a commit.)

PYTHON

Name of the Python binary to use (at least python3 required).

If set to just python, then for the target deps-ubuntu it is assumed that Python is already installed.

VIRTUAL_ENV

Directory prefix to use for local installation.

(This is set automatically when activating a virtual environment on the shell. The build system will re-use the venv if one already exists here, or create one otherwise.)

TMPDIR

Override the default path (/tmp on Unix) where temporary files during build are stored.

PIP_OPTIONS

Add extra options to the pip install command like -q or -v or -e.

Note: The latter option will install Python modules in editable mode, i.e. any update to the source would directly affect the executables.

GIT_RECURSIVE

Set to --recursive to checkout/update all modules recursively. (This usually installs additional tests and models.)

Examples

To build the latest Tesseract locally, run this command first:

# Get code, build and install Tesseract with the default English model.
make install-tesseract
make ocrd-tesserocr-recognize

Optionally install additional Tesseract models.

# Download models from tessdata_fast into the venv's tessdata directory.
ocrd resmgr download ocrd-tesserocr-recognize deu_latf.traineddata
ocrd resmgr download ocrd-tesserocr-recognize Latin.traineddata
ocrd resmgr download ocrd-tesserocr-recognize Fraktur.traineddata

Optionally install Tesseract training tools.

make install-tesseract-training

Running make ocrd or just make downloads/updates and installs the core module, including the ocrd CLI in a virtual Python 3 environment under ./venv.

Running make ocrd-tesserocr-recognize downloads/updates the ocrd_tesserocr module and installs its CLIs, including ocrd-tesserocr-recognize in the venv.

Running make modules downloads/updates all modules.

Running make all additionally installs the executables from all modules.

Running make all OCRD_MODULES="core ocrd_tesserocr ocrd_cis" installs only the executables from these modules.

Results

To use the built executables, simply activate the virtual environment:

. ${VIRTUAL_ENV:-venv}/bin/activate
ocrd --help
ocrd-...

For the Docker image, run it with your data path mounted as a user, and the processor resources as named volume (for model persistency):

docker run -it -u $(id -u):$(id -g) -v $PWD:/data -v ocrd-models:/models ocrd/all
ocrd --help
ocrd-...

Persistent configuration

In order to make choices permanent, you can put your variable preferences (or any custom rules) into local.mk. This file is always included if it exists. So you don't have to type (and memorise) them on the command line or shell environment.

For example, its content could be:

# restrict everything to a subset of modules
OCRD_MODULES = core ocrd_im6convert ocrd_cis ocrd_tesserocr

# use a non-default path for the virtual environment
VIRTUAL_ENV = $(CURDIR)/.venv

# install in editable mode (i.e. referencing the git sources)
PIP_OPTIONS = -e

# use non-default temporary storage
TMPDIR = $(CURDIR)/.tmp

# avoid automatic submodule updates
NO_UPDATE = 1

Note: When local.mk exists, variables can still be overridden on the command line, (i.e. make all OCRD_MODULES= will build all executables for all modules again), but not from the shell environment (i.e. OCRD_MODULES= make all will still use the value from local.mk).

Docker Hub

Besides native installation, ocrd_all is also available as prebuilt Docker images from Docker Hub as ocrd/all, backed by CI/CD. You can choose from three tags, minimum, medium and maximum. These differ w.r.t. which modules are included, with maximum being the equivalent of doing make all with the default (unset) value for OCRD_MODULES.

To download the images on the command line:

docker pull ocrd/all:minimum
# or
docker pull ocrd/all:medium
# or
docker pull ocrd/all:maximum

In addition to these base variants, there are minimum-cuda, medium-cuda and maximum-cuda with GPU support. (These also need nvidia-docker runtime, which will add the docker --gpus option.)

The maximum-cuda variant will be aliased to latest as well.

These tags will be overwritten with every new release of ocrd_all (i.e. rolling release). (You can still differentiate and reference them by their sha256 digest if necessary.)

However, the maximum-cuda variant of each release will also be aliased to a permanent tag by ISO date, e.g. 2023-04-02.

Usage of the prebuilt Docker image is the same as if you had built the image yourself.

This table lists which tag contains which module:

Module	`minimum`	`medium`	`maximum`
core	☑	☑	☑
ocrd_cis	☑	☑	☑
ocrd_fileformat	☑	☑	☑
ocrd_olahd_client	☑	☑	☑
ocrd_im6convert	☑	☑	☑
ocrd_pagetopdf	☑	☑	☑
ocrd_repair_inconsistencies	☑	☑	☑
ocrd_tesserocr	☑	☑	☑
ocrd_wrap	☑	☑	☑
workflow-configuration	☑	☑	☑
cor-asv-ann	-	☑	☑
dinglehopper	-	☑	☑
docstruct	-	☑	☑
format-converters	-	☑	☑
nmalign	-	☑	☑
ocrd_calamari	-	☑	☑
ocrd_keraslm	-	☑	☑
ocrd_neat	-	☑	☑
ocrd_olena	-	☑	☑
ocrd_segment	-	☑	☑
ocrd_anybaseocr	-	-	☑
ocrd_detectron2	-	-	☑
ocrd_doxa	-	-	☑
ocrd_kraken	-	-	☑
ocrd_froc	-	-	☑
sbb_binarization	-	-	☑
cor-asv-fst	-	-	-
ocrd_ocropy	-	-	-
ocrd_pc_segmentation	-	-	-

Note: The following modules have been disabled by default and can only be enabled by explicitly setting OCRD_MODULES or DISABLED_MODULES:

cor-asv-fst (runtime issues)

ocrd_ocropy (better implementation in ocrd_cis available)

ocrd_pc_segmentation (dependency and quality issues)

Uninstall

If you have installed ocrd_all natively and wish to uninstall, first deactivate the virtual environment and remove the ocrd_all directory:

rm -rf ocrd_all

Next, remove all contents under ~/.parallel/semaphores:

rm -rf ~/.parallel/semaphores

Challenges

This repo offers solutions to the following problems with OCR-D integration.

No published/recent version on PyPI

Python modules which are not available in PyPI:

(Solved by installation from source.)

Conflicting requirements

Merging all packages into one venv does not always work. Modules may require mutually exclusive sets of dependent packages.

pip does not even stop or resolve conflicts – it merely warns!

Tensorflow:
- version 2 (required by ocrd_calamari, ocrd_anybaseocr and eynollah)
- version 1 (required by cor-asv-ann, ocrd_segment and ocrd_keraslm)
The temporary solution is to require different package names:
- tensorflow>=2
- tensorflow-gpu==1.15.*
Both cannot be installed in parallel in different versions, and usually also depend on different versions of CUDA toolkit.
OpenCV:
- opencv-python-headless (required by core and others, avoids pulling in X11 libraries)
- opencv-python (probably dragged in by third party packages)
As long as we keep reinstalling the headless variant and no such package attempts GUI, we should be fine. Custom build (as needed for ARM) under the module opencv-python already creates the headless variant.
PyTorch:
- torch<1.0
- torch>=1.0
...

(Solved by managing and delegating to different subsets of venvs.)

System requirements

Modules which do not advertise their system package requirements via make deps-ubuntu:

(Solved by maintaining these requirements under deps-ubuntu here.)

Contributing

Please see our contributing guide to learn how you can support the project.

Acknowledgments

This software uses GNU parallel. GNU Parallel is a general parallelizer to run multiple serial command line programs in parallel without changing them.

Reference

Tange, Ole. (2020). GNU Parallel 20200722 ('Privacy Shield'). Zenodo. https://doi.org/10.5281/zenodo.3956817

Name		Name	Last commit message	Last commit date
Latest commit History 1,128 Commits
.circleci		.circleci
.github		.github
cor-asv-ann @ ff6bf3f		cor-asv-ann @ ff6bf3f
cor-asv-fst @ 4211371		cor-asv-fst @ 4211371
core @ 85bde15		core @ 85bde15
dinglehopper @ 071e6a8		dinglehopper @ 071e6a8
docstruct @ 004e6ec		docstruct @ 004e6ec
eynollah @ 51f6ef6		eynollah @ 51f6ef6
format-converters @ fa8b4b5		format-converters @ fa8b4b5
nmalign @ 1426dbc		nmalign @ 1426dbc
ocrd_anybaseocr @ 3459b41		ocrd_anybaseocr @ 3459b41
ocrd_calamari @ d9cde1f		ocrd_calamari @ d9cde1f
ocrd_cis @ db65d7f		ocrd_cis @ db65d7f
ocrd_detectron2 @ 218e0b5		ocrd_detectron2 @ 218e0b5
ocrd_doxa @ 15e8423		ocrd_doxa @ 15e8423
ocrd_fileformat @ 8ab078d		ocrd_fileformat @ 8ab078d
ocrd_froc @ 42f1ce0		ocrd_froc @ 42f1ce0
ocrd_im6convert @ 82bd491		ocrd_im6convert @ 82bd491
ocrd_keraslm @ 2c466bd		ocrd_keraslm @ 2c466bd
ocrd_kraken @ a6160ce		ocrd_kraken @ a6160ce
ocrd_neat @ 06c8b38		ocrd_neat @ 06c8b38
ocrd_ocropy @ a6e556e		ocrd_ocropy @ a6e556e
ocrd_olahd_client @ 56c9272		ocrd_olahd_client @ 56c9272
ocrd_olena @ 30fc224		ocrd_olena @ 30fc224
ocrd_page2alto @ 8877e8f		ocrd_page2alto @ 8877e8f
ocrd_pagetopdf @ 7c5ab70		ocrd_pagetopdf @ 7c5ab70
ocrd_pc_segmentation @ ead3fdd		ocrd_pc_segmentation @ ead3fdd
ocrd_repair_inconsistencies @ 94c482f		ocrd_repair_inconsistencies @ 94c482f
ocrd_segment @ 064b7a8		ocrd_segment @ 064b7a8
ocrd_tesserocr @ dcbd522		ocrd_tesserocr @ dcbd522
ocrd_wrap @ fd4a2bc		ocrd_wrap @ fd4a2bc
opencv-python @ 6a181ce		opencv-python @ 6a181ce
sbb_binarization @ d259795		sbb_binarization @ d259795
workflow-configuration @ 63e9969		workflow-configuration @ 63e9969
.dockerignore		.dockerignore
.gitignore		.gitignore
.gitmodules		.gitmodules
CHANGELOG.md		CHANGELOG.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
ocrd-all-module-dir.py		ocrd-all-module-dir.py
ocrd-all-tool.py		ocrd-all-tool.py
release.sh		release.sh
test-workflow.sh		test-workflow.sh

License

OCR-D/ocrd_all

Folders and files

Latest commit

History

Repository files navigation

OCR-D/ocrd_all

Prerequisites

Space

Locale

System packages

GPU support

Usage

Targets

deps-ubuntu

deps-cuda

modules

all

ocrd

docker

dockers

clean

show

check

help (default goal)

[any module name]

[any executable name]

Variables

OCRD_MODULES

NO_UPDATE

PYTHON

VIRTUAL_ENV

TMPDIR

PIP_OPTIONS

GIT_RECURSIVE

Examples

Results

Persistent configuration

Docker Hub

Uninstall

Challenges

No published/recent version on PyPI

Conflicting requirements

System requirements

Contributing

Acknowledgments

Reference

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 67

Packages 0

Contributors 9

Languages

Packages