Skip to content

Commit

Permalink
Update instructions about custom language packages
Browse files Browse the repository at this point in the history
Closes #1218
  • Loading branch information
jbarlow83 committed Dec 28, 2023
1 parent 0b6fb62 commit a5ff3d2
Showing 1 changed file with 31 additions and 13 deletions.
44 changes: 31 additions & 13 deletions docs/languages.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,16 +18,26 @@ Tesseract's documentation also lists the three-letter code for your language.
Some are anglicized, e.g. Spanish is ``spa`` rather than ``esp``, while others
are not, e.g. German is ``deu`` and French is ``fra``.

Language packs (strictly speaking, Tesseract "traineddata" files) generally correspond
to the language in question, but different language packs are used in certain
situations. For German, the "Fraktur" language pack can assist with reading older
materials in the Fraktur typeface family (``deu_frak``). Some communities have changed
their script from Cyrillic to Latin; the Cyrillic version of Uzbek is available
as ``uzb_cyrl`` and the Latin version is ``uzb``.

After you have installed a language pack, you can use it with ``ocrmypdf -l <language>``,
for example ``ocrmypdf -l spa``. For multilingual documents, you can specify
all languages to be expected, e.g. ``ocrmypdf -l eng+fra`` for English and French.
English is assumed by default unless other language(s) are specified.

For Linux users, you can often find packages that provide language
packs:
packs.

Platform install steps
======================

Debian and Ubuntu users
=======================
Debian and Ubuntu (apt)
-----------------------

.. code-block:: bash
Expand All @@ -42,8 +52,8 @@ to what languages it should search for. Multiple languages can be
requested using either ``-l eng+fra`` (English and French) or
``-l eng -l fra``.

Fedora users
============
Fedora
------

.. code-block:: bash
Expand All @@ -58,8 +68,8 @@ to what languages it should search for. Multiple languages can be
requested using either ``-l eng+fra`` (English and French) or
``-l eng -l fra``.

Gentoo users
============
Gentoo
------

On Gentoo the package ``app-text/tessdata_fast``, which ``app-text/tesseract`` depends on, handles Tesseract languages.
It accepts USE flags to select what languages should be installed, these can be set in ``/etc/portage/package.use``.
Expand All @@ -85,23 +95,31 @@ to what languages it should search for. Multiple languages can be
requested using either ``-l eng+fra`` (English and French) or
``-l eng -l fra``.

macOS users
===========
macOS
-----

You can install additional language packs by
:ref:`installing Tesseract using Homebrew with all language packs <macos-all-languages>`.

Docker users
============
Docker
------

Users of the OCRmyPDF Docker image should install language packs into a
derived Docker image as
:ref:`described in that section <docker-lang-packs>`.

Windows users
=============
Windows
-------

The Tesseract installer provided by Chocolatey currently includes only English language.
To install other languages, download the respective language pack (``.traineddata`` file)
from https://github.com/tesseract-ocr/tessdata/ and place it in
``C:\\Program Files\\Tesseract-OCR\\tessdata`` (or wherever Tesseract OCR is installed).

Custom language packs
=====================

If you have fine-tuned or trained Tesseract and generated custom trained data, you can
copy your ``customlang.traineddata`` file into your Tesseract "tessdata" folder, and
then use the ``-l customlang`` argument to tell OCRmyPDF to pass that language on to
Tesseract.

0 comments on commit a5ff3d2

Please sign in to comment.