Update instructions about custom language packages

Closes #1218
ocrmypdf · Dec 28, 2023 · a5ff3d2 · a5ff3d2
1 parent 0b6fb62
commit a5ff3d2
Showing 1 changed file with 31 additions and 13 deletions.
diff --git a/docs/languages.rst b/docs/languages.rst
@@ -18,16 +18,26 @@ Tesseract's documentation also lists the three-letter code for your language.
 Some are anglicized, e.g. Spanish is ``spa`` rather than ``esp``, while others
 are not, e.g. German is ``deu`` and French is ``fra``.
 
+Language packs (strictly speaking, Tesseract "traineddata" files) generally correspond
+to the language in question, but different language packs are used in certain
+situations. For German, the "Fraktur" language pack can assist with reading older
+materials in the Fraktur typeface family (``deu_frak``). Some communities have changed
+their script from Cyrillic to Latin; the Cyrillic version of Uzbek is available
+as ``uzb_cyrl`` and the Latin version is ``uzb``.
+
 After you have installed a language pack, you can use it with ``ocrmypdf -l <language>``,
 for example ``ocrmypdf -l spa``. For multilingual documents, you can specify
 all languages to be expected, e.g. ``ocrmypdf -l eng+fra`` for English and French.
 English is assumed by default unless other language(s) are specified.
 
 For Linux users, you can often find packages that provide language
-packs:
+packs.
+
+Platform install steps
+======================
 
-Debian and Ubuntu users
-=======================
+Debian and Ubuntu (apt)
+-----------------------
 
 .. code-block:: bash
 
@@ -42,8 +52,8 @@ to what languages it should search for. Multiple languages can be
 requested using either ``-l eng+fra`` (English and French) or
 ``-l eng -l fra``.
 
-Fedora users
-============
+Fedora
+------
 
 .. code-block:: bash
 
@@ -58,8 +68,8 @@ to what languages it should search for. Multiple languages can be
 requested using either ``-l eng+fra`` (English and French) or
 ``-l eng -l fra``.
 
-Gentoo users
-============
+Gentoo
+------
 
 On Gentoo the package ``app-text/tessdata_fast``, which ``app-text/tesseract`` depends on, handles Tesseract languages.
 It accepts USE flags to select what languages should be installed, these can be set in ``/etc/portage/package.use``.
@@ -85,23 +95,31 @@ to what languages it should search for. Multiple languages can be
 requested using either ``-l eng+fra`` (English and French) or
 ``-l eng -l fra``.
 
-macOS users
-===========
+macOS
+-----
 
 You can install additional language packs by
 :ref:`installing Tesseract using Homebrew with all language packs <macos-all-languages>`.
 
-Docker users
-============
+Docker
+------
 
 Users of the OCRmyPDF Docker image should install language packs into a
 derived Docker image as
 :ref:`described in that section <docker-lang-packs>`.
 
-Windows users
-=============
+Windows
+-------
 
 The Tesseract installer provided by Chocolatey currently includes only English language.
 To install other languages, download the respective language pack (``.traineddata`` file)
 from https://github.com/tesseract-ocr/tessdata/ and place it in
 ``C:\\Program Files\\Tesseract-OCR\\tessdata`` (or wherever Tesseract OCR is installed).
+
+Custom language packs
+=====================
+
+If you have fine-tuned or trained Tesseract and generated custom trained data, you can
+copy your ``customlang.traineddata`` file into your Tesseract "tessdata" folder, and
+then use the ``-l customlang`` argument to tell OCRmyPDF to pass that language on to
+Tesseract.