Make it clearer which Tesseract engine is being used #168

Witiko · 2021-01-21T21:23:00Z

Since Tesseract 4, two OCR engines are available: rule-based (i.e. --oem 0), LSTM (--oem 1). The command-line also exposes an ensemble of the two OCR engines (--oem 2). The documentation for ocrd-tesserocr-recognize does not make it clear which engine is used and using either the following parameters seems to have no effect on the recognition results:

-P tesseract_parameters '{ "tessedit_ocr_engine_mode" : "0" }'
-P tesseract_parameters '{ "tessedit_ocr_engine_mode" : "1" }'
-P tesseract_parameters '{ "tessedit_ocr_engine_mode" : "2" }'

Which one of the OCR engines are we currently using?

The text was updated successfully, but these errors were encountered:

bertsky · 2021-01-21T21:58:05Z

Yes, but that's the same behaviour as in the (non-OCRD) Tesseract CLI: It defaults to OEM_DEFAULT, which auto-detects oem based on the language/model choices, backing off to OEM_LSTM_ONLY if that's not a clue either.

But if you explicitly set tessedit_ocr_engine_mode via variable assignment (as could be done in the Tesseract CLI via -c tessedit_ocr_engine_mode=2), then IIUC this will only have an effect for models which contain both legacy and LSTM data. All the other models you loaded will still be running their "natural" OEM.

So e.g. if you use eng.traineddata or Fraktur.traineddata from Github, or one of the tesstrain models, you'll get LSTMs. However, if you use osd.traineddata or deu-frak.traineddata, or older models from Github, you'll get legacy recognition. (In fact, the latter is what you need to do for the fontshape processor, which needs OEM_TESSERACT_ONLY.)

Witiko · 2021-01-22T11:57:46Z

@bertsky I am using the ces.traineddata, deu.traineddata, and lat.traineddata from tesseract-ocr/tessdata, which contain both legacy and LSTM data. Nevertheless, running the following two commands gives the same results:

$ export TESSDATA_PREFIX=tessdata
$ ocrd process "tesserocr-recognize -I OCR-D-SEG-LINE-RESEG-DEWARP -O OCR-D-OCR3 \
>     -P tesseract_parameters \{\\\"tessedit_ocr_engine_mode\\\":\\\"0\\\"\} \
>     -P sparse_text false -P model ces+deu+lat"
$ ocrd process "tesserocr-recognize -I OCR-D-SEG-LINE-RESEG-DEWARP -O OCR-D-OCR4 \
>     -P tesseract_parameters \{\\\"tessedit_ocr_engine_mode\\\":\\\"1\\\"\} \
>     -P sparse_text false -P model ces+deu+lat"

Except for the metadata, the output of the above commands is equivalent. This is evidenced by the following command, which produces empty output:

$ for i in /var/tmp/ocrd-workspace/302/OCR-D-OCR3/*.xml
> do
>     diff $i `sed s/OCR3/OCR4/g <<< $i`
> done | 
> grep -vE 'OCR[34]|tessedit_ocr_engine_mode|^---$|^[0-9]*c[0-9]*$'

bertsky · 2021-01-22T12:32:25Z

So you are saying that the Tesseract CLI behaves differently, i.e. produces different results depending on the mode here?

Witiko · 2021-01-22T13:08:19Z

Yes, when I use Tesseract with OEM 0 and 1 from the command line, I get different results for the same input:

$ tesseract /var/tmp/ocrd-workspace/302/420.jpg output-tesseract3 \
>     --oem 0 --psm 3 -l ces+deu+lat --tessdata-dir tessdata txt
Tesseract Open Source OCR Engine v4.1.0 with Leptonica
[... output snipped]
$ tesseract /var/tmp/ocrd-workspace/302/420.jpg output-tesseract4 \
>     --oem 1 --psm 3 -l ces+deu+lat --tessdata-dir tessdata txt
[... output omitted]

$ diff output-tesseract3.txt output-tesseract4.txt
3c3
< 'četmi dobrých lidí urozených, tak“ jakž jest
---
> četmi dobrých lidí urozených, tak. jakž jest
5,9c5,9
< soudu, nežli kměstskému příleží, protož to při
< tom tak jsme zuostavíli, však k tomu toto
< z pravapřidavajíce: Poněvadž on nebožtík
< pan Šimon Smolíkovic jemu Jiříkovi“ Fialovi,
< Jiříkovi Hoštálkovi a paní Anně, dceři své,
---
> soudu, nežli k městskému příleží, protož to při
> tom tak jsme zuostavili, však k tomu toto
> Z práva: přidávajíce: .Poněvadž on nebožtík
> -pan Šimon Smolíkovic jemu Jiříkovi  Fialovi,
> Jiříkovi Hošťálkovi a paní Anně, dceři své,
12,14c12,14
< mezi kterýmžto summy peněz na lištu též
< 'v kšeftu se dotýče, a podle toho Jiřík Hošťálek
< smanželkú svá paní Annú a s sirotky i dětmi
---
> mezi kterýmžto summy peněz na listu též
> 'v k&eftu se dotýče, a podle toho Jiřík Hošťálek
> smanželkú svú paní Annú a s sirotky i dětmi
[... output snipped]

bertsky · 2021-01-22T13:35:22Z

Well, the --oem parameter is another mechanism than -c – the latter is what ocrd-tesserocr-recognize's tesseract_parameters exposes. (--oem / enginemode gets interpreted before model initialization, -c / SetVariable afterwards.) As I said, our wrapper only offers OEM_DEFAULT.

So is your issue merely one of documentation, or do you actually want to be able to control this, say via -P oem legacy?

Witiko · 2021-01-22T13:47:51Z

Well, the --oem parameter is another mechanism than -c – the latter is what ocrd-tesserocr-recognize's tesseract_parameters exposes. (--oem / enginemode gets interpreted before model initialization, -c / SetVariable afterwards.) As I said, our wrapper only offers OEM_DEFAULT.

Thanks for the explanation, I did not catch that distinction.

So is your issue merely one of documentation, or do you actually want to be able to control this, say via -P oem legacy?

Both, ideally. According to our experiments, the legacy engine is more than twice as fast as the LSTM engine on a single CPU and offers competitive performance on recognition and superior performance on language detection (although detected language does not seem to be captured by OCR-D in the PAGE XML output). We'd like to test the recognition performance of OCR-D with Tesseract 3 versus OCR-D with Tesseract 4.

bertsky · 2021-01-22T14:02:51Z

So is your issue merely one of documentation, or do you actually want to be able to control this, say via -P oem legacy?

Both, ideally. According to our experiments, the legacy engine is more than twice as fast as the LSTM engine on a single CPU and offers competitive performance on recognition

That depends a lot on the languages/models/training. I can see above that in Czech the LSTM model struggles being consistently better than the legacy one. The German LSTM models also have some systematic errors (e.g. i vs ı, I vs [ etc). But as soon as you use tesstrain for finetuning, LSTMs usually outperform legacy – especially on historic texts.

and superior performance on language detection (although detected language does not seem to be captured by OCR-D in the PAGE XML output).

Language detection is wrapped via ocrd-tesserocr-deskew (which also uses legacy/combined mode). Tesseract/ISO language codes get mapped to PAGE codes and annotated under @primaryScript.

But you are right OCR-D currently lacks an option to facilitate that latter result in the workflow automatically – see #69

We'd like to test the recognition performance of OCR-D with Tesseract 3 versus OCR-D with Tesseract 4.

Good idea. It's a small change, I'll expose oem as parameter.

BTW, until then you can still download the pre-LSTM models from Github. (But since they have the same file name, make sure you don't mix them up – better rename first.)

Witiko · 2021-01-22T16:14:52Z

Language detection is wrapped via ocrd-tesserocr-deskew (which also uses legacy/combined mode).
Tesseract/ISO language codes get mapped to PAGE codes and annotated under @primaryScript.

An interesting piece of information, which I have not spotted in the documentation.

BTW, until then you can still download the pre-LSTM models from Github.

Thanks, I that's an elegant solution for --oem 0, although it does not enable the use of --oem 2.

bertsky · 2021-02-12T21:06:04Z

We'd like to test the recognition performance of OCR-D with Tesseract 3 versus OCR-D with Tesseract 4.

Good idea. It's a small change, I'll expose oem as parameter.

@Witiko I am working on this, but found that you cannot freely choose OEM across the API. It seems that for 0 or 2 you need at least one model with legacy weights in the chain.

Here's the error message you'll get:

    oem=getattr(OEM, self.parameter['oem'])) as tessapi:
  File "tesserocr.pyx", line 1189, in tesserocr.PyTessBaseAPI.__cinit__
  File "tesserocr.pyx", line 1202, in tesserocr.PyTessBaseAPI._init_api
RuntimeError: Failed to init API, possibly an invalid tessdata path: /usr/local/share/tessdata/

Anyway, this is to be expected – just the error message could be better.

stweil · 2021-03-05T12:56:16Z

you can still download the pre-LSTM models from Github

As those older models are included in the latest ones, too (some of them with smaller fixes), I see no reason for that. The only advantage would be a smaller traineddata file, so the initial loading would require less memory and be slightly faster.

bertsky · 2021-03-05T13:37:00Z

you can still download the pre-LSTM models from Github

As those older models are included in the latest ones, too (some of them with smaller fixes), I see no reason for that. The only advantage would be a smaller traineddata file, so the initial loading would require less memory and be slightly faster.

@stweil, thanks for clarifying. (However, there is another reason: The context here was to explicitly compare LSTM with pre-LSTM results, before ocrd-tesserocr-recognize exposed the oem parameter. You could not get that in a mixed model.)

BTW do you think there's anything we can do to improve the above error message when OEM 0 or 2 is requested but none of the loaded models has legacy weights?

stweil · 2021-03-05T13:40:09Z

Sure, that's a nagging issue which needs a fix since a long time. It's simply a question of priorities and available resources.

bertsky linked a pull request Feb 12, 2021 that will close this issue

Skip recognition when text exists and not overwrite_text #170

Merged

kba closed this as completed in #170 Mar 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make it clearer which Tesseract engine is being used #168

Make it clearer which Tesseract engine is being used #168

Witiko commented Jan 21, 2021

bertsky commented Jan 21, 2021

Witiko commented Jan 22, 2021 •

edited

bertsky commented Jan 22, 2021 •

edited

Witiko commented Jan 22, 2021 •

edited

bertsky commented Jan 22, 2021

Witiko commented Jan 22, 2021 •

edited

bertsky commented Jan 22, 2021

Witiko commented Jan 22, 2021

bertsky commented Feb 12, 2021

stweil commented Mar 5, 2021

bertsky commented Mar 5, 2021

stweil commented Mar 5, 2021

Make it clearer which Tesseract engine is being used #168

Make it clearer which Tesseract engine is being used #168

Comments

Witiko commented Jan 21, 2021

bertsky commented Jan 21, 2021

Witiko commented Jan 22, 2021 • edited

bertsky commented Jan 22, 2021 • edited

Witiko commented Jan 22, 2021 • edited

bertsky commented Jan 22, 2021

Witiko commented Jan 22, 2021 • edited

bertsky commented Jan 22, 2021

Witiko commented Jan 22, 2021

bertsky commented Feb 12, 2021

stweil commented Mar 5, 2021

bertsky commented Mar 5, 2021

stweil commented Mar 5, 2021

Witiko commented Jan 22, 2021 •

edited

bertsky commented Jan 22, 2021 •

edited

Witiko commented Jan 22, 2021 •

edited

Witiko commented Jan 22, 2021 •

edited