Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make it clearer which Tesseract engine is being used #168

Closed
Witiko opened this issue Jan 21, 2021 · 12 comments · Fixed by #170
Closed

Make it clearer which Tesseract engine is being used #168

Witiko opened this issue Jan 21, 2021 · 12 comments · Fixed by #170

Comments

@Witiko
Copy link

Witiko commented Jan 21, 2021

Since Tesseract 4, two OCR engines are available: rule-based (i.e. --oem 0), LSTM (--oem 1). The command-line also exposes an ensemble of the two OCR engines (--oem 2). The documentation for ocrd-tesserocr-recognize does not make it clear which engine is used and using either the following parameters seems to have no effect on the recognition results:

  • -P tesseract_parameters '{ "tessedit_ocr_engine_mode" : "0" }'
  • -P tesseract_parameters '{ "tessedit_ocr_engine_mode" : "1" }'
  • -P tesseract_parameters '{ "tessedit_ocr_engine_mode" : "2" }'

Which one of the OCR engines are we currently using?

@bertsky
Copy link
Collaborator

bertsky commented Jan 21, 2021

Yes, but that's the same behaviour as in the (non-OCRD) Tesseract CLI: It defaults to OEM_DEFAULT, which auto-detects oem based on the language/model choices, backing off to OEM_LSTM_ONLY if that's not a clue either.

But if you explicitly set tessedit_ocr_engine_mode via variable assignment (as could be done in the Tesseract CLI via -c tessedit_ocr_engine_mode=2), then IIUC this will only have an effect for models which contain both legacy and LSTM data. All the other models you loaded will still be running their "natural" OEM.

So e.g. if you use eng.traineddata or Fraktur.traineddata from Github, or one of the tesstrain models, you'll get LSTMs. However, if you use osd.traineddata or deu-frak.traineddata, or older models from Github, you'll get legacy recognition. (In fact, the latter is what you need to do for the fontshape processor, which needs OEM_TESSERACT_ONLY.)

@Witiko
Copy link
Author

Witiko commented Jan 22, 2021

@bertsky I am using the ces.traineddata, deu.traineddata, and lat.traineddata from tesseract-ocr/tessdata, which contain both legacy and LSTM data. Nevertheless, running the following two commands gives the same results:

$ export TESSDATA_PREFIX=tessdata
$ ocrd process "tesserocr-recognize -I OCR-D-SEG-LINE-RESEG-DEWARP -O OCR-D-OCR3 \
>     -P tesseract_parameters \{\\\"tessedit_ocr_engine_mode\\\":\\\"0\\\"\} \
>     -P sparse_text false -P model ces+deu+lat"
$ ocrd process "tesserocr-recognize -I OCR-D-SEG-LINE-RESEG-DEWARP -O OCR-D-OCR4 \
>     -P tesseract_parameters \{\\\"tessedit_ocr_engine_mode\\\":\\\"1\\\"\} \
>     -P sparse_text false -P model ces+deu+lat"

Except for the metadata, the output of the above commands is equivalent. This is evidenced by the following command, which produces empty output:

$ for i in /var/tmp/ocrd-workspace/302/OCR-D-OCR3/*.xml
> do
>     diff $i `sed s/OCR3/OCR4/g <<< $i`
> done | 
> grep -vE 'OCR[34]|tessedit_ocr_engine_mode|^---$|^[0-9]*c[0-9]*$'

@bertsky
Copy link
Collaborator

bertsky commented Jan 22, 2021

So you are saying that the Tesseract CLI behaves differently, i.e. produces different results depending on the mode here?

@Witiko
Copy link
Author

Witiko commented Jan 22, 2021

Yes, when I use Tesseract with OEM 0 and 1 from the command line, I get different results for the same input:

$ tesseract /var/tmp/ocrd-workspace/302/420.jpg output-tesseract3 \
>     --oem 0 --psm 3 -l ces+deu+lat --tessdata-dir tessdata txt
Tesseract Open Source OCR Engine v4.1.0 with Leptonica
[... output snipped]
$ tesseract /var/tmp/ocrd-workspace/302/420.jpg output-tesseract4 \
>     --oem 1 --psm 3 -l ces+deu+lat --tessdata-dir tessdata txt
[... output omitted]
$ diff output-tesseract3.txt output-tesseract4.txt
3c3
< 'četmi dobrých lidí urozených, tak“ jakž jest
---
> četmi dobrých lidí urozených, tak. jakž jest
5,9c5,9
< soudu, nežli kměstskému příleží, protož to při
< tom tak jsme zuostavíli, však k tomu toto
< z pravapřidavajíce: Poněvadž on nebožtík
< pan Šimon Smolíkovic jemu Jiříkovi“ Fialovi,
< Jiříkovi Hoštálkovi a paní Anně, dceři své,
---
> soudu, nežli k městskému příleží, protož to při
> tom tak jsme zuostavili, však k tomu toto
> Z práva: přidávajíce: .Poněvadž on nebožtík
> -pan Šimon Smolíkovic jemu Jiříkovi  Fialovi,
> Jiříkovi Hošťálkovi a paní Anně, dceři své,
12,14c12,14
< mezi kterýmžto summy peněz na lištu též
< 'v kšeftu se dotýče, a podle toho Jiřík Hošťálek
< smanželkú svá paní Annú a s sirotky i dětmi
---
> mezi kterýmžto summy peněz na listu též
> 'v k&eftu se dotýče, a podle toho Jiřík Hošťálek
> smanželkú svú paní Annú a s sirotky i dětmi
[... output snipped]

@bertsky
Copy link
Collaborator

bertsky commented Jan 22, 2021

Well, the --oem parameter is another mechanism than -c – the latter is what ocrd-tesserocr-recognize's tesseract_parameters exposes. (--oem / enginemode gets interpreted before model initialization, -c / SetVariable afterwards.) As I said, our wrapper only offers OEM_DEFAULT.

So is your issue merely one of documentation, or do you actually want to be able to control this, say via -P oem legacy?

@Witiko
Copy link
Author

Witiko commented Jan 22, 2021

Well, the --oem parameter is another mechanism than -c – the latter is what ocrd-tesserocr-recognize's tesseract_parameters exposes. (--oem / enginemode gets interpreted before model initialization, -c / SetVariable afterwards.) As I said, our wrapper only offers OEM_DEFAULT.

Thanks for the explanation, I did not catch that distinction.

So is your issue merely one of documentation, or do you actually want to be able to control this, say via -P oem legacy?

Both, ideally. According to our experiments, the legacy engine is more than twice as fast as the LSTM engine on a single CPU and offers competitive performance on recognition and superior performance on language detection (although detected language does not seem to be captured by OCR-D in the PAGE XML output). We'd like to test the recognition performance of OCR-D with Tesseract 3 versus OCR-D with Tesseract 4.

@bertsky
Copy link
Collaborator

bertsky commented Jan 22, 2021

So is your issue merely one of documentation, or do you actually want to be able to control this, say via -P oem legacy?

Both, ideally. According to our experiments, the legacy engine is more than twice as fast as the LSTM engine on a single CPU and offers competitive performance on recognition

That depends a lot on the languages/models/training. I can see above that in Czech the LSTM model struggles being consistently better than the legacy one. The German LSTM models also have some systematic errors (e.g. i vs ı, I vs [ etc). But as soon as you use tesstrain for finetuning, LSTMs usually outperform legacy – especially on historic texts.

and superior performance on language detection (although detected language does not seem to be captured by OCR-D in the PAGE XML output).

Language detection is wrapped via ocrd-tesserocr-deskew (which also uses legacy/combined mode). Tesseract/ISO language codes get mapped to PAGE codes and annotated under @primaryScript.

But you are right OCR-D currently lacks an option to facilitate that latter result in the workflow automatically – see #69

We'd like to test the recognition performance of OCR-D with Tesseract 3 versus OCR-D with Tesseract 4.

Good idea. It's a small change, I'll expose oem as parameter.

BTW, until then you can still download the pre-LSTM models from Github. (But since they have the same file name, make sure you don't mix them up – better rename first.)

@Witiko
Copy link
Author

Witiko commented Jan 22, 2021

Language detection is wrapped via ocrd-tesserocr-deskew (which also uses legacy/combined mode).
Tesseract/ISO language codes get mapped to PAGE codes and annotated under @primaryScript.

An interesting piece of information, which I have not spotted in the documentation.

BTW, until then you can still download the pre-LSTM models from Github.

Thanks, I that's an elegant solution for --oem 0, although it does not enable the use of --oem 2.

@bertsky
Copy link
Collaborator

bertsky commented Feb 12, 2021

We'd like to test the recognition performance of OCR-D with Tesseract 3 versus OCR-D with Tesseract 4.

Good idea. It's a small change, I'll expose oem as parameter.

@Witiko I am working on this, but found that you cannot freely choose OEM across the API. It seems that for 0 or 2 you need at least one model with legacy weights in the chain.

Here's the error message you'll get:

    oem=getattr(OEM, self.parameter['oem'])) as tessapi:
  File "tesserocr.pyx", line 1189, in tesserocr.PyTessBaseAPI.__cinit__
  File "tesserocr.pyx", line 1202, in tesserocr.PyTessBaseAPI._init_api
RuntimeError: Failed to init API, possibly an invalid tessdata path: /usr/local/share/tessdata/

Anyway, this is to be expected – just the error message could be better.

@bertsky bertsky linked a pull request Feb 12, 2021 that will close this issue
@kba kba closed this as completed in #170 Mar 5, 2021
@stweil
Copy link
Contributor

stweil commented Mar 5, 2021

you can still download the pre-LSTM models from Github

As those older models are included in the latest ones, too (some of them with smaller fixes), I see no reason for that. The only advantage would be a smaller traineddata file, so the initial loading would require less memory and be slightly faster.

@bertsky
Copy link
Collaborator

bertsky commented Mar 5, 2021

you can still download the pre-LSTM models from Github

As those older models are included in the latest ones, too (some of them with smaller fixes), I see no reason for that. The only advantage would be a smaller traineddata file, so the initial loading would require less memory and be slightly faster.

@stweil, thanks for clarifying. (However, there is another reason: The context here was to explicitly compare LSTM with pre-LSTM results, before ocrd-tesserocr-recognize exposed the oem parameter. You could not get that in a mixed model.)

BTW do you think there's anything we can do to improve the above error message when OEM 0 or 2 is requested but none of the loaded models has legacy weights?

@stweil
Copy link
Contributor

stweil commented Mar 5, 2021

Sure, that's a nagging issue which needs a fix since a long time. It's simply a question of priorities and available resources.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants