New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make it clearer which Tesseract engine is being used #168
Comments
Yes, but that's the same behaviour as in the (non-OCRD) Tesseract CLI: It defaults to But if you explicitly set So e.g. if you use |
@bertsky I am using the $ export TESSDATA_PREFIX=tessdata
$ ocrd process "tesserocr-recognize -I OCR-D-SEG-LINE-RESEG-DEWARP -O OCR-D-OCR3 \
> -P tesseract_parameters \{\\\"tessedit_ocr_engine_mode\\\":\\\"0\\\"\} \
> -P sparse_text false -P model ces+deu+lat"
$ ocrd process "tesserocr-recognize -I OCR-D-SEG-LINE-RESEG-DEWARP -O OCR-D-OCR4 \
> -P tesseract_parameters \{\\\"tessedit_ocr_engine_mode\\\":\\\"1\\\"\} \
> -P sparse_text false -P model ces+deu+lat" Except for the metadata, the output of the above commands is equivalent. This is evidenced by the following command, which produces empty output: $ for i in /var/tmp/ocrd-workspace/302/OCR-D-OCR3/*.xml
> do
> diff $i `sed s/OCR3/OCR4/g <<< $i`
> done |
> grep -vE 'OCR[34]|tessedit_ocr_engine_mode|^---$|^[0-9]*c[0-9]*$' |
So you are saying that the Tesseract CLI behaves differently, i.e. produces different results depending on the mode here? |
Yes, when I use Tesseract with OEM 0 and 1 from the command line, I get different results for the same input: $ tesseract /var/tmp/ocrd-workspace/302/420.jpg output-tesseract3 \
> --oem 0 --psm 3 -l ces+deu+lat --tessdata-dir tessdata txt
Tesseract Open Source OCR Engine v4.1.0 with Leptonica
[... output snipped]
$ tesseract /var/tmp/ocrd-workspace/302/420.jpg output-tesseract4 \
> --oem 1 --psm 3 -l ces+deu+lat --tessdata-dir tessdata txt
[... output omitted] $ diff output-tesseract3.txt output-tesseract4.txt
3c3
< 'četmi dobrých lidí urozených, tak“ jakž jest
---
> četmi dobrých lidí urozených, tak. jakž jest
5,9c5,9
< soudu, nežli kměstskému příleží, protož to při
< tom tak jsme zuostavíli, však k tomu toto
< z pravapřidavajíce: Poněvadž on nebožtík
< pan Šimon Smolíkovic jemu Jiříkovi“ Fialovi,
< Jiříkovi Hoštálkovi a paní Anně, dceři své,
---
> soudu, nežli k městskému příleží, protož to při
> tom tak jsme zuostavili, však k tomu toto
> Z práva: přidávajíce: .Poněvadž on nebožtík
> -pan Šimon Smolíkovic jemu Jiříkovi Fialovi,
> Jiříkovi Hošťálkovi a paní Anně, dceři své,
12,14c12,14
< mezi kterýmžto summy peněz na lištu též
< 'v kšeftu se dotýče, a podle toho Jiřík Hošťálek
< smanželkú svá paní Annú a s sirotky i dětmi
---
> mezi kterýmžto summy peněz na listu též
> 'v k&eftu se dotýče, a podle toho Jiřík Hošťálek
> smanželkú svú paní Annú a s sirotky i dětmi
[... output snipped] |
Well, the So is your issue merely one of documentation, or do you actually want to be able to control this, say via |
Thanks for the explanation, I did not catch that distinction.
Both, ideally. According to our experiments, the legacy engine is more than twice as fast as the LSTM engine on a single CPU and offers competitive performance on recognition and superior performance on language detection (although detected language does not seem to be captured by OCR-D in the PAGE XML output). We'd like to test the recognition performance of OCR-D with Tesseract 3 versus OCR-D with Tesseract 4. |
That depends a lot on the languages/models/training. I can see above that in Czech the LSTM model struggles being consistently better than the legacy one. The German LSTM models also have some systematic errors (e.g.
Language detection is wrapped via But you are right OCR-D currently lacks an option to facilitate that latter result in the workflow automatically – see #69
Good idea. It's a small change, I'll expose BTW, until then you can still download the pre-LSTM models from Github. (But since they have the same file name, make sure you don't mix them up – better rename first.) |
An interesting piece of information, which I have not spotted in the documentation.
Thanks, I that's an elegant solution for |
@Witiko I am working on this, but found that you cannot freely choose OEM across the API. It seems that for 0 or 2 you need at least one model with legacy weights in the chain. Here's the error message you'll get:
Anyway, this is to be expected – just the error message could be better. |
As those older models are included in the latest ones, too (some of them with smaller fixes), I see no reason for that. The only advantage would be a smaller traineddata file, so the initial loading would require less memory and be slightly faster. |
@stweil, thanks for clarifying. (However, there is another reason: The context here was to explicitly compare LSTM with pre-LSTM results, before ocrd-tesserocr-recognize exposed the BTW do you think there's anything we can do to improve the above error message when OEM 0 or 2 is requested but none of the loaded models has legacy weights? |
Sure, that's a nagging issue which needs a fix since a long time. It's simply a question of priorities and available resources. |
Since Tesseract 4, two OCR engines are available: rule-based (i.e.
--oem 0
), LSTM (--oem 1
). The command-line also exposes an ensemble of the two OCR engines (--oem 2
). The documentation forocrd-tesserocr-recognize
does not make it clear which engine is used and using either the following parameters seems to have no effect on the recognition results:-P tesseract_parameters '{ "tessedit_ocr_engine_mode" : "0" }'
-P tesseract_parameters '{ "tessedit_ocr_engine_mode" : "1" }'
-P tesseract_parameters '{ "tessedit_ocr_engine_mode" : "2" }'
Which one of the OCR engines are we currently using?
The text was updated successfully, but these errors were encountered: