Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't disable built-in TESSDATA_PREFIX #261

Merged
merged 2 commits into from
Nov 30, 2021
Merged

Don't disable built-in TESSDATA_PREFIX #261

merged 2 commits into from
Nov 30, 2021

Conversation

stweil
Copy link
Collaborator

@stweil stweil commented Jun 3, 2021

This allows using the tesseract cli again without setting an
explicit TESSDATA_PREFIX.

Signed-off-by: Stefan Weil sw@weilnetz.de

@stweil
Copy link
Collaborator Author

stweil commented Jun 3, 2021

See related discussion for #240.

Copy link
Collaborator

@bertsky bertsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, but then we should also follow up on that:

The main problem was where minimal models (eng+osd) would be installed to during install-tesseract – the default under $prefix/share/tessdata would only serve the standalone CLI, but we could not make Tesseract take our new XDG path (at compile time), because it does not end on /tessdata. It's a dilemma. We decided to place them under XDG conventions (so at least our wrapper would work without extra download steps, which is important for the segmentation functionality – something users do not expect), rendering the standalone CLI unusable without --tessdata-dir anyway. We figured it would be more consistent that way – but I guess it is not. The users could still place models into the conventional tessdata path themselves, and then rightly expect the CLI to work as usual. What's more, we could even place a symlink there (if the FS supports that) ourselves at install time.

Could you please add the symlink (or copy) of TESSDATA to the configure/hardcoded-default path in TESSERACT_TRAINEDDATA and see if that makes the standalone CLI work out of the box again?

@paulpestov paulpestov added this to Ideas in coordinate_all Aug 30, 2021
@paulpestov paulpestov removed this from Ideas in coordinate_all Aug 30, 2021
@paulpestov paulpestov added this to Open in coordinate_all_pr Sep 15, 2021
This allows using the `tesseract` cli again without setting an
explicit TESSDATA_PREFIX.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
@stweil
Copy link
Collaborator Author

stweil commented Nov 24, 2021

Could you please add the symlink (or copy) of TESSDATA to the configure/hardcoded-default path

That is now done in a 2nd commit 067955f.

@bertsky
Copy link
Collaborator

bertsky commented Nov 25, 2021

That is now done in a 2nd commit 067955f.

@stweil, so does it work now? (That is, do the minimal models get installed both under the XDG/venv prefix where ocrd_tesserocr can find them, and under the builtin default prefix where the standalone CLI can find them?)

@stweil
Copy link
Collaborator Author

stweil commented Nov 25, 2021

so does it work now?

Sure, that was the purpose of the 2nd commit.

$ ls -l venv/share/tessdata/
insgesamt 16540
drwxr-xr-x 1 stweil stweil      360 24. Nov 23:06 configs
-rw-r--r-- 1 stweil stweil  4113088 24. Nov 23:06 eng.traineddata
-rw-r--r-- 1 stweil stweil  2251950 24. Nov 23:06 equ.traineddata
-rw-r--r-- 1 stweil stweil 10562727 24. Nov 23:06 osd.traineddata
-rw-r--r-- 1 stweil stweil      572 24. Nov 23:06 pdf.ttf
drwxr-xr-x 1 stweil stweil       88 24. Nov 23:06 tessconfigs

@bertsky
Copy link
Collaborator

bertsky commented Nov 25, 2021

That's not a functional test though. Can you please try the built programs themselves?

@stweil
Copy link
Collaborator Author

stweil commented Nov 25, 2021

tesseract works as expected:

$ tesseract --list-langs
List of available languages (3):
eng
equ
osd

@bertsky
Copy link
Collaborator

bertsky commented Nov 25, 2021

...and ocrd_tesserocr?

@stweil
Copy link
Collaborator Author

stweil commented Nov 25, 2021

The pull request does not change anything for ocrd_tesserocr, so it still works as expected, too:

$ make -C ocrd_tesserocr test
make: Verzeichnis „/home/stweil/src/github/OCR-D/ocrd_all/ocrd_tesserocr“ wird betreten
pip3 install -U pip
Requirement already satisfied: pip in /home/stweil/src/github/OCR-D/ocrd_all/venv/lib/python3.7/site-packages (21.3.1)
pip3 install -r requirements_test.txt
Requirement already satisfied: pytest>=4.4.0 in /home/stweil/src/github/OCR-D/ocrd_all/venv/lib/python3.7/site-packages (from -r requirements_test.txt (line 1)) (6.2.5)
Requirement already satisfied: coverage>=4.5.2 in /home/stweil/src/github/OCR-D/ocrd_all/venv/lib/python3.7/site-packages (from -r requirements_test.txt (line 2)) (6.1.2)
Requirement already satisfied: iniconfig in /home/stweil/src/github/OCR-D/ocrd_all/venv/lib/python3.7/site-packages (from pytest>=4.4.0->-r requirements_test.txt (line 1)) (1.1.1)
Requirement already satisfied: packaging in /home/stweil/src/github/OCR-D/ocrd_all/venv/lib/python3.7/site-packages (from pytest>=4.4.0->-r requirements_test.txt (line 1)) (21.3)
Requirement already satisfied: pluggy<2.0,>=0.12 in /home/stweil/src/github/OCR-D/ocrd_all/venv/lib/python3.7/site-packages (from pytest>=4.4.0->-r requirements_test.txt (line 1)) (1.0.0)
Requirement already satisfied: importlib-metadata>=0.12 in /home/stweil/src/github/OCR-D/ocrd_all/venv/lib/python3.7/site-packages (from pytest>=4.4.0->-r requirements_test.txt (line 1)) (4.8.2)
Requirement already satisfied: toml in /home/stweil/src/github/OCR-D/ocrd_all/venv/lib/python3.7/site-packages (from pytest>=4.4.0->-r requirements_test.txt (line 1)) (0.10.2)
Requirement already satisfied: attrs>=19.2.0 in /home/stweil/src/github/OCR-D/ocrd_all/venv/lib/python3.7/site-packages (from pytest>=4.4.0->-r requirements_test.txt (line 1)) (21.2.0)
Requirement already satisfied: py>=1.8.2 in /home/stweil/src/github/OCR-D/ocrd_all/venv/lib/python3.7/site-packages (from pytest>=4.4.0->-r requirements_test.txt (line 1)) (1.11.0)
Requirement already satisfied: typing-extensions>=3.6.4 in /home/stweil/src/github/OCR-D/ocrd_all/venv/lib/python3.7/site-packages (from importlib-metadata>=0.12->pytest>=4.4.0->-r requirements_test.txt (line 1)) (4.0.0)
Requirement already satisfied: zipp>=0.5 in /home/stweil/src/github/OCR-D/ocrd_all/venv/lib/python3.7/site-packages (from importlib-metadata>=0.12->pytest>=4.4.0->-r requirements_test.txt (line 1)) (3.6.0)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /home/stweil/src/github/OCR-D/ocrd_all/venv/lib/python3.7/site-packages (from packaging->pytest>=4.4.0->-r requirements_test.txt (line 1)) (3.0.6)
# declare -p HTTP_PROXY
python3 -m pytest --continue-on-collection-errors test 
================================================================================ test session starts ================================================================================
platform linux -- Python 3.7.10, pytest-6.2.5, py-1.11.0, pluggy-1.0.0
rootdir: /home/stweil/src/github/OCR-D/ocrd_all/ocrd_tesserocr
collected 4 items                                                                                                                                                                   

test/test_recognize.py .                                                                                                                                                      [ 25%]
test/test_segment_line.py .                                                                                                                                                   [ 50%]
test/test_segment_region.py .                                                                                                                                                 [ 75%]
test/test_segment_word.py .                                                                                                                                                   [100%]

================================================================================ 4 passed in 25.30s =================================================================================
make: Verzeichnis „/home/stweil/src/github/OCR-D/ocrd_all/ocrd_tesserocr“ wird verlassen

@stweil
Copy link
Collaborator Author

stweil commented Nov 25, 2021

make -C ocrd_tesserocr test-cli also works, but the output is a little bit lengthy, so I don't paste it here.

@bertsky bertsky self-requested a review November 25, 2021 10:46
Copy link
Collaborator

@bertsky bertsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

coordinate_all_pr automation moved this from Open to Ready to Merge Nov 25, 2021
@stweil
Copy link
Collaborator Author

stweil commented Nov 25, 2021

Unrelated side notice: I just tried make -C ocrd_tesserocr test to compare the time used with current Tesseract release in ocrd_all with latest Tesseract. That produces an output 4 passed in ####s.

current Tesseract: 25.39s, 25.35s, 26.31s, 25.48s
latest Tesseract: 24.41s, 23.58s, 23.62s, 24.11s, 23.58s

This is not a representative test, but it indicates that the test time is decreased by more than a second, so that looks like latest Tesseract gives a performance gain of at least 4 percent.

@stweil
Copy link
Collaborator Author

stweil commented Nov 26, 2021

@kba, can we merge the pull request?

@kba kba mentioned this pull request Nov 30, 2021
@kba kba merged commit a2ff799 into master Nov 30, 2021
coordinate_all_pr automation moved this from Ready to Merge to Merged Nov 30, 2021
@stweil stweil deleted the tess branch November 30, 2021 18:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

3 participants