Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recognition claims it is finished but does nothing #106

Closed
b2m opened this issue Feb 3, 2021 · 16 comments
Closed

Recognition claims it is finished but does nothing #106

b2m opened this issue Feb 3, 2021 · 16 comments
Projects

Comments

@b2m
Copy link

b2m commented Feb 3, 2021

I am trying to use OCR4all 0.5.0 via Docker on two different workstations.

On one workstation everything is working fine, on the other the Recognition step finishes after several seconds but generates no results. This is reproducible with the example projects from the getting started repository using the default settings.

  • The console output tab shows:
Found 109 files in the dataset
Checkpoint version 2 is up-to-date.
  • The console error tab stays empty.
  • The browser console log is unsuspicious.
  • Tomcat's catalina.log is unsuspicious.

Any ideas why this is happening or hints on other log files with more information?


One theory is that the second workstation has not enough CPUs (2 available to Docker) to support Calamari.
RAM (12 GB available to Docker) should not be an issue.

@maxnth maxnth added the bug label Feb 3, 2021
@FergusJPWalsh
Copy link

I am also having a similar issue. Using OCR4all on Windows 10 via VirtualBox.
When I run the Recognition step, it says it has recognised the text, but when I go to LAREX there is nothing there.
I have manually provided a page of ground truth and run the Training step. I get the following console error:

multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/usr/local/lib/python3.6/dist-packages/calamari_ocr-1.0.5-py3.6.egg/calamari_ocr/ocr/cross_fold_trainer.py", line 27, in train_individual_model
    ], args.get("run", None), {"threads": args.get('num_threads', -1)}), verbose=args.get("verbose", False)):
  File "/usr/local/lib/python3.6/dist-packages/calamari_ocr-1.0.5-py3.6.egg/calamari_ocr/utils/multiprocessing.py", line 87, in run
    raise Exception("Error: Process finished with code {}".format(process.returncode))
Exception: Error: Process finished with code -4
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/bin/calamari-cross-fold-train", line 33, in 
    sys.exit(load_entry_point('calamari-ocr==1.0.5', 'console_scripts', 'calamari-cross-fold-train')())
  File "/usr/local/lib/python3.6/dist-packages/calamari_ocr-1.0.5-py3.6.egg/calamari_ocr/scripts/cross_fold_train.py", line 80, in main
    temporary_dir=args.temporary_dir, keep_temporary_files=args.keep_temporary_files,
  File "/usr/local/lib/python3.6/dist-packages/calamari_ocr-1.0.5-py3.6.egg/calamari_ocr/ocr/cross_fold_trainer.py", line 151, in run
    pool.map_async(train_individual_model, run_args).get()
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 644, in get
    raise self._value
Exception: Error: Process finished with code -4

@maxnth
Copy link
Member

maxnth commented May 5, 2021

When I run the Recognition step, it says it has recognised the text, but when I go to LAREX there is nothing there.

I sadly couldn't reproduce this yet but we had several reports for this behavior when using VirtualBox. Switching to Docker seemed to help for several users.
This obviously isn't optimal and we're currently working on a new release which hopefully fixes – among others – these problems for both Docker and VirtualBox.

@alexander-winkler
Copy link

Same here on a fresh Ubuntu 20.04.2 LTS installation on a machine with 2 CPUs.

@rmast
Copy link

rmast commented Jul 28, 2021

After trying a lot of configurations as the recognition still gives this empty result, I finally ended up with a downloaded docker-image within Virtualbox. Still the same. I was not able to download the docker-image within the WSL2-docker. I guess that's too slow.

The vbox_install-git makes your image hang after the first script, as there will appear no prompt after reboot.
That first script activates some bashrc-wait routine for a docker-image that doesn't show up as it hasn't been started by the 3rd script yet. Control-C will finish that script when trying to log in.

@rmast
Copy link

rmast commented Jul 29, 2021

As some suggest using a bigger configuration should make recognition go I rented a t2.2xlarge image on AWS called "Docker on Ubuntu 20" for less than half a dollar an hour.

https://aws.amazon.com/marketplace/pp/prodview-2jrv4ti3v2r3e?qid=1627578233532&sr=0-8&ref_=srh_res_product_title#pdp-pricing

This one indeed does recognize some characters on the first page of the Cirurgia example, and then crashes with an error

I only configured 300dpi, one column separator and an extra check for Word output at OCR.

To get started with the right choices I only used and modified the run.sh from the https://github.com/OCR4all/vbox_install -script to /home/ubuntu/ocr4all as BASE_DIR and started docker with it on the AWS-server.

Also I git cloned the getting_started repo, and copied the contained ocr4all directory with content to ~/.

After that this was wat was output on the details screen:
On a seemingly too small configuration that output already ends after the first two lines:

CONSOLE OUTPUT.txt

CONSOLE ERROR.txt

@rmast
Copy link

rmast commented Jul 29, 2021

I scaled the AWS server down to t2.large, with only 4 processors and 16MB. For some reason recognition is now coming to an end and no error. All Cirurgia pages have text now.
I could reach these specs on my bare metal quadcore computer without Virtual Box, booting straight from Ubuntu.

I also scanned the Dutch 600dpi text I wanted to scan in the first place. A surprisingly good result. However straight lines above or below the letters make the recognition completely garbled, so the line segmenter has to do some more.

Also, as expected, ë and € were not in the modern english model, so they have to be trained in some Dutch model to come.

@rmast
Copy link

rmast commented Jul 29, 2021

I saw a document stating someone is using Ocropus with a quadcore and 8GB on a laptop. I guess that will be the minimal system requirement. On my 3 core virtualbox with 8GB RAM the recognition wouldn't run.

@rmast
Copy link

rmast commented Jul 30, 2021

Without having seen any message yet I could imagine the lack of AVX on my old (2009) AMD processor could hinder Tensorflow:
https://stackoverflow.com/questions/53723217/is-there-a-version-of-tensorflow-not-compiled-for-avx-instructions/55165620

@rmast
Copy link

rmast commented Jul 30, 2021

I think the AVX hypothesis is right. I downgraded the Amazon instance to t2.medium (4GB, 2 cpu's) and the recognition of Cirurgia runs fine in 7 minutes with plenty of free ram and using not much more than 10GB disk:
free -m:
total used free shared buff/cache available
Mem: 3932 1471 1275 1 1185 2222
Swap: 0 0 0

df
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/root 30428560 10131736 20280440 34% /
devtmpfs 2007656 0 2007656 0% /dev
tmpfs 2013624 0 2013624 0% /dev/shm
tmpfs 402728 920 401808 1% /run
tmpfs 5120 0 5120 0% /run/lock
tmpfs 2013624 0 2013624 0% /sys/fs/cgroup
/dev/loop2 56832 56832 0 100% /snap/core18/1997
/dev/loop1 25600 25600 0 100% /snap/amazon-ssm-agent/4046
/dev/loop4 56832 56832 0 100% /snap/core18/2074
/dev/loop5 101888 101888 0 100% /snap/core/11420
/dev/loop3 34176 34176 0 100% /snap/amazon-ssm-agent/3552
/dev/loop6 63360 63360 0 100% /snap/core20/1081
/dev/loop0 101632 101632 0 100% /snap/core/10908
/dev/loop7 70400 70400 0 100% /snap/lxd/19823
/dev/loop8 69888 69888 0 100% /snap/lxd/21039
tmpfs 402724 0 402724 0% /run/user/1000

So I think the main mentionable system requirement should be AVX for the functioning of Tensorflow.

@rmast
Copy link

rmast commented Jul 31, 2021

I just ran the recognition on a somewhat newer AMD processor that gave me AVX in the output of gcc:
AMD FX(tm)-6300 Six-Core Processor 3.50 GHz
It does give some recognition result on Cirurgia.

This run is done on the downloaded 0.5 vdi image that mentions Tensorflow 2.1.2 in some requirements_py3.txt of december 4.

@rmast
Copy link

rmast commented Aug 1, 2021

I am trying to build Tensorflow on my old AMD without AVX. That takes days of compiling, so in the mean time I downloaded some precompiled Tensorflow-wheel without AVX.

wget https://tf.novaal.de/westmere/tensorflow-2.1.2-cp36-cp36m-linux_x86_64.whl
pip install --ignore-installed --upgrade tensorflow-2.1.2-cp36-cp36m-linux_x86_64.whl

For the first time tried to reproduce the error from the command line.
If I copy the complete calamari-predict line from ps -ef and try to run it it misses some unique /tmp/tomcat8-tomcat8-tmp/calamari-*.json file that I can copy during the prediction. After capturing that and re-running that line I found the missing error on the command line:

Found 630 files in the dataset
Checkpoint version 2 is up-to-date.
2021-08-01 11:56:44.277277: F tensorflow/core/platform/cpu_feature_guard.cc:37] The TensorFlow library was compiled to use SSE4.1 instructions, but these aren't available on your machine.
Aborted (core dumped)

So I guess it would be nice if such error could end up in the console-error screen.

@rmast
Copy link

rmast commented Aug 1, 2021

And reproducing some of those errors on a modern machine will be a matter of installing some tensorflow-wheel version with very fancy compile options, like CUDA on a machine which doesn't have CUDA available.

@rmast
Copy link

rmast commented Aug 1, 2021

The 2.1.2. build of Tensorflow finally came to an end. It does recognize text now on my old AMD, however, as expected without any hardware acceleration, it takes 26 minutes for Cirurgia to recognize on only two virtual cores.

As I didn't find any other wheel that would fit I published my own:
https://github.com/rmast/tensorflow-community-wheels/releases/tag/2.1.2oldAMD

@maxnth
Copy link
Member

maxnth commented Aug 2, 2021

Hi @rmast,
excuse the late reply, I was caught up in finishing the latest LAREX release and just found the time to catch up on the recent comments in this issue. First of all, thank you for the time you put into finding the sources of error for the above mentioned problems.

so the line segmenter has to do some more

I'll hopefully soon find the time to improve the current implementation of the line segmentation (I already mentioned this in some other issues but sadly didn't get to it yet). But more importantly: We're currently rebuilding OCR4all more or less from scratch to – among other things – add support for OCR-D processors. This will allow users to choose between a wider array of line segmentation processors.

So I guess it would be nice if such error could end up in the console-error screen.

I fully agree (as well as general improvement of logging, see: #104 as well). We're adding a warning regarding this to the documentation as well. I guess using something like cpuinfo to check support for SSE4 etc. before even trying to run e. g. Calamari might make sense as well.

I'm not sure whether we're going to offer Docker versions with TF w/o AVX etc. any time in the future but we would happily link to such custom Docker images / etc.

@rmast
Copy link

rmast commented Oct 3, 2021

@maxnth

I wrote

However straight lines above or below the letters make the recognition completely garbled, so the line segmenter has to do some more.

I now found that Tesseract does this removal of straight horizontal lines in internal preparation of the scan.
So it can probably be reused.

@maxnth
Copy link
Member

maxnth commented Oct 12, 2021

We just reproduced this error with a virtual machine where AVX support was forgotten to activate. Worked fine after activating it.
A warning regarding necessary AVX support was also added to the FAQ of the new documentation that's currently being written.

We'll also add checks for this and similar hardware related quirks to the new UI we're currently building to make it more clear why stuff might not work (or just deactivate it if hardware requirements aren't met).

I now found that Tesseract does this removal of straight horizontal lines in internal preparation of the scan.
So it can probably be reused.

Sounds very useful indeed, though I'd guess we should address this in the calamari repo directly.

I'm this issue as the cause for the problem seems to be found and warnings were added to the upcoming documentation. Feel free to reopen in case I missed something or the same problems appears despite complete hardware support for the used TF version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
0.6
Done
Development

No branches or pull requests

5 participants