New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recognition claims it is finished but does nothing #106
Comments
I am also having a similar issue. Using OCR4all on Windows 10 via VirtualBox.
|
I sadly couldn't reproduce this yet but we had several reports for this behavior when using VirtualBox. Switching to Docker seemed to help for several users. |
Same here on a fresh Ubuntu 20.04.2 LTS installation on a machine with 2 CPUs. |
After trying a lot of configurations as the recognition still gives this empty result, I finally ended up with a downloaded docker-image within Virtualbox. Still the same. I was not able to download the docker-image within the WSL2-docker. I guess that's too slow. The vbox_install-git makes your image hang after the first script, as there will appear no prompt after reboot. |
As some suggest using a bigger configuration should make recognition go I rented a t2.2xlarge image on AWS called "Docker on Ubuntu 20" for less than half a dollar an hour. This one indeed does recognize some characters on the first page of the Cirurgia example, and then crashes with an error I only configured 300dpi, one column separator and an extra check for Word output at OCR. To get started with the right choices I only used and modified the run.sh from the https://github.com/OCR4all/vbox_install -script to /home/ubuntu/ocr4all as BASE_DIR and started docker with it on the AWS-server. Also I git cloned the getting_started repo, and copied the contained ocr4all directory with content to ~/. After that this was wat was output on the details screen: |
I scaled the AWS server down to t2.large, with only 4 processors and 16MB. For some reason recognition is now coming to an end and no error. All Cirurgia pages have text now. I also scanned the Dutch 600dpi text I wanted to scan in the first place. A surprisingly good result. However straight lines above or below the letters make the recognition completely garbled, so the line segmenter has to do some more. Also, as expected, ë and € were not in the modern english model, so they have to be trained in some Dutch model to come. |
I saw a document stating someone is using Ocropus with a quadcore and 8GB on a laptop. I guess that will be the minimal system requirement. On my 3 core virtualbox with 8GB RAM the recognition wouldn't run. |
Without having seen any message yet I could imagine the lack of AVX on my old (2009) AMD processor could hinder Tensorflow: |
I think the AVX hypothesis is right. I downgraded the Amazon instance to t2.medium (4GB, 2 cpu's) and the recognition of Cirurgia runs fine in 7 minutes with plenty of free ram and using not much more than 10GB disk: df So I think the main mentionable system requirement should be AVX for the functioning of Tensorflow. |
I just ran the recognition on a somewhat newer AMD processor that gave me AVX in the output of gcc: This run is done on the downloaded 0.5 vdi image that mentions Tensorflow 2.1.2 in some requirements_py3.txt of december 4. |
I am trying to build Tensorflow on my old AMD without AVX. That takes days of compiling, so in the mean time I downloaded some precompiled Tensorflow-wheel without AVX. wget https://tf.novaal.de/westmere/tensorflow-2.1.2-cp36-cp36m-linux_x86_64.whl For the first time tried to reproduce the error from the command line. Found 630 files in the dataset So I guess it would be nice if such error could end up in the console-error screen. |
And reproducing some of those errors on a modern machine will be a matter of installing some tensorflow-wheel version with very fancy compile options, like CUDA on a machine which doesn't have CUDA available. |
The 2.1.2. build of Tensorflow finally came to an end. It does recognize text now on my old AMD, however, as expected without any hardware acceleration, it takes 26 minutes for Cirurgia to recognize on only two virtual cores. As I didn't find any other wheel that would fit I published my own: |
Hi @rmast,
I'll hopefully soon find the time to improve the current implementation of the line segmentation (I already mentioned this in some other issues but sadly didn't get to it yet). But more importantly: We're currently rebuilding OCR4all more or less from scratch to – among other things – add support for OCR-D processors. This will allow users to choose between a wider array of line segmentation processors.
I fully agree (as well as general improvement of logging, see: #104 as well). We're adding a warning regarding this to the documentation as well. I guess using something like cpuinfo to check support for SSE4 etc. before even trying to run e. g. Calamari might make sense as well. I'm not sure whether we're going to offer Docker versions with TF w/o AVX etc. any time in the future but we would happily link to such custom Docker images / etc. |
I wrote
I now found that Tesseract does this removal of straight horizontal lines in internal preparation of the scan. |
We just reproduced this error with a virtual machine where AVX support was forgotten to activate. Worked fine after activating it. We'll also add checks for this and similar hardware related quirks to the new UI we're currently building to make it more clear why stuff might not work (or just deactivate it if hardware requirements aren't met).
Sounds very useful indeed, though I'd guess we should address this in the calamari repo directly. I'm this issue as the cause for the problem seems to be found and warnings were added to the upcoming documentation. Feel free to reopen in case I missed something or the same problems appears despite complete hardware support for the used TF version. |
I am trying to use OCR4all 0.5.0 via Docker on two different workstations.
On one workstation everything is working fine, on the other the
Recognition
step finishes after several seconds but generates no results. This is reproducible with the example projects from the getting started repository using the default settings.Any ideas why this is happening or hints on other log files with more information?
One theory is that the second workstation has not enough CPUs (2 available to Docker) to support Calamari.
RAM (12 GB available to Docker) should not be an issue.
The text was updated successfully, but these errors were encountered: