-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
workflow_ocr fails but ocrmypdf not #42
Comments
HI @Dexxes, thanks for reporting this. If you decrease your
Currently the implementation checks if either the returncode is not zero or if anything was written to the std-error. Like you can see in your case the error message is written to Maybe @bahnwaerter can add his expertise regarding the mentioned PDF and what might be wrong with it? |
Hey @R0Wi, thanks for the quick response! if ($success && $errorOutput === '' && $stdErr === '')
return $this->command->getOutput();
else
{
if(strpos($stdErr, "Error: stream operator isn't terminated by valid EOL.") !== false)
return $this->command->getOutput();
else
throw new OcrNotPossibleException('OCRmyPDF exited abnormally with exit-code ' . $exitCode . '. Message: ' . $errorOutput . ' ' . $stdErr);
} |
Hey @Dexxes, thanks for sharing your original PDF file. I looked at your PDF file and validated it with the PDF validation tool here, where I obtained the following output:
Since your scanned PDF does not conform to any of the available PDF standards, your assumption of a faulty PDF output by the HP scan tool is confirmed. Feel free to report this found issue to HP. We would appreciate it. Your proposed patch for the error handling of an OCR processing is one solution to circumvent this issue produced by a faulty input PDF file. Since an OCR processing relies on a valid PDF input file, we are not interested in adding exceptional cases to our code base to deal with faulty PDF input files. Otherwise, symptoms are tackled, but the cause remains. Nevertheless, you can repair the faulty PDF file produced by the HP scan tool before uploading it to your Nextcloud server. In this case it's sufficient to use the program mutool clean DOK.pdf DOK-cleaned.pdf Then, an OCR processing of the repaired I will close this issue as it is related to the HP scan tool. But feel free to reopen it, if we can help somehow. |
just wondering but as ocrmypdf only overwrites the files if it was successfully able to produce ocr output and doesnt alter the content of images or the likes, only the ocr text... why not simply let it? |
Well creating workflows based on different conditions (like user/path/filename) is a core functionality of Nextcloud's workflow engine and not part of this app. So we have to be satisfied with the options NC gives us at this point. If you have good ideas don't hesitate to open an issue at the Nextcloud server repo. Regarding the question "Should PDF files which produce warnings while processing be saved?" i think the default should clearly be no. Like @bahnwaerter explained the output of @github-cli you might also be interested in having a look at https://github.com/nextcloud/workflow_script where you can create workflows based on commandline instructions. |
Sorry, I didnt know the output can be undefined, I thought from reading the documentation and experimenting that it does not modify the file if it could not OCR it, worst case it would replace an existing good OCR with an incorrect/inferior one. I commented on the new discussion for the options. about the workflow script, i took a look at it... |
To be honest i don't really know what Regarding the workflow script app: unfortunately i cannot help you here because i'm not using it. I was just aware that this exists :-) If you have troubles setting up the workflow for PDF this could be related to #41 |
Thanks anyhow, the workflow script app works in matching the event/file with operator "is" but from the documentation its not clear to me how to use this to modify the file in place or even create a new one next to it. |
That would be the advantage of using the OCR workflow instead. The app handles all the tricky "create a new file version" stuff for you. So yes @bahnwaerter and i are discussing how these options could be implemented we'll keep you up to date on this in discussion #55. Also we plan to offer an alternative docker based backend for OCR processing so installation and error proneness would be decreased (#51). So stay tuned 😎 |
ok, staying tuned then :) |
Cool thanks 👍 |
Hey, thank you for your great work with this app!
I just encountered some strange behaviour. I scanned something with my scanner and put it into my Nextcloud via web browser upload. But the document got never OCRed. I also ran the cron.php manually but that didn't change a thing.
So I ran ocrmypdf manually (
ocrmypdf DOK.pdf test.pdf --redo-ocr
) and it spit out this:Error: stream operator isn't terminated by valid EOL.
Output may be incorrect.
But nevertheless it gives me a good OCRed PDF file.
Could you check if you can replicate this? Here is a sample scan I worked with:
DOK.pdf
tested with Nextcloud 20.0.4
The text was updated successfully, but these errors were encountered: