workflow_ocr fails but ocrmypdf not #42

Dexxes · 2020-12-19T13:26:45Z

Hey, thank you for your great work with this app!

I just encountered some strange behaviour. I scanned something with my scanner and put it into my Nextcloud via web browser upload. But the document got never OCRed. I also ran the cron.php manually but that didn't change a thing.
So I ran ocrmypdf manually (ocrmypdf DOK.pdf test.pdf --redo-ocr) and it spit out this:

Error: stream operator isn't terminated by valid EOL.
Output may be incorrect.

But nevertheless it gives me a good OCRed PDF file.

Could you check if you can replicate this? Here is a sample scan I worked with:
DOK.pdf

tested with Nextcloud 20.0.4

The text was updated successfully, but these errors were encountered:

R0Wi · 2020-12-19T14:27:41Z

HI @Dexxes, thanks for reporting this. If you decrease your loglevel of Nextcloud to at least WARN (2) you should be able to see your mentioned error message inside the logs with a message like

OCRmyPDF exited abnormally ...

Currently the implementation checks if either the returncode is not zero or if anything was written to the std-error.

Like you can see in your case the error message is written to stderr by ocrmypdf which leads the app to think that there was an error processing your pdf file and therefore not to create a new file version.

Maybe @bahnwaerter can add his expertise regarding the mentioned PDF and what might be wrong with it?

Dexxes · 2020-12-19T17:23:32Z

Hey @R0Wi, thanks for the quick response!
Since I used an HP app for the scan I think they simply messed up their PDF creation. As a result, that EOL error could occur more often for other users too. Maybe you find a better solution to filter out this error. My programming knowlegde is rather limited, but I came up with this:

if ($success && $errorOutput === '' && $stdErr === '')
	return $this->command->getOutput();
else
{
	if(strpos($stdErr, "Error: stream operator isn't terminated by valid EOL.") !== false)
		return $this->command->getOutput();
	else
		throw new OcrNotPossibleException('OCRmyPDF exited abnormally with exit-code ' . $exitCode . '. Message: ' . $errorOutput . ' ' . $stdErr);
}

bahnwaerter · 2020-12-21T12:55:34Z

Hey @Dexxes, thanks for sharing your original PDF file.

I looked at your PDF file and validated it with the PDF validation tool here, where I obtained the following output:

Validating file "DOK.pdf" for conformance level pdf1.3
  The file trailer dictionary is missing or invalid.
  The object's identity 4 doesn't match with the object's reference identity 5.
  The object's identity 3 doesn't match with the object's reference identity 4.
  The object's identity 5 doesn't match with the object's reference identity 6.
  The object's identity 2 doesn't match with the object's reference identity 3.
  The object's identity 1 doesn't match with the object's reference identity 2.
  The document does not conform to the requested standard.
The file format (header, trailer, objects, xref, streams) is corrupted.
The document does not conform to the PDF 1.3 standard.

Since your scanned PDF does not conform to any of the available PDF standards, your assumption of a faulty PDF output by the HP scan tool is confirmed. Feel free to report this found issue to HP. We would appreciate it.

Your proposed patch for the error handling of an OCR processing is one solution to circumvent this issue produced by a faulty input PDF file. Since an OCR processing relies on a valid PDF input file, we are not interested in adding exceptional cases to our code base to deal with faulty PDF input files. Otherwise, symptoms are tackled, but the cause remains.

Nevertheless, you can repair the faulty PDF file produced by the HP scan tool before uploading it to your Nextcloud server. In this case it's sufficient to use the program mutool (from the mupdf-tools package) with the following command line call to obtain the repaired PDF file DOK-cleaned.pdf:

mutool clean DOK.pdf DOK-cleaned.pdf

Then, an OCR processing of the repaired DOK-cleaned.pdf file (triggered by the Nextcloud workflow engine and the OCR workflow app) should succeed.

I will close this issue as it is related to the HP scan tool. But feel free to reopen it, if we can help somehow.

github-cli · 2021-03-16T07:42:15Z

just wondering but as ocrmypdf only overwrites the files if it was successfully able to produce ocr output and doesnt alter the content of images or the likes, only the ocr text... why not simply let it?
or at least give the option to...
actually being able to produce workflows with different options depending on user/path/filename would even be cooler though ;)

R0Wi · 2021-03-16T15:51:06Z

Well creating workflows based on different conditions (like user/path/filename) is a core functionality of Nextcloud's workflow engine and not part of this app. So we have to be satisfied with the options NC gives us at this point. If you have good ideas don't hesitate to open an issue at the Nextcloud server repo.

Regarding the question "Should PDF files which produce warnings while processing be saved?" i think the default should clearly be no. Like @bahnwaerter explained the output of OcrMyPdf is undefined in this case so it might produce corrupted files or PDF files not matching the spec. But of course we could provide a general configuration page where parameters like this can be configured by the admin (so for example a config page containing a checkbox saying "Ignore OcrMyPdf warnings"). @bahnwaerter what do you think about this? We could even add more parameters like language and stuff but all these settings will affect all OCR workflows. I opened a new discussion for this here so please feel free to contribute your meaning.

@github-cli you might also be interested in having a look at https://github.com/nextcloud/workflow_script where you can create workflows based on commandline instructions.

github-cli · 2021-03-16T19:48:37Z

Sorry, I didnt know the output can be undefined, I thought from reading the documentation and experimenting that it does not modify the file if it could not OCR it, worst case it would replace an existing good OCR with an incorrect/inferior one. I commented on the new discussion for the options.

about the workflow script, i took a look at it...
would I just need to pass ocrmypdf --redo-ocr %n %n? or which argument passes the file location?
thanks for your help :)

R0Wi · 2021-03-16T21:19:10Z

To be honest i don't really know what OcrMyPdf does in such edge cases but i saw error messages which unsettled me so i think it's best to keep the default like it is and rather add some config extension points for advanced users.

Regarding the workflow script app: unfortunately i cannot help you here because i'm not using it. I was just aware that this exists :-) If you have troubles setting up the workflow for PDF this could be related to #41

github-cli · 2021-03-17T07:57:35Z

Thanks anyhow, the workflow script app works in matching the event/file with operator "is" but from the documentation its not clear to me how to use this to modify the file in place or even create a new one next to it.
I will just hope for an option in the OCR Workflow ;-)
Thanks for your time and work.

R0Wi · 2021-03-17T08:16:30Z

That would be the advantage of using the OCR workflow instead. The app handles all the tricky "create a new file version" stuff for you. So yes @bahnwaerter and i are discussing how these options could be implemented we'll keep you up to date on this in discussion #55. Also we plan to offer an alternative docker based backend for OCR processing so installation and error proneness would be decreased (#51).

So stay tuned 😎

github-cli · 2021-03-17T12:20:58Z

ok, staying tuned then :)
let me know if i can help somehow, e.g. testing

R0Wi · 2021-03-17T17:40:27Z

Cool thanks 👍

R0Wi self-assigned this Dec 19, 2020

bahnwaerter closed this as completed Dec 21, 2020

bahnwaerter mentioned this issue Mar 15, 2021

workflow not finishing #53

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

workflow_ocr fails but ocrmypdf not #42

workflow_ocr fails but ocrmypdf not #42

Dexxes commented Dec 19, 2020 •

edited

Loading

R0Wi commented Dec 19, 2020 •

edited

Loading

Dexxes commented Dec 19, 2020

bahnwaerter commented Dec 21, 2020

github-cli commented Mar 16, 2021

R0Wi commented Mar 16, 2021 •

edited

Loading

github-cli commented Mar 16, 2021 •

edited

Loading

R0Wi commented Mar 16, 2021

github-cli commented Mar 17, 2021

R0Wi commented Mar 17, 2021

github-cli commented Mar 17, 2021

R0Wi commented Mar 17, 2021

workflow_ocr fails but ocrmypdf not #42

workflow_ocr fails but ocrmypdf not #42

Comments

Dexxes commented Dec 19, 2020 • edited Loading

R0Wi commented Dec 19, 2020 • edited Loading

Dexxes commented Dec 19, 2020

bahnwaerter commented Dec 21, 2020

github-cli commented Mar 16, 2021

R0Wi commented Mar 16, 2021 • edited Loading

github-cli commented Mar 16, 2021 • edited Loading

R0Wi commented Mar 16, 2021

github-cli commented Mar 17, 2021

R0Wi commented Mar 17, 2021

github-cli commented Mar 17, 2021

R0Wi commented Mar 17, 2021

Dexxes commented Dec 19, 2020 •

edited

Loading

R0Wi commented Dec 19, 2020 •

edited

Loading

R0Wi commented Mar 16, 2021 •

edited

Loading

github-cli commented Mar 16, 2021 •

edited

Loading