Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

workflow_ocr fails but ocrmypdf not #42

Closed
Dexxes opened this issue Dec 19, 2020 · 11 comments
Closed

workflow_ocr fails but ocrmypdf not #42

Dexxes opened this issue Dec 19, 2020 · 11 comments
Assignees

Comments

@Dexxes
Copy link

Dexxes commented Dec 19, 2020

Hey, thank you for your great work with this app!

I just encountered some strange behaviour. I scanned something with my scanner and put it into my Nextcloud via web browser upload. But the document got never OCRed. I also ran the cron.php manually but that didn't change a thing.
So I ran ocrmypdf manually (ocrmypdf DOK.pdf test.pdf --redo-ocr) and it spit out this:

Error: stream operator isn't terminated by valid EOL.
Output may be incorrect.

But nevertheless it gives me a good OCRed PDF file.

Could you check if you can replicate this? Here is a sample scan I worked with:
DOK.pdf


tested with Nextcloud 20.0.4

@R0Wi R0Wi self-assigned this Dec 19, 2020
@R0Wi
Copy link
Contributor

R0Wi commented Dec 19, 2020

HI @Dexxes, thanks for reporting this. If you decrease your loglevel of Nextcloud to at least WARN (2) you should be able to see your mentioned error message inside the logs with a message like

OCRmyPDF exited abnormally ...

Currently the implementation checks if either the returncode is not zero or if anything was written to the std-error.

image

Like you can see in your case the error message is written to stderr by ocrmypdf which leads the app to think that there was an error processing your pdf file and therefore not to create a new file version.

Maybe @bahnwaerter can add his expertise regarding the mentioned PDF and what might be wrong with it?

@Dexxes
Copy link
Author

Dexxes commented Dec 19, 2020

Hey @R0Wi, thanks for the quick response!
Since I used an HP app for the scan I think they simply messed up their PDF creation. As a result, that EOL error could occur more often for other users too. Maybe you find a better solution to filter out this error. My programming knowlegde is rather limited, but I came up with this:

if ($success && $errorOutput === '' && $stdErr === '')
	return $this->command->getOutput();
else
{
	if(strpos($stdErr, "Error: stream operator isn't terminated by valid EOL.") !== false)
		return $this->command->getOutput();
	else
		throw new OcrNotPossibleException('OCRmyPDF exited abnormally with exit-code ' . $exitCode . '. Message: ' . $errorOutput . ' ' . $stdErr);
}

@bahnwaerter
Copy link
Collaborator

Hey @Dexxes, thanks for sharing your original PDF file.

I looked at your PDF file and validated it with the PDF validation tool here, where I obtained the following output:

Validating file "DOK.pdf" for conformance level pdf1.3
  The file trailer dictionary is missing or invalid.
  The object's identity 4 doesn't match with the object's reference identity 5.
  The object's identity 3 doesn't match with the object's reference identity 4.
  The object's identity 5 doesn't match with the object's reference identity 6.
  The object's identity 2 doesn't match with the object's reference identity 3.
  The object's identity 1 doesn't match with the object's reference identity 2.
  The document does not conform to the requested standard.
The file format (header, trailer, objects, xref, streams) is corrupted.
The document does not conform to the PDF 1.3 standard.

Since your scanned PDF does not conform to any of the available PDF standards, your assumption of a faulty PDF output by the HP scan tool is confirmed. Feel free to report this found issue to HP. We would appreciate it.

Your proposed patch for the error handling of an OCR processing is one solution to circumvent this issue produced by a faulty input PDF file. Since an OCR processing relies on a valid PDF input file, we are not interested in adding exceptional cases to our code base to deal with faulty PDF input files. Otherwise, symptoms are tackled, but the cause remains.

Nevertheless, you can repair the faulty PDF file produced by the HP scan tool before uploading it to your Nextcloud server. In this case it's sufficient to use the program mutool (from the mupdf-tools package) with the following command line call to obtain the repaired PDF file DOK-cleaned.pdf:

mutool clean DOK.pdf DOK-cleaned.pdf

Then, an OCR processing of the repaired DOK-cleaned.pdf file (triggered by the Nextcloud workflow engine and the OCR workflow app) should succeed.

I will close this issue as it is related to the HP scan tool. But feel free to reopen it, if we can help somehow.

@github-cli
Copy link

just wondering but as ocrmypdf only overwrites the files if it was successfully able to produce ocr output and doesnt alter the content of images or the likes, only the ocr text... why not simply let it?
or at least give the option to...
actually being able to produce workflows with different options depending on user/path/filename would even be cooler though ;)

@R0Wi
Copy link
Contributor

R0Wi commented Mar 16, 2021

Well creating workflows based on different conditions (like user/path/filename) is a core functionality of Nextcloud's workflow engine and not part of this app. So we have to be satisfied with the options NC gives us at this point. If you have good ideas don't hesitate to open an issue at the Nextcloud server repo.

Regarding the question "Should PDF files which produce warnings while processing be saved?" i think the default should clearly be no. Like @bahnwaerter explained the output of OcrMyPdf is undefined in this case so it might produce corrupted files or PDF files not matching the spec. But of course we could provide a general configuration page where parameters like this can be configured by the admin (so for example a config page containing a checkbox saying "Ignore OcrMyPdf warnings"). @bahnwaerter what do you think about this? We could even add more parameters like language and stuff but all these settings will affect all OCR workflows. I opened a new discussion for this here so please feel free to contribute your meaning.

@github-cli you might also be interested in having a look at https://github.com/nextcloud/workflow_script where you can create workflows based on commandline instructions.

@github-cli
Copy link

github-cli commented Mar 16, 2021

Sorry, I didnt know the output can be undefined, I thought from reading the documentation and experimenting that it does not modify the file if it could not OCR it, worst case it would replace an existing good OCR with an incorrect/inferior one. I commented on the new discussion for the options.

about the workflow script, i took a look at it...
would I just need to pass ocrmypdf --redo-ocr %n %n? or which argument passes the file location?
thanks for your help :)

@R0Wi
Copy link
Contributor

R0Wi commented Mar 16, 2021

To be honest i don't really know what OcrMyPdf does in such edge cases but i saw error messages which unsettled me so i think it's best to keep the default like it is and rather add some config extension points for advanced users.

Regarding the workflow script app: unfortunately i cannot help you here because i'm not using it. I was just aware that this exists :-) If you have troubles setting up the workflow for PDF this could be related to #41

@github-cli
Copy link

Thanks anyhow, the workflow script app works in matching the event/file with operator "is" but from the documentation its not clear to me how to use this to modify the file in place or even create a new one next to it.
I will just hope for an option in the OCR Workflow ;-)
Thanks for your time and work.

@R0Wi
Copy link
Contributor

R0Wi commented Mar 17, 2021

That would be the advantage of using the OCR workflow instead. The app handles all the tricky "create a new file version" stuff for you. So yes @bahnwaerter and i are discussing how these options could be implemented we'll keep you up to date on this in discussion #55. Also we plan to offer an alternative docker based backend for OCR processing so installation and error proneness would be decreased (#51).

So stay tuned 😎

@github-cli
Copy link

ok, staying tuned then :)
let me know if i can help somehow, e.g. testing

@R0Wi
Copy link
Contributor

R0Wi commented Mar 17, 2021

Cool thanks 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants