workflow not finishing #53

github-cli · 2021-03-14T20:46:22Z

it seems this workflow does not finish on larger higher res files, if i manually start ocrmypdf --redo-ocr input.odf output.pdf then it works fine but running "sudo -u www-data php cron.php" only updates smaller files (although it seems to start as it takes quite some time if new large scans were added but the files are never updated).
any way to debug this? isnt this the exact same command being used by the workflow?

R0Wi · 2021-03-14T22:07:40Z

Hi @github-cli, yes basically this command is issued. To be precise it's ocrmypdf --redo-ocr -q - - | cat and then the output stream is captured.

First thing i'd to is setting a more verbose loglevel in your NC config and then paste the results here if possible.

github-cli · 2021-03-15T07:41:59Z

this is the relevant output, as for the rror messages in line 9+10, the same ones appear if i run ocrmypdf manually but it still finishes and creates the file correctly

R0Wi · 2021-03-15T08:43:33Z

Thanks for this. Well the relevant line is {"reqId":"R5jlrbUhHQAQ4prqTDLm","level":1,"time":"2021-03-15T07:24:48+01:00","remoteAddr":"","user":"user","app":"workflow_ocr","method":"","url":"--","message":"OCR for file /user/files/+NextCloud/+Scans/user/2021-03-06_150232_somepdf_HQ_.pdf not possible. Message: OCRmyPDF exited abnormally with exit-code 0. Message: 2 **** Error: stream operator isn't terminated by valid EOL.\n Output may be incorrect.\n [...] like you see ocrMyPdf is complaining about the file and the OCR process itself. So like you said the default behaviour of the commandline tool is that it raises a warning but writes the file nevertheless. In our case we decided not to store any PDF file which might be corrupted by the OCR process but rather keep the original file then.

If it's possible for you please paste the mentioned PDF file here, maybe @bahnwaerter could have a look at it and say what's wrong?

github-cli · 2021-03-15T12:21:15Z

I have an example I can send, not exactly confidential but can I still share in private?
I scanned the exact same document with 100ppi which works fine and again with 200ppi which doesnt work...

R0Wi · 2021-03-15T12:32:20Z

Ok then it would be nice if you could send me an email with both files attached.

@bahnwaerter FYI

bahnwaerter · 2021-03-15T19:18:28Z

Hey @github-cli, thanks for sharing your original PDF files with @R0Wi and me.

I've taken a look at those PDF files and analyzed them. The PDF file of the low quality scan is compliant with the PDF 1.7 standard, whereas the PDF file of the high quality scan is not syntactically well-formed. Therefore, the PDF file of the high quality scan does not conform to any of the available PDF standards. Furthermore, I noticed that both PDF files were created by the HP scan tool. This scan tool seems to create faulty PDF files as the analysis of issue #42 shows.

To solve this issue, you can repair your faulty PDF files before uploading them to your Nextcloud server. Please follow the solution described in #42 (comment).

I will close this issue as it is related to the HP scan tool. But feel free to reopen it, if we can help somehow.

bahnwaerter · 2021-03-15T19:18:50Z

Duplicate of #42

bahnwaerter marked this as a duplicate of #42 Mar 15, 2021

bahnwaerter closed this as completed Mar 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

workflow not finishing #53

workflow not finishing #53

github-cli commented Mar 14, 2021

R0Wi commented Mar 14, 2021

github-cli commented Mar 15, 2021

R0Wi commented Mar 15, 2021

github-cli commented Mar 15, 2021

R0Wi commented Mar 15, 2021

bahnwaerter commented Mar 15, 2021

bahnwaerter commented Mar 15, 2021

workflow not finishing #53

workflow not finishing #53

Comments

github-cli commented Mar 14, 2021

R0Wi commented Mar 14, 2021

github-cli commented Mar 15, 2021

R0Wi commented Mar 15, 2021

github-cli commented Mar 15, 2021

R0Wi commented Mar 15, 2021

bahnwaerter commented Mar 15, 2021

bahnwaerter commented Mar 15, 2021