Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCR Overwrites digitally signed files #220

Open
farhills opened this issue Aug 11, 2023 · 9 comments
Open

OCR Overwrites digitally signed files #220

farhills opened this issue Aug 11, 2023 · 9 comments
Assignees
Labels
bug Something isn't working

Comments

@farhills
Copy link

Describe the bug

Files with a digital signature are being overwritten, deleting the digital seal (leaving just the image of the signature)

System

  • App version: 1.27.1
  • Nextcloud version: 27.0.1
  • PHP version: 8.2.8
  • Environment: Linuxserver docker container on unraid
  • ocrmypdf version: 14.1.0

How to reproduce

Steps to reproduce the behavior:

  1. create a pdf and apply a digital signature
  2. allow cron to run
  3. signature is deleted from document

Screenshots

image

Additional context

I've deleted the OCR rule for 'file modified', but in my typical workflow I print to PDF and immediately sign, so the files are captured in the queue and often don't get processed until after they've been signed.

It would be great if we could detect if a file is signed and skip it.

I've also commented on ocrmypdf #1040 as I recognize this issue may be more appropriately directed toward that project.

ocrmypdf/OCRmyPDF#1040

@farhills farhills added the bug Something isn't working label Aug 11, 2023
@R0Wi
Copy link
Contributor

R0Wi commented Aug 11, 2023

Hi @farhills and thanks for reporting this. Indeed I'm afraid you're right and this issue seems rather be related to ocrmypdf than to this app. The app itself doesn't handle the contents of the converted files except that it creates a new file version in NC with the result of the ocrmypdf conversion.

As far as I understand, technically the tool cannot preserve a valid digital pdf signature since it changes the documents content which invalidates any signature.

One way would be to tell ocrmypdf to again sign the document after the process (which is currently not possible AFAIK). If it's possible to check if a pdf is signed or not, we could also add an option "Skip signed pdf" to the app itself.

If you're able to sign your documents via CLI, you could also try to chain the OCR workflow with the external command workflow

@farhills
Copy link
Author

Thanks, as I wrote the issue I realized it would be the underlying library that has to deal with this. My professional organization has teamed up with a very closed-source certificate authority, there's no CLI option for signing. The process is heavily locked down.

I'll mark the issue as closed. If ocrmypdf adds a new switch '--skip-signed' or similar I'll open a new feature request here to tap into that functionality. Thanks!

@farhills
Copy link
Author

And just like that it's been fixed! OCRmyPDF, V14.4.0 and later will preserve digital signatures by default. Earlier versions clobber the signature without warning.

OCRmyPDF cannot preserve digital signatures in PDFs and also add to OCR to them.
By default, it will refuse to modify a signed PDF regardless of other settings. You can
override this behavior with ``--invalidate-digital-signatures``; as the name suggests,
any digital signatures will be invalidated.

OCRmyPDF cannot open documents that are encrypted with a digital certificate.

Versions of OCRmyPDF prior to 14.4.0 would invalidate existing digital signatures
without warning.

ocrmypdf/OCRmyPDF@a371655

@R0Wi
Copy link
Contributor

R0Wi commented Aug 14, 2023

Thanks for letting us know! Sounds like we might want to introduce an additional switch for the digital signature behaviour.

@R0Wi R0Wi reopened this Aug 14, 2023
@farhills
Copy link
Author

In my use case, digitally signed documents should never be changed, even if the document OCR is imperfect or incomplete. These files represent final outputs, and need to be retained unmodified.

When OCR is complete, a new file is saved, so the digital signature is lost (opposed to editing a signed file where the signature is retained, but made invalid due to the edit).

I would, at most, add the --invalidate-digital-signatures flag only for the 'Force OCR' option. Safer for the user, but a bit more work for you, would be an opt-in UI checkbox 'include digitally signed files'. Either way, there needs to be a warning to the user that the signed file will be replaced by the OCR output, and the signature will be permanently lost.

@farhills
Copy link
Author

Some additional feedback - the app notifications need to be updated to catch and handle the no-output condition when processing a digitally signed file. IMO this can be done silently. Currently it throws an error in the browser and desktop client.

image

CLI output for the same file:

root@5ea6340167e7:/data/xxxxxxxxxxxxxx/files/Misc-JD/OCR-Testing# ocrmypdf 'Digital Signature Sample.pdf' sigoutput.pdf
Scanning contents     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
DigitalSignatureError: Input PDF has a digital signature. OCR would alter the document,         _sync.py:432
invalidating the signature.

@R0Wi
Copy link
Contributor

R0Wi commented Aug 16, 2023

Good catch, thanks for the hint. I think we need to properly recognize this situation and don't throw an error but instead logging an information for example.

@yeupou
Copy link

yeupou commented Oct 28, 2023

Hello,

In my use case, most of the time I would not care about the original digital signature but do care about proper OCR. I do understand that an altered file cannot retain original signature and nonetheless want OCR.

But I would not use force OCR because I do care not to destroy original (probably best) OCR.

It would be great if it was an option like the Remove background option, because it perfectly make sense to accept possible deletion of digital signature in modes like skip text.

image

@R0Wi
Copy link
Contributor

R0Wi commented Oct 29, 2023

Current implementation plan would be like the following:

  • Add a new switch "Invalidate Digital Signatures" to the per-workflow settings with appropriate help text
  • If ocrmypdf version is < 14.4.0
    • and switch is not set: do not add any CLI argument but add a warning (like currently implemented) if a signed file gets overwritten
    • and switch is set: same like above but try to not log a warning (if possible... we need to check if there is a way to determine this error properly)
  • If version is >= 14.4.0: add the CLI argument is switch is set, otherwise don't add it. Log any errors/warnings.
  • We'd need a parser for the output of ocrmypdf --version

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants