pyHanko detects field widgets that have been previously deleted, but other toolkits do not #430

ag-gaphp · 2024-05-20T16:27:22Z

ag-gaphp
May 20, 2024

I'm trying to hunt down an issue that I'm experiencing, and I have no idea if it's an issue with pyHanko, PyMuPDF, or the way I'm creating the files in LibreOffice in the first place. I did my best to put together a test scenario to show the issue. I've posted a bug on the PyMuPDF page, but they are unwilling to help and don't see it as an issue in their toolkit. I still need to put together a test by using PyPDF to remove the fields and see if the results are the different, but posting this up in the meantime. Hopefully someone can tell me what I'm doing incorrectly here!

I guess my main purpose in posting here is: Does pyHanko read from some sort of historical data in the PDF? I saw some mention in the docs about being able to get some type of change history from the PDF, and wondering if that's what's going on (i.e., the fields are gone, but references exist in the change history, so pyHanko complains).

First, here is my PDF file as exported from LibreOffice.

Second, here is the altered PDF after having the placeholder widgets removed.

Here is my test script, it attempts to:

Find any fields that start with sig or init using PyMuPDF
Store the rect and name, delete and save to a new PDF
Check with PyMuPDF that the fields are gone
Try to add signatures using pyHanko
Check with PyPDF that the fields are gone

requirements.txt

pymupdf
pyhanko
pypdf

test.py (fitz == PyMuPDF)

import fitz, os
from pyhanko.pdf_utils.incremental_writer import IncrementalPdfFileWriter
from pyhanko.sign.fields import SigFieldSpec, append_signature_field
from pypdf import PdfReader
from pypdf.constants import AnnotationDictionaryAttributes, FieldDictionaryAttributes

OLD_FILE="exported_from_libreoffice.pdf"
NEW_FILE="deleted_signatures.pdf"

if os.path.exists(NEW_FILE):
    os.remove(NEW_FILE)

boxes = {}

doc = fitz.open(OLD_FILE)

# iterate the pages
print("Removing fields with PyMuPDF")
for page in doc:
    # store the page's height for placement
    _page_rect = page.bound()
    _page_height = _page_rect.y1

    # iterate the fields on this page
    field = page.first_widget
    while field:
        n = field.field_name

        # if it's a signature, remove it
        if n.startswith("sig") or n.startswith("init"):
            # PyMuPDF y coords go top-to-bottom, but pyHanko goes bottom-to-top
            # Subtract the y coords from the current page height for pyHanko
            boxes[n] = {
                "page": page.number,
                "box": (
                    field.rect.x0,
                    _page_height-field.rect.y0,
                    field.rect.x1,
                    _page_height-field.rect.y1
                )
            }
            print("Removing field: ", n)
            field = page.delete_widget(field)

        else:
            field = field.next

# save the document updates
doc.ez_save(NEW_FILE)
doc.close()

# now re-opening the document to check if the fields I removed are still there or not
check_doc = fitz.open(NEW_FILE)

# iterate the pages again
print("Checking PDF for removed fields with PyMuPDF")
found = 0
for page in check_doc:
    # iterate the fields on this page
    field = page.first_widget
    while field:
        n = field.field_name
        # if it's a signature, print to console
        if n.startswith("sig") or n.startswith("init"):
            found += 1
            print(f"...'{n}' is still present")
        field = field.next

print(f"PyMuPDF found {found} fields")

check_doc.close()

# now let's try to use pyHanko to add new signatures
print("Adding signatures to new PDF with pyHanko")
found = 0
with open(NEW_FILE, 'rb+') as sig_doc:
    writer = IncrementalPdfFileWriter(sig_doc, strict=False)
    for name in boxes.keys():
        _dict = boxes[name]
        try:
            append_signature_field(writer, SigFieldSpec(
                                        sig_field_name=name,
                                        on_page=_dict["page"],
                                        box=_dict["box"]
                                    ))

        except Exception as e:
            found += 1
            print("ERROR: ", e)

    writer.write_in_place()

print(f"pyHanko found {found} fields")

# now we're going to check if pypdf can detect the fields
print("Checking for the deleted fields using pypdf")
reader = PdfReader(NEW_FILE)
found = 0
for page in reader.pages:
    for annot in page.annotations:
        annot = annot.get_object()
        if annot[AnnotationDictionaryAttributes.Subtype] == "/Widget":
            try:
                n = annot[FieldDictionaryAttributes.T]
                if n.startswith('sig') or n.startswith('init'):
                    found += 1
                    print(f"Found field: {n}")
            except KeyError as e:
                pass

print(f"PyPDF found {found} fields")

If you run this against the file I uploaded, you can see that both PyMuPDF and PyPDF are unable to detect the regular text fields that I removed using PyMuPDF. Somehow, though, pyHanko does see them and is not able to add new signature fields because the names are not unique. This is the output from the script:

Removing fields with PyMuPDF
Checking PDF for removed fields with PyMuPDF
PyMuPDF found 0 fields
Adding signatures to new PDF with pyHanko
ERROR:  Field with name init0 exists but is not a signature field
ERROR:  Field with name init1 exists but is not a signature field
ERROR:  Field with name init2 exists but is not a signature field
ERROR:  Field with name init3 exists but is not a signature field
ERROR:  Field with name init4 exists but is not a signature field
ERROR:  Field with name init5 exists but is not a signature field
ERROR:  Field with name init6 exists but is not a signature field
ERROR:  Field with name init7 exists but is not a signature field
ERROR:  Field with name init8 exists but is not a signature field
ERROR:  Field with name init9 exists but is not a signature field
ERROR:  Field with name init10 exists but is not a signature field
ERROR:  Field with name init11 exists but is not a signature field
ERROR:  Field with name init12 exists but is not a signature field
ERROR:  Field with name init13 exists but is not a signature field
ERROR:  Field with name init14 exists but is not a signature field
ERROR:  Field with name init15 exists but is not a signature field
ERROR:  Field with name init16 exists but is not a signature field
ERROR:  Field with name init17 exists but is not a signature field
ERROR:  Field with name init18 exists but is not a signature field
ERROR:  Field with name init19 exists but is not a signature field
ERROR:  Field with name init20 exists but is not a signature field
ERROR:  Field with name init21 exists but is not a signature field
ERROR:  Field with name init22 exists but is not a signature field
ERROR:  Field with name init23 exists but is not a signature field
ERROR:  Field with name init24 exists but is not a signature field
ERROR:  Field with name init25 exists but is not a signature field
ERROR:  Field with name init26 exists but is not a signature field
ERROR:  Field with name init27 exists but is not a signature field
ERROR:  Field with name init28 exists but is not a signature field
ERROR:  Field with name sigPrimary1 exists but is not a signature field
ERROR:  Field with name init29 exists but is not a signature field
ERROR:  Field with name sigSecondary1 exists but is not a signature field
pyHanko found 32 fields
Checking for the deleted fields using pypdf
PyPDF found 0 fields

Does anyone have any idea what could be causing this, or could someone point me in a direction to keep hunting?

I should note that it is not only pyHanko that is able to detect these supposedly deleted fields, but 6 different eSign platforms are able to detect them when I import the fields, including the field's rect/position on the page.

My current workaround is to rename the fields using PyMuPDF, which allows pyHanko to add the signatures, but then I have an unnecessary text field that always appears underneath my signature fields when I import to eSign, and then have to manually delete them in the eSign platform by moving the signature out of the way, deleting the field, then putting the signature back in place. The end result form is how I want it, but obviously this is a lot of extra manual steps when you have hundreds of forms with an average of 30+ signatures in each one.

Lastly, I know there are yellow and red boxes under each of these fields. I want those to remain in the PDF and I do not expect those to be deleted. From what I understand, those should be separate from field widgets any way, and unaffected by anything I do to the fields.

EDIT: clarified some details, uploaded another example PDF, added my script's output

Answered by ag-gaphp

May 21, 2024

The best I can gather is that this is an unfortunate side effect of viewers and importers not being uniform in how they parse the PDF.

pyHanko and the eSign platforms use the main widget list in the PDF catalog, while a lot of viewers will use only the /Annots list and ignore anything else.

PyMuPDF maintainer has an ehancement request open to also remove the object from the main list in the PDF catalog, and pypdf doesn't have a built-in function to remove a single widget at the moment, so I'm still sorting out how to remove things from the main list with it.

Any way, I think my problem is solved and it ultimately points back to the toolkits that are removing the fields leaving behind some…

View full answer

ag-gaphp · 2024-05-21T17:37:53Z

ag-gaphp
May 21, 2024
Author

I've put together another test where I remove the placeholder widgets using pypdf instead of PyMuPDF.

I get the exact same results, where pyHanko and the eSign platforms still detect those fields that should be removed.

New test script:

import os
from pyhanko.pdf_utils.incremental_writer import IncrementalPdfFileWriter
from pyhanko.sign.fields import SigFieldSpec, append_signature_field
from pypdf import PdfReader, PdfWriter
from pypdf.constants import AnnotationDictionaryAttributes, FieldDictionaryAttributes, PageAttributes

OLD_FILE="exported_from_libreoffice.pdf"
NEW_FILE="test.pdf"

if os.path.exists(NEW_FILE):
    os.remove(NEW_FILE)

boxes = {}

# try using pypdf to remove annotations and see if pyHanko still complains
pypdf_writer = PdfWriter(clone_from=OLD_FILE)
for page in pypdf_writer.pages:
    placeholders = []
    for i, annot in enumerate(page[PageAttributes.ANNOTS]):
        annot = annot.get_object()
        if annot[AnnotationDictionaryAttributes.Subtype] == "/Widget":
            try:
                n = annot[FieldDictionaryAttributes.T]
                if n.startswith('sig') or n.startswith('init'):
                    rect = annot['/Rect']
                    boxes[n] = {
                        "page": page.page_number,
                        "box": (rect[0], rect[1], rect[2], rect[3])
                    }
                    placeholders.append(i)
            except KeyError as e:
                pass

    for i in placeholders[::-1]:
        del page[PageAttributes.ANNOTS][i]

pypdf_writer.write(NEW_FILE)

# now we're going to check in pypdf if the fields are there or not
print("Checking for the deleted fields using pypdf")
reader = PdfReader(NEW_FILE)
found = 0
for page in reader.pages:
    for annot in page.annotations:
        annot = annot.get_object()
        if annot[AnnotationDictionaryAttributes.Subtype] == "/Widget":
            try:
                n = annot[FieldDictionaryAttributes.T]
                if n.startswith('sig') or n.startswith('init'):
                    found += 1
                    print(f"Found field: {n}")
            except KeyError as e:
                pass

print(f"PyPDF found {found} fields")

# now let's try to use pyHanko to add new signatures
print("Adding signatures to new PDF with pyHanko")
found = 0
with open(NEW_FILE, 'rb+') as sig_doc:
    writer = IncrementalPdfFileWriter(sig_doc, strict=False)
    for name in boxes.keys():
        _dict = boxes[name]
        try:
            append_signature_field(writer, SigFieldSpec(
                                        sig_field_name=name,
                                        on_page=_dict["page"],
                                        box=_dict["box"]
                                    ))

        except Exception as e:
            found += 1
            print("ERROR: ", e)

    writer.write_in_place()

print(f"pyHanko found {found} fields")

The output:

Checking for the deleted fields using pypdf
PyPDF found 0 fields
Adding signatures to new PDF with pyHanko
ERROR:  Field with name init0 exists but is not a signature field
ERROR:  Field with name init1 exists but is not a signature field
ERROR:  Field with name init2 exists but is not a signature field
ERROR:  Field with name init3 exists but is not a signature field
ERROR:  Field with name init4 exists but is not a signature field
ERROR:  Field with name init5 exists but is not a signature field
ERROR:  Field with name init6 exists but is not a signature field
ERROR:  Field with name init7 exists but is not a signature field
ERROR:  Field with name init8 exists but is not a signature field
ERROR:  Field with name init9 exists but is not a signature field
ERROR:  Field with name init10 exists but is not a signature field
ERROR:  Field with name init11 exists but is not a signature field
ERROR:  Field with name init12 exists but is not a signature field
ERROR:  Field with name init13 exists but is not a signature field
ERROR:  Field with name init14 exists but is not a signature field
ERROR:  Field with name init15 exists but is not a signature field
ERROR:  Field with name init16 exists but is not a signature field
ERROR:  Field with name init17 exists but is not a signature field
ERROR:  Field with name init18 exists but is not a signature field
ERROR:  Field with name init19 exists but is not a signature field
ERROR:  Field with name init20 exists but is not a signature field
ERROR:  Field with name init21 exists but is not a signature field
ERROR:  Field with name init22 exists but is not a signature field
ERROR:  Field with name init23 exists but is not a signature field
ERROR:  Field with name init24 exists but is not a signature field
ERROR:  Field with name init25 exists but is not a signature field
ERROR:  Field with name init26 exists but is not a signature field
ERROR:  Field with name init27 exists but is not a signature field
ERROR:  Field with name init28 exists but is not a signature field
ERROR:  Field with name sigPrimary1 exists but is not a signature field
ERROR:  Field with name init29 exists but is not a signature field
ERROR:  Field with name sigSecondary1 exists but is not a signature field
pyHanko found 32 fields

1 reply

ag-gaphp May 21, 2024
Author

It appears that pypdf and pymupdf remove the widget from the annotations list, but not from the /Fields list, which is where pyHanko seems to be finding these supposedly deleted fields. I'm trying to find a way to remove them from fields to see if pyHanko has a better time with it, but so far not having any luck in doing so.

It's looking like possibly incomplete operations on pypdf and pymupdf's end, but I'm not 100% on that yet.

ag-gaphp · 2024-05-21T18:58:53Z

ag-gaphp
May 21, 2024
Author

The best I can gather is that this is an unfortunate side effect of viewers and importers not being uniform in how they parse the PDF.

pyHanko and the eSign platforms use the main widget list in the PDF catalog, while a lot of viewers will use only the /Annots list and ignore anything else.

PyMuPDF maintainer has an ehancement request open to also remove the object from the main list in the PDF catalog, and pypdf doesn't have a built-in function to remove a single widget at the moment, so I'm still sorting out how to remove things from the main list with it.

Any way, I think my problem is solved and it ultimately points back to the toolkits that are removing the fields leaving behind some references to the objects.

2 replies

MatthiasValvekens May 21, 2024
Maintainer

Hi @ag-gaphp,

I didn't fully analyse the file you provided, but the conclusion you arrived at is consistent with my experience, and was also the first thing I suspected when I read your initial post. Interactive PDF viewers generally navigate forms via the Annots list on the page object, where other PDF processors might prefer going through AcroForm.

By the way, it's actually not hard to implement a "field remover" using pyHanko's low-level PDF manipulation API. Essentially, the algorithm is this (intentionally simplified to disregard fields with more than one widget):

Find the field you want to remove in the document's AcroForm
The field dictionary will have an object ID. Take note of that.
Remove the object with that ID from the AcroForm's Fields array.
Navigate to the P entry in the field dictionary to find a reference to the page object on which the field's widget annotation appears. Remove the field's object ID from the Annots array of that page.

If you're doing this in an incremental update, don't forget to call update_container whenever you update an object. Also note that this will not actually delete the form field dictionary from the PDF, although it'll make it inaccessible so it's not considered part of any form/annotation/... structure.

As a workaround, provided you're OK with re-saving the entire PDF, you could use pyHanko's copy_into_new_writer() function to import only those objects that can be "reached" from the document catalog into a new file.

Hope this helps.

ag-gaphp May 22, 2024
Author

I'm learning so much more than I thought I ever would about PDF these past weeks. Right now I'm running through each page and iterating through the /Annots list on each, and then going back and removing it from /AcroForm. After all this running around I've been doing, it seems I should definitely start with /AcroForm instead. This process you propose here makes perfect sense and seems more sane than my current approach.

I appreciate your insight!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pyHanko detects field widgets that have been previously deleted, but other toolkits do not #430

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

pyHanko detects field widgets that have been previously deleted, but other toolkits do not #430

ag-gaphp May 20, 2024

Replies: 2 comments · 3 replies

ag-gaphp May 21, 2024 Author

ag-gaphp May 21, 2024 Author

ag-gaphp May 21, 2024 Author

MatthiasValvekens May 21, 2024 Maintainer

ag-gaphp May 22, 2024 Author

ag-gaphp
May 20, 2024

Replies: 2 comments 3 replies

ag-gaphp
May 21, 2024
Author

ag-gaphp May 21, 2024
Author

ag-gaphp
May 21, 2024
Author

MatthiasValvekens May 21, 2024
Maintainer

ag-gaphp May 22, 2024
Author