-
I'm trying to hunt down an issue that I'm experiencing, and I have no idea if it's an issue with pyHanko, PyMuPDF, or the way I'm creating the files in LibreOffice in the first place. I did my best to put together a test scenario to show the issue. I've posted a bug on the PyMuPDF page, but they are unwilling to help and don't see it as an issue in their toolkit. I still need to put together a test by using PyPDF to remove the fields and see if the results are the different, but posting this up in the meantime. Hopefully someone can tell me what I'm doing incorrectly here! I guess my main purpose in posting here is: Does pyHanko read from some sort of historical data in the PDF? I saw some mention in the docs about being able to get some type of change history from the PDF, and wondering if that's what's going on (i.e., the fields are gone, but references exist in the change history, so pyHanko complains). First, here is my PDF file as exported from LibreOffice. Second, here is the altered PDF after having the placeholder widgets removed. Here is my test script, it attempts to:
requirements.txt
test.py ( import fitz, os
from pyhanko.pdf_utils.incremental_writer import IncrementalPdfFileWriter
from pyhanko.sign.fields import SigFieldSpec, append_signature_field
from pypdf import PdfReader
from pypdf.constants import AnnotationDictionaryAttributes, FieldDictionaryAttributes
OLD_FILE="exported_from_libreoffice.pdf"
NEW_FILE="deleted_signatures.pdf"
if os.path.exists(NEW_FILE):
os.remove(NEW_FILE)
boxes = {}
doc = fitz.open(OLD_FILE)
# iterate the pages
print("Removing fields with PyMuPDF")
for page in doc:
# store the page's height for placement
_page_rect = page.bound()
_page_height = _page_rect.y1
# iterate the fields on this page
field = page.first_widget
while field:
n = field.field_name
# if it's a signature, remove it
if n.startswith("sig") or n.startswith("init"):
# PyMuPDF y coords go top-to-bottom, but pyHanko goes bottom-to-top
# Subtract the y coords from the current page height for pyHanko
boxes[n] = {
"page": page.number,
"box": (
field.rect.x0,
_page_height-field.rect.y0,
field.rect.x1,
_page_height-field.rect.y1
)
}
print("Removing field: ", n)
field = page.delete_widget(field)
else:
field = field.next
# save the document updates
doc.ez_save(NEW_FILE)
doc.close()
# now re-opening the document to check if the fields I removed are still there or not
check_doc = fitz.open(NEW_FILE)
# iterate the pages again
print("Checking PDF for removed fields with PyMuPDF")
found = 0
for page in check_doc:
# iterate the fields on this page
field = page.first_widget
while field:
n = field.field_name
# if it's a signature, print to console
if n.startswith("sig") or n.startswith("init"):
found += 1
print(f"...'{n}' is still present")
field = field.next
print(f"PyMuPDF found {found} fields")
check_doc.close()
# now let's try to use pyHanko to add new signatures
print("Adding signatures to new PDF with pyHanko")
found = 0
with open(NEW_FILE, 'rb+') as sig_doc:
writer = IncrementalPdfFileWriter(sig_doc, strict=False)
for name in boxes.keys():
_dict = boxes[name]
try:
append_signature_field(writer, SigFieldSpec(
sig_field_name=name,
on_page=_dict["page"],
box=_dict["box"]
))
except Exception as e:
found += 1
print("ERROR: ", e)
writer.write_in_place()
print(f"pyHanko found {found} fields")
# now we're going to check if pypdf can detect the fields
print("Checking for the deleted fields using pypdf")
reader = PdfReader(NEW_FILE)
found = 0
for page in reader.pages:
for annot in page.annotations:
annot = annot.get_object()
if annot[AnnotationDictionaryAttributes.Subtype] == "/Widget":
try:
n = annot[FieldDictionaryAttributes.T]
if n.startswith('sig') or n.startswith('init'):
found += 1
print(f"Found field: {n}")
except KeyError as e:
pass
print(f"PyPDF found {found} fields") If you run this against the file I uploaded, you can see that both PyMuPDF and PyPDF are unable to detect the regular text fields that I removed using PyMuPDF. Somehow, though, pyHanko does see them and is not able to add new signature fields because the names are not unique. This is the output from the script:
Does anyone have any idea what could be causing this, or could someone point me in a direction to keep hunting? I should note that it is not only pyHanko that is able to detect these supposedly deleted fields, but 6 different eSign platforms are able to detect them when I import the fields, including the field's rect/position on the page. My current workaround is to rename the fields using PyMuPDF, which allows pyHanko to add the signatures, but then I have an unnecessary text field that always appears underneath my signature fields when I import to eSign, and then have to manually delete them in the eSign platform by moving the signature out of the way, deleting the field, then putting the signature back in place. The end result form is how I want it, but obviously this is a lot of extra manual steps when you have hundreds of forms with an average of 30+ signatures in each one. Lastly, I know there are yellow and red boxes under each of these fields. I want those to remain in the PDF and I do not expect those to be deleted. From what I understand, those should be separate from field widgets any way, and unaffected by anything I do to the fields. EDIT: clarified some details, uploaded another example PDF, added my script's output |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 3 replies
-
I've put together another test where I remove the placeholder widgets using pypdf instead of PyMuPDF. I get the exact same results, where pyHanko and the eSign platforms still detect those fields that should be removed. New test script: import os
from pyhanko.pdf_utils.incremental_writer import IncrementalPdfFileWriter
from pyhanko.sign.fields import SigFieldSpec, append_signature_field
from pypdf import PdfReader, PdfWriter
from pypdf.constants import AnnotationDictionaryAttributes, FieldDictionaryAttributes, PageAttributes
OLD_FILE="exported_from_libreoffice.pdf"
NEW_FILE="test.pdf"
if os.path.exists(NEW_FILE):
os.remove(NEW_FILE)
boxes = {}
# try using pypdf to remove annotations and see if pyHanko still complains
pypdf_writer = PdfWriter(clone_from=OLD_FILE)
for page in pypdf_writer.pages:
placeholders = []
for i, annot in enumerate(page[PageAttributes.ANNOTS]):
annot = annot.get_object()
if annot[AnnotationDictionaryAttributes.Subtype] == "/Widget":
try:
n = annot[FieldDictionaryAttributes.T]
if n.startswith('sig') or n.startswith('init'):
rect = annot['/Rect']
boxes[n] = {
"page": page.page_number,
"box": (rect[0], rect[1], rect[2], rect[3])
}
placeholders.append(i)
except KeyError as e:
pass
for i in placeholders[::-1]:
del page[PageAttributes.ANNOTS][i]
pypdf_writer.write(NEW_FILE)
# now we're going to check in pypdf if the fields are there or not
print("Checking for the deleted fields using pypdf")
reader = PdfReader(NEW_FILE)
found = 0
for page in reader.pages:
for annot in page.annotations:
annot = annot.get_object()
if annot[AnnotationDictionaryAttributes.Subtype] == "/Widget":
try:
n = annot[FieldDictionaryAttributes.T]
if n.startswith('sig') or n.startswith('init'):
found += 1
print(f"Found field: {n}")
except KeyError as e:
pass
print(f"PyPDF found {found} fields")
# now let's try to use pyHanko to add new signatures
print("Adding signatures to new PDF with pyHanko")
found = 0
with open(NEW_FILE, 'rb+') as sig_doc:
writer = IncrementalPdfFileWriter(sig_doc, strict=False)
for name in boxes.keys():
_dict = boxes[name]
try:
append_signature_field(writer, SigFieldSpec(
sig_field_name=name,
on_page=_dict["page"],
box=_dict["box"]
))
except Exception as e:
found += 1
print("ERROR: ", e)
writer.write_in_place()
print(f"pyHanko found {found} fields") The output:
|
Beta Was this translation helpful? Give feedback.
-
The best I can gather is that this is an unfortunate side effect of viewers and importers not being uniform in how they parse the PDF. pyHanko and the eSign platforms use the main widget list in the PDF catalog, while a lot of viewers will use only the PyMuPDF maintainer has an ehancement request open to also remove the object from the main list in the PDF catalog, and pypdf doesn't have a built-in function to remove a single widget at the moment, so I'm still sorting out how to remove things from the main list with it. Any way, I think my problem is solved and it ultimately points back to the toolkits that are removing the fields leaving behind some references to the objects. |
Beta Was this translation helpful? Give feedback.
The best I can gather is that this is an unfortunate side effect of viewers and importers not being uniform in how they parse the PDF.
pyHanko and the eSign platforms use the main widget list in the PDF catalog, while a lot of viewers will use only the
/Annots
list and ignore anything else.PyMuPDF maintainer has an ehancement request open to also remove the object from the main list in the PDF catalog, and pypdf doesn't have a built-in function to remove a single widget at the moment, so I'm still sorting out how to remove things from the main list with it.
Any way, I think my problem is solved and it ultimately points back to the toolkits that are removing the fields leaving behind some…