ENH: _writer: Implement flattening #3312

PJBrs · 2025-06-15T20:00:45Z

This is a vibe-coded attempt at implementing flattening of pdf forms. It is currently able to flatten text fields and button fields.

It is not at all integrated with the rest of the pypdf code. It introduces a dependency on pdfminer.six to calculate font widths. This should also be possible with some pypdf code but I don't think that pypdf has correct font metrics. (I tried with STANDARD_WIDTHS from
pypdf/_text_extraction/_layout_mode/_font.py , but text started overflowing certain fields.

So, basically a proof of principle.

Why this PR?

I think first and foremost to get some comments:

I think that I can use lots of stuff from pypdf/constants.py.
I also think that pypdf should have sufficient font metrics info and associated functions to be able to calculate font width, but I don't think that's the case now? I might be wrong though.
I have some questions as well, such as, should flatten be a separate function, or should it be an argument to update_page_form_field_values. That code, at least, has lots of functionality that could be reused.
And I wonder what should go where. Some functions seem to be duplicating existing functionality.
Others, I think the wrap_text helper, are really new, but I wouldn't know what to call them.

I don't have a lot of python coding experience. Just for reference, I tested this code in a different project called dungeonsheets, see here for the correct branch: https://github.com/PJBrs/dungeon-sheets/tree/flatten_vibe; and here for the blank forms: https://github.com/canismarko/dungeon-sheets/tree/master/dungeonsheets/forms.

This PR is in response to #232 .

This is a vibe-coded attempt at implementing flattening of pdf forms. It is currently able to flatten text fields and button fields. It is not at all integrated with the rest of the pypdf code. It introduces a dependency on pdfminer.six to calculate font widths. This should also be possible with some pypdf code but I don't think that pypdf has correct font metrics. (I tried with STANDARD_WIDTHS from pypdf/_text_extraction/_layout_mode/_font.py , but text started overflowing certain fields. So, basically a proof of principle.

PJBrs · 2025-06-15T20:10:25Z

Oh, and I didn't run the test suite, sorry about that! I see that:

I cannot use "print" anywhere
I didn't add pdfminer.six as a dependency.
I need to add return types at every function definition
Line length must not exceed 120 characters
Where possible I need to use double quotes, not single ones
All function arguments need to be properly typed

This is stuff that I can work on while waiting for feedback. I'll try to remember and run local test before a next push to this PR.

stefan6419846 · 2025-06-16T06:45:40Z

Thanks for the PR. Some notes:

Parts of the formatting changes might be automated by running ruff check --fix locally.
If possible, it is appreciated to re-use existing functionality of pypdf.
If possible, please consider moving generic functionality out of the PdfWriter class.
pypdf should not include another PDF library (pdfminer.six) for doing calculations. Instead, we should try to include the necessary code in pypdf directly. To avoid too large PRs, this might need to be part of a separate PR.

PJBrs · 2025-06-16T10:38:55Z

@stefan6419846, thanks for your comments! I think what I need right now is a little bit more guidance. You wrote:

Parts of the formatting changes might be automated by running ruff check --fix locally.

I only have python-3.9 and I don't know if ruff runs on that. But I can fix some of this manually as well.

If possible, it is appreciated to re-use existing functionality of pypdf.

If you see any functionality that is duplicated, please tell me! From what I can tell:

generate_appearance_stream does a lot of what I try to do with add_text_value
I think it should be possible to integrate my flatten function with the existing update_page_form_field_values method
The font width thing ought to be there already but it needs some improvement...

If possible, please consider moving generic functionality out of the PdfWriter class.

If you can, please indicate what you consider generic functionality, and, more importantly, where it ought to go. I think my font_name_map function qualifies?

pypdf should not include another PDF library (pdfminer.six) for doing calculations. Instead, we should try to include the necessary code in pypdf directly. To avoid too large PRs, this might need to be part of a separate PR.

I could try to correct STANDARD_WIDTHS from pypdf/_text_extraction/_layout_mode/_font.py . I don't know how to deal with encoding (yet)

Thanks for your comments!

stefan6419846 · 2025-06-17T08:20:11Z

Please note that while I try to help you with getting these changes integrated, I do not know every aspect of the spec or implementation. My main goal is to ensure that any code fulfills our requirements, especially regarding maintenance.

Python 3.9 should not be an issue, as the latest published release (and version used by us) is still compatible to it.

Regarding re-use: You mentioned in your initial comment that integrating this into update_page_form_field_values would allow for less new code. This is what I have been referring to.

Some candidates for moving them out of the writer module:

The font name map could go into the codecs module. It currently misses a source as well.
extract_formatting_from_annotation and add_*_field_value could go into the annotations module.
base14_font_object could go into the codecs or a font module, similar to calculate_text_width and wrap_text.
merge_content_streams could in theory be part of the ContentStream-related classes.

I could try to correct STANDARD_WIDTHS from pypdf/_text_extraction/_layout_mode/_font.py . I don't know how to deal with encoding (yet)

What do you mean by "correcting them"? Ideally, we would not use the functionality of the text extraction code here, but a dedicated implementation the text extraction code could use later on. This might be based upon the implementation from pdfminer.six, but we would have to analyze which parts of it we would actually have to port.

PJBrs marked this pull request as draft June 15, 2025 20:00

PJBrs changed the title ~~_writer: Implement flattening~~ ENH: _writer: Implement flattening Jun 15, 2025

PJBrs mentioned this pull request Jun 16, 2025

ENH: Flatten PDF forms #232

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ENH: _writer: Implement flattening #3312

ENH: _writer: Implement flattening #3312

Uh oh!

PJBrs commented Jun 15, 2025

Uh oh!

PJBrs commented Jun 15, 2025

Uh oh!

stefan6419846 commented Jun 16, 2025

Uh oh!

PJBrs commented Jun 16, 2025

Uh oh!

stefan6419846 commented Jun 17, 2025

Uh oh!

Uh oh!

ENH: _writer: Implement flattening #3312

Are you sure you want to change the base?

ENH: _writer: Implement flattening #3312

Uh oh!

Conversation

PJBrs commented Jun 15, 2025

Uh oh!

PJBrs commented Jun 15, 2025

Uh oh!

stefan6419846 commented Jun 16, 2025

Uh oh!

PJBrs commented Jun 16, 2025

Uh oh!

stefan6419846 commented Jun 17, 2025

Uh oh!

Uh oh!