Skip to content

ENH: _writer: Implement flattening #3312

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft

Conversation

PJBrs
Copy link

@PJBrs PJBrs commented Jun 15, 2025

This is a vibe-coded attempt at implementing flattening of pdf forms. It is currently able to flatten text fields and button fields.

It is not at all integrated with the rest of the pypdf code. It introduces a dependency on pdfminer.six to calculate font widths. This should also be possible with some pypdf code but I don't think that pypdf has correct font metrics. (I tried with STANDARD_WIDTHS from
pypdf/_text_extraction/_layout_mode/_font.py , but text started overflowing certain fields.

So, basically a proof of principle.

Why this PR?

I think first and foremost to get some comments:

  • I think that I can use lots of stuff from pypdf/constants.py.
  • I also think that pypdf should have sufficient font metrics info and associated functions to be able to calculate font width, but I don't think that's the case now? I might be wrong though.
  • I have some questions as well, such as, should flatten be a separate function, or should it be an argument to update_page_form_field_values. That code, at least, has lots of functionality that could be reused.
  • And I wonder what should go where. Some functions seem to be duplicating existing functionality.
  • Others, I think the wrap_text helper, are really new, but I wouldn't know what to call them.

I don't have a lot of python coding experience. Just for reference, I tested this code in a different project called dungeonsheets, see here for the correct branch: https://github.com/PJBrs/dungeon-sheets/tree/flatten_vibe; and here for the blank forms: https://github.com/canismarko/dungeon-sheets/tree/master/dungeonsheets/forms.

This PR is in response to #232 .

This is a vibe-coded attempt at implementing flattening of pdf forms. It is
currently able to flatten text fields and button fields.

It is not at all integrated with the rest of the pypdf code. It introduces
a dependency on pdfminer.six to calculate font widths. This should also be
possible with some pypdf code but I don't think that pypdf has correct font
metrics. (I tried with STANDARD_WIDTHS from
pypdf/_text_extraction/_layout_mode/_font.py , but text started overflowing
certain fields.

So, basically a proof of principle.
@PJBrs PJBrs marked this pull request as draft June 15, 2025 20:00
@PJBrs
Copy link
Author

PJBrs commented Jun 15, 2025

Oh, and I didn't run the test suite, sorry about that! I see that:

  • I cannot use "print" anywhere
  • I didn't add pdfminer.six as a dependency.
  • I need to add return types at every function definition
  • Line length must not exceed 120 characters
  • Where possible I need to use double quotes, not single ones
  • All function arguments need to be properly typed

This is stuff that I can work on while waiting for feedback. I'll try to remember and run local test before a next push to this PR.

@PJBrs PJBrs changed the title _writer: Implement flattening ENH: _writer: Implement flattening Jun 15, 2025
@stefan6419846
Copy link
Collaborator

Thanks for the PR. Some notes:

  • Parts of the formatting changes might be automated by running ruff check --fix locally.
  • If possible, it is appreciated to re-use existing functionality of pypdf.
  • If possible, please consider moving generic functionality out of the PdfWriter class.
  • pypdf should not include another PDF library (pdfminer.six) for doing calculations. Instead, we should try to include the necessary code in pypdf directly. To avoid too large PRs, this might need to be part of a separate PR.

@PJBrs PJBrs mentioned this pull request Jun 16, 2025
@PJBrs
Copy link
Author

PJBrs commented Jun 16, 2025

@stefan6419846, thanks for your comments! I think what I need right now is a little bit more guidance. You wrote:

Parts of the formatting changes might be automated by running ruff check --fix locally.

I only have python-3.9 and I don't know if ruff runs on that. But I can fix some of this manually as well.

If possible, it is appreciated to re-use existing functionality of pypdf.

If you see any functionality that is duplicated, please tell me! From what I can tell:

  • generate_appearance_stream does a lot of what I try to do with add_text_value
  • I think it should be possible to integrate my flatten function with the existing update_page_form_field_values method
  • The font width thing ought to be there already but it needs some improvement...

If possible, please consider moving generic functionality out of the PdfWriter class.

If you can, please indicate what you consider generic functionality, and, more importantly, where it ought to go. I think my font_name_map function qualifies?

pypdf should not include another PDF library (pdfminer.six) for doing calculations. Instead, we should try to include the necessary code in pypdf directly. To avoid too large PRs, this might need to be part of a separate PR.

I could try to correct STANDARD_WIDTHS from pypdf/_text_extraction/_layout_mode/_font.py . I don't know how to deal with encoding (yet)

Thanks for your comments!

@stefan6419846
Copy link
Collaborator

Please note that while I try to help you with getting these changes integrated, I do not know every aspect of the spec or implementation. My main goal is to ensure that any code fulfills our requirements, especially regarding maintenance.

Python 3.9 should not be an issue, as the latest published release (and version used by us) is still compatible to it.

Regarding re-use: You mentioned in your initial comment that integrating this into update_page_form_field_values would allow for less new code. This is what I have been referring to.

Some candidates for moving them out of the writer module:

  • The font name map could go into the codecs module. It currently misses a source as well.
  • extract_formatting_from_annotation and add_*_field_value could go into the annotations module.
  • base14_font_object could go into the codecs or a font module, similar to calculate_text_width and wrap_text.
  • merge_content_streams could in theory be part of the ContentStream-related classes.

I could try to correct STANDARD_WIDTHS from pypdf/_text_extraction/_layout_mode/_font.py . I don't know how to deal with encoding (yet)

What do you mean by "correcting them"? Ideally, we would not use the functionality of the text extraction code here, but a dedicated implementation the text extraction code could use later on. This might be based upon the implementation from pdfminer.six, but we would have to analyze which parts of it we would actually have to port.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants