ROB: Flate decoding for streams with faulty tail bytes #3332

henningkoertelgmg · 2025-06-25T10:12:46Z

Some FLATE encoded streams of early Adobe Distiller / Pitstop versions are written with additionally added CR bytes to the PDF and calculate the faulty tail bytes into Length value of stream dict. Later then decoding fails. Solved with removing step by step tail bytes until decoding is successful.

henningkoertelgmg · 2025-06-25T10:14:44Z

faulty_stream_tail_example 1.pdf
decoded.dat.txt

I added one sample file and will create a test from it later. I got another sample PDF but it is too large to add it here ...

codecov · 2025-06-25T10:21:17Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.73%. Comparing base (7c3db03) to head (8b0c38a).
Report is 1 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #3332   +/-   ##
=======================================
  Coverage   96.73%   96.73%           
=======================================
  Files          53       53           
  Lines        9054     9060    +6     
  Branches     1675     1676    +1     
=======================================
+ Hits         8758     8764    +6     
  Misses        177      177           
  Partials      119      119

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

First paste of new code was wrong, so I corrected it and moved it from the "try" block into the "except" block because it is a fall back in case of an error and not a general new approach for decoding

henningkoertelgmg · 2025-06-25T11:29:06Z

Sorry for the 2nd commit, did a mistake during code copy. Now it fits.

pypdf/filters.py

simplified code + created test

…ertelgmg/pypdf into patch-FLATEperformance

henningkoertelgmg · 2025-06-27T08:47:39Z

I am done with the suggestion to simplify code as requested above.
Test has been added and covers now two aspects: readabilty of the faulty stream and good performance.

tests/test_filters.py

Co-authored-by: Stefan <96178532+stefan6419846@users.noreply.github.com>

tests/test_filters.py

To make the asset download more robust the test PDF and the expected data have been move to test function body with the pitfall of needing a higher timeout now, what makes this test less precise for the performance check.

tests/test_filters.py

Co-authored-by: Stefan <96178532+stefan6419846@users.noreply.github.com>

stefan6419846

Thanks for the PR and your patience.

henningkoertelgmg · 2025-06-27T11:23:06Z

Thanks for your support Stefan - was not easy for me because the lack of experience with your project and the tough style rules.

@henningkoertelgmg

## What's new ### Performance Improvements (PI) - Performance optimization for LZW decoding (#3329) by @henningkoertelgmg ### Robustness (ROB) - Flate decoding for streams with faulty tail bytes (#3332) by @henningkoertelgmg - dc_creator could be a Bag as well (#3333) by @stefan6419846 - Handle tree being NullObject when retrieving named destinations (#3331) by @stefan6419846 ### Maintenance (MAINT) - Move inline-image mappings to constants (#3328) by @stefan6419846 [Full Changelog](5.6.1...5.7.0)

stefan6419846 changed the title ~~PI: FLATE decoding for streams with faulty tail bytes~~ ROB: Flate decoding for streams with faulty tail bytes Jun 25, 2025

Moved code into correct place in "except" block

2261b38

First paste of new code was wrong, so I corrected it and moved it from the "try" block into the "except" block because it is a fall back in case of an error and not a general new approach for decoding

stefan6419846 reviewed Jun 26, 2025

View reviewed changes

pypdf/filters.py Outdated Show resolved Hide resolved

Merge branch 'py-pdf:main' into patch-FLATEperformance

768ee9c

stefan6419846 added the needs-test label Jun 26, 2025

henningkoertelgmg added 5 commits June 27, 2025 09:05

Merge branch 'main' into patch-FLATEperformance

6c5afb4

ROB, TST: flate decoding faulty stream tail, code cleanup + test

c38b0e4

simplified code + created test

Merge branch 'patch-FLATEperformance' of https://github.com/henningko…

206e56b

…ertelgmg/pypdf into patch-FLATEperformance

retrigger checks

5e51136

TST: remove spaces from yaml file

3442f81

henningkoertelgmg requested a review from stefan6419846 June 27, 2025 08:47