Malformed (?) PDF when an image is included in the HTML #1431

robsco-git · 2021-08-27T13:47:52Z

When creating a PDF with the following code:

from weasyprint import HTML

sun_img_url = "https://weasyprint.org/css/img/sun.png"

html = f"""
<!doctype html>
<html>
    <head>
        <meta charset="utf-8">
        <title>Example</title>
        <meta name="description" content="Example">
    </head>
    <body>
        <img src="{sun_img_url}" alt="" />
    </body>
</html>"""

if __name__ == "__main__":
    HTML(string=html).write_pdf('./test.pdf')

Here is the resultant test.pdf.

Attempting to open test.pdf in Adobe Acrobat Reader DC (Version 2021.005.20060) results in this error message:

There was an error processing a page. There was a problem reading this document (135).

Attempting to validate test.pdf with QPDF results in:

$ qpdf --check test.pdf
checking test.pdf
PDF Version: 1.7
File is not encrypted
File is not linearized
WARNING: test.pdf (object 13 0, offset 4232): unknown token while reading object; treating as string
WARNING: test.pdf (object 13 0, offset 4243): unknown token while reading object; treating as string
WARNING: test.pdf (object 13 0, offset 4249): unknown token while reading object; treating as string
WARNING: test.pdf (object 13 0, offset 4254): unknown token while reading object; treating as string
WARNING: test.pdf (object 13 0, offset 4267): unknown token while reading object; treating as string
WARNING: test.pdf (object 13 0, offset 4274): unknown token while reading object; treating as string
WARNING: test.pdf (object 13 0, offset 4274): too many errors; giving up on reading object
WARNING: test.pdf (object 13 0, offset 4282): expected endobj

Using an online validator (https://www.pdf-online.com/osa/validate.aspx) results in:

Compliance | pdf1.7
Result | Document does not conform to PDF/A.
Details | Validating file "test.pdf" for conformance level pdf1.7
The "Length" key of the stream object is wrong.
The "endobj" keyword is missing.
The value of the key SMask must not be of type dictionary.
The image's sample stream's computed length 43200 is different to the actual length 14400.
The document does not conform to the requested standard.
The file format (header, trailer, objects, xref, streams) is corrupted.
The document doesn't conform to the PDF reference (missing required entries, wrong value types, etc.).
The document does not conform to the PDF 1.7 standard.
Done.

Google Chrome (Version 92.0.4515.159 (Official Build) (64-bit)) opens the PDF and displays the image as expected.
When opening and saving the image with pikepdf (https://github.com/pikepdf/pikepdf) the image is not present in the newly saved version (this is the point I worked backwards from). My understanding is that pikepdf is using QPDF under the hood so it makes sense (to me) that there is an issue with the QPDF --check feature and pikepdf.

Software/library versions:

$ python --version
Python 3.9.5

$ pip freeze
Brotli==1.0.9
cffi==1.14.6
cssselect2==0.4.1
fonttools==4.26.2
html5lib==1.1
lxml==4.6.3
Pillow==8.3.1
pycparser==2.20
pydyf==0.1.1
pyphen==0.11.0
six==1.16.0
tinycss2==1.1.0
weasyprint==53.1
webencodings==0.5.1
zopfli==0.1.8

$ qpdf --version
qpdf version 9.1.1
Run qpdf --copyright to see copyright and license information.

$ pango-view --version
pango-view (pango) 1.44.7

$ weasyprint --info
System: Linux
Machine: x86_64
Version: #1 SMP Wed Feb 19 06:37:35 UTC 2020
Release: 4.19.104-microsoft-standard

WeasyPrint version: 53.1
Python version: 3.9.5
Pydyf version: 0.1.1
Pango version: 14407

All of the above is on Windows 10 running Python in wsl2 (Ubuntu 20.04). I have reproduced 1 and 5 above in a Docker container running a ubuntu:21.04 image albeit with different html markup (but still images involved).

Please let me know if any additional information/testing would help in this regard.

The text was updated successfully, but these errors were encountered:

robsco-git · 2021-08-27T14:02:39Z

Just saw 53.2 is out. It looks like the same issue is still present:

$ weasyprint --info
System: Linux
Machine: x86_64
Version: #1 SMP Wed Feb 19 06:37:35 UTC 2020
Release: 4.19.104-microsoft-standard

WeasyPrint version: 53.2
Python version: 3.9.5
Pydyf version: 0.1.1
Pango version: 14407

fornwall · 2021-08-27T14:16:46Z

Created a repo with minimal reproducible test case at https://github.com/fornwall/weasyprint-issue

As noted there the issue seems to be introduced in 53.1 (and persist in 53.2), while 53.0 produces a working file.

Might be related to alpha channel in images or some other PNG construct? This image: https://github.com/fornwall/weasyprint-issue/blob/main/pngtest8rgba.png causes the issue, while this one does not: https://github.com/fornwall/weasyprint-issue/blob/main/rock-out.png

liZe · 2021-08-27T14:47:31Z

Thanks a lot for the report, and thanks a lot ❤️ for the test case! It’s fixed (but not released) in master and in the 53.x branch.

liZe added the bug Existing features not working as expected label Aug 27, 2021

liZe added this to the 53.3 milestone Aug 27, 2021

liZe closed this as completed in 40537a6 Aug 27, 2021

This was referenced Sep 3, 2021

Getting PDF error 135 when opening with Adobe. WeasyPrint 53.2 #1437

Closed

Again: weasyprint 53.2 / Adobe Acrobat Reader DC in Windows Getting PDF error 135 when opening #1440

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Malformed (?) PDF when an image is included in the HTML #1431

Malformed (?) PDF when an image is included in the HTML #1431

robsco-git commented Aug 27, 2021 •

edited

robsco-git commented Aug 27, 2021

fornwall commented Aug 27, 2021 •

edited

liZe commented Aug 27, 2021

Malformed (?) PDF when an image is included in the HTML #1431

Malformed (?) PDF when an image is included in the HTML #1431

Comments

robsco-git commented Aug 27, 2021 • edited

robsco-git commented Aug 27, 2021

fornwall commented Aug 27, 2021 • edited

liZe commented Aug 27, 2021

robsco-git commented Aug 27, 2021 •

edited

fornwall commented Aug 27, 2021 •

edited