Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Malformed (?) PDF when an image is included in the HTML #1431

Closed
robsco-git opened this issue Aug 27, 2021 · 3 comments
Closed

Malformed (?) PDF when an image is included in the HTML #1431

robsco-git opened this issue Aug 27, 2021 · 3 comments
Labels
bug Existing features not working as expected
Milestone

Comments

@robsco-git
Copy link

robsco-git commented Aug 27, 2021

When creating a PDF with the following code:

from weasyprint import HTML

sun_img_url = "https://weasyprint.org/css/img/sun.png"

html = f"""
<!doctype html>
<html>
    <head>
        <meta charset="utf-8">
        <title>Example</title>
        <meta name="description" content="Example">
    </head>
    <body>
        <img src="{sun_img_url}" alt="" />
    </body>
</html>"""

if __name__ == "__main__":
    HTML(string=html).write_pdf('./test.pdf')

Here is the resultant test.pdf.

  1. Attempting to open test.pdf in Adobe Acrobat Reader DC (Version 2021.005.20060) results in this error message:

There was an error processing a page. There was a problem reading this document (135).

  1. Attempting to validate test.pdf with QPDF results in:
$ qpdf --check test.pdf
checking test.pdf
PDF Version: 1.7
File is not encrypted
File is not linearized
WARNING: test.pdf (object 13 0, offset 4232): unknown token while reading object; treating as string
WARNING: test.pdf (object 13 0, offset 4243): unknown token while reading object; treating as string
WARNING: test.pdf (object 13 0, offset 4249): unknown token while reading object; treating as string
WARNING: test.pdf (object 13 0, offset 4254): unknown token while reading object; treating as string
WARNING: test.pdf (object 13 0, offset 4267): unknown token while reading object; treating as string
WARNING: test.pdf (object 13 0, offset 4274): unknown token while reading object; treating as string
WARNING: test.pdf (object 13 0, offset 4274): too many errors; giving up on reading object
WARNING: test.pdf (object 13 0, offset 4282): expected endobj
  1. Using an online validator (https://www.pdf-online.com/osa/validate.aspx) results in:
Compliance | pdf1.7
Result | Document does not conform to PDF/A.
Details | Validating file "test.pdf" for conformance level pdf1.7
The "Length" key of the stream object is wrong.
The "endobj" keyword is missing.
The value of the key SMask must not be of type dictionary.
The image's sample stream's computed length 43200 is different to the actual length 14400.
The document does not conform to the requested standard.
The file format (header, trailer, objects, xref, streams) is corrupted.
The document doesn't conform to the PDF reference (missing required entries, wrong value types, etc.).
The document does not conform to the PDF 1.7 standard.
Done.
  1. Google Chrome (Version 92.0.4515.159 (Official Build) (64-bit)) opens the PDF and displays the image as expected.

  2. When opening and saving the image with pikepdf (https://github.com/pikepdf/pikepdf) the image is not present in the newly saved version (this is the point I worked backwards from). My understanding is that pikepdf is using QPDF under the hood so it makes sense (to me) that there is an issue with the QPDF --check feature and pikepdf.

Software/library versions:

$ python --version
Python 3.9.5
$ pip freeze
Brotli==1.0.9
cffi==1.14.6
cssselect2==0.4.1
fonttools==4.26.2
html5lib==1.1
lxml==4.6.3
Pillow==8.3.1
pycparser==2.20
pydyf==0.1.1
pyphen==0.11.0
six==1.16.0
tinycss2==1.1.0
weasyprint==53.1
webencodings==0.5.1
zopfli==0.1.8
$ qpdf --version
qpdf version 9.1.1
Run qpdf --copyright to see copyright and license information.
$ pango-view --version
pango-view (pango) 1.44.7
$ weasyprint --info
System: Linux
Machine: x86_64
Version: #1 SMP Wed Feb 19 06:37:35 UTC 2020
Release: 4.19.104-microsoft-standard

WeasyPrint version: 53.1
Python version: 3.9.5
Pydyf version: 0.1.1
Pango version: 14407

All of the above is on Windows 10 running Python in wsl2 (Ubuntu 20.04). I have reproduced 1 and 5 above in a Docker container running a ubuntu:21.04 image albeit with different html markup (but still images involved).

Please let me know if any additional information/testing would help in this regard.

@robsco-git
Copy link
Author

Just saw 53.2 is out. It looks like the same issue is still present:

$ weasyprint --info
System: Linux
Machine: x86_64
Version: #1 SMP Wed Feb 19 06:37:35 UTC 2020
Release: 4.19.104-microsoft-standard

WeasyPrint version: 53.2
Python version: 3.9.5
Pydyf version: 0.1.1
Pango version: 14407

@fornwall
Copy link

fornwall commented Aug 27, 2021

Created a repo with minimal reproducible test case at https://github.com/fornwall/weasyprint-issue

As noted there the issue seems to be introduced in 53.1 (and persist in 53.2), while 53.0 produces a working file.

Might be related to alpha channel in images or some other PNG construct? This image: https://github.com/fornwall/weasyprint-issue/blob/main/pngtest8rgba.png causes the issue, while this one does not: https://github.com/fornwall/weasyprint-issue/blob/main/rock-out.png

@liZe liZe added the bug Existing features not working as expected label Aug 27, 2021
@liZe liZe added this to the 53.3 milestone Aug 27, 2021
@liZe liZe closed this as completed in 40537a6 Aug 27, 2021
@liZe
Copy link
Member

liZe commented Aug 27, 2021

Thanks a lot for the report, and thanks a lot ❤️ for the test case! It’s fixed (but not released) in master and in the 53.x branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Existing features not working as expected
Projects
None yet
Development

No branches or pull requests

3 participants