Add MIME sniffing for images. #124

Closed
SimonSapin opened this Issue Sep 25, 2013 · 5 comments

Comments

Projects
None yet
3 participants
@SimonSapin
Member

SimonSapin commented Sep 25, 2013

The web has lots of broken legacy content served with incorrect MIME types in the HTTP Content-Type header. Web browsers "sniff" the actual type of images by looking at the first few bytes:

http://mimesniff.spec.whatwg.org/

WeasyPrint currently trusts the header for PNG and SVG, and gives everything to GDK-PixBuf which does some form of sniffing which might not be the same as in the WHATWG spec.

Re-implementing sniffing in WeasyPrint is probably overkill, but we could at least give GDK-PixBuf a try when explicit PNG or SVG decoding fails.

@alejoar

This comment has been minimized.

Show comment
Hide comment
@alejoar

alejoar Aug 25, 2016

Any chances of this being solved at some point?

I'm running into the #103 issue and I have A TON of images with wrong Content-Type in my server. This is obviously a problem with how we save images and I'll work on solving that asap, but fixing what's already in our server is going to be a major hassle.

Edit: As a suggestion, and ignoring how WeasyPrint works internally, maybe a 'quick' fix could be to raise an exception when this happens so I can treat it. As of today, no exception is raised at all and the PDF is still generated, but without the problematic images.

alejoar commented Aug 25, 2016

Any chances of this being solved at some point?

I'm running into the #103 issue and I have A TON of images with wrong Content-Type in my server. This is obviously a problem with how we save images and I'll work on solving that asap, but fixing what's already in our server is going to be a major hassle.

Edit: As a suggestion, and ignoring how WeasyPrint works internally, maybe a 'quick' fix could be to raise an exception when this happens so I can treat it. As of today, no exception is raised at all and the PDF is still generated, but without the problematic images.

@liZe

This comment has been minimized.

Show comment
Hide comment
@liZe

liZe Aug 26, 2016

Member

Current implementation:

  1. If it's said to be a SVG, render the image with CairoSVG or abort.
  2. If it's said to be a PNG, render the image with Cairo or abort.
  3. If it's something else, render the image with GDK-Pixbuf or abort.

According to the sniffing "spec":

  1. OK, as we have to rely on the given mimetype the for XML-based images.
  2. Real problem. A JPEG/SVG/… image with a image/png mimetype is not rendered, but it actually should be.
  3. Probably OK. It's the fallback, we let GDK-Pixbuf do the sniffing. It's close to what browsers do.

Without implementing the "spec", here's what I've done:

  1. If it's said to be a SVG, render the image with CairoSVG or abort.
  2. If it's said to be a PNG, render the image with Cairo, or with GDK-Pixbuf if it failed with Cairo, or abort.
  3. If it's something else, render the image with GDK-Pixbuf or abort.

It's closer to what's done by browsers and should be OK for "the real life cases". Problems left:

  • We rely on the GDK-Pixbuf's sniffing algorithm that may be different from the "spec".
  • An SVG with the wrong Content-Type and extension is rendered by GDK-Pixbuf.
  • A PNG with the wrong Content-Type and extension is rendered by GDK-Pixbuf.

@alejoar I suppose that your problem is non-PNG files served with a PNG Content-Type, because the other cases should already be rendered according to the sniffing "spec". Could you try with the fix to see if it fixes your problem?

As of today, no exception is raised at all and the PDF is still generated, but without the problematic images.

Yes, that's what browsers do too. Adding an option to raise an error when an image is not found is possible, but relying on the logs is probably easier and let the user filter the warnings that are important for him.

Member

liZe commented Aug 26, 2016

Current implementation:

  1. If it's said to be a SVG, render the image with CairoSVG or abort.
  2. If it's said to be a PNG, render the image with Cairo or abort.
  3. If it's something else, render the image with GDK-Pixbuf or abort.

According to the sniffing "spec":

  1. OK, as we have to rely on the given mimetype the for XML-based images.
  2. Real problem. A JPEG/SVG/… image with a image/png mimetype is not rendered, but it actually should be.
  3. Probably OK. It's the fallback, we let GDK-Pixbuf do the sniffing. It's close to what browsers do.

Without implementing the "spec", here's what I've done:

  1. If it's said to be a SVG, render the image with CairoSVG or abort.
  2. If it's said to be a PNG, render the image with Cairo, or with GDK-Pixbuf if it failed with Cairo, or abort.
  3. If it's something else, render the image with GDK-Pixbuf or abort.

It's closer to what's done by browsers and should be OK for "the real life cases". Problems left:

  • We rely on the GDK-Pixbuf's sniffing algorithm that may be different from the "spec".
  • An SVG with the wrong Content-Type and extension is rendered by GDK-Pixbuf.
  • A PNG with the wrong Content-Type and extension is rendered by GDK-Pixbuf.

@alejoar I suppose that your problem is non-PNG files served with a PNG Content-Type, because the other cases should already be rendered according to the sniffing "spec". Could you try with the fix to see if it fixes your problem?

As of today, no exception is raised at all and the PDF is still generated, but without the problematic images.

Yes, that's what browsers do too. Adding an option to raise an error when an image is not found is possible, but relying on the logs is probably easier and let the user filter the warnings that are important for him.

liZe added a commit that referenced this issue Aug 26, 2016

@liZe liZe added this to the v0.31 milestone Aug 26, 2016

@SimonSapin

This comment has been minimized.

Show comment
Hide comment
@SimonSapin

SimonSapin Aug 26, 2016

Member

I’ve thought before of adding a "strict mode" flag that makes any error fatal (including fetching and decoding images but also stylesheets) but never got around to it.

With web browsers there is typically a human looking at the screen, and if something goes wrong it’s better to show them as much as possible rather than just this:

screenshot from 2016-08-26 15-31-16

They can hit the "Refresh" button to try again.

WeasyPrint however is more typically used in automated systems that may be unattended, where it’s better for errors not to pass silently, so that we can realize more easily they’re happening.

Member

SimonSapin commented Aug 26, 2016

I’ve thought before of adding a "strict mode" flag that makes any error fatal (including fetching and decoding images but also stylesheets) but never got around to it.

With web browsers there is typically a human looking at the screen, and if something goes wrong it’s better to show them as much as possible rather than just this:

screenshot from 2016-08-26 15-31-16

They can hit the "Refresh" button to try again.

WeasyPrint however is more typically used in automated systems that may be unattended, where it’s better for errors not to pass silently, so that we can realize more easily they’re happening.

@alejoar

This comment has been minimized.

Show comment
Hide comment
@alejoar

alejoar Aug 26, 2016

@alejoar I suppose that your problem is non-PNG files served with a PNG Content-Type, because the other cases should already be rendered according to the sniffing "spec". Could you try with the fix to see if it fixes your problem?

@liZe this was exactly our situation. JPEG files served with PNG Content-Type. I ended up finding the error on our code and managed to solve it, and as for the images already in our server I wrote a simple script to go over all of them and fix the Content-Type. What I actually did is download each image and convert everything to JPEG, as there was a mix of JPEG and PNG (although everything served as PNG) and set proper Content-Type (we use Azure storage, which let me easily mess up the Content-Type months ago and not realize till now when I needed to automatically generate PDFs). Took about 16 hours for the script to finish, but wasn't as bad as I first thought.

Even though this fixed my problem, I really appreciate your effort so I just tried installing from your commit and tested your fix: Works flawlessly!

Here's the detail of my test just so you know:
This is the image: https://wt3002.blob.core.windows.net/static/uploads_dev/wrong_content_type (a JPEG with PNG Content-Type)

Test HTML:

<!DOCTYPE html>
<html>
<head lang="en">
    <meta charset="UTF-8">
    <title>Testing</title>
</head>
<body>
<img src="https://wt3002.blob.core.windows.net/static/uploads_dev/wrong_content_type">
</body>
</html>

Finally:
weasyprint ./weasytest.html ./weasytest.pdf

Results in the following PDF: https://wt3002.blob.core.windows.net/static/uploads_dev/weasytest.pdf

Just what was expected.

@SimonSapin the strict mode you propose is exactly what I thought of when I was thinking in potential solutions to the issue.

Thank you both for the help.

alejoar commented Aug 26, 2016

@alejoar I suppose that your problem is non-PNG files served with a PNG Content-Type, because the other cases should already be rendered according to the sniffing "spec". Could you try with the fix to see if it fixes your problem?

@liZe this was exactly our situation. JPEG files served with PNG Content-Type. I ended up finding the error on our code and managed to solve it, and as for the images already in our server I wrote a simple script to go over all of them and fix the Content-Type. What I actually did is download each image and convert everything to JPEG, as there was a mix of JPEG and PNG (although everything served as PNG) and set proper Content-Type (we use Azure storage, which let me easily mess up the Content-Type months ago and not realize till now when I needed to automatically generate PDFs). Took about 16 hours for the script to finish, but wasn't as bad as I first thought.

Even though this fixed my problem, I really appreciate your effort so I just tried installing from your commit and tested your fix: Works flawlessly!

Here's the detail of my test just so you know:
This is the image: https://wt3002.blob.core.windows.net/static/uploads_dev/wrong_content_type (a JPEG with PNG Content-Type)

Test HTML:

<!DOCTYPE html>
<html>
<head lang="en">
    <meta charset="UTF-8">
    <title>Testing</title>
</head>
<body>
<img src="https://wt3002.blob.core.windows.net/static/uploads_dev/wrong_content_type">
</body>
</html>

Finally:
weasyprint ./weasytest.html ./weasytest.pdf

Results in the following PDF: https://wt3002.blob.core.windows.net/static/uploads_dev/weasytest.pdf

Just what was expected.

@SimonSapin the strict mode you propose is exactly what I thought of when I was thinking in potential solutions to the issue.

Thank you both for the help.

@liZe

This comment has been minimized.

Show comment
Hide comment
@liZe

liZe Aug 26, 2016

Member

@alejoar 😄.

Member

liZe commented Aug 26, 2016

@alejoar 😄.

@liZe liZe closed this Aug 26, 2016

jsonn pushed a commit to jsonn/pkgsrc that referenced this issue Jan 15, 2017

kleink
Update py-weasyprint to 0.34.
Version 0.34
------------

Released on 2016-12-21.

Bug fixes:

* `#398 <Kozea/WeasyPrint#398>`_:
  Honor the presentational_hints option for PDFs.
* `#399 <Kozea/WeasyPrint#399>`_:
  Avoid CairoSVG-2.0.0rc* on Python 2.
* `#396 <Kozea/WeasyPrint#396>`_:
  Correctly close files open by mkstemp.
* `#403 <Kozea/WeasyPrint#403>`_:
  Cast the number of columns into int.
* Fix multi-page multi-columns and add related tests.


Version 0.33
------------

Released on 2016-11-28.

New features:

* `#393 <Kozea/WeasyPrint#393:
  Add tests on MacOS.
* `#370 <Kozea/WeasyPrint#370>`_:
  Enable @font-face on MacOS.

Bug fixes:

* `#389 <Kozea/WeasyPrint#389>`_:
  Always update resume_at when splitting lines.
* `#394 <Kozea/WeasyPrint#394>`_:
  Don't build universal wheels.
* `#388 <Kozea/WeasyPrint#388>`_:
  Fix logic when finishing block formatting context.


Version 0.32
------------

Released on 2016-11-17.

New features:

* `#28 <Kozea/WeasyPrint#28>`_:
  Support @font-face on Linux.
* Support CSS fonts level 3 almost entirely, including OpenType features.
* `#253 <Kozea/WeasyPrint#253>`_:
  Support presentational hints (optional).
* Support break-after, break-before and break-inside for pages and columns.
* `#384 <Kozea/WeasyPrint#384:
  Major performance boost.

Bux fixes:

* `#368 <Kozea/WeasyPrint#368>`_:
  Respect white-space for shrink-to-fit.
* `#382 <Kozea/WeasyPrint#382>`_:
  Fix the preferred width for column groups.
* Handle relative boxes in column-layout boxes.

Documentation:

* Add more and more documentation about Windows installation.
* `#355 <Kozea/WeasyPrint#355:
  Add fonts requirements for tests.


Version 0.31
------------

Released on 2016-08-28.

New features:

* `#124 <Kozea/WeasyPrint#124>`_:
  Add MIME sniffing for images.
* `#60 <Kozea/WeasyPrint#60>`_:
  CSS Multi-column Layout.
* `#197 <Kozea/WeasyPrint#197>`_:
  Add hyphens at line breaks activated by a soft hyphen.

Bux fixes:

* `#132 <Kozea/WeasyPrint#132>`_:
  Fix Python 3 compatibility on Windows.

Documentation:

* `#329 <Kozea/WeasyPrint#329>`_:
  Add documentation about installation on Windows.


Version 0.30
------------

Released on 2016-07-18.

WeasyPrint now depends on html5lib-0.999999999.

Bux fixes:

* Fix Acid2
* `#325 <Kozea/WeasyPrint#325>`_:
  Cutting lines is broken in page margin boxes.
* `#334 <Kozea/WeasyPrint#334>`_:
  Newest html5lib 0.999999999 breaks rendering.


Version 0.29
------------

Released on 2016-06-17.

Bug fixes:

* `#263 <Kozea/WeasyPrint#263:
  Don't crash with floats with percents in positions.
* `#323 <Kozea/WeasyPrint#323>`_:
  Fix CairoSVG 2.0 pre-release dependency in Python 2.x.


Version 0.28
------------

Released on 2016-05-16.

Bug fixes:

* `#189 <Kozea/WeasyPrint#189>`_:
  ``white-space: nowrap`` still wraps on hyphens
* `#305 <Kozea/WeasyPrint#305>`_:
  Fix crashes on some tables
* Don't crash when transform matrix isn't invertible
* Don't crash when rendering ratio-only SVG images
* Fix margins and borders on some tables


Version 0.27
------------

Released on 2016-04-08.

New features:

* `#295 <Kozea/WeasyPrint#295>`_:
  Support the 'rem' unit.
* `#299 <Kozea/WeasyPrint#299>`_:
  Enhance the support of SVG images.

Bug fixes:

* `#307 <Kozea/WeasyPrint#307>`_:
  Fix the layout of cells larger than their tables.

Documentation:

* The website is now on GitHub Pages, the documentation is on Read the Docs.
* `#297 <Kozea/WeasyPrint#297>`_:
  Rewrite the CSS chapter of the documentation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment