pdf2image outputs 1x1 blank image #34

dougsouza · 2018-11-20T15:02:47Z

Describe the bug
For some pdf files, convert_from_path, convert_from_bytes outputs a blank 1x1 PIL image. Interestingly for very similar pdfs it works fine. The documents are mostly one very long page pdfs. Any ideais?

To Reproduce
Steps to reproduce the behavior:

Unfortunately the pdfs I'm working on are confidential and I am not allowed to share

Expected behavior
I would expect to see a normal PIL image, as happened to other similar pdfs.

Screenshots
Output: [<PIL.PpmImagePlugin.PpmImageFile image mode=RGB size=1x1 at 0x7FCC6C3FA4A8>]. The number of pages is correct, the pdf is just one very long page.

Desktop (please complete the following information):

OS: Ubuntu 16.04

The text was updated successfully, but these errors were encountered:

Belval · 2018-11-20T17:09:13Z

As pdf2image is only a thin wrapper around pdftoppm, I would try directly from the CLI.

Something like pdftoppm -r 200 -jpeg your_pdf.pdf out.jpg

If you still get a 1x1 pixel then the problem is on their side and I can't help much.

It would also help if you could provide the exact call you do on convert_from_path/convert_from_bytes.

dougsouza · 2018-11-20T17:22:07Z

@Belval, I figured this is on the pdftoppm side, I am already investigating.

When a run pdftoppm I get: Bogus memory allocation size

The output is a 1x1 blank image.

Thanks for the help

dougsouza · 2018-11-20T18:17:22Z

@Belval,

Just to share the solution: the mediaBox of my pdf was huge and it was eating up all the memory available rendering a lot of blank space. Turns out that the cropBox of the pdf was correct, so converting using the cropBox (pdftoppm -cropbox ...) works fine.

It would be nice to have an option to use the crop box instead, like:

convert_from_path(..., use_crop_box=False)

Cheers

Belval · 2018-11-20T20:33:59Z

Can you try with support-cropbox and see if it fixes your issue?

I will upload the new package to PyPi tonight if it does.

dougsouza · 2018-11-20T20:45:48Z

I just tested. It works fine.

Thanks!!

Belval · 2018-11-20T21:14:40Z

Pull request merged: #35

Package uploaded: https://pypi.org/project/pdf2image/

Be aware that PyPi caches packages so it can take a few minutes until it is available.

Belval · 2018-12-26T17:35:55Z

As this was fixed in a previous version, I am closing it.

alvercau · 2020-06-17T13:21:34Z

Hi

I've run into the same issue. Running with pdftoppm, without the wrapper, does not have the same issue, the jpeg gets correctly generated.
Example file is attached.
5ccad715074ce2b850f646d5-e70d52df-5671-4c97-bfcf-f7e20f5a61d6-1-MN5PADLAME2TMP2HM3JPETQQ33P7NDSQ.pdf

When opening the image afterwards with PIL, I get a DecompressionBombError. So it might be the case that the image is simply too big to be processed by pdf2image.

Belval · 2020-06-17T13:43:26Z

I was unable to reproduce the issue described with the linked file. Using pdftoppm version 0.62.0 the output image seems to correspond to the PDF.

Here is my code snippet:

from pdf2image import convert_from_path
convert_from_path("test.pdf", size=(3000,))[0].save("out.png")

I changed the name of your PDF for reading purposes, I set a size to avoid the Pillow decompression bomb check.

Could you provide your pdftoppm version with pdftoppm -vand your function call?

alvercau · 2020-06-17T13:48:26Z

Hi, I didn't set any size, so the issue is indeed that the image is too big. Would it be possible to have convert_from_path raise an error, explaining that the size is too big and that it can be avoided by setting the size parameter, instead of simply silently returning a white pixel?

Belval · 2020-06-17T13:54:51Z

That's the thing, the underlying library that parses the image file should and does raise an exception on the PDF you linked:

>>> from pdf2image import convert_from_path
>>> convert_from_path("test.pdf")[0].save("out.png")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pdf2image/pdf2image.py", line 202, in convert_from_path
    images += parse_buffer_func(data)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pdf2image/parsers.py", line 21, in parse_buffer_to_ppm
    images.append(Image.open(BytesIO(data[index : index + file_size])))
  File "/home/ubuntu/.local/lib/python3.6/site-packages/PIL/Image.py", line 2881, in open
    im = _open_core(fp, filename, prefix)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/PIL/Image.py", line 2868, in _open_core
    _decompression_bomb_check(im.size)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/PIL/Image.py", line 2793, in _decompression_bomb_check
    "could be decompression bomb DOS attack." % (pixels, 2 * MAX_IMAGE_PIXELS)
PIL.Image.DecompressionBombError: Image size (382369975 pixels) exceeds limit of 178956970 pixels, could be decompression bomb DOS attack.

Do you override de Pillow limit somewhere else in your code?

alvercau · 2020-06-17T14:03:32Z

The issue is not that pdf2image returns an image that is too big, but that it returns one single white pixel, without any warnings.
assert convert_from_path("test.pdf")[0].shape == (1,1)
does not give any errors, as the shape of the image is 1.
If I convert the pdf to jpeg by means of the command line tool pdftoppm, the image gets correctly generated, and I do get the DecompressionBomb error when trying to open it with PIL. So the issue is in pdf2image, not in pdftoppm or PIL.

Belval · 2020-06-17T14:30:23Z

I understand that, but I am simply to reproduce the issue, if I disable Pillow's warning, the saved image is correctly rendered.

In other word, I am unable to get the 1x1 white pixel output you describe.

mananshah1403 · 2022-02-09T20:42:23Z

@Belval I am still encountering this issue with the following PDF. please help!
fail3.pdf

asanaa8 · 2023-04-27T23:23:59Z

same issue here, Help.

hedes1992 · 2024-10-30T07:02:54Z

@Belval,

Just to share the solution: the mediaBox of my pdf was huge and it was eating up all the memory available rendering a lot of blank space. Turns out that the cropBox of the pdf was correct, so converting using the cropBox (pdftoppm -cropbox ...) works fine.

It would be nice to have an option to use the crop box instead, like:

convert_from_path(..., use_crop_box=False)

Cheers

seems use_cropbox=False, refer to https://pdf2image.readthedocs.io/en/latest/reference.html

dougsouza closed this as completed Nov 20, 2018

dougsouza reopened this Nov 20, 2018

Belval mentioned this issue Nov 20, 2018

Support cropbox #35

Merged

Belval closed this as completed Dec 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pdf2image outputs 1x1 blank image #34

pdf2image outputs 1x1 blank image #34

dougsouza commented Nov 20, 2018

Belval commented Nov 20, 2018

dougsouza commented Nov 20, 2018

dougsouza commented Nov 20, 2018

Belval commented Nov 20, 2018 •

edited

Loading

dougsouza commented Nov 20, 2018

Belval commented Nov 20, 2018 •

edited

Loading

Belval commented Dec 26, 2018

alvercau commented Jun 17, 2020 •

edited

Loading

Belval commented Jun 17, 2020

alvercau commented Jun 17, 2020

Belval commented Jun 17, 2020

alvercau commented Jun 17, 2020

Belval commented Jun 17, 2020

mananshah1403 commented Feb 9, 2022

asanaa8 commented Apr 27, 2023

hedes1992 commented Oct 30, 2024

pdf2image outputs 1x1 blank image #34

pdf2image outputs 1x1 blank image #34

Comments

dougsouza commented Nov 20, 2018

Belval commented Nov 20, 2018

dougsouza commented Nov 20, 2018

dougsouza commented Nov 20, 2018

Belval commented Nov 20, 2018 • edited Loading

dougsouza commented Nov 20, 2018

Belval commented Nov 20, 2018 • edited Loading

Belval commented Dec 26, 2018

alvercau commented Jun 17, 2020 • edited Loading

Belval commented Jun 17, 2020

alvercau commented Jun 17, 2020

Belval commented Jun 17, 2020

alvercau commented Jun 17, 2020

Belval commented Jun 17, 2020

mananshah1403 commented Feb 9, 2022

asanaa8 commented Apr 27, 2023

hedes1992 commented Oct 30, 2024

Belval commented Nov 20, 2018 •

edited

Loading

Belval commented Nov 20, 2018 •

edited

Loading

alvercau commented Jun 17, 2020 •

edited

Loading