Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pdf2image outputs 1x1 blank image #34

Closed
dougsouza opened this issue Nov 20, 2018 · 16 comments
Closed

pdf2image outputs 1x1 blank image #34

dougsouza opened this issue Nov 20, 2018 · 16 comments

Comments

@dougsouza
Copy link

Describe the bug
For some pdf files, convert_from_path, convert_from_bytes outputs a blank 1x1 PIL image. Interestingly for very similar pdfs it works fine. The documents are mostly one very long page pdfs. Any ideais?

To Reproduce
Steps to reproduce the behavior:

  1. Unfortunately the pdfs I'm working on are confidential and I am not allowed to share

Expected behavior
I would expect to see a normal PIL image, as happened to other similar pdfs.

Screenshots
Output: [<PIL.PpmImagePlugin.PpmImageFile image mode=RGB size=1x1 at 0x7FCC6C3FA4A8>]. The number of pages is correct, the pdf is just one very long page.

Desktop (please complete the following information):

  • OS: Ubuntu 16.04
@Belval
Copy link
Owner

Belval commented Nov 20, 2018

As pdf2image is only a thin wrapper around pdftoppm, I would try directly from the CLI.

Something like pdftoppm -r 200 -jpeg your_pdf.pdf out.jpg

If you still get a 1x1 pixel then the problem is on their side and I can't help much.

It would also help if you could provide the exact call you do on convert_from_path/convert_from_bytes.

@dougsouza
Copy link
Author

@Belval, I figured this is on the pdftoppm side, I am already investigating.

When a run pdftoppm I get: Bogus memory allocation size

The output is a 1x1 blank image.

Thanks for the help

@dougsouza
Copy link
Author

@Belval,

Just to share the solution: the mediaBox of my pdf was huge and it was eating up all the memory available rendering a lot of blank space. Turns out that the cropBox of the pdf was correct, so converting using the cropBox (pdftoppm -cropbox ...) works fine.

It would be nice to have an option to use the crop box instead, like:

convert_from_path(..., use_crop_box=False)

Cheers

@dougsouza dougsouza reopened this Nov 20, 2018
@Belval
Copy link
Owner

Belval commented Nov 20, 2018

Can you try with support-cropbox and see if it fixes your issue?

I will upload the new package to PyPi tonight if it does.

@dougsouza
Copy link
Author

I just tested. It works fine.

Thanks!!

@Belval Belval mentioned this issue Nov 20, 2018
@Belval
Copy link
Owner

Belval commented Nov 20, 2018

Pull request merged: #35

Package uploaded: https://pypi.org/project/pdf2image/

Be aware that PyPi caches packages so it can take a few minutes until it is available.

@Belval
Copy link
Owner

Belval commented Dec 26, 2018

As this was fixed in a previous version, I am closing it.

@Belval Belval closed this as completed Dec 26, 2018
@alvercau
Copy link

alvercau commented Jun 17, 2020

Hi

I've run into the same issue. Running with pdftoppm, without the wrapper, does not have the same issue, the jpeg gets correctly generated.
Example file is attached.
5ccad715074ce2b850f646d5-e70d52df-5671-4c97-bfcf-f7e20f5a61d6-1-MN5PADLAME2TMP2HM3JPETQQ33P7NDSQ.pdf

When opening the image afterwards with PIL, I get a DecompressionBombError. So it might be the case that the image is simply too big to be processed by pdf2image.

@Belval
Copy link
Owner

Belval commented Jun 17, 2020

I was unable to reproduce the issue described with the linked file. Using pdftoppm version 0.62.0 the output image seems to correspond to the PDF.

Here is my code snippet:

from pdf2image import convert_from_path
convert_from_path("test.pdf", size=(3000,))[0].save("out.png")

I changed the name of your PDF for reading purposes, I set a size to avoid the Pillow decompression bomb check.

Could you provide your pdftoppm version with pdftoppm -vand your function call?

@alvercau
Copy link

Hi, I didn't set any size, so the issue is indeed that the image is too big. Would it be possible to have convert_from_path raise an error, explaining that the size is too big and that it can be avoided by setting the size parameter, instead of simply silently returning a white pixel?

@Belval
Copy link
Owner

Belval commented Jun 17, 2020

That's the thing, the underlying library that parses the image file should and does raise an exception on the PDF you linked:

>>> from pdf2image import convert_from_path
>>> convert_from_path("test.pdf")[0].save("out.png")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pdf2image/pdf2image.py", line 202, in convert_from_path
    images += parse_buffer_func(data)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pdf2image/parsers.py", line 21, in parse_buffer_to_ppm
    images.append(Image.open(BytesIO(data[index : index + file_size])))
  File "/home/ubuntu/.local/lib/python3.6/site-packages/PIL/Image.py", line 2881, in open
    im = _open_core(fp, filename, prefix)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/PIL/Image.py", line 2868, in _open_core
    _decompression_bomb_check(im.size)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/PIL/Image.py", line 2793, in _decompression_bomb_check
    "could be decompression bomb DOS attack." % (pixels, 2 * MAX_IMAGE_PIXELS)
PIL.Image.DecompressionBombError: Image size (382369975 pixels) exceeds limit of 178956970 pixels, could be decompression bomb DOS attack.

Do you override de Pillow limit somewhere else in your code?

@alvercau
Copy link

The issue is not that pdf2image returns an image that is too big, but that it returns one single white pixel, without any warnings.
assert convert_from_path("test.pdf")[0].shape == (1,1)
does not give any errors, as the shape of the image is 1.
If I convert the pdf to jpeg by means of the command line tool pdftoppm, the image gets correctly generated, and I do get the DecompressionBomb error when trying to open it with PIL. So the issue is in pdf2image, not in pdftoppm or PIL.

@Belval
Copy link
Owner

Belval commented Jun 17, 2020

I understand that, but I am simply to reproduce the issue, if I disable Pillow's warning, the saved image is correctly rendered.

In other word, I am unable to get the 1x1 white pixel output you describe.

@mananshah1403
Copy link

@Belval I am still encountering this issue with the following PDF. please help!
fail3.pdf

@asanaa8
Copy link

asanaa8 commented Apr 27, 2023

same issue here, Help.

@hedes1992
Copy link

@Belval,

Just to share the solution: the mediaBox of my pdf was huge and it was eating up all the memory available rendering a lot of blank space. Turns out that the cropBox of the pdf was correct, so converting using the cropBox (pdftoppm -cropbox ...) works fine.

It would be nice to have an option to use the crop box instead, like:

convert_from_path(..., use_crop_box=False)

Cheers

seems use_cropbox=False, refer to https://pdf2image.readthedocs.io/en/latest/reference.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants