A python 2.7 and 3.3+ module that wraps the pdftoppm utility to convert PDF to a PIL Image object
How to install
First you need pdftoppm
pdftoppm is the piece of software that does the actual magic. It is distributed as part of a greater package called poppler.
Mac users will have to install poppler for Mac.
Linux users will have pdftoppm pre-installed with the distro (Tested on Ubuntu and Archlinux) if it's not, run
sudo apt install poppler-utils
Then you can install the pip package!
pip install pdf2image
Pillow if you don't have it already with
pip install pillow
How does it work?
from pdf2image import convert_from_path, convert_from_bytes
Then simply do:
images = convert_from_path('/home/kankroc/example.pdf')
images = convert_from_bytes(open('/home/kankroc/example.pdf', 'rb').read())
OR better yet
import tempfile with tempfile.TemporaryDirectory() as path: images_from_path = convert_from_path('/home/kankroc/example.pdf', output_folder=path) # Do something here
images will be a list of PIL Image representing each page of the PDF document.
Here are the definitions:
convert_from_path(pdf_path, dpi=200, output_folder=None, first_page=None, last_page=None, fmt='ppm')
convert_from_bytes(pdf_file, dpi=200, output_folder=None, first_page=None, last_page=None, fmt='ppm')
userpwparameter allows you to set a password to unlock the converted PDF (
-upwin the cli of pdftoppm)
thread_countparameter allows you to set how many thread will be used for conversion.
first_pageparameter allows you to set a first page to be processed by pdftoppm (
-fin the cli of pdftoppm)
last_pageparameter allows you to set a last page to be processed by pdftoppm (
-lin the cli of pdftoppm)
fmtparameter allows you to specify an output format. Currently supported formats are
- Using an output folder is significantly faster if you are using an SSD. Otherwise i/o usually becomes the bottleneck.
- Using multiple threads can give you some gains but avoid more than 4 as this will cause i/o bottleneck (even on my NVMe SSD!).
- If i/o is your bottleneck, using the JPEG format can lead to significant gains.
- PNG format is pretty slow, I am investigating the issue.
- If you want to know the best settings (most settings will be fine anyway) you can clone the project and run
python tests.pyto get timings.
There are no exception thrown by pdftoppm therefore any file that couldn't be convert/processed will return an empty Image list. The philosophy behind this choice is simple, if the file was corrupted / not found, no image could be extracted and returning an empty list makes sense. (This is up for discussion)
Limitations / known issues
- A relatively big PDF will use up all your memory and cause the process to be killed (unless you use an output folder)