Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pdf2image seems not to be thread safe #125

Closed
stavrakidis opened this issue Feb 10, 2020 · 6 comments
Closed

pdf2image seems not to be thread safe #125

stavrakidis opened this issue Feb 10, 2020 · 6 comments

Comments

@stavrakidis
Copy link

@stavrakidis stavrakidis commented Feb 10, 2020

Using multiprocessing.dummy.Pool I get sometimes the following error. This happens not very often and after hundreds of convertions, but it can happen.

I think it's due to the fact that generators (via yield) are not thread safe, and pdf2images uses two generators. So pdf2images is not thread safe.

Traceback (most recent call last): File "/srv/shared/conda/***/envs/**************/lib/python3.8/runpy.py", line 193, in _run_module_as_main return _run_code(code, main_globals, None, File "/srv/shared/conda/***/envs/**************/lib/python3.8/runpy.py", line 86, in _run_code exec(code, run_globals) File "/srv/shared/conda/***/envs/**************/lib/python3.8/site-packages/*****/__main__.py", line 7, in <module> ocr.convert_**********() File "/srv/shared/conda/***/envs/**************/lib/python3.8/site-packages/*****/**********/ocr.py", line 63, in convert_********** convert_dir(path_pdf, path_text) File "/srv/shared/conda/***/envs/**************/lib/python3.8/site-packages/*****/**********/ocr.py", line 34, in convert_dir pool.starmap(pdf2text, file_list) File "/srv/shared/conda/***/envs/**************/lib/python3.8/multiprocessing/pool.py", line 372, in starmap return self._map_async(func, iterable, starmapstar, chunksize).get() File "/srv/shared/conda/***/envs/**************/lib/python3.8/multiprocessing/pool.py", line 768, in get raise self._value File "/srv/shared/conda/***/envs/**************/lib/python3.8/multiprocessing/pool.py", line 125, in worker result = (True, func(*args, **kwds)) File "/srv/shared/conda/***/envs/**************/lib/python3.8/multiprocessing/pool.py", line 51, in starmapstar return list(itertools.starmap(args[0], args[1])) File "/srv/shared/conda/***/envs/**************/lib/python3.8/site-packages/*****/**********/ocr.py", line 40, in pdf2text pages = pdf2image.convert_from_path(pdf_file, poppler_path=configuration.get_value('ocr', 'poppler')) File "/srv/shared/conda/***/envs/**************/lib/python3.8/site-packages/pdf2image/pdf2image.py", line 144, in convert_from_path thread_output_file = next(output_file) ValueError: generator already executing

@Belval

This comment has been minimized.

Copy link
Owner

@Belval Belval commented Feb 10, 2020

That's possible, I never used with with multiple threads, only with multiple processes.

I'll get a reproducible test case and issue a fix.

@Belval

This comment has been minimized.

Copy link
Owner

@Belval Belval commented Feb 10, 2020

Test case written:

def test_multithread_conversion(self):
        start_time = time.time()
        files = ["./tests/test.pdf", ] * 50
        p = Pool(10)
        res = p.map(convert_from_path, files)
        self.assertTrue(len(res) == 50)
        print("test_multithread_conversion: {} sec".format(time.time() - start_time))
@Belval

This comment has been minimized.

Copy link
Owner

@Belval Belval commented Feb 10, 2020

I created a pull request that addresses this issue. I will probably merge it and build the new package in the coming hours.

@Belval

This comment has been minimized.

Copy link
Owner

@Belval Belval commented Feb 10, 2020

Version 1.12.0 is now live, you should not get this error message anymore. Please tell me if it fixed the bug you were experiencing.

@stavrakidis

This comment has been minimized.

Copy link
Author

@stavrakidis stavrakidis commented Feb 11, 2020

Seems to work. Great!

@Belval

This comment has been minimized.

Copy link
Owner

@Belval Belval commented Feb 11, 2020

Excellent, feel free to open a new issue if you encounter another problem.

@Belval Belval closed this Feb 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.