-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Thread safety #91
Comments
@FilippoBoido could you please provide more context to your issue? Code example, environment, python/ijson versions, backend, stack traces, etc... With the limited information you've provided I can only guess at this point. So my guess is that you are seeing a segmentation fault, that you are using the I don't remember from the top of my head, but I'd guess ijson itself is thread-safe: we don't release the GIL in our C extension, and I think we don't keep any global state, but I might be wrong. Again, more information would be greatly helpful. It could as well be that the Note however that there are other (probably better) alternatives to your use case. Note that we don't release the GIL, even in our C extension, so you can't expect multiple threads to run much faster than a single one (depends exactly on how you wrote your code, so again we'd need to see some sample code). If you are issuing multiple requests concurrently you might be better off using an async http client library, and using the async support that ijson provides, so you run all your requests from a single thread. In case you really are CPU bound instead of network or I/O bound, then you can consider using multiprocessing to handle multiple requests in different processes. Edit: re-inforced the fact that we don't release the GIL so not much speedup can be expected from multiple threads. |
@rtobar thanks for the fast reply and your willingness to discuss my problem.
This method get's called by many threads and the threads are managed by the ThreadPoolExecutor. Here you can see a windows stack trace since the application crashes without a python exception stack trace: After I wrote you yesterday I tried to use the python backend, which doesn't crash but consumes a huge amount of memory. |
Thanks @FilippoBoido for the further details. The error window indicates that it's indeed the yajl2_c backend that produces the crash. That and your example code and explanation confirm that the initial guess I ventured was pretty much spot on. Does the "Details" tab of that error window give any further information? I tried reproducing your error, see this small script. It doesn't read data from the network, but from a local file instead, but it's otherwise a similar situation: import ijson
import sys
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor
def do_stuff(_):
with open(sys.argv[1]) as f:
for item in ijson.items(f, sys.argv[2]):
pass
print(f"Using {ijson.backend} backend")
executor = ProcessPoolExecutor() if sys.argv[3] == 'proc' else ThreadPoolExecutor()
results = list(executor.map(do_stuff, range(100)))
executor.shutdown() I then ran it with these configurations (Linux, AMD Ryzen 7 5825U with 8 cores, 2 threads per core): $> du -hs ~/downloads/large-file.json
25M /home/rtobar/downloads/large-file.json
$> time python3.11 ~/parallel.py ~/downloads/large-file.json item thread
Using yajl2_c backend
38.59user 2.41system 0:38.37elapsed 106%CPU (0avgtext+0avgdata 161128maxresident)k
8inputs+0outputs (1major+375481minor)pagefaults 0swaps
$> time python3.11 ~/parallel.py ~/downloads/large-file.json item proc
Using yajl2_c backend
40.79user 1.23system 0:02.83elapsed 1484%CPU (0avgtext+0avgdata 19176maxresident)k
0inputs+0outputs (0major+472417minor)pagefaults 0swaps Note that:
I also just thought that you until now you assumed it was the multi-threading aspect that was causing crashes. Have you tried testing with a Also, just double-checking: you are using the latest version of |
Thanks @rtobar for your insightful testing. I will report back if I find a solution or a way around this problem. P.S.: |
Interesting... the whole idea of using ijson is to avoid exhausting memory in the first place. How are you using the results from your
You mean the I'll try more experiments around the idea of exhausting memory and see if I can reproduce the crash, but unfortunately until we get a proper stacktrsce we can only do (educated) guesses. |
This is an intermediate recap of what I've found out:
Yes, I'm using the yajl2_c backend.
I have a list of objects called generator_list and each of this objects is the return value of ijson.items. |
@FilippoBoido thanks for another piece of the puzzle, I feel like all of this is all making sense now. There are two main issues I'm seeing here now. The first one I should have seen earlier, but I'm not a big The second one is that you mention there are two thread pool executors involved. In the first, say all_generators = pool1.map(stream_items_from_tracker_single_page_with_query, urls)
...
pool2.map(iterate_over_generators, all_generators) The issue with this is the following: by the end I think in an ideal scenario you'd want to fuse the work you're doing in both pools into one so that you can fully stream the HTTP responses into def do_everything():
result = []
detail = {}
response = self.retry_request_with_auth_obj(...., stream=True)
f = file_like_object_from_generator(response.iter_content) # because #44 hasn't been implemented
for item in ijson.items(f, 'items.item'):
try:
if item['typeName'] in item_types:
result.append(item)
detail[item['typeName']] += 1
except (KeyError, TypeError):
continue
return result, detail
results_and_details = pool.map(do_everything, ...)
# now merge all results and details You can then even use a process pool executor instead of a threaded one for improved parallelisation. |
@rtobar Great insight, thank you!
That makes sense now. I've been busy implementing async methods replicating the current operations done with the ThreadPoolExecutor, so that we can see if the garbage collector is able to release the memory unlike the current implementation with the ThreadPool. A couple of lines after the snippet I shared I clear the generator_list containing the ijson.items and call the gc, which theoretically should free up the memory, but it doesn't. P.S.: You have a typo in the snippet you wrote -> |
I rewrote the routines to make use of the async framework instead of the ThreadPoolExecutor and yield the json payload from files with an |
I have several threads sending requests with the requests library and streaming the response.content into ijson.items
Should I thread lock every call to ijson.items(response.content, prefix)?
I'm having issues with an (0xC0000005) error and I'm trying to find out the cause.
Thanks!
The text was updated successfully, but these errors were encountered: