-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Julia slows down when reading thousands of JLD files #17554
Comments
This is esstentially the same problem as discussed in #7893. As the number of live objects increases, each gc pause takes longer and longer. |
Why would the live set increase here? Does JLD keep all the loaded files in memory somehow? |
Here's a trace from using
Also on one run I got this interesting error:
|
You might try to close the REPL, and see how long its takes to get out. If you hit CTRL+C while it is there it will segfault badly, maybe there is info to be taken from the crash. |
Ah, here we go. From JLD.jl:146:
Looks like closing each file requires looping over all other files. |
Closing as this appears to be a JLD issue. |
Thanks for looking into it! |
I don't think that's true for the julia side of this: if one applies the following patch, diff --git a/src/JLD.jl b/src/JLD.jl
index d2334dc..78382c3 100644
--- a/src/JLD.jl
+++ b/src/JLD.jl
@@ -143,7 +143,9 @@ function close(f::JldFile)
isdefined(f, :gref) && close(f.gref)
# Ensure that all other datasets, groups, and datatypes are closed (ref #176)
- for obj_id in HDF5.h5f_get_obj_ids(f.plain.id, HDF5.H5F_OBJ_DATASET | HDF5.H5F_OBJ_GROUP | HDF5.H5F_OBJ_DATATYPE)
+ ids = HDF5.h5f_get_obj_ids(f.plain.id, HDF5.H5F_OBJ_DATASET | HDF5.H5F_OBJ_GROUP | HDF5.H5F_OBJ_DATATYPE)
+ @show f.plain length(ids)
+ for obj_id in ids
HDF5.h5o_close(obj_id)
end I get In HDF5 parlance, datasets, datatypes, and groups are all objects in a single file---so julia is just looping over the items in that particular file. That said, profiling this reveals that basically all the time is in the C library libhdf5. So this is definitely not a julia problem. @gasagna, maybe you could file this over at JLD and someone might see if we can work around this problem? |
@timholy You're right, all the time is in |
Of course, it's quite possible that your original comment is true in a sense: maybe libhdf5 has a broken way of storing these objects, and so that the C library is looping over the entire history of objects that have ever been opened. I haven't look at (and don't intend to debug) their C code. |
The new HDF5 release (version 1.10) is supposed to be much more efficient here. We should update the HDF5 package to use this version. But maybe before that we should ask the original reporter what HDF5 version he/she is using, and what flags when the file is opened? |
libhdf5 makes me sad 😦 |
@timholy Will open an issue at JLD. |
I have opened an issue at JuliaIO/JLD.jl#86. Maybe we can continue there. |
I have some complex piece of code where I need to read thousands of JLD files, each containing some data that I need to process. I have noticed that as files are read, Julia becomes increasingly slower at reading these files.
Here is minimal working example:
Let's plot the
times
vector.As you can see, the time per batch increases, seemingly linearly with the batch number. If you run the above code in a REPL, then it also takes ages to close it with CTRL+D. Maybe I am just missing something, but this is some weird behaviour I have not seen before.
I am pinging the JLD/HDF maintainers @timholy, @simonster as it might be related to those packages.
EDIT:
The text was updated successfully, but these errors were encountered: