Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Julia slows down when reading thousands of JLD files #17554

Closed
gasagna opened this issue Jul 22, 2016 · 14 comments
Closed

Julia slows down when reading thousands of JLD files #17554

gasagna opened this issue Jul 22, 2016 · 14 comments

Comments

@gasagna
Copy link
Contributor

gasagna commented Jul 22, 2016

I have some complex piece of code where I need to read thousands of JLD files, each containing some data that I need to process. I have noticed that as files are read, Julia becomes increasingly slower at reading these files.

Here is minimal working example:

using JLD

# create `Nfiles`  JLD files in `dir`, each containing some small amount of data
function bar(Nfiles, dir)
    !ispath(dir) && mkdir(dir)
    for i = 1:Nfiles
        f = @sprintf "%05d.jld" i
        JLD.save(joinpath(dir, f), "X", randn(5, 5))
        print("\r ", i)
    end
end

# Read `Nfiles` in `Nbatch` batches and evaluate the time required to load each batch. 
# Return a vector of times
function foo(Nfiles, Nbatch)
   M = div(Nfiles, Nbatch)
   s = 0.0
   times = Float64[]
   for j = 1:Nbatch
       I = (j-1)*M + 1
       t = @elapsed for i = I:(I+M-1)
           f = @sprintf "tmp/%05d.jld" i
           X = JLD.load(f, "X")
           s += sum(X)
       end
       push!(times, t)
       print("\r $j $t")
   end
   times
end

# make 50000 files in `tmp`
bar(50000, "tmp")

# now read files in batches of 1000
times = foo(50000, 50);

Let's plot the times vector.

figure_1

As you can see, the time per batch increases, seemingly linearly with the batch number. If you run the above code in a REPL, then it also takes ages to close it with CTRL+D. Maybe I am just missing something, but this is some weird behaviour I have not seen before.

I am pinging the JLD/HDF maintainers @timholy, @simonster as it might be related to those packages.

EDIT:

julia> versioninfo()
Julia Version 0.4.7-pre+1
Commit 57d0834* (2016-06-19 17:17 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin15.5.0)
  CPU: Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz
  WORD_SIZE: 64
  BLAS: libgfortblas
  LAPACK: liblapack
  LIBM: libopenlibm
  LLVM: libLLVM-3.3
@jakebolewski
Copy link
Member

This is esstentially the same problem as discussed in #7893. As the number of live objects increases, each gc pause takes longer and longer.

@JeffBezanson
Copy link
Member

Why would the live set increase here? Does JLD keep all the loaded files in memory somehow?

@JeffBezanson
Copy link
Member

Here's a trace from using @time in place of @elapsed. The GC time fraction is decreasing, which tells me this is not a GC issue:

  0.923663 seconds (276.00 k allocations: 22.736 MB, 1.16% gc time)
  0.986022 seconds (276.00 k allocations: 22.736 MB, 0.99% gc time)
  1.052426 seconds (276.00 k allocations: 22.736 MB, 0.87% gc time)
  1.116729 seconds (276.00 k allocations: 22.736 MB, 0.76% gc time)
  1.189045 seconds (276.00 k allocations: 22.736 MB, 0.72% gc time)
  1.259395 seconds (276.00 k allocations: 22.736 MB, 0.67% gc time)
  1.331177 seconds (276.00 k allocations: 22.736 MB, 0.66% gc time)
  1.415141 seconds (276.00 k allocations: 22.736 MB, 0.59% gc time)
  1.494682 seconds (276.00 k allocations: 22.736 MB, 0.58% gc time)
  1.581615 seconds (276.00 k allocations: 22.736 MB, 0.54% gc time)
  1.675156 seconds (276.00 k allocations: 22.736 MB, 0.51% gc time)
  1.791942 seconds (276.00 k allocations: 22.736 MB, 0.47% gc time)
  1.905326 seconds (276.00 k allocations: 22.736 MB, 0.45% gc time)
  2.038601 seconds (276.00 k allocations: 22.736 MB, 0.42% gc time)
  2.192792 seconds (276.00 k allocations: 22.736 MB, 0.39% gc time)
  2.395357 seconds (276.00 k allocations: 22.736 MB, 0.36% gc time)
  2.605264 seconds (276.00 k allocations: 22.736 MB, 0.65% gc time)
  2.842317 seconds (276.00 k allocations: 22.736 MB, 0.30% gc time)
  3.078720 seconds (276.00 k allocations: 22.736 MB, 0.28% gc time)
  3.402714 seconds (276.00 k allocations: 22.736 MB, 0.25% gc time)
  3.805483 seconds (276.00 k allocations: 22.736 MB, 0.22% gc time)
  4.197594 seconds (276.00 k allocations: 22.736 MB, 0.20% gc time)
  4.616121 seconds (276.00 k allocations: 22.736 MB, 0.19% gc time)
  5.098908 seconds (276.00 k allocations: 22.736 MB, 0.17% gc time)
  5.835276 seconds (276.00 k allocations: 22.736 MB, 0.14% gc time)
  6.199410 seconds (276.00 k allocations: 22.736 MB, 0.14% gc time)
  6.887267 seconds (276.00 k allocations: 22.736 MB, 0.12% gc time)
  7.357703 seconds (276.00 k allocations: 22.736 MB, 0.12% gc time)
  7.808020 seconds (276.00 k allocations: 22.736 MB, 0.11% gc time)
  8.212789 seconds (276.00 k allocations: 22.736 MB, 0.10% gc time)
  8.744519 seconds (276.00 k allocations: 22.736 MB, 0.10% gc time)
  9.183458 seconds (276.00 k allocations: 22.736 MB, 0.09% gc time)

Also on one run I got this interesting error:

  2.562964 seconds (276.00 k allocations: 22.736 MB, 0.65% gc time)
  2.784770 seconds (276.00 k allocations: 22.736 MB, 0.30% gc time)
HDF5-DIAG: Error detected in HDF5 (1.8.4-patch1) thread 140403051035648:
  #000: ../../../src/H5O.c line 1025 in H5Oclose(): not a valid file object ID (dataset, group, or datatype)
    major: Invalid arguments to routine
    minor: Unable to release object
ERROR: LoadError: Error closing object
 in h5o_close at /home/bezanson/.julia/v0.5/HDF5/src/plain.jl:1991 [inlined]
 in close(::JLD.JldFile) at /home/bezanson/.julia/v0.5/JLD/src/JLD.jl:147
 in #jldopen#9(::Array{Any,1}, ::Function, ::JLD.##39#40{String}, ::String, ::Vararg{String,N}) at /home/bezanson/.julia/v0.5/JLD/src/JLD.jl:267
 in load(::FileIO.File{FileIO.DataFormat{:JLD}}, ::String) at /home/bezanson/.julia/v0.5/JLD/src/JLD.jl:1226
 in #load#13(::Array{Any,1}, ::Function, ::String, ::String, ::Vararg{String,N}) at /home/bezanson/.julia/v0.5/FileIO/src/loadsave.jl:42

@gasagna
Copy link
Contributor Author

gasagna commented Jul 22, 2016

You might try to close the REPL, and see how long its takes to get out. If you hit CTRL+C while it is there it will segfault badly, maybe there is info to be taken from the crash.

@JeffBezanson
Copy link
Member

Ah, here we go. From JLD.jl:146:

        # Ensure that all other datasets, groups, and datatypes are closed (ref #176)
        for obj_id in HDF5.h5f_get_obj_ids(f.plain.id, HDF5.H5F_OBJ_DATASET | HDF5.H5F_OBJ_GROUP | HDF5.H5F_OBJ_DATATYPE)
            HDF5.h5o_close(obj_id)
        end

Looks like closing each file requires looping over all other files.

@JeffBezanson
Copy link
Member

JeffBezanson commented Jul 22, 2016

Closing as this appears to be a JLD issue.

@gasagna
Copy link
Contributor Author

gasagna commented Jul 22, 2016

Thanks for looking into it!

@timholy
Copy link
Member

timholy commented Jul 22, 2016

Looks like closing each file requires looping over all other files.

I don't think that's true for the julia side of this: if one applies the following patch,

diff --git a/src/JLD.jl b/src/JLD.jl
index d2334dc..78382c3 100644
--- a/src/JLD.jl
+++ b/src/JLD.jl
@@ -143,7 +143,9 @@ function close(f::JldFile)
         isdefined(f, :gref) && close(f.gref)

         # Ensure that all other datasets, groups, and datatypes are closed (ref #176)
-        for obj_id in HDF5.h5f_get_obj_ids(f.plain.id, HDF5.H5F_OBJ_DATASET | HDF5.H5F_OBJ_GROUP | HDF5.H5F_OBJ_DATATYPE)
+        ids = HDF5.h5f_get_obj_ids(f.plain.id, HDF5.H5F_OBJ_DATASET | HDF5.H5F_OBJ_GROUP | HDF5.H5F_OBJ_DATATYPE)
+        @show f.plain length(ids)
+        for obj_id in ids
             HDF5.h5o_close(obj_id)
         end

I get length(ids) = 1 for all files.

In HDF5 parlance, datasets, datatypes, and groups are all objects in a single file---so julia is just looping over the items in that particular file.

That said, profiling this reveals that basically all the time is in the C library libhdf5. So this is definitely not a julia problem. @gasagna, maybe you could file this over at JLD and someone might see if we can work around this problem?

@JeffBezanson
Copy link
Member

@timholy You're right, all the time is in h5f_get_obj_ids.

@timholy
Copy link
Member

timholy commented Jul 22, 2016

Of course, it's quite possible that your original comment is true in a sense: maybe libhdf5 has a broken way of storing these objects, and so that the C library is looping over the entire history of objects that have ever been opened. I haven't look at (and don't intend to debug) their C code.

@eschnett
Copy link
Contributor

The new HDF5 release (version 1.10) is supposed to be much more efficient here. We should update the HDF5 package to use this version.

But maybe before that we should ask the original reporter what HDF5 version he/she is using, and what flags when the file is opened?

@simonster
Copy link
Member

libhdf5 makes me sad 😦

@gasagna
Copy link
Contributor Author

gasagna commented Jul 23, 2016

@eschnett

davide@vega - ~$: brew info hdf5
homebrew/science/hdf5: stable 1.8.16 (bottled)
File format designed to store large amounts of data
http://www.hdfgroup.org/HDF5
/usr/local/Cellar/hdf5/1.8.16_1 (180 files, 10.5M) *
  Poured from bottle on 2016-07-06 at 13:51:05
From: https://github.com/Homebrew/homebrew-science/blob/master/hdf5.rb

@timholy Will open an issue at JLD.

@gasagna
Copy link
Contributor Author

gasagna commented Jul 23, 2016

I have opened an issue at JuliaIO/JLD.jl#86. Maybe we can continue there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants