Julia slows down when reading thousands of JLD files #17554

gasagna · 2016-07-22T10:15:11Z

I have some complex piece of code where I need to read thousands of JLD files, each containing some data that I need to process. I have noticed that as files are read, Julia becomes increasingly slower at reading these files.

Here is minimal working example:

using JLD

# create `Nfiles`  JLD files in `dir`, each containing some small amount of data
function bar(Nfiles, dir)
    !ispath(dir) && mkdir(dir)
    for i = 1:Nfiles
        f = @sprintf "%05d.jld" i
        JLD.save(joinpath(dir, f), "X", randn(5, 5))
        print("\r ", i)
    end
end

# Read `Nfiles` in `Nbatch` batches and evaluate the time required to load each batch. 
# Return a vector of times
function foo(Nfiles, Nbatch)
   M = div(Nfiles, Nbatch)
   s = 0.0
   times = Float64[]
   for j = 1:Nbatch
       I = (j-1)*M + 1
       t = @elapsed for i = I:(I+M-1)
           f = @sprintf "tmp/%05d.jld" i
           X = JLD.load(f, "X")
           s += sum(X)
       end
       push!(times, t)
       print("\r $j $t")
   end
   times
end

# make 50000 files in `tmp`
bar(50000, "tmp")

# now read files in batches of 1000
times = foo(50000, 50);

Let's plot the times vector.

As you can see, the time per batch increases, seemingly linearly with the batch number. If you run the above code in a REPL, then it also takes ages to close it with CTRL+D. Maybe I am just missing something, but this is some weird behaviour I have not seen before.

I am pinging the JLD/HDF maintainers @timholy, @simonster as it might be related to those packages.

EDIT:

julia> versioninfo()
Julia Version 0.4.7-pre+1
Commit 57d0834* (2016-06-19 17:17 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin15.5.0)
  CPU: Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz
  WORD_SIZE: 64
  BLAS: libgfortblas
  LAPACK: liblapack
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

The text was updated successfully, but these errors were encountered:

jakebolewski · 2016-07-22T15:11:48Z

This is esstentially the same problem as discussed in #7893. As the number of live objects increases, each gc pause takes longer and longer.

JeffBezanson · 2016-07-22T15:34:53Z

Why would the live set increase here? Does JLD keep all the loaded files in memory somehow?

JeffBezanson · 2016-07-22T15:40:49Z

Here's a trace from using @time in place of @elapsed. The GC time fraction is decreasing, which tells me this is not a GC issue:

  0.923663 seconds (276.00 k allocations: 22.736 MB, 1.16% gc time)
  0.986022 seconds (276.00 k allocations: 22.736 MB, 0.99% gc time)
  1.052426 seconds (276.00 k allocations: 22.736 MB, 0.87% gc time)
  1.116729 seconds (276.00 k allocations: 22.736 MB, 0.76% gc time)
  1.189045 seconds (276.00 k allocations: 22.736 MB, 0.72% gc time)
  1.259395 seconds (276.00 k allocations: 22.736 MB, 0.67% gc time)
  1.331177 seconds (276.00 k allocations: 22.736 MB, 0.66% gc time)
  1.415141 seconds (276.00 k allocations: 22.736 MB, 0.59% gc time)
  1.494682 seconds (276.00 k allocations: 22.736 MB, 0.58% gc time)
  1.581615 seconds (276.00 k allocations: 22.736 MB, 0.54% gc time)
  1.675156 seconds (276.00 k allocations: 22.736 MB, 0.51% gc time)
  1.791942 seconds (276.00 k allocations: 22.736 MB, 0.47% gc time)
  1.905326 seconds (276.00 k allocations: 22.736 MB, 0.45% gc time)
  2.038601 seconds (276.00 k allocations: 22.736 MB, 0.42% gc time)
  2.192792 seconds (276.00 k allocations: 22.736 MB, 0.39% gc time)
  2.395357 seconds (276.00 k allocations: 22.736 MB, 0.36% gc time)
  2.605264 seconds (276.00 k allocations: 22.736 MB, 0.65% gc time)
  2.842317 seconds (276.00 k allocations: 22.736 MB, 0.30% gc time)
  3.078720 seconds (276.00 k allocations: 22.736 MB, 0.28% gc time)
  3.402714 seconds (276.00 k allocations: 22.736 MB, 0.25% gc time)
  3.805483 seconds (276.00 k allocations: 22.736 MB, 0.22% gc time)
  4.197594 seconds (276.00 k allocations: 22.736 MB, 0.20% gc time)
  4.616121 seconds (276.00 k allocations: 22.736 MB, 0.19% gc time)
  5.098908 seconds (276.00 k allocations: 22.736 MB, 0.17% gc time)
  5.835276 seconds (276.00 k allocations: 22.736 MB, 0.14% gc time)
  6.199410 seconds (276.00 k allocations: 22.736 MB, 0.14% gc time)
  6.887267 seconds (276.00 k allocations: 22.736 MB, 0.12% gc time)
  7.357703 seconds (276.00 k allocations: 22.736 MB, 0.12% gc time)
  7.808020 seconds (276.00 k allocations: 22.736 MB, 0.11% gc time)
  8.212789 seconds (276.00 k allocations: 22.736 MB, 0.10% gc time)
  8.744519 seconds (276.00 k allocations: 22.736 MB, 0.10% gc time)
  9.183458 seconds (276.00 k allocations: 22.736 MB, 0.09% gc time)

Also on one run I got this interesting error:

  2.562964 seconds (276.00 k allocations: 22.736 MB, 0.65% gc time)
  2.784770 seconds (276.00 k allocations: 22.736 MB, 0.30% gc time)
HDF5-DIAG: Error detected in HDF5 (1.8.4-patch1) thread 140403051035648:
  #000: ../../../src/H5O.c line 1025 in H5Oclose(): not a valid file object ID (dataset, group, or datatype)
    major: Invalid arguments to routine
    minor: Unable to release object
ERROR: LoadError: Error closing object
 in h5o_close at /home/bezanson/.julia/v0.5/HDF5/src/plain.jl:1991 [inlined]
 in close(::JLD.JldFile) at /home/bezanson/.julia/v0.5/JLD/src/JLD.jl:147
 in #jldopen#9(::Array{Any,1}, ::Function, ::JLD.##39#40{String}, ::String, ::Vararg{String,N}) at /home/bezanson/.julia/v0.5/JLD/src/JLD.jl:267
 in load(::FileIO.File{FileIO.DataFormat{:JLD}}, ::String) at /home/bezanson/.julia/v0.5/JLD/src/JLD.jl:1226
 in #load#13(::Array{Any,1}, ::Function, ::String, ::String, ::Vararg{String,N}) at /home/bezanson/.julia/v0.5/FileIO/src/loadsave.jl:42

gasagna · 2016-07-22T15:43:28Z

You might try to close the REPL, and see how long its takes to get out. If you hit CTRL+C while it is there it will segfault badly, maybe there is info to be taken from the crash.

JeffBezanson · 2016-07-22T15:45:52Z

Ah, here we go. From JLD.jl:146:

        # Ensure that all other datasets, groups, and datatypes are closed (ref #176)
        for obj_id in HDF5.h5f_get_obj_ids(f.plain.id, HDF5.H5F_OBJ_DATASET | HDF5.H5F_OBJ_GROUP | HDF5.H5F_OBJ_DATATYPE)
            HDF5.h5o_close(obj_id)
        end

Looks like closing each file requires looping over all other files.

JeffBezanson · 2016-07-22T16:49:36Z

Closing as this appears to be a JLD issue.

gasagna · 2016-07-22T16:50:28Z

Thanks for looking into it!

timholy · 2016-07-22T20:56:40Z

Looks like closing each file requires looping over all other files.

I don't think that's true for the julia side of this: if one applies the following patch,

diff --git a/src/JLD.jl b/src/JLD.jl
index d2334dc..78382c3 100644
--- a/src/JLD.jl
+++ b/src/JLD.jl
@@ -143,7 +143,9 @@ function close(f::JldFile)
         isdefined(f, :gref) && close(f.gref)

         # Ensure that all other datasets, groups, and datatypes are closed (ref #176)
-        for obj_id in HDF5.h5f_get_obj_ids(f.plain.id, HDF5.H5F_OBJ_DATASET | HDF5.H5F_OBJ_GROUP | HDF5.H5F_OBJ_DATATYPE)
+        ids = HDF5.h5f_get_obj_ids(f.plain.id, HDF5.H5F_OBJ_DATASET | HDF5.H5F_OBJ_GROUP | HDF5.H5F_OBJ_DATATYPE)
+        @show f.plain length(ids)
+        for obj_id in ids
             HDF5.h5o_close(obj_id)
         end

I get length(ids) = 1 for all files.

In HDF5 parlance, datasets, datatypes, and groups are all objects in a single file---so julia is just looping over the items in that particular file.

That said, profiling this reveals that basically all the time is in the C library libhdf5. So this is definitely not a julia problem. @gasagna, maybe you could file this over at JLD and someone might see if we can work around this problem?

JeffBezanson · 2016-07-22T21:20:51Z

@timholy You're right, all the time is in h5f_get_obj_ids.

timholy · 2016-07-22T22:20:21Z

Of course, it's quite possible that your original comment is true in a sense: maybe libhdf5 has a broken way of storing these objects, and so that the C library is looping over the entire history of objects that have ever been opened. I haven't look at (and don't intend to debug) their C code.

eschnett · 2016-07-22T23:06:40Z

The new HDF5 release (version 1.10) is supposed to be much more efficient here. We should update the HDF5 package to use this version.

But maybe before that we should ask the original reporter what HDF5 version he/she is using, and what flags when the file is opened?

simonster · 2016-07-23T03:02:28Z

libhdf5 makes me sad 😦

gasagna · 2016-07-23T10:36:54Z

@eschnett

davide@vega - ~$: brew info hdf5
homebrew/science/hdf5: stable 1.8.16 (bottled)
File format designed to store large amounts of data
http://www.hdfgroup.org/HDF5
/usr/local/Cellar/hdf5/1.8.16_1 (180 files, 10.5M) *
  Poured from bottle on 2016-07-06 at 13:51:05
From: https://github.com/Homebrew/homebrew-science/blob/master/hdf5.rb

@timholy Will open an issue at JLD.

gasagna · 2016-07-23T10:41:02Z

I have opened an issue at JuliaIO/JLD.jl#86. Maybe we can continue there.

JeffBezanson mentioned this issue Jul 22, 2016

load of Array{DateTime,1} getting slower inside loop JuliaIO/JLD.jl#64

Open

JeffBezanson closed this as completed Jul 22, 2016

gasagna mentioned this issue Jul 23, 2016

Julia slows down when reading thousands of JLD files JuliaIO/JLD.jl#86

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Julia slows down when reading thousands of JLD files #17554

Julia slows down when reading thousands of JLD files #17554

gasagna commented Jul 22, 2016 •

edited

Loading

jakebolewski commented Jul 22, 2016

JeffBezanson commented Jul 22, 2016

JeffBezanson commented Jul 22, 2016

gasagna commented Jul 22, 2016

JeffBezanson commented Jul 22, 2016

JeffBezanson commented Jul 22, 2016 •

edited

Loading

gasagna commented Jul 22, 2016

timholy commented Jul 22, 2016

JeffBezanson commented Jul 22, 2016

timholy commented Jul 22, 2016

eschnett commented Jul 22, 2016

simonster commented Jul 23, 2016

gasagna commented Jul 23, 2016

gasagna commented Jul 23, 2016

Julia slows down when reading thousands of JLD files #17554

Julia slows down when reading thousands of JLD files #17554

Comments

gasagna commented Jul 22, 2016 • edited Loading

jakebolewski commented Jul 22, 2016

JeffBezanson commented Jul 22, 2016

JeffBezanson commented Jul 22, 2016

gasagna commented Jul 22, 2016

JeffBezanson commented Jul 22, 2016

JeffBezanson commented Jul 22, 2016 • edited Loading

gasagna commented Jul 22, 2016

timholy commented Jul 22, 2016

JeffBezanson commented Jul 22, 2016

timholy commented Jul 22, 2016

eschnett commented Jul 22, 2016

simonster commented Jul 23, 2016

gasagna commented Jul 23, 2016

gasagna commented Jul 23, 2016

gasagna commented Jul 22, 2016 •

edited

Loading

JeffBezanson commented Jul 22, 2016 •

edited

Loading