Intermittent failure in loading test #35217

Keno · 2020-03-22T06:56:20Z

As previously discussed in #35183, we have an intermittent failure in loading.
Over the past few days I've run this test continuously on around 20 cores and
observe about a 0.5% failure rate. I tried reproducing it with just the random seed,
but that was unsuccessful. Finally, I simply serialized out the inputs to the test_find
function, which yielded a successful reproducer of the issue. The reproducer is as follows:

julia> roots
Dict{String,PkgId} with 3 entries:
  "B" => B [top-level]
  "A" => A [75d55ab9-8b58-4b6e-8ad9-7fc54a3c81d3]
  "C" => C [fa81875a-839c-476c-9a9a-63af890e68cb]

julia> show(IOContext(stdout, :verbose=>true, :compact=>false), graph)
Dict{PkgId,Dict{String,PkgId}}(A [75d55ab9-8b58-4b6e-8ad9-7fc54a3c81d3] => Dict("A" => A [75d55ab9-8b58-4b6e-8ad9-7fc54a3c81d3],"C" => C [fa81875a-839c-476c-9a9a-63af890e68cb]),C [fa81875a-839c-476c-9a9a-63af890e68cb] => Dict("C" => C [fa81875a-839c-476c-9a9a-63af890e68cb]))

julia> paths
Dict{PkgId,Union{Nothing, String}} with 3 entries:
  B [top-level]                            => "/tmp/jl_FyNXyg/B/src/B.jl"
  A [75d55ab9-8b58-4b6e-8ad9-7fc54a3c81d3] => "/tmp/jl_FyNXyg/A/src/A.jl"
  C [fa81875a-839c-476c-9a9a-63af890e68cb] => "/tmp/jl_FyNXyg/C/src/C.jl"

julia> push!(empty!(Base.LOAD_PATH), "/tmp/jl_FyNXyg"))

julia> test_find(roots, graph, paths)
Test Failed at REPL[16]:10
  Expression: id == get(roots, name, nothing)
   Evaluated: A [f2499fa5-f4f5-5ad4-85f6-2621ce3bb437] == A [75d55ab9-8b58-4b6e-8ad9-7fc54a3c81d3]
ERROR: There was an error during testing

julia> Base.dummy_uuid("/tmp/jl_FyNXyg/A/Project.toml")
UUID("f2499fa5-f4f5-5ad4-85f6-2621ce3bb437")

Both A and C have a project file, both are empty, B does not have a project file.

The text was updated successfully, but these errors were encountered:

Keno · 2020-03-22T07:03:18Z

/tmp/jl_FyNXyg/
├── [         80]  A
│   ├── [          0]  Project.toml
│   └── [         60]  src
│       └── [          0]  A.jl
├── [         60]  B
│   └── [         60]  src
│       └── [          0]  B.jl
└── [         80]  C
    ├── [          0]  Project.toml
    └── [         60]  src
        └── [          0]  C.jl

In helpful tree form, I guess.

Keno · 2020-03-22T07:06:18Z

What I'm failing to understand is how the directory got into this form. My impression was that the test would at least write something into those .jl files.

Keno · 2020-03-22T08:22:59Z

Looks like this particular situation may have been caused by this test filling up all of /tmp. Apparently if we run out of space, we just silently drop file contents. That's probably not a great idea.

StefanKarpinski · 2020-03-22T19:18:14Z

Wow, this test and only this test failing is a really unexpected symptom of that situation.

Keno · 2020-03-22T19:20:23Z

I think there may still be a real issue here also. The out of space issue may just have higher frequency. This test really spams a lot of directories on /tmp

StefanKarpinski · 2020-03-22T19:41:16Z

Yeah, sorry, it does. It's a very thorough test 😬

In #35217, I noticed that if we call `close` on an IOStream, but the device that has the underlying file is full, we silently truncate the file without error. To remidy that, introduce an explicit flush call before closing the file. This should either succeed and reflect the changes on disk, or fail and throw an appropriate error.

Keno · 2020-11-25T02:01:45Z

This one looks like it's back:
https://build.julialang.org/#/builders/61/builds/5894/steps/5/logs/stdio

Test Failed at C:\buildbot\worker-tabularasa\tester_win32\build\share\julia\test\loading.jl:552
  Expression: path == get(paths, id, nothing)
   Evaluated: "C:\\Windows\\TEMP\\jl_VfUINM\\packages\\C\\IB6LF\\src\\C.jl" == "C:\\Windows\\TEMP\\jl_pumCto\\packages\\C\\IB6LF\\src\\C.jl"
Error in testset loading:
Test Failed at C:\buildbot\worker-tabularasa\tester_win32\build\share\julia\test\loading.jl:562
  Expression: path == get(paths, id, nothing)
   Evaluated: "C:\\Windows\\TEMP\\jl_VfUINM\\packages\\C\\IB6LF\\src\\C.jl" == "C:\\Windows\\TEMP\\jl_pumCto\\packages\\C\\IB6LF\\src\\C.jl"

Keno · 2021-02-11T17:51:41Z

We have an rr trace for this now. From https://build.julialang.org/#/builders/71/builds/621:

Error in testset loading:
Test Failed at /buildworker/worker/tester_linux64/build/share/julia/test/loading.jl:183
  Expression: root == if something(n, N + 1) ��� something(d, N + 1)
        nothing
    else
        if something(u, N + 1) < something(d, N + 1)
            root_uuid
        else
            proj_uuid
        end
    end
   Evaluated: UUID("800647f9-cfeb-4944-b1e5-3e1efd939c93") == nothing
Error in testset loading:
Test Failed at /buildworker/worker/tester_linux64/build/share/julia/test/loading.jl:185
  Expression: this == if something(d, N + 1) < something(t, N + 1) ��� N
        this_uuid
    else
        nothing
    end
   Evaluated: nothing == UUID("8dc43e05-7c3c-4a41-90b5-ced2ccd00485")
ERROR: LoadError: Test run finished with errors

https://julialang-dumps.s3.amazonaws.com/linux64/621/rr-run_621-gitsha_4f854b4e3f-2021-02-11_15_09_03.tar.zst

vtjnash · 2021-02-12T00:08:15Z

Ah, this was a strange one to watch in the debugger, since it seemed like it would just jump over the bad test. This is a timing attack. We have a cache (specifically get_updated_dict) which won't be invalided if the Project file was being changed quickly, and now more than 0.1 seconds have elapsed since the last update. However, in this case we got very unlucky, and just after writing the new file, we ran GC, which needed to cleanup a few tens of thousands of IOStream finalizers (all fairly quick, since they are already closed, there's just a lot of them, since this test is somewhat exhaustive), before we get to that test.

(rr) p req
$142 = {data = 0x7efd0670f6c0, loop = 0xbad10, type = UV_FS, reserved = {0x7efcf352e4a8, 0x7efd13120c90, 0x7efd0a925f10, 0x7efce3844350, 0x7efcf352e4a0, 0x7efd0a926450}, fs_type = UV_FS_STAT, 
  cb = 0x0, result = 0, ptr = 0x7efd13120be0, path = 0x7efcf352e4a8 "/tmp/jl_MMOg18/Project.toml", statbuf = {st_dev = 55, st_mode = 33188, st_nlink = 1, st_uid = 1337, st_gid = 1337, 
    st_rdev = 0, st_ino = 51121257, st_size = 114, st_blksize = 4096, st_blocks = 8, st_flags = 0, st_gen = 0, st_atim = {tv_sec = 1613053462, tv_nsec = 44915194}, st_mtim = {
      tv_sec = 1613053462, tv_nsec = 44915194}, st_ctim = {tv_sec = 1613053462, tv_nsec = 44915194}, st_birthtim = {tv_sec = 1613053461, tv_nsec = 792916622}}, new_path = 0x0, 
  file = -212013952, flags = 32508, mode = 319950448, nbufs = 32509, bufs = 0x0, off = 1, uid = 1, gid = 0, btime = 6.8984119244809235e-310, atime = 6.8984493493705491e-310, 
  mtime = 1.1857575500189917e-322, work_req = {work = 0xa0, done = 0x7efd13120d50, loop = 0x7efd4034de4d <jl_alloc_array_1d+509>, wq = {0x7efd0a925f10, 0x7efce3844300}}, bufsml = {{
      base = 0x7efd0a926690 "\270\062/\f\375~", len = 161}, {base = 0x7efd000000f0 "\001\fH", len = 139625091825664}, {base = 0x1 <error: Cannot access memory at address 0x1>, len = 1}, {
      base = 0x7efd0a925f10 "PC\204\343\374~", len = 139626178934501}}}
(rr) c
Continuing.

Thread 1 hit Breakpoint 13, ios_file (s=0x7efce3844550, fname=0x7efcf352e4a8 "/tmp/jl_MMOg18/Project.toml", rd=1, wr=0, create=0, trunc=0)
    at /buildworker/worker/package_linux64/build/src/support/ios.c:935
935     {
(rr) 
Continuing.

Thread 1 hit Breakpoint 13, ios_file (s=0x7efce3844650, fname=0x7efcf352e4a8 "/tmp/jl_MMOg18/Project.toml", rd=0, wr=1, create=1, trunc=1)
    at /buildworker/worker/package_linux64/build/src/support/ios.c:935
935     {
(rr) 
Continuing.

Thread 1 hit Breakpoint 19, uv_fs_stat (loop=0xbad10, req=0x7efd13120b70, path=0x7efcf352e4a8 "/tmp/jl_MMOg18/Project.toml", cb=0x0) at src/unix/fs.c:2071
2071    src/unix/fs.c: No such file or directory.
(rr) fin
Run till exit from #0  uv_fs_stat (loop=0xbad10, req=0x7efd13120b70, path=0x7efcf352e4a8 "/tmp/jl_MMOg18/Project.toml", cb=0x0) at src/unix/fs.c:2071
jl_stat (path=<optimized out>, statbuf=0x7efce370c050 "") at /buildworker/worker/package_linux64/build/src/sys.c:129
129         if (ret == 0)
Value returned is $143 = 0
(rr) p req
$144 = {data = 0x7efd0670f6c0, loop = 0xbad10, type = UV_FS, reserved = {0x7efcf352e4a8, 0x7efd41883700, 0x7efd00000000, 0x7efd407cd820 <jl_all_tls_states>, 0x7efcf352e4a0, 0x8}, 
  fs_type = UV_FS_STAT, cb = 0x0, result = 0, ptr = 0x7efd13120be0, path = 0x7efcf352e4a8 "/tmp/jl_MMOg18/Project.toml", statbuf = {st_dev = 55, st_mode = 33188, st_nlink = 1, st_uid = 1337, 
    st_gid = 1337, st_rdev = 0, st_ino = 51121257, st_size = 114, st_blksize = 4096, st_blocks = 8, st_flags = 0, st_gen = 0, st_atim = {tv_sec = 1613053462, tv_nsec = 44915194}, st_mtim = {
      tv_sec = 1613053462, tv_nsec = 44915194}, st_ctim = {tv_sec = 1613053462, tv_nsec = 44915194}, st_birthtim = {tv_sec = 1613053461, tv_nsec = 792916622}}, new_path = 0x0, 
  file = 1077436229, flags = 32509, mode = 319950448, nbufs = 32509, bufs = 0x0, off = 1, uid = 1, gid = 0, btime = 6.8984119244809235e-310, atime = 6.8984493493705491e-310, 
  mtime = 1.1857575500189917e-322, work_req = {work = 0xa0, done = 0x7efd13120d50, loop = 0x7efd4034de4d <jl_alloc_array_1d+509>, wq = {0x7efd0a926850, 0x7efce3844500}}, bufsml = {{
      base = 0x7efd0a926e10 "H3/\f\375~", len = 161}, {base = 0x7efd000000f0 "\001\fH", len = 139625091825664}, {base = 0x1 <error: Cannot access memory at address 0x1>, len = 1}, {
      base = 0x7efd0a926850 "PE\204\343\374~", len = 139626178934501}}}

(rr) p now
$156 = {sec = 1613053462, usec = 189462}

We can generate tons of failures by forcing it to slow down:

diff --git a/test/loading.jl b/test/loading.jl
index 377087a63f..6aecfce8bd 100644
--- a/test/loading.jl
+++ b/test/loading.jl
@@ -170,6 +171,7 @@ end
                         println(io, line)
                     end
                 end
+                i % 5 == 0 && sleep(0.11)
                 # look at lines and their order
                 n = findfirst(line -> startswith(line, "name"), p)
                 u = findfirst(line -> startswith(line, "uuid"), p)

KristofferC · 2021-02-12T07:49:41Z

We have a cache (specifically get_updated_dict) which won't be invalided if the Project file was being changed quickly, and now more than 0.1 seconds have elapsed since the last update.

So to be more explicit, what happens is that:

we read and cache a project file.
something re-writes this project file very quickly (same mtime) after it got cached and the inode, and size are identical.
time passes
now we want to get the dictionary for the project file, the question now is if we can use the cached project file or if we need to invalidate and re-parse

I can't really see any way we can differ between the cached data and the data for the new file here... If we controlled the process writing the Project file, we could "dirty" the cache.. Maybe this method of caching is flawed then?

PR #42328 (21ebabf) already fixed issue #35217 by removing the bad test. Now also remove the nonsensical comment condition from loading.jl.

Keno added packages Package management and loading ci Continuous integration labels Mar 22, 2020

Keno assigned StefanKarpinski Mar 22, 2020

Keno unassigned StefanKarpinski Mar 22, 2020

Keno mentioned this issue Mar 30, 2020

In open(f, ...), call flush before calling close #35303

Closed

KristofferC self-assigned this Feb 12, 2021

vtjnash mentioned this issue Mar 2, 2021

atomic file writing #39892

Open

vtjnash added a commit that referenced this issue Sep 28, 2022

remove bad test from get_updated_dict

7aff712

PR #42328 (21ebabf) already fixed issue #35217 by removing the bad test. Now also remove the nonsensical comment condition from loading.jl.

vtjnash mentioned this issue Sep 28, 2022

remove bad test from get_updated_dict #46944

Merged

vtjnash closed this as completed in #46944 Sep 29, 2022

vtjnash added a commit that referenced this issue Sep 29, 2022

remove bad test from get_updated_dict (#46944)

5588623

PR #42328 (21ebabf) already fixed issue #35217 by removing the bad test. Now also remove the nonsensical comment condition from loading.jl.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intermittent failure in loading test #35217

Intermittent failure in loading test #35217

Keno commented Mar 22, 2020 •

edited

Keno commented Mar 22, 2020

Keno commented Mar 22, 2020

Keno commented Mar 22, 2020

StefanKarpinski commented Mar 22, 2020

Keno commented Mar 22, 2020

StefanKarpinski commented Mar 22, 2020

Keno commented Nov 25, 2020

Keno commented Feb 11, 2021

vtjnash commented Feb 12, 2021

KristofferC commented Feb 12, 2021

Intermittent failure in loading test #35217

Intermittent failure in loading test #35217

Comments

Keno commented Mar 22, 2020 • edited

Keno commented Mar 22, 2020

Keno commented Mar 22, 2020

Keno commented Mar 22, 2020

StefanKarpinski commented Mar 22, 2020

Keno commented Mar 22, 2020

StefanKarpinski commented Mar 22, 2020

Keno commented Nov 25, 2020

Keno commented Feb 11, 2021

vtjnash commented Feb 12, 2021

KristofferC commented Feb 12, 2021

Keno commented Mar 22, 2020 •

edited