Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent failure in loading test #35217

Closed
Keno opened this issue Mar 22, 2020 · 10 comments · Fixed by #46944
Closed

Intermittent failure in loading test #35217

Keno opened this issue Mar 22, 2020 · 10 comments · Fixed by #46944
Assignees
Labels
ci Continuous integration packages Package management and loading

Comments

@Keno
Copy link
Member

Keno commented Mar 22, 2020

As previously discussed in #35183, we have an intermittent failure in loading.
Over the past few days I've run this test continuously on around 20 cores and
observe about a 0.5% failure rate. I tried reproducing it with just the random seed,
but that was unsuccessful. Finally, I simply serialized out the inputs to the test_find
function, which yielded a successful reproducer of the issue. The reproducer is as follows:

julia> roots
Dict{String,PkgId} with 3 entries:
  "B" => B [top-level]
  "A" => A [75d55ab9-8b58-4b6e-8ad9-7fc54a3c81d3]
  "C" => C [fa81875a-839c-476c-9a9a-63af890e68cb]

julia> show(IOContext(stdout, :verbose=>true, :compact=>false), graph)
Dict{PkgId,Dict{String,PkgId}}(A [75d55ab9-8b58-4b6e-8ad9-7fc54a3c81d3] => Dict("A" => A [75d55ab9-8b58-4b6e-8ad9-7fc54a3c81d3],"C" => C [fa81875a-839c-476c-9a9a-63af890e68cb]),C [fa81875a-839c-476c-9a9a-63af890e68cb] => Dict("C" => C [fa81875a-839c-476c-9a9a-63af890e68cb]))

julia> paths
Dict{PkgId,Union{Nothing, String}} with 3 entries:
  B [top-level]                            => "/tmp/jl_FyNXyg/B/src/B.jl"
  A [75d55ab9-8b58-4b6e-8ad9-7fc54a3c81d3] => "/tmp/jl_FyNXyg/A/src/A.jl"
  C [fa81875a-839c-476c-9a9a-63af890e68cb] => "/tmp/jl_FyNXyg/C/src/C.jl"

julia> push!(empty!(Base.LOAD_PATH), "/tmp/jl_FyNXyg"))

julia> test_find(roots, graph, paths)
Test Failed at REPL[16]:10
  Expression: id == get(roots, name, nothing)
   Evaluated: A [f2499fa5-f4f5-5ad4-85f6-2621ce3bb437] == A [75d55ab9-8b58-4b6e-8ad9-7fc54a3c81d3]
ERROR: There was an error during testing

julia> Base.dummy_uuid("/tmp/jl_FyNXyg/A/Project.toml")
UUID("f2499fa5-f4f5-5ad4-85f6-2621ce3bb437")

Both A and C have a project file, both are empty, B does not have a project file.

@Keno Keno added packages Package management and loading ci Continuous integration labels Mar 22, 2020
@Keno
Copy link
Member Author

Keno commented Mar 22, 2020

/tmp/jl_FyNXyg/
├── [         80]  A
│   ├── [          0]  Project.toml
│   └── [         60]  src
│       └── [          0]  A.jl
├── [         60]  B
│   └── [         60]  src
│       └── [          0]  B.jl
└── [         80]  C
    ├── [          0]  Project.toml
    └── [         60]  src
        └── [          0]  C.jl

In helpful tree form, I guess.

@Keno
Copy link
Member Author

Keno commented Mar 22, 2020

What I'm failing to understand is how the directory got into this form. My impression was that the test would at least write something into those .jl files.

@Keno
Copy link
Member Author

Keno commented Mar 22, 2020

Looks like this particular situation may have been caused by this test filling up all of /tmp. Apparently if we run out of space, we just silently drop file contents. That's probably not a great idea.

@StefanKarpinski
Copy link
Sponsor Member

Wow, this test and only this test failing is a really unexpected symptom of that situation.

@Keno
Copy link
Member Author

Keno commented Mar 22, 2020

I think there may still be a real issue here also. The out of space issue may just have higher frequency. This test really spams a lot of directories on /tmp

@StefanKarpinski
Copy link
Sponsor Member

Yeah, sorry, it does. It's a very thorough test 😬

Keno added a commit that referenced this issue Mar 30, 2020
In #35217, I noticed that if we call `close` on an IOStream, but the device
that has the underlying file is full, we silently truncate the file without
error. To remidy that, introduce an explicit flush call before closing the
file. This should either succeed and reflect the changes on disk, or fail
and throw an appropriate error.
@Keno
Copy link
Member Author

Keno commented Nov 25, 2020

This one looks like it's back:
https://build.julialang.org/#/builders/61/builds/5894/steps/5/logs/stdio

Test Failed at C:\buildbot\worker-tabularasa\tester_win32\build\share\julia\test\loading.jl:552
  Expression: path == get(paths, id, nothing)
   Evaluated: "C:\\Windows\\TEMP\\jl_VfUINM\\packages\\C\\IB6LF\\src\\C.jl" == "C:\\Windows\\TEMP\\jl_pumCto\\packages\\C\\IB6LF\\src\\C.jl"
Error in testset loading:
Test Failed at C:\buildbot\worker-tabularasa\tester_win32\build\share\julia\test\loading.jl:562
  Expression: path == get(paths, id, nothing)
   Evaluated: "C:\\Windows\\TEMP\\jl_VfUINM\\packages\\C\\IB6LF\\src\\C.jl" == "C:\\Windows\\TEMP\\jl_pumCto\\packages\\C\\IB6LF\\src\\C.jl"

@Keno
Copy link
Member Author

Keno commented Feb 11, 2021

We have an rr trace for this now. From https://build.julialang.org/#/builders/71/builds/621:

Error in testset loading:
Test Failed at /buildworker/worker/tester_linux64/build/share/julia/test/loading.jl:183
  Expression: root == if something(n, N + 1) ��� something(d, N + 1)
        nothing
    else
        if something(u, N + 1) < something(d, N + 1)
            root_uuid
        else
            proj_uuid
        end
    end
   Evaluated: UUID("800647f9-cfeb-4944-b1e5-3e1efd939c93") == nothing
Error in testset loading:
Test Failed at /buildworker/worker/tester_linux64/build/share/julia/test/loading.jl:185
  Expression: this == if something(d, N + 1) < something(t, N + 1) ��� N
        this_uuid
    else
        nothing
    end
   Evaluated: nothing == UUID("8dc43e05-7c3c-4a41-90b5-ced2ccd00485")
ERROR: LoadError: Test run finished with errors

https://julialang-dumps.s3.amazonaws.com/linux64/621/rr-run_621-gitsha_4f854b4e3f-2021-02-11_15_09_03.tar.zst

@vtjnash
Copy link
Sponsor Member

vtjnash commented Feb 12, 2021

Ah, this was a strange one to watch in the debugger, since it seemed like it would just jump over the bad test. This is a timing attack. We have a cache (specifically get_updated_dict) which won't be invalided if the Project file was being changed quickly, and now more than 0.1 seconds have elapsed since the last update. However, in this case we got very unlucky, and just after writing the new file, we ran GC, which needed to cleanup a few tens of thousands of IOStream finalizers (all fairly quick, since they are already closed, there's just a lot of them, since this test is somewhat exhaustive), before we get to that test.

(rr) p req
$142 = {data = 0x7efd0670f6c0, loop = 0xbad10, type = UV_FS, reserved = {0x7efcf352e4a8, 0x7efd13120c90, 0x7efd0a925f10, 0x7efce3844350, 0x7efcf352e4a0, 0x7efd0a926450}, fs_type = UV_FS_STAT, 
  cb = 0x0, result = 0, ptr = 0x7efd13120be0, path = 0x7efcf352e4a8 "/tmp/jl_MMOg18/Project.toml", statbuf = {st_dev = 55, st_mode = 33188, st_nlink = 1, st_uid = 1337, st_gid = 1337, 
    st_rdev = 0, st_ino = 51121257, st_size = 114, st_blksize = 4096, st_blocks = 8, st_flags = 0, st_gen = 0, st_atim = {tv_sec = 1613053462, tv_nsec = 44915194}, st_mtim = {
      tv_sec = 1613053462, tv_nsec = 44915194}, st_ctim = {tv_sec = 1613053462, tv_nsec = 44915194}, st_birthtim = {tv_sec = 1613053461, tv_nsec = 792916622}}, new_path = 0x0, 
  file = -212013952, flags = 32508, mode = 319950448, nbufs = 32509, bufs = 0x0, off = 1, uid = 1, gid = 0, btime = 6.8984119244809235e-310, atime = 6.8984493493705491e-310, 
  mtime = 1.1857575500189917e-322, work_req = {work = 0xa0, done = 0x7efd13120d50, loop = 0x7efd4034de4d <jl_alloc_array_1d+509>, wq = {0x7efd0a925f10, 0x7efce3844300}}, bufsml = {{
      base = 0x7efd0a926690 "\270\062/\f\375~", len = 161}, {base = 0x7efd000000f0 "\001\fH", len = 139625091825664}, {base = 0x1 <error: Cannot access memory at address 0x1>, len = 1}, {
      base = 0x7efd0a925f10 "PC\204\343\374~", len = 139626178934501}}}
(rr) c
Continuing.

Thread 1 hit Breakpoint 13, ios_file (s=0x7efce3844550, fname=0x7efcf352e4a8 "/tmp/jl_MMOg18/Project.toml", rd=1, wr=0, create=0, trunc=0)
    at /buildworker/worker/package_linux64/build/src/support/ios.c:935
935     {
(rr) 
Continuing.

Thread 1 hit Breakpoint 13, ios_file (s=0x7efce3844650, fname=0x7efcf352e4a8 "/tmp/jl_MMOg18/Project.toml", rd=0, wr=1, create=1, trunc=1)
    at /buildworker/worker/package_linux64/build/src/support/ios.c:935
935     {
(rr) 
Continuing.

Thread 1 hit Breakpoint 19, uv_fs_stat (loop=0xbad10, req=0x7efd13120b70, path=0x7efcf352e4a8 "/tmp/jl_MMOg18/Project.toml", cb=0x0) at src/unix/fs.c:2071
2071    src/unix/fs.c: No such file or directory.
(rr) fin
Run till exit from #0  uv_fs_stat (loop=0xbad10, req=0x7efd13120b70, path=0x7efcf352e4a8 "/tmp/jl_MMOg18/Project.toml", cb=0x0) at src/unix/fs.c:2071
jl_stat (path=<optimized out>, statbuf=0x7efce370c050 "") at /buildworker/worker/package_linux64/build/src/sys.c:129
129         if (ret == 0)
Value returned is $143 = 0
(rr) p req
$144 = {data = 0x7efd0670f6c0, loop = 0xbad10, type = UV_FS, reserved = {0x7efcf352e4a8, 0x7efd41883700, 0x7efd00000000, 0x7efd407cd820 <jl_all_tls_states>, 0x7efcf352e4a0, 0x8}, 
  fs_type = UV_FS_STAT, cb = 0x0, result = 0, ptr = 0x7efd13120be0, path = 0x7efcf352e4a8 "/tmp/jl_MMOg18/Project.toml", statbuf = {st_dev = 55, st_mode = 33188, st_nlink = 1, st_uid = 1337, 
    st_gid = 1337, st_rdev = 0, st_ino = 51121257, st_size = 114, st_blksize = 4096, st_blocks = 8, st_flags = 0, st_gen = 0, st_atim = {tv_sec = 1613053462, tv_nsec = 44915194}, st_mtim = {
      tv_sec = 1613053462, tv_nsec = 44915194}, st_ctim = {tv_sec = 1613053462, tv_nsec = 44915194}, st_birthtim = {tv_sec = 1613053461, tv_nsec = 792916622}}, new_path = 0x0, 
  file = 1077436229, flags = 32509, mode = 319950448, nbufs = 32509, bufs = 0x0, off = 1, uid = 1, gid = 0, btime = 6.8984119244809235e-310, atime = 6.8984493493705491e-310, 
  mtime = 1.1857575500189917e-322, work_req = {work = 0xa0, done = 0x7efd13120d50, loop = 0x7efd4034de4d <jl_alloc_array_1d+509>, wq = {0x7efd0a926850, 0x7efce3844500}}, bufsml = {{
      base = 0x7efd0a926e10 "H3/\f\375~", len = 161}, {base = 0x7efd000000f0 "\001\fH", len = 139625091825664}, {base = 0x1 <error: Cannot access memory at address 0x1>, len = 1}, {
      base = 0x7efd0a926850 "PE\204\343\374~", len = 139626178934501}}}

(rr) p now
$156 = {sec = 1613053462, usec = 189462}

We can generate tons of failures by forcing it to slow down:

diff --git a/test/loading.jl b/test/loading.jl
index 377087a63f..6aecfce8bd 100644
--- a/test/loading.jl
+++ b/test/loading.jl
@@ -170,6 +171,7 @@ end
                         println(io, line)
                     end
                 end
+                i % 5 == 0 && sleep(0.11)
                 # look at lines and their order
                 n = findfirst(line -> startswith(line, "name"), p)
                 u = findfirst(line -> startswith(line, "uuid"), p)

@KristofferC KristofferC self-assigned this Feb 12, 2021
@KristofferC
Copy link
Sponsor Member

We have a cache (specifically get_updated_dict) which won't be invalided if the Project file was being changed quickly, and now more than 0.1 seconds have elapsed since the last update.

So to be more explicit, what happens is that:

  • we read and cache a project file.
  • something re-writes this project file very quickly (same mtime) after it got cached and the inode, and size are identical.
  • time passes
  • now we want to get the dictionary for the project file, the question now is if we can use the cached project file or if we need to invalidate and re-parse

I can't really see any way we can differ between the cached data and the data for the new file here... If we controlled the process writing the Project file, we could "dirty" the cache.. Maybe this method of caching is flawed then?

vtjnash added a commit that referenced this issue Sep 28, 2022
PR #42328 (21ebabf) already fixed
issue #35217 by removing the bad test. Now also remove the nonsensical
comment condition from loading.jl.
vtjnash added a commit that referenced this issue Sep 29, 2022
PR #42328 (21ebabf) already fixed
issue #35217 by removing the bad test. Now also remove the nonsensical
comment condition from loading.jl.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci Continuous integration packages Package management and loading
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants