-
-
Notifications
You must be signed in to change notification settings - Fork 38
Description
On Julia 1.7.3, I've found that downloads sometimes fail with the following error:
UNHANDLED TASK ERROR: IOError: FDWatcher: bad file descriptor (EBADF)
Stacktrace:
[1] try_yieldto(undo::typeof(Base.ensure_rescheduled))
@ Base ./task.jl:812
[2] wait()
@ Base ./task.jl:872
[3] wait(c::Base.GenericCondition{Base.Threads.SpinLock})
@ Base ./condition.jl:123
[4] wait(fdw::FileWatching._FDWatcher; readable::Bool, writable::Bool)
@ FileWatching /usr/local/julia/share/julia/stdlib/v1.7/FileWatching/src/FileWatching.jl:533
[5] wait
@ /usr/local/julia/share/julia/stdlib/v1.7/FileWatching/src/FileWatching.jl:504 [inlined]
[6] macro expansion
@ /usr/local/julia/share/julia/stdlib/v1.7/Downloads/src/Curl/Multi.jl:166 [inlined]
The line this points to in the Downloads.jl source is
Downloads.jl/src/Curl/Multi.jl
Line 166 in b0f23d0
| events = try wait(watcher) |
"Sometimes" here means "after millions of S3 requests in the span of multiple days of runtime with retry around the actual request-making code". (retry using the default settings, so with the default ExponentialBackOff schedule with a single retry). When this error occurred, it occurred multiple times, on multiple different pods (which by design are accessing different s3 URIs but still in the same region), so I'm wondering if it is somehow related to the "connection pool corruption" issue w/ AWS.jl. Another possibly relevant bit of context is that the code that actually is making the requests is actually doing an asyncmap over dozens (<100) of small s3 GET requests.
I'm afraid this happened in a long-running job that I can't interact with directly and don't have a reprex that I can share, but wanted to open an issue in case someone else has seen this or has advice on how to debug or what other information would be useful!