-
-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DTables.groupby
causes issues when multiple processes available
#450
Comments
I just came across basically the same error, but with a slightly different stacktrace: julia> using Distributed; addprocs(10); @everywhere using Dagger, DTables, DataFrames
julia> dt = DTable(DataFrame(a = 1:100, b = rand(1:5, 100)))
DTable with 1 partitions
Tabletype: DataFrame
julia> gdt = groupby(dt, :b)
ERROR: ThunkFailedException:
Root Exception Type: RemoteException
Root Exception:
On worker 5:
ConcurrencyViolationError("lock must be held")
Stacktrace:
[1] concurrency_violation
@ ./condition.jl:8
[2] assert_havelock
@ ./condition.jl:25 [inlined]
[3] assert_havelock
@ ./condition.jl:48 [inlined]
[4] assert_havelock
@ ./condition.jl:72 [inlined]
[5] _wait2
@ ./condition.jl:83
[6] #wait#621
@ ./condition.jl:127
[7] wait
@ ./condition.jl:125 [inlined]
[8] wait_for_conn
@ ~/programs/julia/julia-1.9.3/share/julia/stdlib/v1.9/Distributed/src/cluster.jl:195
[9] check_worker_state
@ ~/programs/julia/julia-1.9.3/share/julia/stdlib/v1.9/Distributed/src/cluster.jl:170
[10] send_msg_
@ ~/programs/julia/julia-1.9.3/share/julia/stdlib/v1.9/Distributed/src/messages.jl:172
[11] send_msg
@ ~/programs/julia/julia-1.9.3/share/julia/stdlib/v1.9/Distributed/src/messages.jl:122 [inlined]
[12] #remotecall_fetch#159
@ ~/programs/julia/julia-1.9.3/share/julia/stdlib/v1.9/Distributed/src/remotecall.jl:460
[13] remotecall_fetch
@ ~/programs/julia/julia-1.9.3/share/julia/stdlib/v1.9/Distributed/src/remotecall.jl:454
[14] #remotecall_fetch#162
@ ~/programs/julia/julia-1.9.3/share/julia/stdlib/v1.9/Distributed/src/remotecall.jl:492 [inlined]
[15] remotecall_fetch
@ ~/programs/julia/julia-1.9.3/share/julia/stdlib/v1.9/Distributed/src/remotecall.jl:492 [inlined]
[16] #171
@ ~/.julia/packages/MemPool/l9nLj/src/datastore.jl:424 [inlined]
[17] forwardkeyerror
@ ~/.julia/packages/MemPool/l9nLj/src/datastore.jl:409
[18] poolget
@ ~/.julia/packages/MemPool/l9nLj/src/datastore.jl:423
[19] move
@ ~/.julia/packages/Dagger/M13n0/src/chunks.jl:98
[20] move
@ ~/.julia/packages/Dagger/M13n0/src/chunks.jl:96 [inlined]
[21] move
@ ~/.julia/packages/Dagger/M13n0/src/chunks.jl:102
[22] #fetch#70
@ ~/.julia/packages/Dagger/M13n0/src/eager_thunk.jl:21
[23] fetch
@ ~/.julia/packages/Dagger/M13n0/src/eager_thunk.jl:11 [inlined]
[24] #fetch#75
@ ~/.julia/packages/Dagger/M13n0/src/eager_thunk.jl:58 [inlined]
[25] fetch
@ ~/.julia/packages/Dagger/M13n0/src/eager_thunk.jl:54 [inlined]
[26] build_groupby_index
@ ~/.julia/packages/DTables/BjdY2/src/operations/groupby.jl:176
[27] #invokelatest#2
@ ./essentials.jl:819 [inlined]
[28] invokelatest
@ ./essentials.jl:816 [inlined]
[29] #43
@ ~/.julia/packages/Dagger/M13n0/src/processor.jl:162
Stacktrace:
[1] wait
@ ./task.jl:349 [inlined]
[2] fetch
@ ./task.jl:369 [inlined]
[3] #execute!#42
@ ~/.julia/packages/Dagger/M13n0/src/processor.jl:172
[4] execute!
@ ~/.julia/packages/Dagger/M13n0/src/processor.jl:157
[5] #158
@ ~/.julia/packages/Dagger/M13n0/src/sch/Sch.jl:1551 [inlined]
[6] #21
@ ~/.julia/packages/Dagger/M13n0/src/options.jl:17 [inlined]
[7] #1
@ ~/.julia/packages/ScopedValues/92HJZ/src/ScopedValues.jl:163
[8] with_logstate
@ ./logging.jl:514
[9] with_logger
@ ./logging.jl:626 [inlined]
[10] enter_scope
@ ~/.julia/packages/ScopedValues/92HJZ/src/payloadlogger.jl:17 [inlined]
[11] with
@ ~/.julia/packages/ScopedValues/92HJZ/src/ScopedValues.jl:162
[12] with_options
@ ~/.julia/packages/Dagger/M13n0/src/options.jl:16
[13] do_task
@ ~/.julia/packages/Dagger/M13n0/src/sch/Sch.jl:1549
[14] macro expansion
@ ~/.julia/packages/Dagger/M13n0/src/sch/Sch.jl:1243 [inlined]
[15] #132
@ ./task.jl:134
This Thunk: Thunk(id=3, build_groupby_index(true, 0, DataFrame, Thunk[2](#134, Any[Union{Dagger.EagerThunk, Dagger.Chunk}[Dagger.Chunk{DataFrame, MemPool.DRef, OSProc, AnyScope}(DataFrame, UnitDomain(), MemPool.DRef(1, 0, 0x0000000000000908), OSProc(1), AnyScope(), false)], DTables.var"#123#125"{Symbol, DTables.var"#122#124"}(:b, DTables.var"#122#124"())])))
Stacktrace:
[1] fetch(t::Dagger.ThunkFuture; proc::OSProc, raw::Bool)
@ Dagger ~/.julia/packages/Dagger/M13n0/src/eager_thunk.jl:16
[2] fetch
@ ~/.julia/packages/Dagger/M13n0/src/eager_thunk.jl:11 [inlined]
[3] #fetch#75
@ ~/.julia/packages/Dagger/M13n0/src/eager_thunk.jl:58 [inlined]
[4] fetch
@ ~/.julia/packages/Dagger/M13n0/src/eager_thunk.jl:54 [inlined]
[5] _groupby(d::DTable, row_function::Function, cols::Vector{Symbol}, merge::Bool, chunksize::Int64)
@ DTables ~/.julia/packages/DTables/BjdY2/src/operations/groupby.jl:132
[6] #groupby#121
@ ~/.julia/packages/DTables/BjdY2/src/operations/groupby.jl:38 [inlined]
[7] groupby(d::DTable, col::Symbol)
@ DTables ~/.julia/packages/DTables/BjdY2/src/operations/groupby.jl:35
[8] top-level scope
@ REPL[3]:1 |
The above error appears to go away when I wrap the code with |
@StevenWhitaker can you try this with Julia |
Your PR does seem to fix the issue! |
Awesome! Considering this isn't our bug, I'm going to close this, with the understanding that I plan to get that PR merged. |
I have some code that involves several operations on
DTable
s. I ran my code withnprocs()
equal to1
, and everything worked fine. I then added some processes so thatnprocs()
equaled5
and ran my code again on worker 1 (so I didn't explicitly use any of the added workers). In this case, my code would hang when callingreduce
on aGDTable
(i.e., after callinggroupby
).I tried to create a MWE, but I haven't yet been able to find one that hangs. Fortunately, I did find a MWE that gives a different error (
ConcurrencyViolationError("lock must be held")
); hopefully this error and the hanging I'm experiencing are different manifestations of the same issue.EDIT: The next comment contains a simpler MWE that produces the same error (slightly different stacktrace, though).
Contents of
mwe.jl
:Error:
Some notes:
f
via aremotecall_fetch
."file.csv"
is a 157 MB table with 233930 rows and 102 columns ofString
andFloat64
values. I tried to generate data to keep the MWE self-contained, but wasn't successful.nworkers
to10
in this MWE seems to make the error happen more frequently. I'm guessing the previous MWE also would have exhibited this error ifnworkers
was larger.The text was updated successfully, but these errors were encountered: