Skip to content

Thread Safety #1905

@clintonTE

Description

@clintonTE

Not sure if this is a feature request or a bug report. It is unclear to me what operations pertaining to DataFrames, if any, are thread-safe. For example, I would have expected the below code to be thread-safe since each thread is operating on different parts of the memory, yet it explodes in a cloud of corruption:

using DataFrames
#WARNING: DO NOT RUN THIS

function tsmwecorrupt(N=100_000)
  df = DataFrame(rand(N,100))
  df.grpcol = (i->i%50).(1:N)

  Threads.@threads for sdf  groupby(df, :grpcol)
    sdf.x3 .= -1.
  end

  println(sum(df.x3))
end

tsmwecorrupt()

OUTPUT: 
(many pages of garbage)

On the other hand, this code seems fine:

function tsmwe(N=100_000)
  df = DataFrame(rand(N,100))
  df.grpcol = (i->i%50).(1:N)

  Threads.@threads for r  eachrow(df)
    r.x3 = -1.
  end

  println(sum(df.x3))
end

tsmwe()

OUTPUT: 
-100000.0

and so does this:

function tsmwe2(N=100_000)
  df = DataFrame(rand(N,100))
  df.grpcol = (i->i%50).(1:N)

  Threads.@threads for sdf  collect(groupby(df, :grpcol))
    sdf.x3 .= -1.
  end

  println(sum(df.x3))
end


tsmwe2()

OUTPUT: 
-100000.0

If this is not a bug, then perhaps this issue can serve as a feature request for thread safety under a wider variety of use cases.

EDIT: Also see 1896

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions