Skip to content

Skipping missing values more easily #2314

@nalimilan

Description

@nalimilan

It seems that dealing with missing values is one of the most painful issues we have, which goes against the very powerful and convenient DataFrames API. Having to write things like filter(:col => x -> coalesce(x > 1, false), df) or combine(gd, :col => (x -> sum(skipmissing(x))) isn't ideal. One proposal to alleviate this is #2258: add a skipmissing argument to functions like filter, select, transform and combine to unify the way one can skip missing values, instead of having to use different syntaxes which are hard to grasp for newcomers and make the code more complex to read.

That would be one step towards being more user-friendly, but one would still have to repeat skipmissing=true all the time when dealing with missing values. I figured two solutions could be considered to improve this:

  • Have DataFramesMeta simplify the handling of missing values. This could be via a dedicated e.g. @linqskipmissing macro or a statement like skipmissing within a @linq block that would automatically pass skipmissing=true to all subsequent operations in a chain or block. This wouldn't really help with operations outside such blocks though.
  • Have a field in DataFrame objects that would store the default value to use for the skipmissing argument. By default it would be false, so that you get the current behavior, which ensures safety via propagation of missing values. But when you know you are working with a data set with missing values, you would be able to call skipmissing!(df, true) once and then avoid repeating it.

Somewhat similar discussions have happened a long time ago (but at the array rather than the data frame level) at JuliaStats/DataArrays.jl#39. I think it's fair to say that we know have enough experience now to make a decision. One argument against implementing this at the DataFrame level is that it will have no effect on operations applied directly to column vectors, like sum(df.col). But that's better than nothing.

Cc: @bkamins, @matthieugomez, @pdeffebach, @mkborregaard

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions