-
Notifications
You must be signed in to change notification settings - Fork 373
Description
It seems that dealing with missing values is one of the most painful issues we have, which goes against the very powerful and convenient DataFrames API. Having to write things like filter(:col => x -> coalesce(x > 1, false), df) or combine(gd, :col => (x -> sum(skipmissing(x))) isn't ideal. One proposal to alleviate this is #2258: add a skipmissing argument to functions like filter, select, transform and combine to unify the way one can skip missing values, instead of having to use different syntaxes which are hard to grasp for newcomers and make the code more complex to read.
That would be one step towards being more user-friendly, but one would still have to repeat skipmissing=true all the time when dealing with missing values. I figured two solutions could be considered to improve this:
- Have DataFramesMeta simplify the handling of missing values. This could be via a dedicated e.g.
@linqskipmissingmacro or a statement likeskipmissingwithin a@linqblock that would automatically passskipmissing=trueto all subsequent operations in a chain or block. This wouldn't really help with operations outside such blocks though. - Have a field in
DataFrameobjects that would store the default value to use for theskipmissingargument. By default it would befalse, so that you get the current behavior, which ensures safety via propagation of missing values. But when you know you are working with a data set with missing values, you would be able to callskipmissing!(df, true)once and then avoid repeating it.
Somewhat similar discussions have happened a long time ago (but at the array rather than the data frame level) at JuliaStats/DataArrays.jl#39. I think it's fair to say that we know have enough experience now to make a decision. One argument against implementing this at the DataFrame level is that it will have no effect on operations applied directly to column vectors, like sum(df.col). But that's better than nothing.