Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add lift function for working with missing values #26661

Closed
wants to merge 1 commit into from

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Mar 30, 2018

Folows discussion in #26631 to add generic lifting method for any function.

The implementation lifts only over positional arguments. The three implemented methods follow results of benchmarking. In particular for a single positional argument it is faster not to call any function (at least currently). Also in lift(f, x; kw...) it is faster not to call lift(f) but directly perform the test. Some limitation is that in lift(f) approach we are not able to differentiate between single or multiple positional argument cases. Therefore lift(f, x; kw...) style will be faster for single positional argument case.

CC @nalimilan

@martinholters
Copy link
Member

I find the term lift a bit too generic to be limited to missing. Not that I have good ideas for alternatives, though...

@StefanKarpinski
Copy link
Sponsor Member

How about calling it mift, short for "missing lift"? :trollface:

@Sacha0
Copy link
Member

Sacha0 commented Mar 30, 2018

mift, the little-known alternative spelling of miffed, commonly used when irked by values missing from your dataset.

@mbauman
Copy link
Sponsor Member

mbauman commented Mar 30, 2018

I'm a little surprised that there's a perf difference there — typically those tuple functions are highly optimized and inline like crazy. You could work around the perf difference between lift(f, x) and lift(f)(x) with a function object:

struct Lifted{F} <: Function end
lift(f) = Lifted{f}()
(::Lifted{f})(x; kws...) where {f} = ismissing(x) ? missing : f(x; kws...)
(::Lifted{f})(xs...; kws...) where {f} = any(ismissing, xs) ? missing : f(xs...; kws...)

We could then always use the lift(f)(xs...) call syntax, and define lift(f, sentinel=missing) to store a sentinel value in the struct, making this extensible.

Just a spitball.

@bkamins
Copy link
Member Author

bkamins commented Mar 30, 2018

@mbauman Nice, This is almost as fast as lift(f, x...; kw...) (around 20ns overhead) and other options are significantly slower. Also given your approach I agree that lift(f, x...; kw...) can be removed (although I liked its convenience).

Additionally after some analysis I have second thoughts about kw keyword arguments, as we cannot control if they are valid if x is missing, eg. lift(uppercase)(missing, foo="bar"). So maybe it is safer to remove this support (user can always use anonymous function to fix keyword argument).

So the approach could be something like (if I understand your idea correctly - I am extending the sentinel concept to also specify return value separately):

struct Lifted{F,S,V} <: Function end
lift(f, sentinel=missing, value=missing) = Lifted{f, sentinel, value}()
(::Lifted{F,S,V})(x) where {F,S,V} = isequal(x, S) ? V : F(x)
(::Lifted{F,S,V})(xs...) where {F,S,V} = any(x->isequal(x, S), xs) ? V : F(xs...)

which would give maximum flexibility.

@nalimilan
Copy link
Member

Additionally after some analysis I have second thoughts about kw keyword arguments, as we cannot control if they are valid if x is missing, eg. lift(uppercase)(missing, foo="bar"). So maybe it is safer to remove this support (user can always use anonymous function to fix keyword argument).

In the same vein, it would be weird that lift(f)(missing, missing) would work when there is no two-argument method for f. It's probably not a serious problem in practice, since people will get an error when non-missing arguments are passed, but it's not totally satisfying either. Maybe all that is needed is to call Core.Compiler.return_type with Tuple{Any,Any} (for two arguments) and throw an error if it returns Union{}?

Regarding the extension to other sentinels than missing, it would make sense to consider coalesce at the same time. coalesce gets rid of both missing and nothing, so maybe we should do the same here by default, i.e. return either missing or nothing if one of the inputs happens to be equal to one of these?

@nalimilan nalimilan added the domain:missing data Base.missing and related functionality label Mar 31, 2018
@bkamins
Copy link
Member Author

bkamins commented Mar 31, 2018

Using Core.Compiler.return_type should be OK (it has low overhead).
But keyword arguments will not be handled this way - right?

As for coalesce - then if we wanted to be consistent with it we should unwrap Some.

In general in the design of lift (or some other better name) I would prefer speed and simplicity over flexibility. The reason is that we need this function to provide Missing a proper lifting as this is an expected functionality and those will be 99% of use-cases. Not sure (but the difference is that coalesce can be efficient thanks to dispatch and here we will have to do testing inside a function).

@nalimilan
Copy link
Member

Using Core.Compiler.return_type should be OK (it has low overhead).
But keyword arguments will not be handled this way - right?
Right. I was just thinking about positional arguments. Support for keyword arguments can always be added later (or not) without breaking anything anyway.

As for coalesce - then if we wanted to be consistent with it we should unwrap Some.
Hmm, that's indeed something which would be interesting to support to make it easier to work with Union{Some{T},Nothing}. I guess a lifted function would unwrap Some arguments before passing them?

In general in the design of lift (or some other better name) I would prefer speed and simplicity over flexibility. The reason is that we need this function to provide Missing a proper lifting as this is an expected functionality and those will be 99% of use-cases. Not sure (but the difference is that coalesce can be efficient thanks to dispatch and here we will have to do testing inside a function).

I don't think there's a tradeoff between flexibility and speed, as it would be handled at compile time anyway (that's the case for return_type when the types are known). There is definitely a tradeoff between simplicity and flexibility, though, which is why we provide skipmissing even if that operation can also be achieved with Iterators.filter. Here the issue is that if we claim the term lift, it would be problematic to reserve it for missing when it would be equally useful for nothing. coalesce handles both and it doesn't seem to be too complex, so maybe we can also find a good solution here.

@bkamins
Copy link
Member Author

bkamins commented Mar 31, 2018

OK - let us try this path and start with the specification:

  1. should lift handle Some like coalesce (actually this could be nice as then we can pass missing or nothing by wrapping it in Some if there is a need to actually pass this value, but actually handling Some was my biggest performance issue with making lift similar to coalesce, but maybe there is none)?
  2. I am not clear what should happen if missing and nothing is mixed in lift - when should missing and when nothing be returned?

@quinnj
Copy link
Member

quinnj commented Apr 1, 2018

Do we have examples of why we need lifting? Is it really that useful? Anybody have a real use-case where they needed lifting? (honestly curious here, not just trying to bat this PR down)

@bkamins
Copy link
Member Author

bkamins commented Apr 1, 2018

This was the original reason:

using CSV
s = "A\nx\ny\n\nv"
df = CSV.read(IOBuffer(s))
uppercase.(df[:A])

and the last line fails, because column A of df contains missing. Of course we could make uppercase function handle missing but then there will be always functions that are not designed to handle missing.

@nalimilan
Copy link
Member

Yeah, it's needed for all cases where we don't provide lifted methods (see also #26631). Actually lift could be even more useful for nothing than for missing, especially when using Some, since in that case no methods are defined at all.

@bkamins
Copy link
Member Author

bkamins commented Apr 1, 2018

@nalimilan This is an implementation working like coalesce. Additionally it supports lift(f, x) form. If this is the direction we agree on I will update the PR:

struct Lifted{F} <: Function end
lift(f) = Lifted{f}()
lift(f, x...) = lift(f)(x...)

function (::Lifted{F})(x::Missing) where {F}
    if Core.Compiler.return_type(F, Tuple{Any}) == Union{}
        throw(ArgumentError("wrong number of function arguments"))
    end
    missing
end

function (::Lifted{F})(x::Nothing) where {F}
    if Core.Compiler.return_type(F, Tuple{Any}) == Union{}
        throw(ArgumentError("wrong number of function arguments"))
    end
    nothing
end

(::Lifted{F})(x::Some) where {F} = F(coalesce(x))
(::Lifted{F})(x) where {F} = F(x)

function (::Lifted{F})(xs...) where {F}
    for x in xs
        if ismissing(x)
            if Core.Compiler.return_type(F, NTuple{length(xs), Any}) == Union{}
                throw(ArgumentError("wrong number of function arguments"))
            end
            return missing
        end
        if isa(x, Nothing)
            if Core.Compiler.return_type(F, NTuple{length(xs), Any}) == Union{}
                throw(ArgumentError("wrong number of function arguments"))
            end
            return nothing
        end
    end
    F(map(coalesce, xs)...)
end

@quinnj
Copy link
Member

quinnj commented Apr 1, 2018

I kind of lean towards thinking we don't really need this. I think it's an easy thing to think we need, but if you really break down use-cases, I really wonder how much convenience this would actually provide. I mean, in @bkamins's example, you could do

uppercase.(collect(skipmissing(df[:A])))

or the map equivalent

map(x->ismissing(x) ? missing : uppercase(x), df[:A])

if you ultimately can just drop the missing entries anyway. Compared the suggested lifting functionality here of

lift(uppercase).(df[:A]))

(assuming that would work correctly w/ broadcasting and everything.

It just seems like the tone of some of the comments around this are like: "oh no, missing is un-useable because every function isn't defined to handle it!!" whereas that's really not the case. Personally, I've been using missing (or null) for close to a year now and haven't ever thought I needed something like this. (and it's not like I never work w/ string data or anything).

Not saying we shouldn't keep proving this out and brainstorming, but just wanted to throw out some of my thoughts.

@nalimilan
Copy link
Member

@bkamins In principle that looks good, though I'm not sure about performance. I've just realized that hasmethod could be used instead of return_type, but I'm not sure whether it makes a difference. Unfortunately I think it currently involves a check on every call (#16422).

@quinnj I see your point. I guess the advantage would be more compelling if we supported the f?(x) syntax instead of lift(f)(x). In the end that's really just a matter of conciseness. We could add this to Missings.jl for a start and see how useful it turns out to be.

@JeffBezanson
Copy link
Sponsor Member

Using return_type for this would not be correct.

@pdeffebach
Copy link
Contributor

pdeffebach commented Aug 16, 2018

uppercase.(collect(skipmissing(df[:A])))
map(x->ismissing(x) ? missing : uppercase(x), df[:A])

These two are not equivalent, since the first one gets rid of missing values.

One thing that would be nice in Julia is a view for missing values. Obviously now you can do

x = [1, 2, 3, missing]
t = view(x, .!ismissing.(x))

Now you can work with `t' as though it is an array while

  1. Modifying x directly
  2. Preserving the place of the missing values of x

The problem is that this view is expensive and the syntax is a bit clumsy for what I see as a pretty common operation. If I understand correctly, lift seems like a lazier way of doing this, right? Like the map solution above. That's fine but given R's easy propogation of missings more concise helper functions are appreciated

@bkamins
Copy link
Member Author

bkamins commented Aug 28, 2018

I have been thinking about going back to defining lift again. My current idea is the following implementation:

struct Lifted{P,V1,V2} <: Function end
lift(predicate, v1, v2) = Lifted{predicate, v1, v2}()

(::Lifted{F,P,V})(xs...;kw...) where {F,P,V} = P(xs...; kw...) ? V(xs...; kw...) : F(xs...; kw...)

_liftmissing_predicate(xs...;kw...) = any(ismissing.(xs)) || any(ismissing.(values(values(kw))))
_liftmissing_value(xs...;kw...) = missing
liftmissing(f) = lift(_liftmissing_pred, _liftmissing_value, f)

The idea is that it is fully flexible lift with liftmissing defined for the most common case.
All three parameters of lift are functions for generality. Also in this case I think we could skip checking number of parameters as I would document lift as something like:

returns a function that applies v1 to its arguments if pred is true on them and otherwise it applies v2.

Maybe function names lift and liftmissing could be changed to something else I am reusing the names to keep consistency with this thread.

@nalimilan
Copy link
Member

We now have both coalesce and something to replace missing and nothing (respectively) with another value. So it could make sense to also have two different functions for lifting (rather than a single general function).

I'm not sure whether "lift" is the best term, even if it's used in C#, Scala and Haskell. "Propagate" would be more evocative, but it's a bit long. "Carry" could be nice (carrymissing). Or "pass" (passmissing). Of course there's also the eternal question of whether operators like f?(x) could be added.

@bkamins
Copy link
Member Author

bkamins commented Aug 29, 2018

I am also not fully convinced with lift that is why I have said that I leave function names only for consistency with the earlier thread.

Regarding nothing and missing - this is true. What I wanted to say however, is that actually this part:

struct Lifted{P,V1,V2} <: Function end
lift(predicate, v1, v2) = Lifted{predicate, v1, v2}()

(::Lifted{F,P,V})(xs...;kw...) where {F,P,V} = P(xs...; kw...) ? V(xs...; kw...) : F(xs...; kw...)

is much more general and can have different applications - as it transforms ternary operator into a higher order function. Then handling missing or nothing is only one of possible usages.

Note e.g. that the default implementation for liftmissing (and actually I like passmissing name) is very conservative: if missing is present anywhere a missing is returned. However, there are cases when you want to be sensitive only to some specific argument or arguments and ignore other (e.g. in some function missing could have some meaning that is used in its logic), then you would need this "base" function.

And of course it is easy enough to add default liftnothing (or passnothing) implementation similar to the missing case.

So in summary there are the following questions:

  • do we want it at all (as some people voiced we do not need it) - I would say we do - when working with tabular data you often need it for column transformations;
  • do we want to expose this underlying infrastructure transforming ternary operator into a higher order function to users or only use it internally; I would say it would be nice to have it exported; then the question is what would be a proper name (probably native speakers could comment)
  • do we want to have liftmissing exported - I would say yes and passmissing is a nice name for it;
  • do we want to have passnothing exported - I am not sure, as one of the key conceptual differences of nothing and missing is that missing should propagate and nothing should not (that is why we disallow printing nothing I guess and also many functions currently use nothing as sentinel in their arguments to indicate absence of some parameter) - but I am also not against it if we feel it would be useful to have it defined by default

Finally I would leave out discussion of f? construct aside as this is a separate topic in my opinion - first we should decide on the functionality.

@davidanthoff
Copy link
Contributor

Could this start out in Missings.jl for a while to see how useful it is and then be moved into base/stdlib if it is a success?

@bkamins
Copy link
Member Author

bkamins commented Aug 30, 2018

OK. I have proposed it here JuliaData/Missings.jl#89. Let us see how it works out.

@bkamins
Copy link
Member Author

bkamins commented Oct 24, 2020

We have now passmissing for some time in Missings.jl and it passed the test of time I believe. So the question is - do we move the functionality from Missings.jl to Base, or close this issue and just keep it in Missings.jl (I think it is OK to keep it in Missings.jl).

@bkamins
Copy link
Member Author

bkamins commented Nov 13, 2021

OK - closing this as having passmissing in Missings.jl (what we have now) is enough I think.

@bkamins bkamins closed this Nov 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:missing data Base.missing and related functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants