Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add 2-arg versions of findmax/min, argmax/min #35316

Merged
merged 29 commits into from
Dec 18, 2020

Conversation

cmcaine
Copy link
Contributor

@cmcaine cmcaine commented Mar 30, 2020

Fixes #27613. Related: #27639, #27612, #34674.

Thanks to @tkf, @StefanKarpinski and @drewrobson for their assistance with
this PR.

I (and Stefan) wanted this to use partial order comparisons, but that would not be backwards compatible without a bit more faffing around. If there's interest, I might introduce another PR to explore that. That would introduce two partial orderings that bend the rules for these special cases: infinitesimals, unorderables (NaN) and missing.

I think reviewers should pay particular attention to the isgreater function I introduce. There are a lot of corner cases when comparing or ordering values and I'm not sure I've thought of them all. I'm fairly sure that it won't introduce any regressions in findmin or argmin (and it actually makes them work with missing where before they did not), but it may not be quite reliable enough to export.

Despite adding the isgreater function, I have not updated min, minimum, or the one-arg versions of findmin or argmin to use it. I can do that, but I wanted to get some review of what I've done first and possibly get this smaller change merged.

@cmcaine
Copy link
Contributor Author

cmcaine commented Mar 30, 2020

I rebased this a few times and github is showing the commits in the wrong order for me on this page. On the repo's page and in the git log they are in the right order. The right order is this:

commit 423dc4b3f5ddcab71f54ab8e2d83239eb007b9e2 (HEAD -> argmax-2-arg-harder, cmcaine/argmax-2-arg-harder)
Author: Colin Caine <cmcaine@gmail.com>
Date:   Mon Mar 30 18:14:26 2020 +0100

    Add 2-arg versions of findmax/min, argmax/min
    
    Fixes #27613. Related: #27639, #27612, #34674.
    
    Thanks to @tkf, @StefanKarpinski and @drewrobson for their assistance
    with this PR.

commit 7bbb84f2fa7a53545dc2cfdfd3df8feba2aac519
Author: Colin Caine <cmcaine@gmail.com>
Date:   Mon Mar 30 18:10:03 2020 +0100

    Add isgreater
    
    Defines a descending total order where unorderable values and missing are last.
    This makes defining min, argmin, etc, simpler.

commit f6a2e73d77f3bd7669126041104763ff7e12d651
Author: Colin Caine <cmcaine@gmail.com>
Date:   Mon Mar 30 21:45:10 2020 +0100

    Rename unexported isgreater in reduceddim

commit b6c06dddb199a673a9e62ae14e9eeee6939560a0 (origin/master, origin/HEAD)
Author: Steven G. Johnson <stevenj@alum.mit.edu>
Date:   Mon Mar 30 12:33:28 2020 -0400

    bump utf8proc version, support Unicode 13 (#35282)

base/operators.jl Outdated Show resolved Hide resolved
NEWS.md Outdated Show resolved Hide resolved
base/reduce.jl Outdated
while true
y = iterate(p, s)
y === nothing && break
m != m && break
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation of _findmax(a, ::Colon) does not handle missing. I understand this is not part of the original plan for this PR. But it may be nice to just replace the function body with mapfoldl((k, v) -> (v, k), _rf_findmax, pairs(a)), now that we have all the machinery.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had planned to leave that and the more involved definition in reducedim.jl alone for this PR in the interest of getting it through quickly and with the least chance of introducing breaking changes.

I can just carefully read the code and reimplement both, though :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewriting _findmax/min(a, ::Colon) using _rf_findmax/min also means that we can test _rf_findmax/min much more heavily via existing tests on/using findmin/max.

Though I guess doing this later is fine too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used keys rather than pairs here, but maybe I should have used pairs. I'll benchmark in a little bit and maybe change it.

base/reduce.jl Outdated
Comment on lines 684 to 794
julia> findmax(identity, 5:9)
(9, 9)

julia> findmax(-, 1:10)
(-1, 1)

julia> findmax(cos, 0:π/2:2π)
(1.0, 0.0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be nice to do demonstrate "first match wins" behavior. Maybe something like:

julia> findmax(first, [(1, :a), (2, :b), (2, :c)])
(2, (2, :b))

Maybe this was the intention of cos example but it is not very obvious to realize this when floating point arithmetic is involved.

# Implementation
Types should not usually implement this function. Instead, implement `isless`.
"""
isgreater(x, y) = _is_reflexive(x) && _is_reflexive(y) ? isless(y, x) : isless(x, y)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should that be !isless(x, y) in the final branch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. That's the branch for NaN and missing and I just want those sorted the same as for isless.

Copy link
Member

@tkf tkf May 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This definition is the trickiest part of this PR and I think it deserves at least some comments about it in the source code.

Having said that, I still have a very uneasy feeling about this definition. In particular, proving that isgreater satisfies transitivity requires that "_is_reflexive(x) && !_is_reflexive(y) implies isless(x, y)" holds. This is the case for NaN and missing but this is not in the API. It'd be much easier to understand and enforce that isgreater is a total order [*] if we just use

function isgreater2(x, y)
    xisnan = x != x
    xisnan isa Bool || return false  # x is missing
    yisnan = y != y
    yisnan isa Bool || return true  # y is missing
    xisnan && return false
    yisnan && return true
    return isless(y, x)
end

In other words, using _is_reflexive may look like it makes isgreater very generic but it crucially depends on the implementation details.

Anyway, I think this PR has to do at least one of the following:

  • Use isgreater2 (or something equivalent) instead of current isgreater.
  • Add the aforementioned property of =-reflexivity and isless to the part of implementation requirement of = and isless.

(Note: Transitivity is very important as otherwise "min" is not going to be associative. If "min" is not associative, we can't have parallel argmin/findmin/minimum/extrema.)

[*] OK this is not quite enough for user-defined missing-like types but we don't have such facility even for isless yet.

@ericphanson
Copy link
Contributor

Maybe it's too late, but it would be awesome if this made it into 1.5. Very handy function. (Of course, it's easy enough to copy the PR code via

module FindMin
    import Base.findmin, Base.findmax
    isgreater(x, y) = _is_reflexive(x) && _is_reflexive(y) ? isless(y, x) : isless(x, y)
    _is_reflexive(x) = let eq = x == x
        isa(eq, Bool) && eq
    end
    findmax(f, domain) = mapfoldl(x -> (f(x), x), _rf_findmax, domain)
    _rf_findmax((fm, m), (fx, x)) = isless(fm, fx) ? (fx, x) : (fm, m)
    findmin(f, domain) = mapfoldl(x -> (f(x), x), _rf_findmin, domain)
    _rf_findmin((fm, m), (fx, x)) = isgreater(fm, fx) ? (fx, x) : (fm, m)
end
using .FindMin

to get the functionality already).

@StefanKarpinski
Copy link
Member

Yes, this would be nice to have. What remains to be done here? Is this ready to go?

@cmcaine
Copy link
Contributor Author

cmcaine commented May 2, 2020

I paused on this because I found that the existing methods use an optimisation to avoid calling isless for values that are normally incomparable (so the first incomparable element will be returned, regardless of what isless wants).

The new definitions don't have that optimisation and are a bit slower, even when it is guaranteed that there are zero incomparables in the element type of the iterable. I wanted to try and work out why that was, but didn't get around to diving into the isless code to find out.

I can just not do that and then I think I only need to change the docstrings a bit to make sure we're giving a loose enough guarantee for the weird values.

@tkf
Copy link
Member

tkf commented May 5, 2020

What remains to be done here?

I think #35316 (comment) is still a blocker.

@JeffBezanson JeffBezanson added the forget me not PRs that one wants to make sure aren't forgotten label Jun 10, 2020
@tkf tkf added the fold sum, maximum, reduce, foldl, etc. label Jun 14, 2020
@tkf
Copy link
Member

tkf commented Jun 14, 2020

I summarized the current issues around reducers in the "findmax family" in #36287. It'd be nice if we can use a more mathematically rigorous and hence optimizable definition of them.

@AzamatB
Copy link
Contributor

AzamatB commented Oct 20, 2020

Bump. Can we get this merged? I hit the case today, where I needed it.

@cmcaine
Copy link
Contributor Author

cmcaine commented Oct 20, 2020

The only remaining issues are:

  1. This is a minor change in semantics.
  2. The current implementations are (or were, last I checked) faster due to the thing about skipping isless I mentioned.
  3. Should probably replace my isgreater() with something like tkf's which is simpler.

Need to think about if the ::Colon and reducedim methods need to be done now or not. If they're not done they may be inconsistent.

Similarly, min, and maximum will be a bit inconsistent with findmax if not updated here, but they're not consistent now, so it probably doesn't really matter. Though would be good to fix.

@cmcaine
Copy link
Contributor Author

cmcaine commented Oct 20, 2020

I can rebase and stuff later and have a look, but might be good to get a steer on whether it is okay to make the existing funcs a bit slower or not.

@mbauman mbauman added the triage This should be discussed on a triage call label Oct 22, 2020
@StefanKarpinski
Copy link
Member

While I can see the motivation for isgreater for helping to implement these kinds of functions, but I'm hesitant about exporting it. My worry is that people will be tempted to modify or use isgreater independent of isless, which might be problematic. The legitimate use of isgreater is to impelment the same kind of logic as this implements for findmax/min and argmax/min, so maybe that's ok. If that's all people use it for, ok. But if people define isless and then think they also need to define isgreater to match or that they can define isgreater not to match... that's what worries me.

@cmcaine
Copy link
Contributor Author

cmcaine commented Oct 23, 2020

Did triage have any opinion about whether it is okay to make findmax and friends about ~20% slower for Float64 in the name of consistency (implementations and benchmarking code below)? At the moment they return the first value that is not == to itself, if any value is like that, and that speeds up the implementation of isless even when the vector contains only values that are == to themselves. The new definitions based on isgreater and isless will always return whichever value the total ordering says is right.

The most common values that are not == to themselves are NaN, missing and tuples containing those values. The current methods don't support missing, so only users using weird types or tuples that contain these values will notice a difference in output, but all users will experience the slowdown.

Here's how the function outputs would change:

# These won't change
isless((NaN, 10), (NaN, 20)) # true
isless((NaN, 20), (NaN, 10)) # false

# Current implementation
findmax_orig([(NaN, 10), (NaN, 20)]) # (NaN, 10)
findmax_orig([(NaN, 20), (NaN, 10)]) # (NaN, 20)

# Proposed consistent implementation
findmax_new( [(NaN, 10), (NaN, 20)]) # (NaN, 20)
findmax_new( [(NaN, 20), (NaN, 10)]) # (NaN, 20)
Implementations and benchmarking code
findmax_orig(a) = _findmax_orig(a, :)

function _findmax_orig(a, ::Colon)
    p = pairs(a)
    y = iterate(p)
    if y === nothing
        throw(ArgumentError("collection must be non-empty"))
    end
    (mi, m), s = y
    i = mi
    while true
        y = iterate(p, s)
        y === nothing && break
        m != m && break
        (i, ai), s = y
        if ai != ai || isless(m, ai)
            m = ai
            mi = i
        end
    end
    return (m, mi)
end


# Smallest possible change (2 lines)
findmax_new(a) = _findmax_new(a, :)

function _findmax_new(a, ::Colon)
    p = pairs(a)
    y = iterate(p)
    if y === nothing
        throw(ArgumentError("collection must be non-empty"))
    end
    (mi, m), s = y
    i = mi
    while true
        y = iterate(p, s)
        y === nothing && break
        # m != m && break
        (i, ai), s = y
        if #= ai != ai || =# isless(m, ai)
            m = ai
            mi = i
        end
    end
    return (m, mi)
end


# This PR
_rf_findmax((fm, m), (fx, x)) = isless(fm, fx) ? (fx, x) : (fm, m)
findmax_2arg(f, domain) = mapfoldl(x -> (f(x), x), _rf_findmax, domain)

# How the existing functions might be implemented with this PR
findmax_fold(a) = findmax_2arg(idx -> a[idx], keys(a))
# Same speed
# findmax_fold(a) = mapfoldl(((k, v),) -> (v, k), _rf_findmax, pairs(a))


# Differences in behaviour
isless((NaN, 10), (NaN, 20)) # true
isless((NaN, 20), (NaN, 10)) # false

# Inconsistant with folding `isless`
findmax_orig([(NaN, 10), (NaN, 20)]) # (NaN, 10)
findmax_orig([(NaN, 20), (NaN, 10)]) # (NaN, 20)

# Consistant
findmax_new( [(NaN, 10), (NaN, 20)]) # (NaN, 20)
findmax_new( [(NaN, 20), (NaN, 10)]) # (NaN, 20)
findmax_fold([(NaN, 20), (NaN, 10)]) # (NaN, 20)
findmax_fold([(NaN, 10), (NaN, 20)]) # (NaN, 20)


using BenchmarkTools

input = rand(Int, 10^6);

@btime findmax_orig($input) # 920 μs
@btime findmax_new($input)  # 920 μs
@btime findmax_fold($input) # 1.1 ms

inputf = rand(Float64, 10^6);

@btime findmax_orig($inputf) # 2.3 ms
@btime findmax_new($inputf)  # 2.8 ms
@btime findmax_fold($inputf) # 2.8 ms

inputfbig = rand(Float64, 10^9);

@btime findmax_orig($inputfbig) # 2.5 s
@btime findmax_new($inputfbig)  # 3.0 s
@btime findmax_fold($inputfbig) # 2.9 s

I did put a note in the docs for isgreater saying that people should only define isless for their types, I don't know how much we can do to prevent people shooting themselves in the feet. Anyway, I'm happy for us to move forward without exporting isgreater. If someone wants it, that could be a separate PR.

@vtjnash vtjnash removed the forget me not PRs that one wants to make sure aren't forgotten label Oct 28, 2020
base/operators.jl Outdated Show resolved Hide resolved
@StefanKarpinski
Copy link
Member

Bump. Would be great to get this rebased with Jameson's suggsted change applied (and without isgreater export).

@cmcaine
Copy link
Contributor Author

cmcaine commented Nov 21, 2020

Done. Sorry for the delay :)

Jameson said on slack that making the existing findmax/min slower in the name of consistency is okay, so I've done that too and _findmax(a, ::Colon) is now implemented with the new 2-arg form.

The work never ends, though, because now findmax(A; dims) has the old behaviour and findmax(A) has the new behaviour. Differences will only be observed in arrays containing NaN at positions other than the end and in arrays containing missing (which will error with the old behaviour). I'll take a look at that later today and see if I can give findmax(A; dims) the new behaviour too.

Copy link
Member

@tkf tkf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this implementation of isgreater is incorrect? I get:

julia> isgreater(x, y) = _is_reflexive_eq(x) && _is_reflexive_eq(y) ? isless(y, x) : isless(x, y)
isgreater (generic function with 1 method)

julia> _is_reflexive_eq(x) = (x == x) === true
_is_reflexive_eq (generic function with 1 method)

julia> min1(x, y) = ifelse(isgreater(x, y), y, x)
min1 (generic function with 1 method)

julia> function isgreater2(x, y)
           xisnan = x != x
           xisnan isa Bool || return false  # x is missing
           yisnan = y != y
           yisnan isa Bool || return true  # y is missing
           xisnan && return false
           yisnan && return true
           return isless(y, x)
       end
isgreater2 (generic function with 1 method)

julia> min2(x, y) = ifelse(isgreater2(x, y), y, x)
min2 (generic function with 1 method)

julia> min((NaN, missing), (missing, NaN))
(NaN, missing)

julia> min1((NaN, missing), (missing, NaN))
(missing, NaN)

julia> min2((NaN, missing), (missing, NaN))
(NaN, missing)

Sorry to be persistent, but, as I mentioned in #35316 (comment), I think this pattern ... ? isless(y, x) : isless(x, y) is hard to reason about and it'd be better to have a verbose and more "honest" implementation like isgreater2 as I've been suggesting. You can probably fix _is_reflexive_eq so that min1 and min behave identically. So I don't think the linked comment is the source of the bug. But I think the attempt to "minimize" the code size may be a contributing factor.

test/operators.jl Outdated Show resolved Hide resolved
@cmcaine
Copy link
Contributor Author

cmcaine commented Nov 25, 2020

I'm sorry that I didn't address that earlier. I thought you were probably right, but after looking into it some more I think my initial approach is better and that isgreater2 is wrong for tuples and other containers. Using the definitions from your comment:

julia> isgreater2((missing, NaN), (NaN, missing))                                                                                                              
false                                                                                                                                                          
                                                                                                                                                               
julia> isgreater2((NaN, missing), (missing, NaN))                                                                                                              
false

One of the pair has to evaluate to true or the function is not a total order. And then we obviously get bad stuff like this:

julia> min2((NaN, missing), (missing, NaN))                                                                                                                    
(NaN, missing)                                                                                                                                                 
                                                                                                                                                               
julia> min2((missing, NaN), (NaN, missing))                                                                                                                    
(missing, NaN)

Comparing with min is probably not the best either because min exhibits exactly the kind of inconsistency that prompted you to suggest isgreater in the first place :)

min uses isless in the general case. So to get the right result for reals, min uses partial order (<), so NaN gets sorted right. To get the right result for missing, there are special definitions of min(::Any, ::Missing) = missing and min(::Missing, ::Any) = missing.

That makes the ordering of min weird if you wrap things in tuples, vectors or other containers:

min(1, NaN) === NaN
min( (1,), (NaN,) ) === (1,)
min( [1], [NaN] ) == [1]

If min used isgreater then simply wrapping a value in a container would not change the order:

min1(x, y) = ifelse(isgreater(x, y), y, x)

min1(1, NaN) === NaN
min1( (1,), (NaN,) ) === (NaN,)
min1( [1], [NaN] ) # [NaN]

That's a change in behaviour, but I think it's better behaviour. We also wouldn't need special sauce for floats or missing and all the other functions would come out consistent and easy.

In general, I don't think we can satisfy all of:

  1. min does not change it's current ordering
  2. min(x...) and findmin(x) should be consistent for all x
  3. findmin returns missing or NaN type in preference to any "normal" value
  4. NaN and missing values are returned from both findmin and findmax in the same order

There's two ways out of it that I see:

  1. min uses isgreater instead of isless in at least the fallback implementation.
  2. min keeps it's current definition and we're satisfied with the functions being consistent almost all the time (not when used with tuples of weird values)

Currently, this PR goes with scenario 2 and I think that's fine. A follow up PR could maybe change min if that is not considered too breaking.

Here's some reproducible code that explores the consequences of those two scenarios for three inputs:

x = (-1.0, NaN, 1.0)
y = ((NaN, missing), (missing, NaN))
z = ((1, 2), (missing, NaN), (NaN, missing))

# isgreater from this PR
_is_reflexive_eq(x) = (x == x) === true
isgreater(x, y) = _is_reflexive_eq(x) && _is_reflexive_eq(y) ? isless(y, x) : isless(x, y)

# min and minimum compatible with "Scenario 1"
min1(x, y) = ifelse(isgreater(x, y), y, x)
minimum1(itr) = foldl(min1, itr)

# findmin compatible with this PR
findmin_(itr) = reverse(foldl(
                    ((k1, v1), (k2, v2)) -> ifelse(isgreater(v1, v2), (k2, v2), (k1, v1)),
                    pairs(itr)
                ))

# Scenario 1
# x, y, z all consistent:

@assert minimum1(x) === NaN === findmin_(x)[1]
@assert minimum1(y) === (missing, NaN) === findmin_(y)[1]
@assert minimum1(z) === (missing, NaN) === findmin_(z)[1]

# Scenario 2
# x is fine, y and z are weird.

@assert minimum(x) === NaN === findmin(x)[1]

@assert minimum(y) === (NaN, missing)
@assert findmin_(y)[1] === (missing, NaN)

@assert minimum(z) === (1, 2)
@assert findmin_(z)[1] === (missing, NaN)

@assert minimum(x) === NaN
@assert minimum(y) === (NaN, missing)
@assert minimum(z) === (1, 2)

I agree with your earlier comment that isgreater and is_relexive_eq are confusing. I can add a comment to explain further.

@cmcaine
Copy link
Contributor Author

cmcaine commented Dec 17, 2020

Oh, one thing that came up on the triage call was that the definition is_poisoning(x) = (x == x) !== true seems to be better and if we recall correctly was used earlier in this PR (or maybe elsewhere?). Is there a reason not to use that definition? We may want to call it isunordered instead which I personally would be ok with exporting even.

We did previously use that :) It gives the wrong result for containers like tuples (previously I thought it was the right result, but it's not). The new behaviour matches isless and min.

The old behaviour would imply min should give this result min( (2, NaN), (1, 0) ) === (2, NaN), which isn't consistent with how min or isless work.

Concretely:

isunordered_old(x) = (x == x) !== true

isunordered_new(x) = false
isunordered_new(x::AbstractFloat) = isnan(x)
isunordered_new(x::Missing) = true

isunordered_old( (NaN, 1) ) == true

# This result is required to match the behaviour of `min`.
isunordered_new( (NaN, 1) ) == false

I've renamed, added docs, and exported.

Edit: The new isunordered avoids potentially expensive structural comparison, too, which is nice.

Once @nalimilan's issues have been addressed, this seems good to go.

Great! I think I've done that and I've rebased, added a NEWS entry for isunordered and added compat admonitions to all the new methods and functions.

If CI passes then I think this is good to go!

…accept Inf

This used to work, so it would be a regression to break it.
@cmcaine
Copy link
Contributor Author

cmcaine commented Dec 18, 2020

CI passes. I think this is ready to merge.

@StefanKarpinski StefanKarpinski merged commit 76952a8 into JuliaLang:master Dec 18, 2020
@StefanKarpinski
Copy link
Member

giphy

@cmcaine
Copy link
Contributor Author

cmcaine commented Dec 18, 2020

At times I doubted this would ever be done ;)

I started work on this on the 11th March 2020 and the issue was originally opened on the 16th June 2018.

julia> using Dates

julia> today() - Date(2020, 3, 11)
282 days

julia> today() - Date(2018, 6, 16)
916 days

Big thanks to @tkf and to the various reviewers!

@mykelk
Copy link

mykelk commented Dec 18, 2020

Yay! We're using this syntax in our new book. Thanks for all the hard work to get this right!

@musm musm removed the triage This should be discussed on a triage call label Jan 7, 2021
timholy added a commit to JuliaLang/Compat.jl that referenced this pull request Mar 29, 2021
timholy added a commit to JuliaLang/Compat.jl that referenced this pull request Mar 31, 2021
ElOceanografo pushed a commit to ElOceanografo/julia that referenced this pull request May 4, 2021
Defines a descending total order, `isgreater` (not exported), where unordered values like NaNs and missing are last. This makes defining min, argmin, etc, simpler and more consistent. Also adds 2-arg versions of findmax/min, argmax/min. Defines and exports the `isunordered` predicate for testing whether a value is unordered like NaN and missing. Fixes JuliaLang#27613. Related: JuliaLang#27639, JuliaLang#27612, JuliaLang#34674.

Thanks to @tkf, @StefanKarpinski and @drewrobson for their assistance with this PR.

Co-authored-by: Jameson Nash <jameson@juliacomputing.com>
Co-authored-by: Takafumi Arakaki <takafumi.a@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fold sum, maximum, reduce, foldl, etc.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a 2-arg flavor, argmax(f, domain)