Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support sorting iterators #46104

Merged
merged 27 commits into from
May 30, 2023
Merged

Conversation

LilithHafner
Copy link
Member

Closes #38328

@LilithHafner LilithHafner added domain:iteration Involves iteration or the iteration protocol domain:sorting Put things in order labels Jul 19, 2022
@gbaraldi
Copy link
Member

Is the specific AbstractVector method necessary since it dispatches to the same code? Or would removing that method be breaking?

@LilithHafner
Copy link
Member Author

Without it, sort([1,2,3]) would dispatch to sort(A::AbstractArray; dims::Integer, ...), a more specific method than sort(v; kw...), and throw an UndefKeywordError.

test/sorting.jl Outdated Show resolved Hide resolved
@gbaraldi
Copy link
Member

Outside of the .1 vs 0.1 LGTM

@ararslan
Copy link
Member

See also #16853 for some historical context

@LilithHafner
Copy link
Member Author

LilithHafner commented Jul 20, 2022

Thanks for the link @ararslan!

I hadn't noticed the problematic error messages for sort(4) and sort('c') before.

My thoughts on the past conversation:

But would it not be strange to get an array from sort when passed a general iterable?

What else could you get?

There is a precedent of returning the most specific type that can be reasonably sorted, which may not be the input type. For example, ranges sorted in forward and reverse orders produce ranges; StaticVectors produce MutableVectors.

I would favor this possibility to sort general iterables, but without special cases: this would return an array even for tuples and strings as input.

I agree. I feel called to implement a different sorting algorithm for every tuple length that returns a tuple, but for general tuples sorting would not be type stable, and sorting NTuples seems to be the domain of StaticArrays, not base.

Most people want Strings not Vector{Char}s, but most people don't sort strings. For those that do sort strings, I think it's fine to give them a Vector{Char}. They just stored it, so they probably think of characters as re-arrangeable.

Just sorting the Unicode characters also doesn't make too much sense, you'd really want to find all of the graphemes and sort those, I'd imagine.

This seems pretty sensible to me. The umlaut stays with the a, for example. @ScottPJones or others, do you have an example of a grapheme that is not represented as a single Char?

julia> sort("亀 mäke ☃🙂 bc. she 💞🐢s")
21-element Vector{Char}:
 ' ': ASCII/Unicode U+0020 (category Zs: Separator, space)
 ' ': ASCII/Unicode U+0020 (category Zs: Separator, space)
 ' ': ASCII/Unicode U+0020 (category Zs: Separator, space)
 ' ': ASCII/Unicode U+0020 (category Zs: Separator, space)
 ' ': ASCII/Unicode U+0020 (category Zs: Separator, space)
 '.': ASCII/Unicode U+002E (category Po: Punctuation, other)
 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
 'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)
 'e': ASCII/Unicode U+0065 (category Ll: Letter, lowercase)
 'e': ASCII/Unicode U+0065 (category Ll: Letter, lowercase)
 'h': ASCII/Unicode U+0068 (category Ll: Letter, lowercase)
 'k': ASCII/Unicode U+006B (category Ll: Letter, lowercase)
 'm': ASCII/Unicode U+006D (category Ll: Letter, lowercase)
 's': ASCII/Unicode U+0073 (category Ll: Letter, lowercase)
 's': ASCII/Unicode U+0073 (category Ll: Letter, lowercase)
 'ä': Unicode U+00E4 (category Ll: Letter, lowercase)
 '': Unicode U+2603 (category So: Symbol, other)
 '': Unicode U+4E80 (category Lo: Letter, other)
 '🐢': Unicode U+1F422 (category So: Symbol, other)
 '💞': Unicode U+1F49E (category So: Symbol, other)
 '🙂': Unicode U+1F642 (category So: Symbol, other)

In summary, I stand by this PR as is (with better error messages) but can see an argument for making strings throw a MethodError instead.

@LilithHafner
Copy link
Member Author

@gbaraldi, in light of the historical context from @ararslan's PR, do you still think this PR looks good as is?

@gbaraldi
Copy link
Member

That's a tough one, I can see someone calling sort on a string as a way to make it alphabetical order so getting a string back makes sense. The first tutorial for sorting a string (https://www.geeksforgeeks.org/sorting-of-strings-in-julia/) collects it, sorts and then joins, which sounds a bit roundabout.
In the other hand almost all things will return arrays as a result of sort.

@Moelf
Copy link
Sponsor Contributor

Moelf commented Jul 30, 2022

Can we just keep this PR but add an error on

sort(::AbstractString)

with a message like ambiguity or something?

@LilithHafner
Copy link
Member Author

Seems reasonable. For almost all iterables this PR is clean; no reason to sweat on the String case unless someone really wants to sort strings:

julia> sort("hello world")
ERROR: ArgumentError: sort(x::AbstractString) is ambiguous. Use sort!(collect(x)) or String(sort!(collect(x))) instead.

@LilithHafner
Copy link
Member Author

How do folks feel about this PR now that we throw an error on AbstractStrings and a better error on Numbers and other zero-dimensional iterators?

@StefanKarpinski StefanKarpinski added the status:triage This should be discussed on a triage call label Sep 28, 2022
base/sort.jl Outdated
@@ -992,7 +994,13 @@ julia> v
2
```
"""
sort(v::AbstractVector; kws...) = sort!(copymutable(v); kws...)
function sort(v; kws...)
IteratorSize(v) == HasShape{0}() && throw(ArgumentError("$v cannot be sorted"))
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't infinite (and maybe unknown) also be errors? I see that people probably want an error for sort(1), so ok, but kind of strange that that of all things would be disallowed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unknown sizes should be supported

sort(w for w in ["karma", "infracostalis", "postencephalon", "Elia"] if all(islowercase, w))

For almost all infinite iterators, we already throw inside copymutable, but perhaps someone could define an infinite iterator that can be copymutableed, returning a sparse vector of infinite length. If they also define a sorting method for that sparse representation, then it would be a mistake to throw on infinite iterators.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But sort is an eager operation - surely if someone wants to sort their infinite iterator lazily, they have to implement it either way and not just rely on the generic fallback that's supposed to collect all elements of the iterator? Wouldn't it be better UX to throw early and make them aware where the actual problem lies, instead of having them chase down an (incidental) error from copymutable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For generic iterables, copymutable collects, but it only needs to make a mutable copy, not necessarily instantiate every element. For example, I believe this is allowed:

struct WithTrailingZerosIterable
    head::Int
    tail::Union{Nothing, WithTrailingZerosIterable}
end

Base.iterate(x::WithTrailingZerosIterable) = iterate(x, x)
Base.iterate(::WithTrailingZerosIterable, x::WithTrailingZerosIterable) = x.head, x.tail
Base.iterate(::WithTrailingZerosIterable, ::Nothing) = 0, Nothing

Base.IteratorSize(::Type{<:WithTrailingZerosIterable}) = Base.SizeInfinite()


struct WithTrailingZerosVector <: AbstractVector{Int}
    data::Vector{Int}
end
Base.size(x::WithTrailingZerosVector) = (typemax(Int),)
Base.getindex(x::WithTrailingZerosVector, i::Int) = i <= length(x.data) ? x.data[i] : 0
function Base.show(io::IO, ::MIME"text/plain", x::WithTrailingZerosVector)
    println(io, "infinite-element WithTrailingZerosVector:")
    for i in 1:5
        x[i] >= 0 && print(io, ' ')
        println(io, x[i])
    end
    println("")
end


function Base.collect(x::WithTrailingZerosIterable)
    data = Int[]
    while x !== nothing
        push!(data, x.head)
        x = x.tail
    end
    WithTrailingZerosVector(data)
end

function Base.sort!(x::WithTrailingZerosVector)
    filter!(x -> x < 0, x.data)
    sort!(x.data)
    return x
end


const X = WithTrailingZerosIterable(1, WithTrailingZerosIterable(-2, WithTrailingZerosIterable(3, nothing)))
display(collect(X))
#=
infinite-element WithTrailingZerosVector:
 1
-2
 3
 0
 0

=#

display(sort!(Base.copymutable(X)))
#=
infinite-element WithTrailingZerosVector:
-2
 0
 0
 0
 0

=#

Nevertheless, we could put this in the same camp as sort(::AbstractString): perhaps it might make sense in some way but for now just throw because it is rarely a good idea to sort an infinite iterable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Throwing on an infinite iterator seems like the sensible choice. I've made the change.

@JeffBezanson
Copy link
Sponsor Member

Triage approves.

@JeffBezanson JeffBezanson removed the status:triage This should be discussed on a triage call label Sep 29, 2022
@PallHaraldsson
Copy link
Contributor

PallHaraldsson commented Sep 29, 2022

Sorry, I believe this doesn't work, because:

or others, do you have an example of a grapheme that is not represented as a single Char?

yes, flag emojis can be long, at least 5 Chars (codeunits, here 11 bytes):

julia> collect("🏳️‍⚧️")
5-element Vector{Char}:
 '🏳': Unicode U+1F3F3 (category So: Symbol, other)
 '️': Unicode U+FE0F (category Mn: Mark, nonspacing)
 '\u200d': Unicode U+200D (category Cf: Other, format)
 '⚧': Unicode U+26A7 (category So: Symbol, other)
 '️': Unicode U+FE0F (category Mn: Mark, nonspacing)

I chose a flag at random, from here:
https://emojipedia.org/flags/

i.e. the transgender flag, and we do not want to piss off people, screwing it up.

julia> sort2("🏳️‍⚧️")
5-element Vector{Char}:
 '\u200d': Unicode U+200D (category Cf: Other, format)
 '⚧': Unicode U+26A7 (category So: Symbol, other)
 '️': Unicode U+FE0F (category Mn: Mark, nonspacing)
 '️': Unicode U+FE0F (category Mn: Mark, nonspacing)
 '🏳': Unicode U+1F3F3 (category So: Symbol, other)

I believe you want to get the same order back. I'm not completely sure it's sensitive to the order, though pretty confident. This doesn't look right when pasting into the REPL (a minor unrelated issue), so I'm not sure how to easily check after sorting.

I didn't see this sooner, hope I was of help, and this can be fixed before the feature freeze. I knew knowing obscure Uncode (flag) emoji trivia would some day come in handy. Ok, not really.

Such stuff was in the lead of the Wikipeda Unicode article, where I put it, at some point, before someone shortened it to a short summary:
https://en.wikipedia.org/w/index.php?title=Unicode&oldid=1086712843

Examples include: the Devanagari kshi, which is encoded by 4 code points

I put such stuff in the lead, not (just) because it's trivia, but also to show people Unicode is in practice variable-length, even UTF-16 (and to be avoided, some think it's better since fixed-length, but that hasn't been true since UCS-2).

@LilithHafner
Copy link
Member Author

I've implemented triage's requested changes.

@LilithHafner
Copy link
Member Author

Hmmm, I'm looking back at @PallHaraldsson's objections, and I think implementing reverse(::AbstractString) might have been a mistake. And even though we have reverse(::AbstractString), I'm not convinced that sort(::AbstractString) is a good idea to provide. It seems like too much of a foot gun. I'm picturing folks submitting issues like the one @PallHaraldsson brought up here, and the position "yes, those mangled results are expected behavior" seems pretty weak. I'd rather say "that's a strange thing to want to do to a string, but if you really want sort the characters in a string, you can do String(sort(collect(s))) or join(sort(SomePackage.collect_graphemes(s)))".

Is anyone opposed to reverting d42c234?

@LilithHafner
Copy link
Member Author

Everything but whether to support sorting strings is settled and ready to go. Unless folks object, I'm going to merge this in a couple days without support for sorting strings (listening to the first triage, not the second triage on that point only). If we decide that it is best to support sorting strings later, that will be non-breaking.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Nov 8, 2023

Could I ask why there is a discrepancy here, so that I still can't get a sorted list of keys, but can take the keys from a sorted dictionary:

julia> sort(keys(Dict(1=>2)))
ERROR: MethodError: no method matching sort!(::Set{Int64})

Closest candidates are:
  sort!(::AbstractUnitRange)
   @ Base range.jl:1392
  sort!(::AbstractVector, ::Base.Sort.Algorithm, ::Base.Order.Ordering)
   @ Base sort.jl:2245
  sort!(::AbstractVector{T}, ::Integer, ::Integer, ::Base.Sort.MergeSortAlg, ::Base.Order.Ordering, ::Vector{T}) where T
   @ Base sort.jl:2172
  ...

Stacktrace:
 [1] sort(v::Base.KeySet{Int64, Dict{Int64, Int64}}; kws::@Kwargs{})
   @ Base.Sort ./sort.jl:1503
 [2] sort(v::Base.KeySet{Int64, Dict{Int64, Int64}})
   @ Base.Sort ./sort.jl:1499
 [3] top-level scope
   @ REPL[7]:1

julia> first.(sort(Dict(1=>2)))
1-element Vector{Int64}:
 1

vtjnash added a commit that referenced this pull request Nov 8, 2023
`copymutable` is only defined to return an array for abstract arrays,
but that is only what this method is never called with. For other types,
it has a default of `collect`, but can be changed by other types (such
as AbstractSet) to do something different.

Refs #46104
@LilithHafner
Copy link
Member Author

LilithHafner commented Nov 8, 2023

For folks following along, the answer is because of #52086 (comment)

jishnub pushed a commit that referenced this pull request Nov 10, 2023
Two chagnes wrapped into one `Base.copymutable` => `Base.copymutable` &
`collect` and `Base.copymutable` => `similar` & words.

Followup for #52086 and #46104; also fixes #51932 (though we still may
want to make `copymutable` public at some point)

---------

Co-authored-by: Jameson Nash <vtjnash@gmail.com>
KristofferC pushed a commit that referenced this pull request Nov 13, 2023
Two chagnes wrapped into one `Base.copymutable` => `Base.copymutable` &
`collect` and `Base.copymutable` => `similar` & words.

Followup for #52086 and #46104; also fixes #51932 (though we still may
want to make `copymutable` public at some point)

---------

Co-authored-by: Jameson Nash <vtjnash@gmail.com>
(cherry picked from commit 42c088b)
KristofferC added a commit that referenced this pull request Nov 16, 2023
LilithHafner added a commit that referenced this pull request Nov 17, 2023
Co-authored-by: Lilith Orion Hafner <lilithhafner@gmail.com>
KristofferC added a commit that referenced this pull request Nov 27, 2023
Co-authored-by: Lilith Orion Hafner <lilithhafner@gmail.com>
(cherry picked from commit 1cb85ad)
KristofferC added a commit that referenced this pull request Dec 2, 2023
Backported PRs:
- [x] #51213 <!-- Wait for other threads to finish compiling before
exiting -->
- [x] #51520 <!-- Make allocopt respect the GC verifier rules with non
usual address spaces -->
- [x] #51598 <!-- Use a simple error when reporting sysimg load
failures. -->
- [x] #51757 <!-- fix parallel peakflop usage -->
- [x] #51781 <!-- Don't make pkgimages global editable -->
- [x] #51848 <!-- allow finalizers to take any locks and yield during
exit -->
- [x] #51847 <!-- add missing wait during Timer and AsyncCondition close
-->
- [x] #50824 <!-- Add some aliasing warnings to docstrings for mutating
functions in Base -->
- [x] #51885 <!-- remove chmodding the pkgimages -->
- [x] #50207 <!-- [devdocs] Improve documentation about building
external forks of LLVM -->
- [x] #51967 <!-- further fix to the new promoting method for
AbstractDateTime subtraction -->
- [x] #51980 <!-- macroexpand: handle const/atomic struct fields
correctly -->
- [x] #51995 <!-- [Artifacts] Pass artifacts dictionary to
`ensure_artifact_installed` dispatch -->
- [x] #52098 <!-- Fix errors in `sort` docstring -->
- [x] #52136 <!-- Bump JuliaSyntax to 0.4.7 -->
- [x] #52140 <!-- Make c func `abspath` consistent on Windows. Fix
tracking path conversion. -->
- [x] #52009 <!-- fix completion that resulted in startpos of 0 for `\\
-->
- [x] #52192 <!-- cap the number of GC threads to number of cpu cores
-->
- [x] #52206 <!-- Make have_fma consistent between interpreter and
compiled -->
- [x] #52027 <!-- fix Unicode.julia_chartransform for Julia 1.10 -->
- [x] #52217 <!-- More helpful error message for empty `cpu_target` in
`Base.julia_cmd` -->
- [x] #51371 <!-- Memoize `cwstring` when used for env lookup /
modification on Windows -->
- [x] #52214 <!-- Turn Method Overwritten Error into a PrecompileError
-- turning off caching -->
- [x] #51895 <!-- Devdocs on fixing precompile hangs, take 2 -->
- [x] #51596 <!-- Reland "Don't mark nonlocal symbols as hidden"" -->
- [x] #51834 <!-- [REPLCompletions] allow symbol completions within
incomplete macrocall expression -->
- [x] #52010 <!-- Revert "Support sorting iterators (#46104)" -->
- [x] #51430 <!-- add support for async backtraces of Tasks on any
thread -->
- [x] #51471 <!-- Fix segfault if root task is NULL -->
- [x] #52194 <!-- Fix multiversioning issues caused by the parallel llvm
work -->
- [x] #51035 <!-- refactor GC scanning code to reflect jl_binding_t are
now first class -->
- [x] #52030 <!-- Bump Statistics -->
- [x] #52189 <!-- codegen: ensure i1 bool is widened to i8 before
storing -->
- [x] #52228 <!-- Widen diagonal var during `Type` unwrapping in
`instanceof_tfunc` -->
- [x] #52182 <!-- jitlayers: replace sharedbytes intern pool with one
that respects alignment -->

Contains multiple commits, manual intervention needed:
- [ ] #51092 <!-- inference: fix bad effects for recursion -->

Non-merged PRs with backport label:
- [ ] #52196 <!-- Fix creating custom log level macros -->
- [ ] #52170 <!-- fix invalidations related to `ismutable` -->
- [ ] #51479 <!-- prevent code loading from lookin in the versioned
environment when building Julia -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:iteration Involves iteration or the iteration protocol domain:sorting Put things in order
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sort for generators