RFC: Added timsort, plus misc. sort updates #1691

kmsquire · 2012-12-06T10:31:46Z

EDIT: (Original message is below)

This patch includes an implementation of timsort for julia, and an update of the sort functions. See also a comparison of the sort routines available in julia at https://github.com/kmsquire/SortPerf.jl (the pdf file contains relevant plots).

Patch summary:

Modularized sort functions (Base.Sort)
Moved each_row, each_col, and each_vec to AbstractArray.jl
Added timsort implementation (timsort.jl)
Renamed and exported insertionsort, quicksort, mergesort, and timsort (mutating and non-mutating variants).
For each of these sorts, added (but did not export) *sort_r (reverse) and *sort_by (sort by function).
For each function except quicksort, added/renamed (but did not export) versions which return the sort permutation: *sort_perm, *sort_perm_r, and *sort_perm_by
Updated search_sorted, search_sorted_first and search_sorted_last as in RFC: Clean up search_sorted* functions #1620
Added and exported additional variants for sort-related functions: issorted_r, issorted_by, search_sorted*_r, search_sorted*_by, select_r, and select_by.
Added tests for various sort algorithms/functions
Added documentation for (almost?) all sort and related functions.

Unexported functions can be accessed with Base.Sort.<function> or with using Base.Sort

This patch might be a little big, so I can probably break it up into multiple patches if needed. It does all kind of go together, though.

No decision was made regarding replacing mergesort with timsort. See the pdf file referenced at the top.

ORIGINAL MESSAGE:

See https://groups.google.com/forum/?fromgroups=#!searchin/julia-dev/timsort/julia-dev/bgFzFVT403s/tm1oe7vIVWAJ and https://gist.github.com/4168004

timsort! is almost complete, and is only missing the function versions which permute indices. Right now, it's in a separate file which is included in sort, simply because adding 500 lines to sort.jl didn't seem like a good idea.

In the original gist, there was some code generated in a loop, which I unrolled in this version because I didn't trust myself to get it right inside of a macro.

I'll retest performance once everything is ready.

Additional Miscellaneous updates:

removed _jl_ prefix from front of *sort function calls
removed _lt suffix from function variants

As a result, the *sort functions: quicksort!, mergesort!, insertionsort!, and timsort! are available in Base ~~(not exported)~~, along with *_r! variants for reverse sorting, and _by! variants for sorting by a function, ~~(EDIT: and _lt variants for sorting with a custom comparison function)~~ so they can be accessed with quicksort!(), insertionsort_r!(), etc.

StefanKarpinski · 2012-12-06T16:21:04Z

This is great work. I really like having all of these available by name like that so easily. Of course we should generally choose good default sorts, but sometimes the programmer just knows better and want to use a specific sort. I wouldn't mind exporting these names either since they're unlikely to clash and the using mechanism means that people can still use the names if they really want to.

kmsquire · 2012-12-07T06:31:03Z

Thanks Stephan. I'll export those names and try to round out everything else soon.

ViralBShah · 2012-12-07T16:53:14Z

I think it is ok to add this to sort.jl.

kmsquire · 2012-12-08T15:29:54Z

Timsort should be feature complete now.

The permutation-based functions introduced a lot of potentially unnecessary code duplication. In particular, merge_collapse() and merge() simply pass the parameter through. Is there a good way to eliminate this, without affecting performance?

I want to run performance tests to make sure I didn't screw anything up and to figure out which sorts are actually fastest in different situations (e.g., sortperm{}), then move timsort.jl into sort.jl if that makes sense.

Any other suggestions to DRY this out are welcome. timsort.jl is currently at 800 lines. In particular, I'm unclear on how to generate multiple versions of a function from within a loop (e.g., with the {r}gallop_* functions). Or should I bother?

Kevin

kmsquire · 2012-12-08T15:34:09Z

The build failure is caused by a missing extras/arpack.jl.

ViralBShah · 2012-12-09T04:07:46Z

It is the recent updates to the paths for loading arpack.jl and suitesparse.jl. make testall on my setup used to work fine, but somehow travis-ci is unable to load.

@staticfloat Do we do something different on the travis-ci setup for tests? I suspect that the paths are different due to the tests running out of make install.

staticfloat · 2012-12-09T04:33:18Z

It's because the relative path from the julia binary to extras and test etc. changes when you install. You shouldn't use require("$JULIA_HOME/../extras/arpack.jl"), you should use require("extras/arpack"). I've submitted a pull request fixing the paths for all the tests (It looks like suitesparse would have also failed, but its disabled).

* removed _jl_ prefix from front of function calls * removed _lt suffix from function variants * provided sort functions: quicksort!, mergesort!, insertionsort!, and timsort!, along with *_r! variants for reverse sorting, and _by! variants for sorting with a function. These functions are not exported, but can easily be accessed with Base.quicksort!() Base.mergesort!() Base.insertionsort!() Base.timsort!()

so that their first parameter can be a lt comparison function.

Added _r, _by, function forms of select, search_sorted

Only exporting basic *sort functions for now; the rest may be accessed with Base.*sort and Base.*sort_perm

kmsquire · 2012-12-14T07:06:13Z

Almost there. I've filled out reverse (*sort_r), component (*sort_by), and permutation (*sort_perm) for almost all functions. I think that all are potentially useful, but at this point, again, I'm only exporting the original sort functions in mutating and nonmutating forms (i.e., timsort(), timsort!(), quicksort(), quicksort!(), etc.) The rest are available via Base.sort

I have some comparison plots among the different sort methods that I'll post tomorrow, along with code for comparing them.

johnmyleswhite · 2012-12-14T12:56:24Z

Sorry to hijack this thread, but since people are working on better sorting functionality, I'm wondering whether the following is a bug or just shows that I don't understand order:

julia> order([2, 4, 3, 3])
4-element Int64 Array:
 1
 3
 4
 2

I, for one, would put 4 at the end of a sorted version of that vector.

pao · 2012-12-14T13:48:06Z

@johnmyleswhite I think those are indexes.

johnmyleswhite · 2012-12-14T13:50:01Z

The indices of what? I thought that order would return the indices in the sorted array, but the indices in the sorted array should be [1, 4, 2, 3] or [1, 4, 3, 2], no?

pao · 2012-12-14T13:51:52Z

The indices of the original array. Wouldn't the ordered indices of the sorted array just be [1,2,3,4] regardless?

johnmyleswhite · 2012-12-14T13:57:57Z

Ah, I see. Right now, order produces indices from the original array such that original[order(original)] == sort(original). Based on R, I would assume that order produces sort(original)[order(original)] == original.

johnmyleswhite · 2012-12-14T13:59:14Z

Put another way: the R approach transforms each element in the original array into its ascending rank order.

kmsquire · 2012-12-15T18:41:51Z

Love the sortperf graphs. What would you recommend for the default sort?
Would I be correct in saying that timsort is fast for non-numeric arrays.
In the case of numeric arrays, except for totally random input, it seems
that timsort performs well, but is slow for small sized inputs.

All done in Julia! Regarding the default sort: Hmmm. It certainly wasn't
clear cut from the graphs. Easy stuff:

quicksort should continue to be used by default for BitTypes
quicksort might be best for ASCIIStrings
timsort should probably be used in DataFrames, where there is more likely
to be ordered inputs, and stable sorting is required. But that's a package
decision.

After that, I don't have a good feeling as to how often sorting is applied
to data which is partially sorted, and it really might depend on the
field/problem/situation. If we're being conservative, we leave things the
way they are. If we want to assume people often work with partially sorted
data, or resort the same data frequently after adding, removing, or making
small changes, then timsort is a good default for non-BitType data.

Questions

for unicode strings, are we sorting properly? (c.f.
http://en.wikipedia.org/wiki/Unicode_collation_algorithm) A quick google
search suggests that libicu has some functionality related to this. I'm
depending too much on Wikipedia, but
http://en.wikipedia.org/wiki/Unicode_equivalence suggests that a stable
sort is required, although I don't understand the subtlties.
How are arrays of strings stored internally? Since insertion sort was
slow even for short string arrays, I was wondering if the strings might be
copied

Comments:

Quicksort overflows the stack for some inputs at large size (though I
forget which tests). Introsort is a quicksort variant commonly used in STL
library in which quicksort switches to a heapsort when the recursion depth
becomes too deep. We might consider adding this variant.
While implementing timsort, I read about quicksort variants which also
check for sorted inputs. This might be a win for BitTypes, but it would
have to be tested.

Perhaps it would be good to include a function in Base.Sort that runs all

sort algorithms on a given input, and gives the fastest one.

That would probably be worthwhile.

For now, my request is to have this patch reviewed as is, without changing
current defaults yet, and then open up another issue (or rather, leave the
original one open, but bump it), and make the decision about changing
default sorts separately. After it's been committed, we can ask people on
julia-users or julia-dev to run tests on their favorite data sets and
report (for which a function which runs all sort algorithms on an input
would be useful).

Kevin

ViralBShah · 2012-12-16T06:01:16Z

I am going to leave this open for a couple more days, before merging it, so that others can review it. I wish that we could have had a better way to deal with the _r and _by versions, but I do not yet have good suggestions.

kmsquire · 2012-12-16T07:00:18Z

I agree--they're less than desirable.

As with the current sort functions, all of the "regular" *sort functions accept a "lessthan" function as an optional first argument, and can use that instead of the default (isless). So the *sort_r versions, at least, could be specified by passing a "greaterthan" (>) function in. I'm assuming that because those functions are not known at compile time, they are less efficient, but I haven't actually tested this yet.

johnmyleswhite · 2012-12-16T15:21:04Z

Mulling over the rank-ordering issue, how about introducing a new function called ranks or ranked that returns an integral rank ordering? We could also have an additional argument that specifies alternative tie-breaking rules. R implements all of the following rules:

first: The rank values increase sequentially for each item in a set of ties
random: The ranks values are distinct, but arbitrary, within a set of ties
average: All items in a set of ties receive the mean rank across the set
max: All items in a set of ties receive the max rank across the set
min: All items in a set of ties receive the min rank across the set. This is how sports rankings behave

Looking into R's behavior more, I now realize that I was wrong about R's order definition because I had used examples in which the rank-orders coincided with the sortperm results. R's order is just like Julia's order. The only thing missing is some way of expressing rank-order.

StefanKarpinski · 2012-12-16T18:16:52Z

All done in Julia!

So cool that the graphs are all done in Julia.

quicksort should continue to be used by default for BitTypes

Yes – for sorting random floating-point values, our default quicksort/insertionsort hybrid kills it. However, since timsort starts to really kill it on some data patterns, we should make sure people are aware they can explicitly call timsort instead, if they happen to expect one of those.

quicksort might be best for ASCIIStrings

I dunno – timsort look like a winner to me for strings. When it wins, it wins big, but when it loses, it loses by only a bit.

for unicode strings, are we sorting properly?

Unicode sorting is complicated and the required order is application-and-language/locale-specific, so we should rely on something like @nolta's ICU package when people have special needs for Unicode string sorting. Most string sorting, however, just needs some sane lexicographical ordering, which for UTF-8, is happily the same thing as lexicographical ordering of byte sequences (thank you, Ken Thompson and Rob Pike). That also means that the common pure-ASCII case doesn't have to pay a performance or complexity penalty at all.

How are arrays of strings stored internally? Since insertion sort was slow even for short string arrays, I was wondering if the strings might be copied

It's an array of pointers to heap-allocated String objects. I'm fairly certain there's no copying happening – although you can't mutate a string through any "official" interface, you can get at the underlying .data field and mutate it, allowing you to see which strings share the same data.

Quicksort overflows the stack for some inputs at large size (though I forget which tests). Introsort is a quicksort variant commonly used in STL library in which quicksort switches to a heapsort when the recursion depth becomes too deep. We might consider adding this variant.
While implementing timsort, I read about quicksort variants which also check for sorted inputs. This might be a win for BitTypes, but it would have to be tested.

We may even want to do introsort by default, and the quicksort variant you mention certainly sounds interesting. Do you have a link? I would be interested in looking at smoothsort too – although the implementation is kind of complex, I've had very good experience using it on data that's already partially sorted.

StefanKarpinski · 2012-12-16T18:32:58Z

I would favor merging this sooner rather than later. On looking through the code, it looks good to me and the changes introduced don't break anything. The only questionable functions that are introduced are the *sort_r ones, which are kind of ugly. Those can all go away once specialization of higher-order functions on their function arguments is possible, since then that can be written only slightly less conveniently with (a,b)->isless(b,a) as a function argument. Currently, that introduces a lot of function call overhead since no inlining can happen.

StefanKarpinski · 2012-12-16T18:36:39Z

@johnmyleswhite: regarding the ranking choices, one issue is that the average strategy doesn't necessarily preserve the array type, so that should possibly be a different function. This could also be done with a higher order function argument, although I'm not sure how happy type inference will be about that.

johnmyleswhite · 2012-12-16T18:52:52Z

Buf. That's a bummer.

kmsquire · 2012-12-16T18:53:23Z

quicksort should continue to be used by default for BitTypes

Yes – for sorting random floating-point values, our default quicksort/insertionsort hybrid kills it. However, since timsort starts to really kill it on some data patterns, we should make sure people are aware they can explicitly call timsort instead, if they happen to expect one of those.

quicksort might be best for ASCIIStrings

I dunno – timsort look like a winner to me for strings. When it wins, it wins big, but when it loses, it loses by only a bit.

The log plots are a little misleading here. For randomly ordered arrays of strings, quicksort is always about 2x as fast as timsort, although timsort wins big if there is any order to the strings.

for unicode strings, are we sorting properly?

Unicode sorting is complicated and the required order is application-and-language/locale-specific, so we should rely on something like @nolta's ICU package when people have special needs for Unicode string sorting. Most string sorting, however, just needs some sane lexicographical ordering, which for UTF-8, is happily the same thing as lexicographical ordering of byte sequences (thank you, Ken Thompson and Rob Pike). That also means that the common pure-ASCII case doesn't have to pay a performance or complexity penalty at all.

Cool.

How are arrays of strings stored internally? Since insertion sort was slow even for short string arrays, I was wondering if the strings might be copied

It's an array of pointers to heap-allocated String objects. I'm fairly certain there's no copying happening – although you can't mutate a string through any "official" interface, you can get at the underlying .data field and mutate it, allowing you to see which strings share the same data.

Okay. Insertionsort is sensitive to the cost of comparsions, so it's just the string comparison which is making it slower.

Quicksort overflows the stack for some inputs at large size (though I forget which tests). Introsort is a quicksort variant commonly used in STL library in which quicksort switches to a heapsort when the recursion depth becomes too deep. We might consider adding this variant.
While implementing timsort, I read about quicksort variants which also check for sorted inputs. This might be a win for BitTypes, but it would have to be tested.

We may even want to do introsort by default, and the quicksort variant you mention certainly sounds interesting. Do you have a link? I would be interested in looking at smoothsort too – although the implementation is kind of complex, I've had very good experience using it on data that's already partially sorted.

Introsort as default sounds good. It depends on heapsort, but it would take only a handful of changes to test and switch to mergesort.

I don't have a link to source for a quicksort which checks for sorted inputs, but the link I found is a discussion on stackoverflow:

http://stackoverflow.com/questions/6567326/does-stdsort-check-if-a-vector-is-already-sorted

Regarding smoothsort: it's quite a cool algorithm. Python's previous sort was based on it. I think it's worth exploring in the next round.

StefanKarpinski · 2012-12-16T19:00:12Z

I wonder if it makes sense to split all this sorting code out into a package... If we're going to have implementations of lots of different sort algorithms, it starts to seem a bit crowded for Base. Although if our default sort ends up using insertion, quick, and merge/heap, then those all end up in Base anyway.

StefanKarpinski · 2012-12-16T19:04:10Z

@johnmyleswhite, it may not be so bad. I'm thinking that there's two very different purposes here that should be different functions. In some cases, you always want indices into the original array, in which case Array{Int} should be returned. That's what the current order does. In other cases, you want to map elements to their ranks, in which case, you may want different strategies for breaking ties, but you can return Array{Float64} consistently, even in cases where the rank values will always be integral. (And since we allow indexing with integer-valued floats, those cases will still work even if you use one of the ranks to do indexing.)

johnmyleswhite · 2012-12-16T19:07:51Z

That's a good point. I was thinking that you wouldn't be able to use the ranks as indices. Using Array{Float64} everywhere works for me.

kmsquire · 2012-12-17T18:55:20Z

Just noticed that @JeffBezanson changed the sort macro to an eval call in c6a3c85

Minor speed improvement for sorted arrays.

kmsquire · 2012-12-18T08:31:51Z

Fixed up sort/timsort to match @JeffBezanson's recent macro removal.

Regarding @StefanKarpinski's last suggestion (on timsort),

I could easily move timsort.jl to a package, as it's the only sort not used by default right now (and it's still in its own file). Of course, this doesn't leave any default sort which handles sortperm has poor performance #939.
I think that exposing the other sorts by name and the addition of search_sorted* and issorted* variants are still worthwhile
I think that implementing a quicksort/heapsort (aka introsort) variant, as mentioned previously, would be worthwhile. There's a heap implementation by @timholy in examples that could be adapted for heapsort, though it might take some work to match the style of the current sort functions.

And for anyone interested:

It's not always true that quicksort is faster than heapsort:
- http://www.azillionmonkeys.com/qed/sort.html
- http://users.aims.ac.za/~mackay/sorting/sorting.html
There's a dual pivot quicksort that is claimed to be faster than regular quicksort:
- http://www.iaroslavski.narod.ru/quicksort/DualPivotQuicksort.pdf

I don't think I'll be implementing any other sorts anytime soon.

After this Friday, I also likely won't be available to do much to this patch for a few weeks, so if you want anything changed/fixed, please ask soon.

Cheers! :-)

Added timsort, plus misc. sort updates

StefanKarpinski · 2012-12-18T12:42:37Z

Lets keep the sorts in Base for now. Man, you've done a lot of research on sorting! @ViralBShah coauthored a paper with what is to the best of my knowledge the best distributed sort. We should implement that for DArrays (allowing different sequential sorts, of course) and we'll have a pretty sweet sorting story!

StefanKarpinski · 2012-12-18T12:54:50Z

It's quite disappointing that the "dual pivot quicksort" paper doesn't provide any actual performance comparison, just a theoretical proof that it swaps fewer elements on average. That makes sense, but seems like it would continue to hold for 3, 4, 5, etc. pivots. So the real question is whether the additional complexity costs more in instruction cache misses, etc. than it saves in element swapping. And the paper leaves that question completely unanswered. (Also, who typesets a paper with that much math in Word? Not even using Microsoft's Equation Editor – just text.)

kmsquire · 2012-12-18T19:09:31Z

It's interesting that "a lot of research on sorting" = "a lot of google searches". ;-)

Yeah, the "dual pivot quicksort" paper I pointed to is a little sketchy. But... for your reading pleasure, here's a paper with detailed analysis justifying the approach: Average Case Analysis of Java 7’s Dual Pivot Quicksort

There are also slides and audio of a talk by the same authors:

Seems worthwhile (and it's nice to have an algorithmic description which isn't subject to the GPL).

Other than that, @ViralBShah's distributed sort would be quite nice to have.

Cheers!

StefanKarpinski · 2012-12-18T19:25:22Z

OMG, LaTeX!

kmsquire · 2012-12-18T19:32:07Z

OMG LOL! ;-)

I could be misunderstanding the purpose of these functions, but I think this was an oversight in JuliaLang#1691

ViralBShah · 2012-12-20T13:37:59Z

This is the paper on distributed sorting that I co-authored:

http://dspace.mit.edu/bitstream/handle/1721.1/7418/CS008.pdf?sequence=1

Just like our julia paper, this one was too useful (it shipped as the default parallel sort on Matlab Star-P), and hence never got accepted. ;-)

kmsquire · 2012-12-20T22:15:36Z

Looks interesting... I'm a little confused why it didn't get accepted,
though.

On Thu, Dec 20, 2012 at 5:38 AM, Viral B. Shah notifications@github.comwrote:

This is the paper on distributed sorting that I co-authored:

http://dspace.mit.edu/bitstream/handle/1721.1/7418/CS008.pdf?sequence=1

Just like our julia paper, this one was too useful (it shipped as the
default parallel sort on Matlab Star-P), and hence never got accepted. ;-)

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/1691#issuecomment-11573588.

staticfloat mentioned this pull request Dec 9, 2012

Fix require() paths in tests #1711

Merged

kmsquire added 15 commits December 13, 2012 19:51

Fixes to make tests pass.

2cfe1c8

Remove *sort_lt functions; the *sort functions are overloaded

f96546d

so that their first parameter can be a lt comparison function.

Partially implement timsort with index permutation (incomplete).

951aba1

Finished implementing timsort with index permutations.

6ac2e2f

timsort: actually return the permutation

47eeff7

Be less clever about functions working with a permutation.

672641e

Small changes to merge_collapse().

091a00e

Minor updates to timsort.

cc1b777

Export *sort! functions; parameterize merge_compute_minrun()

7153958

Add missing commas to export.

928d82f

Modularize Sort

4226bd5

Import required functions from Base into Sort

edb873c

Added nonmutating forms of *sort

efd852e

Added _r, _by, function forms of select, search_sorted

Filled out *sort_perm functions.

6ae4f4c

Only exporting basic *sort functions for now; the rest may be accessed with Base.*sort and Base.*sort_perm

kmsquire added 3 commits December 17, 2012 23:34

Define sorts with @eval instead of macro, to match c6a3c85.

141a2b1

Minor speed improvement for sorted arrays.

Merge branch 'master' into timsort

644b0fb

(Pushed too quickly... removed extraneous <<<<<< HEAD from merge)

fca6052

StefanKarpinski added a commit that referenced this pull request Dec 18, 2012

Merge pull request #1691 from kmsquire/timsort

5bd7ab0

Added timsort, plus misc. sort updates

StefanKarpinski merged commit 5bd7ab0 into JuliaLang:master Dec 18, 2012

simonster added a commit to simonster/julia that referenced this pull request Dec 19, 2012

Make search_sorted_*r and search_sorted_*by behave appropriately

e47ed38

I could be misunderstanding the purpose of these functions, but I think this was an oversight in JuliaLang#1691

kmsquire mentioned this pull request Dec 19, 2012

RFC: Clean up search_sorted* functions #1620

Closed

simonster mentioned this pull request Dec 19, 2012

Fix search_sorted_*r and search_sorted_*by #1786

Closed

kmsquire mentioned this pull request Feb 10, 2014

New Quicksort #5081

Merged

RFC: Added timsort, plus misc. sort updates #1691

RFC: Added timsort, plus misc. sort updates #1691

Conversation

kmsquire commented Dec 6, 2012

StefanKarpinski commented Dec 6, 2012

kmsquire commented Dec 7, 2012

ViralBShah commented Dec 7, 2012

kmsquire commented Dec 8, 2012

kmsquire commented Dec 8, 2012

ViralBShah commented Dec 9, 2012

staticfloat commented Dec 9, 2012

kmsquire commented Dec 14, 2012

johnmyleswhite commented Dec 14, 2012

pao commented Dec 14, 2012

johnmyleswhite commented Dec 14, 2012

pao commented Dec 14, 2012

johnmyleswhite commented Dec 14, 2012

johnmyleswhite commented Dec 14, 2012

kmsquire commented Dec 15, 2012

ViralBShah commented Dec 16, 2012

kmsquire commented Dec 16, 2012

johnmyleswhite commented Dec 16, 2012

StefanKarpinski commented Dec 16, 2012

StefanKarpinski commented Dec 16, 2012

StefanKarpinski commented Dec 16, 2012

johnmyleswhite commented Dec 16, 2012

kmsquire commented Dec 16, 2012

StefanKarpinski commented Dec 16, 2012

StefanKarpinski commented Dec 16, 2012

johnmyleswhite commented Dec 16, 2012

kmsquire commented Dec 17, 2012

kmsquire commented Dec 18, 2012

StefanKarpinski commented Dec 18, 2012

StefanKarpinski commented Dec 18, 2012

kmsquire commented Dec 18, 2012

StefanKarpinski commented Dec 18, 2012

kmsquire commented Dec 18, 2012

ViralBShah commented Dec 20, 2012

kmsquire commented Dec 20, 2012