Skip to content

Conversation

DmitryOlshansky
Copy link
Member

Use Xinok's public domain Tim Sort for stable sort.

We had broken stable sort for 2 years now but I really need a working one for upcoming std.uni ;)

@@ -7295,7 +7298,7 @@ sort(alias less = "a < b", SwapStrategy ss = SwapStrategy.unstable,
{
static assert(false, "Invalid predicate passed to sort: "~less);
}
return assumeSorted!less(r);
return assumeSorted!(less)(r);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why change from good style to bad style?

@andralex
Copy link
Member

There are quite a few stylistic issues and unrealized opportunities for adding more useful algorithms. For now, however, let's focus on fixing the more obvious ones and getting this in. Thanks!

@DmitryOlshansky
Copy link
Member Author

Okay, I knew I'm up for some work :)
I'll do it step by step starting with style.

@DmitryOlshansky
Copy link
Member Author

Fixed all of raised issues.

@Xinok
Copy link
Contributor

Xinok commented Sep 17, 2012

I'm happy to see this coming into fruition. It looks bad for D when simple things like a stable sort are broken.

I suggest removing the "Adopted From" comment and replace it with something more proper. If you really want to give attribution, you could simply provide a link to the original code.

You could replace the "optimisticInsertionSort" with binary insertion sort, whether mine or your own, and call that instead. While it's a bit slower when sorting primitive types, it does far fewer comparisons which would benefit the unstable sort as well.

As for moveAll, I would avoid that altogether. The optimizer isn't really to blame; The implementation uses input range primitives which are just slower. Also, destructively moving each element for insertion will add much unneeded overhead, especially for larger types. Insertion sort is where the majority of time is spent and is integral for performance.

My email is in my profile if you need help with anything. I'm not sure how difficult it is to understand as I don't thoroughly comment my code.

@DmitryOlshansky
Copy link
Member Author

@Xinok About "Adopted From..." I conjured this note in a couple of minutes and not particularly fond of it myself.

Still IRC public domain license requires to explicilty distinguish altered versions and this one is getting some changes. Anyway I'm more then willing to put in any note that you as an author feel would be more appropriate.

optimisticInsertionSort vs binaryInsertionSort - looks like I'd need to play some more with them to have an idea of their performance characteristics on small inputs.

Now speaking of input range primitives and moveAll I completely disagree. If range primitives are to have any success they must be (almost) as fast as naked arrays. If not then we are doing something wrong and that leaves us with 2 options: ether redesign ranges or hack compiler to optimize them better.

And I'd argue that ranges as defined are good enough ground for optimal (or near optimal) performance as they are just compile-time hardwired thin wrappers for arrays unlike say GoF Iterator that has dynamic dispatch.

See this sample that includes 3 versions of shift right array to the right:
https://gist.github.com/3745627

And the disassembly of 3 relevant pieces here:
https://gist.github.com/3745686
(cmd line: dmd -g -O -inline -release -noboundscheck test.d)
Note: inclusion of symbolic info does affect timings, meaning that '-g' doesn't disable any optimizations.

Obviously compiler doesn't do a prticularly good job and happily reloads plenty of stuff from stack even in simpliest case. 2nd version doesn't have move inlined, poor fellow.
3rd version is unrolled but compiler again uses stack a lot (and~2x as much as in 1st version), including some obviously awful cases like swapping 2 registers through the stack.

Having looked at its source code of DMD inliner I can confidently say: it's not good enough. For instance it won't ever optimize functions with multiple returns and in fact move currently has an early return and thus is not inlined.
All in all, it tries to rewrite everything into a kind of long comma-expressions and a lot of stuff can't be converted that easily.

@DmitryOlshansky
Copy link
Member Author

Bottom line:
In general I believe it IS a compiler issue but there could be a few minor faults in Phobos:
For instance after fixing move for builtins (removing branch) I finally see identical code for fn1 vs fn2 in test sample.
The good thing about move is that it incapsulates proper logic for postblit/destructors.

@Xinok
Copy link
Contributor

Xinok commented Sep 20, 2012

Considering that arrays are a pointer and length pair, input ranges have to increment the pointer and decrement the length for each iteration. Even when iterating in reverse (bidirectional ranges), although it only has to decrement the length, getting the back element is arr[$-1]. Optimizers aren't magic that can whisk away any inefficiency. The programmer must still put in effort to manually optimize their code.

The best thing to do is use foreach loops when possible. In this way, the compiler can choose the optimal method for iteration without the programmer writing specialized code for different types. In this case, the most efficient way to iterate an array is using pointers.

Personally, I optimize my code empirically. DMD is an odd compiler, things that should be faster in theory actually turn out to be slower. So in cases where a foreach loop should be optimal, I use a for loop instead because it proved to be faster in benchmarks.

I pushed out an update for Timsort today. Read the comment to understand what I changed and why.

I also wrote this. I tried to optimize insertion sort using Duff's Device, but again, it failed to be any faster in benchmarks.

As for attribution, just include a link to the original code. This is sort of what is done already; std.algorithm:sort links to STL's sort and stable sort.

@DmitryOlshansky
Copy link
Member Author

Considering that arrays are a pointer and length pair, input ranges have to increment the pointer and decrement the length for each iteration. Even when iterating in reverse (bidirectional ranges), although it only has to decrement the length, getting the back element is arr[$-1].

Okay, I got it. What could be improved is rewriting copy/moveAll for random access ranges so that it can use single index loop instead of doing 2 popFronts. Experimenting this way I'm able to get near identical performance. I suspect it could be applied to other similar algorithms benefiting even more code.

To make it clear - I'm not opposed to optimizing by hand, I'm just in favor of reducing that amount and allow greater exposure of optimization. It's all about reuse and impact - if we are to make moveAll as fast as technically possible for random access ranges it would benefit a much broader audience then if we just hack stable sort to be faster.

Either way I see that even straight loop could use more efficient code. There has to be an optimized copy backwards primitive like memcpy but typesafe (a-la array ops).

Optimizers aren't magic that can whisk away any inefficiency. The programmer must still put in effort to manually optimize their code.

I'm mostly aware of what they can and can't do. I recall being surprised by both how much they can do and how much they can't. The good read is Agner's Fog optimization tutorials http://www.agner.org/optimize/.

Personally, I optimize my code empirically. DMD is an odd compiler, things that should be faster in theory actually turn out to be slower.

If there is anything stable it is "look at disassembly" method. I found it to be quite revealing.

So in cases where a foreach loop should be optimal, I use a for loop instead because it proved to be faster in benchmarks.

BTW I've found it myself on numerous occasions. It sounds like a bug.

As for attribution, just include a link to the original code. This is sort of what is done already; std.algorithm:sort links to STL's sort and stable sort.

Will do. Though in this case it was a pointer to how analogous C++ function works not that implementation was taken from STL.

@DmitryOlshansky
Copy link
Member Author

I also wrote this. I tried to optimize insertion sort using Duff's Device, but again, it failed to be any faster in benchmarks.

I suspect it's because every step is dependent on the next one (i.e. all of these upper--). What helps a bit is partially unrolling the loop by factor of 2-4 and using Duff's Device to handle first length % factor.

Okay, I think I'd better leave fine-tuning to some time later e.g. when we have std.benchmark ;)
As it is now it works within couple of % of original that should be good enough.

UPDATE: looking over disassembly shows the move+retro code should be indeed slower, as it has more complex inner loop. So I revert to stright for-loop again :)

I added source link, see if it looks okay. The only thing left should be your latest enhancements to TimSort. I'll pull these in soon.

@DmitryOlshansky
Copy link
Member Author

Folded in. Interestingly I do observe slight speed up compared to a version of Tim Sort that I checked out from your repo previously.

immutable minTemp = max(range.length / 2, MIN_STORAGE);
size_t minGallop = MIN_GALLOP;
Slice[STACK_SIZE] stack = void;
size_t stackLen = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line up these declarations or remove the extra spaces.

@DmitryOlshansky
Copy link
Member Author

@Xinok Done.

@andralex Ready for another round of destruction.

{
put(target, source.front);
auto len = source.length;
for (size_t idx = 0; idx < len; idx++)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

foreach

@Xinok
Copy link
Contributor

Xinok commented Oct 10, 2012

@blackwhale I edited my original post. You can include my more descriptive comment on that line. In fact, please do. :P

I don't thoroughly comment my code, so if there's anywhere that needs a more descriptive comment, leave a note on that line and I'll write a better explanation like I did above.

{
swapAt(r, lessI, greaterI);
size_t at = stack[run1].length <= stack[run3].length
? run1 : run2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

size_t at -> immutable at

@andralex
Copy link
Member

Ping on this? Any work left prompted by Xinok's comments?

@DmitryOlshansky
Copy link
Member Author

Ping on this? Any work left prompted by Xinok's comments?

Yup. I need to add an exteneded unittest and do a few extra style changes.

@DmitryOlshansky
Copy link
Member Author

Ready to go.

for (; !source.empty; source.popFront())
{
put(target, source.front);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put natively supports a range as input, so you can replace this with put(target, source);.

I know the for loop wasn't yours, but if you touch it, might as well improve it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do

@DmitryOlshansky
Copy link
Member Author

Sorry, I've been somewhat busy. @monarchdodra nits are addressed, waiting for more comments if any.

@monarchdodra
Copy link
Collaborator

I had nothing but nitpicks.

@alexrp
Copy link
Contributor

alexrp commented Oct 31, 2012

Are we good to go?

@DmitryOlshansky
Copy link
Member Author

Let me fold revamped unit test and squash it then we are good to go.

@monarchdodra monarchdodra mentioned this pull request Oct 31, 2012
@DmitryOlshansky
Copy link
Member Author

I'm done.

@DmitryOlshansky DmitryOlshansky mentioned this pull request Nov 3, 2012
@DmitryOlshansky
Copy link
Member Author

Squashed and waiting.

@dnadlinger
Copy link
Contributor

Could you separate out the random whitespace/style changes all over the modules to a separate commit? Other than that, looks good to me. Let's get this merged.

@DmitryOlshansky
Copy link
Member Author

The whitespace changes are not random it's a deliberate pass over std.algo to bring existing code to the prevalent standard.

Anyway will do.

@DmitryOlshansky
Copy link
Member Author

Here you go. Hopefully we can finally get this in.

@dnadlinger
Copy link
Contributor

Thanks – and sorry for being such a stickler, but it is helpful to be able to quickly skip such commits when browsing the history (git blame, …).

@DmitryOlshansky
Copy link
Member Author

Ehm! Cool, now the win32 build runs out of memory...

@DmitryOlshansky
Copy link
Member Author

Ping.
Is it out of reach of 2.061 or is there any other problem?

@andralex
Copy link
Member

I assume "cannot read file b6395.d" is unrelated. Merging now.

andralex added a commit that referenced this pull request Dec 12, 2012
@andralex andralex merged commit 0fcca44 into dlang:master Dec 12, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants