-
-
Notifications
You must be signed in to change notification settings - Fork 741
Fix Issue 10322 - ensure RandomSample is initialized before use #1533
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I'm going to paste this on every pull that has indent changes from now on. To view the diff ignoring whitespace changes, add https://github.com/D-Programming-Language/phobos/pull/1533/files?w=1 Makes reviewing much easier in a lot of cases. |
Good call. It's a shame there isn't a meld-like view in GitHub that I'm aware of -- makes these things much easier desktop-side. What do you think of the solution here? I was a little nervous about the delegate-based approach, whether it had safety implications, but it does make for a more elegant and overall faster solution. |
Oh, one technical note -- there's no unittest in place for the case where .popFront() is the first public function to be called, just because it's a bit tricky to work out how this should be handled in a unittest scenario. The measure of whether it's working or not is the statistical distribution of items in the (remainder of) the sample. Without the check for initialization, then (with Algorithm D) the popFront'ed sample would just give you an evenly distributed sample of size Probably this is something whose testing should be part of a separate quality-of-randomness test suite. |
Although if .popFront() didn't correctly account for this, it would fail, because |
DON'T MERGE for now. I think there may be an issue with the current code. Closing, will re-open once it's fixed or shown to be a false alarm. |
The problem seems to be that inside the |
OK, the reason is now clear. A delegate is a reference type. When you do something like this: auto sample = randomSample(...);
sample.popFront();
foreach (s; sample)
{
...
} ... the This doesn't matter if It's (i) a clear sign that RandomSample also needs to be a reference range, and (ii) currently a fundamental flaw in the delegate-based approach, which is a shame, because it does allow for speedup :-( |
The added patches should fix the identified issue with the delegate. I owe thanks to Artur Skawina for suggesting this technique -- I would have reverted to an As Artur points out there is a small amount of maintenance risk here should anyone subsequently modify the code to try and call the |
Hmm, looks like my bounds for the new test case are too narrow :( |
Any chance of getting some review here? :-) I think there are 2 principal issues to address -- (i) is the delegate-based approach to the skip() function OK, or would it be preferable to go back to an |
Apologies. I did a quick read, it looks ok in principle. I'll have to brush up on the delegate skills though (as well as study the thread you created with the difficulties you faced). Things would have been simpler with a simple enum (eg: In regards to the delegate problems: Does D outright ban simple member function pointers? It seems it would have been simpler and safer when a struct stores it in itself... |
I'm not at all precious or possessive about my code, so if another solution would be preferred I'm happy to implement it. Using delegates minimises the amount of |
No, I understand what you did, and why, and it's a smart choice. In this particular case though, using delegates is working against us. Delegates store a pointer to this, which basically means you have an internal pointer, and these are banned for a reason (as you have probably noticed). You are working around it, but... just :puke: I was going to suggest If a language feature is working against us, it means we are using the wrong language feature: "pointer to member functions, eg: delegates". In particular, if we already have "this", then we don't actually need the functions to be "member", do we? Why not just make static skipA(ref typeof(this) that);
static skipD(ref typeof(this) that); Then your code basically becomes: if (!skip)
skip = &skipA;
skip(this); Thoughts? |
Using an indirect call instead of a test is overkill, so that must go. I'm not sure how the benchmark was set up but the test is almost always better. Perhaps the matter is an icache spill which could be fixed by having an if test followed by two function calls (instead of inline code). |
Yea, I found myself feeling similar about the workaround to be honest. I'll tweak accordingly.
OK, I'll give it a go. :-) |
OK, I worked out why the indirect call was seemingly faster than the test. The original if ((_alphaInverse * _toSelect) > _available)
{
_algorithmA = true;
return skipA();
} Once that's moved out of private size_t skip()
{
assert(_skip != Skip.None);
// Step D1: if the number of points still to select is greater
// than a certain proportion of the remaining data points, i.e.
// if n >= alpha * N where alpha = 1/13, we carry out the
// sampling with Algorithm A.
if (_skip == Skip.A)
{
return skipA();
}
else if ((_alphaInverse * _toSelect) > _available)
{
// We shouldn't get here unless the current selected
// algorithm is D.
assert(_skip == Skip.D);
_skip = Skip.A;
return skipA();
}
else
{
assert(_skip == Skip.D);
return skipD();
}
} This results in a significant speed-up over all the previous implementations. You can check my benchmarking code here: http://ubuntuone.com/2dFQHQiv8dSgxTzyeD48XW This .tar.gz file contains sample code plus four different versions of std.random: the first is std.random as in current Phobos, the second is the delegate-based approach, the 3rd is that of the penultimate commit (4ad68f9) and the 4th is the current final version in this pull request, i.e. with the extra check moved into skip(). The benchmark results contained in the .zip file were the result of compiling with:
|
Assuming this version is OK, would you like me to clean up the patchset? I can probably reduce it to one single patch or a smaller series of more coherent ones. |
I take it these test failures aren't my fault ... :-\ Anyway, ping? :-) |
.... ping? I seem to be getting into test-failure hell here. |
Don't worry about it. We all know there are odd issues in the auto tester. We don't actually wait for 10/10 to make a merge. A quick look at the auto tester history is good enough.
Yeah... sorry. It's on my to do list. It's on my todo list. I'll get to it sometime this week. |
Hey, it's all green. Ping? :-) |
Sorry for the delay. This looks good for pulling, but 2 things: It's rare, but your test can fail. Randomly (literally) failing tests will be a pain for every one. You bumped up the tolerance for count1 (WebDrake@f480ac0), can you also bump count99's from 250 to 300 ?
|
Yes, I already bumped up the tolerance but not enough it seems :-( I think that in the long run this might be worth taking out and handing over to a more intense random test suite which will use larger sample size, but for now I'll leave it in with the increased tolerance as you suggest. I'll update later today with a single-commit version of all this. |
This patch fixes a problem where the public methods .index() or .popFront() might be called without the first value of the sample having been determined, which would then cause spurious results. The runtime initialization check currently performed in .front has been extended to those methods. The private boolean checks in the previous implementation have been replaced with an enum indicating the algorithm to be used (A, D or None) with None indicating that the sample has not been initialized. Step D1 of Algorithm D has been moved to the skip() function, which results in a significant performance boost. Unittests have been introduced to cover the cases where .index or .popFront() are called before .front. Finally, the .index method has been made a @Property, which I take to be an oversight of the original code.
Single-patch rebase, and I've updated the pull request summary. :-) |
Oh, and the patch includes updated tolerance for the .popFront() test as you requested. |
All tests green. Ping? :-) |
Fix Issue 10322 - ensure RandomSample is initialized before use
:-) |
Thanks! This was a fun one to work on -- I think I learned useful stuff doing it :-) |
assert(count0 == 0); | ||
assert(count1 < 300, text("1: ", count1, " > 300.")); | ||
assert(4_700 < count99, text("99: ", count99, " < 4700.")); | ||
assert(count99 < 5_300, text("99: ", count99, " > 5300.")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This failed recently with count99 == 5321.https://auto-tester.puremagic.com/test_data.ghtml?projectid=1&runid=66683&logid=6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that was always a possibility. This is rather a crude test -- mostly a sanity check -- and in the unittesting scenario we don't have time to run something really statistically rigorous. The upper/lower bounds were set to be quite forgiving, but there's always a possibility of a rare event where we do just get an extreme of variance.
If it repeats frequently then I'd look into it further, but most likely it's just a rare one-off that we should be aware of as possible. I suppose we could make the bounds even more forgiving, but given how many unittest runs have taken place without a failure, I'd rather not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeterministic unit tests are always dangerous. However, let's assume for the moment that there is no better way of writing the tests in question. In that case, the fact that the tests can fail with a small probability should be noted somewhere, and in such a way that it is obvious even to the non-statistically inclined (preferably in the immediate vicinity of the assertions in question).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would you like it noted? Via an extra paragraph added to the comment above the asserts? Or in the assert messages themselves?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either of those would be fine. Just make it very obvious to anybody going to the failing line in a text editor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do. Thanks for bringing this to my attention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW it was obvious.
Please, let's not have any non-deterministic tests in the test suite. Everything should be repeatable. This test is not only hurting because it fails, it's hurting because I can't have any idea why it would fail -- the seed used is gone. I'd much rather use hard-coded seeds, and forget about testing the "quality" of the random algorithm. If anyone can find a seed that causes it to be poor quality, let's examine that, figure out if we have to add that seed, and be done with it. I'm considering that for unit tests, we should change |
This patch fixes a problem where the public methods .index() or .popFront() might be called without the first value of the sample having been determined, which would then cause spurious results. The runtime initialization check currently performed in .front has been extended to those methods, with a number of tweaks to the implementation that result in an overall performance improvement.
In addition, the .index method has been made a @Property, which I take to be an oversight of the original code.