Fix non-deterministic test failures #2196

menshikh-iv · 2018-09-23T13:41:46Z

Some of the non-deterministic tests failed frequently, I spend much time to rerun it again and again (especially in release time when we need to build gensim many times), most frequent fails are:

TestFastTextModel.test_cbow_hs_online
TestWord2VecModel.test_cbow_hs_fromfile
TestDoc2VecModel.testLoadOldModel

This fails almost always happens on python3.

Here I pin PYTHONHASHSEED for testing purposes (will do the same thing for gensim-wheels repo too).

CC: @gojomo @piskvorky hope that you have no objection.

See also MacPython/gensim-wheels#12

piskvorky · 2018-09-24T09:39:30Z

Pros: More reliable builds, faster releases, less confusion for maintainers and contributors investigating unrelated CI failures.
Cons: moving away from functional tests (does the algo still work acceptably?) to more menial bit-checking (algo may be broken for a different seed; need to redefine ground-truth with each algo change).

Overall, worth it IMO (pros >> cons).

Of course, ideally we'd want both: reliable AND functional. I still don't know why these tests are so flaky (not enough training data / epochs? too non-deterministic to be reliable?).

gojomo · 2018-09-25T00:38:50Z

I'd have to see the individual failures to be sure, but if it really is a small set of commonly-failing tests, they could probably improved to be 'tighter', rather than for the whole environment to be forced into string-hash-determinism. In particular, that it's two variations of the "cbow_hs" test commonly failing makes me very suspicious of that test's design.

And for TestDoc2VecModel.testLoadOldModel, I'm a bit more baffled - as there's no new training there, normal sources of randomness shouldn't risk changes in substantive results. It's loading all static models! There must be a true problem in the test, I'd think.

menshikh-iv · 2018-09-25T03:23:58Z

@gojomo
About "individual failures": see an example - MacPython/gensim-wheels#12 (comment)

About determination - I agree that issue can be in test design itself (I'm almost sure) and current pinning is the workaround (not fix). If you have time to look into tests - this will be pretty nice (if we can fix problematic tests itself instead of "global pinning" - this will be great).

gojomo · 2018-09-25T23:25:56Z

I don't have time to dig deeper, but:

The TestFastTextModel.test_cbow_hs_online test looks really fishy - I wouldn't expect a single iteration over some tiny unit-test dataset to give very consistent results anywhere, and the test failure doesn't look like a "near miss" - -0.96 is nowhere near the target of 0.0.
The TestWord2VecModel.test_cbow_hs_fromfile looks like a near-miss that might happen rarely or be drivable to a negligible rate with slight tightening.
The TestDoc2VecModel.testLoadOldModel remains surprising, because a frozen model loaded from disk should give the same ranked results to any future most_similar() tests. It might be due to perfect-ties in similarity-value resulting in different ranked orders in differently-string-hashed dictionaries. If so, a perfect tie should be very odd in any sufficiently-trained/varied model, so perhaps any old model that includes such a tie be replaced. But if not, another, fix might be to sort most_similar() results by both similarity-score and key-lexicography.

I'm not philosophically opposed to forcing determinism in some cases, having argued for it before when frustrated by jittery test results – but I've also seen other bugs that such determinism would've hidden. (In particular, some of the older flakiness in these tests was likely due to the longstanding bug in inference learning-rate-decay.) Here, if these tests are often failing under some build/test systems (like in this case the MacPython setup), but never failing in our own setup, then there's some other source of variance we may want to understand.

pin seed for testing

0e579ad

menshikh-iv merged commit da5c603 into develop Sep 24, 2018

menshikh-iv deleted the pin-seed branch September 24, 2018 09:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix non-deterministic test failures #2196

Fix non-deterministic test failures #2196

menshikh-iv commented Sep 23, 2018 •

edited

piskvorky commented Sep 24, 2018 •

edited

gojomo commented Sep 25, 2018

menshikh-iv commented Sep 25, 2018

gojomo commented Sep 25, 2018

Fix non-deterministic test failures #2196

Fix non-deterministic test failures #2196

Conversation

menshikh-iv commented Sep 23, 2018 • edited

piskvorky commented Sep 24, 2018 • edited

gojomo commented Sep 25, 2018

menshikh-iv commented Sep 25, 2018

gojomo commented Sep 25, 2018

menshikh-iv commented Sep 23, 2018 •

edited

piskvorky commented Sep 24, 2018 •

edited