intermittent failure: Test CBOW w/ hierarchical softmax - #531

tmylk · 2015-11-17T21:02:21Z

Need to restart Travis and Appveyor builds a couple of times in order to get this test set to pass. Is there a way to make it more robust?

-------------------- >> end captured logging << ---------------------

FAIL: Test CBOW w/ hierarchical softmax

Traceback (most recent call last):
File "C:\Python27\lib\site-packages\gensim\test\test_word2vec.py", line 246, in test_cbow_hs
self.model_sanity(model)
File "C:\Python27\lib\site-packages\gensim\test\test_word2vec.py", line 226, in model_sanity
self.assertLess(t_rank, 50)
AssertionError: 64 not less than 50
-------------------- >> begin captured logging << --------------------
gensim.models.word2vec: INFO: collecting all words and their counts

gojomo · 2015-11-17T22:04:14Z

An odd thing here is that the tolerances were picked such that I saw no failures in 100-200 local test runs... but sometimes (and almost in a bursty fashion?!?!) the auto-testing systems seem far more prone to outlier results.

Especially with regard to a Windows build that requires many tries to pass, compared to linux builds that "usually" passes, it might be indicative of a true training degradation on that platform, requiring a fix other than testing adjustments.

Have you noticed if it's always this exact test (test_cbow_hs) and python (2.7 as opposed to 2.6/3.x)? Or, are you just seeing at-least-one of the many similar tests failing across multiple configs?

More generally I see the problem as:

We're attempting quick tests on a necessarily-thin amount of bundled data on algorithms that have some inherent variance.

We may need to seed the tests to achieve perfect run-for-run reproducibility. At least then, many true bugs will break the test that previously passed predictably.

Simply expanding the tolerances on a still-random process can make the failures less-frequent – but might not eliminate them entirely. Then on the rare occasions test-failures still happen, we'd be unsure if the code really changed in some tangible way, or it was just a very-unlucky outlier run. Making the tolerances so generous that we "never" (practically) see a failure could also hide many other kinds of calculation-degradation bugs we'd usually want tests to catch.

piskvorky · 2015-11-18T04:30:59Z

I'm -1 on seeding tests with exact RNG. If the inherent variance is such that it requires the range of values to become effectively meaningless in order for the test to pass, then the test is not very useful in the first place (only checks syntax).

So let's try to come up with other solutions. Would more data help? Different parameters (iterations)?

tmylk · 2016-01-09T20:15:40Z

@gojomo Do you have ideas one can try on how to fix this? More data or different params?

gojomo · 2016-01-12T15:03:15Z

PR #581 has a tuned version of test_cbow_hs that in my tests should pass far more reliably. (The tuned test requires a parameter introduced by the earlier work in that PR.)

I suspect the trigger has been thread-scheduling that introduces far more randomness on the build machines than in my local tests. (That may be further aggravated by how large jobs are compared to the small unit test dataset, and maybe even further aggravated by the race issue that was in #571.)

The parameters before this check-in had been chosen after no failures in (well over) 200 runs. And, testing them again, they ran thousands of times without a failure on my OSX test machine, in both Py2.7 and Py3.4.

But I did notice a slightly higher spread of (passing) values on Py3.4, probably due to some differences in thread-scheduling and CPU time-slicing. And by forcing far more randomness into my local tests (explicitly seeding with a random number), I could force failure rates much like those seen on the CI machines... until the adjusted parameters in this commit, which have executed 1500 times on both Py27 and Py34 without failure. The CI machines may yet hold surprises, we'll see.

There are also still some mysteries in this particular test, in my local runs. In a number of settings I tried, it seemed like increasing the number of iterations could make the expected results (related-words-close-to-each-other, tests-passing) less likely: 10 iterations was doing well, 30+ iterations was doing awful. That's suspicious and non-intuitive enough that it might be indicative of something wrong with this exact training mode, but I'm stumped what it could be.

While test_cbow_hs has been the major offender, I believe one or two other tests across word2vec/doc2vec were also failing every so often. If/when those recur, the same tuning (larger window, smaller batch_words) may help there as well.

tmylk added the testing Issue related with testing (code, documentation, etc) label Nov 17, 2015

tmylk assigned gojomo Nov 22, 2015

gojomo mentioned this issue Jan 11, 2016

Docker instructions in README.rst #579

Closed

gojomo mentioned this issue Jan 15, 2016

cbow_mean default changed from 0 to 1. #538

Closed

gojomo added a commit that referenced this issue Jan 15, 2016

CHANGELOG.txt for #571, #531

dd6f6b7

tmylk mentioned this issue Jan 23, 2016

Race Condition in Muli-threaded word2vec #571

Closed

menshikh-iv added bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills test before incubator labels Oct 3, 2017

menshikh-iv added good first issue Issue for new contributors (not required gensim understanding + very simple) and removed test before incubator good first issue Issue for new contributors (not required gensim understanding + very simple) labels Oct 16, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

intermittent failure: Test CBOW w/ hierarchical softmax - #531

intermittent failure: Test CBOW w/ hierarchical softmax - #531

tmylk commented Nov 17, 2015

gojomo commented Nov 17, 2015

piskvorky commented Nov 18, 2015

tmylk commented Jan 9, 2016

gojomo commented Jan 12, 2016

intermittent failure: Test CBOW w/ hierarchical softmax - #531

intermittent failure: Test CBOW w/ hierarchical softmax - #531

Comments

tmylk commented Nov 17, 2015

FAIL: Test CBOW w/ hierarchical softmax

gojomo commented Nov 17, 2015

piskvorky commented Nov 18, 2015

tmylk commented Jan 9, 2016

gojomo commented Jan 12, 2016