Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

intermittent failure: Test CBOW w/ hierarchical softmax - #531

Open
tmylk opened this issue Nov 17, 2015 · 4 comments
Open

intermittent failure: Test CBOW w/ hierarchical softmax - #531

tmylk opened this issue Nov 17, 2015 · 4 comments
Assignees
Labels
bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills testing Issue related with testing (code, documentation, etc)

Comments

@tmylk
Copy link
Contributor

tmylk commented Nov 17, 2015

Need to restart Travis and Appveyor builds a couple of times in order to get this test set to pass. Is there a way to make it more robust?

-------------------- >> end captured logging << ---------------------

FAIL: Test CBOW w/ hierarchical softmax

Traceback (most recent call last):
File "C:\Python27\lib\site-packages\gensim\test\test_word2vec.py", line 246, in test_cbow_hs
self.model_sanity(model)
File "C:\Python27\lib\site-packages\gensim\test\test_word2vec.py", line 226, in model_sanity
self.assertLess(t_rank, 50)
AssertionError: 64 not less than 50
-------------------- >> begin captured logging << --------------------
gensim.models.word2vec: INFO: collecting all words and their counts

@tmylk tmylk added the testing Issue related with testing (code, documentation, etc) label Nov 17, 2015
@gojomo
Copy link
Collaborator

gojomo commented Nov 17, 2015

An odd thing here is that the tolerances were picked such that I saw no failures in 100-200 local test runs... but sometimes (and almost in a bursty fashion?!?!) the auto-testing systems seem far more prone to outlier results.

Especially with regard to a Windows build that requires many tries to pass, compared to linux builds that "usually" passes, it might be indicative of a true training degradation on that platform, requiring a fix other than testing adjustments.

Have you noticed if it's always this exact test (test_cbow_hs) and python (2.7 as opposed to 2.6/3.x)? Or, are you just seeing at-least-one of the many similar tests failing across multiple configs?

More generally I see the problem as:

We're attempting quick tests on a necessarily-thin amount of bundled data on algorithms that have some inherent variance.

We may need to seed the tests to achieve perfect run-for-run reproducibility. At least then, many true bugs will break the test that previously passed predictably.

Simply expanding the tolerances on a still-random process can make the failures less-frequent – but might not eliminate them entirely. Then on the rare occasions test-failures still happen, we'd be unsure if the code really changed in some tangible way, or it was just a very-unlucky outlier run. Making the tolerances so generous that we "never" (practically) see a failure could also hide many other kinds of calculation-degradation bugs we'd usually want tests to catch.

@piskvorky
Copy link
Owner

I'm -1 on seeding tests with exact RNG. If the inherent variance is such that it requires the range of values to become effectively meaningless in order for the test to pass, then the test is not very useful in the first place (only checks syntax).

So let's try to come up with other solutions. Would more data help? Different parameters (iterations)?

@tmylk
Copy link
Contributor Author

tmylk commented Jan 9, 2016

@gojomo Do you have ideas one can try on how to fix this? More data or different params?

@gojomo
Copy link
Collaborator

gojomo commented Jan 12, 2016

PR #581 has a tuned version of test_cbow_hs that in my tests should pass far more reliably. (The tuned test requires a parameter introduced by the earlier work in that PR.)

I suspect the trigger has been thread-scheduling that introduces far more randomness on the build machines than in my local tests. (That may be further aggravated by how large jobs are compared to the small unit test dataset, and maybe even further aggravated by the race issue that was in #571.)

The parameters before this check-in had been chosen after no failures in (well over) 200 runs. And, testing them again, they ran thousands of times without a failure on my OSX test machine, in both Py2.7 and Py3.4.

But I did notice a slightly higher spread of (passing) values on Py3.4, probably due to some differences in thread-scheduling and CPU time-slicing. And by forcing far more randomness into my local tests (explicitly seeding with a random number), I could force failure rates much like those seen on the CI machines... until the adjusted parameters in this commit, which have executed 1500 times on both Py27 and Py34 without failure. The CI machines may yet hold surprises, we'll see.

There are also still some mysteries in this particular test, in my local runs. In a number of settings I tried, it seemed like increasing the number of iterations could make the expected results (related-words-close-to-each-other, tests-passing) less likely: 10 iterations was doing well, 30+ iterations was doing awful. That's suspicious and non-intuitive enough that it might be indicative of something wrong with this exact training mode, but I'm stumped what it could be.

While test_cbow_hs has been the major offender, I believe one or two other tests across word2vec/doc2vec were also failing every so often. If/when those recur, the same tuning (larger window, smaller batch_words) may help there as well.

gojomo added a commit that referenced this issue Jan 15, 2016
@menshikh-iv menshikh-iv added bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills test before incubator labels Oct 3, 2017
@menshikh-iv menshikh-iv added good first issue Issue for new contributors (not required gensim understanding + very simple) and removed test before incubator good first issue Issue for new contributors (not required gensim understanding + very simple) labels Oct 16, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills testing Issue related with testing (code, documentation, etc)
Projects
None yet
Development

No branches or pull requests

4 participants