Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve/prune docs/tutorial of TranslationMatrix functionality #2977

Open
gojomo opened this issue Oct 8, 2020 · 0 comments
Open

Improve/prune docs/tutorial of TranslationMatrix functionality #2977

gojomo opened this issue Oct 8, 2020 · 0 comments
Labels
bug Issue described a bug documentation Current issue related to documentation testing Issue related with testing (code, documentation, etc)

Comments

@gojomo
Copy link
Collaborator

gojomo commented Oct 8, 2020

The concerning test failure at #2944 now seems to me to be a false alarm. With more testing across many seeds, it appears the extremely flimsy BackMappingTranslationTest.test_infer_vector() was only passing in the base case (float64 randoms downcast to float32s) due to a lucky seeding, and only failing in the changed case due to unlucky seeding of the slightly-different stream of (float32 from the start) random numbers.

I've disabled the flimsy test, and it's questionable whether the BackMappingTranslationMatrix should even exist. It's perhaps 10 lines of using (not specializing-via-subclass) the actual TranslationMatrix class, and over-specialized on Doc2Vec models – whereas the TranslationMatrix functionality could and should be general to any vector-set, requiring just a few lines to apply to word-vectors, doc-vectors, or others. (And, calling the translation/projection infer_vector is unnecessarily prone to confusion with the different 'inference' that's native to Doc2Vec.)

I still think the TranslationMatrix itself is an under-appreciated bit of functionality, and I even strongly suspect – subject to experimentation – it could be part of a recommended solution for evolving a model to include more words that's far more robust/theoretically-defensible/performant than the build_vocab(..., update=True) & then incrementally .train() approach.

But, it'll need at the very least better docs/tutorial examples. The existing docs/notebook/tranlsation_matrix.ipynb is muddled & hard to run. (The test data it's using links to an all-in-Chinese Baidu download page that seems to require a login before raw .txt download.) It demos the BackmappingTranslationMatrix class in a later 'experimental' area I have trouble following even though it reuses some of the IMDB-dataset Doc2Vec tutorial I wrote.

I only have time to disable the BackMappingTranslationTest.test_infer_vector test right now, and this is pretty fringe functionality, so there's no urgency to clean it up - but this issue it to keep it under consideration, when the right person comes along.

@piskvorky piskvorky added bug Issue described a bug testing Issue related with testing (code, documentation, etc) documentation Current issue related to documentation labels Oct 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug documentation Current issue related to documentation testing Issue related with testing (code, documentation, etc)
Projects
None yet
Development

No branches or pull requests

2 participants