New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CBOW #176
CBOW #176
Conversation
Allows people initializing Word2Vec models using positional arguments to keep the same syntax
Looks great! Could you add some basic CBOW sanity checks (a unit test)? @sebastien-j , do you think this could be somehow integrated with the existing pull request at #162 ? |
I added a unit test. This pull request is mostly complementary to #162, so it could most likely be integrated with it. The negative sampling framework introduced there could also be used to extend CBOW. |
Great work, thanks Sebastien! Merging now. Whom should I credit, do you have a twitter account? Help with testing/merging the other pull request would be much appreciated :) |
I had seen the question, but not Tomas Mikolov's answer. Gensim is already ready for those changes 👍 |
Do you mean you implemented them already? |
It's already implemented. The parameter |
I have added the CBOW functionality.
It should replicate the original code by Mikolov et al., except that it combines context words by taking the average of the corresponding vectors instead of their sum. This is in agreement with the textual description of the CBOW model in [1](although Figure 1 in that paper says).
Here are some baseline results on the Google analogy dataset (questions-words.txt) with the vocabulary restricted to the 30,000 most frequent words. The training corpora are the "text8" and "fil9" datasets found at "http://mattmahoney.net/dc/textdata.html". In all cases, I used vectors of dimension 640 and a window of size 10. The rest of the hyper-parameters take their default value.
With "text8":
Skip-gram: 29.3% correct
CBOW: 15.5% (11.5% if I used the sum of the vectors instead of their average)
With "fil9":
Skip-gram: 52.5%
CBOW: 46.2% (29.9% using the sum)
Note: By default, my system always used "Optimization 1" during training. What would be the most efficient way to test the other optimizations before incorporating the changes into Gensim?
[1] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at ICLR, 2013.