Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CBOW #176

Merged
merged 4 commits into from Mar 12, 2014
Merged

CBOW #176

merged 4 commits into from Mar 12, 2014

Conversation

sebastien-j
Copy link
Contributor

I have added the CBOW functionality.

It should replicate the original code by Mikolov et al., except that it combines context words by taking the average of the corresponding vectors instead of their sum. This is in agreement with the textual description of the CBOW model in [1](although Figure 1 in that paper says).

Here are some baseline results on the Google analogy dataset (questions-words.txt) with the vocabulary restricted to the 30,000 most frequent words. The training corpora are the "text8" and "fil9" datasets found at "http://mattmahoney.net/dc/textdata.html". In all cases, I used vectors of dimension 640 and a window of size 10. The rest of the hyper-parameters take their default value.

With "text8":

Skip-gram: 29.3% correct
CBOW: 15.5% (11.5% if I used the sum of the vectors instead of their average)

With "fil9":

Skip-gram: 52.5%
CBOW: 46.2% (29.9% using the sum)

Note: By default, my system always used "Optimization 1" during training. What would be the most efficient way to test the other optimizations before incorporating the changes into Gensim?


[1] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at ICLR, 2013.

Allows people initializing Word2Vec models using positional arguments to
keep the same syntax
@piskvorky
Copy link
Owner

Looks great! Could you add some basic CBOW sanity checks (a unit test)?

@sebastien-j , do you think this could be somehow integrated with the existing pull request at #162 ?

@sebastien-j
Copy link
Contributor Author

I added a unit test. This pull request is mostly complementary to #162, so it could most likely be integrated with it. The negative sampling framework introduced there could also be used to extend CBOW.

@piskvorky
Copy link
Owner

Great work, thanks Sebastien! Merging now. Whom should I credit, do you have a twitter account?

Help with testing/merging the other pull request would be much appreciated :)

piskvorky added a commit that referenced this pull request Mar 12, 2014
@piskvorky piskvorky merged commit 90df291 into piskvorky:develop Mar 12, 2014
piskvorky added a commit that referenced this pull request Mar 12, 2014
@piskvorky
Copy link
Owner

@sebastien-j
Copy link
Contributor Author

I had seen the question, but not Tomas Mikolov's answer. Gensim is already ready for those changes 👍

@piskvorky
Copy link
Owner

Do you mean you implemented them already?
Or that they can be implemented? (which I do not doubt :)

@sebastien-j
Copy link
Contributor Author

It's already implemented. The parameter cbow_mean controls whether to use the sum or the average. We might have to change the default learning rate though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants