CBOW #176

sebastien-j · 2014-03-05T01:09:14Z

I have added the CBOW functionality.

It should replicate the original code by Mikolov et al., except that it combines context words by taking the average of the corresponding vectors instead of their sum. This is in agreement with the textual description of the CBOW model in [1](although Figure 1 in that paper says).

Here are some baseline results on the Google analogy dataset (questions-words.txt) with the vocabulary restricted to the 30,000 most frequent words. The training corpora are the "text8" and "fil9" datasets found at "http://mattmahoney.net/dc/textdata.html". In all cases, I used vectors of dimension 640 and a window of size 10. The rest of the hyper-parameters take their default value.

With "text8":

Skip-gram: 29.3% correct
CBOW: 15.5% (11.5% if I used the sum of the vectors instead of their average)

With "fil9":

Skip-gram: 52.5%
CBOW: 46.2% (29.9% using the sum)

Note: By default, my system always used "Optimization 1" during training. What would be the most efficient way to test the other optimizations before incorporating the changes into Gensim?

[1] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at ICLR, 2013.

Allows people initializing Word2Vec models using positional arguments to keep the same syntax

piskvorky · 2014-03-09T11:28:12Z

Looks great! Could you add some basic CBOW sanity checks (a unit test)?

@sebastien-j , do you think this could be somehow integrated with the existing pull request at #162 ?

sebastien-j · 2014-03-12T16:23:27Z

I added a unit test. This pull request is mostly complementary to #162, so it could most likely be integrated with it. The negative sampling framework introduced there could also be used to extend CBOW.

piskvorky · 2014-03-12T18:51:05Z

Great work, thanks Sebastien! Merging now. Whom should I credit, do you have a twitter account?

Help with testing/merging the other pull request would be much appreciated :)

CBOW

piskvorky · 2014-06-20T12:30:37Z

@sebastien-j did you see https://groups.google.com/forum/#!topic/word2vec-toolkit/HlJyFACiVPE ?

sebastien-j · 2014-06-21T03:05:17Z

I had seen the question, but not Tomas Mikolov's answer. Gensim is already ready for those changes 👍

piskvorky · 2014-06-23T11:48:27Z

Do you mean you implemented them already?
Or that they can be implemented? (which I do not doubt :)

sebastien-j · 2014-06-23T20:15:12Z

It's already implemented. The parameter cbow_mean controls whether to use the sum or the average. We might have to change the default learning rate though.

sebastien-j added 3 commits March 4, 2014 19:36

Add CBOW

14c6631

Add newline at end of word2vec_inner.pyx

973713a

'sg' now the last formal parameter of __init__()

68bfa52

Allows people initializing Word2Vec models using positional arguments to keep the same syntax

Add unit test (for CBOW)

f8ff2b1

piskvorky added a commit that referenced this pull request Mar 12, 2014

Merge pull request #176 from sebastien-j/cbow

90df291

CBOW

piskvorky merged commit 90df291 into piskvorky:develop Mar 12, 2014

piskvorky added a commit that referenced this pull request Mar 12, 2014

credit CBOW word2vec + pep8 fixes re. #176

b78635d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CBOW #176

CBOW #176

sebastien-j commented Mar 5, 2014

piskvorky commented Mar 9, 2014

sebastien-j commented Mar 12, 2014

piskvorky commented Mar 12, 2014

piskvorky commented Jun 20, 2014

sebastien-j commented Jun 21, 2014

piskvorky commented Jun 23, 2014

sebastien-j commented Jun 23, 2014

CBOW #176

CBOW #176

Conversation

sebastien-j commented Mar 5, 2014

piskvorky commented Mar 9, 2014

sebastien-j commented Mar 12, 2014

piskvorky commented Mar 12, 2014

piskvorky commented Jun 20, 2014

sebastien-j commented Jun 21, 2014

piskvorky commented Jun 23, 2014

sebastien-j commented Jun 23, 2014