Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

W2v negsam #162

Merged
merged 25 commits into from May 10, 2014

Conversation

Projects
None yet
4 participants
@cod3licious
Copy link

commented Feb 8, 2014

added negative sampling to the python train_sentence function for the skipgram model and all additionally needed functions. in build_vocab, if negative sampling is used, there is a big table for the noise word distribution created, which is saved in model.table. this takes quite a bit of RAM and should probably be deleted before saving the model (it's only needed for training anyways).

@piskvorky

This comment has been minimized.

Copy link
Member

commented Feb 9, 2014

Great, thanks @cod3licious :)

I'll review and merge asap. I'm kinda overwhelmed at the moment with life stuff, sorry. Reviewing the python3 port is also still in my "gensim queue".

@mfcabrera

This comment has been minimized.

Copy link
Collaborator

commented Mar 17, 2014

Hey, I would love to see this in Gensim soon. I am right now usong Word2Vec for my master thesis and definetely Neg works better for my but I want to stop using the original C version. Is there a way I could help (a particular test / verification) ?

@piskvorky

This comment has been minimized.

Copy link
Member

commented Mar 17, 2014

Sure! Many things you can help with @mfcabrera :

  1. review the memory requirements that Franziska mentions above
  2. run different combinations of HS/negative, check and report accuracies or any problems
  3. integrate with gensim, i.e. pull request #177
  4. add optimized version in Cython
@sebastien-j

This comment has been minimized.

Copy link
Contributor

commented Mar 19, 2014

Hi all,

a few months ago, using the Gensim implementation of word2vec as a starting point, I added negative sampling (https://github.com/sebastien-j/Word2vec). The modified files are not ready to be merged into gensim yet, but they might be useful.

Here are some potential issues with my implementation. First, the different .py and .pyx files should be merged instead of having one for each kind of model. Second, I did not use the look-up table to compute the sigmoid function, but that can be changed quickly. Third, I was generating the vocabulary Huffman tree and using it during training to see whether a word was in the vocabulary or not. Moreover, the random number generation seems wasteful, but I didn't profile the code to see whether this is a bottleneck or not. There are also additional hyper-parameters that may not be useful in gensim. Finally, the code for "optimization 2" is not written.

If you have any questions, feel free to ask. I'll try to help.

Sébastien

@piskvorky

This comment has been minimized.

Copy link
Member

commented Mar 19, 2014

I wouldn't worry about the sigmoid tables. That optimization doesn't bring much.

But I worry about merging this without a cython version. People will complain it's too slow :)

@sebastien-j

This comment has been minimized.

Copy link
Contributor

commented Mar 19, 2014

Putting the sigmoid table back is a really easy task anyway. I can try to make a cleaner version of the cython version. What would be the best way to submit it? Just make a new pull request?

@cod3licious, why are both 'hs' and 'negative' parameters? I think one of these should be sufficient.

@piskvorky

This comment has been minimized.

Copy link
Member

commented Mar 20, 2014

Best to make a pull request to @cod3licious 's w2v-negsam branch, so all the changes are in this single pull request. Which can then be merged into gensim proper.

@sebastien-j

This comment has been minimized.

Copy link
Contributor

commented Mar 20, 2014

@cod3licious , could you please update your w2v-negsam branch so that it includes the recent changes made in the develop branch of pivorski/gensim? It would help me add Cython functionality.

Thank you.

@cod3licious

This comment has been minimized.

Copy link
Author

commented Mar 26, 2014

@sebastien-j concerning the 'hs' and 'negative' parameters: those are both present in the original C code, so I thought I'll add that in as well. This makes it possible to train the model using both methods and not just one, which might be useful for some cases.

I'm at work right now but I'll try to include the other changes asap. Thanks for helping out :)

@sebastien-j

This comment has been minimized.

Copy link
Contributor

commented Mar 28, 2014

@cod3licious, thanks. I guess having both 'hs' and 'negative' doesn't hurt and could in some cases be useful, although I still find it somewhat weird.

@piskvorky, I sent a pull request to cod3licious:w2w-negsam (for the Cython version). As I mention in my message there, I am unsure about the "best" way to generate random numbers in order to sample from the vocabulary.

@piskvorky

This comment has been minimized.

Copy link
Member

commented Apr 8, 2014

@cod3licious @sebastien-j OK, great. Let's finish this pull request :)

Re. RNG: your approach seems fine, generating the random numbers directly. Certainly faster than calling external methods. But did you check the performance (speed, accuracy) with this RNG approach? No problems there?

@sebastien-j

This comment has been minimized.

Copy link
Contributor

commented Apr 9, 2014

@piskvorky , the performance seems ok. The speeds I report correspond to a single experiment on my laptop. There was generally some other light work done simultaneously.

You may want to compare these results to those obtained with Mikolov's word2vec tool. (I never got it to work properly on my Windows system.)

On fil9 (~123M words; http://mattmahoney.net/dc/textdata.html):


Hierarchical softmax:

Word2Vec(Text8Corpus(infile), size=200, window=5, min_count=5, workers=1, sg=1):
93,252 wps, 50.5% accuracy (restrict_vocab=30000)

Word2Vec(Text8Corpus(infile), size=200, window=5, min_count=5, workers=1, sg=0):
304,471 wps, 37.9%


Negative sampling:

Word2Vec(Text8Corpus(infile), size=200, window=5, min_count=5, workers=1, hs=0, negative=5):
66,645 wps, 45.4%

Word2Vec(Text8Corpus(infile), size=200, window=5, min_count=5, workers=1, sg=0, hs=0, negative=5):
278,443 wps, 32.8%


To see to impact of the RNG, I also tried using numpy's random.randint inside the fast_sentence functions (by using "with gil" to generate the random number, and then releasing the gil):

Word2Vec(Text8Corpus(infile), size=200, window=5, min_count=5, workers=1, hs=0, negative=5):
40,064 wps, 45.1%

Word2Vec(Text8Corpus(infile), size=200, window=5, min_count=5, workers=1, sg=0, hs=0, negative=5):
184,715 wps, 33.1%

There is a clear speed penalty, but no obvious performance gain.

@piskvorky

This comment has been minimized.

Copy link
Member

commented Apr 10, 2014

@sebastien-j great, thanks again.

Is this waiting only for a merge from @cod3licious now? Do we need anything else?

@sebastien-j

This comment has been minimized.

Copy link
Contributor

commented Apr 10, 2014

@piskvorky , the python version of cbow with negative sampling is not in the pull request yet. However, @cod3licious has done it in cod3licious/word2vec/trainmodel.py, so integrating it into gensim should be easy.

@piskvorky

This comment has been minimized.

Copy link
Member

commented Apr 10, 2014

Wait, I'm confused -- we are integrating your changes into the w2w-negsam branch of @cod3licious fork, right?

How many pieces to integrate are there, in what branches?

I'd suggest putting everything into a single branch. It will then be easier to reason about the changes, test them and ultimately merge into gensim.

@sebastien-j

This comment has been minimized.

Copy link
Contributor

commented Apr 10, 2014

We are indeed integrating my changes into the w2v-negsam branch of @cod3licious 's gensim fork.

To recap (as far as I can tell), @cod3licious first implemented all four training methods (python only) in the master branch of cod3licious/word2vec. She then forked gensim, made the w2v-negsam branch and added skip-gram with negative sampling there (no CBOW).

I then made a pull request for CBOW (h.softmax only), which was merged into gensim. Once that was done, @cod3licious updated her w2v-negsam branch to incorporate those changes. At that point, the w2v-negsam branch had python versions of skip-gram (h.softmax and neg. sampling) and cbow(h.softmax only). My pull request into w2v-negsam contains the cython version for all training methods.

Thus, the only missing part in w2v-negsam is the python version of cbow with negative sampling. However, adding it shoud be easy since it is already in cod3licious/word2vec.

@piskvorky

This comment has been minimized.

Copy link
Member

commented Apr 11, 2014

OK. I'll ping @cod3licious via email, I think she's not receiving github notifications.

I've added you as gensim collaborator @sebastien-j , so you can merge/commit directly. Please use with care :-)

@sebastien-j

This comment has been minimized.

Copy link
Contributor

commented Apr 13, 2014

I made a mistake in test_word2vec.py. I sent a pull request to @cod3licious 's w2v-negsam branch in order to fix it (and also to remove trailing whitespace).

@piskvorky

This comment has been minimized.

Copy link
Member

commented Apr 22, 2014

How about we open a new pull request, one that you have full control over, @sebastien-j ?

May be easier and quicker. There's always new functionality being added into gensim, and the longer we wait with merging this PR, the more work it will be to resolve conflicts later & get the PR up-to-date.

@sebastien-j

This comment has been minimized.

Copy link
Contributor

commented Apr 22, 2014

I don't think that is necessary. @cod3licious gave me access to her gensim repository. I'll try to add the remaining functionality soon.

@piskvorky

This comment has been minimized.

Copy link
Member

commented Apr 22, 2014

Ah, cool, I didn't know :)

Big thanks to both of you for polishing & pushing this!

@sebastien-j

This comment has been minimized.

Copy link
Contributor

commented Apr 23, 2014

I added some more Python code to cover all the training algorithms. Most of it is taken directly from @cod3licious 's word2vec repository. There are a few points that should be discussed.

  1. If one wants to use both hierarchical softmax and negative sampling simultaneously, the Python and Cython version do not behave in the same way. syn0 is updated twice for each context-target pair in the Cython version (once for h.s., once for negative sampling), whereas there is only one update in the Python version. The latter corresponds to what is done in the original Google code.
  2. In the Python version of negative sampling, @cod3licious added a criterion excluding context words from being noise. I don't know whether that helps or not. In any case, I think we should be consistent between the two versions.
  3. Right now, CBOW uses the average of the context word vectors, which corresponds to the description of the model in the original paper, but not to the Google code, where the sum is employed. From limited experiments, the mean seems to work better, but there is no guarantee it does in all cases. We could maybe add an additional parameter letting the user choose.
  4. I added an additional method (models_negative_equal) in test_word2vec.py. We could later use "if" statements in models_equal instead, but I could not do so now without rebasing.
  5. In the Python version of CBOW, numpy's sum is used. I imported it as np_sum, but there may be a more practical way.
  6. The latest updates cause a conflict in word2vec.py, but it shouldn't be too hard to fix. Is there a way do do so without rebasing?

piskvorky and others added some commits Apr 23, 2014

sebastien-j added some commits Apr 13, 2014

Add python negative sampling
These modifications are mostly copied from @cod3licious 's wordvec
repository.
Remove 'np.' and change index exclusion
Index exclusion now matches code in @cod3licious 's word2vec repository.
However, this differs from the cython version and from the original
word2vec implementation (or at least I think it does).

I don't know if we should exclude indices in word2_indices.
@piskvorky

This comment has been minimized.

Copy link
Member

commented Apr 23, 2014

Re. differences and variation: I think it's best to aim at replicating the C code as closely as possible (rather than the original paper).

The code paths seem pretty complex. We'll need to try some larger corpora (for example text8 or text9) on the various option combinations, comparing the model/accuracy with the C version, to make sure there are no "major" issues.

Thanks for the work and testing as usual @sebastien-j ! I have rebased this PR on top of the current develop branch, to make your work a little easier. The result is in the w2v-negsam branch of my fork (I can't push into the @cod3licious fork). Let's use that, to avoid more merging problems.

sebastien-j added some commits Apr 24, 2014

Remove additional constraint
Also reformat some comments
CBOW: Use the sum by default
The mean can be also be used.
@sebastien-j

This comment has been minimized.

Copy link
Contributor

commented Apr 24, 2014

I have updated the w2v-negsam branch of cod3licious/gensim.

Some comments on the points I raised in my previous post:

  1. Modifying the Cython version to match the C code would be moderately demanding. As this issue only arises when using both hierarchical softmax and negative sampling, I think we should modify this later in another pull request rather than delaying this one too much.
  2. I removed the additional constraint for noise words.
  3. The CBOW algorithms now use the sum of the context word vectors by default, but users can choose to use the average instead.
  4. No longer applies.

There is another difference between the Cython version and the C code. The logistic function approximation used with negative sampling is not the same. I tried doing as in the C code, but for some reasons I don't understand, it didn't work well. Right now, I employ the approximation used in hierarchical softmax (See commit 9332b43).

I agree that some additional testing is needed to make suke that everything is ok. @piskvorky , do you mind running some tests with the Google tool and posting the results here? Installing the C version on my Windows laptop is not straightforward. I could run the same tests for this pull request.

@piskvorky , do you want me to push the changes I make into your w2v-negsam branch?

@piskvorky

This comment has been minimized.

Copy link
Member

commented Apr 24, 2014

OK. I'll run a few combinations on text9 (cbow on/off, some negative sampling; any other parameter suggestions?) using the C tool. I'll post the accuracy+time results here.

I don't think pushing will work; my w2v-negsam branch is already rebased on top of the latest develop.

It's probably easier if you rebase the changes you've made since my rebase (=since yesterday) on top of my w2v-negsam. Or else I could rebase your new code on top of my w2v-negsam again, and push that again into my w2v-negsam.

In any case, let's start using the rebased branch asap, or this will turn into a git nightmare :)

@piskvorky

This comment has been minimized.

Copy link
Member

commented Apr 24, 2014

Sorry, scratch that. Now I see you've actually used the rebased branch already 👍

@sebastien-j

This comment has been minimized.

Copy link
Contributor

commented Apr 24, 2014

We should at least check all 4 basic combinations {skip-gram, CBOW} x {hierarchical softmax, negative sampling}. To check the Python version, using text8 may be useful.

@piskvorky

This comment has been minimized.

Copy link
Member

commented Apr 26, 2014

-train text8 -output vectors_10.bin -cbow 0 -size 200 -window 5 -negative 0 -hs 1 -threads 4 -binary 1:
time 1x
accuracy 26.6%

-train text8 -output vectors_00.bin -cbow 1 -size 200 -window 5 -negative 0 -hs 1 -threads 4 -binary 1
time 0.25x
accuracy 15.7%

-train text8 -output vectors_15.bin -cbow 0 -size 200 -window 5 -negative 5 -hs 0 -threads 4 -binary 1
time 0.69x
accuracy 14.8%

-train text8 -output vectors_05.bin -cbow 1 -size 200 -window 5 -negative 5 -hs 0 -threads 4 -binary 1
time 0.2x
accuracy 13.2%

(I never used negative sampling; so I'm not sure whether -negative 5 was a good choice here)

@sebastien-j

This comment has been minimized.

Copy link
Contributor

commented Apr 29, 2014

With the same hyper-parameters (and 1 worker):

Skip-gram h. softmax: Cython 26.7%, 94.3k wps; Python 26.8%, 1016 wps

CBOW h. softmax: Cython 14.2%, 315.4k wps; Python 14.2%, 3363 wps

Skip-gram neg. sampling: Cython 13.8%, 70.4k wps; Python 14.1%, 996 wps

CBOW neg. sampling: Cython 13.0%, 293.4k wps; Python 12.6%, 4001 wps


I was able to run the C tool on a virtual machine. On fil9 (with the same hyper-parameters):

Skip-gram h. softmax: Cython 50.5%, C 48.9%

CBOW h. softmax: Cython 33.2%, C 35.5%

Skip-gram neg. sampling: Cython 45.4%, C 44.4%

CBOW neg. sampling: Cython 39.2%, C 42.3%

@piskvorky

This comment has been minimized.

Copy link
Member

commented Apr 30, 2014

Sounds good! Are we ready to merge?

@sebastien-j

This comment has been minimized.

Copy link
Contributor

commented May 4, 2014

Yes, I think we are ready to merge (but you might want to review it just to be sure...).


if model.hs:
# work on the entire tree at once, to push as much work into numpy's C routines as possible (performance)
l2a = deepcopy(model.syn1[word.point]) # 2d matrix, codelen x layer1_size

This comment has been minimized.

Copy link
@piskvorky

piskvorky May 4, 2014

Member

For numpy arrays, simply y = np.array(x) should be faster than y = deepcopy(x)

This comment has been minimized.

Copy link
@piskvorky

piskvorky May 4, 2014

Member

Actually, for fancy indexing (indexing by an array), NumPy should create a copy automatically. Is this explicit copy even necessary?

if model.negative:
# use this word (label = 1) + k other random words not from this sentence (label = 0)
word_indices = [word.index]
while len(word_indices) < model.negative+1:

This comment has been minimized.

Copy link
@piskvorky

piskvorky May 4, 2014

Member

PEP8: space around binary operators: x + 1, not x+1

d1 = self.vocab[self.index2word[widx]].count**power / train_words_pow
for tidx in range(int(table_size)):
self.table[tidx] = widx
if tidx/table_size > d1:

This comment has been minimized.

Copy link
@piskvorky

piskvorky May 4, 2014

Member

PEP8 again.

word_indices.append(w)
l2b = deepcopy(model.syn1neg[word_indices]) # 2d matrix, k+1 x layer1_size
fb = 1. / (1. + exp(-dot(l1, l2b.T))) # propagate hidden -> output
gb = (labels - fb) * alpha # vector of error gradients multiplied by the learning rate

This comment has been minimized.

Copy link
@piskvorky

piskvorky May 4, 2014

Member

I have trouble following the flow here. What's the purpose of this (static?) array of nearly all zeros?

This comment has been minimized.

Copy link
@piskvorky

piskvorky May 6, 2014

Member

I had a look at the negative sampling code. Python and Cython code do slightly different things. Seems the Python code may enter an infinite loop with degenerate vocabulary, in that negative sampling while loop? Cython is OK.

Simply taking negative + 1 random indices and setting labels array accordingly with 0s/1s, without the explicit while loop, sounds safer.

@piskvorky

This comment has been minimized.

Copy link
Member

commented May 4, 2014

I looked at the code, but only spotted some very minor things (style inconsistencies). I could fix those myself after merging.

As for the code logic, it's hard to check in detail. That's why I suggested comparison with the existing implementation, on a real corpus. As a sort of high-level check.

I see you added some unit tests as well, which is great. Are there any more things that could be tested automatically?

piskvorky added a commit that referenced this pull request May 10, 2014

@piskvorky piskvorky merged commit 1b3a955 into RaRe-Technologies:develop May 10, 2014

1 check passed

continuous-integration/travis-ci The Travis CI build passed
Details
@piskvorky

This comment has been minimized.

Copy link
Member

commented May 10, 2014

Merged. Thanks @sebastien-j and @cod3licious !

Further changes and fixes can happen directly over develop branch.

From your tests it seems the Cython CBOW is consistently worse than the C CBOW... any idea why?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.