Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

port phrases & negative sampling in word2vec #130

Closed
piskvorky opened this issue Oct 23, 2013 · 7 comments

Comments

Projects
None yet
5 participants
@piskvorky
Copy link
Member

commented Oct 23, 2013

A few days ago Tomas Mikolov &al published a new article on word2vec: http://arxiv.org/pdf/1310.4546.pdf

It deals with negative sampling and phrases, the results look good => let's port it from the C to gensim.

Solution outline

Negative sampling: add another parameter negative to the constructor + to train_sentence. Implement as a code branch both in the clean readable NumPy version (where it says TODO: add negative sampling? in word2vec.py) and inside word2vec_inner.pyx (optimized cython code).

Phrases: I'm thinking of a transformation class = takes a "normal" sentence iterator as input and calculates the necessary frequencies etc., like in the C code: phrases = Phrase(sentences, passes=1). Then when you iterate over phrases it yields out the same sentences, only with words merged into phrases. For example for sentence in Phrase([['new', 'york', 'is', ...], ...]) => ['new_york', 'is', ...] .... And of course, for training, instead of Word2Vec(sentences) you'd do Word2Vec(Phrase(sentences)).
The passes parameter controls the number of merging passes, so that we can form longer phrases than bigrams, too (new_york_times etc).

@sumitborar

This comment has been minimized.

Copy link

commented Oct 31, 2013

I have code ready for the phrases. Could you take a look ?

Also, I found that C code is greedy in building the phrases and sometimes misses on important phrases as result

For example in sentence :

"this site is especially interesting"
if "this_site" becomes a phrase then "site_is" is ignored.

This could be solved either by adding "site_is" also as potential bigram or by doing trigrams

@piskvorky

This comment has been minimized.

Copy link
Member Author

commented Oct 31, 2013

Sure. Start the pull request @sumitborar , I'll comment there.

@piskvorky

This comment has been minimized.

Copy link
Member Author

commented Nov 7, 2013

@sumitborar ping.

@cod3licious

This comment has been minimized.

Copy link

commented Feb 6, 2014

Python code for negative sampling and also the cbow model. Should be pretty easy to integrate back with the original gensim code. Feedback very much appreciated! https://github.com/cod3licious/word2vec

@piskvorky

This comment has been minimized.

Copy link
Member Author

commented May 18, 2014

Negative sampling finished: added in #162 , incl. optimized cython versions.

Phrases still not merge-ready.

@tmylk

This comment has been minimized.

Copy link
Contributor

commented Jan 23, 2016

Phrases implemented in #261

@tmylk tmylk closed this Jan 23, 2016

@hwsamuel

This comment has been minimized.

Copy link

commented Mar 10, 2016

I'm having the following problem with bigrams that might possibly related to what @sumitborar noted. First of all, my objective is to get a vector representation of phrases within a corpus. I can create bigrams and then call model = word2vec.Word2Vec(bigrams[sentences],min_count=1). The problem is that the model is filtering out many bigrams that are important. For example, I see "this_report" in the bigrams list, but on inspecting the model via model.vocab.keys(), I don't see "this_report". The WordVec model seems to be filtering some bigrams that I don't want filtered. I've tried different values of min_count, including 0, but it doesn't seem to help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.