port phrases & negative sampling in word2vec #130
A few days ago Tomas Mikolov &al published a new article on word2vec: http://arxiv.org/pdf/1310.4546.pdf
It deals with negative sampling and phrases, the results look good => let's port it from the C to gensim.
Negative sampling: add another parameter
Phrases: I'm thinking of a transformation class = takes a "normal" sentence iterator as input and calculates the necessary frequencies etc., like in the C code:
I have code ready for the phrases. Could you take a look ?
Also, I found that C code is greedy in building the phrases and sometimes misses on important phrases as result
For example in sentence :
"this site is especially interesting"
This could be solved either by adding "site_is" also as potential bigram or by doing trigrams
I'm having the following problem with bigrams that might possibly related to what @sumitborar noted. First of all, my objective is to get a vector representation of phrases within a corpus. I can create bigrams and then call