Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added phrase support #135

Closed
wants to merge 2 commits into from

Conversation

Projects
None yet
4 participants
@sumitborar
Copy link

commented Nov 10, 2013

No description provided.

@RadixSeven

This comment has been minimized.

Copy link

commented on gensim/models/word2vec.py in 1e2ffbe Nov 10, 2013

Why get rid of this?

@RadixSeven

This comment has been minimized.

Copy link

commented on gensim/models/word2vec.py in 1e2ffbe Nov 10, 2013

Needs a doc comment. Seeing what you've done, here, I did the same in my own feature branch (it'll make merging easier). You can grab my doc comment if you'd like. The commit is 1ff7ae7a92c84ce186926bc62b4262417031c787

@RadixSeven

This comment has been minimized.

Copy link

commented on gensim/models/word2vec.py in 1e2ffbe Nov 10, 2013

Getting rid of Cython warning not a good idea

@RadixSeven

This comment has been minimized.

Copy link

commented on gensim/models/word2vec.py in 1e2ffbe Nov 10, 2013

Now I understand. This is based on an old version and thus has the changes since then marked as deletions.

@RadixSeven

This comment has been minimized.

Copy link
Contributor

commented Nov 10, 2013

I was puzzled by a number of your changes. On several lines, you undo improvements contributed by Radim and by myself. After thinking about it, I am not sure it was intentional. Maybe there was a problem in how you merged?

@sumitborar

This comment has been minimized.

Copy link
Author

commented Nov 10, 2013

That was unintentional . Looks like there was a problem with merge. I will remerge and submit the changes again .


last_word = word
last_word_count = word_count
new_sentences.append( new_sentence )

This comment has been minimized.

Copy link
@piskvorky

piskvorky Nov 10, 2013

Member

this is not going to work -- you are creating the entire corpus as a list in memory! that wouldn't scale very well :)

the bigram sentences must be coming out as a stream (one sentence at a time).

if bigram_word in vocab:
vocab[bigram_word].count +=1
else:
vocab[bigram_word] = Vocab(count=1)

This comment has been minimized.

Copy link
@piskvorky

piskvorky Nov 10, 2013

Member

this won't scale -- there are too many word combinations, cannot store them all in memory.

the original C code solves this by pruning infrequent entries from memory from time to time (a hack).

@sumitborar

This comment has been minimized.

Copy link
Author

commented Nov 11, 2013

I can use temporary files to store the sentences and then reload them. I was thinking of doing this , but it would be slower.

As for the second case, I am not sure what's a good solution ?. We could do the pruning but as you pointed out its a hack .

@sumitborar

This comment has been minimized.

Copy link
Author

commented Nov 11, 2013

For word combinations, we can do a two pass approach. In first pass we collect all the unigrams and their frequencies and then only add combinations of words which are above the threshold. This would reduce the size of the dictionary.

@piskvorky

This comment has been minimized.

Copy link
Member

commented Nov 11, 2013

My idea was to stream the sentences from an iterator object, like I described in issue #130. Is there a reason this wouldn't work?

Basically, separate the phrase building as corpus preprocessing, rather than part of word2vec training.

About performance: unless something is super slow, the performance is not the primary objective. The primary objective is that it's easy to use and works over large corpora. If the user wants the fastest input possible, he can serialize the sentences to disk in LineSentence format and avoid any online transformations/processing.

@RadixSeven

This comment has been minimized.

Copy link
Contributor

commented Nov 11, 2013

Streaming the sentences and outputting the combined words as you read each sentence won't work because you don't know the bigram frequencies until you've read the whole corpus (imagine that "New York" only occurs twice - once at the beginning and once at the end). You need at least two passes, one to count the unigrams and bigrams and one to keep the better-than-unigram bigrams.

Storing the bigram counts in memory could be challenging. The Google n-gram dump reports a 23 fold increase between unigrams and bigrams. I'm working with a 2.7G word corpus in which there are 2.8M terms, so, a rough estimate in my case would be 64M bigrams - that works out to 128 bytes/bigram on an 8GiB machine. Probably enough. As an upper bound, their 1T word corpus gave 314M bigrams. That gives you 26 bytes (3.25 ints) to store the average bigram on an 8GiB machine - probably not enough (pointer to object, pointer to string, and pointer to dict overhead leaves you with only 2 bytes for the bigram text).

To get all the bigrams in constant RAM, the best algorithm I can think of is:

echo each bigram to a file and also remember unigram probabilities in same pass
do merge-sort on the file, leaving a count field for duplicates (Under Unix: sort tmpfile | uniq -c)
keep only those bigrams that exceed their unigram probabilities
process the corpus, outputting the kept bigrams wherever they appear
@piskvorky

This comment has been minimized.

Copy link
Member

commented Nov 11, 2013

Yes, I think N passes are needed, where N=number of phrase detection passes. N=1 for bigrams. Each pass must collect stats from its predecessor. I don't think that's a problem though.

To avoid misunderstandings, this is the flow that I had in mind in #130:

sentences = LineSentences('/where/ever.txt')  # for example

bigram = Phrases(sentences)  # collect the necessary count stats
model1 = Word2Vec(bigram[sentences], ...)  # transform sentences on the fly

trigram = Phrases(sentences, gram=3)  # same as trigram = Phrases(bigram[sentences])
model = Word2Vec(trigram[sentences], ...)

I didn't look into this very deep, but I think this should work.

@piskvorky

This comment has been minimized.

Copy link
Member

commented Nov 11, 2013

re. RAM for count stats: out-of-core sorting is a proper, solid solution. But doing it the same way the C tool does it may be easier to start with, and perhaps good enough :)

Incidentally, another of my open source tools, sqlitedict, is great for managing such out-of-core mappings. It uses sqlite as backend but acts like a dict.

@RadixSeven

This comment has been minimized.

Copy link
Contributor

commented Nov 11, 2013

You're right that a decent approximation is all we probably need. The improvement in the model from detecting less frequent phrases as phrases is probably not very high. So, as long as the phrase detection gets the most important ones, it is probably good enough.

@mfcabrera

This comment has been minimized.

Copy link
Collaborator

commented Oct 24, 2014

Hi, What is the status of this?. Is someone working currently on this? I would like to give it a shot.

@piskvorky

This comment has been minimized.

Copy link
Member

commented Oct 24, 2014

Not that I know of -- you're welcome @mfcabrera :)

This pull request is unmergeable -- 1) seriously out of date; and 2) corpus preprocessing (incl. phrase detection) should really be independent of word2vec, as described above.

@piskvorky

This comment has been minimized.

Copy link
Member

commented Nov 2, 2014

@mfcabrera progress? Let me know if you need any help -- should be a straightforward job :)

@mfcabrera

This comment has been minimized.

Copy link
Collaborator

commented Nov 2, 2014

Due to time constraints I just started last friday. I am currently working on it. I am writing a separate module for this. My code is based word2phrase.c implementation. Although there are functionality/log that is going to be repeated from word2vec code, particularly the learn vocabulary. When I have something working I will create a pull request :). If I understood correctly in the first pass I need to calculate the counts / stats for n-grams and then apply the generate the bigrams of the fly from the 1-gram sentence iterator.

@piskvorky

This comment has been minimized.

Copy link
Member

commented Nov 2, 2014

That's right. The point is for the result object to be usable as an on-the-fly transformation.

I'm not sure what you mean by "repeat the learn vocabulary from word2vec" -- why would we repeat that?

@mfcabrera

This comment has been minimized.

Copy link
Collaborator

commented Nov 3, 2014

@piskvorky, I meant that to learn the stats word2phrase learns the vocabulary (unigrams and bigrams) from the input/training file (LearnVocabFromTrainFile). word2vec module (and the C version as well) does that (but without the bigrams). So "some" of the logic is repeated. I have a couple of questions: This should be a standalone module right? (e.g. phrases.py) and where should be located? (models?)

@piskvorky

This comment has been minimized.

Copy link
Member

commented Nov 3, 2014

Yes, definitely standalone. This will be useful all around, not just in word2vec. Putting it in models is fine.

@piskvorky

This comment has been minimized.

Copy link
Member

commented Nov 3, 2014

Superseded by #258 , closing here.

@piskvorky piskvorky closed this Nov 3, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.