New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Phrase support based on word2phrase #258
Conversation
Created this pull request to replace the on #135 |
Cheers, will go over it & comment. |
vals = ['%s:%r' % (key, self.__dict__[key]) for key in sorted(self.__dict__) if not key.startswith('_')] | ||
return "<" + ', '.join(vals) + ">" | ||
|
||
class Phrases(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inherit from utils.SaveLoad
, to get free object save/load.
word = sentence[-1] | ||
vocab[word] += 1 | ||
|
||
if len(vocab) > max_vocab_size * 0.7: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be super slow, if we threshold on 0.7 and then prune to the same 0.7. There has be some margin between threshold and pruned dict. Like this, the pruning will happen practically on every sentence once 0.7 is reached, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I was thinking something similar same. Word2phrase avoid this by pruning not by 0.7 but slowly increasing a minimum threshold (min_reduce
) starting at 1. I.e. once reached 0.7 * max_vocab_size for the first time it will remove every word with count of 1, the second time everyword with a count less than or equal than 2 and so on (this is the way it was implemented before, should I go back to something similar?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, in my little snippet I left the threshold as if len(vocab) > max_vocab_size
. So the margin between the threshold & cutoff was 0.3
. But doing it the "powerlaw" way (each min_reduce
layer will probably be as large as everything that comes after it put together) is ok too -- I leave that to your judgement :)
vocabulary (based on Word2Phrase)
Looks good! 👍 Thanks Miguel. I'll run this on the English Wikipedia, hopefully tonight, and see what comes out. |
Tried on EN wiki:
|
Hi Radim, thanks for the testing and feedback.
So in other words, how can I improve this to make it "mergeable"? :) |
Maybe you can use https://github.com/maciejkula/glove-python/blob/master/glove/corpus_cython.pyx |
@ogrisel We're in the process merging GloVe into gensim with Maciej, but that's intended as an optional module (~requires a compiler). I'd like these new phrases to be generally available, even without a mandatory C/C++/Cython compilation. Or maybe compile optionally, hmm. Have a fallback pure-Python code if compilation fails? @mfcabrera I tried creating a 10m string=>int dict and it's "only" about 1.1GB (64bit), from fresh py27 shell. A unicode=>int is ~1.9GB. So 3.2GB is realistic -- dict only needs to overallocate a little on resize, or Python not return freed memory to OS, and we're there :( I'd propose storing utf8 strings to save RAM, instead of unicode. And keeping |
Let me ping optimization guru @larsmans, maybe he has some memory-saving ideas too :) |
You could also use a Marisa Trie datastructure but you will also need a compiler: |
I have good experience with marisa-trie; @IsaacHaze and I have experimented with it and stored huge n-gram tables from Wikipedia dumps in only a few hundred MB. We've since abandoned them because they only store strings and storing numbers in them was too tricky for our purposes. Our solution for big string tables is now SQLite, which comes with Python. |
Oh, the other thing with that lib is that you cannot add a string to an existing trie. You have to have the strings in a set before you can build the trie. So the peak memory usage is still huge. |
That's an interesting direction, thanks Olivier. I'm also thinking we should be able to utilize the known powerlaw distribution better, somehow. And accuracy is not very important here, so maybe throw in some approximative algo too. If we could afford an extra pass, we could precompute all sort of handy things from the stream... A bloom filter for stuff that only appears once/twice = vast majority of tokens/bigrams, so we can safely ignore those... Maybe worth sacrificing an extra pass for? |
@larsmans do you use automatically filled ROWID or INTEGER PRIMARY KEY to attribute a unique integer to each token? Do you use some kind of map(split document into words) / sort / reduce(unique) pipeline to build a list of unique words then injected in batch to sqlite? |
@larsmans said:
It is true that you cannot add keys (strings) to an existing marisa-trie (you have to supply them to its constructor), but they don't have to be in a set (yielding them will work fine), The big limitation is that (as far as i know) you cannot change the corresponding values (counts). For instance, when you're collecting counts from a corpus. I do remember reading a paper by Hal Daume and someone else on approximate counting in huge (streaming) collections... |
Amit Goyal and Hal Daumé III Approximate Scalable Bounded Space Sketch for Large Data NLP |
Thanks for the link @IsaacHaze , sounds simple enough. I like :) @mfcabrera can you change strings to utf8 for now & get rid of the 0.7 multiplier & add some tests? I'd propose we merge as-is, and work on the fancier approximate-min-sketch / trie extension as a separate PR. |
@ogrisel |
Hi @piskvorky I just updated the code with the changes and added some basic tests. Please review :) . Sorry the delay, life got in the middle. Regarding the approximate-min-sketch I would also like to work on it. Looks really interesting. I have just a question, do you mean to have a pure python implementation? I found this python wrapper for the Madoka C++ library. |
logging.basicConfig(stream=sys.stderr, level=logging.DEBUG, format="%(asctime)s\t%(levelname)-8s\t%(filename)s:%(lineno)-4d\t%(message)s") | ||
from gensim.models.word2vec import Text8Corpus | ||
|
||
sentences = Text8Corpus("/Users/miguel/Downloads/text8") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to load the file from sys.arg[1]
instead of "/Users/miguel/Downloads/text8"
which only going to work on macs from people named "Miguel" :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ogrisel Haha, Hi this was for a Personal test :) - the full test is "path" free but sure thing. I will update this! :)
How about making the |
Alternatively this could be implemented as a wrapper / adapter:
|
@mfcabrera thanks. I want to make a new release in the weekend, so will review & merge asap :) @ogrisel that's exactly what the current PR is meant to do, isn't it? Except in gensim, the transformation method is called But now I see |
The |
Actually I read the TransformationABC docstring. I think it would be more explicit to have: transformer = SomeTransformer(parameters)
new_corpus = transformer.transform(original_corpus)
for words in new_corpus.get_texts():
# do my stuff The |
This is a Gensim convention. I can't say I agree with it, but it's pretty ingrained. |
Not to hijack @mfcabrera 's pull request; your example in gensim would be: transformer = SomeTransformer(original_corpus, parameters)
for words in transformer[original_corpus]:
# do my stuff but feel free to alias In fact, if more people like the long syntax, we can even add such alias to "standard gensim" (to the |
For some reason, github won't allow me to open a PR to your fork of gensim @mfcabrera . Weird. It doesn't even offer your fork in the PR menu. Can you please have a look at my branch https://github.com/piskvorky/gensim/tree/mfcabrera-phrases ? It's a fork off your It contains the following changes:
|
That happens for some reason. It's probably a bug. When that happens, open the PR page to any other repo / branch and edit the URL in your browser to point the source to the right repo name / branch name. Then you can create the PR. |
That worked, thanks Olivier 👍 |
The move to utf8 (binary) seemed to do the trick, now memory is at 1.8GB with default I increased the default to 40m (~3.6GB). Processing 1 billion words takes about 15 mins; entire EN Wikipedia in 33 mins. Let me know if there are any more comments, otherwise I'll merge this & release gensim 0.10.3 tomorrow. @mfcabrera I never tried the Madoka wrapper, but if it works, sure! To be clear: anything compiled must be optional, with fallback to the current pure-Python version, if compilation fails / 3rd party module cannot be imported. Maintaining these C++ wrappers over multiple OS / platforms / compilers is no fun, I'd prefer not to be tied into maintaining that. A fast, pure Python/SciPy implementation would be ideal, yes :) |
I had a look and Madoka seems nice. Let's continue with that in a new PR @mfcabrera -- any improvements will be welcome! I'm also thinking where else we could use these sketches. If we bring another library in, let's make the most of it. |
@piskvorky have you had a look at the phrases extracted on the full wikipedia, do they seem legit? Do you have stuff that looks like spurious bi- or tri-grams that don't match some kind of meaningful phrases such as named entities names as "New York" and "New York Times" for instance? |
@ogrisel the word2vec formula is basically a variant of PMI (pointwise mutual info, without taking the log), which is known to overestimate scores for low freq collocations. That's why they put the extra You can see the top bigrams from wiki (min_count 10, threshold 4.0) here. |
Looks great. Have you had a look at the bottom bigrams? Have you tried to do several passes to extract trigrams or more? |
Bottom by score: you can see those in that printout above, too. The list is sorted by raw bigram count (=most frequent bigrams), but in the 3rd column there's the pseudo-PMI score. Stuff like Bottom by frequency (=count 11, because min_count was 10):
Bottom by frequency is less important IMO, because getting these rare bigrams wrong will be ...rare. |
Alright, thanks. Have you tried to run 2 passes to extract trigrams like "new_york_times" for instance? |
@mfcabrera I don't want to wait any longer with the release. I'll merge my own changes via #261 , without waiting for this PR to be updated. |
@ogrisel not on Wiki yet; I'll post the >2-gram results later. |
@piskvorky Sorry for not updating the PR! I was (I am) on vacations with a funky internet access. I am glad the changes were merged. I will start working on the min-sketch thingy when I get back. :D! |
First approach to learning phrases based on wordphrases. Please comment. Not test and proper documentation yet. But they will come.