Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Phrase support based on word2phrase #258

Closed
wants to merge 8 commits into from

Conversation

mfcabrera
Copy link
Contributor

First approach to learning phrases based on wordphrases. Please comment. Not test and proper documentation yet. But they will come.

@mfcabrera
Copy link
Contributor Author

Created this pull request to replace the on #135

@mfcabrera mfcabrera changed the title Phrases Phrase support baed on word2phrase Nov 3, 2014
@piskvorky piskvorky mentioned this pull request Nov 3, 2014
@piskvorky
Copy link
Owner

Cheers, will go over it & comment.

vals = ['%s:%r' % (key, self.__dict__[key]) for key in sorted(self.__dict__) if not key.startswith('_')]
return "<" + ', '.join(vals) + ">"

class Phrases(object):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inherit from utils.SaveLoad, to get free object save/load.

@mfcabrera mfcabrera changed the title Phrase support baed on word2phrase Phrase support based on word2phrase Nov 4, 2014
word = sentence[-1]
vocab[word] += 1

if len(vocab) > max_vocab_size * 0.7:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be super slow, if we threshold on 0.7 and then prune to the same 0.7. There has be some margin between threshold and pruned dict. Like this, the pruning will happen practically on every sentence once 0.7 is reached, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was thinking something similar same. Word2phrase avoid this by pruning not by 0.7 but slowly increasing a minimum threshold (min_reduce) starting at 1. I.e. once reached 0.7 * max_vocab_size for the first time it will remove every word with count of 1, the second time everyword with a count less than or equal than 2 and so on (this is the way it was implemented before, should I go back to something similar?).

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, in my little snippet I left the threshold as if len(vocab) > max_vocab_size. So the margin between the threshold & cutoff was 0.3. But doing it the "powerlaw" way (each min_reduce layer will probably be as large as everything that comes after it put together) is ok too -- I leave that to your judgement :)

@piskvorky
Copy link
Owner

Looks good! 👍 Thanks Miguel.

I'll run this on the English Wikipedia, hopefully tonight, and see what comes out.

@piskvorky
Copy link
Owner

Tried on EN wiki:

  • default max_vocab=500m is way too large. I killed the process when it reached ~20GB of RAM.
  • Then I set 10m max_vocab, which used ~3GB RAM (even that is too much for a default, I think).
  • The pruning step with 10m (7m really, since we multiply that number by 0.7 -- a meaningless operation now that I think of it, why not just pass 7m directly?) happens once every ~10k documents. So over the entire wiki (~3.5m docs), this would be 350x => anything that appears less than min_reduce=350 times within a chunk of ~10k documents has no chance to get in.
  • Unless that defaultdict consumes way more than dict (I don't think so), I'm not sure how to cut down RAM. Must be the Python's object overhead, rather than dict itself.
  • Doubling max_vocab to 20m (=effective 14m) increases the interval between prunes to ~100k documents. RAM goes up 2x = ~6.5GB.

@mfcabrera
Copy link
Contributor Author

Hi Radim, thanks for the testing and feedback.

  • I used the C version default. Bu yeah it might be to big for python.
  • I don't know whether 3GB is too much. Assuming you have at least 4GB of ram in your box.
  • Yeah we can calculate that at 0.7 * max_word_size at the beginning.
  • So I understood correctly I should decrease the times between pruning. Any ideas here?
  • If the memory issue might be python fault, any ideas? Cython? Using Numpy arrays with dtype string with a fixed maximum length and mimicking C code?

So in other words, how can I improve this to make it "mergeable"? :)
Where I could get your test set so I can reproduce it (I mean what preprocessing did you apply to it).

@ogrisel
Copy link

ogrisel commented Nov 5, 2014

Unless that defaultdict consumes way more than dict (I don't think so), I'm not sure how to cut down RAM. Must be the Python's object overhead, rather than dict itself.

Maybe you can use unordered_map[string, int] that comes from the the standard C++ library and is wrapped by default in recent versions of Cython. See for instance:

https://github.com/maciejkula/glove-python/blob/master/glove/corpus_cython.pyx

@piskvorky
Copy link
Owner

@ogrisel We're in the process merging GloVe into gensim with Maciej, but that's intended as an optional module (~requires a compiler). I'd like these new phrases to be generally available, even without a mandatory C/C++/Cython compilation.

Or maybe compile optionally, hmm. Have a fallback pure-Python code if compilation fails?

@mfcabrera I tried creating a 10m string=>int dict and it's "only" about 1.1GB (64bit), from fresh py27 shell. A unicode=>int is ~1.9GB. So 3.2GB is realistic -- dict only needs to overallocate a little on resize, or Python not return freed memory to OS, and we're there :(

I'd propose storing utf8 strings to save RAM, instead of unicode. And keeping max_vocab=10m as default (real 10m, not 7m).

@piskvorky
Copy link
Owner

Let me ping optimization guru @larsmans, maybe he has some memory-saving ideas too :)

@ogrisel
Copy link

ogrisel commented Nov 5, 2014

You could also use a Marisa Trie datastructure but you will also need a compiler:

https://github.com/kmike/marisa-trie

@larsmans
Copy link
Contributor

larsmans commented Nov 5, 2014

I have good experience with marisa-trie; @IsaacHaze and I have experimented with it and stored huge n-gram tables from Wikipedia dumps in only a few hundred MB. We've since abandoned them because they only store strings and storing numbers in them was too tricky for our purposes. Our solution for big string tables is now SQLite, which comes with Python.

@larsmans
Copy link
Contributor

larsmans commented Nov 5, 2014

Oh, the other thing with that lib is that you cannot add a string to an existing trie. You have to have the strings in a set before you can build the trie. So the peak memory usage is still huge.

@piskvorky
Copy link
Owner

That's an interesting direction, thanks Olivier.

I'm also thinking we should be able to utilize the known powerlaw distribution better, somehow. And accuracy is not very important here, so maybe throw in some approximative algo too.

If we could afford an extra pass, we could precompute all sort of handy things from the stream... A bloom filter for stuff that only appears once/twice = vast majority of tokens/bigrams, so we can safely ignore those...

Maybe worth sacrificing an extra pass for?

@ogrisel
Copy link

ogrisel commented Nov 5, 2014

@larsmans do you use automatically filled ROWID or INTEGER PRIMARY KEY to attribute a unique integer to each token?

Do you use some kind of map(split document into words) / sort / reduce(unique) pipeline to build a list of unique words then injected in batch to sqlite?

@IsaacHaze
Copy link

@larsmans said:

Oh, the other thing with that lib is that you cannot add a string to an existing trie. You have to have the strings in a set before you can build the trie. So the peak memory usage is still huge.

It is true that you cannot add keys (strings) to an existing marisa-trie (you have to supply them to its constructor), but they don't have to be in a set (yielding them will work fine), The big limitation is that (as far as i know) you cannot change the corresponding values (counts). For instance, when you're collecting counts from a corpus.

I do remember reading a paper by Hal Daume and someone else on approximate counting in huge (streaming) collections...

@IsaacHaze
Copy link

Amit Goyal and Hal Daumé III Approximate Scalable Bounded Space Sketch for Large Data NLP

@piskvorky
Copy link
Owner

Thanks for the link @IsaacHaze , sounds simple enough. I like :)
(assuming we have enough good hash functions)

@mfcabrera can you change strings to utf8 for now & get rid of the 0.7 multiplier & add some tests? I'd propose we merge as-is, and work on the fancier approximate-min-sketch / trie extension as a separate PR.

@larsmans
Copy link
Contributor

larsmans commented Nov 5, 2014

@ogrisel INTEGER PRIMARY KEY, but SQLite actually maps the first of those to the ROWID that it generates anyway as an optimization. The other reason to switch is that we needed several indexes to perform different operations, and maintaining these got kind of hairy. Otherwise, marisa-trie might have been fine.

@mfcabrera
Copy link
Contributor Author

Hi @piskvorky I just updated the code with the changes and added some basic tests. Please review :) . Sorry the delay, life got in the middle. Regarding the approximate-min-sketch I would also like to work on it. Looks really interesting. I have just a question, do you mean to have a pure python implementation? I found this python wrapper for the Madoka C++ library.

logging.basicConfig(stream=sys.stderr, level=logging.DEBUG, format="%(asctime)s\t%(levelname)-8s\t%(filename)s:%(lineno)-4d\t%(message)s")
from gensim.models.word2vec import Text8Corpus

sentences = Text8Corpus("/Users/miguel/Downloads/text8")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to load the file from sys.arg[1] instead of "/Users/miguel/Downloads/text8" which only going to work on macs from people named "Miguel" :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ogrisel Haha, Hi this was for a Personal test :) - the full test is "path" free but sure thing. I will update this! :)

@ogrisel
Copy link

ogrisel commented Nov 12, 2014

How about making the Corpus class phrase-aware by automatically embedding a Phrases instance and would return phrase-aware tokens when calling the get_texts method?

@ogrisel
Copy link

ogrisel commented Nov 12, 2014

How about making the Corpus class phrase-aware by automatically embedding a Phrases instance and would return phrase-aware tokens when calling the get_texts method?

Alternatively this could be implemented as a wrapper / adapter:

PhraseCorpus could be initialized by passing any Corpus instance as constructor parameter and export a get_texts method that would call internally the get_texts method of the underlying corpus and collapse detected phrases with the _ marker.

@piskvorky
Copy link
Owner

@mfcabrera thanks. I want to make a new release in the weekend, so will review & merge asap :)

@ogrisel that's exactly what the current PR is meant to do, isn't it? Except in gensim, the transformation method is called __getitem__, so that calling new_corpus = phrase_corpus_object[old_corpus] gives you the transformed corpus (iterable).

But now I see Phrases.__getitem__ returns only a generator instead of iterable. I'll change this to use the TransformationABC properly, before merging. Thanks for bringing this up.

@ogrisel
Copy link

ogrisel commented Nov 12, 2014

The __getitem__ API is really not intuitive to me. Why not use a method with a more explicit name like get_texts?

@ogrisel
Copy link

ogrisel commented Nov 12, 2014

Actually I read the TransformationABC docstring. I think it would be more explicit to have:

transformer = SomeTransformer(parameters)
new_corpus = transformer.transform(original_corpus)
for words in new_corpus.get_texts():
     # do my stuff

The collection[key] syntax should be reserved to dict-like, list-like or array-like lookups in my opinion.

@larsmans
Copy link
Contributor

The collection[key] syntax should be reserved to dict-like, list-like or array-like lookups in my opinion.

This is a Gensim convention. I can't say I agree with it, but it's pretty ingrained.

@piskvorky
Copy link
Owner

Not to hijack @mfcabrera 's pull request; your example in gensim would be:

transformer = SomeTransformer(original_corpus, parameters)
for words in transformer[original_corpus]:
     # do my stuff

but feel free to alias __getitem__ and __iter__ methods to whatever name you like, should be trivial.

In fact, if more people like the long syntax, we can even add such alias to "standard gensim" (to the TransformationABC), I don't care either way.

@piskvorky
Copy link
Owner

For some reason, github won't allow me to open a PR to your fork of gensim @mfcabrera . Weird. It doesn't even offer your fork in the PR menu.

Can you please have a look at my branch https://github.com/piskvorky/gensim/tree/mfcabrera-phrases ? It's a fork off your phrases branch, meant to be merged into your branch = into this PR. Cheers.

It contains the following changes:

  • fixes for py3k compatibility
  • fixes for consistent unicode output & utf8/unicode input
  • bigram[corpus] now returns an iterable (instead of just generator) @ogrisel
  • bigram[doc] works too
  • I added module intro & updated all docs, including the HTML Sphinx docs

@ogrisel
Copy link

ogrisel commented Nov 15, 2014

For some reason, github won't allow me to open a PR to your fork of gensim @mfcabrera . Weird. It doesn't even offer your fork in the PR menu.

That happens for some reason. It's probably a bug. When that happens, open the PR page to any other repo / branch and edit the URL in your browser to point the source to the right repo name / branch name. Then you can create the PR.

@piskvorky
Copy link
Owner

That worked, thanks Olivier 👍

@piskvorky
Copy link
Owner

The move to utf8 (binary) seemed to do the trick, now memory is at 1.8GB with default max_vocab_size=20m, and the prune step happens once every ~120m words. I think that's fine.

I increased the default to 40m (~3.6GB). Processing 1 billion words takes about 15 mins; entire EN Wikipedia in 33 mins.

Let me know if there are any more comments, otherwise I'll merge this & release gensim 0.10.3 tomorrow.

@mfcabrera I never tried the Madoka wrapper, but if it works, sure! To be clear: anything compiled must be optional, with fallback to the current pure-Python version, if compilation fails / 3rd party module cannot be imported. Maintaining these C++ wrappers over multiple OS / platforms / compilers is no fun, I'd prefer not to be tied into maintaining that. A fast, pure Python/SciPy implementation would be ideal, yes :)

@piskvorky
Copy link
Owner

I had a look and Madoka seems nice. Let's continue with that in a new PR @mfcabrera -- any improvements will be welcome!

I'm also thinking where else we could use these sketches. If we bring another library in, let's make the most of it.

@ogrisel
Copy link

ogrisel commented Nov 15, 2014

I increased the default to 40m (~3.6GB). Processing 1 billion words takes about 15 mins; entire EN Wikipedia in 33 mins.

Let me know if there are any more comments, otherwise I'll merge this & release gensim 0.10.3 tomorrow.

@piskvorky have you had a look at the phrases extracted on the full wikipedia, do they seem legit? Do you have stuff that looks like spurious bi- or tri-grams that don't match some kind of meaningful phrases such as named entities names as "New York" and "New York Times" for instance?

@piskvorky
Copy link
Owner

@ogrisel the word2vec formula is basically a variant of PMI (pointwise mutual info, without taking the log), which is known to overestimate scores for low freq collocations.

That's why they put the extra min_count parameter, to ignore spurious low freq bigrams.

You can see the top bigrams from wiki (min_count 10, threshold 4.0) here.

@ogrisel
Copy link

ogrisel commented Nov 16, 2014

Looks great. Have you had a look at the bottom bigrams? Have you tried to do several passes to extract trigrams or more?

@piskvorky
Copy link
Owner

Bottom by score: you can see those in that printout above, too. The list is sorted by raw bigram count (=most frequent bigrams), but in the 3rd column there's the pseudo-PMI score. Stuff like new_york is actually near the threshold limit of 4.0, so that would be the "bottom".

Bottom by frequency (=count 11, because min_count was 10):

diego_benaglio  11      4.00591027209
aaron_prioleau  11      4.00556663873
cs_târgovişte   11      4.00512323324
relative_gravimeters    11      4.00401001349
lwf_geraldine   11      4.00389337039
minna_jørgensen 11      4.00379507267
videos_qtvr     11      4.00365570359
futaba_kun      11      4.00365203611
ha_tumah        11      4.003477472
textured_polygons       11      4.00337699452
fairhair_harald 11      4.00306019345
peard_judy      11      4.00231019083
fried_twinkie   11      4.00227647297
gardel_patrice  11      4.00168650245
ira_oira        11      4.00070921986
goncharov_isabelle      11      4.00053637698
bernardino_ludovisi     11      4.00029325041
articulata_hypothesis   11      4.0002873923

Bottom by frequency is less important IMO, because getting these rare bigrams wrong will be ...rare.

@ogrisel
Copy link

ogrisel commented Nov 17, 2014

Alright, thanks. Have you tried to run 2 passes to extract trigrams like "new_york_times" for instance?

@piskvorky
Copy link
Owner

@mfcabrera I don't want to wait any longer with the release. I'll merge my own changes via #261 , without waiting for this PR to be updated.

@piskvorky piskvorky closed this Nov 17, 2014
@piskvorky
Copy link
Owner

@ogrisel not on Wiki yet; I'll post the >2-gram results later.

@mfcabrera
Copy link
Contributor Author

@piskvorky Sorry for not updating the PR! I was (I am) on vacations with a funky internet access. I am glad the changes were merged. I will start working on the min-sketch thingy when I get back. :D!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants