Phrase support based on word2phrase #258

mfcabrera · 2014-11-03T18:04:36Z

First approach to learning phrases based on wordphrases. Please comment. Not test and proper documentation yet. But they will come.

mfcabrera · 2014-11-03T18:05:49Z

Created this pull request to replace the on #135

piskvorky · 2014-11-03T18:16:16Z

Cheers, will go over it & comment.

piskvorky · 2014-11-03T18:16:22Z

gensim/models/phrases.py

+        vals = ['%s:%r' % (key, self.__dict__[key]) for key in sorted(self.__dict__) if not key.startswith('_')]
+        return "<" + ', '.join(vals) + ">"
+
+class Phrases(object):


inherit from utils.SaveLoad, to get free object save/load.

piskvorky · 2014-11-04T14:01:39Z

gensim/models/phrases.py

+                word = sentence[-1]
+                vocab[word] += 1
+
+            if len(vocab) > max_vocab_size * 0.7:


This will be super slow, if we threshold on 0.7 and then prune to the same 0.7. There has be some margin between threshold and pruned dict. Like this, the pruning will happen practically on every sentence once 0.7 is reached, no?

Yeah I was thinking something similar same. Word2phrase avoid this by pruning not by 0.7 but slowly increasing a minimum threshold (min_reduce) starting at 1. I.e. once reached 0.7 * max_vocab_size for the first time it will remove every word with count of 1, the second time everyword with a count less than or equal than 2 and so on (this is the way it was implemented before, should I go back to something similar?).

Well, in my little snippet I left the threshold as if len(vocab) > max_vocab_size. So the margin between the threshold & cutoff was 0.3. But doing it the "powerlaw" way (each min_reduce layer will probably be as large as everything that comes after it put together) is ok too -- I leave that to your judgement :)

vocabulary (based on Word2Phrase)

piskvorky · 2014-11-04T17:29:56Z

Looks good! 👍 Thanks Miguel.

I'll run this on the English Wikipedia, hopefully tonight, and see what comes out.

piskvorky · 2014-11-04T23:46:40Z

Tried on EN wiki:

default max_vocab=500m is way too large. I killed the process when it reached ~20GB of RAM.
Then I set 10m max_vocab, which used ~3GB RAM (even that is too much for a default, I think).
The pruning step with 10m (7m really, since we multiply that number by 0.7 -- a meaningless operation now that I think of it, why not just pass 7m directly?) happens once every ~10k documents. So over the entire wiki (~3.5m docs), this would be 350x => anything that appears less than min_reduce=350 times within a chunk of ~10k documents has no chance to get in.
Unless that defaultdict consumes way more than dict (I don't think so), I'm not sure how to cut down RAM. Must be the Python's object overhead, rather than dict itself.
Doubling max_vocab to 20m (=effective 14m) increases the interval between prunes to ~100k documents. RAM goes up 2x = ~6.5GB.

mfcabrera · 2014-11-05T10:30:26Z

Hi Radim, thanks for the testing and feedback.

I used the C version default. Bu yeah it might be to big for python.
I don't know whether 3GB is too much. Assuming you have at least 4GB of ram in your box.
Yeah we can calculate that at 0.7 * max_word_size at the beginning.
So I understood correctly I should decrease the times between pruning. Any ideas here?
If the memory issue might be python fault, any ideas? Cython? Using Numpy arrays with dtype string with a fixed maximum length and mimicking C code?

So in other words, how can I improve this to make it "mergeable"? :)
Where I could get your test set so I can reproduce it (I mean what preprocessing did you apply to it).

ogrisel · 2014-11-05T12:12:31Z

Unless that defaultdict consumes way more than dict (I don't think so), I'm not sure how to cut down RAM. Must be the Python's object overhead, rather than dict itself.

Maybe you can use unordered_map[string, int] that comes from the the standard C++ library and is wrapped by default in recent versions of Cython. See for instance:

https://github.com/maciejkula/glove-python/blob/master/glove/corpus_cython.pyx

piskvorky · 2014-11-05T14:37:50Z

@ogrisel We're in the process merging GloVe into gensim with Maciej, but that's intended as an optional module (~requires a compiler). I'd like these new phrases to be generally available, even without a mandatory C/C++/Cython compilation.

Or maybe compile optionally, hmm. Have a fallback pure-Python code if compilation fails?

@mfcabrera I tried creating a 10m string=>int dict and it's "only" about 1.1GB (64bit), from fresh py27 shell. A unicode=>int is ~1.9GB. So 3.2GB is realistic -- dict only needs to overallocate a little on resize, or Python not return freed memory to OS, and we're there :(

I'd propose storing utf8 strings to save RAM, instead of unicode. And keeping max_vocab=10m as default (real 10m, not 7m).

piskvorky · 2014-11-05T14:47:37Z

Let me ping optimization guru @larsmans, maybe he has some memory-saving ideas too :)

ogrisel · 2014-11-05T15:10:30Z

You could also use a Marisa Trie datastructure but you will also need a compiler:

https://github.com/kmike/marisa-trie

larsmans · 2014-11-05T15:34:24Z

I have good experience with marisa-trie; @IsaacHaze and I have experimented with it and stored huge n-gram tables from Wikipedia dumps in only a few hundred MB. We've since abandoned them because they only store strings and storing numbers in them was too tricky for our purposes. Our solution for big string tables is now SQLite, which comes with Python.

larsmans · 2014-11-05T15:40:56Z

Oh, the other thing with that lib is that you cannot add a string to an existing trie. You have to have the strings in a set before you can build the trie. So the peak memory usage is still huge.

piskvorky · 2014-11-05T16:08:56Z

That's an interesting direction, thanks Olivier.

I'm also thinking we should be able to utilize the known powerlaw distribution better, somehow. And accuracy is not very important here, so maybe throw in some approximative algo too.

If we could afford an extra pass, we could precompute all sort of handy things from the stream... A bloom filter for stuff that only appears once/twice = vast majority of tokens/bigrams, so we can safely ignore those...

Maybe worth sacrificing an extra pass for?

ogrisel · 2014-11-05T17:41:06Z

@larsmans do you use automatically filled ROWID or INTEGER PRIMARY KEY to attribute a unique integer to each token?

Do you use some kind of map(split document into words) / sort / reduce(unique) pipeline to build a list of unique words then injected in batch to sqlite?

IsaacHaze · 2014-11-05T18:29:28Z

@larsmans said:

Oh, the other thing with that lib is that you cannot add a string to an existing trie. You have to have the strings in a set before you can build the trie. So the peak memory usage is still huge.

It is true that you cannot add keys (strings) to an existing marisa-trie (you have to supply them to its constructor), but they don't have to be in a set (yielding them will work fine), The big limitation is that (as far as i know) you cannot change the corresponding values (counts). For instance, when you're collecting counts from a corpus.

I do remember reading a paper by Hal Daume and someone else on approximate counting in huge (streaming) collections...

IsaacHaze · 2014-11-05T18:33:14Z

Amit Goyal and Hal Daumé III Approximate Scalable Bounded Space Sketch for Large Data NLP

piskvorky · 2014-11-05T19:14:13Z

Thanks for the link @IsaacHaze , sounds simple enough. I like :)
(assuming we have enough good hash functions)

@mfcabrera can you change strings to utf8 for now & get rid of the 0.7 multiplier & add some tests? I'd propose we merge as-is, and work on the fancier approximate-min-sketch / trie extension as a separate PR.

larsmans · 2014-11-05T20:47:27Z

@ogrisel INTEGER PRIMARY KEY, but SQLite actually maps the first of those to the ROWID that it generates anyway as an optimization. The other reason to switch is that we needed several indexes to perform different operations, and maintaining these got kind of hairy. Otherwise, marisa-trie might have been fine.

mfcabrera · 2014-11-11T14:29:07Z

Hi @piskvorky I just updated the code with the changes and added some basic tests. Please review :) . Sorry the delay, life got in the middle. Regarding the approximate-min-sketch I would also like to work on it. Looks really interesting. I have just a question, do you mean to have a pure python implementation? I found this python wrapper for the Madoka C++ library.

ogrisel · 2014-11-12T10:58:05Z

gensim/models/phrases.py

+    logging.basicConfig(stream=sys.stderr, level=logging.DEBUG, format="%(asctime)s\t%(levelname)-8s\t%(filename)s:%(lineno)-4d\t%(message)s")
+    from gensim.models.word2vec import Text8Corpus
+
+    sentences = Text8Corpus("/Users/miguel/Downloads/text8")


It would be better to load the file from sys.arg[1] instead of "/Users/miguel/Downloads/text8" which only going to work on macs from people named "Miguel" :)

@ogrisel Haha, Hi this was for a Personal test :) - the full test is "path" free but sure thing. I will update this! :)

ogrisel · 2014-11-12T11:02:19Z

How about making the Corpus class phrase-aware by automatically embedding a Phrases instance and would return phrase-aware tokens when calling the get_texts method?

ogrisel · 2014-11-12T11:05:24Z

How about making the Corpus class phrase-aware by automatically embedding a Phrases instance and would return phrase-aware tokens when calling the get_texts method?

Alternatively this could be implemented as a wrapper / adapter:

PhraseCorpus could be initialized by passing any Corpus instance as constructor parameter and export a get_texts method that would call internally the get_texts method of the underlying corpus and collapse detected phrases with the _ marker.

piskvorky · 2014-11-12T14:30:59Z

@mfcabrera thanks. I want to make a new release in the weekend, so will review & merge asap :)

@ogrisel that's exactly what the current PR is meant to do, isn't it? Except in gensim, the transformation method is called __getitem__, so that calling new_corpus = phrase_corpus_object[old_corpus] gives you the transformed corpus (iterable).

But now I see Phrases.__getitem__ returns only a generator instead of iterable. I'll change this to use the TransformationABC properly, before merging. Thanks for bringing this up.

ogrisel · 2014-11-12T14:49:34Z

The __getitem__ API is really not intuitive to me. Why not use a method with a more explicit name like get_texts?

ogrisel · 2014-11-12T14:53:07Z

Actually I read the TransformationABC docstring. I think it would be more explicit to have:

transformer = SomeTransformer(parameters)
new_corpus = transformer.transform(original_corpus)
for words in new_corpus.get_texts():
     # do my stuff

The collection[key] syntax should be reserved to dict-like, list-like or array-like lookups in my opinion.

larsmans · 2014-11-12T14:58:41Z

The collection[key] syntax should be reserved to dict-like, list-like or array-like lookups in my opinion.

This is a Gensim convention. I can't say I agree with it, but it's pretty ingrained.

piskvorky · 2014-11-12T16:17:40Z

Not to hijack @mfcabrera 's pull request; your example in gensim would be:

transformer = SomeTransformer(original_corpus, parameters)
for words in transformer[original_corpus]:
     # do my stuff

but feel free to alias __getitem__ and __iter__ methods to whatever name you like, should be trivial.

In fact, if more people like the long syntax, we can even add such alias to "standard gensim" (to the TransformationABC), I don't care either way.

piskvorky · 2014-11-15T14:12:13Z

For some reason, github won't allow me to open a PR to your fork of gensim @mfcabrera . Weird. It doesn't even offer your fork in the PR menu.

Can you please have a look at my branch https://github.com/piskvorky/gensim/tree/mfcabrera-phrases ? It's a fork off your phrases branch, meant to be merged into your branch = into this PR. Cheers.

It contains the following changes:

fixes for py3k compatibility
fixes for consistent unicode output & utf8/unicode input
bigram[corpus] now returns an iterable (instead of just generator) @ogrisel
bigram[doc] works too
I added module intro & updated all docs, including the HTML Sphinx docs

ogrisel · 2014-11-15T14:40:12Z

For some reason, github won't allow me to open a PR to your fork of gensim @mfcabrera . Weird. It doesn't even offer your fork in the PR menu.

That happens for some reason. It's probably a bug. When that happens, open the PR page to any other repo / branch and edit the URL in your browser to point the source to the right repo name / branch name. Then you can create the PR.

piskvorky · 2014-11-15T14:54:03Z

That worked, thanks Olivier 👍

piskvorky · 2014-11-15T16:29:19Z

The move to utf8 (binary) seemed to do the trick, now memory is at 1.8GB with default max_vocab_size=20m, and the prune step happens once every ~120m words. I think that's fine.

I increased the default to 40m (~3.6GB). Processing 1 billion words takes about 15 mins; entire EN Wikipedia in 33 mins.

Let me know if there are any more comments, otherwise I'll merge this & release gensim 0.10.3 tomorrow.

@mfcabrera I never tried the Madoka wrapper, but if it works, sure! To be clear: anything compiled must be optional, with fallback to the current pure-Python version, if compilation fails / 3rd party module cannot be imported. Maintaining these C++ wrappers over multiple OS / platforms / compilers is no fun, I'd prefer not to be tied into maintaining that. A fast, pure Python/SciPy implementation would be ideal, yes :)

piskvorky · 2014-11-15T16:35:38Z

I had a look and Madoka seems nice. Let's continue with that in a new PR @mfcabrera -- any improvements will be welcome!

I'm also thinking where else we could use these sketches. If we bring another library in, let's make the most of it.

ogrisel · 2014-11-15T16:38:38Z

I increased the default to 40m (~3.6GB). Processing 1 billion words takes about 15 mins; entire EN Wikipedia in 33 mins.

Let me know if there are any more comments, otherwise I'll merge this & release gensim 0.10.3 tomorrow.

@piskvorky have you had a look at the phrases extracted on the full wikipedia, do they seem legit? Do you have stuff that looks like spurious bi- or tri-grams that don't match some kind of meaningful phrases such as named entities names as "New York" and "New York Times" for instance?

piskvorky · 2014-11-16T13:10:46Z

@ogrisel the word2vec formula is basically a variant of PMI (pointwise mutual info, without taking the log), which is known to overestimate scores for low freq collocations.

That's why they put the extra min_count parameter, to ignore spurious low freq bigrams.

You can see the top bigrams from wiki (min_count 10, threshold 4.0) here.

ogrisel · 2014-11-16T16:52:43Z

Looks great. Have you had a look at the bottom bigrams? Have you tried to do several passes to extract trigrams or more?

piskvorky · 2014-11-16T18:14:13Z

Bottom by score: you can see those in that printout above, too. The list is sorted by raw bigram count (=most frequent bigrams), but in the 3rd column there's the pseudo-PMI score. Stuff like new_york is actually near the threshold limit of 4.0, so that would be the "bottom".

Bottom by frequency (=count 11, because min_count was 10):

diego_benaglio  11      4.00591027209
aaron_prioleau  11      4.00556663873
cs_târgovişte   11      4.00512323324
relative_gravimeters    11      4.00401001349
lwf_geraldine   11      4.00389337039
minna_jørgensen 11      4.00379507267
videos_qtvr     11      4.00365570359
futaba_kun      11      4.00365203611
ha_tumah        11      4.003477472
textured_polygons       11      4.00337699452
fairhair_harald 11      4.00306019345
peard_judy      11      4.00231019083
fried_twinkie   11      4.00227647297
gardel_patrice  11      4.00168650245
ira_oira        11      4.00070921986
goncharov_isabelle      11      4.00053637698
bernardino_ludovisi     11      4.00029325041
articulata_hypothesis   11      4.0002873923

Bottom by frequency is less important IMO, because getting these rare bigrams wrong will be ...rare.

ogrisel · 2014-11-17T08:40:51Z

Alright, thanks. Have you tried to run 2 passes to extract trigrams like "new_york_times" for instance?

piskvorky · 2014-11-17T11:58:11Z

@mfcabrera I don't want to wait any longer with the release. I'll merge my own changes via #261 , without waiting for this PR to be updated.

piskvorky · 2014-11-17T12:16:37Z

@ogrisel not on Wiki yet; I'll post the >2-gram results later.

mfcabrera · 2014-11-19T19:53:02Z

@piskvorky Sorry for not updating the PR! I was (I am) on vacations with a funky internet access. I am glad the changes were merged. I will start working on the min-sketch thingy when I get back. :D!

Miguel Cabrera added 2 commits November 3, 2014 18:59

Added a python implementation of phrase detecting based on word2prase

4725d3e

Merge remote-tracking branch 'upstream/develop' into phrases

48c39aa

mfcabrera changed the title ~~Phrases~~ Phrase support baed on word2phrase Nov 3, 2014

piskvorky mentioned this pull request Nov 3, 2014

Added phrase support #135

Closed

piskvorky reviewed Nov 3, 2014
View reviewed changes

mfcabrera changed the title ~~Phrase support baed on word2phrase~~ Phrase support based on word2phrase Nov 4, 2014

Removed VocabWord, cleaned up the code and fixed logic.

605b1c4

piskvorky reviewed Nov 4, 2014
View reviewed changes

Go back to the less beautiful but more efficient way of reducing the

c5e9a5f

vocabulary (based on Word2Phrase)

Miguel Cabrera added 2 commits November 11, 2014 15:19

removed 0.7 multiplier and converted string to utf-8

a07635e

added basic test for the phrases transformation module

757b022

Miguel Cabrera added 2 commits November 11, 2014 15:49

Fixed typo (old-style print statement call)

2c46631

Fixing Python 3 support

67025b0

ogrisel reviewed Nov 12, 2014
View reviewed changes

piskvorky closed this Nov 17, 2014

Phrase support based on word2phrase #258

Phrase support based on word2phrase #258

Conversation

mfcabrera commented Nov 3, 2014

mfcabrera commented Nov 3, 2014

piskvorky commented Nov 3, 2014

piskvorky Nov 3, 2014

Choose a reason for hiding this comment

piskvorky Nov 4, 2014

Choose a reason for hiding this comment

mfcabrera Nov 4, 2014

Choose a reason for hiding this comment

piskvorky Nov 4, 2014

Choose a reason for hiding this comment

piskvorky commented Nov 4, 2014

piskvorky commented Nov 4, 2014

mfcabrera commented Nov 5, 2014

ogrisel commented Nov 5, 2014

piskvorky commented Nov 5, 2014

piskvorky commented Nov 5, 2014

ogrisel commented Nov 5, 2014

larsmans commented Nov 5, 2014

larsmans commented Nov 5, 2014

piskvorky commented Nov 5, 2014

ogrisel commented Nov 5, 2014

IsaacHaze commented Nov 5, 2014

IsaacHaze commented Nov 5, 2014

piskvorky commented Nov 5, 2014

larsmans commented Nov 5, 2014

mfcabrera commented Nov 11, 2014

ogrisel Nov 12, 2014

Choose a reason for hiding this comment

mfcabrera Nov 12, 2014

Choose a reason for hiding this comment

ogrisel commented Nov 12, 2014

ogrisel commented Nov 12, 2014

piskvorky commented Nov 12, 2014

ogrisel commented Nov 12, 2014

ogrisel commented Nov 12, 2014

larsmans commented Nov 12, 2014

piskvorky commented Nov 12, 2014

piskvorky commented Nov 15, 2014

ogrisel commented Nov 15, 2014

piskvorky commented Nov 15, 2014

piskvorky commented Nov 15, 2014

piskvorky commented Nov 15, 2014

ogrisel commented Nov 15, 2014

piskvorky commented Nov 16, 2014

ogrisel commented Nov 16, 2014

piskvorky commented Nov 16, 2014

ogrisel commented Nov 17, 2014

piskvorky commented Nov 17, 2014

piskvorky commented Nov 17, 2014

mfcabrera commented Nov 19, 2014