Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

big doc-vector refactor/enhancements #356

Merged
merged 51 commits into from Jun 28, 2015

Conversation

@gojomo
Copy link
Member

gojomo commented Jun 10, 2015

Ready for review/testing!

The headline changes are optimized doc-vector inference, and separating doc-vectors (during training or comparison) from the word-vocabulary – allowing many more docs (via memmap-backing) than words, and avoiding some confusion.

Many smaller changes include a DM-concatenative-context mode (as recommended in the Paragraph Vectors paper), an optional 'lock_factor' to attenuate training of some vectors, and other optimizations/cleanup.

See gensim/test/test_doc2vec.py to do a quick check on a new system or survey some API possibilities. See docs/notebook/doc2vec-IMDB.ipynb for a walkthrough of reproducing the PV paper IMDB sentiment experiment.

If you used Doc2Vec previously, a few key changes to note:

  • LabeledSentence is now TaggedDocument, and the (one or more) document labels that correspond to vectors are now referred to as 'tags'. Greatest memory efficiency is possible by using only int tags, contiguous and ascending from 0.
  • Doc vectors are stored, accessed, and compared through a consituent '.docvecs' field of the Doc2Vec model, rather than the old accessors/comparison methods that still work for words. So, d2v_model.docvecs[doc_tag] or d2v_model.docvecs.most_similar(doc_tag), rather than on d2v_model directly.

Some top needs:

  • testing on diverse, larger datasets and systems - though the ability to handle doc-vector sets much larger than RAM is theoretically there, I haven't forced an overflow yet
  • improving save() to externalize numpy arrays as with the prior models – maybe a recursive utils.SaveLoad?
  • refactoring the DocvecsArray similarity-testing methods – currently, just a quick-and-dirty copy-and-adapt from Word2Vec, but would ideally share code (and the optimizations pending in other @sebastien-j / @KCzar PRs) with word-vecs
gojomo added 30 commits Mar 2, 2015
…for padding; layer1_size potentially different than vector_size; parameter renames for clarity; one-time neg_lables precalc
…ence_dm_concat, infer_vector_dm_concat methods
pep8 & python2 fixes to doc2vec notebook
@gojomo

This comment has been minimized.

Copy link
Member Author

gojomo commented Jun 28, 2015

@piskvorky the notebook & most other work was done in py3.4/OSX 1st, but I've also regularly run the notebook and other tests on py2.7/ubuntu. Thanks for the PR, merged!

@akhudek

This comment has been minimized.

Copy link

akhudek commented Jun 28, 2015

@gojomo The vocabulary phase definitely isn't the problem, I had logging on during the attempt and saw that it completed building the vocabulary with about the same memory usage as word2vec. From memory, (I lost the exact numbers due to the crash, should have logged to a file), there are about 4 million words before pruning, much less afterwards. The crash happened immediately after seeing a log about resetting the layer weights.

The training script is very straightforward, as you can see below. A single line from the input data definitely won't blow memory.

import sys
import gzip
import logging
from gensim.models.doc2vec import TaggedDocument, Doc2Vec

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

class TaggedLineSentence(object):
   def __init__(self, filename):
      self.filename = filename

   def __iter__(self):
      for uid, line in enumerate(gzip.open(self.filename, 'rb')):
         yield TaggedDocument(words=line.split(), tags=[uid])

sentences = TaggedLineSentence(sys.argv[1])

model = Doc2Vec(alpha=0.025, min_alpha=0.025, docvecs_mapfile='mapfile')  # use fixed learning rate
model.build_vocab(sentences)

for epoch in range(10):
   print epoch
   model.train(sentences)
   model.alpha -= 0.002  # decrease the learning rate
   model.min_alpha = model.alpha  # fix the learning rate, no decay

# store the model 
model.save(sys.argv[2])
@gojomo

This comment has been minimized.

Copy link
Member Author

gojomo commented Jun 28, 2015

Hmm. If build_vocab() completes I think you would see 'mapfile' appear in the working directory: Doc2Vec should have written random initialization vectors across its entire extent. (Do you see the '0' epoch print?)

@akhudek

This comment has been minimized.

Copy link

akhudek commented Jun 28, 2015

No, it doesn't make it that far. I've just tried with a small subset of the data. After finishing the vocab with the small data, it's using ~100mb of memory. Somewhere between resetting layer weights and starting training, memory usage jumps to 1.3gb. This probably makes sense since OS X is no doubt loading the entire mmap'd file into memory. In this case the mmap'd file does get created.

I'm beginning to suspect that this might be OS X behaving poorly with the mmap'd file, trying to load the entire thing into it's concept of virtual memory resulting in swap file death. I've tried to limit the processes memory via ulimit, but it doesn't seem to work.

I'm going to try this on a linux system to see if it's OS related.

2015-06-28 17:10:58,509 : INFO : collected 143273 word types from a corpus of 31045724 words and 1000000 documents
2015-06-28 17:10:58,571 : INFO : total 29599 word types after removing those with count<5
2015-06-28 17:10:58,572 : INFO : constructing a huffman tree from 29599 words
2015-06-28 17:10:59,766 : INFO : built huffman tree with maximum node depth 23
2015-06-28 17:10:59,795 : INFO : resetting layer weights
0
2015-06-28 17:11:44,514 : INFO : training model with 1 workers on 29599 vocabulary and 300 features, using 'skipgram'=0 'hierarchical softmax'=1 'subsample'=0 and 'negative sampling'=0
2015-06-28 17:11:45,519 : INFO : PROGRESS: at 0.73% words, alpha 0.02500, 225655 words/s
2015-06-28 17:11:46,523 : INFO : PROGRESS: at 1.46% words, alpha 0.02500, 225018 words/s
2015-06-28 17:11:47,532 : INFO : PROGRESS: at 2.19% words, alpha 0.02500, 223860 words/s
@piskvorky

This comment has been minimized.

Copy link
Member

piskvorky commented Jun 28, 2015

Let me merge this PR into develop.

We can continue the discussion here, as well as open new PRs for fixes / improvements.

piskvorky added a commit that referenced this pull request Jun 28, 2015
big doc-vector refactor/enhancements
@piskvorky piskvorky merged commit 1d5bd88 into RaRe-Technologies:develop Jun 28, 2015
1 check passed
1 check passed
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@piskvorky

This comment has been minimized.

Copy link
Member

piskvorky commented Jun 28, 2015

And it goes without saying -- massive thanks to @gojomo for his epic refactor!

@craigpfeifer

This comment has been minimized.

Copy link

craigpfeifer commented Jun 29, 2015

Great changes! When will these be available via pip?

@gojomo

This comment has been minimized.

Copy link
Member Author

gojomo commented Jun 29, 2015

I don't know @piskvorky's plans for a numbered release, but you can always pip install from a github branch. For example, this should do the trick (because 'develop' is the default branch for /piskvorky/gensim):

pip install git+https://github.com/piskvorky/gensim.git
@e9t

This comment has been minimized.

Copy link

e9t commented Jun 29, 2015

Are you sure you want to totally remove LabeledSentence?

Some versioning schemes (ex: semver), suggest minor version changes to be backwards-compatible, and those who are used to this may have some confusions because this feature merge is a minor version up.
Rather than raising an AttributeError,

AttributeError: 'module' object has no attribute 'LabeledSentence'

how would it be to use a DeprecationWarning?

Anyways, thanks for the great work!

@gojomo

This comment has been minimized.

Copy link
Member Author

gojomo commented Jun 30, 2015

Remaining @cscorley & @e9t suggestions handled on #373 – thanks for review!

@e9t

This comment has been minimized.

Copy link

e9t commented Jun 30, 2015

@gojomo, I got your notebook code to work on my Macbook and reproduced similar results. (Awesome!)
Even when I tried using my own dataset, the code ran just fine with sequential integer tags as you suggested. However, when I replaced the tags with strings (I wanted the tags to be actual document IDs), I got a Segmentation Fault during the docvec training phase as shown below:

# Load corpus
labeledTrainData_clean.tsv
testData_clean.tsv
unlabeledTrainData_clean.tsv
# Set-up Doc2Vec Training & Evaluation Models
Doc2Vec(dm/c,d100,n5,w5,mc2,t8)
Doc2Vec(dbow,d100,n5,mc2,t8)
Doc2Vec(dm/m,d100,n5,w10,mc2,t8)
# Bulk Training
START 2015-07-01 01:20:52.256642
*0.392200 : 1 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,t8) 54.5s 0.5s
*0.392000 : 1 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,t8)_inferred 54.5s 2.1s
Segmentation fault: 11

I'm using a Macbook, so this is probably related to numpy/numpy#4007.
Do you think we can fix this?

@gojomo

This comment has been minimized.

Copy link
Member Author

gojomo commented Jun 30, 2015

I wouldn't be confident it's related to the numpy issue – I've not had random/numpy segfaults while developing this on OSX. (All the many segfaults I've worked through have been traceable to my own genuine bugs.)

I didn't think any Macbooks had 8 cores; did you manually set that number of workers? (I'd recommend just the number of real cores, though that shouldn't risk a segfault.)

Does it happen every time, at roughly the same time, when using string tags? And never when using int IDs?

The string tags should definitely work – they just may prove unwieldy if you get to tens-of-millions of docs. Are your tags relatively short, and just one tag to a document?

Is there a chance any of your documents are > 10,000 words? (That's currently the hard limit, though I hope to loosen it, and going over it should only cause truncation rather than crashes.)

If you can reliably trigger segfaults with a small test case, I'll have other ideas for details to collect. (First among them: enabling core dumps and checking exactly where the fault is happening.)

@e9t

This comment has been minimized.

Copy link

e9t commented Jun 30, 2015

I didn't think any Macbooks had 8 cores; did you manually set that number of workers? (I'd recommend just the number of real cores, though that shouldn't risk a segfault.)

My Macbook has a i7 CPU with 4 physical cores but enabling hyperthreading gives me 8 logical cores, which probably is the reason multiprocessing.cpu_count() returned 8.
I just manually set the number of workers to 4.

Does it happen every time, at roughly the same time, when using string tags? And never when using int IDs?

Until now, yes. (The sample isn't so big, but I have run each 5+ consecutive times.)

The string tags should definitely work – they just may prove unwieldy if you get to tens-of-millions of docs. Are your tags relatively short, and just one tag to a document?

Yes and yes, max length of the tags is 6 and just one tag per document.

Is there a chance any of your documents are > 10,000 words? (That's currently the hard limit, though I hope to loosen it, and going over it should only cause truncation rather than crashes.)

No, document lengths are all shorter than 10,000 words.

If you can reliably trigger segfaults with a small test case, I'll have other ideas for details to collect. (First among them: enabling core dumps and checking exactly where the fault is happening.)

I'll see if I can create a toy case, and see what I can do to spot the problem. It may just be a special case, or my own bug as you suggested but if I find anything special, I'll open a new issue. Thanks!

@gojomo

This comment has been minimized.

Copy link
Member Author

gojomo commented Jun 30, 2015

I'd somehow overlooked the i7 had hyperthreading (usually doing this dev in a OSX VM with 4 virtual cores)... turns out 8 virtual cores do incrementally help, about a 14% speedup over my previous configuration of 4, in a quick test. So by all means keep using 8.

Unless you're doing c/cython stuff, any segfaulting-bug is far more likely to come from the optimized cython/blas code, which can easily ignore boundaries or hold pointers past memory reuse. So please keep me posted any triggering patterns you find. (There is one other report of a segfault on the discussion list.)

@e9t

This comment has been minimized.

Copy link

e9t commented Jul 1, 2015

You are referring to this discussion right?

I looked at my crash logs and found the same exception type and codes EXC_BAD_ACCESS (SIGSEGV) and KERN_INVALID_ADDRESS.
However the trackback looks a bit different from the previous segfault report:

Thread 2 Crashed:
0   libBLAS.dylib                   0x00007fff9325bc95 cblas_sdot + 976
1   libBLAS.dylib                   0x00007fff9322ee8d SDOT + 16
2   doc2vec_inner.so                0x00000001102b71ea __pyx_f_5trunk_6gensim_6models_13doc2vec_inner_our_dot_double + 10 (doc2vec_inner.c:1455)
3   doc2vec_inner.so                0x00000001102c41fa __pyx_f_5trunk_6gensim_6models_13doc2vec_inner_fast_document_dbow_hs + 202 (doc2vec_inner.c:1678)
4   doc2vec_inner.so                0x00000001102c3909 __pyx_pw_5trunk_6gensim_6models_13doc2vec_inner_1train_document_dbow + 13769 (doc2vec_inner.c:4102)
5   org.python.python               0x000000010086091a PyEval_EvalFrameEx + 21166
6   org.python.python               0x00000001007e39e5 gen_send_ex + 169
7   org.python.python               0x00000001007c86f1 PyIter_Next + 16
8   org.python.python               0x0000000100859c31 builtin_sum + 378
9   org.python.python               0x0000000100860869 PyEval_EvalFrameEx + 20989
10  org.python.python               0x000000010085b4b5 PyEval_EvalCodeEx + 1622
11  org.python.python               0x0000000100863b38 fast_function + 321
12  org.python.python               0x00000001008606fb PyEval_EvalFrameEx + 20623
13  org.python.python               0x000000010085b4b5 PyEval_EvalCodeEx + 1622
14  org.python.python               0x00000001007e922f function_call + 372
15  org.python.python               0x00000001007c8e2a PyObject_Call + 103
16  org.python.python               0x0000000100860daf PyEval_EvalFrameEx + 22339
17  org.python.python               0x0000000100863ac2 fast_function + 203
18  org.python.python               0x00000001008606fb PyEval_EvalFrameEx + 20623
19  org.python.python               0x0000000100863ac2 fast_function + 203
20  org.python.python               0x00000001008606fb PyEval_EvalFrameEx + 20623
21  org.python.python               0x000000010085b4b5 PyEval_EvalCodeEx + 1622
22  org.python.python               0x00000001007e922f function_call + 372
23  org.python.python               0x00000001007c8e2a PyObject_Call + 103
24  org.python.python               0x00000001007da54c method_call + 136
25  org.python.python               0x00000001007c8e2a PyObject_Call + 103
26  org.python.python               0x00000001008631c4 PyEval_CallObjectWithKeywords + 93
27  org.python.python               0x000000010089445b t_bootstrap + 70
28  libsystem_pthread.dylib         0x00007fff8f102268 _pthread_body + 131
29  libsystem_pthread.dylib         0x00007fff8f1021e5 _pthread_start + 176
30  libsystem_pthread.dylib         0x00007fff8f10041d thread_start + 13

I've ran the code three times in a row, and the crashes logs all occurred in the same place.
So perhaps numpy wasn't the problem after all?

@gojomo

This comment has been minimized.

Copy link
Member Author

gojomo commented Jul 1, 2015

Yes, that's the discussion. That is a different crash location... but if the bug is some other code, running a little earlier clobbering unintentional addresses with illegal values, the ultimate crash could happen in a variety of places. (However, the other thread reports not seeing the crash again, perhaps specifically since not using the 'sample' parameter... which it doesn't appear you're using at all. So still unclear whether the incidents are related.)

In your original output, it looked like at least one pass in one training mode (dm/c) completed, but then the crash occurred during the 1st pass in the second (dbow) mode. So, is that the repeated 'same place' a crash occurs: always that mode when run second? If you put logging to DEBUG does it indicate about the same amount of progress through the data each crash? How about if you only run the DBOW mode – still crashing same proportion of the way through? How about if you leave out DBOW mode entirely – does it crash another place?

(I'm somewhat skeptical the numpy issue is related – that would seem to necessarily trigger earlier if present. If you step through the triggering steps outlined in the original gensim-related-report, at #131 (comment), can you get that error?)

@gojomo

This comment has been minimized.

Copy link
Member Author

gojomo commented Jul 5, 2015

@e9t – FYI, I found a way to reliably trigger segfaults locally, and fixed the bug responsible – see the commits about "str doctags trigger bad indexes" on PR #380. Specifically, if both using string doc tags, and some tags repeat before all tags are discovered, some tags could be assigned too-high indexes into the vector array (because of a bad assumption of exactly one training example per tag). That'd eventually lead to out-of-bounds accesses or writes.

Using repeated tags, while not exactly the mode described in the PV paper, is definitely a supported use-case. It might be reasonable to do so to create tag vectors for tags representing metadata/set-membership that many documents share. But also, feeding the training process "[tag-A] 1 2 3 . 4 5 6 ." as one large example, or "[tag-A] 1 2 3 ." then "[tag-A] 4 5 6 ." as two smaller examples, will cause approximately the same training for [tag-A] to occur. (In pure DBOW, the training is essentially identical; in other modes where the 'window' setting comes into play, there will be differences related to when the window reaches across the sentence boundary.)

Can you check if the fix in that PR resolves your crash? (The essential change is just the one line: https://github.com/piskvorky/gensim/pull/380/files#diff-e71d1aecc3d6bb450f077300f2cf763dR293 )

@e9t

This comment has been minimized.

Copy link

e9t commented Jul 5, 2015

Sorry for the late reply.

Well, I've enabled DEBUG, and found that the crash does not occur exactly at the "same place".
Running the code three times all crashed during the 1st pass in the third (dm/m) mode, however each at job 3, 7, and 11.

The GOOD NEWS is that yes, that bug fix #380 did the trick.
My code now runs perfectly. Thanks for figuring it out!

@piskvorky

This comment has been minimized.

Copy link
Member

piskvorky commented Jul 5, 2015

@gojomo @cscorley I think this release warrants a version bump to 0.12 (rather than just 0.11.2)... what do you think?

Changelog: https://github.com/piskvorky/gensim/blob/develop/CHANGELOG.txt

@gojomo

This comment has been minimized.

Copy link
Member Author

gojomo commented Jul 5, 2015

👍 – version increments are free, and features (and API changes) warrant it. (About to make a few small CHANGELOG tweaks.)

@gojomo gojomo deleted the gojomo:bigdocvec_pr branch Jul 9, 2015
@vierja

This comment has been minimized.

Copy link

vierja commented Aug 21, 2015

@gojomo is the infer_vector only applicable to a doc2vec model? Or can it be used with word2vec?

@gojomo

This comment has been minimized.

Copy link
Member Author

gojomo commented Aug 21, 2015

It only works for Doc2Vec. (And, if presenting text for inference that contains new words, those words are treated like any other unknown words – dropped before analysis.)

It's possible a similar mechanism could be offered for inferring word vectors. In many ways the work in #435 supports similar goals.

@vierja

This comment has been minimized.

Copy link

vierja commented Aug 21, 2015

@gojomo thanks for the reply.

I meant like, is it possible to generate a paragraph vector representation using a Word2Vec model, using the same technique/code in infer_vector. I have a Word2Vec model, can I use it to infer vectors on a list of paragraphs without training them with Doc2Vec from scratch?

@gojomo

This comment has been minimized.

Copy link
Member Author

gojomo commented Aug 21, 2015

The Doc2Vec algorithms (from the 'Paragraph Vectors' paper) do not start with word vectors, then create doc vectors. Rather, they train doc vectors from text (and only sometimes, in some training methods, generate word vectors as part of that process).

So: no. The inference requires a trained up Doc2Vec model. Having preexisting word vectors isn't a typical/required input for creating doc-vectors (in this algorithm).

(It's intuitively plausible that seeding some kinds of Doc2Vec models with pre-existing word vectors might offer a benefit, but a few small experiments I've done in that direction have had mixed results. There's other research about ways to create sentence/document vectors which do in fact require word vectors first; those algorithms aren't currently in gensim.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants
You can’t perform that action at this time.