Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch sentences in word2vec #535

Merged
merged 38 commits into from Nov 28, 2015
Merged

Batch sentences in word2vec #535

merged 38 commits into from Nov 28, 2015

Conversation

piskvorky
Copy link
Owner

This PR changes the way "jobs" are processed in word2vec:

  1. Jobs of sentences are now prepared in a separate thread. The control flow logic is cleaner, and this should also help with performance in cases where the input stream is slow (better chance that the job producer is fast enough to feed all workers).
  2. Each job now contains a predefined number of words (was: predefined number of sentences). This helps with performance when the documents are very short (tweets etc).

The results are 1:1 bit-for-bit identical, when controlling for the exact same settings (alpha decay etc), so this is purely an internal refactoring / optimisation. The actual results can be different though, because the job batches are of different size than before, causing a slightly different alpha decay.

Benchmarks on the text8 corpus

256 dim, 1 worker, 10,000-word long sentences

  • old 0.12.3: 161k words/s; 27.2% (3343/12268)
  • new batched: 161k words/s; 27.5% (3379/12268)

256 dim, 4 workers, 10,000-word long sentences

  • old 0.12.3: 346k words/s; 27.9% (3417/12268)
  • new batched: 345k words/s; 27.9% (3426/12268)

256 dim, 1 worker, 10-word long sentences

  • old 0.12.3: 161k words/s; 24.6% analogy (3014/12268)
  • new batched: 192k words/s; 24.6% analogy (3013/12268)

256 dim, 4 workers, 10-word long sentences

  • old 0.12.3: 323k words/s; 25.7% analogy (3149/12268)
  • new batched: 429k words/s, 24.7% (3029/12268)

(each result comes from only a single run, so there may be considerable variance)

TODO:

  • add CBOW version of Cython-optimized batching.
  • remove dead code (leave only batching)

olavurmortensen and others added 30 commits September 21, 2015 16:01
Simplify job loop + merge latest gensim
@piskvorky
Copy link
Owner Author

The improvement on short sentences is nice, but not nearly as great as observed by @gojomo 's batching experiments (which I can't find anymore -- can you give the link Gordon?).

@gojomo
Copy link
Collaborator

gojomo commented Nov 19, 2015

I put my partial work on this (Doc2Vec batching) in PR #536 to be easily found.

@piskvorky
Copy link
Owner Author

Some more experiments: same dataset, same settings, but using CBOW (training model with 4 workers on 71290 vocabulary and 256 features, using sg=0 hs=0 sample=0 negative=5).

10,000-word sentences
new batching: 1,469k words/s; 7.5% (925/12268)
old 0.12.3: 1,257k words/s; 8.0% (976/12268)

10-word sentences
new batching: 1,176k words/s; 6.8% (831/12268)
old 0.12.3: 414k words/s; 6.7% (820/12268)

(again, each result comes from only a single run, so there may be considerable variance)

@gojomo this CBOW lift on short documents is more in line with your earlier experiments.

@gojomo
Copy link
Collaborator

gojomo commented Nov 20, 2015

Aha... before batching, each sentence presented to CBOW essentially creates len(sentence) training-examples to the NN. (One context per word.) Comparatively, each sentence presented to skip-gram training creates about len(sentence)*window context-word-to-target-word examples. (In fact the window on both sides would mean double that, but then the random window-reduction to weight nearer words mroe highly halves it back to just window times each target position.) So the skip-gram no-GIL sessions were equal in number, but already relatively longer compared to GIL time. (The composing of the sum/mean context takes much less time than the forward/backprop.) That already meant less chance for contention at GIL entry/exit, so less chance for speedup from batching.

I think my crude early tests (just concatenating sets of 10+ texts without code changes) were likely plain DBOW Doc2Vec (with no word-training and thus no 'window' of any kind) – so the NN-examples were again len(sentence), meaning relatively brief no-GIL sessions, and thus good chances for speedup from batching.

One upshot: if you repeat the SG tests with a smaller window (such as '2') you may see a bigger relative speedup from batching. (Similarly, smaller dimensions.)

@piskvorky
Copy link
Owner Author

Travis fails; it looks like the change in word2vec broke some doc2vec test.

@gojomo does this error ring any immediate bell to you? If not, I'll dig deeper. I thought the API didn't change at all though, so not sure what I missed.

@gojomo
Copy link
Collaborator

gojomo commented Nov 20, 2015

Looks like it's getting a tuple rather than a 'document'-shaped object (something with words and tags properties)... perhaps a method assuming each text example is just a sequence-of-tokens needs overriding?

@piskvorky
Copy link
Owner Author

@gojomo I fixed the doc2vec compatibility in d8b4134.

I noticed though that the pure-Python code path of doc2vec doesn't work -- it complains about missing neg_label. The tests don't catch it because they use the compiled-path. Not directly related to this PR, but a fix would be welcome.

@piskvorky
Copy link
Owner Author

@tmylk I'm seeing an unrelated unit test error from "keywords" again -- can you fix it?

piskvorky added a commit that referenced this pull request Nov 28, 2015
@piskvorky piskvorky merged commit dbe0b96 into develop Nov 28, 2015
@piskvorky piskvorky deleted the sentence_batching branch November 28, 2015 06:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants