Batch sentences in word2vec #535

piskvorky · 2015-11-19T08:50:54Z

This PR changes the way "jobs" are processed in word2vec:

Jobs of sentences are now prepared in a separate thread. The control flow logic is cleaner, and this should also help with performance in cases where the input stream is slow (better chance that the job producer is fast enough to feed all workers).
Each job now contains a predefined number of words (was: predefined number of sentences). This helps with performance when the documents are very short (tweets etc).

The results are 1:1 bit-for-bit identical, when controlling for the exact same settings (alpha decay etc), so this is purely an internal refactoring / optimisation. The actual results can be different though, because the job batches are of different size than before, causing a slightly different alpha decay.

Benchmarks on the text8 corpus

256 dim, 1 worker, 10,000-word long sentences

old 0.12.3: 161k words/s; 27.2% (3343/12268)
new batched: 161k words/s; 27.5% (3379/12268)

256 dim, 4 workers, 10,000-word long sentences

old 0.12.3: 346k words/s; 27.9% (3417/12268)
new batched: 345k words/s; 27.9% (3426/12268)

256 dim, 1 worker, 10-word long sentences

old 0.12.3: 161k words/s; 24.6% analogy (3014/12268)
new batched: 192k words/s; 24.6% analogy (3013/12268)

256 dim, 4 workers, 10-word long sentences

old 0.12.3: 323k words/s; 25.7% analogy (3149/12268)
new batched: 429k words/s, 24.7% (3029/12268)

(each result comes from only a single run, so there may be considerable variance)

TODO:

add CBOW version of Cython-optimized batching.
remove dead code (leave only batching)

…n_sentences_sg (FAST_VERSION).

…obs.

Simplify job loop + merge latest gensim

piskvorky · 2015-11-19T12:10:49Z

The improvement on short sentences is nice, but not nearly as great as observed by @gojomo 's batching experiments (which I can't find anymore -- can you give the link Gordon?).

gojomo · 2015-11-19T18:12:14Z

I put my partial work on this (Doc2Vec batching) in PR #536 to be easily found.

piskvorky · 2015-11-20T05:27:58Z

Some more experiments: same dataset, same settings, but using CBOW (training model with 4 workers on 71290 vocabulary and 256 features, using sg=0 hs=0 sample=0 negative=5).

10,000-word sentences
new batching: 1,469k words/s; 7.5% (925/12268)
old 0.12.3: 1,257k words/s; 8.0% (976/12268)

10-word sentences
new batching: 1,176k words/s; 6.8% (831/12268)
old 0.12.3: 414k words/s; 6.7% (820/12268)

(again, each result comes from only a single run, so there may be considerable variance)

@gojomo this CBOW lift on short documents is more in line with your earlier experiments.

gojomo · 2015-11-20T08:15:45Z

Aha... before batching, each sentence presented to CBOW essentially creates len(sentence) training-examples to the NN. (One context per word.) Comparatively, each sentence presented to skip-gram training creates about len(sentence)*window context-word-to-target-word examples. (In fact the window on both sides would mean double that, but then the random window-reduction to weight nearer words mroe highly halves it back to just window times each target position.) So the skip-gram no-GIL sessions were equal in number, but already relatively longer compared to GIL time. (The composing of the sum/mean context takes much less time than the forward/backprop.) That already meant less chance for contention at GIL entry/exit, so less chance for speedup from batching.

I think my crude early tests (just concatenating sets of 10+ texts without code changes) were likely plain DBOW Doc2Vec (with no word-training and thus no 'window' of any kind) – so the NN-examples were again len(sentence), meaning relatively brief no-GIL sessions, and thus good chances for speedup from batching.

One upshot: if you repeat the SG tests with a smaller window (such as '2') you may see a bigger relative speedup from batching. (Similarly, smaller dimensions.)

piskvorky · 2015-11-20T09:43:33Z

Travis fails; it looks like the change in word2vec broke some doc2vec test.

@gojomo does this error ring any immediate bell to you? If not, I'll dig deeper. I thought the API didn't change at all though, so not sure what I missed.

gojomo · 2015-11-20T18:59:11Z

Looks like it's getting a tuple rather than a 'document'-shaped object (something with words and tags properties)... perhaps a method assuming each text example is just a sequence-of-tokens needs overriding?

piskvorky · 2015-11-21T11:51:58Z

@gojomo I fixed the doc2vec compatibility in d8b4134.

I noticed though that the pure-Python code path of doc2vec doesn't work -- it complains about missing neg_label. The tests don't catch it because they use the compiled-path. Not directly related to this PR, but a fix would be welcome.

piskvorky · 2015-11-21T12:14:31Z

@tmylk I'm seeing an unrelated unit test error from "keywords" again -- can you fix it?

Batch sentences in word2vec

olavurmortensen and others added 30 commits September 21, 2015 16:01

Made comments where I expect to make changes in the code.

e6bd0be

Batching sentences in _do_train_job. Iterating over sentences in trai…

43deaa3

…n_sentences_sg (FAST_VERSION).

Now there are separate batching and non-batching methods.

da48ba9

Added missing import og train_batch_sg.

c51b487

Fixed fatal issues with Cython code.

fbea4e3

Now batching sentences when preparing jobs, instead of in _do_train_j…

1d438d9

…obs.

Fixed some erros in Cython code and canged the batch preparation code.

84564b4

Changed batching method, as it was inefficient.

3a82ce2

Fixes in the batching code of the training loop.

12836e0

Added a script for testing.

090b9a7

Fixed segmentation fault. Cleaned up batch submitting code.

481c258

Now using 1D arrays instead of 2D. This is more efficient.

6aece13

Minor fixes.

b0b398a

Testing vector correctness.

9f95ff2

refactor word2vec loop with job queue

25b82b5

f

1c93a31

increase MAX_WORDS_IN_BATCH in word2vec to 100k (was: 10k)

6d1eb52

Merge branch 'develop' into batch_sentences

f6a0034

increase MAX_WORDS_IN_BATCH in pure Python code path

e35e2ee

Merge pull request #1 from piskvorky/batch_sentences

14888eb

Simplify job loop + merge latest gensim

Made Cython code compatible with the simplified job loop.

66ad39e

Profiling using line_profier.

3cea596

Profiling

8d5cf89

Fixed some errors.

17022d2

Added profiling results.

7e74316

remove file with cython annotations

812c6ab

remove Olavur's profiling results from repo

207ac96

clean up word2vec outer loop some more

0f119d7

fix batched word2vec in Cython

f64704c

Merge remote-tracking branch 'origin/develop' into sentence_batching

7654c9e

piskvorky mentioned this pull request Nov 19, 2015

Batch sentences inside word2vec/doc2vec. Fixes #450 #468

Closed

remove Olavur's batching non-test

5f0f843

cythonized batched CBOW for word2vec

4d774e1

update CHANGELOG with word2vec batching

f9eb4c3

piskvorky added 3 commits November 21, 2015 19:43

fix word2vec tests after batching

a70c0c1

clean up pure-python code paths in word2vec

da0b269

fix doc2vec compatibility for word2vec batching

d8b4134

piskvorky added 2 commits November 28, 2015 13:12

Merge branch 'develop' into sentence_batching

5606499

remove obsolete test

079937c

piskvorky added a commit that referenced this pull request Nov 28, 2015

Merge pull request #535 from piskvorky/sentence_batching

dbe0b96

Batch sentences in word2vec

piskvorky merged commit dbe0b96 into develop Nov 28, 2015

piskvorky deleted the sentence_batching branch November 28, 2015 06:41

This was referenced Jan 9, 2016

Batch sentences inside word2vec/doc2vec #450

Closed

Doc2Vec batching multiple docs to cython routines #536

Closed

edilsonacjr mentioned this pull request Jan 24, 2017

The _raw_word_count method is breaking word2vec model #1106

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch sentences in word2vec #535

Batch sentences in word2vec #535

piskvorky commented Nov 19, 2015

piskvorky commented Nov 19, 2015

gojomo commented Nov 19, 2015

piskvorky commented Nov 20, 2015

gojomo commented Nov 20, 2015

piskvorky commented Nov 20, 2015

gojomo commented Nov 20, 2015

piskvorky commented Nov 21, 2015

piskvorky commented Nov 21, 2015

Batch sentences in word2vec #535

Batch sentences in word2vec #535

Conversation

piskvorky commented Nov 19, 2015

Benchmarks on the text8 corpus

piskvorky commented Nov 19, 2015

gojomo commented Nov 19, 2015

piskvorky commented Nov 20, 2015

gojomo commented Nov 20, 2015

piskvorky commented Nov 20, 2015

gojomo commented Nov 20, 2015

piskvorky commented Nov 21, 2015

piskvorky commented Nov 21, 2015