Race Condition in Muli-threaded word2vec #571

elmarhaussmann · 2015-12-31T14:28:12Z

The current development branch uses batch-based multi-threading. Some parts of the code are not synchronized properly and susceptible to race conditions. In particular, workers execute self.jobs_left -= 1 and the job producer: self.jobs_left += 1. This is not an atomic operation. Depending on timing and release of the GIL this leads to a wrong number of jobs left. Since the termination condition for training checks whether jobs are left, training often never finishes properly.

I verified this by outputting the number of jobs left at the end of training. The longer the training before, the higher the chances of a corrupted value and training not terminating.

Access should be properly synchronized (via locking) or another thread-safe data structure could be used instead.

The text was updated successfully, but these errors were encountered:

piskvorky · 2015-12-31T17:31:50Z

I don't see how this could be non-atomic. Could you post (gist) your verification tests?

piskvorky · 2015-12-31T17:39:57Z

OK, it looks like in-place increment is not atomic in Python, so this may indeed lead to a race condition.

Still interested to see your tests though. I've never seen the "training often never finishes" in any runs or tests, so I'm curious what numbers you got.

elmarhaussmann · 2015-12-31T17:40:36Z

As far as I know only single opcodes are guaranteed to be thread-safe and "+=" consists of reading, adding and writing (loosely speaking - didn't verify the exact opcodes) the value of the variable. I can whip up some code showing this tomorrow.

elmarhaussmann · 2015-12-31T17:47:42Z

I ran some computations on huge data (>1billion) words and 12 threads - the process never finished (2 of 2 times) and was hanging at the end. I also experienced it on smaller input (~10 million words), but it's less likely to occur (maybe 1 in 4 times), which makes sense.
I cannot reproduce this for certain. My current test/workaround consists of changing the termination condition like in this snippet: https://gist.github.com/elmar-haussmann/d7670cac549cafe0862c. I can clearly see in my output that there are jobs left, but nothing is being computed anymore.

piskvorky · 2015-12-31T17:53:24Z

Sure, thanks for reporting! What version of gensim was that on? Develop or some release?

elmarhaussmann · 2015-12-31T17:57:25Z

I used the current develop branch (from two days ago). Batch based multi-threading scales a lot better in my experiments! (not sure if it's in a release version already)

phdowling · 2016-01-07T10:00:46Z

Acquiring a lock before the increment and releasing it after would probably be an easy fix, or do you see any problems there?

from multiprocessing import Lock
lock = Lock()

...
with lock:
    self.jobs_left -= 1

piskvorky · 2016-01-07T11:14:01Z

Yes, either that, or using a C counter (like in the page I linked to), making the operation atomic.

If using a lock, can you time the difference before/after the change? Say on the text9 corpus. Just as a sanity check, to make sure nothing unexpectedly horrible happened to performance.

Cheers!

piskvorky · 2016-01-07T11:18:18Z

Still curious to see some stats from @elmar-haussmann too. I find it weird that he sees a problem (word2vec hangs) 25% of the time, while we've run thousands of tests and runs by this point, and never saw it. I would like to get to the bottom of this disparity, to make sure there's not another issue being masked.

elmarhaussmann · 2016-01-07T21:02:33Z

Can you be a bit more specific on what kind of stats you think are valuable and I'll be happy to try and produce them.

I am using Anaconda Python 3.5.1 (and the Intel MKL BLAS libraries). I'd assume a different Python interpreter can play a role in this.

gojomo · 2016-01-12T10:50:34Z

I've seen occasional hangs in unit test runs, both locally (OSX) and on the TravisCI (Linux) and AppVeyor (Windows) builds, which are likely due to this same issue. Also, there were some intermittent new logged warnings of the form WARNING:gensim.models.word2vec:supplied example count (2949) did not equal expected count (3000) suggesting in some cases a mismatched count at the end of training.

This #581 should fix; it only uses local vars in each thread for running counts, and leverages the existing None signal values in the queues to detect when all threads are done with all jobs. @elmar-haussmann, can you give it a try and see if it resolves your issues?

It also updates the batch_words parameter to be settable on the model, rather than just as a parameter to train() – as it may be helpful to adjust in small-corpus situations (like unit tests) using the simple do-everything constructor. (Learning-rate alpha is only updated per job-batch, so a tiny corpus handled in a single or few jobs won't get a smooth alpha-reduction unless forcing a smaller batch_words. Also if the cython routines are adapted to accept larger batches – as in my out-of-date https://github.com/gojomo/gensim/tree/batching branch – people might want to try larger batch_words, too.)

tmylk · 2016-01-23T21:40:04Z

@elmar-haussmann Please confirm this is fixed by #531. The #531 has been merged to development branch of gensim.

elmarhaussmann · 2016-01-25T10:22:41Z

I am currently running a large test (50 billion words). Will report as soon as it's done (~2 days).

elmarhaussmann · 2016-01-27T00:26:18Z

Can confirm this is fixed. Wasn't able to reproduce any hangs.

piskvorky · 2016-01-27T02:22:02Z

Thank you so much @elmar-haussmann for bringing this up and helping us, and to @gojomo for fixing!

@tmylk #531 is an issue that is still open, what do you mean by linking to it?

menshikh-iv · 2017-10-02T14:33:58Z

Fixed

This was referenced Jan 11, 2016

Docker instructions in README.rst #579

Closed

threadsafe job counts; configurable batch_size #581

Merged

gojomo mentioned this issue Jan 12, 2016

intermittent failure: Test CBOW w/ hierarchical softmax - #531

Open

gojomo added a commit that referenced this issue Jan 15, 2016

CHANGELOG.txt for #571, #531

dd6f6b7

menshikh-iv closed this as completed Oct 2, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race Condition in Muli-threaded word2vec #571

Race Condition in Muli-threaded word2vec #571

elmarhaussmann commented Dec 31, 2015

piskvorky commented Dec 31, 2015

piskvorky commented Dec 31, 2015

elmarhaussmann commented Dec 31, 2015

elmarhaussmann commented Dec 31, 2015

piskvorky commented Dec 31, 2015

elmarhaussmann commented Dec 31, 2015

phdowling commented Jan 7, 2016

piskvorky commented Jan 7, 2016

piskvorky commented Jan 7, 2016

elmarhaussmann commented Jan 7, 2016

gojomo commented Jan 12, 2016

tmylk commented Jan 23, 2016

elmarhaussmann commented Jan 25, 2016

elmarhaussmann commented Jan 27, 2016

piskvorky commented Jan 27, 2016

menshikh-iv commented Oct 2, 2017

Race Condition in Muli-threaded word2vec #571

Race Condition in Muli-threaded word2vec #571

Comments

elmarhaussmann commented Dec 31, 2015

piskvorky commented Dec 31, 2015

piskvorky commented Dec 31, 2015

elmarhaussmann commented Dec 31, 2015

elmarhaussmann commented Dec 31, 2015

piskvorky commented Dec 31, 2015

elmarhaussmann commented Dec 31, 2015

phdowling commented Jan 7, 2016

piskvorky commented Jan 7, 2016

piskvorky commented Jan 7, 2016

elmarhaussmann commented Jan 7, 2016

gojomo commented Jan 12, 2016

tmylk commented Jan 23, 2016

elmarhaussmann commented Jan 25, 2016

elmarhaussmann commented Jan 27, 2016

piskvorky commented Jan 27, 2016

menshikh-iv commented Oct 2, 2017