Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race Condition in Muli-threaded word2vec #571

Closed
elmarhaussmann opened this issue Dec 31, 2015 · 16 comments
Closed

Race Condition in Muli-threaded word2vec #571

elmarhaussmann opened this issue Dec 31, 2015 · 16 comments

Comments

@elmarhaussmann
Copy link

The current development branch uses batch-based multi-threading. Some parts of the code are not synchronized properly and susceptible to race conditions. In particular, workers execute self.jobs_left -= 1 and the job producer: self.jobs_left += 1. This is not an atomic operation. Depending on timing and release of the GIL this leads to a wrong number of jobs left. Since the termination condition for training checks whether jobs are left, training often never finishes properly.

I verified this by outputting the number of jobs left at the end of training. The longer the training before, the higher the chances of a corrupted value and training not terminating.

Access should be properly synchronized (via locking) or another thread-safe data structure could be used instead.

@piskvorky
Copy link
Owner

I don't see how this could be non-atomic. Could you post (gist) your verification tests?

@piskvorky
Copy link
Owner

OK, it looks like in-place increment is not atomic in Python, so this may indeed lead to a race condition.

Still interested to see your tests though. I've never seen the "training often never finishes" in any runs or tests, so I'm curious what numbers you got.

@elmarhaussmann
Copy link
Author

As far as I know only single opcodes are guaranteed to be thread-safe and "+=" consists of reading, adding and writing (loosely speaking - didn't verify the exact opcodes) the value of the variable. I can whip up some code showing this tomorrow.

@elmarhaussmann
Copy link
Author

I ran some computations on huge data (>1billion) words and 12 threads - the process never finished (2 of 2 times) and was hanging at the end. I also experienced it on smaller input (~10 million words), but it's less likely to occur (maybe 1 in 4 times), which makes sense.
I cannot reproduce this for certain. My current test/workaround consists of changing the termination condition like in this snippet: https://gist.github.com/elmar-haussmann/d7670cac549cafe0862c. I can clearly see in my output that there are jobs left, but nothing is being computed anymore.

@piskvorky
Copy link
Owner

Sure, thanks for reporting! What version of gensim was that on? Develop or some release?

@elmarhaussmann
Copy link
Author

I used the current develop branch (from two days ago). Batch based multi-threading scales a lot better in my experiments! (not sure if it's in a release version already)

@phdowling
Copy link
Contributor

Acquiring a lock before the increment and releasing it after would probably be an easy fix, or do you see any problems there?

from multiprocessing import Lock
lock = Lock()

...
with lock:
    self.jobs_left -= 1

@piskvorky
Copy link
Owner

Yes, either that, or using a C counter (like in the page I linked to), making the operation atomic.

If using a lock, can you time the difference before/after the change? Say on the text9 corpus. Just as a sanity check, to make sure nothing unexpectedly horrible happened to performance.

Cheers!

@piskvorky
Copy link
Owner

Still curious to see some stats from @elmar-haussmann too. I find it weird that he sees a problem (word2vec hangs) 25% of the time, while we've run thousands of tests and runs by this point, and never saw it. I would like to get to the bottom of this disparity, to make sure there's not another issue being masked.

@elmarhaussmann
Copy link
Author

Can you be a bit more specific on what kind of stats you think are valuable and I'll be happy to try and produce them.

I am using Anaconda Python 3.5.1 (and the Intel MKL BLAS libraries). I'd assume a different Python interpreter can play a role in this.

@gojomo
Copy link
Collaborator

gojomo commented Jan 12, 2016

I've seen occasional hangs in unit test runs, both locally (OSX) and on the TravisCI (Linux) and AppVeyor (Windows) builds, which are likely due to this same issue. Also, there were some intermittent new logged warnings of the form WARNING:gensim.models.word2vec:supplied example count (2949) did not equal expected count (3000) suggesting in some cases a mismatched count at the end of training.

This #581 should fix; it only uses local vars in each thread for running counts, and leverages the existing None signal values in the queues to detect when all threads are done with all jobs. @elmar-haussmann, can you give it a try and see if it resolves your issues?

It also updates the batch_words parameter to be settable on the model, rather than just as a parameter to train() – as it may be helpful to adjust in small-corpus situations (like unit tests) using the simple do-everything constructor. (Learning-rate alpha is only updated per job-batch, so a tiny corpus handled in a single or few jobs won't get a smooth alpha-reduction unless forcing a smaller batch_words. Also if the cython routines are adapted to accept larger batches – as in my out-of-date https://github.com/gojomo/gensim/tree/batching branch – people might want to try larger batch_words, too.)

@tmylk
Copy link
Contributor

tmylk commented Jan 23, 2016

@elmar-haussmann Please confirm this is fixed by #531. The #531 has been merged to development branch of gensim.

@elmarhaussmann
Copy link
Author

I am currently running a large test (50 billion words). Will report as soon as it's done (~2 days).

@elmarhaussmann
Copy link
Author

Can confirm this is fixed. Wasn't able to reproduce any hangs.

@piskvorky
Copy link
Owner

Thank you so much @elmar-haussmann for bringing this up and helping us, and to @gojomo for fixing!

@tmylk #531 is an issue that is still open, what do you mean by linking to it?

@menshikh-iv
Copy link
Contributor

Fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants