-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Getting OpenNMT-tf to train reproducibly #9
Comments
Hi @atebbifakhr, My understanding is that XLA JIT compilation is not currently enabled by default in TensorFlow. I assume that you're not enabling XLA and therefore that, if there is in fact a source of non-determinism, it's not an XLA-originated op. Can you tell me more about your model and settings? There remain various sources of non-determinism in TensorFlow which are not addressed by the patch. |
Hi @duncanriach, I'm using I decided to call |
Until now, I was unaware of non-determinism issues with I have personally never seen You've also said that the computed gradients are non-deterministic. If non-determinism is appearing in the computed gradients, then it is, by definition, being injected before the gradients are applied. Another op in your model may be injecting non-determinism in back-prop. I recommend making sure that the examples that are being fed into the model are deterministic and that your trainable variables are initialized deterministically, once you have confirmed that then it's possible to debug and locate the source of non-determinism in the model. Unfortunately, I have not had time to release the debugging tool yet, which makes it harder for others to debug. If you can provide me with a simple-as-possible, self-contained example that clearly demonstrates non-determinism, then I may be able to debug it relatively quickly and identify the source, or sources, of non-determinism in it. Self-contained means that all the files needs are provided, including training data or code that generates synthetic data. Simple-as-possible means that it's as simple as possible while still demonstrating the issue. Also, I assuming that the seq2seq model you're using is Google's Seq2seq. Please confirm. |
I prepare this notebook that you can replicate the problem. |
Thanks for providing that code, @atebbifakhr! Nice and simple and self-contained. I love it. I have been able to reproduce the non-determinism, but not the determinism when the cross-entropy op is removed. It seems that the two pkl files generated in that case still differ. Perhaps I'm doing something wrong though. Please will you run again and confirm that you're definitely seeing the pkl files matching when you remove the cross-entropy op? In any case, this example is great because it give me something specific to run and debug. |
Hey, I'm running this locally so that I can instrument and debug it. My machine contains a 12GB TITAN V. I'm getting this error:
Are you familiar with this error and how to resolve it? |
In the model, I reduced Anyway, I am able to locally reproduce the non-determinism and also the determinism (without the cross-entropy op). I'm not sure why I could not reproduce the determinism on colab; possible operator error since the process is very manual. Well done for isolating this source of non-determinism! Thank you. I also want to acknowledge that all of the work that has gone into TensorFlow determinism so far made it so that it was possible to isolate a single op as a source of non-determinism without using the non-determinism debugging tool. This is because removing that one op reveals the underlying determinism that we now have. I intend to instrument this model and confirm the non-determinism and also that the cross-entropy op is the only source. Then we can look at potential fixes or work-arounds. |
Hi @duncanriach, thanks for your reply. It's strange that you couldn't reproduce the non-determinism by removing cross-entropy op, it never happened to me! It's fine to reduce the model size, if it's not fitted into memory, but sometimes you might need to run couple of times to see the non-determinism. Anyway, thanks for your effort, looking forward to hearing from you. |
Hi @duncanriach, |
Hey @atebbifakhr, Sorry, I have not gotten to this yet. Will do as soon as I can and get back to you. |
Hi @atebbifakhr, I looked into this more deeply. Removing The fact that the gradients are deterministic for the first step but the trainable variables are not suggests that non-determinism is being introduced in the gradient update step. I hope to continue investigating soon. |
Hi @atebbifakhr, After further investigation, there seems to be two or three sources of non-determinism in this system.
There is more work to do on this issue, but I wanted to give you an interim update. I've also added your name to the credits section of this repo in recognition of your effort in enabling me to reproduce and isolate the problems you've been seeing. |
I updated my previous comment to include additional information that come from more investigation. |
Hi,
I'm trying to use this patch with Tensorflow 2.0, but the learning is still non-deterministic. I guess it is due to XLA optimization. How can I disable XLA?
Bests,
The text was updated successfully, but these errors were encountered: