-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reproducibility issue with transformers (BERT) and tf2.2 #19
Comments
Beautifully presented. Thanks, @MFreidank. I made a copy of your colab code and have been looking at it. The primary issue right now is that the trainable variables are not matching between runs:
I can see that you have them matching, and I don't understand that would be different for me. Have you changed the colab code in some way since you ran it? The second issue I see is that you're setting |
Oh, I see. You have to restart the runtime to get the same initial trainable variables. I can hopefully provide a work-around for that too. |
So, the solution for getting the same initial trainable variables every time you run the block of code that starts with the definition of |
And ... solved. By removing
You were so close. If you only you coded exactly what your notes required. :-) |
Please confirm that your issue has been solved. Train your model for much longer, at least for one whole epoch, and confirm that it's getting the accuracy you expect while also getting the perfect, bit-exact reproducibility. |
@duncanriach Thank you! I can reproduce the resolution and things are now deterministic in the scenario above - should have taken my own advice from the notes based on your workaround in the tensorflow issue thread ;) There is an issue remaining though: changing Could you have a look at this? I updated my colab notebook to reflect the current state and expose the issue mentioned above. Almost looks like keras is doing some non-deterministic operations in between epochs. Update: For |
Will do.
These between-epoch issues are common and there as several possible sources. Let's see if we can get determinism without you needing to limit the training to one epoch ... |
Running in colab, with my old copy of your code (with the fixes), I'm now no longer seeing reproducibility on 5 steps in one epoch on GPU. This is very concerning and I have not yet figured out what the issue is. Also, looking at your updated colab code and notes, it seems that one epoch with 10 steps on the GPU is not operating reproducibly, which does not match what you wrote above. |
Just to recap where we're at and the solutions we have:
With these three adjustments, there is still some non-determinism. However, rather than just being totally different on every run, the final state of the trainable variables is now one of a discrete set of values. The number of possible values seems to increase with the number of steps per epoch. With With What this suggests to me is that there may be some non-determinism in the interaction between the data-loader (based on I'll investigate more tomorrow. |
@duncanriach Thank you a lot for your work and drive on this and for the conclusive summary of where we stand and what we know. |
I've been trying different batch sizes and number of steps. There seems to be a non-determinism effect that kicks-in with larger batch size and a seemingly independent effect related to the number of steps (and/or perhaps the number of examples trained). This is reminding me of the unresolved aspects of issue 9 (for OpenNMT). I have not yet gotten the non-determinism debug tool working with this model, which will enable me to dig-in more deeply to isolate the remaining source, or sources, of non-determinism. I'm also learning more about BERT and about transformers in general. I presume that each step of training this model runs the complete sentence example (or batch of sentence examples) through the encoder and then the decoder, then calculates the loss, and then back-propagates the gradients to the trainable variables. If we see non-determinism in the trainable variables appear on any given step, it will have been caused by that example (or the examples in that batch) interacting with the trainable variables, as they have been trained during the previous steps, via a non-deterministic op or process. Since this is an RNN model, and a relatively complex one, there is extensive iterative munging happening (although I believe that it will be unrolled), unlike with a non-recurrent DNN. There may be different opportunities for non-determinism to be injected. There may also be the use of sparse operations (for things like sparse embeddings), some of which have been suspect for a while (but have not yet been fully investigated). I intend to keep investigating this issue. BTW, in a comment in the code, you mention that the data loading and preparation is deterministic. Did you confirm that. If so, how? |
@MFreidank, we (@wenscarl and I) have isolated the remaining source of nondeterminism in this model. See this comment on TensorFlow Issue 39751 for more information about the source. We have also confirmed that this was the only remaining source of nondeterminism in the model by temporarily replacing the use of We are close to releasing a patch for the TensorFlow segment sum ops, which will, when applied via |
Update, @wenscarl has confirmed that the patch we are about to release (to be enabled via |
@duncanriach very thank you for your contributions towards making tensorflow deterministic. I am using huggingface/transformers BERT with tf2.2. And I was wondering what is the time to release the patch. |
Hi @Zminghua, I don't have an estimated release date for the patch, but it's relatively high priority for us. The patch will work with TensorFlow version 2.3 and earlier. A recently-discovered problem, which we're attempting to find a solution for, is that from TensorFlow version 2.4 onwards the TensorFlow API no longer exposes the mechanisms that allow for a dynamic patch to be applied from outside the distributed package. This means that we'll have to focus on getting these solutions into upstream stock TensorFlow rather than relying on the theoretically quick triage route provided by patching. |
Hi @duncanriach, through putting "fwd9m" sub-dir in my project dir, then import as follows my code has become fully deterministic when running on GPU. really thank you again ~ |
Oh, you're welcome. Right, you can just clone the code and use it, of course, rather than waiting for the PyPI release. |
Update: we have confirmed that |
I cloned the repository and follow the instruction from above, i.e Much appreciated. |
Hi @phqtuyen, Please pull the This was a bug that only showed up with stock TensorFlow versions 1.14 through 2.0. It was fixed in the incomplete and un-merged Let me know how it goes. |
This should be fixed in TF 2.7 by PR 51861. Please will someone confirm so that this issue can be closed. |
Dear @duncanriach,
Thank you for your contributions, work and guidance towards making tensorflow deterministic in the recent releases.
Unfortunately, for popular keras NLP models (BERT) some problems seem to remain (see also related issue in this repository #14).
In spite of combining learnings from:
... I am still arriving at the following short, non-deterministic colab notebook example.
My results for the sum of model weights (as computed with a function you had suggested) after training for only 5 steps is (differences are
highlighted
below):5159916282
3093758523
This variance gets increasingly more pronounced when the model is trained for longer periods of time.
Could you please help identify the source of non-determinism and provide guidance on how we can resolve it?
As transformers is a very popular package (29.1k Github stars), I am expecting that many other people are silently impacted by this very phenomenon.
Note: As shown above, I have observed that the same code becomes fully deterministic when running on the colab CPU runtime.
The text was updated successfully, but these errors were encountered: