Reproducibility issue with transformers (BERT) and tf2.2 #19

MFreidank · 2020-06-16T16:17:01Z

Dear @duncanriach,
Thank you for your contributions, work and guidance towards making tensorflow deterministic in the recent releases.
Unfortunately, for popular keras NLP models (BERT) some problems seem to remain (see also related issue in this repository #14).

In spite of combining learnings from:

the "complete recipe" in your slides from gputechconf
your recently suggested workaround for issues with crossentropy loss

... I am still arriving at the following short, non-deterministic colab notebook example.

My results for the sum of model weights (as computed with a function you had suggested) after training for only 5 steps is (differences are highlighted below):

	Device	Before training	After training
Run 1	GPU	-641227.5609667897224	-641237.442 `5159916282`
Run 2	GPU	-641227.5609667897224	-641237.442 `3093758523`

Run 1	CPU	-641227.5609667301178	-641238.1506845243275
Run 2	CPU	-641227.5609667301178	-641238.1506845243275

This variance gets increasingly more pronounced when the model is trained for longer periods of time.

Could you please help identify the source of non-determinism and provide guidance on how we can resolve it?

As transformers is a very popular package (29.1k Github stars), I am expecting that many other people are silently impacted by this very phenomenon.

Note: As shown above, I have observed that the same code becomes fully deterministic when running on the colab CPU runtime.

The text was updated successfully, but these errors were encountered:

duncanriach · 2020-06-16T22:36:06Z

Beautifully presented. Thanks, @MFreidank. I made a copy of your colab code and have been looking at it. The primary issue right now is that the trainable variables are not matching between runs:

### Before training: ###
Summary of weights: -641227.5609667897224
### Before training: ###
Summary of weights: -641227.7293046712875

I can see that you have them matching, and I don't understand that would be different for me. Have you changed the colab code in some way since you ran it?

The second issue I see is that you're setting from_logits=True in the constructor of tf.keras.losses.SparseCategoricalCrossentropy. As your notes suggest, this argument should be excluded (or set to False).

duncanriach · 2020-06-16T22:38:13Z

Oh, I see. You have to restart the runtime to get the same initial trainable variables. I can hopefully provide a work-around for that too.

duncanriach · 2020-06-16T22:42:56Z

So, the solution for getting the same initial trainable variables every time you run the block of code that starts with the definition of summarize_keras_weights is to call tf.random.set_seed at the beginning of that block. This will reset the pseudorandom number generator that is used to initialize the trainable variables of the model.

duncanriach · 2020-06-16T22:46:20Z

And ... solved. By removing from_logits=True from the constructor of tf.keras.losses.SparseCategoricalCrossentropy() I was able to get the same trainable variables after both runs.

### Before training: ###
Summary of weights: -641227.5609667897224
5/5 [==============================] - 7s 1s/step - loss: 0.7225 - accuracy: 0.4000
### After training: ###
Summary of weights: -641238.1517347339541

### Before training: ###
Summary of weights: -641227.5609667897224
5/5 [==============================] - 7s 1s/step - loss: 0.7225 - accuracy: 0.4000
### After training: ###
Summary of weights: -641238.1517347339541

You were so close. If you only you coded exactly what your notes required. :-)

duncanriach · 2020-06-16T22:51:49Z

Please confirm that your issue has been solved. Train your model for much longer, at least for one whole epoch, and confirm that it's getting the accuracy you expect while also getting the perfect, bit-exact reproducibility.

MFreidank · 2020-06-17T10:35:26Z

@duncanriach Thank you! I can reproduce the resolution and things are now deterministic in the scenario above - should have taken my own advice from the notes based on your workaround in the tensorflow issue thread ;)

There is an issue remaining though: changing epochs=1 to epochs=2 reintroduces non-determinism (even when keeping steps_per_epoch at only 5).
Note that training for the same 10 steps by using epochs=1, steps_per_epoch=10 is deterministic.

Could you have a look at this? I updated my colab notebook to reflect the current state and expose the issue mentioned above.

Almost looks like keras is doing some non-deterministic operations in between epochs.
For my purposes, I may be able to simply artificially stretch the epoch I am training for (to multiple passes over the dataset) and get things running deterministic this way; I'll investigate this.
Nevertheless, I believe this warrants some further investigation, happy to help in any way I can.

Update: For epochs=2, steps_per_epoch=10, I found it to be reproducible on the CPU.
So the issue must occur on something that does relate to GPU.

duncanriach · 2020-06-17T23:39:39Z

Could you have a look at this?

Will do.

Almost looks like Keras is doing some non-deterministic operations in between epochs.

These between-epoch issues are common and there as several possible sources. Let's see if we can get determinism without you needing to limit the training to one epoch ...

duncanriach · 2020-06-18T02:06:42Z

Running in colab, with my old copy of your code (with the fixes), I'm now no longer seeing reproducibility on 5 steps in one epoch on GPU. This is very concerning and I have not yet figured out what the issue is. Also, looking at your updated colab code and notes, it seems that one epoch with 10 steps on the GPU is not operating reproducibly, which does not match what you wrote above.

duncanriach · 2020-06-18T05:47:58Z

Just to recap where we're at and the solutions we have:

Using tf.random.set_seed, reset TensorFlow's PRNG before initializing trainable variables.
Set TF_DETERMINISTIC_OPS=1 to enable all deterministic ops in the model.
Replace non-deterministic fused softmax/cross-entropy with a deterministic version. You and I are also working on adding a fix for this, which will also be under the control of TF_DETERMINISTIC_OPS.

With these three adjustments, there is still some non-determinism. However, rather than just being totally different on every run, the final state of the trainable variables is now one of a discrete set of values. The number of possible values seems to increase with the number of steps per epoch.

With steps_per_epoch=1 and steps_per_epoch=2, I got the same final value after several runs.

With steps_per_epoch=5, over seven runs, I got only four different results. One result was repeated three times and another was repeated twice.

What this suggests to me is that there may be some non-determinism in the interaction between the data-loader (based on tf.data) and model.fit. I've seen things like this before, but not exactly like this, and nothing jumps out at me from your code that could be causing this (such a multiple data-loader workers or an unseeded shuffle).

I'll investigate more tomorrow.

MFreidank · 2020-06-18T15:31:03Z

@duncanriach Thank you a lot for your work and drive on this and for the conclusive summary of where we stand and what we know.
I agree with all your points, but was not able to pinpoint the exact source of the problem (I tried setting workers=0 to make it run on the same thread as the main training loop, but to no avail).
Looking forward to your further investigation and happy to help from my side in any way I can.

duncanriach · 2020-06-20T05:14:57Z

I've been trying different batch sizes and number of steps. There seems to be a non-determinism effect that kicks-in with larger batch size and a seemingly independent effect related to the number of steps (and/or perhaps the number of examples trained). This is reminding me of the unresolved aspects of issue 9 (for OpenNMT).

I have not yet gotten the non-determinism debug tool working with this model, which will enable me to dig-in more deeply to isolate the remaining source, or sources, of non-determinism. I'm also learning more about BERT and about transformers in general.

I presume that each step of training this model runs the complete sentence example (or batch of sentence examples) through the encoder and then the decoder, then calculates the loss, and then back-propagates the gradients to the trainable variables. If we see non-determinism in the trainable variables appear on any given step, it will have been caused by that example (or the examples in that batch) interacting with the trainable variables, as they have been trained during the previous steps, via a non-deterministic op or process.

Since this is an RNN model, and a relatively complex one, there is extensive iterative munging happening (although I believe that it will be unrolled), unlike with a non-recurrent DNN. There may be different opportunities for non-determinism to be injected. There may also be the use of sparse operations (for things like sparse embeddings), some of which have been suspect for a while (but have not yet been fully investigated).

I intend to keep investigating this issue.

BTW, in a comment in the code, you mention that the data loading and preparation is deterministic. Did you confirm that. If so, how?

duncanriach · 2020-10-23T22:48:58Z

@MFreidank, we (@wenscarl and I) have isolated the remaining source of nondeterminism in this model. See this comment on TensorFlow Issue 39751 for more information about the source.

We have also confirmed that this was the only remaining source of nondeterminism in the model by temporarily replacing the use of tf.gather in the huggingface/transformers BERT code with a much slower tf.linalg.matmul operation, the dense backprop output of which can be used directly to update the word embedding matrix (without the need for the currently-nondeterministic tf.convert_to_tensor). The model trained reproducibly for thousands of batches, over multiple epochs.

We are close to releasing a patch for the TensorFlow segment sum ops, which will, when applied via fwd9m.tensorflow.enable_determinism, will remove this final source of nondeterminism.

duncanriach · 2020-11-05T01:30:38Z

Update, @wenscarl has confirmed that the patch we are about to release (to be enabled via fwd9m.tensorflow.enable_determinism) will resolve the final source of nondeterminism in this model, causing it to train determinismtically.

Zminghua · 2020-12-23T14:58:41Z

@duncanriach very thank you for your contributions towards making tensorflow deterministic. I am using huggingface/transformers BERT with tf2.2. And I was wondering what is the time to release the patch.

duncanriach · 2021-01-05T03:17:27Z

Hi @Zminghua, I don't have an estimated release date for the patch, but it's relatively high priority for us. The patch will work with TensorFlow version 2.3 and earlier. A recently-discovered problem, which we're attempting to find a solution for, is that from TensorFlow version 2.4 onwards the TensorFlow API no longer exposes the mechanisms that allow for a dynamic patch to be applied from outside the distributed package. This means that we'll have to focus on getting these solutions into upstream stock TensorFlow rather than relying on the theoretically quick triage route provided by patching.

Zminghua · 2021-01-06T03:45:22Z

Hi @duncanriach,

through putting "fwd9m" sub-dir in my project dir,

then import as follows
from fwd9m.tensorflow import enable_determinism
enable_determinism()

my code has become fully deterministic when running on GPU.

really thank you again ~

duncanriach · 2021-01-07T21:29:15Z

Oh, you're welcome. Right, you can just clone the code and use it, of course, rather than waiting for the PyPI release.

duncanriach · 2021-01-12T04:37:26Z

Update: we have confirmed that fwd9m.tensorflow.enable_determinism, which currently includes patching of segment_sum and unsorted_segment_sum will, in fact, work on TensorFlow 2.4.0. I don't understand why this is. It's not what I expect given what was in the version 2.4.0 release notes and the associated changes in the stock TensorFlow source code.

phqtuyen · 2021-06-09T19:29:00Z

I cloned the repository and follow the instruction from above, i.e
from framework_determinism.fwd9m.tensorflow import enable_determinism.
However, I get this error message:
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/mnt/shared_ad2_mt1/thopham/projects/exp-1/oda-cognitive/services/train-pool/train-models/framework_determinism/fwd9m/tensorflow/enable_determinism.py", line 61, in _enable_determinism patch_bias_add(_silent=True) TypeError: 'module' object is not callable
I am using Python3.6

Much appreciated.

duncanriach · 2021-06-11T00:37:49Z

Hi @phqtuyen,

Please pull the master branch and try again.

This was a bug that only showed up with stock TensorFlow versions 1.14 through 2.0. It was fixed in the incomplete and un-merged integration-testing branch. This demonstrates the hazards involved in using unreleased (and non-regression-tested) code.

Let me know how it goes.

duncanriach · 2021-09-17T01:32:22Z

This should be fixed in TF 2.7 by PR 51861. Please will someone confirm so that this issue can be closed.

duncanriach changed the title ~~Reproducibility issue with tf2.2~~ Reproducibility issue with transformers (BERT) and tf2.2 Jun 16, 2020

duncanriach mentioned this issue Jun 17, 2020

Lack of reproducibility when using Huggingface transformers library (TensorFlow version) #14

Open

duncanriach added the debugging label Jun 17, 2020

This was referenced Jun 18, 2020

Non-deterministic training issue on GPU: TF-BERT huggingface/transformers#5063

Closed

TFBertForSequenceClassification: Non-deterministic when training on GPU, even with seeds fixed and TF_DETERMINISTIC_OPS="1" tensorflow/tensorflow#40514

Closed

MFreidank mentioned this issue Jun 22, 2020

[WIP] Feature/patch/softmax cross entropy with logits #21

Open

duncanriach removed the debugging label Jan 12, 2021

duncanriach mentioned this issue Jul 7, 2021

Don't get deterministic results with NGC 20.09 #29

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproducibility issue with transformers (BERT) and tf2.2 #19

Reproducibility issue with transformers (BERT) and tf2.2 #19

MFreidank commented Jun 16, 2020 •

edited

Loading

duncanriach commented Jun 16, 2020 •

edited

Loading

duncanriach commented Jun 16, 2020 •

edited

Loading

duncanriach commented Jun 16, 2020 •

edited

Loading

duncanriach commented Jun 16, 2020 •

edited

Loading

duncanriach commented Jun 16, 2020 •

edited

Loading

MFreidank commented Jun 17, 2020 •

edited

Loading

duncanriach commented Jun 17, 2020

duncanriach commented Jun 18, 2020

duncanriach commented Jun 18, 2020 •

edited

Loading

MFreidank commented Jun 18, 2020

duncanriach commented Jun 20, 2020 •

edited

Loading

duncanriach commented Oct 23, 2020 •

edited

Loading

duncanriach commented Nov 5, 2020

Zminghua commented Dec 23, 2020

duncanriach commented Jan 5, 2021

Zminghua commented Jan 6, 2021 •

edited

Loading

duncanriach commented Jan 7, 2021

duncanriach commented Jan 12, 2021 •

edited

Loading

phqtuyen commented Jun 9, 2021 •

edited

Loading

duncanriach commented Jun 11, 2021 •

edited

Loading

duncanriach commented Sep 17, 2021 •

edited

Loading

Reproducibility issue with transformers (BERT) and tf2.2 #19

Reproducibility issue with transformers (BERT) and tf2.2 #19

Comments

MFreidank commented Jun 16, 2020 • edited Loading

duncanriach commented Jun 16, 2020 • edited Loading

duncanriach commented Jun 16, 2020 • edited Loading

duncanriach commented Jun 16, 2020 • edited Loading

duncanriach commented Jun 16, 2020 • edited Loading

duncanriach commented Jun 16, 2020 • edited Loading

MFreidank commented Jun 17, 2020 • edited Loading

duncanriach commented Jun 17, 2020

duncanriach commented Jun 18, 2020

duncanriach commented Jun 18, 2020 • edited Loading

MFreidank commented Jun 18, 2020

duncanriach commented Jun 20, 2020 • edited Loading

duncanriach commented Oct 23, 2020 • edited Loading

duncanriach commented Nov 5, 2020

Zminghua commented Dec 23, 2020

duncanriach commented Jan 5, 2021

Zminghua commented Jan 6, 2021 • edited Loading

duncanriach commented Jan 7, 2021

duncanriach commented Jan 12, 2021 • edited Loading

phqtuyen commented Jun 9, 2021 • edited Loading

duncanriach commented Jun 11, 2021 • edited Loading

duncanriach commented Sep 17, 2021 • edited Loading

MFreidank commented Jun 16, 2020 •

edited

Loading

duncanriach commented Jun 16, 2020 •

edited

Loading

duncanriach commented Jun 16, 2020 •

edited

Loading

duncanriach commented Jun 16, 2020 •

edited

Loading

duncanriach commented Jun 16, 2020 •

edited

Loading

duncanriach commented Jun 16, 2020 •

edited

Loading

MFreidank commented Jun 17, 2020 •

edited

Loading

duncanriach commented Jun 18, 2020 •

edited

Loading

duncanriach commented Jun 20, 2020 •

edited

Loading

duncanriach commented Oct 23, 2020 •

edited

Loading

Zminghua commented Jan 6, 2021 •

edited

Loading

duncanriach commented Jan 12, 2021 •

edited

Loading

phqtuyen commented Jun 9, 2021 •

edited

Loading

duncanriach commented Jun 11, 2021 •

edited

Loading

duncanriach commented Sep 17, 2021 •

edited

Loading