Getting OpenNMT-tf to train reproducibly #9

atebbifakhr · 2020-01-13T18:07:12Z

Hi,

I'm trying to use this patch with Tensorflow 2.0, but the learning is still non-deterministic. I guess it is due to XLA optimization. How can I disable XLA?

Bests,

duncanriach · 2020-01-15T23:08:05Z

Hi @atebbifakhr,

My understanding is that XLA JIT compilation is not currently enabled by default in TensorFlow. I assume that you're not enabling XLA and therefore that, if there is in fact a source of non-determinism, it's not an XLA-originated op.

Can you tell me more about your model and settings? There remain various sources of non-determinism in TensorFlow which are not addressed by the patch.

atebbifakhr · 2020-01-17T12:52:07Z

Hi @duncanriach,

I'm using Tensorflow-gpu==2.0.0 and my model is Transformer for seq2seq. I noticed the source of non-determinism is in tf.nn.softmax_cross_entropy_with_logits.

I decided to call tf.nn.softmax_cross_entropy_with_logits on CPU to make my code deterministic. It works for the first computed gradients, but for the following gradients is still non-deterministic. My guess is optimizer.apply_gradients() is also non-deterministic.

duncanriach · 2020-01-17T22:39:52Z

Until now, I was unaware of non-determinism issues with tf.nn.softmax_cross_entropy_with_logits, but I have started digging into it, and will add it to a list of things to look at and potentially fix.

I have personally never seen optimizer.apply_gradients() operate non-deterministically on a GPU, and many folks are now doing deterministic deep learning with TensorFlow, which makes it even less likely to be an issue.

You've also said that the computed gradients are non-deterministic. If non-determinism is appearing in the computed gradients, then it is, by definition, being injected before the gradients are applied. Another op in your model may be injecting non-determinism in back-prop.

I recommend making sure that the examples that are being fed into the model are deterministic and that your trainable variables are initialized deterministically, once you have confirmed that then it's possible to debug and locate the source of non-determinism in the model. Unfortunately, I have not had time to release the debugging tool yet, which makes it harder for others to debug.

If you can provide me with a simple-as-possible, self-contained example that clearly demonstrates non-determinism, then I may be able to debug it relatively quickly and identify the source, or sources, of non-determinism in it. Self-contained means that all the files needs are provided, including training data or code that generates synthetic data. Simple-as-possible means that it's as simple as possible while still demonstrating the issue.

Also, I assuming that the seq2seq model you're using is Google's Seq2seq. Please confirm.

atebbifakhr · 2020-01-20T13:23:15Z

I prepare this notebook that you can replicate the problem.
Actually, I'm using OpenNMT-tf toolkit. However, the problem is not related to the toolkit. If you change tf.nn.sparse_softmax_cross_entropy_with_logits to something else, the code becomes deterministic.

duncanriach · 2020-01-21T22:37:28Z

Thanks for providing that code, @atebbifakhr! Nice and simple and self-contained. I love it. I have been able to reproduce the non-determinism, but not the determinism when the cross-entropy op is removed. It seems that the two pkl files generated in that case still differ. Perhaps I'm doing something wrong though.

Please will you run again and confirm that you're definitely seeing the pkl files matching when you remove the cross-entropy op?

In any case, this example is great because it give me something specific to run and debug.

duncanriach · 2020-01-22T00:21:15Z

Hey, I'm running this locally so that I can instrument and debug it. My machine contains a 12GB TITAN V. I'm getting this error:

tensorflow.python.framework.errors_impl.ResourceExhaustedError:  OOM when allocating tensor with shape[12544,32001] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

Are you familiar with this error and how to resolve it?

duncanriach · 2020-01-22T00:52:32Z

In the model, I reduced num_units from 512 down to 32 and ffn_inner_dim from 2048 down to 128 for both the encoder and the decoder. This resolved the problem. The machine under my colab is an NVIDIA Tesla T4 with 16GB of GPU memory. I wonder if the model, as configured, fitted into 16GB but not into 12GB.

Anyway, I am able to locally reproduce the non-determinism and also the determinism (without the cross-entropy op). I'm not sure why I could not reproduce the determinism on colab; possible operator error since the process is very manual.

Well done for isolating this source of non-determinism! Thank you.

I also want to acknowledge that all of the work that has gone into TensorFlow determinism so far made it so that it was possible to isolate a single op as a source of non-determinism without using the non-determinism debugging tool. This is because removing that one op reveals the underlying determinism that we now have.

I intend to instrument this model and confirm the non-determinism and also that the cross-entropy op is the only source. Then we can look at potential fixes or work-arounds.

atebbifakhr · 2020-01-22T10:27:52Z

Hi @duncanriach, thanks for your reply. It's strange that you couldn't reproduce the non-determinism by removing cross-entropy op, it never happened to me! It's fine to reduce the model size, if it's not fitted into memory, but sometimes you might need to run couple of times to see the non-determinism.

Anyway, thanks for your effort, looking forward to hearing from you.

atebbifakhr · 2020-02-10T14:28:38Z

Hi @duncanriach,
Any update on this issue? could you confirm the non-deterministim?

duncanriach · 2020-02-13T01:22:14Z

Hey @atebbifakhr, Sorry, I have not gotten to this yet. Will do as soon as I can and get back to you.

duncanriach · 2020-03-24T03:33:37Z

Hi @atebbifakhr, I looked into this more deeply. Removing tf.nn.sparse_softmax_cross_entropy_with_logits from the loss function only makes the gradients reproducible for the first step. They still go non-deterministic on the second step. The trainable variables actually go non-deterministic on the first step (somehow) regardless of whether tf.nn.sparse_softmax_cross_entropy_with_logits is in the loss function.

The fact that the gradients are deterministic for the first step but the trainable variables are not suggests that non-determinism is being introduced in the gradient update step. I hope to continue investigating soon.

duncanriach · 2020-04-02T22:20:26Z

Hi @atebbifakhr,

After further investigation, there seems to be two or three sources of non-determinism in this system.

Confirmed that back-prop of tf.nn.sparse_softmax_cross_entropy_with_logits does inject non-determinism. Opened TensorFlow issue 38185.
Discovered that tf.keras.optimizers.Optimizer::apply_gradients seems to inject non-determinism into the trainable state of the source and target inputters (instances of WordEmbedder) at the end of the first training step. This is mitigated by making the batch size smaller, but I don't know why. In the configuration that I am running, setting the batch size to 1 appears to make the state of the inputters deterministic at the end of the first step.
Discovered that the source and target inputters also inject non-determinism in the forward path by making the samples applied to the model non-reproducible on the second step and onwards (when the state of the inputters is deterministic from the previous step).

There is more work to do on this issue, but I wanted to give you an interim update.

I've also added your name to the credits section of this repo in recognition of your effort in enabling me to reproduce and isolate the problems you've been seeing.

duncanriach · 2020-04-08T00:47:35Z

I updated my previous comment to include additional information that come from more investigation.

duncanriach changed the title ~~How to disable XLA?~~ Getting seq2seq to operate reproducibly Jan 17, 2020

duncanriach changed the title ~~Getting seq2seq to operate reproducibly~~ [debug] Getting seq2seq to operate reproducibly Jan 17, 2020

duncanriach added the waiting for code label Jan 17, 2020

duncanriach changed the title ~~[debug] Getting seq2seq to operate reproducibly~~ Getting seq2seq to operate reproducibly Jan 17, 2020

duncanriach added debugging and removed waiting for code labels Jan 21, 2020

duncanriach added the possible new discovery label Feb 7, 2020

duncanriach changed the title ~~Getting seq2seq to operate reproducibly~~ Getting OpenNMT-tf to train reproducibly Mar 24, 2020

duncanriach mentioned this issue Apr 3, 2020

Add GPU-deterministic back-prop for fused softmax/cross-entropy ops tensorflow/tensorflow#38185

Closed

duncanriach mentioned this issue Jun 20, 2020

Reproducibility issue with transformers (BERT) and tf2.2 #19

Open

MFreidank mentioned this issue Jun 22, 2020

[WIP] Feature/patch/softmax cross entropy with logits #21

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting OpenNMT-tf to train reproducibly #9

Getting OpenNMT-tf to train reproducibly #9

atebbifakhr commented Jan 13, 2020

duncanriach commented Jan 15, 2020

atebbifakhr commented Jan 17, 2020 •

edited

Loading

duncanriach commented Jan 17, 2020

atebbifakhr commented Jan 20, 2020

duncanriach commented Jan 21, 2020 •

edited

Loading

duncanriach commented Jan 22, 2020 •

edited

Loading

duncanriach commented Jan 22, 2020 •

edited

Loading

atebbifakhr commented Jan 22, 2020

atebbifakhr commented Feb 10, 2020

duncanriach commented Feb 13, 2020

duncanriach commented Mar 24, 2020

duncanriach commented Apr 2, 2020 •

edited

Loading

duncanriach commented Apr 8, 2020 •

edited

Loading

Getting OpenNMT-tf to train reproducibly #9

Getting OpenNMT-tf to train reproducibly #9

Comments

atebbifakhr commented Jan 13, 2020

duncanriach commented Jan 15, 2020

atebbifakhr commented Jan 17, 2020 • edited Loading

duncanriach commented Jan 17, 2020

atebbifakhr commented Jan 20, 2020

duncanriach commented Jan 21, 2020 • edited Loading

duncanriach commented Jan 22, 2020 • edited Loading

duncanriach commented Jan 22, 2020 • edited Loading

atebbifakhr commented Jan 22, 2020

atebbifakhr commented Feb 10, 2020

duncanriach commented Feb 13, 2020

duncanriach commented Mar 24, 2020

duncanriach commented Apr 2, 2020 • edited Loading

duncanriach commented Apr 8, 2020 • edited Loading

atebbifakhr commented Jan 17, 2020 •

edited

Loading

duncanriach commented Jan 21, 2020 •

edited

Loading

duncanriach commented Jan 22, 2020 •

edited

Loading

duncanriach commented Jan 22, 2020 •

edited

Loading

duncanriach commented Apr 2, 2020 •

edited

Loading

duncanriach commented Apr 8, 2020 •

edited

Loading