-
Notifications
You must be signed in to change notification settings - Fork 74k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Backward pass of broadcasting on GPU is non-deterministic #2652
Comments
Unfortunately, the reduction ops on GPU use asynchronous atomic adds, and are therefore fundamentally nondeterministic for floating point. Making them deterministic would require either tree-structured reductions or integer math, both significantly slower. I can leave this open with contributions welcome if you'd like (with an adjusted title), but it'll be a lot of work if someone tries to take it on, and it's unclear how best to make it happen automatically. Even if one added deterministic reductions as an option (either as a separate op or as an attr on the existing ops), we'd need an unpleasant global flag to turn this on when building the backward pass. |
I can understand if that's the case. Thanks for the response. |
By the way, pure warp-shuffle (shfl_down, or shfl_xor for keep_dim) based block reduction doesn't seem to be that much slower than warp-shuffle+atomic |
@MetaP Do you have a link for that? I don't quite follow, especially the bit about keep_dim since that doesn't change the computation structure. |
Here's the link: https://devblogs.nvidia.com/parallelforall/faster-parallel-reductions-kepler/ |
Cc @zheng-xq @benoitsteiner in case more GPU knowledgeable folk want to take a look. Determinism would certainly be nice to have if we can get it. |
The shfl_down results are only useful within a single wrap. That technique itself would take a second pass to accumulate the results for each block. In general, there is no guarantee of determinism on GPU. Therefore, we are not sure how much effort we want to spend on it. Even if we can fix this particular kernel, we have other Cudnn kernels that do have non-determinism. |
@zheng-xq could you give some examples of other CUDNN kernels that have non-determinism? I'd like to explore this a little. Just for education purposes, because like what you mentioned, it's probably not worth the effort unless some major thing happens down the road. |
Which op exactly is non-deterministic here? These are the ops in the graph:
Do you expect that For reference, I tried to run this (with both TF 1.4.1, and also TF 1.12.0), and it seems deterministic to me (980 GTX, CUDA 9.1). |
The current high-level status is that there are now solutions for TensorFlow determinism when running on GPUs related to cuDNN (convolutions and max-pooling) and bias_add. Please see the following repo for up-to-date status: https://github.com/NVIDIA/tensorflow-determinism |
Result:
As you can see, consistent result across CPU runs but inconsistent result across GPU runs.
No doubt a CUDA reduction order issue, but it'd be really nice if we can have deterministic reduction. I am using tf 0.8.0 (self-compiled against CuDNN v5). CuDNN version is 5005 (not rc)
The text was updated successfully, but these errors were encountered: