-
Notifications
You must be signed in to change notification settings - Fork 758
scan_by_key results are non-deterministic for floats [NVBug 3477443] #1587
Comments
Note that there are currently two versions of scan by key -- one in CUB and one in Thrust. Eventually we'll be updating Thrust to use CUB's implementation, so the fix should go in to CUB. |
All our scan implementations rely on decoupled look-back approach. That means that CUB and Thrust scans don't provide run-to-run determinism for scan operations on floating-point types. In other words, this issue isn't unique to the scan-by-key variant of the algorithm. Here's an illustration of the tile states of five thread blocks. Each thread block reads the tile states of the predecessor block. If the tile state of the predecessor block contains only tile aggregate, this aggregate is added to the partial sum, and the previous tile state gets inspected. If one of the predecessor states has the full prefix, it gets added to the collected aggregate, and the result is stored in the tile state of the current thread block. Multiple thread blocks run concurrently. Updates of the tile states get visible non-deterministically. For the example above, it's possible that thread block four will observe the final prefix. In this case it'll write I don't see how this algorithm might provide run-to-run determinism. The original paper that proposes this algorithm states that:
I suggest we update the documentation and remove this guarantee. We could add a stable scan implementation that would use a different algorithm for floating-point values. |
Thanks for looking into this! I was hoping we had some new tech in the regular decoupled scan algorithm that managed to address this limitation, but it sounds like we don't. I re-ran the test program using the regular scan algorithms (which are documented to be the same run-to-run) and confirmed that they also differ. I think you're right, the docs are just out of date. Let's just fix the docs for now. If there's significant interest we can look into adding some stable versions later. |
I have created a follow up question NVIDIA/cccl#794, please take a look, and I appreciate your help. |
The docs have been updated to reflect that the scan-by-key algorithms are non-deterministic. From
Closing as fixed. |
While floating point reduction is non-associative and some floating point error is expected, the
scan
algorithms guarantee consistent results "run-to-run" on the same device. This is not the case for thescan_by_key
algorithms, which currently produce different results (within fp error) run-to-run on the same device.We should look into this and see if it's possible to provide the same guarantee for the keyed algorithms.
The text was updated successfully, but these errors were encountered: