Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

scan_by_key results are non-deterministic for floats [NVBug 3477443] #1587

Closed
alliepiper opened this issue Jan 10, 2022 · 5 comments
Closed
Assignees
Labels
P2: nice to have Desired, but not necessary. repro: verified The provided repro has been validated. thrust type: enhancement New feature or request.

Comments

@alliepiper
Copy link
Collaborator

While floating point reduction is non-associative and some floating point error is expected, the scan algorithms guarantee consistent results "run-to-run" on the same device. This is not the case for the scan_by_key algorithms, which currently produce different results (within fp error) run-to-run on the same device.

We should look into this and see if it's possible to provide the same guarantee for the keyed algorithms.

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>

#include <thrust/copy.h>
#include <thrust/fill.h>
#include <thrust/sequence.h>
#include <thrust/scan.h>
#include <iostream>

int main()
{
  auto const numElements = 250000;
  thrust::device_vector<double> data(numElements, 0.1);
  thrust::device_vector<double> keys(numElements, 1);

  thrust::device_vector<double> out1(numElements);
  thrust::device_vector<double> out2(numElements);

  thrust::host_vector<double> hostOut1(numElements);
  thrust::host_vector<double> hostOut2(numElements);

  thrust::inclusive_scan_by_key(keys.begin(), keys.end(), data.begin(), out1.begin());
  thrust::inclusive_scan_by_key(keys.begin(), keys.end(), data.begin(), out2.begin());

  // copy all of out1 and out2 to the host
  thrust::copy(out1.begin(), out1.end(), hostOut1.begin());
  thrust::copy(out2.begin(), out2.end(), hostOut2.begin());

  // Check the outputs are exactly the same
  for(int i = 0; i <numElements; i++) {
    if (hostOut1[i] != hostOut2[i]) {
      std::cout << "Element "<< i << " is not equal" << std::endl;
    }
  }

  return 0;
}
@alliepiper alliepiper added type: enhancement New feature or request. P2: nice to have Desired, but not necessary. repro: verified The provided repro has been validated. labels Jan 10, 2022
@alliepiper alliepiper changed the title scan_by_key results are non-deterministic for floats scan_by_key results are non-deterministic for floats [NVBug 3477443] Jan 10, 2022
@alliepiper
Copy link
Collaborator Author

Note that there are currently two versions of scan by key -- one in CUB and one in Thrust. Eventually we'll be updating Thrust to use CUB's implementation, so the fix should go in to CUB.

@gevtushenko
Copy link
Collaborator

All our scan implementations rely on decoupled look-back approach. That means that CUB and Thrust scans don't provide run-to-run determinism for scan operations on floating-point types. In other words, this issue isn't unique to the scan-by-key variant of the algorithm.

Here's an illustration of the tile states of five thread blocks. Each thread block reads the tile states of the predecessor block. If the tile state of the predecessor block contains only tile aggregate, this aggregate is added to the partial sum, and the previous tile state gets inspected. If one of the predecessor states has the full prefix, it gets added to the collected aggregate, and the result is stored in the tile state of the current thread block.

image

Multiple thread blocks run concurrently. Updates of the tile states get visible non-deterministically. For the example above, it's possible that thread block four will observe the final prefix. In this case it'll write (a + b + c + d) + e as its result. It's also possible that only tile aggregate will be observed. In this case, the result might be (a + b + c) + (d + e).

I don't see how this algorithm might provide run-to-run determinism. The original paper that proposes this algorithm states that:

An interesting implication is that scan results are not necessarily deterministic for pseudo-associative operators. For example, the prefix sum across a given floating-point dataset may vary from run to run because the number and order of scan operators applied by CUB may vary from one run to the next.

I suggest we update the documentation and remove this guarantee. We could add a stable scan implementation that would use a different algorithm for floating-point values.

@alliepiper
Copy link
Collaborator Author

Thanks for looking into this!

I was hoping we had some new tech in the regular decoupled scan algorithm that managed to address this limitation, but it sounds like we don't. I re-ran the test program using the regular scan algorithms (which are documented to be the same run-to-run) and confirmed that they also differ. I think you're right, the docs are just out of date.

Let's just fix the docs for now. If there's significant interest we can look into adding some stable versions later.

@lilohuang
Copy link
Contributor

I have created a follow up question NVIDIA/cccl#794, please take a look, and I appreciate your help.

@ericniebler
Copy link
Collaborator

The docs have been updated to reflect that the scan-by-key algorithms are non-deterministic. From exclusive_scan_by_key:

Results are not deterministic for pseudo-associative operators (e.g., addition of floating-point types). Results for pseudo-associative operators may vary from run to run.

Closing as fixed.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
P2: nice to have Desired, but not necessary. repro: verified The provided repro has been validated. thrust type: enhancement New feature or request.
Projects
Archived in project
Development

No branches or pull requests

5 participants