Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race conditions in CUDA kernels causing incorrect energies #2883

Merged
merged 2 commits into from
Feb 9, 2021

Conversation

prckent
Copy link
Contributor

@prckent prckent commented Feb 8, 2021

Proposed changes

Fix reported bugs by Andrea Zen / Dario Alfe where the legacy CUDA code will give clearly wrong energies for certain calculations. The symptom is a total energy as well as several subcomponents of the total energy that vary significantly with walker count. This problem was initially discovered via non-sensical DMC results, but the problems also occur in VMC.

Bugs are traced to handful of potential race conditions in the kernels associated with Coulomb energy evaluation. These bugs are present in Volta generation and more recent GPUs due to their independent thread scheduling. Our regular testing for e.g. carbon diamond did not identify them.

A side effect of these fixes is that the legacy CUDA code now appears to be completely deterministic.

Despite the extra syncthreads calls, running the performance tests showed no change in runtime within measurement error.

I do not have a small reproducer for the problems.

Thanks to @jefflarkin for reminders of CUDA tools (cuda-memcheck) that helped track these problems down efficiently.

The racecheck tool still issues a warning in find_core_electrons_PBC_kernel in NLPP.cu . It looks to be a false warning but needs more study. Otherwise the code is clean of races in VMC for the standard spline+j1+j2 inputs that I tried.

The initcheck tool flags a couple of uninitialized memory references associated with determinant setup and update.

energies

What type(s) of changes does this code introduce?

  • Bugfix

Does this introduce a breaking change?

  • No

What systems has this change been tested on?

nitrogen: AMD Rome + NVIDIA V100 32GB. LLVM dev + CUDA 11.2.0

Checklist

  • Yes. This PR is up to date with current the current state of 'develop'
  • No, only CUDA changes. We have not reformatted the CUDA. Code added or changed in the PR has been clang-formatted.
  • NA. This PR adds tests to cover any new code, or to catch a bug that is being fixed
  • NA. Documentation has been added (if appropriate)

( Created the PR from within vscode, hence initial draft. )

@prckent prckent marked this pull request as ready for review February 8, 2021 19:22
Copy link
Contributor

@ye-luo ye-luo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this make deterministic tests passing more robustly?
https://cdash.qmcpack.org/CDash/viewTest.php?buildid=183717&onlydelta

@@ -312,6 +312,7 @@ __global__ void find_core_electrons_PBC_kernel(T** R,
disp[tid][1] = r[tid][1] - i[ion][1];
disp[tid][2] = r[tid][2] - i[ion][2];
dist[tid] = min_dist<T>(disp[tid][0], disp[tid][1], disp[tid][2], L, Linv);
__syncthreads();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doubt this really matters.

@@ -1412,6 +1412,7 @@ __global__ void calc_ratio_grad_lapl(T** Ainv_list,
// ratio to make it w.r.t. new position
if (tid < 4)
ratio_prod[(tid + 1) * BS1] /= ratio_prod[0];
__syncthreads();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doubt this really matter.

@ye-luo
Copy link
Contributor

ye-luo commented Feb 8, 2021

I think the bug is not releated to Volta independent thread scheduling. Deterministic tests are failing on my Pascal as well. I think more efficient warp scheduling tricks the bug.

The diamond test just has too low electron counts.

@prckent
Copy link
Contributor Author

prckent commented Feb 8, 2021

As you mention, some of the deterministic tests are failing. I don't think any of these changes will help with them.

An isolated water molecule in a periodic supercell tripped the problem. This has the same electron count as our diamond tests. However the wavefunction is large. Possibly we could make a workflow test out of it.

There are some excess syncthreads calls to try to quiet racecheck. Since they didn't slow the code noticeably and they make the code a bit easier to "parse" mentally for threading issues, I prefer to keep them.

@prckent
Copy link
Contributor Author

prckent commented Feb 8, 2021

Actually the diamondC_2x1x1 are working locally, see below. Hopefully the runs on bora will also improve:

[pk7@nitrogen build_llvmdev_legacy_cuda_mixed_precision]$ ctest -R deterministic-diamondC_2x1x1
Test project /home/pk7/projects/qmc/git_QMCPACK_prckent/build_llvmdev_legacy_cuda_mixed_precision
      Start 453: deterministic-diamondC_2x1x1_pp-vmc_sdj-1-1
 1/48 Test #453: deterministic-diamondC_2x1x1_pp-vmc_sdj-1-1 ..............................   Passed    1.69 sec
      Start 454: deterministic-diamondC_2x1x1_pp-vmc_sdj-1-1-kinetic
 2/48 Test #454: deterministic-diamondC_2x1x1_pp-vmc_sdj-1-1-kinetic ......................   Passed    0.03 sec
      Start 455: deterministic-diamondC_2x1x1_pp-vmc_sdj-1-1-totenergy
 3/48 Test #455: deterministic-diamondC_2x1x1_pp-vmc_sdj-1-1-totenergy ....................   Passed    0.03 sec
      Start 456: deterministic-diamondC_2x1x1_pp-vmc_sdj-1-1-eeenergy
 4/48 Test #456: deterministic-diamondC_2x1x1_pp-vmc_sdj-1-1-eeenergy .....................   Passed    0.03 sec
      Start 457: deterministic-diamondC_2x1x1_pp-vmc_sdj-1-1-samples
 5/48 Test #457: deterministic-diamondC_2x1x1_pp-vmc_sdj-1-1-samples ......................   Passed    0.03 sec
      Start 458: deterministic-diamondC_2x1x1_pp-vmc_sdj-1-1-potential
 6/48 Test #458: deterministic-diamondC_2x1x1_pp-vmc_sdj-1-1-potential ....................   Passed    0.03 sec
      Start 459: deterministic-diamondC_2x1x1_pp-vmc_sdj-1-1-ionion
 7/48 Test #459: deterministic-diamondC_2x1x1_pp-vmc_sdj-1-1-ionion .......................   Passed    0.03 sec
      Start 460: deterministic-diamondC_2x1x1_pp-vmc_sdj-1-1-localecp
 8/48 Test #460: deterministic-diamondC_2x1x1_pp-vmc_sdj-1-1-localecp .....................   Passed    0.03 sec
      Start 461: deterministic-diamondC_2x1x1_pp-vmc_sdj-1-1-nonlocalecp
 9/48 Test #461: deterministic-diamondC_2x1x1_pp-vmc_sdj-1-1-nonlocalecp ..................   Passed    0.03 sec
      Start 462: deterministic-diamondC_2x1x1_pp-vmc_sdj-1-1-flux
10/48 Test #462: deterministic-diamondC_2x1x1_pp-vmc_sdj-1-1-flux .........................   Passed    0.03 sec
      Start 463: deterministic-diamondC_2x1x1_pp-vmc_sdj_excited-1-1
11/48 Test #463: deterministic-diamondC_2x1x1_pp-vmc_sdj_excited-1-1 ......................   Passed    1.64 sec
      Start 464: deterministic-diamondC_2x1x1_pp-vmc_sdj_excited-1-1-kinetic
12/48 Test #464: deterministic-diamondC_2x1x1_pp-vmc_sdj_excited-1-1-kinetic ..............   Passed    0.04 sec
      Start 465: deterministic-diamondC_2x1x1_pp-vmc_sdj_excited-1-1-totenergy
13/48 Test #465: deterministic-diamondC_2x1x1_pp-vmc_sdj_excited-1-1-totenergy ............   Passed    0.03 sec
      Start 466: deterministic-diamondC_2x1x1_pp-vmc_sdj_excited-1-1-eeenergy
14/48 Test #466: deterministic-diamondC_2x1x1_pp-vmc_sdj_excited-1-1-eeenergy .............   Passed    0.04 sec
      Start 467: deterministic-diamondC_2x1x1_pp-vmc_sdj_excited-1-1-samples
15/48 Test #467: deterministic-diamondC_2x1x1_pp-vmc_sdj_excited-1-1-samples ..............   Passed    0.04 sec
      Start 468: deterministic-diamondC_2x1x1_pp-vmc_sdj_excited-1-1-potential
16/48 Test #468: deterministic-diamondC_2x1x1_pp-vmc_sdj_excited-1-1-potential ............   Passed    0.04 sec
      Start 469: deterministic-diamondC_2x1x1_pp-vmc_sdj_excited-1-1-ionion
17/48 Test #469: deterministic-diamondC_2x1x1_pp-vmc_sdj_excited-1-1-ionion ...............   Passed    0.04 sec
      Start 470: deterministic-diamondC_2x1x1_pp-vmc_sdj_excited-1-1-localecp
18/48 Test #470: deterministic-diamondC_2x1x1_pp-vmc_sdj_excited-1-1-localecp .............   Passed    0.03 sec
      Start 471: deterministic-diamondC_2x1x1_pp-vmc_sdj_excited-1-1-nonlocalecp
19/48 Test #471: deterministic-diamondC_2x1x1_pp-vmc_sdj_excited-1-1-nonlocalecp ..........   Passed    0.03 sec
      Start 472: deterministic-diamondC_2x1x1_pp-vmc_sdj_excited-1-1-flux
20/48 Test #472: deterministic-diamondC_2x1x1_pp-vmc_sdj_excited-1-1-flux .................   Passed    0.03 sec
      Start 473: deterministic-diamondC_2x1x1_pp-delayed_update-vmc_sdj-1-1
21/48 Test #473: deterministic-diamondC_2x1x1_pp-delayed_update-vmc_sdj-1-1 ...............   Passed    1.66 sec
      Start 474: deterministic-diamondC_2x1x1_pp-delayed_update-vmc_sdj-1-1-kinetic
22/48 Test #474: deterministic-diamondC_2x1x1_pp-delayed_update-vmc_sdj-1-1-kinetic .......   Passed    0.04 sec
      Start 475: deterministic-diamondC_2x1x1_pp-delayed_update-vmc_sdj-1-1-totenergy
23/48 Test #475: deterministic-diamondC_2x1x1_pp-delayed_update-vmc_sdj-1-1-totenergy .....   Passed    0.03 sec
      Start 476: deterministic-diamondC_2x1x1_pp-delayed_update-vmc_sdj-1-1-eeenergy
24/48 Test #476: deterministic-diamondC_2x1x1_pp-delayed_update-vmc_sdj-1-1-eeenergy ......   Passed    0.03 sec
      Start 477: deterministic-diamondC_2x1x1_pp-delayed_update-vmc_sdj-1-1-samples
25/48 Test #477: deterministic-diamondC_2x1x1_pp-delayed_update-vmc_sdj-1-1-samples .......   Passed    0.03 sec
      Start 478: deterministic-diamondC_2x1x1_pp-delayed_update-vmc_sdj-1-1-potential
26/48 Test #478: deterministic-diamondC_2x1x1_pp-delayed_update-vmc_sdj-1-1-potential .....   Passed    0.03 sec
      Start 479: deterministic-diamondC_2x1x1_pp-delayed_update-vmc_sdj-1-1-ionion
27/48 Test #479: deterministic-diamondC_2x1x1_pp-delayed_update-vmc_sdj-1-1-ionion ........   Passed    0.03 sec
      Start 480: deterministic-diamondC_2x1x1_pp-delayed_update-vmc_sdj-1-1-localecp
28/48 Test #480: deterministic-diamondC_2x1x1_pp-delayed_update-vmc_sdj-1-1-localecp ......   Passed    0.03 sec
      Start 481: deterministic-diamondC_2x1x1_pp-delayed_update-vmc_sdj-1-1-nonlocalecp
29/48 Test #481: deterministic-diamondC_2x1x1_pp-delayed_update-vmc_sdj-1-1-nonlocalecp ...   Passed    0.03 sec
      Start 482: deterministic-diamondC_2x1x1_pp-delayed_update-vmc_sdj-1-1-flux
30/48 Test #482: deterministic-diamondC_2x1x1_pp-delayed_update-vmc_sdj-1-1-flux ..........   Passed    0.03 sec
      Start 483: deterministic-diamondC_2x1x1_pp-dmc_sdj-1-1
31/48 Test #483: deterministic-diamondC_2x1x1_pp-dmc_sdj-1-1 ..............................   Passed    1.60 sec
      Start 484: deterministic-diamondC_2x1x1_pp-dmc_sdj-1-1-kinetic
32/48 Test #484: deterministic-diamondC_2x1x1_pp-dmc_sdj-1-1-kinetic ......................   Passed    0.03 sec
      Start 485: deterministic-diamondC_2x1x1_pp-dmc_sdj-1-1-totenergy
33/48 Test #485: deterministic-diamondC_2x1x1_pp-dmc_sdj-1-1-totenergy ....................   Passed    0.04 sec
      Start 486: deterministic-diamondC_2x1x1_pp-dmc_sdj-1-1-eeenergy
34/48 Test #486: deterministic-diamondC_2x1x1_pp-dmc_sdj-1-1-eeenergy .....................   Passed    0.03 sec
      Start 487: deterministic-diamondC_2x1x1_pp-dmc_sdj-1-1-samples
35/48 Test #487: deterministic-diamondC_2x1x1_pp-dmc_sdj-1-1-samples ......................   Passed    0.03 sec
      Start 488: deterministic-diamondC_2x1x1_pp-dmc_sdj-1-1-potential
36/48 Test #488: deterministic-diamondC_2x1x1_pp-dmc_sdj-1-1-potential ....................   Passed    0.03 sec
      Start 489: deterministic-diamondC_2x1x1_pp-dmc_sdj-1-1-ionion
37/48 Test #489: deterministic-diamondC_2x1x1_pp-dmc_sdj-1-1-ionion .......................   Passed    0.03 sec
      Start 490: deterministic-diamondC_2x1x1_pp-dmc_sdj-1-1-localecp
38/48 Test #490: deterministic-diamondC_2x1x1_pp-dmc_sdj-1-1-localecp .....................   Passed    0.03 sec
      Start 491: deterministic-diamondC_2x1x1_pp-dmc_sdj-1-1-nonlocalecp
39/48 Test #491: deterministic-diamondC_2x1x1_pp-dmc_sdj-1-1-nonlocalecp ..................   Passed    0.03 sec
      Start 492: deterministic-diamondC_2x1x1_pp-dmc_sdj_excited-1-1
40/48 Test #492: deterministic-diamondC_2x1x1_pp-dmc_sdj_excited-1-1 ......................   Passed    1.60 sec
      Start 493: deterministic-diamondC_2x1x1_pp-dmc_sdj_excited-1-1-kinetic
41/48 Test #493: deterministic-diamondC_2x1x1_pp-dmc_sdj_excited-1-1-kinetic ..............   Passed    0.04 sec
      Start 494: deterministic-diamondC_2x1x1_pp-dmc_sdj_excited-1-1-totenergy
42/48 Test #494: deterministic-diamondC_2x1x1_pp-dmc_sdj_excited-1-1-totenergy ............   Passed    0.03 sec
      Start 495: deterministic-diamondC_2x1x1_pp-dmc_sdj_excited-1-1-eeenergy
43/48 Test #495: deterministic-diamondC_2x1x1_pp-dmc_sdj_excited-1-1-eeenergy .............   Passed    0.03 sec
      Start 496: deterministic-diamondC_2x1x1_pp-dmc_sdj_excited-1-1-samples
44/48 Test #496: deterministic-diamondC_2x1x1_pp-dmc_sdj_excited-1-1-samples ..............   Passed    0.03 sec
      Start 497: deterministic-diamondC_2x1x1_pp-dmc_sdj_excited-1-1-potential
45/48 Test #497: deterministic-diamondC_2x1x1_pp-dmc_sdj_excited-1-1-potential ............   Passed    0.03 sec
      Start 498: deterministic-diamondC_2x1x1_pp-dmc_sdj_excited-1-1-ionion
46/48 Test #498: deterministic-diamondC_2x1x1_pp-dmc_sdj_excited-1-1-ionion ...............   Passed    0.03 sec
      Start 499: deterministic-diamondC_2x1x1_pp-dmc_sdj_excited-1-1-localecp
47/48 Test #499: deterministic-diamondC_2x1x1_pp-dmc_sdj_excited-1-1-localecp .............   Passed    0.03 sec
      Start 500: deterministic-diamondC_2x1x1_pp-dmc_sdj_excited-1-1-nonlocalecp
48/48 Test #500: deterministic-diamondC_2x1x1_pp-dmc_sdj_excited-1-1-nonlocalecp ..........   Passed    0.03 sec

100% tests passed, 0 tests failed out of 48

Label Time Summary:
QMCPACK                     =   8.18 sec*proc (5 tests)
QMCPACK-checking-results    =   1.46 sec*proc (43 tests)
deterministic               =   9.65 sec*proc (48 tests)
quality_unknown             =   9.65 sec*proc (48 tests)

Total Test time (real) =   9.69 sec

@ye-luo
Copy link
Contributor

ye-luo commented Feb 8, 2021

The bora failure is sporadic. let us hope your fix helps it.

@ye-luo ye-luo merged commit a53fcc2 into QMCPACK:develop Feb 9, 2021
@prckent prckent deleted the fixcuda branch March 23, 2021 14:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants