GPU CUDA delayed updates #1279

atillack · 2018-12-20T20:52:25Z

To get things moving forward here is the the pull request for my GPU delayed updates code. I've been presenting some aspects of this work (including profiles) at the annual ECP meeting and at Nvidia's GTC earlier this year (2018) for those interested to get a more in depth view. It is labeled work in progress and here is my todo-list:

implement VMC, VMC w/ drift, and DMC parts
a bit of code merging is needed to use the config changes from Ye's CPU delayed updates instead of the ones I've been using
- only difference is configurability per QMC block instead of global in the slaterdeterminant definition but I can live with losing that
extend to complex code path
code cleanup (leftover from some of the different strategies tried)

I tested and profiled the code extensively using the NiO system and see about a 1.5x speedup for DMC blocks of the 256 atom NiO on Summit. This is about what one would expect if the runtime of the update_inverse kernels were reduced to close to nothing per update step.

…o drift work. The drift version currently is not overly optimized and usually slower compared to the original code path. Based on runtime traces, the performance degradation is due mostly to: - Two rows of the updated A inverse are needed for calculating gradients in DiracDeterminant_CUDA's calc_gradient and det_lookahead functions. Without drift, only the calculation in det_lookahead is required. - The calc_lemma_gradient kernel in determinant_update.cu may be optimized further.

…a_update.

…update algorithm. Code in this case choose k=0 (old code path).

…rnel in determinant_update.cu. Speed optimization now comes down to optimizing the two kernels update_onemove and calc_lemma_column.

Conflicts: src/Particle/MCWalkerConfiguration.h src/QMCDrivers/VMC/VMC_CUDA.cpp src/QMCWaveFunctions/EinsplineSet.h src/QMCWaveFunctions/EinsplineSetCuda.cpp src/QMCWaveFunctions/Fermion/DiracDeterminantBase.h src/QMCWaveFunctions/Fermion/DiracDeterminantCUDA.cpp src/QMCWaveFunctions/Fermion/DiracDeterminantCUDA.h src/QMCWaveFunctions/Fermion/SlaterDet.h src/QMCWaveFunctions/Jastrow/OneBodyJastrowOrbitalBspline.cpp src/QMCWaveFunctions/Jastrow/OneBodyJastrowOrbitalBspline.h src/QMCWaveFunctions/Jastrow/TwoBodyJastrowOrbitalBspline.cpp src/QMCWaveFunctions/Jastrow/TwoBodyJastrowOrbitalBspline.h src/QMCWaveFunctions/OrbitalBase.h src/QMCWaveFunctions/SPOSetBase.cpp src/QMCWaveFunctions/SPOSetBase.h src/QMCWaveFunctions/TrialWaveFunction.h src/QMCWaveFunctions/TrialWaveFunction_CUDA.cpp

… Jastrows). Code has similar performance and correctness to previous version.

qmc-robot · 2018-12-20T20:53:01Z

Can one of the maintainers verify this patch?

PDoakORNL · 2018-12-20T21:00:34Z

ok to test

prckent · 2018-12-20T22:04:59Z

To help development & merge velocity, you could do the complex implementation in a later PR if you wished. The GPU code was not complex until #84 by you and @yingwaili

ye-luo · 2018-12-21T00:56:16Z

Okay to test

ye-luo · 2018-12-21T08:46:00Z

src/Numerics/CUDA/cuda_inverse.h

@@ -29,6 +29,26 @@ cublas_inverse (cublasHandle_t handle,
                int N, int rowStride, int numMats,
                bool useHigherPrecision = true);

+void
+cublas_lemma_mats (cublasHandle_t handle,


cuda_inverse.h/cu is for matrix inversion.
cublas_lemma_mats, cublas_ainv_row, cublas_smw_update are not generic wrapper functions of cublas.
Please move them to src/QMCWaveFunctions/Fermion

No production code should be in sandbox. Put distinct functionality in different files or simply rename the cuda_inverse.h file e.g. cuda_matrices.h

The path I put previously was wrong. It was just my local path. Functions under Numerics should not be only used a specific algorithm. So cuda_matrices.h is not good. Please create a new set of .h and .cu file under src/QMCWaveFunctions/Fermion.

@ye-luo @prckent Done.

ye-luo · 2018-12-21T08:50:45Z

SM1 needs a fix at the moment.
short-diamondC_2x1x1_pp-vmc_sdj-1-16 is failing.

Regarding the complex, you can either fix the complex or protect the complex with macro depending on how much effort is needed.

atillack · 2019-01-02T19:00:38Z

Happy New Year! The hamiltonian unit test should work now. Thank you @ye-luo for your fix.

atillack · 2019-01-03T22:01:45Z

Complex code path implemented and tested to be working on NiO S32 on SummitDev.

…pdate.cu and added missing functions.

This way, delay_rank sets the delay rank for the entire run and this also allows overriding the delay rank per QMC block in the config file for GPU delayed updates.

atillack · 2019-02-05T16:10:19Z

@ye-luo Thanks. Yes, I'll add a warning and cap DU at 64.

atillack · 2019-02-05T16:27:45Z

@ye-luo Great idea! Please check the current commit (I am doing the same right now). I added a synchronization at the only different place between drift and no drift where we're coming out of gpu::kernelStream into the default stream.

ye-luo · 2019-02-05T16:33:13Z

@atillack sadly, adding this stream synchronization doesn't help. I mean synchronization of threads within a kernel which has some launching parameter 64 (block size?).

atillack · 2019-02-05T17:13:01Z

@ye-luo Yeah, unfortunately there is no other somewhat suspicious location that stands out. I'll add the warning and cap at k = 64 and work on finding the fix later.

What makes me suspicious of numerical precision being the culprit is that with a Psi.recompute call after each step, everything seems fine (this is my NiO 128 atom data analyzed over the last 10 steps of 20 blocks of 1 step each):

                            LocalEnergy               Variance           ratio 
k = 1:
avg  series 1  -11782.962089 +/- 25.977604   473.905789 +/- 40.549188   0.0402 
k = 2:
avg  series 1  -11782.905483 +/- 25.761714   503.596322 +/- 47.569417   0.0427 
k = 4:
avg  series 1  -11783.675988 +/- 25.934824   498.788050 +/- 40.720720   0.0423 
k = 8:
avg  series 1  -11783.242652 +/- 25.829249   486.117116 +/- 44.136982   0.0413 
k = 16:
avg  series 1  -11783.116865 +/- 26.016353   498.418402 +/- 44.100940   0.0423 
k = 32:
avg  series 1  -11783.063903 +/- 25.766134   515.954829 +/- 42.950156   0.0438 
k = 64:
avg  series 1  -11782.884046 +/- 25.917708   486.847984 +/- 53.088808   0.0413 
k = 128:
avg  series 1  -11782.704666 +/- 25.630978   505.765987 +/- 34.084771   0.0429

ye-luo · 2019-02-05T17:26:05Z

Psi.recompute wipes out any bad stuff accumulated during the PbyP move. If the sampling goes wrong, the average value will goes wrong even if individual Psi is correct. Could you try to run the VMC winder(more nodes) and longer(more blocks) and also check k = 80?
Are you using 1 walker per GPU? Try 32.

atillack · 2019-02-05T17:46:27Z

I am running with 4 GPUs and 128 walkers/GPU.

ye-luo · 2019-02-05T17:55:22Z

I didn't understand why your error bar was around 25.
I'm expecting some value around sqrt(400/(4*128*20))= 0.197642354 Hartree.

atillack · 2019-02-05T17:59:13Z

Psi.recompute on GPU only updates the A inverse - this is why when AinvU (V'A^-1 * dU) and the lemma matrix where slightly inconsistent the SMW update (A^-1' = A^-1 - AinvULemma^-1(V'A^-1) started accumulating those errors leading to the observed NaNs.

I think there might still be some error accumulation going on that gets larger for bigger delay ranks. Part of the fix could be to go to higher precision for the SMW update (like we do for the full A^-1 update).

Here is the k = 80 run data (same as above, NiO 128 atoms, 4 GPUS, 128 walkers/GPU, 1 warmup step, 1 step per block, 20 blocks, last 10 are analyzed):
k = 80:
avg series 1 -11782.911679 +/- 25.797730 479.054177 +/- 50.768536 0.0407

atillack · 2019-02-05T18:08:06Z

@ye-luo The error bar is likely that large as the VMC block I am running is the very first one and is not yet equilibrated. I get the same error bar that you get when I run the "official" input files.

ye-luo · 2019-02-05T18:08:49Z

The recomputing of A^-1 should not be the source. It is also applied on CPU mixed precision and I can safely go up-to k = 512. I start to worry about V100 since I was running on Summit. Were you running on SummitDev?

atillack · 2019-02-05T18:12:11Z

The current numbers are on SummitDev but I also get similar results on Summit. Btw, the k = 80 above was accidentally at k = 64 (the cap at 64 code works! and I forgot I already enabled it after lunch). Here is the k = 80 data:
LocalEnergy Variance ratio
avg series 1 -11784.798230 +/- 25.943351 548.457051 +/- 38.092702 0.0465

…ue to observed numerical errors beyond that.

atillack · 2019-02-05T18:38:17Z

@ye-luo Here is what I get when i run with k=80 using the following VMC input block:

<qmc method="vmc" move="pbyp" gpu="yes" kdelay="80">
<estimator name="LocalEnergy" hdf5="no" />
<parameter name="walkers">128</parameter>
<parameter name="stepsbetweensamples"> 1 </parameter>
<parameter name="warmupSteps"> 5 </parameter>
<parameter name="substeps"> 5 </parameter>
<parameter name="steps"> 2 </parameter>
<parameter name="blocks"> 5 </parameter>
<parameter name="timestep"> 1.0 </parameter>
<parameter name="usedrift"> no </parameter>
</qmc>

                        LocalEnergy               Variance           ratio

NiO-fcc-S32-vmc series 1 -11865.977789 +/- 0.700271 408.558316 +/- 4.459451 0.0344

What is your VMC block?

prckent · 2019-02-05T19:28:01Z

Q. Does Kepler (e.g. on Titan) have the same behaviors?

ye-luo · 2019-02-05T20:00:13Z

@atillack Your VMC block seems fine. I don't put the kdelay via VMC block. I did <slaterdeterminant delay_rank="80"> The error bar in your last rely seems reasonable. Anyway this is not the real issue.
Your k = 80 results confuse me. Are they actually all capped as k = 64?
What is the real k = 80 result? Is it problematic on SummitDev?

atillack · 2019-02-05T20:29:00Z

@ye-luo the most recent k=80 result I posted was on SummitDev and is truly k=80. It makes no difference how the delay rank is set bit the individual section setting can override the global delay rank setting.

atillack · 2019-02-05T20:34:31Z

@ye-luo I have working results up to k=128 bit I also run into the diverging variance at k=256. From my experience, recompute during the warmup steps helps which does make me believe it is numerical instability when more than a handful of steps go by with continuous usage of SMW updates rather than full ones.

ye-luo · 2019-02-05T22:16:01Z

@atillack I tried Summitdev and I got correct numbers for k=80,96,128. So I believe the issue I'm encountering is related to Summit.
@prckent I'm afraid some kernels written for pre-Volta architectures may be no more safe on Volta. I have not gotten a chance to try Titan yet.

Unless there is any other concern. I will approve and merge the code tomorrow. CI machine is under maintenance today.

atillack · 2019-02-05T22:16:51Z

@ye-luo @prckent I am running on Titan right now.

atillack · 2019-02-06T00:02:27Z

@ye-luo @prckent It's fixed and working now!

Thanks @ye-luo for the idea of looking for CUDA threading issues, in the kernel to finish calculation of the lemma and ainvu matrices there was the possibility of changing data underneath another thread trying to use that data...

Here is my current SummitDev @ k=256 run:
k = 256:
NiO-fcc-S32-vmc series 1 -11865.801500 +/- 0.545776 426.167319 +/- 23.228991 0.0359

Rerunning the other tests...

…ning.

atillack · 2019-02-06T00:32:46Z

@ye-luo @prckent Everything is working. Here is my current VMC w/o drift series on SummitDev:

                            LocalEnergy               Variance           ratio 
k = 1:
NiO-fcc-S32-vmc  series 1  -11865.578301 +/- 0.851429   404.349249 +/- 11.021452   0.0341 
k = 2:
NiO-fcc-S32-vmc  series 1  -11865.349947 +/- 0.916794   396.919920 +/- 7.895504   0.0335 
k = 4:
NiO-fcc-S32-vmc  series 1  -11866.000019 +/- 0.772399   424.014092 +/- 22.672240   0.0357 
k = 8:
NiO-fcc-S32-vmc  series 1  -11865.711592 +/- 0.675608   403.633995 +/- 5.521675   0.0340 
k = 16:
NiO-fcc-S32-vmc  series 1  -11865.642270 +/- 0.453154   401.845648 +/- 11.238411   0.0339 
k = 32:
NiO-fcc-S32-vmc  series 1  -11865.794974 +/- 0.935828   409.392814 +/- 12.941130   0.0345 
k = 64:
NiO-fcc-S32-vmc  series 1  -11865.310495 +/- 0.519327   403.773517 +/- 11.573975   0.0340 
k = 128:
NiO-fcc-S32-vmc  series 1  -11865.702223 +/- 0.690361   421.433592 +/- 6.115831   0.0355 
k = 256:
NiO-fcc-S32-vmc  series 1  -11865.801500 +/- 0.545776   426.167319 +/- 23.228991   0.0359

ye-luo · 2019-02-07T00:29:20Z

I verified that runs with up-to k = 256 are correct on summit.

Andreas Tillack added 8 commits December 22, 2017 14:09

Compile fix for GCC on SummitDev/Summit and optimization of calc_lemm…

85af24f

…a_update.

Added output in the case a user chooses a negative k for the delayed …

0596d39

…update algorithm. Code in this case choose k=0 (old code path).

Replaced second smw_update call with small addon to update_onemove ke…

ba41772

…rnel in determinant_update.cu. Speed optimization now comes down to optimizing the two kernels update_onemove and calc_lemma_column.

Minor bugfixes caused by incomplete merge with new/renamed files (AoS…

b5cf582

… Jastrows). Code has similar performance and correctness to previous version.

Added DMC functionality.

9349e94

Slight performance update for GPU delayed updates code.

d6368ec

ghost assigned atillack Dec 20, 2018

ghost added the in progress label Dec 20, 2018

Fix the compilation of GPU real.

20892cc

ghost assigned ye-luo Dec 21, 2018

ye-luo reviewed Dec 21, 2018

View reviewed changes

ye-luo self-requested a review December 21, 2018 08:51

Fixed hamiltonian unit test failure.

43a6560

First implementation of complex GPU delayed updates code path.

245438b

prckent added this to the V3.7.0 Release milestone Jan 3, 2019

Bugfix for complex code path. Things are working now.

1757bf2

Andreas Tillack added 5 commits January 4, 2019 13:19

Fix compile issues for slightly older CUDA version (i.e. on Rhea).

0cb4d5f

Minor compile fix.

d5e984d

Moved delayed update routines into QMCWaveFunctions/Fermion/delayed_u…

e82c929

…pdate.cu and added missing functions.

Implemented pass-through of Ye's delay_rank parameter to my code path.

d59cec9

This way, delay_rank sets the delay rank for the entire run and this also allows overriding the delay rank per QMC block in the config file for GPU delayed updates.

Compile fix.

b39bd00

Added missing synchronization before lemma matrix calculation.

9f21534

Added warning to cap GPU delayed updates at k = 64 for VMC w/ drift d…

004c09f

…ue to observed numerical errors beyond that.

Fixed VMC w/o drift error for larger delay ranks. Removed cap and war…

7bda6a2

…ning.

ye-luo changed the title ~~[WIP] GPU delayed updates~~ GPU delayed updates Feb 7, 2019

ye-luo changed the title ~~GPU delayed updates~~ GPU CUDA delayed updates Feb 7, 2019

Merge branch 'develop' into delayed_updates

55d6e47

ye-luo approved these changes Feb 7, 2019

View reviewed changes

ye-luo merged commit bf1a98e into QMCPACK:develop Feb 7, 2019

ghost removed the in progress label Feb 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU CUDA delayed updates #1279

GPU CUDA delayed updates #1279

atillack commented Dec 20, 2018 •

edited

Loading

qmc-robot commented Dec 20, 2018

PDoakORNL commented Dec 20, 2018

prckent commented Dec 20, 2018

ye-luo commented Dec 21, 2018

ye-luo Dec 21, 2018 •

edited

Loading

prckent Dec 21, 2018

ye-luo Dec 22, 2018

prckent Dec 22, 2018

atillack Jan 7, 2019

ye-luo commented Dec 21, 2018 •

edited

Loading

atillack commented Jan 2, 2019

atillack commented Jan 3, 2019

atillack commented Feb 5, 2019

atillack commented Feb 5, 2019

ye-luo commented Feb 5, 2019

atillack commented Feb 5, 2019 •

edited by ye-luo

Loading

ye-luo commented Feb 5, 2019

atillack commented Feb 5, 2019

ye-luo commented Feb 5, 2019 •

edited

Loading

atillack commented Feb 5, 2019

atillack commented Feb 5, 2019

ye-luo commented Feb 5, 2019

atillack commented Feb 5, 2019

atillack commented Feb 5, 2019

prckent commented Feb 5, 2019

ye-luo commented Feb 5, 2019

atillack commented Feb 5, 2019

atillack commented Feb 5, 2019

ye-luo commented Feb 5, 2019 •

edited

Loading

atillack commented Feb 5, 2019

atillack commented Feb 6, 2019

atillack commented Feb 6, 2019 •

edited by ye-luo

Loading

ye-luo commented Feb 7, 2019

GPU CUDA delayed updates #1279

GPU CUDA delayed updates #1279

Conversation

atillack commented Dec 20, 2018 • edited Loading

qmc-robot commented Dec 20, 2018

PDoakORNL commented Dec 20, 2018

prckent commented Dec 20, 2018

ye-luo commented Dec 21, 2018

ye-luo Dec 21, 2018 • edited Loading

Choose a reason for hiding this comment

prckent Dec 21, 2018

Choose a reason for hiding this comment

ye-luo Dec 22, 2018

Choose a reason for hiding this comment

prckent Dec 22, 2018

Choose a reason for hiding this comment

atillack Jan 7, 2019

Choose a reason for hiding this comment

ye-luo commented Dec 21, 2018 • edited Loading

atillack commented Jan 2, 2019

atillack commented Jan 3, 2019

atillack commented Feb 5, 2019

atillack commented Feb 5, 2019

ye-luo commented Feb 5, 2019

atillack commented Feb 5, 2019 • edited by ye-luo Loading

ye-luo commented Feb 5, 2019

atillack commented Feb 5, 2019

ye-luo commented Feb 5, 2019 • edited Loading

atillack commented Feb 5, 2019

atillack commented Feb 5, 2019

ye-luo commented Feb 5, 2019

atillack commented Feb 5, 2019

atillack commented Feb 5, 2019

prckent commented Feb 5, 2019

ye-luo commented Feb 5, 2019

atillack commented Feb 5, 2019

atillack commented Feb 5, 2019

ye-luo commented Feb 5, 2019 • edited Loading

atillack commented Feb 5, 2019

atillack commented Feb 6, 2019

atillack commented Feb 6, 2019 • edited by ye-luo Loading

ye-luo commented Feb 7, 2019

atillack commented Dec 20, 2018 •

edited

Loading

ye-luo Dec 21, 2018 •

edited

Loading

ye-luo commented Dec 21, 2018 •

edited

Loading

atillack commented Feb 5, 2019 •

edited by ye-luo

Loading

ye-luo commented Feb 5, 2019 •

edited

Loading

ye-luo commented Feb 5, 2019 •

edited

Loading

atillack commented Feb 6, 2019 •

edited by ye-luo

Loading