Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU CUDA delayed updates #1279

Merged
merged 33 commits into from Feb 7, 2019

Conversation

Projects
None yet
6 participants
@atillack
Copy link
Contributor

atillack commented Dec 20, 2018

To get things moving forward here is the the pull request for my GPU delayed updates code. I've been presenting some aspects of this work (including profiles) at the annual ECP meeting and at Nvidia's GTC earlier this year (2018) for those interested to get a more in depth view. It is labeled work in progress and here is my todo-list:

  • implement VMC, VMC w/ drift, and DMC parts

  • a bit of code merging is needed to use the config changes from Ye's CPU delayed updates instead of the ones I've been using

    • only difference is configurability per QMC block instead of global in the slaterdeterminant definition but I can live with losing that
  • extend to complex code path

  • code cleanup (leftover from some of the different strategies tried)

I tested and profiled the code extensively using the NiO system and see about a 1.5x speedup for DMC blocks of the 256 atom NiO on Summit. This is about what one would expect if the runtime of the update_inverse kernels were reduced to close to nothing per update step.

First implementation of delayed updates into VMC_CUDA. Both w/ and w/…
…o drift work.

The drift version currently is not overly optimized and usually slower compared
to the original code path. Based on runtime traces, the performance degradation
is due mostly to:

- Two rows of the updated A inverse are needed for calculating gradients in
  DiracDeterminant_CUDA's calc_gradient and det_lookahead functions. Without
  drift, only the calculation in det_lookahead is required.
- The calc_lemma_gradient kernel in determinant_update.cu may be optimized
  further.
@jefflarkin

This comment has been minimized.

Copy link

jefflarkin commented on src/QMCDrivers/QMCDriver.cpp in 7ca8750 Apr 5, 2018

May want to check the value of kdelay here so that you can return an error if invalid rather than crashing.

atillack added some commits Apr 12, 2018

Added output in the case a user chooses a negative k for the delayed …
…update algorithm. Code in this case choose k=0 (old code path).
Replaced second smw_update call with small addon to update_onemove ke…
…rnel in determinant_update.cu.

Speed optimization now comes down to optimizing the two kernels update_onemove and calc_lemma_column.
Merge remote-tracking branch 'upstream/develop' into delayed_updates
Conflicts:
	src/Particle/MCWalkerConfiguration.h
	src/QMCDrivers/VMC/VMC_CUDA.cpp
	src/QMCWaveFunctions/EinsplineSet.h
	src/QMCWaveFunctions/EinsplineSetCuda.cpp
	src/QMCWaveFunctions/Fermion/DiracDeterminantBase.h
	src/QMCWaveFunctions/Fermion/DiracDeterminantCUDA.cpp
	src/QMCWaveFunctions/Fermion/DiracDeterminantCUDA.h
	src/QMCWaveFunctions/Fermion/SlaterDet.h
	src/QMCWaveFunctions/Jastrow/OneBodyJastrowOrbitalBspline.cpp
	src/QMCWaveFunctions/Jastrow/OneBodyJastrowOrbitalBspline.h
	src/QMCWaveFunctions/Jastrow/TwoBodyJastrowOrbitalBspline.cpp
	src/QMCWaveFunctions/Jastrow/TwoBodyJastrowOrbitalBspline.h
	src/QMCWaveFunctions/OrbitalBase.h
	src/QMCWaveFunctions/SPOSetBase.cpp
	src/QMCWaveFunctions/SPOSetBase.h
	src/QMCWaveFunctions/TrialWaveFunction.h
	src/QMCWaveFunctions/TrialWaveFunction_CUDA.cpp
Minor bugfixes caused by incomplete merge with new/renamed files (AoS…
… Jastrows).

Code has similar performance and correctness to previous version.
@qmc-robot

This comment has been minimized.

Copy link
Collaborator

qmc-robot commented Dec 20, 2018

Can one of the maintainers verify this patch?

@PDoakORNL

This comment has been minimized.

Copy link
Contributor

PDoakORNL commented Dec 20, 2018

ok to test

@prckent

This comment has been minimized.

Copy link
Contributor

prckent commented Dec 20, 2018

To help development & merge velocity, you could do the complex implementation in a later PR if you wished. The GPU code was not complex until #84 by you and @yingwaili

@ye-luo

This comment has been minimized.

Copy link
Contributor

ye-luo commented Dec 21, 2018

Okay to test

@@ -29,6 +29,26 @@ cublas_inverse (cublasHandle_t handle,
int N, int rowStride, int numMats,
bool useHigherPrecision = true);

void
cublas_lemma_mats (cublasHandle_t handle,

This comment has been minimized.

Copy link
@ye-luo

ye-luo Dec 21, 2018

Contributor

cuda_inverse.h/cu is for matrix inversion.
cublas_lemma_mats, cublas_ainv_row, cublas_smw_update are not generic wrapper functions of cublas.
Please move them to src/QMCWaveFunctions/Fermion

This comment has been minimized.

Copy link
@prckent

prckent Dec 21, 2018

Contributor

No production code should be in sandbox. Put distinct functionality in different files or simply rename the cuda_inverse.h file e.g. cuda_matrices.h

This comment has been minimized.

Copy link
@ye-luo

ye-luo Dec 22, 2018

Contributor

The path I put previously was wrong. It was just my local path. Functions under Numerics should not be only used a specific algorithm. So cuda_matrices.h is not good. Please create a new set of .h and .cu file under src/QMCWaveFunctions/Fermion.

This comment has been minimized.

Copy link
@prckent

prckent Dec 22, 2018

Contributor

OK

This comment has been minimized.

Copy link
@atillack

atillack Jan 7, 2019

Author Contributor
@ye-luo

This comment has been minimized.

Copy link
Contributor

ye-luo commented Dec 21, 2018

SM1 needs a fix at the moment.
short-diamondC_2x1x1_pp-vmc_sdj-1-16 is failing.

Regarding the complex, you can either fix the complex or protect the complex with macro depending on how much effort is needed.

@ye-luo ye-luo self-requested a review Dec 21, 2018

@atillack

This comment has been minimized.

Copy link
Contributor Author

atillack commented Jan 2, 2019

Happy New Year! The hamiltonian unit test should work now. Thank you @ye-luo for your fix.

@prckent prckent added this to the V3.7.0 Release milestone Jan 3, 2019

@atillack

This comment has been minimized.

Copy link
Contributor Author

atillack commented Jan 3, 2019

Complex code path implemented and tested to be working on NiO S32 on SummitDev.

atillack added some commits Jan 4, 2019

Implemented pass-through of Ye's delay_rank parameter to my code path.
This way, delay_rank sets the delay rank for the entire run and this
also allows overriding the delay rank per QMC block in the config file
for GPU delayed updates.
@atillack

This comment has been minimized.

Copy link
Contributor Author

atillack commented Feb 5, 2019

@ye-luo Great idea! Please check the current commit (I am doing the same right now). I added a synchronization at the only different place between drift and no drift where we're coming out of gpu::kernelStream into the default stream.

@ye-luo

This comment has been minimized.

Copy link
Contributor

ye-luo commented Feb 5, 2019

@atillack sadly, adding this stream synchronization doesn't help. I mean synchronization of threads within a kernel which has some launching parameter 64 (block size?).

@atillack

This comment has been minimized.

Copy link
Contributor Author

atillack commented Feb 5, 2019

@ye-luo Yeah, unfortunately there is no other somewhat suspicious location that stands out. I'll add the warning and cap at k = 64 and work on finding the fix later.

What makes me suspicious of numerical precision being the culprit is that with a Psi.recompute call after each step, everything seems fine (this is my NiO 128 atom data analyzed over the last 10 steps of 20 blocks of 1 step each):

                            LocalEnergy               Variance           ratio 
k = 1:
avg  series 1  -11782.962089 +/- 25.977604   473.905789 +/- 40.549188   0.0402 
k = 2:
avg  series 1  -11782.905483 +/- 25.761714   503.596322 +/- 47.569417   0.0427 
k = 4:
avg  series 1  -11783.675988 +/- 25.934824   498.788050 +/- 40.720720   0.0423 
k = 8:
avg  series 1  -11783.242652 +/- 25.829249   486.117116 +/- 44.136982   0.0413 
k = 16:
avg  series 1  -11783.116865 +/- 26.016353   498.418402 +/- 44.100940   0.0423 
k = 32:
avg  series 1  -11783.063903 +/- 25.766134   515.954829 +/- 42.950156   0.0438 
k = 64:
avg  series 1  -11782.884046 +/- 25.917708   486.847984 +/- 53.088808   0.0413 
k = 128:
avg  series 1  -11782.704666 +/- 25.630978   505.765987 +/- 34.084771   0.0429 
@ye-luo

This comment has been minimized.

Copy link
Contributor

ye-luo commented Feb 5, 2019

Psi.recompute wipes out any bad stuff accumulated during the PbyP move. If the sampling goes wrong, the average value will goes wrong even if individual Psi is correct. Could you try to run the VMC winder(more nodes) and longer(more blocks) and also check k = 80?
Are you using 1 walker per GPU? Try 32.

@atillack

This comment has been minimized.

Copy link
Contributor Author

atillack commented Feb 5, 2019

I am running with 4 GPUs and 128 walkers/GPU.

@ye-luo

This comment has been minimized.

Copy link
Contributor

ye-luo commented Feb 5, 2019

I didn't understand why your error bar was around 25.
I'm expecting some value around sqrt(400/(4*128*20))= 0.197642354 Hartree.

@atillack

This comment has been minimized.

Copy link
Contributor Author

atillack commented Feb 5, 2019

Psi.recompute on GPU only updates the A inverse - this is why when AinvU (V'A^-1 * dU) and the lemma matrix where slightly inconsistent the SMW update (A^-1' = A^-1 - AinvULemma^-1(V'A^-1) started accumulating those errors leading to the observed NaNs.

I think there might still be some error accumulation going on that gets larger for bigger delay ranks. Part of the fix could be to go to higher precision for the SMW update (like we do for the full A^-1 update).

Here is the k = 80 run data (same as above, NiO 128 atoms, 4 GPUS, 128 walkers/GPU, 1 warmup step, 1 step per block, 20 blocks, last 10 are analyzed):
k = 80:
avg series 1 -11782.911679 +/- 25.797730 479.054177 +/- 50.768536 0.0407

@atillack

This comment has been minimized.

Copy link
Contributor Author

atillack commented Feb 5, 2019

@ye-luo The error bar is likely that large as the VMC block I am running is the very first one and is not yet equilibrated. I get the same error bar that you get when I run the "official" input files.

@ye-luo

This comment has been minimized.

Copy link
Contributor

ye-luo commented Feb 5, 2019

The recomputing of A^-1 should not be the source. It is also applied on CPU mixed precision and I can safely go up-to k = 512. I start to worry about V100 since I was running on Summit. Were you running on SummitDev?

@atillack

This comment has been minimized.

Copy link
Contributor Author

atillack commented Feb 5, 2019

The current numbers are on SummitDev but I also get similar results on Summit. Btw, the k = 80 above was accidentally at k = 64 (the cap at 64 code works! and I forgot I already enabled it after lunch). Here is the k = 80 data:
LocalEnergy Variance ratio
avg series 1 -11784.798230 +/- 25.943351 548.457051 +/- 38.092702 0.0465

Added warning to cap GPU delayed updates at k = 64 for VMC w/ drift d…
…ue to observed numerical errors beyond that.
@atillack

This comment has been minimized.

Copy link
Contributor Author

atillack commented Feb 5, 2019

@ye-luo Here is what I get when i run with k=80 using the following VMC input block:

<qmc method="vmc" move="pbyp" gpu="yes" kdelay="80">
<estimator name="LocalEnergy" hdf5="no" />
<parameter name="walkers">128</parameter>
<parameter name="stepsbetweensamples"> 1 </parameter>
<parameter name="warmupSteps"> 5 </parameter>
<parameter name="substeps"> 5 </parameter>
<parameter name="steps"> 2 </parameter>
<parameter name="blocks"> 5 </parameter>
<parameter name="timestep"> 1.0 </parameter>
<parameter name="usedrift"> no </parameter>
</qmc>

                        LocalEnergy               Variance           ratio 

NiO-fcc-S32-vmc series 1 -11865.977789 +/- 0.700271 408.558316 +/- 4.459451 0.0344

What is your VMC block?

@prckent

This comment has been minimized.

Copy link
Contributor

prckent commented Feb 5, 2019

Q. Does Kepler (e.g. on Titan) have the same behaviors?

@ye-luo

This comment has been minimized.

Copy link
Contributor

ye-luo commented Feb 5, 2019

@atillack Your VMC block seems fine. I don't put the kdelay via VMC block. I did <slaterdeterminant delay_rank="80"> The error bar in your last rely seems reasonable. Anyway this is not the real issue.
Your k = 80 results confuse me. Are they actually all capped as k = 64?
What is the real k = 80 result? Is it problematic on SummitDev?

@atillack

This comment has been minimized.

Copy link
Contributor Author

atillack commented Feb 5, 2019

@ye-luo the most recent k=80 result I posted was on SummitDev and is truly k=80. It makes no difference how the delay rank is set bit the individual section setting can override the global delay rank setting.

@atillack

This comment has been minimized.

Copy link
Contributor Author

atillack commented Feb 5, 2019

@ye-luo I have working results up to k=128 bit I also run into the diverging variance at k=256. From my experience, recompute during the warmup steps helps which does make me believe it is numerical instability when more than a handful of steps go by with continuous usage of SMW updates rather than full ones.

@ye-luo

This comment has been minimized.

Copy link
Contributor

ye-luo commented Feb 5, 2019

@atillack I tried Summitdev and I got correct numbers for k=80,96,128. So I believe the issue I'm encountering is related to Summit.
@prckent I'm afraid some kernels written for pre-Volta architectures may be no more safe on Volta. I have not gotten a chance to try Titan yet.

Unless there is any other concern. I will approve and merge the code tomorrow. CI machine is under maintenance today.

@atillack

This comment has been minimized.

Copy link
Contributor Author

atillack commented Feb 5, 2019

@ye-luo @prckent I am running on Titan right now.

@atillack

This comment has been minimized.

Copy link
Contributor Author

atillack commented Feb 6, 2019

@ye-luo @prckent It's fixed and working now!

Thanks @ye-luo for the idea of looking for CUDA threading issues, in the kernel to finish calculation of the lemma and ainvu matrices there was the possibility of changing data underneath another thread trying to use that data...

Here is my current SummitDev @ k=256 run:
k = 256:
NiO-fcc-S32-vmc series 1 -11865.801500 +/- 0.545776 426.167319 +/- 23.228991 0.0359

Rerunning the other tests...

@atillack

This comment has been minimized.

Copy link
Contributor Author

atillack commented Feb 6, 2019

@ye-luo @prckent Everything is working. Here is my current VMC w/o drift series on SummitDev:

                            LocalEnergy               Variance           ratio 
k = 1:
NiO-fcc-S32-vmc  series 1  -11865.578301 +/- 0.851429   404.349249 +/- 11.021452   0.0341 
k = 2:
NiO-fcc-S32-vmc  series 1  -11865.349947 +/- 0.916794   396.919920 +/- 7.895504   0.0335 
k = 4:
NiO-fcc-S32-vmc  series 1  -11866.000019 +/- 0.772399   424.014092 +/- 22.672240   0.0357 
k = 8:
NiO-fcc-S32-vmc  series 1  -11865.711592 +/- 0.675608   403.633995 +/- 5.521675   0.0340 
k = 16:
NiO-fcc-S32-vmc  series 1  -11865.642270 +/- 0.453154   401.845648 +/- 11.238411   0.0339 
k = 32:
NiO-fcc-S32-vmc  series 1  -11865.794974 +/- 0.935828   409.392814 +/- 12.941130   0.0345 
k = 64:
NiO-fcc-S32-vmc  series 1  -11865.310495 +/- 0.519327   403.773517 +/- 11.573975   0.0340 
k = 128:
NiO-fcc-S32-vmc  series 1  -11865.702223 +/- 0.690361   421.433592 +/- 6.115831   0.0355 
k = 256:
NiO-fcc-S32-vmc  series 1  -11865.801500 +/- 0.545776   426.167319 +/- 23.228991   0.0359 

@ye-luo ye-luo changed the title [WIP] GPU delayed updates GPU delayed updates Feb 7, 2019

@ye-luo ye-luo changed the title GPU delayed updates GPU CUDA delayed updates Feb 7, 2019

@ye-luo

ye-luo approved these changes Feb 7, 2019

@ye-luo

This comment has been minimized.

Copy link
Contributor

ye-luo commented Feb 7, 2019

I verified that runs with up-to k = 256 are correct on summit.

@ye-luo ye-luo merged commit bf1a98e into QMCPACK:develop Feb 7, 2019

2 checks passed

rhea-cpu
Details
rhea-gpu
Details

@wafflebot wafflebot bot removed the in progress label Feb 7, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.