Split splines across multiple GPUs on a given node #1101

atillack · 2018-10-09T18:47:43Z

Here is my work-in-progress version of the split splines code for GPU with promising performance numbers:

Run	GPUs	Walkers/GPU	Original Code	New Code w/ Split Splines	Split Spline Cost
NiO S32 on SummitDev	2	128	129.0	148.0	1.15
--	4	128	129.0	236.5	1.83
NiO S32 on Summit	3	128	91.3	148.1	1.62
--	3	250	145.1	223.2	1.54
--	6	128	92.5	223.2	2.41
--	6	250	147.3	403.7	2.74
NiO S64 on Summit	3	45	230.5	334.7	1.45
--	6	45	230.5	463.8	2.01

Currently, only the spline type used for our NiO example is fully implemented. I will of course implement the missing bits once we finalized the code here.

In order to use the code and split the spline data memory across multiple GPUs the following needs to be done:

GPU MPS needs to be used,
the GPUs need to be visible in each MPI rank,
the options gpu="yes" and gpusharing="yes" need to be set in the <determinantset type="einspline"> section in the <wavefunction> definition block

I did test the original code based on an older development tree on both SummitDev and Summit with 2,4 and 3,6 GPUs, respectively (this is the data above). After updating to the current development tree I could only run on two GPUs on SummitDev to confirm performance levels are still the same - with four GPUs I get an MPI error (coll:ibm:module: datatype_prepare_recvv overflowed integer range triggered by comm->gatherv_in_place from gatherv in the new spline adaptor code). I get the exact same error with vanilla 3.5.0 on SummitDev. Since I can't run on Summit right now I can't verify if this is a regression in 3.5.0 (3.4.0 did not have these issues on Summit) or just the particular MPI version on SummitDev.

To finish:

Output error and exit when split splines and UVM are used together for now
PhaseFactors.cu is already finished but needs testing of the QMC_Complex code path
Implement double complex evaluation functions
Implement float and double wave function evaluation functions
Fill in the missing split splines memory allocations/break-ups in einspline/multi_bspline_create_cuda.cu
Implement the MPI host distribution to follow the split splines
Add documentation in manual and test case

…ite the right final results.

…o be tweaked.

Conflicts: src/QMCWaveFunctions/Jastrow/OneBodyJastrowOrbitalBspline.cpp

…ction). Also restored non-split splines performance back to reference.

qmc-robot · 2018-10-09T18:49:02Z

Can one of the maintainers verify this patch?

qmc-robot · 2018-10-09T18:49:06Z

Can one of the maintainers verify this patch?

prckent · 2018-10-09T19:52:56Z

OK to test

…riable name typo for QMC_COMPLEX case.

atillack · 2018-10-09T20:35:24Z

Rhea (gpu) should compile and run fine now.

ye-luo · 2018-10-09T21:35:11Z

In the v3.5, I changed the way of the spline tables collection using in-place gather in order to require no extra scratch memory. MPI is notorious for using 32 bit integer range and you might have an inferior MPI implementation. We need to find a workaround hopefully this is not a bug.
Could you go to src/spline/einspline_util.hpp in the function gatherv could you printout
buffer->coefs_size and ncol on the crashed run?
There is an internal control switching between two algorithms. Try reduce (1<<20) and see if that works around the issue.

ye-luo · 2018-10-09T21:47:48Z

I'm trying to understand the algorithm. But directly reading the code may not be the best way.
So throw some questions:

Since you are dispatching spline evaluation to remote devices. The result vector is allocated as unified memory and relys on CUDA to migrate the data back to the device for the determinant calculation. Am I right?
When you list 6 GPUs, how many MPI ranks do you have on the node? 6 or 1?
Assuming the previous answer is 6. The spline table on the host is not distributed among the MPI ranks within the group. Only the table on GPU is distributed when the data is copied to GPU. Am I right?
How do you share the GPU memory address across different MPIs? or CUDA handles it for you?
How much faster is this comparing to the UVM I implemented? One MPI with one GPU, the part of the spline which doesn't fit stays on the host.

atillack · 2018-10-09T22:02:58Z

@ye-luo:

To your earlier question:

On SummitDev with 4 MPI ranks: buffer->coefs_size=446,772,240 ; ncol = 816 (Working on reducing 1<<20 right now).

Algorithm questions:

Yes.
6 MPI ranks. Each rank sees all six GPUs.
Yes, you are correct.
Cuda IPC memory handles (cudaIpcGetMemHandle and cudaIpcOpenMemHandle)
Good question. Let me see if I can do a fair comparison by triggering your UVM code to put half the spline table on GPU and the other on the host.

atillack · 2018-10-09T22:03:57Z

@ye-luo Also, I left your UVM code in place and it should still work even with the split splines...

ye-luo · 2018-10-09T22:07:44Z

@atillack Great work.
I also need some effort on the spline builder to make the distribution on the host and it should connect to the GPU part seamlessly. I think eventually it will be we first split the table over MPI and then split between device and host using UVM if the table doesn't fit the device.

atillack · 2018-10-09T22:10:44Z

@ye-luo Agreed.

atillack · 2018-10-10T14:55:50Z

@ye-luo You are spot-on with your comment on the spline table MPI problem. I went to 1<<16 (to have a lower bound) and that runs through fine with 4 MPI ranks on SummitDev.

atillack · 2018-10-10T17:44:42Z

@ye-luo 1<<19 of course works too... (446,772,240 / 816 > (1<<19))

atillack · 2018-10-10T18:26:13Z

@ye-luo To your question of how much faster this implementation is compared to your UVM code: I set the available spline memory to 1738 MB (half of the spline table) so half ends up on the GPU and the other half on the CPU for the 128 atom NiO system.

When I do this with split splines using two MPI ranks (this implementation) on SummitDev the DMC block (5 block of 5 steps) takes 149 seconds (within typical fluctuation of the number above). Your UVM code also using two MPI ranks each using the same amount of memory for the spline table on the GPU (and the rest on the host) takes a total of 364 seconds. These runs were on the same node run after another.

…. 2 MPI ranks on the same GPU, rank 0 initializes half the memory, rank 1 the other half)

atillack · 2018-11-30T17:01:43Z

Almost finished.

atillack · 2018-11-30T18:34:33Z

@prckent I am not sure how to properly set up a test case. In theory, only gpusharing="yes" needs to be set in the section of a test case with enough orbitals (like NiO) but Cuda MPS also needs to be present which is not within the test framework's control (I think).

ye-luo · 2018-11-30T19:44:59Z

Is there a way in the code to test if MPS is enabled or not? If user turns on the flag but the MPS is not ready, the code can turn off this feature or abort the code.
You can grab an existing diamondC_2x1x1 test and enable sharing as a test.

prckent · 2018-11-30T22:19:01Z

Keep it simple. Lets worry about MPS settings when we have proven that they are needed. For oxygen, where we run the nightlies, I don't think we need be concerned. Later PRs can improve the situation as needed.

atillack · 2018-12-02T01:14:20Z

@prckent @ye-luo I think Ye's idea is a good one and it was simple enough to implement (see next commit). If Cuda MPS is off, a warning is generated and the split splines functionality disabled. This way we can add gpusharing="yes" to a test case and nothing breaks if there is no CUDA MPS.

… Summit). This allows a warning in case CUDA MPS is not available and the user tries to use split splines.

Conflicts: src/einspline/multi_bspline_cuda_d_impl.h src/einspline/multi_bspline_cuda_s_impl.h src/einspline/multi_bspline_cuda_z_impl.h

… as this is safer and is also necessary on Summit.

…ed once per node as contention may skew results.

prckent · 2018-12-06T22:59:09Z

Currently checking this on oxygen. If it works, I think we should merge. The tests and docs can be updated later. Enough tooth pulling via this PR.

prckent

Will wait for comments on Friday before merging.

prckent · 2018-12-07T03:36:14Z

Note: Unable to get this to work with MPS on oxygen. Config instructions at https://docs.nvidia.com/deploy/mps/index.html generate a segv. Probably I don't have the correct/magical combination of cuda aware openmpi and environment settings. Without MPS running the code is appropriately benign.

Can someone confirm that this is working? I currently don't have access to a fully updated multiple GPU machine.

atillack · 2018-12-07T15:59:41Z

@prckent I can run on SummitDev and Summit. Documentation is updated. Can you send me the output which segfaults?

prckent · 2018-12-07T18:56:50Z

The LaTeX is slightly broken. I will push a fix and improvements, then merge.

MPS investigations on oxygen will not occur for a while. Having Summit work is the key thing due to upcoming INCITE usage. (Hence #1054 is an important problem.)

Andreas Tillack added 11 commits March 21, 2018 13:05

Splitting the memory based on relative MPI ranks/GPUs

95359f0

Almost working implementation.

5c64b4d

Current version of split splines. Issues are it still does not get qu…

231c7ad

…ite the right final results.

First working version (with correct results). Performance still has t…

121b2b7

…o be tweaked.

Added NVTX debugging to code.

c93d940

Performance improvements using cudaMemPrefetchAsync for managed memory.

cff2efe

Improved performance and fixed some bugs.

ecc2d99

Minor updates to EinsplineSetCuda.cpp.

c5b8435

Merge remote-tracking branch 'upstream/develop' into split_splines_gpu

51af398

Conflicts: src/QMCWaveFunctions/Jastrow/OneBodyJastrowOrbitalBspline.cpp

Add switch for splitting splines (gpusharing="yes/no" in einspline se…

e3832b4

…ction). Also restored non-split splines performance back to reference.

Code cleanup (removal of NVTX)

3170040

ghost assigned atillack Oct 9, 2018

ghost added the in progress label Oct 9, 2018

prckent changed the title ~~Split splines across multiple GPUs on a given node~~ [WIP] Split splines across multiple GPUs on a given node Oct 9, 2018

Added complex double eval function stub for split splines code and va…

8633357

…riable name typo for QMC_COMPLEX case.

Andreas Tillack added 2 commits October 9, 2018 16:49

Minor cleanup

c1d1d4f

More split spline stub functions to (hopefully) fix test compiles.

376924b

Initial support for splitting memory by rank across the same GPU (e.g…

5e7c17f

…. 2 MPI ranks on the same GPU, rank 0 initializes half the memory, rank 1 the other half)

Add manual entry in the Beta test features section

81ede25

Andreas Tillack added 8 commits December 2, 2018 16:35

Add test for running CUDA MPS daemon (tested on Titan, SummitDev, and…

6119dbd

… Summit). This allows a warning in case CUDA MPS is not available and the user tries to use split splines.

Fix merge conflicts.

4122e25

This time for real.

bcc9c3d

Merge remote-tracking branch 'upstream/develop' into split_splines_gpu

5446db1

Conflicts: src/einspline/multi_bspline_cuda_d_impl.h src/einspline/multi_bspline_cuda_s_impl.h src/einspline/multi_bspline_cuda_z_impl.h

Minor cleanup.

20699b2

Simplified check for CUDA MPS running.

9b81315

Change MPS test to false if nvidia-cuda-mps-control is anything but 0…

db65cbd

… as this is safer and is also necessary on Summit.

Integrate MPS test and make sure nvidia-cuda-mps-control is only call…

4c2d746

…ed once per node as contention may skew results.

Improve feature description

7b50608

ghost assigned prckent Dec 6, 2018

prckent changed the title ~~[WIP] Split splines across multiple GPUs on a given node~~ Split splines across multiple GPUs on a given node Dec 6, 2018

prckent approved these changes Dec 6, 2018

View reviewed changes

Improve GPU spline sharing docs

825f102

prckent mentioned this pull request Dec 7, 2018

Example and Test of GPU spline data sharing required #1229

Open

Merge branch 'develop' into split_splines_gpu

7eda7d2

ye-luo approved these changes Dec 7, 2018

View reviewed changes

prckent merged commit a22d633 into QMCPACK:develop Dec 7, 2018

ghost removed the in progress label Dec 7, 2018

atillack deleted the split_splines_gpu branch December 12, 2018 20:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split splines across multiple GPUs on a given node #1101

Split splines across multiple GPUs on a given node #1101

atillack commented Oct 9, 2018 •

edited

Loading

qmc-robot commented Oct 9, 2018

qmc-robot commented Oct 9, 2018

prckent commented Oct 9, 2018

atillack commented Oct 9, 2018

ye-luo commented Oct 9, 2018

ye-luo commented Oct 9, 2018 •

edited

Loading

atillack commented Oct 9, 2018

atillack commented Oct 9, 2018

ye-luo commented Oct 9, 2018 •

edited

Loading

atillack commented Oct 9, 2018

atillack commented Oct 10, 2018 •

edited

Loading

atillack commented Oct 10, 2018

atillack commented Oct 10, 2018

atillack commented Nov 30, 2018

atillack commented Nov 30, 2018

ye-luo commented Nov 30, 2018

prckent commented Nov 30, 2018

atillack commented Dec 2, 2018

prckent commented Dec 6, 2018 •

edited

Loading

prckent left a comment

prckent commented Dec 7, 2018 •

edited

Loading

atillack commented Dec 7, 2018

prckent commented Dec 7, 2018

Split splines across multiple GPUs on a given node #1101

Split splines across multiple GPUs on a given node #1101

Conversation

atillack commented Oct 9, 2018 • edited Loading

qmc-robot commented Oct 9, 2018

qmc-robot commented Oct 9, 2018

prckent commented Oct 9, 2018

atillack commented Oct 9, 2018

ye-luo commented Oct 9, 2018

ye-luo commented Oct 9, 2018 • edited Loading

atillack commented Oct 9, 2018

atillack commented Oct 9, 2018

ye-luo commented Oct 9, 2018 • edited Loading

atillack commented Oct 9, 2018

atillack commented Oct 10, 2018 • edited Loading

atillack commented Oct 10, 2018

atillack commented Oct 10, 2018

atillack commented Nov 30, 2018

atillack commented Nov 30, 2018

ye-luo commented Nov 30, 2018

prckent commented Nov 30, 2018

atillack commented Dec 2, 2018

prckent commented Dec 6, 2018 • edited Loading

prckent left a comment

Choose a reason for hiding this comment

prckent commented Dec 7, 2018 • edited Loading

atillack commented Dec 7, 2018

prckent commented Dec 7, 2018

atillack commented Oct 9, 2018 •

edited

Loading

ye-luo commented Oct 9, 2018 •

edited

Loading

ye-luo commented Oct 9, 2018 •

edited

Loading

atillack commented Oct 10, 2018 •

edited

Loading

prckent commented Dec 6, 2018 •

edited

Loading

prckent commented Dec 7, 2018 •

edited

Loading