New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split splines across multiple GPUs on a given node #1101

Merged
merged 40 commits into from Dec 7, 2018

Conversation

Projects
None yet
4 participants
@atillack
Copy link
Contributor

atillack commented Oct 9, 2018

Here is my work-in-progress version of the split splines code for GPU with promising performance numbers:

Run GPUs Walkers/GPU Original Code New Code w/ Split Splines Split Spline Cost
NiO S32 on SummitDev 2 128 129.0 148.0 1.15
-- 4 128 129.0 236.5 1.83
NiO S32 on Summit 3 128 91.3 148.1 1.62
-- 3 250 145.1 223.2 1.54
-- 6 128 92.5 223.2 2.41
-- 6 250 147.3 403.7 2.74
NiO S64 on Summit 3 45 230.5 334.7 1.45
-- 6 45 230.5 463.8 2.01

Currently, only the spline type used for our NiO example is fully implemented. I will of course implement the missing bits once we finalized the code here.

In order to use the code and split the spline data memory across multiple GPUs the following needs to be done:

  • GPU MPS needs to be used,
  • the GPUs need to be visible in each MPI rank,
  • the options gpu="yes" and gpusharing="yes" need to be set in the <determinantset type="einspline"> section in the <wavefunction> definition block

I did test the original code based on an older development tree on both SummitDev and Summit with 2,4 and 3,6 GPUs, respectively (this is the data above). After updating to the current development tree I could only run on two GPUs on SummitDev to confirm performance levels are still the same - with four GPUs I get an MPI error (coll:ibm:module: datatype_prepare_recvv overflowed integer range triggered by comm->gatherv_in_place from gatherv in the new spline adaptor code). I get the exact same error with vanilla 3.5.0 on SummitDev. Since I can't run on Summit right now I can't verify if this is a regression in 3.5.0 (3.4.0 did not have these issues on Summit) or just the particular MPI version on SummitDev.

To finish:

  • Output error and exit when split splines and UVM are used together for now
  • PhaseFactors.cu is already finished but needs testing of the QMC_Complex code path
  • Implement double complex evaluation functions
  • Implement float and double wave function evaluation functions
  • Fill in the missing split splines memory allocations/break-ups in einspline/multi_bspline_create_cuda.cu
  • Implement the MPI host distribution to follow the split splines
  • Add documentation in manual and test case

@wafflebot wafflebot bot added the in progress label Oct 9, 2018

@qmc-robot

This comment has been minimized.

Copy link
Collaborator

qmc-robot commented Oct 9, 2018

Can one of the maintainers verify this patch?

1 similar comment
@qmc-robot

This comment has been minimized.

Copy link
Collaborator

qmc-robot commented Oct 9, 2018

Can one of the maintainers verify this patch?

@prckent prckent changed the title Split splines across multiple GPUs on a given node [WIP] Split splines across multiple GPUs on a given node Oct 9, 2018

@prckent

This comment has been minimized.

Copy link
Contributor

prckent commented Oct 9, 2018

OK to test

Added complex double eval function stub for split splines code and va…
…riable name typo for QMC_COMPLEX case.
@atillack

This comment has been minimized.

Copy link
Contributor

atillack commented Oct 9, 2018

Rhea (gpu) should compile and run fine now.

@ye-luo

This comment has been minimized.

Copy link
Contributor

ye-luo commented Oct 9, 2018

In the v3.5, I changed the way of the spline tables collection using in-place gather in order to require no extra scratch memory. MPI is notorious for using 32 bit integer range and you might have an inferior MPI implementation. We need to find a workaround hopefully this is not a bug.
Could you go to src/spline/einspline_util.hpp in the function gatherv could you printout
buffer->coefs_size and ncol on the crashed run?
There is an internal control switching between two algorithms. Try reduce (1<<20) and see if that works around the issue.

@ye-luo

This comment has been minimized.

Copy link
Contributor

ye-luo commented Oct 9, 2018

I'm trying to understand the algorithm. But directly reading the code may not be the best way.
So throw some questions:

  1. Since you are dispatching spline evaluation to remote devices. The result vector is allocated as unified memory and relys on CUDA to migrate the data back to the device for the determinant calculation. Am I right?
  2. When you list 6 GPUs, how many MPI ranks do you have on the node? 6 or 1?
  3. Assuming the previous answer is 6. The spline table on the host is not distributed among the MPI ranks within the group. Only the table on GPU is distributed when the data is copied to GPU. Am I right?
  4. How do you share the GPU memory address across different MPIs? or CUDA handles it for you?
  5. How much faster is this comparing to the UVM I implemented? One MPI with one GPU, the part of the spline which doesn't fit stays on the host.
@atillack

This comment has been minimized.

Copy link
Contributor

atillack commented Oct 9, 2018

@ye-luo:

To your earlier question:

On SummitDev with 4 MPI ranks: buffer->coefs_size=446,772,240 ; ncol = 816 (Working on reducing 1<<20 right now).

Algorithm questions:

  1. Yes.
  2. 6 MPI ranks. Each rank sees all six GPUs.
  3. Yes, you are correct.
  4. Cuda IPC memory handles (cudaIpcGetMemHandle and cudaIpcOpenMemHandle)
  5. Good question. Let me see if I can do a fair comparison by triggering your UVM code to put half the spline table on GPU and the other on the host.
@atillack

This comment has been minimized.

Copy link
Contributor

atillack commented Oct 9, 2018

@ye-luo Also, I left your UVM code in place and it should still work even with the split splines...

@ye-luo

This comment has been minimized.

Copy link
Contributor

ye-luo commented Oct 9, 2018

@atillack Great work.
I also need some effort on the spline builder to make the distribution on the host and it should connect to the GPU part seamlessly. I think eventually it will be we first split the table over MPI and then split between device and host using UVM if the table doesn't fit the device.

@atillack

This comment has been minimized.

Copy link
Contributor

atillack commented Oct 9, 2018

@ye-luo Agreed.

@atillack

This comment has been minimized.

Copy link
Contributor

atillack commented Oct 10, 2018

@ye-luo You are spot-on with your comment on the spline table MPI problem. I went to 1<<16 (to have a lower bound) and that runs through fine with 4 MPI ranks on SummitDev.

@atillack

This comment has been minimized.

Copy link
Contributor

atillack commented Oct 10, 2018

@ye-luo 1<<19 of course works too... (446,772,240 / 816 > (1<<19))

@atillack

This comment has been minimized.

Copy link
Contributor

atillack commented Oct 10, 2018

@ye-luo To your question of how much faster this implementation is compared to your UVM code: I set the available spline memory to 1738 MB (half of the spline table) so half ends up on the GPU and the other half on the CPU for the 128 atom NiO system.

When I do this with split splines using two MPI ranks (this implementation) on SummitDev the DMC block (5 block of 5 steps) takes 149 seconds (within typical fluctuation of the number above). Your UVM code also using two MPI ranks each using the same amount of memory for the spline table on the GPU (and the rest on the host) takes a total of 364 seconds. These runs were on the same node run after another.

@atillack

This comment has been minimized.

Copy link
Contributor

atillack commented Nov 30, 2018

Almost finished.

@atillack

This comment has been minimized.

Copy link
Contributor

atillack commented Nov 30, 2018

@prckent I am not sure how to properly set up a test case. In theory, only gpusharing="yes" needs to be set in the section of a test case with enough orbitals (like NiO) but Cuda MPS also needs to be present which is not within the test framework's control (I think).

@ye-luo

This comment has been minimized.

Copy link
Contributor

ye-luo commented Nov 30, 2018

Is there a way in the code to test if MPS is enabled or not? If user turns on the flag but the MPS is not ready, the code can turn off this feature or abort the code.
You can grab an existing diamondC_2x1x1 test and enable sharing as a test.

@prckent

This comment has been minimized.

Copy link
Contributor

prckent commented Nov 30, 2018

Keep it simple. Lets worry about MPS settings when we have proven that they are needed. For oxygen, where we run the nightlies, I don't think we need be concerned. Later PRs can improve the situation as needed.

@atillack

This comment has been minimized.

Copy link
Contributor

atillack commented Dec 2, 2018

@prckent @ye-luo I think Ye's idea is a good one and it was simple enough to implement (see next commit). If Cuda MPS is off, a warning is generated and the split splines functionality disabled. This way we can add gpusharing="yes" to a test case and nothing breaks if there is no CUDA MPS.

atillack added some commits Dec 2, 2018

Add test for running CUDA MPS daemon (tested on Titan, SummitDev, and…
… Summit). This allows a warning in case CUDA MPS is not available and the user tries to use split splines.
Merge remote-tracking branch 'upstream/develop' into split_splines_gpu
Conflicts:
	src/einspline/multi_bspline_cuda_d_impl.h
	src/einspline/multi_bspline_cuda_s_impl.h
	src/einspline/multi_bspline_cuda_z_impl.h
Change MPS test to false if nvidia-cuda-mps-control is anything but 0…
… as this is safer and is also necessary on Summit.
Integrate MPS test and make sure nvidia-cuda-mps-control is only call…
…ed once per node as contention may skew results.
@prckent

This comment has been minimized.

Copy link
Contributor

prckent commented Dec 6, 2018

Currently checking this on oxygen. If it works, I think we should merge. The tests and docs can be updated later. Enough tooth pulling via this PR.

@prckent prckent changed the title [WIP] Split splines across multiple GPUs on a given node Split splines across multiple GPUs on a given node Dec 6, 2018

@prckent

prckent approved these changes Dec 6, 2018

Copy link
Contributor

prckent left a comment

Will wait for comments on Friday before merging.

@prckent

This comment has been minimized.

Copy link
Contributor

prckent commented Dec 7, 2018

Note: Unable to get this to work with MPS on oxygen. Config instructions at https://docs.nvidia.com/deploy/mps/index.html generate a segv. Probably I don't have the correct/magical combination of cuda aware openmpi and environment settings. Without MPS running the code is appropriately benign.

Can someone confirm that this is working? I currently don't have access to a fully updated multiple GPU machine.

@atillack

This comment has been minimized.

Copy link
Contributor

atillack commented Dec 7, 2018

@prckent I can run on SummitDev and Summit. Documentation is updated. Can you send me the output which segfaults?

@prckent

This comment has been minimized.

Copy link
Contributor

prckent commented Dec 7, 2018

The LaTeX is slightly broken. I will push a fix and improvements, then merge.

MPS investigations on oxygen will not occur for a while. Having Summit work is the key thing due to upcoming INCITE usage. (Hence #1054 is an important problem.)

@ye-luo

ye-luo approved these changes Dec 7, 2018

@prckent prckent merged commit a22d633 into QMCPACK:develop Dec 7, 2018

2 checks passed

qmcpack rhea Build finished.
Details
qmcpack rhea (gpu) Build finished.
Details

@wafflebot wafflebot bot removed the in progress label Dec 7, 2018

@atillack atillack deleted the atillack:split_splines_gpu branch Dec 12, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment