Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Matrix update engines direct inversion #3470

Merged

Conversation

PDoakORNL
Copy link
Contributor

Proposed changes

Connect the CUDA direct inversion with minimum change to both develop and my design for performant and flexible acceleration of DiracDeterminantBatched.

What type(s) of changes does this code introduce?

  • New feature
  • Refactoring (no functional changes, no api changes)

Does this introduce a breaking change?

  • No

What systems has this change been tested on?

Leconte

Checklist

Update the following with a yes where the items apply. If you're unsure about any of them, don't hesitate to ask. This is
simply a reminder of what we are going to look for before merging your code.

  • Yes/No. This PR is up to date with current the current state of 'develop'
  • Yes/No. Code added or changed in the PR has been clang-formatted
  • Yes/No. This PR adds tests to cover any new code, or to catch a bug that is being fixed
  • Yes/No. Documentation has been added (if appropriate)

@PDoakORNL PDoakORNL changed the title [WIP] Matrix update engines direct inversion Matrix update engines direct inversion Sep 23, 2021
@PDoakORNL
Copy link
Contributor Author

Failing test is failing CI

@ye-luo
Copy link
Contributor

ye-luo commented Sep 23, 2021

Test this please

@ye-luo ye-luo self-requested a review September 23, 2021 14:54
remove extra resize_fill_constant

small cleanup

formatting files

separating timer for DDB inverse and batched inverse
@PDoakORNL PDoakORNL force-pushed the matrix_update_engines_direct_inversion branch from 13821fa to 78253d2 Compare September 24, 2021 12:51
ye-luo
ye-luo previously requested changes Sep 24, 2021
Copy link
Contributor

@ye-luo ye-luo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Investigating a test failure

@ye-luo
Copy link
Contributor

ye-luo commented Sep 27, 2021

MP passes test_wavefunction_trial but FP doesn't.
I think the failing was caused by DiracMatrixComputeCUDA.hpp

mw_invertTranspose(
      CUDALinearAlgebraHandles& cuda_handles,
      RefVector<DualMatrix<TMAT>>& a_mats,
      RefVector<DualMatrix<TMAT>>& inv_a_mats,
      DualVector<LogValue>& log_values,
      const std::vector<bool>& compute_mask)
  1. a_mats are not padded. inv_a_mats are padded. So lda was incorrect in a few places. The expectation is using transpose to add padding before inversion.
  2. This routine is expected to be synchronize the device and capture all error.
  3. log_values is only needed on the host. So The dual space piece can be just internal to DiracMatrixComputeCUDA. Once this function is done, the device memory need to be up-to-date. Transfer batch will be handled by DDB. DDB should do D2H in evaluateLog but not in recompute used after branching.
  4. Do we really need compute_mask?

Related to point 2 . cuBLAS_LU::computeInverseAndDetLog_batched calls cuBLAS::getri_batched directly but I found computeGetri_batched, I think we need to complete it by making it blocking and check infos.
I'm also thinking of fusing computeInverseAndDetLog_batched into DiracMatrixComputeCUDA.

@prckent
Copy link
Contributor

prckent commented Sep 27, 2021

not a change request: a_mats are not padded while inv_a_mats are. What is our logic for this (memory usage?) or is it a historical thing?

@ye-luo
Copy link
Contributor

ye-luo commented Sep 27, 2021

not a change request: a_mats are not padded while inv_a_mats are. What is our logic for this (memory usage?) or is it a historical thing?

It is more of a historic thing. When orbitals are produced by SPOSet, they are not padded thus a_mats are not padded.
inv_a_mats is padded because we frequently do vector operation on their rows.

@PDoakORNL
Copy link
Contributor Author

Any batch you think is useless if for unit testing. They aren't useless unless you trust NVIDIA and AMD completely.

@PDoakORNL
Copy link
Contributor Author

I will just drop the compute_mask from the API. My original direction on this was that all determinant data should be in per crowd data structures and not stored in the determinant, engine, or matrix objects. Didn't make much difference for async CUDA transfers so its premature optimization. I just didn't want to change it to get this code in and done diverging. Since this has already been delayed for a least a couple of months I might as well cut it.

@PDoakORNL
Copy link
Contributor Author

PDoakORNL commented Sep 27, 2021

So what is wrong with the unit test for test_DiracMatrixComputeCUDA, I don't think there is anything wrong with mw_invertTranspose at full precision. Looks to me like full precision mw_invertTranspose works perfectly as does the batched calculation of the log dets.

I will go over those unit tests again. I don't think its likely that is broken. So either something is up with the input or the higher level code expects additional state in the individual walkers to be updated in someway.

@ye-luo
Copy link
Contributor

ye-luo commented Sep 27, 2021

Any batch you think is useless if for unit testing. They aren't useless unless you trust NVIDIA and AMD completely.

I don't get what you mean here. We should have batching in unit tests. I don't even trust NVIDIA.

@PDoakORNL
Copy link
Contributor Author

I see what you mean. I was mixing two issues here. And really those unit tests should be moved to the Platform unit tests. The computeGetrf_batched is only in the API for the unit test, otherwise it could just be in the .cu. The computeGetri_batched call only exists at all in either file for the unit test, nothing is done in the function, the unit test should also just call the cuBLAS wrapper itself. I'll make an issue for moving the cublas wrapper unit tests.

@ye-luo ye-luo force-pushed the matrix_update_engines_direct_inversion branch from b71988f to 555d3d8 Compare September 28, 2021 04:31
@PDoakORNL PDoakORNL requested a review from ye-luo October 1, 2021 14:00
@ye-luo
Copy link
Contributor

ye-luo commented Oct 1, 2021

FYI. I ran one NiO a64 benchmark in mixed precision, I saw inversion was taking 7% of total time. Before this PR, the inversion time was 3x. Encouraging. I will do more tests once the non-deterministic fix in upstream is merged.

@PDoakORNL PDoakORNL force-pushed the matrix_update_engines_direct_inversion branch from 0a34412 to 22cc717 Compare October 6, 2021 17:14
@PDoakORNL
Copy link
Contributor Author

Ok by propagating the changes from #3514 I believe everything slated to possibly be const in the API is now const. So this is ready for be reviewed again.

@ye-luo
Copy link
Contributor

ye-luo commented Oct 6, 2021

Test this please

@PDoakORNL
Copy link
Contributor Author

What the status on this? What is the status of the nondeterminism in develop?

@ye-luo
Copy link
Contributor

ye-luo commented Oct 19, 2021

I need to do a performance check. The additional transfer after matrix inversion concerns me but as long as it doesn't slowdown the full run too muc (<20%)h, we will merge and improve later. I will let you know in this week.

@ye-luo
Copy link
Contributor

ye-luo commented Oct 20, 2021

Start testing in-house

Copy link
Contributor

@ye-luo ye-luo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pros. Acceleration is real
Cons. Need additional device memory.
I have prepared a PR for allowing selecting inverter from input. So users have larger degree of freedom.

@ye-luo ye-luo enabled auto-merge October 20, 2021 15:07
@ye-luo ye-luo merged commit 14cdc13 into QMCPACK:develop Oct 20, 2021
@PDoakORNL PDoakORNL deleted the matrix_update_engines_direct_inversion branch February 23, 2022 00:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants