Matrix update engines direct inversion #3470

PDoakORNL · 2021-09-22T19:52:33Z

Proposed changes

Connect the CUDA direct inversion with minimum change to both develop and my design for performant and flexible acceleration of DiracDeterminantBatched.

What type(s) of changes does this code introduce?

New feature
Refactoring (no functional changes, no api changes)

Does this introduce a breaking change?

No

What systems has this change been tested on?

Leconte

Checklist

Update the following with a yes where the items apply. If you're unsure about any of them, don't hesitate to ask. This is
simply a reminder of what we are going to look for before merging your code.

Yes/No. This PR is up to date with current the current state of 'develop'
Yes/No. Code added or changed in the PR has been clang-formatted
Yes/No. This PR adds tests to cover any new code, or to catch a bug that is being fixed
Yes/No. Documentation has been added (if appropriate)

PDoakORNL · 2021-09-23T13:26:27Z

Failing test is failing CI

ye-luo · 2021-09-23T14:54:08Z

Test this please

remove extra resize_fill_constant small cleanup formatting files separating timer for DDB inverse and batched inverse

ye-luo

Investigating a test failure

ye-luo · 2021-09-27T14:05:55Z

MP passes test_wavefunction_trial but FP doesn't.
I think the failing was caused by DiracMatrixComputeCUDA.hpp

mw_invertTranspose(
      CUDALinearAlgebraHandles& cuda_handles,
      RefVector<DualMatrix<TMAT>>& a_mats,
      RefVector<DualMatrix<TMAT>>& inv_a_mats,
      DualVector<LogValue>& log_values,
      const std::vector<bool>& compute_mask)

a_mats are not padded. inv_a_mats are padded. So lda was incorrect in a few places. The expectation is using transpose to add padding before inversion.
This routine is expected to be synchronize the device and capture all error.
log_values is only needed on the host. So The dual space piece can be just internal to DiracMatrixComputeCUDA. Once this function is done, the device memory need to be up-to-date. Transfer batch will be handled by DDB. DDB should do D2H in evaluateLog but not in recompute used after branching.
Do we really need compute_mask?

Related to point 2 . cuBLAS_LU::computeInverseAndDetLog_batched calls cuBLAS::getri_batched directly but I found computeGetri_batched, I think we need to complete it by making it blocking and check infos.
I'm also thinking of fusing computeInverseAndDetLog_batched into DiracMatrixComputeCUDA.

prckent · 2021-09-27T15:37:54Z

not a change request: a_mats are not padded while inv_a_mats are. What is our logic for this (memory usage?) or is it a historical thing?

ye-luo · 2021-09-27T16:14:03Z

not a change request: a_mats are not padded while inv_a_mats are. What is our logic for this (memory usage?) or is it a historical thing?

It is more of a historic thing. When orbitals are produced by SPOSet, they are not padded thus a_mats are not padded.
inv_a_mats is padded because we frequently do vector operation on their rows.

PDoakORNL · 2021-09-27T22:04:44Z

Any batch you think is useless if for unit testing. They aren't useless unless you trust NVIDIA and AMD completely.

PDoakORNL · 2021-09-27T22:25:55Z

I will just drop the compute_mask from the API. My original direction on this was that all determinant data should be in per crowd data structures and not stored in the determinant, engine, or matrix objects. Didn't make much difference for async CUDA transfers so its premature optimization. I just didn't want to change it to get this code in and done diverging. Since this has already been delayed for a least a couple of months I might as well cut it.

PDoakORNL · 2021-09-27T22:30:46Z

So what is wrong with the unit test for test_DiracMatrixComputeCUDA, I don't think there is anything wrong with mw_invertTranspose at full precision. Looks to me like full precision mw_invertTranspose works perfectly as does the batched calculation of the log dets.

I will go over those unit tests again. I don't think its likely that is broken. So either something is up with the input or the higher level code expects additional state in the individual walkers to be updated in someway.

ye-luo · 2021-09-27T23:01:37Z

Any batch you think is useless if for unit testing. They aren't useless unless you trust NVIDIA and AMD completely.

I don't get what you mean here. We should have batching in unit tests. I don't even trust NVIDIA.

PDoakORNL · 2021-09-27T23:41:04Z

I see what you mean. I was mixing two issues here. And really those unit tests should be moved to the Platform unit tests. The computeGetrf_batched is only in the API for the unit test, otherwise it could just be in the .cu. The computeGetri_batched call only exists at all in either file for the unit test, nothing is done in the function, the unit test should also just call the cuBLAS wrapper itself. I'll make an issue for moving the cublas wrapper unit tests.

…sion-Ye Fixes test_wavefunction_trial failing in full precision build After a build in a fresh repo this seems good.

stale now

ye-luo · 2021-10-01T14:07:48Z

FYI. I ran one NiO a64 benchmark in mixed precision, I saw inversion was taking 7% of total time. Before this PR, the inversion time was 3x. Encouraging. I will do more tests once the non-deterministic fix in upstream is merged.

PDoakORNL · 2021-10-06T17:16:39Z

Ok by propagating the changes from #3514 I believe everything slated to possibly be const in the API is now const. So this is ready for be reviewed again.

ye-luo · 2021-10-06T18:46:11Z

Test this please

PDoakORNL · 2021-10-19T16:03:18Z

What the status on this? What is the status of the nondeterminism in develop?

ye-luo · 2021-10-19T16:10:26Z

I need to do a performance check. The additional transfer after matrix inversion concerns me but as long as it doesn't slowdown the full run too muc (<20%)h, we will merge and improve later. I will let you know in this week.

…nes_direct_inversion

ye-luo · 2021-10-20T15:01:34Z

Start testing in-house

ye-luo

Pros. Acceleration is real
Cons. Need additional device memory.
I have prepared a PR for allowing selecting inverter from input. So users have larger degree of freedom.

PDoakORNL changed the title ~~[WIP] Matrix update engines direct inversion~~ Matrix update engines direct inversion Sep 23, 2021

ye-luo self-requested a review September 23, 2021 14:54

PDoakORNL added 2 commits September 24, 2021 08:51

integration of direct inversion code

7d14b2c

remove extra resize_fill_constant small cleanup formatting files separating timer for DDB inverse and batched inverse

attempted fix for checkLogandGL in mixed precision

78253d2

PDoakORNL force-pushed the matrix_update_engines_direct_inversion branch from 13821fa to 78253d2 Compare September 24, 2021 12:51

ye-luo previously requested changes Sep 24, 2021

View reviewed changes

Merge branch 'develop' into matrix_update_engines_direct_inversion

555d3d8

ye-luo added 5 commits September 27, 2021 22:29

[NFC] Change mw_computeInvertAndLog to non template.

3a99e69

Rename mw_computeInvertAndLog_stride

56a7086

Use explicit type.

60184be

Fix full precision padding issue.

cab62cf

Rename assignUpperRight to assignUpperLeft

13d5553

ye-luo force-pushed the matrix_update_engines_direct_inversion branch from b71988f to 555d3d8 Compare September 28, 2021 04:31

ye-luo and others added 6 commits September 28, 2021 00:02

Remove unused bits and apply formatting.

90ed060

Restrict assignUpperLeft to Matrix type.

401baca

Expand unit test to cover Matrix<>::assignUpperLeft

f1ad689

Update comments.

f624f75

Notes for DiracMatrixComputeCUDA

3dd3a41

Merge pull request #30 from ye-luo/matrix_update_engines_direct_inver…

5e5d251

…sion-Ye Fixes test_wavefunction_trial failing in full precision build After a build in a fresh repo this seems good.

ye-luo and others added 3 commits September 28, 2021 21:41

Correct timer scope in DiracDeterminantBatched

85f34df

remove compute_mask from lower level inversion functions

d6aa22f

Merge branch 'develop' into matrix_update_engines_direct_inversion

4e8b3f4

PDoakORNL requested a review from ye-luo October 1, 2021 14:00

ye-luo mentioned this pull request Oct 6, 2021

Fix omp offload without cuda #3514

Merged

PDoakORNL added 2 commits October 6, 2021 12:38

Merge branch 'develop' into matrix_update_engines_direct_2

fea01c4

propagating mw_invertPsiM const API changes

22cc717

PDoakORNL force-pushed the matrix_update_engines_direct_inversion branch from 0a34412 to 22cc717 Compare October 6, 2021 17:14

Merge branch 'develop' into matrix_update_engines_direct_inversion

7eb3c29

ye-luo and others added 3 commits October 19, 2021 12:53

Merge branch 'develop' into matrix_update_engines_direct_inversion

eeeb5c9

Merge branch 'develop' into matrix_update_engines_direct_inversion

a9076b9

Merge remote-tracking branch 'origin/develop' into matrix_update_engi…

82530dd

…nes_direct_inversion

ye-luo approved these changes Oct 20, 2021

View reviewed changes

ye-luo enabled auto-merge October 20, 2021 15:07

ye-luo mentioned this pull request Oct 20, 2021

Select inverter in DiracDeterminantBatched #3546

Merged

ye-luo merged commit 14cdc13 into QMCPACK:develop Oct 20, 2021

PDoakORNL deleted the matrix_update_engines_direct_inversion branch February 23, 2022 00:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Matrix update engines direct inversion #3470

Matrix update engines direct inversion #3470

PDoakORNL commented Sep 22, 2021

PDoakORNL commented Sep 23, 2021

ye-luo commented Sep 23, 2021

ye-luo left a comment

ye-luo commented Sep 27, 2021 •

edited

Loading

prckent commented Sep 27, 2021

ye-luo commented Sep 27, 2021 •

edited

Loading

PDoakORNL commented Sep 27, 2021

PDoakORNL commented Sep 27, 2021

PDoakORNL commented Sep 27, 2021 •

edited

Loading

ye-luo commented Sep 27, 2021

PDoakORNL commented Sep 27, 2021

ye-luo commented Oct 1, 2021

PDoakORNL commented Oct 6, 2021

ye-luo commented Oct 6, 2021

PDoakORNL commented Oct 19, 2021

ye-luo commented Oct 19, 2021

ye-luo commented Oct 20, 2021

ye-luo left a comment

Matrix update engines direct inversion #3470

Matrix update engines direct inversion #3470

Conversation

PDoakORNL commented Sep 22, 2021

Proposed changes

What type(s) of changes does this code introduce?

Does this introduce a breaking change?

What systems has this change been tested on?

Checklist

PDoakORNL commented Sep 23, 2021

ye-luo commented Sep 23, 2021

ye-luo left a comment

Choose a reason for hiding this comment

ye-luo commented Sep 27, 2021 • edited Loading

prckent commented Sep 27, 2021

ye-luo commented Sep 27, 2021 • edited Loading

PDoakORNL commented Sep 27, 2021

PDoakORNL commented Sep 27, 2021

PDoakORNL commented Sep 27, 2021 • edited Loading

ye-luo commented Sep 27, 2021

PDoakORNL commented Sep 27, 2021

ye-luo commented Oct 1, 2021

PDoakORNL commented Oct 6, 2021

ye-luo commented Oct 6, 2021

PDoakORNL commented Oct 19, 2021

ye-luo commented Oct 19, 2021

ye-luo commented Oct 20, 2021

ye-luo left a comment

Choose a reason for hiding this comment

ye-luo commented Sep 27, 2021 •

edited

Loading

ye-luo commented Sep 27, 2021 •

edited

Loading

PDoakORNL commented Sep 27, 2021 •

edited

Loading