Integrate rocsolver for LU and inversion #3756

ye-luo · 2022-01-24T16:11:41Z

Is your feature request related to a problem? Please describe.
After #3755 in DelayedUpdateCUDA.h, the LU and inversion solver is ifdef to run on the host on AMD GPU. Due to the fact that rocsolver API differs from cusolver, we will need rocSolverInverter equivalent to cuSolverInverter.

Describe the solution you'd like
Note that rocSolver buffer handling API doesn't look the same as cuSolver.
Also need a unit test to cover both cuSolverInverter and rocSolverInverter.

markdewing · 2022-02-15T00:12:32Z

Trying to get straight all the different inversion methods, options, and code paths.

They get set up in SlaterDetBuilder::putDeterminant

QMC_CUDA - legacy CUDA acceleration in DiracDeterminantCUDA. Rest of this applies if QMC_CUDA is not defined.

input options:

'matrix_inverter' is 'gpu' or 'host'. Defaults to 'gpu'. (converted to matrix_inverter_kind which is DetMatInvertor::HOST or DetMatInvertor::ACCEL)
use_batch (option name is 'batch'). Defaults to 'yes' if ENABLE_OFFLOAD is defined, defaults to 'no' otherwise
useGPU (option name is 'gpu'). Only available if ENABLE_CUDA or ENABLE_OFFLOAD are defined. Defaults to 'yes'.

The selections:

use_batch is 'yes'
- useGPU is 'yes' and ENABLE_CUDA and ENABLE_OFFLOAD
  - DiracDeterminantBatched<MatrixDelayedUpdateCUDA>
  - "Running on NVIDIA GPU via CUDA acceleration and OpenMP offload"
- any of the previous options not set
  - DiracDeterminantBatched<MatrixUpdateOMPTarget>
  - "Running on an accelerator via OpenMP offload. Only SM1 update supported. delay_rank is ignored"
use_batch is 'no'
- useGPU is 'yes and ENABLE_CUDA
  - DiracDeterminant<DelayedUpdateCUDA>
  - "Running on an NVIDIA GPU via CUDA acceleration"
- useGPU is 'no' (or ENABLE_CUDA is not defined)
  - DiracDeterminant<DelayedUpdate>
  - "Running on CPU"

Looking at the classes to see how the inversion method fits in

DiracMatrix uses the CPU (Xgetrf/Xgetri) calls from computeInvertAndLog, which is called from invert_transpose. This the go-to (or default) class for doing the inversion on the CPU.
DiracDeterminant can select the inverter based on the matrix_inverter_kind (in the invertPsiM function)
- HOST - DiracMatrix is always the host_inverter_
- ACCEL - calls invert_transpose on the update engine (which is of the templated type)
DelayedUpdate uses DiracMatrix as the matrix inversion engine (So DiracDeterminant<DelayedUpdate> will always run on the CPU)

Now the other non-batched choice:

DelayedUpdateCUDA - the function called is chosen (at compile time) in invert_transpose
- QMC_CUDA2HIP defined - uses host_inverter (of type DiracMatrix)
  - This compile-time branch will get replaced with rocSolverInverter, and be guarded by ENABLE_ROCM instead (The goal for this issue)
- otherwise - cusolver_inverter (of type cuSolverInverter)
  - cuSolverInverter calls cusolverDnDgetrf / cusolverDnDgetrs

Looking to the batched options:

DiracDeterminantBatched selects the inverter based on matrix_inverter_kind in mw_invertPsiM
- HOST - DiracMatrix is always the host_inverter_ using a loop over walkers (that is, no batching)
- ACCEL - calls mw_invertTranspose on accel_inverter (type is the template type ::DetInverter)

The DET_ENGINE options;

MatrixDelayedUpdateCUDA, DetInverter type is DiracMatrixComputeCUDA
- calls mw_computeInvertAndLog, which calls cuBLAS_LU::computeInverseAndDetLog_batched (in QMCWaveFunctions/detail/CUDA/cuBLAS_LU.cu) which calls eventually cublasDgetrfBatched / cublasDgetriBatched
MatrixUpdateOMPTarget, DetInveter type is DiracMatrixComputeOMPTarget
- That uses DiracMatrix as detEng. The mw_invertTranspose call loops over walkers and calls DiracMatrix::invert_transpose on each one (i.e., uses the CPU for matrix inversion)

markdewing · 2022-02-17T05:29:40Z

Files, classes, and code involved in testing matrix inversion

test_dirac_matrix.cpp
- uses gen_inverse.py to generate comparison data
- Utilities/for_testing/checkMatrix.hpp - the checkMatrix function for comparing matrices
test_DiracMatrixComputeCUDA.cpp, benchmark_DiracMatrixComputeCUDA.cpp
- uses Containers/tests/makeRngSpdMatrix.hpp to create a random symmetric positive definite matrix
test_cuBLAS_LU.cpp
- uses tests/scripts/inversion_ref.py to generate reference data
test_DiracMatrixComputeOMPTarget.cpp
test_DiracDeterminant.cpp
test_DiracDeterminantBatched.cpp

ye-luo · 2022-02-17T06:11:33Z

Please ask questions and request for documentation/write up. I think we need to do better documentation.
That might help you going through the code rather than rediscovering all the pieces all by yourself.

markdewing · 2022-03-08T21:41:24Z

Is it expected that QMC_CUDA2HIP needs to be set for this to work? And not just ENABLE_ROCM?

ye-luo · 2022-03-08T21:49:06Z

Do you need any bits of HIP to run a test? Assume yes to allocate memory, then set QMC_CUDA2HIP.

prckent assigned markdewing Jan 24, 2022

prckent added the enhancement label Feb 1, 2022

markdewing mentioned this issue Feb 21, 2022

Add unit test and micro benchmark for cuSolverInverter #3861

Merged

markdewing mentioned this issue Mar 10, 2022

Add matrix inversion based on rocSolver for AMD GPUs #3901

Merged

ye-luo closed this as completed Mar 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate rocsolver for LU and inversion #3756

Integrate rocsolver for LU and inversion #3756

ye-luo commented Jan 24, 2022

markdewing commented Feb 15, 2022

markdewing commented Feb 17, 2022

ye-luo commented Feb 17, 2022

markdewing commented Mar 8, 2022

ye-luo commented Mar 8, 2022

Integrate rocsolver for LU and inversion #3756

Integrate rocsolver for LU and inversion #3756

Comments

ye-luo commented Jan 24, 2022

markdewing commented Feb 15, 2022

markdewing commented Feb 17, 2022

ye-luo commented Feb 17, 2022

markdewing commented Mar 8, 2022

ye-luo commented Mar 8, 2022