Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate rocsolver for LU and inversion #3756

ye-luo opened this issue Jan 24, 2022 · 5 comments

Integrate rocsolver for LU and inversion #3756

ye-luo opened this issue Jan 24, 2022 · 5 comments


Copy link

ye-luo commented Jan 24, 2022

Is your feature request related to a problem? Please describe.
After #3755 in DelayedUpdateCUDA.h, the LU and inversion solver is ifdef to run on the host on AMD GPU. Due to the fact that rocsolver API differs from cusolver, we will need rocSolverInverter equivalent to cuSolverInverter.

Describe the solution you'd like
Note that rocSolver buffer handling API doesn't look the same as cuSolver.
Also need a unit test to cover both cuSolverInverter and rocSolverInverter.

Copy link

Trying to get straight all the different inversion methods, options, and code paths.

They get set up in SlaterDetBuilder::putDeterminant

QMC_CUDA - legacy CUDA acceleration in DiracDeterminantCUDA. Rest of this applies if QMC_CUDA is not defined.

input options:

  • 'matrix_inverter' is 'gpu' or 'host'. Defaults to 'gpu'. (converted to matrix_inverter_kind which is DetMatInvertor::HOST or DetMatInvertor::ACCEL)
  • use_batch (option name is 'batch'). Defaults to 'yes' if ENABLE_OFFLOAD is defined, defaults to 'no' otherwise
  • useGPU (option name is 'gpu'). Only available if ENABLE_CUDA or ENABLE_OFFLOAD are defined. Defaults to 'yes'.

The selections:

  • use_batch is 'yes'
    • useGPU is 'yes' and ENABLE_CUDA and ENABLE_OFFLOAD
      • DiracDeterminantBatched<MatrixDelayedUpdateCUDA>
      • "Running on NVIDIA GPU via CUDA acceleration and OpenMP offload"
    • any of the previous options not set
      • DiracDeterminantBatched<MatrixUpdateOMPTarget>
      • "Running on an accelerator via OpenMP offload. Only SM1 update supported. delay_rank is ignored"
  • use_batch is 'no'
    • useGPU is 'yes and ENABLE_CUDA
      • DiracDeterminant<DelayedUpdateCUDA>
      • "Running on an NVIDIA GPU via CUDA acceleration"
    • useGPU is 'no' (or ENABLE_CUDA is not defined)
      • DiracDeterminant<DelayedUpdate>
      • "Running on CPU"

Looking at the classes to see how the inversion method fits in

  • DiracMatrix uses the CPU (Xgetrf/Xgetri) calls from computeInvertAndLog, which is called from invert_transpose. This the go-to (or default) class for doing the inversion on the CPU.

  • DiracDeterminant can select the inverter based on the matrix_inverter_kind (in the invertPsiM function)

    • HOST - DiracMatrix is always the host_inverter_
    • ACCEL - calls invert_transpose on the update engine (which is of the templated type)
  • DelayedUpdate uses DiracMatrix as the matrix inversion engine (So DiracDeterminant<DelayedUpdate> will always run on the CPU)

Now the other non-batched choice:

  • DelayedUpdateCUDA - the function called is chosen (at compile time) in invert_transpose
    • QMC_CUDA2HIP defined - uses host_inverter (of type DiracMatrix)
      • This compile-time branch will get replaced with rocSolverInverter, and be guarded by ENABLE_ROCM instead (The goal for this issue)
    • otherwise - cusolver_inverter (of type cuSolverInverter)
      • cuSolverInverter calls cusolverDnDgetrf / cusolverDnDgetrs

Looking to the batched options:

  • DiracDeterminantBatched selects the inverter based on matrix_inverter_kind in mw_invertPsiM
    • HOST - DiracMatrix is always the host_inverter_ using a loop over walkers (that is, no batching)
    • ACCEL - calls mw_invertTranspose on accel_inverter (type is the template type ::DetInverter)

The DET_ENGINE options;

  • MatrixDelayedUpdateCUDA, DetInverter type is DiracMatrixComputeCUDA
    • calls mw_computeInvertAndLog, which calls cuBLAS_LU::computeInverseAndDetLog_batched (in QMCWaveFunctions/detail/CUDA/ which calls eventually cublasDgetrfBatched / cublasDgetriBatched
  • MatrixUpdateOMPTarget, DetInveter type is DiracMatrixComputeOMPTarget
    • That uses DiracMatrix as detEng. The mw_invertTranspose call loops over walkers and calls DiracMatrix::invert_transpose on each one (i.e., uses the CPU for matrix inversion)

Copy link

Files, classes, and code involved in testing matrix inversion

  • test_dirac_matrix.cpp

    • uses to generate comparison data
    • Utilities/for_testing/checkMatrix.hpp - the checkMatrix function for comparing matrices
  • test_DiracMatrixComputeCUDA.cpp, benchmark_DiracMatrixComputeCUDA.cpp

    • uses Containers/tests/makeRngSpdMatrix.hpp to create a random symmetric positive definite matrix
  • test_cuBLAS_LU.cpp

    • uses tests/scripts/ to generate reference data
  • test_DiracMatrixComputeOMPTarget.cpp

  • test_DiracDeterminant.cpp

  • test_DiracDeterminantBatched.cpp

Copy link
Contributor Author

ye-luo commented Feb 17, 2022

Please ask questions and request for documentation/write up. I think we need to do better documentation.
That might help you going through the code rather than rediscovering all the pieces all by yourself.

Copy link

Is it expected that QMC_CUDA2HIP needs to be set for this to work? And not just ENABLE_ROCM?

Copy link
Contributor Author

ye-luo commented Mar 8, 2022

Do you need any bits of HIP to run a test? Assume yes to allocate memory, then set QMC_CUDA2HIP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet

No branches or pull requests

3 participants