GETRF optimizations #308

jzuniga-amd · 2021-08-13T21:56:18Z

A series of different optimizations; focus on large matrices.

tfalders

Looks good! I've opened a PR on your local fork with the requested changes to PIVOT.

tfalders · 2021-08-17T19:20:21Z

library/src/lapack/roclapack_getrf.hpp

+                else if(m <= 64)
+                    blk = n;
+                else if(m <= 352)
+                    blk = 16;
+                else if(m <= 8960)
+                    blk = n;
+                else
+                    blk = 16;


This is rather odd. Why does it switch between 16 and n?

The block sizes for the small cases will always be so that the specialized kernels can be used.

Now, for larger panel matrices, I would probably expect the block size to always increase with the matrix size, but, in practice, there are many factors that could affect these values (tuned experimentally here)...
My initial guess is that there are more tuned cases for GEMMs with square matrices than panel matrices, although I did not investigate further.

* Eliminate dynamic allocation * Extract swap function

* Drop gfx803 from default build architectures (ROCm#288) * Added getri_npvt (ROCm#305) * Added nullptr checks for ipiv in getri * Implemented getri_npvt routines * Added test cases for getri_npvt routines * Updated changelog * Minor corrections * Changed pivot from a template argument to a function argument * Add pivot argument to template functions

cgmb

The code is cleaner.
The library is faster.
The binary is smaller (by 22%).
Beautiful.

The only thing I'll add is that we should wait for the extended tests to finish before merging.

cgmb · 2021-08-13T22:06:21Z

common/include/common_host_helpers.hpp

-    hipMemcpy(hA, AA[0], sizeof(T) * lda * n, hipMemcpyDeviceToHost);
+    hipMemcpy(hA.data(), AA[0], sizeof(T) * lda * n, hipMemcpyDeviceToHost);


Opps. I hope that was at least a compilation error when you went to use the function.

cgmb · 2021-08-20T22:37:59Z

library/src/include/rocsolver_small_kernels.hpp

-                               const rocblas_int pivot);
+                               const bool pivot);


It's a small change, but I appreciate it. The first time I saw this code, I thought const rocblas_int pivot was some sort of pivot index.

cgmb · 2021-08-20T22:57:51Z

library/src/lapack/roclapack_getf2.hpp

+/** This kernel executes an optimized scaled rank-update (scal + ger)
+    for panel matrices (matrices with less than 128 columns).
+    Useful to speedup the factorization of block-columns in getrf **/


This is the perfect level of detail for me. I don't instantly understand it all, but there's enough information for me to find references for the parts I don't get.

* update row swaping methods (laswp) * 2- and 3-steps recursion/iteration * add local iamax+ger+scal * rebase develop / fix merge conflicts * tuning new blocksizes normal case * tuning new blocksizes batch cases * back to 2-steps recursion * tuning new blocksizes for non-pivoting versions * remove specialized kernels for small panel matrices * update workspace requirements * Changelog and documentation * GETRF suggestions (#5) - Eliminate dynamic allocation - Extract swap function * Changed pivot from a template argument to a function argument (#6) - Changed pivot from a template argument to a function argument - Add pivot argument to template functions * Use new swap helper * fix workspace-size bug * add launch bounds * variable thread-group sizes Co-authored-by: Cory Bloor <Cordell.Bloor@amd.com> Co-authored-by: Troy Alderson <58866654+tfalders@users.noreply.github.com>

jzuniga-amd requested review from cgmb and tfalders as code owners August 13, 2021 21:56

tfalders approved these changes Aug 17, 2021

View reviewed changes

jzuniga-amd and others added 16 commits August 19, 2021 14:07

update row swaping methods (laswp)

d12c075

update row swaping methods (laswp)

87904ec

2- and 3-steps recursion/iteration

6014fe3

add local iamax+ger+scal

5dea841

rebase develop / fix merge conflicts

469becf

tuning new blocksizes normal case

a904aa6

tuning new blocksizes batch cases

b170023

back to 2-steps recursion

4b6308b

tuning new blocksizes for non-pivoting versions

eb3b509

remove specialized kernels for small panel matrices

5d4230d

remove specialized kernels for small panel matrices

201fdc1

update workspace requirements

efc414b

Changelog and documentation

2eff74f

GETRF suggestions (#5)

5c5a9af

* Eliminate dynamic allocation * Extract swap function

Use new swap helper

a239c0c

jzuniga-amd force-pushed the lufact_optim branch from 48b6e86 to a239c0c Compare August 19, 2021 22:06

cgmb approved these changes Aug 20, 2021

View reviewed changes

jzuniga-amd added 3 commits August 24, 2021 12:36

fix workspace-size bug

7ba8aa0

add launch bounds

51eec7d

variable thread-group sizes

b4e5f3a

jzuniga-amd merged commit 31de9aa into ROCm:develop Aug 30, 2021

jzuniga-amd deleted the lufact_optim branch October 6, 2022 19:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GETRF optimizations #308

GETRF optimizations #308

jzuniga-amd commented Aug 13, 2021

tfalders left a comment

tfalders Aug 17, 2021

jzuniga-amd Aug 19, 2021

cgmb left a comment

cgmb Aug 13, 2021

cgmb Aug 20, 2021

cgmb Aug 20, 2021

		hipMemcpy(hA, AA[0], sizeof(T) * lda * n, hipMemcpyDeviceToHost);
		hipMemcpy(hA.data(), AA[0], sizeof(T) * lda * n, hipMemcpyDeviceToHost);

GETRF optimizations #308

GETRF optimizations #308

Conversation

jzuniga-amd commented Aug 13, 2021

tfalders left a comment

Choose a reason for hiding this comment

tfalders Aug 17, 2021

Choose a reason for hiding this comment

jzuniga-amd Aug 19, 2021

Choose a reason for hiding this comment

cgmb left a comment

Choose a reason for hiding this comment

cgmb Aug 13, 2021

Choose a reason for hiding this comment

cgmb Aug 20, 2021

Choose a reason for hiding this comment

cgmb Aug 20, 2021

Choose a reason for hiding this comment