-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GETRF optimizations #308
GETRF optimizations #308
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! I've opened a PR on your local fork with the requested changes to PIVOT.
else if(m <= 64) | ||
blk = n; | ||
else if(m <= 352) | ||
blk = 16; | ||
else if(m <= 8960) | ||
blk = n; | ||
else | ||
blk = 16; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is rather odd. Why does it switch between 16 and n?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The block sizes for the small cases will always be so that the specialized kernels can be used.
Now, for larger panel matrices, I would probably expect the block size to always increase with the matrix size, but, in practice, there are many factors that could affect these values (tuned experimentally here)...
My initial guess is that there are more tuned cases for GEMMs with square matrices than panel matrices, although I did not investigate further.
* Eliminate dynamic allocation * Extract swap function
* Drop gfx803 from default build architectures (ROCm#288) * Added getri_npvt (ROCm#305) * Added nullptr checks for ipiv in getri * Implemented getri_npvt routines * Added test cases for getri_npvt routines * Updated changelog * Minor corrections * Changed pivot from a template argument to a function argument * Add pivot argument to template functions
48b6e86
to
a239c0c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code is cleaner.
The library is faster.
The binary is smaller (by 22%).
Beautiful.
The only thing I'll add is that we should wait for the extended tests to finish before merging.
hipMemcpy(hA, AA[0], sizeof(T) * lda * n, hipMemcpyDeviceToHost); | ||
hipMemcpy(hA.data(), AA[0], sizeof(T) * lda * n, hipMemcpyDeviceToHost); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opps. I hope that was at least a compilation error when you went to use the function.
const rocblas_int pivot); | ||
const bool pivot); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a small change, but I appreciate it. The first time I saw this code, I thought const rocblas_int pivot
was some sort of pivot index.
/** This kernel executes an optimized scaled rank-update (scal + ger) | ||
for panel matrices (matrices with less than 128 columns). | ||
Useful to speedup the factorization of block-columns in getrf **/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the perfect level of detail for me. I don't instantly understand it all, but there's enough information for me to find references for the parts I don't get.
* update row swaping methods (laswp) * 2- and 3-steps recursion/iteration * add local iamax+ger+scal * rebase develop / fix merge conflicts * tuning new blocksizes normal case * tuning new blocksizes batch cases * back to 2-steps recursion * tuning new blocksizes for non-pivoting versions * remove specialized kernels for small panel matrices * update workspace requirements * Changelog and documentation * GETRF suggestions (#5) - Eliminate dynamic allocation - Extract swap function * Changed pivot from a template argument to a function argument (#6) - Changed pivot from a template argument to a function argument - Add pivot argument to template functions * Use new swap helper * fix workspace-size bug * add launch bounds * variable thread-group sizes Co-authored-by: Cory Bloor <Cordell.Bloor@amd.com> Co-authored-by: Troy Alderson <58866654+tfalders@users.noreply.github.com>
* update row swaping methods (laswp) * 2- and 3-steps recursion/iteration * add local iamax+ger+scal * rebase develop / fix merge conflicts * tuning new blocksizes normal case * tuning new blocksizes batch cases * back to 2-steps recursion * tuning new blocksizes for non-pivoting versions * remove specialized kernels for small panel matrices * update workspace requirements * Changelog and documentation * GETRF suggestions (#5) - Eliminate dynamic allocation - Extract swap function * Changed pivot from a template argument to a function argument (#6) - Changed pivot from a template argument to a function argument - Add pivot argument to template functions * Use new swap helper * fix workspace-size bug * add launch bounds * variable thread-group sizes Co-authored-by: Cory Bloor <Cordell.Bloor@amd.com> Co-authored-by: Troy Alderson <58866654+tfalders@users.noreply.github.com>
A series of different optimizations; focus on large matrices.