Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GETRF optimizations #308

Merged
merged 19 commits into from
Aug 30, 2021
Merged

GETRF optimizations #308

merged 19 commits into from
Aug 30, 2021

Conversation

jzuniga-amd
Copy link
Collaborator

A series of different optimizations; focus on large matrices.

Copy link
Collaborator

@tfalders tfalders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I've opened a PR on your local fork with the requested changes to PIVOT.

Comment on lines 148 to 155
else if(m <= 64)
blk = n;
else if(m <= 352)
blk = 16;
else if(m <= 8960)
blk = n;
else
blk = 16;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is rather odd. Why does it switch between 16 and n?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The block sizes for the small cases will always be so that the specialized kernels can be used.

Now, for larger panel matrices, I would probably expect the block size to always increase with the matrix size, but, in practice, there are many factors that could affect these values (tuned experimentally here)...
My initial guess is that there are more tuned cases for GEMMs with square matrices than panel matrices, although I did not investigate further.

Copy link
Collaborator

@cgmb cgmb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code is cleaner.
The library is faster.
The binary is smaller (by 22%).
Beautiful.

The only thing I'll add is that we should wait for the extended tests to finish before merging.

hipMemcpy(hA, AA[0], sizeof(T) * lda * n, hipMemcpyDeviceToHost);
hipMemcpy(hA.data(), AA[0], sizeof(T) * lda * n, hipMemcpyDeviceToHost);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opps. I hope that was at least a compilation error when you went to use the function.

const rocblas_int pivot);
const bool pivot);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a small change, but I appreciate it. The first time I saw this code, I thought const rocblas_int pivot was some sort of pivot index.

Comment on lines +26 to +28
/** This kernel executes an optimized scaled rank-update (scal + ger)
for panel matrices (matrices with less than 128 columns).
Useful to speedup the factorization of block-columns in getrf **/
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the perfect level of detail for me. I don't instantly understand it all, but there's enough information for me to find references for the parts I don't get.

@jzuniga-amd jzuniga-amd merged commit 31de9aa into ROCm:develop Aug 30, 2021
jzuniga-amd added a commit to jzuniga-amd/rocSOLVER that referenced this pull request Sep 8, 2021
* update row swaping methods (laswp)
* 2- and 3-steps recursion/iteration
* add local iamax+ger+scal
* rebase develop / fix merge conflicts
* tuning new blocksizes normal case
* tuning new blocksizes batch cases
* back to 2-steps recursion
* tuning new blocksizes for non-pivoting versions
* remove specialized kernels for small panel matrices
* update workspace requirements
* Changelog and documentation
* GETRF suggestions (#5)
     - Eliminate dynamic allocation
     - Extract swap function
* Changed pivot from a template argument to a function argument (#6)
     - Changed pivot from a template argument to a function argument
     - Add pivot argument to template functions
* Use new swap helper
* fix workspace-size bug
* add launch bounds
* variable thread-group sizes

Co-authored-by: Cory Bloor <Cordell.Bloor@amd.com>
Co-authored-by: Troy Alderson <58866654+tfalders@users.noreply.github.com>
jzuniga-amd added a commit that referenced this pull request Sep 8, 2021
* update row swaping methods (laswp)
* 2- and 3-steps recursion/iteration
* add local iamax+ger+scal
* rebase develop / fix merge conflicts
* tuning new blocksizes normal case
* tuning new blocksizes batch cases
* back to 2-steps recursion
* tuning new blocksizes for non-pivoting versions
* remove specialized kernels for small panel matrices
* update workspace requirements
* Changelog and documentation
* GETRF suggestions (#5)
     - Eliminate dynamic allocation
     - Extract swap function
* Changed pivot from a template argument to a function argument (#6)
     - Changed pivot from a template argument to a function argument
     - Add pivot argument to template functions
* Use new swap helper
* fix workspace-size bug
* add launch bounds
* variable thread-group sizes

Co-authored-by: Cory Bloor <Cordell.Bloor@amd.com>
Co-authored-by: Troy Alderson <58866654+tfalders@users.noreply.github.com>
@jzuniga-amd jzuniga-amd deleted the lufact_optim branch October 6, 2022 19:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants