Remove uncessary ctype being passed to GroupedGEMMQuant kernel#2922
Conversation
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
|
/te-ci pytorch |
Greptile SummaryThis PR removes the Confidence Score: 5/5Safe to merge — minimal, focused change with no functional regressions. The PR makes three symmetrical one-line deletions across two implementation files and a test. The cuDNN wrapper accepts No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[ForwardGroupedMLP] -->|fc1_glu_kwargs — keeps c_dtype| B[grouped_gemm_quant_wrapper_sm100\nFC1 SwiGLU GEMM]
A -->|fc2_quant_kwargs — c_dtype removed| C[grouped_gemm_quant_wrapper_sm100\nFC2 GEMM]
D[BackwardGroupedMLP] -->|fc1_dgrad_kwargs — c_dtype removed| E[grouped_gemm_quant_wrapper_sm100\nFC1 dgrad GEMM]
B -->|intermediate C tensor still allocated| F[SwiGLU activation]
C -->|no intermediate C tensor| G[FC2 output D]
E -->|no intermediate C tensor| H[grad_input]
Reviews (2): Last reviewed commit: "Merge pull request #4 from ksivaman/pr_f..." | Re-trigger Greptile |
ksivaman
left a comment
There was a problem hiding this comment.
This shouldn't be merged in as a standalone commit as it requires changes from cudnn-FE for correctly inferring this type. Once those changes are released, this should be a part of a bigger PR that raises the min requires FE for enabled the fusion and removes all of the other cumbersome FE version checks for the features as well.
|
@ksivaman if your concern is that the title isnt inline with what the change acheives, then it makes sense and I have changed the title. But this is a completely harmless change and I would argue a clean syntactic change, since c_type isnt even needed by the kernel and we are not using the intermediate C. So I dont think it makes sense to block this merge to get it to main and to 2.15, given that it might reduce the testing effort to a lot of folks. |
ksivaman
left a comment
There was a problem hiding this comment.
After offline discussion, this has been tested with cudnnFE 1.22.0 and title is accurate. LGTM
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Remove c_dtype from fusible ops test
|
/te-ci pytorch |
* remove ctype to eliminate memory usage from the cudnn kernel Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * Remove c_dtype from fusible ops test Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
…A#2922) * remove ctype to eliminate memory usage from the cudnn kernel Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * Remove c_dtype from fusible ops test Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Description
We dont depend on the intermediate C that can be potentially outputted by the grouped_gemm_quant kernel for Fused MOE blocks.
For future:
Removing this c_type would eliminate unecessary intermediate high precision C tensor being outputted from the cudnn kernel. Although to see any real memory benefit we would need to move to newer cudnn release.
Please include a brief summary of the changes, relevant motivation and context.
Fixes # (issue)
Type of change
Changes
Please list the changes introduced in this PR:
Checklist: