Conversation
When kernel_size == stride (non-overlapping patches), Conv3d is mathematically equivalent to reshape + nn.Linear. This avoids the im2col/col2im overhead and replaces MIOpen's implicit GEMM backward-weight path with standard rocBLAS matmul backward. Profiling on MI355X (gfx950) shows the backward-weight GEMM (kernel_batched_gemm_xdlops_bwd_weight) consumed 79.3% of compute time. With use_linear=True, this kernel is eliminated entirely, yielding a 2.87x end-to-end training speedup with identical loss convergence. Enabled via config: use_linear: !!bool True (default False, fully backward compatible). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
pzhanggit
left a comment
There was a problem hiding this comment.
@nicholasmalaya thank you very much for the optimization, Nick!
@TsChala the PR looks good to me. Could you do a test run on Frontier when it's back from maintenance? We should extend the changes to other models for better performance as well. Thanks
|
Thanks for the edits @nicholasmalaya ! @pzhanggit I ran some test on the JHUTDB dataset today. Using the Turbulence Transformer I see around 2x speed-up! This is only from the For the |
TsChala
left a comment
There was a problem hiding this comment.
This looks good to me. After this PR is merged I can work on expanding it to the other models as well.
…RNL#48 This PR (follow-on to ORNL#48) addresses two potential issues: 1. smooth layer silently dropped when notransposed=True: In hMLP_output.__init__ and forward, ORNL#48 added additional indentation to 'if self.smooth' blocks, resulting in the smooth layer getting skipped when self.notransposed=True and self.smooth=True. This PR would revert to previous behavior, by un-indenting. 2. expand_projections shape mismatch with use_linear=True: Within BaseModel.expand_projections, ORNL#48 introduced additional freedom for new_debed.out_head to be either nn.Linear or nn.ConvTranspose3d; however a later copy still is not generalized for both possibilities, and can result in shape mismatch when use_linear=True. This PR generalizes to account for both possibilities. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This PR (follow-on to ORNL#48) addresses three potential issues: 1. smooth layer silently dropped when notransposed=True: In hMLP_output.__init__ and forward, ORNL#48 added additional indentation to 'if self.smooth' blocks, resulting in the smooth layer getting skipped when self.notransposed=True and self.smooth=True. This PR would revert to previous behavior, by un-indenting. 2. expand_projections shape mismatch with use_linear=True: Within BaseModel.expand_projections, ORNL#48 introduced additional freedom for new_debed.out_head to be either nn.Linear or nn.ConvTranspose3d; however a later copy still is not generalized for both possibilities, and can result in shape mismatch when use_linear=True. This PR generalizes to account for both possibilities. 3. Collapse redundant if/else in hMLP_output.forward: After un-indenting the smooth block in (1), the remaining 'if self.notransposed / else' branches reduce to 'x = self.out_head(x)'. This PR collapses them. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Fix smooth layer and expand_projections regressions from #48
When kernel_size == stride (non-overlapping patches), Conv3d is mathematically equivalent to reshape + nn.Linear. This avoids the im2col/col2im overhead and replaces MIOpen's implicit GEMM backward-weight path with standard rocBLAS matmul backward.
Profiling on MI355X (gfx950) shows the backward-weight GEMM (kernel_batched_gemm_xdlops_bwd_weight) consumed 79.3% of compute time. With use_linear=True, this kernel is eliminated entirely, yielding a 2.87x end-to-end training speedup with identical loss convergence.
Enabled via config: use_linear: !!bool True (default False, fully backward compatible).