Recompute attention-residual #7

ngc92 · 2025-10-09T10:10:21Z

This requires one more matrix multiplication on the backward pass, but halves the amount of activation memory that needs to be kept compared to the current setting.

This allows us to achieve reasonable mfu even with the 14B model on 4x4090:

Old:

Model	nGPU	DType	Batch	TPS	SOL	TTB
Qwen2.5-14B⁶	4	fp8	4	3.2k	22%	87h
Qwen2.5-14B⁷	4	bf16	4	2.5k	33%	111h

New:

Model	nGPU	DType	Batch	TPS	SOL	TTB
Qwen2.5-14B⁸	4	fp8	8	6.0k	42%	47h
Qwen2.5-14B⁹	4	bf16	8	4.5k	58%	62h

ngc92 added 5 commits October 9, 2025 14:49

cleanup function signature

c96c52f

split recomputation into its own function

c5b69ff

implement recomputation of full transformer block

c16b47b

updated readme with better 14B results

a815db9

remove redundant abs-max calls

b9fe02b

ngc92 force-pushed the recompute-res branch from c4f30cb to b9fe02b Compare October 9, 2025 12:49

formatting

886162b

ngc92 merged commit 61d62bf into dev Oct 10, 2025
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Recompute attention-residual #7

Recompute attention-residual #7

ngc92 commented Oct 9, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Recompute attention-residual #7

Recompute attention-residual #7

Conversation

ngc92 commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ngc92 commented Oct 9, 2025 •

edited

Loading