Skip to content

Conversation

ngc92
Copy link
Collaborator

@ngc92 ngc92 commented Oct 9, 2025

This requires one more matrix multiplication on the backward pass, but halves the amount of activation memory that needs to be kept compared to the current setting.

This allows us to achieve reasonable mfu even with the 14B model on 4x4090:

Old:

Model nGPU DType Batch TPS SOL TTB
Qwen2.5-14B⁶ 4 fp8 4 3.2k 22% 87h
Qwen2.5-14B⁷ 4 bf16 4 2.5k 33% 111h

New:

Model nGPU DType Batch TPS SOL TTB
Qwen2.5-14B⁸ 4 fp8 8 6.0k 42% 47h
Qwen2.5-14B⁹ 4 bf16 8 4.5k 58% 62h

@ngc92 ngc92 merged commit 61d62bf into dev Oct 10, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant