Add Gemma 4 FLOPs & fix sliding window flops computations#3592
Add Gemma 4 FLOPs & fix sliding window flops computations#3592copybara-service[bot] merged 1 commit intomainfrom
Conversation
|
🤖 Hi @gagika, I've received your request, and I'm working on it now! You can track my progress in the logs for more details. |
|
🤖 I'm sorry @gagika, but I was unable to process your request. Please see the logs for more details. |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
e46459c to
038d9e9
Compare
038d9e9 to
ccbcf03
Compare
|
🤖 Hi @gagika, I've received your request, and I'm working on it now! You can track my progress in the logs for more details. |
There was a problem hiding this comment.
This Pull Request introduces Gemma 4 FLOPs calculations and significantly improves the accuracy of existing FLOPs math, particularly for sliding window attention and mixed attention architectures. The addition of a comprehensive test suite covering multiple model families is a major highlight and ensures the reliability of these critical metrics.
🔍 General Feedback
- Great Test Coverage: The new
maxtext_utils_flops_test.pyis excellent. It uses a robust6 * params * tokensverification strategy that provides high confidence in the computed TFLOPs across various architectures. - Improved Accuracy: The fixes for sliding window area and vision encoder scaling (backward pass) are well-timed and correct.
- Inconsistency in Shared KV Projections: There is a potential logic error in how
share_kv_projectionsis applied to mixed attention models in the main caller. One unit test specifically assumes local layers do not share KV projections even when the flag is True, but the code currently applies it to both. - MoE Fallback Logic: The fallback for MoE layer detection is now more generalized, which is good, but might be too broad for future hybrid architectures.
ccbcf03 to
ffd741b
Compare
|
🤖 Hi @gagika, I've received your request, and I'm working on it now! You can track my progress in the logs for more details. |
There was a problem hiding this comment.
This pull request significantly improves the accuracy of FLOPs and MFU (Model Flops Utilization) calculations across multiple architectures, with a focus on Gemma 4 and corrected sliding window logic. The implementation is thorough, including a new comprehensive test suite that validates calculations for 12 different model configurations.
🔍 General Feedback
- Accuracy Improvements: The switch to a precise triangular overlap formula for sliding window attention and the inclusion of backward pass FLOPs for vision encoders are excellent updates that prevent MFU over-estimation.
- Architectural Coverage: The addition of Gemma 4 specific logic and the generalization of MoE layer detection make the utilities much more robust for future model support.
- Testing: The new
maxtext_utils_flops_test.pyis a great addition, providing clear manual-calculation-based verification for various architectures. - Suggestions: I've provided a few suggestions to further generalize the MoE layer detection and ensure consistent dimension usage in MoE FFN calculations.
|
I love the test! I am not sure how we have gotten this far without testing our tflops calculation... |
589589b to
4c726e3
Compare
4c726e3 to
5c9d56d
Compare
5c9d56d to
6288c23
Compare
Description
Adds TFLOPs calculations for the Gemma 4 architecture (including MoE) and fixes several inaccuracies in existing FLOPs math (sliding window overlap, vision encoder scaling, and shared KV projections).
moe_mlp_dimand generalized MoE layer detection (num_experts > 1).max_target_length * window - 0.5 * window**2).share_kv_projectionsfor accurate QKV FLOPs.Tests
Added
maxtext_utils_flops_test.pyto validate FLOPs calculations across 12 model architectures.Checklist
Before submitting this PR, please make sure (put X in square brackets):
gemini-reviewlabel.