It would be useful to have a utility that estimates GPU memory usage for a given model + training configuration before launching a job. This would help users right-size their parallelism strategy and cluster allocation without trial-and-error.
See Megatron Memory Estimator for a similar tool and helpful ideas on what to account for.
Motivation
Currently, the only way to know if a configuration fits in memory is to run it and see if it OOMs. A memory estimator would save GPU hours and iteration time, especially when exploring large model configurations.
Challenges / Open Questions
- Building a general-purpose memory calculator is hard to maintain if the underlying framework (e.g. PyTorch, FSDP, Megatron) doesn't expose memory accounting APIs. Estimates can drift as framework internals change.
- Activation memory depends heavily on which optimizations are enabled (activation checkpointing, offloading, mixed precision, etc.), making precise estimates difficult.
It would be useful to have a utility that estimates GPU memory usage for a given model + training configuration before launching a job. This would help users right-size their parallelism strategy and cluster allocation without trial-and-error.
See Megatron Memory Estimator for a similar tool and helpful ideas on what to account for.
Motivation
Currently, the only way to know if a configuration fits in memory is to run it and see if it OOMs. A memory estimator would save GPU hours and iteration time, especially when exploring large model configurations.
Challenges / Open Questions