Skip to content

Add memory calculator to estimate GPU memory requirements for a given model configuration #1592

@terrykong

Description

@terrykong

It would be useful to have a utility that estimates GPU memory usage for a given model + training configuration before launching a job. This would help users right-size their parallelism strategy and cluster allocation without trial-and-error.

See Megatron Memory Estimator for a similar tool and helpful ideas on what to account for.

Motivation

Currently, the only way to know if a configuration fits in memory is to run it and see if it OOMs. A memory estimator would save GPU hours and iteration time, especially when exploring large model configurations.

Challenges / Open Questions

  • Building a general-purpose memory calculator is hard to maintain if the underlying framework (e.g. PyTorch, FSDP, Megatron) doesn't expose memory accounting APIs. Estimates can drift as framework internals change.
  • Activation memory depends heavily on which optimizations are enabled (activation checkpointing, offloading, mixed precision, etc.), making precise estimates difficult.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions