Add memory calculator to estimate GPU memory requirements for a given model configuration

It would be useful to have a utility that estimates GPU memory usage for a given model + training configuration before launching a job. This would help users right-size their parallelism strategy and cluster allocation without trial-and-error.

See [Megatron Memory Estimator](https://huggingface.co/spaces/ISEEKYAN/megatron_memory_estimator) for a similar tool and helpful ideas on what to account for.

## Motivation

Currently, the only way to know if a configuration fits in memory is to run it and see if it OOMs. A memory estimator would save GPU hours and iteration time, especially when exploring large model configurations.

## Challenges / Open Questions

- Building a general-purpose memory calculator is hard to maintain if the underlying framework (e.g. PyTorch, FSDP, Megatron) doesn't expose memory accounting APIs. Estimates can drift as framework internals change.
- Activation memory depends heavily on which optimizations are enabled (activation checkpointing, offloading, mixed precision, etc.), making precise estimates difficult.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add memory calculator to estimate GPU memory requirements for a given model configuration #1592

Motivation

Challenges / Open Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add memory calculator to estimate GPU memory requirements for a given model configuration #1592

Description

Motivation

Challenges / Open Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions