Skip to content

Forward max_memory_padding to _chunked_apply in optimize()#513

Merged
orionarcher merged 2 commits intoTorchSim:mainfrom
niklashoelter:fix/forward-max-memory-padding
Mar 20, 2026
Merged

Forward max_memory_padding to _chunked_apply in optimize()#513
orionarcher merged 2 commits intoTorchSim:mainfrom
niklashoelter:fix/forward-max-memory-padding

Conversation

@niklashoelter
Copy link
Copy Markdown
Contributor

@niklashoelter niklashoelter commented Mar 18, 2026

The optimize() function extracts several attributes from the InFlightAutoBatcher and passes them to _chunked_apply(), which creates a BinningAutoBatcher for FIRE initialization. However, max_memory_padding was not forwarded, causing the BinningAutoBatcher to use its default of 1.0 (no safety margin). This can lead to OOM errors during optimizer initialization on large workloads, because the memory estimation fills 100% of GPU memory with a bare forward pass, leaving no headroom for the additional state allocated by fire_init() (velocities, dt, alpha, etc.).

Summary

When passing an InFlightAutoBatcher with a custom max_memory_padding to optimize(), the padding value is not forwarded to the internal _chunked_apply() call used for optimizer initialization (e.g. FIRE init). This causes the BinningAutoBatchercreated inside_chunked_apply()to default tomax_memory_padding=1.0`, effectively using no safety margin during memory estimation for the init phase.

We observed OOM errors during FIRE initialization on large workloads (~4000 structures, 24 GB GPU) that we believe are caused by this. The memory estimator determines batch sizes that fill 100% of GPU memory based on a bare forward pass, leaving no headroom for the additional state allocated by fire_init() (velocities, dt, alpha, etc.). Reducing max_memory_padding had no effect, since the value was not reaching the BinningAutoBatcher.

Fix

Forward max_memory_padding from the InFlightAutoBatcher to _chunked_apply() in runners.py, alongside the other attributes that are already forwarded (max_memory_scaler, memory_scales_with, max_atoms_to_try, oom_error_message).

Before a pull request can be merged, the following items must be checked:

  • Doc strings have been added in the Google docstring format.
  • Run ruff on your code.
  • Tests have been added for any new functionality or bug fixes.

The optimize() function extracts several attributes from the
InFlightAutoBatcher and passes them to _chunked_apply(), which
creates a BinningAutoBatcher for FIRE initialization. However,
max_memory_padding was not forwarded, causing the BinningAutoBatcher
to use its default of 1.0 (no safety margin). This can lead to OOM
errors during optimizer initialization on large workloads, because
the memory estimation fills 100% of GPU memory with a bare forward
pass, leaving no headroom for the additional state allocated by
fire_init() (velocities, dt, alpha, etc.).
Copy link
Copy Markdown
Collaborator

@orionarcher orionarcher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

@orionarcher orionarcher merged commit 8c9ddea into TorchSim:main Mar 20, 2026
70 of 72 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

max_memory_padding not forwarded to BinningAutoBatcher during optimizer init in optimize()

3 participants