Fix FIRE NaN velocity sentinel corrupting retained systems during autobatcher swaps by janosh · Pull Request #490 · TorchSim/torch-sim

janosh · 2026-03-04T16:50:58Z

Fixes a bug in _ase_fire_step where the NaN-velocity initialization sentinel skips the entire FIRE step (force transformation, power calculation, dt/alpha/n_pos updates, velocity mixing) for all systems when any system has NaN velocities.

This matters when the InFlightAutoBatcher swaps in a new state mid-optimization — the new state has NaN velocities from fire_init, which causes retained systems to skip a FIRE step entirely.

Evidence

Clone-and-compare test: two identical systems, 10 FIRE steps, then inject NaN velocities into system 1 only and run one more step. System 0 should be completely unaffected:

Metric	Main (buggy)	With fix
Position diff	`3.76e-04`	`0.0`
dt diff	`1.33e-02`	`0.0`
alpha diff	`9.70e-04`	`0.0`
n_pos diff	`1`	`0`

test_fire_nan_velocities_dont_affect_other_systems fails on all 4 parametrizations on main, passes on all 4 with the fix.

Fix

Decouple NaN zeroing from the FIRE logic branch. Only skip FIRE mixing when nan_velocities.all() (first step, matching ASE). When a subset has NaN (autobatcher swap), zero them and run normal FIRE — newly zeroed systems naturally get power=0 → negative mask → dt decrease and velocity reset.

…ched optimization The NaN-velocity sentinel in `_ase_fire_step` used an `if/else` that skipped force transformation AND FIRE mixing for ALL systems whenever ANY system had NaN velocities. When the InFlightAutoBatcher swaps in a new state (NaN velocities from fire_init), retained systems got their FIRE state (dt, alpha, n_pos, velocity mixing) skipped entirely. Proven with a clone-and-compare test: on main, system 0's positions, dt, alpha, and n_pos all differ when system 1 has NaN velocities injected. With the fix, all diffs are exactly 0. Fix: decouple NaN zeroing from the FIRE logic branch. Only skip FIRE mixing when ALL velocities are NaN (first step, matching ASE behavior). When a subset has NaN (autobatcher swap), zero them and proceed with normal FIRE logic — newly zeroed systems naturally get power=0 → negative mask → dt decrease and velocity reset. Note: this is separate from Killian's FixSymmetry NaN error, which also reproduces with LBFGS and appears to be an autobatcher-level issue.

Use nan_to_num_() instead of conditional masked zeroing in _ase_fire_step, fix _vv_fire_step using zeros_like(positions) instead of velocities, and add unit cell filter to NaN isolation test parametrization.

janosh · 2026-03-04T18:23:36Z

failing CI is intermittent MACE download issue. @orionarcher do you want to review? if not, @abhijeetgangan feel free to merge?

janosh force-pushed the fix-ase-fire-nan-velocity-autobatcher branch from 2154c6c to caa7482 Compare March 4, 2026 17:05

janosh changed the title ~~Fix ASE FIRE NaN velocity handling breaking FixSymmetry + autobatcher~~ Fix FIRE NaN velocity sentinel corrupting retained systems during autobatcher swaps Mar 4, 2026

janosh marked this pull request as draft March 4, 2026 17:06

janosh marked this pull request as ready for review March 4, 2026 17:07

abhijeetgangan approved these changes Mar 4, 2026

View reviewed changes

Simplify NaN velocity handling and improve test coverage

a806bbf

Use nan_to_num_() instead of conditional masked zeroing in _ase_fire_step, fix _vv_fire_step using zeros_like(positions) instead of velocities, and add unit cell filter to NaN isolation test parametrization.

abhijeetgangan merged commit 4a45e7e into main Mar 4, 2026
129 of 130 checks passed

abhijeetgangan deleted the fix-ase-fire-nan-velocity-autobatcher branch March 4, 2026 20:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix FIRE NaN velocity sentinel corrupting retained systems during autobatcher swaps#490

Fix FIRE NaN velocity sentinel corrupting retained systems during autobatcher swaps#490
abhijeetgangan merged 2 commits intomainfrom
fix-ase-fire-nan-velocity-autobatcher

janosh commented Mar 4, 2026 •

edited

Loading

Uh oh!

janosh commented Mar 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

janosh commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Evidence

Fix

Uh oh!

janosh commented Mar 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

janosh commented Mar 4, 2026 •

edited

Loading