You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What's the issue, what's expected?: python mnist.py --enable-msamp --opt-level=O2 should work with the versions pinned in pyproject.toml. Specifically, it should work with torch==2.2.1, given that torch is unpinned.
How to reproduce it?:
build MS-AMP with torch==2.2.1.
Log message or shapshot?:
$ python mnist.py --enable-msamp --opt-level=O2
[2024-03-05 14:56:15,819] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
msamp is enabled, opt_level: O2
Traceback (most recent call last):
File "/home/a/MS-AMP/examples/mnist.py", line 185, in<module>main()
File "/home/a/MS-AMP/examples/mnist.py", line 176, in main
train(args, model, device, train_loader, optimizer, epoch)
File "/home/a/MS-AMP/examples/mnist.py", line 73, in train
scaler.step(optimizer)
File "/home/a/venv/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 447, in step
self.unscale_(optimizer)
File "/home/a/venv/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 337, in unscale_
optimizer_state["found_inf_per_device"] = self._unscale_grads_(
File "/home/a/venv/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 255, in _unscale_grads_
assert isinstance(param, torch.Tensor)
AssertionError
What's the issue, what's expected?:
python mnist.py --enable-msamp --opt-level=O2
should work with the versions pinned inpyproject.toml
. Specifically, it should work withtorch==2.2.1
, given that torch is unpinned.How to reproduce it?:
build MS-AMP with
torch==2.2.1
.Log message or shapshot?:
Additional information:
This occurs because
optimizer.param_groups[:,'params']
containsScalingParameter
sScalingParameter
s subclassScalingTensor
which subclasses nothing, so theisinstance
check failsCommenting out the assertion line manually fixes the issue. I do not know how to reasonably fix this without resorting to that.
The text was updated successfully, but these errors were encountered: