PyTorch Optimizer has set_to_none keyword argument. FusedAdam from TE doesn't have this kwarg, despite inheriting from the torch.optim.Optimizer. It's a broken inheritance protocol and it leads to various issues. For example torch.distributed.checkpoint() assumes set_to_none is present in the Optimizer when initializing the Optimizer states in this code line. Currently it's broken with the TE FusedAdam optimizer.
I understand TE FusedAdam has set_grad_none attribute, but it should still incorporate set_to_none kwargs to zero_grad method, otherwise some PyTorch functionalities are broken.
PyTorch Optimizer has
set_to_nonekeyword argument. FusedAdam from TE doesn't have this kwarg, despite inheriting from thetorch.optim.Optimizer. It's a broken inheritance protocol and it leads to various issues. For exampletorch.distributed.checkpoint()assumesset_to_noneis present in the Optimizer when initializing the Optimizer states in this code line. Currently it's broken with the TE FusedAdam optimizer.I understand TE FusedAdam has
set_grad_noneattribute, but it should still incorporateset_to_nonekwargs tozero_gradmethod, otherwise some PyTorch functionalities are broken.