Add support for CPU and sparse tensors #252
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Followup to #243
Added CPU tensor support (all cases with cuda.FloatTensor and cuda.HalfTensor are replaced with also allowing FloatTensor and HalfTensor. Whether tensor is on CPU vs GPU should make no difference algorithmically).
Added sparse tensor support (overflow checks need special case for Sparse tensors since previous logic of
float(model_grad.float().sum())
does not work for sparse). For additional safety since sparse tensors cannot be distributed, I added an option to explicitlyall_reduce_overflow
to ensure all nodes receive the same overflow state per update. Since some parameters are not distributed for sparse use-cases (Pytorch doesn't support DistributedDataParallel with sparse tensors), it's possible some nodes can have inconsistent overflow states with other nodes which can un-sync their parameter updates after loss scaling.