Add support for CPU and sparse tensors #252

azgo14 · 2019-04-08T19:08:57Z

Followup to #243

Added CPU tensor support (all cases with cuda.FloatTensor and cuda.HalfTensor are replaced with also allowing FloatTensor and HalfTensor. Whether tensor is on CPU vs GPU should make no difference algorithmically).
Added sparse tensor support (overflow checks need special case for Sparse tensors since previous logic of float(model_grad.float().sum()) does not work for sparse). For additional safety since sparse tensors cannot be distributed, I added an option to explicitly all_reduce_overflow to ensure all nodes receive the same overflow state per update. Since some parameters are not distributed for sparse use-cases (Pytorch doesn't support DistributedDataParallel with sparse tensors), it's possible some nodes can have inconsistent overflow states with other nodes which can un-sync their parameter updates after loss scaling.

1) Added CPU tensor support (all cases with cuda.FloatTensor and cuda.HalfTensor are replaced with also allowing FloatTensor and HalfTensor. Whether tensor is on CPU vs GPU should make no difference algorithmically). 2) Added sparse tensor support (overflow checks need special case for Sparse tensors since previous logic of `float(model_grad.float().sum())` does not work for sparse). For additional safety since sparse tensors cannot be distributed, I added an option to explicitly `all_reduce_overflow` to ensure all nodes receive the same overflow state per update. Since some parameters are not distributed for sparse use-cases (Pytorch doesn't support DistributedDataParallel with sparse tensors), it's possible some nodes can have inconsistent overflow states with other nodes which can un-sync their parameter updates after loss scaling.

mcarilli · 2019-04-08T20:57:56Z

This is great! The main shortcoming right now is that I think it will only work for a Python-only install of Apex.

If Apex is installed with the cpp extensions, it performs gradient unscaling using my custom multi-tensor kernels, which can only handle contiguous tensors. I can and will hack that logic to accept sparse tensors as well, but I need to clean up handling of fused optimizers and come up with a unified strategy for checkpointing first.

azgo14 marked this pull request as ready for review April 8, 2019 19:25

mcarilli added the sparse gradients label Apr 8, 2019

mcarilli mentioned this pull request Apr 24, 2019

multiple optimizers #264

Closed

justusschock mentioned this pull request May 14, 2019

Move to new APEX.amp API delira-dev/delira#97

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for CPU and sparse tensors #252

Add support for CPU and sparse tensors #252

azgo14 commented Apr 8, 2019 •

edited

Loading

mcarilli commented Apr 8, 2019

Add support for CPU and sparse tensors #252

Are you sure you want to change the base?

Add support for CPU and sparse tensors #252

Conversation

azgo14 commented Apr 8, 2019 • edited Loading

mcarilli commented Apr 8, 2019

azgo14 commented Apr 8, 2019 •

edited

Loading