Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for CPU and sparse tensors #252

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

azgo14
Copy link

@azgo14 azgo14 commented Apr 8, 2019

Followup to #243

  1. Added CPU tensor support (all cases with cuda.FloatTensor and cuda.HalfTensor are replaced with also allowing FloatTensor and HalfTensor. Whether tensor is on CPU vs GPU should make no difference algorithmically).

  2. Added sparse tensor support (overflow checks need special case for Sparse tensors since previous logic of float(model_grad.float().sum()) does not work for sparse). For additional safety since sparse tensors cannot be distributed, I added an option to explicitly all_reduce_overflow to ensure all nodes receive the same overflow state per update. Since some parameters are not distributed for sparse use-cases (Pytorch doesn't support DistributedDataParallel with sparse tensors), it's possible some nodes can have inconsistent overflow states with other nodes which can un-sync their parameter updates after loss scaling.

1) Added CPU tensor support (all cases with cuda.FloatTensor and cuda.HalfTensor are replaced with also allowing FloatTensor and HalfTensor. Whether tensor is on CPU vs GPU should make no difference algorithmically).

2) Added sparse tensor support (overflow checks need special case for Sparse tensors since previous logic of `float(model_grad.float().sum())` does not work for sparse). For additional safety since sparse tensors cannot be distributed, I added an option to explicitly `all_reduce_overflow` to ensure all nodes receive the same overflow state per update. Since some parameters are not distributed for sparse use-cases (Pytorch doesn't support DistributedDataParallel with sparse tensors), it's possible some nodes can have inconsistent overflow states with other nodes which can un-sync their parameter updates after loss scaling.
@azgo14 azgo14 marked this pull request as ready for review April 8, 2019 19:25
@mcarilli
Copy link
Contributor

mcarilli commented Apr 8, 2019

This is great! The main shortcoming right now is that I think it will only work for a Python-only install of Apex.

If Apex is installed with the cpp extensions, it performs gradient unscaling using my custom multi-tensor kernels, which can only handle contiguous tensors. I can and will hack that logic to accept sparse tensors as well, but I need to clean up handling of fused optimizers and come up with a unified strategy for checkpointing first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants