Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SyncBatchNorm doesn't support 2 dimensions input? #194

Closed
flymark2010 opened this issue Mar 11, 2019 · 19 comments · Fixed by #590
Closed

SyncBatchNorm doesn't support 2 dimensions input? #194

flymark2010 opened this issue Mar 11, 2019 · 19 comments · Fixed by #590

Comments

@flymark2010
Copy link

Hi,
I'm facing the issue that the program crash when the input for SyncBatchNorm is two dimensions. Here's the code:

import torch
import apex

model = apex.parallel.SyncBatchNorm(4).cuda()
data = torch.rand((8,4)).cuda()
output = model(data)

When running the code, error raised like this:

Traceback (most recent call last):
  File "syncbn_test.by", line 7, in <module>
    output = model(data)
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/apex/parallel/optimized_sync_batchnorm.py", line 81, in forward
    return SyncBatchnormFunction.apply(input, self.weight, self.bias, self.running_mean, self.running_var, self.eps, self.training or not self.track_running_stats, exponential_average_factor, self.process_group, self.channel_last)
  File "/usr/local/lib/python3.5/dist-packages/apex/parallel/optimized_sync_batchnorm_kernel.py", line 27, in forward
    mean, var_biased = syncbn.welford_mean_var(input)
RuntimeError: Dimension out of range (expected to be in range of [-2, 1], but got 2) (maybe_wrap_dim at /pytorch/aten/src/ATen/core/WrapDimMinimal.h:18)

And everthing runs ok when data a 4 dims tensor.

Here is my environment:

Ubuntu 16.04
Python 3.5.2
Pytorch 1.0.1, installed with "pip install torch"
apex is installed with command:
  pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" .
cuda 10.0
nvidia driver 410.72
@mcarilli
Copy link
Contributor

@jjsjann123 have you observed this?

@jjsjann123
Copy link
Collaborator

yeah, this is an overlook for my side.
I mistakenly thought this is not supported by torch.nn.BatchNormXd, but it is.

I'll patch it soon

@flymark2010
Copy link
Author

Thanks very much. Hope for reply once fixed!

@jjsjann123
Copy link
Collaborator

I'll update on this thread once it's fixed, should be sometime this week (Currently occupied by some other work)

@DTennant
Copy link
Contributor

I'm facing the same issue, hoping you guys can fix this soon..... many thanks!

@jjsjann123
Copy link
Collaborator

The support should be really easy as we already have channel_last kernels there, that handles the same data stride as with 2 dimensional tensor.
I'll update later today or early tomorrow as soon as I finish my work at hand

jjsjann123 added a commit that referenced this issue Mar 15, 2019
  supporting 2 dimensional input, resolving issue #194

Implementation:
  for 2d input, switching channel_last flag to true for better memory access
pattern in the kernel.
@jjsjann123
Copy link
Collaborator

@flymark2010 Sorry for the long wait for such a simple fix (priority issue on my side :/ )
PR with the fix has been issued. Feel free to cherry pick, I'll push for the merge as well.

mcarilli pushed a commit that referenced this issue Mar 22, 2019
supporting 2 dimensional input, resolving issue #194

Implementation:
  for 2d input, switching channel_last flag to true for better memory access
pattern in the kernel.
@mcarilli
Copy link
Contributor

Should be fixed via 0a99154.

@hiyijian
Copy link

I am using latest master and got below issue:
image
PS:
self.bn5 = SyncBatchNorm(512)

@jjsjann123
Copy link
Collaborator

Thanks a lot for reporting this @hiyijian. I should have done a better job unit testing this.

Any chance I can have a repro? If it's something not related to 2-dimension tensor, maybe we want to have a new issue and tag me on that one :)

@zhuyingSeu
Copy link

Hello, I'm confused with the same problem.
The apex was installed on 16 Sep.

import torch
import apex
model = apex.parallel.SyncBatchNorm(4).cuda()
data = torch.rand((8,4)).cuda()
ouput = model(data)

Traceback (most recent call last):
File "", line 1, in
File "/gpfs01/user_home/zhuying/.conda/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "/gpfs01/user_home/zhuying/.conda/envs/py36/lib/python3.6/site-packages/apex/parallel/optimized_sync_batchnorm.py", line 85, in forward
return SyncBatchnormFunction.apply(input, z, self.weight, self.bias, self.running_mean, self.running_var, self.eps, self.training or not self.track_running_stats, exponential_average_factor, self.process_group, self.channel_last, self.fuse_relu)
File "/gpfs01/user_home/zhuying/.conda/envs/py36/lib/python3.6/site-packages/apex/parallel/optimized_sync_batchnorm_kernel.py", line 27, in forward
mean, var_biased = syncbn.welford_mean_var(input)
RuntimeError: Dimension out of range (expected to be in range of [-2, 1], but got 2) (maybe_wrap_dim at /tmp/pip-req-build-p5q91txh/c10/core/WrapDimMinimal.h:20)

@diggerdu
Copy link
Contributor

Hello, I'm confused with the same problem.
The apex was installed on 16 Sep.

import torch
import apex
model = apex.parallel.SyncBatchNorm(4).cuda()
data = torch.rand((8,4)).cuda()
ouput = model(data)

Traceback (most recent call last):
File "", line 1, in
File "/gpfs01/user_home/zhuying/.conda/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "/gpfs01/user_home/zhuying/.conda/envs/py36/lib/python3.6/site-packages/apex/parallel/optimized_sync_batchnorm.py", line 85, in forward
return SyncBatchnormFunction.apply(input, z, self.weight, self.bias, self.running_mean, self.running_var, self.eps, self.training or not self.track_running_stats, exponential_average_factor, self.process_group, self.channel_last, self.fuse_relu)
File "/gpfs01/user_home/zhuying/.conda/envs/py36/lib/python3.6/site-packages/apex/parallel/optimized_sync_batchnorm_kernel.py", line 27, in forward
mean, var_biased = syncbn.welford_mean_var(input)
RuntimeError: Dimension out of range (expected to be in range of [-2, 1], but got 2) (maybe_wrap_dim at /tmp/pip-req-build-p5q91txh/c10/core/WrapDimMinimal.h:20)

same error

@diggerdu
Copy link
Contributor

@zhuyingSeu a workaround is using python-fallback version

@tycallen
Copy link

tycallen commented Nov 2, 2019

Same error, and when try to use python-fallback, get another error.
Model is fine when not use syncbn.

Traceback (most recent call last):
  File "dist_train.py", line 74, in <module>
    main()
  File "dist_train.py", line 70, in main
    num_query
  File "/root/gg/tyc/processor/processor.py", line 82, in do_train
    scores, feats = model(img, target)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 539, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/apex/parallel/distributed.py", line 560, in forward
    result = self.module(*inputs, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 539, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/apex/amp/_initialize.py", line 197, in new_fwd
    **applier(kwargs, input_caster))
  File "/root/gg/tyc/model/make_model.py", line 230, in forward
    out = self.model.features(x)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 539, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 539, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torchvision/models/densenet.py", line 74, in forward
    new_features = layer(*features)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 539, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torchvision/models/densenet.py", line 50, in forward
    bottleneck_output = bn_function(*prev_features)
  File "/usr/local/lib/python3.6/dist-packages/torchvision/models/densenet.py", line 23, in bn_function
    bottleneck_output = conv(relu(norm(concated_features)))
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 539, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/conv.py", line 343, in forward
    return self.conv2d_forward(input, self.weight)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/conv.py", line 340, in conv2d_forward
    self.padding, self.dilation, self.groups)
RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.cuda.HalfTensor) should be the same

@jjsjann123
Copy link
Collaborator

Oops, somehow missed this thread, let me take a look at this.

@jjsjann123
Copy link
Collaborator

It was broken by me earlier at #275 https://github.com/NVIDIA/apex/pull/275/files#diff-f22c2ebe466a49fc0a46a5e7fabb2004R74-R85 🤦‍♂️

Will push a fix shortly

@jjsjann123
Copy link
Collaborator

PR issued #590

@tycallen
Copy link

tycallen commented Nov 7, 2019

PR issued #590

It works! Thank you.

@jjsjann123
Copy link
Collaborator

Glad to help and thanks for reporting the issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants