Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc) #50

Closed
yxchng opened this issue Aug 6, 2022 · 3 comments
Labels
documentation Improvements or additions to documentation

Comments

@yxchng
Copy link

yxchng commented Aug 6, 2022

 ** On entry to SGEMM  parameter number 10 had an illegal value
Traceback (most recent call last):
  File "check_flops.py", line 34, in <module>
    model(x)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/tiger/convnext/models/natnext510_511.py", line 620, in forward
    x = self.forward_features(x)
  File "/opt/tiger/convnext/models/natnext510_511.py", line 616, in forward_features
    return self.features(x)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/container.py", line 139, in forward
    input = module(input)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/tiger/convnext/models/natnext510_511.py", line 434, in forward
    new_features = layer(features)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/tiger/convnext/models/natnext510_511.py", line 392, in forward
    bottleneck_output = self.bottleneck_fn(prev_features)
  File "/opt/tiger/convnext/models/natnext510_511.py", line 349, in bottleneck_fn
    bottleneck_output = self.block(x)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/tiger/convnext/models/natnext510_511.py", line 128, in forward
    x = self.attn(x)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/tiger/convnext/natten/nattencuda.py", line 121, in forward
    qkv = self.qkv(x).reshape(B, H, W, 3, self.num_heads, self.head_dim).permute(3, 0, 4, 1, 2, 5)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/linear.py", line 96, in forward
    return F.linear(input, self.weight, self.bias)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 1847, in linear
    return torch._C._nn.linear(input, weight, bias)
RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`

My input shape is torch.Size([1, 256, 14, 14]). Why am I getting this error?

@yxchng
Copy link
Author

yxchng commented Aug 6, 2022

Seems like it expects channels_last format. Will you support channels_first? Frequent permutation is quite a hit on performance.

@alihassanijr
Copy link
Member

alihassanijr commented Aug 7, 2022

Thank you for your interest.

That's correct, the module expects channels-last inputs, similar to many other Attention implementations (although those are typically flattened across spatial axes and are 3D tensors).

Since the model is not a CNN, and Convolutions are the only operations in the model that require a channels first structure, it makes sense to keep inputs channels-last to avoid frequent permutations.

Some permutations and reshaping operations are unavoidable because of the multi-head split.
The current structure is optimized for speed, as all linear projections expect channels-last, and our NA kernel expects inputs
to be of shape Batch x Heads x Height x Width x Dim, which is again the most efficient way of computing outputs.

To be clear, channels-last is usually the more efficient format, hence the movement towards channels-last integration since torch 1.11. Although other factors also affect how much faster your operations get when switching to channels-last (to name a few: cuDNN kernels being called instead of naive ATEN kernels, architecture-specific kernels from NVIDIA packages).

But thank you for bringing this to our attention, shape requirements should be included in our documentation. We will update this in future versions.

@alihassanijr alihassanijr added the documentation Improvements or additions to documentation label Aug 7, 2022
@alihassanijr
Copy link
Member

Closing this due to inactivity. If you still have questions feel free to open it back up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

2 participants