Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] crash on poolings with larger-than-317 pool sizes #2094

Closed
ganler opened this issue Jun 26, 2022 · 11 comments
Closed

[Bug] crash on poolings with larger-than-317 pool sizes #2094

ganler opened this issue Jun 26, 2022 · 11 comments
Assignees
Labels
triaged Issue has been triaged by maintainers

Comments

@ganler
Copy link

ganler commented Jun 26, 2022

Description

A simple pooling 2d (can be AvgPool/MaxPool etc.) layer whose sizes are greater or equal to 317 will lead to TRT crash. This can be fixed on user side by using multiple smaller poolings but just want report this case to improve TRT's robustness. :-)

I am new here so I just randomly cc some active developers: @nvpohanh @zerollzeng @ttyio

Environment

TensorRT Version: 8.4.1.5
NVIDIA GPU: 3080Ti
NVIDIA Driver Version: 510.73.08
CUDA Version: 11.6
CUDNN Version: 8.4.1
Operating System: Ubuntu 20.04
Python Version (if applicable): 3.8
PyTorch Version (if applicable): 1.13.0.dev20220624+cu113

Relevant Files

The ONNX model can be generated by script below.

Steps To Reproduce

# Build network and export to ONNX 14
import torch

ksize = (317, 317)

class Net(torch.nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.pool = torch.nn.MaxPool2d(kernel_size=ksize)

    def forward(self, x):
        return self.pool(x)


x = torch.rand((1, 3, *ksize), dtype=torch.float32)
torch.onnx.export(Net().eval(), x, "output.onnx", opset_version=14)

# Check if the model is strictly valid
import onnx
onnx_model = onnx.load("output.onnx")
model = onnx.checker.check_model(onnx_model, full_check=True)

# Compile with TensorRT but get crashed.
import tensorrt as trt
builder = trt.Builder(trt.Logger(trt.Logger.WARNING))
network = builder.create_network(1 << (int)(
    trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
config = builder.create_builder_config()
config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 * 1 << 30)
parser = trt.OnnxParser(network, trt.Logger(trt.Logger.WARNING))
assert parser.parse(onnx._serialize(onnx_model))
builder.build_engine(network, config)

Errors:

"""
[06/25/2022-18:56:24] [TRT] [E] [network.cpp::addPoolingNd::1025] Error Code 3: API Usage Error (Parameter check failed at: optimizer/api/network.cpp::addPoolingNd::1025, condition: allDimsGtEq(windowSize, 1) && volume(windowSize) < MAX_KERNEL_DIMS_PRODUCT(nbSpatialDims)
)
[1]    930974 segmentation fault (core dumped)  ipython
"""
@zerollzeng zerollzeng added the triaged Issue has been triaged by maintainers label Jun 28, 2022
@zerollzeng zerollzeng self-assigned this Jun 28, 2022
@zerollzeng
Copy link
Collaborator

zerollzeng commented Jun 28, 2022

We limit the max kernel volume must less than 100,000 for 2d pooling kernel(volume(windowSize) < MAX_KERNEL_DIMS_PRODUCT(nbSpatialDims==2)), in your case 317x317=100489 which exceed the limit, is this a real case in your scenario?

@ganler
Copy link
Author

ganler commented Jun 28, 2022

@zerollzeng Thanks for the explanation. This use case is not from a real-world model but from our project which is related to automatic model generation. Thanks for the feedback and I will lower down the kernel size during generation. BTW, just curious, does tensorrt's python end wants to ensure that errors should be returned by exception instead of crash?

@drproktor
Copy link

Hi @zerollzeng,
first of all: Thank you for you work on TensorRT!

I just wanted to mention that we are actually using this in a real world segmentation network in our project. We face the problem that we get a hard crash (SEGFAULT) when the user increases the image above a certain size. Even if there is a limit for the kernel size I would expect that an exception is thrown and not a hard crash. Since the problem stems from the unchecked return value here: https://github.com/onnx/onnx-tensorrt/blob/0462dc31ae78f48744b6141ae376df1f96d3f459/onnx2trt_utils.cpp#L1511
and I am unaware of the project structure: Should I post this issue in this project as well?
Moreover, is there e.g. an option that allows to change the behavior of TensorRT in the case of an error, e.g. like raising exceptions instead of returning nullptr for unsupported configurations?

Thank you very much!

@zerollzeng
Copy link
Collaborator

Hi @drproktor Thank you :-)

The hard crash looks like a bug to me, could you please share a reproduce with us? Many thanks!

@drproktor
Copy link

drproktor commented Sep 11, 2023

Hi @zerollzeng,

to reproduce you can still use the steps as described by @ganler in the initial post. It also occurs with the latest version of TensorRT. Is this sufficient?

If it helps I would be willing to help and provide a patch for the problem. There are multiple places within https://github.com/onnx/onnx-tensorrt/blob/0462dc31ae78f48744b6141ae376df1f96d3f459/onnx2trt_utils.cpp
where the return value of add*Layer functions is not checked for potential nullptrs. These should probably be catched as well?

Thank you very much!

@drproktor
Copy link

@zerollzeng I created an issue in the onnx-tensorrt repository onnx/onnx-tensorrt#937

@zerollzeng
Copy link
Collaborator

Moreover, is there e.g. an option that allows to change the behavior of TensorRT in the case of an error, e.g. like raising exceptions instead of returning nullptr for unsupported configurations?

File internal bug 4291317 for this

@nvpohanh
Copy link
Collaborator

If the goal is to do GlobalAvgPool or GlobalMaxPool, it is recommended to use the Reduce layer instead of the Pooling layer for better performance and better support.

@oxana-nvidia
Copy link
Collaborator

segfault issue is expected to be fixed in in TRT 9.1

@drproktor
Copy link

If the goal is to do GlobalAvgPool or GlobalMaxPool, it is recommended to use the Reduce layer instead of the Pooling layer for better performance and better support.

@nvpohanh : Thanks for the information. But in this case its really just a very large pooling op not a global op.

segfault issue is expected to be fixed in in TRT 9.1
@oxana-nvidia Good to know and thank you!

@oxana-nvidia
Copy link
Collaborator

@drproktor We are currently considering to remove this limitation, but it will require more time to support it. Likely we will need to wait at least next major release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

6 participants