[Bug] crash on poolings with larger-than-317 pool sizes #2094

ganler · 2022-06-26T00:06:19Z

Description

A simple pooling 2d (can be AvgPool/MaxPool etc.) layer whose sizes are greater or equal to 317 will lead to TRT crash. This can be fixed on user side by using multiple smaller poolings but just want report this case to improve TRT's robustness. :-)

I am new here so I just randomly cc some active developers: @nvpohanh @zerollzeng @ttyio

Environment

TensorRT Version: 8.4.1.5
NVIDIA GPU: 3080Ti
NVIDIA Driver Version: 510.73.08
CUDA Version: 11.6
CUDNN Version: 8.4.1
Operating System: Ubuntu 20.04
Python Version (if applicable): 3.8
PyTorch Version (if applicable): 1.13.0.dev20220624+cu113

Relevant Files

The ONNX model can be generated by script below.

Steps To Reproduce

# Build network and export to ONNX 14
import torch

ksize = (317, 317)

class Net(torch.nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.pool = torch.nn.MaxPool2d(kernel_size=ksize)

    def forward(self, x):
        return self.pool(x)


x = torch.rand((1, 3, *ksize), dtype=torch.float32)
torch.onnx.export(Net().eval(), x, "output.onnx", opset_version=14)

# Check if the model is strictly valid
import onnx
onnx_model = onnx.load("output.onnx")
model = onnx.checker.check_model(onnx_model, full_check=True)

# Compile with TensorRT but get crashed.
import tensorrt as trt
builder = trt.Builder(trt.Logger(trt.Logger.WARNING))
network = builder.create_network(1 << (int)(
    trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
config = builder.create_builder_config()
config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 * 1 << 30)
parser = trt.OnnxParser(network, trt.Logger(trt.Logger.WARNING))
assert parser.parse(onnx._serialize(onnx_model))
builder.build_engine(network, config)

Errors:

"""
[06/25/2022-18:56:24] [TRT] [E] [network.cpp::addPoolingNd::1025] Error Code 3: API Usage Error (Parameter check failed at: optimizer/api/network.cpp::addPoolingNd::1025, condition: allDimsGtEq(windowSize, 1) && volume(windowSize) < MAX_KERNEL_DIMS_PRODUCT(nbSpatialDims)
)
[1]    930974 segmentation fault (core dumped)  ipython
"""

The text was updated successfully, but these errors were encountered:

zerollzeng · 2022-06-28T01:42:09Z

We limit the max kernel volume must less than 100,000 for 2d pooling kernel(volume(windowSize) < MAX_KERNEL_DIMS_PRODUCT(nbSpatialDims==2)), in your case 317x317=100489 which exceed the limit, is this a real case in your scenario?

ganler · 2022-06-28T01:57:37Z

@zerollzeng Thanks for the explanation. This use case is not from a real-world model but from our project which is related to automatic model generation. Thanks for the feedback and I will lower down the kernel size during generation. BTW, just curious, does tensorrt's python end wants to ensure that errors should be returned by exception instead of crash?

drproktor · 2023-09-05T15:55:41Z

Hi @zerollzeng,
first of all: Thank you for you work on TensorRT!

I just wanted to mention that we are actually using this in a real world segmentation network in our project. We face the problem that we get a hard crash (SEGFAULT) when the user increases the image above a certain size. Even if there is a limit for the kernel size I would expect that an exception is thrown and not a hard crash. Since the problem stems from the unchecked return value here: https://github.com/onnx/onnx-tensorrt/blob/0462dc31ae78f48744b6141ae376df1f96d3f459/onnx2trt_utils.cpp#L1511
and I am unaware of the project structure: Should I post this issue in this project as well?
Moreover, is there e.g. an option that allows to change the behavior of TensorRT in the case of an error, e.g. like raising exceptions instead of returning nullptr for unsupported configurations?

Thank you very much!

zerollzeng · 2023-09-09T01:47:03Z

Hi @drproktor Thank you :-)

The hard crash looks like a bug to me, could you please share a reproduce with us? Many thanks!

drproktor · 2023-09-11T12:58:02Z

Hi @zerollzeng,

to reproduce you can still use the steps as described by @ganler in the initial post. It also occurs with the latest version of TensorRT. Is this sufficient?

If it helps I would be willing to help and provide a patch for the problem. There are multiple places within https://github.com/onnx/onnx-tensorrt/blob/0462dc31ae78f48744b6141ae376df1f96d3f459/onnx2trt_utils.cpp
where the return value of add*Layer functions is not checked for potential nullptrs. These should probably be catched as well?

Thank you very much!

drproktor · 2023-09-11T15:27:47Z

@zerollzeng I created an issue in the onnx-tensorrt repository onnx/onnx-tensorrt#937

zerollzeng · 2023-09-19T14:31:34Z

Moreover, is there e.g. an option that allows to change the behavior of TensorRT in the case of an error, e.g. like raising exceptions instead of returning nullptr for unsupported configurations?

File internal bug 4291317 for this

nvpohanh · 2023-09-20T04:56:44Z

If the goal is to do GlobalAvgPool or GlobalMaxPool, it is recommended to use the Reduce layer instead of the Pooling layer for better performance and better support.

oxana-nvidia · 2023-09-20T23:50:24Z

segfault issue is expected to be fixed in in TRT 9.1

drproktor · 2023-09-24T12:05:22Z

If the goal is to do GlobalAvgPool or GlobalMaxPool, it is recommended to use the Reduce layer instead of the Pooling layer for better performance and better support.

@nvpohanh : Thanks for the information. But in this case its really just a very large pooling op not a global op.

segfault issue is expected to be fixed in in TRT 9.1
@oxana-nvidia Good to know and thank you!

oxana-nvidia · 2023-09-25T16:03:39Z

@drproktor We are currently considering to remove this limitation, but it will require more time to support it. Likely we will need to wait at least next major release.

zerollzeng added the triaged Issue has been triaged by maintainers label Jun 28, 2022

zerollzeng self-assigned this Jun 28, 2022

ganler closed this as completed Jun 28, 2022

ttyio added the TRT limitation label Aug 5, 2022

drproktor mentioned this issue Sep 11, 2023

Crash on poolings with kernel volume >= 100 000 onnx/onnx-tensorrt#937

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] crash on poolings with larger-than-317 pool sizes #2094

[Bug] crash on poolings with larger-than-317 pool sizes #2094

ganler commented Jun 26, 2022

zerollzeng commented Jun 28, 2022 •

edited

Loading

ganler commented Jun 28, 2022 •

edited

Loading

drproktor commented Sep 5, 2023

zerollzeng commented Sep 9, 2023

drproktor commented Sep 11, 2023 •

edited

Loading

drproktor commented Sep 11, 2023

zerollzeng commented Sep 19, 2023

nvpohanh commented Sep 20, 2023

oxana-nvidia commented Sep 20, 2023

drproktor commented Sep 24, 2023

oxana-nvidia commented Sep 25, 2023

[Bug] crash on poolings with larger-than-317 pool sizes #2094

[Bug] crash on poolings with larger-than-317 pool sizes #2094

Comments

ganler commented Jun 26, 2022

Description

Environment

Relevant Files

Steps To Reproduce

zerollzeng commented Jun 28, 2022 • edited Loading

ganler commented Jun 28, 2022 • edited Loading

drproktor commented Sep 5, 2023

zerollzeng commented Sep 9, 2023

drproktor commented Sep 11, 2023 • edited Loading

drproktor commented Sep 11, 2023

zerollzeng commented Sep 19, 2023

nvpohanh commented Sep 20, 2023

oxana-nvidia commented Sep 20, 2023

drproktor commented Sep 24, 2023

oxana-nvidia commented Sep 25, 2023

zerollzeng commented Jun 28, 2022 •

edited

Loading

ganler commented Jun 28, 2022 •

edited

Loading

drproktor commented Sep 11, 2023 •

edited

Loading