Is there any plan to solve the problem with tensor number < 2^31-1? #3815

Ehrich-wen · 2024-04-23T06:45:36Z

In the "Known Issues" of the release note of TensorRT 10.0, it is mentioned that "UNets with tensors containing >2^31 elements may fail during the engine building step." This problem brings great restrictions to our SD deployment, especially Video model, are there any plans to solve this problem in future versions?

Environment

TensorRT Version: V10.0

NVIDIA GPU: A100/A10/H800/H100/T4/2080TI

NVIDIA Driver Version:>510

CUDA Version:12.0

lix19937 · 2024-04-23T14:29:09Z

UNets with tensors containing >2^31 elements may fail during the engine building step

2^31 means int32 data's max value ? Or built-in plugin/layer impl has write dead.

Ehrich-wen · 2024-04-24T03:33:01Z

UNets with tensors containing >2^31 elements may fail during the engine building step

2^31 means int32 data's max value ? Or built-in plugin/layer impl has write dead.

Yes, I guess so. However, the length of the deep tensor of more and more models exceeds int32, forcing us to find other solutions. The excellent performance of tensorrt is still the best choice, so we hope that the official will consider supporting it in future versions.

zerollzeng · 2024-04-25T15:16:15Z

Let me check internally.

zerollzeng · 2024-04-25T15:18:13Z

This should be fixed in TRT 10.0.1, and we just release it, could you please take a try?

zerollzeng · 2024-04-25T15:18:37Z

https://docs.nvidia.com/deeplearning/tensorrt/release-notes/index.html#rel-10-0-1

Ehrich-wen · 2024-04-26T09:04:37Z

https://docs.nvidia.com/deeplearning/tensorrt/release-notes/index.html#rel-10-0-1

Awesome, thank you very much!

Ehrich-wen · 2024-04-26T11:42:08Z

@zerollzeng I tried the V10.0.1 version, but still got this error. Am I using it incorrectly?
`&&&& RUNNING TensorRT.trtexec [TensorRT v100001] # /usr/src/tensorrt/bin/trtexec --onnx=./onnx/unet.onnx --fp16 --minShapes=x:1x8x1x72x128 --optShapes=x:1x8x16x72x128 --maxShapes=x:1x8x40x72x128 --saveEngine=unet.plan
[04/26/2024-19:28:32] [W] The engine can stream its weights but it will not at runtime because --weightStreamingBudget unset or set to 0.
[04/26/2024-19:28:32] [I] === Model Options ===
[04/26/2024-19:28:32] [I] Format: ONNX
[04/26/2024-19:28:32] [I] Model: ./onnx/unet.onnx
[04/26/2024-19:28:32] [I] Output:
[04/26/2024-19:28:32] [I] === Build Options ===
[04/26/2024-19:28:32] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default, tacticSharedMem: default
[04/26/2024-19:28:32] [I] avgTiming: 8
[04/26/2024-19:28:32] [I] Precision: FP32+FP16
[04/26/2024-19:28:32] [I] LayerPrecisions:
[04/26/2024-19:28:32] [I] Layer Device Types:
[04/26/2024-19:28:32] [I] Calibration:
[04/26/2024-19:28:32] [I] Refit: Disabled
[04/26/2024-19:28:32] [I] Strip weights: Disabled
[04/26/2024-19:28:32] [I] Version Compatible: Disabled
[04/26/2024-19:28:32] [I] ONNX Plugin InstanceNorm: Disabled
[04/26/2024-19:28:32] [I] TensorRT runtime: full
[04/26/2024-19:28:32] [I] Lean DLL Path:
[04/26/2024-19:28:32] [I] Tempfile Controls: { in_memory: allow, temporary: allow }
[04/26/2024-19:28:32] [I] Exclude Lean Runtime: Disabled
[04/26/2024-19:28:32] [I] Sparsity: Disabled
[04/26/2024-19:28:32] [I] Safe mode: Disabled
[04/26/2024-19:28:32] [I] Build DLA standalone loadable: Disabled
[04/26/2024-19:28:32] [I] Allow GPU fallback for DLA: Disabled
[04/26/2024-19:28:32] [I] DirectIO mode: Disabled
[04/26/2024-19:28:32] [I] Restricted mode: Disabled
[04/26/2024-19:28:32] [I] Skip inference: Disabled
[04/26/2024-19:28:32] [I] Save engine: dynamic_unet.plan
[04/26/2024-19:28:32] [I] Load engine:
[04/26/2024-19:28:32] [I] Profiling verbosity: 0
[04/26/2024-19:28:32] [I] Tactic sources: Using default tactic sources
[04/26/2024-19:28:32] [I] timingCacheMode: local
[04/26/2024-19:28:32] [I] timingCacheFile:
[04/26/2024-19:28:32] [I] Enable Compilation Cache: Enabled
[04/26/2024-19:28:32] [I] errorOnTimingCacheMiss: Disabled
[04/26/2024-19:28:32] [I] Preview Features: Use default preview flags.
[04/26/2024-19:28:32] [I] MaxAuxStreams: -1
[04/26/2024-19:28:32] [I] BuilderOptimizationLevel: -1
[04/26/2024-19:28:32] [I] Calibration Profile Index: 0
[04/26/2024-19:28:32] [I] Weight Streaming: Enabled
[04/26/2024-19:28:32] [I] Debug Tensors:
[04/26/2024-19:28:32] [I] Input(s)s format: fp32:CHW
[04/26/2024-19:28:32] [I] Output(s)s format: fp32:CHW
[04/26/2024-19:28:32] [I] Input build shape (profile 0): x=1x8x1x72x128+1x8x16x72x128+1x8x40x72x128
[04/26/2024-19:28:32] [I] Input calibration shapes: model
[04/26/2024-19:28:32] [I] === System Options ===
[04/26/2024-19:28:32] [I] Device: 0
[04/26/2024-19:28:32] [I] DLACore:
[04/26/2024-19:28:32] [I] Plugins:
[04/26/2024-19:28:32] [I] setPluginsToSerialize:
[04/26/2024-19:28:32] [I] dynamicPlugins:
[04/26/2024-19:28:32] [I] ignoreParsedPluginLibs: 0
[04/26/2024-19:28:32] [I]
[04/26/2024-19:28:32] [I] === Inference Options ===
[04/26/2024-19:28:32] [I] Batch: Explicit
[04/26/2024-19:28:32] [I] Input inference shape : x=1x8x16x72x128
[04/26/2024-19:28:32] [I] Iterations: 10
[04/26/2024-19:28:32] [I] Duration: 3s (+ 200ms warm up)
[04/26/2024-19:28:32] [I] Sleep time: 0ms
[04/26/2024-19:28:32] [I] Idle time: 0ms
[04/26/2024-19:28:32] [I] Inference Streams: 1
[04/26/2024-19:28:32] [I] ExposeDMA: Disabled
[04/26/2024-19:28:32] [I] Data transfers: Enabled
[04/26/2024-19:28:32] [I] Spin-wait: Disabled
[04/26/2024-19:28:32] [I] Multithreading: Disabled
[04/26/2024-19:28:32] [I] CUDA Graph: Disabled
[04/26/2024-19:28:32] [I] Separate profiling: Disabled
[04/26/2024-19:28:32] [I] Time Deserialize: Disabled
[04/26/2024-19:28:32] [I] Time Refit: Disabled
[04/26/2024-19:28:32] [I] NVTX verbosity: 0
[04/26/2024-19:28:32] [I] Persistent Cache Ratio: 0
[04/26/2024-19:28:32] [I] Optimization Profile Index: 0
[04/26/2024-19:28:32] [I] Weight Streaming Budget: Disabled
[04/26/2024-19:28:32] [I] Inputs:
[04/26/2024-19:28:32] [I] Debug Tensor Save Destinations:
[04/26/2024-19:28:32] [I] === Reporting Options ===
[04/26/2024-19:28:32] [I] Verbose: Disabled
[04/26/2024-19:28:32] [I] Averages: 10 inferences
[04/26/2024-19:28:32] [I] Percentiles: 90,95,99
[04/26/2024-19:28:32] [I] Dump refittable layers:Disabled
[04/26/2024-19:28:32] [I] Dump output: Disabled
[04/26/2024-19:28:32] [I] Profile: Disabled
[04/26/2024-19:28:32] [I] Export timing to JSON file:
[04/26/2024-19:28:32] [I] Export output to JSON file:
[04/26/2024-19:28:32] [I] Export profile to JSON file:
[04/26/2024-19:28:32] [I]
[04/26/2024-19:28:32] [I] === Device Information ===
[04/26/2024-19:28:32] [I] Available Devices:
[04/26/2024-19:28:32] [I] Device 0: "NVIDIA A10" UUID: GPU-b8ea513b-1327-f986-cb26-e50f9e100686
[04/26/2024-19:28:32] [I] Device 1: "NVIDIA A10" UUID: GPU-79ebcb00-d01a-2b16-7fed-07dc154b8399
[04/26/2024-19:28:32] [I] Device 2: "NVIDIA A10" UUID: GPU-bbe1e200-4079-3c27-4c14-f0d75f1bd3ef
[04/26/2024-19:28:32] [I] Device 3: "NVIDIA A10" UUID: GPU-c1ab49e8-26c0-b11d-3098-f165a093703d
[04/26/2024-19:28:32] [I] Device 4: "NVIDIA A10" UUID: GPU-fb53045d-5a9d-5d6b-6880-cfc459a64ec7
[04/26/2024-19:28:32] [I] Device 5: "NVIDIA A10" UUID: GPU-52ccfc8a-86f2-ce71-fa18-2f282e2da474
[04/26/2024-19:28:32] [I] Device 6: "NVIDIA A10" UUID: GPU-b4756f72-fe6b-aa8a-6a85-61341b7bc00f
[04/26/2024-19:28:32] [I] Device 7: "NVIDIA A10" UUID: GPU-8cb64d70-dbfe-b159-e0d1-867223c8bbf4
[04/26/2024-19:28:32] [I] Selected Device: NVIDIA A10
[04/26/2024-19:28:32] [I] Selected Device ID: 0
[04/26/2024-19:28:32] [I] Selected Device UUID: GPU-b8ea513b-1327-f986-cb26-e50f9e100686
[04/26/2024-19:28:32] [I] Compute Capability: 8.6
[04/26/2024-19:28:32] [I] SMs: 72
[04/26/2024-19:28:32] [I] Device Global Memory: 22515 MiB
[04/26/2024-19:28:32] [I] Shared Memory per SM: 100 KiB
[04/26/2024-19:28:32] [I] Memory Bus Width: 384 bits (ECC enabled)
[04/26/2024-19:28:32] [I] Application Compute Clock Rate: 1.695 GHz
[04/26/2024-19:28:32] [I] Application Memory Clock Rate: 6.251 GHz
[04/26/2024-19:28:32] [I]
[04/26/2024-19:28:32] [I] Note: The application clock rates do not reflect the actual clock rates that the GPU is currently running at.
[04/26/2024-19:28:32] [I]
[04/26/2024-19:28:32] [I] TensorRT version: 10.0.1
[04/26/2024-19:28:32] [I] Loading standard plugins
[04/26/2024-19:28:32] [I] [TRT] [MemUsageChange] Init CUDA: CPU +1, GPU +0, now: CPU 16, GPU 19428 (MiB)
[04/26/2024-19:28:35] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +1750, GPU +312, now: CPU 1902, GPU 19740 (MiB)
[04/26/2024-19:28:35] [I] Start parsing network model.
[04/26/2024-19:28:35] [I] [TRT] ----------------------------------------------------------------
[04/26/2024-19:28:35] [I] [TRT] Input filename: ./onnx/unet.onnx
[04/26/2024-19:28:35] [I] [TRT] ONNX IR version: 0.0.8
[04/26/2024-19:28:35] [I] [TRT] Opset version: 17
[04/26/2024-19:28:35] [I] [TRT] Producer name: pytorch
[04/26/2024-19:28:35] [I] [TRT] Producer version: 2.2.1
[04/26/2024-19:28:35] [I] [TRT] Domain:
[04/26/2024-19:28:35] [I] [TRT] Model version: 0
[04/26/2024-19:28:35] [I] [TRT] Doc string:
[04/26/2024-19:28:35] [I] [TRT] ----------------------------------------------------------------
[04/26/2024-19:28:35] [W] [TRT] ModelImporter.cpp:420: Make sure input t has Int64 binding.
[04/26/2024-19:28:41] [E] Error[4]: [graphShapeAnalyzer.cpp::processCheck::872] Error Code 4: Internal Error (/input_blocks.1/input_blocks.1.1/transformer_blocks.0/attn1/Mul_6_output: tensor volume exceeds 2147483647, dimensions are [80,9216,9216])
[04/26/2024-19:28:41] [E] [TRT] ModelImporter.cpp:828: While parsing node number 937 [Mul -> "/input_blocks.1/input_blocks.1.1/transformer_blocks.0/attn1/Mul_6_output_0"]:
[04/26/2024-19:28:41] [E] [TRT] ModelImporter.cpp:831: --- Begin node ---
input: "/input_blocks.1/input_blocks.1.1/transformer_blocks.0/attn1/Einsum_output_0"
input: "/input_blocks.1/input_blocks.1.1/transformer_blocks.0/attn1/Constant_27_output_0"
output: "/input_blocks.1/input_blocks.1.1/transformer_blocks.0/attn1/Mul_6_output_0"
name: "/input_blocks.1/input_blocks.1.1/transformer_blocks.0/attn1/Mul_6"
op_type: "Mul"

[04/26/2024-19:28:41] [E] [TRT] ModelImporter.cpp:832: --- End node ---
[04/26/2024-19:28:41] [E] [TRT] ModelImporter.cpp:836: ERROR: ModelImporter.cpp:194 In function parseNode:
[6] Invalid Node - /input_blocks.1/input_blocks.1.1/transformer_blocks.0/attn1/Mul_6
[graphShapeAnalyzer.cpp::processCheck::872] Error Code 4: Internal Error (/input_blocks.1/input_blocks.1.1/transformer_blocks.0/attn1/Mul_6_output: tensor volume exceeds 2147483647, dimensions are [80,9216,9216])
[04/26/2024-19:28:41] [E] Failed to parse onnx file
[04/26/2024-19:28:42] [I] Finished parsing network model. Parse time: 6.81613
[04/26/2024-19:28:42] [E] Parsing model failed
[04/26/2024-19:28:42] [E] Failed to create engine from model or file.
[04/26/2024-19:28:42] [E] Engine set up failed`

zerollzeng · 2024-05-01T11:59:04Z

Looks like we didn't fix it completely, could you please share the model with us? I think we need further investigation on it. Thanks!

Ehrich-wen · 2024-05-06T02:34:27Z

@zerollzeng Sorry, I can't upload onnx. This is the original project and model. Please verify it.
https://github.com/Doubiiu/DynamiCrafter
https://huggingface.co/Doubiiu/DynamiCrafter_1024/blob/main/model.ckpt

zerollzeng · 2024-05-19T03:02:58Z

Could you please upload an onnx model that is converted directly from public repo that doesn't contain any sensitive information, ask this because I'm quite busy these days...

dzzhang96 · 2024-05-22T23:39:50Z

I have the same issue, my model has 1x3x512x512x512 input, and it exceeds the 2^31 limitation. I failed to build the engine.@zerollzeng

dzzhang96 · 2024-05-22T23:58:13Z

I uploaded a dummy model for testing here. @zerollzeng Thanks!

zerollzeng · 2024-06-07T09:24:54Z

I have the same issue, my model has 1x3x512x512x512 input, and it exceeds the 2^31 limitation. I failed to build the engine.@zerollzeng

We do not support > int32 volume for conv now.

dzzhang96 · 2024-06-07T09:29:59Z

I have the same issue, my model has 1x3x512x512x512 input, and it exceeds the 2^31 limitation. I failed to build the engine.@zerollzeng

We do not support > int32 volume for conv now.

In the release note of 10.0.1, it is said the issue “UNets with tensors containing >2^31 elements may fail during the engine building step” was fixed. But it is not. @zerollzeng

Ehrich-wen · 2024-06-18T12:23:45Z

I have the same issue, my model has 1x3x512x512x512 input, and it exceeds the 2^31 limitation. I failed to build the engine.@zerollzeng

We do not support > int32 volume for conv now.

In the release note of 10.0.1, it is said the issue “UNets with tensors containing >2^31 elements may fail during the engine building step” was fixed. But it is not. @zerollzeng

have you tried the latest version 10.1.0? it seems to have been resolved

dzzhang96 · 2024-06-27T18:05:04Z

I have the same issue, my model has 1x3x512x512x512 input, and it exceeds the 2^31 limitation. I failed to build the engine.@zerollzeng

We do not support > int32 volume for conv now.

In the release note of 10.0.1, it is said the issue “UNets with tensors containing >2^31 elements may fail during the engine building step” was fixed. But it is not. @zerollzeng

have you tried the latest version 10.1.0? it seems to have been resolved

Really?? I checked the release note and it does not say the issue is fixed. Have you tried 10.1.0? :)

lix19937 · 2024-06-29T08:12:34Z

TensorRT Release 10.0.1

UNets with tensors containing >2^31 elements may have failed during the engine building step.

I come across matmul element num > >2^31, it falied, use plugin WAR to solve it.

zerollzeng self-assigned this Apr 25, 2024

zerollzeng added the triaged Issue has been triaged by maintainers label Apr 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there any plan to solve the problem with tensor number < 2^31-1? #3815

Is there any plan to solve the problem with tensor number < 2^31-1? #3815

Ehrich-wen commented Apr 23, 2024

lix19937 commented Apr 23, 2024

Ehrich-wen commented Apr 24, 2024

zerollzeng commented Apr 25, 2024

zerollzeng commented Apr 25, 2024

zerollzeng commented Apr 25, 2024

Ehrich-wen commented Apr 26, 2024

Ehrich-wen commented Apr 26, 2024

zerollzeng commented May 1, 2024

Ehrich-wen commented May 6, 2024

zerollzeng commented May 19, 2024

dzzhang96 commented May 22, 2024

dzzhang96 commented May 22, 2024

zerollzeng commented Jun 7, 2024

dzzhang96 commented Jun 7, 2024 •

edited

Loading

Ehrich-wen commented Jun 18, 2024

dzzhang96 commented Jun 27, 2024

lix19937 commented Jun 29, 2024 •

edited

Loading

TensorRT Release 10.0.1

Is there any plan to solve the problem with tensor number < 2^31-1? #3815

Is there any plan to solve the problem with tensor number < 2^31-1? #3815

Comments

Ehrich-wen commented Apr 23, 2024

Environment

lix19937 commented Apr 23, 2024

Ehrich-wen commented Apr 24, 2024

zerollzeng commented Apr 25, 2024

zerollzeng commented Apr 25, 2024

zerollzeng commented Apr 25, 2024

Ehrich-wen commented Apr 26, 2024

Ehrich-wen commented Apr 26, 2024

zerollzeng commented May 1, 2024

Ehrich-wen commented May 6, 2024

zerollzeng commented May 19, 2024

dzzhang96 commented May 22, 2024

dzzhang96 commented May 22, 2024

zerollzeng commented Jun 7, 2024

dzzhang96 commented Jun 7, 2024 • edited Loading

Ehrich-wen commented Jun 18, 2024

dzzhang96 commented Jun 27, 2024

lix19937 commented Jun 29, 2024 • edited Loading

TensorRT Release 10.0.1

dzzhang96 commented Jun 7, 2024 •

edited

Loading

lix19937 commented Jun 29, 2024 •

edited

Loading