Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there any plan to solve the problem with tensor number < 2^31-1? #3815

Open
Ehrich-wen opened this issue Apr 23, 2024 · 17 comments
Open

Is there any plan to solve the problem with tensor number < 2^31-1? #3815

Ehrich-wen opened this issue Apr 23, 2024 · 17 comments
Assignees
Labels
triaged Issue has been triaged by maintainers

Comments

@Ehrich-wen
Copy link

In the "Known Issues" of the release note of TensorRT 10.0, it is mentioned that "UNets with tensors containing >2^31 elements may fail during the engine building step." This problem brings great restrictions to our SD deployment, especially Video model, are there any plans to solve this problem in future versions?

Environment

TensorRT Version: V10.0

NVIDIA GPU: A100/A10/H800/H100/T4/2080TI

NVIDIA Driver Version:>510

CUDA Version:12.0

@lix19937
Copy link

UNets with tensors containing >2^31 elements may fail during the engine building step

2^31 means int32 data's max value ? Or built-in plugin/layer impl has write dead.

@Ehrich-wen
Copy link
Author

UNets with tensors containing >2^31 elements may fail during the engine building step

2^31 means int32 data's max value ? Or built-in plugin/layer impl has write dead.

Yes, I guess so. However, the length of the deep tensor of more and more models exceeds int32, forcing us to find other solutions. The excellent performance of tensorrt is still the best choice, so we hope that the official will consider supporting it in future versions.

@zerollzeng
Copy link
Collaborator

Let me check internally.

@zerollzeng
Copy link
Collaborator

This should be fixed in TRT 10.0.1, and we just release it, could you please take a try?

@zerollzeng
Copy link
Collaborator

@zerollzeng zerollzeng self-assigned this Apr 25, 2024
@zerollzeng zerollzeng added the triaged Issue has been triaged by maintainers label Apr 25, 2024
@Ehrich-wen
Copy link
Author

@Ehrich-wen
Copy link
Author

@zerollzeng I tried the V10.0.1 version, but still got this error. Am I using it incorrectly?
`&&&& RUNNING TensorRT.trtexec [TensorRT v100001] # /usr/src/tensorrt/bin/trtexec --onnx=./onnx/unet.onnx --fp16 --minShapes=x:1x8x1x72x128 --optShapes=x:1x8x16x72x128 --maxShapes=x:1x8x40x72x128 --saveEngine=unet.plan
[04/26/2024-19:28:32] [W] The engine can stream its weights but it will not at runtime because --weightStreamingBudget unset or set to 0.
[04/26/2024-19:28:32] [I] === Model Options ===
[04/26/2024-19:28:32] [I] Format: ONNX
[04/26/2024-19:28:32] [I] Model: ./onnx/unet.onnx
[04/26/2024-19:28:32] [I] Output:
[04/26/2024-19:28:32] [I] === Build Options ===
[04/26/2024-19:28:32] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default, tacticSharedMem: default
[04/26/2024-19:28:32] [I] avgTiming: 8
[04/26/2024-19:28:32] [I] Precision: FP32+FP16
[04/26/2024-19:28:32] [I] LayerPrecisions:
[04/26/2024-19:28:32] [I] Layer Device Types:
[04/26/2024-19:28:32] [I] Calibration:
[04/26/2024-19:28:32] [I] Refit: Disabled
[04/26/2024-19:28:32] [I] Strip weights: Disabled
[04/26/2024-19:28:32] [I] Version Compatible: Disabled
[04/26/2024-19:28:32] [I] ONNX Plugin InstanceNorm: Disabled
[04/26/2024-19:28:32] [I] TensorRT runtime: full
[04/26/2024-19:28:32] [I] Lean DLL Path:
[04/26/2024-19:28:32] [I] Tempfile Controls: { in_memory: allow, temporary: allow }
[04/26/2024-19:28:32] [I] Exclude Lean Runtime: Disabled
[04/26/2024-19:28:32] [I] Sparsity: Disabled
[04/26/2024-19:28:32] [I] Safe mode: Disabled
[04/26/2024-19:28:32] [I] Build DLA standalone loadable: Disabled
[04/26/2024-19:28:32] [I] Allow GPU fallback for DLA: Disabled
[04/26/2024-19:28:32] [I] DirectIO mode: Disabled
[04/26/2024-19:28:32] [I] Restricted mode: Disabled
[04/26/2024-19:28:32] [I] Skip inference: Disabled
[04/26/2024-19:28:32] [I] Save engine: dynamic_unet.plan
[04/26/2024-19:28:32] [I] Load engine:
[04/26/2024-19:28:32] [I] Profiling verbosity: 0
[04/26/2024-19:28:32] [I] Tactic sources: Using default tactic sources
[04/26/2024-19:28:32] [I] timingCacheMode: local
[04/26/2024-19:28:32] [I] timingCacheFile:
[04/26/2024-19:28:32] [I] Enable Compilation Cache: Enabled
[04/26/2024-19:28:32] [I] errorOnTimingCacheMiss: Disabled
[04/26/2024-19:28:32] [I] Preview Features: Use default preview flags.
[04/26/2024-19:28:32] [I] MaxAuxStreams: -1
[04/26/2024-19:28:32] [I] BuilderOptimizationLevel: -1
[04/26/2024-19:28:32] [I] Calibration Profile Index: 0
[04/26/2024-19:28:32] [I] Weight Streaming: Enabled
[04/26/2024-19:28:32] [I] Debug Tensors:
[04/26/2024-19:28:32] [I] Input(s)s format: fp32:CHW
[04/26/2024-19:28:32] [I] Output(s)s format: fp32:CHW
[04/26/2024-19:28:32] [I] Input build shape (profile 0): x=1x8x1x72x128+1x8x16x72x128+1x8x40x72x128
[04/26/2024-19:28:32] [I] Input calibration shapes: model
[04/26/2024-19:28:32] [I] === System Options ===
[04/26/2024-19:28:32] [I] Device: 0
[04/26/2024-19:28:32] [I] DLACore:
[04/26/2024-19:28:32] [I] Plugins:
[04/26/2024-19:28:32] [I] setPluginsToSerialize:
[04/26/2024-19:28:32] [I] dynamicPlugins:
[04/26/2024-19:28:32] [I] ignoreParsedPluginLibs: 0
[04/26/2024-19:28:32] [I]
[04/26/2024-19:28:32] [I] === Inference Options ===
[04/26/2024-19:28:32] [I] Batch: Explicit
[04/26/2024-19:28:32] [I] Input inference shape : x=1x8x16x72x128
[04/26/2024-19:28:32] [I] Iterations: 10
[04/26/2024-19:28:32] [I] Duration: 3s (+ 200ms warm up)
[04/26/2024-19:28:32] [I] Sleep time: 0ms
[04/26/2024-19:28:32] [I] Idle time: 0ms
[04/26/2024-19:28:32] [I] Inference Streams: 1
[04/26/2024-19:28:32] [I] ExposeDMA: Disabled
[04/26/2024-19:28:32] [I] Data transfers: Enabled
[04/26/2024-19:28:32] [I] Spin-wait: Disabled
[04/26/2024-19:28:32] [I] Multithreading: Disabled
[04/26/2024-19:28:32] [I] CUDA Graph: Disabled
[04/26/2024-19:28:32] [I] Separate profiling: Disabled
[04/26/2024-19:28:32] [I] Time Deserialize: Disabled
[04/26/2024-19:28:32] [I] Time Refit: Disabled
[04/26/2024-19:28:32] [I] NVTX verbosity: 0
[04/26/2024-19:28:32] [I] Persistent Cache Ratio: 0
[04/26/2024-19:28:32] [I] Optimization Profile Index: 0
[04/26/2024-19:28:32] [I] Weight Streaming Budget: Disabled
[04/26/2024-19:28:32] [I] Inputs:
[04/26/2024-19:28:32] [I] Debug Tensor Save Destinations:
[04/26/2024-19:28:32] [I] === Reporting Options ===
[04/26/2024-19:28:32] [I] Verbose: Disabled
[04/26/2024-19:28:32] [I] Averages: 10 inferences
[04/26/2024-19:28:32] [I] Percentiles: 90,95,99
[04/26/2024-19:28:32] [I] Dump refittable layers:Disabled
[04/26/2024-19:28:32] [I] Dump output: Disabled
[04/26/2024-19:28:32] [I] Profile: Disabled
[04/26/2024-19:28:32] [I] Export timing to JSON file:
[04/26/2024-19:28:32] [I] Export output to JSON file:
[04/26/2024-19:28:32] [I] Export profile to JSON file:
[04/26/2024-19:28:32] [I]
[04/26/2024-19:28:32] [I] === Device Information ===
[04/26/2024-19:28:32] [I] Available Devices:
[04/26/2024-19:28:32] [I] Device 0: "NVIDIA A10" UUID: GPU-b8ea513b-1327-f986-cb26-e50f9e100686
[04/26/2024-19:28:32] [I] Device 1: "NVIDIA A10" UUID: GPU-79ebcb00-d01a-2b16-7fed-07dc154b8399
[04/26/2024-19:28:32] [I] Device 2: "NVIDIA A10" UUID: GPU-bbe1e200-4079-3c27-4c14-f0d75f1bd3ef
[04/26/2024-19:28:32] [I] Device 3: "NVIDIA A10" UUID: GPU-c1ab49e8-26c0-b11d-3098-f165a093703d
[04/26/2024-19:28:32] [I] Device 4: "NVIDIA A10" UUID: GPU-fb53045d-5a9d-5d6b-6880-cfc459a64ec7
[04/26/2024-19:28:32] [I] Device 5: "NVIDIA A10" UUID: GPU-52ccfc8a-86f2-ce71-fa18-2f282e2da474
[04/26/2024-19:28:32] [I] Device 6: "NVIDIA A10" UUID: GPU-b4756f72-fe6b-aa8a-6a85-61341b7bc00f
[04/26/2024-19:28:32] [I] Device 7: "NVIDIA A10" UUID: GPU-8cb64d70-dbfe-b159-e0d1-867223c8bbf4
[04/26/2024-19:28:32] [I] Selected Device: NVIDIA A10
[04/26/2024-19:28:32] [I] Selected Device ID: 0
[04/26/2024-19:28:32] [I] Selected Device UUID: GPU-b8ea513b-1327-f986-cb26-e50f9e100686
[04/26/2024-19:28:32] [I] Compute Capability: 8.6
[04/26/2024-19:28:32] [I] SMs: 72
[04/26/2024-19:28:32] [I] Device Global Memory: 22515 MiB
[04/26/2024-19:28:32] [I] Shared Memory per SM: 100 KiB
[04/26/2024-19:28:32] [I] Memory Bus Width: 384 bits (ECC enabled)
[04/26/2024-19:28:32] [I] Application Compute Clock Rate: 1.695 GHz
[04/26/2024-19:28:32] [I] Application Memory Clock Rate: 6.251 GHz
[04/26/2024-19:28:32] [I]
[04/26/2024-19:28:32] [I] Note: The application clock rates do not reflect the actual clock rates that the GPU is currently running at.
[04/26/2024-19:28:32] [I]
[04/26/2024-19:28:32] [I] TensorRT version: 10.0.1
[04/26/2024-19:28:32] [I] Loading standard plugins
[04/26/2024-19:28:32] [I] [TRT] [MemUsageChange] Init CUDA: CPU +1, GPU +0, now: CPU 16, GPU 19428 (MiB)
[04/26/2024-19:28:35] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +1750, GPU +312, now: CPU 1902, GPU 19740 (MiB)
[04/26/2024-19:28:35] [I] Start parsing network model.
[04/26/2024-19:28:35] [I] [TRT] ----------------------------------------------------------------
[04/26/2024-19:28:35] [I] [TRT] Input filename: ./onnx/unet.onnx
[04/26/2024-19:28:35] [I] [TRT] ONNX IR version: 0.0.8
[04/26/2024-19:28:35] [I] [TRT] Opset version: 17
[04/26/2024-19:28:35] [I] [TRT] Producer name: pytorch
[04/26/2024-19:28:35] [I] [TRT] Producer version: 2.2.1
[04/26/2024-19:28:35] [I] [TRT] Domain:
[04/26/2024-19:28:35] [I] [TRT] Model version: 0
[04/26/2024-19:28:35] [I] [TRT] Doc string:
[04/26/2024-19:28:35] [I] [TRT] ----------------------------------------------------------------
[04/26/2024-19:28:35] [W] [TRT] ModelImporter.cpp:420: Make sure input t has Int64 binding.
[04/26/2024-19:28:41] [E] Error[4]: [graphShapeAnalyzer.cpp::processCheck::872] Error Code 4: Internal Error (/input_blocks.1/input_blocks.1.1/transformer_blocks.0/attn1/Mul_6_output: tensor volume exceeds 2147483647, dimensions are [80,9216,9216])
[04/26/2024-19:28:41] [E] [TRT] ModelImporter.cpp:828: While parsing node number 937 [Mul -> "/input_blocks.1/input_blocks.1.1/transformer_blocks.0/attn1/Mul_6_output_0"]:
[04/26/2024-19:28:41] [E] [TRT] ModelImporter.cpp:831: --- Begin node ---
input: "/input_blocks.1/input_blocks.1.1/transformer_blocks.0/attn1/Einsum_output_0"
input: "/input_blocks.1/input_blocks.1.1/transformer_blocks.0/attn1/Constant_27_output_0"
output: "/input_blocks.1/input_blocks.1.1/transformer_blocks.0/attn1/Mul_6_output_0"
name: "/input_blocks.1/input_blocks.1.1/transformer_blocks.0/attn1/Mul_6"
op_type: "Mul"

[04/26/2024-19:28:41] [E] [TRT] ModelImporter.cpp:832: --- End node ---
[04/26/2024-19:28:41] [E] [TRT] ModelImporter.cpp:836: ERROR: ModelImporter.cpp:194 In function parseNode:
[6] Invalid Node - /input_blocks.1/input_blocks.1.1/transformer_blocks.0/attn1/Mul_6
[graphShapeAnalyzer.cpp::processCheck::872] Error Code 4: Internal Error (/input_blocks.1/input_blocks.1.1/transformer_blocks.0/attn1/Mul_6_output: tensor volume exceeds 2147483647, dimensions are [80,9216,9216])
[04/26/2024-19:28:41] [E] Failed to parse onnx file
[04/26/2024-19:28:42] [I] Finished parsing network model. Parse time: 6.81613
[04/26/2024-19:28:42] [E] Parsing model failed
[04/26/2024-19:28:42] [E] Failed to create engine from model or file.
[04/26/2024-19:28:42] [E] Engine set up failed`

@zerollzeng
Copy link
Collaborator

Looks like we didn't fix it completely, could you please share the model with us? I think we need further investigation on it. Thanks!

@Ehrich-wen
Copy link
Author

@zerollzeng Sorry, I can't upload onnx. This is the original project and model. Please verify it.
https://github.com/Doubiiu/DynamiCrafter
https://huggingface.co/Doubiiu/DynamiCrafter_1024/blob/main/model.ckpt

@zerollzeng
Copy link
Collaborator

Could you please upload an onnx model that is converted directly from public repo that doesn't contain any sensitive information, ask this because I'm quite busy these days...

@dzzhang96
Copy link

I have the same issue, my model has 1x3x512x512x512 input, and it exceeds the 2^31 limitation. I failed to build the engine.@zerollzeng

@dzzhang96
Copy link

I uploaded a dummy model for testing here. @zerollzeng Thanks!

@zerollzeng
Copy link
Collaborator

I have the same issue, my model has 1x3x512x512x512 input, and it exceeds the 2^31 limitation. I failed to build the engine.@zerollzeng

We do not support > int32 volume for conv now.

@dzzhang96
Copy link

dzzhang96 commented Jun 7, 2024

I have the same issue, my model has 1x3x512x512x512 input, and it exceeds the 2^31 limitation. I failed to build the engine.@zerollzeng

We do not support > int32 volume for conv now.

In the release note of 10.0.1, it is said the issue “UNets with tensors containing >2^31 elements may fail during the engine building step” was fixed. But it is not. @zerollzeng

@Ehrich-wen
Copy link
Author

I have the same issue, my model has 1x3x512x512x512 input, and it exceeds the 2^31 limitation. I failed to build the engine.@zerollzeng

We do not support > int32 volume for conv now.

In the release note of 10.0.1, it is said the issue “UNets with tensors containing >2^31 elements may fail during the engine building step” was fixed. But it is not. @zerollzeng

have you tried the latest version 10.1.0? it seems to have been resolved

@dzzhang96
Copy link

I have the same issue, my model has 1x3x512x512x512 input, and it exceeds the 2^31 limitation. I failed to build the engine.@zerollzeng

We do not support > int32 volume for conv now.

In the release note of 10.0.1, it is said the issue “UNets with tensors containing >2^31 elements may fail during the engine building step” was fixed. But it is not. @zerollzeng

have you tried the latest version 10.1.0? it seems to have been resolved

Really?? I checked the release note and it does not say the issue is fixed. Have you tried 10.1.0? :)

@lix19937
Copy link

lix19937 commented Jun 29, 2024

TensorRT Release 10.0.1

UNets with tensors containing >2^31 elements may have failed during the engine building step.

I come across matmul element num > >2^31, it falied, use plugin WAR to solve it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

4 participants