-
Notifications
You must be signed in to change notification settings - Fork 174
Description
Issue Summary
I'm experiencing a significant performance discrepancy when using TensorRT Model Optimizer's DeviceModel
between Windows and Linux environments. The compilation time difference is extreme:
- Windows (local/python->modelopt api): ~2 minutes
- Linux (server/python->modelopt api): ~3 hours
- Direct trtexec on Linux(cli->trtexec): ~10 minutes
Environment Details
Windows Environment (Fast)
- CPU: Intel Core i9-9900X (12 cores)
- GPU: RTX 5080
- CUDA: 12.9
- TensorRT: 12.0.36
- TRT-Model-Optimizer: 0.35.0
- Driver: 570+
- Compilation Time: ~2 minutes
Linux Environment (Slow)
- CPU: Intel(R) Xeon(R) Platinum 8457C (32 cores)
- GPU: NVIDIA H20
- CUDA: 12.2
- TensorRT: 12.0.36
- TRT-Model-Optimizer:0.29.0
- Driver: 535.161.08 (limited to this version due to corporate policy)
- Compilation Time: ~3 hours
Code Reference
The issue appears to be related to the TensorRT engine builder implementation:
https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/modelopt/torch/_deploy/_runtime/tensorrt/engine_builder.py#L121C1-L203C19
Detailed Problem Description
1. Model Optimizer Performance Issue
When using modelopt._deploy
with the following configuration:
self.deployment = {
"runtime": "TRT",
"precision": "stronglyTyped",
}
# Later in the code:
client = RuntimeRegistry.get(self.deployment)
compiled_model = client.ir_to_compiled(onnx_bytes)
The compilation gets stuck at:
[I] [TRT] Compiler backend is used during engine build.
And takes approximately 3 hours to complete on Linux, while the same code completes in 2 minutes on Windows.
2. Direct trtexec Performance (Baseline)
When I extract and run the exact same trtexec
command that the Model Optimizer generates internally:
trtexec --onnx=/tmp/model.fp8.onnx \
--stronglyTyped \
--saveEngine=/tmp/model.fp8.engine \
--skipInference \
--builderOptimizationLevel=4 \
--exportLayerInfo=/tmp/model.fp8.engine.graph.json
The compilation completes in ~10 minutes on the same Linux system.
Analysis
This indicates that:
- The Linux hardware is capable of fast TensorRT compilation (10 min with direct trtexec)
- The Model Optimizer introduces significant overhead specifically on Linux
- The Windows version of Model Optimizer doesn't have this overhead
Expected Behavior
The Model Optimizer should have similar performance characteristics across platforms, especially when the underlying hardware is more powerful (32-core Xeon vs 12-core i9).
Request
Could you please investigate:
- Why Model Optimizer has such different performance on Linux vs Windows?
- Whether there are Linux-specific optimizations or configurations that can be applied?
Additional Information
- Model: Custom UNet with FP8 quantization
- Both systems have adequate memory and disk space
- No other processes are competing for GPU resources during compilation