Skip to content

[Question] Performance Degradation: TensorRT Model Optimizer vs Direct trtexec on Linux #360

@yy6768

Description

@yy6768

Issue Summary

I'm experiencing a significant performance discrepancy when using TensorRT Model Optimizer's DeviceModel between Windows and Linux environments. The compilation time difference is extreme:

  • Windows (local/python->modelopt api): ~2 minutes
  • Linux (server/python->modelopt api): ~3 hours
  • Direct trtexec on Linux(cli->trtexec): ~10 minutes

Environment Details

Windows Environment (Fast)

  • CPU: Intel Core i9-9900X (12 cores)
  • GPU: RTX 5080
  • CUDA: 12.9
  • TensorRT: 12.0.36
  • TRT-Model-Optimizer: 0.35.0
  • Driver: 570+
  • Compilation Time: ~2 minutes

Linux Environment (Slow)

  • CPU: Intel(R) Xeon(R) Platinum 8457C (32 cores)
  • GPU: NVIDIA H20
  • CUDA: 12.2
  • TensorRT: 12.0.36
  • TRT-Model-Optimizer:0.29.0
  • Driver: 535.161.08 (limited to this version due to corporate policy)
  • Compilation Time: ~3 hours

Code Reference

The issue appears to be related to the TensorRT engine builder implementation:
https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/modelopt/torch/_deploy/_runtime/tensorrt/engine_builder.py#L121C1-L203C19

Detailed Problem Description

1. Model Optimizer Performance Issue

When using modelopt._deploy with the following configuration:

self.deployment = {
    "runtime": "TRT",
    "precision": "stronglyTyped",
}

# Later in the code:
client = RuntimeRegistry.get(self.deployment)
compiled_model = client.ir_to_compiled(onnx_bytes)

The compilation gets stuck at:

[I] [TRT] Compiler backend is used during engine build.

And takes approximately 3 hours to complete on Linux, while the same code completes in 2 minutes on Windows.

2. Direct trtexec Performance (Baseline)

When I extract and run the exact same trtexec command that the Model Optimizer generates internally:

trtexec --onnx=/tmp/model.fp8.onnx \
        --stronglyTyped \
        --saveEngine=/tmp/model.fp8.engine \
        --skipInference \
        --builderOptimizationLevel=4 \
        --exportLayerInfo=/tmp/model.fp8.engine.graph.json

The compilation completes in ~10 minutes on the same Linux system.

Analysis

This indicates that:

  1. The Linux hardware is capable of fast TensorRT compilation (10 min with direct trtexec)
  2. The Model Optimizer introduces significant overhead specifically on Linux
  3. The Windows version of Model Optimizer doesn't have this overhead

Expected Behavior

The Model Optimizer should have similar performance characteristics across platforms, especially when the underlying hardware is more powerful (32-core Xeon vs 12-core i9).

Request

Could you please investigate:

  1. Why Model Optimizer has such different performance on Linux vs Windows?
  2. Whether there are Linux-specific optimizations or configurations that can be applied?

Additional Information

  • Model: Custom UNet with FP8 quantization
  • Both systems have adequate memory and disk space
  • No other processes are competing for GPU resources during compilation

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions