[Question] Performance Degradation: TensorRT Model Optimizer vs Direct trtexec on Linux

## Issue Summary

I'm experiencing a significant performance discrepancy when using TensorRT Model Optimizer's [`DeviceModel` ](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/26c203abde6ea430dbb84e1f13e5673cd86a15bd/modelopt/torch/_deploy/device_model.py#L27)between Windows and Linux environments. The compilation time difference is extreme:

- **Windows (local/python->modelopt api)**: ~2 minutes
- **Linux (server/python->modelopt api)**: ~3 hours
- **Direct trtexec on Linux(cli->trtexec)**: ~10 minutes


## Environment Details

### Windows Environment (Fast)
- **CPU**: Intel Core i9-9900X (12 cores)
- **GPU**: RTX 5080
- **CUDA**: 12.9
- **TensorRT**: 12.0.36
- **TRT-Model-Optimizer**: 0.35.0
- **Driver**: 570+
- **Compilation Time**: ~2 minutes

### Linux Environment (Slow)
- **CPU**: Intel(R) Xeon(R) Platinum 8457C (32 cores)
- **GPU**: NVIDIA H20
- **CUDA**: 12.2
- **TensorRT**: 12.0.36
- **TRT-Model-Optimizer**:0.29.0
- **Driver**: 535.161.08 (limited to this version due to corporate policy)
- **Compilation Time**: ~3 hours

## Code Reference

The issue appears to be related to the TensorRT engine builder implementation:
https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/modelopt/torch/_deploy/_runtime/tensorrt/engine_builder.py#L121C1-L203C19

## Detailed Problem Description

### 1. Model Optimizer Performance Issue
When using `modelopt._deploy` with the following configuration:
```python
self.deployment = {
    "runtime": "TRT",
    "precision": "stronglyTyped",
}

# Later in the code:
client = RuntimeRegistry.get(self.deployment)
compiled_model = client.ir_to_compiled(onnx_bytes)
```

The compilation gets stuck at:
```
[I] [TRT] Compiler backend is used during engine build.
```

And takes approximately **3 hours** to complete on Linux, while the same code completes in **2 minutes** on Windows.

### 2. Direct trtexec Performance (Baseline)
When I extract and run the exact same `trtexec` command that the Model Optimizer generates internally:
```bash
trtexec --onnx=/tmp/model.fp8.onnx \
        --stronglyTyped \
        --saveEngine=/tmp/model.fp8.engine \
        --skipInference \
        --builderOptimizationLevel=4 \
        --exportLayerInfo=/tmp/model.fp8.engine.graph.json
```

The compilation completes in **~10 minutes** on the same Linux system.

## Analysis

This indicates that:
1. The Linux hardware is capable of fast TensorRT compilation (10 min with direct trtexec)
2. The Model Optimizer introduces significant overhead specifically on Linux
3. The Windows version of Model Optimizer doesn't have this overhead


## Expected Behavior

The Model Optimizer should have similar performance characteristics across platforms, especially when the underlying hardware is more powerful (32-core Xeon vs 12-core i9).


## Request

Could you please investigate:
1. Why Model Optimizer has such different performance on Linux vs Windows?
2. Whether there are Linux-specific optimizations or configurations that can be applied?

## Additional Information

- Model: Custom UNet with FP8 quantization
- Both systems have adequate memory and disk space
- No other processes are competing for GPU resources during compilation


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Question] Performance Degradation: TensorRT Model Optimizer vs Direct trtexec on Linux #360

Issue Summary

Environment Details

Windows Environment (Fast)

Linux Environment (Slow)

Code Reference

Detailed Problem Description

1. Model Optimizer Performance Issue

2. Direct trtexec Performance (Baseline)

Analysis

Expected Behavior

Request

Additional Information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Question] Performance Degradation: TensorRT Model Optimizer vs Direct trtexec on Linux #360

Description

Issue Summary

Environment Details

Windows Environment (Fast)

Linux Environment (Slow)

Code Reference

Detailed Problem Description

1. Model Optimizer Performance Issue

2. Direct trtexec Performance (Baseline)

Analysis

Expected Behavior

Request

Additional Information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions