Skip to content

Understanding int8 vs fp16 Performance Differences with trtexec Quantization Logs #3200

@WoodieDudy

Description

@WoodieDudy

Hello,

I'm currently working to understand the performance distinction between fp16 and int8 quantization of my model using trtexec. I would like to know what insights I can get from the trtexec logs.

Environment Details: (using pytorch:23.07-py3 docker image)

  • TensorRT Version: v8.6.1.6
  • Driver Version: 470.82.01
  • CUDA Version: 12.1
  • GPU: V100

I've attached the logs for the following commands:

For context, I've provided a snippet of how I quantized and exported my model:

from pytorch_quantization import nn as quant_nn
from pytorch_quantization import quant_modules

class OneLayer(nn.Module):
    def __init__(self,) -> None:
        super().__init__()
        d_model = 512
        d_ff = 2048
        self.lin1 = nn.Linear(d_model, d_ff, bias=False)

    def forward(self, x):
        return self.lin1(x)

block = OneLayer()

quant_nn.TensorQuantizer.use_fb_fake_quant = True

quant_modules.initialize()

torch.onnx.export(
    block,
    torch.rand(input_shape),
    os.path.join(dest_dir, save_name),
    verbose=False,
    input_names=["x"]
)

If someone could direct me to an article or resource on how to interpret trtexec logs and the insights that can be extracted from them, e.g. which kernels are used, I would be very grateful.

Thank you for your assistance!

Metadata

Metadata

Assignees

Labels

triagedIssue has been triaged by maintainers

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions