Myelin ForeignNode takes most time at engine build & inference #2577

fxmarty · 2022-12-30T19:11:01Z

Description

The model and logs can be found here (uploading): https://huggingface.co/fxmarty/bugged-myelin-tensorrt-gptj/tree/main

A netron view can be found here: http://netron.app?url=https://huggingface.co/fxmarty/bugged-myelin-tensorrt-gptj/blob/main/decoder_model.onnx

I'd be glad if you can advise why the Myelin ForeignNode takes a very long time at engine build. Based on this answer, I should not be too worried concerning inference itself, but how can I be sure that the very long time taken during engine build is expected? Myelin ForeignNode is taking at engine build ~20s for this small 300 MB model, but up to 500 seconds for GPT-J-6B. Is this expected?

The command run is:

trtexec --onnx=decoder_model.onnx --shapes=input_ids:1x10,attention_mask:1x10 --verbose --dumpProfile --profilingVerbosity=detailed > trtexec_log.txt

The profiling yields:

[12/30/2022-19:58:07] [I] === Profile (1860 iterations ) ===
[12/30/2022-19:58:07] [I]                                                                                                                       Layer   Time (ms)   Avg. Time (ms)   Median Time (ms)   Time %
[12/30/2022-19:58:07] [I]                                                                                               /transformer/h.7/attn/Range_1       10.36           0.0056             0.0051      0.3
[12/30/2022-19:58:07] [I]                                                                                               /transformer/h.6/attn/Range_1        9.37           0.0050             0.0051      0.3
[12/30/2022-19:58:07] [I]                                                                                               /transformer/h.5/attn/Range_1        9.39           0.0050             0.0051      0.3
[12/30/2022-19:58:07] [I]                                                                                               /transformer/h.4/attn/Range_1        9.47           0.0051             0.0051      0.3
[12/30/2022-19:58:07] [I]                                                                                               /transformer/h.3/attn/Range_1        8.99           0.0048             0.0041      0.3
[12/30/2022-19:58:07] [I]                                                                                               /transformer/h.2/attn/Range_1        8.91           0.0048             0.0041      0.3
[12/30/2022-19:58:07] [I]                                                                                               /transformer/h.1/attn/Range_1        8.96           0.0048             0.0041      0.3
[12/30/2022-19:58:07] [I]                                                                                               /transformer/h.0/attn/Range_1        9.18           0.0049             0.0041      0.3
[12/30/2022-19:58:07] [I]                                                                                                 /transformer/h.7/attn/Range        9.03           0.0049             0.0051      0.3
[12/30/2022-19:58:07] [I]                                                                                                 /transformer/h.6/attn/Range        8.85           0.0048             0.0041      0.3
[12/30/2022-19:58:07] [I]                                                                                                 /transformer/h.5/attn/Range        8.77           0.0047             0.0041      0.3
[12/30/2022-19:58:07] [I]                                                                                                 /transformer/h.4/attn/Range        8.99           0.0048             0.0041      0.3
[12/30/2022-19:58:07] [I]                                                                                                 /transformer/h.3/attn/Range        8.78           0.0047             0.0041      0.3
[12/30/2022-19:58:07] [I]                                                                                                 /transformer/h.2/attn/Range        8.76           0.0047             0.0041      0.3
[12/30/2022-19:58:07] [I]                                                                                                 /transformer/h.1/attn/Range        8.81           0.0047             0.0041      0.3
[12/30/2022-19:58:07] [I]                                                                                                 /transformer/h.0/attn/Range        8.91           0.0048             0.0041      0.3
[12/30/2022-19:58:07] [I]  {ForeignNode[/transformer/h.7/attn/Constant_33_output_0 + (Unnamed Layer* 5111) [Shuffle].../transformer/h.7/attn/Cast_5]}      135.21           0.0727             0.0655      4.5
[12/30/2022-19:58:07] [I]                                                                                              PWN(/transformer/h.7/attn/Cos)        9.50           0.0051             0.0051      0.3
[12/30/2022-19:58:07] [I]                                                                                              PWN(/transformer/h.7/attn/Sin)        9.33           0.0050             0.0051      0.3
[12/30/2022-19:58:07] [I]                                                                                              PWN(/transformer/h.6/attn/Cos)        9.43           0.0051             0.0051      0.3
[12/30/2022-19:58:07] [I]                                                                                              PWN(/transformer/h.6/attn/Sin)        9.29           0.0050             0.0051      0.3
[12/30/2022-19:58:07] [I]                                                                                              PWN(/transformer/h.5/attn/Cos)        9.32           0.0050             0.0051      0.3
[12/30/2022-19:58:07] [I]                                                                                              PWN(/transformer/h.5/attn/Sin)        9.18           0.0049             0.0051      0.3
[12/30/2022-19:58:07] [I]                                                                                              PWN(/transformer/h.4/attn/Cos)        9.30           0.0050             0.0051      0.3
[12/30/2022-19:58:07] [I]                                                                                              PWN(/transformer/h.4/attn/Sin)        9.25           0.0050             0.0051      0.3
[12/30/2022-19:58:07] [I]                                                                                              PWN(/transformer/h.3/attn/Cos)        9.31           0.0050             0.0051      0.3
[12/30/2022-19:58:07] [I]                                                                                              PWN(/transformer/h.3/attn/Sin)        9.21           0.0050             0.0051      0.3
[12/30/2022-19:58:07] [I]                                                                                              PWN(/transformer/h.2/attn/Cos)        9.31           0.0050             0.0051      0.3
[12/30/2022-19:58:07] [I]                                                                                              PWN(/transformer/h.2/attn/Sin)        9.26           0.0050             0.0051      0.3
[12/30/2022-19:58:07] [I]                                                                                              PWN(/transformer/h.1/attn/Cos)        9.28           0.0050             0.0051      0.3
[12/30/2022-19:58:07] [I]                                                                                              PWN(/transformer/h.1/attn/Sin)        9.22           0.0050             0.0051      0.3
[12/30/2022-19:58:07] [I]                                                                                              PWN(/transformer/h.0/attn/Cos)        9.46           0.0051             0.0051      0.3
[12/30/2022-19:58:07] [I]                                                                                              PWN(/transformer/h.0/attn/Sin)        9.24           0.0050             0.0051      0.3
[12/30/2022-19:58:07] [I]                                                                            {ForeignNode[transformer.h.0.ln_1.bias.../Cast]}     2563.58           1.3783             1.3650     85.6
[12/30/2022-19:58:07] [I]                                                                                                                       Total     2993.23           1.6093             1.5800    100.0
[12/30/2022-19:58:07] [I]

The first foreign node seem to come from:

The second one should be:

This last matmul is huge, but I somewhat doubt it would take like >99% of the inference time. So I wonder if there's a bug in the inference as well.

Environment

TensorRT Version: 8.4.3.1
NVIDIA GPU: NVIDIA GeForce RTX 3060 Laptop GPU (but could reproduce on an A100 as well)
NVIDIA Driver Version: 515.86.01
CUDA Version: 11.7
CUDNN Version: 8.7.0
Operating System: Linux 5.15.0-56-generic #62-Ubuntu SMP Tue Nov 22 19:54:14 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
PyTorch Version (if applicable): 1.13.1 (version used for torch.onnx.export())

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Myelin ForeignNode takes most time at engine build & inference #2577

Myelin ForeignNode takes most time at engine build & inference #2577

fxmarty commented Dec 30, 2022 •

edited

zerollzeng commented Jan 3, 2023

fxmarty commented Jan 4, 2023

Myelin ForeignNode takes most time at engine build & inference #2577

Myelin ForeignNode takes most time at engine build & inference #2577

Comments

fxmarty commented Dec 30, 2022 • edited

Description

Environment

Related

zerollzeng commented Jan 3, 2023

fxmarty commented Jan 4, 2023

fxmarty commented Dec 30, 2022 •

edited