Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Myelin ForeignNode takes most time at engine build & inference #2577

Closed
fxmarty opened this issue Dec 30, 2022 · 2 comments
Closed

Myelin ForeignNode takes most time at engine build & inference #2577

fxmarty opened this issue Dec 30, 2022 · 2 comments
Assignees
Labels
triaged Issue has been triaged by maintainers

Comments

@fxmarty
Copy link

fxmarty commented Dec 30, 2022

Description

The model and logs can be found here (uploading): https://huggingface.co/fxmarty/bugged-myelin-tensorrt-gptj/tree/main

A netron view can be found here: http://netron.app?url=https://huggingface.co/fxmarty/bugged-myelin-tensorrt-gptj/blob/main/decoder_model.onnx

I'd be glad if you can advise why the Myelin ForeignNode takes a very long time at engine build. Based on this answer, I should not be too worried concerning inference itself, but how can I be sure that the very long time taken during engine build is expected? Myelin ForeignNode is taking at engine build ~20s for this small 300 MB model, but up to 500 seconds for GPT-J-6B. Is this expected?

The command run is:

trtexec --onnx=decoder_model.onnx --shapes=input_ids:1x10,attention_mask:1x10 --verbose --dumpProfile --profilingVerbosity=detailed > trtexec_log.txt

The profiling yields:

[12/30/2022-19:58:07] [I] === Profile (1860 iterations ) ===
[12/30/2022-19:58:07] [I]                                                                                                                       Layer   Time (ms)   Avg. Time (ms)   Median Time (ms)   Time %
[12/30/2022-19:58:07] [I]                                                                                               /transformer/h.7/attn/Range_1       10.36           0.0056             0.0051      0.3
[12/30/2022-19:58:07] [I]                                                                                               /transformer/h.6/attn/Range_1        9.37           0.0050             0.0051      0.3
[12/30/2022-19:58:07] [I]                                                                                               /transformer/h.5/attn/Range_1        9.39           0.0050             0.0051      0.3
[12/30/2022-19:58:07] [I]                                                                                               /transformer/h.4/attn/Range_1        9.47           0.0051             0.0051      0.3
[12/30/2022-19:58:07] [I]                                                                                               /transformer/h.3/attn/Range_1        8.99           0.0048             0.0041      0.3
[12/30/2022-19:58:07] [I]                                                                                               /transformer/h.2/attn/Range_1        8.91           0.0048             0.0041      0.3
[12/30/2022-19:58:07] [I]                                                                                               /transformer/h.1/attn/Range_1        8.96           0.0048             0.0041      0.3
[12/30/2022-19:58:07] [I]                                                                                               /transformer/h.0/attn/Range_1        9.18           0.0049             0.0041      0.3
[12/30/2022-19:58:07] [I]                                                                                                 /transformer/h.7/attn/Range        9.03           0.0049             0.0051      0.3
[12/30/2022-19:58:07] [I]                                                                                                 /transformer/h.6/attn/Range        8.85           0.0048             0.0041      0.3
[12/30/2022-19:58:07] [I]                                                                                                 /transformer/h.5/attn/Range        8.77           0.0047             0.0041      0.3
[12/30/2022-19:58:07] [I]                                                                                                 /transformer/h.4/attn/Range        8.99           0.0048             0.0041      0.3
[12/30/2022-19:58:07] [I]                                                                                                 /transformer/h.3/attn/Range        8.78           0.0047             0.0041      0.3
[12/30/2022-19:58:07] [I]                                                                                                 /transformer/h.2/attn/Range        8.76           0.0047             0.0041      0.3
[12/30/2022-19:58:07] [I]                                                                                                 /transformer/h.1/attn/Range        8.81           0.0047             0.0041      0.3
[12/30/2022-19:58:07] [I]                                                                                                 /transformer/h.0/attn/Range        8.91           0.0048             0.0041      0.3
[12/30/2022-19:58:07] [I]  {ForeignNode[/transformer/h.7/attn/Constant_33_output_0 + (Unnamed Layer* 5111) [Shuffle].../transformer/h.7/attn/Cast_5]}      135.21           0.0727             0.0655      4.5
[12/30/2022-19:58:07] [I]                                                                                              PWN(/transformer/h.7/attn/Cos)        9.50           0.0051             0.0051      0.3
[12/30/2022-19:58:07] [I]                                                                                              PWN(/transformer/h.7/attn/Sin)        9.33           0.0050             0.0051      0.3
[12/30/2022-19:58:07] [I]                                                                                              PWN(/transformer/h.6/attn/Cos)        9.43           0.0051             0.0051      0.3
[12/30/2022-19:58:07] [I]                                                                                              PWN(/transformer/h.6/attn/Sin)        9.29           0.0050             0.0051      0.3
[12/30/2022-19:58:07] [I]                                                                                              PWN(/transformer/h.5/attn/Cos)        9.32           0.0050             0.0051      0.3
[12/30/2022-19:58:07] [I]                                                                                              PWN(/transformer/h.5/attn/Sin)        9.18           0.0049             0.0051      0.3
[12/30/2022-19:58:07] [I]                                                                                              PWN(/transformer/h.4/attn/Cos)        9.30           0.0050             0.0051      0.3
[12/30/2022-19:58:07] [I]                                                                                              PWN(/transformer/h.4/attn/Sin)        9.25           0.0050             0.0051      0.3
[12/30/2022-19:58:07] [I]                                                                                              PWN(/transformer/h.3/attn/Cos)        9.31           0.0050             0.0051      0.3
[12/30/2022-19:58:07] [I]                                                                                              PWN(/transformer/h.3/attn/Sin)        9.21           0.0050             0.0051      0.3
[12/30/2022-19:58:07] [I]                                                                                              PWN(/transformer/h.2/attn/Cos)        9.31           0.0050             0.0051      0.3
[12/30/2022-19:58:07] [I]                                                                                              PWN(/transformer/h.2/attn/Sin)        9.26           0.0050             0.0051      0.3
[12/30/2022-19:58:07] [I]                                                                                              PWN(/transformer/h.1/attn/Cos)        9.28           0.0050             0.0051      0.3
[12/30/2022-19:58:07] [I]                                                                                              PWN(/transformer/h.1/attn/Sin)        9.22           0.0050             0.0051      0.3
[12/30/2022-19:58:07] [I]                                                                                              PWN(/transformer/h.0/attn/Cos)        9.46           0.0051             0.0051      0.3
[12/30/2022-19:58:07] [I]                                                                                              PWN(/transformer/h.0/attn/Sin)        9.24           0.0050             0.0051      0.3
[12/30/2022-19:58:07] [I]                                                                            {ForeignNode[transformer.h.0.ln_1.bias.../Cast]}     2563.58           1.3783             1.3650     85.6
[12/30/2022-19:58:07] [I]                                                                                                                       Total     2993.23           1.6093             1.5800    100.0
[12/30/2022-19:58:07] [I] 

The first foreign node seem to come from:

image

The second one should be:

image

This last matmul is huge, but I somewhat doubt it would take like >99% of the inference time. So I wonder if there's a bug in the inference as well.

Environment

TensorRT Version: 8.4.3.1
NVIDIA GPU: NVIDIA GeForce RTX 3060 Laptop GPU (but could reproduce on an A100 as well)
NVIDIA Driver Version: 515.86.01
CUDA Version: 11.7
CUDNN Version: 8.7.0
Operating System: Linux 5.15.0-56-generic #62-Ubuntu SMP Tue Nov 22 19:54:14 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
PyTorch Version (if applicable): 1.13.1 (version used for torch.onnx.export())

Related

#2308 #2576 huggingface/optimum#605

@fxmarty fxmarty changed the title Myelin ForeignNode takes most time at inference, seem to consume most of the model Myelin ForeignNode takes most time at engine build & inference Dec 30, 2022
@zerollzeng
Copy link
Collaborator

I think this is expected but I don't know much details about it. @zhenhuaw-me may know more about this.

@zerollzeng zerollzeng added the triaged Issue has been triaged by maintainers label Jan 3, 2023
@fxmarty
Copy link
Author

fxmarty commented Jan 4, 2023

This was on a very small model. So it is expected that the last big matmul takes most time.

@fxmarty fxmarty closed this as completed Jan 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

3 participants