Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Embedding_bag operator on GPU #3319

Open
rishucoding opened this issue Sep 13, 2023 · 7 comments
Open

Embedding_bag operator on GPU #3319

rishucoding opened this issue Sep 13, 2023 · 7 comments
Assignees
Labels
triaged Issue has been triaged by maintainers

Comments

@rishucoding
Copy link

Hello,

Nvidia MLPerf suggests to use TensorRT framework for a performant inference deployment. For DLRM (DL based Recommendation Systems) inference on GPU, I have the following questions:

  • Does TensorRT modify the backend (CUDA/C++ source code) of Embedding bag operator or it uses the exact same vanilla PyTorch CUDA kernels?

  • What are the benefits of using vanilla PyTorch over TensorRT for DLRM inference?

Please let me know your comments. Thanks

@zerollzeng
Copy link
Collaborator

@nvpohanh ^ ^

@zerollzeng zerollzeng added the triaged Issue has been triaged by maintainers label Sep 17, 2023
@nvpohanh
Copy link
Collaborator

For Gather operation, TRT generates the kernel dynamically and tries to fuse it with other pointwise operations if possible. That means, we do not use the same Gather kernels as PyTorch does.

@nvpohanh
Copy link
Collaborator

What are the benefits of using vanilla PyTorch over TensorRT for DLRM inference?

Our MLPerf-Inference submission uses TensorRT for the DLRM benchmark: https://github.com/mlcommons/inference_results_v3.1/tree/main/closed/NVIDIA

Using TensorRT allows more aggressive fusions like Gemm+Pointwise fusions.

@ttyio
Copy link
Collaborator

ttyio commented Oct 10, 2023

closing since no activity for more than 3 weeks, thanks all!

@ttyio ttyio closed this as completed Oct 10, 2023
@rishucoding
Copy link
Author

Thanks @nvpohanh for the comments. Could you share the source code for TRT implementation of Gather Kernel used in Embedding Stage for DLRMs? Also, could you compare the TRT gather kernel with the PyTorch Embedding Stage CUDA kernel (link)

@zerollzeng
Copy link
Collaborator

@nvpohanh ^ ^

@zerollzeng zerollzeng reopened this Feb 13, 2024
@rishucoding
Copy link
Author

Hi -- could you please share your comments on my follow-up question? Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

4 participants