Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuda driver version #83

Closed
Kelsey2018 opened this issue Oct 24, 2023 · 6 comments
Closed

cuda driver version #83

Kelsey2018 opened this issue Oct 24, 2023 · 6 comments
Labels
bug Something isn't working triaged Issue has been triaged by maintainers

Comments

@Kelsey2018
Copy link

I encountered this error, "ERROR: This container was built for NVIDIA Driver Release 535.86 or later, but version 520.56.06 was detected and compatibility mode is UNAVAILABLE."

So can TensorRT-LLM support lower driver version?

@ljayx
Copy link

ljayx commented Oct 24, 2023

Have you tried the forward compatibility mode described here?

@BasicCoder
Copy link
Contributor

please reference #46

@kaiyux
Copy link
Member

kaiyux commented Oct 24, 2023

@Kelsey2018 We noticed that issue, and fixed that in the internal codebase. We plan to release an update of the main branch by the end of this week or early next week. Before that, please try the changes in #46 as a workaround, thank you.

@juney-nvidia juney-nvidia added bug Something isn't working triaged Issue has been triaged by maintainers labels Oct 25, 2023
@Kelsey2018
Copy link
Author

Thanks a lot for replying! The performance of TensorRT-LLM is commendable; however, I am curious about the distinctive features that set it apart from the VLLM acceleration strategy. Both approaches employ paged attention, which I personally perceive as the equivalent of Paged KV Cache in TRT-LLM, along with dynamic batching to efficiently handle multiple requests.

It remains to be seen whether there will be additional documentation in the future that elucidates the unique attributes of TensorRT-LLM?

@jdemouth-nvidia
Copy link
Collaborator

The most fundamental technical difference is that TensorRT-LLM relies on TensorRT ; which is a graph compiler that can produce optimised kernels for your graph. As we continue to improve TensorRT, there will be less and less needs for "manual" interventions to optimise new networks (in terms of kernels as well as taking advantage of numerical optimizations like INT4, INT8 or FP8). I hope it helps a bit.

@Shixiaowei02
Copy link
Collaborator

This issue has been fixed in the main branch, so I will close the issue and reopen it anytime if you have new questions. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

8 participants
@BasicCoder @jdemouth-nvidia @kaiyux @ljayx @Kelsey2018 @Shixiaowei02 @juney-nvidia and others