-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
Hello NVIDIA team,
Thank you developing and open sourcing this framework. 🙏
It seems like current python example for running TensorRT-LLM is not thread-safe and I wanted to bring this up.
In TensorRT guidelines, TensorRT session is thread-safe as long as new ExecutionContext is used for each thread. [1]
[1]: https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-803/best-practices/index.html#thread-safety
However, in out python TensorRT GenerationSession
2, it seems like we are using the same ExecutionContext 3 4 for a single GenerationSession.
This would make calling GenerationSession::decode
function in concurrent manner unsafe.
Would there be a new GenerationSession where new ExecutionContext
is created for each call to decode
?
Please let me know if my understanding is incorrect.