Skip to content

Concurrency / thread safety Issue with Python TrT-LLM example  #225

@hchoi-moveworks

Description

@hchoi-moveworks

Hello NVIDIA team,

Thank you developing and open sourcing this framework. 🙏

It seems like current python example for running TensorRT-LLM is not thread-safe and I wanted to bring this up.

In TensorRT guidelines, TensorRT session is thread-safe as long as new ExecutionContext is used for each thread. [1]
[1]: https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-803/best-practices/index.html#thread-safety

However, in out python TensorRT GenerationSession 2, it seems like we are using the same ExecutionContext 3 4 for a single GenerationSession.

This would make calling GenerationSession::decode function in concurrent manner unsafe.

Would there be a new GenerationSession where new ExecutionContext is created for each call to decode?
Please let me know if my understanding is incorrect.

Metadata

Metadata

Assignees

Labels

questionFurther information is requestedtriagedIssue has been triaged by maintainers

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions