Concurrency / thread safety Issue with Python TrT-LLM example 

Hello NVIDIA team,

Thank you developing and open sourcing  this framework. 🙏 

It seems like current python example for running TensorRT-LLM is **not thread-safe** and I wanted to bring this up. 

In TensorRT guidelines, TensorRT session is thread-safe **as long as new ExecutionContext is used for each thread**.  [1]
[1]: https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-803/best-practices/index.html#thread-safety

However, in out python TensorRT `GenerationSession` [2], it seems like we are using the same ExecutionContext [3] [4] for a single GenerationSession.

[2]:  https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/llama/run.py#L234
[3]:  https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/runtime/generation.py#L287C14-L287C22
[4]: https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/runtime/generation.py#L139-L142

This would make calling `GenerationSession::decode` function in concurrent manner unsafe. 

Would there be a new GenerationSession where  new `ExecutionContext` is created for each call to `decode`? 
Please let me know if my understanding is incorrect. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Concurrency / thread safety Issue with Python TrT-LLM example #225

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Concurrency / thread safety Issue with Python TrT-LLM example #225

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions