# TensorRT-LLM
https://nvidia.github.io/TensorRT-LLM/  

英伟达(NVIDIA)在TensorRT基础上针对LLM优化所推出的推理加速引擎TensorRT-LLM

## 安装
https://github.com/NVIDIA/TensorRT-LLM/blob/release/0.5.0/docs/source/installation.md

## 使用
使用TensorRT-LLM部署大模型大致分为如下三个步骤：
- 下载预训练模型权重；
- 创建大模型的全优化引擎；
- 部署该引擎

Step0：在docker容器中安装所需要的环境
```
pip install -r examples/bloom/requirements.txt
git lfs install
```
Step1：从Huggingface中下载BLOOM-650m模型
```
cd examples/bloom
rm -rf ./bloom/560M
mkdir -p ./bloom/560M && git clone https://huggingface.co/bigscience/bloom-560m ./bloom/560M
```
Step2：创建引擎
```
# Single GPU on BLOOM 560M
python build.py --model_dir ./bloom/560M/ \
                --dtype float16 \
                --use_gemm_plugin float16 \
                --use_gpt_attention_plugin float16 \
                --output_dir ./bloom/560M/trt_engines/fp16/1-gpu/
```
Note：关于参数的细节可以参考https://github.com/NVIDIA/TensorRT-LLM/blob/release/0.5.0/examples/bloom  
Step3：运行引擎
```
python summarize.py --test_trt_llm \
                    --hf_model_location ./bloom/560M/ \
                    --data_type fp16 \
                    --engine_dir ./bloom/560M/trt_engines/fp16/1-gpu/
```

In [None]:
import tensorrtllm as trtllm

# Initialize the model
model = trtllm.LargeLanguageModel('./path_to_your_model')

# Apply kernel fusion and quantization
optimization_flags = trtllm.OptimizationFlag.FUSE_OPERATIONS | trtllm.OptimizationFlag.QUANTIZE
optimized_model = model.optimize(flags=optimization_flags)

In [None]:
# Enable in-flight batching and paged attention
runtime_parameters = {
'in_flight_batching': True,
'paged_attention': True
}

# Build the engine with these runtime optimizations
engine = optimized_model.build_engine(runtime_parameters=runtime_parameters)

In [None]:
input_data = [...] # your input data here
results = engine.execute_with_inflight_batching(input_data)

## 适配多种类型的LLM

In [None]:
import tensorrtllm as trtllm
 
# Define and load different LLMs
llama_model = trtllm.LargeLanguageModel('./path_to_llama_model')
chatglm_model = trtllm.LargeLanguageModel('./path_to_chatglm_model')

# Build optimized engines for different LLMs
llama_engine = llama_model.build_engine()
chatglm_engine = chatglm_model.build_engine()

## 降低硬件资源依赖

In [None]:
import tensorrtllm as trtllm
 
# Initialize the model
model = trtllm.LargeLanguageModel('./path_to_your_model')

# Optimize the model with energy-efficient settings
optimized_model = model.optimize(energy_efficient=True)

# Monitor energy consumption
energy_usage = optimized_model.monitor_energy_usage()

## 简单实用

In [None]:
import tensorrtllm as trtllm
 
# Initialize and load the model
model = trtllm.LargeLanguageModel('./path_to_your_model')

# Perform common operations through easy-to-understand methods
model.optimize()
model.build_engine()
model.execute(input_data)

## 模型量化

In [None]:
import tensorrtllm as trtllm
 
# Initialize the model
model = trtllm.LargeLanguageModel('./path_to_your_model')

# Enable quantization
quantized_model = model.enable_quantization(precision='FP8')

# Build and execute the quantized model
engine = quantized_model.build_engine()
result = engine.execute(input_data)

## 适应新架构

In [None]:
import tensorrtllm as trtllm
 
# Initialize the model
model = trtllm.LargeLanguageModel('./path_to_your_model')

# Update the model with new kernels or architectures
updated_model = model.update_components(new_kernels='./path_to_new_kernels',
                                    new_architectures='./path_to_new_architectures')

# Re-optimize and deploy the updated model
updated_engine = updated_model.build_engine()