-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
trt give random output value, diffs with onnxruntime #2928
Comments
I run inferece with this |
Does your model has dynamic shapes, if yes then you need to set the input shapes. If trtexec can build the engine normally, then there are some issues in your python scripts. and I think we support LLaMA and you already can build the engine with trtexec. |
Thanks, I have converted it to .engine with trtexec --onnx=decoder-merge-0.onnx
--minShapes=hidden_in:1x1x4096,attn_mask:1x1x1x1,position_ids:1x1,past_key_in:1x32x0x128,past_value_in:1x32x0x128
--optShapes=hidden_in:1x1x4096,attn_mask:1x1x1x2,position_ids:1x1,past_key_in:1x32x1x128,past_value_in:1x32x1x128
--maxShapes=hidden_in:1x64x4096,attn_mask:1x1x64x192,position_ids:1x64,past_key_in:1x32x192x128,past_value_in:1x32x192x128
--fp16 --saveEngine=decoder.engine Let me try inference value later. |
@zerollzeng I got wrong output value from trt, which diffs with onnxruntime. Here is reproduction:
trtexec --onnx=decoder-merge-0.onnx
--minShapes=hidden_in:1x1x4096,attn_mask:1x1x1x1,position_ids:1x1,past_key_in:1x32x0x128,past_value_in:1x32x0x128
--optShapes=hidden_in:1x1x4096,attn_mask:1x1x1x2,position_ids:1x1,past_key_in:1x32x1x128,past_value_in:1x32x1x128
--maxShapes=hidden_in:1x64x4096,attn_mask:1x1x64x192,position_ids:1x64,past_key_in:1x32x192x128,past_value_in:1x32x192x128
--fp16 --saveEngine=decoder-merge-0.engine
$ llama.onnx git:(add-trt-backend) cd data && ls *
attn_mask.npy hidden_in.npy position_ids.npy
# inference with trt
trt_wrapper = TrtWrapper('path/to/decoder-merge-0.engine')
trt_outputs = trt_wrapper.forward(_inputs)
# with ort
ort_wrapper = OrtWrapper('path/to/decoder-merge-0.onnx')
ort_outputs = ort_wrapper.forward(_inputs)
(base) ➜ llama git:(add-trt-backend) python3 trt_wrapper.py
2023-05-09 16:20:05.846 | DEBUG | __main__:__init__:128 - /home/khj/下载/7b-onnx/alpaca-onnx-7B-fp16/models/decoder-merge-0.onnx loaded
False
False
False
7.645
# again
(base) ➜ llama git:(add-trt-backend) python3 trt_wrapper.py
2023-05-09 16:22:33.620 | DEBUG | __main__:__init__:128 - /home/khj/下载/7b-onnx/alpaca-onnx-7B-fp16/models/decoder-merge-0.onnx loaded
False
False
False
4.492 trt outputs would give me random value cc @lingffff |
Could you try with Polygraphy? see https://github.com/NVIDIA/TensorRT/tree/main/tools/Polygraphy/examples/cli/run/01_comparing_frameworks a quick check would be like refer to |
Ah.. why we have to learn these much tools QaQ .. I will try it later. |
@tpoisonooo Hi, I tested the precision using versions 8.6.0.12 and 8.6.1.6 before, and it seems good to me. And then use Polygraphy to test accurate: polygraphy run --onnxrt /home/oldpan/code/models/GPT/LLAMA/alpaca.onnx/decoder-merge-4.onnx --save-results=/home/oldpan/code/data/debug_data/output-debug/decoder-merge-4-onnx-output.json --data-loader-script /home/oldpan/code/convert/tools/data_loader.py
polygraphy run /home/oldpan/code/models/normal/tensorrt_engine/llama-trt/decoder-merge-4.trt --model-type engine --trt --data-loader-script /home/oldpan/code/convert/tools/data_loader.py --load-outputs /home/oldpan/code/data/debug_data/output-debug/decoder-merge-4-onnx-output.json --atol 1e-2 --rtol 1e-3 The output is
The data_loader.py is: import numpy as np
from polygraphy.json import save_json
INPUT_SHAPE_1 = (1, 1, 4096)
INPUT_SHAPE_2 = (1, 1, 1, 50)
INPUT_SHAPE_3 = (1, 1)
INPUT_SHAPE_4 = (1, 32, 49, 128)
INPUT_SHAPE_5 = (1, 32, 49, 128)
# --shapes=hidden_in:1x1x4096,attn_mask:1x1x1x50,position_ids:1x1,past_key_in:1x32x49x128,past_value_in:1x32x49x128
def load_data():
for _ in range(1):
yield {"hidden_in": np.ones(shape=INPUT_SHAPE_1, dtype=np.float16),
"attn_mask": np.ones(shape=INPUT_SHAPE_2, dtype=np.float16),
"position_ids": np.ones(shape=INPUT_SHAPE_3, dtype=np.int64),
"past_key_in": np.ones(shape=INPUT_SHAPE_4, dtype=np.float16),
"past_value_in": np.ones(shape=INPUT_SHAPE_5, dtype=np.float16)} # Still totally real data |
Hi @Oldpan , I notice that you set "--minShapes=past_key_in:1x32x1x128" but @tpoisonooo set "--minShapes=past_key_in:1x32x0x128". Maybe this "zero tensor" feature causes the problem? |
There is a cache in llama, so "--minShapes=past_key_in:1x32x0x128". Or I have to export two kinds of Please check if past_key_value is not None:
kv_seq_len += past_key_value[0].shape[-2]
..
if past_key_value is not None:
# reuse k, v, self_attention
key_states = torch.cat([past_key_value[0], key_states], dim=2)
value_states = torch.cat([past_key_value[1], value_states], dim=2)
past_key_value = (key_states, value_states) if use_cache else None |
@tpoisonooo hi, do you test llama's performance on tensorrt and how fast the token was generated? |
On my GTX1060 TI, I got 3.5~5ms per backbone decoder with llama 7B fp16 model, AKA 1000 / (5*32) = 6 token/second |
I wonder when this problem will be solved, thank you |
I wonder if there is any progress on this issue |
change np.ones to np.random.rand will show FAILED in this case |
is this fixed? |
Not yet, if you want to inference LLaMa on CUDA, try https://github.com/InternLM/lmdeploy |
@tpoisonooo is this approach any different than LLaMA on Optimum , I already converted to tensorRT on Optimum |
Does this test have a run demo for TensorRT? |
how do you converted to tensorRT on Optimum? I don't find reference about this. |
Hi all, We successfully converted to TensorRT! Each LlamaDecoderLayer was splited into three segments: pre, mid, and post. The kv cache between the pre and mid segments reverts to PyTorch for computation. The dimensions inside pre and post do not undergo a transpose operation, enabling batch processing. The mid segment has no parameters, eliminating the need for 32 repetitions. Instead, a single mid paramater is placed on each card. see https://github.com/torchpipe/LLM.TensorRT.Serve |
Description
I am using python tensorrt to convert onnx, the script not finish after 2 hours.
But for
trtexec --onnx=model.onnx --fp16
, it would stop normally and give memodel.engine
.Environment
TensorRT Version: 8.6.1.6 GA, here is download url
NVIDIA GPU: GTX1660
NVIDIA Driver Version: 515.86.01
CUDA Version: cu117
CUDNN Version: 8.4.1
Operating System: ubuntu20.04
Python Version (if applicable): 3.9
Tensorflow Version (if applicable): -
PyTorch Version (if applicable): torch2.0
Baremetal or Container (if so, version):
Relevant Files
fp16 onnx model download here https://huggingface.co/tpoisonooo/alpaca.onnx/blob/fp16/decoder-merge-0.onnx
single script download here: https://github.com/tpoisonooo/llama.onnx/blob/add-trt-backend/tools/onnx-to-trt.py
Steps To Reproduce
onnx_model_dir
And this script would not finish
trtexec
worksNotes
This onnx is part of LLaMa huggingface format.
Since LLaMa needs
cache
andif
opr here, I have to build a empty_tensor to hack it.So past_key_in.min_shape is
[1,32,0,128]
, it works on onnxruntime.The text was updated successfully, but these errors were encountered: