trt give random output value, diffs with onnxruntime #2928

tpoisonooo · 2023-05-04T10:16:15Z

Description

I am using python tensorrt to convert onnx, the script not finish after 2 hours.

But for trtexec --onnx=model.onnx --fp16, it would stop normally and give me model.engine.

Environment

TensorRT Version: 8.6.1.6 GA, here is download url
NVIDIA GPU: GTX1660
NVIDIA Driver Version: 515.86.01
CUDA Version: cu117
CUDNN Version: 8.4.1
Operating System: ubuntu20.04
Python Version (if applicable): 3.9
Tensorflow Version (if applicable): -
PyTorch Version (if applicable): torch2.0
Baremetal or Container (if so, version):

Relevant Files

fp16 onnx model download here https://huggingface.co/tpoisonooo/alpaca.onnx/blob/fp16/decoder-merge-0.onnx
single script download here: https://github.com/tpoisonooo/llama.onnx/blob/add-trt-backend/tools/onnx-to-trt.py

Steps To Reproduce

Download onnx and save it to onnx_model_dir
Install python trt, run the script

$ python3 onnx-to-trt.py  onnx_model_dir   output_engine_dir

And this script would not finish

But trtexec works

$ trtexec --onnx=/path/to/onnx_models/decoder-merge-0.onnx --fp16
$ ls
.. decoder.engine

Notes

This onnx is part of LLaMa huggingface format.

Since LLaMa needs cache and if opr here, I have to build a empty_tensor to hack it.

So past_key_in.min_shape is [1,32,0,128], it works on onnxruntime.

The text was updated successfully, but these errors were encountered:

lingffff · 2023-05-05T03:07:20Z

But for trtexec --onnx=model.onnx --fp16, it would stop normally and give me model.engine.

I run inferece with this model.engine but get wrong outputs.
Does current TensorRT 8.6.1.6 GA support LLaMA?

zerollzeng · 2023-05-07T02:21:08Z

trtexec --onnx=/path/to/onnx_models/decoder-merge-0.onnx --fp16

Does your model has dynamic shapes, if yes then you need to set the input shapes.

If trtexec can build the engine normally, then there are some issues in your python scripts. and I think we support LLaMA and you already can build the engine with trtexec.

tpoisonooo · 2023-05-08T07:18:20Z

trtexec --onnx=/path/to/onnx_models/decoder-merge-0.onnx --fp16
Does your model has dynamic shapes, if yes then you need to set the input shapes.

If trtexec can build the engine normally, then there are some issues in your python scripts. and I think we support LLaMA and you already can build the engine with trtexec.

Thanks, I have converted it to .engine with

trtexec --onnx=decoder-merge-0.onnx 
--minShapes=hidden_in:1x1x4096,attn_mask:1x1x1x1,position_ids:1x1,past_key_in:1x32x0x128,past_value_in:1x32x0x128  
--optShapes=hidden_in:1x1x4096,attn_mask:1x1x1x2,position_ids:1x1,past_key_in:1x32x1x128,past_value_in:1x32x1x128  
--maxShapes=hidden_in:1x64x4096,attn_mask:1x1x64x192,position_ids:1x64,past_key_in:1x32x192x128,past_value_in:1x32x192x128
--fp16  --saveEngine=decoder.engine

Let me try inference value later.

tpoisonooo · 2023-05-09T08:23:13Z

@zerollzeng I got wrong output value from trt, which diffs with onnxruntime. Here is reproduction:

download onnx, take decoder-merge-0.onnx as example https://huggingface.co/tpoisonooo/alpaca.onnx/blob/fp16/decoder-merge-0.onnx
generate .engine with trtexec mentioned before

trtexec --onnx=decoder-merge-0.onnx 
--minShapes=hidden_in:1x1x4096,attn_mask:1x1x1x1,position_ids:1x1,past_key_in:1x32x0x128,past_value_in:1x32x0x128  
--optShapes=hidden_in:1x1x4096,attn_mask:1x1x1x2,position_ids:1x1,past_key_in:1x32x1x128,past_value_in:1x32x1x128  
--maxShapes=hidden_in:1x64x4096,attn_mask:1x1x64x192,position_ids:1x64,past_key_in:1x32x192x128,past_value_in:1x32x192x128
--fp16  --saveEngine=decoder-merge-0.engine

download testdata https://github.com/tpoisonooo/llama.onnx/tree/add-trt-backend/data , there are 3 numpy array

$ llama.onnx git:(add-trt-backend) cd data && ls *
attn_mask.npy  hidden_in.npy  position_ids.npy

open this single python script, give onnx/engine filepath https://github.com/tpoisonooo/llama.onnx/blob/add-trt-backend/llama/trt_wrapper.py#L157

    # inference with trt
    trt_wrapper = TrtWrapper('path/to/decoder-merge-0.engine')
    trt_outputs = trt_wrapper.forward(_inputs)

    # with ort
    ort_wrapper = OrtWrapper('path/to/decoder-merge-0.onnx')
    ort_outputs = ort_wrapper.forward(_inputs)

run, it would print np.allclose and diff.max()

(base) ➜  llama git:(add-trt-backend) python3 trt_wrapper.py
2023-05-09 16:20:05.846 | DEBUG    | __main__:__init__:128 - /home/khj/下载/7b-onnx/alpaca-onnx-7B-fp16/models/decoder-merge-0.onnx loaded
False
False
False
7.645

# again
(base) ➜  llama git:(add-trt-backend) python3 trt_wrapper.py
2023-05-09 16:22:33.620 | DEBUG    | __main__:__init__:128 - /home/khj/下载/7b-onnx/alpaca-onnx-7B-fp16/models/decoder-merge-0.onnx loaded
False
False
False
4.492

trt outputs would give me random value

cc @lingffff

zerollzeng · 2023-05-09T15:06:04Z

Could you try with Polygraphy? see https://github.com/NVIDIA/TensorRT/tree/main/tools/Polygraphy/examples/cli/run/01_comparing_frameworks

a quick check would be like polygraphy run decoder-merge-0.onnx --trt --fp16 --onnxrt --trt-min-shapes xxx --trt-opt-shapes xxx --trt-max-shapes xxx --input-shapes xxx --data-loader-script data_loader.py

refer to polygraphy run -h

tpoisonooo · 2023-05-10T01:50:17Z

Ah.. why we have to learn these much tools QaQ .. I will try it later.

Oldpan · 2023-05-11T02:24:33Z

@tpoisonooo Hi, I tested the precision using versions 8.6.0.12 and 8.6.1.6 before, and it seems good to me.
First, I converted it using trtexec：
./trtexec --onnx=/home/oldpan/code/models/GPT/LLAMA/alpaca.onnx/decoder-merge-5.onnx \ --minShapes=hidden_in:1x1x4096,attn_mask:1x1x1x1,position_ids:1x1,past_key_in:1x32x1x128,past_value_in:1x32x1x128 \ --optShapes=hidden_in:1x1x4096,attn_mask:1x1x1x1029,position_ids:1x1,past_key_in:1x32x1028x128,past_value_in:1x32x1028x128 \ --maxShapes=hidden_in:1x1x4096,attn_mask:1x1x1x2049,position_ids:1x1,past_key_in:1x32x2048x128,past_value_in:1x32x2048x128 \ --shapes=hidden_in:1x1x4096,attn_mask:1x1x1x50,position_ids:1x1,past_key_in:1x32x49x128,past_value_in:1x32x49x128 \ --fp16 --saveEngine=decoder-merge-5.trt

And then use Polygraphy to test accurate:

polygraphy run --onnxrt /home/oldpan/code/models/GPT/LLAMA/alpaca.onnx/decoder-merge-4.onnx --save-results=/home/oldpan/code/data/debug_data/output-debug/decoder-merge-4-onnx-output.json --data-loader-script /home/oldpan/code/convert/tools/data_loader.py
polygraphy run /home/oldpan/code/models/normal/tensorrt_engine/llama-trt/decoder-merge-4.trt --model-type engine --trt --data-loader-script /home/oldpan/code/convert/tools/data_loader.py --load-outputs /home/oldpan/code/data/debug_data/output-debug/decoder-merge-4-onnx-output.json  --atol 1e-2 --rtol 1e-3

The output is

[I] RUNNING | Command: /home/oldpan/miniconda3/envs/develop/bin/polygraphy run /home/oldpan/code/models/normal/tensorrt_engine/llama-trt/decoder-merge-4.trt --model-type engine --trt --data-loader-script /home/oldpan/code/project/llama.ddeploy/tools/data_loader.py --load-outputs /home/oldpan/code/data/debug_data/output-debug/decoder-merge-4-onnx-output.json --atol 1e-2 --rtol 1e-3
[I] Saving custom input data to custom_inputs.json
[I] trt-runner-N0-05/11/23-10:21:57     | Activating and starting inference
[I] Loading bytes from /home/oldpan/code/models/normal/tensorrt_engine/llama-trt/decoder-merge-4.trt
[W] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage and speed up TensorRT initialization. See "Lazy Loading" section of CUDA documentation https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#lazy-loading
[W] Input tensor: position_ids | Buffer dtype (int64) does not match expected input dtype (int32), attempting to cast. 
[I] trt-runner-N0-05/11/23-10:21:57    
    ---- Inference Input(s) ----
    {hidden_in [dtype=float16, shape=(1, 1, 4096)],
     attn_mask [dtype=float16, shape=(1, 1, 1, 50)],
     position_ids [dtype=int32, shape=(1, 1)],
     past_key_in [dtype=float16, shape=(1, 32, 49, 128)],
     past_value_in [dtype=float16, shape=(1, 32, 49, 128)]}
[I] trt-runner-N0-05/11/23-10:21:57    
    ---- Inference Output(s) ----
    {past_key [dtype=float16, shape=(1, 32, 50, 128)],
     past_value [dtype=float16, shape=(1, 32, 50, 128)],
     hidden_out [dtype=float16, shape=(1, 1, 4096)]}
[I] trt-runner-N0-05/11/23-10:21:57     | Completed 1 iteration(s) in 1.745 ms | Average inference time: 1.745 ms.
[I] Loading inference results from /home/oldpan/code/data/debug_data/output-debug/decoder-merge-4-onnx-output.json
[I] Accuracy Comparison | trt-runner-N0-05/11/23-10:21:57 vs. onnxrt-runner-N0-04/22/23-22:27:42
[I]     Comparing Output: 'past_key' (dtype=float16, shape=(1, 32, 50, 128)) with 'past_key' (dtype=float16, shape=(1, 32, 50, 128))
[I]         Tolerance: [abs=0.01, rel=0.001] | Checking elemwise error
[I]         trt-runner-N0-05/11/23-10:21:57: past_key | Stats: mean=0.97991, std-dev=0.15628, var=0.024423, median=1, min=-1.999 at (0, 10, 49, 114), max=2.8555 at (0, 4, 49, 49), avg-magnitude=0.98727
[I]         onnxrt-runner-N0-04/22/23-22:27:42: past_key | Stats: mean=0.97991, std-dev=0.15628, var=0.024423, median=1, min=-2 at (0, 10, 49, 114), max=2.8555 at (0, 4, 49, 49), avg-magnitude=0.98727
[I]         Error Metrics: past_key
[I]             Minimum Required Tolerance: elemwise error | [abs=0.0019531] OR [rel=2.0484] (requirements may be lower if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=1.3869e-06, std-dev=2.5162e-05, var=6.3313e-10, median=0, min=0 at (0, 0, 0, 0), max=0.0019531 at (0, 10, 49, 43), avg-magnitude=1.3869e-06
[I]             Relative Difference | Stats: mean=2.0463e-05, std-dev=0.0048362, var=2.3389e-05, median=0, min=0 at (0, 0, 0, 0), max=2.0484 at (0, 30, 49, 10), avg-magnitude=2.0463e-05
[I]         PASSED | Output: 'past_key' | Difference is within tolerance (rel=0.001, abs=0.01)
[I]     Comparing Output: 'past_value' (dtype=float16, shape=(1, 32, 50, 128)) with 'past_value' (dtype=float16, shape=(1, 32, 50, 128))
[I]         Tolerance: [abs=0.01, rel=0.001] | Checking elemwise error
[I]         trt-runner-N0-05/11/23-10:21:57: past_value | Stats: mean=0.97994, std-dev=0.14484, var=0.020979, median=1, min=-1.041 at (0, 31, 49, 55), max=1.0186 at (0, 30, 49, 121), avg-magnitude=0.98395
[I]         onnxrt-runner-N0-04/22/23-22:27:42: past_value | Stats: mean=0.97994, std-dev=0.14484, var=0.020979, median=1, min=-1.041 at (0, 31, 49, 55), max=1.0186 at (0, 30, 49, 121), avg-magnitude=0.98395
[I]         Error Metrics: past_value
[I]             Minimum Required Tolerance: elemwise error | [abs=0.00024414] OR [rel=0.00092421] (requirements may be lower if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=2.099e-09, std-dev=6.0771e-07, var=3.6931e-13, median=0, min=0 at (0, 0, 0, 0), max=0.00024414 at (0, 0, 49, 39), avg-magnitude=2.099e-09
[I]             Relative Difference | Stats: mean=5.1702e-08, std-dev=5.9317e-06, var=3.5185e-11, median=0, min=0 at (0, 0, 0, 0), max=0.00092421 at (0, 17, 49, 117), avg-magnitude=5.1702e-08
[I]         PASSED | Output: 'past_value' | Difference is within tolerance (rel=0.001, abs=0.01)
[I]     Comparing Output: 'hidden_out' (dtype=float16, shape=(1, 1, 4096)) with 'hidden_out' (dtype=float16, shape=(1, 1, 4096))
[I]         Tolerance: [abs=0.01, rel=0.001] | Checking elemwise error
[I]         trt-runner-N0-05/11/23-10:21:57: hidden_out | Stats: mean=1.0164, std-dev=1.0491, var=1.1007, median=0.98901, min=-3.5781 at (0, 0, 1181), max=9.3516 at (0, 0, 3840), avg-magnitude=1.1887
[I]         onnxrt-runner-N0-04/22/23-22:27:42: hidden_out | Stats: mean=1.0164, std-dev=1.0492, var=1.1007, median=0.9895, min=-3.5781 at (0, 0, 1181), max=9.3594 at (0, 0, 3840), avg-magnitude=1.1887
[I]         Error Metrics: hidden_out
[I]             Minimum Required Tolerance: elemwise error | [abs=0.0078125] OR [rel=0.4306] (requirements may be lower if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=0.0005626, std-dev=0.00064283, var=4.1323e-07, median=0.00048828, min=0 at (0, 0, 0), max=0.0078125 at (0, 0, 3840), avg-magnitude=0.0005626
[I]             Relative Difference | Stats: mean=0.0017539, std-dev=0.011641, var=0.0001355, median=0.00062073, min=0 at (0, 0, 0), max=0.4306 at (0, 0, 1582), avg-magnitude=0.0017539
[I]         PASSED | Output: 'hidden_out' | Difference is within tolerance (rel=0.001, abs=0.01)
[I]     PASSED | All outputs matched | Outputs: ['past_key', 'past_value', 'hidden_out']

The data_loader.py is:

import numpy as np
from polygraphy.json import save_json

INPUT_SHAPE_1 = (1, 1, 4096)
INPUT_SHAPE_2 = (1, 1, 1, 50)
INPUT_SHAPE_3 = (1, 1)
INPUT_SHAPE_4 = (1, 32, 49, 128)
INPUT_SHAPE_5 = (1, 32, 49, 128)

# --shapes=hidden_in:1x1x4096,attn_mask:1x1x1x50,position_ids:1x1,past_key_in:1x32x49x128,past_value_in:1x32x49x128
def load_data():
    for _ in range(1):
        yield {"hidden_in": np.ones(shape=INPUT_SHAPE_1, dtype=np.float16),
               "attn_mask": np.ones(shape=INPUT_SHAPE_2, dtype=np.float16),
               "position_ids": np.ones(shape=INPUT_SHAPE_3, dtype=np.int64),
               "past_key_in": np.ones(shape=INPUT_SHAPE_4, dtype=np.float16),
               "past_value_in": np.ones(shape=INPUT_SHAPE_5, dtype=np.float16)}  # Still totally real data

lingffff · 2023-05-11T14:17:46Z

Hi @Oldpan , I notice that you set "--minShapes=past_key_in:1x32x1x128" but @tpoisonooo set "--minShapes=past_key_in:1x32x0x128". Maybe this "zero tensor" feature causes the problem?

tpoisonooo · 2023-05-12T05:51:36Z

@tpoisonooo Hi, I tested the precision using versions 8.6.0.12 and 8.6.1.6 before, and it seems good to me. First, I converted it using trtexec： ./trtexec --onnx=/home/oldpan/code/models/GPT/LLAMA/alpaca.onnx/decoder-merge-5.onnx \ --minShapes=hidden_in:1x1x4096,attn_mask:1x1x1x1,position_ids:1x1,past_key_in:1x32x1x128,past_value_in:1x32x1x128 \ --optShapes=hidden_in:1x1x4096,attn_mask:1x1x1x1029,position_ids:1x1,past_key_in:1x32x1028x128,past_value_in:1x32x1028x128 \ --maxShapes=hidden_in:1x1x4096,attn_mask:1x1x1x2049,position_ids:1x1,past_key_in:1x32x2048x128,past_value_in:1x32x2048x128 \ --shapes=hidden_in:1x1x4096,attn_mask:1x1x1x50,position_ids:1x1,past_key_in:1x32x49x128,past_value_in:1x32x49x128 \ --fp16 --saveEngine=decoder-merge-5.trt

And then use Polygraphy to test accurate:

polygraphy run --onnxrt /home/oldpan/code/models/GPT/LLAMA/alpaca.onnx/decoder-merge-4.onnx --save-results=/home/oldpan/code/data/debug_data/output-debug/decoder-merge-4-onnx-output.json --data-loader-script /home/oldpan/code/convert/tools/data_loader.py
polygraphy run /home/oldpan/code/models/normal/tensorrt_engine/llama-trt/decoder-merge-4.trt --model-type engine --trt --data-loader-script /home/oldpan/code/convert/tools/data_loader.py --load-outputs /home/oldpan/code/data/debug_data/output-debug/decoder-merge-4-onnx-output.json  --atol 1e-2 --rtol 1e-3

The output is

[I] RUNNING | Command: /home/oldpan/miniconda3/envs/develop/bin/polygraphy run /home/oldpan/code/models/normal/tensorrt_engine/llama-trt/decoder-merge-4.trt --model-type engine --trt --data-loader-script /home/oldpan/code/project/llama.ddeploy/tools/data_loader.py --load-outputs /home/oldpan/code/data/debug_data/output-debug/decoder-merge-4-onnx-output.json --atol 1e-2 --rtol 1e-3
[I] Saving custom input data to custom_inputs.json
[I] trt-runner-N0-05/11/23-10:21:57     | Activating and starting inference
[I] Loading bytes from /home/oldpan/code/models/normal/tensorrt_engine/llama-trt/decoder-merge-4.trt
[W] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage and speed up TensorRT initialization. See "Lazy Loading" section of CUDA documentation https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#lazy-loading
[W] Input tensor: position_ids | Buffer dtype (int64) does not match expected input dtype (int32), attempting to cast. 
[I] trt-runner-N0-05/11/23-10:21:57    
    ---- Inference Input(s) ----
    {hidden_in [dtype=float16, shape=(1, 1, 4096)],
     attn_mask [dtype=float16, shape=(1, 1, 1, 50)],
     position_ids [dtype=int32, shape=(1, 1)],
     past_key_in [dtype=float16, shape=(1, 32, 49, 128)],
     past_value_in [dtype=float16, shape=(1, 32, 49, 128)]}
[I] trt-runner-N0-05/11/23-10:21:57    
    ---- Inference Output(s) ----
    {past_key [dtype=float16, shape=(1, 32, 50, 128)],
     past_value [dtype=float16, shape=(1, 32, 50, 128)],
     hidden_out [dtype=float16, shape=(1, 1, 4096)]}
[I] trt-runner-N0-05/11/23-10:21:57     | Completed 1 iteration(s) in 1.745 ms | Average inference time: 1.745 ms.
[I] Loading inference results from /home/oldpan/code/data/debug_data/output-debug/decoder-merge-4-onnx-output.json
[I] Accuracy Comparison | trt-runner-N0-05/11/23-10:21:57 vs. onnxrt-runner-N0-04/22/23-22:27:42
[I]     Comparing Output: 'past_key' (dtype=float16, shape=(1, 32, 50, 128)) with 'past_key' (dtype=float16, shape=(1, 32, 50, 128))
[I]         Tolerance: [abs=0.01, rel=0.001] | Checking elemwise error
[I]         trt-runner-N0-05/11/23-10:21:57: past_key | Stats: mean=0.97991, std-dev=0.15628, var=0.024423, median=1, min=-1.999 at (0, 10, 49, 114), max=2.8555 at (0, 4, 49, 49), avg-magnitude=0.98727
[I]         onnxrt-runner-N0-04/22/23-22:27:42: past_key | Stats: mean=0.97991, std-dev=0.15628, var=0.024423, median=1, min=-2 at (0, 10, 49, 114), max=2.8555 at (0, 4, 49, 49), avg-magnitude=0.98727
[I]         Error Metrics: past_key
[I]             Minimum Required Tolerance: elemwise error | [abs=0.0019531] OR [rel=2.0484] (requirements may be lower if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=1.3869e-06, std-dev=2.5162e-05, var=6.3313e-10, median=0, min=0 at (0, 0, 0, 0), max=0.0019531 at (0, 10, 49, 43), avg-magnitude=1.3869e-06
[I]             Relative Difference | Stats: mean=2.0463e-05, std-dev=0.0048362, var=2.3389e-05, median=0, min=0 at (0, 0, 0, 0), max=2.0484 at (0, 30, 49, 10), avg-magnitude=2.0463e-05
[I]         PASSED | Output: 'past_key' | Difference is within tolerance (rel=0.001, abs=0.01)
[I]     Comparing Output: 'past_value' (dtype=float16, shape=(1, 32, 50, 128)) with 'past_value' (dtype=float16, shape=(1, 32, 50, 128))
[I]         Tolerance: [abs=0.01, rel=0.001] | Checking elemwise error
[I]         trt-runner-N0-05/11/23-10:21:57: past_value | Stats: mean=0.97994, std-dev=0.14484, var=0.020979, median=1, min=-1.041 at (0, 31, 49, 55), max=1.0186 at (0, 30, 49, 121), avg-magnitude=0.98395
[I]         onnxrt-runner-N0-04/22/23-22:27:42: past_value | Stats: mean=0.97994, std-dev=0.14484, var=0.020979, median=1, min=-1.041 at (0, 31, 49, 55), max=1.0186 at (0, 30, 49, 121), avg-magnitude=0.98395
[I]         Error Metrics: past_value
[I]             Minimum Required Tolerance: elemwise error | [abs=0.00024414] OR [rel=0.00092421] (requirements may be lower if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=2.099e-09, std-dev=6.0771e-07, var=3.6931e-13, median=0, min=0 at (0, 0, 0, 0), max=0.00024414 at (0, 0, 49, 39), avg-magnitude=2.099e-09
[I]             Relative Difference | Stats: mean=5.1702e-08, std-dev=5.9317e-06, var=3.5185e-11, median=0, min=0 at (0, 0, 0, 0), max=0.00092421 at (0, 17, 49, 117), avg-magnitude=5.1702e-08
[I]         PASSED | Output: 'past_value' | Difference is within tolerance (rel=0.001, abs=0.01)
[I]     Comparing Output: 'hidden_out' (dtype=float16, shape=(1, 1, 4096)) with 'hidden_out' (dtype=float16, shape=(1, 1, 4096))
[I]         Tolerance: [abs=0.01, rel=0.001] | Checking elemwise error
[I]         trt-runner-N0-05/11/23-10:21:57: hidden_out | Stats: mean=1.0164, std-dev=1.0491, var=1.1007, median=0.98901, min=-3.5781 at (0, 0, 1181), max=9.3516 at (0, 0, 3840), avg-magnitude=1.1887
[I]         onnxrt-runner-N0-04/22/23-22:27:42: hidden_out | Stats: mean=1.0164, std-dev=1.0492, var=1.1007, median=0.9895, min=-3.5781 at (0, 0, 1181), max=9.3594 at (0, 0, 3840), avg-magnitude=1.1887
[I]         Error Metrics: hidden_out
[I]             Minimum Required Tolerance: elemwise error | [abs=0.0078125] OR [rel=0.4306] (requirements may be lower if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=0.0005626, std-dev=0.00064283, var=4.1323e-07, median=0.00048828, min=0 at (0, 0, 0), max=0.0078125 at (0, 0, 3840), avg-magnitude=0.0005626
[I]             Relative Difference | Stats: mean=0.0017539, std-dev=0.011641, var=0.0001355, median=0.00062073, min=0 at (0, 0, 0), max=0.4306 at (0, 0, 1582), avg-magnitude=0.0017539
[I]         PASSED | Output: 'hidden_out' | Difference is within tolerance (rel=0.001, abs=0.01)
[I]     PASSED | All outputs matched | Outputs: ['past_key', 'past_value', 'hidden_out']

The data_loader.py is:

import numpy as np
from polygraphy.json import save_json

INPUT_SHAPE_1 = (1, 1, 4096)
INPUT_SHAPE_2 = (1, 1, 1, 50)
INPUT_SHAPE_3 = (1, 1)
INPUT_SHAPE_4 = (1, 32, 49, 128)
INPUT_SHAPE_5 = (1, 32, 49, 128)

# --shapes=hidden_in:1x1x4096,attn_mask:1x1x1x50,position_ids:1x1,past_key_in:1x32x49x128,past_value_in:1x32x49x128
def load_data():
    for _ in range(1):
        yield {"hidden_in": np.ones(shape=INPUT_SHAPE_1, dtype=np.float16),
               "attn_mask": np.ones(shape=INPUT_SHAPE_2, dtype=np.float16),
               "position_ids": np.ones(shape=INPUT_SHAPE_3, dtype=np.int64),
               "past_key_in": np.ones(shape=INPUT_SHAPE_4, dtype=np.float16),
               "past_value_in": np.ones(shape=INPUT_SHAPE_5, dtype=np.float16)}  # Still totally real data

There is a cache in llama, so "--minShapes=past_key_in:1x32x0x128". Or I have to export two kinds of decoder.onnx. cc @lingffff @Oldpan

Please check past_key_value in modeling_llama.py https://github.com/huggingface/transformers/blob/273f5ba0266b223c1d611bd00d4a4b2d58771a33/src/transformers/models/llama/modeling_llama.py#L213

        if past_key_value is not None:
            kv_seq_len += past_key_value[0].shape[-2]
..

        if past_key_value is not None:
            # reuse k, v, self_attention
            key_states = torch.cat([past_key_value[0], key_states], dim=2)
            value_states = torch.cat([past_key_value[1], value_states], dim=2)

        past_key_value = (key_states, value_states) if use_cache else None

zhaohb · 2023-05-12T11:19:14Z

@tpoisonooo hi, do you test llama's performance on tensorrt and how fast the token was generated？

tpoisonooo · 2023-05-14T13:18:38Z

@tpoisonooo hi, do you test llama's performance on tensorrt and how fast the token was generated？

On my GTX1060 TI, I got 3.5~5ms per backbone decoder with llama 7B fp16 model, AKA 1000 / (5*32) = 6 token/second

coderchem · 2023-06-02T07:37:35Z

I wonder when this problem will be solved, thank you

user-ZJ · 2023-06-16T03:56:23Z

I wonder if there is any progress on this issue

user-ZJ · 2023-06-19T07:40:27Z

@tpoisonooo Hi, I tested the precision using versions 8.6.0.12 and 8.6.1.6 before, and it seems good to me. First, I converted it using trtexec： ./trtexec --onnx=/home/oldpan/code/models/GPT/LLAMA/alpaca.onnx/decoder-merge-5.onnx \ --minShapes=hidden_in:1x1x4096,attn_mask:1x1x1x1,position_ids:1x1,past_key_in:1x32x1x128,past_value_in:1x32x1x128 \ --optShapes=hidden_in:1x1x4096,attn_mask:1x1x1x1029,position_ids:1x1,past_key_in:1x32x1028x128,past_value_in:1x32x1028x128 \ --maxShapes=hidden_in:1x1x4096,attn_mask:1x1x1x2049,position_ids:1x1,past_key_in:1x32x2048x128,past_value_in:1x32x2048x128 \ --shapes=hidden_in:1x1x4096,attn_mask:1x1x1x50,position_ids:1x1,past_key_in:1x32x49x128,past_value_in:1x32x49x128 \ --fp16 --saveEngine=decoder-merge-5.trt

And then use Polygraphy to test accurate:

polygraphy run --onnxrt /home/oldpan/code/models/GPT/LLAMA/alpaca.onnx/decoder-merge-4.onnx --save-results=/home/oldpan/code/data/debug_data/output-debug/decoder-merge-4-onnx-output.json --data-loader-script /home/oldpan/code/convert/tools/data_loader.py
polygraphy run /home/oldpan/code/models/normal/tensorrt_engine/llama-trt/decoder-merge-4.trt --model-type engine --trt --data-loader-script /home/oldpan/code/convert/tools/data_loader.py --load-outputs /home/oldpan/code/data/debug_data/output-debug/decoder-merge-4-onnx-output.json  --atol 1e-2 --rtol 1e-3

The output is

[I] RUNNING | Command: /home/oldpan/miniconda3/envs/develop/bin/polygraphy run /home/oldpan/code/models/normal/tensorrt_engine/llama-trt/decoder-merge-4.trt --model-type engine --trt --data-loader-script /home/oldpan/code/project/llama.ddeploy/tools/data_loader.py --load-outputs /home/oldpan/code/data/debug_data/output-debug/decoder-merge-4-onnx-output.json --atol 1e-2 --rtol 1e-3
[I] Saving custom input data to custom_inputs.json
[I] trt-runner-N0-05/11/23-10:21:57     | Activating and starting inference
[I] Loading bytes from /home/oldpan/code/models/normal/tensorrt_engine/llama-trt/decoder-merge-4.trt
[W] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage and speed up TensorRT initialization. See "Lazy Loading" section of CUDA documentation https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#lazy-loading
[W] Input tensor: position_ids | Buffer dtype (int64) does not match expected input dtype (int32), attempting to cast. 
[I] trt-runner-N0-05/11/23-10:21:57    
    ---- Inference Input(s) ----
    {hidden_in [dtype=float16, shape=(1, 1, 4096)],
     attn_mask [dtype=float16, shape=(1, 1, 1, 50)],
     position_ids [dtype=int32, shape=(1, 1)],
     past_key_in [dtype=float16, shape=(1, 32, 49, 128)],
     past_value_in [dtype=float16, shape=(1, 32, 49, 128)]}
[I] trt-runner-N0-05/11/23-10:21:57    
    ---- Inference Output(s) ----
    {past_key [dtype=float16, shape=(1, 32, 50, 128)],
     past_value [dtype=float16, shape=(1, 32, 50, 128)],
     hidden_out [dtype=float16, shape=(1, 1, 4096)]}
[I] trt-runner-N0-05/11/23-10:21:57     | Completed 1 iteration(s) in 1.745 ms | Average inference time: 1.745 ms.
[I] Loading inference results from /home/oldpan/code/data/debug_data/output-debug/decoder-merge-4-onnx-output.json
[I] Accuracy Comparison | trt-runner-N0-05/11/23-10:21:57 vs. onnxrt-runner-N0-04/22/23-22:27:42
[I]     Comparing Output: 'past_key' (dtype=float16, shape=(1, 32, 50, 128)) with 'past_key' (dtype=float16, shape=(1, 32, 50, 128))
[I]         Tolerance: [abs=0.01, rel=0.001] | Checking elemwise error
[I]         trt-runner-N0-05/11/23-10:21:57: past_key | Stats: mean=0.97991, std-dev=0.15628, var=0.024423, median=1, min=-1.999 at (0, 10, 49, 114), max=2.8555 at (0, 4, 49, 49), avg-magnitude=0.98727
[I]         onnxrt-runner-N0-04/22/23-22:27:42: past_key | Stats: mean=0.97991, std-dev=0.15628, var=0.024423, median=1, min=-2 at (0, 10, 49, 114), max=2.8555 at (0, 4, 49, 49), avg-magnitude=0.98727
[I]         Error Metrics: past_key
[I]             Minimum Required Tolerance: elemwise error | [abs=0.0019531] OR [rel=2.0484] (requirements may be lower if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=1.3869e-06, std-dev=2.5162e-05, var=6.3313e-10, median=0, min=0 at (0, 0, 0, 0), max=0.0019531 at (0, 10, 49, 43), avg-magnitude=1.3869e-06
[I]             Relative Difference | Stats: mean=2.0463e-05, std-dev=0.0048362, var=2.3389e-05, median=0, min=0 at (0, 0, 0, 0), max=2.0484 at (0, 30, 49, 10), avg-magnitude=2.0463e-05
[I]         PASSED | Output: 'past_key' | Difference is within tolerance (rel=0.001, abs=0.01)
[I]     Comparing Output: 'past_value' (dtype=float16, shape=(1, 32, 50, 128)) with 'past_value' (dtype=float16, shape=(1, 32, 50, 128))
[I]         Tolerance: [abs=0.01, rel=0.001] | Checking elemwise error
[I]         trt-runner-N0-05/11/23-10:21:57: past_value | Stats: mean=0.97994, std-dev=0.14484, var=0.020979, median=1, min=-1.041 at (0, 31, 49, 55), max=1.0186 at (0, 30, 49, 121), avg-magnitude=0.98395
[I]         onnxrt-runner-N0-04/22/23-22:27:42: past_value | Stats: mean=0.97994, std-dev=0.14484, var=0.020979, median=1, min=-1.041 at (0, 31, 49, 55), max=1.0186 at (0, 30, 49, 121), avg-magnitude=0.98395
[I]         Error Metrics: past_value
[I]             Minimum Required Tolerance: elemwise error | [abs=0.00024414] OR [rel=0.00092421] (requirements may be lower if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=2.099e-09, std-dev=6.0771e-07, var=3.6931e-13, median=0, min=0 at (0, 0, 0, 0), max=0.00024414 at (0, 0, 49, 39), avg-magnitude=2.099e-09
[I]             Relative Difference | Stats: mean=5.1702e-08, std-dev=5.9317e-06, var=3.5185e-11, median=0, min=0 at (0, 0, 0, 0), max=0.00092421 at (0, 17, 49, 117), avg-magnitude=5.1702e-08
[I]         PASSED | Output: 'past_value' | Difference is within tolerance (rel=0.001, abs=0.01)
[I]     Comparing Output: 'hidden_out' (dtype=float16, shape=(1, 1, 4096)) with 'hidden_out' (dtype=float16, shape=(1, 1, 4096))
[I]         Tolerance: [abs=0.01, rel=0.001] | Checking elemwise error
[I]         trt-runner-N0-05/11/23-10:21:57: hidden_out | Stats: mean=1.0164, std-dev=1.0491, var=1.1007, median=0.98901, min=-3.5781 at (0, 0, 1181), max=9.3516 at (0, 0, 3840), avg-magnitude=1.1887
[I]         onnxrt-runner-N0-04/22/23-22:27:42: hidden_out | Stats: mean=1.0164, std-dev=1.0492, var=1.1007, median=0.9895, min=-3.5781 at (0, 0, 1181), max=9.3594 at (0, 0, 3840), avg-magnitude=1.1887
[I]         Error Metrics: hidden_out
[I]             Minimum Required Tolerance: elemwise error | [abs=0.0078125] OR [rel=0.4306] (requirements may be lower if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=0.0005626, std-dev=0.00064283, var=4.1323e-07, median=0.00048828, min=0 at (0, 0, 0), max=0.0078125 at (0, 0, 3840), avg-magnitude=0.0005626
[I]             Relative Difference | Stats: mean=0.0017539, std-dev=0.011641, var=0.0001355, median=0.00062073, min=0 at (0, 0, 0), max=0.4306 at (0, 0, 1582), avg-magnitude=0.0017539
[I]         PASSED | Output: 'hidden_out' | Difference is within tolerance (rel=0.001, abs=0.01)
[I]     PASSED | All outputs matched | Outputs: ['past_key', 'past_value', 'hidden_out']

The data_loader.py is:

import numpy as np
from polygraphy.json import save_json

INPUT_SHAPE_1 = (1, 1, 4096)
INPUT_SHAPE_2 = (1, 1, 1, 50)
INPUT_SHAPE_3 = (1, 1)
INPUT_SHAPE_4 = (1, 32, 49, 128)
INPUT_SHAPE_5 = (1, 32, 49, 128)

# --shapes=hidden_in:1x1x4096,attn_mask:1x1x1x50,position_ids:1x1,past_key_in:1x32x49x128,past_value_in:1x32x49x128
def load_data():
    for _ in range(1):
        yield {"hidden_in": np.ones(shape=INPUT_SHAPE_1, dtype=np.float16),
               "attn_mask": np.ones(shape=INPUT_SHAPE_2, dtype=np.float16),
               "position_ids": np.ones(shape=INPUT_SHAPE_3, dtype=np.int64),
               "past_key_in": np.ones(shape=INPUT_SHAPE_4, dtype=np.float16),
               "past_value_in": np.ones(shape=INPUT_SHAPE_5, dtype=np.float16)}  # Still totally real data

change np.ones to np.random.rand will show FAILED in this case

tikikun · 2023-07-09T02:45:07Z

is this fixed?

tpoisonooo · 2023-07-10T02:29:19Z

is this fixed?

Not yet, if you want to inference LLaMa on CUDA, try https://github.com/InternLM/lmdeploy

tikikun · 2023-07-11T00:41:19Z

@tpoisonooo is this approach any different than LLaMA on Optimum , I already converted to tensorRT on Optimum
huggingface/optimum#975

Rane2021 · 2023-08-02T00:42:03Z

@tpoisonooo hi, do you test llama's performance on tensorrt and how fast the token was generated？

On my GTX1060 TI, I got 3.5~5ms per backbone decoder with llama 7B fp16 model, AKA 1000 / (5*32) = 6 token/second

Does this test have a run demo for TensorRT?

xiaobai52HZ · 2023-10-10T11:13:38Z

different

how do you converted to tensorRT on Optimum? I don't find reference about this.

tp-nan · 2023-11-05T00:39:39Z

Hi all, We successfully converted to TensorRT!

Each LlamaDecoderLayer was splited into three segments: pre, mid, and post. The kv cache between the pre and mid segments reverts to PyTorch for computation. The dimensions inside pre and post do not undergo a transpose operation, enabling batch processing. The mid segment has no parameters, eliminating the need for 32 repetitions. Instead, a single mid paramater is placed on each card. see https://github.com/torchpipe/LLM.TensorRT.Serve

tpoisonooo mentioned this issue May 4, 2023

关于ONNX转换 tpoisonooo/llama.onnx#9

Open

zerollzeng self-assigned this May 7, 2023

zerollzeng added the triaged Issue has been triaged by maintainers label May 7, 2023

tpoisonooo changed the title ~~onnx2tensorrt not finish for long time~~ trt give random output value, diffs with onnxruntime May 9, 2023

tpoisonooo mentioned this issue May 9, 2023

convert Onnx problem tpoisonooo/llama.onnx#12

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

trt give random output value, diffs with onnxruntime #2928

trt give random output value, diffs with onnxruntime #2928

tpoisonooo commented May 4, 2023

lingffff commented May 5, 2023

zerollzeng commented May 7, 2023

tpoisonooo commented May 8, 2023

tpoisonooo commented May 9, 2023 •

edited

Loading

zerollzeng commented May 9, 2023

tpoisonooo commented May 10, 2023 •

edited

Loading

Oldpan commented May 11, 2023 •

edited

Loading

lingffff commented May 11, 2023

tpoisonooo commented May 12, 2023 •

edited

Loading

zhaohb commented May 12, 2023

tpoisonooo commented May 14, 2023

coderchem commented Jun 2, 2023

user-ZJ commented Jun 16, 2023

user-ZJ commented Jun 19, 2023

tikikun commented Jul 9, 2023

tpoisonooo commented Jul 10, 2023

tikikun commented Jul 11, 2023

Rane2021 commented Aug 2, 2023

xiaobai52HZ commented Oct 10, 2023

tp-nan commented Nov 5, 2023 •

edited

Loading

trt give random output value, diffs with onnxruntime #2928

trt give random output value, diffs with onnxruntime #2928

Comments

tpoisonooo commented May 4, 2023

Description

Environment

Relevant Files

Steps To Reproduce

Notes

lingffff commented May 5, 2023

zerollzeng commented May 7, 2023

tpoisonooo commented May 8, 2023

tpoisonooo commented May 9, 2023 • edited Loading

zerollzeng commented May 9, 2023

tpoisonooo commented May 10, 2023 • edited Loading

Oldpan commented May 11, 2023 • edited Loading

lingffff commented May 11, 2023

tpoisonooo commented May 12, 2023 • edited Loading

zhaohb commented May 12, 2023

tpoisonooo commented May 14, 2023

coderchem commented Jun 2, 2023

user-ZJ commented Jun 16, 2023

user-ZJ commented Jun 19, 2023

tikikun commented Jul 9, 2023

tpoisonooo commented Jul 10, 2023

tikikun commented Jul 11, 2023

Rane2021 commented Aug 2, 2023

xiaobai52HZ commented Oct 10, 2023

tp-nan commented Nov 5, 2023 • edited Loading

tpoisonooo commented May 9, 2023 •

edited

Loading

tpoisonooo commented May 10, 2023 •

edited

Loading

Oldpan commented May 11, 2023 •

edited

Loading

tpoisonooo commented May 12, 2023 •

edited

Loading

tp-nan commented Nov 5, 2023 •

edited

Loading