Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

trt give random output value, diffs with onnxruntime #2928

Open
tpoisonooo opened this issue May 4, 2023 · 20 comments
Open

trt give random output value, diffs with onnxruntime #2928

tpoisonooo opened this issue May 4, 2023 · 20 comments
Assignees
Labels
triaged Issue has been triaged by maintainers

Comments

@tpoisonooo
Copy link

Description

I am using python tensorrt to convert onnx, the script not finish after 2 hours.

But for trtexec --onnx=model.onnx --fp16, it would stop normally and give me model.engine.

Environment

TensorRT Version: 8.6.1.6 GA, here is download url
NVIDIA GPU: GTX1660
NVIDIA Driver Version: 515.86.01
CUDA Version: cu117
CUDNN Version: 8.4.1
Operating System: ubuntu20.04
Python Version (if applicable): 3.9
Tensorflow Version (if applicable): -
PyTorch Version (if applicable): torch2.0
Baremetal or Container (if so, version):

Relevant Files

fp16 onnx model download here https://huggingface.co/tpoisonooo/alpaca.onnx/blob/fp16/decoder-merge-0.onnx
single script download here: https://github.com/tpoisonooo/llama.onnx/blob/add-trt-backend/tools/onnx-to-trt.py

Steps To Reproduce

  1. Download onnx and save it to onnx_model_dir
  2. Install python trt, run the script
$ python3 onnx-to-trt.py  onnx_model_dir   output_engine_dir

And this script would not finish

  1. But trtexec works
$ trtexec --onnx=/path/to/onnx_models/decoder-merge-0.onnx --fp16
$ ls
.. decoder.engine

Notes

This onnx is part of LLaMa huggingface format.

Since LLaMa needs cache and if opr here, I have to build a empty_tensor to hack it.

So past_key_in.min_shape is [1,32,0,128], it works on onnxruntime.

@lingffff
Copy link

lingffff commented May 5, 2023

But for trtexec --onnx=model.onnx --fp16, it would stop normally and give me model.engine.

I run inferece with this model.engine but get wrong outputs.
Does current TensorRT 8.6.1.6 GA support LLaMA?

@zerollzeng
Copy link
Collaborator

trtexec --onnx=/path/to/onnx_models/decoder-merge-0.onnx --fp16

Does your model has dynamic shapes, if yes then you need to set the input shapes.

If trtexec can build the engine normally, then there are some issues in your python scripts. and I think we support LLaMA and you already can build the engine with trtexec.

@zerollzeng zerollzeng self-assigned this May 7, 2023
@zerollzeng zerollzeng added the triaged Issue has been triaged by maintainers label May 7, 2023
@tpoisonooo
Copy link
Author

trtexec --onnx=/path/to/onnx_models/decoder-merge-0.onnx --fp16

Does your model has dynamic shapes, if yes then you need to set the input shapes.

If trtexec can build the engine normally, then there are some issues in your python scripts. and I think we support LLaMA and you already can build the engine with trtexec.

Thanks, I have converted it to .engine with

trtexec --onnx=decoder-merge-0.onnx 
--minShapes=hidden_in:1x1x4096,attn_mask:1x1x1x1,position_ids:1x1,past_key_in:1x32x0x128,past_value_in:1x32x0x128  
--optShapes=hidden_in:1x1x4096,attn_mask:1x1x1x2,position_ids:1x1,past_key_in:1x32x1x128,past_value_in:1x32x1x128  
--maxShapes=hidden_in:1x64x4096,attn_mask:1x1x64x192,position_ids:1x64,past_key_in:1x32x192x128,past_value_in:1x32x192x128
--fp16  --saveEngine=decoder.engine

Let me try inference value later.

@tpoisonooo
Copy link
Author

tpoisonooo commented May 9, 2023

@zerollzeng I got wrong output value from trt, which diffs with onnxruntime. Here is reproduction:

  1. download onnx, take decoder-merge-0.onnx as example https://huggingface.co/tpoisonooo/alpaca.onnx/blob/fp16/decoder-merge-0.onnx
  2. generate .engine with trtexec mentioned before
trtexec --onnx=decoder-merge-0.onnx 
--minShapes=hidden_in:1x1x4096,attn_mask:1x1x1x1,position_ids:1x1,past_key_in:1x32x0x128,past_value_in:1x32x0x128  
--optShapes=hidden_in:1x1x4096,attn_mask:1x1x1x2,position_ids:1x1,past_key_in:1x32x1x128,past_value_in:1x32x1x128  
--maxShapes=hidden_in:1x64x4096,attn_mask:1x1x64x192,position_ids:1x64,past_key_in:1x32x192x128,past_value_in:1x32x192x128
--fp16  --saveEngine=decoder-merge-0.engine
  1. download testdata https://github.com/tpoisonooo/llama.onnx/tree/add-trt-backend/data , there are 3 numpy array
$ llama.onnx git:(add-trt-backend) cd data && ls *
attn_mask.npy  hidden_in.npy  position_ids.npy
  1. open this single python script, give onnx/engine filepath https://github.com/tpoisonooo/llama.onnx/blob/add-trt-backend/llama/trt_wrapper.py#L157
    # inference with trt
    trt_wrapper = TrtWrapper('path/to/decoder-merge-0.engine')
    trt_outputs = trt_wrapper.forward(_inputs)

    # with ort
    ort_wrapper = OrtWrapper('path/to/decoder-merge-0.onnx')
    ort_outputs = ort_wrapper.forward(_inputs)
  1. run, it would print np.allclose and diff.max()
(base) ➜  llama git:(add-trt-backend) python3 trt_wrapper.py
2023-05-09 16:20:05.846 | DEBUG    | __main__:__init__:128 - /home/khj/下载/7b-onnx/alpaca-onnx-7B-fp16/models/decoder-merge-0.onnx loaded
False
False
False
7.645

# again
(base) ➜  llama git:(add-trt-backend) python3 trt_wrapper.py
2023-05-09 16:22:33.620 | DEBUG    | __main__:__init__:128 - /home/khj/下载/7b-onnx/alpaca-onnx-7B-fp16/models/decoder-merge-0.onnx loaded
False
False
False
4.492

trt outputs would give me random value

cc @lingffff

@tpoisonooo tpoisonooo changed the title onnx2tensorrt not finish for long time trt give random output value, diffs with onnxruntime May 9, 2023
@zerollzeng
Copy link
Collaborator

Could you try with Polygraphy? see https://github.com/NVIDIA/TensorRT/tree/main/tools/Polygraphy/examples/cli/run/01_comparing_frameworks

a quick check would be like polygraphy run decoder-merge-0.onnx --trt --fp16 --onnxrt --trt-min-shapes xxx --trt-opt-shapes xxx --trt-max-shapes xxx --input-shapes xxx --data-loader-script data_loader.py

refer to polygraphy run -h

@tpoisonooo
Copy link
Author

tpoisonooo commented May 10, 2023

Ah.. why we have to learn these much tools QaQ .. I will try it later.

@Oldpan
Copy link

Oldpan commented May 11, 2023

@tpoisonooo Hi, I tested the precision using versions 8.6.0.12 and 8.6.1.6 before, and it seems good to me.
First, I converted it using trtexec:
./trtexec --onnx=/home/oldpan/code/models/GPT/LLAMA/alpaca.onnx/decoder-merge-5.onnx \ --minShapes=hidden_in:1x1x4096,attn_mask:1x1x1x1,position_ids:1x1,past_key_in:1x32x1x128,past_value_in:1x32x1x128 \ --optShapes=hidden_in:1x1x4096,attn_mask:1x1x1x1029,position_ids:1x1,past_key_in:1x32x1028x128,past_value_in:1x32x1028x128 \ --maxShapes=hidden_in:1x1x4096,attn_mask:1x1x1x2049,position_ids:1x1,past_key_in:1x32x2048x128,past_value_in:1x32x2048x128 \ --shapes=hidden_in:1x1x4096,attn_mask:1x1x1x50,position_ids:1x1,past_key_in:1x32x49x128,past_value_in:1x32x49x128 \ --fp16 --saveEngine=decoder-merge-5.trt

And then use Polygraphy to test accurate:

polygraphy run --onnxrt /home/oldpan/code/models/GPT/LLAMA/alpaca.onnx/decoder-merge-4.onnx --save-results=/home/oldpan/code/data/debug_data/output-debug/decoder-merge-4-onnx-output.json --data-loader-script /home/oldpan/code/convert/tools/data_loader.py
polygraphy run /home/oldpan/code/models/normal/tensorrt_engine/llama-trt/decoder-merge-4.trt --model-type engine --trt --data-loader-script /home/oldpan/code/convert/tools/data_loader.py --load-outputs /home/oldpan/code/data/debug_data/output-debug/decoder-merge-4-onnx-output.json  --atol 1e-2 --rtol 1e-3

The output is

[I] RUNNING | Command: /home/oldpan/miniconda3/envs/develop/bin/polygraphy run /home/oldpan/code/models/normal/tensorrt_engine/llama-trt/decoder-merge-4.trt --model-type engine --trt --data-loader-script /home/oldpan/code/project/llama.ddeploy/tools/data_loader.py --load-outputs /home/oldpan/code/data/debug_data/output-debug/decoder-merge-4-onnx-output.json --atol 1e-2 --rtol 1e-3
[I] Saving custom input data to custom_inputs.json
[I] trt-runner-N0-05/11/23-10:21:57     | Activating and starting inference
[I] Loading bytes from /home/oldpan/code/models/normal/tensorrt_engine/llama-trt/decoder-merge-4.trt
[W] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage and speed up TensorRT initialization. See "Lazy Loading" section of CUDA documentation https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#lazy-loading
[W] Input tensor: position_ids | Buffer dtype (int64) does not match expected input dtype (int32), attempting to cast. 
[I] trt-runner-N0-05/11/23-10:21:57    
    ---- Inference Input(s) ----
    {hidden_in [dtype=float16, shape=(1, 1, 4096)],
     attn_mask [dtype=float16, shape=(1, 1, 1, 50)],
     position_ids [dtype=int32, shape=(1, 1)],
     past_key_in [dtype=float16, shape=(1, 32, 49, 128)],
     past_value_in [dtype=float16, shape=(1, 32, 49, 128)]}
[I] trt-runner-N0-05/11/23-10:21:57    
    ---- Inference Output(s) ----
    {past_key [dtype=float16, shape=(1, 32, 50, 128)],
     past_value [dtype=float16, shape=(1, 32, 50, 128)],
     hidden_out [dtype=float16, shape=(1, 1, 4096)]}
[I] trt-runner-N0-05/11/23-10:21:57     | Completed 1 iteration(s) in 1.745 ms | Average inference time: 1.745 ms.
[I] Loading inference results from /home/oldpan/code/data/debug_data/output-debug/decoder-merge-4-onnx-output.json
[I] Accuracy Comparison | trt-runner-N0-05/11/23-10:21:57 vs. onnxrt-runner-N0-04/22/23-22:27:42
[I]     Comparing Output: 'past_key' (dtype=float16, shape=(1, 32, 50, 128)) with 'past_key' (dtype=float16, shape=(1, 32, 50, 128))
[I]         Tolerance: [abs=0.01, rel=0.001] | Checking elemwise error
[I]         trt-runner-N0-05/11/23-10:21:57: past_key | Stats: mean=0.97991, std-dev=0.15628, var=0.024423, median=1, min=-1.999 at (0, 10, 49, 114), max=2.8555 at (0, 4, 49, 49), avg-magnitude=0.98727
[I]         onnxrt-runner-N0-04/22/23-22:27:42: past_key | Stats: mean=0.97991, std-dev=0.15628, var=0.024423, median=1, min=-2 at (0, 10, 49, 114), max=2.8555 at (0, 4, 49, 49), avg-magnitude=0.98727
[I]         Error Metrics: past_key
[I]             Minimum Required Tolerance: elemwise error | [abs=0.0019531] OR [rel=2.0484] (requirements may be lower if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=1.3869e-06, std-dev=2.5162e-05, var=6.3313e-10, median=0, min=0 at (0, 0, 0, 0), max=0.0019531 at (0, 10, 49, 43), avg-magnitude=1.3869e-06
[I]             Relative Difference | Stats: mean=2.0463e-05, std-dev=0.0048362, var=2.3389e-05, median=0, min=0 at (0, 0, 0, 0), max=2.0484 at (0, 30, 49, 10), avg-magnitude=2.0463e-05
[I]         PASSED | Output: 'past_key' | Difference is within tolerance (rel=0.001, abs=0.01)
[I]     Comparing Output: 'past_value' (dtype=float16, shape=(1, 32, 50, 128)) with 'past_value' (dtype=float16, shape=(1, 32, 50, 128))
[I]         Tolerance: [abs=0.01, rel=0.001] | Checking elemwise error
[I]         trt-runner-N0-05/11/23-10:21:57: past_value | Stats: mean=0.97994, std-dev=0.14484, var=0.020979, median=1, min=-1.041 at (0, 31, 49, 55), max=1.0186 at (0, 30, 49, 121), avg-magnitude=0.98395
[I]         onnxrt-runner-N0-04/22/23-22:27:42: past_value | Stats: mean=0.97994, std-dev=0.14484, var=0.020979, median=1, min=-1.041 at (0, 31, 49, 55), max=1.0186 at (0, 30, 49, 121), avg-magnitude=0.98395
[I]         Error Metrics: past_value
[I]             Minimum Required Tolerance: elemwise error | [abs=0.00024414] OR [rel=0.00092421] (requirements may be lower if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=2.099e-09, std-dev=6.0771e-07, var=3.6931e-13, median=0, min=0 at (0, 0, 0, 0), max=0.00024414 at (0, 0, 49, 39), avg-magnitude=2.099e-09
[I]             Relative Difference | Stats: mean=5.1702e-08, std-dev=5.9317e-06, var=3.5185e-11, median=0, min=0 at (0, 0, 0, 0), max=0.00092421 at (0, 17, 49, 117), avg-magnitude=5.1702e-08
[I]         PASSED | Output: 'past_value' | Difference is within tolerance (rel=0.001, abs=0.01)
[I]     Comparing Output: 'hidden_out' (dtype=float16, shape=(1, 1, 4096)) with 'hidden_out' (dtype=float16, shape=(1, 1, 4096))
[I]         Tolerance: [abs=0.01, rel=0.001] | Checking elemwise error
[I]         trt-runner-N0-05/11/23-10:21:57: hidden_out | Stats: mean=1.0164, std-dev=1.0491, var=1.1007, median=0.98901, min=-3.5781 at (0, 0, 1181), max=9.3516 at (0, 0, 3840), avg-magnitude=1.1887
[I]         onnxrt-runner-N0-04/22/23-22:27:42: hidden_out | Stats: mean=1.0164, std-dev=1.0492, var=1.1007, median=0.9895, min=-3.5781 at (0, 0, 1181), max=9.3594 at (0, 0, 3840), avg-magnitude=1.1887
[I]         Error Metrics: hidden_out
[I]             Minimum Required Tolerance: elemwise error | [abs=0.0078125] OR [rel=0.4306] (requirements may be lower if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=0.0005626, std-dev=0.00064283, var=4.1323e-07, median=0.00048828, min=0 at (0, 0, 0), max=0.0078125 at (0, 0, 3840), avg-magnitude=0.0005626
[I]             Relative Difference | Stats: mean=0.0017539, std-dev=0.011641, var=0.0001355, median=0.00062073, min=0 at (0, 0, 0), max=0.4306 at (0, 0, 1582), avg-magnitude=0.0017539
[I]         PASSED | Output: 'hidden_out' | Difference is within tolerance (rel=0.001, abs=0.01)
[I]     PASSED | All outputs matched | Outputs: ['past_key', 'past_value', 'hidden_out']

The data_loader.py is:

import numpy as np
from polygraphy.json import save_json

INPUT_SHAPE_1 = (1, 1, 4096)
INPUT_SHAPE_2 = (1, 1, 1, 50)
INPUT_SHAPE_3 = (1, 1)
INPUT_SHAPE_4 = (1, 32, 49, 128)
INPUT_SHAPE_5 = (1, 32, 49, 128)

# --shapes=hidden_in:1x1x4096,attn_mask:1x1x1x50,position_ids:1x1,past_key_in:1x32x49x128,past_value_in:1x32x49x128
def load_data():
    for _ in range(1):
        yield {"hidden_in": np.ones(shape=INPUT_SHAPE_1, dtype=np.float16),
               "attn_mask": np.ones(shape=INPUT_SHAPE_2, dtype=np.float16),
               "position_ids": np.ones(shape=INPUT_SHAPE_3, dtype=np.int64),
               "past_key_in": np.ones(shape=INPUT_SHAPE_4, dtype=np.float16),
               "past_value_in": np.ones(shape=INPUT_SHAPE_5, dtype=np.float16)}  # Still totally real data

@lingffff
Copy link

Hi @Oldpan , I notice that you set "--minShapes=past_key_in:1x32x1x128" but @tpoisonooo set "--minShapes=past_key_in:1x32x0x128". Maybe this "zero tensor" feature causes the problem?

@tpoisonooo
Copy link
Author

tpoisonooo commented May 12, 2023

@tpoisonooo Hi, I tested the precision using versions 8.6.0.12 and 8.6.1.6 before, and it seems good to me. First, I converted it using trtexec: ./trtexec --onnx=/home/oldpan/code/models/GPT/LLAMA/alpaca.onnx/decoder-merge-5.onnx \ --minShapes=hidden_in:1x1x4096,attn_mask:1x1x1x1,position_ids:1x1,past_key_in:1x32x1x128,past_value_in:1x32x1x128 \ --optShapes=hidden_in:1x1x4096,attn_mask:1x1x1x1029,position_ids:1x1,past_key_in:1x32x1028x128,past_value_in:1x32x1028x128 \ --maxShapes=hidden_in:1x1x4096,attn_mask:1x1x1x2049,position_ids:1x1,past_key_in:1x32x2048x128,past_value_in:1x32x2048x128 \ --shapes=hidden_in:1x1x4096,attn_mask:1x1x1x50,position_ids:1x1,past_key_in:1x32x49x128,past_value_in:1x32x49x128 \ --fp16 --saveEngine=decoder-merge-5.trt

And then use Polygraphy to test accurate:

polygraphy run --onnxrt /home/oldpan/code/models/GPT/LLAMA/alpaca.onnx/decoder-merge-4.onnx --save-results=/home/oldpan/code/data/debug_data/output-debug/decoder-merge-4-onnx-output.json --data-loader-script /home/oldpan/code/convert/tools/data_loader.py
polygraphy run /home/oldpan/code/models/normal/tensorrt_engine/llama-trt/decoder-merge-4.trt --model-type engine --trt --data-loader-script /home/oldpan/code/convert/tools/data_loader.py --load-outputs /home/oldpan/code/data/debug_data/output-debug/decoder-merge-4-onnx-output.json  --atol 1e-2 --rtol 1e-3

The output is

[I] RUNNING | Command: /home/oldpan/miniconda3/envs/develop/bin/polygraphy run /home/oldpan/code/models/normal/tensorrt_engine/llama-trt/decoder-merge-4.trt --model-type engine --trt --data-loader-script /home/oldpan/code/project/llama.ddeploy/tools/data_loader.py --load-outputs /home/oldpan/code/data/debug_data/output-debug/decoder-merge-4-onnx-output.json --atol 1e-2 --rtol 1e-3
[I] Saving custom input data to custom_inputs.json
[I] trt-runner-N0-05/11/23-10:21:57     | Activating and starting inference
[I] Loading bytes from /home/oldpan/code/models/normal/tensorrt_engine/llama-trt/decoder-merge-4.trt
[W] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage and speed up TensorRT initialization. See "Lazy Loading" section of CUDA documentation https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#lazy-loading
[W] Input tensor: position_ids | Buffer dtype (int64) does not match expected input dtype (int32), attempting to cast. 
[I] trt-runner-N0-05/11/23-10:21:57    
    ---- Inference Input(s) ----
    {hidden_in [dtype=float16, shape=(1, 1, 4096)],
     attn_mask [dtype=float16, shape=(1, 1, 1, 50)],
     position_ids [dtype=int32, shape=(1, 1)],
     past_key_in [dtype=float16, shape=(1, 32, 49, 128)],
     past_value_in [dtype=float16, shape=(1, 32, 49, 128)]}
[I] trt-runner-N0-05/11/23-10:21:57    
    ---- Inference Output(s) ----
    {past_key [dtype=float16, shape=(1, 32, 50, 128)],
     past_value [dtype=float16, shape=(1, 32, 50, 128)],
     hidden_out [dtype=float16, shape=(1, 1, 4096)]}
[I] trt-runner-N0-05/11/23-10:21:57     | Completed 1 iteration(s) in 1.745 ms | Average inference time: 1.745 ms.
[I] Loading inference results from /home/oldpan/code/data/debug_data/output-debug/decoder-merge-4-onnx-output.json
[I] Accuracy Comparison | trt-runner-N0-05/11/23-10:21:57 vs. onnxrt-runner-N0-04/22/23-22:27:42
[I]     Comparing Output: 'past_key' (dtype=float16, shape=(1, 32, 50, 128)) with 'past_key' (dtype=float16, shape=(1, 32, 50, 128))
[I]         Tolerance: [abs=0.01, rel=0.001] | Checking elemwise error
[I]         trt-runner-N0-05/11/23-10:21:57: past_key | Stats: mean=0.97991, std-dev=0.15628, var=0.024423, median=1, min=-1.999 at (0, 10, 49, 114), max=2.8555 at (0, 4, 49, 49), avg-magnitude=0.98727
[I]         onnxrt-runner-N0-04/22/23-22:27:42: past_key | Stats: mean=0.97991, std-dev=0.15628, var=0.024423, median=1, min=-2 at (0, 10, 49, 114), max=2.8555 at (0, 4, 49, 49), avg-magnitude=0.98727
[I]         Error Metrics: past_key
[I]             Minimum Required Tolerance: elemwise error | [abs=0.0019531] OR [rel=2.0484] (requirements may be lower if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=1.3869e-06, std-dev=2.5162e-05, var=6.3313e-10, median=0, min=0 at (0, 0, 0, 0), max=0.0019531 at (0, 10, 49, 43), avg-magnitude=1.3869e-06
[I]             Relative Difference | Stats: mean=2.0463e-05, std-dev=0.0048362, var=2.3389e-05, median=0, min=0 at (0, 0, 0, 0), max=2.0484 at (0, 30, 49, 10), avg-magnitude=2.0463e-05
[I]         PASSED | Output: 'past_key' | Difference is within tolerance (rel=0.001, abs=0.01)
[I]     Comparing Output: 'past_value' (dtype=float16, shape=(1, 32, 50, 128)) with 'past_value' (dtype=float16, shape=(1, 32, 50, 128))
[I]         Tolerance: [abs=0.01, rel=0.001] | Checking elemwise error
[I]         trt-runner-N0-05/11/23-10:21:57: past_value | Stats: mean=0.97994, std-dev=0.14484, var=0.020979, median=1, min=-1.041 at (0, 31, 49, 55), max=1.0186 at (0, 30, 49, 121), avg-magnitude=0.98395
[I]         onnxrt-runner-N0-04/22/23-22:27:42: past_value | Stats: mean=0.97994, std-dev=0.14484, var=0.020979, median=1, min=-1.041 at (0, 31, 49, 55), max=1.0186 at (0, 30, 49, 121), avg-magnitude=0.98395
[I]         Error Metrics: past_value
[I]             Minimum Required Tolerance: elemwise error | [abs=0.00024414] OR [rel=0.00092421] (requirements may be lower if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=2.099e-09, std-dev=6.0771e-07, var=3.6931e-13, median=0, min=0 at (0, 0, 0, 0), max=0.00024414 at (0, 0, 49, 39), avg-magnitude=2.099e-09
[I]             Relative Difference | Stats: mean=5.1702e-08, std-dev=5.9317e-06, var=3.5185e-11, median=0, min=0 at (0, 0, 0, 0), max=0.00092421 at (0, 17, 49, 117), avg-magnitude=5.1702e-08
[I]         PASSED | Output: 'past_value' | Difference is within tolerance (rel=0.001, abs=0.01)
[I]     Comparing Output: 'hidden_out' (dtype=float16, shape=(1, 1, 4096)) with 'hidden_out' (dtype=float16, shape=(1, 1, 4096))
[I]         Tolerance: [abs=0.01, rel=0.001] | Checking elemwise error
[I]         trt-runner-N0-05/11/23-10:21:57: hidden_out | Stats: mean=1.0164, std-dev=1.0491, var=1.1007, median=0.98901, min=-3.5781 at (0, 0, 1181), max=9.3516 at (0, 0, 3840), avg-magnitude=1.1887
[I]         onnxrt-runner-N0-04/22/23-22:27:42: hidden_out | Stats: mean=1.0164, std-dev=1.0492, var=1.1007, median=0.9895, min=-3.5781 at (0, 0, 1181), max=9.3594 at (0, 0, 3840), avg-magnitude=1.1887
[I]         Error Metrics: hidden_out
[I]             Minimum Required Tolerance: elemwise error | [abs=0.0078125] OR [rel=0.4306] (requirements may be lower if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=0.0005626, std-dev=0.00064283, var=4.1323e-07, median=0.00048828, min=0 at (0, 0, 0), max=0.0078125 at (0, 0, 3840), avg-magnitude=0.0005626
[I]             Relative Difference | Stats: mean=0.0017539, std-dev=0.011641, var=0.0001355, median=0.00062073, min=0 at (0, 0, 0), max=0.4306 at (0, 0, 1582), avg-magnitude=0.0017539
[I]         PASSED | Output: 'hidden_out' | Difference is within tolerance (rel=0.001, abs=0.01)
[I]     PASSED | All outputs matched | Outputs: ['past_key', 'past_value', 'hidden_out']

The data_loader.py is:

import numpy as np
from polygraphy.json import save_json

INPUT_SHAPE_1 = (1, 1, 4096)
INPUT_SHAPE_2 = (1, 1, 1, 50)
INPUT_SHAPE_3 = (1, 1)
INPUT_SHAPE_4 = (1, 32, 49, 128)
INPUT_SHAPE_5 = (1, 32, 49, 128)

# --shapes=hidden_in:1x1x4096,attn_mask:1x1x1x50,position_ids:1x1,past_key_in:1x32x49x128,past_value_in:1x32x49x128
def load_data():
    for _ in range(1):
        yield {"hidden_in": np.ones(shape=INPUT_SHAPE_1, dtype=np.float16),
               "attn_mask": np.ones(shape=INPUT_SHAPE_2, dtype=np.float16),
               "position_ids": np.ones(shape=INPUT_SHAPE_3, dtype=np.int64),
               "past_key_in": np.ones(shape=INPUT_SHAPE_4, dtype=np.float16),
               "past_value_in": np.ones(shape=INPUT_SHAPE_5, dtype=np.float16)}  # Still totally real data

There is a cache in llama, so "--minShapes=past_key_in:1x32x0x128". Or I have to export two kinds of decoder.onnx. cc @lingffff @Oldpan

Please check past_key_value in modeling_llama.py https://github.com/huggingface/transformers/blob/273f5ba0266b223c1d611bd00d4a4b2d58771a33/src/transformers/models/llama/modeling_llama.py#L213

        if past_key_value is not None:
            kv_seq_len += past_key_value[0].shape[-2]
..

        if past_key_value is not None:
            # reuse k, v, self_attention
            key_states = torch.cat([past_key_value[0], key_states], dim=2)
            value_states = torch.cat([past_key_value[1], value_states], dim=2)

        past_key_value = (key_states, value_states) if use_cache else None

@zhaohb
Copy link

zhaohb commented May 12, 2023

@tpoisonooo hi, do you test llama's performance on tensorrt and how fast the token was generated?

@tpoisonooo
Copy link
Author

@tpoisonooo hi, do you test llama's performance on tensorrt and how fast the token was generated?

On my GTX1060 TI, I got 3.5~5ms per backbone decoder with llama 7B fp16 model, AKA 1000 / (5*32) = 6 token/second

@coderchem
Copy link

I wonder when this problem will be solved, thank you

@user-ZJ
Copy link

user-ZJ commented Jun 16, 2023

I wonder if there is any progress on this issue

@user-ZJ
Copy link

user-ZJ commented Jun 19, 2023

@tpoisonooo Hi, I tested the precision using versions 8.6.0.12 and 8.6.1.6 before, and it seems good to me. First, I converted it using trtexec: ./trtexec --onnx=/home/oldpan/code/models/GPT/LLAMA/alpaca.onnx/decoder-merge-5.onnx \ --minShapes=hidden_in:1x1x4096,attn_mask:1x1x1x1,position_ids:1x1,past_key_in:1x32x1x128,past_value_in:1x32x1x128 \ --optShapes=hidden_in:1x1x4096,attn_mask:1x1x1x1029,position_ids:1x1,past_key_in:1x32x1028x128,past_value_in:1x32x1028x128 \ --maxShapes=hidden_in:1x1x4096,attn_mask:1x1x1x2049,position_ids:1x1,past_key_in:1x32x2048x128,past_value_in:1x32x2048x128 \ --shapes=hidden_in:1x1x4096,attn_mask:1x1x1x50,position_ids:1x1,past_key_in:1x32x49x128,past_value_in:1x32x49x128 \ --fp16 --saveEngine=decoder-merge-5.trt

And then use Polygraphy to test accurate:

polygraphy run --onnxrt /home/oldpan/code/models/GPT/LLAMA/alpaca.onnx/decoder-merge-4.onnx --save-results=/home/oldpan/code/data/debug_data/output-debug/decoder-merge-4-onnx-output.json --data-loader-script /home/oldpan/code/convert/tools/data_loader.py
polygraphy run /home/oldpan/code/models/normal/tensorrt_engine/llama-trt/decoder-merge-4.trt --model-type engine --trt --data-loader-script /home/oldpan/code/convert/tools/data_loader.py --load-outputs /home/oldpan/code/data/debug_data/output-debug/decoder-merge-4-onnx-output.json  --atol 1e-2 --rtol 1e-3

The output is

[I] RUNNING | Command: /home/oldpan/miniconda3/envs/develop/bin/polygraphy run /home/oldpan/code/models/normal/tensorrt_engine/llama-trt/decoder-merge-4.trt --model-type engine --trt --data-loader-script /home/oldpan/code/project/llama.ddeploy/tools/data_loader.py --load-outputs /home/oldpan/code/data/debug_data/output-debug/decoder-merge-4-onnx-output.json --atol 1e-2 --rtol 1e-3
[I] Saving custom input data to custom_inputs.json
[I] trt-runner-N0-05/11/23-10:21:57     | Activating and starting inference
[I] Loading bytes from /home/oldpan/code/models/normal/tensorrt_engine/llama-trt/decoder-merge-4.trt
[W] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage and speed up TensorRT initialization. See "Lazy Loading" section of CUDA documentation https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#lazy-loading
[W] Input tensor: position_ids | Buffer dtype (int64) does not match expected input dtype (int32), attempting to cast. 
[I] trt-runner-N0-05/11/23-10:21:57    
    ---- Inference Input(s) ----
    {hidden_in [dtype=float16, shape=(1, 1, 4096)],
     attn_mask [dtype=float16, shape=(1, 1, 1, 50)],
     position_ids [dtype=int32, shape=(1, 1)],
     past_key_in [dtype=float16, shape=(1, 32, 49, 128)],
     past_value_in [dtype=float16, shape=(1, 32, 49, 128)]}
[I] trt-runner-N0-05/11/23-10:21:57    
    ---- Inference Output(s) ----
    {past_key [dtype=float16, shape=(1, 32, 50, 128)],
     past_value [dtype=float16, shape=(1, 32, 50, 128)],
     hidden_out [dtype=float16, shape=(1, 1, 4096)]}
[I] trt-runner-N0-05/11/23-10:21:57     | Completed 1 iteration(s) in 1.745 ms | Average inference time: 1.745 ms.
[I] Loading inference results from /home/oldpan/code/data/debug_data/output-debug/decoder-merge-4-onnx-output.json
[I] Accuracy Comparison | trt-runner-N0-05/11/23-10:21:57 vs. onnxrt-runner-N0-04/22/23-22:27:42
[I]     Comparing Output: 'past_key' (dtype=float16, shape=(1, 32, 50, 128)) with 'past_key' (dtype=float16, shape=(1, 32, 50, 128))
[I]         Tolerance: [abs=0.01, rel=0.001] | Checking elemwise error
[I]         trt-runner-N0-05/11/23-10:21:57: past_key | Stats: mean=0.97991, std-dev=0.15628, var=0.024423, median=1, min=-1.999 at (0, 10, 49, 114), max=2.8555 at (0, 4, 49, 49), avg-magnitude=0.98727
[I]         onnxrt-runner-N0-04/22/23-22:27:42: past_key | Stats: mean=0.97991, std-dev=0.15628, var=0.024423, median=1, min=-2 at (0, 10, 49, 114), max=2.8555 at (0, 4, 49, 49), avg-magnitude=0.98727
[I]         Error Metrics: past_key
[I]             Minimum Required Tolerance: elemwise error | [abs=0.0019531] OR [rel=2.0484] (requirements may be lower if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=1.3869e-06, std-dev=2.5162e-05, var=6.3313e-10, median=0, min=0 at (0, 0, 0, 0), max=0.0019531 at (0, 10, 49, 43), avg-magnitude=1.3869e-06
[I]             Relative Difference | Stats: mean=2.0463e-05, std-dev=0.0048362, var=2.3389e-05, median=0, min=0 at (0, 0, 0, 0), max=2.0484 at (0, 30, 49, 10), avg-magnitude=2.0463e-05
[I]         PASSED | Output: 'past_key' | Difference is within tolerance (rel=0.001, abs=0.01)
[I]     Comparing Output: 'past_value' (dtype=float16, shape=(1, 32, 50, 128)) with 'past_value' (dtype=float16, shape=(1, 32, 50, 128))
[I]         Tolerance: [abs=0.01, rel=0.001] | Checking elemwise error
[I]         trt-runner-N0-05/11/23-10:21:57: past_value | Stats: mean=0.97994, std-dev=0.14484, var=0.020979, median=1, min=-1.041 at (0, 31, 49, 55), max=1.0186 at (0, 30, 49, 121), avg-magnitude=0.98395
[I]         onnxrt-runner-N0-04/22/23-22:27:42: past_value | Stats: mean=0.97994, std-dev=0.14484, var=0.020979, median=1, min=-1.041 at (0, 31, 49, 55), max=1.0186 at (0, 30, 49, 121), avg-magnitude=0.98395
[I]         Error Metrics: past_value
[I]             Minimum Required Tolerance: elemwise error | [abs=0.00024414] OR [rel=0.00092421] (requirements may be lower if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=2.099e-09, std-dev=6.0771e-07, var=3.6931e-13, median=0, min=0 at (0, 0, 0, 0), max=0.00024414 at (0, 0, 49, 39), avg-magnitude=2.099e-09
[I]             Relative Difference | Stats: mean=5.1702e-08, std-dev=5.9317e-06, var=3.5185e-11, median=0, min=0 at (0, 0, 0, 0), max=0.00092421 at (0, 17, 49, 117), avg-magnitude=5.1702e-08
[I]         PASSED | Output: 'past_value' | Difference is within tolerance (rel=0.001, abs=0.01)
[I]     Comparing Output: 'hidden_out' (dtype=float16, shape=(1, 1, 4096)) with 'hidden_out' (dtype=float16, shape=(1, 1, 4096))
[I]         Tolerance: [abs=0.01, rel=0.001] | Checking elemwise error
[I]         trt-runner-N0-05/11/23-10:21:57: hidden_out | Stats: mean=1.0164, std-dev=1.0491, var=1.1007, median=0.98901, min=-3.5781 at (0, 0, 1181), max=9.3516 at (0, 0, 3840), avg-magnitude=1.1887
[I]         onnxrt-runner-N0-04/22/23-22:27:42: hidden_out | Stats: mean=1.0164, std-dev=1.0492, var=1.1007, median=0.9895, min=-3.5781 at (0, 0, 1181), max=9.3594 at (0, 0, 3840), avg-magnitude=1.1887
[I]         Error Metrics: hidden_out
[I]             Minimum Required Tolerance: elemwise error | [abs=0.0078125] OR [rel=0.4306] (requirements may be lower if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=0.0005626, std-dev=0.00064283, var=4.1323e-07, median=0.00048828, min=0 at (0, 0, 0), max=0.0078125 at (0, 0, 3840), avg-magnitude=0.0005626
[I]             Relative Difference | Stats: mean=0.0017539, std-dev=0.011641, var=0.0001355, median=0.00062073, min=0 at (0, 0, 0), max=0.4306 at (0, 0, 1582), avg-magnitude=0.0017539
[I]         PASSED | Output: 'hidden_out' | Difference is within tolerance (rel=0.001, abs=0.01)
[I]     PASSED | All outputs matched | Outputs: ['past_key', 'past_value', 'hidden_out']

The data_loader.py is:

import numpy as np
from polygraphy.json import save_json

INPUT_SHAPE_1 = (1, 1, 4096)
INPUT_SHAPE_2 = (1, 1, 1, 50)
INPUT_SHAPE_3 = (1, 1)
INPUT_SHAPE_4 = (1, 32, 49, 128)
INPUT_SHAPE_5 = (1, 32, 49, 128)

# --shapes=hidden_in:1x1x4096,attn_mask:1x1x1x50,position_ids:1x1,past_key_in:1x32x49x128,past_value_in:1x32x49x128
def load_data():
    for _ in range(1):
        yield {"hidden_in": np.ones(shape=INPUT_SHAPE_1, dtype=np.float16),
               "attn_mask": np.ones(shape=INPUT_SHAPE_2, dtype=np.float16),
               "position_ids": np.ones(shape=INPUT_SHAPE_3, dtype=np.int64),
               "past_key_in": np.ones(shape=INPUT_SHAPE_4, dtype=np.float16),
               "past_value_in": np.ones(shape=INPUT_SHAPE_5, dtype=np.float16)}  # Still totally real data

change np.ones to np.random.rand will show FAILED in this case

@tikikun
Copy link

tikikun commented Jul 9, 2023

is this fixed?

@tpoisonooo
Copy link
Author

is this fixed?

Not yet, if you want to inference LLaMa on CUDA, try https://github.com/InternLM/lmdeploy

@tikikun
Copy link

tikikun commented Jul 11, 2023

@tpoisonooo is this approach any different than LLaMA on Optimum , I already converted to tensorRT on Optimum
huggingface/optimum#975

@Rane2021
Copy link

Rane2021 commented Aug 2, 2023

@tpoisonooo hi, do you test llama's performance on tensorrt and how fast the token was generated?

On my GTX1060 TI, I got 3.5~5ms per backbone decoder with llama 7B fp16 model, AKA 1000 / (5*32) = 6 token/second

Does this test have a run demo for TensorRT?

@xiaobai52HZ
Copy link

different

how do you converted to tensorRT on Optimum? I don't find reference about this.

@tp-nan
Copy link

tp-nan commented Nov 5, 2023

Hi all, We successfully converted to TensorRT!

Each LlamaDecoderLayer was splited into three segments: pre, mid, and post. The kv cache between the pre and mid segments reverts to PyTorch for computation. The dimensions inside pre and post do not undergo a transpose operation, enabling batch processing. The mid segment has no parameters, eliminating the need for 32 repetitions. Instead, a single mid paramater is placed on each card. see https://github.com/torchpipe/LLM.TensorRT.Serve

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests