Dynamic dimensions required for input: input, but no shapes were provided. Automatically overriding shape to: 1x3x608x608 #1111

vilmara · 2021-03-10T00:07:30Z

Description

I am trying to convert the pre-trained Pytorch YOLOV4 (darknet) model to TensorRT INT8 with dynamic batching, to later on deploying it on DS-Triton. I am following the general steps in the same NVIDIA-AI-IOT/yolov4_deepstream, but getting issues first with dynamic dimensions at the ONNX-TRT conversion step, then loading the model on DS-Triton :

Environment

TensorRT Version: 7.2.1
NVIDIA GPU: T4
NVIDIA Driver Version: 450.51.06
CUDA Version: 11.1
CUDNN Version: 8.0.4
Operating System: Ubuntu 18.04
Python Version (if applicable): 1.8
Tensorflow Version (if applicable):
PyTorch Version (if applicable): container image nvcr.io/nvidia/pytorch:20.11-py3
Baremetal or Container (if so, version): container image deepstream:5.1-21.02-triton

Relevant Files

YOLOV4 pre-trained model weights and cfg downloaded from
https://raw.githubusercontent.com/AlexeyAB/darknet/master/cfg/yolov4.cfg
https://github.com/AlexeyAB/darknet/releases/download/darknet_yolo_v3_optimal/yolov4.weights

Steps To Reproduce

Complete Pipeline: Pytoch YOLOV4 (darknet) --> ONNX --> TensorRT --> DeepStream-Triton

Step 1: download cfg file and weights from the above link

Step 2: git clone repository pytorch-YOLOv4
$ sudo git clone https://github.com/Tianxiaomo/pytorch-YOLOv4.git

Step 3: Convert model YOLOV4 Pytoch --> ONNX | Dynamic Batch size

$ sudo docker run --gpus all -it --rm --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -v l/pytorch-YOLOv4/:/workspace/pytorch-YOLOv4/ nvcr.io/nvidia/pytorch:20.11-py3
$ cd /workspace/pytorch-YOLOv4
$ python demo_darknet2onnx.py "/workspace/pytorch-YOLOv4/models_cfg_weights/yolov4.cfg" "/workspace/pytorch-YOLOv4/models_cfg_weights/yolov4.weights" "/workspace/pytorch-YOLOv4/data/dog.jpg" -1

Result:

Onnx model exporting done
The model expects input shape:  ['batch_size', 3, 608, 608]
Saved model: yolov4_-1_3_608_608_dynamic.onnx

Step 4: Convert model ONNX --> TensorRT | Dynamic Batch size
$ sudo docker run --gpus all -it --rm --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -v /tmp/.X11-unix:/tmp/.X11-unix -e DISPLAY=$DISPLAY -v /pytorch-YOLOv4/:/workspace/pytorch-YOLOv4/ deepstream:5.1-21.02-triton

$ /usr/src/tensorrt/bin/trtexec --onnx=yolov4_-1_3_608_608_dynamic.onnx --explicitBatch --minShapes=\'data\':1x3x608x608 --optShapes=\'data\':2x3x608x608 --maxShapes=\'data\':8x3x608x608 --workspace=4096 --buildOnly -- saveEngine=yolov4_-1_3_608_608_dynamic.onnx_int8.engine --int8

Note: trtexec automatically overrides the engine shape to: 1x3x608x608 instead of keeping the dynamicbatching

[03/09/2021-22:24:24] [W] Dynamic dimensions required for input: input, but no shapes were provided. Automatically overriding shape to: 1x3x608x608
[03/09/2021-22:24:24] [I] FP32 and INT8 precisions have been specified - more performance might be enabled by additionally specifying --fp16 or --best
[03/09/2021-22:24:25] [W] [TRT] Calibrator is not being used. Users must provide dynamic range for all tensors that are not Int32.
[03/09/2021-22:43:52] [I] [TRT] Detected 1 inputs and 8 output network tensors.

$ /usr/src/tensorrt/bin/trtexec --loadEngine=yolov4_-1_3_608_608_dynamic.onnx_int8.engine --int8
Result BS=1:

.
[03/09/2021-22:48:45] [E] [TRT] Parameter check failed at: engine.cpp::enqueue::445, condition: batchSize > 0 && batchSize <= mEngine.getMaxBatchSize(). Note: Batch siz but engine max batch size was: 1
[03/09/2021-22:48:45] [I] Warmup completed 312 queries over 200 ms
[03/09/2021-22:48:45] [I] Timing trace has 4704 queries over 3.0043 s
[03/09/2021-22:48:45] [I] Trace averages of 10 runs:
.
[03/09/2021-22:46:29] [I] Host Latency
[03/09/2021-22:46:29] [I] min: 6.81131 ms (end to end 11.6827 ms)
[03/09/2021-22:46:29] [I] max: 10.3354 ms (end to end 21.7613 ms)
[03/09/2021-22:46:29] [I] mean: 7.02095 ms (end to end 12.1098 ms)
[03/09/2021-22:46:29] [I] median: 7.00833 ms (end to end 12.0729 ms)
[03/09/2021-22:46:29] [I] percentile: 7.2074 ms at 99% (end to end 12.4701 ms at 99%)
[03/09/2021-22:46:29] [I] throughput: 163.949 qps
[03/09/2021-22:46:29] [I] walltime: 3.02533 s
[03/09/2021-22:46:29] [I] Enqueue Time
[03/09/2021-22:46:29] [I] min: 1.49683 ms
[03/09/2021-22:46:29] [I] max: 1.841 ms
[03/09/2021-22:46:29] [I] median: 1.52332 ms
[03/09/2021-22:46:29] [I] GPU Compute
[03/09/2021-22:46:29] [I] min: 5.86343 ms
[03/09/2021-22:46:29] [I] max: 9.38628 ms
[03/09/2021-22:46:29] [I] mean: 6.0721 ms
[03/09/2021-22:46:29] [I] median: 6.05927 ms
[03/09/2021-22:46:29] [I] percentile: 6.25732 ms at 99%
[03/09/2021-22:46:29] [I] total compute time: 3.01176 s

Result BS=2:
Error:
03/09/2021-22:48:45] [E] [TRT] Parameter check failed at: engine.cpp::enqueue::445, condition: batchSize > 0 && batchSize <= mEngine.getMaxBatchSize(). Note: Batch size was: 2, but engine max batch size was: 1

Step 5: Config the DS-Triton files as described in the sample NVIDIA-AI-IOT/yolov4_deepstream

Step 6: Run YOLOV4 INT8 mode with Dynamic shapes with DS-Triton
$ deepstream-app -c deepstream_app_config_yoloV4.txt
Error: "unable to autofill for 'yolov4_nvidia', either all model tensor configuration should specify their dims or none"

root@1101333383d9:/workspace/Deepstream_5.1_Triton/samples/configs/deepstream-app-trtis# deepstream-app -c source1_primary_yolov4.txt
I0309 23:25:10.628131 260 metrics.cc:219] Collecting metrics for GPU 0: Tesla T4
I0309 23:25:10.634856 260 metrics.cc:219] Collecting metrics for GPU 1: Tesla T4
I0309 23:25:10.641297 260 metrics.cc:219] Collecting metrics for GPU 2: Tesla T4
I0309 23:25:10.647843 260 metrics.cc:219] Collecting metrics for GPU 3: Tesla T4
I0309 23:25:10.706528 260 pinned_memory_manager.cc:199] Pinned memory pool is created at '0x7febf8000000' with size 268435456
I0309 23:25:10.710959 260 cuda_memory_manager.cc:99] CUDA memory pool is created on device 0 with size 67108864
I0309 23:25:10.710967 260 cuda_memory_manager.cc:99] CUDA memory pool is created on device 1 with size 67108864
I```
0309 23:25:10.710972 260 cuda_memory_manager.cc:99] CUDA memory pool is created on device 2 with size 67108864
I0309 23:25:10.710976 260 cuda_memory_manager.cc:99] CUDA memory pool is created on device 3 with size 67108864
I0309 23:25:10.991848 260 server.cc:141]
.
| Backend | Config | Path |
.
.

I0309 23:25:10.991880 260 server.cc:184]
.
| Model | Version | Status |
.
.

I0309 23:25:10.991971 260 tritonserver.cc:1620]
.
| Option                           | Value                                                                                                                            |
.
| server_id                        | triton                                                                                                                           |
| server_version                   | 2.5.0                                                                                                                            |
| server_extensions                | classification sequence model_repository schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tens |
|                                  | or_data statistics                                                                                                               |
| model_repository_path[0]         | /workspace/Deepstream_5.1_Triton/samples/trtis_model_repo                                                                        |
| model_control_mode               | MODE_EXPLICIT                                                                                                                    |
| strict_model_config              | 0                                                                                                                                |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                        |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                         |
| cuda_memory_pool_byte_size{1}    | 67108864                                                                                                                         |
| cuda_memory_pool_byte_size{2}    | 67108864                                                                                                                         |
| cuda_memory_pool_byte_size{3}    | 67108864                                                                                                                         |
| min_supported_compute_capability | 6.0                                                                                                                              |
| strict_readiness                 | 1                                                                                                                                |
| exit_timeout                     | 30                                                                                                                               |
.

E0309 23:25:22.300254 260 model_repository_manager.cc:1705] unable to autofill for 'yolov4_nvidia', either all model tensor configuration should specify their dims or none.
ERROR: infer_trtis_server.cpp:1044 Triton: failed to load model yolov4_nvidia, triton_err_str:Internal, err_msg:failed to load 'yolov4_nvidia', no version is available
ERROR: infer_trtis_backend.cpp:45 failed to load model: yolov4_nvidia, nvinfer error:NVDSINFER_TRTIS_ERROR
ERROR: infer_trtis_backend.cpp:184 failed to initialize backend while ensuring model:yolov4_nvidia ready, nvinfer error:NVDSINFER_TRTIS_ERROR
0:00:14.399726167   260 0x564fdec902f0 ERROR          nvinferserver gstnvinferserver.cpp:362:gst_nvinfer_server_logger:<primary_gie> nvinferserver[UID 1]: Error in createNNBackend() <infer_trtis_context.cpp:246> [UID = 1]: failed to initialize trtis backend for model:yolov4_nvidia, nvinfer error:NVDSINFER_TRTIS_ERROR
I0309 23:25:22.300489 260 server.cc:280] Waiting for in-flight requests to complete.
I0309 23:25:22.300497 260 server.cc:295] Timeout 30: Found 0 live models and 0 in-flight non-inference requests
0:00:14.399831360   260 0x564fdec902f0 ERROR          nvinferserver gstnvinferserver.cpp:362:gst_nvinfer_server_logger:<primary_gie> nvinferserver[UID 1]: Error in initialize() <infer_base_context.cpp:81> [UID = 1]: create nn-backend failed, check config file settings, nvinfer error:NVDSINFER_TRTIS_ERROR
0:00:14.399843072   260 0x564fdec902f0 WARN           nvinferserver gstnvinferserver_impl.cpp:439:start:<primary_gie> error: Failed to initialize InferTrtIsContext
0:00:14.399868241   260 0x564fdec902f0 WARN           nvinferserver gstnvinferserver_impl.cpp:439:start:<primary_gie> error: Config file path: /workspace/Deepstream_5.1_Triton/samples/configs/deepstream-app-trtis/config_infer_primary_yolov4.txt
0:00:14.400284532   260 0x564fdec902f0 WARN           nvinferserver gstnvinferserver.cpp:460:gst_nvinfer_server_start:<primary_gie> error: gstnvinferserver_impl start failed
** ERROR: <main:655>: Failed to set pipeline to PAUSED
Quitting
ERROR from primary_gie: Failed to initialize InferTrtIsContext
Debug info: gstnvinferserver_impl.cpp(439): start (): /GstPipeline:pipeline/GstBin:primary_gie_bin/GstNvInferServer:primary_gie:
Config file path: /workspace/Deepstream_5.1_Triton/samples/configs/deepstream-app-trtis/config_infer_primary_yolov4.txt
ERROR from primary_gie: gstnvinferserver_impl start failed
Debug info: gstnvinferserver.cpp(460): gst_nvinfer_server_start (): /GstPipeline:pipeline/GstBin:primary_gie_bin/GstNvInferServer:primary_gie
App run failed

I think the problem is with trtexec, is there a sample/tool that shows how to optimize a YOLO Pytorch-ONNX to TensorRT engine INT8 mode with full INT8 calibration and dynamic input shapes?

The text was updated successfully, but these errors were encountered:

pranavm-nvidia · 2021-03-10T00:36:53Z

@vilmara It looks like the model input is called input, but you're using data:

--minShapes=\'data\':1x3x608x608 --optShapes=\'data\':2x3x608x608 --maxShapes=\'data\':8x3x608x608

vilmara · 2021-03-10T01:07:32Z

Hi @pranavm-nvidia, thanks for your prompt reply. You are right, I have tried with input name "input" and got the same result from trtexec generating the engine with static batch size (Dynamic dimensions required for input: input, but no shapes were provided. Automatically overriding shape to: 1x3x608x608). Please see below:

Input shape ONNX model:

Step: Generating the engine with shape's name 'input'
$ /usr/src/tensorrt/bin/trtexec --onnx=yolov4_-1_3_608_608_dynamic.onnx --explicitBatch --minShapes=\'input\':1x3x608x608 --optShapes=\'input\':2x3x608x608 --maxShapes=\'input\':8x3x608x608 --workspace=4096 --buildOnly --saveEngine=yolov4_-1_3_608_608_dynamic.onnx_int8_trtexec_3.engine --int8

&&&& RUNNING TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --onnx=yolov4_-1_3_608_608_dynamic.onnx --explicitBatch --minShapes='input':1x3x608x608 --optShapes='input':2x3x608x608 --maxShapes='input':8x3x608x608 --workspace=4096 --buildOnly --saveEngine=yolov4_-1_3_608_608_dynamic.onnx_int8_trtexec_3.engine --int8
[03/10/2021-00:43:28] [I] === Model Options ===
[03/10/2021-00:43:28] [I] Format: ONNX
[03/10/2021-00:43:28] [I] Model: yolov4_-1_3_608_608_dynamic.onnx
[03/10/2021-00:43:28] [I] Output:
[03/10/2021-00:43:28] [I] === Build Options ===
[03/10/2021-00:43:28] [I] Max batch: explicit
[03/10/2021-00:43:28] [I] Workspace: 4096 MB
[03/10/2021-00:43:28] [I] minTiming: 1
[03/10/2021-00:43:28] [I] avgTiming: 8
[03/10/2021-00:43:28] [I] Precision: FP32+INT8
[03/10/2021-00:43:28] [I] Calibration: Dynamic
[03/10/2021-00:43:28] [I] Safe mode: Disabled
[03/10/2021-00:43:28] [I] Save engine: yolov4_-1_3_608_608_dynamic.onnx_int8_trtexec_3.engine
[03/10/2021-00:43:28] [I] Load engine:
[03/10/2021-00:43:28] [I] Builder Cache: Enabled
[03/10/2021-00:43:28] [I] NVTX verbosity: 0
[03/10/2021-00:43:28] [I] Inputs format: fp32:CHW
[03/10/2021-00:43:28] [I] Outputs format: fp32:CHW
[03/10/2021-00:43:28] [I] Input build shape: input=1x3x608x608+2x3x608x608+8x3x608x608
[03/10/2021-00:43:28] [I] Input calibration shapes: model
[03/10/2021-00:43:28] [I] === System Options ===
[03/10/2021-00:43:28] [I] Device: 0
[03/10/2021-00:43:28] [I] DLACore:
[03/10/2021-00:43:28] [I] Plugins:
[03/10/2021-00:43:28] [I] === Inference Options ===
[03/10/2021-00:43:28] [I] Batch: Explicit
[03/10/2021-00:43:28] [I] Input inference shape: input=2x3x608x608
[03/10/2021-00:43:28] [I] Iterations: 10
[03/10/2021-00:43:28] [I] Duration: 3s (+ 200ms warm up)
[03/10/2021-00:43:28] [I] Sleep time: 0ms
[03/10/2021-00:43:28] [I] Streams: 1
[03/10/2021-00:43:28] [I] ExposeDMA: Disabled
[03/10/2021-00:43:28] [I] Spin-wait: Disabled
[03/10/2021-00:43:28] [I] Multithreading: Disabled
[03/10/2021-00:43:28] [I] CUDA Graph: Disabled
[03/10/2021-00:43:28] [I] Skip inference: Enabled
[03/10/2021-00:43:28] [I] Inputs:
[03/10/2021-00:43:28] [I] === Reporting Options ===
[03/10/2021-00:43:28] [I] Verbose: Disabled
[03/10/2021-00:43:28] [I] Averages: 10 inferences
[03/10/2021-00:43:28] [I] Percentile: 99
[03/10/2021-00:43:28] [I] Dump output: Disabled
[03/10/2021-00:43:28] [I] Profile: Disabled
[03/10/2021-00:43:28] [I] Export timing to JSON file:
[03/10/2021-00:43:28] [I] Export output to JSON file:
[03/10/2021-00:43:28] [I] Export profile to JSON file:
[03/10/2021-00:43:28] [I]
----------------------------------------------------------------
Input filename:   yolov4_-1_3_608_608_dynamic.onnx
ONNX IR version:  0.0.6
Opset version:    11
Producer name:    pytorch
Producer version: 1.8
Domain:
Model version:    0
Doc string:
----------------------------------------------------------------
[03/10/2021-00:43:39] [W] [TRT] /home/jenkins/workspace/OSS/L0_MergeRequest/oss/parsers/onnx/onnx2trt_utils.cpp:220: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[03/10/2021-00:43:39] [W] [TRT] /home/jenkins/workspace/OSS/L0_MergeRequest/oss/parsers/onnx/onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[03/10/2021-00:43:39] [W] [TRT] /home/jenkins/workspace/OSS/L0_MergeRequest/oss/parsers/onnx/onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[03/10/2021-00:43:39] [W] [TRT] /home/jenkins/workspace/OSS/L0_MergeRequest/oss/parsers/onnx/onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[03/10/2021-00:43:39] [W] [TRT] /home/jenkins/workspace/OSS/L0_MergeRequest/oss/parsers/onnx/onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[03/10/2021-00:43:39] [W] [TRT] /home/jenkins/workspace/OSS/L0_MergeRequest/oss/parsers/onnx/onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[03/10/2021-00:43:39] [W] [TRT] /home/jenkins/workspace/OSS/L0_MergeRequest/oss/parsers/onnx/onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[03/10/2021-00:43:39] [W] [TRT] Output type must be INT32 for shape outputs
[03/10/2021-00:43:39] [W] [TRT] Output type must be INT32 for shape outputs
[03/10/2021-00:43:39] [W] [TRT] Output type must be INT32 for shape outputs
[03/10/2021-00:43:39] [W] [TRT] Output type must be INT32 for shape outputs
[03/10/2021-00:43:39] [I] FP32 and INT8 precisions have been specified - more performance might be enabled by additionally specifying --fp16 or --best
[03/10/2021-00:43:39] [W] [TRT] Calibrator is not being used. Users must provide dynamic range for all tensors that are not Int32.
[03/10/2021-00:48:29] [I] [TRT] Detected 1 inputs and 8 output network tensors.
&&&& PASSED TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --onnx=yolov4_-1_3_608_608_dynamic.onnx --explicitBatch --minShapes='input':1x3x608x608 --optShapes='input':2x3x608x608 --maxShapes='input':8x3x608x608 --workspace=4096 --buildOnly --saveEngine=yolov4_-1_3_608_608_dynamic.onnx_int8_trtexec_3.engine --int8

Running the model | BS=2
$ /usr/src/tensorrt/bin/trtexec --loadEngine=yolov4_-1_3_608_608_dynamic.onnx_int8_trtexec_3.engine --int8 --batch=2

Note: I got batch dimension errors again, the engine was converted to static batch size 1x3x608x608 instead of keeping the dynamic shape, and the engine generates wrong throughput

[03/10/2021-00:52:09] [I] === Model Options ===
[03/10/2021-00:52:09] [I] Format: *
[03/10/2021-00:52:09] [I] Model:
[03/10/2021-00:52:09] [I] Output:
[03/10/2021-00:52:09] [I] === Build Options ===
[03/10/2021-00:52:09] [I] Max batch: 2
[03/10/2021-00:52:09] [I] Workspace: 16 MB
[03/10/2021-00:52:09] [I] minTiming: 1
[03/10/2021-00:52:09] [I] avgTiming: 8
[03/10/2021-00:52:09] [I] Precision: FP32+INT8
[03/10/2021-00:52:09] [I] Calibration: Dynamic
[03/10/2021-00:52:09] [I] Safe mode: Disabled
[03/10/2021-00:52:09] [I] Save engine:
[03/10/2021-00:52:09] [I] Load engine: yolov4_-1_3_608_608_dynamic.onnx_int8_trtexec_3.engine
[03/10/2021-00:52:09] [I] Builder Cache: Enabled
[03/10/2021-00:52:09] [I] NVTX verbosity: 0
[03/10/2021-00:52:09] [I] Inputs format: fp32:CHW
[03/10/2021-00:52:09] [I] Outputs format: fp32:CHW
[03/10/2021-00:52:09] [I] Input build shapes: model
[03/10/2021-00:52:09] [I] Input calibration shapes: model
[03/10/2021-00:52:09] [I] === System Options ===
[03/10/2021-00:52:09] [I] Device: 0
[03/10/2021-00:52:09] [I] DLACore:
[03/10/2021-00:52:09] [I] Plugins:
[03/10/2021-00:52:09] [I] === Inference Options ===
[03/10/2021-00:52:09] [I] Batch: 2
[03/10/2021-00:52:09] [I] Input inference shapes: model
[03/10/2021-00:52:09] [I] Iterations: 10
[03/10/2021-00:52:09] [I] Duration: 3s (+ 200ms warm up)
[03/10/2021-00:52:09] [I] Sleep time: 0ms
[03/10/2021-00:52:09] [I] Streams: 1
[03/10/2021-00:52:09] [I] ExposeDMA: Disabled
[03/10/2021-00:52:09] [I] Spin-wait: Disabled
[03/10/2021-00:52:09] [I] Multithreading: Disabled
[03/10/2021-00:52:09] [I] CUDA Graph: Disabled
[03/10/2021-00:52:09] [I] Skip inference: Disabled
[03/10/2021-00:52:09] [I] Inputs:
[03/10/2021-00:52:09] [I] === Reporting Options ===
[03/10/2021-00:52:09] [I] Verbose: Disabled
[03/10/2021-00:52:09] [I] Averages: 10 inferences
[03/10/2021-00:52:09] [I] Percentile: 99
[03/10/2021-00:52:09] [I] Dump output: Disabled
[03/10/2021-00:52:09] [I] Profile: Disabled
[03/10/2021-00:52:09] [I] Export timing to JSON file:
[03/10/2021-00:52:09] [I] Export output to JSON file:
[03/10/2021-00:52:09] [I] Export profile to JSON file:
[03/10/2021-00:52:09] [I]
[03/10/2021-00:52:20] [W] Dynamic dimensions required for input: input, but no shapes were provided. Automatically overriding shape to: 1x3x608x608
[03/10/2021-00:52:21] [I] Starting inference threads
[03/10/2021-00:52:21] [E] [TRT] Parameter check failed at: engine.cpp::enqueue::445, condition: batchSize > 0 && batchSize <= mEngine.getMaxBatchSize(). Note: Batch size was: 2, but engine max batch size was: 1
[03/10/2021-00:52:21] [E] [TRT] Parameter check failed at: engine.cpp::enqueue::445, condition: batchSize > 0 && batchSize <= mEngine.getMaxBatchSize(). Note: Batch size was: 2, but engine max batch size was: 1
[03/10/2021-00:52:21] [E] [TRT] Parameter check failed at: engine.cpp::enqueue::445, condition: batchSize > 0 && batchSize <= mEngine.getMaxBatchSize(). Note: Batch size was: 2, but engine max batch size was: 1
.
[03/10/2021-00:52:24] [I] min: 1.95605 ms (end to end 2.39429 ms)
[03/10/2021-00:52:24] [I] max: 2.0972 ms (end to end 2.50848 ms)
[03/10/2021-00:52:24] [I] mean: 2.06553 ms (end to end 2.50324 ms)
[03/10/2021-00:52:24] [I] median: 2.06543 ms (end to end 2.50348 ms)
[03/10/2021-00:52:24] [I] percentile: 2.06836 ms at 99% (end to end 2.50635 ms at 99%)
**[03/10/2021-00:52:24] [I] throughput: 1565.76 qps**
[03/10/2021-00:52:24] [I] walltime: 3.0043 s
[03/10/2021-00:52:24] [I] Enqueue Time
[03/10/2021-00:52:24] [I] min: 0.0118408 ms
[03/10/2021-00:52:24] [I] max: 0.0271606 ms
[03/10/2021-00:52:24] [I] median: 0.0124512 ms
[03/10/2021-00:52:24] [I] GPU Compute
[03/10/2021-00:52:24] [I] min: 0.00195312 ms
[03/10/2021-00:52:24] [I] max: 0.00463867 ms
[03/10/2021-00:52:24] [I] mean: 0.00305392 ms
[03/10/2021-00:52:24] [I] median: 0.00292969 ms
[03/10/2021-00:52:24] [I] percentile: 0.00390625 ms at 99%
[03/10/2021-00:52:24] [I] total compute time: 0.00718283 s
&&&& PASSED TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --loadEngine=yolov4_-1_3_608_608_dynamic.onnx_int8_trtexec_3.engine --int8 --batch=2

pranavm-nvidia · 2021-03-10T01:12:20Z

@vilmara Can you try using --shapes to set the inference shapes in your second command? --batch used to be for implicit batch networks, and is deprecated.

vilmara · 2021-03-10T01:29:20Z

@pranavm-nvidia , please see below the results; with shape =1 and shape =2

With Max batch: 1
$ /usr/src/tensorrt/bin/trtexec --loadEngine=yolov4_-1_3_608_608_dynamic.onnx_int8_trtexec_3.engine --int8

Throughput: 154.158 qps

&&&& RUNNING TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --loadEngine=yolov4_-1_3_608_608_dynamic.onnx_int8_trtexec_3.engine --int8
[03/10/2021-01:20:54] [I] === Model Options ===
[03/10/2021-01:20:54] [I] Format: *
[03/10/2021-01:20:54] [I] Model:
[03/10/2021-01:20:54] [I] Output:
[03/10/2021-01:20:54] [I] === Build Options ===
[03/10/2021-01:20:54] [I] Max batch: 1
[03/10/2021-01:20:54] [I] Workspace: 16 MB
[03/10/2021-01:20:54] [I] minTiming: 1
[03/10/2021-01:20:54] [I] avgTiming: 8
[03/10/2021-01:20:54] [I] Precision: FP32+INT8
[03/10/2021-01:20:54] [I] Calibration: Dynamic
[03/10/2021-01:20:54] [I] Safe mode: Disabled
[03/10/2021-01:20:54] [I] Save engine:
[03/10/2021-01:20:54] [I] Load engine: yolov4_-1_3_608_608_dynamic.onnx_int8_trtexec_3.engine
[03/10/2021-01:20:54] [I] Builder Cache: Enabled
[03/10/2021-01:20:54] [I] NVTX verbosity: 0
[03/10/2021-01:20:54] [I] Inputs format: fp32:CHW
[03/10/2021-01:20:54] [I] Outputs format: fp32:CHW
[03/10/2021-01:20:54] [I] Input build shapes: model
[03/10/2021-01:20:54] [I] Input calibration shapes: model
[03/10/2021-01:20:54] [I] === System Options ===
[03/10/2021-01:20:54] [I] Device: 0
[03/10/2021-01:20:54] [I] DLACore:
[03/10/2021-01:20:54] [I] Plugins:
[03/10/2021-01:20:54] [I] === Inference Options ===
[03/10/2021-01:20:54] [I] Batch: 1
[03/10/2021-01:20:54] [I] Input inference shapes: model
[03/10/2021-01:20:54] [I] Iterations: 10
[03/10/2021-01:20:54] [I] Duration: 3s (+ 200ms warm up)
[03/10/2021-01:20:54] [I] Sleep time: 0ms
[03/10/2021-01:20:54] [I] Streams: 1
[03/10/2021-01:20:54] [I] ExposeDMA: Disabled
[03/10/2021-01:20:54] [I] Spin-wait: Disabled
[03/10/2021-01:20:54] [I] Multithreading: Disabled
[03/10/2021-01:20:54] [I] CUDA Graph: Disabled
[03/10/2021-01:20:54] [I] Skip inference: Disabled
[03/10/2021-01:20:54] [I] Inputs:
[03/10/2021-01:20:54] [I] === Reporting Options ===
[03/10/2021-01:20:54] [I] Verbose: Disabled
[03/10/2021-01:20:54] [I] Averages: 10 inferences
[03/10/2021-01:20:54] [I] Percentile: 99
[03/10/2021-01:20:54] [I] Dump output: Disabled
[03/10/2021-01:20:54] [I] Profile: Disabled
[03/10/2021-01:20:54] [I] Export timing to JSON file:
[03/10/2021-01:20:54] [I] Export output to JSON file:
[03/10/2021-01:20:54] [I] Export profile to JSON file:
[03/10/2021-01:20:54] [I]
[03/10/2021-01:21:05] [W] Dynamic dimensions required for input: input, but no shapes were provided. Automatically overriding shape to: 1x3x608x608
[03/10/2021-01:21:06] [I] Starting inference threads
[03/10/2021-01:21:09] [I] Warmup completed 15 queries over 200 ms
[03/10/2021-01:21:09] [I] Timing trace has 466 queries over 3.02287 s
[03/10/2021-01:21:09] [I] Trace averages of 10 runs:
[03/10/2021-01:21:09] [I] Average on 10 runs - GPU latency: 10.3064 ms - Host latency: 11.2575 ms (end to end 21.3018 ms, enqueue 1.58119 ms)
[03/10/2021-01:21:09] [I] Average on 10 runs - GPU latency: 6.2913 ms - Host latency: 7.24501 ms (end to end 12.5531 ms, enqueue 1.541 ms)
[03/10/2021-01:21:09] [I] Average on 10 runs - GPU latency: 6.31232 ms - Host latency: 7.26759 ms (end to end 12.5699 ms, enqueue 1.54997 ms)
[03/10/2021-01:21:09] [I] Average on 10 runs - GPU latency: 6.34031 ms - Host latency: 7.29834 ms (end to end 12.6406 ms, enqueue 1.54017 ms)
[03/10/2021-01:21:09] [I] Average on 10 runs - GPU latency: 6.33834 ms - Host latency: 7.29172 ms (end to end 12.6288 ms, enqueue 1.50324 ms)
[03/10/2021-01:21:09] [I] Average on 10 runs - GPU latency: 6.33023 ms - Host latency: 7.28496 ms (end to end 12.6171 ms, enqueue 1.49459 ms)
[03/10/2021-01:21:09] [I] Average on 10 runs - GPU latency: 6.35021 ms - Host latency: 7.30355 ms (end to end 12.6524 ms, enqueue 1.56425 ms)
[03/10/2021-01:21:09] [I] Average on 10 runs - GPU latency: 6.33003 ms - Host latency: 7.28468 ms (end to end 12.6167 ms, enqueue 1.49919 ms)
[03/10/2021-01:21:09] [I] Average on 10 runs - GPU latency: 6.3475 ms - Host latency: 7.30049 ms (end to end 12.6528 ms, enqueue 1.53017 ms)
[03/10/2021-01:21:09] [I] Average on 10 runs - GPU latency: 6.31423 ms - Host latency: 7.26859 ms (end to end 12.58 ms, enqueue 1.5191 ms)
[03/10/2021-01:21:09] [I] Average on 10 runs - GPU latency: 6.32168 ms - Host latency: 7.27675 ms (end to end 12.6064 ms, enqueue 1.52623 ms)
[03/10/2021-01:21:09] [I] Average on 10 runs - GPU latency: 6.33661 ms - Host latency: 7.29273 ms (end to end 12.6216 ms, enqueue 1.50298 ms)
[03/10/2021-01:21:09] [I] Average on 10 runs - GPU latency: 6.335 ms - Host latency: 7.28808 ms (end to end 12.635 ms, enqueue 1.5046 ms)
[03/10/2021-01:21:09] [I] Average on 10 runs - GPU latency: 6.37668 ms - Host latency: 7.32964 ms (end to end 12.7022 ms, enqueue 1.5267 ms)
[03/10/2021-01:21:09] [I] Average on 10 runs - GPU latency: 6.46835 ms - Host latency: 7.42238 ms (end to end 12.8584 ms, enqueue 1.57943 ms)
[03/10/2021-01:21:09] [I] Average on 10 runs - GPU latency: 6.59392 ms - Host latency: 7.54982 ms (end to end 13.1529 ms, enqueue 1.54055 ms)
[03/10/2021-01:21:09] [I] Average on 10 runs - GPU latency: 6.39971 ms - Host latency: 7.35326 ms (end to end 12.7646 ms, enqueue 1.56095 ms)
[03/10/2021-01:21:09] [I] Average on 10 runs - GPU latency: 6.37819 ms - Host latency: 7.33138 ms (end to end 12.7025 ms, enqueue 1.56886 ms)
[03/10/2021-01:21:09] [I] Average on 10 runs - GPU latency: 6.34188 ms - Host latency: 7.29685 ms (end to end 12.6504 ms, enqueue 1.52433 ms)
[03/10/2021-01:21:09] [I] Average on 10 runs - GPU latency: 6.3759 ms - Host latency: 7.33035 ms (end to end 12.6989 ms, enqueue 1.52706 ms)
[03/10/2021-01:21:09] [I] Average on 10 runs - GPU latency: 6.3494 ms - Host latency: 7.30215 ms (end to end 12.6496 ms, enqueue 1.54875 ms)
[03/10/2021-01:21:09] [I] Average on 10 runs - GPU latency: 6.40233 ms - Host latency: 7.35695 ms (end to end 12.7567 ms, enqueue 1.53451 ms)
[03/10/2021-01:21:09] [I] Average on 10 runs - GPU latency: 6.3105 ms - Host latency: 7.26461 ms (end to end 12.5787 ms, enqueue 1.5144 ms)
[03/10/2021-01:21:09] [I] Average on 10 runs - GPU latency: 6.34429 ms - Host latency: 7.30092 ms (end to end 12.6401 ms, enqueue 1.51837 ms)
[03/10/2021-01:21:09] [I] Average on 10 runs - GPU latency: 6.37424 ms - Host latency: 7.32961 ms (end to end 12.6914 ms, enqueue 1.56772 ms)
[03/10/2021-01:21:09] [I] Average on 10 runs - GPU latency: 6.35863 ms - Host latency: 7.31173 ms (end to end 12.6829 ms, enqueue 1.52521 ms)
[03/10/2021-01:21:09] [I] Average on 10 runs - GPU latency: 6.35978 ms - Host latency: 7.31337 ms (end to end 12.6607 ms, enqueue 1.52924 ms)
[03/10/2021-01:21:09] [I] Average on 10 runs - GPU latency: 6.32866 ms - Host latency: 7.28274 ms (end to end 12.6234 ms, enqueue 1.53496 ms)
[03/10/2021-01:21:09] [I] Average on 10 runs - GPU latency: 6.34381 ms - Host latency: 7.29774 ms (end to end 12.6369 ms, enqueue 1.52136 ms)
[03/10/2021-01:21:09] [I] Average on 10 runs - GPU latency: 6.34426 ms - Host latency: 7.29944 ms (end to end 12.6243 ms, enqueue 1.4979 ms)
[03/10/2021-01:21:09] [I] Average on 10 runs - GPU latency: 6.59482 ms - Host latency: 7.54722 ms (end to end 13.1421 ms, enqueue 1.54148 ms)
[03/10/2021-01:21:09] [I] Average on 10 runs - GPU latency: 6.42817 ms - Host latency: 7.38113 ms (end to end 12.8258 ms, enqueue 1.55986 ms)
[03/10/2021-01:21:09] [I] Average on 10 runs - GPU latency: 6.37895 ms - Host latency: 7.33132 ms (end to end 12.7084 ms, enqueue 1.52351 ms)
[03/10/2021-01:21:09] [I] Average on 10 runs - GPU latency: 6.38203 ms - Host latency: 7.3373 ms (end to end 12.7233 ms, enqueue 1.54202 ms)
[03/10/2021-01:21:09] [I] Average on 10 runs - GPU latency: 6.36753 ms - Host latency: 7.32026 ms (end to end 12.6802 ms, enqueue 1.54561 ms)
[03/10/2021-01:21:09] [I] Average on 10 runs - GPU latency: 6.35054 ms - Host latency: 7.30452 ms (end to end 12.6658 ms, enqueue 1.54045 ms)
[03/10/2021-01:21:09] [I] Average on 10 runs - GPU latency: 6.3325 ms - Host latency: 7.2875 ms (end to end 12.6056 ms, enqueue 1.49817 ms)
[03/10/2021-01:21:09] [I] Average on 10 runs - GPU latency: 6.33772 ms - Host latency: 7.29207 ms (end to end 12.6232 ms, enqueue 1.50571 ms)
[03/10/2021-01:21:09] [I] Average on 10 runs - GPU latency: 6.38477 ms - Host latency: 7.33865 ms (end to end 12.7271 ms, enqueue 1.54875 ms)
[03/10/2021-01:21:09] [I] Average on 10 runs - GPU latency: 6.3635 ms - Host latency: 7.31834 ms (end to end 12.6756 ms, enqueue 1.503 ms)
[03/10/2021-01:21:09] [I] Average on 10 runs - GPU latency: 6.36074 ms - Host latency: 7.31372 ms (end to end 12.6804 ms, enqueue 1.58777 ms)
[03/10/2021-01:21:09] [I] Average on 10 runs - GPU latency: 6.35349 ms - Host latency: 7.30757 ms (end to end 12.6509 ms, enqueue 1.52463 ms)
[03/10/2021-01:21:09] [I] Average on 10 runs - GPU latency: 6.3543 ms - Host latency: 7.30801 ms (end to end 12.6764 ms, enqueue 1.52825 ms)
[03/10/2021-01:21:09] [I] Average on 10 runs - GPU latency: 6.3792 ms - Host latency: 7.33228 ms (end to end 12.7003 ms, enqueue 1.54377 ms)
[03/10/2021-01:21:09] [I] Average on 10 runs - GPU latency: 6.35442 ms - Host latency: 7.30952 ms (end to end 12.6701 ms, enqueue 1.52278 ms)
[03/10/2021-01:21:09] [I] Average on 10 runs - GPU latency: 6.4991 ms - Host latency: 7.45127 ms (end to end 12.935 ms, enqueue 1.54097 ms)
[03/10/2021-01:21:09] [I] Host Latency
[03/10/2021-01:21:09] [I] min: 7.22192 ms (end to end 12.4314 ms)
[03/10/2021-01:21:09] [I] max: 14.8126 ms (end to end 27.6749 ms)
[03/10/2021-01:21:09] [I] mean: 7.40889 ms (end to end 12.879 ms)
[03/10/2021-01:21:09] [I] median: 7.30386 ms (end to end 12.6557 ms)
[03/10/2021-01:21:09] [I] percentile: 14.8004 ms at 99% (end to end 27.6423 ms at 99%)
[03/10/2021-01:21:09] [I] throughput: 154.158 qps
[03/10/2021-01:21:09] [I] walltime: 3.02287 s
[03/10/2021-01:21:09] [I] Enqueue Time
[03/10/2021-01:21:09] [I] min: 1.46539 ms
[03/10/2021-01:21:09] [I] max: 1.80786 ms
[03/10/2021-01:21:09] [I] median: 1.48773 ms
[03/10/2021-01:21:09] [I] GPU Compute
[03/10/2021-01:21:09] [I] min: 6.26685 ms
[03/10/2021-01:21:09] [I] max: 13.8603 ms
[03/10/2021-01:21:09] [I] mean: 6.45483 ms
[03/10/2021-01:21:09] [I] median: 6.34949 ms
[03/10/2021-01:21:09] [I] percentile: 13.8505 ms at 99%
[03/10/2021-01:21:09] [I] total compute time: 3.00795 s
&&&& PASSED TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --loadEngine=yolov4_-1_3_608_608_dynamic.onnx_int8_trtexec_3.engine --int8

With BS=2 | --shapes='input':2x3x608x608

$ /usr/src/tensorrt/bin/trtexec --loadEngine=yolov4_-1_3_608_608_dynamic.onnx_int8_trtexec_3.engine --int8 --shapes=\'input\':2x3x608x608

Throughput: 0 qps

&&&& RUNNING TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --loadEngine=yolov4_-1_3_608_608_dynamic.onnx_int8_trtexec_3.engine --int8 --shapes='input':2x3x608x608
[03/10/2021-01:25:58] [I] === Model Options ===
[03/10/2021-01:25:58] [I] Format: *
[03/10/2021-01:25:58] [I] Model:
[03/10/2021-01:25:58] [I] Output:
[03/10/2021-01:25:58] [I] === Build Options ===
[03/10/2021-01:25:58] [I] Max batch: explicit
[03/10/2021-01:25:58] [I] Workspace: 16 MB
[03/10/2021-01:25:58] [I] minTiming: 1
[03/10/2021-01:25:58] [I] avgTiming: 8
[03/10/2021-01:25:58] [I] Precision: FP32+INT8
[03/10/2021-01:25:58] [I] Calibration: Dynamic
[03/10/2021-01:25:58] [I] Safe mode: Disabled
[03/10/2021-01:25:58] [I] Save engine:
[03/10/2021-01:25:58] [I] Load engine: yolov4_-1_3_608_608_dynamic.onnx_int8_trtexec_3.engine
[03/10/2021-01:25:58] [I] Builder Cache: Enabled
[03/10/2021-01:25:58] [I] NVTX verbosity: 0
[03/10/2021-01:25:58] [I] Inputs format: fp32:CHW
[03/10/2021-01:25:58] [I] Outputs format: fp32:CHW
[03/10/2021-01:25:58] [I] Input build shape: input=2x3x608x608+2x3x608x608+2x3x608x608
[03/10/2021-01:25:58] [I] Input calibration shapes: model
[03/10/2021-01:25:58] [I] === System Options ===
[03/10/2021-01:25:58] [I] Device: 0
[03/10/2021-01:25:58] [I] DLACore:
[03/10/2021-01:25:58] [I] Plugins:
[03/10/2021-01:25:58] [I] === Inference Options ===
[03/10/2021-01:25:58] [I] Batch: Explicit
[03/10/2021-01:25:58] [I] Input inference shape: input=2x3x608x608
[03/10/2021-01:25:58] [I] Iterations: 10
[03/10/2021-01:25:58] [I] Duration: 3s (+ 200ms warm up)
[03/10/2021-01:25:58] [I] Sleep time: 0ms
[03/10/2021-01:25:58] [I] Streams: 1
[03/10/2021-01:25:58] [I] ExposeDMA: Disabled
[03/10/2021-01:25:58] [I] Spin-wait: Disabled
[03/10/2021-01:25:58] [I] Multithreading: Disabled
[03/10/2021-01:25:58] [I] CUDA Graph: Disabled
[03/10/2021-01:25:58] [I] Skip inference: Disabled
[03/10/2021-01:25:58] [I] Inputs:
[03/10/2021-01:25:58] [I] === Reporting Options ===
[03/10/2021-01:25:58] [I] Verbose: Disabled
[03/10/2021-01:25:58] [I] Averages: 10 inferences
[03/10/2021-01:25:58] [I] Percentile: 99
[03/10/2021-01:25:58] [I] Dump output: Disabled
[03/10/2021-01:25:58] [I] Profile: Disabled
[03/10/2021-01:25:58] [I] Export timing to JSON file:
[03/10/2021-01:25:58] [I] Export output to JSON file:
[03/10/2021-01:25:58] [I] Export profile to JSON file:
[03/10/2021-01:25:58] [I]
[03/10/2021-01:26:10] [I] Starting inference threads
[03/10/2021-01:26:13] [I] Warmup completed 0 queries over 200 ms
[03/10/2021-01:26:13] [I] Timing trace has 0 queries over 3.04473 s
[03/10/2021-01:26:13] [I] Trace averages of 10 runs:
[03/10/2021-01:26:13] [I] Average on 10 runs - GPU latency: 12.1201 ms - Host latency: 14.0069 ms (end to end 25.1735 ms, enqueue 1.49378 ms)
[03/10/2021-01:26:13] [I] Average on 10 runs - GPU latency: 10.137 ms - Host latency: 12.0237 ms (end to end 20.2425 ms, enqueue 1.57295 ms)
[03/10/2021-01:26:13] [I] Average on 10 runs - GPU latency: 10.0741 ms - Host latency: 11.9609 ms (end to end 20.111 ms, enqueue 1.5246 ms)
[03/10/2021-01:26:13] [I] Average on 10 runs - GPU latency: 10.0996 ms - Host latency: 11.986 ms (end to end 20.1477 ms, enqueue 1.57426 ms)
[03/10/2021-01:26:13] [I] Average on 10 runs - GPU latency: 10.0957 ms - Host latency: 11.9826 ms (end to end 20.1541 ms, enqueue 1.52883 ms)
[03/10/2021-01:26:13] [I] Average on 10 runs - GPU latency: 10.0296 ms - Host latency: 11.9163 ms (end to end 20.0398 ms, enqueue 1.61193 ms)
[03/10/2021-01:26:13] [I] Average on 10 runs - GPU latency: 10.1328 ms - Host latency: 12.0197 ms (end to end 20.1968 ms, enqueue 1.52691 ms)
[03/10/2021-01:26:13] [I] Average on 10 runs - GPU latency: 10.0667 ms - Host latency: 11.9539 ms (end to end 20.0923 ms, enqueue 1.57367 ms)
[03/10/2021-01:26:13] [I] Average on 10 runs - GPU latency: 10.316 ms - Host latency: 12.2041 ms (end to end 20.5112 ms, enqueue 1.47996 ms)
[03/10/2021-01:26:13] [I] Average on 10 runs - GPU latency: 10.9389 ms - Host latency: 12.8307 ms (end to end 21.843 ms, enqueue 1.5741 ms)
[03/10/2021-01:26:13] [I] Average on 10 runs - GPU latency: 10.4337 ms - Host latency: 12.3211 ms (end to end 20.8643 ms, enqueue 1.53197 ms)
[03/10/2021-01:26:13] [I] Average on 10 runs - GPU latency: 10.4604 ms - Host latency: 12.3465 ms (end to end 20.8491 ms, enqueue 1.57727 ms)
[03/10/2021-01:26:13] [I] Average on 10 runs - GPU latency: 10.3613 ms - Host latency: 12.2481 ms (end to end 20.7303 ms, enqueue 1.53014 ms)
[03/10/2021-01:26:13] [I] Average on 10 runs - GPU latency: 10.0905 ms - Host latency: 11.9771 ms (end to end 20.1365 ms, enqueue 1.57793 ms)
[03/10/2021-01:26:13] [I] Average on 10 runs - GPU latency: 10.0721 ms - Host latency: 11.9593 ms (end to end 20.1225 ms, enqueue 1.52898 ms)
[03/10/2021-01:26:13] [I] Average on 10 runs - GPU latency: 10.0807 ms - Host latency: 11.9682 ms (end to end 20.098 ms, enqueue 1.53143 ms)
[03/10/2021-01:26:13] [I] Average on 10 runs - GPU latency: 10.0834 ms - Host latency: 11.9693 ms (end to end 20.126 ms, enqueue 1.6184 ms)
[03/10/2021-01:26:13] [I] Average on 10 runs - GPU latency: 10.1005 ms - Host latency: 11.9877 ms (end to end 20.1294 ms, enqueue 1.48313 ms)
[03/10/2021-01:26:13] [I] Average on 10 runs - GPU latency: 10.5925 ms - Host latency: 12.4824 ms (end to end 21.1131 ms, enqueue 1.52883 ms)
[03/10/2021-01:26:13] [I] Average on 10 runs - GPU latency: 10.7158 ms - Host latency: 12.6058 ms (end to end 21.4158 ms, enqueue 1.48391 ms)
[03/10/2021-01:26:13] [I] Average on 10 runs - GPU latency: 10.3664 ms - Host latency: 12.2531 ms (end to end 20.7108 ms, enqueue 1.53105 ms)
[03/10/2021-01:26:13] [I] Average on 10 runs - GPU latency: 10.6887 ms - Host latency: 12.5782 ms (end to end 21.3066 ms, enqueue 1.53381 ms)
[03/10/2021-01:26:13] [I] Average on 10 runs - GPU latency: 10.2405 ms - Host latency: 12.1284 ms (end to end 20.4625 ms, enqueue 1.56887 ms)
[03/10/2021-01:26:13] [I] Average on 10 runs - GPU latency: 10.3204 ms - Host latency: 12.208 ms (end to end 20.577 ms, enqueue 1.52947 ms)
[03/10/2021-01:26:13] [I] Average on 10 runs - GPU latency: 10.2967 ms - Host latency: 12.1829 ms (end to end 20.5491 ms, enqueue 1.52668 ms)
[03/10/2021-01:26:13] [I] Average on 10 runs - GPU latency: 10.4952 ms - Host latency: 12.3848 ms (end to end 20.9506 ms, enqueue 1.57275 ms)
[03/10/2021-01:26:13] [I] Average on 10 runs - GPU latency: 10.1634 ms - Host latency: 12.0503 ms (end to end 20.3115 ms, enqueue 1.57158 ms)
[03/10/2021-01:26:13] [I] Average on 10 runs - GPU latency: 10.2122 ms - Host latency: 12.1003 ms (end to end 20.3671 ms, enqueue 1.5252 ms)
[03/10/2021-01:26:13] [I] Average on 10 runs - GPU latency: 10.4627 ms - Host latency: 12.3501 ms (end to end 20.8417 ms, enqueue 1.52703 ms)
[03/10/2021-01:26:13] [I] Host Latency
[03/10/2021-01:26:13] [I] min: 11.7104 ms (end to end 19.7689 ms)
[03/10/2021-01:26:13] [I] max: 21.8932 ms (end to end 39.9739 ms)
[03/10/2021-01:26:13] [I] mean: 12.2437 ms (end to end 20.7003 ms)
[03/10/2021-01:26:13] [I] median: 12.1147 ms (end to end 20.4098 ms)
[03/10/2021-01:26:13] [I] percentile: 13.0126 ms at 99% (end to end 29.7882 ms at 99%)
[03/10/2021-01:26:13] [I] throughput: 0 qps
[03/10/2021-01:26:13] [I] walltime: 3.04473 s
[03/10/2021-01:26:13] [I] Enqueue Time
[03/10/2021-01:26:13] [I] min: 1.4682 ms
[03/10/2021-01:26:13] [I] max: 1.94559 ms
[03/10/2021-01:26:13] [I] median: 1.48224 ms
[03/10/2021-01:26:13] [I] GPU Compute
[03/10/2021-01:26:13] [I] min: 9.82251 ms
[03/10/2021-01:26:13] [I] max: 20.0069 ms
[03/10/2021-01:26:13] [I] mean: 10.3562 ms
[03/10/2021-01:26:13] [I] median: 10.2275 ms
[03/10/2021-01:26:13] [I] percentile: 11.1216 ms at 99%
[03/10/2021-01:26:13] [I] total compute time: 3.02401 s
&&&& PASSED TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --loadEngine=yolov4_-1_3_608_608_dynamic.onnx_int8_trtexec_3.engine --int8 --shapes='input':2x3x608x608

pranavm-nvidia · 2021-03-10T01:33:06Z

Looks right - with batch size 1, the latency is 6.45ms and with batch size 2 it's 10.35ms. Those numbers seem reasonable to me.

vilmara · 2021-03-10T01:39:40Z

@pranavm-nvidia, and the throughput? with shape>1 it shows 0 qps (I guest it means FPS)

pranavm-nvidia · 2021-03-10T01:43:37Z

@vilmara Yeah hadn't noticed that before. That looks like a bug in trtexec. I think the generated engine should be fine though

vilmara · 2021-03-10T02:05:06Z

@pranavm-nvidia, It seems the generated model with trtexec has issues with its deployment on DS-Triton. Is there another sample/tool that shows how to optimize a YOLO Pytorch-ONNX to TensorRT engine INT8 mode with full INT8 calibration and dynamic input shapes?. I have reported the DS-Triton issue here triton-inference-server/server#2606

pranavm-nvidia · 2021-03-10T15:24:12Z

You could look at https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/sampleDynamicReshape

vilmara · 2021-03-16T02:09:17Z

You could look at https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/sampleDynamicReshape

Hi @pranavm-nvidia, I will take a look at it later on.

In regards to the === Build and Inference Batch Options === in trtexec. What options should I use to build the engine with dynamic input shapes, so it can be deployed later on with DS with Bs >1?. Right now I am getting the error TensorRT engine only supports max-batch 1 with DS

See below:
Build the engine with dymanic shapes:
$ /usr/src/tensorrt/bin/trtexec --onnx=yolov4_-1_3_608_608_dynamic.onnx --explicitBatch --minShapes=\'input\':1x3x608x608 --optShapes=\'input\':4x3x608x608 --maxShapes=\'input\':8x3x608x608 --workspace=4096 --saveEngine=yolov4_-1_3_608_608_dynamic_int8_.engine --int8

Run the inference with trtexec and default batch size
$ /usr/src/tensorrt/bin/trtexec --loadEngine=yolov4_-1_3_608_608_dynamic_onnx_int8_trtexec_4.engine --int8
Result:

[03/16/2021-00:23:14] [I] Host Latency
[03/16/2021-00:23:14] [I] min: 7.01904 ms (end to end 12.0889 ms)
[03/16/2021-00:23:14] [I] max: 7.89343 ms (end to end 13.8339 ms)
[03/16/2021-00:23:14] [I] mean: 7.15021 ms (end to end 12.3533 ms)
[03/16/2021-00:23:14] [I] median: 7.09982 ms (end to end 12.2517 ms)
[03/16/2021-00:23:14] [I] percentile: 7.88986 ms at 99% (end to end 13.818 ms at 99%)
[03/16/2021-00:23:14] [I] throughput: 160.912 qps
[03/16/2021-00:23:14] [I] walltime: 3.02029 s
[03/16/2021-00:23:14] [I] Enqueue Time
[03/16/2021-00:23:14] [I] min: 1.4646 ms
[03/16/2021-00:23:14] [I] max: 1.79004 ms
[03/16/2021-00:23:14] [I] median: 1.48828 ms
[03/16/2021-00:23:14] [I] GPU Compute
[03/16/2021-00:23:14] [I] min: 6.0675 ms
[03/16/2021-00:23:14] [I] max: 6.93729 ms
[03/16/2021-00:23:14] [I] mean: 6.1978 ms
[03/16/2021-00:23:14] [I] median: 6.14783 ms
[03/16/2021-00:23:14] [I] percentile: 6.9351 ms at 99%
[03/16/2021-00:23:14] [I] total compute time: 3.01213 s

Print the engine's input and output shapes:

input shape :  (-1, 3, 608, 608)
out shape :  (-1, 22743, 1, 4)

Deploy the engine with DS

Run inference on DS with max_batch_size=1
$ deepstream-app -c source1_primary_yolov4.txt

I0316 01:15:35.232182 159 model_repository_manager.cc:810] loading: yolov4_nvidia:1
I0316 01:15:46.895954 159 plan_backend.cc:333] Creating instance yolov4_nvidia_0_0_gpu0 on GPU 0 (7.5) using yolov4_-1_3_608_608_dynamic_onnx_int8_trtexec_4.engine
I0316 01:15:47.333165 159 plan_backend.cc:666] Created instance yolov4_nvidia_0_0_gpu0 on GPU 0 with stream priority 0 and optimization profile default[0];
I0316 01:15:47.334265 159 model_repository_manager.cc:983] successfully loaded 'yolov4_nvidia' version 1
INFO: infer_trtis_backend.cpp:206 TrtISBackend id:1 initialized model: yolov4_nvidia

Runtime commands:
        h: Print this help
        q: Quit

        p: Pause
        r: Resume

NOTE: To expand a source in the 2D tiled display and view object details, left-click on the source.
      To go back to the tiled display, right-click anywhere on the window.


**PERF:  FPS 0 (Avg)
**PERF:  0.00 (0.00)
** INFO: <bus_callback:181>: Pipeline ready

** INFO: <bus_callback:167>: Pipeline running

**PERF:  138.28 (138.17)
**PERF:  141.00 (139.60)
** INFO: <bus_callback:204>: Received EOS. Exiting ...

Quitting
I0316 01:16:00.336337 159 model_repository_manager.cc:837] unloading: yolov4_nvidia:1
I0316 01:16:00.338973 159 server.cc:280] Waiting for in-flight requests to complete.
I0316 01:16:00.338986 159 server.cc:295] Timeout 30: Found 1 live models and 0 in-flight non-inference requests
I0316 01:16:00.378079 159 model_repository_manager.cc:966] successfully unloaded 'yolov4_nvidia' version 1
I0316 01:16:01.339052 159 server.cc:295] Timeout 29: Found 0 live models and 0 in-flight non-inference requests
App run successful

Run inference on DS with max_batch_size=4
$ deepstream-app -c source1_primary_yolov4.txt
Error:

E0316 01:20:12.879238 195 model_repository_manager.cc:1705] unable to autofill for 'yolov4_nvidia', configuration specified max-batch 4 but TensorRT engine only supports max-batch 1
ERROR: infer_trtis_server.cpp:1044 Triton: failed to load model yolov4_nvidia, triton_err_str:Internal, err_msg:failed to load 'yolov4_nvidia', no version is available
ERROR: infer_trtis_backend.cpp:45 failed to load model: yolov4_nvidia, nvinfer error:NVDSINFER_TRTIS_ERROR
ERROR: infer_trtis_backend.cpp:184 failed to initialize backend while ensuring model:yolov4_nvidia ready, nvinfer error:NVDSINFER_TRTIS_ERROR
0:00:14.484600140   195 0x56007e1c7cf0 ERROR          nvinferserver gstnvinferserver.cpp:362:gst_nvinfer_server_logger:<primary_gie> nvinferserver[UID 1]: Error in createNNBackend() <infer_trtis_context.cpp:246> [UID = 1]: failed to initialize trtis backend for model:yolov4_nvidia, nvinfer error:NVDSINFER_TRTIS_ERROR
I0316 01:20:12.879481 195 server.cc:280] Waiting for in-flight requests to complete.
I0316 01:20:12.879488 195 server.cc:295] Timeout 30: Found 0 live models and 0 in-flight non-inference requests
0:00:14.484704250   195 0x56007e1c7cf0 ERROR          nvinferserver gstnvinferserver.cpp:362:gst_nvinfer_server_logger:<primary_gie> nvinferserver[UID 1]: Error in initialize() <infer_base_context.cpp:81> [UID = 1]: create nn-backend failed, check config file settings, nvinfer error:NVDSINFER_TRTIS_ERROR
0:00:14.484716684   195 0x56007e1c7cf0 WARN           nvinferserver gstnvinferserver_impl.cpp:439:start:<primary_gie> error: Failed to initialize InferTrtIsContext
0:00:14.484722696   195 0x56007e1c7cf0 WARN           nvinferserver gstnvinferserver_impl.cpp:439:start:<primary_gie> error: Config file path: /workspace/Deepstream_5.1_Triton/samples/configs/deepstream-app-trtis/config_infer_primary_yolov4.txt
0:00:14.485106084   195 0x56007e1c7cf0 WARN           nvinferserver gstnvinferserver.cpp:460:gst_nvinfer_server_start:<primary_gie> error: gstnvinferserver_impl start failed
** ERROR: <main:655>: Failed to set pipeline to PAUSED
Quitting
ERROR from primary_gie: Failed to initialize InferTrtIsContext
Debug info: gstnvinferserver_impl.cpp(439): start (): /GstPipeline:pipeline/GstBin:primary_gie_bin/GstNvInferServer:primary_gie:
Config file path: /workspace/Deepstream_5.1_Triton/samples/configs/deepstream-app-trtis/config_infer_primary_yolov4.txt
ERROR from primary_gie: gstnvinferserver_impl start failed
Debug info: gstnvinferserver.cpp(460): gst_nvinfer_server_start (): /GstPipeline:pipeline/GstBin:primary_gie_bin/GstNvInferServer:primary_gie
App run failed

pranavm-nvidia · 2021-03-16T14:07:45Z

@vilmara It looks like you've built the TRT engine correctly. Regarding the deepstream issue, you'd probably want to ask here: https://forums.developer.nvidia.com/c/accelerated-computing/intelligent-video-analytics/deepstream-sdk/15

vilmara · 2021-03-16T14:33:58Z

Hi @pranavm-nvidia, thanks for helping to build the TRT engine correctly, I have submitted the new issue at the forum

ttyio · 2021-05-27T05:59:54Z

Closing since no remaining issue in this thread according to last comment, thanks

vilmara changed the title ~~Unable to autofill for 'yolov4_nvidia', either all model tensor configuration should specify their dims or none.~~ Dynamic dimensions required for input: input, but no shapes were provided. Automatically overriding shape to: 1x3x608x608 Mar 10, 2021

deadeyegoodwin mentioned this issue Mar 10, 2021

Unable to autofill for 'yolov4_nvidia', either all model tensor configuration should specify their dims or none triton-inference-server/server#2606

Closed

ttyio added Release: 7.x Topic: Dynamic Shape triaged Issue has been triaged by maintainers labels Mar 11, 2021

ttyio closed this as completed May 27, 2021

royinx mentioned this issue Jul 31, 2021

Error details: model expected the shape of dimension 0 to be between 1 and 1 but received 5 triton-inference-server/server#3032

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dynamic dimensions required for input: input, but no shapes were provided. Automatically overriding shape to: 1x3x608x608 #1111

Dynamic dimensions required for input: input, but no shapes were provided. Automatically overriding shape to: 1x3x608x608 #1111

vilmara commented Mar 10, 2021 •

edited

pranavm-nvidia commented Mar 10, 2021

vilmara commented Mar 10, 2021 •

edited

pranavm-nvidia commented Mar 10, 2021

vilmara commented Mar 10, 2021 •

edited

pranavm-nvidia commented Mar 10, 2021

vilmara commented Mar 10, 2021

pranavm-nvidia commented Mar 10, 2021

vilmara commented Mar 10, 2021 •

edited

pranavm-nvidia commented Mar 10, 2021

vilmara commented Mar 16, 2021 •

edited

pranavm-nvidia commented Mar 16, 2021

vilmara commented Mar 16, 2021

ttyio commented May 27, 2021

Dynamic dimensions required for input: input, but no shapes were provided. Automatically overriding shape to: 1x3x608x608 #1111

Dynamic dimensions required for input: input, but no shapes were provided. Automatically overriding shape to: 1x3x608x608 #1111

Comments

vilmara commented Mar 10, 2021 • edited

Description

Environment

Relevant Files

Steps To Reproduce

pranavm-nvidia commented Mar 10, 2021

vilmara commented Mar 10, 2021 • edited

pranavm-nvidia commented Mar 10, 2021

vilmara commented Mar 10, 2021 • edited

pranavm-nvidia commented Mar 10, 2021

vilmara commented Mar 10, 2021

pranavm-nvidia commented Mar 10, 2021

vilmara commented Mar 10, 2021 • edited

pranavm-nvidia commented Mar 10, 2021

vilmara commented Mar 16, 2021 • edited

pranavm-nvidia commented Mar 16, 2021

vilmara commented Mar 16, 2021

ttyio commented May 27, 2021

vilmara commented Mar 10, 2021 •

edited

vilmara commented Mar 10, 2021 •

edited

vilmara commented Mar 10, 2021 •

edited

vilmara commented Mar 10, 2021 •

edited

vilmara commented Mar 16, 2021 •

edited