Skip to content

Commit

Permalink
TensorRT OSS 9.1.0 Release
Browse files Browse the repository at this point in the history
Signed-off-by: Simeng Liu <simengl@nvidia.com>
  • Loading branch information
SimengLiu-nv committed Oct 19, 2023
1 parent 42fccbf commit b8adcfa
Show file tree
Hide file tree
Showing 145 changed files with 8,685 additions and 1,389 deletions.
18 changes: 18 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,23 @@
# TensorRT OSS Release Changelog

## 9.1.0 GA - 2023-10-18

Key Features and Updates:

- Update the [trt_python_plugin](samples/python/python_plugin) sample.
- Python plugins API reference is part of the offical TRT Python API.
- Added samples demonstrating the usage of the progress monitor API.
- Check [sampleProgressMonitor](samples/sampleProgressMonitor) for the C++ sample.
- Check [simple_progress_monitor](samples/python/simple_progress_monitor) for the Python sample.
- Remove dependencies related to python<3.8 in python samples as we no longer support python<3.8 for python samples.
- Demo changes
- Added LAMBADA dataset accuracy checks in the [HuggingFace](demo/HuggingFace) demo.
- Enabled structured sparsity and FP8 quantized batch matrix multiplication(BMM)s in attention in the [NeMo](demo/NeMo) demo.
- Replaced deprecated APIs in the [BERT](demo/BERT) demo.
- Updated tooling
- Polygraphy v0.49.1


## 9.0.1 GA - 2023-09-07

Key Features and Updates:
Expand Down
16 changes: 8 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ You can skip the **Build** section to enjoy TensorRT with Python.
To build the TensorRT-OSS components, you will first need the following software packages.

**TensorRT GA build**
* TensorRT v9.0.1.4
* TensorRT v9.1.0.4
* Available from direct download links listed below

**System Packages**
Expand All @@ -36,7 +36,7 @@ To build the TensorRT-OSS components, you will first need the following software
* cuda-11.8.0 + cuDNN-8.9
* [GNU make](https://ftp.gnu.org/gnu/make/) >= v4.1
* [cmake](https://github.com/Kitware/CMake/releases) >= v3.13
* [python](<https://www.python.org/downloads/>) >= v3.6.9, <= v3.10.x
* [python](<https://www.python.org/downloads/>) >= v3.8, <= v3.10.x
* [pip](https://pypi.org/project/pip/#history) >= v19.0
* Essential utilities
* [git](https://git-scm.com/downloads), [pkg-config](https://www.freedesktop.org/wiki/Software/pkg-config/), [wget](https://www.gnu.org/software/wget/faq.html#download)
Expand Down Expand Up @@ -73,16 +73,16 @@ To build the TensorRT-OSS components, you will first need the following software
If using the TensorRT OSS build container, TensorRT libraries are preinstalled under `/usr/lib/x86_64-linux-gnu` and you may skip this step.

Else download and extract the TensorRT GA build from [NVIDIA Developer Zone](https://developer.nvidia.com) with the direct links below:
- [TensorRT 9.0.1.4 for CUDA 11.8, Linux x86_64](https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/secure/9.0.1/tars/tensorrt-9.0.1.4.linux.x86_64-gnu.cuda-11.8.tar.gz)
- [TensorRT 9.0.1.4 for CUDA 12.2, Linux x86_64](https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/secure/9.0.1/tars/tensorrt-9.0.1.4.linux.x86_64-gnu.cuda-12.2.tar.gz)
- [TensorRT 9.1.0.4 for CUDA 11.8, Linux x86_64](https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/secure/9.1.0/tars/tensorrt-9.1.0.4.linux.x86_64-gnu.cuda-11.8.tar.gz)
- [TensorRT 9.1.0.4 for CUDA 12.2, Linux x86_64](https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/secure/9.1.0/tars/tensorrt-9.1.0.4.linux.x86_64-gnu.cuda-12.2.tar.gz)


**Example: Ubuntu 20.04 on x86-64 with cuda-12.2**

```bash
cd ~/Downloads
tar -xvzf tensorrt-9.0.1.4.linux.x86_64-gnu.cuda-12.2.tar.gz
export TRT_LIBPATH=`pwd`/TensorRT-9.0.1.4
tar -xvzf tensorrt-9.1.0.4.linux.x86_64-gnu.cuda-12.2.tar.gz
export TRT_LIBPATH=`pwd`/TensorRT-9.1.0.4
```

## Setting Up The Build Environment
Expand All @@ -96,9 +96,9 @@ For Linux platforms, we recommend that you generate a docker container for build
```bash
./docker/build.sh --file docker/ubuntu-20.04.Dockerfile --tag tensorrt-ubuntu20.04-cuda12.2
```
**Example: CentOS/RedHat 7 on x86-64 with cuda-11.8**
**Example: CentOS/RedHat 7 on x86-64 with cuda-12.2**
```bash
./docker/build.sh --file docker/centos-7.Dockerfile --tag tensorrt-centos7-cuda11.8 --cuda 11.8.0
./docker/build.sh --file docker/centos-7.Dockerfile --tag tensorrt-centos7-cuda12.2 --cuda 12.2.0
```

2. #### Launch the TensorRT-OSS build container.
Expand Down
2 changes: 1 addition & 1 deletion VERSION
Original file line number Diff line number Diff line change
@@ -1 +1 @@
9.0.1.4
9.1.0.4
55 changes: 55 additions & 0 deletions cmake/toolchains/cmake_aarch64_cross.toolchain
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
#
# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

set(CMAKE_SYSTEM_NAME Linux)
set(CMAKE_SYSTEM_PROCESSOR aarch64)

set(TRT_PLATFORM_ID "aarch64")

set(CUDA_PLATFORM_ID "sbsa-linux")

set(CMAKE_C_COMPILER /usr/bin/aarch64-linux-gnu-gcc-8)
set(CMAKE_CXX_COMPILER /usr/bin/aarch64-linux-gnu-g++-8)

set(CMAKE_C_FLAGS "" CACHE STRING "" FORCE)
set(CMAKE_CXX_FLAGS "" CACHE STRING "" FORCE)

set(CMAKE_C_COMPILER_TARGET aarch64-linux-gnu)
set(CMAKE_CXX_COMPILER_TARGET aarch64-linux-gnu)

set(CMAKE_C_COMPILER_FORCED TRUE)
set(CMAKE_CXX_COMPILER_FORCED TRUE)

set(CUDA_ROOT /usr/local/cuda/targets/${CUDA_PLATFORM_ID} CACHE STRING "CUDA ROOT dir")

set(CUDNN_LIB /usr/lib/aarch64-linux-gnu/libcudnn.so)

set(BUILD_LIBRARY_ONLY 1)

set(CUDA_TOOLKIT_ROOT_DIR ${CUDA_ROOT})
set(CUDA_INCLUDE_DIRS ${CUDA_ROOT}/include)

set(RT_LIB /usr/aarch64-linux-gnu/lib/librt.so)

set(CMAKE_CUDA_COMPILER /usr/local/cuda/bin/nvcc)
set(CMAKE_CUDA_HOST_COMPILER ${CMAKE_CXX_COMPILER} CACHE STRING "" FORCE)
set(CMAKE_CUDA_FLAGS "-I${CUDA_INCLUDE_DIRS} -Xcompiler=\"-fPIC ${CMAKE_CXX_FLAGS}\"" CACHE STRING "" FORCE)
set(CMAKE_CUDA_COMPILER_FORCED TRUE)

set(CUDA_LIBS -L${CUDA_ROOT}/lib)

set(ADDITIONAL_PLATFORM_LIB_FLAGS ${CUDA_LIBS} -lcublas -lcudart -lstdc++ -lm)
56 changes: 21 additions & 35 deletions demo/BERT/builder.py
Original file line number Diff line number Diff line change
Expand Up @@ -107,10 +107,7 @@ def attention_layer_opt(prefix, config, init_dict, network, input_tensor, imask)
Ball = init_dict[prefix + BQKV]

# FC_attention
if config.use_int8:
mult_all = network.add_convolution_nd(input_tensor, 3 * hidden_size, (1, 1), Wall, Ball)
else:
mult_all = network.add_fully_connected(input_tensor, 3 * hidden_size, Wall, Ball)
mult_all = network.add_convolution_nd(input_tensor, 3 * hidden_size, (1, 1), Wall, Ball)

if config.use_qat:
dr_qkv = max(
Expand Down Expand Up @@ -217,24 +214,20 @@ def transformer_layer_opt(prefix, config, init_dict, network, input_tensor, imas

# FC0
B_aout = init_dict[prefix + B_AOUT]
if config.use_int8:
if not config.use_int8 and use_custom_fc():
W_aoutT = init_dict[prefix + W_AOUT + "_notrans"]
attention_out_fc = custom_fc(config, network, attention_heads, hidden_size, W_aoutT)
else:
W_aout = init_dict[prefix + W_AOUT]
attention_out_fc = network.add_convolution_nd(attention_heads, hidden_size, (1, 1), W_aout, B_aout)
B_aout = None

if not config.use_int8_skipln:
if config.use_int8 and not config.use_int8_skipln:
attention_out_fc.set_output_type(0, trt.DataType.HALF if config.use_fp16 else trt.DataType.FLOAT)

if config.use_qat:
if config.use_int8 and config.use_qat:
dr_fc_aout = init_dict[prefix + 'attention_output_add_local_input_quantizer_amax']
set_output_range(attention_out_fc, dr_fc_aout)
elif use_custom_fc():
W_aoutT = init_dict[prefix + W_AOUT + "_notrans"]
attention_out_fc = custom_fc(config, network, attention_heads, hidden_size, W_aoutT)
else:
W_aout = init_dict[prefix + W_AOUT]
attention_out_fc = network.add_fully_connected(attention_heads, hidden_size, W_aout, B_aout)
B_aout = None

skiplayer = skipln(prefix + "attention_output_layernorm_",config, init_dict, network, attention_out_fc.get_output(0), input_tensor, B_aout)
attention_ln = skiplayer.get_output(0)
Expand All @@ -245,10 +238,7 @@ def transformer_layer_opt(prefix, config, init_dict, network, input_tensor, imas
# FC1 + GELU
B_mid = init_dict[prefix + B_MID]
W_mid = init_dict[prefix + W_MID]
if config.use_int8:
mid_dense = network.add_convolution_nd(attention_ln, config.intermediate_size, (1, 1), W_mid, B_mid)
else:
mid_dense = network.add_fully_connected(attention_ln, config.intermediate_size, W_mid, B_mid)
mid_dense = network.add_convolution_nd(attention_ln, config.intermediate_size, (1, 1), W_mid, B_mid)

mid_dense_out = mid_dense.get_output(0)
POW = network.add_constant((1, 1, 1, 1, 1), trt.Weights(np.ascontiguousarray([3.0], dtype=np.float32)))
Expand Down Expand Up @@ -281,21 +271,18 @@ def transformer_layer_opt(prefix, config, init_dict, network, input_tensor, imas
# FC2
# Dense to hidden size
B_lout = init_dict[prefix + B_LOUT]
if config.use_int8 and not config.use_fc2_gemm:
W_lout = init_dict[prefix + W_LOUT]
out_dense = network.add_convolution_nd(intermediate_act, hidden_size, (1, 1), W_lout, B_lout)
B_lout = None

if not config.use_int8_skipln:
out_dense.set_output_type(0, trt.DataType.HALF if config.use_fp16 else trt.DataType.FLOAT)
elif use_custom_fc():
prefer_conv = config.use_int8 and not config.use_fc2_gemm
if not prefer_conv and use_custom_fc():
W_loutT = init_dict[prefix + W_LOUT + "_notrans"]
out_dense = custom_fc(config, network, intermediate_act, hidden_size, W_loutT)
else:
W_lout = init_dict[prefix + W_LOUT]
out_dense = network.add_fully_connected(intermediate_act, hidden_size, W_lout, B_lout)
out_dense = network.add_convolution_nd(intermediate_act, hidden_size, (1, 1), W_lout, B_lout)
B_lout = None

if config.use_int8 and not config.use_int8_skipln:
out_dense.set_output_type(0, trt.DataType.HALF if config.use_fp16 else trt.DataType.FLOAT)

if config.use_qat:
dr_fc_out = init_dict[prefix + 'output_add_local_input_quantizer_amax']
set_output_range(out_dense, dr_fc_out)
Expand Down Expand Up @@ -334,7 +321,7 @@ def squad_output(prefix, config, init_dict, network, input_tensor):
B_out = init_dict[prefix + SQD_B]

W = network.add_constant((1, hidden_size, 2), W_out)
dense = network.add_fully_connected(input_tensor, 2, W_out, B_out)
dense = network.add_convolution_nd(input_tensor, 2, (1, 1), W_out, B_out)

OUT = network.add_shuffle(dense.get_output(0))
OUT.second_transpose = (1, 0, 2, 3, 4)
Expand Down Expand Up @@ -402,7 +389,7 @@ def build_engine(batch_sizes, workspace_size, sequence_lengths, config, weights_
explicit_batch_flag = 1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)

with trt.Builder(TRT_LOGGER) as builder, builder.create_network(explicit_batch_flag) as network, builder.create_builder_config() as builder_config:
builder_config.max_workspace_size = workspace_size * (1024 * 1024)
builder_config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, workspace_size * (1024 * 1024))
builder_config.avg_timing_iterations = 8
if config.use_fp16:
builder_config.set_flag(trt.BuilderFlag.FP16)
Expand Down Expand Up @@ -451,10 +438,11 @@ def build_engine(batch_sizes, workspace_size, sequence_lengths, config, weights_
squad_logits = squad_output("cls_", config, weights_dict, network, bert_out)
squad_logits_out = squad_logits.get_output(0)

squad_logits_out.name = "logits_out"
network.mark_output(squad_logits_out)

build_start_time = time.time()
engine = builder.build_engine(network, builder_config)
serialized_engine = builder.build_serialized_network(network, builder_config)
build_time_elapsed = (time.time() - build_start_time)
TRT_LOGGER.log(TRT_LOGGER.INFO, "build engine in {:.3f} Sec".format(build_time_elapsed))

Expand All @@ -469,7 +457,7 @@ def build_engine(batch_sizes, workspace_size, sequence_lengths, config, weights_

if config.use_int8 and not config.use_qat:
calibrator.free()
return engine
return serialized_engine

def generate_calibration_cache(sequence_lengths, workspace_size, config, weights_dict, squad_json, vocab_file, calibrationCacheFile, calib_num):
"""
Expand All @@ -488,7 +476,7 @@ def generate_calibration_cache(sequence_lengths, workspace_size, config, weights
config.use_fp16 = False
config.is_calib_mode = True

with build_engine([1], workspace_size, sequence_lengths, config, weights_dict, squad_json, vocab_file, calibrationCacheFile, calib_num, False) as engine:
with build_engine([1], workspace_size, sequence_lengths, config, weights_dict, squad_json, vocab_file, calibrationCacheFile, calib_num, False) as serialized_engine:
TRT_LOGGER.log(TRT_LOGGER.INFO, "calibration cache generated in {:}".format(calibrationCacheFile))

config.use_fp16 = saved_use_fp16
Expand Down Expand Up @@ -553,9 +541,7 @@ def main():
else:
raise RuntimeError("You need either specify TF checkpoint using option --ckpt or ONNX using option --onnx to build TRT BERT model.")

with build_engine(args.batch_size, args.workspace_size, args.sequence_length, config, weights_dict, args.squad_json, args.vocab_file, calib_cache, args.calib_num, args.verbose) as engine:
TRT_LOGGER.log(TRT_LOGGER.VERBOSE, "Serializing Engine...")
serialized_engine = engine.serialize()
with build_engine(args.batch_size, args.workspace_size, args.sequence_length, config, weights_dict, args.squad_json, args.vocab_file, calib_cache, args.calib_num, args.verbose) as serialized_engine:
TRT_LOGGER.log(TRT_LOGGER.INFO, "Saving Engine to {:}".format(args.output))
with open(args.output, "wb") as fout:
fout.write(serialized_engine)
Expand Down
37 changes: 10 additions & 27 deletions demo/BERT/builder_varseqlen.py
Original file line number Diff line number Diff line change
Expand Up @@ -107,10 +107,7 @@ def attention_layer_opt(prefix, config, init_dict, network, input_tensor, mask_i
Ball = init_dict[prefix + BQKV]

# FC_attention
if config.use_int8:
mult_all = network.add_convolution_nd(input_tensor, 3 * hidden_size, (1, 1), Wall, Ball)
else:
mult_all = network.add_fully_connected(input_tensor, 3 * hidden_size, Wall, Ball)
mult_all = network.add_convolution_nd(input_tensor, 3 * hidden_size, (1, 1), Wall, Ball)

if config.use_qat:
dr_qkv = max(
Expand Down Expand Up @@ -202,10 +199,7 @@ def transformer_layer_opt(prefix, config, init_dict, network, input_tensor, resi
# FC0
B_aout = init_dict[prefix + B_AOUT]
W_aout = init_dict[prefix + W_AOUT]
if config.use_int8:
attention_out_fc = network.add_convolution_nd(attention_heads, hidden_size, (1, 1), W_aout, B_aout)
else:
attention_out_fc = network.add_fully_connected(attention_heads, hidden_size, W_aout, B_aout)
attention_out_fc = network.add_convolution_nd(attention_heads, hidden_size, (1, 1), W_aout, B_aout)
if config.use_int8 and config.use_qat:
dr_fc_aout = init_dict[prefix + 'attention_output_add_local_input_quantizer_amax']
set_output_range(attention_out_fc, dr_fc_aout)
Expand All @@ -225,10 +219,7 @@ def transformer_layer_opt(prefix, config, init_dict, network, input_tensor, resi
# FC1 + GELU
B_mid = init_dict[prefix + B_MID]
W_mid = init_dict[prefix + W_MID]
if config.use_int8:
mid_dense = network.add_convolution_nd(attention_ln, config.intermediate_size, (1, 1), W_mid, B_mid)
else:
mid_dense = network.add_fully_connected(attention_ln, config.intermediate_size, W_mid, B_mid)
mid_dense = network.add_convolution_nd(attention_ln, config.intermediate_size, (1, 1), W_mid, B_mid)

gelu_layer = add_gelu(network, mid_dense.get_output(0))

Expand All @@ -247,10 +238,7 @@ def transformer_layer_opt(prefix, config, init_dict, network, input_tensor, resi
B_lout = init_dict[prefix + B_LOUT]
W_lout = init_dict[prefix + W_LOUT]

if config.use_int8:
out_dense = network.add_convolution_nd(intermediate_act, hidden_size, (1, 1), W_lout, B_lout)
else:
out_dense = network.add_fully_connected(intermediate_act, hidden_size, W_lout, B_lout)
out_dense = network.add_convolution_nd(intermediate_act, hidden_size, (1, 1), W_lout, B_lout)
if config.use_int8 and config.use_qat:
dr_fc_out = init_dict[prefix + 'output_add_local_input_quantizer_amax']
set_output_range(out_dense, dr_fc_out)
Expand Down Expand Up @@ -327,6 +315,7 @@ def bert_model(config, init_dict, network, input_tensor, residual, mask_idx, cu_

squad_logits = squad_output("cls_", config, init_dict, network, prev_input)
squad_logits_out = squad_logits.get_output(0)
squad_logits_out.name = "logits_out"
network.mark_output(squad_logits_out)


Expand All @@ -339,11 +328,7 @@ def squad_output(prefix, config, init_dict, network, input_tensor):
W_out = init_dict[prefix + SQD_W]
B_out = init_dict[prefix + SQD_B]

if config.use_int8:
dense = network.add_convolution_nd(input_tensor, 2, (1, 1), W_out, B_out)
else:
dense = network.add_fully_connected(input_tensor, 2, W_out, B_out)

dense = network.add_convolution_nd(input_tensor, 2, (1, 1), W_out, B_out)
OUT = network.add_shuffle(dense.get_output(0))
if config.use_int8 and config.interleaved:
OUT.second_transpose = (1, 2, 0, 3)
Expand Down Expand Up @@ -397,7 +382,7 @@ def build_engine(batch_sizes, workspace_size, sequence_length, config, weights_d
explicit_batch_flag = 1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)

with trt.Builder(TRT_LOGGER) as builder, builder.create_network(explicit_batch_flag) as network, builder.create_builder_config() as builder_config:
builder_config.max_workspace_size = workspace_size * (1024 * 1024)
builder_config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, workspace_size * (1024 * 1024))
builder_config.avg_timing_iterations = 8
if config.use_fp16:
builder_config.set_flag(trt.BuilderFlag.FP16)
Expand Down Expand Up @@ -454,7 +439,7 @@ def build_engine(batch_sizes, workspace_size, sequence_length, config, weights_d
bert_model(config, weights_dict, network, embeddings, residual, mask_idx, cu_seqlens, max_seqlen)

build_start_time = time.time()
engine = builder.build_engine(network, builder_config)
serialized_engine = builder.build_serialized_network(network, builder_config)
build_time_elapsed = (time.time() - build_start_time)
TRT_LOGGER.log(TRT_LOGGER.INFO, "build engine in {:.3f} Sec".format(build_time_elapsed))

Expand All @@ -467,7 +452,7 @@ def build_engine(batch_sizes, workspace_size, sequence_length, config, weights_d
f.flush()
os.fsync(f)

return engine
return serialized_engine

def main():
parser = argparse.ArgumentParser(description="TensorRT BERT Sample", formatter_class=argparse.ArgumentDefaultsHelpFormatter)
Expand Down Expand Up @@ -533,9 +518,7 @@ def main():
"PyTorch using option --pytorch, or Pickle weight dictionary using option --pickle "
"to build TRT BERT model.")

with build_engine(args.max_batch_size, args.workspace_size, args.max_sequence_length, config, weights_dict, args.squad_json, args.vocab_file, calib_cache, args.calib_num, args.verbose) as engine:
TRT_LOGGER.log(TRT_LOGGER.VERBOSE, "Serializing Engine...")
serialized_engine = engine.serialize()
with build_engine(args.max_batch_size, args.workspace_size, args.max_sequence_length, config, weights_dict, args.squad_json, args.vocab_file, calib_cache, args.calib_num, args.verbose) as serialized_engine:
TRT_LOGGER.log(TRT_LOGGER.INFO, "Saving Engine to {:}".format(args.output))
with open(args.output, "wb") as fout:
fout.write(serialized_engine)
Expand Down
Loading

0 comments on commit b8adcfa

Please sign in to comment.