Skip to content

v14.test2: latest TensorRT library

Pre-release
Pre-release
Compare
Choose a tag to compare
@github-actions github-actions released this 23 Oct 00:58
· 119 commits to master since this release

This is a preview release for TensorRT 9.1.0, following v14.test release.

  • Same as v14.test release, it requires Pascal GPUs or later (10 series+) and driver version >= 525. Support for Kepler 2.0 and Maxwell GPUs is dropped.

  • TensorRT 9.1.0 is officially documented as for Large Language Models (LLMs) on NVIDIA A100, A10G, L4, L40, L40S, H100 GPUs, and NVIDIA GH200 Grace Hopper™ Superchip only on Linux. The Windows build is downloaded from here, and can be used on other GPU models.

    • On Windows, some users have reported crashes when using it in mpv (#65). This problem occurs on an earlier version of this release, which is now fixed.
  • Add parameters bf16 (#64), custom_env and custom_args to the TRT backend.

    • fp16 execution of Waifu2xModel.swin_unet_art is more accurate, faster and uses less GPU memory than bf16 execution (benchmark)
  • Device memory usage of model Waifu2xModel.swin_unet_art is reduced compared to TensorRT 9.0.1 on A10G with 1080p input (at 2.66 fps with 7.0GB VRAM usage) with default auxiliary stream heuristic.

    • TensorRT 9.0.1 using 7 auxiliary streams compared to TensorRT 9.1.0 (3 streams) results in significantly more device memory with no performance gain.
    • Setting max_aux_streams=3 lowers device memory usage of TensorRT 9.0.1 to ~8.9GB, and max_aux_streams=0 corresponds to ~7.3GB usage.
    • TensorRT 9.1.0 with max_aux_streams=0 uses ~6.7GB device memory.
  • Users should use the same version of TensorRT as provided (9.1.0) because runtime version checking is disabled in this release.

  • Added support for RIFE v4.8 - v4.12, v4.12 ~ v4.13 lite (ensemble) models, which are also available for previous vs-mlrt releases (simply download the new model file here and update vsmlrt.py). v4.8 and v4.9 models should have the same execution speed as v4.7, while v4.10-v4.12 models are equally heavier than previous models. Ensemble models are heavier than their non-ensemble counterparts.

    • Starting from RIFE v4.11, all rife models are temporarily moved here with individual packaging.
  • RIFE models with v2 representation for TRT backend now has improved accuracy, contributed by @charlessuh (#66 (comment)). This has been backported to master.

    • This improvement may be very slightly inefficient under onnx file renaming. It is advised to keep onnx file name unchanged and change the function call to vsmlrt.RIFE().
    • By default vsmlrt.RIFE() in vsmlrt.py uses v1 representation. The v2 representation is enabled with vsmlrt.RIFE(_implementation=2) function call.
      Sample Error Messageinput: for dimension number 1 in profile 0 does not match network definition (got min=11, opt=11, max=11), expected min=opt=max=7)
    • v2 representation is still considered experimental.
  • Added support for SAFA v0.1 video enhancement model.

    • This model takes arbitrary sized video and uses both spatial and temporal information to improve visual quality.
    • Note that this model is non-deterministic by nature, and existing backends does not support manual seeding.
    • ~17 fps on RTX 4090 with TRT(fp16), 1080p input and non-adaptive mode. Adaptive mode is about 2x slower than non-adaptive mode, uses more memory and does not support cuda graphs execution.
    • This representation is not supported by the NCNN_VK backend for the same issue as RIFE v2 representation.
  • Also check the release note of v14.test release.


  • This pre-release uses trt 9.1.0 + cuda 12.2.2 + cudnn 8.9.5, which can only run on driver >= 525 and 10 series and later gpus, with improved support for self-attentions found in transformer models.
  • vsmlrt.py in all branches can be used interchangeably.

  • TensorRT 9.0.1 is for Large Language Models (LLMs) on A100, A10G, L4, L40, and H100 GPUs only on x86 Linux. Model Waifu2xModel.swin_unet_artis 1.2x faster compared to TensorRT 8.6.1 on A10G with 720p input (at 6.3 fps with 4GB VRAM usage), thanks to multi-head attention fusion (requires fp16).

This pre-release is now feature complete. Development now switch to the v14.test3 pre-release.