Release v14.test2: latest TensorRT library · AmusementClub/vs-mlrt

This is a preview release for TensorRT 9.1.0, following v14.test release.

Same as v14.test release, it requires Pascal GPUs or later (10 series+) and driver version >= 525. Support for Kepler 2.0 and Maxwell GPUs is dropped.
TensorRT 9.1.0 is officially documented as for Large Language Models (LLMs) on NVIDIA A100, A10G, L4, L40, L40S, H100 GPUs, and NVIDIA GH200 Grace Hopper™ Superchip only on Linux. The Windows build is downloaded from here, and can be used on other GPU models.
- ~~On Windows, some users have reported crashes when using it in mpv~~ (#65). This problem occurs on an earlier version of this release, which is now fixed.
Add parameters bf16 (#64), custom_env and custom_args to the TRT backend.
- fp16 execution of Waifu2xModel.swin_unet_art is more accurate, faster and uses less GPU memory than bf16 execution (benchmark)
Device memory usage of model Waifu2xModel.swin_unet_art is reduced compared to TensorRT 9.0.1 on A10G with 1080p input (at 2.66 fps with 7.0GB VRAM usage) with default auxiliary stream heuristic.
- TensorRT 9.0.1 using 7 auxiliary streams compared to TensorRT 9.1.0 (3 streams) results in significantly more device memory with no performance gain.
- Setting max_aux_streams=3 lowers device memory usage of TensorRT 9.0.1 to ~8.9GB, and max_aux_streams=0 corresponds to ~7.3GB usage.
- TensorRT 9.1.0 with max_aux_streams=0 uses ~6.7GB device memory.
Users should use the same version of TensorRT as provided (9.1.0) because runtime version checking is disabled in this release.
Added support for RIFE v4.8 - v4.12, v4.12 ~ v4.13 lite (ensemble) models, which are also available for previous vs-mlrt releases (simply download the new model file here and update vsmlrt.py). v4.8 and v4.9 models should have the same execution speed as v4.7, while v4.10-v4.12 models are equally heavier than previous models. Ensemble models are heavier than their non-ensemble counterparts.
- Starting from RIFE v4.11, all rife models are temporarily moved here with individual packaging.
RIFE models with v2 representation for TRT backend now has improved accuracy, contributed by @charlessuh (#66 (comment)). This has been backported to master.
- This improvement may be very slightly inefficient under onnx file renaming. It is advised to keep onnx file name unchanged and change the function call to vsmlrt.RIFE().
- By default vsmlrt.RIFE() in vsmlrt.py uses v1 representation. The v2 representation is enabled with vsmlrt.RIFE(_implementation=2) function call.
  
  Sample Error Message
  input: for dimension number 1 in profile 0 does not match network definition (got min=11, opt=11, max=11), expected min=opt=max=7)
- v2 representation is still considered experimental.
Added support for SAFA v0.1 video enhancement model.
- This model takes arbitrary sized video and uses both spatial and temporal information to improve visual quality.
- Note that this model is non-deterministic by nature, and existing backends does not support manual seeding.
- ~17 fps on RTX 4090 with TRT(fp16), 1080p input and non-adaptive mode. Adaptive mode is about 2x slower than non-adaptive mode, uses more memory and does not support cuda graphs execution.
- This representation is not supported by the NCNN_VK backend for the same issue as RIFE v2 representation.
Also check the release note of v14.test release.

This pre-release uses trt 9.1.0 + cuda 12.2.2 + cudnn 8.9.5, which can only run on driver >= 525 and 10 series and later gpus, with improved support for self-attentions found in transformer models.
vsmlrt.py in all branches can be used interchangeably.

TensorRT 9.0.1 is for Large Language Models (LLMs) on A100, A10G, L4, L40, and H100 GPUs only on x86 Linux. Model Waifu2xModel.swin_unet_artis 1.2x faster compared to TensorRT 8.6.1 on A10G with 720p input (at 6.3 fps with 4GB VRAM usage), thanks to multi-head attention fusion (requires fp16).

This pre-release is now feature complete. Development now switch to the v14.test3 pre-release.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v14.test2: latest TensorRT library

Contributors