Releases: AmusementClub/vs-mlrt
v14: latest libraries
Compared to the previous stable (v13.2) release:
General
- External models are no longer packaged.
vsmlrt.py
- Plugin invocation order in the
get_plugin_path()
function is sorted to reduce memory consumption. - Added support for RIFE v4.7 ~ v4.16 (lite, ensemble) models.
- Added support for SCUNet models for image denoising.
TRT
plugin and runtime libraries
- Upgraded to TensorRT 10.0.1.
- Maxwell and Pascal GPUs are no longer supported. Other backends still support these GPUs.
- Reduce GPU memory usage for dynamically shaped engines when the actual tile size is smaller than the maximum tile size set during engine building.
- Reduced engine build time.
- Added long path support for engines on Windows.
- cuDNN is no longer a strict runtime dependency.
vsmlrt.py
- The cuDNN tactic is no longer enabled by default.
- TF32 acceleration is disabled by default.
- The maximum workspace is set to
None
for the total memory size of the GPU. - Add parameters
builder_optimization_level
,max_aux_streams
,bf16
(#64),custom_env
,custom_args
,short_path
andengine_folder
(#90):builder_optimization_level
: "adjust how long TensorRT should spend searching for tactics with potentially better performance" linkmax_aux_streams
: Within-inference multi-streaming, "if enabled, TensorRT will run some layers on the auxiliary streams in parallel to the layers running on the main stream, ..., may increase the memory consumption, ..." linkbf16
: "TensorRT supports the bfloat16 (brain float) floating point format on NVIDIA Ampere and later architectures ... Note that not all layers support bfloat16." linkcustom_env
,custom_args
: custom environment variable and arguments for trtexec engine build.short_path
: whether to shorten engine name.- On Windows, this could be useful in addressing the maximum path length limitation, and is enabled by default.
engine_folder
: used to specify custom directory for engines.
known issues
-
Accoding to the documentation,
There is an up to 4x performance regression for networks containing "GridSample" ops compared to TensorRT 9.2.
This affects RIFE and SAFA models. -
trtexec may reports errors like:
[E] Error[9]: Skipping tactic 0xded5318b4a444b84 due to exception Cask convolution execution
[E] Error[2]: [virtualMemoryBuffer.cpp::nvinfer1::StdVirtualMemoryBufferImpl::resizePhysical::140] Error Code 2: OutOfMemory (no further information)
This issue has been submitted to NVIDIA.
ORT
- Upgraded to ONNX Runtime v1.18.0.
interface
- The
ORT_*
backends now support fp16 I/O. The semantics of thefp16
flag in these backends is as follows:- Enabling
fp16
will use a built-in quantization that converts a fp32 onnx to a fp16 onnx. If the input video is of half-precision floating-point format, the generated fp16 onnx will use fp16 input. The output format can be controlled by theoutput_format
option (0 = fp32, 1 = fp16
). - Disabling
fp16
will not use the built-in quantization. However, if the onnx file itself uses fp16 for computation, the actual computation will be done in fp16. In this case, the input video format should match the input format of the onnx, and the output format is inferred from the onnx.
- Enabling
CUDA
- Reduced execution overhead.
- Added support for TF32 acceleration. This is disabled by default.
- Added experimental
prefer_nhwc
flag to reduce the number of layout transformations when using tensor cores. This is disabled by default.
OV
- Upgraded to OpenVINO 2024.2.0.
- Added experimental
OV_NPU
backend for Intel NPUs.
MIGX
- Added support for MIGraphX backend for AMD GPUs. Currently this backend is Linux only.
Community contributions
scripts/vsmlrt.py
: update esrgan janai models by @hooke007 in #53scripts/vsmlrt.py
: add more esrgan janai models by @hooke007 in #82vsmigx
: allow fp16 input & output by @abihf in #86scripts/vsmlrt.py
: fix fp16 precision issues of RIFE v2 representations by @charlessuh in #66 (comment)
Benchmark
NVIDIA GeForce RTX 3090, 10496 shaders @ 1695 MHz, driver 552.22, Windows Server 2022, Python 3.11.9, vapoursynth-classic R57.A8
1920x1080 RGBS, TRT backend, CUDA graphs enabled, fp16
Measurements: FPS / Device Memory (MB)
model | 1 stream | 2 streams | 3 streams |
---|---|---|---|
dpir color | 10.99 / 1715.172 | 11.62 / 3048.540 | 11.64 / 4381.912 |
waifu2x upconv_7_{anime_style_art_rgb, photo} | 22.38 / 2016.352 | 32.66 / 3734.880 | 32.54 / 5453.404 |
waifu2x cunet / cugan | 12.41 / 4359.284 | 15.53 / 8363.392 | 15.47 / 12367.504 |
waifu2x swin_unet | 3.80 / 7304.332 | 4.06 / 14392.408 | 4.06 / 21276.380 |
real-esrgan (v2/v3, xsx2) | 16.65 / 955.480 | 22.53 / 1645.904 | 22.49 / 2336.324 |
scunet color | 4.20 / 2847.708 | 4.33 / 6646.884 | 4.33 / 9792.736 |
Also check benchmarks from previous pre-releases v14.test4 (NVIDIA RTX 2080 Ti/3090/4090 GPUs) and v14.test3 (NVIDIA RTX 4090 and AMD RX 7900 XTX GPUs).
This release uses CUDA 12.4.1, cuDNN 8.9.7, TensorRT 10.0.1, ONNX Runtime v1.18.0, OpenVINO 2024.2.0 and ncnn 20220915 b16f8ca.
Full Changelog: v13.2...v14
v14.test4: latest TensorRT and ONNX Runtime libraries
This is a preview release for TensorRT 10.0.0, following the v14.test
, v14.test2
and v14.test3
releases.
-
The
TRT
backend no longer supports Maxwell and Pascal GPUs. Other backends still support these GPUs. Same as those releases, the current release requires driver version >= 525. -
Added support for SwinIR models for image restoration, which are only supported by the
TRT
backend and theORT_CPU
backend from vs-mlrt v14.test4 or later. SwinIR-M and SwinIR-L models exhibit precision issue with fp16 implementation, this is under investigation. -
Added support for SCUNet models for image denoising, which are only supported by the
TRT
backend and theORT_CPU
backend from vs-mlrt v14.test4 or later. -
Added
engine_folder
argument to theTRT
backend in vsmlrt.py to specify custom directory for engines. -
Starting with this pre-release, for dynamically shaped engines, the trt runtime allocates gpu memory based on the actual tile size, whereas in previous releases, the runtime would have to allocate gpu memory based on the maximum tile size set at engine compile time. This feature requires TensorRT 10 or later.
-
The
ORT_*
backends now support fp16 I/O. The semantics of thefp16
flag is as follows:- Enabling
fp16
will use a built-in quantization that converts a fp32 onnx to a fp16 onnx. If the input video is of half-precision floating-point format, the generated fp16 onnx will use fp16 input. The output format can be controlled by theoutput_format
option (0 = fp32, 1 = fp16
). - Disabling
fp16
will not use the built-in quantization. However, if the onnx file itself uses fp16 for computation, the actual computation will be done in fp16. In this case, the input video format should match the input format of the onnx, and the output format is inferred from the onnx.
- Enabling
-
Reduce the overhead of the
ORT_CUDA
backend. -
Added support for TF32 acceleration to the
ORT_CUDA
backend. Disabled by default. -
Add experimental
prefer_nhwc
flag to theORT_CUDA
backend to reduce the number of layout transformations when using tensor cores. -
For production use of the
TRT
backend, continue to use vsmlrt v13.2. For RIFE and SAFA acceleration on theTRT
backend, continue to use any old release. -
Also check the release notes of the previous pre-releases.
benchmark 1
- RTX 4090
- processor clock @ 2520 MHz
- Intel Icelake server @ 2100 MHz
- Driver 551.86
- Windows 10 21H2 (19044.1415)
- TensorRT 10.0.0
- VapourSynth-Classic R57.A8, vapoursynth-plugin v0.96g3
1920x1080 rgbs, CUDA graphs enabled, fp16
Measurements: FPS / Device Memory (MB)
general
model | 1 stream | 2 streams | 3 streams |
---|---|---|---|
dpir gray | 22.05 / 1818.796 | 25.30 / 3111.114 | 25.33 / 4403.488 |
dpir color | 18.30 / 1851.632 | 25.13 / 3176.808 | 25.17 / 4501.984 |
waifu2x upconv_7_{anime_style_art_rgb, photo} | 20.45 / 2148.716 | 41.22 / 3867.240 | 61.21 / 5585.764 |
waifu2x upresnet10 | 17.91 / 1716.588 | 34.53 / 2941.540 | 42.33 / 4166.492 |
waifu2x cunet / cugan | 13.89 / 4391.292 | 25.74 / 8346.248 | 25.96 / 12301.202 |
waifu2x swin_unet | 4.62 / 7436.692 | 5.43 / 14426.812 | 5.43 / 21412.840 |
real-esrgan (v2/v3, xsx2) | 17.06 / 1087.844 | 33.41 / 1778.264 | 38.26 / 2468.684 |
scunet gray | 5.29 / 3590.320 | 5.40 / 6678.768 | 5.40 / 9767.208 |
scunet color | 5.13 / 3555.568 | 5.48 / 6611.308 | 5.47 / 9667.048 |
swinir-s (2x, color) | 1.63 / 15897.048 | N/A | N/A |
swinir-m* (2x, color, 720p) | 1.05 / 11305.268 | N/A | N/A |
swinir-l* (4x, color, 720p) | 0.61 / 15391.316 | N/A | N/A |
*: swinir-m and swinir-l exhibit precision issues.
rife
v2, fp16 i/o
version | 1 stream | 2 streams | 3 streams | 4 streams | 5 streams |
---|---|---|---|---|---|
v4.4-v4.5 | 136.92/778.432 | 273.80/1149.204 | 414.80/1522.028 | 553.70/1892.796 | 574.31/2263.568 |
v4.6 | 136.01/800.960 | 275.26/1192.212 | 411.01/1585.516 | 544.30/1979.764 | 550.01/2368.020 |
v4.7-v4.9 | 98.20/1302.724 | 195.78/2187.548 | 210.12/3074.420 | 210.45/3957.196 | 210.66/4844.068 |
v4.10-v4.15 | 84.41/1595.592 | 160.93/2773.280 | 161.96/3953.020 | 162.04/5132.760 | 162.07/6310.448 |
{v4.12, v4.13, v4.15, v4.16}_lite | 93.39/1333.444 | 187.32/2255.132 | 197.71/3178.872 | 198.01/4098.508 | 197.95/5022.248 |
v4.14 lite | 81.83/1595.292 | 153.40/2779.424 | 154.19/3963.260 | 154.28/5149.140 | 154.30/6332.980 |
benchmark 2
NVIDIA GeForce RTX 3090, 10496 shaders @ 1695 MHz, driver 552.22, Windows Server 2022, Python 3.11.9, vapoursynth-classic R57.A8
Measurements: (1080p, fp16) FPS / Device Memory (MB)
model | ORT_CUDA NCHW | ORT_CUDA NHWC | ORT_DML |
---|---|---|---|
dpir color | 4.54 / 2573.3 | 5.98 / 2470.9 | 8.45 / 2364.5 |
dpir color (2 streams) | 4.66 / 4854.9 | 6.30 / 4680.8 | 9.48 / 4630.9 |
waifu2x upconv7 | 10.98 / 5432.5 | 3.18 / 3017.8 | 12.48 / 4493.0 |
waifu2x upconv7 (2 streams) | 14.96 / 10397.1 | 3.25 / 5780.9 | 21.72 / 8891.7 |
waifu2x cunet / cugan | 4.70 / 7955.6 | 4.49 / 6290.6 | OOM |
waifu2x cunet / cugan (2 streams) | 5.11 / 15721.9 | 4.78 / 12312.0 | OOM |
waifu2x swin_unet_art | 2.98 / 23518.5 | 3.05 / 22812.0 | N/A |
realesrgan | 8.99 / 1647.7 | 11.20 / 1127.5 | 11.99 / 1346.6 |
realesrgan (2 streams) | 10.69 / 3034.5 | 13.58 / 1994.1 | 17.34 / 2601.6 |
rife v4.4 (1920x1088) | 61.42 / 1100.9 | 56.02 / 1162.3 | 44.73 / 882.4 |
rife v4.4 (1920x1088, 2 streams) | 106.48 / 1953.4 | 92.88 / 2071.9 | 68.80 / 1670.7 |
scunet color | N/A | N/A | N/A |
benchmark 3
NVIDIA GeForce RTX 2080 Ti, 4352 shaders @ 1700 MHz, driver 552.22, Windows 10 LTSC 21H2 (19044.1415), Python 3.11.9, vapoursynth-classic R57.A8
Measurements: (1080p, fp16) FPS / Device Memory (MB)
model | TRT | ORT_CUDA | ORT_DML | ORT_CUDA NHWC |
---|---|---|---|---|
dpir color (1 stream) | 7.08 / 1899 | 3.10 / 2602 | 4.99 / 2341 | 4.26 / 2411 |
dpir color (2 streams) | 8.06 / 3376 | 3.30 / 5016 | 5.85 / 4619 | 4.74 / 4650 |
waifu2x upconv7 (1 stream) | 11.47 / 2014 | 7.01 / 4949 | 7.45 / 4501 | 1.59 / 2923 |
waifu2x upconv7 (2 streams) | 21.44 / 3782 | 10.11 / 9732 | 13.23 / 8940 | 1.77 / 5674 |
waifu2x cunet / cugan (1 stream) | 7.41 / 4664 | 3.10 / 10067 | OOM | 0.77 / 6188 |
waifu2x cunet / cugan (2 streams) | 10.92 / 8863 | OOM | OOM | OOM |
waifu2x swin_unet_art (1 stream) | 2.35 / 7234 | OOM | N/A | OOM |
waifu2x swin_unet_art (2 streams) | OOM | OOM | N/A | OOM |
realesrgan (1 stream) | 8.66 / 1268 | 5.33 / 1545 | 6.39 / 1316 | 6.96 / 1033 |
realesrgan (2 streams) | 13.20 / 2166 | 7.78 / 2932 | 10.22 / 2571 | 10.25 / 1895 |
rife v4.4 (1920x1088, fp16 i/o, 1 stream) | 64.97 / 609 | 46.60 / 967 | 32.18 / 723 | 48.5... |
v14.test3: latest TensorRT, MIGraphX backend
This is a preview release for TensorRT 9.2.0, following the v14.test
and v14.test2
releases.
-
Same as those releases, it requires Pascal GPUs or later (10 series+) and driver version >= 525. Support for Kepler 2.0 and Maxwell GPUs is dropped.
-
TensorRT 9.2.0 is officially documented as
for Large Language Models (LLMs) on NVIDIA A100, A10G, L4, L40, L40S, H100 GPUs, and NVIDIA GH200 Grace Hopper™ Superchip only
. The Windows build is downloaded from here, and can be used on other GPU models. -
Users should use the same version of TensorRT as provided (9.2.0) because runtime version checking is disabled in this release.
-
Added support for AnimeJaNai V3 models, contributed by contributed by @hooke007 in #82.
-
Added support for RIFE v4.13 ~ v4.16 (lite, ensemble) models, which are also available for previous vs-mlrt releases (simply download the new model file here and update
vsmlrt.py
).- The v4.13 ~ v4.15 models should have the same execution speed as the v4.10 - v4.12 models.
- The v4.13 lite model, the v4.15 lite model and the v4.16 lite model should all have the same execution speed as the v4.12 lite model, while the v4.14 lite model may run slower.
-
Added support for fractional video frame interpolation in RIFE.
- Playback in video players should also set
video_player=True
(#59 (comment)). This change is experimental.
- Playback in video players should also set
-
Fixed an issue that causes the
TRT
backend crashes during script realoading. (#65). It is also fixed in the latest iteration of thev14.test2
release. -
RIFE v4.7+ models with v2 representation are not working with dynamic shapes (#72). This has been reported to TensorRT developers.
-
Initial MIGraphX support (experimental) for AMD GPUs.
- fp16 I/O contributed by @abihf in #86.
- Multi-stream execution, device selection, hip graphs and dynamic shapes are not explicitly supported for now.
preliminary benchmark on Radeon RX 7900 XTX 1:
- resolution: 1920x1080
- measurements: fps / device memory (MB)
model fp32 fp16 dpir gray 2.33 / 2829 7.29 / 1702 dpir color 2.27 / 2861 7.03 / 1734 waifu2x upconv7 6.31 / 4540 12.90 / 2503 waifu2x upresnet10 6.65 / 3077 13.63 / 1775 waifu2x cunet / cugan 3.69 / 6711 8.36 / 3591 waifu2x swin_unet 2 2.19 / 9791 4.53 / 5236 realesrgan 3 5.75 / 1961 11.57 / 959 rife 4 N/A N/A -
Also check the release notes of the
v14.test
andv14.test2
releases.
benchmark
- RTX 4090
- processor clock @ 2520 MHz
- Intel Icelake server @ 2100 MHz
- Driver 551.86
- Windows 10 21H2 (19044.1415)
- TensorRT 9.2.0
- VapourSynth-Classic R57.A8, vapoursynth-plugin v0.96g3
1920x1080 rgbs, CUDA graphs enabled, fp16
Measurements: FPS / Device Memory (MB)
general
model | 1 stream | 2 streams | 3 streams |
---|---|---|---|
dpir gray | 21.93/1757.352 | 25.48/3049.696 | 25.31/4342.044 |
dpir color | 18.24/1790.184 | 25.11/3115.360 | 25.22/4440.540 |
waifu2x upconv_7_{anime_style_art_rgb, photo} | 19.58/2148.716 | 39.87/3867.240 | 59.94/5585.768 |
waifu2x upresnet10 | 17.40/1655.144 | 34.22/2880.096 | 42.78/4105.048 |
waifu2x cunet / cugan | 13.64/4391.292 | 25.09/8346.248 | 25.19/12301.208 |
waifu2x swin_unet | 4.62/14989.772 | OOM | OOM |
real-esrgan (v2/v3, xsx2) | 16.77/1136.996 | 33.99/1876.568 | 41.44/2616.140 |
rife
v2, fp16 i/o
version | 1 stream | 2 streams | 3 streams | 4 streams | 5 streams |
---|---|---|---|---|---|
v4.4-v4.5 | 150.20/622.784 | 301.05/835.860 | 448.90/1053.024 | 615.84/1268.152 | 787.57/1481.224 |
v4.6 | 147.63/624.832 | 294.53/837.904 | 452.26/1055.072 | 603.63/1270.200 | 764.31/1485.320 |
v4.7-v4.9 | 132.06/747.712 | 268.63/1075.476 | 403.54/1405.284 | 494.98/1737.152 | 496.41/2064.908 |
v4.10-v4.15 | 119.09/862.400 | 238.68/1304.852 | 346.98/1749.352 | 349.48/2195.904 | 349.80/2638.356 |
{v4.12, v4.13, v4.15, v4.16}_lite | 123.72/782.528 | 250.81/1151.252 | 377.27/1522.020 | 403.14/1894.844 | 403.79/2263.568 |
v4.14 lite | 117.97/839.872 | 234.67/1265.940 | 320.23/1696.104 | 321.88/2124.224 | 321.18/2552.340 |
- This pre-release uses trt 9.2.0 + cuda 12.3.1 + cudnn 8.9.6, which requires a minimum driver version of 525 and is compatible with 10 series and newer GPUs, with no significant performance improvement measured.
vsmlrt.py
in all branches can be used interchangeably.
-
RDNA3, Navi 31, 12288 shaders, processor clock @ 2399 MHz, memory clock @ 1249 MHz, driver 6.0.32831, PCIe 4.0 x16, MIGraphX 2.8.0, ROCm 6.0.2, Linux 6.7.0-060700-generic, VapourSynth-Classic R57.A8 ↩
-
tested on MIGraphX 2.10.0 ROCm/AMDMIGraphX@ecd5adc, requiers MIGraphX 2.9.0 ROCm/AMDMIGraphX@2d4a6507c3ad41f9d7ea36de1d7fb257cc788585s and replacing edge padding by reflection padding ↩
-
tested on MIGraphX 2.10.0 ROCm/AMDMIGraphX@ecd5adc, requiers MIGraphX 2.9.0 ROCm/AMDMIGraphX@2d4a6507c3ad41f9d7ea36de1d7fb257cc788585s ↩
-
missing support for
GridSample
operation ↩
v14.test2: latest TensorRT library
This is a preview release for TensorRT 9.1.0, following v14.test
release.
-
Same as
v14.test
release, it requires Pascal GPUs or later (10 series+) and driver version >= 525. Support for Kepler 2.0 and Maxwell GPUs is dropped. -
TensorRT 9.1.0 is officially documented as
for Large Language Models (LLMs) on NVIDIA A100, A10G, L4, L40, L40S, H100 GPUs, and NVIDIA GH200 Grace Hopper™ Superchip only
on Linux. The Windows build is downloaded from here, and can be used on other GPU models.On Windows, some users have reported crashes when using it in mpv(#65). This problem occurs on an earlier version of this release, which is now fixed.
-
Add parameters
bf16
(#64),custom_env
andcustom_args
to theTRT
backend.- fp16 execution of
Waifu2xModel.swin_unet_art
is more accurate, faster and uses less GPU memory than bf16 execution (benchmark)
- fp16 execution of
-
Device memory usage of model
Waifu2xModel.swin_unet_art
is reduced compared to TensorRT 9.0.1 on A10G with 1080p input (at 2.66 fps with 7.0GB VRAM usage) with default auxiliary stream heuristic.- TensorRT 9.0.1 using 7 auxiliary streams compared to TensorRT 9.1.0 (3 streams) results in significantly more device memory with no performance gain.
- Setting
max_aux_streams=3
lowers device memory usage of TensorRT 9.0.1 to ~8.9GB, andmax_aux_streams=0
corresponds to ~7.3GB usage. - TensorRT 9.1.0 with
max_aux_streams=0
uses ~6.7GB device memory.
-
Users should use the same version of TensorRT as provided (9.1.0) because runtime version checking is disabled in this release.
-
Added support for RIFE v4.8 - v4.12, v4.12 ~ v4.13 lite (ensemble) models, which are also available for previous vs-mlrt releases (simply download the new model file here and update
vsmlrt.py
). v4.8 and v4.9 models should have the same execution speed as v4.7, while v4.10-v4.12 models are equally heavier than previous models. Ensemble models are heavier than their non-ensemble counterparts.- Starting from RIFE v4.11, all rife models are temporarily moved here with individual packaging.
-
RIFE models with v2 representation for
TRT
backend now has improved accuracy, contributed by @charlessuh (#66 (comment)). This has been backported to master.- This improvement may be very slightly inefficient under onnx file renaming. It is advised to keep onnx file name unchanged and change the function call to
vsmlrt.RIFE()
. - By default
vsmlrt.RIFE()
invsmlrt.py
uses v1 representation. The v2 representation is enabled withvsmlrt.RIFE(_implementation=2)
function call.Sample Error Message
input: for dimension number 1 in profile 0 does not match network definition (got min=11, opt=11, max=11), expected min=opt=max=7)
- v2 representation is still considered experimental.
- This improvement may be very slightly inefficient under onnx file renaming. It is advised to keep onnx file name unchanged and change the function call to
-
Added support for SAFA v0.1 video enhancement model.
- This model takes arbitrary sized video and uses both spatial and temporal information to improve visual quality.
- Note that this model is non-deterministic by nature, and existing backends does not support manual seeding.
- ~17 fps on RTX 4090 with
TRT(fp16)
, 1080p input and non-adaptive mode. Adaptive mode is about 2x slower than non-adaptive mode, uses more memory and does not support cuda graphs execution. - This representation is not supported by the
NCNN_VK
backend for the same issue as RIFE v2 representation.
-
Also check the release note of
v14.test
release.
- This pre-release uses trt 9.1.0 + cuda 12.2.2 + cudnn 8.9.5, which can only run on driver >= 525 and 10 series and later gpus, with improved support for self-attentions found in transformer models.
vsmlrt.py
in all branches can be used interchangeably.
- TensorRT 9.0.1 is
for Large Language Models (LLMs) on A100, A10G, L4, L40, and H100 GPUs only
on x86 Linux. ModelWaifu2xModel.swin_unet_art
is 1.2x faster compared to TensorRT 8.6.1 on A10G with 720p input (at 6.3 fps with 4GB VRAM usage), thanks to multi-head attention fusion (requires fp16).
This pre-release is now feature complete. Development now switch to the v14.test3
pre-release.
v14.test: latest TensorRT library
This is a preview release for TensorRT 8.6.1.
-
It requires Pascal GPUs or later (10 series+) and driver version >= 525. Support for Kepler 2.0 and Maxwell GPUs is dropped.
-
Add parameters
builder_optimization_level
andmax_aux_streams
to theTRT
backend.builder_optimization_level
: "adjust how long TensorRT should spend searching for tactics with potentially better performance" linkmax_aux_streams
: Within-inference multi-streaming, "if enabled, TensorRT will run some layers on the auxiliary streams in parallel to the layers running on the main stream, ..., may increase the memory consumption, ..." link- It is advised to lower
max_aux_streams
to 0 on heavy models likeWaifu2xModel.swin_unet_art
to reduce memory usage. Check the benchmark data at the bottom.
- It is advised to lower
-
Following TensorRT 8.6.1,
cudnn
tactic source of theTRT
backend is disabled by default.tf32
is also disabled by default in vsmlrt.py. -
Add parameter
short_path
to theTRT
backend, which shortens engine path and is enabled on Windows by default. -
Model
Waifu2xModel.swin_unet_art
seems does not work withbuilder_optimization_level=5
from theTRT
backend before TRT 9.0. Usebuilder_optimization_level=4
or lower instead.
Less than 5% performance improvement among built-in models compared to 13.1/13.2, 24% device memory usage reduction on DPIR and 35% on RealESRGAN.
Version information:
v13.2
release uses trt 8.5.1 + cuda 11.8.0, which can run on driver >= 450 and 900 series and later gpus.v14.test
pre-release uses trt 8.6.1 + cuda 12.1.1, which can only run on driver >= 525 and 10 series and later gpus, with no significant performance improvement measured.vsmlrt.py
in both branches can be used interchangeably.
- Added support for RIFE v4.7 model (
"optimized for anime scenes"
), which is also available for previous vs-mlrt releases (simply download the new model file here and updatevsmlrt.py
). It is more computational intensive than v4.6.
- This pre-release is now feature complete. Development now switch to
trt-latest
branch andv14.test2
pre-release.
v13.2: latest ort library, DirectML backend
- Added support for DirectML backend through ONNX Runtime. It is available for all dx12 devices and can be accessed through
backend=Backend.ORT_DML()
. waifu2x swin_unet models may not be supported on this backend, and RIFE models may be poorly supported. - Asset
vsmlrt-windows-x64-vk.*.7z
is renamed tovsmlrt-windows-x64-generic-gpu.*.7z
and includes backendsOV_CPU
,OV_GPU
,ORT_CPU
,ORT_DML
,NCNN_VK
.cuda
asset continue to include all backends in this release. - Update onnxruntime to microsoft/onnxruntime@73584f9.
Note
- Backend
OV_GPU
may produces reduced precision output. This is under investigation.
benchmark 1
NVIDIA GeForce RTX 3090, 10496 shaders @ 1695 MHz, driver 532.03, Windows 10 21H2 LTSC (19044.1415), Python 3.11.3, vapoursynth-classic R57.A8
Measurements: (1080p, fp16) FPS / Device Memory (MB)
model | ORT_CUDA | ORT_DML |
---|---|---|
dpir | 4.25 / 2573.3 | 7.01 / 2371.0 |
dpir (2 streams) | 4.58 / 5506.2 | 8.85 / 4643.1 |
waifu2x upconv7 | 9.10 / 5248.1 | 9.65 / 4503.1 |
waifu2x upconv7 (2 streams) | 11.15 / 2966.9 | 18.52 / 8911.2 |
waifu2x cunet / cugan | 4.06 / 7875.7 | 6.36 / 8973.7 |
waifu2x cunet / cugan (2 streams) | N/A | 9.51 / 17849.1 |
waifu2x swin_unet_art | N/A | N/A |
realesrgan | 7.52 / 1901.7 | 8.54 / 1352.4 |
realesrgan (2 streams) | 11.15 / 2966.9 | 15.58 / 2608.7 |
rife | 34.30 / 1109.1 | 2.12 / 1417.8 |
rife (2 streams) | 61.45 / 2051.4 | 4.27 / 2740.9 |
benchmark 2
AMD Radeon Pro V620 MxGPU, 4608 shaders @ 2390 MHz, Adrenalin 21.20.02.13, Windows Server 2019, Python 3.11.3, vapoursynth-classic R57.A8
Measurements: (1080p, fp16) FPS / Device Memory (MB)
model | NCNN_VK | ORT_DML |
---|---|---|
dpir | 1.70 / 3248.4 | 4.75 / 2308.1 |
dpir (2 streams) | 1.74 / 6099.5 | 4.86 / 4584.6 |
waifu2x upconv7 | 5.18 / 6872.3 | 14.51 / 4448.5 |
waifu2x upconv7 (2 streams) | 6.14 / 13701 | 15.98 / 8861.2 |
waifu2x cunet / cugan (2x2 tiles) | 1.07 / 3159.8 | 5.57 / 2196.7 |
waifu2x cunet / cugan (2x2 tiles, 2 streams) | 1.07 / 3159.8 | 6.08 / 4357.8 |
waifu2x swin_unet_art | N/A | N/A |
realesrgan | 3.86 / 2699.7 | 9.59 / 1290.4 |
realesrgan (2 streams) | 4.43 / 5355.8 | 10.58 / 2545.3 |
rife | N/A | 2.68 / 1353.5 |
rife (2 streams) | N/A | 4.44 / 2673.3 |
v13.1: latest ov & ort library, new models
- Update openvino to openvinotoolkit/openvino@b0ffec4 , improved support for RIFE in both
OV_CPU
andOV_GPU
. (benchmark on arc a380) - Fix a typo in vsmlrt.py's
__all__
. - Added parameter
num_threads
in theOV_CPU
backend - Update onnxruntime to microsoft/onnxruntime@8ed3dfe.
- Default value of
workspace
of the TRT backend will be changed toNone
in the next release, anduse_cudnn
will be changed toFalse
.
the following models can be found on external models:
- Add support for waifu2x
swin_unet
models. - Add support for
ensemble
configuration of RIFE. (RIFE v2 acceleration is experimental and may results in reduced quality onTRT
backend with fp16 enabled.)
Note
- Backend
OV_GPU
may produces reduced precision output. This is under investigation.
Contributed Models
Please see PR #42 for policy.
-
AnimeJaNai_v2.7z:
RealESRGANv2/{animejanaiV2L1, animejanaiV2L2, animejanaiV2L3}.onnx
, contributed by @hooke007 in #53. -
AnimeJaNai_v3.7z:
RealESRGANv2/{animejanaiV3-HD-L1, animejanaiV3-HD-L2, animejanaiV3-HD-L3}.onnx
, contributed by @hooke007 in #82.
v13: fp16 i/o, faster dynamic shapes for TRT backend
-
Added support for fp16 I/O format and faster dynamic shapes in the
TRT
backend.- Thanks to @hooke007 @grobalt @MysteryDove @chainikdn and many others users on the svp forum, it becomes clear that reducing system bandwidth requirement is crucial to 4K RIFE performance on powerful GPUs (https://github.com/AmusementClub/vs-mlrt/discussions/19). The
TRT
backend now accepts fp16 clips beyond fp32, and the output format can be specified via parameteroutput_format
(0 for fp32 and 1 for fp16).
As the only portable way to convert fp32 clips to fp16 is via
resize
(std.Expr
only supports fp16 when the cpu supports the f16c instruction set extension), and to conserve memory bandwidth, you could use the following snippet for RIFE to perform the necessary padding, YUV/RGB conversion and FP16 conversion in one go:th = (src.height + 31) // 32 * 32 # adjust 32 and 31 to match specific AI network input resolution requirements. tw = (src.width + 31) // 32 * 32 # same. padded = src.resize.Bicubic(tw, th, format=vs.RGBS if WANT_FP32 else vs.RGBH, matrix_in_s="709", src_width=tw, src_height=th) flt = vsmlrt.RIFE(padded, model=vsmlrt.RIFEModel.v4_6, backend=backend, output_format=1) # fp16 output oh = src.height * (flt.height // th) # not necessary for RIFE (i.e. oh = src.height), but required for super-resolution upscalers. ow = src.width * (flt.width // tw) res = flt.resize.Bicubic(ow, oh, format=vs.YUV420P8, matrix_s="709", src_width=ow, src_height=oh)
-
Faster dynamic shapes introduced in TensorRT 8.5.1 improves performance and device memory usage of dynamically shaped engines (#20 (comment) and the following benchmark).
Dynamically shaped models can be created by specifying
static_shape=False
,min_shapes
(minimum size),opt_shapes
(optimization size) andmax_shapes
(maximum size) in theTRT
backend.Engine cache placement policy of
vsmlrt.py
is direct-mapped:- If dynamic shapes is used, engine name is determined by
min_shapes
,opt_shapes
andmax_shapes
(among others). - Otherwise, engine name is determined by
opt_shapes
. opt_shapes
is usually set totilesize
in each specific model's interface if not initialized.
- If dynamic shapes is used, engine name is determined by
-
workspace
can now be set to None for unlimited workspace size. (#21) -
Add flag
force_fp16
. This flag forces fp16 computation during inference, and is disabled by default.- It reduces memory usage during engine build, and allows more engines to be built, e.g. 1080p rife on gpus with 4gb memory. (successfully build dynamically shaped rife with
min_shapes=(64, 64), opt_shapes=(1920, 1088), max_shapes=(3840, 2176)
on a 4gb ampere) - It may reduce engine build time.
- It reduces memory usage during engine build, and allows more engines to be built, e.g. 1080p rife on gpus with 4gb memory. (successfully build dynamically shaped rife with
- Thanks to @hooke007 @grobalt @MysteryDove @chainikdn and many others users on the svp forum, it becomes clear that reducing system bandwidth requirement is crucial to 4K RIFE performance on powerful GPUs (https://github.com/AmusementClub/vs-mlrt/discussions/19). The
-
Introduce a new simplified backend interface
BackendV2
. -
Disable tiling support for rife due to incompatible inference design.
dynamic shapes
In conclusion, dynamic shapes should be much more flexible when dealing with different video resolutions (no engine re-compilation is required), and incurs almost no performance degradation starting with TensorRT 8.5. Only increased device memory usage will be a concern.
benchmark
-
configuration: nvidia a10 (ecc disabled), driver 527.41 (tcc), windows server 2022, python 3.11.1, vapoursynth-classic R57.A7,
Backend.TRT(fp16=True, use_cuda_graph=True, tf32=False, use_cudnn=False)
,CUDA_MODULE_LOADING=LAZY
-
Statically shaped engines for each model are compiled separately, while the dynamically shaped engine is compiled once with (1)
static_shape=False, min_shapes=(64, 64), opt_shapes=<max-benchmarked-video-dimensions>, max_shapes=<max-benchmarked-video-dimensions>
or (2)static_shape=False, min_shapes=(64, 64), opt_shapes=<min-benchmarked-video-dimensions>, max_shapes=<max-benchmarked-video-dimensions>
.opt_shapes
may be lowered for faster engine generation.
-
measurements: fps / device memory (MB)
model | 1 stream static | 1 stream dynamic (1) | 1 stream dynamic (2) | 2 streams static | 2 streams dynamic (1) | 2 streams dynamic (2) |
---|---|---|---|---|---|---|
waifu2x upconv7 1920x1080 | 17.4 / 1992 | 17.5 / 1998 | 17.4 / 2040 | 21.2 / 3694 | 21.2 / 3756 | 20.9 / 3850 |
waifu2x upconv7 1280x720 | 37.2 / 1046 | 38.5 / 1930 | 37.8 / 1976 | 46.3 / 1818 | 48.2 / 3628 | 46.5 / 3722 |
waifu2x upconv7 720x480 | 102.2 / 544 | 104.4 / 1894 | 102.2 / 1940 | 123.1 / 834 | 128.1 / 3556 | 123.4 / 3650 |
dpir color 1920x1080 | 7.3 / 2114 | 7.3 / 2114 | 7.2 / 2116 | 7.5 / 3656 | 7.4 / 3992 | 7.4 / 4002 |
dpir color 1280x720 | 16.4 / 1122 | 16.3 / 2086 | 16.2 / 2092 | 16.7 / 1810 | 16.7 / 3936 | 16.7 / 3946 |
dpir color 720x480 | 41.5 / 604 | 41.5 / 2068 | 41.6 / 2074 | 44.3 / 863 | 44.3 / 3900 | 44.2 / 3910 |
real-esrgan v2 1920x1080 | 12.3 / 1320 | 12.3 / 1320 | 12.3 / 1320 | 13.0 / 2196 | 13.2 / 2402 | 12.9 / 2402 |
real-esrgan v2 1280x720 | 26.9 / 736 | 26.9 / 1256 | 27.0 / 1256 | 29.1 / 1130 | 29.3 / 2274 | 29.2 / 2274 |
real-esrgan v2 720x480 | 73.2 / 422 | 73.2 / 1220 | 72.7 / 1220 | 78.7 / 570 | 78.4 / 2202 | 78.1 / 2202 |
cugan 1920x1080 | 9.4 / 4648 | 9.4 / 4726 | 9.2 / 4618 | 9.8 / 8754 | 10.2 / 9210 | 9.9 / 8996 |
cugan 1280x720 | 20.5 / 2214 | 20.5 / 4662 | 20.0 / 4554 | 21.2 / 4050 | 22.9 / 9082 | 22.4 / 8868 |
cugan 720x480 | 54.8 / 996 | 53.7 / 4626 | 52.9 / 4518 | 57.7 / 1690 | 59.9 / 9019 | 58.9 / 8796 |
rife v4.4 1920x1088 | 92.8 / 590 | 92.2 / 594 | 89.5 / 606 | 178.6 / 920 | 177.6 / 942 | 169.5 / 974 |
rife v4.4 1280x736 | 206.0 / 410 | 199.2 / 534 | 199.1 / 550 | 394.2 / 560 | 377.5 / 822 | 374.0 / 854 |
rife v4.4 736x480 | 497.3 / 316 | 442.2 / 504 | 492.3 / 520 | 903.2 / 376 | 809.3 / 762 | 874.1 / 794 |
*The gap is large on rife because of underutilization, and will disappear when using more streams.
v12.3.test
This is a preview release for https://github.com/AmusementClub/vs-mlrt/releases/tag/v13.