Releases · AmusementClub/vs-mlrt

25 Apr 01:01

github-actions

v14

5c0bc0c

v14: latest libraries Latest

Latest

Compared to the previous stable (v13.2) release:

General

External models are no longer packaged.

vsmlrt.py

Plugin invocation order in the get_plugin_path() function is sorted to reduce memory consumption.
Added support for RIFE v4.7 ~ v4.16 (lite, ensemble) models.
Added support for SCUNet models for image denoising.

TRT

plugin and runtime libraries

Upgraded to TensorRT 10.0.1.
Maxwell and Pascal GPUs are no longer supported. Other backends still support these GPUs.
Reduce GPU memory usage for dynamically shaped engines when the actual tile size is smaller than the maximum tile size set during engine building.
Reduced engine build time.
Added long path support for engines on Windows.
cuDNN is no longer a strict runtime dependency.

vsmlrt.py

The cuDNN tactic is no longer enabled by default.
TF32 acceleration is disabled by default.
The maximum workspace is set to None for the total memory size of the GPU.
Add parameters builder_optimization_level, max_aux_streams, bf16 (#64), custom_env, custom_args, short_path and engine_folder (#90):
- builder_optimization_level: "adjust how long TensorRT should spend searching for tactics with potentially better performance" link
- max_aux_streams: Within-inference multi-streaming, "if enabled, TensorRT will run some layers on the auxiliary streams in parallel to the layers running on the main stream, ..., may increase the memory consumption, ..." link
- bf16: "TensorRT supports the bfloat16 (brain float) floating point format on NVIDIA Ampere and later architectures ... Note that not all layers support bfloat16." link
- custom_env, custom_args: custom environment variable and arguments for trtexec engine build.
- short_path: whether to shorten engine name.
  - On Windows, this could be useful in addressing the maximum path length limitation, and is enabled by default.
- engine_folder: used to specify custom directory for engines.

known issues

Accoding to the documentation, There is an up to 4x performance regression for networks containing "GridSample" ops compared to TensorRT 9.2.This affects RIFE and SAFA models.
trtexec may reports errors like:
- [E] Error[9]: Skipping tactic 0xded5318b4a444b84 due to exception Cask convolution execution
- [E] Error[2]: [virtualMemoryBuffer.cpp::nvinfer1::StdVirtualMemoryBufferImpl::resizePhysical::140] Error Code 2: OutOfMemory (no further information)
This issue has been submitted to NVIDIA.

ORT

Upgraded to ONNX Runtime v1.18.0.

interface

The ORT_* backends now support fp16 I/O. The semantics of the fp16 flag in these backends is as follows:
- Enabling fp16 will use a built-in quantization that converts a fp32 onnx to a fp16 onnx. If the input video is of half-precision floating-point format, the generated fp16 onnx will use fp16 input. The output format can be controlled by the output_format option (0 = fp32, 1 = fp16).
- Disabling fp16 will not use the built-in quantization. However, if the onnx file itself uses fp16 for computation, the actual computation will be done in fp16. In this case, the input video format should match the input format of the onnx, and the output format is inferred from the onnx.

CUDA

Reduced execution overhead.
Added support for TF32 acceleration. This is disabled by default.
Added experimental prefer_nhwc flag to reduce the number of layout transformations when using tensor cores. This is disabled by default.

OV

Upgraded to OpenVINO 2024.2.0.
Added experimental OV_NPU backend for Intel NPUs.

MIGX

Added support for MIGraphX backend for AMD GPUs. Currently this backend is Linux only.

Community contributions

scripts/vsmlrt.py: update esrgan janai models by @hooke007 in #53
scripts/vsmlrt.py: add more esrgan janai models by @hooke007 in #82
vsmigx: allow fp16 input & output by @abihf in #86
scripts/vsmlrt.py: fix fp16 precision issues of RIFE v2 representations by @charlessuh in #66 (comment)

Benchmark

NVIDIA GeForce RTX 3090, 10496 shaders @ 1695 MHz, driver 552.22, Windows Server 2022, Python 3.11.9, vapoursynth-classic R57.A8

1920x1080 RGBS, TRT backend, CUDA graphs enabled, fp16

Measurements: FPS / Device Memory (MB)

model	1 stream	2 streams	3 streams
dpir color	10.99 / 1715.172	11.62 / 3048.540	11.64 / 4381.912
waifu2x upconv_7_{anime_style_art_rgb, photo}	22.38 / 2016.352	32.66 / 3734.880	32.54 / 5453.404
waifu2x cunet / cugan	12.41 / 4359.284	15.53 / 8363.392	15.47 / 12367.504
waifu2x swin_unet	3.80 / 7304.332	4.06 / 14392.408	4.06 / 21276.380
real-esrgan (v2/v3, xsx2)	16.65 / 955.480	22.53 / 1645.904	22.49 / 2336.324
scunet color	4.20 / 2847.708	4.33 / 6646.884	4.33 / 9792.736

Also check benchmarks from previous pre-releases v14.test4 (NVIDIA RTX 2080 Ti/3090/4090 GPUs) and v14.test3 (NVIDIA RTX 4090 and AMD RX 7900 XTX GPUs).

This release uses CUDA 12.4.1, cuDNN 8.9.7, TensorRT 10.0.1, ONNX Runtime v1.18.0, OpenVINO 2024.2.0 and ncnn 20220915 b16f8ca.

Full Changelog: v13.2...v14

Contributors

charlessuh, abihf, and hooke007

Assets 13

contrib-models.v14.7z

5.63 MB 2024-04-27T06:35:46Z
models.v14.7z

814 MB 2024-04-27T06:35:46Z
scripts.v14.7z

14.7 KB 2024-04-27T06:32:39Z
vsmlrt-cuda.v14.7z

1.38 GB 2024-04-27T06:38:59Z
vsmlrt-windows-x64-cpu.v14.7z

842 MB 2024-04-27T06:43:11Z
vsmlrt-windows-x64-cuda.v14.7z

1.94 GB 2024-04-27T06:58:00Z
vsmlrt-windows-x64-generic-gpu.v14.7z

843 MB 2024-04-27T06:46:05Z
VSNCNN-Windows-x64.v14.7z

2.13 MB 2024-04-27T06:34:23Z
VSORT-Windows-x64.v14.7z

33.9 MB 2024-04-27T06:36:39Z
VSOV-Windows-x64.v14.7z

17.4 MB 2024-04-27T06:35:01Z
Source code (zip)

2024-04-27T06:24:15Z
Source code (tar.gz)

2024-04-27T06:24:15Z

27 Mar 03:27

github-actions

v14.test4

e1826de

v14.test4: latest TensorRT and ONNX Runtime libraries Pre-release

Pre-release

This is a preview release for TensorRT 10.0.0, following the v14.test, v14.test2 and v14.test3 releases.

The TRT backend no longer supports Maxwell and Pascal GPUs. Other backends still support these GPUs. Same as those releases, the current release requires driver version >= 525.
Added support for SwinIR models for image restoration, which are only supported by the TRT backend and the ORT_CPU backend from vs-mlrt v14.test4 or later. SwinIR-M and SwinIR-L models exhibit precision issue with fp16 implementation, this is under investigation.
Added support for SCUNet models for image denoising, which are only supported by the TRT backend and the ORT_CPU backend from vs-mlrt v14.test4 or later.
Added engine_folder argument to the TRT backend in vsmlrt.py to specify custom directory for engines.
Starting with this pre-release, for dynamically shaped engines, the trt runtime allocates gpu memory based on the actual tile size, whereas in previous releases, the runtime would have to allocate gpu memory based on the maximum tile size set at engine compile time. This feature requires TensorRT 10 or later.
The ORT_* backends now support fp16 I/O. The semantics of the fp16 flag is as follows:
- Enabling fp16 will use a built-in quantization that converts a fp32 onnx to a fp16 onnx. If the input video is of half-precision floating-point format, the generated fp16 onnx will use fp16 input. The output format can be controlled by the output_format option (0 = fp32, 1 = fp16).
- Disabling fp16 will not use the built-in quantization. However, if the onnx file itself uses fp16 for computation, the actual computation will be done in fp16. In this case, the input video format should match the input format of the onnx, and the output format is inferred from the onnx.
Reduce the overhead of the ORT_CUDA backend.
Added support for TF32 acceleration to the ORT_CUDA backend. Disabled by default.
Add experimental prefer_nhwc flag to the ORT_CUDA backend to reduce the number of layout transformations when using tensor cores.
For production use of the TRT backend, continue to use vsmlrt v13.2. For RIFE and SAFA acceleration on the TRT backend, continue to use any old release.
Also check the release notes of the previous pre-releases.

benchmark 1

previous benchmark

RTX 4090
- processor clock @ 2520 MHz
Intel Icelake server @ 2100 MHz
Driver 551.86
Windows 10 21H2 (19044.1415)
TensorRT 10.0.0
VapourSynth-Classic R57.A8, vapoursynth-plugin v0.96g3

1920x1080 rgbs, CUDA graphs enabled, fp16

Measurements: FPS / Device Memory (MB)

general

model	1 stream	2 streams	3 streams
dpir gray	22.05 / 1818.796	25.30 / 3111.114	25.33 / 4403.488
dpir color	18.30 / 1851.632	25.13 / 3176.808	25.17 / 4501.984

waifu2x upconv_7_{anime_style_art_rgb, photo}	20.45 / 2148.716	41.22 / 3867.240	61.21 / 5585.764
waifu2x upresnet10	17.91 / 1716.588	34.53 / 2941.540	42.33 / 4166.492
waifu2x cunet / cugan	13.89 / 4391.292	25.74 / 8346.248	25.96 / 12301.202
waifu2x swin_unet	4.62 / 7436.692	5.43 / 14426.812	5.43 / 21412.840

real-esrgan (v2/v3, xsx2)	17.06 / 1087.844	33.41 / 1778.264	38.26 / 2468.684

scunet gray	5.29 / 3590.320	5.40 / 6678.768	5.40 / 9767.208
scunet color	5.13 / 3555.568	5.48 / 6611.308	5.47 / 9667.048

swinir-s (2x, color)	1.63 / 15897.048	N/A	N/A
swinir-m* (2x, color, 720p)	1.05 / 11305.268	N/A	N/A
swinir-l* (4x, color, 720p)	0.61 / 15391.316	N/A	N/A

*: swinir-m and swinir-l exhibit precision issues.

rife

v2, fp16 i/o

version	1 stream	2 streams	3 streams	4 streams	5 streams
v4.4-v4.5	136.92/778.432	273.80/1149.204	414.80/1522.028	553.70/1892.796	574.31/2263.568
v4.6	136.01/800.960	275.26/1192.212	411.01/1585.516	544.30/1979.764	550.01/2368.020
v4.7-v4.9	98.20/1302.724	195.78/2187.548	210.12/3074.420	210.45/3957.196	210.66/4844.068
v4.10-v4.15	84.41/1595.592	160.93/2773.280	161.96/3953.020	162.04/5132.760	162.07/6310.448
{v4.12, v4.13, v4.15, v4.16}_lite	93.39/1333.444	187.32/2255.132	197.71/3178.872	198.01/4098.508	197.95/5022.248
v4.14 lite	81.83/1595.292	153.40/2779.424	154.19/3963.260	154.28/5149.140	154.30/6332.980

benchmark 2

previous benchmark

NVIDIA GeForce RTX 3090, 10496 shaders @ 1695 MHz, driver 552.22, Windows Server 2022, Python 3.11.9, vapoursynth-classic R57.A8

Measurements: (1080p, fp16) FPS / Device Memory (MB)

model	ORT_CUDA NCHW	ORT_CUDA NHWC	ORT_DML
dpir color	4.54 / 2573.3	5.98 / 2470.9	8.45 / 2364.5
dpir color (2 streams)	4.66 / 4854.9	6.30 / 4680.8	9.48 / 4630.9
waifu2x upconv7	10.98 / 5432.5	3.18 / 3017.8	12.48 / 4493.0
waifu2x upconv7 (2 streams)	14.96 / 10397.1	3.25 / 5780.9	21.72 / 8891.7
waifu2x cunet / cugan	4.70 / 7955.6	4.49 / 6290.6	OOM
waifu2x cunet / cugan (2 streams)	5.11 / 15721.9	4.78 / 12312.0	OOM
waifu2x swin_unet_art	2.98 / 23518.5	3.05 / 22812.0	N/A
realesrgan	8.99 / 1647.7	11.20 / 1127.5	11.99 / 1346.6
realesrgan (2 streams)	10.69 / 3034.5	13.58 / 1994.1	17.34 / 2601.6
rife v4.4 (1920x1088)	61.42 / 1100.9	56.02 / 1162.3	44.73 / 882.4
rife v4.4 (1920x1088, 2 streams)	106.48 / 1953.4	92.88 / 2071.9	68.80 / 1670.7
scunet color	N/A	N/A	N/A

benchmark 3

NVIDIA GeForce RTX 2080 Ti, 4352 shaders @ 1700 MHz, driver 552.22, Windows 10 LTSC 21H2 (19044.1415), Python 3.11.9, vapoursynth-classic R57.A8

Measurements: (1080p, fp16) FPS / Device Memory (MB)

model	TRT	ORT_CUDA	ORT_DML	ORT_CUDA NHWC
dpir color (1 stream)	7.08 / 1899	3.10 / 2602	4.99 / 2341	4.26 / 2411
dpir color (2 streams)	8.06 / 3376	3.30 / 5016	5.85 / 4619	4.74 / 4650

waifu2x upconv7 (1 stream)	11.47 / 2014	7.01 / 4949	7.45 / 4501	1.59 / 2923
waifu2x upconv7 (2 streams)	21.44 / 3782	10.11 / 9732	13.23 / 8940	1.77 / 5674

waifu2x cunet / cugan (1 stream)	7.41 / 4664	3.10 / 10067	OOM	0.77 / 6188
waifu2x cunet / cugan (2 streams)	10.92 / 8863	OOM	OOM	OOM

waifu2x swin_unet_art (1 stream)	2.35 / 7234	OOM	N/A	OOM
waifu2x swin_unet_art (2 streams)	OOM	OOM	N/A	OOM

realesrgan (1 stream)	8.66 / 1268	5.33 / 1545	6.39 / 1316	6.96 / 1033
realesrgan (2 streams)	13.20 / 2166	7.78 / 2932	10.22 / 2571	10.25 / 1895

rife v4.4 (1920x1088, fp16 i/o, 1 stream)	64.97 / 609	46.60 / 967	32.18 / 723	48.5...

Assets 14

03 Dec 01:13

github-actions

v14.test3

1426a2b

v14.test3: latest TensorRT, MIGraphX backend Pre-release

Pre-release

This is a preview release for TensorRT 9.2.0, following the v14.test and v14.test2 releases.

Same as those releases, it requires Pascal GPUs or later (10 series+) and driver version >= 525. Support for Kepler 2.0 and Maxwell GPUs is dropped.
TensorRT 9.2.0 is officially documented as for Large Language Models (LLMs) on NVIDIA A100, A10G, L4, L40, L40S, H100 GPUs, and NVIDIA GH200 Grace Hopper™ Superchip only. The Windows build is downloaded from here, and can be used on other GPU models.
Users should use the same version of TensorRT as provided (9.2.0) because runtime version checking is disabled in this release.
Added support for AnimeJaNai V3 models, contributed by contributed by @hooke007 in #82.
Added support for RIFE v4.13 ~ v4.16 (lite, ensemble) models, which are also available for previous vs-mlrt releases (simply download the new model file here and update vsmlrt.py).
- The v4.13 ~ v4.15 models should have the same execution speed as the v4.10 - v4.12 models.
- The v4.13 lite model, the v4.15 lite model and the v4.16 lite model should all have the same execution speed as the v4.12 lite model, while the v4.14 lite model may run slower.
Added support for fractional video frame interpolation in RIFE.
- Playback in video players should also set video_player=True (#59 (comment)). This change is experimental.
Fixed an issue that causes the TRT backend crashes during script realoading. (#65). It is also fixed in the latest iteration of the v14.test2 release.
RIFE v4.7+ models with v2 representation are not working with dynamic shapes (#72). This has been reported to TensorRT developers.

Initial MIGraphX support (experimental) for AMD GPUs.

fp16 I/O contributed by @abihf in #86.
Multi-stream execution, device selection, hip graphs and dynamic shapes are not explicitly supported for now.

preliminary benchmark on Radeon RX 7900 XTX ¹:

resolution: 1920x1080
measurements: fps / device memory (MB)

model	fp32	fp16
dpir gray	2.33 / 2829	7.29 / 1702
dpir color	2.27 / 2861	7.03 / 1734
waifu2x upconv7	6.31 / 4540	12.90 / 2503
waifu2x upresnet10	6.65 / 3077	13.63 / 1775
waifu2x cunet / cugan	3.69 / 6711	8.36 / 3591
waifu2x swin_unet ²	2.19 / 9791	4.53 / 5236
realesrgan ³	5.75 / 1961	11.57 / 959
rife ⁴	N/A	N/A

Also check the release notes of the v14.test and v14.test2 releases.

benchmark

RTX 4090
- processor clock @ 2520 MHz
Intel Icelake server @ 2100 MHz
Driver 551.86
Windows 10 21H2 (19044.1415)
TensorRT 9.2.0
VapourSynth-Classic R57.A8, vapoursynth-plugin v0.96g3

1920x1080 rgbs, CUDA graphs enabled, fp16

Measurements: FPS / Device Memory (MB)

general

model	1 stream	2 streams	3 streams
dpir gray	21.93/1757.352	25.48/3049.696	25.31/4342.044
dpir color	18.24/1790.184	25.11/3115.360	25.22/4440.540

waifu2x upconv_7_{anime_style_art_rgb, photo}	19.58/2148.716	39.87/3867.240	59.94/5585.768
waifu2x upresnet10	17.40/1655.144	34.22/2880.096	42.78/4105.048
waifu2x cunet / cugan	13.64/4391.292	25.09/8346.248	25.19/12301.208
waifu2x swin_unet	4.62/14989.772	OOM	OOM

real-esrgan (v2/v3, xsx2)	16.77/1136.996	33.99/1876.568	41.44/2616.140

rife

v2, fp16 i/o

version	1 stream	2 streams	3 streams	4 streams	5 streams
v4.4-v4.5	150.20/622.784	301.05/835.860	448.90/1053.024	615.84/1268.152	787.57/1481.224
v4.6	147.63/624.832	294.53/837.904	452.26/1055.072	603.63/1270.200	764.31/1485.320
v4.7-v4.9	132.06/747.712	268.63/1075.476	403.54/1405.284	494.98/1737.152	496.41/2064.908
v4.10-v4.15	119.09/862.400	238.68/1304.852	346.98/1749.352	349.48/2195.904	349.80/2638.356
{v4.12, v4.13, v4.15, v4.16}_lite	123.72/782.528	250.81/1151.252	377.27/1522.020	403.14/1894.844	403.79/2263.568
v4.14 lite	117.97/839.872	234.67/1265.940	320.23/1696.104	321.88/2124.224	321.18/2552.340

This pre-release uses trt 9.2.0 + cuda 12.3.1 + cudnn 8.9.6, which requires a minimum driver version of 525 and is compatible with 10 series and newer GPUs, with no significant performance improvement measured.
vsmlrt.py in all branches can be used interchangeably.

RDNA3, Navi 31, 12288 shaders, processor clock @ 2399 MHz, memory clock @ 1249 MHz, driver 6.0.32831, PCIe 4.0 x16, MIGraphX 2.8.0, ROCm 6.0.2, Linux 6.7.0-060700-generic, VapourSynth-Classic R57.A8 ↩
tested on MIGraphX 2.10.0 ROCm/AMDMIGraphX@ecd5adc, requiers MIGraphX 2.9.0 ROCm/AMDMIGraphX@2d4a6507c3ad41f9d7ea36de1d7fb257cc788585s and replacing edge padding by reflection padding ↩
tested on MIGraphX 2.10.0 ROCm/AMDMIGraphX@ecd5adc, requiers MIGraphX 2.9.0 ROCm/AMDMIGraphX@2d4a6507c3ad41f9d7ea36de1d7fb257cc788585s ↩
missing support for GridSample operation ↩

Contributors

abihf and hooke007

Assets 14

23 Oct 00:58

github-actions

v14.test2

5aadcab

v14.test2: latest TensorRT library Pre-release

Pre-release

This is a preview release for TensorRT 9.1.0, following v14.test release.

Same as v14.test release, it requires Pascal GPUs or later (10 series+) and driver version >= 525. Support for Kepler 2.0 and Maxwell GPUs is dropped.
TensorRT 9.1.0 is officially documented as for Large Language Models (LLMs) on NVIDIA A100, A10G, L4, L40, L40S, H100 GPUs, and NVIDIA GH200 Grace Hopper™ Superchip only on Linux. The Windows build is downloaded from here, and can be used on other GPU models.
- ~~On Windows, some users have reported crashes when using it in mpv~~ (#65). This problem occurs on an earlier version of this release, which is now fixed.
Add parameters bf16 (#64), custom_env and custom_args to the TRT backend.
- fp16 execution of Waifu2xModel.swin_unet_art is more accurate, faster and uses less GPU memory than bf16 execution (benchmark)
Device memory usage of model Waifu2xModel.swin_unet_art is reduced compared to TensorRT 9.0.1 on A10G with 1080p input (at 2.66 fps with 7.0GB VRAM usage) with default auxiliary stream heuristic.
- TensorRT 9.0.1 using 7 auxiliary streams compared to TensorRT 9.1.0 (3 streams) results in significantly more device memory with no performance gain.
- Setting max_aux_streams=3 lowers device memory usage of TensorRT 9.0.1 to ~8.9GB, and max_aux_streams=0 corresponds to ~7.3GB usage.
- TensorRT 9.1.0 with max_aux_streams=0 uses ~6.7GB device memory.
Users should use the same version of TensorRT as provided (9.1.0) because runtime version checking is disabled in this release.
Added support for RIFE v4.8 - v4.12, v4.12 ~ v4.13 lite (ensemble) models, which are also available for previous vs-mlrt releases (simply download the new model file here and update vsmlrt.py). v4.8 and v4.9 models should have the same execution speed as v4.7, while v4.10-v4.12 models are equally heavier than previous models. Ensemble models are heavier than their non-ensemble counterparts.
- Starting from RIFE v4.11, all rife models are temporarily moved here with individual packaging.
RIFE models with v2 representation for TRT backend now has improved accuracy, contributed by @charlessuh (#66 (comment)). This has been backported to master.
- This improvement may be very slightly inefficient under onnx file renaming. It is advised to keep onnx file name unchanged and change the function call to vsmlrt.RIFE().
- By default vsmlrt.RIFE() in vsmlrt.py uses v1 representation. The v2 representation is enabled with vsmlrt.RIFE(_implementation=2) function call.
  
  Sample Error Message
  input: for dimension number 1 in profile 0 does not match network definition (got min=11, opt=11, max=11), expected min=opt=max=7)
- v2 representation is still considered experimental.
Added support for SAFA v0.1 video enhancement model.
- This model takes arbitrary sized video and uses both spatial and temporal information to improve visual quality.
- Note that this model is non-deterministic by nature, and existing backends does not support manual seeding.
- ~17 fps on RTX 4090 with TRT(fp16), 1080p input and non-adaptive mode. Adaptive mode is about 2x slower than non-adaptive mode, uses more memory and does not support cuda graphs execution.
- This representation is not supported by the NCNN_VK backend for the same issue as RIFE v2 representation.
Also check the release note of v14.test release.

This pre-release uses trt 9.1.0 + cuda 12.2.2 + cudnn 8.9.5, which can only run on driver >= 525 and 10 series and later gpus, with improved support for self-attentions found in transformer models.
vsmlrt.py in all branches can be used interchangeably.

TensorRT 9.0.1 is for Large Language Models (LLMs) on A100, A10G, L4, L40, and H100 GPUs only on x86 Linux. Model Waifu2xModel.swin_unet_artis 1.2x faster compared to TensorRT 8.6.1 on A10G with 720p input (at 6.3 fps with 4GB VRAM usage), thanks to multi-head attention fusion (requires fp16).

This pre-release is now feature complete. Development now switch to the v14.test3 pre-release.

Contributors

charlessuh

Assets 14

13 Mar 12:20

github-actions

v14.test

daf9620

v14.test: latest TensorRT library Pre-release

Pre-release

This is a preview release for TensorRT 8.6.1.

It requires Pascal GPUs or later (10 series+) and driver version >= 525. Support for Kepler 2.0 and Maxwell GPUs is dropped.
Add parameters builder_optimization_level and max_aux_streams to the TRT backend.
- builder_optimization_level: "adjust how long TensorRT should spend searching for tactics with potentially better performance" link
- max_aux_streams: Within-inference multi-streaming, "if enabled, TensorRT will run some layers on the auxiliary streams in parallel to the layers running on the main stream, ..., may increase the memory consumption, ..." link
  - It is advised to lower max_aux_streams to 0 on heavy models like Waifu2xModel.swin_unet_art to reduce memory usage. Check the benchmark data at the bottom.
Following TensorRT 8.6.1, cudnn tactic source of the TRT backend is disabled by default. tf32 is also disabled by default in vsmlrt.py.
Add parameter short_path to the TRT backend, which shortens engine path and is enabled on Windows by default.
Model Waifu2xModel.swin_unet_art seems does not work with builder_optimization_level=5 from the TRT backend before TRT 9.0. Use builder_optimization_level=4 or lower instead.

Less than 5% performance improvement among built-in models compared to 13.1/13.2, 24% device memory usage reduction on DPIR and 35% on RealESRGAN.

Version information:

v13.2 release uses trt 8.5.1 + cuda 11.8.0, which can run on driver >= 450 and 900 series and later gpus.
v14.test pre-release uses trt 8.6.1 + cuda 12.1.1, which can only run on driver >= 525 and 10 series and later gpus, with no significant performance improvement measured.
vsmlrt.py in both branches can be used interchangeably.

Added support for RIFE v4.7 model ("optimized for anime scenes"), which is also available for previous vs-mlrt releases (simply download the new model file here and update vsmlrt.py). It is more computational intensive than v4.6.

This pre-release is now feature complete. Development now switch to trt-latest branch and v14.test2 pre-release.

Assets 14

28 May 15:01

github-actions

v13.2

d9e4111

v13.2: latest ort library, DirectML backend

Added support for DirectML backend through ONNX Runtime. It is available for all dx12 devices and can be accessed through backend=Backend.ORT_DML(). waifu2x swin_unet models may not be supported on this backend, and RIFE models may be poorly supported.
Asset vsmlrt-windows-x64-vk.*.7z is renamed to vsmlrt-windows-x64-generic-gpu.*.7z and includes backends OV_CPU, OV_GPU, ORT_CPU, ORT_DML, NCNN_VK. cuda asset continue to include all backends in this release.
Update onnxruntime to microsoft/onnxruntime@73584f9.

Note

Backend OV_GPU may produces reduced precision output. This is under investigation.

benchmark 1

NVIDIA GeForce RTX 3090, 10496 shaders @ 1695 MHz, driver 532.03, Windows 10 21H2 LTSC (19044.1415), Python 3.11.3, vapoursynth-classic R57.A8

Measurements: (1080p, fp16) FPS / Device Memory (MB)

model	ORT_CUDA	ORT_DML
dpir	4.25 / 2573.3	7.01 / 2371.0
dpir (2 streams)	4.58 / 5506.2	8.85 / 4643.1
waifu2x upconv7	9.10 / 5248.1	9.65 / 4503.1
waifu2x upconv7 (2 streams)	11.15 / 2966.9	18.52 / 8911.2
waifu2x cunet / cugan	4.06 / 7875.7	6.36 / 8973.7
waifu2x cunet / cugan (2 streams)	N/A	9.51 / 17849.1
waifu2x swin_unet_art	N/A	N/A
realesrgan	7.52 / 1901.7	8.54 / 1352.4
realesrgan (2 streams)	11.15 / 2966.9	15.58 / 2608.7
rife	34.30 / 1109.1	2.12 / 1417.8
rife (2 streams)	61.45 / 2051.4	4.27 / 2740.9

benchmark 2

AMD Radeon Pro V620 MxGPU, 4608 shaders @ 2390 MHz, Adrenalin 21.20.02.13, Windows Server 2019, Python 3.11.3, vapoursynth-classic R57.A8

Measurements: (1080p, fp16) FPS / Device Memory (MB)

model	NCNN_VK	ORT_DML
dpir	1.70 / 3248.4	4.75 / 2308.1
dpir (2 streams)	1.74 / 6099.5	4.86 / 4584.6
waifu2x upconv7	5.18 / 6872.3	14.51 / 4448.5
waifu2x upconv7 (2 streams)	6.14 / 13701	15.98 / 8861.2
waifu2x cunet / cugan (2x2 tiles)	1.07 / 3159.8	5.57 / 2196.7
waifu2x cunet / cugan (2x2 tiles, 2 streams)	1.07 / 3159.8	6.08 / 4357.8
waifu2x swin_unet_art	N/A	N/A
realesrgan	3.86 / 2699.7	9.59 / 1290.4
realesrgan (2 streams)	4.43 / 5355.8	10.58 / 2545.3
rife	N/A	2.68 / 1353.5
rife (2 streams)	N/A	4.44 / 2673.3

Assets 14

29 Jan 15:00

github-actions

v13.1

cf2bfbf

v13.1: latest ov & ort library, new models

Update openvino to openvinotoolkit/openvino@b0ffec4 , improved support for RIFE in both OV_CPU and OV_GPU. (benchmark on arc a380)
Fix a typo in vsmlrt.py's __all__.
Added parameter num_threads in the OV_CPU backend
Update onnxruntime to microsoft/onnxruntime@8ed3dfe.
Default value of workspace of the TRT backend will be changed to None in the next release, anduse_cudnn will be changed to False.

the following models can be found on external models:

Add support for waifu2x swin_unet models.
Add support for ensemble configuration of RIFE. (RIFE v2 acceleration is experimental and may results in reduced quality on TRT backend with fp16 enabled.)

Note

Backend OV_GPU may produces reduced precision output. This is under investigation.

Assets 14

20 Apr 04:26

AkarinVS

contrib-models

668058b

Contributed Models Pre-release

Pre-release

Please see PR #42 for policy.

AnimeJaNai_v2.7z: RealESRGANv2/{animejanaiV2L1, animejanaiV2L2, animejanaiV2L3}.onnx, contributed by @hooke007 in #53.
AnimeJaNai_v3.7z: RealESRGANv2/{animejanaiV3-HD-L1, animejanaiV3-HD-L2, animejanaiV3-HD-L3}.onnx, contributed by @hooke007 in #82.

Contributors

hooke007

Assets 4

14 Jan 10:49

github-actions

v13

5b7840f

v13: fp16 i/o, faster dynamic shapes for TRT backend

Added support for fp16 I/O format and faster dynamic shapes in the TRT backend.
- Thanks to @hooke007 @grobalt @MysteryDove @chainikdn and many others users on the svp forum, it becomes clear that reducing system bandwidth requirement is crucial to 4K RIFE performance on powerful GPUs (https://github.com/AmusementClub/vs-mlrt/discussions/19). The TRT backend now accepts fp16 clips beyond fp32, and the output format can be specified via parameter output_format (0 for fp32 and 1 for fp16).
As the only portable way to convert fp32 clips to fp16 is via resize (std.Expr only supports fp16 when the cpu supports the f16c instruction set extension), and to conserve memory bandwidth, you could use the following snippet for RIFE to perform the necessary padding, YUV/RGB conversion and FP16 conversion in one go:
```
th = (src.height + 31) // 32 * 32  # adjust 32 and 31 to match specific AI network input resolution requirements.
tw = (src.width  + 31) // 32 * 32  # same.
padded = src.resize.Bicubic(tw, th, format=vs.RGBS if WANT_FP32 else vs.RGBH, matrix_in_s="709", src_width=tw, src_height=th)
flt = vsmlrt.RIFE(padded, model=vsmlrt.RIFEModel.v4_6, backend=backend, output_format=1) # fp16 output
oh = src.height * (flt.height // th)  # not necessary for RIFE (i.e. oh = src.height), but required for super-resolution upscalers.
ow = src.width  * (flt.width  // tw)
res = flt.resize.Bicubic(ow, oh, format=vs.YUV420P8, matrix_s="709", src_width=ow, src_height=oh)
```
- Faster dynamic shapes introduced in TensorRT 8.5.1 improves performance and device memory usage of dynamically shaped engines (#20 (comment) and the following benchmark).
  
  Dynamically shaped models can be created by specifying static_shape=False, min_shapes (minimum size), opt_shapes (optimization size) and max_shapes (maximum size) in the TRT backend.
  
  Engine cache placement policy of vsmlrt.py is direct-mapped:
  - If dynamic shapes is used, engine name is determined by min_shapes, opt_shapes and max_shapes (among others).
  - Otherwise, engine name is determined by opt_shapes.
  - opt_shapes is usually set to tilesize in each specific model's interface if not initialized.
- workspace can now be set to None for unlimited workspace size. (#21)
- Add flag force_fp16. This flag forces fp16 computation during inference, and is disabled by default.
  - It reduces memory usage during engine build, and allows more engines to be built, e.g. 1080p rife on gpus with 4gb memory. (successfully build dynamically shaped rife with min_shapes=(64, 64), opt_shapes=(1920, 1088), max_shapes=(3840, 2176) on a 4gb ampere)
  - It may reduce engine build time.
Introduce a new simplified backend interface BackendV2.
Disable tiling support for rife due to incompatible inference design.

dynamic shapes

In conclusion, dynamic shapes should be much more flexible when dealing with different video resolutions (no engine re-compilation is required), and incurs almost no performance degradation starting with TensorRT 8.5. Only increased device memory usage will be a concern.

benchmark

configuration: nvidia a10 (ecc disabled), driver 527.41 (tcc), windows server 2022, python 3.11.1, vapoursynth-classic R57.A7, Backend.TRT(fp16=True, use_cuda_graph=True, tf32=False, use_cudnn=False), CUDA_MODULE_LOADING=LAZY
- Statically shaped engines for each model are compiled separately, while the dynamically shaped engine is compiled once with (1) static_shape=False, min_shapes=(64, 64), opt_shapes=<max-benchmarked-video-dimensions>, max_shapes=<max-benchmarked-video-dimensions> or (2) static_shape=False, min_shapes=(64, 64), opt_shapes=<min-benchmarked-video-dimensions>, max_shapes=<max-benchmarked-video-dimensions>.
  
  opt_shapes may be lowered for faster engine generation.

measurements: fps / device memory (MB)

model	1 stream static	1 stream dynamic (1)	1 stream dynamic (2)	2 streams static	2 streams dynamic (1)	2 streams dynamic (2)
waifu2x upconv7 1920x1080	17.4 / 1992	17.5 / 1998	17.4 / 2040	21.2 / 3694	21.2 / 3756	20.9 / 3850
waifu2x upconv7 1280x720	37.2 / 1046	38.5 / 1930	37.8 / 1976	46.3 / 1818	48.2 / 3628	46.5 / 3722
waifu2x upconv7 720x480	102.2 / 544	104.4 / 1894	102.2 / 1940	123.1 / 834	128.1 / 3556	123.4 / 3650
dpir color 1920x1080	7.3 / 2114	7.3 / 2114	7.2 / 2116	7.5 / 3656	7.4 / 3992	7.4 / 4002
dpir color 1280x720	16.4 / 1122	16.3 / 2086	16.2 / 2092	16.7 / 1810	16.7 / 3936	16.7 / 3946
dpir color 720x480	41.5 / 604	41.5 / 2068	41.6 / 2074	44.3 / 863	44.3 / 3900	44.2 / 3910
real-esrgan v2 1920x1080	12.3 / 1320	12.3 / 1320	12.3 / 1320	13.0 / 2196	13.2 / 2402	12.9 / 2402
real-esrgan v2 1280x720	26.9 / 736	26.9 / 1256	27.0 / 1256	29.1 / 1130	29.3 / 2274	29.2 / 2274
real-esrgan v2 720x480	73.2 / 422	73.2 / 1220	72.7 / 1220	78.7 / 570	78.4 / 2202	78.1 / 2202
cugan 1920x1080	9.4 / 4648	9.4 / 4726	9.2 / 4618	9.8 / 8754	10.2 / 9210	9.9 / 8996
cugan 1280x720	20.5 / 2214	20.5 / 4662	20.0 / 4554	21.2 / 4050	22.9 / 9082	22.4 / 8868
cugan 720x480	54.8 / 996	53.7 / 4626	52.9 / 4518	57.7 / 1690	59.9 / 9019	58.9 / 8796
rife v4.4 1920x1088	92.8 / 590	92.2 / 594	89.5 / 606	178.6 / 920	177.6 / 942	169.5 / 974
rife v4.4 1280x736	206.0 / 410	199.2 / 534	199.1 / 550	394.2 / 560	377.5 / 822	374.0 / 854
rife v4.4 736x480	497.3 / 316	442.2 / 504	492.3 / 520	903.2 / 376	809.3 / 762	874.1 / 794

*The gap is large on rife because of underutilization, and will disappear when using more streams.

Contributors

chainikdn, MysteryDove, and 2 other contributors

Assets 13

09 Jan 15:10

github-actions

v12.3.test

97952cb

v12.3.test Pre-release

Pre-release

This is a preview release for https://github.com/AmusementClub/vs-mlrt/releases/tag/v13.

Assets 13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

General

vsmlrt.py

TRT

plugin and runtime libraries

vsmlrt.py

known issues

ORT

interface

CUDA

OV

MIGX

Community contributions

Benchmark

Contributors

benchmark 1

general

rife

benchmark 2

benchmark 3

benchmark

general

rife

Contributors

Contributors

Note

benchmark 1

benchmark 2

Note

Contributors

dynamic shapes

benchmark

Contributors

Releases: AmusementClub/vs-mlrt

v14: latest libraries

General

vsmlrt.py

TRT

plugin and runtime libraries

vsmlrt.py

known issues

ORT

interface

CUDA

OV

MIGX

Community contributions

Benchmark

Contributors

v14.test4: latest TensorRT and ONNX Runtime libraries

benchmark 1

general

rife

benchmark 2

benchmark 3

v14.test3: latest TensorRT, MIGraphX backend

benchmark

general

rife

Contributors

v14.test2: latest TensorRT library

Contributors

v14.test: latest TensorRT library

v13.2: latest ort library, DirectML backend

Note

benchmark 1

benchmark 2

v13.1: latest ov & ort library, new models

Note

Contributed Models

Contributors

v13: fp16 i/o, faster dynamic shapes for TRT backend

dynamic shapes

benchmark

Contributors

v12.3.test