Releases: AmusementClub/vs-mlrt
v12.2
Update vsmlrt.py:
-
Introduce a new release artifact
ext-models.v12.2.7z
, which comes from External Models, and it's not bundled into full binary release packages (i.e. thecpu
,cuda
andvk
packages). Please refer to their release notes for details on how to use those models. -
Export a new API
vsmlrt.inference
for inference of custom models.import vsmlrt output = vsmlrt.inference(clips, "path/to/onnx", backend=vsmlrt.Backend.TRT(fp16=True))
If you encounter issues like
Cannot find input tensor with name "input" in the network inputs! Please make sure the input tensor names are correct.
, you could usevsmlrt.inference(..., input_name=None)
or export the model with its input name set to "input". -
Fix
trt
inference of cugan-pro (3x) models. (#15)
External Models
More models!
In addition to bundled models, vs-mlrt can also be used to run these models:
-
anime-segmentation/isnet_is.onnx
: anime character segmentation ata0a563c
, RGBS -> GRAYS, requires mod64 input -
oidn/rt_ldr.onnx
: image denoising from Intel® Open Image Denoise library, RGBS, requires mod16 input -
ppocr/ml_PP-OCRv3_det.onnx
: multilingual text detection model from PaddleOCR, RGBS -> GRAYS, requires mod32 input -
waifu2x swin_unet: waifu2x's swin_unet models. It's supported by the Python wrapper with
vsmlrt.Waifu2xModel.{swin_unet_art,swin_unet_art_scan,swin_unet_photo{v2}}
.-
file list:
waifu2x/swin_unet_art/{scale2x, scale4x, noise0, noise0_scale2x, ..., noise3_scale4x}.onnx
waifu2x/swin_unet_art_scan/{scale4x, noise0_scale4x, ..., noise3_scale4x}.onnx
waifu2x/swin_unet_photo{_v2}/{scale4x, noise0_scale4x, ..., noise3_scale4x}.onnx]
-
- v2 models handle paddings internally and reduce PCIe traffic flow.
-
safa/{safa_{v0.1,v0.2,v0.3,v0.4}_{non_adaptive,adaptive1x,adaptive}.onnx
: SAFA video enhancement models. Individually packaged. -
ArtCNN/ArtCNN_{C4F32,C16F64}{_Chroma,_DS}
: ArtCNN models for anime super-resolution and restoration.
With more to come.
Also check onnx models provided by the avs-mlrt community.
Usage
If an external model is not supported by the Python wrapper, you can use the generic vsmlrt.inference
API to run these models (requires release v12.2 or later).
import vsmlrt
output = vsmlrt.inference(rgbs, "path/to/onnx", backend=vsmlrt.Backend.TRT(fp16=True))
The rife
model requires auxiliary inputs and should be used from vsmlrt.RIFE
or vsmlrt.RIFEMerge
interface.
v12.1
This minor release fixes #9: now if vsort/vstrt fails to load required cuda DLLs, they won't crash the entire process.
However, if vs-mlrt is correctly installed, this shouldn't happen. Please report an issue if you can't access the core.trt
or core.ort
namespaces. Common mistake is forgetting to extract the vsmlrt-cuda.v12.1.7z
package for VSORT-Windows-x64.v12.1.7z
or VSTRT-Windows-x64.v12.1.7z
packages. If in doubt, use the fully bundled release vsmlrt-windows-x64-cuda.v12.1.7z
for CUDA users.
Note: we explicitly do not support using both pytorch and vs-mlrt plugins in the same vpy script as pytorch uses its own set of cuda DLL which might be in conflict with the ones vs-mlrt uses. As those DLLs are not explicitly versioned (e.g. nvinfer.dll
instead of nvinfer-x.yz.dll
), there is nothing we can do.
v12: latest CUDA libraries
Compared to v11, this release updated CUDA dependencies to CUDA 11.8.0, cuDNN 8.6.0 and TensorRT 8.5.1:
- Added support for the NVIDIA 40 series GPUs.
- Added support for RIFE on the
trt
backend.
Known issue
- Performance of the
OV_CPU
orORT_CUDA(fp16=True)
backends forRIFE
is lower than expected, which is under investigation. Please considerORT_CPU
orORT_CUDA(fp16=False)
for now. - The
NCNN_VK
backend does not supportRIFE
.
Installation Notes
For some advanced features, vsmlrt.py
requires numpy
and onnx
packages to be available. You might need to run pip install onnx numpy
.
Benchmark
Configuration: NVIDIA RTX 3090, driver 526.47, windows server 2019, vs r60, python 3.11.0, 1080p fp16
Backends: ort-cuda, trt from vs-mlrt v12.
For the trt
backend, the engine is created without CUDA_MODULE_LOADING=LAZY
environment variable and with it during benchmarking to reduce device memory consumption.
Data format: fps / GPU memory usage (MB)
rife(model=44, 1920x1088)
backend | 1 stream | 2 streams |
---|---|---|
ort-cuda | 53.62/1771 | 83.34/2748 |
trt | 71.30/ 626 | 107.3/ 962 |
dpir color
backend | 1 stream | 2 streams |
---|---|---|
ort-cuda | 4.64/3230 | |
trt | 10.32/1992 | 11.61/3475 |
waifu2x upconv_7
backend | 1 stream | 2 streams |
---|---|---|
ort-cuda | 11.07/5916 | 15.04/10899 |
trt | 18.38/2092 | 31.64/ 3848 |
waifu2x cunet
backend | 1 stream | 2 streams |
---|---|---|
ort-cuda | 4.63/8541 | 5.32/16148 |
trt | 11.44/4771 | 15.59/ 8972 |
realesrgan v2/v3
backend | 1 stream | 2 streams |
---|---|---|
ort-cuda | 8.84/2283 | 11.10/4202 |
trt | 14.59/1324 | 21.37/2174 |
v11 RIFE support
Added support for the RIFE video frame interpolation algorithm.
There are two APIs for RIFE:
vsmlrt.RIFE
is a high-level API for interpolating a clip. set themulti
argument to specify the fps factor. Just remember to perform scene detection on the input clip.vsmlrt.RIFEMerge
is a novel temporalstd.MaskedMerge
-like interface for RIFE. Use it if you want to precisely control the frames and/or time point for the interpolation.
Known issues
-
vstrt doesn't support RIFE for the moment1. The next release of TensorRT should include RIFE support and we will release v12 when that happens.
-
vstrt backend also doesn't yet support latest RTX 4000 series GPUs. This will be fixed after upgrading to the upcoming TensorRT 8.5 release. RTX 4000 series GPU owners please use other the other CUDA backends.
-
Users of the
OV_GPU
backend may experience errors likeExceeded max size of memory object allocation: Requested 11456040960 bytes but max alloc size is 4294959104 bytes
. Please consider tiling for now.The reason is that the openvino library follows the opencl standard on memory object allocation restriction (
CL_DEVICE_MAX_MEM_ALLOC_SIZE
). For most existing intel gpus (gen9 and later), the driver imposes a maximum allocation size of ~4GiB2.
-
It's missing grid_sample operator support, see https://github.com/onnx/onnx-tensorrt/blob/main/docs/operators.md. ↩
-
this value is derived from here, which states that device not supporting
sharedSystemMemCapabilities
has a maximum allowed allocation size of 4294959104 bytes ↩
v11.test
internal testing only.
Added support for the RIFE video frame interpolation algorithm. Some features are still being implemented. The Python RIFE model wrapper interface is still subject to change.
Known issue
-
Users of the
OV_GPU
backend may experience errors likeExceeded max size of memory object allocation: Requested 11456040960 bytes but max alloc size is 4294959104 bytes
. Please consider tiling for now.The reason is that the openvino library follows the opencl standard on memory object allocation restriction (
CL_DEVICE_MAX_MEM_ALLOC_SIZE
). For most existing intel gpus (gen9 and later), the driver imposes a maximum allocation size of ~4GiB1.
-
this value is derived from here, which states that device not supporting
sharedSystemMemCapabilities
has a maximum allowed allocation size of 4294959104 bytes ↩
Model Release 20220923, RIFE model
New modules (compared to previous model release):
- RIFE v4.0 from vs-rife v2.0.0.
rife/rife_v4.0.onnx
, config:fastmode=True, ensemble=False
- RIFE v4.2, v4.3, v4.4, v4.5, v4.6, v4.7, v4.8, v4.9, v4.10 from Practical-RIFE.
rife/rife_{v4.2,v4.3,v4.4,v4.5,v4.6,v4.7,v4.8,v4.9,v4.10}.onnx
, config:fastmode=True, ensemble=False
- Other provided RIFE models can be found here, including v2 representation of RIFE v4.7-v4.10 models. Sorry for the inconvenience.
Notes:
- For RIFE on ort-gpu, vs-mlrt v11 or later is suggested for best performance. And (as of v11), only ov-cpu, ort-cpu, ort-cuda, trt (pending new TensorRT release) support RIFE. Specifically, ncnn-vk do not support RIFE due to missing
gridsample
op.
v10: new vulkan based vsncnn (AMD GPU supported)
Release Highlight
Vulkan based AMD GPU support added with the new vsncnn-vk backend.
Major features
- Introduced ncnn-based vsncnn plugin that supports any GPU with Vulkan support (NVidia, AMD, Intel integrated & discrete).
- Good news for AMD GPU users! vs-mlrt has finally achieved full platform coverage: from x86 CPU to GPU of all three major vendors.
- Please refer to the benchmark below for performance details. Tl;dr it's comparable to vsort-cuda on most networks (except waifu2x-cunet), but (significantly) slower than vstrt. Owing to its C++ implementation, it's generally faster than Python based ncnn implementations.
- Hint: If your GPU has enough memory, please consider setting
num_streams>1
to extract more performance. - Even though it's possible to use software based Vulkan implementations (as we did in the GHA tests), if you want to do CPU-only inference, it's much better to use vsov-cpu (or vsort-cpu).
- Introduced a new smaller Vulkan-based GPU binary package (
vsmlrt-windows-x64-vk.v10.7z
) that only includes vsov-{cpu,gpu}, vsort-cpu and vsncnn-vk. Use this if you only use Intel/AMD GPU or don't want to download 1GB data in exchange for a backend that is merely 2~8x faster. Now there shouldn't be any reasons not to use vs-mlrt.
Benchmark
Configuration: NVIDIA RTX 3090, driver 516.94, windows server 2019, vs r60, python 3.10.7, 1080p fp16
Backends: ncnn-vk, ort-cuda, trt from vs-mlrt v10, dpir-ncnn v2.0.0, w2xncnnvk r2
Data format: fps / GPU memory usage (MB)
dpir color
backend | 1 stream | 2 streams |
---|---|---|
ncnn-vk | 4.33/3347 | 4.72/6119 |
ort-cuda | 4.56/3595 | |
trt | 10.64/2595 | 11.10/4593 |
dpir-ncnn | 3.68/3326 |
waifu2x upconv_7
backend | 1 stream | 2 streams |
---|---|---|
ncnn-vk | 9.46/6820 | 14.71/13468 |
ort-cuda | 12.10/6411 | 13.98/11273 |
trt | 21.32/3317 | 29.10/ 5053 |
w2xncnnvk | 6.68/6931 | 12.70/13626 |
waifu2x cunet
backend | 1 stream | 2 streams |
---|---|---|
ncnn-vk | 1.46/11908 | 1.53/23574 |
ort-cuda | 4.85/ 8793 | 5.18/16231 |
trt | 11.60/ 4960 | 15.60/ 9057 |
w2xncnnvk | 1.38/11966 | 1.58/23687 |
realesrgan v2/v3
backend | 1 stream | 2 streams |
---|---|---|
ncnn-vk | 7.23/2781 | 8.35/5330 |
ort-cuda | 9.05/2669 | 10.18/4539 |
trt | 15.93/1667 | 19.58/2543 |
v10.pre
This is a pre-release for testing & benchmarking purposes only.
For production use, please use the official v10 release.
Release Highlight
Vulkan based AMD GPU support added with the new vsncnn-vk backend.
Major features
- Introduced ncnn-based vsncnn plugin that supports any GPU with Vulkan support (NVidia, AMD, Intel integrated & discrete). Good news for AMD GPU users! vs-mlrt has finally achieved full platform coverage: from x86 CPU to GPU of all three major vendors.
- Introduced a new smaller Vulkan-based GPU binary package (
vsmlrt-windows-x64-vk.v10.pre.7z
) that only includes vsov-{cpu,gpu}, vsort-cpu and vsncnn-vk. Use this if you only use Intel/AMD GPU or don't want to download 1GB data in exchange for a backend that is merely 3x faster. Now there shouldn't be any reasons not to use vs-mlrt.
v9.2
Fixed issues
- In vs-mlrt v9 and v9.1 on windows, the
ORT_CUDA
backend may fails for out of memory when processing a noninitial frame. This has been fixed and the performance should be improved. - Parameter
use_cuda_graph
of theORT_CUDA
backend now works properly on windows. It is however not recommended to use currently.
Full Changelog: v9.1...v9.2