Release v13: fp16 i/o, faster dynamic shapes for TRT backend · AmusementClub/vs-mlrt

Added support for fp16 I/O format and faster dynamic shapes in the TRT backend.
- Thanks to @hooke007 @grobalt @MysteryDove @chainikdn and many others users on the svp forum, it becomes clear that reducing system bandwidth requirement is crucial to 4K RIFE performance on powerful GPUs (https://github.com/AmusementClub/vs-mlrt/discussions/19). The TRT backend now accepts fp16 clips beyond fp32, and the output format can be specified via parameter output_format (0 for fp32 and 1 for fp16).
As the only portable way to convert fp32 clips to fp16 is via resize (std.Expr only supports fp16 when the cpu supports the f16c instruction set extension), and to conserve memory bandwidth, you could use the following snippet for RIFE to perform the necessary padding, YUV/RGB conversion and FP16 conversion in one go:
```
th = (src.height + 31) // 32 * 32  # adjust 32 and 31 to match specific AI network input resolution requirements.
tw = (src.width  + 31) // 32 * 32  # same.
padded = src.resize.Bicubic(tw, th, format=vs.RGBS if WANT_FP32 else vs.RGBH, matrix_in_s="709", src_width=tw, src_height=th)
flt = vsmlrt.RIFE(padded, model=vsmlrt.RIFEModel.v4_6, backend=backend, output_format=1) # fp16 output
oh = src.height * (flt.height // th)  # not necessary for RIFE (i.e. oh = src.height), but required for super-resolution upscalers.
ow = src.width  * (flt.width  // tw)
res = flt.resize.Bicubic(ow, oh, format=vs.YUV420P8, matrix_s="709", src_width=ow, src_height=oh)
```
- Faster dynamic shapes introduced in TensorRT 8.5.1 improves performance and device memory usage of dynamically shaped engines (#20 (comment) and the following benchmark).
  
  Dynamically shaped models can be created by specifying static_shape=False, min_shapes (minimum size), opt_shapes (optimization size) and max_shapes (maximum size) in the TRT backend.
  
  Engine cache placement policy of vsmlrt.py is direct-mapped:
  - If dynamic shapes is used, engine name is determined by min_shapes, opt_shapes and max_shapes (among others).
  - Otherwise, engine name is determined by opt_shapes.
  - opt_shapes is usually set to tilesize in each specific model's interface if not initialized.
- workspace can now be set to None for unlimited workspace size. (#21)
- Add flag force_fp16. This flag forces fp16 computation during inference, and is disabled by default.
  - It reduces memory usage during engine build, and allows more engines to be built, e.g. 1080p rife on gpus with 4gb memory. (successfully build dynamically shaped rife with min_shapes=(64, 64), opt_shapes=(1920, 1088), max_shapes=(3840, 2176) on a 4gb ampere)
  - It may reduce engine build time.
Introduce a new simplified backend interface BackendV2.
Disable tiling support for rife due to incompatible inference design.

dynamic shapes

In conclusion, dynamic shapes should be much more flexible when dealing with different video resolutions (no engine re-compilation is required), and incurs almost no performance degradation starting with TensorRT 8.5. Only increased device memory usage will be a concern.

benchmark

configuration: nvidia a10 (ecc disabled), driver 527.41 (tcc), windows server 2022, python 3.11.1, vapoursynth-classic R57.A7, Backend.TRT(fp16=True, use_cuda_graph=True, tf32=False, use_cudnn=False), CUDA_MODULE_LOADING=LAZY
- Statically shaped engines for each model are compiled separately, while the dynamically shaped engine is compiled once with (1) static_shape=False, min_shapes=(64, 64), opt_shapes=<max-benchmarked-video-dimensions>, max_shapes=<max-benchmarked-video-dimensions> or (2) static_shape=False, min_shapes=(64, 64), opt_shapes=<min-benchmarked-video-dimensions>, max_shapes=<max-benchmarked-video-dimensions>.
  
  opt_shapes may be lowered for faster engine generation.

measurements: fps / device memory (MB)

model	1 stream static	1 stream dynamic (1)	1 stream dynamic (2)	2 streams static	2 streams dynamic (1)	2 streams dynamic (2)
waifu2x upconv7 1920x1080	17.4 / 1992	17.5 / 1998	17.4 / 2040	21.2 / 3694	21.2 / 3756	20.9 / 3850
waifu2x upconv7 1280x720	37.2 / 1046	38.5 / 1930	37.8 / 1976	46.3 / 1818	48.2 / 3628	46.5 / 3722
waifu2x upconv7 720x480	102.2 / 544	104.4 / 1894	102.2 / 1940	123.1 / 834	128.1 / 3556	123.4 / 3650
dpir color 1920x1080	7.3 / 2114	7.3 / 2114	7.2 / 2116	7.5 / 3656	7.4 / 3992	7.4 / 4002
dpir color 1280x720	16.4 / 1122	16.3 / 2086	16.2 / 2092	16.7 / 1810	16.7 / 3936	16.7 / 3946
dpir color 720x480	41.5 / 604	41.5 / 2068	41.6 / 2074	44.3 / 863	44.3 / 3900	44.2 / 3910
real-esrgan v2 1920x1080	12.3 / 1320	12.3 / 1320	12.3 / 1320	13.0 / 2196	13.2 / 2402	12.9 / 2402
real-esrgan v2 1280x720	26.9 / 736	26.9 / 1256	27.0 / 1256	29.1 / 1130	29.3 / 2274	29.2 / 2274
real-esrgan v2 720x480	73.2 / 422	73.2 / 1220	72.7 / 1220	78.7 / 570	78.4 / 2202	78.1 / 2202
cugan 1920x1080	9.4 / 4648	9.4 / 4726	9.2 / 4618	9.8 / 8754	10.2 / 9210	9.9 / 8996
cugan 1280x720	20.5 / 2214	20.5 / 4662	20.0 / 4554	21.2 / 4050	22.9 / 9082	22.4 / 8868
cugan 720x480	54.8 / 996	53.7 / 4626	52.9 / 4518	57.7 / 1690	59.9 / 9019	58.9 / 8796
rife v4.4 1920x1088	92.8 / 590	92.2 / 594	89.5 / 606	178.6 / 920	177.6 / 942	169.5 / 974
rife v4.4 1280x736	206.0 / 410	199.2 / 534	199.1 / 550	394.2 / 560	377.5 / 822	374.0 / 854
rife v4.4 736x480	497.3 / 316	442.2 / 504	492.3 / 520	903.2 / 376	809.3 / 762	874.1 / 794

*The gap is large on rife because of underutilization, and will disappear when using more streams.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v13: fp16 i/o, faster dynamic shapes for TRT backend

dynamic shapes

benchmark

Contributors