[FFMPEG QSV][Sample decode] use derived data instead of sw copy to improve the performance #1550
Comments
Hi @fulinjie , I'd like to request more info about the change. I don't understand it well. Mostly, I don't understand where the improvement comes from. Basically, there can be two modes when app needs eventually system memory at output (e.g. to dump data to disk):
Which mode are you talking about? P.S. I'm aware of very slow gtt map access on BXT. As far as I remember it was resolved by "GPUCopy". |
There is also vaGetImage/vaPutImage API which might be better in performance... |
Yes, I know. But it's not applicable for MSDK as far as I understand. |
Hi @dmitryermilov , thanks for the detailed answer for my question.
To produce system memory eventually like the two modes you provided.
This is exactly the way we want to process Decode and CSC in video memory and maps to system memory.
Above cmdline in ffmpeg qsv works in this mode.
Currently works in mode 2, and mode 1 is the expected way to improve the performance without memory copy. I think we'd better include mode 1 in ffmpeg qsv as well.
Yes, we used to enbale gpucopy to dump nv12 data for performance improvement, but in this case, |
I think there is a misunderstanding. Let's resolve it together. It's still not clear where 20x improvement comes from. It can't be due to system->system copy. At least I'm very skeptical about it(but of course I can be wrong). But it can be due to inefficient "video->system" copy.
In this case you won't be able to use GPUCopy because gpu copy functionality in hidden inside msdk.
What is the difference here? Which calls do you mean under "Map/Unmap"
Sure! Actually while writing comments above I realized that I didn't pay attention that there is CSC in the pipeline. Probably the perf problem is due to this. For highest performance, for the "decode->CSC->file dump" pipeline memory shared between decoder and VPP should be video memory, memory outputted by VPP should system memory (produces by GPUCopy). Do you agree? It's important to get an agreement here. If so, can you please check that decoder is initialized with IOPattern=MFX_IOPATTERN_OUT_VIDEO_MEMORY, VPP is initialized with IOPattern=MFX_IOPATTERN_IN_VIDEO_MEMORY | MFX_IOPATTERN_OUT_SYSTEM_MEMORY ? |
Observed in "perf top" log, copyVideoToSys() is heavily used (highlighted in red). That's the reason for my suspect. Same issue exits in ffmpeg vaapi as well, and could also be addressed by eliminating the copy from derived data (mapped) to destination data.
Yes, there is no memory copy exists in this pipeline if I got it correctly.
Avoid the memory copy in mode 2 could address this performance issue according to my test.
I've tried the this pipeline with:fulinjie/ffmpeg@cefed3e I thought the highest performance may depend on the fourcc/format. (not quite sure) For Tiled format (like NV12), gpu copy may have the best performance, since Tiled format data in video memory would be accessed slowly if we use deriveImage/Mapuffer simply. For Linear format (like rgb32), derive/map and then access the linear data in video memory may have the best performance, since linear data in video memory has no difference compared with copying them into system memory.
|
Yes, I agree.
It means that GPUCopy is not involved:) But let's focus on it a bit later. Well, I think I realized what these two patches do. There is one idea/concern. I need your help to check. |
Provided cmdline is not runnable. For some network reasons, APL is not accessible currently. Results: And the reason is that there is extra memory copy in raw_encode() for Pipeline 2 to copy data from AVFrame to AVPacket for file dump. If result in APL is still needed, will provide next week. |
Yes, please share them. |
On APL,
frame= 103 fps=1.4 q=-0.0 Lsize= 3337200kB time=00:00:01.81 bitrate=15033598.0kbits/s speed=0.0239x
frame= 1585 fps= 57 q=-0.0 Lsize=N/A time=00:00:26.54 bitrate=N/A speed=0.956x |
Thanks for quick reply! Although I agree that results from KBL is a point for discussion. I mean on systems where gtt map works fast, indeed, an additional memcopy might be a bottleneck. |
Got your point.
Is there the same issue(false impression) in sample_decode?
Is there any issue/PR on this filed in libva repo or somewhere else? From above discussion,
So IMHO, a query in libva may also be needed whether data of specific format could be fast accessed on specific platform in directly mapping. |
In case of output system memory from MSDK, MSDK "touches" this slow gtt mapped memory in copyVideoToSys.
Not yet..
I fully agree with you and ready to take part in this proposal discussion. It still would be great why GPUCopy doens't work in the case. @fulinjie , do you have bandwidth to investigate it? BTW, @fulinjie , JFYI: intel/media-driver#623 . Probably we should check if the same can improve system memory access on BXT. |
Sure. There may be some "false impression" exist in previous sample_decode test (59 fps result).
Thanks a lot, will work on framework side to support better performance. |
So it works?! Great! Because originally from the conversation above I got an impression that it doesn't work on BXT.
Right. It's because output memory from MSDK is video memory. App accesses it externally (not inside msdk) so GPUCopy isn't involved. |
@fulinjie can we close this one? |
@daleksan, this could be closed at this stage since query in libva is needed at first. And we can use hwmap=mode=direct explicitly to force deriving data instead of sw copy. |
Patches to illustrate the issue:
ffmpeg:
fulinjie/ffmpeg@b0c99fb
msdk:
fulinjie@c19a421
Test cases:
4K input -> Decode + CSC -> output to null (or /dev/null)
CMDLINE:
ffmpeg -hwaccel qsv -c:v h264_qsv -i Nature.h264 -vf scale_qsv=format=rgb32,hwdownload,format=rgb32 -f null -
./sample_decode h264 -i Nature.h264 -vaapi -rgb4 -o /dev/null
Tested on APL:
sample_decode:
59.98 fps
ffmpeg qsv:
3.4 fps -> 61 fps
After applying these patches in ffmpeg qsv and MSDK core library, performance can be greatly improved(nearly 20x), and meets the performance of sample_decode.
Performance improvement could also be observed on core platform (KBL).
The possible root cause can be that:
Currently, in libmfx core library, sw copy is used to copy derived data
to system memory.
Since the derived data in video memory could be used directly (like sample decode does), sw copy can be redundant under some situations. And if the frame size is large, the performance can be greatly influenced.
Is it possible to add a by-pass for derive image data directly to DST without copying? (If it has already existed somewhere in the core?)
The text was updated successfully, but these errors were encountered: