Skip to content
This repository has been archived by the owner on May 17, 2023. It is now read-only.

[FFMPEG QSV][Sample decode] use derived data instead of sw copy to improve the performance #1550

Closed
fulinjie opened this issue Jul 26, 2019 · 17 comments
Assignees

Comments

@fulinjie
Copy link
Contributor

Patches to illustrate the issue:
ffmpeg:
fulinjie/ffmpeg@b0c99fb
msdk:
fulinjie@c19a421

Test cases:
4K input -> Decode + CSC -> output to null (or /dev/null)
CMDLINE:
ffmpeg -hwaccel qsv -c:v h264_qsv -i Nature.h264 -vf scale_qsv=format=rgb32,hwdownload,format=rgb32 -f null -

./sample_decode h264 -i Nature.h264 -vaapi -rgb4 -o /dev/null

Tested on APL:

sample_decode:
59.98 fps

ffmpeg qsv:
3.4 fps -> 61 fps

After applying these patches in ffmpeg qsv and MSDK core library, performance can be greatly improved(nearly 20x), and meets the performance of sample_decode.

Performance improvement could also be observed on core platform (KBL).

The possible root cause can be that:
Currently, in libmfx core library, sw copy is used to copy derived data
to system memory.
Since the derived data in video memory could be used directly (like sample decode does), sw copy can be redundant under some situations. And if the frame size is large, the performance can be greatly influenced.

Is it possible to add a by-pass for derive image data directly to DST without copying? (If it has already existed somewhere in the core?)

@dmitryermilov
Copy link
Contributor

dmitryermilov commented Jul 26, 2019

Hi @fulinjie ,

I'd like to request more info about the change. I don't understand it well. Mostly, I don't understand where the improvement comes from.
MSDK can produce either system or video memory. To get the ball rolling, the first question is in this case above, how MSDK is initialized by ffmpeg ? To produce system or video memory?

Basically, there can be two modes when app needs eventually system memory at output (e.g. to dump data to disk):

  • 1 ) app initializes MSDK to produce video memory. Application gets video memory and 2) maps(vaDeriveImage/vaMapBuffer) it to system memory.

  • 1 ) app initializes MSDK to produce system memory. MSDK internally decodes to video memory and then internally 2) makes copy from video memory to system memory. It can be done by sw copy("vaDeriveImage->vaMapBuffer->memcpy->vaUnmapBuffer->vaDestroyImage)", or GPUCopy. Application 3) gets system memory.

Which mode are you talking about?

P.S. I'm aware of very slow gtt map access on BXT. As far as I remember it was resolved by "GPUCopy".

@dvrogozh
Copy link
Contributor

There is also vaGetImage/vaPutImage API which might be better in performance...

@dmitryermilov
Copy link
Contributor

Yes, I know. But it's not applicable for MSDK as far as I understand.

@dmitryermilov dmitryermilov self-assigned this Jul 26, 2019
@fulinjie
Copy link
Contributor Author

Hi @fulinjie ,

Hi @dmitryermilov , thanks for the detailed answer for my question.
I'm seeking for performance improvement solution for ffmpeg qsv as a requirement from customer:
HW Decode + CSC (rgb4) and dump to system memory.

I'd like to request more info about the change. I don't understand it well. Mostly, I don't understand where the improvement comes from.
MSDK can produce either system or video memory. To get the ball rolling, the first question is in this case above, how MSDK is initialized by ffmpeg ? To produce system or video memory?

To produce system memory eventually like the two modes you provided.

Basically, there can be two modes when app needs eventually system memory at output (e.g. to dump data to disk):

  • 1 ) app initializes MSDK with produce video memory. Application gets video memory and 2) maps(vaDeriveImage/vaMapBuffer) it to system memory.

This is exactly the way we want to process Decode and CSC in video memory and maps to system memory.

  • 1 ) app initializes MSDK with produce system memory. MSDK internally decodes to video memory and then internally 2) makes copy from video memory to system memory. It can be done by sw copy("vaDeriveImage->vaMapBuffer->memcpy->vaUnmapBuffer->vaDestroyImage)", or GPUCopy. Application 3) gets system memory.

Above cmdline in ffmpeg qsv works in this mode.
What I did with the provided patches seems to force mode 2 (Map/memcpy/Unmap/system memory) to act like mode 1(vaDeriveImage/vaMapbuffer/system memory).

Which mode are you talking about?

Currently works in mode 2, and mode 1 is the expected way to improve the performance without memory copy.
(Since 4K frame is large and additional sw copy (even if gpu copy) for each frame may greatly influence the performance)

I think we'd better include mode 1 in ffmpeg qsv as well.
Any hints for the IOPatterns/flags/Code path in libmfx core library will be very helpful.

P.S. I'm aware of very slow gtt map access on BXT. As far as I remember it was resolved by "GPUCopy".

Yes, we used to enbale gpucopy to dump nv12 data for performance improvement, but in this case,
gpu copy improved from 3.7 fps to 24 fps in APL and didn't meet the bench mark of sample decode.
And that's the reason for a better solution.

@dmitryermilov
Copy link
Contributor

dmitryermilov commented Jul 26, 2019

I think there is a misunderstanding. Let's resolve it together.
A few points to discuss:

It's still not clear where 20x improvement comes from. It can't be due to system->system copy. At least I'm very skeptical about it(but of course I can be wrong). But it can be due to inefficient "video->system" copy.

Basically, there can be two modes when app needs eventually system memory at output (e.g. to dump data to disk):

  • 1 ) app initializes MSDK with produce video memory. Application gets video memory and 2) maps(vaDeriveImage/vaMapBuffer) it to system memory.

This is exactly the way we want to process Decode and CSC in video memory and maps to system memory.

In this case you won't be able to use GPUCopy because gpu copy functionality in hidden inside msdk.

What I did with the provided patches seems to force mode 2 (Map/memcpy/Unmap/system memory) to act like mode 1(vaDeriveImage/vaMapbuffer/system memory).

What is the difference here? Which calls do you mean under "Map/Unmap"

Any hints for the IOPatterns/flags/Code path in libmfx core library will be very helpful.

Sure! Actually while writing comments above I realized that I didn't pay attention that there is CSC in the pipeline. Probably the perf problem is due to this. For highest performance, for the "decode->CSC->file dump" pipeline memory shared between decoder and VPP should be video memory, memory outputted by VPP should system memory (produces by GPUCopy). Do you agree? It's important to get an agreement here. If so, can you please check that decoder is initialized with IOPattern=MFX_IOPATTERN_OUT_VIDEO_MEMORY, VPP is initialized with IOPattern=MFX_IOPATTERN_IN_VIDEO_MEMORY | MFX_IOPATTERN_OUT_SYSTEM_MEMORY ?

@fulinjie
Copy link
Contributor Author

It's still not clear where 20x improvement comes from. It can't be due to system->system copy. At least I'm very skeptical about it(but of course I can be wrong). But it can be due to inefficient "video->system" copy.

Observed in "perf top" log, copyVideoToSys() is heavily used (highlighted in red). That's the reason for my suspect.

Same issue exits in ffmpeg vaapi as well, and could also be addressed by eliminating the copy from derived data (mapped) to destination data.

Basically, there can be two modes when app needs eventually system memory at output (e.g. to dump data to disk):

  • 1 ) app initializes MSDK with produce video memory. Application gets video memory and 2) maps(vaDeriveImage/vaMapBuffer) it to system memory.

This is exactly the way we want to process Decode and CSC in video memory and maps to system memory.

In this case you won't be able to use GPUCopy because gpu copy functionality in hidden inside msdk.

Yes, there is no memory copy exists in this pipeline if I got it correctly.

What I did with the provided patches seems to force mode 2 (Map/memcpy/Unmap/system memory) to act like mode 1(vaDeriveImage/vaMapbuffer/system memory).

What is the difference here? Which calls do you mean under "Map/Unmap"

Avoid the memory copy in mode 2 could address this performance issue according to my test.

Any hints for the IOPatterns/flags/Code path in libmfx core library will be very helpful.

Sure! Actually while writing comments above I realized that I didn't pay attention that there is CSC in the pipeline. Probably the perf problem is due to this. For highest performance, for the "decode->CSC->file dump" pipeline memory shared between decoder and VPP should be video memory, memory outputted by VPP should system memory (produces by GPUCopy). Do you agree? It's important to get an agreement here.

I've tried the this pipeline with:fulinjie/ffmpeg@cefed3e
pipeline memory shared between decoder and VPP is video memory, and output memory for VPP is system memory (use GPUCopy), the result in APL is 24fps (double confirmed by customer).

I thought the highest performance may depend on the fourcc/format. (not quite sure)

For Tiled format (like NV12), gpu copy may have the best performance, since Tiled format data in video memory would be accessed slowly if we use deriveImage/Mapuffer simply.

For Linear format (like rgb32), derive/map and then access the linear data in video memory may have the best performance, since linear data in video memory has no difference compared with copying them into system memory.

If so, can you please check that decoder is initialized with IOPattern=MFX_IOPATTERN_OUT_VIDEO_MEMORY, VPP is initialized with IOPattern=MFX_IOPATTERN_IN_VIDEO_MEMORY | MFX_IOPATTERN_OUT_SYSTEM_MEMORY ?

@dmitryermilov
Copy link
Contributor

dmitryermilov commented Jul 26, 2019

I thought the highest performance may depend on the fourcc/format. (not quite sure)

Yes, I agree.

Observed in "perf top" log, copyVideoToSys() is heavily used (highlighted in red). That's the reason for my suspect.

It means that GPUCopy is not involved:) But let's focus on it a bit later.

Well, I think I realized what these two patches do. There is one idea/concern. I need your help to check.
Since you already have an environment, can you please run(with applied patches ):
ffmpeg -hwaccel qsv -c:v h264_qsv -i Nature.h264 -vf scale_qsv=format=rgb32,hwdownload,format=rgb32 -f /dev/null
and share resulted fps? Will it be ~61 or the ~3.4? I'll explain the idea later.
I'm internationally interested into writing to a file (even if it's dummy like /dev/null/).

@fulinjie
Copy link
Contributor Author

I thought the highest performance may depend on the fourcc/format. (not quite sure)

Yes, I agree.

Observed in "perf top" log, copyVideoToSys() is heavily used (highlighted in red). That's the reason for my suspect.

It means that GPUCopy is not involved:) But let's focus on it a bit later.

Well, I think I realized what these two patches do. There is one idea/concern. I need your help to check.
Since you already have an environment, can you please run(with applied patches ):
ffmpeg -hwaccel qsv -c:v h264_qsv -i Nature.h264 -vf scale_qsv=format=rgb32,hwdownload,format=rgb32 -f /dev/null
and share resulted fps? Will it be ~61 or the ~3.4? I'll explain the idea later.
I'm internationally interested into writing to a file (even if it's dummy like /dev/null/).

Provided cmdline is not runnable.
IMHO, what you intended to test may be
'ffmpeg -hwaccel qsv -c:v h264_qsv -i Nature.h264 -vf scale_qsv=format=rgb32,hwdownload,format=rgb32 -f rawvideo /dev/null '

For some network reasons, APL is not accessible currently.
Instead, I tested on KBL (50+ fps -> 90+ fps after applying these two patches)

Results:
Pipeline 1. -f null -: 97 fps
Pipeline 2. -f rawvideo /dev/null: 55 fps
Before these patches: about 50+ fps

And the reason is that there is extra memory copy in raw_encode() for Pipeline 2 to copy data from AVFrame to AVPacket for file dump.
https://github.com/FFmpeg/FFmpeg/blob/47b6ca0b022a413e392707464f2423795aa89bfb/libavcodec/rawenc.c#L52

If result in APL is still needed, will provide next week.

@dmitryermilov
Copy link
Contributor

If result in APL is still needed, will provide next week.

Yes, please share them.

@fulinjie
Copy link
Contributor Author

On APL,

  1. ffmpeg -hwaccel qsv -c:v h264_qsv -i Nature.h264 -vf scale_qsv=format=rgb32,hwdownload,format=rgb32 -f rawvideo /dev/null

frame= 103 fps=1.4 q=-0.0 Lsize= 3337200kB time=00:00:01.81 bitrate=15033598.0kbits/s speed=0.0239x

  1. ffmpeg -hwaccel qsv -c:v h264_qsv -i Nature.h264 -vf scale_qsv=format=rgb32,hwdownload,format=rgb32 -f null -

frame= 1585 fps= 57 q=-0.0 Lsize=N/A time=00:00:26.54 bitrate=N/A speed=0.956x

@dmitryermilov
Copy link
Contributor

dmitryermilov commented Jul 28, 2019

Thanks for quick reply!
Well, the figures above prove my concern - yes, we can eliminate MSDK internal mecopy and return pointers obtained from vaMapBuffer to application (ffmpeg is app for MSDK in current case) but access to it 'mapped' memory works extremely slowly! Do you see my point? Access to this memory can be inside MSDK or on app side, but anyway this access will exist somewhere in real usage case.
Running with " -f null -" may say that performance 20 times higher but it's a false impression.
Accessing to this memory from downstream component (whatever it is: file writer, sw encoder, sw screen renderer) will kill this performance.

Although I agree that results from KBL is a point for discussion. I mean on systems where gtt map works fast, indeed, an additional memcopy might be a bottleneck.
Actually a few months ago we had several discussions about it. We came to a conclusion that we need to make libva map to system memory provided by app. Currently libva doesn't provide such API. Somehow we abandoned further discussions of libva proposal for a while..

@fulinjie
Copy link
Contributor Author

Well, the figures above prove my concern - yes, we can eliminate MSDK internal mecopy and return pointers obtained from vaMapBuffer to application (ffmpeg is app for MSDK in current case) but access to it 'mapped' memory works extremely slowly! Do you see my point? Access to this memory can be inside MSDK or on app side, but anyway this access will exist somewhere in real usage case.

Got your point.
A concern is that since the "mapped memory" is accessible, the slowly access should be an issue for APL (atom) and need to be fixed in media driver, like intel/media-driver#620
I think @FurongZhang is working on this.

Running with " -f null -" may say that performance 20 times higher but it's a false impression.
Accessing to this memory from downstream component (whatever it is: file writer, sw encoder, sw screen renderer) will kill this performance.

Is there the same issue(false impression) in sample_decode?
Customers may take sample_decode as the bench mark and require a matched performance for ffmpeg-qsv.

Although I agree that results from KBL is a point for discussion. I mean on systems where gtt map works fast, indeed, an additional memcopy might be a bottleneck.
Actually a few months ago we had several discussions about it. We came to a conclusion that we need to make libva map to system memory provided by app. Currently libva doesn't provide such API. Somehow we abandoned further discussions of libva proposal for a while..

Is there any issue/PR on this filed in libva repo or somewhere else?
We may need to trace this.

From above discussion,

  1. if directly map access is slow, use gpu copy could get the best performance.
  2. if directly map access is quick, use mapped memory and avoid additional memcopy could get the best performance.

So IMHO, a query in libva may also be needed whether data of specific format could be fast accessed on specific platform in directly mapping.
Hi @XinfengZhang , how do you think of these 2 API requirements?

@dmitryermilov
Copy link
Contributor

dmitryermilov commented Jul 29, 2019

Is there the same issue(false impression) in sample_decode?

In case of output system memory from MSDK, MSDK "touches" this slow gtt mapped memory in copyVideoToSys.

Is there any issue/PR on this filed in libva repo or somewhere else?
We may need to trace this.

Not yet..

From above discussion,

  1. if directly map access is slow, use gpu copy could get the best performance.
  2. if directly map access is quick, use mapped memory and avoid additional memcopy could get the best performance.

So IMHO, a query in libva may also be needed whether data of specific format could be fast accessed on specific platform in directly mapping.
Hi @XinfengZhang , how do you think of these 2 API requirements?

I fully agree with you and ready to take part in this proposal discussion. It still would be great why GPUCopy doens't work in the case. @fulinjie , do you have bandwidth to investigate it?

BTW, @fulinjie , JFYI: intel/media-driver#623 . Probably we should check if the same can improve system memory access on BXT.

@fulinjie
Copy link
Contributor Author

I fully agree with you and ready to take part in this proposal discussion. It still would be great why GPUCopy doens't work in the case. @fulinjie , do you have bandwidth to investigate it?

Sure.
GPUCopy takes effect for both ffmpeg and sample_decode to improve the "actually access" performance from 1.43.4 fps to 2430 fps.

There may be some "false impression" exist in previous sample_decode test (59 fps result).
For ./sample_decode h264 -i Nature.h264 -vaapi -rgb4 -o /dev/null, the performance is 4.5 fps on APL actually.

BTW, @fulinjie , JFYI: intel/media-driver#623 . Probably we should check if the same can improve system memory access on BXT.

Thanks a lot, will work on framework side to support better performance.

@dmitryermilov
Copy link
Contributor

I fully agree with you and ready to take part in this proposal discussion. It still would be great why GPUCopy doens't work in the case. @fulinjie , do you have bandwidth to investigate it?

Sure.
GPUCopy takes effect for both ffmpeg and sample_decode to improve the "actually access" performance from 1.43.4 fps to 2430 fps.

So it works?! Great! Because originally from the conversation above I got an impression that it doesn't work on BXT.

There may be some "false impression" exist in previous sample_decode test (59 fps result).
For ./sample_decode h264 -i Nature.h264 -vaapi -rgb4 -o /dev/null, the performance is 4.5 fps on APL actually.

Right. It's because output memory from MSDK is video memory. App accesses it externally (not inside msdk) so GPUCopy isn't involved.

@daleksan
Copy link
Contributor

@fulinjie can we close this one?

@fulinjie
Copy link
Contributor Author

@daleksan, this could be closed at this stage since query in libva is needed at first. And we can use hwmap=mode=direct explicitly to force deriving data instead of sw copy.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants