frame->sequence vmaf score aggregation may bias "easy" content too much #20

rbultje · 2016-09-08T19:06:55Z

I'll try to explain what I did and hopefully it will make sense. I took a bunch of the netflix-provided "4K" files from xiph's website (http://media.xiph.org/video/derf/). For the purposes of this discussion, I'll focus on the example where I took DinnerScene and FoodMarket2. I downsampled them to 360p 8bit 100frames using ffmpeg, and then encoded them using my own VP9 encoder (http://www.twoorioles.com/eve-vp9-video-encoder) at fixed base quantizers (DinnerScene: 40, 90, 140, 190, and FoodMarket2 any even quantizer between 40 and 254). "Base quantizer" here means that "normal" P-frames are encoded using that quantizer (or some value roughly around it), and "ARF frames" are encoded at a lower quantizer defined by all kind of stuff in the encoder. I don't think that's very relevant, nor is it relevant that this is VP9. You could probably reproduce this using x264 or libvpx also. An example of file sizes for these 100-frame, 360p sequences:

21523 res/DinnerScene.q140.ivf
9510 res/DinnerScene.q190.ivf
118710 res/DinnerScene.q40.ivf
46032 res/DinnerScene.q90.ivf
362298 res/FoodMarket2.q140.ivf
123366 res/FoodMarket2.q190.ivf
2222365 res/FoodMarket2.q40.ivf
922392 res/FoodMarket2.q90.ivf

As for quantizers-in-P-frames, I confirmed that the quantizer of the second frame in each sequence above is 39,90,141,190 for DinnerScene and 40,90,141,193 for FoodMarket2 in the sequences above. I hope that explains my input data.

Next, I tried to simply concatenate the decoded sequences of any combination of DinnerScene+FoodMarket2 (where the base quantizer of FoodMarket2 is equal to or greater than the base quantizer in DinnerScene) and measure the sequence VMAF (the last line "Aggregate: [..] VMAF_score:([0-9.]+)" from your tools) and PSNR ("geometric" as well as "arithmetic" - although I use libvpx terminology, i.e. "average" avg_psnr=avg(sse_to_psnr(frame_sse)) and "global" glb_psnr=sse_to_psnr(avg(sse_frame)), where sse_to_psnr is the typical formula from wikipedia. As file size, I simply used the sum of the two filesizes of the input IVF files. This is not exact (since there's two IVF file headers) but it's close enough. Together, this allows me to plot a typical bitrate vs. quality curve. Note that in my plots, I plot VMAF logaritmic also (basically 10*log10(100.0/vmaf_score)), and I understand that this is probably strange, but it slightly increases resolution near the 100.0 score in the graphs.

So, in these graphs, each dot is a combination. It's easiest to look at the graphs as 4 lines, where each line is a combination of DinnerScene at q=40/90/140/190 and FoodMarket2=dinnerscene_q+factor, where factor is an even number between 0 and 120. The leftmost dot in each line ("largest file") has factor=0 and the rightmost dot per line ("smallest file") has factor=120. The factor is the delta quantizer between the two files. Let's first look at glb PSNR and avg PSNR. It's no surprise that glb PSNR biases towards a higher relative weight for hard content, and avg PSNR biases towards a higher relative weight for easy content. This is well-established and is mathematically easy to demonstrate. I'm not making any statements on which one is better, they are just there to make a point.

I could, for example, make a statement that according to glb PSNR, the x value for the top line's factor=56 point is almost identical to the x value for the second line's factor=0 point, but the y value for the second line's is higher, so at the same combined average bitrate (1.16 mbps), glb PSNR (slightly) prefers dinnerscene at q=90 and foodmarket2 at q=90 to dinnerscene at q=40 and foodmarket2 at q=96.

Likewise, I can make a statement that according to avg PSNR, the top line's x value is identical (at factor=120) to the second line's x value at factor=58 and the third line at factor=4. However, the top point's y value is slightly better than the second point's y value, and both are much much better than the third point's y value. Therefore, at the same combined average bitrate (426/431 kbps), avg PSNR prefers dinnerscene at q=40 and foodmarket2 at q=160 to dinnerscene at q=90 and foodmarket2 at q=148, and both are much preferred to dinnerscene at q=140 and foodmarket2 at q=144.

Now, let's look at the third graph, lVMAF (logarithmic VMAF). This graph looks a lot more like avg than glb PSNR, but not exactly, because we at least have intersection points, which the other graph didn't have. This should allow us to compare the same sequence (but with fragments encoded at different base quantizers) which have the same file size as well as the same aggregate VMAF score. I chose the point of the top line at factor=116 (dinnerscene q=40 and foodmarket2 q=158) and second line at factor=54 (dinnerscene q=90 and foodmarket2 q=144), which gives a combined bitrate of 454-457 kbps, and have a nearly identical VMAF score of 94.317 and 94.34 (logarithmic VMAF=12.454/12.472). I'd like to attach these files here but each of them (losslessly re-compressed) is 20MB and the bug tracker doesn't like that (10 MB/file limit). I'm suspecting you can easily re-create these files and the experiment yourself or I can upload them somewhere else where 20MB per file for 2 files is not an issue.

The point is that I visually look at the differences between the two files, and I don't think I agree. Here's the last frame of dinnerscene at q=40 and q=90:

And here's the 4th frame of foodmarket2 at q=158 and q=144:

To me, visually, the differences between the second set of images (q=144 and q=158) is much greater than between the first set of images (at q=40 and q=90), and thus the combination of q=90+q=144 should be preferred (possibly by a pretty big margin) over the combination of q=40+q=158. The fact that it brings the relative quality of the two sequences less far apart is a bonus (to me, psychologically, it would seem that "good" quality in one segment does not erase my recent memory of "horrible" quality in another) but I'm not enough of an expert in this field to be able to make a case over whether that's relevant or not.

This is an example, obviously. When you go to extreme cases (try ToddlerFountain+DinnerScene), you get seemingly crazy peaks where the scores claim that DinnerScene at q=40 and ToddlerFountain at q=180 (factor=140) is about the same as DinnerScene at q=90 and ToddlerFountain at q=176 (factor=86), which is again a lot better than DinnerScene at q=140 and ToddlerFountain at q=174 (factor=34), all at the same combined bitrate of about 900 kbps. At some level this seems almost obvious, since what could possibly be the visual difference of ToddlerFountain at q=180, 176 or 174? Whereas a visual difference at q=140 and q=40 should be pretty significant, right? But it seems to call for seemingly absurd (?) quality variations between easy and hard content. Optimal factor (i.e. the one giving optimal VMAF scores) in this case would likely be around 115 (or at least somewhere relatively in the middle of 140 and 86), which means effective delta quantizer would be around 10, since the effective quantizer in vp9 increases by 1.02 per index, and pow(1.02, 115)=9.75.

Example plot for logarithmic VMAF scores for dinnerscene at q=40,90,140,190 and toddlerfountain at q=dinnerscene_q+factor, where factor is all even numbers between 0 and 200:

I guess I have an intrinsic bias in me that wants to believe that larger relative quality differences would have a negative impact on the overall viewing experience, but I'm not an expert in this field so I don't know if that makes sense or not. Beyond the examples given here, I've looked at some other examples, and I feel that what I presented here for one example is generally applicable to VMAF, i.e. that it may be biasing a little too much towards higher overall scores because of high frame scores on easy content regardless of the scores on hard content. I hope that makes sense as a "potential bug" report.

li-zhi · 2016-09-10T01:05:16Z

Hi Ronald,

Your issue touched on one of the open questions on VMAF, which is how to pool the per-frame score into a final score. I have similar intrinsic bias as you, and I think it is well warranted according to many psycho-visual studies. Currently we report the mean per-frame score as the aggregate score, just for its simplicity. But you can very easily change it into something else that will weigh bad-quality frames more heavily.

One option is to use harmonic mean, as advocated by Ioannis. Under https://github.com/Netflix/vmaf/blob/master/python/core/result.py, there is a field score_aggregate_method for BasicResult class. Currently it is default to np.mean, you can simply change it to ListStats.harmonic_mean. Or you can use the set_score_aggregate_method() (follow example in https://github.com/Netflix/vmaf/blob/master/python/script/run_vmaf_cross_validation.py).

Another option is to use L-p norm, where p < 1.0. For example, you can replace np.mean by partial(ListStats.lp_norm, p=0.5).

A third option is to first convert VMAF into a 'distortion' measure (for example, by letting dst = 2^(-VMAF), and then pool dst using a L-p norm, where p > 1 (e.g. p = 2, or 3, or 4).

Try these options, and let us know what you think.

Best,
Zhi

pavan4 · 2016-10-05T11:22:04Z

It is an interesting point.

I worked with different metrics to aggregate the per-frame score into a final score for a piece of study in my dissertation research.

This method performed relatively (from experimental observations in our research) well compared to mean - We aggregated per frame score on a per scene basis (mean) and then selected the mean of all the means over the top 10% of worst performing scenes. The reason being, we observed that people pivoted their score based on the worst performing scenes during our subjective quality experiments. This might not be the optimal strategy, however, definitely gave better results than mean over the entire score set.

Might be worth trying for you.

li-zhi · 2016-10-10T16:31:21Z

@pavan4 Thanks for the recommendation. Your method is similar to a percentile-based pooling, where the aggregate score is taken at say 10-percentile of the distribution. Personally I also find this type of approaches more robust than merely taking the mean.

rbultje · 2016-11-08T17:14:43Z

Here's an example of what I get (this is otherwise the same experiment as the third chart in my original post) when using harmonic_mean instead of mean for averaging. What's most significant is the reduction in wide variation at the low-to-mid end of the graph, so from a practical perspective, my immediate issue is resolved when using harmonic_mean.

Something that might be useful is to add a way to run_vmaf.py to accept multiple --pool arguments so we don't have to run it twice to compare pooling mechanisms.

li-zhi · 2016-11-10T01:18:04Z

@rbultje: I intend to keep the interface for run_vmaf.py simple for basic usage. It is actually quite easy to regenerate results using multiple pooling mechanisms by just running once, since the per-frame scores are given. The exact formula for harmonic_mean and other methods can be found at:
https://github.com/Netflix/vmaf/blob/master/python/tools/stats.py

…er-models-christosb to master * commit '48582187685500cfb5885167746e537c09d53cff': MPD-20322: Add integer models.

li-zhi closed this as completed Apr 4, 2018

rfliam mentioned this issue Jul 3, 2019

Consider Changing the VMAF Scale to [1, 101], or (0, 100] #339

Closed

nilfm pushed a commit that referenced this issue Dec 8, 2022

Merge pull request #20 in MCE/vmaf-private-lts from feature/add-integ…

8bb5fce

…er-models-christosb to master * commit '48582187685500cfb5885167746e537c09d53cff': MPD-20322: Add integer models.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

frame->sequence vmaf score aggregation may bias "easy" content too much #20

frame->sequence vmaf score aggregation may bias "easy" content too much #20

rbultje commented Sep 8, 2016

li-zhi commented Sep 10, 2016

pavan4 commented Oct 5, 2016

li-zhi commented Oct 10, 2016

rbultje commented Nov 8, 2016

li-zhi commented Nov 10, 2016

frame->sequence vmaf score aggregation may bias "easy" content too much #20

frame->sequence vmaf score aggregation may bias "easy" content too much #20

Comments

rbultje commented Sep 8, 2016

li-zhi commented Sep 10, 2016

pavan4 commented Oct 5, 2016

li-zhi commented Oct 10, 2016

rbultje commented Nov 8, 2016

li-zhi commented Nov 10, 2016