Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NUMA for high core count CPUs. #1142

Closed
bradleysepos opened this issue Jan 23, 2018 · 22 comments · Fixed by #1824
Closed

NUMA for high core count CPUs. #1142

bradleysepos opened this issue Jan 23, 2018 · 22 comments · Fixed by #1824
Assignees
Milestone

Comments

@bradleysepos
Copy link
Contributor

@bradleysepos bradleysepos commented Jan 23, 2018

Currently, we disable NUMA when building x265, e.g. https://github.com/HandBrake/HandBrake/blob/master/contrib/x265_8bit/module.defs#L17. With high core count CPUs on the consumer market, we probably should revisit this for performance.

@jstebbins Any thoughts?

@mqudsi
Copy link

@mqudsi mqudsi commented Jan 23, 2018

Sorry for the mess in the previous thread, I was going to offer to re-create.

I note that NUMA isn't explicitly disabled for x264. Is the default state disabled? Because I see the same behavior with both x264 and x265.

@bradleysepos
Copy link
Contributor Author

@bradleysepos bradleysepos commented Jan 23, 2018

x264 simply does not scale past 6-8 cores. Given HD, slow enough encoder settings, and HandBrake's filters, it's generally possible to saturate 12-16 cores.

x265 can saturate more on its own but may require NUMA on high core count systems.

@bradleysepos
Copy link
Contributor Author

@bradleysepos bradleysepos commented Jan 23, 2018

As a workaround, you can launch more than one instance of HandBrake to run multiple jobs in parallel.

@mqudsi
Copy link

@mqudsi mqudsi commented Jan 23, 2018

Yes, I considered that as well, but the similarities in the performance profile for both is causing me to question that assumption.

This is with all user-selectable filters disabled, of course, and HD content. I can try at placebo speed and see if that does the trick, but I see the same behavior with both x264 and x265 where it's a very clear delineation between the load on cores 0-8 and the load on cores 9-15 (though the FPS encode rate obviously differs greatly between the two).

@mqudsi
Copy link

@mqudsi mqudsi commented Jan 23, 2018

To further clarify, if it were simply a matter of there not being enough source data to work with/general limitations of the encoding algorithm in question, I would expect the thread utilization for 16 identically instantiated threads in a pool to average out to more or less the same suboptimal value over a period of time, versus in this case I am seeing 8 constantly fully loaded threads and 8 that just don't ever see the same workload (even apart from seeing the same on both x264 and x265).

@bradleysepos
Copy link
Contributor Author

@bradleysepos bradleysepos commented Jan 23, 2018

It certainly correlates. 🤷🏻‍♂️ I'm fairly certain the cause is different.

@mqudsi
Copy link

@mqudsi mqudsi commented Jan 23, 2018

fwiw, simply specifying ENABLE_LIBNUMA=ON and patching the build to link against libnuma.a breaks the threadpool:

x265 [info]: HEVC encoder version 2.6
x265 [info]: build info [Linux][clang 6.0.0][64 bit] 8bit+10bit+12bit
x265 [info]: using cpu capabilities: MMX2 SSE2Fast LZCNT SSSE3 SSE4.2 AVX FMA3 BMI2 AVX2
x265 [info]: Main profile, Level-3.1 (Main tier)
x265 [warning]: No thread pool allocated, --wpp disabled
x265 [warning]: No thread pool allocated, --lookahead-slices disabled

@maximd33
Copy link
Contributor

@maximd33 maximd33 commented Jan 23, 2018

NUMA support should not increase amount of the threads used but influence threads/memory distribution
https://bitbucket.org/multicoreware/x265/commits/62b8fe990df5e560834e9b567a913238d8dd398e

@mqudsi
Copy link

@mqudsi mqudsi commented Jan 23, 2018

Thanks for that link, @maximd33

I understand how NUMA works (though I haven't dealt with libnuma before) very well. In my case though, enabling NUMA completely disabled the thread pool.

I think the link you mentioned might hold the clue:

except --threads N is now --pool N

It seems that a different parameter might be needed when calling x265 linked against libnuma (if the change to the cli argument is reflective of actual api changes).

@maximd33
Copy link
Contributor

@maximd33 maximd33 commented Jan 23, 2018

@AndreiPukrov
Copy link

@AndreiPukrov AndreiPukrov commented Feb 13, 2018

Hm, i would support that as i think that this would then fix my issue reported here. So if needed I can offer to run some tests against a Dual CPU system if required.

Running two instances of Handbrake would work if you have two movies. However I often have only one to work with (and I think 80% of the other users as well) and if HandBrake would then use the full power from the computer that would be the preferred option.

@maximd33
Copy link
Contributor

@maximd33 maximd33 commented Feb 17, 2018

@AndreiPukrov can you try x265 in command line mode and check if more cores will help ?

@Hydrad
Copy link

@Hydrad Hydrad commented Jun 5, 2018

I have a dual Xeon 6146, with 12 cores each. I was converting several 12 GB files using X265 10bit in Handbrake Ver 1.1.0. A single instance of Handbrake seemed to be trying to use all 24 cores. When I limited the affinity to just 6 cores, the CPU utilization went up from 7 to 11 percent. All 6 active cores were running between 85 and 100%. I then started 3 instances of Handbrake, and let them all use all cores. The total CPU utilization was averaging 38%. I then limited each of the three instances to 6 cores each, with no over lapping cores. One CPU then had all cores running 85 to 100%. The other CPU just had 6 cores for the third instance running in that same 85-100% range. The total CPU utilization for the three instance, 6 core each was averaging 61%. That is about a 60% increase in CPU utilization, which I think would translate into a 60% increase in compute power for each instance.

You would definitely be better served to limit the number of cores that handbrake uses. Whether 6 cores is the optimum remains to be seen. I have an overclocked i7-8700K with 6 cores. For that box, Handbrake typically pegs all 6 cores at 100% almost continually during the run. That machine has a faster hard drive, so maybe it is better able to feed the CPU's. It must also have better cooling, because the Xeon box was getting pretty close to TJMax , averaging 86-92 C using the stock CPU coolers. The 8700K did not have any thermal problems.

@mqudsi
Copy link

@mqudsi mqudsi commented Aug 1, 2018

I have a solution that - at least for myself - has no disadvantages. MP4 and MKV both support lossless concatenation at the stream level. I split the input into n equally-sized streams at keyframes (where n is the number of virtual cores) then run the job, and concatenate the results.

The only concerns are player support (which should be fine) and the container overhead for having multiple streams (which is almost minimal). There's also the obvious issues with h265 dynamic bitrate adaptation and crf calculation, but for large enough videos they should level off at an average early on.

For HandBrake jobs that do not have horrible-performing filters (I'm looking at you, EDDI2-BOB and you, knlmeans), this might be a very good default option to offer.

@bradleysepos
Copy link
Contributor Author

@bradleysepos bradleysepos commented Sep 4, 2018

@jstebbins I'm interested in revisiting this but don't have access to a multi-socket system. It seems on Linux we can simply document dependency on libnuma-dev?

@bradleysepos bradleysepos added this to the Unscheduled milestone Sep 4, 2018
@juanpc2018
Copy link

@juanpc2018 juanpc2018 commented Sep 6, 2018

what do you mean "x264 simply does not scale past 6-8 cores. "
im using h264 and works well...
near 100% all cores, over 95%, depends on the audio channel/settings,
encoding FullHD 24fps from 30Mbps to 8Mbps, Audio PassThru, No subtitles, No chapters "only 1"
32-cores, dual CPU AMD Opteron 6386,
Handbrake 1.1.1
Windows8.1x64

the problem maybe is PowerOptions, Needs to be set minimum CPU usage: 99%,
"High Performance"

i had another problem with Handbrake at first, i was mixing different CPU´s,
from 6200 & 6300, or different 6200 or different 6300 CPU´s in Asymmetric configuration,
16+4, 16+8, etc...
16+16 but different series 6200 + 6300 & different speeds 6276 + 6386
BIOS is Not designed for Asymmetric configuration...

some Apps / SW crashed immediately, usually software that requires or installs Visual C++ x86 Redistributable, but Windows8.1x64 boots perfectly.
True 100% 64-Bit SW does not crash...
for example:
Chrome & FireFox crashed,
but Opera & IE11 worked ok.
Davinci Resolve 12.5 it worked ok, if i remember correctly.

maybe was something to do with the 4GB limit,
or maybe 64-Bit CPUs have a 32-Bit emulation in HW,
for example:
WinRAR built-in benchmark test with Symmetric CPU´s should work OK, but does Not, all cores at 50%. like there is only 1-CPU.

Opteron 6000 does Not have NUMA or SMT/HT.

@bradleysepos
Copy link
Contributor Author

@bradleysepos bradleysepos commented Sep 6, 2018

what do you mean "x264 simply does not scale past 6-8 cores. "

Exactly what it says, and what you confirmed by "depends on the audio channel/settings".

In normal usage, x264 doesn't scale linearly past about that many cores. The decoder workload and other things like filters, audio encoding, can saturate more cores.

With respect, I don't think your comments are adding much to the discussion here...

@juanpc2018
Copy link

@juanpc2018 juanpc2018 commented Sep 6, 2018

maybe its a language barrier...
i really dont understand what you say... LOL Jajajaja
https://en.wikipedia.org/wiki/Amdahl%27s_law

audio encoders are very different, and interrupt the OS to ask for CPU time = Lowering CPU load.

Thats the problem of x86_64 architecture, needs lots of very fast interrupts, to fake multi task, each interrupt adds latency.
All OS today are Fake MultiTask.
to have a Real Time Multitask OS, needs 1024-cores...
Not even the fastest dual CPU in the world, An Overclocked 64-core / 128-thread it cant run software in real time.
https://youtu.be/sTVyE37uglc?t=14m47s

x86_64 architecture is too complex,
thats why RISC could eliminate x86_64 in the future,
today 96-core ARM CPU´s from Cavio/Gigabyte ThunderX are almost the same as intel in performance.
1000-core CPU is possible with RISC architecture... = Real Time OS, True MultiTask.

jstebbins added a commit to jstebbins/HandBrake that referenced this issue Jan 20, 2019
Threadripper and other modern CPUs are now multi-core modules that
benefit from having NUMA available.

Adds a dependency for libnuma.

Fixes HandBrake#1142
bradleysepos pushed a commit to jstebbins/HandBrake that referenced this issue Apr 4, 2019
Threadripper and other modern CPUs are now multi-core modules that
benefit from having NUMA available.

Adds a dependency for libnuma.

Fixes HandBrake#1142
@bradleysepos bradleysepos removed this from the Unscheduled milestone Aug 5, 2019
@bradleysepos bradleysepos added this to the 1.3.0 milestone Aug 5, 2019
@vinas1
Copy link

@vinas1 vinas1 commented Nov 29, 2019

I have a dual Xeon 6146, with 12 cores each. I was converting several 12 GB files using X265 10bit in Handbrake Ver 1.1.0. A single instance of Handbrake seemed to be trying to use all 24 cores. When I limited the affinity to just 6 cores, the CPU utilization went up from 7 to 11 percent. All 6 active cores were running between 85 and 100%. I then started 3 instances of Handbrake, and let them all use all cores. The total CPU utilization was averaging 38%. I then limited each of the three instances to 6 cores each, with no over lapping cores. One CPU then had all cores running 85 to 100%. The other CPU just had 6 cores for the third instance running in that same 85-100% range. The total CPU utilization for the three instance, 6 core each was averaging 61%. That is about a 60% increase in CPU utilization, which I think would translate into a 60% increase in compute power for each instance.

You would definitely be better served to limit the number of cores that handbrake uses. Whether 6 cores is the optimum remains to be seen. I have an overclocked i7-8700K with 6 cores. For that box, Handbrake typically pegs all 6 cores at 100% almost continually during the run. That machine has a faster hard drive, so maybe it is better able to feed the CPU's. It must also have better cooling, because the Xeon box was getting pretty close to TJMax , averaging 86-92 C using the stock CPU coolers. The 8700K did not have any thermal problems.

With a 48thread threadripper x264 1080p (VerySlow encoding) a single instance job loads up 12 threads to 100% and another separate 12 to 55%. Total usage is around 60% cpu. Remaining 24 logical processors are around 15-25% loaded. Things seem to be working much better in the 1.3.0 release on HEDT machines. Loading separate instance for better density may no longer be necessary.

@marshalleq
Copy link

@marshalleq marshalleq commented Dec 9, 2019

It is still not clear to me if / what I should do about NUMA for handbrake. I like a gazillion others now have this issue due to the large number of AMD's new processors which even include the numa style chiplets in the desktop range now (albeit high end desktop). Then there's the threadripper. AMD seems to state that a lot of it is handled by the BIOS, however it seems contradictory. I can certainly see that running handbrake on my 1950x threadripper runs across both numa nodes with a single encode. So it does not seem to be numa aware still.

@sr55
Copy link
Contributor

@sr55 sr55 commented Dec 9, 2019

Threadripper 3 and all desktop Ryzen Parts do not have NUMA zones so are unaffected.

Older Threadrippers do. In HandBrake, most of the encoders are not NUMA aware and probably won't be made so. x265 is NUMA aware and this is enabled in HandBrake. That said, it won't necessarily scale to 32 threads anyway. Depending on the source, settings etc, it could for example top out at 6~8 only so it so you wouldn't know if it was working correctly or not.

@mqudsi
Copy link

@mqudsi mqudsi commented Dec 9, 2019

I'm going to just repeat my previous comment: HB should bypass the issue of individual codec numa awareness altogether. If there are specific technical issues (implementation or result/quality/compatibility) regarding the suggestion of splitting all input (larger than a certain threshold minutes/bytes) into n chunks for n cores at keyframes, handing out a chunk to each core (don't worry about work stealing), and then concatenating the results at the stream level (not file concat) where supported (x264/x265/vp8/vp9/av1 at the very least) or else at the container level (supported by mp4 and mkv, I believe), then you can easily have 100% utilization of all cores and achieve enormous throughput improvements.

I have done this locally in my ffmpeg pipeline (I no longer use HB for this reason) and the results are amazing.

@sr55
Copy link
Contributor

@sr55 sr55 commented Dec 9, 2019

Splitting by keyframes isn't necessarily the best idea. While it may not always be noticeable, it can lead to subtle output artefacts. I've only ever typical noticed this with Animation or Learning/Presentation style material myself but in theory, if you look closely, you may see it in other content too.

HandBrake passes full, uncompressed frames to the encoders. The encoders make decisions based on previous and future frames which is why you can get some oddity at the split points. If you know exactly where to cut the content, it's something that can work well, but it's not something you can trivially reliably automate.

Regardless, HandBrakes engine doesn't suit that kind of workflow, so it's not something we'd implement.

HandBrake does support multiple GUI's / encode's running at the same time, and likely in the future, we'll add support for running multiple encodes in 1 GUI allowing for better UX

@mqudsi
Copy link

@mqudsi mqudsi commented Dec 9, 2019

If the raw, decoded output of sections split at key frames is concatenated and then identically reencoded with hard-coded settings, it should be possible to get automated reliable results. There are definitely things that need to be controlled for though, while variable bitrate is fine, dynamic settings dependent on content (e.g. anamorphic aspect ratio) should ideally be disabled, precalculated, or synced (this gets complicated but by starting one job until these values are available then starting the rest, unless doing two-pass encoding).

Audio gets neglected in these discussions, but it’s imperative to choose an LCM of the audio bitrate and video frame rate when picking split points. With higher bandwidth audio tracks it’s less of an issue but at lower bitrates audio artifacts are readily introduced. For myself, I just chunk and encode the video as a first step and then encode the audio (without chunking) and mux it into the resulting container; I believe theoretically this could introduce some AV drift but I haven’t had any bad experiences since switching to this method a year or so ago.

The thresholds I mentioned earlier play a role here. It’s obviously a trade off between hard concatenation points (mainly with theoretical optimal encoding rather than user-visible artifacts) and chunked encoding but if the ratio of chunks to overall length is high enough and the aforementioned points are kept in mind, concatenation points should be (relatively) rate and far enough in between that the result is not (on the whole) adversely affected. You wouldn’t want to split a twenty second clip into 64 chunks, and if you did even if the result did not contain any artifacts, the file would probably be larger than an unmuxed alternative (and unless you’re encoding something like AV1, also take about as long) thanks to the overhead. For chunks of sufficient length, even if you don’t account for the variables mentioned above, generally there will be sufficient content for the hysteresis to be minimized and to arrive at a local optimal value sufficiently close enough to that computed by encoders processing other chunks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging a pull request may close this issue.

10 participants