New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
D5.5: Write an assembly superoptimiser supporting AVX and upcoming Intel processor extensions for the MPIR library and optimise MPIR for modern processors #118
Comments
|
That sounds great! Thanks for the report. |
Hi all, I'm just writing to let you know that our project to write a superoptimiser Instead there are performance counters that do this, however they are Unfortunately, due to what we think is a Linux kernel bug (N.B: we are We have spent about 6 weeks tracking down this issue, and as of this I'm just reporting the issue here so that it doesn't come as a surprise As far as I can see, no action needs to be taken by anyone, and I don't We'll use this ticket to keep people updated about our progress with fixing Note that no other ODK deliverables depend on this deliverable, so I don't Bill. On 30 June 2016 at 15:16, Nicolas M. Thiéry notifications@github.com
|
Another possibility is if anyone has any recent Intel OSX machines that we By the way, we have verified that the various tools that Linux provides to We'll post more when we know more, including links to various tickets, if Bill. On 17 August 2016 at 13:06, Bill Hart goodwillhart@googlemail.com wrote:
|
Interesting report--that sounds frustrating! Perhaps it's not so bad, in terms of delays. As you wrote, the work on the superoptimiser is already done, as is the work on tracking down and addressing the issues in Linux. Assuming the Linux devs acknowledge the bug and will patch it, while there may be a delay on that it's not the end of the world since no other ODK work is waiting on it. In principle the deliverable could be said to be satisfied by building one's own kernel (and that's only necessary as a temporary measure). |
Unfortunately we won't be able to run a custom kernel on our research We've decided at this point that we will not persist longer than the end of We strongly suspect the crippling of the facility may not be an accident Bill. On 19 August 2016 at 12:33, Erik Bray notifications@github.com wrote:
|
Alex Kruppa just informed me that he has found a solution to the The solution was apparently to use a(n existing) library which programs the We still don't get rock solid timings. But we now have cycle counts that Unfortunately, a few days ago, we managed to crash one of our Bill. On 19 August 2016 at 13:17, Bill Hart goodwillhart@googlemail.com wrote:
|
@ClementPernet (WP leader) and @wbhart (Kaiserslautern=lead beneficiary) |
We had about two month of delays due to issues I reported previously. The On 21 November 2016 at 17:23, bpilorget notifications@github.com wrote:
|
Dear M18 deliverable leaders, Just a reminder that reports are due for mid-february, to buy us some time for proofreading, feedback, and final submission before February 28th. See our README for details on the process. In practice, I'll be offline February 12-19, and the week right after will be pretty busy. Therefore, it would be helpful if a first draft could be available sometime this week, so that I can have a head start reviewing it. Thanks in advance! |
Currently proofreading the issue description; don't edit to avoid conflicts. |
Done with the proofreading! The issue description is fair game again. This looks very nice. I just left a few little TODO's; @wbhart can you take care of them? It would be nice as well if the github description contained a link to the upcoming blog post. Maybe you can create a stub page for this post or just guess in advance what its URL will be? And then I believe this is good to go. Yeah! |
Thanks Nicolas. I will take care of the TODOs. I can link to the blog. One
is written, another will be done today.
…On 24 February 2017 at 11:02, Nicolas M. Thiéry ***@***.***> wrote:
Done with the proofreading! The issue description is fair game again.
This looks very nice. I just left a few little TODO's; @wbhart
<https://github.com/wbhart> can you take care of them? It would be nice
as well if the github description contained a link to the upcoming blog
post. Maybe you can create a stub page for this post or just guess in
advance what its URL will be?
And then I believe this is good to go. Yeah!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#118 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAOzpgvmmgHMV7JRO0Vn7K7DnQ3Pyn-9ks5rfqqvgaJpZM4F5zGC>
.
|
I wrote a blog for this deliverable here [1].
I still need to write a blog on the parallelisation of the FFT. I'll
probably do that on Monday, as well as tidy up those minor TODOs you've
made for me.
[1]
https://wbhart.blogspot.de/2017/02/assembly-superoptimisation-in-mpir.html
…On 24 February 2017 at 11:49, Bill Hart ***@***.***> wrote:
Thanks Nicolas. I will take care of the TODOs. I can link to the blog. One
is written, another will be done today.
On 24 February 2017 at 11:02, Nicolas M. Thiéry ***@***.***>
wrote:
> Done with the proofreading! The issue description is fair game again.
>
> This looks very nice. I just left a few little TODO's; @wbhart
> <https://github.com/wbhart> can you take care of them? It would be nice
> as well if the github description contained a link to the upcoming blog
> post. Maybe you can create a stub page for this post or just guess in
> advance what its URL will be?
>
> And then I believe this is good to go. Yeah!
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#118 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AAOzpgvmmgHMV7JRO0Vn7K7DnQ3Pyn-9ks5rfqqvgaJpZM4F5zGC>
> .
>
|
@wbhart this blogpost is super-great! Any pointer to more detailed information, like:
|
Thanks for the feedback.
The rdtsc instruction no longer gives cycle timing. We don't know why:
there are multiple theories. But it nowadays gives some kind of scaled
value, and the scaling changes.
I don't have all the details of how we precisely got around the issue,
since I didn't write the code and the person who did is no longer working
on ODK. The things I recall:
1) Turn off hyperthreading
2) Use RDPMC performance counters (this is incredibly hard to do and an
entire art in itself; they are probably not available by default on your
system)
3) The address of memory locations that are accessed for your timings
should not be the same as other memory locations accessed, modulo 4096.
This is particularly relevant for variables on the stack, which means you
need to shift the stack address to get reliable timings.
4) There is an enormous penalty for switching between SSE and AVX on some
processors (something like 70 cycles)
There are many more things. It took literally months to solve the problems.
That's all I remember.
We are using brute force, but with an algorithm which only examines valid
reorderings. It works out which assembly instructions commute and then uses
a clever algorithm to ensure that the tree of all combinations is pruned
carefully to only include those which are valid reorderings. I can only
refer you to the source code here, as there was not time to write up how
this is done, precisely. Sorry that I don't recall precisely how it works,
and the person who came up with it is no longer working on the project.
Alex Best might find time to explain it to you if you email him. But
remember that we aren't paying him any more.
…On 24 February 2017 at 22:29, serge-sans-paille ***@***.***> wrote:
@wbhart <https://github.com/wbhart> this blogpost is super-great! Any
pointer to more detailed information, like:
- what's the limitation of traditional rdtsc in thr context of a
super-optimizer?
- how did you get around the issue, any code to share if we meet the
same issue?
- when doing super optimization, are you using brute force, or do you
use some heuristic when exploring the possible combinations?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#118 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAOzptKQXfMHRxZiGj8uoapBXOvys_yeks5rf0u0gaJpZM4F5zGC>
.
|
I forgot to say, obviously you need to time the code a number of times and
then throw away any outliers. On modern CPUs there can be interruptions for
any reason. One has to handle things like context switches, parts of the
CPU being switched off to preserve power and so on. For many, many reasons,
things aren't as consistent as they used to be, so some of the timings have
to be excluded. Of course you don't want to throw away all the smallest
timings if they are real timings, e.g. due to the fact that the code you
are timing takes a fractional number of cycles (which will yield random
looking values when timed with an integer counter for cycles).
…On 25 February 2017 at 10:37, Bill Hart ***@***.***> wrote:
Thanks for the feedback.
The rdtsc instruction no longer gives cycle timing. We don't know why:
there are multiple theories. But it nowadays gives some kind of scaled
value, and the scaling changes.
I don't have all the details of how we precisely got around the issue,
since I didn't write the code and the person who did is no longer working
on ODK. The things I recall:
1) Turn off hyperthreading
2) Use RDPMC performance counters (this is incredibly hard to do and an
entire art in itself; they are probably not available by default on your
system)
3) The address of memory locations that are accessed for your timings
should not be the same as other memory locations accessed, modulo 4096.
This is particularly relevant for variables on the stack, which means you
need to shift the stack address to get reliable timings.
4) There is an enormous penalty for switching between SSE and AVX on some
processors (something like 70 cycles)
There are many more things. It took literally months to solve the
problems. That's all I remember.
We are using brute force, but with an algorithm which only examines valid
reorderings. It works out which assembly instructions commute and then uses
a clever algorithm to ensure that the tree of all combinations is pruned
carefully to only include those which are valid reorderings. I can only
refer you to the source code here, as there was not time to write up how
this is done, precisely. Sorry that I don't recall precisely how it works,
and the person who came up with it is no longer working on the project.
Alex Best might find time to explain it to you if you email him. But
remember that we aren't paying him any more.
On 24 February 2017 at 22:29, serge-sans-paille ***@***.***>
wrote:
> @wbhart <https://github.com/wbhart> this blogpost is super-great! Any
> pointer to more detailed information, like:
>
> - what's the limitation of traditional rdtsc in thr context of a
> super-optimizer?
> - how did you get around the issue, any code to share if we meet the
> same issue?
> - when doing super optimization, are you using brute force, or do you
> use some heuristic when exploring the possible combinations?
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#118 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AAOzptKQXfMHRxZiGj8uoapBXOvys_yeks5rf0u0gaJpZM4F5zGC>
> .
>
|
TODOs for this report have now been fixed. I've also added a link to the relevant blog post. |
@nthiery All done I think. |
Submitted! Thanks @wbhart, and congratulations on yet another deliverable! |
I just checked this ticket again, as I wanted to see what happened after I left the project. A few remarks on the RDPMC instruction, in addition to Bill's earlier comments:
This is really rather important; with hyper-threading, multiple programs execute on the same core's execution units, and there is no way of telling how many clock cycles each program used as they are both using CPU resources concurrently. Fortunately, it turned out that disabling a logical CPU via Linux's
Doing it, e.g., via a kernel module, is tricky business. I never got it to work reliably, as the kernel has its own idea of whether it should be enabled or not. Very fortunately, the PMU-tools library offers an easy-to-use interface to the kernel's perf system that lets you enable RDPMC in a way that lets the kernel know that it should be enabled, so the kernel will not spuriously switch it off again. Enabling RDPMC through PMU-tools, then using a combination of CPUID (for serializing) and RDPMC gave very accurate timings; this is the best way to time short functions that I've found yet, and it is reasonably easy to do.
It is worth pointing out that the stack alignment does not change during one program execution, so within one program run, either all the timings will be fast (with no partial address alias stall), or they will all be slow. However, the stall may mask other delays during function entry, when those delays might affect execution time without the alias stall. It's best to avoid the stall with an appropriate alloca() call during program init.
Even greater than that, around 100 cycles, iirc. This stall exists only on the Haswell (and probably Broadwell), but not on Skylake. It is a little unfortunate not being able to use a mix of SSE2 and AVX on Skylake, but for code that may also be used on Has/Broadwell, the stall must be avoided at all cost. For strictly Skylake-only code, mixing is not an issue. All the best, |
Context and Problem statement
MPIR
is a highly optimised library for bignum arithmetic forked from GMP. It is a fundamental building block for many open source mathematical computational components (SageMath, FLINT, Nemo, Eiffelroom, GMPY, Advanpix, PHP and MPIR.net), and therefore its fine optimization on a variety of processor architecture is important for the High Performance aims of OpenDreamKit.For this deliverable the task was to implement a superoptimizer which tries valid permutations (i.e., that do not change program behaviour) of instructions in assembly functions, times each permutation, and chooses the fastest one. In addition, new MPIR functions for recent processor architectures were to be written, making use of recently added features like AVX2 instructions, and be optimized with the super-optimizer where applicable.
It is usually the case that the difference between assembly optimised code and C code compiled by an optimising compiler such as GCC is a factor of 4-12 for bignum arithmetic. But each new processor microarchitecture requires new assembly language code to be written. One can use older assembly code, but each new microarchitecture can do around 20% better than the previous one if hand optimisation is done. In addition to that, speedups due to superoptimisation can be anywhere from 5% to 100%.
In MPIR, we are typically comparing superoptimised code that was written for a previous, but related microarchitecture, and so if the job is done properly, we expect about 20% improvement. We see that, and more, below.
Work completed
For the first six months of the project, we wrote the
ajs
superoptimizer (https://github.com/akruppa/ajs), based on the open-sourceAsmJit
library (https://github.com/asmjit/asmjit), a complete Just In Time and remote assembler for C++ language.For the second six months, we solved several problems with the
ajs
superoptimizer, especially erratic timings that had put the concept in jeopardy, and, with contributions from Jens Nurmann, wrote and/or optimized a set of core functions for MPIR and some auxiliary functions used internally (see below).The
ajs
superoptimizerThe biggest problem with the superoptimizer was the highly erratic timings it measured for function executions. This made it practically impossible to have it automatically choose (one of) the fastest permutations for a given function.
The major problem was that the RDTSC(P) instructions no longer count cpu core cycles, but cycles of a fixed-frequency counter, i.e., elapsed natural time. Due to extensive clock scaling features of recent cpus, the measured time depended much more on power saving decisions made by the cpu than on the (comparatively small) speedup by finding a good permutation. This is especially true as functions may have to be superoptimized in several pieces, e.g., separately for lead-in, core loop, and lead-out, to reduce the search space so that decent permutations are found within acceptable time.
The solution we used was the RDPMC instruction which provides low-latency access to performance measurement counters, including the "second fixed- function counter" (FFC2) which does, in fact, count cpu core clock cycles. The problem was enabling access to this counter from user mode applications, which requires setting some bits in MSR/CR. Attempts to do so via kernel modules we wrote turned out unreliable as the kernel disabled the bits again (and my modules killed machines on multiple occasions).
Eventually an excellent solution to this problem was found in the jevents library of the pmu-tools (https://github.com/andikleen/pmu-tools/) which provides an API to the perf subsystem of the Linux kernel. This allows enabling RDPMC to read FFC2 without the kernel spuriously disabling it again.
The resulting timings within one program run were much more stable than before, usually resulting in the same cycle count for a given function. The timings still vary by 1 occasionally (very rarely 2 or more); we have tried to find the source of the remaining variance, but to no avail.
Another major source of error, but invariant within one super-optimizer run, was the alignment of the stack, which appears to be randomly chosen at program start. The writes to the stack (PUSH/CALL) on a function call could alias (mod 4096) with the measured function's input operands, causing "partial address alias" stalls which inflated execution time by as much as 10 cycles. This problem was solved by forcing a particular stack alignment.
Other problems that occurred within
ajs
and which were solved:asmjit
, changing instruction alignment compared to other assemblers. We now manually annotate those instructions that require long form; all other use short. This requires manual work to annotate and verify the resulting instruction encodings.All in all, fixing the aforementioned problems in
ajs
consumed well over 2 months of time on the project. The code to generate permutations that honour data dependencies is quite powerful; however, subtle interactions with the cpu hardware made it very time-consuming to get nearly cycle-accurate timings as we required.Optimized functions for MPIR
We now review the functions that have been optimized on various processor microarchitectures (Intel Haswell and Skylake and AMD Bulldozer).
Whilst these aren't the most recent architectures from the major chip manufacturers, they are coming into widespread use around now. Indeed it is difficult to get access to more recent machines. Naturally access to the particular architecture is required in order to optimise for it.
For Haswell and Skylake, the following set of core functions was re-implemented or existing code optimized to take advantage of the respective micro-architecture:
add_n
,sub_n
,addmul_1
,submul_1
,addlsh1_n
,sublsh1_n
,com_n
,copyi
,copyd
,rshift1
,lshift1
,rshift
,lshift
,mul_1
,mul_basecase
.The only AMD CPU to which we could gain access was a Bulldozer which is a fairly old and poorly designed microarchitecture; in particular, new instruction set extensions like AVX2 are so slow on Bulldozer (and Piledriver) that they are best avoided. This left little room for optimization, and we opted not to write new code for this outdated cpu, but to cherry-pick existing code that performs well.
We are very grateful to Jens Nurmann who contributed significant amounts of code and expertise on AVX2 programming, to Brian Gladman for porting the new code to the Microsoft Visual C build system, and to William Stein for granting us access to a Bulldozer machine.
Haswell microarchitecture
For Haswell, new AVX2 versions of
com_n
,copyd
,copyi
,lshift
,lshift1
,rshift
,rshift1
were written anew and super-optimized.The
addmul_1
,submul_1
,mul_1
,mul_basecase
, andsqr_basecase
functions for Haswell in the GMP library were copied as these are extremely well optimized already - we did not think we could produce better in what little time we had left. Attempts to super-optimize these functions did not find better code.Existing
add_n
,sub_n
,karaadd
,karasub
,hgcd2
functions were modified for Haswell and super-optimized, whilesumdiff_n
andnsumdiff_n
were written anew.To give a summary of the speedups obtained, we include here results obtained with the
mpir_bench
program (https://github.com/akruppa/mpir_bench_two). Higher values are better (function executions per unit time); the apparent slow-down for size < 512 GCD is to be investigated.The new code can be found in the directory https://github.com/akruppa/mpir/tree/master/mpn/x86_64/haswell .
Skylake microarchitecture
For Skylake,
add_n
,sub_n
,mul_1
,add_err1_n
andsub_err1_n
were written anew and super-optimized. Theaddmul_1
,mul_basecase
andsqr_basecase
functions were taken from GMP. The other functions for Haswell are used as fall-backs.The new code can be found in the directory https://github.com/akruppa/mpir/tree/master/mpn/x86_64/skylake .
Bulldozer microarchitecture
On Bulldozer, the speed gains obtained are much more humble than on Haswell and Skylake, as relatively few functions were replaced by faster ones. This microarchitecture is not a profitable target for code optimization any more.
The new code can be found in the directory https://github.com/akruppa/mpir/tree/master/mpn/x86_64/bulldozer .
Additional work
Since the end of the project, we have added preliminary Broadwell CPU support. This does not include any superoptimisation at this point. Broadwell is essentially a revision of Haswell, but with some Skylake features. We have added processor detection to MPIR and sped up this CPU by making use of the Haswell code written for this project. Work is underway to make some of the new Skylake code available to Broadwell chips, and to write new assembly code for Broadwell. Many thanks to our volunteers, Jens Nurmann and David Cleaver who have agreed to work on this.
Future work
The superoptimizer works reasonably reliably now and can be used to optimize more functions in MPIR and other software projects. At this stage MPIR is the only project that has made use of the superoptimiser, however we have already received a support request, so we expect there to be more use cases soon.
The division and GCD functions in MPIR are worthwhile targets for additional optimization work.
The new Zen microarchitecture of AMD was released towards the end of our project, and looks promising for scientific computation. An optimization effort here would be worthwhile; it will require to get access to such a machine.
Source code
The
ajs
superoptimizer can be found at https://github.com/akruppa/ajs . The optimized functions forMPIR
are merged into the mainMPIR
repository at https://github.com/wbhart/mpir .Testing this code
Build instructions for MPIR are as follows:
Download MPIR-3.0.0 from http://mpir.org/
Note that you also need to have the latest yasm to build MPIR: http://yasm.tortall.net/
To build yasm, download the tarball:
To test MPIR, download the tarball:
A Haswell, Skylake, or Bulldozer CPU is required to test the changes referred to above.
The text was updated successfully, but these errors were encountered: