Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

D5.5: Write an assembly superoptimiser supporting AVX and upcoming Intel processor extensions for the MPIR library and optimise MPIR for modern processors #118

Closed
minrk opened this issue Sep 8, 2015 · 21 comments

Comments

@minrk
Copy link
Contributor

minrk commented Sep 8, 2015

Context and Problem statement

MPIR is a highly optimised library for bignum arithmetic forked from GMP. It is a fundamental building block for many open source mathematical computational components (SageMath, FLINT, Nemo, Eiffelroom, GMPY, Advanpix, PHP and MPIR.net), and therefore its fine optimization on a variety of processor architecture is important for the High Performance aims of OpenDreamKit.

For this deliverable the task was to implement a superoptimizer which tries valid permutations (i.e., that do not change program behaviour) of instructions in assembly functions, times each permutation, and chooses the fastest one. In addition, new MPIR functions for recent processor architectures were to be written, making use of recently added features like AVX2 instructions, and be optimized with the super-optimizer where applicable.

It is usually the case that the difference between assembly optimised code and C code compiled by an optimising compiler such as GCC is a factor of 4-12 for bignum arithmetic. But each new processor microarchitecture requires new assembly language code to be written. One can use older assembly code, but each new microarchitecture can do around 20% better than the previous one if hand optimisation is done. In addition to that, speedups due to superoptimisation can be anywhere from 5% to 100%.

In MPIR, we are typically comparing superoptimised code that was written for a previous, but related microarchitecture, and so if the job is done properly, we expect about 20% improvement. We see that, and more, below.

Work completed

For the first six months of the project, we wrote the ajs superoptimizer (https://github.com/akruppa/ajs), based on the open-source AsmJit library (https://github.com/asmjit/asmjit), a complete Just In Time and remote assembler for C++ language.

For the second six months, we solved several problems with the ajs superoptimizer, especially erratic timings that had put the concept in jeopardy, and, with contributions from Jens Nurmann, wrote and/or optimized a set of core functions for MPIR and some auxiliary functions used internally (see below).

The ajs superoptimizer

The biggest problem with the superoptimizer was the highly erratic timings it measured for function executions. This made it practically impossible to have it automatically choose (one of) the fastest permutations for a given function.

The major problem was that the RDTSC(P) instructions no longer count cpu core cycles, but cycles of a fixed-frequency counter, i.e., elapsed natural time. Due to extensive clock scaling features of recent cpus, the measured time depended much more on power saving decisions made by the cpu than on the (comparatively small) speedup by finding a good permutation. This is especially true as functions may have to be superoptimized in several pieces, e.g., separately for lead-in, core loop, and lead-out, to reduce the search space so that decent permutations are found within acceptable time.

The solution we used was the RDPMC instruction which provides low-latency access to performance measurement counters, including the "second fixed- function counter" (FFC2) which does, in fact, count cpu core clock cycles. The problem was enabling access to this counter from user mode applications, which requires setting some bits in MSR/CR. Attempts to do so via kernel modules we wrote turned out unreliable as the kernel disabled the bits again (and my modules killed machines on multiple occasions).

Eventually an excellent solution to this problem was found in the jevents library of the pmu-tools (https://github.com/andikleen/pmu-tools/) which provides an API to the perf subsystem of the Linux kernel. This allows enabling RDPMC to read FFC2 without the kernel spuriously disabling it again.

The resulting timings within one program run were much more stable than before, usually resulting in the same cycle count for a given function. The timings still vary by 1 occasionally (very rarely 2 or more); we have tried to find the source of the remaining variance, but to no avail.

Another major source of error, but invariant within one super-optimizer run, was the alignment of the stack, which appears to be randomly chosen at program start. The writes to the stack (PUSH/CALL) on a function call could alias (mod 4096) with the measured function's input operands, causing "partial address alias" stalls which inflated execution time by as much as 10 cycles. This problem was solved by forcing a particular stack alignment.

Other problems that occurred within ajs and which were solved:

  • Jump instruction were always encoded in long form by asmjit, changing instruction alignment compared to other assemblers. We now manually annotate those instructions that require long form; all other use short. This requires manual work to annotate and verify the resulting instruction encodings.
  • Allow new registers introduced with AVX2, and instructions with 4 operands
  • Various fixes and extensions to asm parsing code

All in all, fixing the aforementioned problems in ajs consumed well over 2 months of time on the project. The code to generate permutations that honour data dependencies is quite powerful; however, subtle interactions with the cpu hardware made it very time-consuming to get nearly cycle-accurate timings as we required.

Optimized functions for MPIR

We now review the functions that have been optimized on various processor microarchitectures (Intel Haswell and Skylake and AMD Bulldozer).

Whilst these aren't the most recent architectures from the major chip manufacturers, they are coming into widespread use around now. Indeed it is difficult to get access to more recent machines. Naturally access to the particular architecture is required in order to optimise for it.

For Haswell and Skylake, the following set of core functions was re-implemented or existing code optimized to take advantage of the respective micro-architecture: add_n, sub_n, addmul_1, submul_1, addlsh1_n, sublsh1_n, com_n, copyi, copyd, rshift1, lshift1, rshift, lshift, mul_1, mul_basecase.

The only AMD CPU to which we could gain access was a Bulldozer which is a fairly old and poorly designed microarchitecture; in particular, new instruction set extensions like AVX2 are so slow on Bulldozer (and Piledriver) that they are best avoided. This left little room for optimization, and we opted not to write new code for this outdated cpu, but to cherry-pick existing code that performs well.

We are very grateful to Jens Nurmann who contributed significant amounts of code and expertise on AVX2 programming, to Brian Gladman for porting the new code to the Microsoft Visual C build system, and to William Stein for granting us access to a Bulldozer machine.

Haswell microarchitecture

For Haswell, new AVX2 versions of com_n, copyd, copyi, lshift, lshift1, rshift, rshift1 were written anew and super-optimized.

The addmul_1, submul_1, mul_1, mul_basecase, and sqr_basecase functions for Haswell in the GMP library were copied as these are extremely well optimized already - we did not think we could produce better in what little time we had left. Attempts to super-optimize these functions did not find better code.

Existing add_n, sub_n, karaadd, karasub, hgcd2 functions were modified for Haswell and super-optimized, while sumdiff_n and nsumdiff_n were written anew.

To give a summary of the speedups obtained, we include here results obtained with the mpir_bench program (https://github.com/akruppa/mpir_bench_two). Higher values are better (function executions per unit time); the apparent slow-down for size < 512 GCD is to be investigated.

Program multiply (weight 1.00) Old New
128 128 108222650 107111633
512 512 22816149 26895874
8192 8192 228124 289984
131072 131072 3884 5015
2097152 2097152 173 203
128 128 108109328 107223557
512 512 17689648 20384648
8192 8192 155145 189057
131072 131072 2771 3479
2097152 2097152 118 133
15000 10000 80120 91788
20000 10000 61030 71776
30000 10000 37966 42448
16777216 512 501 658
16777216 262144 24.6 28.7
Program gcd (weight 0.50) Old New
128 128 3729465 3646816
512 512 767983 554155
8192 8192 10974 15908
131072 131072 175 223
1048576 1048576 9.38 11.5
Program gcdext (weight 0.50) Old New
128 128 2628011 2036197
512 512 595026 451973
8192 8192 7900 11192
131072 131072 129 171
1048576 1048576 6.04 7.94

The new code can be found in the directory https://github.com/akruppa/mpir/tree/master/mpn/x86_64/haswell .

Skylake microarchitecture

For Skylake, add_n, sub_n, mul_1, add_err1_n and sub_err1_n were written anew and super-optimized. The addmul_1, mul_basecase and sqr_basecase functions were taken from GMP. The other functions for Haswell are used as fall-backs.

Program multiply (weight 1.00) Old New
128 128 123326551 123312872
512 512 29477397 33899135
8192 8192 298474 358841
131072 131072 4924 6024
2097152 2097152 213 246
128 128 123340235 123340948
512 512 22551903 25322713
8192 8192 208058 238204
131072 131072 3497 4316
2097152 2097152 142 155
15000 10000 104503 112647
20000 10000 80121 89101
30000 10000 47871 54247
16777216 512 611 693
16777216 262144 29.1 33.6
Program gcd (weight 0.50) Old New
128 128 4387356 4373122
512 512 814864 682194
8192 8192 11468 18970
131072 131072 208 274
1048576 1048576 11.3 14.1
Program gcdext (weight 0.50) Old New
128 128 2750101 2562046
512 512 640358 557060
8192 8192 8526 13743
131072 131072 155 212
1048576 1048576 7.50 9.83

The new code can be found in the directory https://github.com/akruppa/mpir/tree/master/mpn/x86_64/skylake .

Bulldozer microarchitecture

On Bulldozer, the speed gains obtained are much more humble than on Haswell and Skylake, as relatively few functions were replaced by faster ones. This microarchitecture is not a profitable target for code optimization any more.

Program multiply (weight 1.00) Old New
128 128 55322152 55550756
512 512 12248577 12586138
8192 8192 139406 138848
131072 131072 2406 2421
2097152 2097152 101 105
128 128 55781257 51370568
512 512 7690668 8710261
8192 8192 90386 83592
131072 131072 1587 1584
2097152 2097152 64.0 65.9
15000 10000 44703 45193
20000 10000 33852 35294
30000 10000 20000 20199
16777216 512 268 294
16777216 262144 12.7 13.4
Program gcd (weight 0.50) Old New
128 128 2597029 2611829
512 512 284031 289573
8192 8192 6800 6810
131072 131072 108 107
1048576 1048576 5.77 5.77
Program gcdext (weight 0.50) Old New
128 128 1270472 1239850
512 512 223972 218197
8192 8192 4944 4924
131072 131072 78.1 78.0
1048576 1048576 3.65 3.65

The new code can be found in the directory https://github.com/akruppa/mpir/tree/master/mpn/x86_64/bulldozer .

Additional work

Since the end of the project, we have added preliminary Broadwell CPU support. This does not include any superoptimisation at this point. Broadwell is essentially a revision of Haswell, but with some Skylake features. We have added processor detection to MPIR and sped up this CPU by making use of the Haswell code written for this project. Work is underway to make some of the new Skylake code available to Broadwell chips, and to write new assembly code for Broadwell. Many thanks to our volunteers, Jens Nurmann and David Cleaver who have agreed to work on this.

Future work

The superoptimizer works reasonably reliably now and can be used to optimize more functions in MPIR and other software projects. At this stage MPIR is the only project that has made use of the superoptimiser, however we have already received a support request, so we expect there to be more use cases soon.

The division and GCD functions in MPIR are worthwhile targets for additional optimization work.

The new Zen microarchitecture of AMD was released towards the end of our project, and looks promising for scientific computation. An optimization effort here would be worthwhile; it will require to get access to such a machine.

Source code

The ajs superoptimizer can be found at https://github.com/akruppa/ajs . The optimized functions for MPIR are merged into the main MPIR repository at https://github.com/wbhart/mpir .

Testing this code

Build instructions for MPIR are as follows:

Download MPIR-3.0.0 from http://mpir.org/

Note that you also need to have the latest yasm to build MPIR: http://yasm.tortall.net/

To build yasm, download the tarball:

./configure
make

To test MPIR, download the tarball:

./configure --enable-gmpcompat --with-yasm=/path_to_yasm/yasm
make
make check

A Haswell, Skylake, or Bulldozer CPU is required to test the changes referred to above.

@minrk minrk added this to the D5.5 milestone Sep 8, 2015
@wbhart wbhart changed the title D5.5: Extend the existing assembly superoptimiser for AVX and upcoming Intel processor extensions for the MPIR library. D5.5: Write an assembly superoptimiser supporting AVX and upcoming Intel processor extensions for the MPIR library and optimise MPIR for modern processors Jun 26, 2016
@nthiery nthiery assigned wbhart and unassigned ClementPernet Jun 30, 2016
@wbhart
Copy link
Contributor

wbhart commented Jun 30, 2016

  • D 5.5 Write an assembly superoptimiser supporting AVX and upcoming Intel processor extensions for the MPIR library and optimise MPIR for modern processors
    • Due month 18
    • Alex Best has used the asmjit library to write an assembly superoptimiser supporting AVX [1]
    • The superoptimiser is now working on many of the architectures we have access to and can be used to optimise MPIR asm code
    • Have been working closely with a new MPIR contributor, Jens Nurmann, who has been writing new Skylake assembly code and also updating code for older two and three port processors (he now has access to the superoptimiser and is using it)
    • Have been validating the superoptimiser on Piledriver and a new Haswell machine purchased by Uni KL
    • Have developed a possible strategy to extend superoptimiser to Windows ABI
    • First version of MPIR addmul_1 function written using new mulx instruction provided by Intel explicitly for bignum arithmetic
    • Alex Kruppa will write new assembly code for modern processors (esp. AVX) and superoptimise it
    • Project is exactly on schedule

[1] https://github.com/alexjbest/ajs/

@nthiery
Copy link
Contributor

nthiery commented Jun 30, 2016

That sounds great! Thanks for the report.

@wbhart
Copy link
Contributor

wbhart commented Aug 17, 2016

Hi all,

I'm just writing to let you know that our project to write a superoptimiser
and optimise for modern processors has hit an unexpected snag. On all
modern Intel CPU's, the rdtsc and rdtscp instructions no longer give cycle
accurate timings required for superoptimisation.

Instead there are performance counters that do this, however they are
switched off by default. To turn them on, one needs to run ring 0 code,
which means it can only be done from a Linux kernel module. We have written
such a module.

Unfortunately, due to what we think is a Linux kernel bug (N.B: we are
not 100% sure yet, but the kernel devs already have noncritical tickets
open for similar things), the counters are almost immediately switched off
by the kernel after they are switched on (it looks like some bit mask may
be incorrect). If we are right about the cause, this can't be fixed unless
we fix the Linux kernel. We can do that, but it will take months to get the
patch into a Linux kernel release. That's months we don't have.

We have spent about 6 weeks tracking down this issue, and as of this
moment, we are about 2 weeks behind schedule due to this bug.

I'm just reporting the issue here so that it doesn't come as a surprise
later when we announce that we cannot superoptimise for modern Intel CPUs.
As far as we know, the issue does not affect recent AMD's (we have not
verified this yet). However, we currently do not have access to any recent
AMDs that we haven't already superoptimised for in the past.

As far as I can see, no action needs to be taken by anyone, and I don't
think there is anything anyone can do to help, except perhaps give us
access to recent AMD machines (later than piledriver). They would need to
be machines on which you have root access and are prepared to run a small
number of things as root.

We'll use this ticket to keep people updated about our progress with fixing
the issue. Note that it will not affect us delivering a superoptimiser as a
piece of code, since this is already written. It just won't work on Intel
CPUs made after about 2012. Obviously it will affect us delivering
superoptimised MPIR routines for modern Intel CPUs, though we can deliver
these later than planned, once the bug is fixed.

Note that no other ODK deliverables depend on this deliverable, so I don't
see any potential issues for the ODK project as a whole.

Bill.

On 30 June 2016 at 15:16, Nicolas M. Thiéry notifications@github.com
wrote:

That sounds great! Thanks for the report.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#118 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AAOzpraBZSpZivut8Z4CdMtxZdHPk1cGks5qQ8GQgaJpZM4F5zGC
.

@wbhart
Copy link
Contributor

wbhart commented Aug 17, 2016

Another possibility is if anyone has any recent Intel OSX machines that we
can gain access to. As they don't use the Linux kernel, the bug likely
doesn't exist there.

By the way, we have verified that the various tools that Linux provides to
control these registers do not work as advertised. Also, there seems to be
an access privilege bug for at least one of the relevant device files in
/dev. This does not escalate privileges, but rather denies access when it
shouldn't, so is not a security concern. So there are definitely real bugs
here. What we haven't yet verified is whether a kernel bug is responsible
for turning off access to the relevant performance counters (perhaps it is
even deliberate).

We'll post more when we know more, including links to various tickets, if
appropriate.

Bill.

On 17 August 2016 at 13:06, Bill Hart goodwillhart@googlemail.com wrote:

Hi all,

I'm just writing to let you know that our project to write a
superoptimiser and optimise for modern processors has hit an unexpected
snag. On all modern Intel CPU's, the rdtsc and rdtscp instructions no
longer give cycle accurate timings required for superoptimisation.

Instead there are performance counters that do this, however they are
switched off by default. To turn them on, one needs to run ring 0 code,
which means it can only be done from a Linux kernel module. We have written
such a module.

Unfortunately, due to what we think is a Linux kernel bug (N.B: we are
not 100% sure yet, but the kernel devs already have noncritical tickets
open for similar things), the counters are almost immediately switched off
by the kernel after they are switched on (it looks like some bit mask may
be incorrect). If we are right about the cause, this can't be fixed unless
we fix the Linux kernel. We can do that, but it will take months to get the
patch into a Linux kernel release. That's months we don't have.

We have spent about 6 weeks tracking down this issue, and as of this
moment, we are about 2 weeks behind schedule due to this bug.

I'm just reporting the issue here so that it doesn't come as a surprise
later when we announce that we cannot superoptimise for modern Intel CPUs.
As far as we know, the issue does not affect recent AMD's (we have not
verified this yet). However, we currently do not have access to any recent
AMDs that we haven't already superoptimised for in the past.

As far as I can see, no action needs to be taken by anyone, and I don't
think there is anything anyone can do to help, except perhaps give us
access to recent AMD machines (later than piledriver). They would need to
be machines on which you have root access and are prepared to run a small
number of things as root.

We'll use this ticket to keep people updated about our progress with
fixing the issue. Note that it will not affect us delivering a
superoptimiser as a piece of code, since this is already written. It just
won't work on Intel CPUs made after about 2012. Obviously it will affect us
delivering superoptimised MPIR routines for modern Intel CPUs, though we
can deliver these later than planned, once the bug is fixed.

Note that no other ODK deliverables depend on this deliverable, so I don't
see any potential issues for the ODK project as a whole.

Bill.

On 30 June 2016 at 15:16, Nicolas M. Thiéry notifications@github.com
wrote:

That sounds great! Thanks for the report.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#118 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AAOzpraBZSpZivut8Z4CdMtxZdHPk1cGks5qQ8GQgaJpZM4F5zGC
.

@embray
Copy link
Collaborator

embray commented Aug 19, 2016

Interesting report--that sounds frustrating! Perhaps it's not so bad, in terms of delays. As you wrote, the work on the superoptimiser is already done, as is the work on tracking down and addressing the issues in Linux. Assuming the Linux devs acknowledge the bug and will patch it, while there may be a delay on that it's not the end of the world since no other ODK work is waiting on it. In principle the deliverable could be said to be satisfied by building one's own kernel (and that's only necessary as a temporary measure).

@wbhart
Copy link
Contributor

wbhart commented Aug 19, 2016

Unfortunately we won't be able to run a custom kernel on our research
machines here. Also, we are completely unsure what is buggy at this point.
There is a particular bit which we need to set in a given CPU register. The
Linux tools for setting it don't work. On one older machine, with an older
kernel, we are able to set it with about 50% probability. On the machine we
need to set it on, we haven't figured out any way to set it, nor have we
figured out why it can't be set. It's a frustratingly slow grind to try to
sort the issue out.

We've decided at this point that we will not persist longer than the end of
the month. After that time we will abandon recent Intel and only
superoptimise for AMD, which seems to work.

We strongly suspect the crippling of the facility may not be an accident
(but have no proof of this).

Bill.

On 19 August 2016 at 12:33, Erik Bray notifications@github.com wrote:

Interesting report--that sounds frustrating! Perhaps it's not so bad, in
terms of delays. As you wrote, the work on the superoptimiser is already
done, as is the work on tracking down and addressing the issues in Linux.
Assuming the Linux devs acknowledge the bug and will patch it, while there
may be a delay on that it's not the end of the world since no other ODK
work is waiting on it. In principle the deliverable could be said to be
satisfied by building one's own kernel (and that's only necessary as a
temporary measure).


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#118 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAOzpqTmdlMg0likqjAcE-oQ62hvZz0oks5qhYaVgaJpZM4F5zGC
.

@wbhart
Copy link
Contributor

wbhart commented Aug 25, 2016

Alex Kruppa just informed me that he has found a solution to the
superoptimisation issue on Haswell (and presumably all other recent Intel
CPUs). This means we can get the project back on schedule. Once we clean
everything up, we will be about 4 weeks behind the originally planned
schedule due to this issue, which we've been working on for more than 7
weeks. However, we are now much more hopeful about delivering on all the
promised deliverables.

The solution was apparently to use a(n existing) library which programs the
kernel perf subsystem. This both works around all the bugs in the other
tools that we found, and cooperates with the kernel so it doesn't clobber
the performance registers.

We still don't get rock solid timings. But we now have cycle counts that
are good enough to run a superoptimiser and get maybe 90% of the potential
benefit.

Unfortunately, a few days ago, we managed to crash one of our
machines whilst trying different potential solutions, killing the kernel
and the remote access panel (no idea how), taking down three of our
websites. Hopefully the machine will be restarted some time in the next few
days.

Bill.

On 19 August 2016 at 13:17, Bill Hart goodwillhart@googlemail.com wrote:

Unfortunately we won't be able to run a custom kernel on our research
machines here. Also, we are completely unsure what is buggy at this point.
There is a particular bit which we need to set in a given CPU register. The
Linux tools for setting it don't work. On one older machine, with an older
kernel, we are able to set it with about 50% probability. On the machine we
need to set it on, we haven't figured out any way to set it, nor have we
figured out why it can't be set. It's a frustratingly slow grind to try to
sort the issue out.

We've decided at this point that we will not persist longer than the end
of the month. After that time we will abandon recent Intel and only
superoptimise for AMD, which seems to work.

We strongly suspect the crippling of the facility may not be an accident
(but have no proof of this).

Bill.

On 19 August 2016 at 12:33, Erik Bray notifications@github.com wrote:

Interesting report--that sounds frustrating! Perhaps it's not so bad, in
terms of delays. As you wrote, the work on the superoptimiser is already
done, as is the work on tracking down and addressing the issues in Linux.
Assuming the Linux devs acknowledge the bug and will patch it, while there
may be a delay on that it's not the end of the world since no other ODK
work is waiting on it. In principle the deliverable could be said to be
satisfied by building one's own kernel (and that's only necessary as a
temporary measure).


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#118 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAOzpqTmdlMg0likqjAcE-oQ62hvZz0oks5qhYaVgaJpZM4F5zGC
.

@bpilorget
Copy link
Contributor

@ClementPernet (WP leader) and @wbhart (Kaiserslautern=lead beneficiary)
This deliverable is due for February 2017

@wbhart
Copy link
Contributor

wbhart commented Nov 21, 2016

We had about two month of delays due to issues I reported previously. The
superoptimiser is done and will be delivered. We are pretty sure we will be
able to deliver an optimised version of MPIR for modern processors in time,
though it has certainly been a tough project and we aren't completely out
of the woods yet!

On 21 November 2016 at 17:23, bpilorget notifications@github.com wrote:

@ClementPernet https://github.com/ClementPernet (WP leader) and @wbhart
https://github.com/wbhart (Kaiserslautern=lead beneficiary)
This deliverable is due for February 2017


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#118 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAOzpmyNbwaGQBPulnMo6947wa0eFZW9ks5rAcV5gaJpZM4F5zGC
.

@nthiery
Copy link
Contributor

nthiery commented Feb 6, 2017

Dear M18 deliverable leaders,

Just a reminder that reports are due for mid-february, to buy us some time for proofreading, feedback, and final submission before February 28th. See our README for details on the process.

In practice, I'll be offline February 12-19, and the week right after will be pretty busy. Therefore, it would be helpful if a first draft could be available sometime this week, so that I can have a head start reviewing it.

Thanks in advance!

@minrk minrk mentioned this issue Feb 16, 2017
@nthiery
Copy link
Contributor

nthiery commented Feb 24, 2017

Currently proofreading the issue description; don't edit to avoid conflicts.

@nthiery
Copy link
Contributor

nthiery commented Feb 24, 2017

Done with the proofreading! The issue description is fair game again.

This looks very nice. I just left a few little TODO's; @wbhart can you take care of them? It would be nice as well if the github description contained a link to the upcoming blog post. Maybe you can create a stub page for this post or just guess in advance what its URL will be?

And then I believe this is good to go. Yeah!

@wbhart
Copy link
Contributor

wbhart commented Feb 24, 2017 via email

@wbhart
Copy link
Contributor

wbhart commented Feb 24, 2017 via email

@serge-sans-paille
Copy link
Contributor

@wbhart this blogpost is super-great! Any pointer to more detailed information, like:

  • what's the limitation of traditional rdtsc in thr context of a super-optimizer?
  • how did you get around the issue, any code to share if we meet the same issue?
  • when doing super optimization, are you using brute force, or do you use some heuristic when exploring the possible combinations?

@wbhart
Copy link
Contributor

wbhart commented Feb 25, 2017 via email

@wbhart
Copy link
Contributor

wbhart commented Feb 25, 2017 via email

@wbhart
Copy link
Contributor

wbhart commented Feb 27, 2017

TODOs for this report have now been fixed. I've also added a link to the relevant blog post.

@wbhart
Copy link
Contributor

wbhart commented Feb 27, 2017

@nthiery All done I think.

@nthiery
Copy link
Contributor

nthiery commented Feb 27, 2017

Submitted! Thanks @wbhart, and congratulations on yet another deliverable!

@nthiery nthiery closed this as completed Feb 27, 2017
@akruppa
Copy link

akruppa commented Apr 19, 2017

I just checked this ticket again, as I wanted to see what happened after I left the project. A few remarks on the RDPMC instruction, in addition to Bill's earlier comments:

  1. Turn off hyperthreading

This is really rather important; with hyper-threading, multiple programs execute on the same core's execution units, and there is no way of telling how many clock cycles each program used as they are both using CPU resources concurrently. Fortunately, it turned out that disabling a logical CPU via Linux's
/sys/devices/system/cpu/cpu<n>/online
file appears to be about as good as disabling hyper-threading entirely via the BIOS; at least, I found no significant timing differences between a CPU whose logical sibling was set offline and a system where hyper-threading was disabled globally. This means that only one single logical CPU needs to be switched off (and the process to be timed needs to be CPU-bound to that physical CPU) but the rest of the system continues to run at full performance.

  1. Use RDPMC performance counters (this is incredibly hard to do and an entire art in itself; they are probably not available by default on your system)

Doing it, e.g., via a kernel module, is tricky business. I never got it to work reliably, as the kernel has its own idea of whether it should be enabled or not. Very fortunately, the PMU-tools library offers an easy-to-use interface to the kernel's perf system that lets you enable RDPMC in a way that lets the kernel know that it should be enabled, so the kernel will not spuriously switch it off again. Enabling RDPMC through PMU-tools, then using a combination of CPUID (for serializing) and RDPMC gave very accurate timings; this is the best way to time short functions that I've found yet, and it is reasonably easy to do.

  1. The address of memory locations that are accessed for your timings should not be the same as other memory locations accessed, modulo 4096. This is particularly relevant for variables on the stack, which means you need to shift the stack address to get reliable timings.

It is worth pointing out that the stack alignment does not change during one program execution, so within one program run, either all the timings will be fast (with no partial address alias stall), or they will all be slow. However, the stall may mask other delays during function entry, when those delays might affect execution time without the alias stall. It's best to avoid the stall with an appropriate alloca() call during program init.

  1. There is an enormous penalty for switching between SSE and AVX on some processors (something like 70 cycles)

Even greater than that, around 100 cycles, iirc. This stall exists only on the Haswell (and probably Broadwell), but not on Skylake. It is a little unfortunate not being able to use a mix of SSE2 and AVX on Skylake, but for code that may also be used on Has/Broadwell, the stall must be avoided at all cost. For strictly Skylake-only code, mixing is not an issue.

All the best,
Alex

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants