New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
D5.7: Take advantage of multiple cores in the matrix Fourier Algorithm component of the FFT for integer and polynomial arithmetic,and include assembly primitives for SIMD processor instructions (e.g. AVX, etc.), especially in the FFT butterflies #120
Comments
@ClementPernet (WP leader) and @wbhart (Lead beneficiary) |
This is not a massive project and should be no problem to deliver on time. On 21 November 2016 at 17:25, bpilorget notifications@github.com wrote:
|
I have removed the non-existent "Knight's Bridge" from the description of this deliverable. We now have more of an idea of what is required here, and we think SIMD support such as SSE and AVX is what is required. |
The multiple core version of the FFT is now implemented. |
Dear M18 deliverable leaders, Just a reminder that reports are due for mid-february, to buy us some time for proofreading, feedback, and final submission before February 28th. See our README for details on the process. In practice, I'll be offline February 12-19, and the week right after will be pretty busy. Therefore, it would be helpful if a first draft could be available sometime this week, so that I can have a head start reviewing it. Thanks in advance! |
Hi @wbhart, |
Yeah, I haven't formally gone through and checked everything off on the
checklist, and am still doing some minor edits around the edge. But
basically it's ready to go.
I just wrote a blog on the quadratic sieve [1].
I hope to find time for a blog on the assembly superoptimiser too, perhaps
tomorrow.
[1] https://wbhart.blogspot.de/2017/02/integer-factorisation-in-flint.html
…On 23 February 2017 at 17:21, Nicolas M. Thiéry ***@***.***> wrote:
Hi @wbhart <https://github.com/wbhart>,
What's the status of the current report? Ready for final proofreading and
submission?
Same question for your other deliverable reports?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#120 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAOzpqedsjr_aM3bC-t5s1roaZRAcDl2ks5rfbHsgaJpZM4F5zG->
.
|
On Thu, Feb 23, 2017 at 09:24:57AM -0800, wbhart wrote:
Yeah, I haven't formally gone through and checked everything off on the
checklist, and am still doing some minor edits around the edge. But
basically it's ready to go.
I just wrote a blog on the quadratic sieve [1].
I hope to find time for a blog on the assembly superoptimiser too,
perhaps
tomorrow.
Thanks for the feedback. I'll try to read the current reports tomorrow
then.
Note that the deadline is only about the reports; so, no rush for the
blogs: it can easily wait a couple days, unless you are planning to
include its content in the report.
Cheers,
|
I'm not planning on including the blogs in the report. They have a wider
scope and are really designed to draw focus down to the specific ODK
deliverables.
On 23 February 2017 at 22:32, Nicolas M. Thiéry <notifications@github.com>
wrote:
… On Thu, Feb 23, 2017 at 09:24:57AM -0800, wbhart wrote:
> Yeah, I haven't formally gone through and checked everything off on the
> checklist, and am still doing some minor edits around the edge. But
> basically it's ready to go.
> I just wrote a blog on the quadratic sieve [1].
> I hope to find time for a blog on the assembly superoptimiser too,
> perhaps
> tomorrow.
Thanks for the feedback. I'll try to read the current reports tomorrow
then.
Note that the deadline is only about the reports; so, no rush for the
blogs: it can easily wait a couple days, unless you are planning to
include its content in the report.
Cheers,
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#120 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAOzpsXD6nUq6TZH-sVCp8dw-no3tv5Sks5rffrvgaJpZM4F5zG->
.
|
I'm not planning on including the blogs in the report. They have
a wider scope and are really designed to draw focus down to the
specific ODK deliverables.
Ok, sounds good. Then, let's focus for now on getting the reports done
and submitted :-)
|
Hi @wbhart: I probably won't get to proofread the issue description this afternoon. Could you add a little section of context at the top of the issue description, in the vein of that of D5.5 and D5.6? Please also implement the other little cosmetic changes I did in D5.5 and D5.6 (reminder of what Bulldozer/... are, code blocks for shell instructions, software names in backticks, paragraphs on a single line, ...). Thanks! |
Sure thing. No problem.
…On 24 February 2017 at 14:15, Nicolas M. Thiéry ***@***.***> wrote:
Hi @wbhart <https://github.com/wbhart>: I probably won't get to proofread
the issue description this afternoon. Could you add a little section of
context at the top of the issue description, in the vein of that of D5.5
and D5.6? Please also implement the other little cosmetic changes I did in
D5.5 and D5.6 (reminder of what Bulldozer/... are, code blocks for shell
instructions, software names in backticks, paragraphs on a single line,
...). Thanks!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#120 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAOzpvadEbnBxlvJ8xVPhm2r94D4Khxiks5rftgEgaJpZM4F5zG->
.
|
Thanks; ping me when you are done! |
I think this is done now. I will add a link to a blog article as soon as
one is written (either today or tomorrow).
…On 27 February 2017 at 10:30, Nicolas M. Thiéry ***@***.***> wrote:
Thanks; ping me when you are done!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#120 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAOzppKo6bew4YhYy4UFo1f3K3mGr8Elks5rgpfBgaJpZM4F5zG->
.
|
@nthiery All done here too I think, with the exception of a blog post, which should appear today or tomorrow. |
Ok. It's ready to submit on my side too. Ping me when the blog post is online. |
@nthiery I have added a link to the now complete blog post |
Submitted! |
Report on parallelising the FFT
Problem statement
Given two polynomials of length n, the time to multiply them using classical schoolboy multiplication is O(n^2). But there are numerous algorithms which can do better. The Karatsuba method already takes time O(n^log_2(3)). There are other methods, including Toom-Cook which slightly improve the exponent.
The Fast Fourier technique allows multiplication of such polynomials in O(n log(n)) operations. This is a technique that goes back as far as Gauß, but has seen extensive development since then, with over 800 papers on the method and related techniques, with applications from signal processing to string search or polynomial and integer arithmetic.
The version of the FFT that is used in
Flint
andMPIR
is the Schoenhage-Strassen method. Instead of doing a convolution over the complex numbers, which would make use of imprecise floating point numbers, which would be subject to rounding error, it makes use of an exact ring, namely Z/pZ where p = 2^(2^n) + 1. This technique allows exact multiplication of polynomials and integers with nearly linear complexity.In summary, the existing FFT in
Flint
is used for:The purpose of this task was to parallelise the FFT in
Flint
.Typically, parallelising the FFT algorithm is difficult. However,
Flint
makes use of a cache-friendly implementation of the FFT which uses the Matrix Fourier Algorithm. This breaks one very large FFT convolution up into many smaller FFT's.The existing FFT implementation in
Flint
(andMPIR
) is world class and includes:The method
In order to thread the FFT in
Flint
, we usedOpenMP
. The level at which we threaded it was at the level of the Matrix Fourier Algorithm. This involved separating temporary storage that is used throughout the algorithm, on a per thread level, and then addingOpenMP
primitives to the part of the Matrix Fourier Algorithm that breaks the FFT into lots of smaller FFTs.We also threaded the code which splits large integers into FFT coefficients. Unfortunately it is difficult, or even impossible to fully parallelise the recombination that happens after the FFT convolution has run, so this wasn't attempted. However, it is a negligible portion of the run time.
Fortunately, once the Matrix Fourier Algorithm becomes more efficient than a single large FFT (due to its cache aware properties), the threaded version also becomes more efficient than the single-threaded version. In fact, the tuning crossover was found to be at exactly the same point! This is an interesting coincidence and made tuning very easy.
To maximise the benefit of threads, we combine parts of the small inward FFTs, the relevant pointwise multiplications and parts of the outward inverse FFTs into combined blocks that each run on a single thread without interruption. The whole FFT convolution consists of many of these smaller blocks. This was by design rather than accident!
The algorithm in
Flint
also combines the truncated Fourier transform and Matrix Fourier algorithm in such a way that the entire large FFT breaks down exactly into the smaller threaded blocks discussed above, with no additional bits that have to be dealt with serially. This is due to an innovation in theFlint
FFT which isn't available elsewhere. Again, this was a design feature, not an accident. The scope of this method is exceptionally technical and well beyond the scope of this report to describe.In fact, we were able to preserve every single one of the technical tricks mentioned above in our parallel implementation of the FFT in
Flint
.Results
The new code for the threaded Matrix Fourier algorithm has been implemented as part of this deliverable and merged into the main
Flint
repository.Here are timings of the new code in
Flint
on a single core, versus four and eight cores for various sized integer multiplications on a 64 bit machine.Testing the parallel FFT
The Flint repository is available here.
To build and test the code mentioned above, you must have
GMP
/MPIR
andMPFR
installed on your machine (refer to your system documentation for how to do this). Then do:Full instructions on how to build
Flint
are available in theFlint
documentation, available at theFlint
website.The description of the FFT interface is well beyond the scope of this documentation, but can be found in the
Flint
documentation (625 pp.) There is also additional information specific to the FFT in the Flint FFT READMEReport on writing assembly primitives for the FFT butterflies
Problem statement
For this deliverable, our task was to improve existing functions or write new ones to use features of recent microprocessors (esp. AVX2) to speed up the Schönhage-Strassen FFT butterflies. Such assembly primitives are provided by the
MPIR
library.The main operations used in the FFT butterflies are:
Some of these operations already had assembly primitives available as part of the
MPIR
library. However, these were not optimised for recent architectures using AVX, for example. In this task, we also added a new assembly primitive, as described below, which is used directly in the FFT butterflies (where most of the FFT work is actually done).Each year or two, Intel and AMD release new CPU microarchitectures. The ones we focused on for this deliverable were Intel Haswell and Skylake and AMD Bulldozer. These are not the most recent architectures, but they are coming into widespread use at the present time.
Results
The microarchitectures for which we optimized the code are mainly Intel Haswell and Intel Skylake, and to a lesser extent AMD Bulldozer. For Bulldozer (and Piledriver) it should be noted that the opportunities
for optimization are rather limited: the microarchitecture generally performs poorly, especially in hyper-threading mode, and the AVX instructions in particular are so slow as to be practically useless. The newer AMD Steamroller fares better, but we did not have access to one.
For Haswell and Skylake, the
mpn_lshift1
,mpn_rshift1
,mpn_lshift
, andmpn_rshift
have been written anew, using AVX2 instructions which gave a large speed-up over the previous code. Thempn_add_n
/mpn_sub_n
functions (which are identical, performance-wise) have been modified from existing code and optimized according to the respective micro-architecture. Anmpn_sumdiff_n
(computes a+b, a-b) has been introduced intoMPIR
; this function existed for older processors but not for recent x86_64.We are very grateful to Jens Nurmann who contributed significant amounts of code and expertise on AVX2 programming.
Haswell microarchitecture
Timings in cycles per limb:
(1) The sum of the times of
mpn_add_n
,mpn_sub_n
.(2) The sum of the times of
mpn_add_n
,mpn_sub_n
,mpn_neg_n
.Timings for the full Schönhage-Strassen large integer multiplication (
mpn_mul_n
) in seconds:Note that these timings include the effect of code improvements made for D5.5 (#118), in particular, better
mpn_mul_basecase
and Karatsuba code.Skylake microarchitecture
Timings in cycles per limb:
Of note here is the speed of mpn_add_n/mpn_sub_n, at essentially 1c/l for the core loop, optimal both in terms of the data dependency chain and memory accesses, as Skylake can in theory execute 2 read and 1 write per clock cycle. In practice, presumably the instruction scheduler falls into a bad pattern after running at 1c/l for a while, and from then on runs the loop only at ~1.2c/l. Jens Nurmann found that inserting a meaningless AVX2 instruction into the core loop (which does not otherwise use AVX2)
breaks up this bad scheduling pattern, allowing these critically important core functions to run at the optimal speed reliably.
Timings for mpn_mul_n in seconds:
Bulldozer microarchitecture
Much less optimization effort was made for Bulldozer than for Haswell and Skylake, owing to the age and poor performance of this processor. No code was written from scratch, but among all the existing implementations for a given function, the one that ran fastest on Bulldozer was chosen.
Among those functions that were replaced by faster versions, these three are relevant to the FFT butterflies:
Timings for mpn_mul_n in seconds:
Unfortunately, the improvements to the mpn_[lr]shift functions are barely visible in the integer multiplication benchmark on Bulldozer.
All code written for this deliverable has been committed to Alex Kruppa's fork of the
MPIR
repository at https://github.com/akruppa/mpir and merged into the mainMPIR
repository at https://github.com/wbhart/mpir and will be available in the MPIR-3.0.0 release, available at theMPIR
website.Build instructions for
MPIR
are as follows:Download MPIR-3.0.0 from: http://mpir.org/
Note that you also need to have the latest
Yasm
assembler to build MPIR: http://yasm.tortall.net/To build
Yasm
, download the tarball:To test
MPIR
, download the tarball:A Haswell, Skylake, or Bulldozer CPU is required to test the changes referred to above.
The text was updated successfully, but these errors were encountered: