Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

D5.13: Parallelise the Singular sparse polynomial multiplication algorithms and provide parallel versions of the Singular sparse polynomial division and GCD algorithms. #111

Open
minrk opened this issue Sep 8, 2015 · 38 comments

Comments

@minrk
Copy link
Contributor

@minrk minrk commented Sep 8, 2015

Singular is a computer algebra system aimed at computations in algebraic geometry and is one of the key components used by SageMath. Computing with multivariate polynomials being at the core of Singular, their performance impacts the whole system.

The aim of this deliverable was to modernize Singular's sparse multivariate arithmetic by 1) updating the algorithms to the state of the art and by 2) applying thread level parallelism to achieve decent scaling on multi-core machines. The operations we focused on are multiplication, divisibility testing, and the computation of the greatest common divisor. The implementation was carried out in the C library Flint, which is used by Singular but also by independent systems. The latter thus also benefit from the improvements. Among many other applications, this tackles the long standing slowness of multivariate rational fractions in SageMath.

@minrk minrk added this to the D5.13 milestone Sep 8, 2015
@nthiery nthiery assigned wdecker and unassigned ClementPernet Mar 22, 2016
@IzabelaFaguet

This comment has been minimized.

Copy link
Contributor

@IzabelaFaguet IzabelaFaguet commented Apr 16, 2019

Hello everyone!
We are organising an ODK report writing sprint From August 24th to August 31st,
a good opportunity to finish the final reports in a pleasant and friendly environment.
Would someone like to participate?

The link to the poll: https://framadate.org/tfuHjZgcSU8pHI45

@wbhart

This comment has been minimized.

Copy link
Contributor

@wbhart wbhart commented May 28, 2019

Hi all, time for an update on progress. Daniel Schultz has been working extremely hard on this ticket and we now have all the parallel algorithms in Flint that we promised and performance seems to be very good.

The next step is to make this available for users, which we plan to do via Singular. Some of the functionality will be made available directly from Singular, e.g. multiplication, division and gcd. But to maximise impact to Singular users, we also need to use the new code to speed up Factory, which Singular uses internally for various operations, such as factorisation, rational functions and primary decomposition.

We expect the biggest improvements to rational functions, though we are hopeful we will be able to notice some improvement to factorisation too (it would be a very major project to rewrite all the factorisation code to benefit fully from the sdmp format we use, and this lies beyond the scope of the ODK project).

We believe that the approach we are taking will benefit the greatest number of Singular users and downstream beneficiaries. However, it is not at all trivial work to make use of our code in Factory, so we won't know if that aspect is completely successful until we implement it for at least one coefficient ring. We are going to start with Z/pZ, though the new code in Flint will happily handle Z, Q, GF(q) (with and without Conway polynomials) just fine as well.

@nthiery

This comment has been minimized.

Copy link
Contributor

@nthiery nthiery commented Aug 25, 2019

For the record: Daniel told me that he had a version of the report for review, just not yet in ODK's repo. He will be in Cernay from tomorrow to Wednesday, and I'll make a first pass of review.

@nthiery

This comment has been minimized.

Copy link
Contributor

@nthiery nthiery commented Aug 26, 2019

Suggestion: please prepare a demo in notebook format; see #289.

@nthiery

This comment has been minimized.

Copy link
Contributor

@nthiery nthiery commented Aug 29, 2019

I have just been through the report. Nice write up! And impressive job behind the scene, even if I don't have the expertise to appreciate all its refinements :-)

I made some minor changes, notably in the abstract on the issue description above.

@nthiery

This comment has been minimized.

Copy link
Contributor

@nthiery nthiery commented Aug 29, 2019

@wbhart: did you get a chance to review the latest version? Notably the technical details? If not, would you have a chance to do it today?
@ClementPernet: please review and let me know when done.

Then we can submit. It's pretty close!

@embray

This comment has been minimized.

Copy link
Collaborator

@embray embray commented Aug 29, 2019

I am reviewing the text and have a few small typo fixes to submit.

@nthiery

This comment has been minimized.

Copy link
Contributor

@nthiery nthiery commented Aug 29, 2019

@embray

This comment has been minimized.

Copy link
Collaborator

@embray embray commented Aug 29, 2019

I don't understand the point here;

Since this slows down the rest of SINGULAR by about a factor of two, it may not be advantageous to disable omalloc in practice. Nevertheless, we have disabled this to test the efficiency of the parallel conversion routines.

This seems to suggest that you would never want to disable omalloc for a typical Singular install, in which cases these parallelized routines are not useful except with a custom-built Singular solely for that purpose? Or does this statement only refer the "parallel conversion routines" (the point of which is not clear here, but I could be missing something).

@embray

This comment has been minimized.

Copy link
Collaborator

@embray embray commented Aug 29, 2019

I fixed just some minor typos in 110444b

@wbhart

This comment has been minimized.

Copy link
Contributor

@wbhart wbhart commented Aug 29, 2019

@wbhart

This comment has been minimized.

Copy link
Contributor

@wbhart wbhart commented Aug 29, 2019

@nthiery

This comment has been minimized.

Copy link
Contributor

@nthiery nthiery commented Aug 29, 2019

@wbhart: I just recompiled the latest version, with my and Erik's edits. The pdf is available from the Report link above. If you had already reviewed the most recent version by Daniel, you can also just have a look at the abstract and the diff in the sources.

@wbhart

This comment has been minimized.

Copy link
Contributor

@wbhart wbhart commented Aug 29, 2019

@nthiery This all looks really good, albeit a little too honest at points. But better to defend reality than some fiction, I think. And I am immensely proud of what we've achieved, despite the obvious limitations put on us by reality.

There is just one typo on page 2: "long outstanding calculations". I would not regard them to be outstanding. They are quite average, I think, especially since they are taking so long. Perhaps "long running computations" was intended.

@tthsqe12

This comment has been minimized.

Copy link
Contributor

@tthsqe12 tthsqe12 commented Aug 29, 2019

@wbhart outstanding has two possible meanings, and the second less common one was intended here.
@embray That part was referring only to the parallel conversion routines, which would be disabled if omalloc is being used. If this was not clear, please let me know how you might make it clearer.

@nthiery

This comment has been minimized.

Copy link
Contributor

@nthiery nthiery commented Aug 29, 2019

@wbhart

This comment has been minimized.

Copy link
Contributor

@wbhart wbhart commented Aug 29, 2019

I do think Daniel has covered it very well in the report. There's really nothing useful I could add to that. He describes in detail the limitations of the existing polynomial formats, omalloc, system malloc etc.

It all comes down to whether this is accepted as reality or not.

@nthiery

This comment has been minimized.

Copy link
Contributor

@nthiery nthiery commented Aug 29, 2019

@nthiery

This comment has been minimized.

Copy link
Contributor

@nthiery nthiery commented Aug 29, 2019

@ClementPernet: just waiting for your green light (or reviews), and I'll submit.

@wbhart

This comment has been minimized.

Copy link
Contributor

@wbhart wbhart commented Aug 29, 2019

You have probably submitted by now, but I was quite rushed for time today. Here is a more detailed description of the essential nontriviality of the work we have been doing.

There are four main areas where fast polynomial arithmetic can be expected to speed up important research applications using Singular. I will discuss each in turn in the interests of the "honesty" we were talking about.

  1. Basic arithmetic for arithmetic's sake.
  2. Groebner bases.
  3. Primary decomposition/factorisation.
  4. Rational functions.

No. 1. Here we are talking about users implementing algorithms in Singular which need to do basic arithmetic, i.e. multiplication, division and gcd of polynomials. As the default polynomial format in Singular is the linked-list sparse distributed format, a conversion to and from Flint array sparse distributed format is unavoidable. You still get a big speedup in practice, but as you can see from timings, conversion costs may actually dominate, meaning you don't get a linear speedup with cores. This is essential behaviour and we have expended much effort in minimizing the cost. The user still wins, however, so it is worth the effort.

No. 2. Here you can only expect to gain when there are large divisions done in the Buchberger algorithm for Groebner bases, which is not used for every Groebner basis application. However, it does still have significant applications. We have encountered such examples in real world applications recently (albeit with orderings that we nearly but don't quite support yet). Many such examples exist in real research with orderings that we do support and we expect big gains here eventually. Work under this heading (supporting additional orderings, etc.) will go on for years after ODK and is well beyond the scope of the deliverable.

No. 3. Primary decomposition depends heavily on factorisation of multivariate polynomials. But factorisation can be a life's work, compared to a four year ODK project, and so lies completely outside the scope of ODK. Daniel has already begun work on a new factoring engine in Singular, based on ODK and the preliminary results are extremely encouraging.

No. 4. I and Hans Schoenemann have been implementing fast rational functions based on the ODK work. Here conversion costs do NOT occur. Rational functions are constructed in the Flint format and never leave that format. One can do Groebner bases over rational function fields and this is a very important application in Singular. One should temper one's expectations though. Although the single core gains may make orders of magnitude difference here, it is common for the rational functions to be too small to benefit from parallelisation. However, when "coefficient explosion" occurs, then it becomes critical. But again one should realise that in such cases, it is often better to use a modular Groebner basis algorithm, if available. On the other hand, this may be merely due to the lack of fast rational functions, and even with the modular algorithm, one will still benefit from the ODK work, since you still have to do the same computations mod p, which we implement. The work to implement rational functions is basically done (by me), independently of ODK funding but directly based on the fast ODK arithmetic. It needs some new interpreter features to be added by Hans when he returns from holidays. Experience tells us the speedups could potentially be orders of magnitude (seconds compared to weeks in some real world cases).

I should also mention that we are already benefiting from the ODK work in our computer algebra system Oscar. For example, we depend on it entirely for an important application one of our collaborators has found in group theory (again using fast rational functions). That's all completely independent of ODK, but it's worth knowing that the benefits are being multiplied across multiple systems. From an Oscar perspective, we incur the costs only when converting from the Flint format to Singular format when we want to do a Groebner basis computation, where the conversion cost is usually not relevant compared to the cost of the Groebner basis computation, which can be doubly exponential in the worst case. Other systems are free to approach things in a similar way, and then the cost of conversion becomes a moot point.

I hope this helps explain some of the hard reality of the problem domain and helps explain our current understanding, which has certainly evolved over the four years. Nothing is for free, but many systems will be getting benefits from this work for years to come.

@nthiery

This comment has been minimized.

Copy link
Contributor

@nthiery nthiery commented Aug 29, 2019

Thanks @wbhart ! I haven not submitted yet, so please add this to the .tex file.

@wbhart

This comment has been minimized.

Copy link
Contributor

@wbhart wbhart commented Aug 29, 2019

@wbhart

This comment has been minimized.

Copy link
Contributor

@wbhart wbhart commented Aug 29, 2019

By the way, here are some blog stats for all articles on ODK related stuff, in case someone is collecting that. I can't give any more details, due to privacy regulations, but the aggregate figures here should be ok.

blogreaders

@nthiery nthiery added Submitted and removed needs review labels Aug 29, 2019
@nthiery

This comment has been minimized.

Copy link
Contributor

@nthiery nthiery commented Aug 29, 2019

@wbhart: all good! I did some minor tweaks, and submitted!

Thank you @tthsqe12 for all the work, and good luck with your next endeavour!

@ClementPernet : still useful to review at some point. I could resubmit later in case you would have non trivial suggestions.

@ClementPernet

This comment has been minimized.

Copy link
Contributor

@ClementPernet ClementPernet commented Aug 30, 2019

Thanks for this great piece of work (both the report and the code!)
I've done a fast review and share some remarks, right now, before diving in more details later on.

  1. minor edit: you should add [htb] in your table environment so that Latex can place them closer to where they belong.
  2. I'm not sure to understand the argument why you're trying to disable the turboboost. The reason invoqued (to have a "reasonable efficiency", defined by the speed-up / nthreads), look like some dirty recipe to provide good data. Overall what really matters is not really the speed-up nor the efficiency but the wall-time for the computation: it is extremely easy to have a linear scaling speedup if you parallelize a stupidly slow code, and it get always harder when the sequential code gets more efficient. I know you are dealing with extremely efficient sequential code, but maybe, you should just add an arguement explaining that by doing so you obtain better timings overall.
  3. if time permits, a few graphical representations (curves) replacing or complementing some tables would make the data easier to read.
@wbhart

This comment has been minimized.

Copy link
Contributor

@wbhart wbhart commented Aug 30, 2019

I assume I can still modify the report. Apparently I need to correct something I wrote.

I will also attend to Clement's 1 and 2. I'll see if Daniel has time to work on 3.

Just to answer 2, disabling turbo means that the processor cannot speed up some cores, skewing the data. It's impossible to get linear scaling on modern CPU's without disabling turbo boost. I will try to explain this in the report.

It doesn't give better timings overall, it is done purely to get meaningful data.

@ClementPernet

This comment has been minimized.

Copy link
Contributor

@ClementPernet ClementPernet commented Aug 30, 2019

@wbhart : could you also update the progress report on Task T5.4 (which is essentially this deliverable) in section 1.2.5.2 of https://github.com/OpenDreamKit/OpenDreamKit/blob/master/ReportingPeriod3/TechnicalReport/WP5.tex
You can also just send me a paragraph if you want me to edit the file.
Thanks

@wbhart

This comment has been minimized.

Copy link
Contributor

@wbhart wbhart commented Aug 30, 2019

I've made a PR addressing points 1 and 2 and various comments of people around here that correct a few things (mainly spelling).

I will now look at WP5.tex. Daniel is looking into whether he can make a graph showing efficiency that we can add to the report. If it is done by the end of the day, he'll add a PR for that.

@nthiery

This comment has been minimized.

Copy link
Contributor

@nthiery nthiery commented Aug 30, 2019

@wbhart

This comment has been minimized.

Copy link
Contributor

@wbhart wbhart commented Aug 30, 2019

Done.

@nthiery

This comment has been minimized.

Copy link
Contributor

@nthiery nthiery commented Aug 30, 2019

Very final version submitted!

@wbhart

This comment has been minimized.

Copy link
Contributor

@wbhart wbhart commented Aug 30, 2019

Thank you!

@ClementPernet

This comment has been minimized.

Copy link
Contributor

@ClementPernet ClementPernet commented Oct 18, 2019

@wbhart @tthsqe12 @wdecker :
I am preparing the presentation of WP5. There are 4 deliverables due for M48 and this is one of them. Can you prepare a few slides (about 3-4) summarizing the main achievements in this deliverable, and the perspectives and new directions which follow from this work.
My slides for WP5 will be beamer slides, so you can

  • prepare beamer slides
  • or just list the content in plain text, and I can beamerize them.
    I will need to have this ready before sunday 27.
    Thanks.
@wbhart

This comment has been minimized.

Copy link
Contributor

@wbhart wbhart commented Oct 18, 2019

Sure. I hope that @tthsqe12 can work on this.

@tthsqe12

This comment has been minimized.

Copy link
Contributor

@tthsqe12 tthsqe12 commented Oct 25, 2019

@ClementPernet Please you let me know if you received my slides via email.

@ClementPernet

This comment has been minimized.

Copy link
Contributor

@ClementPernet ClementPernet commented Oct 25, 2019

Yes, I am currently merging them in my presentation, and will push it to this repo.
A few remarks:

  • great set of achievements and perspectives, thanks!
  • the slides look a bit too verbose (a lot of text and no graphics). As there will be 4 such tasks being reviewed, we should consume reviewers reading energy with parsimony! I'll probably edit the text to make it less verbose
  • some figures (table or graph) illustrating key improvements would have a great impact. Last year in the status report for D5.13, we presented a table for multivariate multiplication:
    ReportingPeriod2/ReviewPresentations/WP5/WP5.pdf
    Have these timings been improved since then? Or could you select which of tables 1, 2, 3, 4, 5 in the deliverable report is most representative (select 2-3 tables). Then, I would recommend to have a speed-up graph representing these tables: displaying the speed-up w.r.t. the sequential execution, as a function of the thread numbers, where the ideal speed-up y=x is also represented.
    Other deliverables will display similar figures, so it matters to give a consistent type of information.
    Let me know first which data is most important, and then if you can produce the graphs, otherwise, I can do it on my way to Lux.
    Thanks for your help.
@tthsqe12

This comment has been minimized.

Copy link
Contributor

@tthsqe12 tthsqe12 commented Oct 25, 2019

The timings for the table in the status report last year were done when the server speed was unstable from other jobs. I can turn Figure 1 of the final report into such a graph with x = y plotted and send it to you right away. Feel free to cut down on my verbosity.

@ClementPernet

This comment has been minimized.

Copy link
Contributor

@ClementPernet ClementPernet commented Oct 25, 2019

The timings for the table in the status report last year were done when the server speed was unstable from other jobs. I can turn Figure 1 of the final report into such a graph with x = y plotted and send it to you right away.

Great, please do!

Feel free to cut down on my verbosity.

Will do, it's always easier to cut down on too much information than to interpolate on too little!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.