Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

D5.14: Implementations of exact linear algebra algorithms on distributed memory et heterogenous architectures: clusters and accelerators. Solving large linear systems over the rationals is the target application. #112

Open
minrk opened this issue Sep 8, 2015 · 11 comments

Comments

@minrk
Copy link
Contributor

@minrk minrk commented Sep 8, 2015

Context

Computational linear algebra is a key tool delivering high computing throughput to applications requiring large scale computations. In numerical computing, dealing with floating point arithmetic and approximations, a long history of efforts has led to the design of a full stack of technology for numerical HPC: from the design of stable and fast algorithms, to their implementation in standardized libraries such as LAPACK and BLAS, and their parallelization on shared memory servers or supercomputers with distributed memory.

On the other hand, computational mathematics relies on linear algebra with exact arithmetic, i.e. multiprecision integers and rationals, finite fields, etc. This leads to significant differences in the algorithmic and implementation approaches. Over the last 20 years, a continuous stream of research has improved the exact linear algebra algorithmic; simultaneously, software projects such as LinBox and fflas-ffpack were created to deliver a similar set of kernel linear algebra routines as LAPACK but for exact arithmetic.

The parallelization of these kernels has only been partially addressed in the past, and was mostly focused on shared memory architectures.

Achievements of the deliverable

This deliverable aims at enhancing these libraries so that they can exploit various large scale parallel architectures, including large multi-cores, clusters, and accelerators. The target application is the solver of linear systems of the field of multi-precision rational numbers. For this application, several algorithmic approaches have be studied and experimented, namely a Chinese Remainder based solver and a p-adic lifting solver. The former exposes a much higher level of parallelism in its tasks, while the latter requires many fewer operations asymptotically. We first focus on the algorithmic aspects, with the presentation of a new p-adic lifting algorithm based on Chen and Storjohann's algorithm. We then present the implementation and related experiments of a MPI-Chinese-remainder-based rational solver for distributed computing, and an implementation of the new p-adic algorithm. Lastly we report on the support for GPU made available in fflas-ffpack and related benchmarks.

@minrk minrk added this to the D5.14 milestone Sep 8, 2015
@nthiery nthiery modified the milestones: Month 48: 2019-08-31, D5.14 Mar 22, 2016
@ClementPernet

This comment has been minimized.

Copy link
Contributor

@ClementPernet ClementPernet commented Jan 27, 2017

I gave a presentation at the Runtime Day ("Journée Runtime") in Paris about our experience in parallelization of recursive tasks for LinBox.
The conference programme: http://calcul.math.cnrs.fr/spip.php?article275
The slides of my presentation : http://calcul.math.cnrs.fr/Documents/Journees/janv2017/runtime_pernet.pdf

@nthiery

This comment has been minimized.

Copy link
Contributor

@nthiery nthiery commented Jan 27, 2017

@bpilorget

This comment has been minimized.

Copy link
Contributor

@bpilorget bpilorget commented Jan 11, 2018

#122 was merged into #112 after 3rd amendment to the contract

@IzabelaFaguet

This comment has been minimized.

Copy link
Contributor

@IzabelaFaguet IzabelaFaguet commented Apr 16, 2019

Hello everyone!
We are organising an ODK report writing sprint From August 24th to August 31st,
a good opportunity to finish the final reports in a pleasant and friendly environment.
Would someone like to participate?

The link to the poll: https://framadate.org/tfuHjZgcSU8pHI45

@nthiery nthiery added the in writing label Aug 25, 2019
@nthiery

This comment has been minimized.

Copy link
Contributor

@nthiery nthiery commented Aug 26, 2019

Suggestion: please prepare a demo in notebook format; see #289.

@ClementPernet

This comment has been minimized.

Copy link
Contributor

@ClementPernet ClementPernet commented Aug 31, 2019

Report is ready. If someone want to give it one more pair eyes, it would be most welcome! @wbhart , @stevelinton @KBelabas or any other volunteer ?

@nthiery nthiery added needs review and removed in writing labels Aug 31, 2019
@wbhart

This comment has been minimized.

Copy link
Contributor

@wbhart wbhart commented Aug 31, 2019

p1. lead -> led
p1. implementations approaches -> implementation approaches
p1. will be studied and experimented -> were studied and experimented with
p2. asymptotically much fewer operations -> many fewer operations asymptotically
p2. a careful attention -> paying careful attention
p2. and compute with -> and computes with
p2. numerous arithmetic operation -> numerous arithmetic operations
p2. reaching a -> reaching an
p3. allow to terminate -> allow one to terminate
p4. with over the arbitrary -> over arbitrary
p4. having better cache efficiency and more suitable -> which has better cache efficiency and is more suitable
p4. share with -> shared with
p5. it was parallelization of these operations was not of high priority
and we chose to delay to future work -> parallelization of these operations was not of high priority
and we chose to delay this to future work
p5. some prototype -> a prototype
p5. a solving a -> solution of a
p5. 100 bits entries -> 100 bit entries
p6. were obtained on -> were conducted on
p6. not of a prototype benchmarking code -> not of prototype benchmarking code
p6. denote by the INV -> denote by INV
p6. is a constant for any l ?? Do you mean constant not depending on l? Obviously l^6 is a constant for all l, which is why I ask.
p7. in sequential ?? Do you mean, "for sequential code" or "in the sequel" or "one after the other". Again in the next sentence.
p7. Now in parallel -> Now for parallel code ??
p8. number primes -> number of primes
p8. may increases -> may increase
p8. matrices, make it negligible -> matrices make it negligible
p8. parallelization pays of -> parallelization pays off
p8. tune up at a finer grain -> fine tune
p8. Graphic Processing Unit (GPU) -> Graphics Processing Units (GPUs)
p8. among these applications -> among the applications
p8. Gaussian elimination, require branching -> Gaussian elimination require branching
p8. relying exploiting -> exploiting
p8. only need to -> only needs to
p9. Figure 5: leftmost column correspond -> the leftmost column corresponds
p9. He can then transparently run his code -> The user can then transparently run their code
p9. up to a factor 39.5 and up to 2.46 with respect to a parallel execution on -> up to a factor of 39.5 faster and up to 2.46 times faster with respect to a parallel execution
p9. negligible compared to as ?? Something is missing in this sentence
p9. This is the reason why other routines :: should not be a new paragraph
p9. many matrix multiplication -> many matrix multiplications ??

I'm actually quite surprised that memory -> GPU overhead is still a problem. I thought this was 10's of gigabytes per second these days on modern GPU's and much higher for data centre GPUs. The code must be very high performance to hit this bottleneck.

@ClementPernet

This comment has been minimized.

Copy link
Contributor

@ClementPernet ClementPernet commented Aug 31, 2019

Thanks @wbhart for your review, i'm applying your fixes.
Regarding your remark on GPU memory communication: yes the code is very high performance: multiplying 2 10000x10000 matrices of double takes below 1s (0.9s) while there is 2.4Gb of data (3 operands of 10^8 times 8 bytes.

@wbhart

This comment has been minimized.

Copy link
Contributor

@wbhart wbhart commented Aug 31, 2019

Ah I see. I found a page that says whilst transfer from CPU to GPU is pretty quick, people are seeing much slower speeds in the other direction, i.e. down around 2.5GB/s, which would totally dominate your runtime.

Nice results anyway!

@ClementPernet

This comment has been minimized.

Copy link
Contributor

@ClementPernet ClementPernet commented Aug 31, 2019

Done fixing the report with @wbhart remarks. Thanks again.
@nthiery : I think the report is good to go now.

@nthiery nthiery added Submitted and removed needs review labels Aug 31, 2019
@nthiery

This comment has been minimized.

Copy link
Contributor

@nthiery nthiery commented Aug 31, 2019

Deliverable submitted; thanks for the timely report! Now I can go light-hearted for the evening :-)

I can't wait until my next occasion to do solving over the rational with Sage, and benefit from all the cool stuff here. Kudos to whole team at Grenoble!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.