-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
D5.14: Implementations of exact linear algebra algorithms on distributed memory et heterogenous architectures: clusters and accelerators. Solving large linear systems over the rationals is the target application. #112
Comments
I gave a presentation at the Runtime Day ("Journée Runtime") in Paris about our experience in parallelization of recursive tasks for LinBox. |
On Fri, Jan 27, 2017 at 12:59:29AM -0800, Clément Pernet wrote:
I gave a presentation at the Runtime Day ("Journée Runtime") in Paris
about our experience in parallelization of recursive tasks for LinBox.
The conference programme:
[1]http://calcul.math.cnrs.fr/spip.php?article275
The slides of my presentation :
[2]http://calcul.math.cnrs.fr/Documents/Journees/janv2017/runtime_perne
t.pdf
Cool! Any feedback from the audience?
You may want to add this to our activities page
http://opendreamkit.org/activities/
Cheers,
Nicolas
|
Hello everyone! The link to the poll: https://framadate.org/tfuHjZgcSU8pHI45 |
Suggestion: please prepare a demo in notebook format; see #289. |
Report is ready. If someone want to give it one more pair eyes, it would be most welcome! @wbhart , @stevelinton @KBelabas or any other volunteer ? |
p1. lead -> led I'm actually quite surprised that memory -> GPU overhead is still a problem. I thought this was 10's of gigabytes per second these days on modern GPU's and much higher for data centre GPUs. The code must be very high performance to hit this bottleneck. |
Thanks @wbhart for your review, i'm applying your fixes. |
Ah I see. I found a page that says whilst transfer from CPU to GPU is pretty quick, people are seeing much slower speeds in the other direction, i.e. down around 2.5GB/s, which would totally dominate your runtime. Nice results anyway! |
Deliverable submitted; thanks for the timely report! Now I can go light-hearted for the evening :-) I can't wait until my next occasion to do solving over the rational with Sage, and benefit from all the cool stuff here. Kudos to whole team at Grenoble! |
Context
Computational linear algebra is a key tool delivering high computing throughput to applications requiring large scale computations. In numerical computing, dealing with floating point arithmetic and approximations, a long history of efforts has led to the design of a full stack of technology for numerical HPC: from the design of stable and fast algorithms, to their implementation in standardized libraries such as LAPACK and BLAS, and their parallelization on shared memory servers or supercomputers with distributed memory.
On the other hand, computational mathematics relies on linear algebra with exact arithmetic, i.e. multiprecision integers and rationals, finite fields, etc. This leads to significant differences in the algorithmic and implementation approaches. Over the last 20 years, a continuous stream of research has improved the exact linear algebra algorithmic; simultaneously, software projects such as LinBox and fflas-ffpack were created to deliver a similar set of kernel linear algebra routines as LAPACK but for exact arithmetic.
The parallelization of these kernels has only been partially addressed in the past, and was mostly focused on shared memory architectures.
Achievements of the deliverable
This deliverable aims at enhancing these libraries so that they can exploit various large scale parallel architectures, including large multi-cores, clusters, and accelerators. The target application is the solver of linear systems of the field of multi-precision rational numbers. For this application, several algorithmic approaches have be studied and experimented, namely a Chinese Remainder based solver and a p-adic lifting solver. The former exposes a much higher level of parallelism in its tasks, while the latter requires many fewer operations asymptotically. We first focus on the algorithmic aspects, with the presentation of a new p-adic lifting algorithm based on Chen and Storjohann's algorithm. We then present the implementation and related experiments of a MPI-Chinese-remainder-based rational solver for distributed computing, and an implementation of the new p-adic algorithm. Lastly we report on the support for GPU made available in fflas-ffpack and related benchmarks.
The text was updated successfully, but these errors were encountered: