/
WP5.tex
273 lines (228 loc) · 15.7 KB
/
WP5.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
\subsubsection{WorkPackage 5: High Performance Mathematical Computing}
\label{hpc}
%Explain, task per task, the work carried out in WP during the reporting period giving details of the work carried out by each beneficiary involved.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\paragraph{Overview}
Workpackage 5 is about the development of high performance computing tools in
mathematical virtual research environments. It is addressed at the level
of each kernel library composing the computational tools of the project (\Pari,
\GAP, \Linbox, \MPIR, \Sage, \Singular, ...), and also at the level of interfacing and exposing
core parallel features to higher level programming interfaces.
Key results obtained over the period for WorkPackage 5 are the following:
%% Only list deliverables produced in the reporting period
\begin{compactitem}
%% \item A closer integration of \Linbox in \Sage with improved reliability and
%% computing efficiency.
\item A full-featured parallelisation engine, supporting POSIX threads and
MPI, for PARI in production release of the software
\item Release of GAP-4.9 allowing compilation in HPC-GAP compatibility mode.
% \item A new super-optimizer for vectorized assembly code and its
% exploitation to improve the performances of the MPIR code.
\item A new symmetric matrix factorization algorithm over finite fields, and
its high-performance implementation in the \texttt{fflas-ffpack} library.
\item Major redesign the the polynomial arithmetic used in Singular delivering
state of the art efficiency.
\end{compactitem}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subparagraph{Milestones}
\subparagraph{\longmilestoneref{hpc-prototype}}
\emph{“User story: Astrid wants to run compute intensive routines
involving both dense linear algebra and combinatorics. She has
access through a JupyterHub-based VRE to a high end multi-core
machine which includes a vanilla \Sage installation. She
automatically benefits from the HPC features of the underlying
specialized libraries (\Linbox, ...). This is a proof of concept
of the overall framework to integrate the HPC advances of
specialized libraries into a general purpose VRE.
%
It will prepare the final integration of a broader set of such
parallel features for the end of the project.”}
With Deliverable~\delivref{hpc}{LinBox-algo}, we developped, released and integrated in the
\Sage the LinBox library and its core dependencies: fflas-ffpack and givaro.
When installing the latest \Sage release on a multithreaded multicore server, it
only takes one configure option to let fflas-ffpack use a multi-threaded BLAS
and therefore expose its parallel speed-up to the end-user of Sage. This feature
is compliant with the use of a higher level of parallelism, through process
workstealing queues that \textit{Astrid} may be using in her combinatorics code, as those
exposed in \delivref{hpc}{sage-HPCcombi}. Now that this first proof of concept has been
successfully achieved, we are working in exposing the more advanced parallel
routines of fflas-ffpack into \Sage, following~\delivref{hpc}{LinBox-DSL}. It
should in particular make Gaussian elimination and related routines enjoy a
better scaling with respect to available CPU cores.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\paragraph{Tasks}
\subparagraph{\longtaskref{hpc}{hpc-pari}}
Deliverable~\longdelivref{hpc}{pari-hpc1} has been merged
with \longdelivref{hpc}{pari-hpc2} in the revised workplan.
Deliverable~\delivref{hpc}{pari-hpc2} is on time or even ahead of schedule.
The generic engine for Deliverable~\delivref{hpc}{pari-hpc1} was written and
finalized in 2015 and 2016 and is in production since PARI-2.9 (released
11/2016), it supports sequential evaluation (no parallelism), POSIX threads and
MPI within the same code base.
Development work since then is targeted at using the engine wherever it makes
sense in the code base. The current situation (PARI-2.11, released 07/2018)
includes:
\begin{itemize}
\item fast (near linear time) Chinese remaindering;
\item fast linear algebra over Q and cyclotomic fields, a critical component of
the new "Modular Forms" package;
\item polynomial resultant in $\mathbb{Z}[X]$ via Chinese remainders;
\item computation of classical modular polynomials for about 20 classical
invariants (j, Weber functions, small eta quotients...);
\item discrete logarithm over finite fields (prime fields and
$\mathbb{F}_{p^e}$ for word-sized prime $p$) - polynomial resultant in
$\mathbb{Z}[X] \times \mathbb{Z}[X,Y]$ via
Chinese remainders / evaluation;
\item APRCL primality proof.
\end{itemize}
More work is underway regarding the development of ECPP, integer
factorization engines, class group and units computations.
Compared to the pre-existing sequential code, the parallel version adds about 15 extra lines of
C on average (conveniently encapsulated in a separate "worker" object). As usual
in "embarassingly parallel" code, the measured speedup is essentially linear in
the number of available cores. The resulting GP binary has been successfully
crash-tested by a panel of users during the 2017 PARI/GP Atelier in Lyon:
basically, substituting the \texttt{Configure} instruction by \texttt{Configure --mt=pthreads}
before compilation transparently yields the expected speedups for the target
functions on participants dual-core laptops.
\subparagraph{\longtaskref{hpc}{hpc-gap}}
\label{hpc@hpc-gap}
No deliverable is due for the evaluation period but steady progress was made on
Deliverable~\longdelivref{hpc}{GAP-HPC-report}. Over this period, eight releases were cut
incorporating contributions to Deliverable~\longdelivref{component-architecture}{hpc-configure}.
Another major direction of efforts is the HPC-GAP integration:
HPC-GAP is a fork of \GAP initiated during the \scienceproject project, which
enables multithreaded calculations. Now that HPC-GAP has reached
maturity, it's critical for its widespread use and long term
maintainability to merge it back into \GAP's master branch.
The first step towards this long-standing goal, which is at the core of
Task~\localtaskref{hpc}{hpc-gap}, is the major release of \GAP 4.9, published in 2018 and allowing compilation in HPC-GAP
compatibility mode. It comes together with the new manual book called ``HPC-GAP Reference Manual'',
which is also available online at \url{https://www.gap-system.org/Manuals/doc/hpc/chap0.html}.
HPC-GAP integration required a major
refactorization of \GAP's build system
mainly developed by our external collaborator Max Horn (Giessen) into \GAP's master
branch during the Sage-GAP-Days-85 we organized in March 2017.
An overview of the most important changes introduced in GAP 4.9.1,
with links to corresponding GitHub entries,
can be found in the \GAP manual: \url{https://www.gap-system.org/Manuals/doc/changes/chap2.html}.
Many \GAP packages also have been updated to work in \GAP 4.9 and make use of its new features.
\subparagraph{\longtaskref{hpc}{hpc-linbox}}
\label{hpc@hpc-linbox}
The deliverable~\longdelivref{hpc}{LinBox-algo} due for this reporting period is
delivered on time. The contribution, as detailed in the report is twofold:
\begin{compactitem}
\item Several algorithmic innovations have been produced: a new matrix invariant
for rank profiles~\cite{DumPerSul:fcrpmgbd16}, a new symmetric matrix
factorization algorithm and its high performance implementation has been
proposed~\cite{DuPe18}; new representation and algorithms for quasi-separable
matrices~\cite{Pernet:cqm16,PerSto:tsegqm17}, and new developments on
interactive certificates for the security of large scale distributed
computations on unsafe
resources~\cite{DumKalTho:lticmpdsm16,DumLucPer:cftearp17}. These results have been
presented in the main venues in the domain:
the international conference ISSAC'16-17-18 and the Journal of Symbolic computation.
\item Major software developments have happened in the LinBox ecosystem, increasing
robustness, maintainability and introducing new computational features, such
as the new symmetric factorization algorithm or a parallel triangular matrix
inversion implementation. Not only have these library been closely integrated
within \Sage, but the interface has been fully rethought thanks to a broader
support of C++ in Cython.
\end{compactitem}
Deliverable~\longdelivref{hpc}{LinBox-distributed} is making good progress
thanks to the work of our \ODK engineer Hongguang Zhu. Several code prototypes
already implement a Chinese remainder based parallel rational solver on a distributed
infrastructure. We are now working on improving the communication patterns,
and on designing a parallel rational vector reconstruction algorithm, as it
has now become the bottleneck in this computation.
\subparagraph{\longtaskref{hpc}{hpc-singular}}
\label{hpc@hpc-singular}
The only deliverable under consideration for this reporting period
is~\longdelivref{hpc}{singular-polyarith}
Multivariate polynomials are represented in Singular using the sdmp format. While this data structure is generally amenable to parallelization, the implementation and some of the algorithms in Singular are not. Since the last update much work has been invested in updating the algorithms and data structures and making Singular polynomial arithmetic competitive with other systems. This work has been done in the Singular submodule Flint, whose code is available at \url{https://github.com/wbhart/flint2}. We also now support polynomial exponents of unlimited size with the three basic monomial orderings of lex, deglex, and degrevlex. Suggestions by colleagues in the HPC community including Bernard Parisse, Michael Monagan, Roman Pearce, and Micka\"el Gastineau have been invaluable.
The serial implementations of the operations of multiplication, division and GCD are complete in both the dense and sparse cases and the performance is competitive with other systems. The parallel implementation of multiplication is also complete with competitive performance as well. We are on track to have division and GCD parallelized and delivered on time.
We have also been able to parallelize polynomial root clustering, which is a
major engine in Singular that benefits from fine-grained
parallelization. Performance improvements continue to be made by Remi Imbach
(now at NYU following his ODK contract). The fully working implementation is
at \url{https://github.com/rimbach/Ccluster}.
\subparagraph{\longtaskref{hpc}{hpc-mpir}}
\label{hpc@hpc-mpir}
All deliverables for this task have been delivered at the previous
reporting period.
\subparagraph{\longtaskref{hpc}{hpc-combi}}
\label{hpc@hpc-combi}
The goal of this task is to use combinatorics as a source of challenges to
experiment on various HPC techniques. Recall that from the first reporting
period, deliverable~\longdelivref{hpc}{sage-paral-tree} successfully
implemented a MapReduce programming model on large datasets described by a
recursion tree. Since it is written purely in Python, the code doesn't
perform well when the computation in each node is short. A good technology
for handling such situations with fine grain parallelism, is
\software{Cilk++}. However since both Intel and GCC teams decided to
discontinue \software{Cilk++} support, we looked for some replacement and
also to scale from shared memory to cluster. It turns out that there isn't
currently any widespread well maintained replacement.
On the other hand, while writing our code to experiment with various
solutions, we came to realize that designing new algorithms based on the
vector primitive of the modern processor (AVX instruction set) allows to
gain large speedup ($2\times$ to $50\times$) on the handling of small
combinatorial objects. By requiring the invention of new algorithms, this
becomes definitely a research question. As a preliminary results, we started
to develop a library called HPCombi which can already speed up some
computation of \software{GAP} and \software{Sage} through the
\software{libsemigroup} library. This work resulted in a keynote talk at the
PASCO 2017 conference and an invitation to the Scottish Programming Language
Seminar 2018.
On the less experimental side, progress has been made into Sage to compute
with integer vector which are ubiquitous in combinatorics. Using Cython, we
wrote a bridge called \software{PPLPy} for the \software{PPL} library which
now provides access to the PPL library to any Python user. This package use
general linear programming techniques to enumerate integer vectors.
Improving \Sage capabilities in polytope computations and linear algebra is
critical for combinatorics. Similarly two other interfaces have been developed
to \software{LaTTe} (counting integral points in polytopes) and
\software{PolyMake} (the reference software for Polyhedral computations).
Finally, some more combinatorial objects were optimized using the Cython
technology, namely permutations and Lyndon words (which are a combinatorial
tool to deals with vectors up to a cyclic permutation).
\subparagraph{\longtaskref{hpc}{pythran}}
\label{hpc@hpc-pythran}
The goal of this task is to make Pythran easily integratable in large-scale
project, taking into account native dependencies, compilation time, memory
footprint, speed and size of compiled binaries as well as multi-platform
support. Integration with \software{cython} is a possible mean to achieve
this goal.
Two projects have been selected for this task: \software{scipy} and
\software{scikit-image}. These projects are relevant for \software{pythran}
because they have many small to medium kernels that can benefit from
compilation. Even though \software{pythran} has not been selected as a scipy
backend, the exchanges with the community have led to a great deal of
improvements of which all \software{Pythran} users take advantage. The
\software{scikit-image} community is still examining the possibility of using
\software{pythran} as an acceleration mean.
Integration of \software{pythran} as a \software{cython} backend for
\software{numpy} has improved in various aspects: better error detection,
more supported expression patterns and improved performance for the compiled
expressions.
Deliverable~\longdelivref{hpc}{sage-HPCcombi} is shared with
Task~\longlocaltaskref{hpc}{hpc-combi}, the status of which we reported on above.
\subparagraph{\longtaskref{hpc}{hpc-jupyter}}
\label{hpc@hpc-jupyter}
%% It is common for academic High Performance Computing (HPC) clusters to make
%% use of schedulers based on Sun Grid Engine with Son of Grid Engine as one of
%% the most popular. It is used, for example, on the institutional HPC systems
%% in the Universities of Sheffield and Manchester in the United Kingdom. It is also used
%% on the regional N8 HPC facility, a system shared by the eight most research
%% intensive universities in the North of England.
All deliverables for this task have been delivered at the previous
reporting period.
% LocalWords: subsubsection hpc compactitem super-optimizer vectorized factorization
% LocalWords: texttt fflas-ffpack longmilestoneref emph JupyterHub-based specialized
% LocalWords: longtaskref longdelivref delivref finalized mathbb embarassingly taskref
% LocalWords: scienceproject refactorization organized LinBox-algo DumPerSul:fcrpmgbd16
% LocalWords: Pernet:cqm16,PerSto:tsegqm17 DumKalTho:lticmpdsm16,DumLucPer:cftearp17
% LocalWords: Cython Hongguang singular-polyarith sdmp parallelization deglex degrevlex
% LocalWords: Monagan Micka parallelized parallelize Imbach hpc-mpir sage-paral-tree
% LocalWords: Cilk libsemigroup optimized pythran integratable scipy scikit-image numpy
% LocalWords: sage-HPCcombi hpc-jupyter