D5.2: Facility to compile Pythran compliant user kernels and Sage code and automatically take advantage of multi-cores and SIMD instruction units in Cython #115

minrk · 2015-09-08T20:17:22Z

WP5: High Performance Mathematical Computing
Lead Institution: Université Joseph Fourier (UJF)
Due: 2017-02-28 (month 18)
Nature: Demonstrator
Task: T5.7 (T5.7: Pythran #105) Pythran
Proposal: p.51
Final report

The Python programming language is widely used in the development of computational mathematics systems like SageMath for its expressiveness and flexibility. Yet, as an interpreted language, it suffers from inherent inefficiencies.

Over the years several tools have been developed to overcome this barrier. A major player is Cython, which is both an extension of the Python language, and a compiler generating compilable C code. At the cost of additional work from the developer (e.g. type annotations), Cython can deliver performances similar to that of a compiled language. It's being used intensively in SageMath. Another emerging player is Pythran, a Python to C++ compiler for a subset of the Python language, with a focus on scientific/numerical computing. It takes advantage of type inference features of C++ as well as multi-cores and SIMD instruction units to deliver high performance without the need for additional work from the developer. In particular, it includes a C++ implementation of a major subset of the Numpy API, optimized for speed, with support for expression templates that minimize the number of memory transfers needed to compute complex expressions (Numpy is the fundamental package for scientific computing with Python). However, unlike Cython, Pythran does not support user defined classes, a key feature in systems like SageMath.

This deliverable is a step toward taking the best of both worlds, and helping bridge the gap between numerical and exact computing. It proposes to incorporate Pythran support for Numpy within Cython, which consequently provides high performance numerical linear algebra to high level mathematical software developers, especially within SageMath.

As an illustration, the new Pythran backend in Cython achieves a speed-up of about 4 on the following typical Numpy based function:

def sqrt_sum (numpy.ndarray[FLOATTYPE_t, ndim=1] a,
              numpy.ndarray[FLOATTYPE_t, ndim=1] b):
    return numpy.sqrt(numpy.sqrt(a*a+b*b))

The text was updated successfully, but these errors were encountered:

bpilorget · 2016-05-09T12:38:39Z

Deliverable postponed to month 18 due to hiring issues.

bpilorget · 2016-11-21T16:20:27Z

@ClementPernet (WP leader and lead beneficiary)
This deliverable is due for February 2017

ClementPernet · 2017-01-19T10:08:27Z

The aim of this project was to optimize the overall performance of Cython code
that uses Numpy arrays.

Indeed, when operations are done on Numpy arrays, Cython relies on the original
Numpy package to compute them. This involves a fall back to the Python
interpreter. It thus misses several optimization opportunities, especially with
complex expressions.

The Pythran project is a Python to C++ compiler, that aims at optimizing
scientific Python. It thus supports only a subset of the Python language.
It also has a full C++ implementation of a major set of the Numpy API. Some of
the advantage of this implementation is that it supports expression templates
and SIMD instructions. Expression templates allow to "fuse" loops that can
occurs when expressions with multiple operators are computed. For instance,
the expression "a+b*c" will original be transformed by Cython in two call: one
for the multiplication of b by c, and one for the addition of the result of
this multiplication and the addition by a. Each call will end-up in one loop,
that will read memory, compute the operation and write back to memory. The
second loop will have the same pattern. In nowadays architecture, memory
bandwidth is often the limiting factor in this kind of operation. It is thus
really interesting to merge these loops, and load/store the memory only once.
Expression templating is a C++ technique that allows to evaluate expressions
only when they are stored to memory. Thus, in this case, the two loops will be
automatically "merged" by the C++ compiler, and we'll get an optimized version
of this code. Note that this technique is used for instance by the C++ wrapper
of the GMP library;

The project has been focused on using this Pythran backend for Numpy arrays in
Cython when possible. Indeed, Pythran has a few limitations regarding the Numpy
arrays it can handle:

array "views" are not supported. That means that arrays must be stored in
contiguous memory. Fortran and C-style format are supported.
the endianess of the integers must be the same that the one of the targeted
architecture (note that Cython has the same limitation)

We thus need to be able to fall-back to the Cython implementation if we are not
in one of those cases.

The integration within Cython works this way:

at the function level, for every argument that is an Numpy array and
supported by Pythran, we change its type by a fused type [1]. This fused type
is either a Pythran Numpy buffer or the original Cython buffer type
for a variable defined as a Numpy array, we change it directly to a Pythran
buffer, if type and endianess is supported.

Cython has a comprehensive suite test regarding the Numpy features it supports.
This test suite is still valid after our patch, except for some tests that were
already failing (with my setup).

A PR is on its way to Cython to integrate this into the project. We are waiting
for fixes in Pythran to be merged into its master branch
(serge-sans-paille/pythran#616 and
serge-sans-paille/pythran#628) to send it publicly.

Some benchmarks:

simple computation:

def sqrt_sum(Numpy.ndarray[FLOATTYPE_t, ndim=1] a, Numpy.ndarray[FLOATTYPE_t, ndim=1] b):
return Numpy.sqrt(Numpy.sqrt(aa+bb))

On my computer (Core i7-6700HQ), this gives there results, with an array of
100000000 32-bit floats as input:

for the classical Cython version: 960ms
for the Cython version using the Pythran backend: 761ms
for the Cython version using the Pythran backend using SIMD instructions: 243ms

which makes a speedup of ~3.9x using SIMD instructions.

this example from the Cython documentation: http://cython.readthedocs.io/en/latest/src/tutorial/Numpy.html

The input we use is the following:

N = 1000
f = np.arange(N*N, dtype=np.int).reshape((N,N))
g = np.arange(81, dtype=np.int).reshape((9, 9))

for the classical Cython version: 155ms
for the Cython version using the Pythran backend: 150ms
for the Cython version using the Pythran backend using SIMD instructions: 149ms

It seems that the Pythran backend does not manage to benefit much from SIMD
instructions in this case. We thus still got an average speedup of ~3%.

Summary of current status:

integration within Cython working
waiting for PRs to be merged within Pythran
benchmarks showing good results :)

What remains to be done:

documentation within Cython about this integration
pushing a PR to Cython (with an explanation on the dev ML)

Code:

the current Cython code can be gathered from here:
git clone --branch=feature/pythran --depth 1 ssh://git@git.geekou.info/cython-pythran
the associated Pythran version can be gathered here (waiting for PR to be merged):
git clone --branch=fix/py3_double_define --depth 1 https://github.com/aguinet/pythran

nthiery · 2017-02-06T17:51:32Z

Dear M18 deliverable leaders,

Just a reminder that reports are due for mid-february, to buy us some time for proofreading, feedback, and final submission before February 28th. See our README for details on the process.

In practice, I'll be offline February 12-19, and the week right after will be pretty busy. Therefore, it would be helpful if a first draft could be available sometime this week, so that I can have a head start reviewing it.

Thanks in advance!

nthiery · 2017-02-10T13:36:20Z

Hi @ClementPernet,
I just proofread your post above and made minor edits. It's a good start for the report. In fact, I suggest that you move the text to this issue's description.

nthiery · 2017-02-23T22:12:39Z

Hi @ClementPernet,
Thanks for the report!
Is it in final form? Did you have someone else proofread it already?
The following file is missing in the repo: benchs/float/graph.pdf.
Could you add one or two sentences to this github issue description (aka abstract for the report) about what was achieved for this deliverable?
Thanks,

nthiery · 2017-02-23T22:14:42Z

Since this seems almost ready, if you get a chance to do the above tomorrow morning, that would be great, as I could then submit it in the afternoon and tick the box to focus on the others :-)

aguinet · 2017-02-24T07:46:52Z

Hello @nthiery ,

I'll add this one in the morning, sorry about the missing file!

ClementPernet · 2017-02-24T09:21:40Z

I added the files. They were by default not added due to a gitignore rule. So I forced them to be added.
@aguinet Can I let you write up the abstract as requested by @nthiery ?

ClementPernet · 2017-02-24T10:34:23Z

The Cython PR seem faily close to be merged: cython/cython#1607
I was delaying the polishing of this report, hoping to be able to declare the code as merged, but this seems to be unfeasable. @aguinet do you confirm ?
@nthiery apart from that the report is in its final form. I meant to go through the deliverable checklist one last time to check if we missed something. I'll do it before this afternoon.

aguinet · 2017-02-25T10:49:32Z

@ClementPernet it seems complicated yes, I don't know how long it will take.

About the abstract, if I understand correctly, I have to make something shorter than what's a the top of this issue?

nthiery · 2017-02-26T22:33:54Z

Hi @aguinet,
Sorry, I missed your call for info. Please expand a bit further the abstract (i.e. the first comment on this issue description): What Pythran is (done), what Cython is, what's been achieved in this deliverable to fuse their feature sets, how it is relevant to OpenDreamKit.
Thanks!

ClementPernet · 2017-02-27T12:58:00Z

@aguinet What's the status of the abstract ? We need to polish the deliverable today. Keep us updated.

aguinet · 2017-02-27T14:28:25Z

@ClementPernet @nthiery

I added some notes about what's Cython and its usage within SageMath and OpenDreamKit.

The aim of this project was to optimize the overall performance of Cython code
that uses Numpy arrays.

Cython is a compiler for the Python and Cython language . It basically converts
Python code to a C/C++ module that calls directly the C Python API. The Cython
language is an extension of Python that allows to write C-like code within a
Python module. This code is directly inserted into the generated module. This
allows to write C python module with the flexibility of a Python script. This
is used in numerous Python project, with SageMath being one of them.

In Cython, when operations are done on Numpy arrays, Cython relies on the
original Numpy package to compute them. This involves a fall back to the Python
interpreter. It thus misses several optimisation opportunities, espacially with
complex expressions. Optimizing these Numpy calls can improve the overall
performances of applications that rely on Numpy to do their computation work.
The interest for the OpenDreamKit project is the overall performance gain that
could be achieve within various SageMath workload.

The Pythran project is a Python to C++ compiler, that aims at optimizing
scientific Python. It thus supports only a subset of the Python language.
It also has a full C++ implementation of a major set of the Numpy API. Some of
the advantage of this implementation is that it supports expression templates
and SIMD instructions. Expression templates allow to "fuse" loops that can
occurs when expressions with multiple operators are computed. For instance,
the expression "a+b*c" will original be transformed by Cython in two call: one
for the multiplication of b by c, and one for the addition of the result of
this multiplication and the addition by a. Each call will end-up in one loop,
that will read memory, compute the operation and write back to memory. The
second loop will have the same pattern. In nowadays architecture, memory
bandwith is often the limiting factor in this kind of operation. It is thus
really intereseting to merge these loops, and load/store the memory only once.
Expression templating is a C++ technique that allows to evalute expressions
only when they are stored to memory. Thus, in this case, the two loops will be
automatically "merged" by the C++ compiler, and we'll get an optimized version
of this code. Note that this technique is used for instance by the C++ wrapper
of the GMP library;

The project has been focused on using this Pythran backend for numpy arrays in
Cython when possible. Indeed, Pythran has a few limitations regarding the numpy
arrays it can handle:

* array "views" are not supported. That means that arrays must be stored in
  conitguous memory. Fortran and C-style format are supported.
* the endianess of the integers must be the same that the one of the targeted
  architecture (note that Cython has the same limitation)

We thus need to be able to fall-back to the Cython implementation if we are not
in one of those cases.

The integration within Cython works this way:

* at the function level, for every argument that is an numpy array and
  supported by Pythran, we change its type by a fused type [1]. This fused type
  is whether a Pythran numpy buffer or the original Cython buffer type
* for variable defined as numpy array, we change them directly to a Pythran
  buffer, if type and endianess is supported. 

Cython has a comprehensive suite test regarding the Numpy features it supports.
This test suite is still valid after our patch, except for some tests that was
already failing (with my setup).

A PR is on its way to Cython to integrate this into the project. We are waiting
for fixes in Pythran to be merged into its master branch
(https://github.com/serge-sans-paille/pythran/pull/616 and
https://github.com/serge-sans-paille/pythran/pull/628) to send it publicly.

Some benchmarks: 
----------------

* simple computation:

def sqrt_sum(numpy.ndarray[FLOATTYPE_t, ndim=1] a, numpy.ndarray[FLOATTYPE_t, ndim=1] b):
    return numpy.sqrt(numpy.sqrt(a*a+b*b))

On my computer (Core i7-6700HQ), this gives there results, with an array of
100000000 32-bit floats as input:

- for the classical Cython version: 960ms
- for the Cython version using the Pythran backend: 761ms
- for the Cython version using the Pythran backend using SIMD instructions: 243ms

which makes a speedup of ~3.9x using SIMD instructions.

* this example from the Cython documentation:
  http://cython.readthedocs.io/en/latest/src/tutorial/numpy.html

The input we use is the following:

N = 1000
f = np.arange(N*N, dtype=np.int).reshape((N,N))
g = np.arange(81, dtype=np.int).reshape((9, 9))

- for the classical Cython version: 155ms
- for the Cython version using the Pythran backend: 150ms
- for the Cython version using the Pythran backend using SIMD instructions: 149ms

It seems the Pythran backend don't manage to benefit a lot from SIMD
instructions in this case. We thus still got an average speedup of ~3%.

Sage integration:
-----------------

Further work needs to be done to test the overall benefit of these improvements
within Sage.

Summary of current status:
--------------------------

A PR has been pushed to Cython that is currently being reviewed :
https://github.com/cython/cython/pull/1607.

The latest version of the master banch of Pythran needs to be used to test
this.

[1] https://wiki.sagemath.org/Cython

nthiery · 2017-02-27T18:48:37Z

Thanks @aguinet; the above text reads nicely. I am uncomfortable though: in good parts this text duplicates the content the report without being quite shorter; so that does not fits the need (remember: the text is meant to appear at the beginning of the report, playing the role of an abstract).
Can you or @ClementPernet make a shorter version of this text (maybe 4-5 paragraphs?), and move whatever content that would not yet be in report.tex in there?
Worst case, this deliverable can be submitted tomorrow. However, in case that can be done tonight, that would be good so that I can focus on the remaining deliverables tomorrow afternoon (I am teaching in the morning).
Thanks in advance,
Nicolas

aguinet · 2017-02-27T19:33:44Z

Okay i'll do that tonight!

ClementPernet · 2017-02-27T19:38:36Z

I merged new content of Adrien's post to the deliverable. I will write up an abstract later tonight.

ClementPernet · 2017-02-27T19:39:56Z

Ok @aguinet , send me your abstract when it's ready and I'll merge everything together.

ClementPernet · 2017-02-27T21:46:34Z

@nthiery : Updated version of the report pushed. Ready for proof-reading.

nthiery · 2017-02-27T23:09:42Z

Thanks @ClementPernet, @aguinet.

I moved your material from github-issue-description.tex into, well, the github issue description :-), and allowed myself to work some more on it. I reverted the lead to "UJF", since the change of name is not yet in the grant agreement, and this resulted in the report showing both affiliation. I also added @aguinet as coauthor; on other occasions, I would have prefereed to ask for confirmation, but I needed to move fast :-)

The report is now submitted to the EU!

Congrats @aguinet on this first contribution, and thanks @aguinet and @ClementPernet for this deliverable! I had been dreaming about having the best of Cythan and Pythran worlds at my finger tip, and you made it a reality. I am looking forward further collaborations between those two projects.

minrk added this to the D5.2 milestone Sep 8, 2015

minrk added UGA deliverable WP5 labels Sep 8, 2015

minrk assigned ClementPernet Nov 3, 2015

bpilorget modified the milestones: Month 9: 2016-05-31, Month 18: 2017-02-28 May 9, 2016

minrk mentioned this issue Feb 10, 2017

T5.7: Pythran #105

Closed

ClementPernet added the Pythran label Feb 16, 2017

nthiery closed this as completed Feb 27, 2017

nthiery added the ReportingPeriod1 label May 2, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

D5.2: Facility to compile Pythran compliant user kernels and Sage code and automatically take advantage of multi-cores and SIMD instruction units in Cython #115

D5.2: Facility to compile Pythran compliant user kernels and Sage code and automatically take advantage of multi-cores and SIMD instruction units in Cython #115

minrk commented Sep 8, 2015 •

edited by nthiery

bpilorget commented May 9, 2016

bpilorget commented Nov 21, 2016

ClementPernet commented Jan 19, 2017 •

edited by nthiery

nthiery commented Feb 6, 2017

nthiery commented Feb 10, 2017

nthiery commented Feb 23, 2017

nthiery commented Feb 23, 2017

aguinet commented Feb 24, 2017

ClementPernet commented Feb 24, 2017

ClementPernet commented Feb 24, 2017

aguinet commented Feb 25, 2017

nthiery commented Feb 26, 2017 •

edited

ClementPernet commented Feb 27, 2017

aguinet commented Feb 27, 2017

nthiery commented Feb 27, 2017

aguinet commented Feb 27, 2017

ClementPernet commented Feb 27, 2017

ClementPernet commented Feb 27, 2017

ClementPernet commented Feb 27, 2017 •

edited

nthiery commented Feb 27, 2017

D5.2: Facility to compile Pythran compliant user kernels and Sage code and automatically take advantage of multi-cores and SIMD instruction units in Cython #115

D5.2: Facility to compile Pythran compliant user kernels and Sage code and automatically take advantage of multi-cores and SIMD instruction units in Cython #115

Comments

minrk commented Sep 8, 2015 • edited by nthiery

bpilorget commented May 9, 2016

bpilorget commented Nov 21, 2016

ClementPernet commented Jan 19, 2017 • edited by nthiery

nthiery commented Feb 6, 2017

nthiery commented Feb 10, 2017

nthiery commented Feb 23, 2017

nthiery commented Feb 23, 2017

aguinet commented Feb 24, 2017

ClementPernet commented Feb 24, 2017

ClementPernet commented Feb 24, 2017

aguinet commented Feb 25, 2017

nthiery commented Feb 26, 2017 • edited

ClementPernet commented Feb 27, 2017

aguinet commented Feb 27, 2017

nthiery commented Feb 27, 2017

aguinet commented Feb 27, 2017

ClementPernet commented Feb 27, 2017

ClementPernet commented Feb 27, 2017

ClementPernet commented Feb 27, 2017 • edited

nthiery commented Feb 27, 2017

minrk commented Sep 8, 2015 •

edited by nthiery

ClementPernet commented Jan 19, 2017 •

edited by nthiery

nthiery commented Feb 26, 2017 •

edited

ClementPernet commented Feb 27, 2017 •

edited