Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

D5.2: Facility to compile Pythran compliant user kernels and Sage code and automatically take advantage of multi-cores and SIMD instruction units in Cython #115

Closed
minrk opened this issue Sep 8, 2015 · 20 comments

Comments

@minrk
Copy link
Contributor

minrk commented Sep 8, 2015

The Python programming language is widely used in the development of computational mathematics systems like SageMath for its expressiveness and flexibility. Yet, as an interpreted language, it suffers from inherent inefficiencies.

Over the years several tools have been developed to overcome this barrier. A major player is Cython, which is both an extension of the Python language, and a compiler generating compilable C code. At the cost of additional work from the developer (e.g. type annotations), Cython can deliver performances similar to that of a compiled language. It's being used intensively in SageMath. Another emerging player is Pythran, a Python to C++ compiler for a subset of the Python language, with a focus on scientific/numerical computing. It takes advantage of type inference features of C++ as well as multi-cores and SIMD instruction units to deliver high performance without the need for additional work from the developer. In particular, it includes a C++ implementation of a major subset of the Numpy API, optimized for speed, with support for expression templates that minimize the number of memory transfers needed to compute complex expressions (Numpy is the fundamental package for scientific computing with Python). However, unlike Cython, Pythran does not support user defined classes, a key feature in systems like SageMath.

This deliverable is a step toward taking the best of both worlds, and helping bridge the gap between numerical and exact computing. It proposes to incorporate Pythran support for Numpy within Cython, which consequently provides high performance numerical linear algebra to high level mathematical software developers, especially within SageMath.

As an illustration, the new Pythran backend in Cython achieves a speed-up of about 4 on the following typical Numpy based function:

def sqrt_sum (numpy.ndarray[FLOATTYPE_t, ndim=1] a,
              numpy.ndarray[FLOATTYPE_t, ndim=1] b):
    return numpy.sqrt(numpy.sqrt(a*a+b*b))
@bpilorget
Copy link
Contributor

Deliverable postponed to month 18 due to hiring issues.

@bpilorget
Copy link
Contributor

@ClementPernet (WP leader and lead beneficiary)
This deliverable is due for February 2017

@ClementPernet
Copy link
Contributor

ClementPernet commented Jan 19, 2017

The aim of this project was to optimize the overall performance of Cython code
that uses Numpy arrays.

Indeed, when operations are done on Numpy arrays, Cython relies on the original
Numpy package to compute them. This involves a fall back to the Python
interpreter. It thus misses several optimization opportunities, especially with
complex expressions.

The Pythran project is a Python to C++ compiler, that aims at optimizing
scientific Python. It thus supports only a subset of the Python language.
It also has a full C++ implementation of a major set of the Numpy API. Some of
the advantage of this implementation is that it supports expression templates
and SIMD instructions. Expression templates allow to "fuse" loops that can
occurs when expressions with multiple operators are computed. For instance,
the expression "a+b*c" will original be transformed by Cython in two call: one
for the multiplication of b by c, and one for the addition of the result of
this multiplication and the addition by a. Each call will end-up in one loop,
that will read memory, compute the operation and write back to memory. The
second loop will have the same pattern. In nowadays architecture, memory
bandwidth is often the limiting factor in this kind of operation. It is thus
really interesting to merge these loops, and load/store the memory only once.
Expression templating is a C++ technique that allows to evaluate expressions
only when they are stored to memory. Thus, in this case, the two loops will be
automatically "merged" by the C++ compiler, and we'll get an optimized version
of this code. Note that this technique is used for instance by the C++ wrapper
of the GMP library;

The project has been focused on using this Pythran backend for Numpy arrays in
Cython when possible. Indeed, Pythran has a few limitations regarding the Numpy
arrays it can handle:

  • array "views" are not supported. That means that arrays must be stored in
    contiguous memory. Fortran and C-style format are supported.
  • the endianess of the integers must be the same that the one of the targeted
    architecture (note that Cython has the same limitation)

We thus need to be able to fall-back to the Cython implementation if we are not
in one of those cases.

The integration within Cython works this way:

  • at the function level, for every argument that is an Numpy array and
    supported by Pythran, we change its type by a fused type [1]. This fused type
    is either a Pythran Numpy buffer or the original Cython buffer type
  • for a variable defined as a Numpy array, we change it directly to a Pythran
    buffer, if type and endianess is supported.

Cython has a comprehensive suite test regarding the Numpy features it supports.
This test suite is still valid after our patch, except for some tests that were
already failing (with my setup).

A PR is on its way to Cython to integrate this into the project. We are waiting
for fixes in Pythran to be merged into its master branch
(serge-sans-paille/pythran#616 and
serge-sans-paille/pythran#628) to send it publicly.

Some benchmarks:

  • simple computation:

def sqrt_sum(Numpy.ndarray[FLOATTYPE_t, ndim=1] a, Numpy.ndarray[FLOATTYPE_t, ndim=1] b):
return Numpy.sqrt(Numpy.sqrt(aa+bb))

On my computer (Core i7-6700HQ), this gives there results, with an array of
100000000 32-bit floats as input:

  • for the classical Cython version: 960ms
  • for the Cython version using the Pythran backend: 761ms
  • for the Cython version using the Pythran backend using SIMD instructions: 243ms

which makes a speedup of ~3.9x using SIMD instructions.

The input we use is the following:

N = 1000
f = np.arange(N*N, dtype=np.int).reshape((N,N))
g = np.arange(81, dtype=np.int).reshape((9, 9))

  • for the classical Cython version: 155ms
  • for the Cython version using the Pythran backend: 150ms
  • for the Cython version using the Pythran backend using SIMD instructions: 149ms

It seems that the Pythran backend does not manage to benefit much from SIMD
instructions in this case. We thus still got an average speedup of ~3%.

Summary of current status:

  • integration within Cython working
  • waiting for PRs to be merged within Pythran
  • benchmarks showing good results :)

What remains to be done:

  • documentation within Cython about this integration
  • pushing a PR to Cython (with an explanation on the dev ML)

Code:

  • the current Cython code can be gathered from here:
    git clone --branch=feature/pythran --depth 1 ssh://git@git.geekou.info/cython-pythran

  • the associated Pythran version can be gathered here (waiting for PR to be merged):
    git clone --branch=fix/py3_double_define --depth 1 https://github.com/aguinet/pythran

@nthiery
Copy link
Contributor

nthiery commented Feb 6, 2017

Dear M18 deliverable leaders,

Just a reminder that reports are due for mid-february, to buy us some time for proofreading, feedback, and final submission before February 28th. See our README for details on the process.

In practice, I'll be offline February 12-19, and the week right after will be pretty busy. Therefore, it would be helpful if a first draft could be available sometime this week, so that I can have a head start reviewing it.

Thanks in advance!

@minrk minrk mentioned this issue Feb 10, 2017
@nthiery
Copy link
Contributor

nthiery commented Feb 10, 2017

Hi @ClementPernet,
I just proofread your post above and made minor edits. It's a good start for the report. In fact, I suggest that you move the text to this issue's description.

@nthiery
Copy link
Contributor

nthiery commented Feb 23, 2017

Hi @ClementPernet,
Thanks for the report!
Is it in final form? Did you have someone else proofread it already?
The following file is missing in the repo: benchs/float/graph.pdf.
Could you add one or two sentences to this github issue description (aka abstract for the report) about what was achieved for this deliverable?
Thanks,

@nthiery
Copy link
Contributor

nthiery commented Feb 23, 2017

Since this seems almost ready, if you get a chance to do the above tomorrow morning, that would be great, as I could then submit it in the afternoon and tick the box to focus on the others :-)

@aguinet
Copy link

aguinet commented Feb 24, 2017

Hello @nthiery ,

I'll add this one in the morning, sorry about the missing file!

@ClementPernet
Copy link
Contributor

I added the files. They were by default not added due to a gitignore rule. So I forced them to be added.
@aguinet Can I let you write up the abstract as requested by @nthiery ?

@ClementPernet
Copy link
Contributor

The Cython PR seem faily close to be merged: cython/cython#1607
I was delaying the polishing of this report, hoping to be able to declare the code as merged, but this seems to be unfeasable. @aguinet do you confirm ?
@nthiery apart from that the report is in its final form. I meant to go through the deliverable checklist one last time to check if we missed something. I'll do it before this afternoon.

@aguinet
Copy link

aguinet commented Feb 25, 2017

@ClementPernet it seems complicated yes, I don't know how long it will take.

About the abstract, if I understand correctly, I have to make something shorter than what's a the top of this issue?

@nthiery
Copy link
Contributor

nthiery commented Feb 26, 2017

Hi @aguinet,
Sorry, I missed your call for info. Please expand a bit further the abstract (i.e. the first comment on this issue description): What Pythran is (done), what Cython is, what's been achieved in this deliverable to fuse their feature sets, how it is relevant to OpenDreamKit.
Thanks!

@ClementPernet
Copy link
Contributor

@aguinet What's the status of the abstract ? We need to polish the deliverable today. Keep us updated.

@aguinet
Copy link

aguinet commented Feb 27, 2017

@ClementPernet @nthiery

I added some notes about what's Cython and its usage within SageMath and OpenDreamKit.

The aim of this project was to optimize the overall performance of Cython code
that uses Numpy arrays.

Cython is a compiler for the Python and Cython language . It basically converts
Python code to a C/C++ module that calls directly the C Python API. The Cython
language is an extension of Python that allows to write C-like code within a
Python module. This code is directly inserted into the generated module. This
allows to write C python module with the flexibility of a Python script. This
is used in numerous Python project, with SageMath being one of them.

In Cython, when operations are done on Numpy arrays, Cython relies on the
original Numpy package to compute them. This involves a fall back to the Python
interpreter. It thus misses several optimisation opportunities, espacially with
complex expressions. Optimizing these Numpy calls can improve the overall
performances of applications that rely on Numpy to do their computation work.
The interest for the OpenDreamKit project is the overall performance gain that
could be achieve within various SageMath workload.

The Pythran project is a Python to C++ compiler, that aims at optimizing
scientific Python. It thus supports only a subset of the Python language.
It also has a full C++ implementation of a major set of the Numpy API. Some of
the advantage of this implementation is that it supports expression templates
and SIMD instructions. Expression templates allow to "fuse" loops that can
occurs when expressions with multiple operators are computed. For instance,
the expression "a+b*c" will original be transformed by Cython in two call: one
for the multiplication of b by c, and one for the addition of the result of
this multiplication and the addition by a. Each call will end-up in one loop,
that will read memory, compute the operation and write back to memory. The
second loop will have the same pattern. In nowadays architecture, memory
bandwith is often the limiting factor in this kind of operation. It is thus
really intereseting to merge these loops, and load/store the memory only once.
Expression templating is a C++ technique that allows to evalute expressions
only when they are stored to memory. Thus, in this case, the two loops will be
automatically "merged" by the C++ compiler, and we'll get an optimized version
of this code. Note that this technique is used for instance by the C++ wrapper
of the GMP library;

The project has been focused on using this Pythran backend for numpy arrays in
Cython when possible. Indeed, Pythran has a few limitations regarding the numpy
arrays it can handle:

* array "views" are not supported. That means that arrays must be stored in
  conitguous memory. Fortran and C-style format are supported.
* the endianess of the integers must be the same that the one of the targeted
  architecture (note that Cython has the same limitation)

We thus need to be able to fall-back to the Cython implementation if we are not
in one of those cases.

The integration within Cython works this way:

* at the function level, for every argument that is an numpy array and
  supported by Pythran, we change its type by a fused type [1]. This fused type
  is whether a Pythran numpy buffer or the original Cython buffer type
* for variable defined as numpy array, we change them directly to a Pythran
  buffer, if type and endianess is supported. 

Cython has a comprehensive suite test regarding the Numpy features it supports.
This test suite is still valid after our patch, except for some tests that was
already failing (with my setup).

A PR is on its way to Cython to integrate this into the project. We are waiting
for fixes in Pythran to be merged into its master branch
(https://github.com/serge-sans-paille/pythran/pull/616 and
https://github.com/serge-sans-paille/pythran/pull/628) to send it publicly.

Some benchmarks: 
----------------

* simple computation:

def sqrt_sum(numpy.ndarray[FLOATTYPE_t, ndim=1] a, numpy.ndarray[FLOATTYPE_t, ndim=1] b):
    return numpy.sqrt(numpy.sqrt(a*a+b*b))

On my computer (Core i7-6700HQ), this gives there results, with an array of
100000000 32-bit floats as input:

- for the classical Cython version: 960ms
- for the Cython version using the Pythran backend: 761ms
- for the Cython version using the Pythran backend using SIMD instructions: 243ms

which makes a speedup of ~3.9x using SIMD instructions.

* this example from the Cython documentation:
  http://cython.readthedocs.io/en/latest/src/tutorial/numpy.html

The input we use is the following:

N = 1000
f = np.arange(N*N, dtype=np.int).reshape((N,N))
g = np.arange(81, dtype=np.int).reshape((9, 9))

- for the classical Cython version: 155ms
- for the Cython version using the Pythran backend: 150ms
- for the Cython version using the Pythran backend using SIMD instructions: 149ms

It seems the Pythran backend don't manage to benefit a lot from SIMD
instructions in this case. We thus still got an average speedup of ~3%.

Sage integration:
-----------------

Further work needs to be done to test the overall benefit of these improvements
within Sage.

Summary of current status:
--------------------------

A PR has been pushed to Cython that is currently being reviewed :
https://github.com/cython/cython/pull/1607.

The latest version of the master banch of Pythran needs to be used to test
this.

[1] https://wiki.sagemath.org/Cython

@nthiery
Copy link
Contributor

nthiery commented Feb 27, 2017

Thanks @aguinet; the above text reads nicely. I am uncomfortable though: in good parts this text duplicates the content the report without being quite shorter; so that does not fits the need (remember: the text is meant to appear at the beginning of the report, playing the role of an abstract).
Can you or @ClementPernet make a shorter version of this text (maybe 4-5 paragraphs?), and move whatever content that would not yet be in report.tex in there?
Worst case, this deliverable can be submitted tomorrow. However, in case that can be done tonight, that would be good so that I can focus on the remaining deliverables tomorrow afternoon (I am teaching in the morning).
Thanks in advance,
Nicolas

@aguinet
Copy link

aguinet commented Feb 27, 2017

Okay i'll do that tonight!

@ClementPernet
Copy link
Contributor

I merged new content of Adrien's post to the deliverable. I will write up an abstract later tonight.

@ClementPernet
Copy link
Contributor

Ok @aguinet , send me your abstract when it's ready and I'll merge everything together.

@ClementPernet
Copy link
Contributor

ClementPernet commented Feb 27, 2017

@nthiery : Updated version of the report pushed. Ready for proof-reading.

@nthiery
Copy link
Contributor

nthiery commented Feb 27, 2017

Thanks @ClementPernet, @aguinet.

I moved your material from github-issue-description.tex into, well, the github issue description :-), and allowed myself to work some more on it. I reverted the lead to "UJF", since the change of name is not yet in the grant agreement, and this resulted in the report showing both affiliation. I also added @aguinet as coauthor; on other occasions, I would have prefereed to ask for confirmation, but I needed to move fast :-)

The report is now submitted to the EU!

Congrats @aguinet on this first contribution, and thanks @aguinet and @ClementPernet for this deliverable! I had been dreaming about having the best of Cythan and Pythran worlds at my finger tip, and you made it a reality. I am looking forward further collaborations between those two projects.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants