GPU changes: #32

toto6 · 2017-09-13T19:16:04Z

Replace cudamat by cupy for GPU implementations (cupy is still in active development, while cudamat is not)
Use the new DA class instead of the old deprecated one

TODO for another PR:

Performances are still a bit lower than with cudamat (even if better than CPU for large matrices). Some speedups should be possible by tweaking the code

- Replace cudamat by cupy for GPU implementations (cupy is still in active development, while cudamat is not) - Use the new DA class instead of the old deprecated one TODO for another PR: - Performances are still a bit lower than with cudamat (even if better than CPU for large matrices). Some speedups should be possible by tweaking the code

Slasnista · 2017-09-14T07:50:16Z

hi @aje,

I have seen that you have just updated the gpu code and I look forward to using it !

Here some suggestions about gpu computing for maybe a next PR:

What do you think about merging the gpu.da and the standard da scripts. We could add just a boolean parameter in the init of each OTDA class to place the computations onto the GPU.
Along with this direction, what about having the same orga of modules, functions over the repo, classes... and use a boolean parameter for each function to specify whether the computation should be run on the gpu or on the cpu as for deep learning libraries (where you specify the device to use: tensorflow, theano, pytorch rely on this)

Let me know if you want to talk about it directly. We could plan a skype / hangout ;)

ncourty · 2017-09-14T08:01:01Z

Hi @aje and @Slasnista, thanks for your work on POT ! It seems to me that the direction pointed out by @Slasnista is the good one for GPU integration in POT: seamless integration with a boolean that either can be set upon installation or triggered by a test on the availability of cupy. However, I think it would be better to do this in another PR.

toto6 · 2017-09-14T08:22:25Z

I agree it would be better to have a boolean than the current version where I directly duplicate code. I'm not sure how it could be done as I'm not familiar with other python library using GPU. For sure we would do a skype about it

rflamary · 2017-09-14T11:26:40Z

Hello

The way i see it it can be solved rather elegantly with the following functions in utils.py

try:
    import cupy as cp
except ImportError:
    cp=False
    

def get_array_module(*args):
    """ return the proper modul (numpy of cupy) depending on the  """
    if cp:
        return cp.get_array_module(*args)
    else:
        return np
    
def to_gpu(*args):
    """ upload numpy arry to gpu and return them """
    return (cp.asarray(x) for x in args)
            
def is_gpu(np):
    """ test if a module is cupy on numpy"""
    if 'cupy' in np.__name__:
        return True
    else:
        return False

get_array_module return the module that should be used (numpy or cupy) depending on if its installed and the arrays you given them are already on gpu.

The second function transfer all the array given to it to the gpu.

The last one allows for a quick test to know whether a module is numpy or cupy.

Now you just need to add at the beginning of any function (for isntance sinkhorn knopp):

def sinkhorn_knopp(a, b, M, reg, numItermax=1000, stopThr=1e-9, verbose=False, log=False, **kwargs):

    np=get_array_module(a,b,M)
    
    a = np.asarray(a, dtype=np.float64) # probably to change depending on is_gpu(np)
    b = np.asarray(b, dtype=np.float64)
    M = np.asarray(M, dtype=np.float64)

This will automatically replace the local np and use either numpy of cupy .

This means that obviously the user has to take care of sending the arrays on the gpu or indeed we add a parameter force_gpu that will force its use when cupy is installed

if force_cpu:
    a,b,M=to_gpu(a,b,M)
np=get_array_module(a,b,M)

This is the main strength of cupy ( being mostly compatible with numpy) and we should do that instead of re-implementing everything.

Finally note that my proposition do not bug when cupy is not installed, it will live hapilly using numpy locally.

Slasnista · 2017-09-14T11:58:03Z

I really like your solution @rflamary. It should work with every function, shouldn't it ?

We might be able to do this with a decorator applied to any functions. Then we would just have to make sure that the tests pass on gpu too. Does it sound tractable or am I forgetting some issue ?

toto6 · 2017-09-14T14:08:35Z

It seems like a good solution to me. I will try to make it work.

rflamary · 2017-09-14T14:13:54Z

@Slasnista I don't know about the decorator, how do you handle the local np with it?

I think indeed it should work with every functions that use the numpy subset implemented in cupy.

@aje I think we should also handle the case when cupy is bot installed in to_gpu:

def to_gpu(*args):
    """ upload numpy arry to gpu and return them """
    if cp:
        return (cp.asarray(x) for x in args)
    else:
        return args

now we have a code that does not explode with force_gpu , maybe we should add a warning to tell the user that cupy is not installed and that default backend is used....

Add function pairwiseEuclidean that can be used with numpy or cupy. cupy (GPU) is used if parameter gpu==True and cupy is available. Otherwise compute with numpy. This function is faster than scipy.spatial.distance.cdist for sqeuclidean even when computing using the CPU (numpy).

TODO: - add parameter "gpu" in init of all classes extending BaseTransport - pass parameter "gpu" to function pairwiseEuclidean - change in file bregman.py the function sinkhorn_knopp to use cupy or numpy - change in file da.py the function sinkhorn_lpl1_mm to use cupy or numpy - same but for other functions...

- modified sinkhorn knopp code to be executed on numpy or cupy depending on the type of input matrices - at the moment GPU version is slow compared to CPU. with the test I added I obtain these results: ``` Normal, time: 4.96 sec GPU, time: 4.65 sec ``` - TODO: - improve performances of sinkhorn knopp for GPU - add gpu support for LpL1

Before ``` Normal, time: 4.96 sec GPU, time: 4.65 sec ``` After ``` Normal, time: 4.21 sec GPU, time: 3.45 sec ```

Before ``` Normal, time: 4.21 sec GPU, time: 3.45 sec ``` After ``` Normal, time: 3.70 sec GPU, time: 2.65 sec ```

Slasnista · 2017-09-19T16:38:54Z

@rflamary, a decorator to_gpu would allow you to do something like this to handle the conversion cpu <-> gpu:

@to_gpu
def ot_function(X, y, reg=1., verbose=False):
    """ot_function
    """

     ...

     return Coupling

If the decorator is well designed he may be able to turn any numpy array given as argument into a cupy array and also transform the returned cupy array into a numpy array after the function has run.

After that, we would just have to put this decorator on every function that performs a heavy computation that would be speeded up onto a gpu. Does it sound good to you ?

rflamary · 2017-09-20T07:04:50Z

Hello,
@Slasnista

I agree that a decorator is both elegant and the way to go for handling upload to gpu. syill I dont' want to convert back to numpy array since it introduce and overhead every time you can the function that the user might not need. My opinion is that if the user wants to use the gpu, he can handle gpu array in the output. Still one could add a parameter to_array=False that says if the array should be converted automatically.
the decorator can handle only array conversion since we still need to get the local backend inside the function so it does not solve all the modifications .

@aje

I'm not convince with the use of xp instead of np. This make the function non numpy standard and if a user wants to use the function in some other place by removing the xp=get... line it would force him to import numpy as xp (highly irregular). I also think it is a good idea to keep np to keep the code easy to read for outsiders maybe a simple comment above the line can say ("overwrite backend with numpy of cupy depending on array format").
could you try to code the decorator for loading the arryy onto gpu instead of handling it inside the function as propose by @Slasnista ? If not maybe @Slasnista can propose an implementation.

Improve the benchmark comparison script between CPU and GPU. For me it now output the following: scipy's cdist, time: 10.28 sec pairwiseEuclidean CPU, time: 0.63 sec pairwiseEuclidean GPU, time: 0.33 sec Sinkhorn CPU, time: 3.58 sec Sinkhorn GPU, time: 2.63 sec

rflamary · 2017-09-20T07:12:26Z

Also I think we should have default dtytes for GPU and CPU, CPU can handle float64 very easily but GPU has a very large overhead (which explains the limited performance gain of cupy for the moment). i' pretty sure that using float32 would lead to tremendous computational gain (at the cost of less numerical stability obviously)

Maybe we should let the user define the dtype through the array passed and not force it inside the sinkhorn function.

toto6 · 2017-09-20T07:15:05Z

I don't understand this decorator thing, I never used any

toto6 · 2017-09-20T07:21:02Z

Concerning the overhead with float64 or float32, I think this is not the main cause of bad performances here.

I remember, with cudamat, I had significantly better performances (and with float64). The performances with cudamat were the following:
5000
Normal sinkhorn, time: 1.21 sec
min:5.251E-12 max::8.839E-05 mean::1.600E-07 std::5.317E-07
GPU sinkhorn, time: 0.09 sec
min:5.251E-12 max::8.839E-05 mean::1.600E-07 std::5.317E-07

10000
Normal sinkhorn, time: 4.37 sec
min:4.508E-13 max::4.601E-05 mean::4.000E-08 std::1.366E-07
GPU sinkhorn, time: 0.25 sec
min:4.508E-13 max::4.601E-05 mean::4.000E-08 std::1.366E-07

15000
Normal sinkhorn, time: 10.68 sec
min:1.696E-13 max::2.031E-05 mean::1.778E-08 std::6.191E-08
GPU sinkhorn, time: 0.49 sec
min:1.696E-13 max::2.031E-05 mean::1.778E-08 std::6.191E-08

20000
Normal sinkhorn, time: 22.55 sec
min:1.144E-13 max::1.383E-05 mean::1.000E-08 std::3.506E-08
GPU sinkhorn, time: 0.89 sec
min:1.144E-13 max::1.383E-05 mean::1.000E-08 std::3.506E-08

25000
Normal sinkhorn, time: 32.90 sec
min:8.452E-14 max::8.882E-06 mean::6.400E-09 std::2.248E-08
GPU sinkhorn, time: 1.39 sec
min:8.452E-14 max::8.882E-06 mean::6.400E-09 std::2.248E-08

30000
Normal sinkhorn, time: 47.70 sec
min:6.306E-14 max::9.280E-06 mean::4.444E-09 std::1.572E-08
GPU sinkhorn, time: 1.93 sec
min:6.306E-14 max::9.280E-06 mean::4.444E-09 std::1.572E-08

rflamary · 2017-09-20T07:24:26Z

@aje I agree with you this difference in performance is clearly a problem and i'm disappointed by the performances of cupy so far. I will try it on my machien to see if the gain is still so small on a different GPU.

rflamary · 2017-09-21T07:26:50Z

hello @aje

I just tested a bit pycu with the following script

import numpy as np
import matplotlib.pylab as pl
import cupy as cp
import ot

#%% parameters and data generation
n=5000 # nb samples
tp=np.float64

A=np.random.randn(n,n).astype(tp)
B=np.random.randn(n,n).astype(tp)

A_gpu=cp.asarray(A)
B_gpu=cp.asarray(B)

ot.tic()
A.dot(B)
ot.toc()

ot.tic()
A_gpu.dot(B_gpu)
ot.toc()

and got the following performances for float32:

Elapsed time : 0.601469993591 s
Elapsed time : 0.0522150993347 s

and for float64:

Elapsed time : 1.05664110184 s
Elapsed time : 1.37450909615 s

I've got a Titan X that is a bit old and sucks at float64 but I think it comes mainly from the fact that cupy is part of chainer that is a deep learning framework and they simply don't care about float64 (they even talk about going 16 of 8 bit in neural networks nowadays).

toto6 · 2017-09-21T07:56:22Z

ah, I have similar results. I will try to add a parameter to chose which float type (32 or 64) to use in the computation

rflamary · 2018-05-09T12:49:11Z

Hello everyone and sorry for the long wait.

I have been playing with this PR now that I have a working GPU again and you can see a few modifications here:
https://github.com/rflamary/POT/tree/pr32

I think it's nice and you get a computational gain on large matrices but i have a major problem with the effect of the decorator functions: they screw with the documentation and function signature.

Basically not autocomplete for the sinkhorn function gives you sinkhorn(*args,**kwarg) which is not good at all. Same with the documentation that do not have a proper signature anymore.

There is supposedly a module called decorator that provide this sort of thing but i have not been able to use it properly (keeps signature but function do not work).

This is a major problem because documentation is very important and we cannot have a good one on the functions that go through the decorators (signature is part of the doc).

The only solution in my opinion (unless one of you manage to use the decorator module) is to provide separate functions which go though the decorators. They will not have a proper signature but at least the original ones will have proper signature and since you can use this function as a drop-in replacement it's OK.
Those functions can be defined in a separate gpu.py file or directly in ot.bregman. The last one would require only small changes from the OptimalTransport classes.

What do you think?

agramfort · 2018-05-09T19:48:41Z

maybe it's worth looking how tensorly https://github.com/tensorly/tensorly handles different backends without duplicating much algorithmic code. my 2c

toto6 · 2018-05-10T05:04:43Z

Since the decorator upload/download matrices to/from the GPU at the beginning/end of the functions that have the decorator, a possibility is to remove the decorator and add one line of code at the beginning/end of these functions to do the same thing (this line would call a function that upload/download if necessary). This way, there is no need to duplicate the functions. But it would require the functions having the decorator to receive the gpu/to_np arguments through kwargs.

But I will check the decorator module to see if I can make it work without removing the decorator.

rflamary · 2018-05-10T06:09:25Z

Ok thank you.

Please merge my branch first. It will help you merge master I already handled most conflicts

toto6 · 2018-05-10T06:33:22Z

I merged the branch.
The decorator module you speak of is this https://github.com/aje/POT/commit/90636c44d278de19ae68534bdb63003d9bde874d#diff-55ba92bcceef627e33e92d3440cc23fc ?

rflamary · 2018-05-10T09:13:01Z

yes indeed i did notw ant to add a dependency and since the module is a unique .py file it seemed a good idea to put it in externals

rflamary · 2019-06-27T12:12:01Z

Hello @toto6 ,

Thank you for your work on this PR. It was a big help when refactoring the ot.gpu.
But since it is now outdated it cannot be merged anymore and I will close it.

toto6 added 4 commits September 13, 2017 21:13

Fix indent

93376d5

Remove debug

3bf4d09

Remove debug

bbe9a92

Small speedup LpL1

05c1bd6

toto6 added 8 commits September 14, 2017 20:09

Merge branch 'master' into master

111d3ff

Merge https://github.com/rflamary/POT

e99328b

Little speedup sinkhorn:

6c7012f

Before ``` Normal, time: 4.96 sec GPU, time: 4.65 sec ``` After ``` Normal, time: 4.21 sec GPU, time: 3.45 sec ```

Little speedup sinkhorn:

eae9d2a

Before ``` Normal, time: 4.21 sec GPU, time: 3.45 sec ``` After ``` Normal, time: 3.70 sec GPU, time: 2.65 sec ```

Move test file

a30f3aa

Fix mistake

23869f8

Replace xp by np

dcfae03

Improve the benchmark comparison script between CPU and GPU. For me it now output the following: scipy's cdist, time: 10.28 sec pairwiseEuclidean CPU, time: 0.63 sec pairwiseEuclidean GPU, time: 0.33 sec Sinkhorn CPU, time: 3.58 sec Sinkhorn GPU, time: 2.63 sec

rflamary added 20 commits March 20, 2018 16:49

makefile update

88a81c3

update example

287c659

add linear mapping test + autopep8

6fdf5de

add test linear mapping class

5efdf00

add tests for ot.uils

fc9923d

add externals for function signature

927395b

add test gromov + debug sklearn Basestimator

55aaf78

aupdate gromov + autopep8 externals

64ef33d

gtomov barycenter tests

7095e03

add entropic gromov test for 90+% corerage

63fd11e

update readme + doc

1262563

update doc

0ce1a5e

pep cleanup

83c706c

pep8 unused variable

69c7d1c

update reame

7681db5

add automatically documentation to gu_fun

aa12256

pep8 util function

aa24c1d

better decorate

6f964e4

awesome merge

0bec096

working decorator for gpu

90636c4

rflamary mentioned this pull request Aug 23, 2018

Usage of ot.gpu #61

Closed

rflamary mentioned this pull request Sep 24, 2018

[MRG] New ot.gpu with cupy #67

Merged

rflamary closed this Jun 27, 2019

GPU changes: #32

GPU changes: #32

Uh oh!

Conversation

toto6 commented Sep 13, 2017

Uh oh!

Slasnista commented Sep 14, 2017

Uh oh!

ncourty commented Sep 14, 2017

Uh oh!

toto6 commented Sep 14, 2017

Uh oh!

rflamary commented Sep 14, 2017

Uh oh!

Slasnista commented Sep 14, 2017

Uh oh!

toto6 commented Sep 14, 2017

Uh oh!

rflamary commented Sep 14, 2017

Uh oh!

Slasnista commented Sep 19, 2017

Uh oh!

rflamary commented Sep 20, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rflamary commented Sep 20, 2017

Uh oh!

toto6 commented Sep 20, 2017

Uh oh!

toto6 commented Sep 20, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rflamary commented Sep 20, 2017

Uh oh!

rflamary commented Sep 21, 2017

Uh oh!

toto6 commented Sep 21, 2017

Uh oh!

rflamary commented May 9, 2018

Uh oh!

agramfort commented May 9, 2018 via email

Uh oh!

toto6 commented May 10, 2018

Uh oh!

rflamary commented May 10, 2018

Uh oh!

toto6 commented May 10, 2018

Uh oh!

rflamary commented May 10, 2018

Uh oh!

rflamary commented Jun 27, 2019

Uh oh!

Uh oh!

rflamary commented Sep 20, 2017 •

edited

Loading

toto6 commented Sep 20, 2017 •

edited

Loading