Skip to content

Conversation

toto6
Copy link
Contributor

@toto6 toto6 commented Sep 13, 2017

  • Replace cudamat by cupy for GPU implementations (cupy is still in active development, while cudamat is not)
  • Use the new DA class instead of the old deprecated one

TODO for another PR:

  • Performances are still a bit lower than with cudamat (even if better than CPU for large matrices). Some speedups should be possible by tweaking the code

 - Replace cudamat by cupy for GPU implementations (cupy is still in active development, while cudamat is not)
 - Use the new DA class instead of the old deprecated one

TODO for another PR:
 - Performances are still a bit lower than with cudamat (even if better than CPU for large matrices). Some speedups should be possible by tweaking the code
@Slasnista
Copy link
Contributor

hi @aje,

I have seen that you have just updated the gpu code and I look forward to using it !

Here some suggestions about gpu computing for maybe a next PR:

  1. What do you think about merging the gpu.da and the standard da scripts. We could add just a boolean parameter in the init of each OTDA class to place the computations onto the GPU.

  2. Along with this direction, what about having the same orga of modules, functions over the repo, classes... and use a boolean parameter for each function to specify whether the computation should be run on the gpu or on the cpu as for deep learning libraries (where you specify the device to use: tensorflow, theano, pytorch rely on this)

Let me know if you want to talk about it directly. We could plan a skype / hangout ;)

@ncourty
Copy link
Collaborator

ncourty commented Sep 14, 2017

Hi @aje and @Slasnista, thanks for your work on POT ! It seems to me that the direction pointed out by @Slasnista is the good one for GPU integration in POT: seamless integration with a boolean that either can be set upon installation or triggered by a test on the availability of cupy. However, I think it would be better to do this in another PR.

@toto6
Copy link
Contributor Author

toto6 commented Sep 14, 2017

I agree it would be better to have a boolean than the current version where I directly duplicate code. I'm not sure how it could be done as I'm not familiar with other python library using GPU. For sure we would do a skype about it

@rflamary
Copy link
Collaborator

Hello

The way i see it it can be solved rather elegantly with the following functions in utils.py

try:
    import cupy as cp
except ImportError:
    cp=False
    

def get_array_module(*args):
    """ return the proper modul (numpy of cupy) depending on the  """
    if cp:
        return cp.get_array_module(*args)
    else:
        return np
    
def to_gpu(*args):
    """ upload numpy arry to gpu and return them """
    return (cp.asarray(x) for x in args)
            
def is_gpu(np):
    """ test if a module is cupy on numpy"""
    if 'cupy' in np.__name__:
        return True
    else:
        return False

get_array_module return the module that should be used (numpy or cupy) depending on if its installed and the arrays you given them are already on gpu.

The second function transfer all the array given to it to the gpu.

The last one allows for a quick test to know whether a module is numpy or cupy.

Now you just need to add at the beginning of any function (for isntance sinkhorn knopp):

def sinkhorn_knopp(a, b, M, reg, numItermax=1000, stopThr=1e-9, verbose=False, log=False, **kwargs):

    np=get_array_module(a,b,M)
    
    a = np.asarray(a, dtype=np.float64) # probably to change depending on is_gpu(np)
    b = np.asarray(b, dtype=np.float64)
    M = np.asarray(M, dtype=np.float64)

This will automatically replace the local np and use either numpy of cupy .

This means that obviously the user has to take care of sending the arrays on the gpu or indeed we add a parameter force_gpu that will force its use when cupy is installed

if force_cpu:
    a,b,M=to_gpu(a,b,M)
np=get_array_module(a,b,M)

This is the main strength of cupy ( being mostly compatible with numpy) and we should do that instead of re-implementing everything.

Finally note that my proposition do not bug when cupy is not installed, it will live hapilly using numpy locally.

@Slasnista
Copy link
Contributor

I really like your solution @rflamary. It should work with every function, shouldn't it ?

We might be able to do this with a decorator applied to any functions. Then we would just have to make sure that the tests pass on gpu too. Does it sound tractable or am I forgetting some issue ?

@toto6
Copy link
Contributor Author

toto6 commented Sep 14, 2017

It seems like a good solution to me. I will try to make it work.

@rflamary
Copy link
Collaborator

@Slasnista I don't know about the decorator, how do you handle the local np with it?

I think indeed it should work with every functions that use the numpy subset implemented in cupy.

@aje I think we should also handle the case when cupy is bot installed in to_gpu:

def to_gpu(*args):
    """ upload numpy arry to gpu and return them """
    if cp:
        return (cp.asarray(x) for x in args)
    else:
        return args

now we have a code that does not explode with force_gpu , maybe we should add a warning to tell the user that cupy is not installed and that default backend is used....

Add function pairwiseEuclidean that can be used with numpy or cupy. cupy (GPU) is used if parameter gpu==True and cupy is available. Otherwise compute with numpy.
This function is faster than scipy.spatial.distance.cdist for sqeuclidean even when computing using the CPU (numpy).
TODO:
 - add parameter "gpu" in init of all classes extending BaseTransport
 - pass parameter "gpu" to function pairwiseEuclidean
 - change in file bregman.py the function sinkhorn_knopp to use cupy or numpy
 - change in file da.py the function sinkhorn_lpl1_mm to use cupy or numpy
 - same but for other functions...
- modified sinkhorn knopp code to be executed on numpy or cupy depending on the type of input matrices
- at the moment GPU version is slow compared to CPU. with the test I added I obtain these results:
```
Normal, time:   4.96 sec
   GPU, time:   4.65 sec
```

- TODO:
 - improve performances of sinkhorn knopp for GPU
 - add gpu support for LpL1
Before
```
Normal, time:   4.96 sec
   GPU, time:   4.65 sec
```
After
```
Normal, time:   4.21 sec
   GPU, time:   3.45 sec
```
Before
```
Normal, time:   4.21 sec
   GPU, time:   3.45 sec
```
After
```
Normal, time:   3.70 sec
   GPU, time:   2.65 sec
```
@Slasnista
Copy link
Contributor

@rflamary, a decorator to_gpu would allow you to do something like this to handle the conversion cpu <-> gpu:

@to_gpu
def ot_function(X, y, reg=1., verbose=False):
    """ot_function
    """

     ...

     return Coupling

If the decorator is well designed he may be able to turn any numpy array given as argument into a cupy array and also transform the returned cupy array into a numpy array after the function has run.

After that, we would just have to put this decorator on every function that performs a heavy computation that would be speeded up onto a gpu. Does it sound good to you ?

@rflamary
Copy link
Collaborator

rflamary commented Sep 20, 2017

Hello,
@Slasnista

  • I agree that a decorator is both elegant and the way to go for handling upload to gpu. syill I dont' want to convert back to numpy array since it introduce and overhead every time you can the function that the user might not need. My opinion is that if the user wants to use the gpu, he can handle gpu array in the output. Still one could add a parameter to_array=False that says if the array should be converted automatically.
  • the decorator can handle only array conversion since we still need to get the local backend inside the function so it does not solve all the modifications .

@aje

  • I'm not convince with the use of xp instead of np. This make the function non numpy standard and if a user wants to use the function in some other place by removing the xp=get... line it would force him to import numpy as xp (highly irregular). I also think it is a good idea to keep np to keep the code easy to read for outsiders maybe a simple comment above the line can say ("overwrite backend with numpy of cupy depending on array format").
  • could you try to code the decorator for loading the arryy onto gpu instead of handling it inside the function as propose by @Slasnista ? If not maybe @Slasnista can propose an implementation.

Improve the benchmark comparison script between CPU and GPU.
For me it now output the following:
        scipy's cdist, time:  10.28 sec
pairwiseEuclidean CPU, time:   0.63 sec
pairwiseEuclidean GPU, time:   0.33 sec
Sinkhorn CPU, time:   3.58 sec
Sinkhorn GPU, time:   2.63 sec
@rflamary
Copy link
Collaborator

Also I think we should have default dtytes for GPU and CPU, CPU can handle float64 very easily but GPU has a very large overhead (which explains the limited performance gain of cupy for the moment). i' pretty sure that using float32 would lead to tremendous computational gain (at the cost of less numerical stability obviously)

Maybe we should let the user define the dtype through the array passed and not force it inside the sinkhorn function.

@toto6
Copy link
Contributor Author

toto6 commented Sep 20, 2017

I don't understand this decorator thing, I never used any

@toto6
Copy link
Contributor Author

toto6 commented Sep 20, 2017

Concerning the overhead with float64 or float32, I think this is not the main cause of bad performances here.

I remember, with cudamat, I had significantly better performances (and with float64). The performances with cudamat were the following:
5000
Normal sinkhorn, time: 1.21 sec
min:5.251E-12 max::8.839E-05 mean::1.600E-07 std::5.317E-07
GPU sinkhorn, time: 0.09 sec
min:5.251E-12 max::8.839E-05 mean::1.600E-07 std::5.317E-07

10000
Normal sinkhorn, time: 4.37 sec
min:4.508E-13 max::4.601E-05 mean::4.000E-08 std::1.366E-07
GPU sinkhorn, time: 0.25 sec
min:4.508E-13 max::4.601E-05 mean::4.000E-08 std::1.366E-07

15000
Normal sinkhorn, time: 10.68 sec
min:1.696E-13 max::2.031E-05 mean::1.778E-08 std::6.191E-08
GPU sinkhorn, time: 0.49 sec
min:1.696E-13 max::2.031E-05 mean::1.778E-08 std::6.191E-08

20000
Normal sinkhorn, time: 22.55 sec
min:1.144E-13 max::1.383E-05 mean::1.000E-08 std::3.506E-08
GPU sinkhorn, time: 0.89 sec
min:1.144E-13 max::1.383E-05 mean::1.000E-08 std::3.506E-08

25000
Normal sinkhorn, time: 32.90 sec
min:8.452E-14 max::8.882E-06 mean::6.400E-09 std::2.248E-08
GPU sinkhorn, time: 1.39 sec
min:8.452E-14 max::8.882E-06 mean::6.400E-09 std::2.248E-08

30000
Normal sinkhorn, time: 47.70 sec
min:6.306E-14 max::9.280E-06 mean::4.444E-09 std::1.572E-08
GPU sinkhorn, time: 1.93 sec
min:6.306E-14 max::9.280E-06 mean::4.444E-09 std::1.572E-08

@rflamary
Copy link
Collaborator

@aje I agree with you this difference in performance is clearly a problem and i'm disappointed by the performances of cupy so far. I will try it on my machien to see if the gain is still so small on a different GPU.

@rflamary
Copy link
Collaborator

hello @aje

I just tested a bit pycu with the following script

import numpy as np
import matplotlib.pylab as pl
import cupy as cp
import ot

#%% parameters and data generation
n=5000 # nb samples
tp=np.float64

A=np.random.randn(n,n).astype(tp)
B=np.random.randn(n,n).astype(tp)

A_gpu=cp.asarray(A)
B_gpu=cp.asarray(B)

ot.tic()
A.dot(B)
ot.toc()

ot.tic()
A_gpu.dot(B_gpu)
ot.toc()

and got the following performances for float32:

Elapsed time : 0.601469993591 s
Elapsed time : 0.0522150993347 s

and for float64:

Elapsed time : 1.05664110184 s
Elapsed time : 1.37450909615 s

I've got a Titan X that is a bit old and sucks at float64 but I think it comes mainly from the fact that cupy is part of chainer that is a deep learning framework and they simply don't care about float64 (they even talk about going 16 of 8 bit in neural networks nowadays).

@toto6
Copy link
Contributor Author

toto6 commented Sep 21, 2017

ah, I have similar results. I will try to add a parameter to chose which float type (32 or 64) to use in the computation

@rflamary
Copy link
Collaborator

rflamary commented May 9, 2018

Hello everyone and sorry for the long wait.

I have been playing with this PR now that I have a working GPU again and you can see a few modifications here:
https://github.com/rflamary/POT/tree/pr32

I think it's nice and you get a computational gain on large matrices but i have a major problem with the effect of the decorator functions: they screw with the documentation and function signature.

Basically not autocomplete for the sinkhorn function gives you sinkhorn(*args,**kwarg) which is not good at all. Same with the documentation that do not have a proper signature anymore.

There is supposedly a module called decorator that provide this sort of thing but i have not been able to use it properly (keeps signature but function do not work).

This is a major problem because documentation is very important and we cannot have a good one on the functions that go through the decorators (signature is part of the doc).

The only solution in my opinion (unless one of you manage to use the decorator module) is to provide separate functions which go though the decorators. They will not have a proper signature but at least the original ones will have proper signature and since you can use this function as a drop-in replacement it's OK.
Those functions can be defined in a separate gpu.py file or directly in ot.bregman. The last one would require only small changes from the OptimalTransport classes.

What do you think?

@agramfort
Copy link
Collaborator

agramfort commented May 9, 2018 via email

@toto6
Copy link
Contributor Author

toto6 commented May 10, 2018

Since the decorator upload/download matrices to/from the GPU at the beginning/end of the functions that have the decorator, a possibility is to remove the decorator and add one line of code at the beginning/end of these functions to do the same thing (this line would call a function that upload/download if necessary). This way, there is no need to duplicate the functions. But it would require the functions having the decorator to receive the gpu/to_np arguments through kwargs.

But I will check the decorator module to see if I can make it work without removing the decorator.

@rflamary
Copy link
Collaborator

Ok thank you.

Please merge my branch first. It will help you merge master I already handled most conflicts

@toto6
Copy link
Contributor Author

toto6 commented May 10, 2018

I merged the branch.
The decorator module you speak of is this https://github.com/aje/POT/commit/90636c44d278de19ae68534bdb63003d9bde874d#diff-55ba92bcceef627e33e92d3440cc23fc ?

@rflamary
Copy link
Collaborator

yes indeed i did notw ant to add a dependency and since the module is a unique .py file it seemed a good idea to put it in externals

@rflamary rflamary mentioned this pull request Aug 23, 2018
@rflamary
Copy link
Collaborator

Hello @toto6 ,

Thank you for your work on this PR. It was a big help when refactoring the ot.gpu.
But since it is now outdated it cannot be merged anymore and I will close it.

@rflamary rflamary closed this Jun 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants