-
Notifications
You must be signed in to change notification settings - Fork 529
GPU changes: #32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU changes: #32
Conversation
- Replace cudamat by cupy for GPU implementations (cupy is still in active development, while cudamat is not) - Use the new DA class instead of the old deprecated one TODO for another PR: - Performances are still a bit lower than with cudamat (even if better than CPU for large matrices). Some speedups should be possible by tweaking the code
hi @aje, I have seen that you have just updated the gpu code and I look forward to using it ! Here some suggestions about gpu computing for maybe a next PR:
Let me know if you want to talk about it directly. We could plan a skype / hangout ;) |
Hi @aje and @Slasnista, thanks for your work on POT ! It seems to me that the direction pointed out by @Slasnista is the good one for GPU integration in POT: seamless integration with a boolean that either can be set upon installation or triggered by a test on the availability of cupy. However, I think it would be better to do this in another PR. |
I agree it would be better to have a boolean than the current version where I directly duplicate code. I'm not sure how it could be done as I'm not familiar with other python library using GPU. For sure we would do a skype about it |
Hello The way i see it it can be solved rather elegantly with the following functions in utils.py try:
import cupy as cp
except ImportError:
cp=False
def get_array_module(*args):
""" return the proper modul (numpy of cupy) depending on the """
if cp:
return cp.get_array_module(*args)
else:
return np
def to_gpu(*args):
""" upload numpy arry to gpu and return them """
return (cp.asarray(x) for x in args)
def is_gpu(np):
""" test if a module is cupy on numpy"""
if 'cupy' in np.__name__:
return True
else:
return False get_array_module return the module that should be used (numpy or cupy) depending on if its installed and the arrays you given them are already on gpu. The second function transfer all the array given to it to the gpu. The last one allows for a quick test to know whether a module is numpy or cupy. Now you just need to add at the beginning of any function (for isntance sinkhorn knopp): def sinkhorn_knopp(a, b, M, reg, numItermax=1000, stopThr=1e-9, verbose=False, log=False, **kwargs):
np=get_array_module(a,b,M)
a = np.asarray(a, dtype=np.float64) # probably to change depending on is_gpu(np)
b = np.asarray(b, dtype=np.float64)
M = np.asarray(M, dtype=np.float64) This will automatically replace the local np and use either numpy of cupy . This means that obviously the user has to take care of sending the arrays on the gpu or indeed we add a parameter force_gpu that will force its use when cupy is installed if force_cpu:
a,b,M=to_gpu(a,b,M)
np=get_array_module(a,b,M) This is the main strength of cupy ( being mostly compatible with numpy) and we should do that instead of re-implementing everything. Finally note that my proposition do not bug when cupy is not installed, it will live hapilly using numpy locally. |
I really like your solution @rflamary. It should work with every function, shouldn't it ? We might be able to do this with a decorator applied to any functions. Then we would just have to make sure that the tests pass on gpu too. Does it sound tractable or am I forgetting some issue ? |
It seems like a good solution to me. I will try to make it work. |
@Slasnista I don't know about the decorator, how do you handle the local np with it? I think indeed it should work with every functions that use the numpy subset implemented in cupy. @aje I think we should also handle the case when cupy is bot installed in to_gpu: def to_gpu(*args):
""" upload numpy arry to gpu and return them """
if cp:
return (cp.asarray(x) for x in args)
else:
return args now we have a code that does not explode with force_gpu , maybe we should add a warning to tell the user that cupy is not installed and that default backend is used.... |
Add function pairwiseEuclidean that can be used with numpy or cupy. cupy (GPU) is used if parameter gpu==True and cupy is available. Otherwise compute with numpy. This function is faster than scipy.spatial.distance.cdist for sqeuclidean even when computing using the CPU (numpy).
TODO: - add parameter "gpu" in init of all classes extending BaseTransport - pass parameter "gpu" to function pairwiseEuclidean - change in file bregman.py the function sinkhorn_knopp to use cupy or numpy - change in file da.py the function sinkhorn_lpl1_mm to use cupy or numpy - same but for other functions...
- modified sinkhorn knopp code to be executed on numpy or cupy depending on the type of input matrices - at the moment GPU version is slow compared to CPU. with the test I added I obtain these results: ``` Normal, time: 4.96 sec GPU, time: 4.65 sec ``` - TODO: - improve performances of sinkhorn knopp for GPU - add gpu support for LpL1
Before ``` Normal, time: 4.96 sec GPU, time: 4.65 sec ``` After ``` Normal, time: 4.21 sec GPU, time: 3.45 sec ```
Before ``` Normal, time: 4.21 sec GPU, time: 3.45 sec ``` After ``` Normal, time: 3.70 sec GPU, time: 2.65 sec ```
@rflamary, a decorator @to_gpu
def ot_function(X, y, reg=1., verbose=False):
"""ot_function
"""
...
return Coupling If the decorator is well designed he may be able to turn any numpy array given as argument into a cupy array and also transform the returned cupy array into a numpy array after the function has run. After that, we would just have to put this decorator on every function that performs a heavy computation that would be speeded up onto a gpu. Does it sound good to you ? |
Hello,
|
Improve the benchmark comparison script between CPU and GPU. For me it now output the following: scipy's cdist, time: 10.28 sec pairwiseEuclidean CPU, time: 0.63 sec pairwiseEuclidean GPU, time: 0.33 sec Sinkhorn CPU, time: 3.58 sec Sinkhorn GPU, time: 2.63 sec
Also I think we should have default dtytes for GPU and CPU, CPU can handle float64 very easily but GPU has a very large overhead (which explains the limited performance gain of cupy for the moment). i' pretty sure that using float32 would lead to tremendous computational gain (at the cost of less numerical stability obviously) Maybe we should let the user define the dtype through the array passed and not force it inside the sinkhorn function. |
I don't understand this decorator thing, I never used any |
Concerning the overhead with float64 or float32, I think this is not the main cause of bad performances here. I remember, with cudamat, I had significantly better performances (and with float64). The performances with cudamat were the following: 10000 15000 20000 25000 30000 |
@aje I agree with you this difference in performance is clearly a problem and i'm disappointed by the performances of cupy so far. I will try it on my machien to see if the gain is still so small on a different GPU. |
hello @aje I just tested a bit pycu with the following script import numpy as np
import matplotlib.pylab as pl
import cupy as cp
import ot
#%% parameters and data generation
n=5000 # nb samples
tp=np.float64
A=np.random.randn(n,n).astype(tp)
B=np.random.randn(n,n).astype(tp)
A_gpu=cp.asarray(A)
B_gpu=cp.asarray(B)
ot.tic()
A.dot(B)
ot.toc()
ot.tic()
A_gpu.dot(B_gpu)
ot.toc() and got the following performances for float32:
and for float64:
I've got a Titan X that is a bit old and sucks at float64 but I think it comes mainly from the fact that cupy is part of chainer that is a deep learning framework and they simply don't care about float64 (they even talk about going 16 of 8 bit in neural networks nowadays). |
ah, I have similar results. I will try to add a parameter to chose which float type (32 or 64) to use in the computation |
Hello everyone and sorry for the long wait. I have been playing with this PR now that I have a working GPU again and you can see a few modifications here: I think it's nice and you get a computational gain on large matrices but i have a major problem with the effect of the decorator functions: they screw with the documentation and function signature. Basically not autocomplete for the sinkhorn function gives you There is supposedly a module called This is a major problem because documentation is very important and we cannot have a good one on the functions that go through the decorators (signature is part of the doc). The only solution in my opinion (unless one of you manage to use the decorator module) is to provide separate functions which go though the decorators. They will not have a proper signature but at least the original ones will have proper signature and since you can use this function as a drop-in replacement it's OK. What do you think? |
maybe it's worth looking how tensorly https://github.com/tensorly/tensorly
handles different backends without duplicating much algorithmic code.
my 2c
|
Since the decorator upload/download matrices to/from the GPU at the beginning/end of the functions that have the decorator, a possibility is to remove the decorator and add one line of code at the beginning/end of these functions to do the same thing (this line would call a function that upload/download if necessary). This way, there is no need to duplicate the functions. But it would require the functions having the decorator to receive the gpu/to_np arguments through kwargs. But I will check the decorator module to see if I can make it work without removing the decorator. |
Ok thank you. Please merge my branch first. It will help you merge master I already handled most conflicts |
I merged the branch. |
yes indeed i did notw ant to add a dependency and since the module is a unique .py file it seemed a good idea to put it in externals |
Hello @toto6 , Thank you for your work on this PR. It was a big help when refactoring the ot.gpu. |
TODO for another PR: