Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] New ot.gpu with cupy #67

Merged
merged 7 commits into from Sep 28, 2018
Merged

[MRG] New ot.gpu with cupy #67

merged 7 commits into from Sep 28, 2018

Conversation

rflamary
Copy link
Collaborator

The PR is a cupy implementation of the functions currently implemented in ot.gpu. I also removed all the classes that were deprecated anyways. It still needs proper updated test but i like this solution since it stays mostly compatible with the old ot.gpu.

I have received a large number of queries about ot.gpu but cudamat is not maintained and the problem will only grow so we need to do something before release 0.5.

This solution is far less elegant than PR #32 of @toto6 with all the decorators but having a cupy specific implementation leaves more room for code optimization than a generic implementation IMHO. Which means that we can make it better in the future without compromizing the numpy implmentation.

I give an example of use for the ot.gpu functions below with different format for input/output, i.e. if there are numpy.array of cupy.array . The output is obtained on my Titan X GPU after two run of the script in ipython.

import numpy as np
import pylab as pl
import ot
import ot.gpu

#%%
n=2000

tp=np.float32

xs=np.random.randn(n,2).astype(tp)
xt=np.random.randn(n,2).astype(tp)

w=ot.unif(n)

lab=np.zeros(n)
lab[n//2:]=1


print('Upload data to GPU:')
print('===================')
ot.tic()
xs2,xt2= ot.gpu.to_gpu(xs,xt)
ot.toc()

#%% test dist computation

ot.tic()
M=ot.dist(xs.copy(),xt.copy())
t0=ot.toq()


ot.tic()
M1=ot.gpu.dist(xs.copy(),xt.copy(),to_numpy=True)
t1=ot.toq()

ot.tic()
M2=ot.gpu.dist(xs.copy(),xt.copy(),to_numpy=False)
t2=ot.toq()

ot.tic()
M3=ot.gpu.dist(xs2,xt2,to_numpy=False)
t3=ot.toq()

print('\nDist computation:')
print('===================')
print('CPU                     : {:1.4f}s'.format(t0))
print('GPU (src=cpu,tgt=cpu)   : {:1.4f}s (x{:1.2f})'.format(t1,t0/t1))
print('GPU (src=cpu,tgt=gpu)   : {:1.4f}s (x{:1.2f})'.format(t2,t0/t2))
print('GPU (src=gpu,tgt=gpu)   : {:1.4f}s (x{:1.2f})'.format(t3,t0/t3))
print('Err= {:e}'.format(np.abs(M-M1).max()))

#%% Sinkhorn computation

reg=1

ot.tic()
G=ot.sinkhorn(w,w,M.copy(),reg)
t0=ot.toq()

ot.tic()
G1=ot.gpu.sinkhorn(w,w,M.copy(),reg,to_numpy=True)
t1=ot.toq()

ot.tic()
G2=ot.gpu.sinkhorn(w,w,M.copy(),reg,to_numpy=False)
t2=ot.toq()

ot.tic()
G3=ot.gpu.sinkhorn(w,w,M3,reg,to_numpy=False)
t2=ot.toq()

print('\nSinkhorn computation:')
print('=======================')
print('CPU                     : {:1.4f}s'.format(t0))
print('GPU (src=cpu,tgt=cpu)   : {:1.4f}s (x{:1.2f})'.format(t1,t0/t1))
print('GPU (src=cpu,tgt=gpu)   : {:1.4f}s (x{:1.2f})'.format(t2,t0/t2))
print('GPU (src=gpu,tgt=gpu)   : {:1.4f}s (x{:1.2f})'.format(t3,t0/t3))
print('Err= {:e}'.format(np.abs(G-G1).max()))


#%% test sinkhorn multi distrub

reg=1

w2=np.random.rand(n,20)
w2/=w2.sum(0,keepdims=True)

ot.tic()
wass=ot.sinkhorn(w,w2,M.copy(),reg)
t0=ot.toq()

ot.tic()
wass1=ot.gpu.sinkhorn(w,w2,M.copy(),reg,to_numpy=True)
t1=ot.toq()

ot.tic()
wass2=ot.gpu.sinkhorn(w,w2,M.copy(),reg,to_numpy=False)
t2=ot.toq()

ot.tic()
wass2=ot.gpu.sinkhorn(w,w2,M3,reg,to_numpy=False)
t2=ot.toq()

print('\nSinkhorn multiple target:')
print('==========================')
print('CPU                     : {:1.4f}s'.format(t0))
print('GPU (src=cpu,tgt=cpu)   : {:1.4f}s (x{:1.2f})'.format(t1,t0/t1))
print('GPU (src=cpu,tgt=gpu)   : {:1.4f}s (x{:1.2f})'.format(t2,t0/t2))
print('GPU (src=gpu,tgt=gpu)   : {:1.4f}s (x{:1.2f})'.format(t3,t0/t3))
print('Err= {:e}'.format(np.abs(wass-wass1).max()))


#%
ot.tic()
G1p=ot.da.sinkhorn_lpl1_mm(w,lab,w,M.copy(),reg)
t0=ot.toq()

ot.tic()
G1p1=ot.gpu.da.sinkhorn_lpl1_mm(w,lab,w,M.copy(),reg,to_numpy=True)
t1=ot.toq()

ot.tic()
G1p2=ot.gpu.da.sinkhorn_lpl1_mm(w,lab,w,M.copy(),reg,to_numpy=False)
t2=ot.toq()

ot.tic()
G1p2=ot.gpu.da.sinkhorn_lpl1_mm(w,lab,w,M3,reg,to_numpy=False)
t3=ot.toq()

print('\nSinkhorn lpl1 :')
print('==========================')
print('CPU                     : {:1.4f}s'.format(t0))
print('GPU (src=cpu,tgt=cpu)   : {:1.4f}s (x{:1.2f})'.format(t1,t0/t1))
print('GPU (src=cpu,tgt=gpu)   : {:1.4f}s (x{:1.2f})'.format(t2,t0/t2))
print('GPU (src=gpu,tgt=gpu)   : {:1.4f}s (x{:1.2f})'.format(t3,t0/t3))
print('Err= {:e}'.format(np.abs(G1p-G1p1).max()))

The output I have is the following:

Upload data to GPU:
===================
Elapsed time : 0.28782010078430176 s

Dist computation:
===================
CPU                     : 0.1933s
GPU (src=cpu,tgt=cpu)   : 0.5164s (x0.37)
GPU (src=cpu,tgt=gpu)   : 0.0010s (x184.93)
GPU (src=gpu,tgt=gpu)   : 0.0011s (x180.36)
Err= 0.000000e+00

Sinkhorn computation:
=======================
CPU                     : 1.8513s
GPU (src=cpu,tgt=cpu)   : 0.6724s (x2.75)
GPU (src=cpu,tgt=gpu)   : 0.2524s (x7.33)
GPU (src=gpu,tgt=gpu)   : 0.0011s (x1727.06)
Err= 1.985125e-12

Sinkhorn multiple target:
==========================
CPU                     : 12.7924s
GPU (src=cpu,tgt=cpu)   : 1.1502s (x11.12)
GPU (src=cpu,tgt=gpu)   : 0.9587s (x13.34)
GPU (src=gpu,tgt=gpu)   : 0.0011s (x11933.96)
Err= 1.294231e-09

Sinkhorn lpl1 :
==========================
CPU                     : 22.6899s
GPU (src=cpu,tgt=cpu)   : 2.9365s (x7.73)
GPU (src=cpu,tgt=gpu)   : 2.7254s (x8.33)
GPU (src=gpu,tgt=gpu)   : 2.5752s (x8.81)
Err= 2.574980e-19

@rflamary rflamary changed the title [WIP] New gpu with cupy [WIP] New ot.gpu with cupy Sep 25, 2018
@LeoGautheron
Copy link

I don't know why, but I obtain greatly faster CPU times than yours with my i7-6700HQ and GTX 980M and 32GB of memory.

Upload data to GPU:
===================
Elapsed time : 0.4076845645904541 s

Dist computation:
===================
CPU                     : 0.0094s
GPU (src=cpu,tgt=cpu)   : 0.5155s (x0.02)
GPU (src=cpu,tgt=gpu)   : 0.0000s (x93839.17)
GPU (src=gpu,tgt=gpu)   : 0.0000s (x93839.17)
Err= 0.000000e+00

Sinkhorn computation:
=======================
CPU                     : 0.1406s
GPU (src=cpu,tgt=cpu)   : 0.1562s (x0.90)
GPU (src=cpu,tgt=gpu)   : 0.1093s (x1.29)
GPU (src=gpu,tgt=gpu)   : 0.0000s (x1406278.61)
Err= 3.419000e-12

Sinkhorn multiple target:
==========================
CPU                     : 0.9373s
GPU (src=cpu,tgt=cpu)   : 0.2968s (x3.16)
GPU (src=cpu,tgt=gpu)   : 0.2812s (x3.33)
GPU (src=gpu,tgt=gpu)   : 0.0000s (x9372515.68)
Err= 1.284586e-09

Sinkhorn lpl1 :
==========================
CPU                     : 1.9049s
GPU (src=cpu,tgt=cpu)   : 1.0466s (x1.82)
GPU (src=cpu,tgt=gpu)   : 1.0154s (x1.88)
GPU (src=gpu,tgt=gpu)   : 1.0154s (x1.88)
Err= 3.523657e-19

@rflamary
Copy link
Collaborator Author

rflamary commented Sep 26, 2018

Hello @LeoGautheron ,

This is because my CPU has a low frequency (1.5GHz) so it can't really fight an i7 ;) and i'm not sure my numoy install uses the properly its multiple cpu. Also kudos on the GTX that is clearly better than my old Titan X.

Still note that the time is clearly dominated by upload/download from to GPU memory. We have to be clear in the documentation that best speed is obtained with matrices already as cupy arrays.

This performances are not very impressive to be frank but I think it is important that we provide a working ot.gpu for release 0.5. We can definitely do better but unless you have kickass optimization tricks to share on short notice I think we should merge this (after proper documentation update).

What do you think?

@LeoGautheron
Copy link

I think its good to go, I see no optimization to do right now.

@rflamary
Copy link
Collaborator Author

OK then, thank you for the feedback, we need to provide gpu for the users (30k+ download on pip).

And sorry again for you Zombie PR I stole some stuff from it anyways so not all is lost.

@rflamary rflamary changed the title [WIP] New ot.gpu with cupy [MRG] New ot.gpu with cupy Sep 27, 2018
@rflamary rflamary merged commit 8f6c455 into master Sep 28, 2018
@rflamary rflamary deleted the new_gpu branch December 5, 2018 11:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants