[MRG] New ot.gpu with cupy #67

rflamary · 2018-09-24T12:54:09Z

The PR is a cupy implementation of the functions currently implemented in ot.gpu. I also removed all the classes that were deprecated anyways. It still needs proper updated test but i like this solution since it stays mostly compatible with the old ot.gpu.

I have received a large number of queries about ot.gpu but cudamat is not maintained and the problem will only grow so we need to do something before release 0.5.

This solution is far less elegant than PR #32 of @toto6 with all the decorators but having a cupy specific implementation leaves more room for code optimization than a generic implementation IMHO. Which means that we can make it better in the future without compromizing the numpy implmentation.

I give an example of use for the ot.gpu functions below with different format for input/output, i.e. if there are numpy.array of cupy.array . The output is obtained on my Titan X GPU after two run of the script in ipython.

import numpy as np
import pylab as pl
import ot
import ot.gpu

#%%
n=2000

tp=np.float32

xs=np.random.randn(n,2).astype(tp)
xt=np.random.randn(n,2).astype(tp)

w=ot.unif(n)

lab=np.zeros(n)
lab[n//2:]=1


print('Upload data to GPU:')
print('===================')
ot.tic()
xs2,xt2= ot.gpu.to_gpu(xs,xt)
ot.toc()

#%% test dist computation

ot.tic()
M=ot.dist(xs.copy(),xt.copy())
t0=ot.toq()


ot.tic()
M1=ot.gpu.dist(xs.copy(),xt.copy(),to_numpy=True)
t1=ot.toq()

ot.tic()
M2=ot.gpu.dist(xs.copy(),xt.copy(),to_numpy=False)
t2=ot.toq()

ot.tic()
M3=ot.gpu.dist(xs2,xt2,to_numpy=False)
t3=ot.toq()

print('\nDist computation:')
print('===================')
print('CPU                     : {:1.4f}s'.format(t0))
print('GPU (src=cpu,tgt=cpu)   : {:1.4f}s (x{:1.2f})'.format(t1,t0/t1))
print('GPU (src=cpu,tgt=gpu)   : {:1.4f}s (x{:1.2f})'.format(t2,t0/t2))
print('GPU (src=gpu,tgt=gpu)   : {:1.4f}s (x{:1.2f})'.format(t3,t0/t3))
print('Err= {:e}'.format(np.abs(M-M1).max()))

#%% Sinkhorn computation

reg=1

ot.tic()
G=ot.sinkhorn(w,w,M.copy(),reg)
t0=ot.toq()

ot.tic()
G1=ot.gpu.sinkhorn(w,w,M.copy(),reg,to_numpy=True)
t1=ot.toq()

ot.tic()
G2=ot.gpu.sinkhorn(w,w,M.copy(),reg,to_numpy=False)
t2=ot.toq()

ot.tic()
G3=ot.gpu.sinkhorn(w,w,M3,reg,to_numpy=False)
t2=ot.toq()

print('\nSinkhorn computation:')
print('=======================')
print('CPU                     : {:1.4f}s'.format(t0))
print('GPU (src=cpu,tgt=cpu)   : {:1.4f}s (x{:1.2f})'.format(t1,t0/t1))
print('GPU (src=cpu,tgt=gpu)   : {:1.4f}s (x{:1.2f})'.format(t2,t0/t2))
print('GPU (src=gpu,tgt=gpu)   : {:1.4f}s (x{:1.2f})'.format(t3,t0/t3))
print('Err= {:e}'.format(np.abs(G-G1).max()))


#%% test sinkhorn multi distrub

reg=1

w2=np.random.rand(n,20)
w2/=w2.sum(0,keepdims=True)

ot.tic()
wass=ot.sinkhorn(w,w2,M.copy(),reg)
t0=ot.toq()

ot.tic()
wass1=ot.gpu.sinkhorn(w,w2,M.copy(),reg,to_numpy=True)
t1=ot.toq()

ot.tic()
wass2=ot.gpu.sinkhorn(w,w2,M.copy(),reg,to_numpy=False)
t2=ot.toq()

ot.tic()
wass2=ot.gpu.sinkhorn(w,w2,M3,reg,to_numpy=False)
t2=ot.toq()

print('\nSinkhorn multiple target:')
print('==========================')
print('CPU                     : {:1.4f}s'.format(t0))
print('GPU (src=cpu,tgt=cpu)   : {:1.4f}s (x{:1.2f})'.format(t1,t0/t1))
print('GPU (src=cpu,tgt=gpu)   : {:1.4f}s (x{:1.2f})'.format(t2,t0/t2))
print('GPU (src=gpu,tgt=gpu)   : {:1.4f}s (x{:1.2f})'.format(t3,t0/t3))
print('Err= {:e}'.format(np.abs(wass-wass1).max()))


#%
ot.tic()
G1p=ot.da.sinkhorn_lpl1_mm(w,lab,w,M.copy(),reg)
t0=ot.toq()

ot.tic()
G1p1=ot.gpu.da.sinkhorn_lpl1_mm(w,lab,w,M.copy(),reg,to_numpy=True)
t1=ot.toq()

ot.tic()
G1p2=ot.gpu.da.sinkhorn_lpl1_mm(w,lab,w,M.copy(),reg,to_numpy=False)
t2=ot.toq()

ot.tic()
G1p2=ot.gpu.da.sinkhorn_lpl1_mm(w,lab,w,M3,reg,to_numpy=False)
t3=ot.toq()

print('\nSinkhorn lpl1 :')
print('==========================')
print('CPU                     : {:1.4f}s'.format(t0))
print('GPU (src=cpu,tgt=cpu)   : {:1.4f}s (x{:1.2f})'.format(t1,t0/t1))
print('GPU (src=cpu,tgt=gpu)   : {:1.4f}s (x{:1.2f})'.format(t2,t0/t2))
print('GPU (src=gpu,tgt=gpu)   : {:1.4f}s (x{:1.2f})'.format(t3,t0/t3))
print('Err= {:e}'.format(np.abs(G1p-G1p1).max()))

The output I have is the following:

Upload data to GPU:
===================
Elapsed time : 0.28782010078430176 s

Dist computation:
===================
CPU                     : 0.1933s
GPU (src=cpu,tgt=cpu)   : 0.5164s (x0.37)
GPU (src=cpu,tgt=gpu)   : 0.0010s (x184.93)
GPU (src=gpu,tgt=gpu)   : 0.0011s (x180.36)
Err= 0.000000e+00

Sinkhorn computation:
=======================
CPU                     : 1.8513s
GPU (src=cpu,tgt=cpu)   : 0.6724s (x2.75)
GPU (src=cpu,tgt=gpu)   : 0.2524s (x7.33)
GPU (src=gpu,tgt=gpu)   : 0.0011s (x1727.06)
Err= 1.985125e-12

Sinkhorn multiple target:
==========================
CPU                     : 12.7924s
GPU (src=cpu,tgt=cpu)   : 1.1502s (x11.12)
GPU (src=cpu,tgt=gpu)   : 0.9587s (x13.34)
GPU (src=gpu,tgt=gpu)   : 0.0011s (x11933.96)
Err= 1.294231e-09

Sinkhorn lpl1 :
==========================
CPU                     : 22.6899s
GPU (src=cpu,tgt=cpu)   : 2.9365s (x7.73)
GPU (src=cpu,tgt=gpu)   : 2.7254s (x8.33)
GPU (src=gpu,tgt=gpu)   : 2.5752s (x8.81)
Err= 2.574980e-19

LeoGautheron · 2018-09-25T17:38:43Z

I don't know why, but I obtain greatly faster CPU times than yours with my i7-6700HQ and GTX 980M and 32GB of memory.

Upload data to GPU:
===================
Elapsed time : 0.4076845645904541 s

Dist computation:
===================
CPU                     : 0.0094s
GPU (src=cpu,tgt=cpu)   : 0.5155s (x0.02)
GPU (src=cpu,tgt=gpu)   : 0.0000s (x93839.17)
GPU (src=gpu,tgt=gpu)   : 0.0000s (x93839.17)
Err= 0.000000e+00

Sinkhorn computation:
=======================
CPU                     : 0.1406s
GPU (src=cpu,tgt=cpu)   : 0.1562s (x0.90)
GPU (src=cpu,tgt=gpu)   : 0.1093s (x1.29)
GPU (src=gpu,tgt=gpu)   : 0.0000s (x1406278.61)
Err= 3.419000e-12

Sinkhorn multiple target:
==========================
CPU                     : 0.9373s
GPU (src=cpu,tgt=cpu)   : 0.2968s (x3.16)
GPU (src=cpu,tgt=gpu)   : 0.2812s (x3.33)
GPU (src=gpu,tgt=gpu)   : 0.0000s (x9372515.68)
Err= 1.284586e-09

Sinkhorn lpl1 :
==========================
CPU                     : 1.9049s
GPU (src=cpu,tgt=cpu)   : 1.0466s (x1.82)
GPU (src=cpu,tgt=gpu)   : 1.0154s (x1.88)
GPU (src=gpu,tgt=gpu)   : 1.0154s (x1.88)
Err= 3.523657e-19

rflamary · 2018-09-26T07:14:10Z

Hello @LeoGautheron ,

This is because my CPU has a low frequency (1.5GHz) so it can't really fight an i7 ;) and i'm not sure my numoy install uses the properly its multiple cpu. Also kudos on the GTX that is clearly better than my old Titan X.

Still note that the time is clearly dominated by upload/download from to GPU memory. We have to be clear in the documentation that best speed is obtained with matrices already as cupy arrays.

This performances are not very impressive to be frank but I think it is important that we provide a working ot.gpu for release 0.5. We can definitely do better but unless you have kickass optimization tricks to share on short notice I think we should merge this (after proper documentation update).

What do you think?

LeoGautheron · 2018-09-26T07:33:22Z

I think its good to go, I see no optimization to do right now.

rflamary · 2018-09-26T07:45:08Z

OK then, thank you for the feedback, we need to provide gpu for the users (30k+ download on pip).

And sorry again for you Zombie PR I stole some stuff from it anyways so not all is lost.

rflamary added 4 commits September 24, 2018 13:57

convert ot.gpu to cupy

d258c7d

pep8

f45f7a6

update tests

75e7802

working test +92 percent tets coverege

5e7bfbc

rflamary changed the title ~~[WIP] New gpu with cupy~~ [WIP] New ot.gpu with cupy Sep 25, 2018

rflamary changed the title ~~[WIP] New ot.gpu with cupy~~ [MRG] New ot.gpu with cupy Sep 27, 2018

rflamary added 3 commits September 28, 2018 09:41

update documentation

ee8ed4f

correction import in ot.gpu

fa7f3dd

Merge branch 'master' into new_gpu

08b859d

rflamary merged commit 8f6c455 into master Sep 28, 2018

rflamary deleted the new_gpu branch December 5, 2018 11:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MRG] New ot.gpu with cupy #67

[MRG] New ot.gpu with cupy #67

Uh oh!

rflamary commented Sep 24, 2018

Uh oh!

LeoGautheron commented Sep 25, 2018

Uh oh!

rflamary commented Sep 26, 2018 •

edited

Loading

Uh oh!

LeoGautheron commented Sep 26, 2018

Uh oh!

rflamary commented Sep 26, 2018

Uh oh!

Uh oh!

[MRG] New ot.gpu with cupy #67

[MRG] New ot.gpu with cupy #67

Uh oh!

Conversation

rflamary commented Sep 24, 2018

Uh oh!

LeoGautheron commented Sep 25, 2018

Uh oh!

rflamary commented Sep 26, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LeoGautheron commented Sep 26, 2018

Uh oh!

rflamary commented Sep 26, 2018

Uh oh!

Uh oh!

rflamary commented Sep 26, 2018 •

edited

Loading