Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPUDNNReduction is slower than GpuCAReduceCuda #6432

Open
danielS91 opened this issue Sep 20, 2017 · 1 comment
Open

GPUDNNReduction is slower than GpuCAReduceCuda #6432

danielS91 opened this issue Sep 20, 2017 · 1 comment

Comments

@danielS91
Copy link

I recently updated to current theano dev version. While doing so, I noticed a significant slowdown when training keras models.
It looks like that GPUDNNReduction is much slower than previous used GpuCAReduceCuda. I tried to create a minimal example to reproduce the problem:

import numpy as np
import theano
import pygpu
from time import time

print('theano: {}'.format(theano.__version__))
print('pygpu: {}'.format(pygpu.__version__))

dtype = theano.config.floatX
sizes = [1000, 10000, 20000]

a = theano.tensor.matrix()
f_max = theano.function([a], a.max(axis=1))
f_min = theano.function([a], a.min(axis=1))
f_sum = theano.function([a], a.sum(axis=1))

# print graphs
funcs = {'max': f_max, 'min': f_min, 'sum': f_sum}
for f in funcs:
    print(f)
    theano.printing.debugprint(funcs[f], print_type=True)
    # first call
    funcs[f](np.random.random((5, 5)).astype(dtype))

# time execution
n_runs = 100
for f in funcs:
    print(f)
    for s in sizes:
        data = np.random.random((s, s)).astype(dtype)
        t1 = time()
        for i in range(n_runs):
            funcs[f](data)
        print("{}x{}:\t{}".format(s, s, (time()-t1)/n_runs))

I tested both a Maxwell GPU (Titan X) and a Pascal GPU (GTX 1080Ti) which - for whatever reason - is slower than the Titan X:

When running with theano defaults the output looks like:

Using cuDNN version 7001 on context None
Mapped name None to device cuda: GeForce GTX TITAN X (0000:01:00.0)
theano: 0.10.0beta2+119.gbc20630
pygpu: 0.7.1
max
HostFromGpu(gpuarray) [id A] <TensorType(float32, vector)> 'max'   3
 |GpuDnnReduction{red_op='maximum', axis=(1,), acc_dtype='float32', dtype='float32', return_indices=False} [id B] <GpuArrayType<None>(float32, vector)> ''   2
   |GpuContiguous [id C] <GpuArrayType<None>(float32, matrix)> ''   1
     |GpuFromHost<None> [id D] <GpuArrayType<None>(float32, matrix)> ''   0
       |<TensorType(float32, matrix)> [id E] <TensorType(float32, matrix)>

...

max
1000x1000:	0.000675480365753
10000x10000:	0.0677583003044
20000x20000:	0.275841779709
sum
1000x1000:	0.00063549041748
10000x10000:	0.0676817893982
20000x20000:	0.275875630379
min
1000x1000:	0.000633809566498
10000x10000:	0.0676900100708
20000x20000:	0.275860497952

--------------------------------

Using cuDNN version 7001 on context None
Mapped name None to device cuda: GeForce GTX 1080 Ti (0000:01:00.0)
theano: 0.10.0beta2+119.gbc20630
pygpu: 0.7.1

...

max
1000x1000:	0.000814170837402
10000x10000:	0.0918325400352
20000x20000:	0.366781489849
sum
1000x1000:	0.000770909786224
10000x10000:	0.0916589212418
20000x20000:	0.366722888947
min
1000x1000:	0.000764350891113
10000x10000:	0.0915873193741
20000x20000:	0.366739079952

When running with THEANO_FLAGS='optimizer_excluding=local_dnn_reduction':

Using cuDNN version 7001 on context None
Mapped name None to device cuda: GeForce GTX TITAN X (0000:01:00.0)
theano: 0.10.0beta2+119.gbc20630
pygpu: 0.7.1
max
HostFromGpu(gpuarray) [id A] <TensorType(float32, vector)> 'max'   2
 |GpuCAReduceCuda{maximum}{1} [id B] <GpuArrayType<None>(float32, vector)> ''   1
   |GpuFromHost<None> [id C] <GpuArrayType<None>(float32, matrix)> ''   0
     |<TensorType(float32, matrix)> [id D] <TensorType(float32, matrix)>

...

max
1000x1000:	0.000498871803284
10000x10000:	0.0555803990364
20000x20000:	0.221592080593
sum
1000x1000:	0.000458009243011
10000x10000:	0.054373550415
20000x20000:	0.21805713892
min
1000x1000:	0.000489480495453
10000x10000:	0.055532169342
20000x20000:	0.221575219631

--------------------------------

Using cuDNN version 7001 on context None
Mapped name None to device cuda: GeForce GTX 1080 Ti (0000:01:00.0)
theano: 0.10.0beta2+119.gbc20630
pygpu: 0.7.1

...

max
1000x1000:	0.000663950443268
10000x10000:	0.0790155386925
20000x20000:	0.315941710472
sum
1000x1000:	0.000606820583344
10000x10000:	0.0785833787918
20000x20000:	0.314622268677
min
1000x1000:	0.000656189918518
10000x10000:	0.0790068006516
20000x20000:	0.316200020313

Currently, I ended up in disabling all related optimizers: local_dnn_reduction, local_cudnn_maxandargmax, local_dnn_argmax.

@nouiz
Copy link
Member

nouiz commented Oct 5, 2017

Thanks for the report. I have been able to reproduce it an an GTX750 too.

I'll open an issue with NVIDIA about this. Note, I modified your test to isolate more the dnn reduction time (so mostly, we don't move the data to the GPU inside the timing)

import numpy as np
import theano
import pygpu
from time import time

print('theano: {}'.format(theano.__version__))
print('pygpu: {}'.format(pygpu.__version__))

dtype = theano.config.floatX
sizes = [1000, 10000]  # , 20000]                                                                                                                                                

a = theano.shared(np.random.random((sizes[0], sizes[0])).astype(dtype))
f_max = theano.function([], a.max(axis=1))
f_min = theano.function([], a.min(axis=1))
f_sum = theano.function([], a.sum(axis=1))

# print graphs                                                                                                                                                                   
funcs = {'max': f_max, 'min': f_min, 'sum': f_sum}
for f in funcs:
    print(f)
    theano.printing.debugprint(funcs[f], print_type=True)
    # first call                                                                                                                                                                 
    funcs[f]()

# time execution                                                                                                                                                                 
n_runs = 100
for f in funcs:
    print(f)
    for s in sizes:
        data = np.random.random((s, s)).astype(dtype)
        a.set_value(data)
        t1 = time()
        for i in range(n_runs):
            funcs[f]()
        print("{}x{}:\t{}".format(s, s, (time() - t1)/n_runs))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants