GPUDNNReduction is slower than GpuCAReduceCuda #6432

danielS91 · 2017-09-20T11:31:33Z

I recently updated to current theano dev version. While doing so, I noticed a significant slowdown when training keras models.
It looks like that GPUDNNReduction is much slower than previous used GpuCAReduceCuda. I tried to create a minimal example to reproduce the problem:

import numpy as np
import theano
import pygpu
from time import time

print('theano: {}'.format(theano.__version__))
print('pygpu: {}'.format(pygpu.__version__))

dtype = theano.config.floatX
sizes = [1000, 10000, 20000]

a = theano.tensor.matrix()
f_max = theano.function([a], a.max(axis=1))
f_min = theano.function([a], a.min(axis=1))
f_sum = theano.function([a], a.sum(axis=1))

# print graphs
funcs = {'max': f_max, 'min': f_min, 'sum': f_sum}
for f in funcs:
    print(f)
    theano.printing.debugprint(funcs[f], print_type=True)
    # first call
    funcs[f](np.random.random((5, 5)).astype(dtype))

# time execution
n_runs = 100
for f in funcs:
    print(f)
    for s in sizes:
        data = np.random.random((s, s)).astype(dtype)
        t1 = time()
        for i in range(n_runs):
            funcs[f](data)
        print("{}x{}:\t{}".format(s, s, (time()-t1)/n_runs))

I tested both a Maxwell GPU (Titan X) and a Pascal GPU (GTX 1080Ti) which - for whatever reason - is slower than the Titan X:

When running with theano defaults the output looks like:

Using cuDNN version 7001 on context None
Mapped name None to device cuda: GeForce GTX TITAN X (0000:01:00.0)
theano: 0.10.0beta2+119.gbc20630
pygpu: 0.7.1
max
HostFromGpu(gpuarray) [id A] <TensorType(float32, vector)> 'max'   3
 |GpuDnnReduction{red_op='maximum', axis=(1,), acc_dtype='float32', dtype='float32', return_indices=False} [id B] <GpuArrayType<None>(float32, vector)> ''   2
   |GpuContiguous [id C] <GpuArrayType<None>(float32, matrix)> ''   1
     |GpuFromHost<None> [id D] <GpuArrayType<None>(float32, matrix)> ''   0
       |<TensorType(float32, matrix)> [id E] <TensorType(float32, matrix)>

...

max
1000x1000:	0.000675480365753
10000x10000:	0.0677583003044
20000x20000:	0.275841779709
sum
1000x1000:	0.00063549041748
10000x10000:	0.0676817893982
20000x20000:	0.275875630379
min
1000x1000:	0.000633809566498
10000x10000:	0.0676900100708
20000x20000:	0.275860497952

--------------------------------

Using cuDNN version 7001 on context None
Mapped name None to device cuda: GeForce GTX 1080 Ti (0000:01:00.0)
theano: 0.10.0beta2+119.gbc20630
pygpu: 0.7.1

...

max
1000x1000:	0.000814170837402
10000x10000:	0.0918325400352
20000x20000:	0.366781489849
sum
1000x1000:	0.000770909786224
10000x10000:	0.0916589212418
20000x20000:	0.366722888947
min
1000x1000:	0.000764350891113
10000x10000:	0.0915873193741
20000x20000:	0.366739079952

When running with THEANO_FLAGS='optimizer_excluding=local_dnn_reduction':

Using cuDNN version 7001 on context None
Mapped name None to device cuda: GeForce GTX TITAN X (0000:01:00.0)
theano: 0.10.0beta2+119.gbc20630
pygpu: 0.7.1
max
HostFromGpu(gpuarray) [id A] <TensorType(float32, vector)> 'max'   2
 |GpuCAReduceCuda{maximum}{1} [id B] <GpuArrayType<None>(float32, vector)> ''   1
   |GpuFromHost<None> [id C] <GpuArrayType<None>(float32, matrix)> ''   0
     |<TensorType(float32, matrix)> [id D] <TensorType(float32, matrix)>

...

max
1000x1000:	0.000498871803284
10000x10000:	0.0555803990364
20000x20000:	0.221592080593
sum
1000x1000:	0.000458009243011
10000x10000:	0.054373550415
20000x20000:	0.21805713892
min
1000x1000:	0.000489480495453
10000x10000:	0.055532169342
20000x20000:	0.221575219631

--------------------------------

Using cuDNN version 7001 on context None
Mapped name None to device cuda: GeForce GTX 1080 Ti (0000:01:00.0)
theano: 0.10.0beta2+119.gbc20630
pygpu: 0.7.1

...

max
1000x1000:	0.000663950443268
10000x10000:	0.0790155386925
20000x20000:	0.315941710472
sum
1000x1000:	0.000606820583344
10000x10000:	0.0785833787918
20000x20000:	0.314622268677
min
1000x1000:	0.000656189918518
10000x10000:	0.0790068006516
20000x20000:	0.316200020313

Currently, I ended up in disabling all related optimizers: local_dnn_reduction, local_cudnn_maxandargmax, local_dnn_argmax.

The text was updated successfully, but these errors were encountered:

nouiz · 2017-10-05T13:35:59Z

Thanks for the report. I have been able to reproduce it an an GTX750 too.

I'll open an issue with NVIDIA about this. Note, I modified your test to isolate more the dnn reduction time (so mostly, we don't move the data to the GPU inside the timing)

import numpy as np
import theano
import pygpu
from time import time

print('theano: {}'.format(theano.__version__))
print('pygpu: {}'.format(pygpu.__version__))

dtype = theano.config.floatX
sizes = [1000, 10000]  # , 20000]                                                                                                                                                

a = theano.shared(np.random.random((sizes[0], sizes[0])).astype(dtype))
f_max = theano.function([], a.max(axis=1))
f_min = theano.function([], a.min(axis=1))
f_sum = theano.function([], a.sum(axis=1))

# print graphs                                                                                                                                                                   
funcs = {'max': f_max, 'min': f_min, 'sum': f_sum}
for f in funcs:
    print(f)
    theano.printing.debugprint(funcs[f], print_type=True)
    # first call                                                                                                                                                                 
    funcs[f]()

# time execution                                                                                                                                                                 
n_runs = 100
for f in funcs:
    print(f)
    for s in sizes:
        data = np.random.random((s, s)).astype(dtype)
        a.set_value(data)
        t1 = time()
        for i in range(n_runs):
            funcs[f]()
        print("{}x{}:\t{}".format(s, s, (time() - t1)/n_runs))

nouiz added the GPU - New back-end label Jan 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPUDNNReduction is slower than GpuCAReduceCuda #6432

GPUDNNReduction is slower than GpuCAReduceCuda #6432

danielS91 commented Sep 20, 2017

nouiz commented Oct 5, 2017 •

edited

GPUDNNReduction is slower than GpuCAReduceCuda #6432

GPUDNNReduction is slower than GpuCAReduceCuda #6432

Comments

danielS91 commented Sep 20, 2017

nouiz commented Oct 5, 2017 • edited

nouiz commented Oct 5, 2017 •

edited