caffe conv kernel for theano. tests work, but needs integration and some... #2002

stencilman · 2014-07-29T18:55:01Z

The caffe convolution works and passes test, however, code needs some cleaning, which are marked with TODO in comments. I created a new file, theano/sandbox/cuda/tests/test_conv_gemm.py that calls GpuConvMM.

TODO:

Add support for the full mode

Other possible follow up in gh-2015

NEWS.txt

Add faster convolution (Arjun Jain, Frederic B.)

…ome cleanup

abergeron · 2014-07-29T20:19:50Z

theano/sandbox/cuda/caffe_common.hpp

@@ -0,0 +1,33 @@
+// Copyright 2014 BVLC and contributors.


This is not ok. You need to add the full license (either inline as a comment, or in a separate file that is referred to here).

abergeron · 2014-07-29T20:34:41Z

The code is a little rough around the edges, but I'm pretty sure we want to merge after the issues are taken care of.

stencilman · 2014-07-29T20:36:11Z

I think Fred wanted to clean it up and merge it properly, and that is why i left it so. However, I will make all the changes you suggest.

On 29-Jul-2014, at 4:34 pm, abergeron notifications@github.com wrote:

The code is a little rough around the edges, but I'm pretty sure we want to merge after the issues are taken care of.

—
Reply to this email directly or view it on GitHub.

nouiz · 2014-07-30T21:49:47Z

A made a PR to your branch with some advancement. It return bad value right
now in some cases. The variable m, n and k don't have the good value. Can
you check in caffe code in the file vision_layer.hpp how we should compute
M_ N_ and K_? I have some difficulty to find the correspondance to image.
Shape[*] and for the filter. I'll try to continue tomorrow if I can.
Le 29 juil. 2014 16:36, "Arjun Jain" notifications@github.com a écrit :

I think Fred wanted to clean it up and merge it properly, and that is why
i left it so. However, I will make all the changes you suggest.

On 29-Jul-2014, at 4:34 pm, abergeron notifications@github.com wrote:

The code is a little rough around the edges, but I'm pretty sure we want
to merge after the issues are taken care of.

—
Reply to this email directly or view it on GitHub.

—
Reply to this email directly or view it on GitHub
#2002 (comment).

… cuda_ndarray gpuval = cuda_ndarray.conv(img, kern, mode, subsample). So, made the changes in the test_conv_cuda_ndarray _test_dummy(). I see that the cpu version is computed using py_conv(), which in turn calls scipy.signal.convolve2d. How can the result 'gpuval' now be the same as scipy.signal.convolve2d instead of the scipy.signal.correlate? Also, this still passes tests for all image, kernel, channel and batch sizes: https://github.com/stencilman/Theano-1/blob/fb66035292ef070b86466bf61c9c42b8faaa0a1c/theano/sandbox/cuda/tests/test_conv_gemm.py

…nv_cuda_ndarray.py. I rotated the kernel by 180 before convolution, and this now gives the same result as GpuConvMM. So, I think the cuda/c part is completely fine and the corrent arguments are being passed to the cublas function.

stencilman · 2014-07-31T13:57:11Z

Fred, I made two commits, please have a look.. I do not think there is any problem with the cuda/c code.. Thanks!

abergeron · 2014-07-31T14:25:10Z

theano/sandbox/cuda/conv_gemm.cu

@@ -0,0 +1,193 @@
+/*


You still have to keep the original copyright notice, which was deleted here.

stencilman · 2014-08-05T13:40:54Z

Thanks! Oh, support for non-square kernel is great. I will try and look into it later this week.

stencilman · 2014-08-06T14:02:47Z

Hi all, I am sorry for being so negative but I am very disheartened with the speed of Theano. The fprop is awesome and the speed is exactly like in torch7, however the bprop in Theano is 5x slower!! I really wanted to use Theano, but if I can't get it to be faster, I will have no option.

I will be very grateful if anyone could provide any input. :-( Thank you in advance.
Bellow I attach my network structure and profile results. I only uses convolutions(GpuCorrMM) everywhere.

... building the model
** DataLayer
DataLayer out_size:  (16, 3, 240, 240)
** Created DataLayer
** LCNLayer
LCNLayer in_size:  (16, 3, 240, 240)
LCNLayer out_size: (16, 3, 240, 240)
LCNLayer filtersnp.shape: (1, 3, 9, 9)
** Created LCNLayer
** ConvPoolLayer
ConvPoolLayer filter_size:  5
ConvPoolLayer in_size: (16, 3, 240, 240)
ConvPoolLayer out_size: (16, 16, 120, 120)
** Created ConvPoolLayer
** ConvPoolLayer
ConvPoolLayer filter_size:  5
ConvPoolLayer in_size: (16, 16, 120, 120)
ConvPoolLayer out_size: (16, 16, 60, 60)
** Created ConvPoolLayer
** ConvPoolLayer
ConvPoolLayer filter_size:  5
ConvPoolLayer in_size: (16, 16, 60, 60)
ConvPoolLayer out_size: (16, 16, 60, 60)
** Created ConvPoolLayer
** ConvPoolLayer
ConvPoolLayer filter_size:  9
ConvPoolLayer in_size: (16, 16, 60, 60)
ConvPoolLayer out_size: (16, 512, 60, 60)
** Created ConvPoolLayer
** ConvPoolLayer
ConvPoolLayer filter_size:  1
ConvPoolLayer in_size: (16, 512, 60, 60)
ConvPoolLayer out_size: (16, 256, 60, 60)
** Created ConvPoolLayer
** ConvPoolLayer
Linear Activation for this layer
ConvPoolLayer filter_size:  1
ConvPoolLayer in_size: (16, 256, 60, 60)
ConvPoolLayer out_size: (16, 4, 60, 60)
** Created ConvPoolLayer
...started Compute
In DataLayer
out_size:  (16, 3, 240, 240)
++ Computed DataLayer 0
In LCNLayer
++ Computed LCNLayer 1
In ConvPoolLayer, filter size:  (16, 3, 5, 5)
++ Computed ConvPoolLayer 3
In ConvPoolLayer, filter size:  (16, 16, 5, 5)
++ Computed ConvPoolLayer 4
In ConvPoolLayer, filter size:  (16, 16, 5, 5)
++ Computed ConvPoolLayer 5
In ConvPoolLayer, filter size:  (512, 16, 9, 9)
++ Computed ConvPoolLayer 6
In ConvPoolLayer, filter size:  (256, 512, 1, 1)
++ Computed ConvPoolLayer 16
In ConvPoolLayer, filter size:  (4, 256, 1, 1)
++ Computed ConvPoolLayer 18
Using MSE
Regularization: 0.0
Not using RMSPROP
Not using Momentum
learning_rate: 0.02
==> Compiling theano funcitons...
==> Done compiling theano funcitons.
==> training!
==> doing train epoch on train data:
==> online epoch # 0             [batchSize = 16] 
Exec: rm data/conffast/train.log
[===================================================================>] �
Avg Error 0.11
Per sample train-time 211.91msec


Function profiling
==================
  Message: /home/ajain/Projects/deep_nets/python/lib/machine.py:239
  Time in 48 calls to Function.__call__: 1.490243e+02s
  Time in Function.fn.__call__: 1.490195e+02s (99.997%)
  Time in thunks: 1.485269e+02s (99.666%)
  Total compile time: 5.829459e+00s
    Number of Apply nodes: 369
    Theano Optimizer time: 5.523921e+00s
       Theano validate time: 3.618696e-01s
    Theano Linker time (includes C, CUDA code generation/compiling): 2.642159e-01s

Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  94.5%    94.5%     140.298s       1.54e-01s     C      912      19   theano.sandbox.cuda.blas.GpuCorrMM
   2.0%    96.5%       3.038s       1.58e-02s     C      192       4   theano.sandbox.cuda.basic_ops.GpuIncSubtensor
   1.3%    97.8%       1.943s       1.09e-03s     C     1776      37   theano.sandbox.cuda.basic_ops.GpuContiguous
   0.6%    98.4%       0.901s       4.27e-04s     C     2112      44   theano.sandbox.cuda.basic_ops.GpuElemwise
   0.5%    98.9%       0.683s       5.47e-04s     Py    1248      26   theano.sandbox.cuda.basic_ops.GpuReshape
   0.3%    99.2%       0.411s       2.14e-03s     C      192       4   theano.sandbox.cuda.basic_ops.GpuFromHost
   0.2%    99.4%       0.296s       3.09e-03s     Py      96       2   theano.tensor.extra_ops.RepeatOp
   0.2%    99.6%       0.291s       1.01e-03s     C      288       6   theano.sandbox.cuda.blas.GpuDownsampleFactorMaxGrad
   0.2%    99.7%       0.247s       8.58e-04s     C      288       6   theano.sandbox.cuda.blas.GpuDownsampleFactorMax
   0.1%    99.8%       0.179s       1.24e-03s     C      144       3   theano.sandbox.cuda.basic_ops.HostFromGpu
   0.1%    99.9%       0.116s       3.02e-04s     C      384       8   theano.sandbox.cuda.basic_ops.GpuCAReduce
   0.1%   100.0%       0.092s       3.82e-04s     C      240       5   theano.sandbox.cuda.basic_ops.GpuAlloc
   0.0%   100.0%       0.007s       4.19e-06s     C     1776      37   theano.sandbox.cuda.basic_ops.GpuSubtensor
   0.0%   100.0%       0.006s       2.19e-06s     C     2688      56   theano.compile.ops.Shape_i
   0.0%   100.0%       0.005s       3.32e-06s     C     1440      30   theano.sandbox.cuda.basic_ops.GpuDimShuffle
   0.0%   100.0%       0.004s       7.37e-06s     C      576      12   theano.tensor.basic.Join
   0.0%   100.0%       0.003s       2.49e-06s     C     1152      24   theano.tensor.elemwise.Elemwise
   0.0%   100.0%       0.002s       2.78e-06s     C      864      18   theano.tensor.subtensor.Subtensor
   0.0%   100.0%       0.002s       2.92e-06s     C      672      14   theano.tensor.opt.MakeVector
   0.0%   100.0%       0.001s       2.20e-05s     Py      48       1   theano.sandbox.cuda.basic_ops.GpuFlatten
   ... (remaining 2 Classes account for   0.00%(0.00s) of the runtime)

Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  91.5%    91.5%     135.923s       2.57e-01s     C      528       11   GpuCorrMM{valid, (1, 1), pad=0}
   2.9%    94.5%       4.375s       1.14e-02s     C      384        8   GpuCorrMM{full, (1, 1), pad=0}
   2.0%    96.5%       3.038s       1.58e-02s     C      192        4   GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:}
   1.3%    97.8%       1.943s       1.09e-03s     C     1776       37   GpuContiguous
   0.5%    98.3%       0.683s       5.47e-04s     Py    1248       26   GpuReshape{4}
   0.3%    98.5%       0.411s       2.14e-03s     C      192        4   GpuFromHost
   0.2%    98.7%       0.296s       3.09e-03s     Py      96        2   RepeatOp
   0.2%    98.9%       0.243s       1.01e-03s     C      240        5   GpuElemwise{add,no_inplace}
   0.2%    99.1%       0.235s       1.22e-03s     C      192        4   GpuDownsampleFactorMaxGrad{(1, 1),True}
   0.1%    99.2%       0.210s       8.73e-04s     C      240        5   GpuElemwise{maximum,no_inplace}
   0.1%    99.3%       0.203s       1.06e-03s     C      192        4   GpuDownsampleFactorMax{(1, 1),True}
   0.1%    99.5%       0.179s       1.24e-03s     C      144        3   HostFromGpu
   0.1%    99.6%       0.163s       6.81e-04s     C      240        5   GpuElemwise{Composite{[Cast{float32}(EQ(i0, i1))]}}[(0, 0)]
   0.1%    99.7%       0.156s       6.51e-04s     C      240        5   GpuElemwise{Mul}[(0, 0)]
   0.1%    99.7%       0.095s       3.30e-04s     C      288        6   GpuCAReduce{add}{1,0,1,1}
   0.1%    99.8%       0.083s       4.34e-04s     C      192        4   GpuAlloc{memset_0=True}
   0.0%    99.8%       0.056s       5.88e-04s     C       96        2   GpuDownsampleFactorMaxGrad{(2, 2),True}
   0.0%    99.9%       0.044s       4.58e-04s     C       96        2   GpuDownsampleFactorMax{(2, 2),True}
   0.0%    99.9%       0.026s       5.39e-04s     C       48        1   GpuElemwise{sqr,no_inplace}
   0.0%    99.9%       0.024s       2.48e-04s     C       96        2   GpuElemwise{TrueDiv}[(0, 0)]
   ... (remaining 37 Ops account for   0.09%(0.14s) of the runtime)

Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
  35.0%    35.0%      52.035s       1.08e+00s     48   313   GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
  30.3%    65.3%      44.946s       9.36e-01s     48   327   GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
  15.1%    80.4%      22.398s       4.67e-01s     48   367   GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
   5.4%    85.8%       8.042s       1.68e-01s     48   325   GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
   3.6%    89.4%       5.390s       1.12e-01s     48   355   GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
   1.2%    90.6%       1.771s       3.69e-02s     48   322   GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:}(GpuAlloc{memset_0=
   1.2%    91.8%       1.742s       3.63e-02s     48   190   GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
   1.0%    92.7%       1.431s       2.98e-02s     48   341   GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
   0.7%    93.4%       0.968s       2.02e-02s     48   364   GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:}(GpuAlloc{memset_0=
   0.5%    93.8%       0.669s       1.39e-02s     48   300   GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
   0.4%    94.2%       0.561s       1.17e-02s     48    44   GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
   0.4%    94.6%       0.552s       1.15e-02s     48    61   GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
   0.3%    94.9%       0.486s       1.01e-02s     48   326   GpuContiguous(GpuDimShuffle{1,0,2,3}.0)
   0.3%    95.2%       0.468s       9.74e-03s     48   224   GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
   0.3%    95.5%       0.430s       8.95e-03s     48   311   GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
   0.3%    95.8%       0.393s       8.20e-03s     48   232   GpuContiguous(GpuSubtensor{::, ::, ::int64, ::int64}.0)
   0.3%    96.1%       0.389s       8.11e-03s     48   122   GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
   0.3%    96.3%       0.381s       7.93e-03s     48   208   GpuReshape{4}(GpuSubtensor{::, ::, int64:int64:, int64:int64:}.0, Join.0)
   0.2%    96.6%       0.371s       7.74e-03s     48   366   GpuContiguous(GpuDimShuffle{1,0,2,3}.0)
   0.2%    96.8%       0.364s       7.58e-03s     48   353   GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
   ... (remaining 349 Apply instances account for 3.19%(4.74s) of the runtime)

Function profiling
==================
  Message: /home/ajain/Projects/deep_nets/python/lib/machine.py:243
  Time in 0 calls to Function.__call__: 0.000000e+00s
  Total compile time: 3.771852e+00s
    Number of Apply nodes: 0
    Theano Optimizer time: 3.601604e+00s
       Theano validate time: 1.030920e-01s
    Theano Linker time (includes C, CUDA code generation/compiling): 1.514251e-01s

Function profiling
==================
  Message: Sum of all printed profiles at exit excluding Scan op profile.
  Time in 48 calls to Function.__call__: 1.490243e+02s
  Time in Function.fn.__call__: 1.490195e+02s (99.997%)
  Time in thunks: 1.485269e+02s (99.666%)
  Total compile time: 9.601311e+00s
    Number of Apply nodes: 369
    Theano Optimizer time: 9.125525e+00s
       Theano validate time: 4.649615e-01s
    Theano Linker time (includes C, CUDA code generation/compiling): 4.156411e-01s

Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  94.5%    94.5%     140.298s       1.54e-01s     C      912      19   theano.sandbox.cuda.blas.GpuCorrMM
   2.0%    96.5%       3.038s       1.58e-02s     C      192       4   theano.sandbox.cuda.basic_ops.GpuIncSubtensor
   1.3%    97.8%       1.943s       1.09e-03s     C     1776      37   theano.sandbox.cuda.basic_ops.GpuContiguous
   0.6%    98.4%       0.901s       4.27e-04s     C     2112      44   theano.sandbox.cuda.basic_ops.GpuElemwise
   0.5%    98.9%       0.683s       5.47e-04s     Py    1248      26   theano.sandbox.cuda.basic_ops.GpuReshape
   0.3%    99.2%       0.411s       2.14e-03s     C      192       4   theano.sandbox.cuda.basic_ops.GpuFromHost
   0.2%    99.4%       0.296s       3.09e-03s     Py      96       2   theano.tensor.extra_ops.RepeatOp
   0.2%    99.6%       0.291s       1.01e-03s     C      288       6   theano.sandbox.cuda.blas.GpuDownsampleFactorMaxGrad
   0.2%    99.7%       0.247s       8.58e-04s     C      288       6   theano.sandbox.cuda.blas.GpuDownsampleFactorMax
   0.1%    99.8%       0.179s       1.24e-03s     C      144       3   theano.sandbox.cuda.basic_ops.HostFromGpu
   0.1%    99.9%       0.116s       3.02e-04s     C      384       8   theano.sandbox.cuda.basic_ops.GpuCAReduce
   0.1%   100.0%       0.092s       3.82e-04s     C      240       5   theano.sandbox.cuda.basic_ops.GpuAlloc
   0.0%   100.0%       0.007s       4.19e-06s     C     1776      37   theano.sandbox.cuda.basic_ops.GpuSubtensor
   0.0%   100.0%       0.006s       2.19e-06s     C     2688      56   theano.compile.ops.Shape_i
   0.0%   100.0%       0.005s       3.32e-06s     C     1440      30   theano.sandbox.cuda.basic_ops.GpuDimShuffle
   0.0%   100.0%       0.004s       7.37e-06s     C      576      12   theano.tensor.basic.Join
   0.0%   100.0%       0.003s       2.49e-06s     C     1152      24   theano.tensor.elemwise.Elemwise
   0.0%   100.0%       0.002s       2.78e-06s     C      864      18   theano.tensor.subtensor.Subtensor
   0.0%   100.0%       0.002s       2.92e-06s     C      672      14   theano.tensor.opt.MakeVector
   0.0%   100.0%       0.001s       2.20e-05s     Py      48       1   theano.sandbox.cuda.basic_ops.GpuFlatten
   ... (remaining 2 Classes account for   0.00%(0.00s) of the runtime)

Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  91.5%    91.5%     135.923s       2.57e-01s     C      528       11   GpuCorrMM{valid, (1, 1), pad=0}
   2.9%    94.5%       4.375s       1.14e-02s     C      384        8   GpuCorrMM{full, (1, 1), pad=0}
   2.0%    96.5%       3.038s       1.58e-02s     C      192        4   GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:}
   1.3%    97.8%       1.943s       1.09e-03s     C     1776       37   GpuContiguous
   0.5%    98.3%       0.683s       5.47e-04s     Py    1248       26   GpuReshape{4}
   0.3%    98.5%       0.411s       2.14e-03s     C      192        4   GpuFromHost
   0.2%    98.7%       0.296s       3.09e-03s     Py      96        2   RepeatOp
   0.2%    98.9%       0.243s       1.01e-03s     C      240        5   GpuElemwise{add,no_inplace}
   0.2%    99.1%       0.235s       1.22e-03s     C      192        4   GpuDownsampleFactorMaxGrad{(1, 1),True}
   0.1%    99.2%       0.210s       8.73e-04s     C      240        5   GpuElemwise{maximum,no_inplace}
   0.1%    99.3%       0.203s       1.06e-03s     C      192        4   GpuDownsampleFactorMax{(1, 1),True}
   0.1%    99.5%       0.179s       1.24e-03s     C      144        3   HostFromGpu
   0.1%    99.6%       0.163s       6.81e-04s     C      240        5   GpuElemwise{Composite{[Cast{float32}(EQ(i0, i1))]}}[(0, 0)]
   0.1%    99.7%       0.156s       6.51e-04s     C      240        5   GpuElemwise{Mul}[(0, 0)]
   0.1%    99.7%       0.095s       3.30e-04s     C      288        6   GpuCAReduce{add}{1,0,1,1}
   0.1%    99.8%       0.083s       4.34e-04s     C      192        4   GpuAlloc{memset_0=True}
   0.0%    99.8%       0.056s       5.88e-04s     C       96        2   GpuDownsampleFactorMaxGrad{(2, 2),True}
   0.0%    99.9%       0.044s       4.58e-04s     C       96        2   GpuDownsampleFactorMax{(2, 2),True}
   0.0%    99.9%       0.026s       5.39e-04s     C       48        1   GpuElemwise{sqr,no_inplace}
   0.0%    99.9%       0.024s       2.48e-04s     C       96        2   GpuElemwise{TrueDiv}[(0, 0)]
   ... (remaining 37 Ops account for   0.09%(0.14s) of the runtime)

Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
  35.0%    35.0%      52.035s       1.08e+00s     48   313   GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
  30.3%    65.3%      44.946s       9.36e-01s     48   327   GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
  15.1%    80.4%      22.398s       4.67e-01s     48   367   GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
   5.4%    85.8%       8.042s       1.68e-01s     48   325   GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
   3.6%    89.4%       5.390s       1.12e-01s     48   355   GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
   1.2%    90.6%       1.771s       3.69e-02s     48   322   GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:}(GpuAlloc{memset_0=
   1.2%    91.8%       1.742s       3.63e-02s     48   190   GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
   1.0%    92.7%       1.431s       2.98e-02s     48   341   GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
   0.7%    93.4%       0.968s       2.02e-02s     48   364   GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:}(GpuAlloc{memset_0=
   0.5%    93.8%       0.669s       1.39e-02s     48   300   GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
   0.4%    94.2%       0.561s       1.17e-02s     48    44   GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
   0.4%    94.6%       0.552s       1.15e-02s     48    61   GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
   0.3%    94.9%       0.486s       1.01e-02s     48   326   GpuContiguous(GpuDimShuffle{1,0,2,3}.0)
   0.3%    95.2%       0.468s       9.74e-03s     48   224   GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
   0.3%    95.5%       0.430s       8.95e-03s     48   311   GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
   0.3%    95.8%       0.393s       8.20e-03s     48   232   GpuContiguous(GpuSubtensor{::, ::, ::int64, ::int64}.0)
   0.3%    96.1%       0.389s       8.11e-03s     48   122   GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
   0.3%    96.3%       0.381s       7.93e-03s     48   208   GpuReshape{4}(GpuSubtensor{::, ::, int64:int64:, int64:int64:}.0, Join.0)
   0.2%    96.6%       0.371s       7.74e-03s     48   366   GpuContiguous(GpuDimShuffle{1,0,2,3}.0)
   0.2%    96.8%       0.364s       7.58e-03s     48   353   GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
   ... (remaining 349 Apply instances account for 3.19%(4.74s) of the runtime)

nouiz · 2014-08-06T14:59:09Z

If you look at those part of the profile:

Theano Linker time (includes C, CUDA code generation/compiling):
2.642159e-01s Class <% time> <#call> <#apply>
94.5% 94.5% 140.298s 1.54e-01s C 912 19 theano.sandbox.cuda.blas.GpuCorrMM

and

Class <% time> <#call> <#apply>
94.5% 94.5% 140.298s 1.54e-01s C 912 19 theano.sandbox.cuda.blas.GpuCorrMM

We see that we spend ~95% of the time inside the GpuCorrMM op. Here is more
detail:

Ops <% time> <#call> <#apply>
91.5% 91.5% 135.923s 2.57e-01s C 528 11 GpuCorrMM{valid, (1, 1), pad=0}
2.9% 94.5% 4.375s 1.14e-02s C 384 8 GpuCorrMM{full, (1, 1), pad=0}

and

Ops <% time> <#call> <#apply>
91.5% 91.5% 135.923s 2.57e-01s C 528 11 GpuCorrMM{valid, (1, 1), pad=0}
2.9% 94.5% 4.375s 1.14e-02s C 384 8 GpuCorrMM{full, (1, 1), pad=0}

This tell that we spend much more time in the valid mode then in the full
mode of GpuCorrMM. So the problem is related to the new op with this
profile.

Can you rerun the profiling with this extra theano flags:
profile_memory=True. This will tell us which shape for the valid cause this
slow case.

Fred

On Wed, Aug 6, 2014 at 10:02 AM, Arjun Jain notifications@github.com
wrote:

Hi all, I am sorry for being so negative but I am very disheartened with
the speed of Theano. The fprop is awesome and the speed is exactly like in
torch7, however the bprop in Theano is 5x slower!! I really wanted to use
Theano, but if I can't get it to be faster, I will have no option.

I will be very grateful if anyone could provide any input. :-( Thank you
in advance.
Bellow I attach my network structure and profile results. I only uses
convolutions(GpuCorrMM) everywhere.

`
... building the model
** DataLayer
DataLayer out_size: (16, 3, 240, 240)
** Created DataLayer
** LCNLayer
LCNLayer in_size: (16, 3, 240, 240)
LCNLayer out_size: (16, 3, 240, 240)
LCNLayer filtersnp.shape: (1, 3, 9, 9)
** Created LCNLayer
** ConvPoolLayer
ConvPoolLayer filter_size: 5
ConvPoolLayer in_size: (16, 3, 240, 240)
ConvPoolLayer out_size: (16, 16, 120, 120)
** Created ConvPoolLayer
** ConvPoolLayer
ConvPoolLayer filter_size: 5
ConvPoolLayer in_size: (16, 16, 120, 120)
ConvPoolLayer out_size: (16, 16, 60, 60)
** Created ConvPoolLayer
** ConvPoolLayer
ConvPoolLayer filter_size: 5
ConvPoolLayer in_size: (16, 16, 60, 60)
ConvPoolLayer out_size: (16, 16, 60, 60)
** Created ConvPoolLayer
** ConvPoolLayer
ConvPoolLayer filter_size: 9
ConvPoolLayer in_size: (16, 16, 60, 60)
ConvPoolLayer out_size: (16, 512, 60, 60)
** Created ConvPoolLayer
** ConvPoolLayer
ConvPoolLayer filter_size: 1
ConvPoolLayer in_size: (16, 512, 60, 60)
ConvPoolLayer out_size: (16, 256, 60, 60)
** Created ConvPoolLayer
** ConvPoolLayer
Linear Activation for this layer
ConvPoolLayer filter_size: 1
ConvPoolLayer in_size: (16, 256, 60, 60)
ConvPoolLayer out_size: (16, 4, 60, 60)
** Created ConvPoolLayer
...started Compute
In DataLayer
out_size: (16, 3, 240, 240)
++ Computed DataLayer 0
In LCNLayer
++ Computed LCNLayer 1
In ConvPoolLayer, filter size: (16, 3, 5, 5)
++ Computed ConvPoolLayer 3
In ConvPoolLayer, filter size: (16, 16, 5, 5)
++ Computed ConvPoolLayer 4
In ConvPoolLayer, filter size: (16, 16, 5, 5)
++ Computed ConvPoolLayer 5
In ConvPoolLayer, filter size: (512, 16, 9, 9)
++ Computed ConvPoolLayer 6
In ConvPoolLayer, filter size: (256, 512, 1, 1)
++ Computed ConvPoolLayer 16
In ConvPoolLayer, filter size: (4, 256, 1, 1)
++ Computed ConvPoolLayer 18
Using MSE
Regularization: 0.0
Not using RMSPROP
Not using Momentum
learning_rate: 0.02
==> Compiling theano funcitons...
==> Done compiling theano funcitons.
==> training!
==> doing train epoch on train data:
==> online epoch # 0 [batchSize = 16]
Exec: rm data/conffast/train.log
[===================================================================>]
Avg Error 0.11
Per sample train-time 211.91msec
Function profiling

Message: /home/ajain/Projects/deep_nets/python/lib/machine.py:239
Time in 48 calls to Function.call: 1.490243e+02s
Time in Function.fn.call: 1.490195e+02s (99.997%)
Time in thunks: 1.485269e+02s (99.666%)
Total compile time: 5.829459e+00s
Number of Apply nodes: 369
Theano Optimizer time: 5.523921e+00s
Theano validate time: 3.618696e-01s
Theano Linker time (includes C, CUDA code generation/compiling):
2.642159e-01s
Class

<% time> <#call> <#apply>
94.5% 94.5% 140.298s 1.54e-01s C 912 19 theano.sandbox.cuda.blas.GpuCorrMM
2.0% 96.5% 3.038s 1.58e-02s C 192 4
theano.sandbox.cuda.basic_ops.GpuIncSubtensor
1.3% 97.8% 1.943s 1.09e-03s C 1776 37
theano.sandbox.cuda.basic_ops.GpuContiguous
0.6% 98.4% 0.901s 4.27e-04s C 2112 44
theano.sandbox.cuda.basic_ops.GpuElemwise
0.5% 98.9% 0.683s 5.47e-04s Py 1248 26
theano.sandbox.cuda.basic_ops.GpuReshape
0.3% 99.2% 0.411s 2.14e-03s C 192 4
theano.sandbox.cuda.basic_ops.GpuFromHost
0.2% 99.4% 0.296s 3.09e-03s Py 96 2 theano.tensor.extra_ops.RepeatOp
0.2% 99.6% 0.291s 1.01e-03s C 288 6
theano.sandbox.cuda.blas.GpuDownsampleFactorMaxGrad
0.2% 99.7% 0.247s 8.58e-04s C 288 6
theano.sandbox.cuda.blas.GpuDownsampleFactorMax
0.1% 99.8% 0.179s 1.24e-03s C 144 3
theano.sandbox.cuda.basic_ops.HostFromGpu
0.1% 99.9% 0.116s 3.02e-04s C 384 8
theano.sandbox.cuda.basic_ops.GpuCAReduce
0.1% 100.0% 0.092s 3.82e-04s C 240 5 theano.sandbox.cuda.basic_ops.GpuAlloc
0.0% 100.0% 0.007s 4.19e-06s C 1776 37
theano.sandbox.cuda.basic_ops.GpuSubtensor
0.0% 100.0% 0.006s 2.19e-06s C 2688 56 theano.compile.ops.Shape_i
0.0% 100.0% 0.005s 3.32e-06s C 1440 30
theano.sandbox.cuda.basic_ops.GpuDimShuffle
0.0% 100.0% 0.004s 7.37e-06s C 576 12 theano.tensor.basic.Join
0.0% 100.0% 0.003s 2.49e-06s C 1152 24 theano.tensor.elemwise.Elemwise
0.0% 100.0% 0.002s 2.78e-06s C 864 18 theano.tensor.subtensor.Subtensor
0.0% 100.0% 0.002s 2.92e-06s C 672 14 theano.tensor.opt.MakeVector
0.0% 100.0% 0.001s 2.20e-05s Py 48 1
theano.sandbox.cuda.basic_ops.GpuFlatten
... (remaining 2 Classes account for 0.00%(0.00s) of the runtime)
Ops

<% time> <#call> <#apply>
91.5% 91.5% 135.923s 2.57e-01s C 528 11 GpuCorrMM{valid, (1, 1), pad=0}
2.9% 94.5% 4.375s 1.14e-02s C 384 8 GpuCorrMM{full, (1, 1), pad=0}
2.0% 96.5% 3.038s 1.58e-02s C 192 4 GpuIncSubtensor{InplaceInc;::, ::,
int64:int64:, int64:int64:}
1.3% 97.8% 1.943s 1.09e-03s C 1776 37 GpuContiguous
0.5% 98.3% 0.683s 5.47e-04s Py 1248 26 GpuReshape{4}
0.3% 98.5% 0.411s 2.14e-03s C 192 4 GpuFromHost
0.2% 98.7% 0.296s 3.09e-03s Py 96 2 RepeatOp
0.2% 98.9% 0.243s 1.01e-03s C 240 5 GpuElemwise{add,no_inplace}
0.2% 99.1% 0.235s 1.22e-03s C 192 4 GpuDownsampleFactorMaxGrad{(1, 1),True}
0.1% 99.2% 0.210s 8.73e-04s C 240 5 GpuElemwise{maximum,no_inplace}
0.1% 99.3% 0.203s 1.06e-03s C 192 4 GpuDownsampleFactorMax{(1, 1),True}
0.1% 99.5% 0.179s 1.24e-03s C 144 3 HostFromGpu
0.1% 99.6% 0.163s 6.81e-04s C 240 5
GpuElemwise{Composite{[Cast{float32}(EQ(i0, i1))]}}[(0, 0)]
0.1% 99.7% 0.156s 6.51e-04s C 240 5 GpuElemwise{Mul}[(0, 0)]
0.1% 99.7% 0.095s 3.30e-04s C 288 6 GpuCAReduce{add}{1,0,1,1}
0.1% 99.8% 0.083s 4.34e-04s C 192 4 GpuAlloc{memset_0=True}
0.0% 99.8% 0.056s 5.88e-04s C 96 2 GpuDownsampleFactorMaxGrad{(2, 2),True}
0.0% 99.9% 0.044s 4.58e-04s C 96 2 GpuDownsampleFactorMax{(2, 2),True}
0.0% 99.9% 0.026s 5.39e-04s C 48 1 GpuElemwise{sqr,no_inplace}
0.0% 99.9% 0.024s 2.48e-04s C 96 2 GpuElemwise{TrueDiv}[(0, 0)]
... (remaining 37 Ops account for 0.09%(0.14s) of the runtime)
Apply

<% time> <#call>
35.0% 35.0% 52.035s 1.08e+00s 48 313 GpuCorrMM{valid, (1, 1),
pad=0}(GpuContiguous.0, GpuContiguous.0)
30.3% 65.3% 44.946s 9.36e-01s 48 327 GpuCorrMM{valid, (1, 1),
pad=0}(GpuContiguous.0, GpuContiguous.0)
15.1% 80.4% 22.398s 4.67e-01s 48 367 GpuCorrMM{valid, (1, 1),
pad=0}(GpuContiguous.0, GpuContiguous.0)
5.4% 85.8% 8.042s 1.68e-01s 48 325 GpuCorrMM{valid, (1, 1),
pad=0}(GpuContiguous.0, GpuContiguous.0)
3.6% 89.4% 5.390s 1.12e-01s 48 355 GpuCorrMM{valid, (1, 1),
pad=0}(GpuContiguous.0, GpuContiguous.0)
1.2% 90.6% 1.771s 3.69e-02s 48 322 GpuIncSubtensor{InplaceInc;::, ::,
int64:int64:, int64:int64:}(GpuAlloc{memset_0=
1.2% 91.8% 1.742s 3.63e-02s 48 190 GpuCorrMM{full, (1, 1),
pad=0}(GpuContiguous.0, GpuContiguous.0)
1.0% 92.7% 1.431s 2.98e-02s 48 341 GpuCorrMM{valid, (1, 1),
pad=0}(GpuContiguous.0, GpuContiguous.0)
0.7% 93.4% 0.968s 2.02e-02s 48 364 GpuIncSubtensor{InplaceInc;::, ::,
int64:int64:, int64:int64:}(GpuAlloc{memset_0=
0.5% 93.8% 0.669s 1.39e-02s 48 300 GpuCorrMM{valid, (1, 1),
pad=0}(GpuContiguous.0, GpuContiguous.0)
0.4% 94.2% 0.561s 1.17e-02s 48 44 GpuCorrMM{full, (1, 1),
pad=0}(GpuContiguous.0, GpuContiguous.0)
0.4% 94.6% 0.552s 1.15e-02s 48 61 GpuCorrMM{full, (1, 1),
pad=0}(GpuContiguous.0, GpuContiguous.0)
0.3% 94.9% 0.486s 1.01e-02s 48 326 GpuContiguous(GpuDimShuffle{1,0,2,3}.0)
0.3% 95.2% 0.468s 9.74e-03s 48 224 GpuCorrMM{full, (1, 1),
pad=0}(GpuContiguous.0, GpuContiguous.0)
0.3% 95.5% 0.430s 8.95e-03s 48 311 GpuCorrMM{valid, (1, 1),
pad=0}(GpuContiguous.0, GpuContiguous.0)
0.3% 95.8% 0.393s 8.20e-03s 48 232 GpuContiguous(GpuSubtensor{::, ::,
::int64, ::int64}.0)
0.3% 96.1% 0.389s 8.11e-03s 48 122 GpuCorrMM{full, (1, 1),
pad=0}(GpuContiguous.0, GpuContiguous.0)
0.3% 96.3% 0.381s 7.93e-03s 48 208 GpuReshape{4}(GpuSubtensor{::, ::,
int64:int64:, int64:int64:}.0, Join.0)
0.2% 96.6% 0.371s 7.74e-03s 48 366 GpuContiguous(GpuDimShuffle{1,0,2,3}.0)
0.2% 96.8% 0.364s 7.58e-03s 48 353 GpuCorrMM{valid, (1, 1),
pad=0}(GpuContiguous.0, GpuContiguous.0)
... (remaining 349 Apply instances account for 3.19%(4.74s) of the runtime)
Function profiling

Message: /home/ajain/Projects/deep_nets/python/lib/machine.py:243
Time in 0 calls to Function.call: 0.000000e+00s
Total compile time: 3.771852e+00s
Number of Apply nodes: 0
Theano Optimizer time: 3.601604e+00s
Theano validate time: 1.030920e-01s
Theano Linker time (includes C, CUDA code generation/compiling):
1.514251e-01s
Function profiling

Message: Sum of all printed profiles at exit excluding Scan op profile.
Time in 48 calls to Function.call: 1.490243e+02s
Time in Function.fn.call: 1.490195e+02s (99.997%)
Time in thunks: 1.485269e+02s (99.666%)
Total compile time: 9.601311e+00s
Number of Apply nodes: 369
Theano Optimizer time: 9.125525e+00s
Theano validate time: 4.649615e-01s
Theano Linker time (includes C, CUDA code generation/compiling):
4.156411e-01s
Class

<% time> <#call> <#apply>
94.5% 94.5% 140.298s 1.54e-01s C 912 19 theano.sandbox.cuda.blas.GpuCorrMM
2.0% 96.5% 3.038s 1.58e-02s C 192 4
theano.sandbox.cuda.basic_ops.GpuIncSubtensor
1.3% 97.8% 1.943s 1.09e-03s C 1776 37
theano.sandbox.cuda.basic_ops.GpuContiguous
0.6% 98.4% 0.901s 4.27e-04s C 2112 44
theano.sandbox.cuda.basic_ops.GpuElemwise
0.5% 98.9% 0.683s 5.47e-04s Py 1248 26
theano.sandbox.cuda.basic_ops.GpuReshape
0.3% 99.2% 0.411s 2.14e-03s C 192 4
theano.sandbox.cuda.basic_ops.GpuFromHost
0.2% 99.4% 0.296s 3.09e-03s Py 96 2 theano.tensor.extra_ops.RepeatOp
0.2% 99.6% 0.291s 1.01e-03s C 288 6
theano.sandbox.cuda.blas.GpuDownsampleFactorMaxGrad
0.2% 99.7% 0.247s 8.58e-04s C 288 6
theano.sandbox.cuda.blas.GpuDownsampleFactorMax
0.1% 99.8% 0.179s 1.24e-03s C 144 3
theano.sandbox.cuda.basic_ops.HostFromGpu
0.1% 99.9% 0.116s 3.02e-04s C 384 8
theano.sandbox.cuda.basic_ops.GpuCAReduce
0.1% 100.0% 0.092s 3.82e-04s C 240 5 theano.sandbox.cuda.basic_ops.GpuAlloc
0.0% 100.0% 0.007s 4.19e-06s C 1776 37
theano.sandbox.cuda.basic_ops.GpuSubtensor
0.0% 100.0% 0.006s 2.19e-06s C 2688 56 theano.compile.ops.Shape_i
0.0% 100.0% 0.005s 3.32e-06s C 1440 30
theano.sandbox.cuda.basic_ops.GpuDimShuffle
0.0% 100.0% 0.004s 7.37e-06s C 576 12 theano.tensor.basic.Join
0.0% 100.0% 0.003s 2.49e-06s C 1152 24 theano.tensor.elemwise.Elemwise
0.0% 100.0% 0.002s 2.78e-06s C 864 18 theano.tensor.subtensor.Subtensor
0.0% 100.0% 0.002s 2.92e-06s C 672 14 theano.tensor.opt.MakeVector
0.0% 100.0% 0.001s 2.20e-05s Py 48 1
theano.sandbox.cuda.basic_ops.GpuFlatten
... (remaining 2 Classes account for 0.00%(0.00s) of the runtime)
Ops

<% time> <#call> <#apply>
91.5% 91.5% 135.923s 2.57e-01s C 528 11 GpuCorrMM{valid, (1, 1), pad=0}
2.9% 94.5% 4.375s 1.14e-02s C 384 8 GpuCorrMM{full, (1, 1), pad=0}
2.0% 96.5% 3.038s 1.58e-02s C 192 4 GpuIncSubtensor{InplaceInc;::, ::,
int64:int64:, int64:int64:}
1.3% 97.8% 1.943s 1.09e-03s C 1776 37 GpuContiguous
0.5% 98.3% 0.683s 5.47e-04s Py 1248 26 GpuReshape{4}
0.3% 98.5% 0.411s 2.14e-03s C 192 4 GpuFromHost
0.2% 98.7% 0.296s 3.09e-03s Py 96 2 RepeatOp
0.2% 98.9% 0.243s 1.01e-03s C 240 5 GpuElemwise{add,no_inplace}
0.2% 99.1% 0.235s 1.22e-03s C 192 4 GpuDownsampleFactorMaxGrad{(1, 1),True}
0.1% 99.2% 0.210s 8.73e-04s C 240 5 GpuElemwise{maximum,no_inplace}
0.1% 99.3% 0.203s 1.06e-03s C 192 4 GpuDownsampleFactorMax{(1, 1),True}
0.1% 99.5% 0.179s 1.24e-03s C 144 3 HostFromGpu
0.1% 99.6% 0.163s 6.81e-04s C 240 5
GpuElemwise{Composite{[Cast{float32}(EQ(i0, i1))]}}[(0, 0)]
0.1% 99.7% 0.156s 6.51e-04s C 240 5 GpuElemwise{Mul}[(0, 0)]
0.1% 99.7% 0.095s 3.30e-04s C 288 6 GpuCAReduce{add}{1,0,1,1}
0.1% 99.8% 0.083s 4.34e-04s C 192 4 GpuAlloc{memset_0=True}
0.0% 99.8% 0.056s 5.88e-04s C 96 2 GpuDownsampleFactorMaxGrad{(2, 2),True}
0.0% 99.9% 0.044s 4.58e-04s C 96 2 GpuDownsampleFactorMax{(2, 2),True}
0.0% 99.9% 0.026s 5.39e-04s C 48 1 GpuElemwise{sqr,no_inplace}
0.0% 99.9% 0.024s 2.48e-04s C 96 2 GpuElemwise{TrueDiv}[(0, 0)]
... (remaining 37 Ops account for 0.09%(0.14s) of the runtime)
Apply

<% time> <#call>
35.0% 35.0% 52.035s 1.08e+00s 48 313 GpuCorrMM{valid, (1, 1),
pad=0}(GpuContiguous.0, GpuContiguous.0)
30.3% 65.3% 44.946s 9.36e-01s 48 327 GpuCorrMM{valid, (1, 1),
pad=0}(GpuContiguous.0, GpuContiguous.0)
15.1% 80.4% 22.398s 4.67e-01s 48 367 GpuCorrMM{valid, (1, 1),
pad=0}(GpuContiguous.0, GpuContiguous.0)
5.4% 85.8% 8.042s 1.68e-01s 48 325 GpuCorrMM{valid, (1, 1),
pad=0}(GpuContiguous.0, GpuContiguous.0)
3.6% 89.4% 5.390s 1.12e-01s 48 355 GpuCorrMM{valid, (1, 1),
pad=0}(GpuContiguous.0, GpuContiguous.0)
1.2% 90.6% 1.771s 3.69e-02s 48 322 GpuIncSubtensor{InplaceInc;::, ::,
int64:int64:, int64:int64:}(GpuAlloc{memset_0=
1.2% 91.8% 1.742s 3.63e-02s 48 190 GpuCorrMM{full, (1, 1),
pad=0}(GpuContiguous.0, GpuContiguous.0)
1.0% 92.7% 1.431s 2.98e-02s 48 341 GpuCorrMM{valid, (1, 1),
pad=0}(GpuContiguous.0, GpuContiguous.0)
0.7% 93.4% 0.968s 2.02e-02s 48 364 GpuIncSubtensor{InplaceInc;::, ::,
int64:int64:, int64:int64:}(GpuAlloc{memset_0=
0.5% 93.8% 0.669s 1.39e-02s 48 300 GpuCorrMM{valid, (1, 1),
pad=0}(GpuContiguous.0, GpuContiguous.0)
0.4% 94.2% 0.561s 1.17e-02s 48 44 GpuCorrMM{full, (1, 1),
pad=0}(GpuContiguous.0, GpuContiguous.0)
0.4% 94.6% 0.552s 1.15e-02s 48 61 GpuCorrMM{full, (1, 1),
pad=0}(GpuContiguous.0, GpuContiguous.0)
0.3% 94.9% 0.486s 1.01e-02s 48 326 GpuContiguous(GpuDimShuffle{1,0,2,3}.0)
0.3% 95.2% 0.468s 9.74e-03s 48 224 GpuCorrMM{full, (1, 1),
pad=0}(GpuContiguous.0, GpuContiguous.0)
0.3% 95.5% 0.430s 8.95e-03s 48 311 GpuCorrMM{valid, (1, 1),
pad=0}(GpuContiguous.0, GpuContiguous.0)
0.3% 95.8% 0.393s 8.20e-03s 48 232 GpuContiguous(GpuSubtensor{::, ::,
::int64, ::int64}.0)
0.3% 96.1% 0.389s 8.11e-03s 48 122 GpuCorrMM{full, (1, 1),
pad=0}(GpuContiguous.0, GpuContiguous.0)
0.3% 96.3% 0.381s 7.93e-03s 48 208 GpuReshape{4}(GpuSubtensor{::, ::,
int64:int64:, int64:int64:}.0, Join.0)
0.2% 96.6% 0.371s 7.74e-03s 48 366 GpuContiguous(GpuDimShuffle{1,0,2,3}.0)
0.2% 96.8% 0.364s 7.58e-03s 48 353 GpuCorrMM{valid, (1, 1),
pad=0}(GpuContiguous.0, GpuContiguous.0)
... (remaining 349 Apply instances account for 3.19%(4.74s) of the runtime)
`

—
Reply to this email directly or view it on GitHub
#2002 (comment).

stencilman · 2014-08-06T16:00:24Z

Hi Fred, thanks a ton for your reply. Please find bellow the complete log after also using profile_memory=True.

What I dont understand is why is the 'valid' getting called at all? I only call conv2d with 'full'. Perhaps this happens during the back prop, but I am really not sure why. Any help would be greatly appreciated. Thanks a lot!!!

Using gpu device 3: GeForce GTX TITAN Black
/home/ajain/Theano/theano/gof/vm.py:716: UserWarning: CVM does not support memory profile, using Stack VM.
  'CVM does not support memory profile, using Stack VM.')

Function profiling
==================
  Message: /home/ajain/Projects/deep_nets/python/lib/machine.py:239
  Time in 48 calls to Function.__call__: 1.493318e+02s
  Time in Function.fn.__call__: 1.493235e+02s (99.994%)
  Time in thunks: 1.478504e+02s (99.008%)
  Total compile time: 5.620848e+00s
    Number of Apply nodes: 369
    Theano Optimizer time: 5.321193e+00s
       Theano validate time: 3.487010e-01s
    Theano Linker time (includes C, CUDA code generation/compiling): 2.601519e-01s

Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  94.4%    94.4%     139.572s       1.53e-01s     C      912      19   theano.sandbox.cuda.blas.GpuCorrMM
   2.1%    96.5%       3.036s       1.58e-02s     C      192       4   theano.sandbox.cuda.basic_ops.GpuIncSubtensor
   1.3%    97.8%       1.980s       1.11e-03s     C     1776      37   theano.sandbox.cuda.basic_ops.GpuContiguous
   0.6%    98.4%       0.910s       4.31e-04s     C     2112      44   theano.sandbox.cuda.basic_ops.GpuElemwise
   0.5%    98.9%       0.666s       5.34e-04s     Py    1248      26   theano.sandbox.cuda.basic_ops.GpuReshape
   0.3%    99.1%       0.382s       1.99e-03s     C      192       4   theano.sandbox.cuda.basic_ops.GpuFromHost
   0.2%    99.3%       0.296s       3.08e-03s     Py      96       2   theano.tensor.extra_ops.RepeatOp
   0.2%    99.5%       0.296s       1.03e-03s     C      288       6   theano.sandbox.cuda.blas.GpuDownsampleFactorMaxGrad
   0.2%    99.7%       0.252s       8.75e-04s     C      288       6   theano.sandbox.cuda.blas.GpuDownsampleFactorMax
   0.1%    99.8%       0.178s       1.24e-03s     C      144       3   theano.sandbox.cuda.basic_ops.HostFromGpu
   0.1%    99.9%       0.121s       3.16e-04s     C      384       8   theano.sandbox.cuda.basic_ops.GpuCAReduce
   0.1%   100.0%       0.096s       4.01e-04s     C      240       5   theano.sandbox.cuda.basic_ops.GpuAlloc
   0.0%   100.0%       0.015s       5.75e-06s     C     2688      56   theano.compile.ops.Shape_i
   0.0%   100.0%       0.013s       7.38e-06s     C     1776      37   theano.sandbox.cuda.basic_ops.GpuSubtensor
   0.0%   100.0%       0.011s       7.58e-06s     C     1440      30   theano.sandbox.cuda.basic_ops.GpuDimShuffle
   0.0%   100.0%       0.007s       5.66e-06s     C     1152      24   theano.tensor.elemwise.Elemwise
   0.0%   100.0%       0.006s       1.06e-05s     C      576      12   theano.tensor.basic.Join
   0.0%   100.0%       0.005s       6.05e-06s     C      864      18   theano.tensor.subtensor.Subtensor
   0.0%   100.0%       0.004s       5.89e-06s     C      672      14   theano.tensor.opt.MakeVector
   0.0%   100.0%       0.002s       5.11e-06s     C      336       7   theano.tensor.elemwise.Prod
   ... (remaining 2 Classes account for   0.00%(0.00s) of the runtime)

Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  91.5%    91.5%     135.223s       2.56e-01s     C      528       11   GpuCorrMM{valid, (1, 1), pad=0}
   2.9%    94.4%       4.348s       1.13e-02s     C      384        8   GpuCorrMM{full, (1, 1), pad=0}
   2.1%    96.5%       3.036s       1.58e-02s     C      192        4   GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:}
   1.3%    97.8%       1.980s       1.11e-03s     C     1776       37   GpuContiguous
   0.5%    98.2%       0.666s       5.34e-04s     Py    1248       26   GpuReshape{4}
   0.3%    98.5%       0.382s       1.99e-03s     C      192        4   GpuFromHost
   0.2%    98.7%       0.296s       3.08e-03s     Py      96        2   RepeatOp
   0.2%    98.9%       0.247s       1.03e-03s     C      240        5   GpuElemwise{add,no_inplace}
   0.2%    99.0%       0.237s       1.23e-03s     C      192        4   GpuDownsampleFactorMaxGrad{(1, 1),True}
   0.1%    99.2%       0.210s       8.77e-04s     C      240        5   GpuElemwise{maximum,no_inplace}
   0.1%    99.3%       0.206s       1.07e-03s     C      192        4   GpuDownsampleFactorMax{(1, 1),True}
   0.1%    99.4%       0.178s       1.24e-03s     C      144        3   HostFromGpu
   0.1%    99.5%       0.164s       6.82e-04s     C      240        5   GpuElemwise{Composite{[Cast{float32}(EQ(i0, i1))]}}[(0, 0)]
   0.1%    99.6%       0.157s       6.56e-04s     C      240        5   GpuElemwise{Mul}[(0, 0)]
   0.1%    99.7%       0.100s       3.47e-04s     C      288        6   GpuCAReduce{add}{1,0,1,1}
   0.1%    99.8%       0.087s       4.51e-04s     C      192        4   GpuAlloc{memset_0=True}
   0.0%    99.8%       0.059s       6.14e-04s     C       96        2   GpuDownsampleFactorMaxGrad{(2, 2),True}
   0.0%    99.8%       0.046s       4.80e-04s     C       96        2   GpuDownsampleFactorMax{(2, 2),True}
   0.0%    99.9%       0.023s       4.84e-04s     C       48        1   GpuElemwise{sqr,no_inplace}
   0.0%    99.9%       0.022s       2.32e-04s     C       96        2   GpuElemwise{TrueDiv}[(0, 0)]
   ... (remaining 37 Ops account for   0.12%(0.18s) of the runtime)

Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
  34.9%    34.9%      51.585s       1.07e+00s     48   313  14400.0        0.3 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
    input 0: dtype=float32, shape=(256, 16, 60, 60), strides=(57600, 3600, 60, 1) 
    input 1: dtype=float32, shape=(512, 16, 60, 60), strides=(57600, 3600, 60, 1) 
    output 0: dtype=float32, shape=(256, 512, 1, 1), strides=(512, 1, 0, 0) 
  30.4%    65.3%      44.946s       9.36e-01s     48   327  72900.0        1.6 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
    input 0: dtype=float32, shape=(512, 16, 68, 68), strides=(73984, 4624, 68, 1) 
    input 1: dtype=float32, shape=(16, 16, 60, 60), strides=(57600, 3600, 60, 1) 
    output 0: dtype=float32, shape=(512, 16, 9, 9), strides=(1296, 81, 9, 1) 
  15.1%    80.4%      22.270s       4.64e-01s     48   367   2109.4        0.1 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
    input 0: dtype=float32, shape=(16, 16, 244, 244), strides=(952576, 59536, 244, 1) 
    input 1: dtype=float32, shape=(3, 16, 240, 240), strides=(921600, 57600, 240, 1) 
    output 0: dtype=float32, shape=(16, 3, 5, 5), strides=(75, 25, 5, 1) 
   5.4%    85.8%       8.016s       1.67e-01s     48   325  72900.0        8.9 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
    input 0: dtype=float32, shape=(16, 512, 68, 68), strides=(2367488, 4624, 68, 1) 
    input 1: dtype=float32, shape=(16, 512, 9, 9), strides=(41472, 81, 9, 1) 
    output 0: dtype=float32, shape=(16, 16, 60, 60), strides=(57600, 3600, 60, 1) 
   3.6%    89.4%       5.313s       1.11e-01s     48   355   2812.5        0.5 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
    input 0: dtype=float32, shape=(16, 16, 124, 124), strides=(246016, 15376, 124, 1) 
    input 1: dtype=float32, shape=(16, 16, 120, 120), strides=(230400, 14400, 120, 1) 
    output 0: dtype=float32, shape=(16, 16, 5, 5), strides=(400, 25, 5, 1) 
   1.2%    90.6%       1.770s       3.69e-02s     48   322                     GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:
    input 0: dtype=float32, shape=(16, 512, 68, 68), strides=(2367488, 4624, 68, 1) 
    input 1: dtype=float32, shape=(16, 512, 60, 60), strides=(1843200, 3600, 60, 1) 
    input 2: dtype=int64, shape=8, strides=c 
    input 3: dtype=int64, shape=8, strides=c 
    input 4: dtype=int64, shape=8, strides=c 
    input 5: dtype=int64, shape=8, strides=c 
    output 0: dtype=float32, shape=(16, 512, 68, 68), strides=(2367488, 4624, 68, 1) 
   1.2%    91.7%       1.732s       3.61e-02s     48   190  72900.0       41.1 GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous
    input 0: dtype=float32, shape=(16, 16, 60, 60), strides=(57600, 3600, 60, 1) 
    input 1: dtype=float32, shape=(512, 16, 9, 9), strides=(1296, 81, 9, 1) 
    output 0: dtype=float32, shape=(16, 512, 68, 68), strides=(2367488, 4624, 68, 1) 
   1.0%    92.7%       1.416s       2.95e-02s     48   341    703.1        0.5 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
    input 0: dtype=float32, shape=(16, 16, 64, 64), strides=(65536, 4096, 64, 1) 
    input 1: dtype=float32, shape=(16, 16, 60, 60), strides=(57600, 3600, 60, 1) 
    output 0: dtype=float32, shape=(16, 16, 5, 5), strides=(400, 25, 5, 1) 
   0.7%    93.3%       0.969s       2.02e-02s     48   364                     GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:
    input 0: dtype=float32, shape=(16, 16, 244, 244), strides=(952576, 59536, 244, 1) 
    input 1: dtype=float32, shape=(16, 16, 240, 240), strides=(921600, 57600, 240, 1) 
    input 2: dtype=int64, shape=8, strides=c 
    input 3: dtype=int64, shape=8, strides=c 
    input 4: dtype=int64, shape=8, strides=c 
    input 5: dtype=int64, shape=8, strides=c 
    output 0: dtype=float32, shape=(16, 16, 244, 244), strides=(952576, 59536, 244, 1) 
   0.4%    93.8%       0.664s       1.38e-02s     48   300    112.5        0.2 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
    input 0: dtype=float32, shape=(4, 16, 60, 60), strides=(57600, 3600, 60, 1) 
    input 1: dtype=float32, shape=(256, 16, 60, 60), strides=(57600, 3600, 60, 1) 
    output 0: dtype=float32, shape=(4, 256, 1, 1), strides=(256, 1, 0, 0) 
   0.4%    94.2%       0.551s       1.15e-02s     48    44    427.1        0.8 GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous
    input 0: dtype=float32, shape=(16, 3, 240, 240), strides=(172800, 57600, 240, 1) 
    input 1: dtype=float32, shape=(1, 3, 9, 9), strides=(0, 81, 9, 1) 
    output 0: dtype=float32, shape=(16, 1, 248, 248), strides=(61504, 0, 248, 1) 
   0.4%    94.5%       0.549s       1.14e-02s     48    61    427.1        0.8 GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous
    input 0: dtype=float32, shape=(16, 3, 240, 240), strides=(172800, 57600, 240, 1) 
    input 1: dtype=float32, shape=(1, 3, 9, 9), strides=(0, 81, 9, 1) 
    output 0: dtype=float32, shape=(16, 1, 248, 248), strides=(61504, 0, 248, 1) 
   0.3%    94.9%       0.492s       1.03e-02s     48   326                     GpuContiguous(GpuDimShuffle{1,0,2,3}.0)
    input 0: dtype=float32, shape=(512, 16, 68, 68), strides=(4624, 2367488, 68, 1) 
    output 0: dtype=float32, shape=(512, 16, 68, 68), strides=(73984, 4624, 68, 1) 
   0.3%    95.2%       0.464s       9.67e-03s     48   224  14400.0       30.3 GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous
    input 0: dtype=float32, shape=(16, 512, 60, 60), strides=(1843200, 3600, 60, 1) 
    input 1: dtype=float32, shape=(256, 512, 1, 1), strides=(512, 1, 0, 0) 
    output 0: dtype=float32, shape=(16, 256, 60, 60), strides=(921600, 3600, 60, 1) 
   0.3%    95.5%       0.425s       8.86e-03s     48   311  14400.0       33.1 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
    input 0: dtype=float32, shape=(16, 256, 60, 60), strides=(921600, 3600, 60, 1) 
    input 1: dtype=float32, shape=(512, 256, 1, 1), strides=(256, 1, 0, 0) 
    output 0: dtype=float32, shape=(16, 512, 60, 60), strides=(1843200, 3600, 60, 1) 
   0.3%    95.7%       0.396s       8.26e-03s     48   232                     GpuContiguous(GpuSubtensor{::, ::, ::int64, ::int64}.0)
    input 0: dtype=float32, shape=(512, 16, 60, 60), strides=(3600, 1843200, 60, 1) 
    output 0: dtype=float32, shape=(512, 16, 60, 60), strides=(57600, 3600, 60, 1) 
   0.3%    96.0%       0.388s       8.07e-03s     48   122   2812.5        7.1 GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous
    input 0: dtype=float32, shape=(16, 16, 120, 120), strides=(230400, 14400, 120, 1) 
    input 1: dtype=float32, shape=(16, 16, 5, 5), strides=(400, 25, 5, 1) 
    output 0: dtype=float32, shape=(16, 16, 124, 124), strides=(246016, 15376, 124, 1) 
   0.3%    96.3%       0.385s       8.02e-03s     48   208                     GpuReshape{4}(GpuSubtensor{::, ::, int64:int64:, int64:int64:
    input 0: dtype=float32, shape=(16, 512, 60, 60), strides=(2367488, 4624, 68, 1) 
    input 1: dtype=int64, shape=(4,), strides=c 
    output 0: dtype=float32, shape=(8192, 1, 60, 60), strides=(3600, 0, 60, 1) 
   0.2%    96.5%       0.367s       7.64e-03s     48   366                     GpuContiguous(GpuDimShuffle{1,0,2,3}.0)
    input 0: dtype=float32, shape=(16, 16, 244, 244), strides=(59536, 952576, 244, 1) 
    output 0: dtype=float32, shape=(16, 16, 244, 244), strides=(952576, 59536, 244, 1) 
   0.2%    96.8%       0.365s       7.61e-03s     48   353   2812.5        7.5 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
    input 0: dtype=float32, shape=(16, 16, 124, 124), strides=(246016, 15376, 124, 1) 
    input 1: dtype=float32, shape=(16, 16, 5, 5), strides=(400, 25, 5, 1) 
    output 0: dtype=float32, shape=(16, 16, 120, 120), strides=(230400, 14400, 120, 1) 
   ... (remaining 349 Apply instances account for 3.24%(4.79s) of the runtime)

Memory Profile
(Sparse variables are ignored)
(For values in brackets, it's for linker = c|py
---
    Max if no gc (allow_gc=False): 1619415KB (1619415KB)
    Max if linker=cvm(default): 765866KB (765866KB)
    Memory saved if views are used: 2970455KB (2970455KB)
    Memory saved if inplace ops are used: 649542KB (649542KB)
    Memory saved if gc is enabled: 853549KB (853549KB)

    <Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>

     151519232B  [(16, 512, 68, 68)] c GpuAlloc{memset_0=True}(CudaNdarrayConstant{[[[[ 0.]]]]}, Shape_i{0}.0, Shape_i{1}.0, Shape_i{2}.0, Shape_i{3}.0)
     151519232B  [(512, 16, 68, 68)] v GpuContiguous(GpuDimShuffle{1,0,2,3}.0)
     151519232B  [(16, 512, 68, 68)] i GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:}(GpuAlloc{memset_0=True}.0, GpuReshape{4}.0, Constant{4}, Constant{-4}, Constant{4}, Constant{-4})
     151519232B  [(512, 16, 68, 68)] v GpuDimShuffle{1,0,2,3}(GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:}.0)
     151519232B  [(16, 512, 68, 68)] v GpuContiguous(GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:}.0)
     151519232B  [(16, 512, 68, 68)] c GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
     117964800B  [(16, 512, 60, 60)] v GpuContiguous(GpuElemwise{maximum,no_inplace}.0)
     117964800B  [(512, 16, 60, 60)] v GpuContiguous(GpuSubtensor{::, ::, ::int64, ::int64}.0)
     117964800B  [(16, 512, 60, 60)] c GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
     117964800B  [(16, 512, 60, 60)] c GpuElemwise{add,no_inplace}(GpuReshape{4}.0, GpuDimShuffle{x,0,x,x}.0)
     117964800B  [(16, 512, 60, 60)] v GpuReshape{4}(GpuDownsampleFactorMaxGrad{(1, 1),True}.0, MakeVector.0)
     117964800B  [(16, 512, 60, 60)] v GpuReshape{4}(GpuDownsampleFactorMax{(1, 1),True}.0, Join.0)
     117964800B  [(16, 512, 60, 60)] v GpuSubtensor{::, ::, int64:int64:, int64:int64:}(GpuCorrMM{full, (1, 1), pad=0}.0, Constant{4}, Constant{-4}, Constant{4}, Constant{-4})
     117964800B  [(16, 512, 60, 60)] i GpuElemwise{Composite{[Cast{float32}(EQ(i0, i1))]}}[(0, 0)](GpuElemwise{maximum,no_inplace}.0, GpuElemwise{add,no_inplace}.0)
     117964800B  [(8192, 1, 60, 60)] c GpuDownsampleFactorMaxGrad{(1, 1),True}(GpuReshape{4}.0, GpuDownsampleFactorMax{(1, 1),True}.0, GpuReshape{4}.0)
     117964800B  [(16, 512, 60, 60)] i GpuElemwise{Mul}[(0, 0)](GpuElemwise{Composite{[Cast{float32}(EQ(i0, i1))]}}[(0, 0)].0, GpuCorrMM{valid, (1, 1), pad=0}.0)
     117964800B  [(16, 512, 60, 60)] c GpuElemwise{maximum,no_inplace}(CudaNdarrayConstant{[[[[  9.99999997e-07]]]]}, GpuElemwise{add,no_inplace}.0)
     117964800B  [(8192, 1, 60, 60)] v GpuReshape{4}(GpuSubtensor{::, ::, int64:int64:, int64:int64:}.0, Join.0)
     117964800B  [(512, 16, 60, 60)] v GpuDimShuffle{1,0,2,3}(GpuElemwise{maximum,no_inplace}.0)
     117964800B  [(8192, 1, 60, 60)] v GpuReshape{4}(GpuElemwise{Mul}[(0, 0)].0, MakeVector.0)
   ... (remaining 349 Apply account for 2804535640B/5365158232B ((52.27%)) of the Apply with dense outputs sizes)

    <created/inplace/view> is taken from the Op's declaration.
    Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.

Function profiling
==================
  Message: /home/ajain/Projects/deep_nets/python/lib/machine.py:243
  Time in 0 calls to Function.__call__: 0.000000e+00s
  Total compile time: 3.708201e+00s
    Number of Apply nodes: 0
    Theano Optimizer time: 3.540352e+00s
       Theano validate time: 1.026495e-01s
    Theano Linker time (includes C, CUDA code generation/compiling): 1.489460e-01s

Function profiling
==================
  Message: Sum of all printed profiles at exit excluding Scan op profile.
  Time in 48 calls to Function.__call__: 1.493318e+02s
  Time in Function.fn.__call__: 1.493235e+02s (99.994%)
  Time in thunks: 1.478504e+02s (99.008%)
  Total compile time: 9.329049e+00s
    Number of Apply nodes: 369
    Theano Optimizer time: 8.861545e+00s
       Theano validate time: 4.513505e-01s
    Theano Linker time (includes C, CUDA code generation/compiling): 4.090979e-01s

Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  94.4%    94.4%     139.572s       1.53e-01s     C      912      19   theano.sandbox.cuda.blas.GpuCorrMM
   2.1%    96.5%       3.036s       1.58e-02s     C      192       4   theano.sandbox.cuda.basic_ops.GpuIncSubtensor
   1.3%    97.8%       1.980s       1.11e-03s     C     1776      37   theano.sandbox.cuda.basic_ops.GpuContiguous
   0.6%    98.4%       0.910s       4.31e-04s     C     2112      44   theano.sandbox.cuda.basic_ops.GpuElemwise
   0.5%    98.9%       0.666s       5.34e-04s     Py    1248      26   theano.sandbox.cuda.basic_ops.GpuReshape
   0.3%    99.1%       0.382s       1.99e-03s     C      192       4   theano.sandbox.cuda.basic_ops.GpuFromHost
   0.2%    99.3%       0.296s       3.08e-03s     Py      96       2   theano.tensor.extra_ops.RepeatOp
   0.2%    99.5%       0.296s       1.03e-03s     C      288       6   theano.sandbox.cuda.blas.GpuDownsampleFactorMaxGrad
   0.2%    99.7%       0.252s       8.75e-04s     C      288       6   theano.sandbox.cuda.blas.GpuDownsampleFactorMax
   0.1%    99.8%       0.178s       1.24e-03s     C      144       3   theano.sandbox.cuda.basic_ops.HostFromGpu
   0.1%    99.9%       0.121s       3.16e-04s     C      384       8   theano.sandbox.cuda.basic_ops.GpuCAReduce
   0.1%   100.0%       0.096s       4.01e-04s     C      240       5   theano.sandbox.cuda.basic_ops.GpuAlloc
   0.0%   100.0%       0.015s       5.75e-06s     C     2688      56   theano.compile.ops.Shape_i
   0.0%   100.0%       0.013s       7.38e-06s     C     1776      37   theano.sandbox.cuda.basic_ops.GpuSubtensor
   0.0%   100.0%       0.011s       7.58e-06s     C     1440      30   theano.sandbox.cuda.basic_ops.GpuDimShuffle
   0.0%   100.0%       0.007s       5.66e-06s     C     1152      24   theano.tensor.elemwise.Elemwise
   0.0%   100.0%       0.006s       1.06e-05s     C      576      12   theano.tensor.basic.Join
   0.0%   100.0%       0.005s       6.05e-06s     C      864      18   theano.tensor.subtensor.Subtensor
   0.0%   100.0%       0.004s       5.89e-06s     C      672      14   theano.tensor.opt.MakeVector
   0.0%   100.0%       0.002s       5.11e-06s     C      336       7   theano.tensor.elemwise.Prod
   ... (remaining 2 Classes account for   0.00%(0.00s) of the runtime)

Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  91.5%    91.5%     135.223s       2.56e-01s     C      528       11   GpuCorrMM{valid, (1, 1), pad=0}
   2.9%    94.4%       4.348s       1.13e-02s     C      384        8   GpuCorrMM{full, (1, 1), pad=0}
   2.1%    96.5%       3.036s       1.58e-02s     C      192        4   GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:}
   1.3%    97.8%       1.980s       1.11e-03s     C     1776       37   GpuContiguous
   0.5%    98.2%       0.666s       5.34e-04s     Py    1248       26   GpuReshape{4}
   0.3%    98.5%       0.382s       1.99e-03s     C      192        4   GpuFromHost
   0.2%    98.7%       0.296s       3.08e-03s     Py      96        2   RepeatOp
   0.2%    98.9%       0.247s       1.03e-03s     C      240        5   GpuElemwise{add,no_inplace}
   0.2%    99.0%       0.237s       1.23e-03s     C      192        4   GpuDownsampleFactorMaxGrad{(1, 1),True}
   0.1%    99.2%       0.210s       8.77e-04s     C      240        5   GpuElemwise{maximum,no_inplace}
   0.1%    99.3%       0.206s       1.07e-03s     C      192        4   GpuDownsampleFactorMax{(1, 1),True}
   0.1%    99.4%       0.178s       1.24e-03s     C      144        3   HostFromGpu
   0.1%    99.5%       0.164s       6.82e-04s     C      240        5   GpuElemwise{Composite{[Cast{float32}(EQ(i0, i1))]}}[(0, 0)]
   0.1%    99.6%       0.157s       6.56e-04s     C      240        5   GpuElemwise{Mul}[(0, 0)]
   0.1%    99.7%       0.100s       3.47e-04s     C      288        6   GpuCAReduce{add}{1,0,1,1}
   0.1%    99.8%       0.087s       4.51e-04s     C      192        4   GpuAlloc{memset_0=True}
   0.0%    99.8%       0.059s       6.14e-04s     C       96        2   GpuDownsampleFactorMaxGrad{(2, 2),True}
   0.0%    99.8%       0.046s       4.80e-04s     C       96        2   GpuDownsampleFactorMax{(2, 2),True}
   0.0%    99.9%       0.023s       4.84e-04s     C       48        1   GpuElemwise{sqr,no_inplace}
   0.0%    99.9%       0.022s       2.32e-04s     C       96        2   GpuElemwise{TrueDiv}[(0, 0)]
   ... (remaining 37 Ops account for   0.12%(0.18s) of the runtime)

Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
  34.9%    34.9%      51.585s       1.07e+00s     48   313  14400.0        0.3 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
    input 0: dtype=float32, shape=(256, 16, 60, 60), strides=(57600, 3600, 60, 1) 
    input 1: dtype=float32, shape=(512, 16, 60, 60), strides=(57600, 3600, 60, 1) 
    output 0: dtype=float32, shape=(256, 512, 1, 1), strides=(512, 1, 0, 0) 
  30.4%    65.3%      44.946s       9.36e-01s     48   327  72900.0        1.6 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
    input 0: dtype=float32, shape=(512, 16, 68, 68), strides=(73984, 4624, 68, 1) 
    input 1: dtype=float32, shape=(16, 16, 60, 60), strides=(57600, 3600, 60, 1) 
    output 0: dtype=float32, shape=(512, 16, 9, 9), strides=(1296, 81, 9, 1) 
  15.1%    80.4%      22.270s       4.64e-01s     48   367   2109.4        0.1 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
    input 0: dtype=float32, shape=(16, 16, 244, 244), strides=(952576, 59536, 244, 1) 
    input 1: dtype=float32, shape=(3, 16, 240, 240), strides=(921600, 57600, 240, 1) 
    output 0: dtype=float32, shape=(16, 3, 5, 5), strides=(75, 25, 5, 1) 
   5.4%    85.8%       8.016s       1.67e-01s     48   325  72900.0        8.9 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
    input 0: dtype=float32, shape=(16, 512, 68, 68), strides=(2367488, 4624, 68, 1) 
    input 1: dtype=float32, shape=(16, 512, 9, 9), strides=(41472, 81, 9, 1) 
    output 0: dtype=float32, shape=(16, 16, 60, 60), strides=(57600, 3600, 60, 1) 
   3.6%    89.4%       5.313s       1.11e-01s     48   355   2812.5        0.5 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
    input 0: dtype=float32, shape=(16, 16, 124, 124), strides=(246016, 15376, 124, 1) 
    input 1: dtype=float32, shape=(16, 16, 120, 120), strides=(230400, 14400, 120, 1) 
    output 0: dtype=float32, shape=(16, 16, 5, 5), strides=(400, 25, 5, 1) 
   1.2%    90.6%       1.770s       3.69e-02s     48   322                     GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:
    input 0: dtype=float32, shape=(16, 512, 68, 68), strides=(2367488, 4624, 68, 1) 
    input 1: dtype=float32, shape=(16, 512, 60, 60), strides=(1843200, 3600, 60, 1) 
    input 2: dtype=int64, shape=8, strides=c 
    input 3: dtype=int64, shape=8, strides=c 
    input 4: dtype=int64, shape=8, strides=c 
    input 5: dtype=int64, shape=8, strides=c 
    output 0: dtype=float32, shape=(16, 512, 68, 68), strides=(2367488, 4624, 68, 1) 
   1.2%    91.7%       1.732s       3.61e-02s     48   190  72900.0       41.1 GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous
    input 0: dtype=float32, shape=(16, 16, 60, 60), strides=(57600, 3600, 60, 1) 
    input 1: dtype=float32, shape=(512, 16, 9, 9), strides=(1296, 81, 9, 1) 
    output 0: dtype=float32, shape=(16, 512, 68, 68), strides=(2367488, 4624, 68, 1) 
   1.0%    92.7%       1.416s       2.95e-02s     48   341    703.1        0.5 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
    input 0: dtype=float32, shape=(16, 16, 64, 64), strides=(65536, 4096, 64, 1) 
    input 1: dtype=float32, shape=(16, 16, 60, 60), strides=(57600, 3600, 60, 1) 
    output 0: dtype=float32, shape=(16, 16, 5, 5), strides=(400, 25, 5, 1) 
   0.7%    93.3%       0.969s       2.02e-02s     48   364                     GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:
    input 0: dtype=float32, shape=(16, 16, 244, 244), strides=(952576, 59536, 244, 1) 
    input 1: dtype=float32, shape=(16, 16, 240, 240), strides=(921600, 57600, 240, 1) 
    input 2: dtype=int64, shape=8, strides=c 
    input 3: dtype=int64, shape=8, strides=c 
    input 4: dtype=int64, shape=8, strides=c 
    input 5: dtype=int64, shape=8, strides=c 
    output 0: dtype=float32, shape=(16, 16, 244, 244), strides=(952576, 59536, 244, 1) 
   0.4%    93.8%       0.664s       1.38e-02s     48   300    112.5        0.2 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
    input 0: dtype=float32, shape=(4, 16, 60, 60), strides=(57600, 3600, 60, 1) 
    input 1: dtype=float32, shape=(256, 16, 60, 60), strides=(57600, 3600, 60, 1) 
    output 0: dtype=float32, shape=(4, 256, 1, 1), strides=(256, 1, 0, 0) 
   0.4%    94.2%       0.551s       1.15e-02s     48    44    427.1        0.8 GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous
    input 0: dtype=float32, shape=(16, 3, 240, 240), strides=(172800, 57600, 240, 1) 
    input 1: dtype=float32, shape=(1, 3, 9, 9), strides=(0, 81, 9, 1) 
    output 0: dtype=float32, shape=(16, 1, 248, 248), strides=(61504, 0, 248, 1) 
   0.4%    94.5%       0.549s       1.14e-02s     48    61    427.1        0.8 GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous
    input 0: dtype=float32, shape=(16, 3, 240, 240), strides=(172800, 57600, 240, 1) 
    input 1: dtype=float32, shape=(1, 3, 9, 9), strides=(0, 81, 9, 1) 
    output 0: dtype=float32, shape=(16, 1, 248, 248), strides=(61504, 0, 248, 1) 
   0.3%    94.9%       0.492s       1.03e-02s     48   326                     GpuContiguous(GpuDimShuffle{1,0,2,3}.0)
    input 0: dtype=float32, shape=(512, 16, 68, 68), strides=(4624, 2367488, 68, 1) 
    output 0: dtype=float32, shape=(512, 16, 68, 68), strides=(73984, 4624, 68, 1) 
   0.3%    95.2%       0.464s       9.67e-03s     48   224  14400.0       30.3 GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous
    input 0: dtype=float32, shape=(16, 512, 60, 60), strides=(1843200, 3600, 60, 1) 
    input 1: dtype=float32, shape=(256, 512, 1, 1), strides=(512, 1, 0, 0) 
    output 0: dtype=float32, shape=(16, 256, 60, 60), strides=(921600, 3600, 60, 1) 
   0.3%    95.5%       0.425s       8.86e-03s     48   311  14400.0       33.1 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
    input 0: dtype=float32, shape=(16, 256, 60, 60), strides=(921600, 3600, 60, 1) 
    input 1: dtype=float32, shape=(512, 256, 1, 1), strides=(256, 1, 0, 0) 
    output 0: dtype=float32, shape=(16, 512, 60, 60), strides=(1843200, 3600, 60, 1) 
   0.3%    95.7%       0.396s       8.26e-03s     48   232                     GpuContiguous(GpuSubtensor{::, ::, ::int64, ::int64}.0)
    input 0: dtype=float32, shape=(512, 16, 60, 60), strides=(3600, 1843200, 60, 1) 
    output 0: dtype=float32, shape=(512, 16, 60, 60), strides=(57600, 3600, 60, 1) 
   0.3%    96.0%       0.388s       8.07e-03s     48   122   2812.5        7.1 GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous
    input 0: dtype=float32, shape=(16, 16, 120, 120), strides=(230400, 14400, 120, 1) 
    input 1: dtype=float32, shape=(16, 16, 5, 5), strides=(400, 25, 5, 1) 
    output 0: dtype=float32, shape=(16, 16, 124, 124), strides=(246016, 15376, 124, 1) 
   0.3%    96.3%       0.385s       8.02e-03s     48   208                     GpuReshape{4}(GpuSubtensor{::, ::, int64:int64:, int64:int64:
    input 0: dtype=float32, shape=(16, 512, 60, 60), strides=(2367488, 4624, 68, 1) 
    input 1: dtype=int64, shape=(4,), strides=c 
    output 0: dtype=float32, shape=(8192, 1, 60, 60), strides=(3600, 0, 60, 1) 
   0.2%    96.5%       0.367s       7.64e-03s     48   366                     GpuContiguous(GpuDimShuffle{1,0,2,3}.0)
    input 0: dtype=float32, shape=(16, 16, 244, 244), strides=(59536, 952576, 244, 1) 
    output 0: dtype=float32, shape=(16, 16, 244, 244), strides=(952576, 59536, 244, 1) 
   0.2%    96.8%       0.365s       7.61e-03s     48   353   2812.5        7.5 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
    input 0: dtype=float32, shape=(16, 16, 124, 124), strides=(246016, 15376, 124, 1) 
    input 1: dtype=float32, shape=(16, 16, 5, 5), strides=(400, 25, 5, 1) 
    output 0: dtype=float32, shape=(16, 16, 120, 120), strides=(230400, 14400, 120, 1) 
   ... (remaining 349 Apply instances account for 3.24%(4.79s) of the runtime)

Memory Profile
(Sparse variables are ignored)
(For values in brackets, it's for linker = c|py
---
    Max if no gc (allow_gc=False): 1619415KB (1619415KB)
    Max if linker=cvm(default): 765866KB (765866KB)
    Memory saved if views are used: 2970455KB (2970455KB)
    Memory saved if inplace ops are used: 649542KB (649542KB)
    Memory saved if gc is enabled: 853549KB (853549KB)

    <Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>

     151519232B  [(16, 512, 68, 68)] c GpuAlloc{memset_0=True}(CudaNdarrayConstant{[[[[ 0.]]]]}, Shape_i{0}.0, Shape_i{1}.0, Shape_i{2}.0, Shape_i{3}.0)
     151519232B  [(512, 16, 68, 68)] v GpuContiguous(GpuDimShuffle{1,0,2,3}.0)
     151519232B  [(16, 512, 68, 68)] i GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:}(GpuAlloc{memset_0=True}.0, GpuReshape{4}.0, Constant{4}, Constant{-4}, Constant{4}, Constant{-4})
     151519232B  [(512, 16, 68, 68)] v GpuDimShuffle{1,0,2,3}(GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:}.0)
     151519232B  [(16, 512, 68, 68)] v GpuContiguous(GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:}.0)
     151519232B  [(16, 512, 68, 68)] c GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
     117964800B  [(16, 512, 60, 60)] v GpuContiguous(GpuElemwise{maximum,no_inplace}.0)
     117964800B  [(512, 16, 60, 60)] v GpuContiguous(GpuSubtensor{::, ::, ::int64, ::int64}.0)
     117964800B  [(16, 512, 60, 60)] c GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
     117964800B  [(16, 512, 60, 60)] c GpuElemwise{add,no_inplace}(GpuReshape{4}.0, GpuDimShuffle{x,0,x,x}.0)
     117964800B  [(16, 512, 60, 60)] v GpuReshape{4}(GpuDownsampleFactorMaxGrad{(1, 1),True}.0, MakeVector.0)
     117964800B  [(16, 512, 60, 60)] v GpuReshape{4}(GpuDownsampleFactorMax{(1, 1),True}.0, Join.0)
     117964800B  [(16, 512, 60, 60)] v GpuSubtensor{::, ::, int64:int64:, int64:int64:}(GpuCorrMM{full, (1, 1), pad=0}.0, Constant{4}, Constant{-4}, Constant{4}, Constant{-4})
     117964800B  [(16, 512, 60, 60)] i GpuElemwise{Composite{[Cast{float32}(EQ(i0, i1))]}}[(0, 0)](GpuElemwise{maximum,no_inplace}.0, GpuElemwise{add,no_inplace}.0)
     117964800B  [(8192, 1, 60, 60)] c GpuDownsampleFactorMaxGrad{(1, 1),True}(GpuReshape{4}.0, GpuDownsampleFactorMax{(1, 1),True}.0, GpuReshape{4}.0)
     117964800B  [(16, 512, 60, 60)] i GpuElemwise{Mul}[(0, 0)](GpuElemwise{Composite{[Cast{float32}(EQ(i0, i1))]}}[(0, 0)].0, GpuCorrMM{valid, (1, 1), pad=0}.0)
     117964800B  [(16, 512, 60, 60)] c GpuElemwise{maximum,no_inplace}(CudaNdarrayConstant{[[[[  9.99999997e-07]]]]}, GpuElemwise{add,no_inplace}.0)
     117964800B  [(8192, 1, 60, 60)] v GpuReshape{4}(GpuSubtensor{::, ::, int64:int64:, int64:int64:}.0, Join.0)
     117964800B  [(512, 16, 60, 60)] v GpuDimShuffle{1,0,2,3}(GpuElemwise{maximum,no_inplace}.0)
     117964800B  [(8192, 1, 60, 60)] v GpuReshape{4}(GpuElemwise{Mul}[(0, 0)].0, MakeVector.0)
   ... (remaining 349 Apply account for 2804535640B/5365158232B ((52.27%)) of the Apply with dense outputs sizes)

    <created/inplace/view> is taken from the Op's declaration.
    Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.

nouiz · 2014-08-06T20:12:38Z

if the forward is valid, in the grad, you will have a valid and a full
convolution.
If the forward is full, you will also have a valid and a full in the grad.

Why do you use the full mode in the forward? That is very strange.
Normally, what I saw is that people use the valid mode in the forward. Are
you sure your torch implementation also use the full mode in the forward?

I need to leave. I'll see if I can work on that tonight. It the speed
difference is real I see two way go identify it:

Run your code with cuda-memcheck. If we have a problem like a call to
gemm with too big number and it cause bad memory read, we will see the
error. That would be the easy case.

I looked at caffe code and it work differently in the backward case then
what we did. Check the method Backward_gpu in the src/caffe/layers/
conv_layers.cu

Here is the profile of convnet-benchmark you can compare it to what is on
the web site, but we see that the grad, the way we implement it in
GpuCorrMM, it is slower:

CONFIG: input = 3 x 128 x 128 * ker = 3 x 96 x 11 x 11 ( bs = 128 , stride
= 1 )
gemm theano.sandbox.cuda.blas.GpuConvMM fprof: 1246.53857689 GFLOP/s ( tm =
0.0996497273445 )
gemm theano.sandbox.cuda.blas.GpuConvMM bprop weights: 0.0 GFLOP/s ( tm =
0.355364978313 )
gemm theano.sandbox.cuda.blas.GpuConvMM bprop inputs: 0.0 GFLOP/s ( tm =
1.75031024218 )
gemm theano.sandbox.cuda.blas.GpuConvMM bprop both: 0.0 GFLOP/s ( tm =
2.09745848179 )

CONFIG: input = 64 x 64 x 64 * ker = 64 x 128 x 9 x 9 ( bs = 128 , stride =
1 )
gemm theano.sandbox.cuda.blas.GpuConvMM fprof: 1814.27996141 GFLOP/s ( tm =
0.293620705605 )
gemm theano.sandbox.cuda.blas.GpuConvMM bprop weights: 0.0 GFLOP/s ( tm =
1.16873198748 )
gemm theano.sandbox.cuda.blas.GpuConvMM bprop inputs: 0.0 GFLOP/s ( tm =
0.506856739521 )
gemm theano.sandbox.cuda.blas.GpuConvMM bprop both: 0.0 GFLOP/s ( tm =
1.77047175169 )

CONFIG: input = 128 x 32 x 32 * ker = 128 x 128 x 9 x 9 ( bs = 128 , stride
= 1 )
gemm theano.sandbox.cuda.blas.GpuConvMM fprof: 1163.07152933 GFLOP/s ( tm =
0.168252289295 )
gemm theano.sandbox.cuda.blas.GpuConvMM bprop weights: 0.0 GFLOP/s ( tm =
0.437334001064 )
gemm theano.sandbox.cuda.blas.GpuConvMM bprop inputs: 0.0 GFLOP/s ( tm =
0.230055272579 )
gemm theano.sandbox.cuda.blas.GpuConvMM bprop both: 0.0 GFLOP/s ( tm =
0.706533968449 )

CONFIG: input = 128 x 16 x 16 * ker = 128 x 128 x 7 x 7 ( bs = 128 , stride
= 1 )
gemm theano.sandbox.cuda.blas.GpuConvMM fprof: 362.523363232 GFLOP/s ( tm =
0.0566917657852 )
gemm theano.sandbox.cuda.blas.GpuConvMM bprop weights: 0.0 GFLOP/s ( tm =
0.0665702223778 )
gemm theano.sandbox.cuda.blas.GpuConvMM bprop inputs: 0.0 GFLOP/s ( tm =
0.102097034454 )
gemm theano.sandbox.cuda.blas.GpuConvMM bprop both: 0.0 GFLOP/s ( tm =
0.168358981609 )

On Wed, Aug 6, 2014 at 12:00 PM, Arjun Jain notifications@github.com
wrote:

Hi Fred, thanks a ton for your reply. Please find bellow the complete log
after also using profile_memory=True.

What I dont understand is why is the 'valid' getting called at all? I only
call conv2d with 'full'. Perhaps this happens during the back prop, but I
am really not sure why. Any help would be greatly appreciated. Thanks a
lot!!!

Using gpu device 3: GeForce GTX TITAN Black
/home/ajain/Theano/theano/gof/vm.py:716: UserWarning: CVM does not support memory profile, using Stack VM.
'CVM does not support memory profile, using Stack VM.')

Function profiling

Message: /home/ajain/Projects/deep_nets/python/lib/machine.py:239
Time in 48 calls to Function.call: 1.493318e+02s
Time in Function.fn.call: 1.493235e+02s (99.994%)
Time in thunks: 1.478504e+02s (99.008%)
Total compile time: 5.620848e+00s
Number of Apply nodes: 369
Theano Optimizer time: 5.321193e+00s
Theano validate time: 3.487010e-01s
Theano Linker time (includes C, CUDA code generation/compiling): 2.601519e-01s

Class

<% time> <sum %> 94.4% 94.4% 139.572s 2.1% 96.5% 3.036s 1.3% 97.8% 1.980s 0.6% 98.4% 0.910s 0.5% 98.9% 0.666s 0.3% 99.1% 0.382s 0.2% 99.3% 0.296s 0.2% 99.5% 0.296s 0.2% 99.7% 0.252s 0.1% 99.8% 0.178s 0.1% 99.9% 0.121s 0.1% 100.0% 0.096s 0.0% 100.0% 0.015s 0.0% 100.0% 0.013s 0.0% 100.0% 0.011s 0.0% 100.0% 0.007s 0.0% 100.0% 0.006s 0.0% 100.0% 0.005s 0.0% 100.0% 0.004s 0.0% 100.0% 0.002s ... (remaining 2 Classes account for <#call> <#apply>
1.53e-01s C 912 19 theano.sandbox.cuda.blas.GpuCorrMM
1.58e-02s C 192 4 theano.sandbox.cuda.basic_ops.GpuIncSubtensor
1.11e-03s C 1776 37 theano.sandbox.cuda.basic_ops.GpuContiguous
4.31e-04s C 2112 44 theano.sandbox.cuda.basic_ops.GpuElemwise
5.34e-04s Py 1248 26 theano.sandbox.cuda.basic_ops.GpuReshape
1.99e-03s C 192 4 theano.sandbox.cuda.basic_ops.GpuFromHost
3.08e-03s Py 96 2 theano.tensor.extra_ops.RepeatOp
1.03e-03s C 288 6 theano.sandbox.cuda.blas.GpuDownsampleFactorMaxGrad
8.75e-04s C 288 6 theano.sandbox.cuda.blas.GpuDownsampleFactorMax
1.24e-03s C 144 3 theano.sandbox.cuda.basic_ops.HostFromGpu
3.16e-04s C 384 8 theano.sandbox.cuda.basic_ops.GpuCAReduce
4.01e-04s C 240 5 theano.sandbox.cuda.basic_ops.GpuAlloc
5.75e-06s C 2688 56 theano.compile.ops.Shape_i
7.38e-06s C 1776 37 theano.sandbox.cuda.basic_ops.GpuSubtensor
7.58e-06s C 1440 30 theano.sandbox.cuda.basic_ops.GpuDimShuffle
5.66e-06s C 1152 24 theano.tensor.elemwise.Elemwise
1.06e-05s C 576 12 theano.tensor.basic.Join
6.05e-06s C 864 18 theano.tensor.subtensor.Subtensor
5.89e-06s C 672 14 theano.tensor.opt.MakeVector
5.11e-06s C 336 7 theano.tensor.elemwise.Prod
0.00%(0.00s) of the runtime)

Ops

<% time> <sum %> <#call> <#apply>
91.5% 91.5% 135.223s 2.56e-01s C 528 11 GpuCorrMM{valid, (1, 1), pad=0}
2.9% 94.4% 4.348s 1.13e-02s C 384 8 GpuCorrMM{full, (1, 1), pad=0}
2.1% 96.5% 3.036s 1.58e-02s C 192 4 GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:}
1.3% 97.8% 1.980s 1.11e-03s C 1776 37 GpuContiguous
0.5% 98.2% 0.666s 5.34e-04s Py 1248 26 GpuReshape{4}
0.3% 98.5% 0.382s 1.99e-03s C 192 4 GpuFromHost
0.2% 98.7% 0.296s 3.08e-03s Py 96 2 RepeatOp
0.2% 98.9% 0.247s 1.03e-03s C 240 5 GpuElemwise{add,no_inplace}
0.2% 99.0% 0.237s 1.23e-03s C 192 4 GpuDownsampleFactorMaxGrad{(1, 1),True}
0.1% 99.2% 0.210s 8.77e-04s C 240 5 GpuElemwise{maximum,no_inplace}
0.1% 99.3% 0.206s 1.07e-03s C 192 4 GpuDownsampleFactorMax{(1, 1),True}
0.1% 99.4% 0.178s 1.24e-03s C 144 3 HostFromGpu
0.1% 99.5% 0.164s 6.82e-04s C 240 5 GpuElemwise{Composite{[Cast{float32}(EQ(i0, i1))]}}[(0, 0)]
0.1% 99.6% 0.157s 6.56e-04s C 240 5 GpuElemwise{Mul}[(0, 0)]
0.1% 99.7% 0.100s 3.47e-04s C 288 6 GpuCAReduce{add}{1,0,1,1}
0.1% 99.8% 0.087s 4.51e-04s C 192 4 GpuAlloc{memset_0=True}
0.0% 99.8% 0.059s 6.14e-04s C 96 2 GpuDownsampleFactorMaxGrad{(2, 2),True}
0.0% 99.8% 0.046s 4.80e-04s C 96 2 GpuDownsampleFactorMax{(2, 2),True}
0.0% 99.9% 0.023s 4.84e-04s C 48 1 GpuElemwise{sqr,no_inplace}
0.0% 99.9% 0.022s 2.32e-04s C 96 2 GpuElemwise{TrueDiv}[(0, 0)]
... (remaining 37 Ops account for 0.12%(0.18s) of the runtime)

Apply

<% time> <sum %> <#call> <Gflops/s>
34.9% 34.9% 51.585s 1.07e+00s 48 313 14400.0 0.3 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
input 0: dtype=float32, shape=(256, 16, 60, 60), strides=(57600, 3600, 60, 1)
input 1: dtype=float32, shape=(512, 16, 60, 60), strides=(57600, 3600, 60, 1)
output 0: dtype=float32, shape=(256, 512, 1, 1), strides=(512, 1, 0, 0)
30.4% 65.3% 44.946s 9.36e-01s 48 327 72900.0 1.6 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
input 0: dtype=float32, shape=(512, 16, 68, 68), strides=(73984, 4624, 68, 1)
input 1: dtype=float32, shape=(16, 16, 60, 60), strides=(57600, 3600, 60, 1)
output 0: dtype=float32, shape=(512, 16, 9, 9), strides=(1296, 81, 9, 1)
15.1% 80.4% 22.270s 4.64e-01s 48 367 2109.4 0.1 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
input 0: dtype=float32, shape=(16, 16, 244, 244), strides=(952576, 59536, 244, 1)
input 1: dtype=float32, shape=(3, 16, 240, 240), strides=(921600, 57600, 240, 1)
output 0: dtype=float32, shape=(16, 3, 5, 5), strides=(75, 25, 5, 1)
5.4% 85.8% 8.016s 1.67e-01s 48 325 72900.0 8.9 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
input 0: dtype=float32, shape=(16, 512, 68, 68), strides=(2367488, 4624, 68, 1)
input 1: dtype=float32, shape=(16, 512, 9, 9), strides=(41472, 81, 9, 1)
output 0: dtype=float32, shape=(16, 16, 60, 60), strides=(57600, 3600, 60, 1)
3.6% 89.4% 5.313s 1.11e-01s 48 355 2812.5 0.5 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
input 0: dtype=float32, shape=(16, 16, 124, 124), strides=(246016, 15376, 124, 1)
input 1: dtype=float32, shape=(16, 16, 120, 120), strides=(230400, 14400, 120, 1)
output 0: dtype=float32, shape=(16, 16, 5, 5), strides=(400, 25, 5, 1)
1.2% 90.6% 1.770s 3.69e-02s 48 322 GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:
input 0: dtype=float32, shape=(16, 512, 68, 68), strides=(2367488, 4624, 68, 1)
input 1: dtype=float32, shape=(16, 512, 60, 60), strides=(1843200, 3600, 60, 1)
input 2: dtype=int64, shape=8, strides=c
input 3: dtype=int64, shape=8, strides=c
input 4: dtype=int64, shape=8, strides=c
input 5: dtype=int64, shape=8, strides=c
output 0: dtype=float32, shape=(16, 512, 68, 68), strides=(2367488, 4624, 68, 1)
1.2% 91.7% 1.732s 3.61e-02s 48 190 72900.0 41.1 GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous
input 0: dtype=float32, shape=(16, 16, 60, 60), strides=(57600, 3600, 60, 1)
input 1: dtype=float32, shape=(512, 16, 9, 9), strides=(1296, 81, 9, 1)
output 0: dtype=float32, shape=(16, 512, 68, 68), strides=(2367488, 4624, 68, 1)
1.0% 92.7% 1.416s 2.95e-02s 48 341 703.1 0.5 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
input 0: dtype=float32, shape=(16, 16, 64, 64), strides=(65536, 4096, 64, 1)
input 1: dtype=float32, shape=(16, 16, 60, 60), strides=(57600, 3600, 60, 1)
output 0: dtype=float32, shape=(16, 16, 5, 5), strides=(400, 25, 5, 1)
0.7% 93.3% 0.969s 2.02e-02s 48 364 GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:
input 0: dtype=float32, shape=(16, 16, 244, 244), strides=(952576, 59536, 244, 1)
input 1: dtype=float32, shape=(16, 16, 240, 240), strides=(921600, 57600, 240, 1)
input 2: dtype=int64, shape=8, strides=c
input 3: dtype=int64, shape=8, strides=c
input 4: dtype=int64, shape=8, strides=c
input 5: dtype=int64, shape=8, strides=c
output 0: dtype=float32, shape=(16, 16, 244, 244), strides=(952576, 59536, 244, 1)
0.4% 93.8% 0.664s 1.38e-02s 48 300 112.5 0.2 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
input 0: dtype=float32, shape=(4, 16, 60, 60), strides=(57600, 3600, 60, 1)
input 1: dtype=float32, shape=(256, 16, 60, 60), strides=(57600, 3600, 60, 1)
output 0: dtype=float32, shape=(4, 256, 1, 1), strides=(256, 1, 0, 0)
0.4% 94.2% 0.551s 1.15e-02s 48 44 427.1 0.8 GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous
input 0: dtype=float32, shape=(16, 3, 240, 240), strides=(172800, 57600, 240, 1)
input 1: dtype=float32, shape=(1, 3, 9, 9), strides=(0, 81, 9, 1)
output 0: dtype=float32, shape=(16, 1, 248, 248), strides=(61504, 0, 248, 1)
0.4% 94.5% 0.549s 1.14e-02s 48 61 427.1 0.8 GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous
input 0: dtype=float32, shape=(16, 3, 240, 240), strides=(172800, 57600, 240, 1)
input 1: dtype=float32, shape=(1, 3, 9, 9), strides=(0, 81, 9, 1)
output 0: dtype=float32, shape=(16, 1, 248, 248), strides=(61504, 0, 248, 1)
0.3% 94.9% 0.492s 1.03e-02s 48 326 GpuContiguous(GpuDimShuffle{1,0,2,3}.0)
input 0: dtype=float32, shape=(512, 16, 68, 68), strides=(4624, 2367488, 68, 1)
output 0: dtype=float32, shape=(512, 16, 68, 68), strides=(73984, 4624, 68, 1)
0.3% 95.2% 0.464s 9.67e-03s 48 224 14400.0 30.3 GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous
input 0: dtype=float32, shape=(16, 512, 60, 60), strides=(1843200, 3600, 60, 1)
input 1: dtype=float32, shape=(256, 512, 1, 1), strides=(512, 1, 0, 0)
output 0: dtype=float32, shape=(16, 256, 60, 60), strides=(921600, 3600, 60, 1)
0.3% 95.5% 0.425s 8.86e-03s 48 311 14400.0 33.1 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
input 0: dtype=float32, shape=(16, 256, 60, 60), strides=(921600, 3600, 60, 1)
input 1: dtype=float32, shape=(512, 256, 1, 1), strides=(256, 1, 0, 0)
output 0: dtype=float32, shape=(16, 512, 60, 60), strides=(1843200, 3600, 60, 1)
0.3% 95.7% 0.396s 8.26e-03s 48 232 GpuContiguous(GpuSubtensor{::, ::, ::int64, ::int64}.0)
input 0: dtype=float32, shape=(512, 16, 60, 60), strides=(3600, 1843200, 60, 1)
output 0: dtype=float32, shape=(512, 16, 60, 60), strides=(57600, 3600, 60, 1)
0.3% 96.0% 0.388s 8.07e-03s 48 122 2812.5 7.1 GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous
input 0: dtype=float32, shape=(16, 16, 120, 120), strides=(230400, 14400, 120, 1)
input 1: dtype=float32, shape=(16, 16, 5, 5), strides=(400, 25, 5, 1)
output 0: dtype=float32, shape=(16, 16, 124, 124), strides=(246016, 15376, 124, 1)
0.3% 96.3% 0.385s 8.02e-03s 48 208 GpuReshape{4}(GpuSubtensor{::, ::, int64:int64:, int64:int64:
input 0: dtype=float32, shape=(16, 512, 60, 60), strides=(2367488, 4624, 68, 1)
input 1: dtype=int64, shape=(4,), strides=c
output 0: dtype=float32, shape=(8192, 1, 60, 60), strides=(3600, 0, 60, 1)
0.2% 96.5% 0.367s 7.64e-03s 48 366 GpuContiguous(GpuDimShuffle{1,0,2,3}.0)
input 0: dtype=float32, shape=(16, 16, 244, 244), strides=(59536, 952576, 244, 1)
output 0: dtype=float32, shape=(16, 16, 244, 244), strides=(952576, 59536, 244, 1)
0.2% 96.8% 0.365s 7.61e-03s 48 353 2812.5 7.5 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
input 0: dtype=float32, shape=(16, 16, 124, 124), strides=(246016, 15376, 124, 1)
input 1: dtype=float32, shape=(16, 16, 5, 5), strides=(400, 25, 5, 1)
output 0: dtype=float32, shape=(16, 16, 120, 120), strides=(230400, 14400, 120, 1)
... (remaining 349 Apply instances account for 3.24%(4.79s) of the runtime)

Memory Profile
(Sparse variables are ignored)

(For values in brackets, it's for linker = c|py
Max if no gc (allow_gc=False): 1619415KB (1619415KB)
Max if linker=cvm(default): 765866KB (765866KB)
Memory saved if views are used: 2970455KB (2970455KB)
Memory saved if inplace ops are used: 649542KB (649542KB)
Memory saved if gc is enabled: 853549KB (853549KB)

<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>

 151519232B  [(16, 512, 68, 68)] c GpuAlloc{memset_0=True}(CudaNdarrayConstant{[[[[ 0.]]]]}, Shape_i{0}.0, Shape_i{1}.0, Shape_i{2}.0, Shape_i{3}.0)
 151519232B  [(512, 16, 68, 68)] v GpuContiguous(GpuDimShuffle{1,0,2,3}.0)
 151519232B  [(16, 512, 68, 68)] i GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:}(GpuAlloc{memset_0=True}.0, GpuReshape{4}.0, Constant{4}, Constant{-4}, Constant{4}, Constant{-4})
 151519232B  [(512, 16, 68, 68)] v GpuDimShuffle{1,0,2,3}(GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:}.0)
 151519232B  [(16, 512, 68, 68)] v GpuContiguous(GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:}.0)
 151519232B  [(16, 512, 68, 68)] c GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
 117964800B  [(16, 512, 60, 60)] v GpuContiguous(GpuElemwise{maximum,no_inplace}.0)
 117964800B  [(512, 16, 60, 60)] v GpuContiguous(GpuSubtensor{::, ::, ::int64, ::int64}.0)
 117964800B  [(16, 512, 60, 60)] c GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
 117964800B  [(16, 512, 60, 60)] c GpuElemwise{add,no_inplace}(GpuReshape{4}.0, GpuDimShuffle{x,0,x,x}.0)
 117964800B  [(16, 512, 60, 60)] v GpuReshape{4}(GpuDownsampleFactorMaxGrad{(1, 1),True}.0, MakeVector.0)
 117964800B  [(16, 512, 60, 60)] v GpuReshape{4}(GpuDownsampleFactorMax{(1, 1),True}.0, Join.0)
 117964800B  [(16, 512, 60, 60)] v GpuSubtensor{::, ::, int64:int64:, int64:int64:}(GpuCorrMM{full, (1, 1), pad=0}.0, Constant{4}, Constant{-4}, Constant{4}, Constant{-4})
 117964800B  [(16, 512, 60, 60)] i GpuElemwise{Composite{[Cast{float32}(EQ(i0, i1))]}}[(0, 0)](GpuElemwise{maximum,no_inplace}.0, GpuElemwise{add,no_inplace}.0)
 117964800B  [(8192, 1, 60, 60)] c GpuDownsampleFactorMaxGrad{(1, 1),True}(GpuReshape{4}.0, GpuDownsampleFactorMax{(1, 1),True}.0, GpuReshape{4}.0)
 117964800B  [(16, 512, 60, 60)] i GpuElemwise{Mul}[(0, 0)](GpuElemwise{Composite{[Cast{float32}(EQ(i0, i1))]}}[(0, 0)].0, GpuCorrMM{valid, (1, 1), pad=0}.0)
 117964800B  [(16, 512, 60, 60)] c GpuElemwise{maximum,no_inplace}(CudaNdarrayConstant{[[[[  9.99999997e-07]]]]}, GpuElemwise{add,no_inplace}.0)
 117964800B  [(8192, 1, 60, 60)] v GpuReshape{4}(GpuSubtensor{::, ::, int64:int64:, int64:int64:}.0, Join.0)
 117964800B  [(512, 16, 60, 60)] v GpuDimShuffle{1,0,2,3}(GpuElemwise{maximum,no_inplace}.0)
 117964800B  [(8192, 1, 60, 60)] v GpuReshape{4}(GpuElemwise{Mul}[(0, 0)].0, MakeVector.0)
... (remaining 349 Apply account for 2804535640B/5365158232B ((52.27%)) of the Apply with dense outputs sizes)
<created/inplace/view> is taken from the Op's declaration.
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.
Function profiling

Message: /home/ajain/Projects/deep_nets/python/lib/machine.py:243
Time in 0 calls to Function.call: 0.000000e+00s
Total compile time: 3.708201e+00s
Number of Apply nodes: 0
Theano Optimizer time: 3.540352e+00s
Theano validate time: 1.026495e-01s
Theano Linker time (includes C, CUDA code generation/compiling): 1.489460e-01s

Function profiling

Message: Sum of all printed profiles at exit excluding Scan op profile.
Time in 48 calls to Function.call: 1.493318e+02s
Time in Function.fn.call: 1.493235e+02s (99.994%)
Time in thunks: 1.478504e+02s (99.008%)
Total compile time: 9.329049e+00s
Number of Apply nodes: 369
Theano Optimizer time: 8.861545e+00s
Theano validate time: 4.513505e-01s
Theano Linker time (includes C, CUDA code generation/compiling): 4.090979e-01s

Class

<% time> <sum %> <#call> <#apply>
94.4% 94.4% 139.572s 1.53e-01s C 912 19 theano.sandbox.cuda.blas.GpuCorrMM
2.1% 96.5% 3.036s 1.58e-02s C 192 4 theano.sandbox.cuda.basic_ops.GpuIncSubtensor
1.3% 97.8% 1.980s 1.11e-03s C 1776 37 theano.sandbox.cuda.basic_ops.GpuContiguous
0.6% 98.4% 0.910s 4.31e-04s C 2112 44 theano.sandbox.cuda.basic_ops.GpuElemwise
0.5% 98.9% 0.666s 5.34e-04s Py 1248 26 theano.sandbox.cuda.basic_ops.GpuReshape
0.3% 99.1% 0.382s 1.99e-03s C 192 4 theano.sandbox.cuda.basic_ops.GpuFromHost
0.2% 99.3% 0.296s 3.08e-03s Py 96 2 theano.tensor.extra_ops.RepeatOp
0.2% 99.5% 0.296s 1.03e-03s C 288 6 theano.sandbox.cuda.blas.GpuDownsampleFactorMaxGrad
0.2% 99.7% 0.252s 8.75e-04s C 288 6 theano.sandbox.cuda.blas.GpuDownsampleFactorMax
0.1% 99.8% 0.178s 1.24e-03s C 144 3 theano.sandbox.cuda.basic_ops.HostFromGpu
0.1% 99.9% 0.121s 3.16e-04s C 384 8 theano.sandbox.cuda.basic_ops.GpuCAReduce
0.1% 100.0% 0.096s 4.01e-04s C 240 5 theano.sandbox.cuda.basic_ops.GpuAlloc
0.0% 100.0% 0.015s 5.75e-06s C 2688 56 theano.compile.ops.Shape_i
0.0% 100.0% 0.013s 7.38e-06s C 1776 37 theano.sandbox.cuda.basic_ops.GpuSubtensor
0.0% 100.0% 0.011s 7.58e-06s C 1440 30 theano.sandbox.cuda.basic_ops.GpuDimShuffle
0.0% 100.0% 0.007s 5.66e-06s C 1152 24 theano.tensor.elemwise.Elemwise
0.0% 100.0% 0.006s 1.06e-05s C 576 12 theano.tensor.basic.Join
0.0% 100.0% 0.005s 6.05e-06s C 864 18 theano.tensor.subtensor.Subtensor
0.0% 100.0% 0.004s 5.89e-06s C 672 14 theano.tensor.opt.MakeVector
0.0% 100.0% 0.002s 5.11e-06s C 336 7 theano.tensor.elemwise.Prod
... (remaining 2 Classes account for 0.00%(0.00s) of the runtime)

Ops

<% time> <sum %> <#call> <#apply>
91.5% 91.5% 135.223s 2.56e-01s C 528 11 GpuCorrMM{valid, (1, 1), pad=0}
2.9% 94.4% 4.348s 1.13e-02s C 384 8 GpuCorrMM{full, (1, 1), pad=0}
2.1% 96.5% 3.036s 1.58e-02s C 192 4 GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:}
1.3% 97.8% 1.980s 1.11e-03s C 1776 37 GpuContiguous
0.5% 98.2% 0.666s 5.34e-04s Py 1248 26 GpuReshape{4}
0.3% 98.5% 0.382s 1.99e-03s C 192 4 GpuFromHost
0.2% 98.7% 0.296s 3.08e-03s Py 96 2 RepeatOp
0.2% 98.9% 0.247s 1.03e-03s C 240 5 GpuElemwise{add,no_inplace}
0.2% 99.0% 0.237s 1.23e-03s C 192 4 GpuDownsampleFactorMaxGrad{(1, 1),True}
0.1% 99.2% 0.210s 8.77e-04s C 240 5 GpuElemwise{maximum,no_inplace}
0.1% 99.3% 0.206s 1.07e-03s C 192 4 GpuDownsampleFactorMax{(1, 1),True}
0.1% 99.4% 0.178s 1.24e-03s C 144 3 HostFromGpu
0.1% 99.5% 0.164s 6.82e-04s C 240 5 GpuElemwise{Composite{[Cast{float32}(EQ(i0, i1))]}}[(0, 0)]
0.1% 99.6% 0.157s 6.56e-04s C 240 5 GpuElemwise{Mul}[(0, 0)]
0.1% 99.7% 0.100s 3.47e-04s C 288 6 GpuCAReduce{add}{1,0,1,1}
0.1% 99.8% 0.087s 4.51e-04s C 192 4 GpuAlloc{memset_0=True}
0.0% 99.8% 0.059s 6.14e-04s C 96 2 GpuDownsampleFactorMaxGrad{(2, 2),True}
0.0% 99.8% 0.046s 4.80e-04s C 96 2 GpuDownsampleFactorMax{(2, 2),True}
0.0% 99.9% 0.023s 4.84e-04s C 48 1 GpuElemwise{sqr,no_inplace}
0.0% 99.9% 0.022s 2.32e-04s C 96 2 GpuElemwise{TrueDiv}[(0, 0)]
... (remaining 37 Ops account for 0.12%(0.18s) of the runtime)

Apply

<% time> <sum %> <#call> <Gflops/s>
34.9% 34.9% 51.585s 1.07e+00s 48 313 14400.0 0.3 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
input 0: dtype=float32, shape=(256, 16, 60, 60), strides=(57600, 3600, 60, 1)
input 1: dtype=float32, shape=(512, 16, 60, 60), strides=(57600, 3600, 60, 1)
output 0: dtype=float32, shape=(256, 512, 1, 1), strides=(512, 1, 0, 0)
30.4% 65.3% 44.946s 9.36e-01s 48 327 72900.0 1.6 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
input 0: dtype=float32, shape=(512, 16, 68, 68), strides=(73984, 4624, 68, 1)
input 1: dtype=float32, shape=(16, 16, 60, 60), strides=(57600, 3600, 60, 1)
output 0: dtype=float32, shape=(512, 16, 9, 9), strides=(1296, 81, 9, 1)
15.1% 80.4% 22.270s 4.64e-01s 48 367 2109.4 0.1 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
input 0: dtype=float32, shape=(16, 16, 244, 244), strides=(952576, 59536, 244, 1)
input 1: dtype=float32, shape=(3, 16, 240, 240), strides=(921600, 57600, 240, 1)
output 0: dtype=float32, shape=(16, 3, 5, 5), strides=(75, 25, 5, 1)
5.4% 85.8% 8.016s 1.67e-01s 48 325 72900.0 8.9 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
input 0: dtype=float32, shape=(16, 512, 68, 68), strides=(2367488, 4624, 68, 1)
input 1: dtype=float32, shape=(16, 512, 9, 9), strides=(41472, 81, 9, 1)
output 0: dtype=float32, shape=(16, 16, 60, 60), strides=(57600, 3600, 60, 1)
3.6% 89.4% 5.313s 1.11e-01s 48 355 2812.5 0.5 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
input 0: dtype=float32, shape=(16, 16, 124, 124), strides=(246016, 15376, 124, 1)
input 1: dtype=float32, shape=(16, 16, 120, 120), strides=(230400, 14400, 120, 1)
output 0: dtype=float32, shape=(16, 16, 5, 5), strides=(400, 25, 5, 1)
1.2% 90.6% 1.770s 3.69e-02s 48 322 GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:
input 0: dtype=float32, shape=(16, 512, 68, 68), strides=(2367488, 4624, 68, 1)
input 1: dtype=float32, shape=(16, 512, 60, 60), strides=(1843200, 3600, 60, 1)
input 2: dtype=int64, shape=8, strides=c
input 3: dtype=int64, shape=8, strides=c
input 4: dtype=int64, shape=8, strides=c
input 5: dtype=int64, shape=8, strides=c
output 0: dtype=float32, shape=(16, 512, 68, 68), strides=(2367488, 4624, 68, 1)
1.2% 91.7% 1.732s 3.61e-02s 48 190 72900.0 41.1 GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous
input 0: dtype=float32, shape=(16, 16, 60, 60), strides=(57600, 3600, 60, 1)
input 1: dtype=float32, shape=(512, 16, 9, 9), strides=(1296, 81, 9, 1)
output 0: dtype=float32, shape=(16, 512, 68, 68), strides=(2367488, 4624, 68, 1)
1.0% 92.7% 1.416s 2.95e-02s 48 341 703.1 0.5 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
input 0: dtype=float32, shape=(16, 16, 64, 64), strides=(65536, 4096, 64, 1)
input 1: dtype=float32, shape=(16, 16, 60, 60), strides=(57600, 3600, 60, 1)
output 0: dtype=float32, shape=(16, 16, 5, 5), strides=(400, 25, 5, 1)
0.7% 93.3% 0.969s 2.02e-02s 48 364 GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:
input 0: dtype=float32, shape=(16, 16, 244, 244), strides=(952576, 59536, 244, 1)
input 1: dtype=float32, shape=(16, 16, 240, 240), strides=(921600, 57600, 240, 1)
input 2: dtype=int64, shape=8, strides=c
input 3: dtype=int64, shape=8, strides=c
input 4: dtype=int64, shape=8, strides=c
input 5: dtype=int64, shape=8, strides=c
output 0: dtype=float32, shape=(16, 16, 244, 244), strides=(952576, 59536, 244, 1)
0.4% 93.8% 0.664s 1.38e-02s 48 300 112.5 0.2 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
input 0: dtype=float32, shape=(4, 16, 60, 60), strides=(57600, 3600, 60, 1)
input 1: dtype=float32, shape=(256, 16, 60, 60), strides=(57600, 3600, 60, 1)
output 0: dtype=float32, shape=(4, 256, 1, 1), strides=(256, 1, 0, 0)
0.4% 94.2% 0.551s 1.15e-02s 48 44 427.1 0.8 GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous
input 0: dtype=float32, shape=(16, 3, 240, 240), strides=(172800, 57600, 240, 1)
input 1: dtype=float32, shape=(1, 3, 9, 9), strides=(0, 81, 9, 1)
output 0: dtype=float32, shape=(16, 1, 248, 248), strides=(61504, 0, 248, 1)
0.4% 94.5% 0.549s 1.14e-02s 48 61 427.1 0.8 GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous
input 0: dtype=float32, shape=(16, 3, 240, 240), strides=(172800, 57600, 240, 1)
input 1: dtype=float32, shape=(1, 3, 9, 9), strides=(0, 81, 9, 1)
output 0: dtype=float32, shape=(16, 1, 248, 248), strides=(61504, 0, 248, 1)
0.3% 94.9% 0.492s 1.03e-02s 48 326 GpuContiguous(GpuDimShuffle{1,0,2,3}.0)
input 0: dtype=float32, shape=(512, 16, 68, 68), strides=(4624, 2367488, 68, 1)
output 0: dtype=float32, shape=(512, 16, 68, 68), strides=(73984, 4624, 68, 1)
0.3% 95.2% 0.464s 9.67e-03s 48 224 14400.0 30.3 GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous
input 0: dtype=float32, shape=(16, 512, 60, 60), strides=(1843200, 3600, 60, 1)
input 1: dtype=float32, shape=(256, 512, 1, 1), strides=(512, 1, 0, 0)
output 0: dtype=float32, shape=(16, 256, 60, 60), strides=(921600, 3600, 60, 1)
0.3% 95.5% 0.425s 8.86e-03s 48 311 14400.0 33.1 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
input 0: dtype=float32, shape=(16, 256, 60, 60), strides=(921600, 3600, 60, 1)
input 1: dtype=float32, shape=(512, 256, 1, 1), strides=(256, 1, 0, 0)
output 0: dtype=float32, shape=(16, 512, 60, 60), strides=(1843200, 3600, 60, 1)
0.3% 95.7% 0.396s 8.26e-03s 48 232 GpuContiguous(GpuSubtensor{::, ::, ::int64, ::int64}.0)
input 0: dtype=float32, shape=(512, 16, 60, 60), strides=(3600, 1843200, 60, 1)
output 0: dtype=float32, shape=(512, 16, 60, 60), strides=(57600, 3600, 60, 1)
0.3% 96.0% 0.388s 8.07e-03s 48 122 2812.5 7.1 GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous
input 0: dtype=float32, shape=(16, 16, 120, 120), strides=(230400, 14400, 120, 1)
input 1: dtype=float32, shape=(16, 16, 5, 5), strides=(400, 25, 5, 1)
output 0: dtype=float32, shape=(16, 16, 124, 124), strides=(246016, 15376, 124, 1)
0.3% 96.3% 0.385s 8.02e-03s 48 208 GpuReshape{4}(GpuSubtensor{::, ::, int64:int64:, int64:int64:
input 0: dtype=float32, shape=(16, 512, 60, 60), strides=(2367488, 4624, 68, 1)
input 1: dtype=int64, shape=(4,), strides=c
output 0: dtype=float32, shape=(8192, 1, 60, 60), strides=(3600, 0, 60, 1)
0.2% 96.5% 0.367s 7.64e-03s 48 366 GpuContiguous(GpuDimShuffle{1,0,2,3}.0)
input 0: dtype=float32, shape=(16, 16, 244, 244), strides=(59536, 952576, 244, 1)
output 0: dtype=float32, shape=(16, 16, 244, 244), strides=(952576, 59536, 244, 1)
0.2% 96.8% 0.365s 7.61e-03s 48 353 2812.5 7.5 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
input 0: dtype=float32, shape=(16, 16, 124, 124), strides=(246016, 15376, 124, 1)
input 1: dtype=float32, shape=(16, 16, 5, 5), strides=(400, 25, 5, 1)
output 0: dtype=float32, shape=(16, 16, 120, 120), strides=(230400, 14400, 120, 1)
... (remaining 349 Apply instances account for 3.24%(4.79s) of the runtime)

Memory Profile
(Sparse variables are ignored)

(For values in brackets, it's for linker = c|py
Max if no gc (allow_gc=False): 1619415KB (1619415KB)
Max if linker=cvm(default): 765866KB (765866KB)
Memory saved if views are used: 2970455KB (2970455KB)
Memory saved if inplace ops are used: 649542KB (649542KB)
Memory saved if gc is enabled: 853549KB (853549KB)

<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>

 151519232B  [(16, 512, 68, 68)] c GpuAlloc{memset_0=True}(CudaNdarrayConstant{[[[[ 0.]]]]}, Shape_i{0}.0, Shape_i{1}.0, Shape_i{2}.0, Shape_i{3}.0)
 151519232B  [(512, 16, 68, 68)] v GpuContiguous(GpuDimShuffle{1,0,2,3}.0)
 151519232B  [(16, 512, 68, 68)] i GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:}(GpuAlloc{memset_0=True}.0, GpuReshape{4}.0, Constant{4}, Constant{-4}, Constant{4}, Constant{-4})
 151519232B  [(512, 16, 68, 68)] v GpuDimShuffle{1,0,2,3}(GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:}.0)
 151519232B  [(16, 512, 68, 68)] v GpuContiguous(GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:}.0)
 151519232B  [(16, 512, 68, 68)] c GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
 117964800B  [(16, 512, 60, 60)] v GpuContiguous(GpuElemwise{maximum,no_inplace}.0)
 117964800B  [(512, 16, 60, 60)] v GpuContiguous(GpuSubtensor{::, ::, ::int64, ::int64}.0)
 117964800B  [(16, 512, 60, 60)] c GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
 117964800B  [(16, 512, 60, 60)] c GpuElemwise{add,no_inplace}(GpuReshape{4}.0, GpuDimShuffle{x,0,x,x}.0)
 117964800B  [(16, 512, 60, 60)] v GpuReshape{4}(GpuDownsampleFactorMaxGrad{(1, 1),True}.0, MakeVector.0)
 117964800B  [(16, 512, 60, 60)] v GpuReshape{4}(GpuDownsampleFactorMax{(1, 1),True}.0, Join.0)
 117964800B  [(16, 512, 60, 60)] v GpuSubtensor{::, ::, int64:int64:, int64:int64:}(GpuCorrMM{full, (1, 1), pad=0}.0, Constant{4}, Constant{-4}, Constant{4}, Constant{-4})
 117964800B  [(16, 512, 60, 60)] i GpuElemwise{Composite{[Cast{float32}(EQ(i0, i1))]}}[(0, 0)](GpuElemwise{maximum,no_inplace}.0, GpuElemwise{add,no_inplace}.0)
 117964800B  [(8192, 1, 60, 60)] c GpuDownsampleFactorMaxGrad{(1, 1),True}(GpuReshape{4}.0, GpuDownsampleFactorMax{(1, 1),True}.0, GpuReshape{4}.0)
 117964800B  [(16, 512, 60, 60)] i GpuElemwise{Mul}[(0, 0)](GpuElemwise{Composite{[Cast{float32}(EQ(i0, i1))]}}[(0, 0)].0, GpuCorrMM{valid, (1, 1), pad=0}.0)
 117964800B  [(16, 512, 60, 60)] c GpuElemwise{maximum,no_inplace}(CudaNdarrayConstant{[[[[  9.99999997e-07]]]]}, GpuElemwise{add,no_inplace}.0)
 117964800B  [(8192, 1, 60, 60)] v GpuReshape{4}(GpuSubtensor{::, ::, int64:int64:, int64:int64:}.0, Join.0)
 117964800B  [(512, 16, 60, 60)] v GpuDimShuffle{1,0,2,3}(GpuElemwise{maximum,no_inplace}.0)
 117964800B  [(8192, 1, 60, 60)] v GpuReshape{4}(GpuElemwise{Mul}[(0, 0)].0, MakeVector.0)
... (remaining 349 Apply account for 2804535640B/5365158232B ((52.27%)) of the Apply with dense outputs sizes)
<created/inplace/view> is taken from the Op's declaration.
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.
—
Reply to this email directly or view it on GitHub
#2002 (comment).

nouiz · 2014-08-06T21:31:52Z

Part of the difference is that we do separate grad for the weights and the
inputs, but caffe share some work while they compute blurry at the same
time.

We need to implement a new op GPU corrgrad, that will compute both grad at
the same time. Can you start it? I'll do the optimization needed.

Keep me updated on what you do and when as I'll also work on it tonight,
but I don't know if that is enough and if we can prevent duly opiate work
that would be great.
Le 6 août 2014 16:12, "Frédéric Bastien" frederic.bastien@gmail.com a
écrit :

if the forward is valid, in the grad, you will have a valid and a full
convolution.
If the forward is full, you will also have a valid and a full in the grad.

Why do you use the full mode in the forward? That is very strange.
Normally, what I saw is that people use the valid mode in the forward. Are
you sure your torch implementation also use the full mode in the forward?

I need to leave. I'll see if I can work on that tonight. It the speed
difference is real I see two way go identify it:

Run your code with cuda-memcheck. If we have a problem like a call to
gemm with too big number and it cause bad memory read, we will see the
error. That would be the easy case.

I looked at caffe code and it work differently in the backward case then
what we did. Check the method Backward_gpu in the src/caffe/layers/
conv_layers.cu

Here is the profile of convnet-benchmark you can compare it to what is on
the web site, but we see that the grad, the way we implement it in
GpuCorrMM, it is slower:
CONFIG: input = 3 x 128 x 128 * ker = 3 x 96 x 11 x 11 ( bs = 128 , stride
= 1 )
gemm theano.sandbox.cuda.blas.GpuConvMM fprof: 1246.53857689 GFLOP/s ( tm
= 0.0996497273445 )
gemm theano.sandbox.cuda.blas.GpuConvMM bprop weights: 0.0 GFLOP/s ( tm =
0.355364978313 )
gemm theano.sandbox.cuda.blas.GpuConvMM bprop inputs: 0.0 GFLOP/s ( tm =
1.75031024218 )
gemm theano.sandbox.cuda.blas.GpuConvMM bprop both: 0.0 GFLOP/s ( tm =
2.09745848179 )

CONFIG: input = 64 x 64 x 64 * ker = 64 x 128 x 9 x 9 ( bs = 128 , stride
= 1 )
gemm theano.sandbox.cuda.blas.GpuConvMM fprof: 1814.27996141 GFLOP/s ( tm
= 0.293620705605 )
gemm theano.sandbox.cuda.blas.GpuConvMM bprop weights: 0.0 GFLOP/s ( tm =
1.16873198748 )
gemm theano.sandbox.cuda.blas.GpuConvMM bprop inputs: 0.0 GFLOP/s ( tm =
0.506856739521 )
gemm theano.sandbox.cuda.blas.GpuConvMM bprop both: 0.0 GFLOP/s ( tm =
1.77047175169 )

CONFIG: input = 128 x 32 x 32 * ker = 128 x 128 x 9 x 9 ( bs = 128 ,
stride = 1 )
gemm theano.sandbox.cuda.blas.GpuConvMM fprof: 1163.07152933 GFLOP/s ( tm
= 0.168252289295 )
gemm theano.sandbox.cuda.blas.GpuConvMM bprop weights: 0.0 GFLOP/s ( tm =
0.437334001064 )
gemm theano.sandbox.cuda.blas.GpuConvMM bprop inputs: 0.0 GFLOP/s ( tm =
0.230055272579 )
gemm theano.sandbox.cuda.blas.GpuConvMM bprop both: 0.0 GFLOP/s ( tm =
0.706533968449 )

CONFIG: input = 128 x 16 x 16 * ker = 128 x 128 x 7 x 7 ( bs = 128 ,
stride = 1 )
gemm theano.sandbox.cuda.blas.GpuConvMM fprof: 362.523363232 GFLOP/s ( tm
= 0.0566917657852 )
gemm theano.sandbox.cuda.blas.GpuConvMM bprop weights: 0.0 GFLOP/s ( tm =
0.0665702223778 )
gemm theano.sandbox.cuda.blas.GpuConvMM bprop inputs: 0.0 GFLOP/s ( tm =
0.102097034454 )
gemm theano.sandbox.cuda.blas.GpuConvMM bprop both: 0.0 GFLOP/s ( tm =
0.168358981609 )
On Wed, Aug 6, 2014 at 12:00 PM, Arjun Jain notifications@github.com
wrote:
Hi Fred, thanks a ton for your reply. Please find bellow the complete log
after also using profile_memory=True.

What I dont understand is why is the 'valid' getting called at all? I
only call conv2d with 'full'. Perhaps this happens during the back prop,
but I am really not sure why. Any help would be greatly appreciated. Thanks
a lot!!!

Using gpu device 3: GeForce GTX TITAN Black
/home/ajain/Theano/theano/gof/vm.py:716: UserWarning: CVM does not support memory profile, using Stack VM.
'CVM does not support memory profile, using Stack VM.')

Function profiling

Message: /home/ajain/Projects/deep_nets/python/lib/machine.py:239
Time in 48 calls to Function.call: 1.493318e+02s
Time in Function.fn.call: 1.493235e+02s (99.994%)
Time in thunks: 1.478504e+02s (99.008%)
Total compile time: 5.620848e+00s
Number of Apply nodes: 369
Theano Optimizer time: 5.321193e+00s
Theano validate time: 3.487010e-01s
Theano Linker time (includes C, CUDA code generation/compiling): 2.601519e-01s

Class

<% time> <sum %> <#call> <#apply>
94.4% 94.4% 139.572s 1.53e-01s C 912 19 theano.sandbox.cuda.blas.GpuCorrMM
2.1% 96.5% 3.036s 1.58e-02s C 192 4 theano.sandbox.cuda.basic_ops.GpuIncSubtensor
1.3% 97.8% 1.980s 1.11e-03s C 1776 37 theano.sandbox.cuda.basic_ops.GpuContiguous
0.6% 98.4% 0.910s 4.31e-04s C 2112 44 theano.sandbox.cuda.basic_ops.GpuElemwise
0.5% 98.9% 0.666s 5.34e-04s Py 1248 26 theano.sandbox.cuda.basic_ops.GpuReshape
0.3% 99.1% 0.382s 1.99e-03s C 192 4 theano.sandbox.cuda.basic_ops.GpuFromHost
0.2% 99.3% 0.296s 3.08e-03s Py 96 2 theano.tensor.extra_ops.RepeatOp
0.2% 99.5% 0.296s 1.03e-03s C 288 6 theano.sandbox.cuda.blas.GpuDownsampleFactorMaxGrad
0.2% 99.7% 0.252s 8.75e-04s C 288 6 theano.sandbox.cuda.blas.GpuDownsampleFactorMax
0.1% 99.8% 0.178s 1.24e-03s C 144 3 theano.sandbox.cuda.basic_ops.HostFromGpu
0.1% 99.9% 0.121s 3.16e-04s C 384 8 theano.sandbox.cuda.basic_ops.GpuCAReduce
0.1% 100.0% 0.096s 4.01e-04s C 240 5 theano.sandbox.cuda.basic_ops.GpuAlloc
0.0% 100.0% 0.015s 5.75e-06s C 2688 56 theano.compile.ops.Shape_i
0.0% 100.0% 0.013s 7.38e-06s C 1776 37 theano.sandbox.cuda.basic_ops.GpuSubtensor
0.0% 100.0% 0.011s 7.58e-06s C 1440 30 theano.sandbox.cuda.basic_ops.GpuDimShuffle
0.0% 100.0% 0.007s 5.66e-06s C 1152 24 theano.tensor.elemwise.Elemwise
0.0% 100.0% 0.006s 1.06e-05s C 576 12 theano.tensor.basic.Join
0.0% 100.0% 0.005s 6.05e-06s C 864 18 theano.tensor.subtensor.Subtensor
0.0% 100.0% 0.004s 5.89e-06s C 672 14 theano.tensor.opt.MakeVector
0.0% 100.0% 0.002s 5.11e-06s C 336 7 theano.tensor.elemwise.Prod
... (remaining 2 Classes account for 0.00%(0.00s) of the runtime)

Ops

<% time> <sum %> <#call> <#apply>
91.5% 91.5% 135.223s 2.56e-01s C 528 11 GpuCorrMM{valid, (1, 1), pad=0}
2.9% 94.4% 4.348s 1.13e-02s C 384 8 GpuCorrMM{full, (1, 1), pad=0}
2.1% 96.5% 3.036s 1.58e-02s C 192 4 GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:}
1.3% 97.8% 1.980s 1.11e-03s C 1776 37 GpuContiguous
0.5% 98.2% 0.666s 5.34e-04s Py 1248 26 GpuReshape{4}
0.3% 98.5% 0.382s 1.99e-03s C 192 4 GpuFromHost
0.2% 98.7% 0.296s 3.08e-03s Py 96 2 RepeatOp
0.2% 98.9% 0.247s 1.03e-03s C 240 5 GpuElemwise{add,no_inplace}
0.2% 99.0% 0.237s 1.23e-03s C 192 4 GpuDownsampleFactorMaxGrad{(1, 1),True}
0.1% 99.2% 0.210s 8.77e-04s C 240 5 GpuElemwise{maximum,no_inplace}
0.1% 99.3% 0.206s 1.07e-03s C 192 4 GpuDownsampleFactorMax{(1, 1),True}
0.1% 99.4% 0.178s 1.24e-03s C 144 3 HostFromGpu
0.1% 99.5% 0.164s 6.82e-04s C 240 5 GpuElemwise{Composite{[Cast{float32}(EQ(i0, i1))]}}[(0, 0)]
0.1% 99.6% 0.157s 6.56e-04s C 240 5 GpuElemwise{Mul}[(0, 0)]
0.1% 99.7% 0.100s 3.47e-04s C 288 6 GpuCAReduce{add}{1,0,1,1}
0.1% 99.8% 0.087s 4.51e-04s C 192 4 GpuAlloc{memset_0=True}
0.0% 99.8% 0.059s 6.14e-04s C 96 2 GpuDownsampleFactorMaxGrad{(2, 2),True}
0.0% 99.8% 0.046s 4.80e-04s C 96 2 GpuDownsampleFactorMax{(2, 2),True}
0.0% 99.9% 0.023s 4.84e-04s C 48 1 GpuElemwise{sqr,no_inplace}
0.0% 99.9% 0.022s 2.32e-04s C 96 2 GpuElemwise{TrueDiv}[(0, 0)]
... (remaining 37 Ops account for 0.12%(0.18s) of the runtime)

Apply

<% time> <sum %> <#call> <Gflops/s>
34.9% 34.9% 51.585s 1.07e+00s 48 313 14400.0 0.3 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
input 0: dtype=float32, shape=(256, 16, 60, 60), strides=(57600, 3600, 60, 1)
input 1: dtype=float32, shape=(512, 16, 60, 60), strides=(57600, 3600, 60, 1)
output 0: dtype=float32, shape=(256, 512, 1, 1), strides=(512, 1, 0, 0)
30.4% 65.3% 44.946s 9.36e-01s 48 327 72900.0 1.6 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
input 0: dtype=float32, shape=(512, 16, 68, 68), strides=(73984, 4624, 68, 1)
input 1: dtype=float32, shape=(16, 16, 60, 60), strides=(57600, 3600, 60, 1)
output 0: dtype=float32, shape=(512, 16, 9, 9), strides=(1296, 81, 9, 1)
15.1% 80.4% 22.270s 4.64e-01s 48 367 2109.4 0.1 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
input 0: dtype=float32, shape=(16, 16, 244, 244), strides=(952576, 59536, 244, 1)
input 1: dtype=float32, shape=(3, 16, 240, 240), strides=(921600, 57600, 240, 1)
output 0: dtype=float32, shape=(16, 3, 5, 5), strides=(75, 25, 5, 1)
5.4% 85.8% 8.016s 1.67e-01s 48 325 72900.0 8.9 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
input 0: dtype=float32, shape=(16, 512, 68, 68), strides=(2367488, 4624, 68, 1)
input 1: dtype=float32, shape=(16, 512, 9, 9), strides=(41472, 81, 9, 1)
output 0: dtype=float32, shape=(16, 16, 60, 60), strides=(57600, 3600, 60, 1)
3.6% 89.4% 5.313s 1.11e-01s 48 355 2812.5 0.5 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
input 0: dtype=float32, shape=(16, 16, 124, 124), strides=(246016, 15376, 124, 1)
input 1: dtype=float32, shape=(16, 16, 120, 120), strides=(230400, 14400, 120, 1)
output 0: dtype=float32, shape=(16, 16, 5, 5), strides=(400, 25, 5, 1)
1.2% 90.6% 1.770s 3.69e-02s 48 322 GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:
input 0: dtype=float32, shape=(16, 512, 68, 68), strides=(2367488, 4624, 68, 1)
input 1: dtype=float32, shape=(16, 512, 60, 60), strides=(1843200, 3600, 60, 1)
input 2: dtype=int64, shape=8, strides=c
input 3: dtype=int64, shape=8, strides=c
input 4: dtype=int64, shape=8, strides=c
input 5: dtype=int64, shape=8, strides=c
output 0: dtype=float32, shape=(16, 512, 68, 68), strides=(2367488, 4624, 68, 1)
1.2% 91.7% 1.732s 3.61e-02s 48 190 72900.0 41.1 GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous
input 0: dtype=float32, shape=(16, 16, 60, 60), strides=(57600, 3600, 60, 1)
input 1: dtype=float32, shape=(512, 16, 9, 9), strides=(1296, 81, 9, 1)
output 0: dtype=float32, shape=(16, 512, 68, 68), strides=(2367488, 4624, 68, 1)
1.0% 92.7% 1.416s 2.95e-02s 48 341 703.1 0.5 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
input 0: dtype=float32, shape=(16, 16, 64, 64), strides=(65536, 4096, 64, 1)
input 1: dtype=float32, shape=(16, 16, 60, 60), strides=(57600, 3600, 60, 1)
output 0: dtype=float32, shape=(16, 16, 5, 5), strides=(400, 25, 5, 1)
0.7% 93.3% 0.969s 2.02e-02s 48 364 GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:
input 0: dtype=float32, shape=(16, 16, 244, 244), strides=(952576, 59536, 244, 1)
input 1: dtype=float32, shape=(16, 16, 240, 240), strides=(921600, 57600, 240, 1)
input 2: dtype=int64, shape=8, strides=c
input 3: dtype=int64, shape=8, strides=c
input 4: dtype=int64, shape=8, strides=c
input 5: dtype=int64, shape=8, strides=c
output 0: dtype=float32, shape=(16, 16, 244, 244), strides=(952576, 59536, 244, 1)
0.4% 93.8% 0.664s 1.38e-02s 48 300 112.5 0.2 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
input 0: dtype=float32, shape=(4, 16, 60, 60), strides=(57600, 3600, 60, 1)
input 1: dtype=float32, shape=(256, 16, 60, 60), strides=(57600, 3600, 60, 1)
output 0: dtype=float32, shape=(4, 256, 1, 1), strides=(256, 1, 0, 0)
0.4% 94.2% 0.551s 1.15e-02s 48 44 427.1 0.8 GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous
input 0: dtype=float32, shape=(16, 3, 240, 240), strides=(172800, 57600, 240, 1)
input 1: dtype=float32, shape=(1, 3, 9, 9), strides=(0, 81, 9, 1)
output 0: dtype=float32, shape=(16, 1, 248, 248), strides=(61504, 0, 248, 1)
0.4% 94.5% 0.549s 1.14e-02s 48 61 427.1 0.8 GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous
input 0: dtype=float32, shape=(16, 3, 240, 240), strides=(172800, 57600, 240, 1)
input 1: dtype=float32, shape=(1, 3, 9, 9), strides=(0, 81, 9, 1)
output 0: dtype=float32, shape=(16, 1, 248, 248), strides=(61504, 0, 248, 1)
0.3% 94.9% 0.492s 1.03e-02s 48 326 GpuContiguous(GpuDimShuffle{1,0,2,3}.0)
input 0: dtype=float32, shape=(512, 16, 68, 68), strides=(4624, 2367488, 68, 1)
output 0: dtype=float32, shape=(512, 16, 68, 68), strides=(73984, 4624, 68, 1)
0.3% 95.2% 0.464s 9.67e-03s 48 224 14400.0 30.3 GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous
input 0: dtype=float32, shape=(16, 512, 60, 60), strides=(1843200, 3600, 60, 1)
input 1: dtype=float32, shape=(256, 512, 1, 1), strides=(512, 1, 0, 0)
output 0: dtype=float32, shape=(16, 256, 60, 60), strides=(921600, 3600, 60, 1)
0.3% 95.5% 0.425s 8.86e-03s 48 311 14400.0 33.1 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
input 0: dtype=float32, shape=(16, 256, 60, 60), strides=(921600, 3600, 60, 1)
input 1: dtype=float32, shape=(512, 256, 1, 1), strides=(256, 1, 0, 0)
output 0: dtype=float32, shape=(16, 512, 60, 60), strides=(1843200, 3600, 60, 1)
0.3% 95.7% 0.396s 8.26e-03s 48 232 GpuContiguous(GpuSubtensor{::, ::, ::int64, ::int64}.0)
input 0: dtype=float32, shape=(512, 16, 60, 60), strides=(3600, 1843200, 60, 1)
output 0: dtype=float32, shape=(512, 16, 60, 60), strides=(57600, 3600, 60, 1)
0.3% 96.0% 0.388s 8.07e-03s 48 122 2812.5 7.1 GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous
input 0: dtype=float32, shape=(16, 16, 120, 120), strides=(230400, 14400, 120, 1)
input 1: dtype=float32, shape=(16, 16, 5, 5), strides=(400, 25, 5, 1)
output 0: dtype=float32, shape=(16, 16, 124, 124), strides=(246016, 15376, 124, 1)
0.3% 96.3% 0.385s 8.02e-03s 48 208 GpuReshape{4}(GpuSubtensor{::, ::, int64:int64:, int64:int64:
input 0: dtype=float32, shape=(16, 512, 60, 60), strides=(2367488, 4624, 68, 1)
input 1: dtype=int64, shape=(4,), strides=c
output 0: dtype=float32, shape=(8192, 1, 60, 60), strides=(3600, 0, 60, 1)
0.2% 96.5% 0.367s 7.64e-03s 48 366 GpuContiguous(GpuDimShuffle{1,0,2,3}.0)
input 0: dtype=float32, shape=(16, 16, 244, 244), strides=(59536, 952576, 244, 1)
output 0: dtype=float32, shape=(16, 16, 244, 244), strides=(952576, 59536, 244, 1)
0.2% 96.8% 0.365s 7.61e-03s 48 353 2812.5 7.5 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
input 0: dtype=float32, shape=(16, 16, 124, 124), strides=(246016, 15376, 124, 1)
input 1: dtype=float32, shape=(16, 16, 5, 5), strides=(400, 25, 5, 1)
output 0: dtype=float32, shape=(16, 16, 120, 120), strides=(230400, 14400, 120, 1)
... (remaining 349 Apply instances account for 3.24%(4.79s) of the runtime)

Memory Profile
(Sparse variables are ignored)

(For values in brackets, it's for linker = c|py
Max if no gc (allow_gc=False): 1619415KB (1619415KB)
Max if linker=cvm(default): 765866KB (765866KB)
Memory saved if views are used: 2970455KB (2970455KB)
Memory saved if inplace ops are used: 649542KB (649542KB)
Memory saved if gc is enabled: 853549KB (853549KB)

<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>

 151519232B  [(16, 512, 68, 68)] c GpuAlloc{memset_0=True}(CudaNdarrayConstant{[[[[ 0.]]]]}, Shape_i{0}.0, Shape_i{1}.0, Shape_i{2}.0, Shape_i{3}.0)
 151519232B  [(512, 16, 68, 68)] v GpuContiguous(GpuDimShuffle{1,0,2,3}.0)
 151519232B  [(16, 512, 68, 68)] i GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:}(GpuAlloc{memset_0=True}.0, GpuReshape{4}.0, Constant{4}, Constant{-4}, Constant{4}, Constant{-4})
 151519232B  [(512, 16, 68, 68)] v GpuDimShuffle{1,0,2,3}(GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:}.0)
 151519232B  [(16, 512, 68, 68)] v GpuContiguous(GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:}.0)
 151519232B  [(16, 512, 68, 68)] c GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
 117964800B  [(16, 512, 60, 60)] v GpuContiguous(GpuElemwise{maximum,no_inplace}.0)
 117964800B  [(512, 16, 60, 60)] v GpuContiguous(GpuSubtensor{::, ::, ::int64, ::int64}.0)
 117964800B  [(16, 512, 60, 60)] c GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
 117964800B  [(16, 512, 60, 60)] c GpuElemwise{add,no_inplace}(GpuReshape{4}.0, GpuDimShuffle{x,0,x,x}.0)
 117964800B  [(16, 512, 60, 60)] v GpuReshape{4}(GpuDownsampleFactorMaxGrad{(1, 1),True}.0, MakeVector.0)
 117964800B  [(16, 512, 60, 60)] v GpuReshape{4}(GpuDownsampleFactorMax{(1, 1),True}.0, Join.0)
 117964800B  [(16, 512, 60, 60)] v GpuSubtensor{::, ::, int64:int64:, int64:int64:}(GpuCorrMM{full, (1, 1), pad=0}.0, Constant{4}, Constant{-4}, Constant{4}, Constant{-4})
 117964800B  [(16, 512, 60, 60)] i GpuElemwise{Composite{[Cast{float32}(EQ(i0, i1))]}}[(0, 0)](GpuElemwise{maximum,no_inplace}.0, GpuElemwise{add,no_inplace}.0)
 117964800B  [(8192, 1, 60, 60)] c GpuDownsampleFactorMaxGrad{(1, 1),True}(GpuReshape{4}.0, GpuDownsampleFactorMax{(1, 1),True}.0, GpuReshape{4}.0)
 117964800B  [(16, 512, 60, 60)] i GpuElemwise{Mul}[(0, 0)](GpuElemwise{Composite{[Cast{float32}(EQ(i0, i1))]}}[(0, 0)].0, GpuCorrMM{valid, (1, 1), pad=0}.0)
 117964800B  [(16, 512, 60, 60)] c GpuElemwise{maximum,no_inplace}(CudaNdarrayConstant{[[[[  9.99999997e-07]]]]}, GpuElemwise{add,no_inplace}.0)
 117964800B  [(8192, 1, 60, 60)] v GpuReshape{4}(GpuSubtensor{::, ::, int64:int64:, int64:int64:}.0, Join.0)
 117964800B  [(512, 16, 60, 60)] v GpuDimShuffle{1,0,2,3}(GpuElemwise{maximum,no_inplace}.0)
 117964800B  [(8192, 1, 60, 60)] v GpuReshape{4}(GpuElemwise{Mul}[(0, 0)].0, MakeVector.0)
... (remaining 349 Apply account for 2804535640B/5365158232B ((52.27%)) of the Apply with dense outputs sizes)
<created/inplace/view> is taken from the Op's declaration.
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.
Function profiling

Message: /home/ajain/Projects/deep_nets/python/lib/machine.py:243
Time in 0 calls to Function.call: 0.000000e+00s
Total compile time: 3.708201e+00s
Number of Apply nodes: 0
Theano Optimizer time: 3.540352e+00s
Theano validate time: 1.026495e-01s
Theano Linker time (includes C, CUDA code generation/compiling): 1.489460e-01s

Function profiling

Message: Sum of all printed profiles at exit excluding Scan op profile.
Time in 48 calls to Function.call: 1.493318e+02s
Time in Function.fn.call: 1.493235e+02s (99.994%)
Time in thunks: 1.478504e+02s (99.008%)
Total compile time: 9.329049e+00s
Number of Apply nodes: 369
Theano Optimizer time: 8.861545e+00s
Theano validate time: 4.513505e-01s
Theano Linker time (includes C, CUDA code generation/compiling): 4.090979e-01s

Class

<% time> <sum %> <#call> <#apply>
94.4% 94.4% 139.572s 1.53e-01s C 912 19 theano.sandbox.cuda.blas.GpuCorrMM
2.1% 96.5% 3.036s 1.58e-02s C 192 4 theano.sandbox.cuda.basic_ops.GpuIncSubtensor
1.3% 97.8% 1.980s 1.11e-03s C 1776 37 theano.sandbox.cuda.basic_ops.GpuContiguous
0.6% 98.4% 0.910s 4.31e-04s C 2112 44 theano.sandbox.cuda.basic_ops.GpuElemwise
0.5% 98.9% 0.666s 5.34e-04s Py 1248 26 theano.sandbox.cuda.basic_ops.GpuReshape
0.3% 99.1% 0.382s 1.99e-03s C 192 4 theano.sandbox.cuda.basic_ops.GpuFromHost
0.2% 99.3% 0.296s 3.08e-03s Py 96 2 theano.tensor.extra_ops.RepeatOp
0.2% 99.5% 0.296s 1.03e-03s C 288 6 theano.sandbox.cuda.blas.GpuDownsampleFactorMaxGrad
0.2% 99.7% 0.252s 8.75e-04s C 288 6 theano.sandbox.cuda.blas.GpuDownsampleFactorMax
0.1% 99.8% 0.178s 1.24e-03s C 144 3 theano.sandbox.cuda.basic_ops.HostFromGpu
0.1% 99.9% 0.121s 3.16e-04s C 384 8 theano.sandbox.cuda.basic_ops.GpuCAReduce
0.1% 100.0% 0.096s 4.01e-04s C 240 5 theano.sandbox.cuda.basic_ops.GpuAlloc
0.0% 100.0% 0.015s 5.75e-06s C 2688 56 theano.compile.ops.Shape_i
0.0% 100.0% 0.013s 7.38e-06s C 1776 37 theano.sandbox.cuda.basic_ops.GpuSubtensor
0.0% 100.0% 0.011s 7.58e-06s C 1440 30 theano.sandbox.cuda.basic_ops.GpuDimShuffle
0.0% 100.0% 0.007s 5.66e-06s C 1152 24 theano.tensor.elemwise.Elemwise
0.0% 100.0% 0.006s 1.06e-05s C 576 12 theano.tensor.basic.Join
0.0% 100.0% 0.005s 6.05e-06s C 864 18 theano.tensor.subtensor.Subtensor
0.0% 100.0% 0.004s 5.89e-06s C 672 14 theano.tensor.opt.MakeVector
0.0% 100.0% 0.002s 5.11e-06s C 336 7 theano.tensor.elemwise.Prod
... (remaining 2 Classes account for 0.00%(0.00s) of the runtime)

Ops

<% time> <sum %> <#call> <#apply>
91.5% 91.5% 135.223s 2.56e-01s C 528 11 GpuCorrMM{valid, (1, 1), pad=0}
2.9% 94.4% 4.348s 1.13e-02s C 384 8 GpuCorrMM{full, (1, 1), pad=0}
2.1% 96.5% 3.036s 1.58e-02s C 192 4 GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:}
1.3% 97.8% 1.980s 1.11e-03s C 1776 37 GpuContiguous
0.5% 98.2% 0.666s 5.34e-04s Py 1248 26 GpuReshape{4}
0.3% 98.5% 0.382s 1.99e-03s C 192 4 GpuFromHost
0.2% 98.7% 0.296s 3.08e-03s Py 96 2 RepeatOp
0.2% 98.9% 0.247s 1.03e-03s C 240 5 GpuElemwise{add,no_inplace}
0.2% 99.0% 0.237s 1.23e-03s C 192 4 GpuDownsampleFactorMaxGrad{(1, 1),True}
0.1% 99.2% 0.210s 8.77e-04s C 240 5 GpuElemwise{maximum,no_inplace}
0.1% 99.3% 0.206s 1.07e-03s C 192 4 GpuDownsampleFactorMax{(1, 1),True}
0.1% 99.4% 0.178s 1.24e-03s C 144 3 HostFromGpu
0.1% 99.5% 0.164s 6.82e-04s C 240 5 GpuElemwise{Composite{[Cast{float32}(EQ(i0, i1))]}}[(0, 0)]
0.1% 99.6% 0.157s 6.56e-04s C 240 5 GpuElemwise{Mul}[(0, 0)]
0.1% 99.7% 0.100s 3.47e-04s C 288 6 GpuCAReduce{add}{1,0,1,1}
0.1% 99.8% 0.087s 4.51e-04s C 192 4 GpuAlloc{memset_0=True}
0.0% 99.8% 0.059s 6.14e-04s C 96 2 GpuDownsampleFactorMaxGrad{(2, 2),True}
0.0% 99.8% 0.046s 4.80e-04s C 96 2 GpuDownsampleFactorMax{(2, 2),True}
0.0% 99.9% 0.023s 4.84e-04s C 48 1 GpuElemwise{sqr,no_inplace}
0.0% 99.9% 0.022s 2.32e-04s C 96 2 GpuElemwise{TrueDiv}[(0, 0)]
... (remaining 37 Ops account for 0.12%(0.18s) of the runtime)

Apply

<% time> <sum %> <#call> <Gflops/s>
34.9% 34.9% 51.585s 1.07e+00s 48 313 14400.0 0.3 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
input 0: dtype=float32, shape=(256, 16, 60, 60), strides=(57600, 3600, 60, 1)
input 1: dtype=float32, shape=(512, 16, 60, 60), strides=(57600, 3600, 60, 1)
output 0: dtype=float32, shape=(256, 512, 1, 1), strides=(512, 1, 0, 0)
30.4% 65.3% 44.946s 9.36e-01s 48 327 72900.0 1.6 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
input 0: dtype=float32, shape=(512, 16, 68, 68), strides=(73984, 4624, 68, 1)
input 1: dtype=float32, shape=(16, 16, 60, 60), strides=(57600, 3600, 60, 1)
output 0: dtype=float32, shape=(512, 16, 9, 9), strides=(1296, 81, 9, 1)
15.1% 80.4% 22.270s 4.64e-01s 48 367 2109.4 0.1 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
input 0: dtype=float32, shape=(16, 16, 244, 244), strides=(952576, 59536, 244, 1)
input 1: dtype=float32, shape=(3, 16, 240, 240), strides=(921600, 57600, 240, 1)
output 0: dtype=float32, shape=(16, 3, 5, 5), strides=(75, 25, 5, 1)
5.4% 85.8% 8.016s 1.67e-01s 48 325 72900.0 8.9 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
input 0: dtype=float32, shape=(16, 512, 68, 68), strides=(2367488, 4624, 68, 1)
input 1: dtype=float32, shape=(16, 512, 9, 9), strides=(41472, 81, 9, 1)
output 0: dtype=float32, shape=(16, 16, 60, 60), strides=(57600, 3600, 60, 1)
3.6% 89.4% 5.313s 1.11e-01s 48 355 2812.5 0.5 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
input 0: dtype=float32, shape=(16, 16, 124, 124), strides=(246016, 15376, 124, 1)
input 1: dtype=float32, shape=(16, 16, 120, 120), strides=(230400, 14400, 120, 1)
output 0: dtype=float32, shape=(16, 16, 5, 5), strides=(400, 25, 5, 1)
1.2% 90.6% 1.770s 3.69e-02s 48 322 GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:
input 0: dtype=float32, shape=(16, 512, 68, 68), strides=(2367488, 4624, 68, 1)
input 1: dtype=float32, shape=(16, 512, 60, 60), strides=(1843200, 3600, 60, 1)
input 2: dtype=int64, shape=8, strides=c
input 3: dtype=int64, shape=8, strides=c
input 4: dtype=int64, shape=8, strides=c
input 5: dtype=int64, shape=8, strides=c
output 0: dtype=float32, shape=(16, 512, 68, 68), strides=(2367488, 4624, 68, 1)
1.2% 91.7% 1.732s 3.61e-02s 48 190 72900.0 41.1 GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous
input 0: dtype=float32, shape=(16, 16, 60, 60), strides=(57600, 3600, 60, 1)
input 1: dtype=float32, shape=(512, 16, 9, 9), strides=(1296, 81, 9, 1)
output 0: dtype=float32, shape=(16, 512, 68, 68), strides=(2367488, 4624, 68, 1)
1.0% 92.7% 1.416s 2.95e-02s 48 341 703.1 0.5 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
input 0: dtype=float32, shape=(16, 16, 64, 64), strides=(65536, 4096, 64, 1)
input 1: dtype=float32, shape=(16, 16, 60, 60), strides=(57600, 3600, 60, 1)
output 0: dtype=float32, shape=(16, 16, 5, 5), strides=(400, 25, 5, 1)
0.7% 93.3% 0.969s 2.02e-02s 48 364 GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:
input 0: dtype=float32, shape=(16, 16, 244, 244), strides=(952576, 59536, 244, 1)
input 1: dtype=float32, shape=(16, 16, 240, 240), strides=(921600, 57600, 240, 1)
input 2: dtype=int64, shape=8, strides=c
input 3: dtype=int64, shape=8, strides=c
input 4: dtype=int64, shape=8, strides=c
input 5: dtype=int64, shape=8, strides=c
output 0: dtype=float32, shape=(16, 16, 244, 244), strides=(952576, 59536, 244, 1)
0.4% 93.8% 0.664s 1.38e-02s 48 300 112.5 0.2 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
input 0: dtype=float32, shape=(4, 16, 60, 60), strides=(57600, 3600, 60, 1)
input 1: dtype=float32, shape=(256, 16, 60, 60), strides=(57600, 3600, 60, 1)
output 0: dtype=float32, shape=(4, 256, 1, 1), strides=(256, 1, 0, 0)
0.4% 94.2% 0.551s 1.15e-02s 48 44 427.1 0.8 GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous
input 0: dtype=float32, shape=(16, 3, 240, 240), strides=(172800, 57600, 240, 1)
input 1: dtype=float32, shape=(1, 3, 9, 9), strides=(0, 81, 9, 1)
output 0: dtype=float32, shape=(16, 1, 248, 248), strides=(61504, 0, 248, 1)
0.4% 94.5% 0.549s 1.14e-02s 48 61 427.1 0.8 GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous
input 0: dtype=float32, shape=(16, 3, 240, 240), strides=(172800, 57600, 240, 1)
input 1: dtype=float32, shape=(1, 3, 9, 9), strides=(0, 81, 9, 1)
output 0: dtype=float32, shape=(16, 1, 248, 248), strides=(61504, 0, 248, 1)
0.3% 94.9% 0.492s 1.03e-02s 48 326 GpuContiguous(GpuDimShuffle{1,0,2,3}.0)
input 0: dtype=float32, shape=(512, 16, 68, 68), strides=(4624, 2367488, 68, 1)
output 0: dtype=float32, shape=(512, 16, 68, 68), strides=(73984, 4624, 68, 1)
0.3% 95.2% 0.464s 9.67e-03s 48 224 14400.0 30.3 GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous
input 0: dtype=float32, shape=(16, 512, 60, 60), strides=(1843200, 3600, 60, 1)
input 1: dtype=float32, shape=(256, 512, 1, 1), strides=(512, 1, 0, 0)
output 0: dtype=float32, shape=(16, 256, 60, 60), strides=(921600, 3600, 60, 1)
0.3% 95.5% 0.425s 8.86e-03s 48 311 14400.0 33.1 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
input 0: dtype=float32, shape=(16, 256, 60, 60), strides=(921600, 3600, 60, 1)
input 1: dtype=float32, shape=(512, 256, 1, 1), strides=(256, 1, 0, 0)
output 0: dtype=float32, shape=(16, 512, 60, 60), strides=(1843200, 3600, 60, 1)
0.3% 95.7% 0.396s 8.26e-03s 48 232 GpuContiguous(GpuSubtensor{::, ::, ::int64, ::int64}.0)
input 0: dtype=float32, shape=(512, 16, 60, 60), strides=(3600, 1843200, 60, 1)
output 0: dtype=float32, shape=(512, 16, 60, 60), strides=(57600, 3600, 60, 1)
0.3% 96.0% 0.388s 8.07e-03s 48 122 2812.5 7.1 GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous
input 0: dtype=float32, shape=(16, 16, 120, 120), strides=(230400, 14400, 120, 1)
input 1: dtype=float32, shape=(16, 16, 5, 5), strides=(400, 25, 5, 1)
output 0: dtype=float32, shape=(16, 16, 124, 124), strides=(246016, 15376, 124, 1)
0.3% 96.3% 0.385s 8.02e-03s 48 208 GpuReshape{4}(GpuSubtensor{::, ::, int64:int64:, int64:int64:
input 0: dtype=float32, shape=(16, 512, 60, 60), strides=(2367488, 4624, 68, 1)
input 1: dtype=int64, shape=(4,), strides=c
output 0: dtype=float32, shape=(8192, 1, 60, 60), strides=(3600, 0, 60, 1)
0.2% 96.5% 0.367s 7.64e-03s 48 366 GpuContiguous(GpuDimShuffle{1,0,2,3}.0)
input 0: dtype=float32, shape=(16, 16, 244, 244), strides=(59536, 952576, 244, 1)
output 0: dtype=float32, shape=(16, 16, 244, 244), strides=(952576, 59536, 244, 1)
0.2% 96.8% 0.365s 7.61e-03s 48 353 2812.5 7.5 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
input 0: dtype=float32, shape=(16, 16, 124, 124), strides=(246016, 15376, 124, 1)
input 1: dtype=float32, shape=(16, 16, 5, 5), strides=(400, 25, 5, 1)
output 0: dtype=float32, shape=(16, 16, 120, 120), strides=(230400, 14400, 120, 1)
... (remaining 349 Apply instances account for 3.24%(4.79s) of the runtime)

Memory Profile
(Sparse variables are ignored)

(For values in brackets, it's for linker = c|py
Max if no gc (allow_gc=False): 1619415KB (1619415KB)
Max if linker=cvm(default): 765866KB (765866KB)
Memory saved if views are used: 2970455KB (2970455KB)
Memory saved if inplace ops are used: 649542KB (649542KB)
Memory saved if gc is enabled: 853549KB (853549KB)

<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>

 151519232B  [(16, 512, 68, 68)] c GpuAlloc{memset_0=True}(CudaNdarrayConstant{[[[[ 0.]]]]}, Shape_i{0}.0, Shape_i{1}.0, Shape_i{2}.0, Shape_i{3}.0)
 151519232B  [(512, 16, 68, 68)] v GpuContiguous(GpuDimShuffle{1,0,2,3}.0)
 151519232B  [(16, 512, 68, 68)] i GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:}(GpuAlloc{memset_0=True}.0, GpuReshape{4}.0, Constant{4}, Constant{-4}, Constant{4}, Constant{-4})
 151519232B  [(512, 16, 68, 68)] v GpuDimShuffle{1,0,2,3}(GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:}.0)
 151519232B  [(16, 512, 68, 68)] v GpuContiguous(GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:}.0)
 151519232B  [(16, 512, 68, 68)] c GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
 117964800B  [(16, 512, 60, 60)] v GpuContiguous(GpuElemwise{maximum,no_inplace}.0)
 117964800B  [(512, 16, 60, 60)] v GpuContiguous(GpuSubtensor{::, ::, ::int64, ::int64}.0)
 117964800B  [(16, 512, 60, 60)] c GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
 117964800B  [(16, 512, 60, 60)] c GpuElemwise{add,no_inplace}(GpuReshape{4}.0, GpuDimShuffle{x,0,x,x}.0)
 117964800B  [(16, 512, 60, 60)] v GpuReshape{4}(GpuDownsampleFactorMaxGrad{(1, 1),True}.0, MakeVector.0)
 117964800B  [(16, 512, 60, 60)] v GpuReshape{4}(GpuDownsampleFactorMax{(1, 1),True}.0, Join.0)
 117964800B  [(16, 512, 60, 60)] v GpuSubtensor{::, ::, int64:int64:, int64:int64:}(GpuCorrMM{full, (1, 1), pad=0}.0, Constant{4}, Constant{-4}, Constant{4}, Constant{-4})
 117964800B  [(16, 512, 60, 60)] i GpuElemwise{Composite{[Cast{float32}(EQ(i0, i1))]}}[(0, 0)](GpuElemwise{maximum,no_inplace}.0, GpuElemwise{add,no_inplace}.0)
 117964800B  [(8192, 1, 60, 60)] c GpuDownsampleFactorMaxGrad{(1, 1),True}(GpuReshape{4}.0, GpuDownsampleFactorMax{(1, 1),True}.0, GpuReshape{4}.0)
 117964800B  [(16, 512, 60, 60)] i GpuElemwise{Mul}[(0, 0)](GpuElemwise{Composite{[Cast{float32}(EQ(i0, i1))]}}[(0, 0)].0, GpuCorrMM{valid, (1, 1), pad=0}.0)
 117964800B  [(16, 512, 60, 60)] c GpuElemwise{maximum,no_inplace}(CudaNdarrayConstant{[[[[  9.99999997e-07]]]]}, GpuElemwise{add,no_inplace}.0)
 117964800B  [(8192, 1, 60, 60)] v GpuReshape{4}(GpuSubtensor{::, ::, int64:int64:, int64:int64:}.0, Join.0)
 117964800B  [(512, 16, 60, 60)] v GpuDimShuffle{1,0,2,3}(GpuElemwise{maximum,no_inplace}.0)
 117964800B  [(8192, 1, 60, 60)] v GpuReshape{4}(GpuElemwise{Mul}[(0, 0)].0, MakeVector.0)
... (remaining 349 Apply account for 2804535640B/5365158232B ((52.27%)) of the Apply with dense outputs sizes)
<created/inplace/view> is taken from the Op's declaration.
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.
—
Reply to this email directly or view it on GitHub
#2002 (comment).

stencilman · 2014-08-06T22:24:45Z

Hi Fred,

I use full because I want the same shape of output as input. I can change to using valid in fprop, but does that change anything?

Yes, I checked the caffe code and how they calculate gradients. I can add this to GpuCorrMM. How can I add a grad function that can call cuda function to get the gradient value?

I tried using cuda-memcheck, but it always seems to get stuck. I will run cuda-memcheck on a minimal program instead and let you know.

Yes, I do think the back prop is as slow as I report.

Would be amazing if you have some time to look at it later. Thank you! I am happy to do anything to make it fast, I really need this.

Thanks a lot,
Warm regards,
Arjun

nouiz · 2014-08-07T00:11:19Z

I'm looking into it, finaly I don't think we need to create a new op that
will do both grad at the same time. The fact that they are together is only
that don't have have a granularity of operation as small as Theano.

The problem is related to the fact that we use the same code for the
convolution in full mode, but they use another version in that case. I
started to work on it.

Fred

On Wed, Aug 6, 2014 at 6:24 PM, Arjun Jain notifications@github.com wrote:

Hi Fred,

I use full because I want the same shape of output as input. I can change
to using valid in fprop, but does that change anything?

Yes, I checked the caffe code and how they calculate gradients. I can add
this to GpuCorrMM. How can I add a grad function that can call cuda
function to get the gradient value?

I tried using cuda-memcheck, but it always seems to get stuck. I will run
cuda-memcheck on a minimal program instead and let you know.

Yes, I do think the back prop is as slow as I report.

Would be amazing if you have some time to look at it later. Thank you! I
am happy to do anything to make it fast, I really need this.

Thanks a lot,
Warm regards,
Arjun

—
Reply to this email directly or view it on GitHub
#2002 (comment).

stencilman · 2014-08-07T00:22:37Z

It is great to know we might not need to do a new op!

Thanks for working on it, let me know if I can help in any way, I would be more than happy to! I would help us a lot here to have fast convolution in Theano. :-)

nouiz · 2014-08-07T02:29:54Z

I'll go sleep. I have something that run, but don't return the right answer:

https://github.com/nouiz/Theano/tree/conv_gemm

If you can review it and try to fix it, it would be great.

On Wed, Aug 6, 2014 at 8:22 PM, Arjun Jain notifications@github.com wrote:

It is great to know we might not need to do a new op!

Thanks for working on it, let me know if I can help in any way, I would be
more than happy to! I would help us a lot here to have fast convolution in
Theano. :-)

—
Reply to this email directly or view it on GitHub
#2002 (comment).

stencilman · 2014-08-07T02:31:29Z

Great, thanks! I will have a look and let you know if I can find anything.

stencilman · 2014-08-07T02:49:35Z

I am a bit confused, why do you want to call col2im in the 'full' mode? Why do you want to do the 'valid' and 'full' mode differently in the the cuda code?

stencilman · 2014-08-07T03:17:52Z

I added support to non-square kernels here: https://github.com/stencilman/Theano-1/commit/85b8a90f553699e67e85291cad6350e6abb5a944

It passed all tests. Is that what you were trying to do?

nouiz · 2014-08-07T13:40:10Z

I'm tring do to as in this fct:

cunn_SpatialConvolutionMM_updateGradInput

In particular this part:

https://github.com/torch/cunn/blob/master/SpatialConvolutionMM.cu#L303

On Wed, Aug 6, 2014 at 11:17 PM, Arjun Jain notifications@github.com
wrote:

I added support to non-square kernels here: stencilman/Theano@85b8a90
https://github.com/stencilman/Theano-1/commit/85b8a90f553699e67e85291cad6350e6abb5a944

It passed all tests. Is that what you were trying to do?

—
Reply to this email directly or view it on GitHub
#2002 (comment).

stencilman · 2014-08-07T13:44:00Z

Ok. updateGradInput calculates the gradients. I am still a bit confused as to why? It is because you want to get it to work and then you want to change it to a different function?

All tests for full mode will fail right if you calculate the gradient instead of doing the corr?

Please do explain.

And yes, the non-square stuff works perfectly, I will create a PR.

stencilman · 2014-08-07T13:55:23Z

I created a PR(#2023) for code which does non-square kernels.

About your branch and the changes you made last night, I am unsure why you want to do what cunn_SpatialConvolutionMM_updateGradInput does. Can you please explain?

stencilman · 2014-08-07T14:25:14Z

Also, why do you think it is the 'full' corr that makes it slow? And how is the algorithm different for us in the full mode than them?

what i find is that the forward is super fast, even if it is full. Somehow in the backward it is very slow. It will really help me a lot if you can fix this or tell me how to make it faster for back prop. Thanks a lot.

nouiz · 2014-08-08T12:45:02Z

I started to write that reply yesterday. But forgot to hit reply. Here it
is:

On Thu, Aug 7, 2014 at 10:25 AM, Arjun Jain notifications@github.com
wrote:

Also, why do you think it is the 'full' corr that makes it slow? And how
is the algorithm different for us in the full mode than them?

I ran the convnet-benchmark before my PR and saw this result (tm is the
time)

CONFIG: input = 3 x 128 x 128 * ker = 3 x 96 x 11 x 11 ( bs = 128 , stride
= 1 )
theano.tensor.nnet.conv.conv2d fprof: 295.136034231 GFLOP/s ( tm =
0.420881271362 )
theano.tensor.nnet.conv.conv2d bprop weights: 0.0 GFLOP/s ( tm =
0.672349274158 )
theano.tensor.nnet.conv.conv2d bprop inputs: 0.0 GFLOP/s ( tm = 51.
4428064823 )
theano.tensor.nnet.conv.conv2d bprop both: 0.0 GFLOP/s ( tm = 53.2438390255
)
(experimental) theano.sandbox.cuda.fftconv.conv2d_fft fprof: 605.508708927
GFLOP/s ( tm = 0.20514523983 )
(experimental) theano.sandbox.cuda.fftconv.conv2d_fft bprop weights: 0.0
GFLOP/s ( tm = 0.206289708614 )
(experimental) theano.sandbox.cuda.fftconv.conv2d_fft bprop inputs: 0.0
GFLOP/s ( tm = 1.18310427666 )
(experimental) theano.sandbox.cuda.fftconv.conv2d_fft bprop both: 0.0
GFLOP/s ( tm = 1.39372771978 )
gemm theano.sandbox.cuda.blas.GpuCorrMM fprof: 1243.54620238 GFLOP/s ( tm =
0.0998895168304 )
gemm theano.sandbox.cuda.blas.GpuCorrMM bprop weights: 0.0 GFLOP/s ( tm =
0.346038997173 )
gemm theano.sandbox.cuda.blas.GpuCorrMM bprop inputs: 0.0 GFLOP/s ( tm =
1.75310575962 )
gemm theano.sandbox.cuda.blas.GpuCorrMM bprop both: 0.0 GFLOP/s ( tm =
2.26795125008 )

Here we see that it take 1753ms (second to last line) for the bprop vs th
inputs. This is the full mode. On the web site of this benchmark (that was
done the the same gpu titan black) the timming for torch7 was 91ms.

with the code in the master of theano, we don't call col2im. There
implementation of the full convolution use it instead of im2col. My guess
is that this is the cause.

what i find is that the forward is super fast, even if it is full. Somehow
in the backward it is very slow. It will really help me a lot if you can
fix this or tell me how to make it faster for back prop. Thanks a lot.

I won't be available tonight and I won't be available for a week after
Friday afternoon. My best guess for now is the full mode. But you are right
that in your profile, it was the valid mode that caused problem.

Can you check to make sure your torch implemenation also use the full mode
in the forward? Also, I'm pretty surprised that with that setting, you
would get the profile you shown me.

stencilman · 2014-08-08T13:35:47Z

I am sorry @nouiz , I still dont understand how the full is any different from valid. For me, full is just valid with some padding, and that is how we implement it.

col2im in my opinion has nothing to do with the full or valid modes, it is only useful for calculating the gradient, please correct me if I am wrong.

The convnet-benchmark results as @nouiz report are clearly weird, no? It takes 99msec fprop vs 1753ms bprop? Why?

I will try to get to the bottom of the slow speed.

From tomorrow morning, I will also not be available until next Friday as I am going to vancouver for a conference tomorrow morning. I will see what I can do today.

nouiz · 2014-08-08T14:06:28Z

On Fri, Aug 8, 2014 at 9:35 AM, Arjun Jain notifications@github.com wrote:

I am sorry @nouiz https://github.com/nouiz , I still dont understand
how the full is any different from valid. For me, full is just valid with
some padding, and that is how we implement it.

col2im in my opinion has nothing to do with the full or valid modes, it is
only useful for calculating the gradient, please correct me if I am wrong.

When the fprop is valid, the bprop wrt the inputs will be a convolution in
full mode.

You are right that the full mode is valid with padding. When we have 2
similar algo (conv with and without padding), it happend sometimes that the
faster implementation isn't the same. It is the case here. The fact that we
know the padding is done with 0, we can make an implementation don't don't
do the computation again the 0s. My guess is that the implementation
im2col+gemm will do the multiplication with 0s, but the implementation
col2im+gemm don't. I didn't look enough to be sure of that. But as caffe
and torch7 do that, I suppose they have a good reason to don't use the same
implementation in that case.

The convnet-benchmark results as @nouiz https://github.com/nouiz report

are clearly weird, no? It takes 99msec fprop vs 1753ms bprop? Why?

I guess it is the reason I wrote above.

I will try to get to the bottom of the slow speed.

From tomorrow morning, I will also not be available until next Friday as I
am going to vancouver for a conference tomorrow morning. I will see what I
can do today.

—
Reply to this email directly or view it on GitHub
#2002 (comment).

stencilman · 2014-08-08T14:21:35Z

Thanks a lot for your reply @nouiz . Hmm, I dont think so. I dont think it is the zero padding that is making it slow. Also, I think the col2im and im2col do not deal with padding differently.

IMHO, the different in speed between this and torch will not be a orders of magnitude because of the zero padding. It is something else. @nouiz I will be very grateful if you could tell me what Theano does in case of bprop? What size convolutions? Maybe we try to manually fprop the same size convolution and see how much time it takes? What do you think?

stencilman · 2014-08-08T14:29:30Z

Btw @nouiz we seem to have a misunderstand of what caffe/torch does. Correct me if I am wrong: you seem to think they use col2im + gemm for full mode. I dont think so. They dont even have a full mode. They use col2im only when they are doing their bprop. I dont understand how Theano does the bprop.

nouiz · 2014-08-08T14:59:46Z

the bprop is also a convolution. So when you tell that they use col2im+gemm
during there bprop, it mean that they use that implementation for the
convolution in the bprop.

The grad graph of conv2d is defined in that method:

https://github.com/Theano/Theano/blob/master/theano/tensor/nnet/conv.py#L775

The grad again the filter is also a convolution, check here:

https://github.com/Theano/Theano/blob/master/theano/tensor/nnet/conv.py#L898

You see there the this grad is always a valid convolution. Here you see
that in the full mode we swap the input/filter of that convolution compared
to the valid mode:

https://github.com/Theano/Theano/blob/master/theano/tensor/nnet/conv.py#L848

The grad again the image is also a convolution check here:

https://github.com/Theano/Theano/blob/master/theano/tensor/nnet/conv.py#L943

Here you see that in that case the, the mode of that convolution is full is
the original conv is valid. Otherwise it is a valid convolution.

https://github.com/Theano/Theano/blob/master/theano/tensor/nnet/conv.py#L916

So in the case the fprop is valid, the bprop again the inputs is a full
convolution. In that case (bprop again the inputs) caffe and torch7 use
col2im+gemm code.

On Fri, Aug 8, 2014 at 10:29 AM, Arjun Jain notifications@github.com
wrote:

Btw @nouiz https://github.com/nouiz we seem to have a misunderstand of
what caffe/torch does. Correct me if I am wrong: you seem to think they use
col2im + gemm for full mode. I dont think so. They dont even have a full
mode. They use col2im only when they are doing their bprop. I dont
understand how Theano does the bprop.

—
Reply to this email directly or view it on GitHub
#2002 (comment).

stencilman · 2014-08-08T21:26:14Z

Thanks a ton for that comment @nouiz ! That clarifies things a lot!

So the grads are conv ops too. Is it possible that the "unroll" while doing the bprop conv doesnt have the optimum shapes and thus it is not fast?

I am still not sure if we need col2im or what the gemm+col2im does. i have a suspicion the problem has something to do with the sizes..

nouiz · 2014-08-08T22:08:55Z

I haven't been able to work on it today. I won't work on it next week. So
if one of you can continue, it would be great.

Do you know if the torch code in full mode update in place the weight? If
so, that could explain that. We don't do that now. We can do it, both it is
better to do it later. To make it work not in place, we probably need toink
init to zero the allocated memory.

Also, it is possible that we need to allocate bigger memory with that are
used with zero padding.
Le 8 août 2014 10:59, "Frédéric Bastien" frederic.bastien@gmail.com a
écrit :

the bprop is also a convolution. So when you tell that they use
col2im+gemm during there bprop, it mean that they use that implementation
for the convolution in the bprop.

The grad graph of conv2d is defined in that method:

https://github.com/Theano/Theano/blob/master/theano/tensor/nnet/conv.py#L775

The grad again the filter is also a convolution, check here:

https://github.com/Theano/Theano/blob/master/theano/tensor/nnet/conv.py#L898

You see there the this grad is always a valid convolution. Here you see
that in the full mode we swap the input/filter of that convolution compared
to the valid mode:

https://github.com/Theano/Theano/blob/master/theano/tensor/nnet/conv.py#L848

The grad again the image is also a convolution check here:

https://github.com/Theano/Theano/blob/master/theano/tensor/nnet/conv.py#L943

Here you see that in that case the, the mode of that convolution is full
is the original conv is valid. Otherwise it is a valid convolution.

https://github.com/Theano/Theano/blob/master/theano/tensor/nnet/conv.py#L916

So in the case the fprop is valid, the bprop again the inputs is a full
convolution. In that case (bprop again the inputs) caffe and torch7 use
col2im+gemm code.

On Fri, Aug 8, 2014 at 10:29 AM, Arjun Jain notifications@github.com
wrote:

Btw @nouiz https://github.com/nouiz we seem to have a misunderstand of
what caffe/torch does. Correct me if I am wrong: you seem to think they use
col2im + gemm for full mode. I dont think so. They dont even have a full
mode. They use col2im only when they are doing their bprop. I dont
understand how Theano does the bprop.

—
Reply to this email directly or view it on GitHub
#2002 (comment).

stencilman · 2014-08-08T23:15:15Z

I will try to continue to look into this.

maybe this info helps you in figuring the soln: they(torch) create a 'module' and a 'criterion', and when you do module:forward(), it goes forwards and when you do module:backwards() it goes backward(recursively).
(module: https://github.com/torch/torch7-distro/blob/master/extra/nn/Module.lua#L28, criterion: https://github.com/torch/torch7-distro/blob/master/extra/nn/Criterion.lua)

function Module:backward(input, gradOutput, scale)
   scale = scale or 1
   self:updateGradInput(input, gradOutput)
   self:accGradParameters(input, gradOutput, scale)
   return self.gradInput
end

And you:

-- create closure to evaluate f(X) and df/dX
local feval = function(x)
     local target = <your data>

      -- reset gradients
      gradParameters:zero()  -- dE_dw

      -- evaluate function for complete mini batch
      local output = model:forward(batchGPU.data)

      local err = criterion:forward(output, target)
      ave_err = ave_err + err

      -- estimate df/dW
      local df_do = criterion:backward(output, target)
      model:backward(batchGPU.data, df_do)
      -- return f and df/dX
      return err, gradParameters
end

-- optimize on current mini-batch using the above closure
optimMethod(feval, parameters, conf.optimState)

where optimMethod is (https://github.com/torch/optim/blob/master/sgd.lua).

nouiz · 2014-08-09T02:17:04Z

I don't have the time to look, but where is the "parameter" used in
cunn_SpatialConvolutionMM_accGradParameters?
In particular:

gradWeight
gradBias
finput
fgradInput

This would help solve some of the questions I had when writting the Theano
version. Also, we will need to update the license to also include torch7
license/copy right I think. We can do this after we fix the speed issue in
this PR, juste before merging.

On Fri, Aug 8, 2014 at 7:15 PM, Arjun Jain notifications@github.com wrote:

I will try to continue to look into this.

maybe this info helps you in figuring the soln: they(torch) create a
'module' and a 'criterion', and when you do module:forward(), it goes
forwards and when you do module:backwards() it goes backward(recursively).
(module:
https://github.com/torch/torch7-distro/blob/master/extra/nn/Module.lua#L28,
criterion:
https://github.com/torch/torch7-distro/blob/master/extra/nn/Criterion.lua)

function Module:backward(input, gradOutput, scale)
scale = scale or 1
self:updateGradInput(input, gradOutput)
self:accGradParameters(input, gradOutput, scale)
return self.gradInput
end

And you:

-- create closure to evaluate f(X) and df/dX
local feval = function(x)
local target =
  -- reset gradients
  gradParameters:zero()  -- dE_dw

  -- evaluate function for complete mini batch
  local output = model:forward(batchGPU.data)

  local err = criterion:forward(output, target)
  ave_err = ave_err + err

  -- estimate df/dW
  local df_do = criterion:backward(output, target)
  model:backward(batchGPU.data, df_do)
  -- return f and df/dX
  return err, gradParameters
end

-- optimize on current mini-batch using the above closure
optimMethod(feval, parameters, conf.optimState)

where optimMethod is (https://github.com/torch/optim/blob/master/sgd.lua).

—
Reply to this email directly or view it on GitHub
#2002 (comment).

stencilman · 2014-08-09T02:31:41Z

gradWeight: https://github.com/torch/cunn/blob/master/SpatialConvolutionMM.cu#L407
gradBias: https://github.com/torch/cunn/blob/master/SpatialConvolutionMM.cu#L426

finput and fgardInput are used in updateGradInput (which transfers the gradients for the chain rule):
fgradInput or gradInput_n: https://github.com/torch/cunn/blob/master/SpatialConvolutionMM.cu#L306

caffe conv kernel for theano. tests work, but needs integration and s…

fb66035

…ome cleanup

abergeron reviewed Jul 29, 2014
View reviewed changes

stencilman and others added 13 commits July 29, 2014 16:50

changes after abergeron commented on the code

53630ed

remove old code that isn't used.

41ab038

Add error about the missing implementation.

5a4e453

remove old code and small fix for not yet used code.

5741566

Opt to use GpuConvMM in valid mode.

ea8153b

Better handling of transfer and better error reporting and fix refcount.

c5728f5

warn about bugged code.

598f485

Reuse the current gpu conv test for gpuconvmm

998b9bc

Reuse pre allocated memory.

2955b33

Add check and better error message

1df727f

Partial fix

549c2fd

code simplication

9bf8ef2

Indentation.

a1509a7

stencilman added 2 commits July 31, 2014 00:14

abergeron reviewed Jul 31, 2014
View reviewed changes

theano/sandbox/cuda/conv_gemm.cu

@@ -0,0 +1,193 @@

/*

Copy link

Member

abergeron Jul 31, 2014

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You still have to keep the original copyright notice, which was deleted here.

nouiz mentioned this pull request Aug 5, 2014

Continue gemm convolution #2015

Closed

8 tasks

f0k mentioned this pull request Jul 1, 2015

[WIP] CpuCorrMM closes #3026 #3089

Closed

11 tasks

caffe conv kernel for theano. tests work, but needs integration and some... #2002

caffe conv kernel for theano. tests work, but needs integration and some... #2002

Conversation

stencilman commented Jul 29, 2014

abergeron Jul 29, 2014

Choose a reason for hiding this comment

abergeron commented Jul 29, 2014

stencilman commented Jul 29, 2014

nouiz commented Jul 30, 2014

stencilman commented Jul 31, 2014

abergeron Jul 31, 2014

Choose a reason for hiding this comment

stencilman commented Aug 5, 2014

stencilman commented Aug 6, 2014

nouiz commented Aug 6, 2014

stencilman commented Aug 6, 2014

nouiz commented Aug 6, 2014

Function profiling

Class

Ops

Apply

(For values in brackets, it's for linker = c|py

Function profiling

Function profiling

Class

Ops

Apply

(For values in brackets, it's for linker = c|py

nouiz commented Aug 6, 2014

Function profiling

Class

Ops

Apply

(For values in brackets, it's for linker = c|py

Function profiling

Function profiling

Class

Ops

Apply

(For values in brackets, it's for linker = c|py

stencilman commented Aug 6, 2014

nouiz commented Aug 7, 2014

stencilman commented Aug 7, 2014

nouiz commented Aug 7, 2014

stencilman commented Aug 7, 2014

stencilman commented Aug 7, 2014

stencilman commented Aug 7, 2014

nouiz commented Aug 7, 2014

stencilman commented Aug 7, 2014

stencilman commented Aug 7, 2014

stencilman commented Aug 7, 2014

nouiz commented Aug 8, 2014

stencilman commented Aug 8, 2014

nouiz commented Aug 8, 2014

stencilman commented Aug 8, 2014

stencilman commented Aug 8, 2014

nouiz commented Aug 8, 2014

stencilman commented Aug 8, 2014

nouiz commented Aug 8, 2014

stencilman commented Aug 8, 2014

nouiz commented Aug 9, 2014

stencilman commented Aug 9, 2014