caffe conv kernel for theano. tests work, but needs integration and some... #2002

Merged
merged 28 commits into from Aug 5, 2014

Projects

None yet

5 participants

@stencilman
Contributor

The caffe convolution works and passes test, however, code needs some cleaning, which are marked with TODO in comments. I created a new file, theano/sandbox/cuda/tests/test_conv_gemm.py that calls GpuConvMM.

TODO:

  • Add support for the full mode

Other possible follow up in gh-2015

NEWS.txt

  • Add faster convolution (Arjun Jain, Frederic B.)
@abergeron abergeron commented on an outdated diff Jul 29, 2014
theano/sandbox/cuda/caffe_common.hpp
@@ -0,0 +1,33 @@
+// Copyright 2014 BVLC and contributors.
@abergeron
abergeron Jul 29, 2014 Member

This is not ok. You need to add the full license (either inline as a comment, or in a separate file that is referred to here).

@abergeron abergeron commented on an outdated diff Jul 29, 2014
theano/sandbox/cuda/caffe_common.hpp
@@ -0,0 +1,33 @@
+// Copyright 2014 BVLC and contributors.
+
+#ifndef CAFFE_COMMON_HPP_
+#define CAFFE_COMMON_HPP_
+
+//#include <boost/shared_ptr.hpp>
@abergeron
abergeron Jul 29, 2014 Member

Rather than commenting includes, just remove them.

@abergeron abergeron commented on an outdated diff Jul 29, 2014
theano/sandbox/cuda/conv_gemm.cu
@@ -0,0 +1,168 @@
+// Copyright 2014 BVLC and contributors.
@abergeron
abergeron Jul 29, 2014 Member

Same thing here, add the full license.

@abergeron abergeron commented on an outdated diff Jul 29, 2014
theano/sandbox/cuda/conv_gemm.cu
@@ -0,0 +1,168 @@
+// Copyright 2014 BVLC and contributors.
+#undef _GLIBCXX_ATOMIC_BUILTINS
+#include <Python.h>
+#include "cuda_ndarray.cuh"
+#include "caffe_common.hpp"
+// Author: Arjun Jain
+// Kernel for fast unfold+copy
+// (borrowed from Caffe: https://github.com/BVLC/caffe/blob/master/src/caffe/layers/conv_layer.cu)
@abergeron
abergeron Jul 29, 2014 Member

This is puzzling, did you write this kernel for caffe?

Also the source you cite is not for the im2col stuff that's just below.

@abergeron abergeron commented on an outdated diff Jul 29, 2014
theano/sandbox/cuda/tests/test_conv_gemm.py
@@ -0,0 +1,152 @@
+"""
+Tests for GPU convolution
+"""
+import sys
+import time
+import unittest
+import matplotlib.pyplot as plt
@abergeron
abergeron Jul 29, 2014 Member

Don't import matplotlib for a test.

@abergeron
abergeron Jul 29, 2014 Member

Especially if you don't use it.

@abergeron abergeron commented on an outdated diff Jul 29, 2014
theano/sandbox/cuda/conv_gemm.cu
+}
+
+
+
+CudaNdarray* validMM(const CudaNdarray *input,
+ CudaNdarray *weight,
+ CudaNdarray *output)
+{
+
+ // TODO: This needs to be done in the singleton!
+ // Initialize CUBLAS
+ cublasHandle_t handle;
+ cublasStatus_t status = cublasCreate(&handle);
+ if (status != CUBLAS_STATUS_SUCCESS) {
+ std::cerr << "!!!! CUBLAS initialization error\n";
+ }
@abergeron
abergeron Jul 29, 2014 Member

We have a global cublas handle now, helpfully called 'handle'. You should use it instead of making your own.

@abergeron abergeron commented on an outdated diff Jul 29, 2014
theano/sandbox/cuda/conv_gemm.cu
+ columns->devdata, m,
+ weight->devdata, k,
+ &beta,
+ output->devdata + elt * op_stride, m
+ );
+
+
+ cudaError_t err = cudaGetLastError();
+ if (err != cudaSuccess) {
+ printf("error in validMM: %s\n", cudaGetErrorString(err));
+ }
+
+ }
+
+ // TODO: How is columns and output deallocated?
+ // device_free(columns->devdata);
@abergeron
abergeron Jul 29, 2014 Member

To deallocate a CudaNdarray simply Py_DECREF() it.

@abergeron abergeron commented on an outdated diff Jul 29, 2014
theano/sandbox/cuda/conv_gemm.cu
+ int n = CudaNdarray_HOST_DIMS(weight)[1];
+ int k = CudaNdarray_HOST_DIMS(columns)[0];
+
+ //Caffe::getRef().getCublasHandle().get()
+ status = cublasSgemm(handle,
+ CUBLAS_OP_N, CUBLAS_OP_N,
+ m, n, k,
+ &alpha,
+ columns->devdata, m,
+ weight->devdata, k,
+ &beta,
+ output->devdata + elt * op_stride, m
+ );
+
+
+ cudaError_t err = cudaGetLastError();
@abergeron
abergeron Jul 29, 2014 Member

You should check the value of status for errors rather than calling cudaGetLastError()

@abergeron
Member

The code is a little rough around the edges, but I'm pretty sure we want to merge after the issues are taken care of.

@stencilman
Contributor

I think Fred wanted to clean it up and merge it properly, and that is why i left it so. However, I will make all the changes you suggest.

On 29-Jul-2014, at 4:34 pm, abergeron notifications@github.com wrote:

The code is a little rough around the edges, but I'm pretty sure we want to merge after the issues are taken care of.


Reply to this email directly or view it on GitHub.

@nouiz
Member
nouiz commented Jul 30, 2014

A made a PR to your branch with some advancement. It return bad value right
now in some cases. The variable m, n and k don't have the good value. Can
you check in caffe code in the file vision_layer.hpp how we should compute
M_ N_ and K_? I have some difficulty to find the correspondance to image.
Shape[*] and for the filter. I'll try to continue tomorrow if I can.
Le 29 juil. 2014 16:36, "Arjun Jain" notifications@github.com a écrit :

I think Fred wanted to clean it up and merge it properly, and that is why
i left it so. However, I will make all the changes you suggest.

On 29-Jul-2014, at 4:34 pm, abergeron notifications@github.com wrote:

The code is a little rough around the edges, but I'm pretty sure we want
to merge after the issues are taken care of.


Reply to this email directly or view it on GitHub.


Reply to this email directly or view it on GitHub
#2002 (comment).

stencilman added some commits Jul 31, 2014
@stencilman stencilman Hi Fred, I tried it out, but for me, it doesnt find conv() in package…
… cuda_ndarray gpuval = cuda_ndarray.conv(img, kern, mode, subsample). So, made the changes in the test_conv_cuda_ndarray _test_dummy().

I see that the cpu version is computed using py_conv(), which in turn calls scipy.signal.convolve2d. How can the result 'gpuval' now be the same as scipy.signal.convolve2d instead of the scipy.signal.correlate?

Also, this still passes tests for all image, kernel, channel and batch sizes: https://github.com/stencilman/Theano-1/blob/fb66035292ef070b86466bf61c9c42b8faaa0a1c/theano/sandbox/cuda/tests/test_conv_gemm.py
80dd43e
@stencilman stencilman Look what I did on like 117 in file theano/sandbox/cuda/tests/test_co…
…nv_cuda_ndarray.py. I rotated the kernel by 180 before convolution, and this now gives the same result as GpuConvMM. So, I think the cuda/c part is completely fine and the corrent arguments are being passed to the cublas function.
f18c849
@stencilman
Contributor

Fred, I made two commits, please have a look.. I do not think there is any problem with the cuda/c code.. Thanks!

@abergeron abergeron commented on the diff Jul 31, 2014
theano/sandbox/cuda/conv_gemm.cu
@@ -0,0 +1,193 @@
+/*
@abergeron
abergeron Jul 31, 2014 Member

You still have to keep the original copyright notice, which was deleted here.

@abergeron abergeron commented on the diff Jul 31, 2014
theano/sandbox/cuda/caffe_common.hpp
@@ -0,0 +1,53 @@
+/*
@abergeron
abergeron Jul 31, 2014 Member

Same here with the copyright notice.

@nouiz
Member
nouiz commented Jul 31, 2014

You shouldn't change the test file. Before your change, the test that fail for me is:

theano-nose theano/sandbox/cuda/tests/test_conv_cuda_ndarray.py:test_valid

An example of shape that fail: img=(1, 1, 3, 3), kern=(1, 1, 2, 2)

They way it currently work, is that the user specify a convolution in the graph, then an optimization change it to use GpuConvMM. This optimization shouldn't convert this to a correlation.

Did you looked for the value of M_, N_ and K_ in caffe code?

@nouiz nouiz referenced this pull request in nouiz/Theano Jul 31, 2014
Closed

Conv gemm aj #30

@stencilman
Contributor

How do you know the caffe code does convolution and not correlation too? If you want I can check and confirm by installing and running caffe code. In torch7 they do correlation.

@stencilman
Contributor

I think maybe we can pass CUBLAS_OP_T instead of CUBLAS_OP_N. I am trying to get it work such that it does a convolution and not a correlation.

@nouiz
Member
nouiz commented Jul 31, 2014

The optimization flip the filter before calling GpuConvMM. So the result
should be a convolution.

Did you look at the m,n and k parameter? I think they are wrong and the
cause of bugs, for correlation.

Fred

On Thu, Jul 31, 2014 at 11:00 AM, Arjun Jain notifications@github.com
wrote:

I think maybe we can pass CUBLAS_OP_T instead of CUBLAS_OP_N. I am trying
to get it work such that it does a convolution and not a correlation.


Reply to this email directly or view it on GitHub
#2002 (comment).

@stencilman
Contributor

Can you please point me to where the optimization flips the kernel? I am now looking at the n,m,k and will try to find the right values.

@nouiz
Member
nouiz commented Jul 31, 2014

https://github.com/stencilman/Theano-1/blob/conv_gemm/theano/sandbox/cuda/opt.py#L1294

On Thu, Jul 31, 2014 at 11:47 AM, Arjun Jain notifications@github.com
wrote:

Can you please point me to where the optimization flips the kernel? I am
now looking at the n,m,k and will try to find the right values.


Reply to this email directly or view it on GitHub
#2002 (comment).

@stencilman stencilman - fixed a bug in cafe conv (values of _M, _N, _K)
- added a test that does a check on variety of shapes and sizes of image and kernel
- removed flip form local_conv_gemm in cuda/opt.py is the code never reached there for me
- added flip in the kernel in the test code itself
c649d66
@stencilman
Contributor

Hey Fred, please check my last commit out. I think this fixes everything. Please check the commit log for the details of the fixes. Most importantly, I removed kernel flip in local_conv_gemm as this code was never reached for me. But everything looks good to me!! Please have a look, thanks a lot in advance!

@nouiz
Member
nouiz commented Aug 1, 2014

I did a PR to your PR. It is normal that with the test you did, you didn't
found the optimization being used, as you manually specified the op to use
in your test.

I still have failure when I run tests like this:

theano-nose theano/sandbox/cuda/tests/test_conv_cuda_ndarray.py:test_valid
-s

I won't have more time today for this. Can you try to check the problem in
that case?

On Thu, Jul 31, 2014 at 8:17 PM, Arjun Jain notifications@github.com
wrote:

Hey Fred, please check my last commit out. I think this fixes everything.
Please check the commit log for the details of the fixes. Most importantly,
I removed kernel flip in local_conv_gemm as this code was never reached for
me. But everything looks good to me!! Please have a look, thanks a lot in
advance!


Reply to this email directly or view it on GitHub
#2002 (comment).

@stencilman
Contributor

Thanks a lot Fred! I confirm that I can run it using theano-nose, and yes, I do see failure for some cases. I will investigate tomorrow morning and get back.

@stencilman
Contributor

and I merged with your repo branch.

@stencilman
Contributor

Fred, to me it seems it is failing when the image stride is not (1,1).. Is that true? Why is it testing for this case? We are not passing the stride values to the c/cuda code yet, so stride will not work. Please let me know if that is the case, or can you tell me if there is any other case it is failing for? Thank you in advance :-)

@stencilman
Contributor

I confirm that theano-nose theano/sandbox/cuda/tests/test_conv_cuda_ndarray.py:test_valid -s passes for all values if the image stride is (1,1) and kernel stride is (1,1). We can definitely pass these strides to the cuda call and it will still work for all square stride values (but the nose test will still fail as it will call conv with non square stride values). Let me know how you want to handle this can I can fix it. Thank you!!

@nouiz
Member
nouiz commented Aug 1, 2014

I made a PR to this PR that finish the valid mode support. I think there is 3 thinks left to do. I put them in the description of this PR and here:

  • Add support for the full mode
  • Remove The need of gpu_contiguous
  • Add support for strides (probably only square strides)

I would do them in this order. I think we should do the first one before merging this PR. The 2 other could wait for another PR if we can't made them rapidly enough.

@stencilman can you start work on the full mode support?

@stencilman
Contributor

Adding support for the full mode can be easily accomplished by using the 'pad' parameter. We can pad it by (kernel size -1)/2 and that would make it a full conv (of course if its decimal we need to add one pixel more on one side). What do you think? How should we do it?

@benanne
Contributor
benanne commented Aug 1, 2014

You'll need to pad (kernel size - 1) zeros on the input on all sides to get a full convolution, not half of that :)

@stencilman
Contributor

Right!!! Sorry, I was thinking of same size conv which I use.. Yes, you are right!

On 01-Aug-2014, at 10:40 am, Sander Dieleman notifications@github.com wrote:

You'll need to pad (kernel size - 1) zeros on the input on all sides to get a full convolution, not half of that :)


Reply to this email directly or view it on GitHub.

@benanne
Contributor
benanne commented Aug 1, 2014

Now that you mention it, it would be quite useful to have support for 'same' convolutions as well (and not just for this implementation).

@nouiz
Member
nouiz commented Aug 1, 2014

If we want to add this support quickly, we could just add support in the
Conv2D python code, not in the c code. Then we could use the fast version
on the GPU in this case. That way, we don't need to add support everywhere.

On Fri, Aug 1, 2014 at 10:49 AM, Sander Dieleman notifications@github.com
wrote:

Now that you mention it, it would be quite useful to have support for
'same' convolutions as well (and not just for this implementation).


Reply to this email directly or view it on GitHub
#2002 (comment).

@stencilman
Contributor

Hmm, true, ok, let me try and get back!

@stencilman stencilman Including support for 'full' convolutions. It uses the existing pad f…
…unctionality of the c/cuda code. If mode == valid, I make pad = filter_size -1. This is passed to the C code, and everything else gets taken care of automatically.

The theano-nose for test_full does not pass, however, the test function test_gemm()  passes. @nouiz: do you know why?
497a2d9
@stencilman
Contributor

I just added support for 'full' convolutions. It uses the existing pad functionality of the c/cuda code. If mode == valid, I make pad = (filter_size -1). This is passed to the C code, and everything else gets taken care of automatically.

The theano-nose for test_full does not pass, however, the test function test_gemm() passes. Fred, do you know why?

@stencilman
Contributor

I had a closer look at test_full(), and it found that line 187 asserts, which I do not know why. If I comment out the bellow lines, the test_full() passes. Fred, do you have any comment? Thanks!

if cls is not None:
assert any([isinstance(node.op, cls)
for node in f.maker.fgraph.toposort()]), f.maker.fgraph.toposort()

@lamblin
Member
lamblin commented Aug 2, 2014

@stencilman : this assert is there to make sure that there is at least an op in the function f that is an instance of cls, here an instance of cuda.blas.GpuConvMM.
What is the content of f.maker.fgraph.toposort()? Do you see a convolution Op in there?

@stencilman
Contributor

Thanks a lot for your comment @lamblin, because of your comment I figured what the problem was. Now, the test_full also passed for GpuConvMM! Please have a look, thanks!

@stencilman
Contributor

I am not sure how to remove the need of gpu_contiguous. If anyone can tell me how to or do it instead, it would be great. I think support for strides can come a fresh PR. So, can this PR be merged already?

@stencilman
Contributor

Hi all! Sorry for being so annoying, but any updates on this? I am happy to make any more changes that are required for this to be merged. Thank you!

@abergeron abergeron commented on the diff Aug 4, 2014
theano/sandbox/cuda/blas.py
@@ -497,9 +498,179 @@ def c_code(self, node, name, inputs, outputs, sub):
gpu_ger_inplace = GpuGer(inplace=True)
+class GpuConvMM(GpuOp):
+ """
+ Author: Arjun Jain
+ Implement the caffe convolution
+ """
+ def __init__(self, border_mode,
+ subsample=(1, 1),
+ pad=0):
+ """
+ :param border_mode: "valid" or "full"
+ :param subsample: not yet supported
@abergeron
abergeron Aug 4, 2014 Member

This isn't true anymore, you are passing the subsample parameters to the C code now.

@nouiz
nouiz Aug 4, 2014 Member

We don't pass it to validMM(), we just init the output memory to the right size, but will write to the full memory output region! So it is not working for now.

@abergeron abergeron commented on an outdated diff Aug 4, 2014
theano/sandbox/cuda/blas.py
+ subsample=(1, 1),
+ pad=0):
+ """
+ :param border_mode: "valid" or "full"
+ :param subsample: not yet supported
+ :param pad: not yet supported
+ """
+ self.border_mode = border_mode
+ self.subsample = subsample
+ self.pad = pad
+ if pad != 0:
+ raise NotImplementedError(
+ "GpuConvMM don't implement the pad parameter")
+ if subsample != (1, 1):
+ raise NotImplementedError(
+ "GpuConvMM we don't implement the subsample parameter")
@abergeron
abergeron Aug 4, 2014 Member

Could you relax this check so that it accepts valid values.

@nouiz nouiz commented on an outdated diff Aug 4, 2014
theano/sandbox/cuda/conv_gemm.cu
+ int height_col = (height + 2 * pad - ksize) / stride + 1;
+ int width_col = (width + 2 * pad - ksize) / stride + 1;
+ int num_kernels = channels * height_col * width_col;
+
+ // Launch
+ im2col_kernel <<<CAFFE_GET_BLOCKS(num_kernels), CAFFE_CUDA_NUM_THREADS>>> (
+ num_kernels, data_im, height, width, ksize,
+ pad, stride,
+ height_col, width_col, data_col
+ );
+}
+
+
+
+// Author: Arjun Jain
+CudaNdarray* validMM(const CudaNdarray *input,
@nouiz
nouiz Aug 4, 2014 Member

rename to corrMM (for correlation, what it does)

@stencilman
Contributor

Added Changed suggested by @nouiz. Did not make the changes suggested by @abergeron as we do not pass these params to the corrMM functions, they are only for the right output memory size. Please let me know if anything else needs to be done before being able to merge this. Thanks! :-)

@nouiz
Member
nouiz commented Aug 4, 2014

Before merging I think we should do 2 thinkgs I forgot:

  1. rename GpuConvMM to GpuCorrMM. That way it will be correctly called.
  2. we should document that. Can you add the info about it in this file:

doc/library/tensor/nnet/conv.txt

I started it, but I need to leave:

    - :func:`GpuCorrMM <theano.sandbox.cuda.blas.GpuCorrMM>`
      This is a GPU-only version of a correlation that uses a gemm call
      to perform the work. It need extra memory for computation.
      You can enable it for call to conv2d 2d by setting
      'THEANO_FLAGS=optimizer_including=conv_gemm'
      in your environment. This is not enabled by default because it
      use some extra memory. It don't support strides for now and require
      square images and kernels.

and at the end of the file:

.. autofunction:: theano.sandbox.cuda.blas.GpuCorrMM
@nouiz
Member
nouiz commented Aug 4, 2014

I did some profiling for to know how much time is spent in the gpu_contiguous. It was taking significant time only for 1 shape configuration and only for the bprop vs weights. So this can wait for another PR:

Function profiling
==================
  Message: gemm theano.sandbox.cuda.blas.GpuConvMM bprop wrt weights
  Time in 4 calls to Function.__call__: 1.536482e+00s
  Time in Function.fn.__call__: 1.536326e+00s (99.990%)
  Time in thunks: 1.533858e+00s (99.829%)
  Total compile time: 1.262300e-01s
    Number of Apply nodes: 9
    Theano Optimizer time: 1.171520e-01s
       Theano validate time: 4.892349e-04s
    Theano Linker time (includes C, CUDA code generation/compiling): 5.265951e-03s

Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  80.0%    80.0%       1.228s       3.07e-01s     C        4       1   theano.sandbox.cuda.blas.GpuConvMM
  20.0%   100.0%       0.306s       3.83e-02s     C        8       2   theano.sandbox.cuda.basic_ops.GpuContiguous
   0.0%   100.0%       0.000s       3.14e-06s     C       12       3   theano.sandbox.cuda.basic_ops.GpuDimShuffle
   0.0%   100.0%       0.000s       2.92e-06s     C       12       3   theano.sandbox.cuda.basic_ops.GpuSubtensor
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  80.0%    80.0%       1.228s       3.07e-01s     C        4        1   GpuConvMM{valid, (1, 1), pad=0}
  20.0%   100.0%       0.306s       3.83e-02s     C        8        2   GpuContiguous
   0.0%   100.0%       0.000s       3.14e-06s     C       12        3   GpuDimShuffle{1,0,2,3}
   0.0%   100.0%       0.000s       2.92e-06s     C       12        3   GpuSubtensor{::, ::, ::int64, ::int64}
   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)

Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
  80.0%    80.0%       1.228s       3.07e-01s      4     6   GpuConvMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
  19.5%    99.5%       0.298s       7.46e-02s      4     5   GpuContiguous(GpuSubtensor{::, ::, ::int64, ::int64}.0)
   0.5%   100.0%       0.008s       1.95e-03s      4     3   GpuContiguous(GpuDimShuffle{1,0,2,3}.0)
   0.0%   100.0%       0.000s       5.48e-06s      4     8   GpuSubtensor{::, ::, ::int64, ::int64}(GpuDimShuffle{1,0,2,3}.0, Constant{-1}, Co
   0.0%   100.0%       0.000s       5.01e-06s      4     7   GpuDimShuffle{1,0,2,3}(GpuConvMM{valid, (1, 1), pad=0}.0)
   0.0%   100.0%       0.000s       2.86e-06s      4     1   GpuDimShuffle{1,0,2,3}(<CudaNdarrayType(float32, 4D)>)
   0.0%   100.0%       0.000s       2.74e-06s      4     2   GpuSubtensor{::, ::, ::int64, ::int64}(GpuDimShuffle{1,0,2,3}.0, Constant{-1}, Co
   0.0%   100.0%       0.000s       1.55e-06s      4     0   GpuDimShuffle{1,0,2,3}(<CudaNdarrayType(float32, 4D)>)
   0.0%   100.0%       0.000s       5.36e-07s      4     4   GpuSubtensor{::, ::, ::int64, ::int64}(GpuSubtensor{::, ::, ::int64, ::int64}.0, 
   ... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)

@stencilman stencilman - added some documentaiton
- changed conv to corr as suggested by Fred
4c55bc4
@stencilman
Contributor

Hi, I changed the name from conv to corr and added to the doc. I am curious, how to get rid of the GpuContiguous? 20% is a lot of time for me. Also, if in my theano code I want to use the correlation directly and not convolution, is there a clean way to do this or do I need to have my local Theano where I make the hacks? Thanks! Also, please let me know if the doc is ok and if any other changes are desired. Thanks a lot Fred for profiling and all the other help :-)

@lamblin
Member
lamblin commented Aug 4, 2014

Btw, there is probably a way of speeding up the execution of GpuContiguous itself in that specific case.

@stencilman
Contributor

@lamblin How? I would be happy to do it if I can, if it can be done quicker by you guys and if you have the time, I will be grateful if you add this optimization, would help me a lot. Thanks!

@lamblin
Member
lamblin commented Aug 5, 2014

I'll try to check that if I have time this week. That would need a little bit of profiling first to see if it can be done easily or not.

@stencilman
Contributor

Ok. Thanks @lamblin! Any else we should change/add to this PR so that we can merge it?

@nouiz nouiz merged commit 369af1a into Theano:master Aug 5, 2014

1 check passed

continuous-integration/travis-ci The Travis CI build passed
Details
@nouiz
Member
nouiz commented Aug 5, 2014

I just merged it. I read on the convnet benchmark that caffe updated its convolution code to support none square kernel. I'll create a ticket with the possible extension we do to this.

@nouiz nouiz referenced this pull request Aug 5, 2014
Closed

Continue gemm convolution #2015

5 of 8 tasks complete
@stencilman
Contributor

Thanks! Oh, support for non-square kernel is great. I will try and look into it later this week.

@stencilman
Contributor

Hi all, I am sorry for being so negative but I am very disheartened with the speed of Theano. The fprop is awesome and the speed is exactly like in torch7, however the bprop in Theano is 5x slower!! I really wanted to use Theano, but if I can't get it to be faster, I will have no option.

I will be very grateful if anyone could provide any input. :-( Thank you in advance.
Bellow I attach my network structure and profile results. I only uses convolutions(GpuCorrMM) everywhere.

... building the model
** DataLayer
DataLayer out_size:  (16, 3, 240, 240)
** Created DataLayer
** LCNLayer
LCNLayer in_size:  (16, 3, 240, 240)
LCNLayer out_size: (16, 3, 240, 240)
LCNLayer filtersnp.shape: (1, 3, 9, 9)
** Created LCNLayer
** ConvPoolLayer
ConvPoolLayer filter_size:  5
ConvPoolLayer in_size: (16, 3, 240, 240)
ConvPoolLayer out_size: (16, 16, 120, 120)
** Created ConvPoolLayer
** ConvPoolLayer
ConvPoolLayer filter_size:  5
ConvPoolLayer in_size: (16, 16, 120, 120)
ConvPoolLayer out_size: (16, 16, 60, 60)
** Created ConvPoolLayer
** ConvPoolLayer
ConvPoolLayer filter_size:  5
ConvPoolLayer in_size: (16, 16, 60, 60)
ConvPoolLayer out_size: (16, 16, 60, 60)
** Created ConvPoolLayer
** ConvPoolLayer
ConvPoolLayer filter_size:  9
ConvPoolLayer in_size: (16, 16, 60, 60)
ConvPoolLayer out_size: (16, 512, 60, 60)
** Created ConvPoolLayer
** ConvPoolLayer
ConvPoolLayer filter_size:  1
ConvPoolLayer in_size: (16, 512, 60, 60)
ConvPoolLayer out_size: (16, 256, 60, 60)
** Created ConvPoolLayer
** ConvPoolLayer
Linear Activation for this layer
ConvPoolLayer filter_size:  1
ConvPoolLayer in_size: (16, 256, 60, 60)
ConvPoolLayer out_size: (16, 4, 60, 60)
** Created ConvPoolLayer
...started Compute
In DataLayer
out_size:  (16, 3, 240, 240)
++ Computed DataLayer 0
In LCNLayer
++ Computed LCNLayer 1
In ConvPoolLayer, filter size:  (16, 3, 5, 5)
++ Computed ConvPoolLayer 3
In ConvPoolLayer, filter size:  (16, 16, 5, 5)
++ Computed ConvPoolLayer 4
In ConvPoolLayer, filter size:  (16, 16, 5, 5)
++ Computed ConvPoolLayer 5
In ConvPoolLayer, filter size:  (512, 16, 9, 9)
++ Computed ConvPoolLayer 6
In ConvPoolLayer, filter size:  (256, 512, 1, 1)
++ Computed ConvPoolLayer 16
In ConvPoolLayer, filter size:  (4, 256, 1, 1)
++ Computed ConvPoolLayer 18
Using MSE
Regularization: 0.0
Not using RMSPROP
Not using Momentum
learning_rate: 0.02
==> Compiling theano funcitons...
==> Done compiling theano funcitons.
==> training!
==> doing train epoch on train data:
==> online epoch # 0             [batchSize = 16] 
Exec: rm data/conffast/train.log
[===================================================================>] �
Avg Error 0.11
Per sample train-time 211.91msec


Function profiling
==================
  Message: /home/ajain/Projects/deep_nets/python/lib/machine.py:239
  Time in 48 calls to Function.__call__: 1.490243e+02s
  Time in Function.fn.__call__: 1.490195e+02s (99.997%)
  Time in thunks: 1.485269e+02s (99.666%)
  Total compile time: 5.829459e+00s
    Number of Apply nodes: 369
    Theano Optimizer time: 5.523921e+00s
       Theano validate time: 3.618696e-01s
    Theano Linker time (includes C, CUDA code generation/compiling): 2.642159e-01s

Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  94.5%    94.5%     140.298s       1.54e-01s     C      912      19   theano.sandbox.cuda.blas.GpuCorrMM
   2.0%    96.5%       3.038s       1.58e-02s     C      192       4   theano.sandbox.cuda.basic_ops.GpuIncSubtensor
   1.3%    97.8%       1.943s       1.09e-03s     C     1776      37   theano.sandbox.cuda.basic_ops.GpuContiguous
   0.6%    98.4%       0.901s       4.27e-04s     C     2112      44   theano.sandbox.cuda.basic_ops.GpuElemwise
   0.5%    98.9%       0.683s       5.47e-04s     Py    1248      26   theano.sandbox.cuda.basic_ops.GpuReshape
   0.3%    99.2%       0.411s       2.14e-03s     C      192       4   theano.sandbox.cuda.basic_ops.GpuFromHost
   0.2%    99.4%       0.296s       3.09e-03s     Py      96       2   theano.tensor.extra_ops.RepeatOp
   0.2%    99.6%       0.291s       1.01e-03s     C      288       6   theano.sandbox.cuda.blas.GpuDownsampleFactorMaxGrad
   0.2%    99.7%       0.247s       8.58e-04s     C      288       6   theano.sandbox.cuda.blas.GpuDownsampleFactorMax
   0.1%    99.8%       0.179s       1.24e-03s     C      144       3   theano.sandbox.cuda.basic_ops.HostFromGpu
   0.1%    99.9%       0.116s       3.02e-04s     C      384       8   theano.sandbox.cuda.basic_ops.GpuCAReduce
   0.1%   100.0%       0.092s       3.82e-04s     C      240       5   theano.sandbox.cuda.basic_ops.GpuAlloc
   0.0%   100.0%       0.007s       4.19e-06s     C     1776      37   theano.sandbox.cuda.basic_ops.GpuSubtensor
   0.0%   100.0%       0.006s       2.19e-06s     C     2688      56   theano.compile.ops.Shape_i
   0.0%   100.0%       0.005s       3.32e-06s     C     1440      30   theano.sandbox.cuda.basic_ops.GpuDimShuffle
   0.0%   100.0%       0.004s       7.37e-06s     C      576      12   theano.tensor.basic.Join
   0.0%   100.0%       0.003s       2.49e-06s     C     1152      24   theano.tensor.elemwise.Elemwise
   0.0%   100.0%       0.002s       2.78e-06s     C      864      18   theano.tensor.subtensor.Subtensor
   0.0%   100.0%       0.002s       2.92e-06s     C      672      14   theano.tensor.opt.MakeVector
   0.0%   100.0%       0.001s       2.20e-05s     Py      48       1   theano.sandbox.cuda.basic_ops.GpuFlatten
   ... (remaining 2 Classes account for   0.00%(0.00s) of the runtime)

Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  91.5%    91.5%     135.923s       2.57e-01s     C      528       11   GpuCorrMM{valid, (1, 1), pad=0}
   2.9%    94.5%       4.375s       1.14e-02s     C      384        8   GpuCorrMM{full, (1, 1), pad=0}
   2.0%    96.5%       3.038s       1.58e-02s     C      192        4   GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:}
   1.3%    97.8%       1.943s       1.09e-03s     C     1776       37   GpuContiguous
   0.5%    98.3%       0.683s       5.47e-04s     Py    1248       26   GpuReshape{4}
   0.3%    98.5%       0.411s       2.14e-03s     C      192        4   GpuFromHost
   0.2%    98.7%       0.296s       3.09e-03s     Py      96        2   RepeatOp
   0.2%    98.9%       0.243s       1.01e-03s     C      240        5   GpuElemwise{add,no_inplace}
   0.2%    99.1%       0.235s       1.22e-03s     C      192        4   GpuDownsampleFactorMaxGrad{(1, 1),True}
   0.1%    99.2%       0.210s       8.73e-04s     C      240        5   GpuElemwise{maximum,no_inplace}
   0.1%    99.3%       0.203s       1.06e-03s     C      192        4   GpuDownsampleFactorMax{(1, 1),True}
   0.1%    99.5%       0.179s       1.24e-03s     C      144        3   HostFromGpu
   0.1%    99.6%       0.163s       6.81e-04s     C      240        5   GpuElemwise{Composite{[Cast{float32}(EQ(i0, i1))]}}[(0, 0)]
   0.1%    99.7%       0.156s       6.51e-04s     C      240        5   GpuElemwise{Mul}[(0, 0)]
   0.1%    99.7%       0.095s       3.30e-04s     C      288        6   GpuCAReduce{add}{1,0,1,1}
   0.1%    99.8%       0.083s       4.34e-04s     C      192        4   GpuAlloc{memset_0=True}
   0.0%    99.8%       0.056s       5.88e-04s     C       96        2   GpuDownsampleFactorMaxGrad{(2, 2),True}
   0.0%    99.9%       0.044s       4.58e-04s     C       96        2   GpuDownsampleFactorMax{(2, 2),True}
   0.0%    99.9%       0.026s       5.39e-04s     C       48        1   GpuElemwise{sqr,no_inplace}
   0.0%    99.9%       0.024s       2.48e-04s     C       96        2   GpuElemwise{TrueDiv}[(0, 0)]
   ... (remaining 37 Ops account for   0.09%(0.14s) of the runtime)

Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
  35.0%    35.0%      52.035s       1.08e+00s     48   313   GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
  30.3%    65.3%      44.946s       9.36e-01s     48   327   GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
  15.1%    80.4%      22.398s       4.67e-01s     48   367   GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
   5.4%    85.8%       8.042s       1.68e-01s     48   325   GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
   3.6%    89.4%       5.390s       1.12e-01s     48   355   GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
   1.2%    90.6%       1.771s       3.69e-02s     48   322   GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:}(GpuAlloc{memset_0=
   1.2%    91.8%       1.742s       3.63e-02s     48   190   GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
   1.0%    92.7%       1.431s       2.98e-02s     48   341   GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
   0.7%    93.4%       0.968s       2.02e-02s     48   364   GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:}(GpuAlloc{memset_0=
   0.5%    93.8%       0.669s       1.39e-02s     48   300   GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
   0.4%    94.2%       0.561s       1.17e-02s     48    44   GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
   0.4%    94.6%       0.552s       1.15e-02s     48    61   GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
   0.3%    94.9%       0.486s       1.01e-02s     48   326   GpuContiguous(GpuDimShuffle{1,0,2,3}.0)
   0.3%    95.2%       0.468s       9.74e-03s     48   224   GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
   0.3%    95.5%       0.430s       8.95e-03s     48   311   GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
   0.3%    95.8%       0.393s       8.20e-03s     48   232   GpuContiguous(GpuSubtensor{::, ::, ::int64, ::int64}.0)
   0.3%    96.1%       0.389s       8.11e-03s     48   122   GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
   0.3%    96.3%       0.381s       7.93e-03s     48   208   GpuReshape{4}(GpuSubtensor{::, ::, int64:int64:, int64:int64:}.0, Join.0)
   0.2%    96.6%       0.371s       7.74e-03s     48   366   GpuContiguous(GpuDimShuffle{1,0,2,3}.0)
   0.2%    96.8%       0.364s       7.58e-03s     48   353   GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
   ... (remaining 349 Apply instances account for 3.19%(4.74s) of the runtime)

Function profiling
==================
  Message: /home/ajain/Projects/deep_nets/python/lib/machine.py:243
  Time in 0 calls to Function.__call__: 0.000000e+00s
  Total compile time: 3.771852e+00s
    Number of Apply nodes: 0
    Theano Optimizer time: 3.601604e+00s
       Theano validate time: 1.030920e-01s
    Theano Linker time (includes C, CUDA code generation/compiling): 1.514251e-01s

Function profiling
==================
  Message: Sum of all printed profiles at exit excluding Scan op profile.
  Time in 48 calls to Function.__call__: 1.490243e+02s
  Time in Function.fn.__call__: 1.490195e+02s (99.997%)
  Time in thunks: 1.485269e+02s (99.666%)
  Total compile time: 9.601311e+00s
    Number of Apply nodes: 369
    Theano Optimizer time: 9.125525e+00s
       Theano validate time: 4.649615e-01s
    Theano Linker time (includes C, CUDA code generation/compiling): 4.156411e-01s

Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  94.5%    94.5%     140.298s       1.54e-01s     C      912      19   theano.sandbox.cuda.blas.GpuCorrMM
   2.0%    96.5%       3.038s       1.58e-02s     C      192       4   theano.sandbox.cuda.basic_ops.GpuIncSubtensor
   1.3%    97.8%       1.943s       1.09e-03s     C     1776      37   theano.sandbox.cuda.basic_ops.GpuContiguous
   0.6%    98.4%       0.901s       4.27e-04s     C     2112      44   theano.sandbox.cuda.basic_ops.GpuElemwise
   0.5%    98.9%       0.683s       5.47e-04s     Py    1248      26   theano.sandbox.cuda.basic_ops.GpuReshape
   0.3%    99.2%       0.411s       2.14e-03s     C      192       4   theano.sandbox.cuda.basic_ops.GpuFromHost
   0.2%    99.4%       0.296s       3.09e-03s     Py      96       2   theano.tensor.extra_ops.RepeatOp
   0.2%    99.6%       0.291s       1.01e-03s     C      288       6   theano.sandbox.cuda.blas.GpuDownsampleFactorMaxGrad
   0.2%    99.7%       0.247s       8.58e-04s     C      288       6   theano.sandbox.cuda.blas.GpuDownsampleFactorMax
   0.1%    99.8%       0.179s       1.24e-03s     C      144       3   theano.sandbox.cuda.basic_ops.HostFromGpu
   0.1%    99.9%       0.116s       3.02e-04s     C      384       8   theano.sandbox.cuda.basic_ops.GpuCAReduce
   0.1%   100.0%       0.092s       3.82e-04s     C      240       5   theano.sandbox.cuda.basic_ops.GpuAlloc
   0.0%   100.0%       0.007s       4.19e-06s     C     1776      37   theano.sandbox.cuda.basic_ops.GpuSubtensor
   0.0%   100.0%       0.006s       2.19e-06s     C     2688      56   theano.compile.ops.Shape_i
   0.0%   100.0%       0.005s       3.32e-06s     C     1440      30   theano.sandbox.cuda.basic_ops.GpuDimShuffle
   0.0%   100.0%       0.004s       7.37e-06s     C      576      12   theano.tensor.basic.Join
   0.0%   100.0%       0.003s       2.49e-06s     C     1152      24   theano.tensor.elemwise.Elemwise
   0.0%   100.0%       0.002s       2.78e-06s     C      864      18   theano.tensor.subtensor.Subtensor
   0.0%   100.0%       0.002s       2.92e-06s     C      672      14   theano.tensor.opt.MakeVector
   0.0%   100.0%       0.001s       2.20e-05s     Py      48       1   theano.sandbox.cuda.basic_ops.GpuFlatten
   ... (remaining 2 Classes account for   0.00%(0.00s) of the runtime)

Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  91.5%    91.5%     135.923s       2.57e-01s     C      528       11   GpuCorrMM{valid, (1, 1), pad=0}
   2.9%    94.5%       4.375s       1.14e-02s     C      384        8   GpuCorrMM{full, (1, 1), pad=0}
   2.0%    96.5%       3.038s       1.58e-02s     C      192        4   GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:}
   1.3%    97.8%       1.943s       1.09e-03s     C     1776       37   GpuContiguous
   0.5%    98.3%       0.683s       5.47e-04s     Py    1248       26   GpuReshape{4}
   0.3%    98.5%       0.411s       2.14e-03s     C      192        4   GpuFromHost
   0.2%    98.7%       0.296s       3.09e-03s     Py      96        2   RepeatOp
   0.2%    98.9%       0.243s       1.01e-03s     C      240        5   GpuElemwise{add,no_inplace}
   0.2%    99.1%       0.235s       1.22e-03s     C      192        4   GpuDownsampleFactorMaxGrad{(1, 1),True}
   0.1%    99.2%       0.210s       8.73e-04s     C      240        5   GpuElemwise{maximum,no_inplace}
   0.1%    99.3%       0.203s       1.06e-03s     C      192        4   GpuDownsampleFactorMax{(1, 1),True}
   0.1%    99.5%       0.179s       1.24e-03s     C      144        3   HostFromGpu
   0.1%    99.6%       0.163s       6.81e-04s     C      240        5   GpuElemwise{Composite{[Cast{float32}(EQ(i0, i1))]}}[(0, 0)]
   0.1%    99.7%       0.156s       6.51e-04s     C      240        5   GpuElemwise{Mul}[(0, 0)]
   0.1%    99.7%       0.095s       3.30e-04s     C      288        6   GpuCAReduce{add}{1,0,1,1}
   0.1%    99.8%       0.083s       4.34e-04s     C      192        4   GpuAlloc{memset_0=True}
   0.0%    99.8%       0.056s       5.88e-04s     C       96        2   GpuDownsampleFactorMaxGrad{(2, 2),True}
   0.0%    99.9%       0.044s       4.58e-04s     C       96        2   GpuDownsampleFactorMax{(2, 2),True}
   0.0%    99.9%       0.026s       5.39e-04s     C       48        1   GpuElemwise{sqr,no_inplace}
   0.0%    99.9%       0.024s       2.48e-04s     C       96        2   GpuElemwise{TrueDiv}[(0, 0)]
   ... (remaining 37 Ops account for   0.09%(0.14s) of the runtime)

Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
  35.0%    35.0%      52.035s       1.08e+00s     48   313   GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
  30.3%    65.3%      44.946s       9.36e-01s     48   327   GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
  15.1%    80.4%      22.398s       4.67e-01s     48   367   GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
   5.4%    85.8%       8.042s       1.68e-01s     48   325   GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
   3.6%    89.4%       5.390s       1.12e-01s     48   355   GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
   1.2%    90.6%       1.771s       3.69e-02s     48   322   GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:}(GpuAlloc{memset_0=
   1.2%    91.8%       1.742s       3.63e-02s     48   190   GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
   1.0%    92.7%       1.431s       2.98e-02s     48   341   GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
   0.7%    93.4%       0.968s       2.02e-02s     48   364   GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:}(GpuAlloc{memset_0=
   0.5%    93.8%       0.669s       1.39e-02s     48   300   GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
   0.4%    94.2%       0.561s       1.17e-02s     48    44   GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
   0.4%    94.6%       0.552s       1.15e-02s     48    61   GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
   0.3%    94.9%       0.486s       1.01e-02s     48   326   GpuContiguous(GpuDimShuffle{1,0,2,3}.0)
   0.3%    95.2%       0.468s       9.74e-03s     48   224   GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
   0.3%    95.5%       0.430s       8.95e-03s     48   311   GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
   0.3%    95.8%       0.393s       8.20e-03s     48   232   GpuContiguous(GpuSubtensor{::, ::, ::int64, ::int64}.0)
   0.3%    96.1%       0.389s       8.11e-03s     48   122   GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
   0.3%    96.3%       0.381s       7.93e-03s     48   208   GpuReshape{4}(GpuSubtensor{::, ::, int64:int64:, int64:int64:}.0, Join.0)
   0.2%    96.6%       0.371s       7.74e-03s     48   366   GpuContiguous(GpuDimShuffle{1,0,2,3}.0)
   0.2%    96.8%       0.364s       7.58e-03s     48   353   GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
   ... (remaining 349 Apply instances account for 3.19%(4.74s) of the runtime)
@nouiz
Member
nouiz commented Aug 6, 2014

If you look at those part of the profile:

Theano Linker time (includes C, CUDA code generation/compiling):
2.642159e-01s Class <% time> <#call> <#apply>
94.5% 94.5% 140.298s 1.54e-01s C 912 19 theano.sandbox.cuda.blas.GpuCorrMM

and

Class <% time> <#call> <#apply>
94.5% 94.5% 140.298s 1.54e-01s C 912 19 theano.sandbox.cuda.blas.GpuCorrMM

We see that we spend ~95% of the time inside the GpuCorrMM op. Here is more
detail:

Ops <% time> <#call> <#apply>
91.5% 91.5% 135.923s 2.57e-01s C 528 11 GpuCorrMM{valid, (1, 1), pad=0}
2.9% 94.5% 4.375s 1.14e-02s C 384 8 GpuCorrMM{full, (1, 1), pad=0}

and

Ops <% time> <#call> <#apply>
91.5% 91.5% 135.923s 2.57e-01s C 528 11 GpuCorrMM{valid, (1, 1), pad=0}
2.9% 94.5% 4.375s 1.14e-02s C 384 8 GpuCorrMM{full, (1, 1), pad=0}

This tell that we spend much more time in the valid mode then in the full
mode of GpuCorrMM. So the problem is related to the new op with this
profile.

Can you rerun the profiling with this extra theano flags:
profile_memory=True. This will tell us which shape for the valid cause this
slow case.

Fred

On Wed, Aug 6, 2014 at 10:02 AM, Arjun Jain notifications@github.com
wrote:

Hi all, I am sorry for being so negative but I am very disheartened with
the speed of Theano. The fprop is awesome and the speed is exactly like in
torch7, however the bprop in Theano is 5x slower!! I really wanted to use
Theano, but if I can't get it to be faster, I will have no option.

I will be very grateful if anyone could provide any input. :-( Thank you
in advance.
Bellow I attach my network structure and profile results. I only uses
convolutions(GpuCorrMM) everywhere.

`
... building the model
** DataLayer
DataLayer out_size: (16, 3, 240, 240)
** Created DataLayer
** LCNLayer
LCNLayer in_size: (16, 3, 240, 240)
LCNLayer out_size: (16, 3, 240, 240)
LCNLayer filtersnp.shape: (1, 3, 9, 9)
** Created LCNLayer
** ConvPoolLayer
ConvPoolLayer filter_size: 5
ConvPoolLayer in_size: (16, 3, 240, 240)
ConvPoolLayer out_size: (16, 16, 120, 120)
** Created ConvPoolLayer
** ConvPoolLayer
ConvPoolLayer filter_size: 5
ConvPoolLayer in_size: (16, 16, 120, 120)
ConvPoolLayer out_size: (16, 16, 60, 60)
** Created ConvPoolLayer
** ConvPoolLayer
ConvPoolLayer filter_size: 5
ConvPoolLayer in_size: (16, 16, 60, 60)
ConvPoolLayer out_size: (16, 16, 60, 60)
** Created ConvPoolLayer
** ConvPoolLayer
ConvPoolLayer filter_size: 9
ConvPoolLayer in_size: (16, 16, 60, 60)
ConvPoolLayer out_size: (16, 512, 60, 60)
** Created ConvPoolLayer
** ConvPoolLayer
ConvPoolLayer filter_size: 1
ConvPoolLayer in_size: (16, 512, 60, 60)
ConvPoolLayer out_size: (16, 256, 60, 60)
** Created ConvPoolLayer
** ConvPoolLayer
Linear Activation for this layer
ConvPoolLayer filter_size: 1
ConvPoolLayer in_size: (16, 256, 60, 60)
ConvPoolLayer out_size: (16, 4, 60, 60)
** Created ConvPoolLayer
...started Compute
In DataLayer
out_size: (16, 3, 240, 240)
++ Computed DataLayer 0
In LCNLayer
++ Computed LCNLayer 1
In ConvPoolLayer, filter size: (16, 3, 5, 5)
++ Computed ConvPoolLayer 3
In ConvPoolLayer, filter size: (16, 16, 5, 5)
++ Computed ConvPoolLayer 4
In ConvPoolLayer, filter size: (16, 16, 5, 5)
++ Computed ConvPoolLayer 5
In ConvPoolLayer, filter size: (512, 16, 9, 9)
++ Computed ConvPoolLayer 6
In ConvPoolLayer, filter size: (256, 512, 1, 1)
++ Computed ConvPoolLayer 16
In ConvPoolLayer, filter size: (4, 256, 1, 1)
++ Computed ConvPoolLayer 18
Using MSE
Regularization: 0.0
Not using RMSPROP
Not using Momentum
learning_rate: 0.02
==> Compiling theano funcitons...
==> Done compiling theano funcitons.
==> training!
==> doing train epoch on train data:
==> online epoch # 0 [batchSize = 16]
Exec: rm data/conffast/train.log
[===================================================================>]
Avg Error 0.11
Per sample train-time 211.91msec
Function profiling

Message: /home/ajain/Projects/deep_nets/python/lib/machine.py:239
Time in 48 calls to Function.call: 1.490243e+02s
Time in Function.fn.call: 1.490195e+02s (99.997%)
Time in thunks: 1.485269e+02s (99.666%)
Total compile time: 5.829459e+00s
Number of Apply nodes: 369
Theano Optimizer time: 5.523921e+00s
Theano validate time: 3.618696e-01s
Theano Linker time (includes C, CUDA code generation/compiling):
2.642159e-01s
Class

<% time> <#call> <#apply>
94.5% 94.5% 140.298s 1.54e-01s C 912 19 theano.sandbox.cuda.blas.GpuCorrMM
2.0% 96.5% 3.038s 1.58e-02s C 192 4
theano.sandbox.cuda.basic_ops.GpuIncSubtensor
1.3% 97.8% 1.943s 1.09e-03s C 1776 37
theano.sandbox.cuda.basic_ops.GpuContiguous
0.6% 98.4% 0.901s 4.27e-04s C 2112 44
theano.sandbox.cuda.basic_ops.GpuElemwise
0.5% 98.9% 0.683s 5.47e-04s Py 1248 26
theano.sandbox.cuda.basic_ops.GpuReshape
0.3% 99.2% 0.411s 2.14e-03s C 192 4
theano.sandbox.cuda.basic_ops.GpuFromHost
0.2% 99.4% 0.296s 3.09e-03s Py 96 2 theano.tensor.extra_ops.RepeatOp
0.2% 99.6% 0.291s 1.01e-03s C 288 6
theano.sandbox.cuda.blas.GpuDownsampleFactorMaxGrad
0.2% 99.7% 0.247s 8.58e-04s C 288 6
theano.sandbox.cuda.blas.GpuDownsampleFactorMax
0.1% 99.8% 0.179s 1.24e-03s C 144 3
theano.sandbox.cuda.basic_ops.HostFromGpu
0.1% 99.9% 0.116s 3.02e-04s C 384 8
theano.sandbox.cuda.basic_ops.GpuCAReduce
0.1% 100.0% 0.092s 3.82e-04s C 240 5 theano.sandbox.cuda.basic_ops.GpuAlloc
0.0% 100.0% 0.007s 4.19e-06s C 1776 37
theano.sandbox.cuda.basic_ops.GpuSubtensor
0.0% 100.0% 0.006s 2.19e-06s C 2688 56 theano.compile.ops.Shape_i
0.0% 100.0% 0.005s 3.32e-06s C 1440 30
theano.sandbox.cuda.basic_ops.GpuDimShuffle
0.0% 100.0% 0.004s 7.37e-06s C 576 12 theano.tensor.basic.Join
0.0% 100.0% 0.003s 2.49e-06s C 1152 24 theano.tensor.elemwise.Elemwise
0.0% 100.0% 0.002s 2.78e-06s C 864 18 theano.tensor.subtensor.Subtensor
0.0% 100.0% 0.002s 2.92e-06s C 672 14 theano.tensor.opt.MakeVector
0.0% 100.0% 0.001s 2.20e-05s Py 48 1
theano.sandbox.cuda.basic_ops.GpuFlatten
... (remaining 2 Classes account for 0.00%(0.00s) of the runtime)
Ops

<% time> <#call> <#apply>
91.5% 91.5% 135.923s 2.57e-01s C 528 11 GpuCorrMM{valid, (1, 1), pad=0}
2.9% 94.5% 4.375s 1.14e-02s C 384 8 GpuCorrMM{full, (1, 1), pad=0}
2.0% 96.5% 3.038s 1.58e-02s C 192 4 GpuIncSubtensor{InplaceInc;::, ::,
int64:int64:, int64:int64:}
1.3% 97.8% 1.943s 1.09e-03s C 1776 37 GpuContiguous
0.5% 98.3% 0.683s 5.47e-04s Py 1248 26 GpuReshape{4}
0.3% 98.5% 0.411s 2.14e-03s C 192 4 GpuFromHost
0.2% 98.7% 0.296s 3.09e-03s Py 96 2 RepeatOp
0.2% 98.9% 0.243s 1.01e-03s C 240 5 GpuElemwise{add,no_inplace}
0.2% 99.1% 0.235s 1.22e-03s C 192 4 GpuDownsampleFactorMaxGrad{(1, 1),True}
0.1% 99.2% 0.210s 8.73e-04s C 240 5 GpuElemwise{maximum,no_inplace}
0.1% 99.3% 0.203s 1.06e-03s C 192 4 GpuDownsampleFactorMax{(1, 1),True}
0.1% 99.5% 0.179s 1.24e-03s C 144 3 HostFromGpu
0.1% 99.6% 0.163s 6.81e-04s C 240 5
GpuElemwise{Composite{[Cast{float32}(EQ(i0, i1))]}}[(0, 0)]
0.1% 99.7% 0.156s 6.51e-04s C 240 5 GpuElemwise{Mul}[(0, 0)]
0.1% 99.7% 0.095s 3.30e-04s C 288 6 GpuCAReduce{add}{1,0,1,1}
0.1% 99.8% 0.083s 4.34e-04s C 192 4 GpuAlloc{memset_0=True}
0.0% 99.8% 0.056s 5.88e-04s C 96 2 GpuDownsampleFactorMaxGrad{(2, 2),True}
0.0% 99.9% 0.044s 4.58e-04s C 96 2 GpuDownsampleFactorMax{(2, 2),True}
0.0% 99.9% 0.026s 5.39e-04s C 48 1 GpuElemwise{sqr,no_inplace}
0.0% 99.9% 0.024s 2.48e-04s C 96 2 GpuElemwise{TrueDiv}[(0, 0)]
... (remaining 37 Ops account for 0.09%(0.14s) of the runtime)
Apply

<% time> <#call>
35.0% 35.0% 52.035s 1.08e+00s 48 313 GpuCorrMM{valid, (1, 1),
pad=0}(GpuContiguous.0, GpuContiguous.0)
30.3% 65.3% 44.946s 9.36e-01s 48 327 GpuCorrMM{valid, (1, 1),
pad=0}(GpuContiguous.0, GpuContiguous.0)
15.1% 80.4% 22.398s 4.67e-01s 48 367 GpuCorrMM{valid, (1, 1),
pad=0}(GpuContiguous.0, GpuContiguous.0)
5.4% 85.8% 8.042s 1.68e-01s 48 325 GpuCorrMM{valid, (1, 1),
pad=0}(GpuContiguous.0, GpuContiguous.0)
3.6% 89.4% 5.390s 1.12e-01s 48 355 GpuCorrMM{valid, (1, 1),
pad=0}(GpuContiguous.0, GpuContiguous.0)
1.2% 90.6% 1.771s 3.69e-02s 48 322 GpuIncSubtensor{InplaceInc;::, ::,
int64:int64:, int64:int64:}(GpuAlloc{memset_0=
1.2% 91.8% 1.742s 3.63e-02s 48 190 GpuCorrMM{full, (1, 1),
pad=0}(GpuContiguous.0, GpuContiguous.0)
1.0% 92.7% 1.431s 2.98e-02s 48 341 GpuCorrMM{valid, (1, 1),
pad=0}(GpuContiguous.0, GpuContiguous.0)
0.7% 93.4% 0.968s 2.02e-02s 48 364 GpuIncSubtensor{InplaceInc;::, ::,
int64:int64:, int64:int64:}(GpuAlloc{memset_0=
0.5% 93.8% 0.669s 1.39e-02s 48 300 GpuCorrMM{valid, (1, 1),
pad=0}(GpuContiguous.0, GpuContiguous.0)
0.4% 94.2% 0.561s 1.17e-02s 48 44 GpuCorrMM{full, (1, 1),
pad=0}(GpuContiguous.0, GpuContiguous.0)
0.4% 94.6% 0.552s 1.15e-02s 48 61 GpuCorrMM{full, (1, 1),
pad=0}(GpuContiguous.0, GpuContiguous.0)
0.3% 94.9% 0.486s 1.01e-02s 48 326 GpuContiguous(GpuDimShuffle{1,0,2,3}.0)
0.3% 95.2% 0.468s 9.74e-03s 48 224 GpuCorrMM{full, (1, 1),
pad=0}(GpuContiguous.0, GpuContiguous.0)
0.3% 95.5% 0.430s 8.95e-03s 48 311 GpuCorrMM{valid, (1, 1),
pad=0}(GpuContiguous.0, GpuContiguous.0)
0.3% 95.8% 0.393s 8.20e-03s 48 232 GpuContiguous(GpuSubtensor{::, ::,
::int64, ::int64}.0)
0.3% 96.1% 0.389s 8.11e-03s 48 122 GpuCorrMM{full, (1, 1),
pad=0}(GpuContiguous.0, GpuContiguous.0)
0.3% 96.3% 0.381s 7.93e-03s 48 208 GpuReshape{4}(GpuSubtensor{::, ::,
int64:int64:, int64:int64:}.0, Join.0)
0.2% 96.6% 0.371s 7.74e-03s 48 366 GpuContiguous(GpuDimShuffle{1,0,2,3}.0)
0.2% 96.8% 0.364s 7.58e-03s 48 353 GpuCorrMM{valid, (1, 1),
pad=0}(GpuContiguous.0, GpuContiguous.0)
... (remaining 349 Apply instances account for 3.19%(4.74s) of the runtime)
Function profiling

Message: /home/ajain/Projects/deep_nets/python/lib/machine.py:243
Time in 0 calls to Function.call: 0.000000e+00s
Total compile time: 3.771852e+00s
Number of Apply nodes: 0
Theano Optimizer time: 3.601604e+00s
Theano validate time: 1.030920e-01s
Theano Linker time (includes C, CUDA code generation/compiling):
1.514251e-01s
Function profiling

Message: Sum of all printed profiles at exit excluding Scan op profile.
Time in 48 calls to Function.call: 1.490243e+02s
Time in Function.fn.call: 1.490195e+02s (99.997%)
Time in thunks: 1.485269e+02s (99.666%)
Total compile time: 9.601311e+00s
Number of Apply nodes: 369
Theano Optimizer time: 9.125525e+00s
Theano validate time: 4.649615e-01s
Theano Linker time (includes C, CUDA code generation/compiling):
4.156411e-01s
Class

<% time> <#call> <#apply>
94.5% 94.5% 140.298s 1.54e-01s C 912 19 theano.sandbox.cuda.blas.GpuCorrMM
2.0% 96.5% 3.038s 1.58e-02s C 192 4
theano.sandbox.cuda.basic_ops.GpuIncSubtensor
1.3% 97.8% 1.943s 1.09e-03s C 1776 37
theano.sandbox.cuda.basic_ops.GpuContiguous
0.6% 98.4% 0.901s 4.27e-04s C 2112 44
theano.sandbox.cuda.basic_ops.GpuElemwise
0.5% 98.9% 0.683s 5.47e-04s Py 1248 26
theano.sandbox.cuda.basic_ops.GpuReshape
0.3% 99.2% 0.411s 2.14e-03s C 192 4
theano.sandbox.cuda.basic_ops.GpuFromHost
0.2% 99.4% 0.296s 3.09e-03s Py 96 2 theano.tensor.extra_ops.RepeatOp
0.2% 99.6% 0.291s 1.01e-03s C 288 6
theano.sandbox.cuda.blas.GpuDownsampleFactorMaxGrad
0.2% 99.7% 0.247s 8.58e-04s C 288 6
theano.sandbox.cuda.blas.GpuDownsampleFactorMax
0.1% 99.8% 0.179s 1.24e-03s C 144 3
theano.sandbox.cuda.basic_ops.HostFromGpu
0.1% 99.9% 0.116s 3.02e-04s C 384 8
theano.sandbox.cuda.basic_ops.GpuCAReduce
0.1% 100.0% 0.092s 3.82e-04s C 240 5 theano.sandbox.cuda.basic_ops.GpuAlloc
0.0% 100.0% 0.007s 4.19e-06s C 1776 37
theano.sandbox.cuda.basic_ops.GpuSubtensor
0.0% 100.0% 0.006s 2.19e-06s C 2688 56 theano.compile.ops.Shape_i
0.0% 100.0% 0.005s 3.32e-06s C 1440 30
theano.sandbox.cuda.basic_ops.GpuDimShuffle
0.0% 100.0% 0.004s 7.37e-06s C 576 12 theano.tensor.basic.Join
0.0% 100.0% 0.003s 2.49e-06s C 1152 24 theano.tensor.elemwise.Elemwise
0.0% 100.0% 0.002s 2.78e-06s C 864 18 theano.tensor.subtensor.Subtensor
0.0% 100.0% 0.002s 2.92e-06s C 672 14 theano.tensor.opt.MakeVector
0.0% 100.0% 0.001s 2.20e-05s Py 48 1
theano.sandbox.cuda.basic_ops.GpuFlatten
... (remaining 2 Classes account for 0.00%(0.00s) of the runtime)
Ops

<% time> <#call> <#apply>
91.5% 91.5% 135.923s 2.57e-01s C 528 11 GpuCorrMM{valid, (1, 1), pad=0}
2.9% 94.5% 4.375s 1.14e-02s C 384 8 GpuCorrMM{full, (1, 1), pad=0}
2.0% 96.5% 3.038s 1.58e-02s C 192 4 GpuIncSubtensor{InplaceInc;::, ::,
int64:int64:, int64:int64:}
1.3% 97.8% 1.943s 1.09e-03s C 1776 37 GpuContiguous
0.5% 98.3% 0.683s 5.47e-04s Py 1248 26 GpuReshape{4}
0.3% 98.5% 0.411s 2.14e-03s C 192 4 GpuFromHost
0.2% 98.7% 0.296s 3.09e-03s Py 96 2 RepeatOp
0.2% 98.9% 0.243s 1.01e-03s C 240 5 GpuElemwise{add,no_inplace}
0.2% 99.1% 0.235s 1.22e-03s C 192 4 GpuDownsampleFactorMaxGrad{(1, 1),True}
0.1% 99.2% 0.210s 8.73e-04s C 240 5 GpuElemwise{maximum,no_inplace}
0.1% 99.3% 0.203s 1.06e-03s C 192 4 GpuDownsampleFactorMax{(1, 1),True}
0.1% 99.5% 0.179s 1.24e-03s C 144 3 HostFromGpu
0.1% 99.6% 0.163s 6.81e-04s C 240 5
GpuElemwise{Composite{[Cast{float32}(EQ(i0, i1))]}}[(0, 0)]
0.1% 99.7% 0.156s 6.51e-04s C 240 5 GpuElemwise{Mul}[(0, 0)]
0.1% 99.7% 0.095s 3.30e-04s C 288 6 GpuCAReduce{add}{1,0,1,1}
0.1% 99.8% 0.083s 4.34e-04s C 192 4 GpuAlloc{memset_0=True}
0.0% 99.8% 0.056s 5.88e-04s C 96 2 GpuDownsampleFactorMaxGrad{(2, 2),True}
0.0% 99.9% 0.044s 4.58e-04s C 96 2 GpuDownsampleFactorMax{(2, 2),True}
0.0% 99.9% 0.026s 5.39e-04s C 48 1 GpuElemwise{sqr,no_inplace}
0.0% 99.9% 0.024s 2.48e-04s C 96 2 GpuElemwise{TrueDiv}[(0, 0)]
... (remaining 37 Ops account for 0.09%(0.14s) of the runtime)
Apply

<% time> <#call>
35.0% 35.0% 52.035s 1.08e+00s 48 313 GpuCorrMM{valid, (1, 1),
pad=0}(GpuContiguous.0, GpuContiguous.0)
30.3% 65.3% 44.946s 9.36e-01s 48 327 GpuCorrMM{valid, (1, 1),
pad=0}(GpuContiguous.0, GpuContiguous.0)
15.1% 80.4% 22.398s 4.67e-01s 48 367 GpuCorrMM{valid, (1, 1),
pad=0}(GpuContiguous.0, GpuContiguous.0)
5.4% 85.8% 8.042s 1.68e-01s 48 325 GpuCorrMM{valid, (1, 1),
pad=0}(GpuContiguous.0, GpuContiguous.0)
3.6% 89.4% 5.390s 1.12e-01s 48 355 GpuCorrMM{valid, (1, 1),
pad=0}(GpuContiguous.0, GpuContiguous.0)
1.2% 90.6% 1.771s 3.69e-02s 48 322 GpuIncSubtensor{InplaceInc;::, ::,
int64:int64:, int64:int64:}(GpuAlloc{memset_0=
1.2% 91.8% 1.742s 3.63e-02s 48 190 GpuCorrMM{full, (1, 1),
pad=0}(GpuContiguous.0, GpuContiguous.0)
1.0% 92.7% 1.431s 2.98e-02s 48 341 GpuCorrMM{valid, (1, 1),
pad=0}(GpuContiguous.0, GpuContiguous.0)
0.7% 93.4% 0.968s 2.02e-02s 48 364 GpuIncSubtensor{InplaceInc;::, ::,
int64:int64:, int64:int64:}(GpuAlloc{memset_0=
0.5% 93.8% 0.669s 1.39e-02s 48 300 GpuCorrMM{valid, (1, 1),
pad=0}(GpuContiguous.0, GpuContiguous.0)
0.4% 94.2% 0.561s 1.17e-02s 48 44 GpuCorrMM{full, (1, 1),
pad=0}(GpuContiguous.0, GpuContiguous.0)
0.4% 94.6% 0.552s 1.15e-02s 48 61 GpuCorrMM{full, (1, 1),
pad=0}(GpuContiguous.0, GpuContiguous.0)
0.3% 94.9% 0.486s 1.01e-02s 48 326 GpuContiguous(GpuDimShuffle{1,0,2,3}.0)
0.3% 95.2% 0.468s 9.74e-03s 48 224 GpuCorrMM{full, (1, 1),
pad=0}(GpuContiguous.0, GpuContiguous.0)
0.3% 95.5% 0.430s 8.95e-03s 48 311 GpuCorrMM{valid, (1, 1),
pad=0}(GpuContiguous.0, GpuContiguous.0)
0.3% 95.8% 0.393s 8.20e-03s 48 232 GpuContiguous(GpuSubtensor{::, ::,
::int64, ::int64}.0)
0.3% 96.1% 0.389s 8.11e-03s 48 122 GpuCorrMM{full, (1, 1),
pad=0}(GpuContiguous.0, GpuContiguous.0)
0.3% 96.3% 0.381s 7.93e-03s 48 208 GpuReshape{4}(GpuSubtensor{::, ::,
int64:int64:, int64:int64:}.0, Join.0)
0.2% 96.6% 0.371s 7.74e-03s 48 366 GpuContiguous(GpuDimShuffle{1,0,2,3}.0)
0.2% 96.8% 0.364s 7.58e-03s 48 353 GpuCorrMM{valid, (1, 1),
pad=0}(GpuContiguous.0, GpuContiguous.0)
... (remaining 349 Apply instances account for 3.19%(4.74s) of the runtime)
`


Reply to this email directly or view it on GitHub
#2002 (comment).

@stencilman
Contributor

Hi Fred, thanks a ton for your reply. Please find bellow the complete log after also using profile_memory=True.

What I dont understand is why is the 'valid' getting called at all? I only call conv2d with 'full'. Perhaps this happens during the back prop, but I am really not sure why. Any help would be greatly appreciated. Thanks a lot!!!

Using gpu device 3: GeForce GTX TITAN Black
/home/ajain/Theano/theano/gof/vm.py:716: UserWarning: CVM does not support memory profile, using Stack VM.
  'CVM does not support memory profile, using Stack VM.')

Function profiling
==================
  Message: /home/ajain/Projects/deep_nets/python/lib/machine.py:239
  Time in 48 calls to Function.__call__: 1.493318e+02s
  Time in Function.fn.__call__: 1.493235e+02s (99.994%)
  Time in thunks: 1.478504e+02s (99.008%)
  Total compile time: 5.620848e+00s
    Number of Apply nodes: 369
    Theano Optimizer time: 5.321193e+00s
       Theano validate time: 3.487010e-01s
    Theano Linker time (includes C, CUDA code generation/compiling): 2.601519e-01s

Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  94.4%    94.4%     139.572s       1.53e-01s     C      912      19   theano.sandbox.cuda.blas.GpuCorrMM
   2.1%    96.5%       3.036s       1.58e-02s     C      192       4   theano.sandbox.cuda.basic_ops.GpuIncSubtensor
   1.3%    97.8%       1.980s       1.11e-03s     C     1776      37   theano.sandbox.cuda.basic_ops.GpuContiguous
   0.6%    98.4%       0.910s       4.31e-04s     C     2112      44   theano.sandbox.cuda.basic_ops.GpuElemwise
   0.5%    98.9%       0.666s       5.34e-04s     Py    1248      26   theano.sandbox.cuda.basic_ops.GpuReshape
   0.3%    99.1%       0.382s       1.99e-03s     C      192       4   theano.sandbox.cuda.basic_ops.GpuFromHost
   0.2%    99.3%       0.296s       3.08e-03s     Py      96       2   theano.tensor.extra_ops.RepeatOp
   0.2%    99.5%       0.296s       1.03e-03s     C      288       6   theano.sandbox.cuda.blas.GpuDownsampleFactorMaxGrad
   0.2%    99.7%       0.252s       8.75e-04s     C      288       6   theano.sandbox.cuda.blas.GpuDownsampleFactorMax
   0.1%    99.8%       0.178s       1.24e-03s     C      144       3   theano.sandbox.cuda.basic_ops.HostFromGpu
   0.1%    99.9%       0.121s       3.16e-04s     C      384       8   theano.sandbox.cuda.basic_ops.GpuCAReduce
   0.1%   100.0%       0.096s       4.01e-04s     C      240       5   theano.sandbox.cuda.basic_ops.GpuAlloc
   0.0%   100.0%       0.015s       5.75e-06s     C     2688      56   theano.compile.ops.Shape_i
   0.0%   100.0%       0.013s       7.38e-06s     C     1776      37   theano.sandbox.cuda.basic_ops.GpuSubtensor
   0.0%   100.0%       0.011s       7.58e-06s     C     1440      30   theano.sandbox.cuda.basic_ops.GpuDimShuffle
   0.0%   100.0%       0.007s       5.66e-06s     C     1152      24   theano.tensor.elemwise.Elemwise
   0.0%   100.0%       0.006s       1.06e-05s     C      576      12   theano.tensor.basic.Join
   0.0%   100.0%       0.005s       6.05e-06s     C      864      18   theano.tensor.subtensor.Subtensor
   0.0%   100.0%       0.004s       5.89e-06s     C      672      14   theano.tensor.opt.MakeVector
   0.0%   100.0%       0.002s       5.11e-06s     C      336       7   theano.tensor.elemwise.Prod
   ... (remaining 2 Classes account for   0.00%(0.00s) of the runtime)

Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  91.5%    91.5%     135.223s       2.56e-01s     C      528       11   GpuCorrMM{valid, (1, 1), pad=0}
   2.9%    94.4%       4.348s       1.13e-02s     C      384        8   GpuCorrMM{full, (1, 1), pad=0}
   2.1%    96.5%       3.036s       1.58e-02s     C      192        4   GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:}
   1.3%    97.8%       1.980s       1.11e-03s     C     1776       37   GpuContiguous
   0.5%    98.2%       0.666s       5.34e-04s     Py    1248       26   GpuReshape{4}
   0.3%    98.5%       0.382s       1.99e-03s     C      192        4   GpuFromHost
   0.2%    98.7%       0.296s       3.08e-03s     Py      96        2   RepeatOp
   0.2%    98.9%       0.247s       1.03e-03s     C      240        5   GpuElemwise{add,no_inplace}
   0.2%    99.0%       0.237s       1.23e-03s     C      192        4   GpuDownsampleFactorMaxGrad{(1, 1),True}
   0.1%    99.2%       0.210s       8.77e-04s     C      240        5   GpuElemwise{maximum,no_inplace}
   0.1%    99.3%       0.206s       1.07e-03s     C      192        4   GpuDownsampleFactorMax{(1, 1),True}
   0.1%    99.4%       0.178s       1.24e-03s     C      144        3   HostFromGpu
   0.1%    99.5%       0.164s       6.82e-04s     C      240        5   GpuElemwise{Composite{[Cast{float32}(EQ(i0, i1))]}}[(0, 0)]
   0.1%    99.6%       0.157s       6.56e-04s     C      240        5   GpuElemwise{Mul}[(0, 0)]
   0.1%    99.7%       0.100s       3.47e-04s     C      288        6   GpuCAReduce{add}{1,0,1,1}
   0.1%    99.8%       0.087s       4.51e-04s     C      192        4   GpuAlloc{memset_0=True}
   0.0%    99.8%       0.059s       6.14e-04s     C       96        2   GpuDownsampleFactorMaxGrad{(2, 2),True}
   0.0%    99.8%       0.046s       4.80e-04s     C       96        2   GpuDownsampleFactorMax{(2, 2),True}
   0.0%    99.9%       0.023s       4.84e-04s     C       48        1   GpuElemwise{sqr,no_inplace}
   0.0%    99.9%       0.022s       2.32e-04s     C       96        2   GpuElemwise{TrueDiv}[(0, 0)]
   ... (remaining 37 Ops account for   0.12%(0.18s) of the runtime)

Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
  34.9%    34.9%      51.585s       1.07e+00s     48   313  14400.0        0.3 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
    input 0: dtype=float32, shape=(256, 16, 60, 60), strides=(57600, 3600, 60, 1) 
    input 1: dtype=float32, shape=(512, 16, 60, 60), strides=(57600, 3600, 60, 1) 
    output 0: dtype=float32, shape=(256, 512, 1, 1), strides=(512, 1, 0, 0) 
  30.4%    65.3%      44.946s       9.36e-01s     48   327  72900.0        1.6 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
    input 0: dtype=float32, shape=(512, 16, 68, 68), strides=(73984, 4624, 68, 1) 
    input 1: dtype=float32, shape=(16, 16, 60, 60), strides=(57600, 3600, 60, 1) 
    output 0: dtype=float32, shape=(512, 16, 9, 9), strides=(1296, 81, 9, 1) 
  15.1%    80.4%      22.270s       4.64e-01s     48   367   2109.4        0.1 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
    input 0: dtype=float32, shape=(16, 16, 244, 244), strides=(952576, 59536, 244, 1) 
    input 1: dtype=float32, shape=(3, 16, 240, 240), strides=(921600, 57600, 240, 1) 
    output 0: dtype=float32, shape=(16, 3, 5, 5), strides=(75, 25, 5, 1) 
   5.4%    85.8%       8.016s       1.67e-01s     48   325  72900.0        8.9 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
    input 0: dtype=float32, shape=(16, 512, 68, 68), strides=(2367488, 4624, 68, 1) 
    input 1: dtype=float32, shape=(16, 512, 9, 9), strides=(41472, 81, 9, 1) 
    output 0: dtype=float32, shape=(16, 16, 60, 60), strides=(57600, 3600, 60, 1) 
   3.6%    89.4%       5.313s       1.11e-01s     48   355   2812.5        0.5 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
    input 0: dtype=float32, shape=(16, 16, 124, 124), strides=(246016, 15376, 124, 1) 
    input 1: dtype=float32, shape=(16, 16, 120, 120), strides=(230400, 14400, 120, 1) 
    output 0: dtype=float32, shape=(16, 16, 5, 5), strides=(400, 25, 5, 1) 
   1.2%    90.6%       1.770s       3.69e-02s     48   322                     GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:
    input 0: dtype=float32, shape=(16, 512, 68, 68), strides=(2367488, 4624, 68, 1) 
    input 1: dtype=float32, shape=(16, 512, 60, 60), strides=(1843200, 3600, 60, 1) 
    input 2: dtype=int64, shape=8, strides=c 
    input 3: dtype=int64, shape=8, strides=c 
    input 4: dtype=int64, shape=8, strides=c 
    input 5: dtype=int64, shape=8, strides=c 
    output 0: dtype=float32, shape=(16, 512, 68, 68), strides=(2367488, 4624, 68, 1) 
   1.2%    91.7%       1.732s       3.61e-02s     48   190  72900.0       41.1 GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous
    input 0: dtype=float32, shape=(16, 16, 60, 60), strides=(57600, 3600, 60, 1) 
    input 1: dtype=float32, shape=(512, 16, 9, 9), strides=(1296, 81, 9, 1) 
    output 0: dtype=float32, shape=(16, 512, 68, 68), strides=(2367488, 4624, 68, 1) 
   1.0%    92.7%       1.416s       2.95e-02s     48   341    703.1        0.5 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
    input 0: dtype=float32, shape=(16, 16, 64, 64), strides=(65536, 4096, 64, 1) 
    input 1: dtype=float32, shape=(16, 16, 60, 60), strides=(57600, 3600, 60, 1) 
    output 0: dtype=float32, shape=(16, 16, 5, 5), strides=(400, 25, 5, 1) 
   0.7%    93.3%       0.969s       2.02e-02s     48   364                     GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:
    input 0: dtype=float32, shape=(16, 16, 244, 244), strides=(952576, 59536, 244, 1) 
    input 1: dtype=float32, shape=(16, 16, 240, 240), strides=(921600, 57600, 240, 1) 
    input 2: dtype=int64, shape=8, strides=c 
    input 3: dtype=int64, shape=8, strides=c 
    input 4: dtype=int64, shape=8, strides=c 
    input 5: dtype=int64, shape=8, strides=c 
    output 0: dtype=float32, shape=(16, 16, 244, 244), strides=(952576, 59536, 244, 1) 
   0.4%    93.8%       0.664s       1.38e-02s     48   300    112.5        0.2 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
    input 0: dtype=float32, shape=(4, 16, 60, 60), strides=(57600, 3600, 60, 1) 
    input 1: dtype=float32, shape=(256, 16, 60, 60), strides=(57600, 3600, 60, 1) 
    output 0: dtype=float32, shape=(4, 256, 1, 1), strides=(256, 1, 0, 0) 
   0.4%    94.2%       0.551s       1.15e-02s     48    44    427.1        0.8 GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous
    input 0: dtype=float32, shape=(16, 3, 240, 240), strides=(172800, 57600, 240, 1) 
    input 1: dtype=float32, shape=(1, 3, 9, 9), strides=(0, 81, 9, 1) 
    output 0: dtype=float32, shape=(16, 1, 248, 248), strides=(61504, 0, 248, 1) 
   0.4%    94.5%       0.549s       1.14e-02s     48    61    427.1        0.8 GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous
    input 0: dtype=float32, shape=(16, 3, 240, 240), strides=(172800, 57600, 240, 1) 
    input 1: dtype=float32, shape=(1, 3, 9, 9), strides=(0, 81, 9, 1) 
    output 0: dtype=float32, shape=(16, 1, 248, 248), strides=(61504, 0, 248, 1) 
   0.3%    94.9%       0.492s       1.03e-02s     48   326                     GpuContiguous(GpuDimShuffle{1,0,2,3}.0)
    input 0: dtype=float32, shape=(512, 16, 68, 68), strides=(4624, 2367488, 68, 1) 
    output 0: dtype=float32, shape=(512, 16, 68, 68), strides=(73984, 4624, 68, 1) 
   0.3%    95.2%       0.464s       9.67e-03s     48   224  14400.0       30.3 GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous
    input 0: dtype=float32, shape=(16, 512, 60, 60), strides=(1843200, 3600, 60, 1) 
    input 1: dtype=float32, shape=(256, 512, 1, 1), strides=(512, 1, 0, 0) 
    output 0: dtype=float32, shape=(16, 256, 60, 60), strides=(921600, 3600, 60, 1) 
   0.3%    95.5%       0.425s       8.86e-03s     48   311  14400.0       33.1 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
    input 0: dtype=float32, shape=(16, 256, 60, 60), strides=(921600, 3600, 60, 1) 
    input 1: dtype=float32, shape=(512, 256, 1, 1), strides=(256, 1, 0, 0) 
    output 0: dtype=float32, shape=(16, 512, 60, 60), strides=(1843200, 3600, 60, 1) 
   0.3%    95.7%       0.396s       8.26e-03s     48   232                     GpuContiguous(GpuSubtensor{::, ::, ::int64, ::int64}.0)
    input 0: dtype=float32, shape=(512, 16, 60, 60), strides=(3600, 1843200, 60, 1) 
    output 0: dtype=float32, shape=(512, 16, 60, 60), strides=(57600, 3600, 60, 1) 
   0.3%    96.0%       0.388s       8.07e-03s     48   122   2812.5        7.1 GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous
    input 0: dtype=float32, shape=(16, 16, 120, 120), strides=(230400, 14400, 120, 1) 
    input 1: dtype=float32, shape=(16, 16, 5, 5), strides=(400, 25, 5, 1) 
    output 0: dtype=float32, shape=(16, 16, 124, 124), strides=(246016, 15376, 124, 1) 
   0.3%    96.3%       0.385s       8.02e-03s     48   208                     GpuReshape{4}(GpuSubtensor{::, ::, int64:int64:, int64:int64:
    input 0: dtype=float32, shape=(16, 512, 60, 60), strides=(2367488, 4624, 68, 1) 
    input 1: dtype=int64, shape=(4,), strides=c 
    output 0: dtype=float32, shape=(8192, 1, 60, 60), strides=(3600, 0, 60, 1) 
   0.2%    96.5%       0.367s       7.64e-03s     48   366                     GpuContiguous(GpuDimShuffle{1,0,2,3}.0)
    input 0: dtype=float32, shape=(16, 16, 244, 244), strides=(59536, 952576, 244, 1) 
    output 0: dtype=float32, shape=(16, 16, 244, 244), strides=(952576, 59536, 244, 1) 
   0.2%    96.8%       0.365s       7.61e-03s     48   353   2812.5        7.5 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
    input 0: dtype=float32, shape=(16, 16, 124, 124), strides=(246016, 15376, 124, 1) 
    input 1: dtype=float32, shape=(16, 16, 5, 5), strides=(400, 25, 5, 1) 
    output 0: dtype=float32, shape=(16, 16, 120, 120), strides=(230400, 14400, 120, 1) 
   ... (remaining 349 Apply instances account for 3.24%(4.79s) of the runtime)

Memory Profile
(Sparse variables are ignored)
(For values in brackets, it's for linker = c|py
---
    Max if no gc (allow_gc=False): 1619415KB (1619415KB)
    Max if linker=cvm(default): 765866KB (765866KB)
    Memory saved if views are used: 2970455KB (2970455KB)
    Memory saved if inplace ops are used: 649542KB (649542KB)
    Memory saved if gc is enabled: 853549KB (853549KB)

    <Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>

     151519232B  [(16, 512, 68, 68)] c GpuAlloc{memset_0=True}(CudaNdarrayConstant{[[[[ 0.]]]]}, Shape_i{0}.0, Shape_i{1}.0, Shape_i{2}.0, Shape_i{3}.0)
     151519232B  [(512, 16, 68, 68)] v GpuContiguous(GpuDimShuffle{1,0,2,3}.0)
     151519232B  [(16, 512, 68, 68)] i GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:}(GpuAlloc{memset_0=True}.0, GpuReshape{4}.0, Constant{4}, Constant{-4}, Constant{4}, Constant{-4})
     151519232B  [(512, 16, 68, 68)] v GpuDimShuffle{1,0,2,3}(GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:}.0)
     151519232B  [(16, 512, 68, 68)] v GpuContiguous(GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:}.0)
     151519232B  [(16, 512, 68, 68)] c GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
     117964800B  [(16, 512, 60, 60)] v GpuContiguous(GpuElemwise{maximum,no_inplace}.0)
     117964800B  [(512, 16, 60, 60)] v GpuContiguous(GpuSubtensor{::, ::, ::int64, ::int64}.0)
     117964800B  [(16, 512, 60, 60)] c GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
     117964800B  [(16, 512, 60, 60)] c GpuElemwise{add,no_inplace}(GpuReshape{4}.0, GpuDimShuffle{x,0,x,x}.0)
     117964800B  [(16, 512, 60, 60)] v GpuReshape{4}(GpuDownsampleFactorMaxGrad{(1, 1),True}.0, MakeVector.0)
     117964800B  [(16, 512, 60, 60)] v GpuReshape{4}(GpuDownsampleFactorMax{(1, 1),True}.0, Join.0)
     117964800B  [(16, 512, 60, 60)] v GpuSubtensor{::, ::, int64:int64:, int64:int64:}(GpuCorrMM{full, (1, 1), pad=0}.0, Constant{4}, Constant{-4}, Constant{4}, Constant{-4})
     117964800B  [(16, 512, 60, 60)] i GpuElemwise{Composite{[Cast{float32}(EQ(i0, i1))]}}[(0, 0)](GpuElemwise{maximum,no_inplace}.0, GpuElemwise{add,no_inplace}.0)
     117964800B  [(8192, 1, 60, 60)] c GpuDownsampleFactorMaxGrad{(1, 1),True}(GpuReshape{4}.0, GpuDownsampleFactorMax{(1, 1),True}.0, GpuReshape{4}.0)
     117964800B  [(16, 512, 60, 60)] i GpuElemwise{Mul}[(0, 0)](GpuElemwise{Composite{[Cast{float32}(EQ(i0, i1))]}}[(0, 0)].0, GpuCorrMM{valid, (1, 1), pad=0}.0)
     117964800B  [(16, 512, 60, 60)] c GpuElemwise{maximum,no_inplace}(CudaNdarrayConstant{[[[[  9.99999997e-07]]]]}, GpuElemwise{add,no_inplace}.0)
     117964800B  [(8192, 1, 60, 60)] v GpuReshape{4}(GpuSubtensor{::, ::, int64:int64:, int64:int64:}.0, Join.0)
     117964800B  [(512, 16, 60, 60)] v GpuDimShuffle{1,0,2,3}(GpuElemwise{maximum,no_inplace}.0)
     117964800B  [(8192, 1, 60, 60)] v GpuReshape{4}(GpuElemwise{Mul}[(0, 0)].0, MakeVector.0)
   ... (remaining 349 Apply account for 2804535640B/5365158232B ((52.27%)) of the Apply with dense outputs sizes)

    <created/inplace/view> is taken from the Op's declaration.
    Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.

Function profiling
==================
  Message: /home/ajain/Projects/deep_nets/python/lib/machine.py:243
  Time in 0 calls to Function.__call__: 0.000000e+00s
  Total compile time: 3.708201e+00s
    Number of Apply nodes: 0
    Theano Optimizer time: 3.540352e+00s
       Theano validate time: 1.026495e-01s
    Theano Linker time (includes C, CUDA code generation/compiling): 1.489460e-01s

Function profiling
==================
  Message: Sum of all printed profiles at exit excluding Scan op profile.
  Time in 48 calls to Function.__call__: 1.493318e+02s
  Time in Function.fn.__call__: 1.493235e+02s (99.994%)
  Time in thunks: 1.478504e+02s (99.008%)
  Total compile time: 9.329049e+00s
    Number of Apply nodes: 369
    Theano Optimizer time: 8.861545e+00s
       Theano validate time: 4.513505e-01s
    Theano Linker time (includes C, CUDA code generation/compiling): 4.090979e-01s

Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  94.4%    94.4%     139.572s       1.53e-01s     C      912      19   theano.sandbox.cuda.blas.GpuCorrMM
   2.1%    96.5%       3.036s       1.58e-02s     C      192       4   theano.sandbox.cuda.basic_ops.GpuIncSubtensor
   1.3%    97.8%       1.980s       1.11e-03s     C     1776      37   theano.sandbox.cuda.basic_ops.GpuContiguous
   0.6%    98.4%       0.910s       4.31e-04s     C     2112      44   theano.sandbox.cuda.basic_ops.GpuElemwise
   0.5%    98.9%       0.666s       5.34e-04s     Py    1248      26   theano.sandbox.cuda.basic_ops.GpuReshape
   0.3%    99.1%       0.382s       1.99e-03s     C      192       4   theano.sandbox.cuda.basic_ops.GpuFromHost
   0.2%    99.3%       0.296s       3.08e-03s     Py      96       2   theano.tensor.extra_ops.RepeatOp
   0.2%    99.5%       0.296s       1.03e-03s     C      288       6   theano.sandbox.cuda.blas.GpuDownsampleFactorMaxGrad
   0.2%    99.7%       0.252s       8.75e-04s     C      288       6   theano.sandbox.cuda.blas.GpuDownsampleFactorMax
   0.1%    99.8%       0.178s       1.24e-03s     C      144       3   theano.sandbox.cuda.basic_ops.HostFromGpu
   0.1%    99.9%       0.121s       3.16e-04s     C      384       8   theano.sandbox.cuda.basic_ops.GpuCAReduce
   0.1%   100.0%       0.096s       4.01e-04s     C      240       5   theano.sandbox.cuda.basic_ops.GpuAlloc
   0.0%   100.0%       0.015s       5.75e-06s     C     2688      56   theano.compile.ops.Shape_i
   0.0%   100.0%       0.013s       7.38e-06s     C     1776      37   theano.sandbox.cuda.basic_ops.GpuSubtensor
   0.0%   100.0%       0.011s       7.58e-06s     C     1440      30   theano.sandbox.cuda.basic_ops.GpuDimShuffle
   0.0%   100.0%       0.007s       5.66e-06s     C     1152      24   theano.tensor.elemwise.Elemwise
   0.0%   100.0%       0.006s       1.06e-05s     C      576      12   theano.tensor.basic.Join
   0.0%   100.0%       0.005s       6.05e-06s     C      864      18   theano.tensor.subtensor.Subtensor
   0.0%   100.0%       0.004s       5.89e-06s     C      672      14   theano.tensor.opt.MakeVector
   0.0%   100.0%       0.002s       5.11e-06s     C      336       7   theano.tensor.elemwise.Prod
   ... (remaining 2 Classes account for   0.00%(0.00s) of the runtime)

Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  91.5%    91.5%     135.223s       2.56e-01s     C      528       11   GpuCorrMM{valid, (1, 1), pad=0}
   2.9%    94.4%       4.348s       1.13e-02s     C      384        8   GpuCorrMM{full, (1, 1), pad=0}
   2.1%    96.5%       3.036s       1.58e-02s     C      192        4   GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:}
   1.3%    97.8%       1.980s       1.11e-03s     C     1776       37   GpuContiguous
   0.5%    98.2%       0.666s       5.34e-04s     Py    1248       26   GpuReshape{4}
   0.3%    98.5%       0.382s       1.99e-03s     C      192        4   GpuFromHost
   0.2%    98.7%       0.296s       3.08e-03s     Py      96        2   RepeatOp
   0.2%    98.9%       0.247s       1.03e-03s     C      240        5   GpuElemwise{add,no_inplace}
   0.2%    99.0%       0.237s       1.23e-03s     C      192        4   GpuDownsampleFactorMaxGrad{(1, 1),True}
   0.1%    99.2%       0.210s       8.77e-04s     C      240        5   GpuElemwise{maximum,no_inplace}
   0.1%    99.3%       0.206s       1.07e-03s     C      192        4   GpuDownsampleFactorMax{(1, 1),True}
   0.1%    99.4%       0.178s       1.24e-03s     C      144        3   HostFromGpu
   0.1%    99.5%       0.164s       6.82e-04s     C      240        5   GpuElemwise{Composite{[Cast{float32}(EQ(i0, i1))]}}[(0, 0)]
   0.1%    99.6%       0.157s       6.56e-04s     C      240        5   GpuElemwise{Mul}[(0, 0)]
   0.1%    99.7%       0.100s       3.47e-04s     C      288        6   GpuCAReduce{add}{1,0,1,1}
   0.1%    99.8%       0.087s       4.51e-04s     C      192        4   GpuAlloc{memset_0=True}
   0.0%    99.8%       0.059s       6.14e-04s     C       96        2   GpuDownsampleFactorMaxGrad{(2, 2),True}
   0.0%    99.8%       0.046s       4.80e-04s     C       96        2   GpuDownsampleFactorMax{(2, 2),True}
   0.0%    99.9%       0.023s       4.84e-04s     C       48        1   GpuElemwise{sqr,no_inplace}
   0.0%    99.9%       0.022s       2.32e-04s     C       96        2   GpuElemwise{TrueDiv}[(0, 0)]
   ... (remaining 37 Ops account for   0.12%(0.18s) of the runtime)

Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
  34.9%    34.9%      51.585s       1.07e+00s     48   313  14400.0        0.3 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
    input 0: dtype=float32, shape=(256, 16, 60, 60), strides=(57600, 3600, 60, 1) 
    input 1: dtype=float32, shape=(512, 16, 60, 60), strides=(57600, 3600, 60, 1) 
    output 0: dtype=float32, shape=(256, 512, 1, 1), strides=(512, 1, 0, 0) 
  30.4%    65.3%      44.946s       9.36e-01s     48   327  72900.0        1.6 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
    input 0: dtype=float32, shape=(512, 16, 68, 68), strides=(73984, 4624, 68, 1) 
    input 1: dtype=float32, shape=(16, 16, 60, 60), strides=(57600, 3600, 60, 1) 
    output 0: dtype=float32, shape=(512, 16, 9, 9), strides=(1296, 81, 9, 1) 
  15.1%    80.4%      22.270s       4.64e-01s     48   367   2109.4        0.1 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
    input 0: dtype=float32, shape=(16, 16, 244, 244), strides=(952576, 59536, 244, 1) 
    input 1: dtype=float32, shape=(3, 16, 240, 240), strides=(921600, 57600, 240, 1) 
    output 0: dtype=float32, shape=(16, 3, 5, 5), strides=(75, 25, 5, 1) 
   5.4%    85.8%       8.016s       1.67e-01s     48   325  72900.0        8.9 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
    input 0: dtype=float32, shape=(16, 512, 68, 68), strides=(2367488, 4624, 68, 1) 
    input 1: dtype=float32, shape=(16, 512, 9, 9), strides=(41472, 81, 9, 1) 
    output 0: dtype=float32, shape=(16, 16, 60, 60), strides=(57600, 3600, 60, 1) 
   3.6%    89.4%       5.313s       1.11e-01s     48   355   2812.5        0.5 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
    input 0: dtype=float32, shape=(16, 16, 124, 124), strides=(246016, 15376, 124, 1) 
    input 1: dtype=float32, shape=(16, 16, 120, 120), strides=(230400, 14400, 120, 1) 
    output 0: dtype=float32, shape=(16, 16, 5, 5), strides=(400, 25, 5, 1) 
   1.2%    90.6%       1.770s       3.69e-02s     48   322                     GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:
    input 0: dtype=float32, shape=(16, 512, 68, 68), strides=(2367488, 4624, 68, 1) 
    input 1: dtype=float32, shape=(16, 512, 60, 60), strides=(1843200, 3600, 60, 1) 
    input 2: dtype=int64, shape=8, strides=c 
    input 3: dtype=int64, shape=8, strides=c 
    input 4: dtype=int64, shape=8, strides=c 
    input 5: dtype=int64, shape=8, strides=c 
    output 0: dtype=float32, shape=(16, 512, 68, 68), strides=(2367488, 4624, 68, 1) 
   1.2%    91.7%       1.732s       3.61e-02s     48   190  72900.0       41.1 GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous
    input 0: dtype=float32, shape=(16, 16, 60, 60), strides=(57600, 3600, 60, 1) 
    input 1: dtype=float32, shape=(512, 16, 9, 9), strides=(1296, 81, 9, 1) 
    output 0: dtype=float32, shape=(16, 512, 68, 68), strides=(2367488, 4624, 68, 1) 
   1.0%    92.7%       1.416s       2.95e-02s     48   341    703.1        0.5 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
    input 0: dtype=float32, shape=(16, 16, 64, 64), strides=(65536, 4096, 64, 1) 
    input 1: dtype=float32, shape=(16, 16, 60, 60), strides=(57600, 3600, 60, 1) 
    output 0: dtype=float32, shape=(16, 16, 5, 5), strides=(400, 25, 5, 1) 
   0.7%    93.3%       0.969s       2.02e-02s     48   364                     GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:
    input 0: dtype=float32, shape=(16, 16, 244, 244), strides=(952576, 59536, 244, 1) 
    input 1: dtype=float32, shape=(16, 16, 240, 240), strides=(921600, 57600, 240, 1) 
    input 2: dtype=int64, shape=8, strides=c 
    input 3: dtype=int64, shape=8, strides=c 
    input 4: dtype=int64, shape=8, strides=c 
    input 5: dtype=int64, shape=8, strides=c 
    output 0: dtype=float32, shape=(16, 16, 244, 244), strides=(952576, 59536, 244, 1) 
   0.4%    93.8%       0.664s       1.38e-02s     48   300    112.5        0.2 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
    input 0: dtype=float32, shape=(4, 16, 60, 60), strides=(57600, 3600, 60, 1) 
    input 1: dtype=float32, shape=(256, 16, 60, 60), strides=(57600, 3600, 60, 1) 
    output 0: dtype=float32, shape=(4, 256, 1, 1), strides=(256, 1, 0, 0) 
   0.4%    94.2%       0.551s       1.15e-02s     48    44    427.1        0.8 GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous
    input 0: dtype=float32, shape=(16, 3, 240, 240), strides=(172800, 57600, 240, 1) 
    input 1: dtype=float32, shape=(1, 3, 9, 9), strides=(0, 81, 9, 1) 
    output 0: dtype=float32, shape=(16, 1, 248, 248), strides=(61504, 0, 248, 1) 
   0.4%    94.5%       0.549s       1.14e-02s     48    61    427.1        0.8 GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous
    input 0: dtype=float32, shape=(16, 3, 240, 240), strides=(172800, 57600, 240, 1) 
    input 1: dtype=float32, shape=(1, 3, 9, 9), strides=(0, 81, 9, 1) 
    output 0: dtype=float32, shape=(16, 1, 248, 248), strides=(61504, 0, 248, 1) 
   0.3%    94.9%       0.492s       1.03e-02s     48   326                     GpuContiguous(GpuDimShuffle{1,0,2,3}.0)
    input 0: dtype=float32, shape=(512, 16, 68, 68), strides=(4624, 2367488, 68, 1) 
    output 0: dtype=float32, shape=(512, 16, 68, 68), strides=(73984, 4624, 68, 1) 
   0.3%    95.2%       0.464s       9.67e-03s     48   224  14400.0       30.3 GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous
    input 0: dtype=float32, shape=(16, 512, 60, 60), strides=(1843200, 3600, 60, 1) 
    input 1: dtype=float32, shape=(256, 512, 1, 1), strides=(512, 1, 0, 0) 
    output 0: dtype=float32, shape=(16, 256, 60, 60), strides=(921600, 3600, 60, 1) 
   0.3%    95.5%       0.425s       8.86e-03s     48   311  14400.0       33.1 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
    input 0: dtype=float32, shape=(16, 256, 60, 60), strides=(921600, 3600, 60, 1) 
    input 1: dtype=float32, shape=(512, 256, 1, 1), strides=(256, 1, 0, 0) 
    output 0: dtype=float32, shape=(16, 512, 60, 60), strides=(1843200, 3600, 60, 1) 
   0.3%    95.7%       0.396s       8.26e-03s     48   232                     GpuContiguous(GpuSubtensor{::, ::, ::int64, ::int64}.0)
    input 0: dtype=float32, shape=(512, 16, 60, 60), strides=(3600, 1843200, 60, 1) 
    output 0: dtype=float32, shape=(512, 16, 60, 60), strides=(57600, 3600, 60, 1) 
   0.3%    96.0%       0.388s       8.07e-03s     48   122   2812.5        7.1 GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous
    input 0: dtype=float32, shape=(16, 16, 120, 120), strides=(230400, 14400, 120, 1) 
    input 1: dtype=float32, shape=(16, 16, 5, 5), strides=(400, 25, 5, 1) 
    output 0: dtype=float32, shape=(16, 16, 124, 124), strides=(246016, 15376, 124, 1) 
   0.3%    96.3%       0.385s       8.02e-03s     48   208                     GpuReshape{4}(GpuSubtensor{::, ::, int64:int64:, int64:int64:
    input 0: dtype=float32, shape=(16, 512, 60, 60), strides=(2367488, 4624, 68, 1) 
    input 1: dtype=int64, shape=(4,), strides=c 
    output 0: dtype=float32, shape=(8192, 1, 60, 60), strides=(3600, 0, 60, 1) 
   0.2%    96.5%       0.367s       7.64e-03s     48   366                     GpuContiguous(GpuDimShuffle{1,0,2,3}.0)
    input 0: dtype=float32, shape=(16, 16, 244, 244), strides=(59536, 952576, 244, 1) 
    output 0: dtype=float32, shape=(16, 16, 244, 244), strides=(952576, 59536, 244, 1) 
   0.2%    96.8%       0.365s       7.61e-03s     48   353   2812.5        7.5 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
    input 0: dtype=float32, shape=(16, 16, 124, 124), strides=(246016, 15376, 124, 1) 
    input 1: dtype=float32, shape=(16, 16, 5, 5), strides=(400, 25, 5, 1) 
    output 0: dtype=float32, shape=(16, 16, 120, 120), strides=(230400, 14400, 120, 1) 
   ... (remaining 349 Apply instances account for 3.24%(4.79s) of the runtime)

Memory Profile
(Sparse variables are ignored)
(For values in brackets, it's for linker = c|py
---
    Max if no gc (allow_gc=False): 1619415KB (1619415KB)
    Max if linker=cvm(default): 765866KB (765866KB)
    Memory saved if views are used: 2970455KB (2970455KB)
    Memory saved if inplace ops are used: 649542KB (649542KB)
    Memory saved if gc is enabled: 853549KB (853549KB)

    <Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>

     151519232B  [(16, 512, 68, 68)] c GpuAlloc{memset_0=True}(CudaNdarrayConstant{[[[[ 0.]]]]}, Shape_i{0}.0, Shape_i{1}.0, Shape_i{2}.0, Shape_i{3}.0)
     151519232B  [(512, 16, 68, 68)] v GpuContiguous(GpuDimShuffle{1,0,2,3}.0)
     151519232B  [(16, 512, 68, 68)] i GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:}(GpuAlloc{memset_0=True}.0, GpuReshape{4}.0, Constant{4}, Constant{-4}, Constant{4}, Constant{-4})
     151519232B  [(512, 16, 68, 68)] v GpuDimShuffle{1,0,2,3}(GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:}.0)
     151519232B  [(16, 512, 68, 68)] v GpuContiguous(GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:}.0)
     151519232B  [(16, 512, 68, 68)] c GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
     117964800B  [(16, 512, 60, 60)] v GpuContiguous(GpuElemwise{maximum,no_inplace}.0)
     117964800B  [(512, 16, 60, 60)] v GpuContiguous(GpuSubtensor{::, ::, ::int64, ::int64}.0)
     117964800B  [(16, 512, 60, 60)] c GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
     117964800B  [(16, 512, 60, 60)] c GpuElemwise{add,no_inplace}(GpuReshape{4}.0, GpuDimShuffle{x,0,x,x}.0)
     117964800B  [(16, 512, 60, 60)] v GpuReshape{4}(GpuDownsampleFactorMaxGrad{(1, 1),True}.0, MakeVector.0)
     117964800B  [(16, 512, 60, 60)] v GpuReshape{4}(GpuDownsampleFactorMax{(1, 1),True}.0, Join.0)
     117964800B  [(16, 512, 60, 60)] v GpuSubtensor{::, ::, int64:int64:, int64:int64:}(GpuCorrMM{full, (1, 1), pad=0}.0, Constant{4}, Constant{-4}, Constant{4}, Constant{-4})
     117964800B  [(16, 512, 60, 60)] i GpuElemwise{Composite{[Cast{float32}(EQ(i0, i1))]}}[(0, 0)](GpuElemwise{maximum,no_inplace}.0, GpuElemwise{add,no_inplace}.0)
     117964800B  [(8192, 1, 60, 60)] c GpuDownsampleFactorMaxGrad{(1, 1),True}(GpuReshape{4}.0, GpuDownsampleFactorMax{(1, 1),True}.0, GpuReshape{4}.0)
     117964800B  [(16, 512, 60, 60)] i GpuElemwise{Mul}[(0, 0)](GpuElemwise{Composite{[Cast{float32}(EQ(i0, i1))]}}[(0, 0)].0, GpuCorrMM{valid, (1, 1), pad=0}.0)
     117964800B  [(16, 512, 60, 60)] c GpuElemwise{maximum,no_inplace}(CudaNdarrayConstant{[[[[  9.99999997e-07]]]]}, GpuElemwise{add,no_inplace}.0)
     117964800B  [(8192, 1, 60, 60)] v GpuReshape{4}(GpuSubtensor{::, ::, int64:int64:, int64:int64:}.0, Join.0)
     117964800B  [(512, 16, 60, 60)] v GpuDimShuffle{1,0,2,3}(GpuElemwise{maximum,no_inplace}.0)
     117964800B  [(8192, 1, 60, 60)] v GpuReshape{4}(GpuElemwise{Mul}[(0, 0)].0, MakeVector.0)
   ... (remaining 349 Apply account for 2804535640B/5365158232B ((52.27%)) of the Apply with dense outputs sizes)

    <created/inplace/view> is taken from the Op's declaration.
    Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.
@nouiz
Member
nouiz commented Aug 6, 2014

if the forward is valid, in the grad, you will have a valid and a full
convolution.
If the forward is full, you will also have a valid and a full in the grad.

Why do you use the full mode in the forward? That is very strange.
Normally, what I saw is that people use the valid mode in the forward. Are
you sure your torch implementation also use the full mode in the forward?

I need to leave. I'll see if I can work on that tonight. It the speed
difference is real I see two way go identify it:

  1. Run your code with cuda-memcheck. If we have a problem like a call to
    gemm with too big number and it cause bad memory read, we will see the
    error. That would be the easy case.

I looked at caffe code and it work differently in the backward case then
what we did. Check the method Backward_gpu in the src/caffe/layers/
conv_layers.cu

Here is the profile of convnet-benchmark you can compare it to what is on
the web site, but we see that the grad, the way we implement it in
GpuCorrMM, it is slower:

CONFIG: input = 3 x 128 x 128 * ker = 3 x 96 x 11 x 11 ( bs = 128 , stride
= 1 )
gemm theano.sandbox.cuda.blas.GpuConvMM fprof: 1246.53857689 GFLOP/s ( tm =
0.0996497273445 )
gemm theano.sandbox.cuda.blas.GpuConvMM bprop weights: 0.0 GFLOP/s ( tm =
0.355364978313 )
gemm theano.sandbox.cuda.blas.GpuConvMM bprop inputs: 0.0 GFLOP/s ( tm =
1.75031024218 )
gemm theano.sandbox.cuda.blas.GpuConvMM bprop both: 0.0 GFLOP/s ( tm =
2.09745848179 )

CONFIG: input = 64 x 64 x 64 * ker = 64 x 128 x 9 x 9 ( bs = 128 , stride =
1 )
gemm theano.sandbox.cuda.blas.GpuConvMM fprof: 1814.27996141 GFLOP/s ( tm =
0.293620705605 )
gemm theano.sandbox.cuda.blas.GpuConvMM bprop weights: 0.0 GFLOP/s ( tm =
1.16873198748 )
gemm theano.sandbox.cuda.blas.GpuConvMM bprop inputs: 0.0 GFLOP/s ( tm =
0.506856739521 )
gemm theano.sandbox.cuda.blas.GpuConvMM bprop both: 0.0 GFLOP/s ( tm =
1.77047175169 )

CONFIG: input = 128 x 32 x 32 * ker = 128 x 128 x 9 x 9 ( bs = 128 , stride
= 1 )
gemm theano.sandbox.cuda.blas.GpuConvMM fprof: 1163.07152933 GFLOP/s ( tm =
0.168252289295 )
gemm theano.sandbox.cuda.blas.GpuConvMM bprop weights: 0.0 GFLOP/s ( tm =
0.437334001064 )
gemm theano.sandbox.cuda.blas.GpuConvMM bprop inputs: 0.0 GFLOP/s ( tm =
0.230055272579 )
gemm theano.sandbox.cuda.blas.GpuConvMM bprop both: 0.0 GFLOP/s ( tm =
0.706533968449 )

CONFIG: input = 128 x 16 x 16 * ker = 128 x 128 x 7 x 7 ( bs = 128 , stride
= 1 )
gemm theano.sandbox.cuda.blas.GpuConvMM fprof: 362.523363232 GFLOP/s ( tm =
0.0566917657852 )
gemm theano.sandbox.cuda.blas.GpuConvMM bprop weights: 0.0 GFLOP/s ( tm =
0.0665702223778 )
gemm theano.sandbox.cuda.blas.GpuConvMM bprop inputs: 0.0 GFLOP/s ( tm =
0.102097034454 )
gemm theano.sandbox.cuda.blas.GpuConvMM bprop both: 0.0 GFLOP/s ( tm =
0.168358981609 )

On Wed, Aug 6, 2014 at 12:00 PM, Arjun Jain notifications@github.com
wrote:

Hi Fred, thanks a ton for your reply. Please find bellow the complete log
after also using profile_memory=True.

What I dont understand is why is the 'valid' getting called at all? I only
call conv2d with 'full'. Perhaps this happens during the back prop, but I
am really not sure why. Any help would be greatly appreciated. Thanks a
lot!!!

Using gpu device 3: GeForce GTX TITAN Black
/home/ajain/Theano/theano/gof/vm.py:716: UserWarning: CVM does not support memory profile, using Stack VM.
'CVM does not support memory profile, using Stack VM.')

Function profiling

Message: /home/ajain/Projects/deep_nets/python/lib/machine.py:239
Time in 48 calls to Function.call: 1.493318e+02s
Time in Function.fn.call: 1.493235e+02s (99.994%)
Time in thunks: 1.478504e+02s (99.008%)
Total compile time: 5.620848e+00s
Number of Apply nodes: 369
Theano Optimizer time: 5.321193e+00s
Theano validate time: 3.487010e-01s
Theano Linker time (includes C, CUDA code generation/compiling): 2.601519e-01s

Class

<% time> <sum %> <#call> <#apply>
94.4% 94.4% 139.572s 1.53e-01s C 912 19 theano.sandbox.cuda.blas.GpuCorrMM
2.1% 96.5% 3.036s 1.58e-02s C 192 4 theano.sandbox.cuda.basic_ops.GpuIncSubtensor
1.3% 97.8% 1.980s 1.11e-03s C 1776 37 theano.sandbox.cuda.basic_ops.GpuContiguous
0.6% 98.4% 0.910s 4.31e-04s C 2112 44 theano.sandbox.cuda.basic_ops.GpuElemwise
0.5% 98.9% 0.666s 5.34e-04s Py 1248 26 theano.sandbox.cuda.basic_ops.GpuReshape
0.3% 99.1% 0.382s 1.99e-03s C 192 4 theano.sandbox.cuda.basic_ops.GpuFromHost
0.2% 99.3% 0.296s 3.08e-03s Py 96 2 theano.tensor.extra_ops.RepeatOp
0.2% 99.5% 0.296s 1.03e-03s C 288 6 theano.sandbox.cuda.blas.GpuDownsampleFactorMaxGrad
0.2% 99.7% 0.252s 8.75e-04s C 288 6 theano.sandbox.cuda.blas.GpuDownsampleFactorMax
0.1% 99.8% 0.178s 1.24e-03s C 144 3 theano.sandbox.cuda.basic_ops.HostFromGpu
0.1% 99.9% 0.121s 3.16e-04s C 384 8 theano.sandbox.cuda.basic_ops.GpuCAReduce
0.1% 100.0% 0.096s 4.01e-04s C 240 5 theano.sandbox.cuda.basic_ops.GpuAlloc
0.0% 100.0% 0.015s 5.75e-06s C 2688 56 theano.compile.ops.Shape_i
0.0% 100.0% 0.013s 7.38e-06s C 1776 37 theano.sandbox.cuda.basic_ops.GpuSubtensor
0.0% 100.0% 0.011s 7.58e-06s C 1440 30 theano.sandbox.cuda.basic_ops.GpuDimShuffle
0.0% 100.0% 0.007s 5.66e-06s C 1152 24 theano.tensor.elemwise.Elemwise
0.0% 100.0% 0.006s 1.06e-05s C 576 12 theano.tensor.basic.Join
0.0% 100.0% 0.005s 6.05e-06s C 864 18 theano.tensor.subtensor.Subtensor
0.0% 100.0% 0.004s 5.89e-06s C 672 14 theano.tensor.opt.MakeVector
0.0% 100.0% 0.002s 5.11e-06s C 336 7 theano.tensor.elemwise.Prod
... (remaining 2 Classes account for 0.00%(0.00s) of the runtime)

Ops

<% time> <sum %> <#call> <#apply>
91.5% 91.5% 135.223s 2.56e-01s C 528 11 GpuCorrMM{valid, (1, 1), pad=0}
2.9% 94.4% 4.348s 1.13e-02s C 384 8 GpuCorrMM{full, (1, 1), pad=0}
2.1% 96.5% 3.036s 1.58e-02s C 192 4 GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:}
1.3% 97.8% 1.980s 1.11e-03s C 1776 37 GpuContiguous
0.5% 98.2% 0.666s 5.34e-04s Py 1248 26 GpuReshape{4}
0.3% 98.5% 0.382s 1.99e-03s C 192 4 GpuFromHost
0.2% 98.7% 0.296s 3.08e-03s Py 96 2 RepeatOp
0.2% 98.9% 0.247s 1.03e-03s C 240 5 GpuElemwise{add,no_inplace}
0.2% 99.0% 0.237s 1.23e-03s C 192 4 GpuDownsampleFactorMaxGrad{(1, 1),True}
0.1% 99.2% 0.210s 8.77e-04s C 240 5 GpuElemwise{maximum,no_inplace}
0.1% 99.3% 0.206s 1.07e-03s C 192 4 GpuDownsampleFactorMax{(1, 1),True}
0.1% 99.4% 0.178s 1.24e-03s C 144 3 HostFromGpu
0.1% 99.5% 0.164s 6.82e-04s C 240 5 GpuElemwise{Composite{[Cast{float32}(EQ(i0, i1))]}}[(0, 0)]
0.1% 99.6% 0.157s 6.56e-04s C 240 5 GpuElemwise{Mul}[(0, 0)]
0.1% 99.7% 0.100s 3.47e-04s C 288 6 GpuCAReduce{add}{1,0,1,1}
0.1% 99.8% 0.087s 4.51e-04s C 192 4 GpuAlloc{memset_0=True}
0.0% 99.8% 0.059s 6.14e-04s C 96 2 GpuDownsampleFactorMaxGrad{(2, 2),True}
0.0% 99.8% 0.046s 4.80e-04s C 96 2 GpuDownsampleFactorMax{(2, 2),True}
0.0% 99.9% 0.023s 4.84e-04s C 48 1 GpuElemwise{sqr,no_inplace}
0.0% 99.9% 0.022s 2.32e-04s C 96 2 GpuElemwise{TrueDiv}[(0, 0)]
... (remaining 37 Ops account for 0.12%(0.18s) of the runtime)

Apply

<% time> <sum %> <#call> <Gflops/s>
34.9% 34.9% 51.585s 1.07e+00s 48 313 14400.0 0.3 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
input 0: dtype=float32, shape=(256, 16, 60, 60), strides=(57600, 3600, 60, 1)
input 1: dtype=float32, shape=(512, 16, 60, 60), strides=(57600, 3600, 60, 1)
output 0: dtype=float32, shape=(256, 512, 1, 1), strides=(512, 1, 0, 0)
30.4% 65.3% 44.946s 9.36e-01s 48 327 72900.0 1.6 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
input 0: dtype=float32, shape=(512, 16, 68, 68), strides=(73984, 4624, 68, 1)
input 1: dtype=float32, shape=(16, 16, 60, 60), strides=(57600, 3600, 60, 1)
output 0: dtype=float32, shape=(512, 16, 9, 9), strides=(1296, 81, 9, 1)
15.1% 80.4% 22.270s 4.64e-01s 48 367 2109.4 0.1 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
input 0: dtype=float32, shape=(16, 16, 244, 244), strides=(952576, 59536, 244, 1)
input 1: dtype=float32, shape=(3, 16, 240, 240), strides=(921600, 57600, 240, 1)
output 0: dtype=float32, shape=(16, 3, 5, 5), strides=(75, 25, 5, 1)
5.4% 85.8% 8.016s 1.67e-01s 48 325 72900.0 8.9 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
input 0: dtype=float32, shape=(16, 512, 68, 68), strides=(2367488, 4624, 68, 1)
input 1: dtype=float32, shape=(16, 512, 9, 9), strides=(41472, 81, 9, 1)
output 0: dtype=float32, shape=(16, 16, 60, 60), strides=(57600, 3600, 60, 1)
3.6% 89.4% 5.313s 1.11e-01s 48 355 2812.5 0.5 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
input 0: dtype=float32, shape=(16, 16, 124, 124), strides=(246016, 15376, 124, 1)
input 1: dtype=float32, shape=(16, 16, 120, 120), strides=(230400, 14400, 120, 1)
output 0: dtype=float32, shape=(16, 16, 5, 5), strides=(400, 25, 5, 1)
1.2% 90.6% 1.770s 3.69e-02s 48 322 GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:
input 0: dtype=float32, shape=(16, 512, 68, 68), strides=(2367488, 4624, 68, 1)
input 1: dtype=float32, shape=(16, 512, 60, 60), strides=(1843200, 3600, 60, 1)
input 2: dtype=int64, shape=8, strides=c
input 3: dtype=int64, shape=8, strides=c
input 4: dtype=int64, shape=8, strides=c
input 5: dtype=int64, shape=8, strides=c
output 0: dtype=float32, shape=(16, 512, 68, 68), strides=(2367488, 4624, 68, 1)
1.2% 91.7% 1.732s 3.61e-02s 48 190 72900.0 41.1 GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous
input 0: dtype=float32, shape=(16, 16, 60, 60), strides=(57600, 3600, 60, 1)
input 1: dtype=float32, shape=(512, 16, 9, 9), strides=(1296, 81, 9, 1)
output 0: dtype=float32, shape=(16, 512, 68, 68), strides=(2367488, 4624, 68, 1)
1.0% 92.7% 1.416s 2.95e-02s 48 341 703.1 0.5 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
input 0: dtype=float32, shape=(16, 16, 64, 64), strides=(65536, 4096, 64, 1)
input 1: dtype=float32, shape=(16, 16, 60, 60), strides=(57600, 3600, 60, 1)
output 0: dtype=float32, shape=(16, 16, 5, 5), strides=(400, 25, 5, 1)
0.7% 93.3% 0.969s 2.02e-02s 48 364 GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:
input 0: dtype=float32, shape=(16, 16, 244, 244), strides=(952576, 59536, 244, 1)
input 1: dtype=float32, shape=(16, 16, 240, 240), strides=(921600, 57600, 240, 1)
input 2: dtype=int64, shape=8, strides=c
input 3: dtype=int64, shape=8, strides=c
input 4: dtype=int64, shape=8, strides=c
input 5: dtype=int64, shape=8, strides=c
output 0: dtype=float32, shape=(16, 16, 244, 244), strides=(952576, 59536, 244, 1)
0.4% 93.8% 0.664s 1.38e-02s 48 300 112.5 0.2 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
input 0: dtype=float32, shape=(4, 16, 60, 60), strides=(57600, 3600, 60, 1)
input 1: dtype=float32, shape=(256, 16, 60, 60), strides=(57600, 3600, 60, 1)
output 0: dtype=float32, shape=(4, 256, 1, 1), strides=(256, 1, 0, 0)
0.4% 94.2% 0.551s 1.15e-02s 48 44 427.1 0.8 GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous
input 0: dtype=float32, shape=(16, 3, 240, 240), strides=(172800, 57600, 240, 1)
input 1: dtype=float32, shape=(1, 3, 9, 9), strides=(0, 81, 9, 1)
output 0: dtype=float32, shape=(16, 1, 248, 248), strides=(61504, 0, 248, 1)
0.4% 94.5% 0.549s 1.14e-02s 48 61 427.1 0.8 GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous
input 0: dtype=float32, shape=(16, 3, 240, 240), strides=(172800, 57600, 240, 1)
input 1: dtype=float32, shape=(1, 3, 9, 9), strides=(0, 81, 9, 1)
output 0: dtype=float32, shape=(16, 1, 248, 248), strides=(61504, 0, 248, 1)
0.3% 94.9% 0.492s 1.03e-02s 48 326 GpuContiguous(GpuDimShuffle{1,0,2,3}.0)
input 0: dtype=float32, shape=(512, 16, 68, 68), strides=(4624, 2367488, 68, 1)
output 0: dtype=float32, shape=(512, 16, 68, 68), strides=(73984, 4624, 68, 1)
0.3% 95.2% 0.464s 9.67e-03s 48 224 14400.0 30.3 GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous
input 0: dtype=float32, shape=(16, 512, 60, 60), strides=(1843200, 3600, 60, 1)
input 1: dtype=float32, shape=(256, 512, 1, 1), strides=(512, 1, 0, 0)
output 0: dtype=float32, shape=(16, 256, 60, 60), strides=(921600, 3600, 60, 1)
0.3% 95.5% 0.425s 8.86e-03s 48 311 14400.0 33.1 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
input 0: dtype=float32, shape=(16, 256, 60, 60), strides=(921600, 3600, 60, 1)
input 1: dtype=float32, shape=(512, 256, 1, 1), strides=(256, 1, 0, 0)
output 0: dtype=float32, shape=(16, 512, 60, 60), strides=(1843200, 3600, 60, 1)
0.3% 95.7% 0.396s 8.26e-03s 48 232 GpuContiguous(GpuSubtensor{::, ::, ::int64, ::int64}.0)
input 0: dtype=float32, shape=(512, 16, 60, 60), strides=(3600, 1843200, 60, 1)
output 0: dtype=float32, shape=(512, 16, 60, 60), strides=(57600, 3600, 60, 1)
0.3% 96.0% 0.388s 8.07e-03s 48 122 2812.5 7.1 GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous
input 0: dtype=float32, shape=(16, 16, 120, 120), strides=(230400, 14400, 120, 1)
input 1: dtype=float32, shape=(16, 16, 5, 5), strides=(400, 25, 5, 1)
output 0: dtype=float32, shape=(16, 16, 124, 124), strides=(246016, 15376, 124, 1)
0.3% 96.3% 0.385s 8.02e-03s 48 208 GpuReshape{4}(GpuSubtensor{::, ::, int64:int64:, int64:int64:
input 0: dtype=float32, shape=(16, 512, 60, 60), strides=(2367488, 4624, 68, 1)
input 1: dtype=int64, shape=(4,), strides=c
output 0: dtype=float32, shape=(8192, 1, 60, 60), strides=(3600, 0, 60, 1)
0.2% 96.5% 0.367s 7.64e-03s 48 366 GpuContiguous(GpuDimShuffle{1,0,2,3}.0)
input 0: dtype=float32, shape=(16, 16, 244, 244), strides=(59536, 952576, 244, 1)
output 0: dtype=float32, shape=(16, 16, 244, 244), strides=(952576, 59536, 244, 1)
0.2% 96.8% 0.365s 7.61e-03s 48 353 2812.5 7.5 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
input 0: dtype=float32, shape=(16, 16, 124, 124), strides=(246016, 15376, 124, 1)
input 1: dtype=float32, shape=(16, 16, 5, 5), strides=(400, 25, 5, 1)
output 0: dtype=float32, shape=(16, 16, 120, 120), strides=(230400, 14400, 120, 1)
... (remaining 349 Apply instances account for 3.24%(4.79s) of the runtime)

Memory Profile
(Sparse variables are ignored)

(For values in brackets, it's for linker = c|py

Max if no gc (allow_gc=False): 1619415KB (1619415KB)
Max if linker=cvm(default): 765866KB (765866KB)
Memory saved if views are used: 2970455KB (2970455KB)
Memory saved if inplace ops are used: 649542KB (649542KB)
Memory saved if gc is enabled: 853549KB (853549KB)

<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>

 151519232B  [(16, 512, 68, 68)] c GpuAlloc{memset_0=True}(CudaNdarrayConstant{[[[[ 0.]]]]}, Shape_i{0}.0, Shape_i{1}.0, Shape_i{2}.0, Shape_i{3}.0)
 151519232B  [(512, 16, 68, 68)] v GpuContiguous(GpuDimShuffle{1,0,2,3}.0)
 151519232B  [(16, 512, 68, 68)] i GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:}(GpuAlloc{memset_0=True}.0, GpuReshape{4}.0, Constant{4}, Constant{-4}, Constant{4}, Constant{-4})
 151519232B  [(512, 16, 68, 68)] v GpuDimShuffle{1,0,2,3}(GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:}.0)
 151519232B  [(16, 512, 68, 68)] v GpuContiguous(GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:}.0)
 151519232B  [(16, 512, 68, 68)] c GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
 117964800B  [(16, 512, 60, 60)] v GpuContiguous(GpuElemwise{maximum,no_inplace}.0)
 117964800B  [(512, 16, 60, 60)] v GpuContiguous(GpuSubtensor{::, ::, ::int64, ::int64}.0)
 117964800B  [(16, 512, 60, 60)] c GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
 117964800B  [(16, 512, 60, 60)] c GpuElemwise{add,no_inplace}(GpuReshape{4}.0, GpuDimShuffle{x,0,x,x}.0)
 117964800B  [(16, 512, 60, 60)] v GpuReshape{4}(GpuDownsampleFactorMaxGrad{(1, 1),True}.0, MakeVector.0)
 117964800B  [(16, 512, 60, 60)] v GpuReshape{4}(GpuDownsampleFactorMax{(1, 1),True}.0, Join.0)
 117964800B  [(16, 512, 60, 60)] v GpuSubtensor{::, ::, int64:int64:, int64:int64:}(GpuCorrMM{full, (1, 1), pad=0}.0, Constant{4}, Constant{-4}, Constant{4}, Constant{-4})
 117964800B  [(16, 512, 60, 60)] i GpuElemwise{Composite{[Cast{float32}(EQ(i0, i1))]}}[(0, 0)](GpuElemwise{maximum,no_inplace}.0, GpuElemwise{add,no_inplace}.0)
 117964800B  [(8192, 1, 60, 60)] c GpuDownsampleFactorMaxGrad{(1, 1),True}(GpuReshape{4}.0, GpuDownsampleFactorMax{(1, 1),True}.0, GpuReshape{4}.0)
 117964800B  [(16, 512, 60, 60)] i GpuElemwise{Mul}[(0, 0)](GpuElemwise{Composite{[Cast{float32}(EQ(i0, i1))]}}[(0, 0)].0, GpuCorrMM{valid, (1, 1), pad=0}.0)
 117964800B  [(16, 512, 60, 60)] c GpuElemwise{maximum,no_inplace}(CudaNdarrayConstant{[[[[  9.99999997e-07]]]]}, GpuElemwise{add,no_inplace}.0)
 117964800B  [(8192, 1, 60, 60)] v GpuReshape{4}(GpuSubtensor{::, ::, int64:int64:, int64:int64:}.0, Join.0)
 117964800B  [(512, 16, 60, 60)] v GpuDimShuffle{1,0,2,3}(GpuElemwise{maximum,no_inplace}.0)
 117964800B  [(8192, 1, 60, 60)] v GpuReshape{4}(GpuElemwise{Mul}[(0, 0)].0, MakeVector.0)

... (remaining 349 Apply account for 2804535640B/5365158232B ((52.27%)) of the Apply with dense outputs sizes)

<created/inplace/view> is taken from the Op's declaration.
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.

Function profiling

Message: /home/ajain/Projects/deep_nets/python/lib/machine.py:243
Time in 0 calls to Function.call: 0.000000e+00s
Total compile time: 3.708201e+00s
Number of Apply nodes: 0
Theano Optimizer time: 3.540352e+00s
Theano validate time: 1.026495e-01s
Theano Linker time (includes C, CUDA code generation/compiling): 1.489460e-01s

Function profiling

Message: Sum of all printed profiles at exit excluding Scan op profile.
Time in 48 calls to Function.call: 1.493318e+02s
Time in Function.fn.call: 1.493235e+02s (99.994%)
Time in thunks: 1.478504e+02s (99.008%)
Total compile time: 9.329049e+00s
Number of Apply nodes: 369
Theano Optimizer time: 8.861545e+00s
Theano validate time: 4.513505e-01s
Theano Linker time (includes C, CUDA code generation/compiling): 4.090979e-01s

Class

<% time> <sum %> <#call> <#apply>
94.4% 94.4% 139.572s 1.53e-01s C 912 19 theano.sandbox.cuda.blas.GpuCorrMM
2.1% 96.5% 3.036s 1.58e-02s C 192 4 theano.sandbox.cuda.basic_ops.GpuIncSubtensor
1.3% 97.8% 1.980s 1.11e-03s C 1776 37 theano.sandbox.cuda.basic_ops.GpuContiguous
0.6% 98.4% 0.910s 4.31e-04s C 2112 44 theano.sandbox.cuda.basic_ops.GpuElemwise
0.5% 98.9% 0.666s 5.34e-04s Py 1248 26 theano.sandbox.cuda.basic_ops.GpuReshape
0.3% 99.1% 0.382s 1.99e-03s C 192 4 theano.sandbox.cuda.basic_ops.GpuFromHost
0.2% 99.3% 0.296s 3.08e-03s Py 96 2 theano.tensor.extra_ops.RepeatOp
0.2% 99.5% 0.296s 1.03e-03s C 288 6 theano.sandbox.cuda.blas.GpuDownsampleFactorMaxGrad
0.2% 99.7% 0.252s 8.75e-04s C 288 6 theano.sandbox.cuda.blas.GpuDownsampleFactorMax
0.1% 99.8% 0.178s 1.24e-03s C 144 3 theano.sandbox.cuda.basic_ops.HostFromGpu
0.1% 99.9% 0.121s 3.16e-04s C 384 8 theano.sandbox.cuda.basic_ops.GpuCAReduce
0.1% 100.0% 0.096s 4.01e-04s C 240 5 theano.sandbox.cuda.basic_ops.GpuAlloc
0.0% 100.0% 0.015s 5.75e-06s C 2688 56 theano.compile.ops.Shape_i
0.0% 100.0% 0.013s 7.38e-06s C 1776 37 theano.sandbox.cuda.basic_ops.GpuSubtensor
0.0% 100.0% 0.011s 7.58e-06s C 1440 30 theano.sandbox.cuda.basic_ops.GpuDimShuffle
0.0% 100.0% 0.007s 5.66e-06s C 1152 24 theano.tensor.elemwise.Elemwise
0.0% 100.0% 0.006s 1.06e-05s C 576 12 theano.tensor.basic.Join
0.0% 100.0% 0.005s 6.05e-06s C 864 18 theano.tensor.subtensor.Subtensor
0.0% 100.0% 0.004s 5.89e-06s C 672 14 theano.tensor.opt.MakeVector
0.0% 100.0% 0.002s 5.11e-06s C 336 7 theano.tensor.elemwise.Prod
... (remaining 2 Classes account for 0.00%(0.00s) of the runtime)

Ops

<% time> <sum %> <#call> <#apply>
91.5% 91.5% 135.223s 2.56e-01s C 528 11 GpuCorrMM{valid, (1, 1), pad=0}
2.9% 94.4% 4.348s 1.13e-02s C 384 8 GpuCorrMM{full, (1, 1), pad=0}
2.1% 96.5% 3.036s 1.58e-02s C 192 4 GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:}
1.3% 97.8% 1.980s 1.11e-03s C 1776 37 GpuContiguous
0.5% 98.2% 0.666s 5.34e-04s Py 1248 26 GpuReshape{4}
0.3% 98.5% 0.382s 1.99e-03s C 192 4 GpuFromHost
0.2% 98.7% 0.296s 3.08e-03s Py 96 2 RepeatOp
0.2% 98.9% 0.247s 1.03e-03s C 240 5 GpuElemwise{add,no_inplace}
0.2% 99.0% 0.237s 1.23e-03s C 192 4 GpuDownsampleFactorMaxGrad{(1, 1),True}
0.1% 99.2% 0.210s 8.77e-04s C 240 5 GpuElemwise{maximum,no_inplace}
0.1% 99.3% 0.206s 1.07e-03s C 192 4 GpuDownsampleFactorMax{(1, 1),True}
0.1% 99.4% 0.178s 1.24e-03s C 144 3 HostFromGpu
0.1% 99.5% 0.164s 6.82e-04s C 240 5 GpuElemwise{Composite{[Cast{float32}(EQ(i0, i1))]}}[(0, 0)]
0.1% 99.6% 0.157s 6.56e-04s C 240 5 GpuElemwise{Mul}[(0, 0)]
0.1% 99.7% 0.100s 3.47e-04s C 288 6 GpuCAReduce{add}{1,0,1,1}
0.1% 99.8% 0.087s 4.51e-04s C 192 4 GpuAlloc{memset_0=True}
0.0% 99.8% 0.059s 6.14e-04s C 96 2 GpuDownsampleFactorMaxGrad{(2, 2),True}
0.0% 99.8% 0.046s 4.80e-04s C 96 2 GpuDownsampleFactorMax{(2, 2),True}
0.0% 99.9% 0.023s 4.84e-04s C 48 1 GpuElemwise{sqr,no_inplace}
0.0% 99.9% 0.022s 2.32e-04s C 96 2 GpuElemwise{TrueDiv}[(0, 0)]
... (remaining 37 Ops account for 0.12%(0.18s) of the runtime)

Apply

<% time> <sum %> <#call> <Gflops/s>
34.9% 34.9% 51.585s 1.07e+00s 48 313 14400.0 0.3 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
input 0: dtype=float32, shape=(256, 16, 60, 60), strides=(57600, 3600, 60, 1)
input 1: dtype=float32, shape=(512, 16, 60, 60), strides=(57600, 3600, 60, 1)
output 0: dtype=float32, shape=(256, 512, 1, 1), strides=(512, 1, 0, 0)
30.4% 65.3% 44.946s 9.36e-01s 48 327 72900.0 1.6 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
input 0: dtype=float32, shape=(512, 16, 68, 68), strides=(73984, 4624, 68, 1)
input 1: dtype=float32, shape=(16, 16, 60, 60), strides=(57600, 3600, 60, 1)
output 0: dtype=float32, shape=(512, 16, 9, 9), strides=(1296, 81, 9, 1)
15.1% 80.4% 22.270s 4.64e-01s 48 367 2109.4 0.1 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
input 0: dtype=float32, shape=(16, 16, 244, 244), strides=(952576, 59536, 244, 1)
input 1: dtype=float32, shape=(3, 16, 240, 240), strides=(921600, 57600, 240, 1)
output 0: dtype=float32, shape=(16, 3, 5, 5), strides=(75, 25, 5, 1)
5.4% 85.8% 8.016s 1.67e-01s 48 325 72900.0 8.9 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
input 0: dtype=float32, shape=(16, 512, 68, 68), strides=(2367488, 4624, 68, 1)
input 1: dtype=float32, shape=(16, 512, 9, 9), strides=(41472, 81, 9, 1)
output 0: dtype=float32, shape=(16, 16, 60, 60), strides=(57600, 3600, 60, 1)
3.6% 89.4% 5.313s 1.11e-01s 48 355 2812.5 0.5 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
input 0: dtype=float32, shape=(16, 16, 124, 124), strides=(246016, 15376, 124, 1)
input 1: dtype=float32, shape=(16, 16, 120, 120), strides=(230400, 14400, 120, 1)
output 0: dtype=float32, shape=(16, 16, 5, 5), strides=(400, 25, 5, 1)
1.2% 90.6% 1.770s 3.69e-02s 48 322 GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:
input 0: dtype=float32, shape=(16, 512, 68, 68), strides=(2367488, 4624, 68, 1)
input 1: dtype=float32, shape=(16, 512, 60, 60), strides=(1843200, 3600, 60, 1)
input 2: dtype=int64, shape=8, strides=c
input 3: dtype=int64, shape=8, strides=c
input 4: dtype=int64, shape=8, strides=c
input 5: dtype=int64, shape=8, strides=c
output 0: dtype=float32, shape=(16, 512, 68, 68), strides=(2367488, 4624, 68, 1)
1.2% 91.7% 1.732s 3.61e-02s 48 190 72900.0 41.1 GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous
input 0: dtype=float32, shape=(16, 16, 60, 60), strides=(57600, 3600, 60, 1)
input 1: dtype=float32, shape=(512, 16, 9, 9), strides=(1296, 81, 9, 1)
output 0: dtype=float32, shape=(16, 512, 68, 68), strides=(2367488, 4624, 68, 1)
1.0% 92.7% 1.416s 2.95e-02s 48 341 703.1 0.5 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
input 0: dtype=float32, shape=(16, 16, 64, 64), strides=(65536, 4096, 64, 1)
input 1: dtype=float32, shape=(16, 16, 60, 60), strides=(57600, 3600, 60, 1)
output 0: dtype=float32, shape=(16, 16, 5, 5), strides=(400, 25, 5, 1)
0.7% 93.3% 0.969s 2.02e-02s 48 364 GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:
input 0: dtype=float32, shape=(16, 16, 244, 244), strides=(952576, 59536, 244, 1)
input 1: dtype=float32, shape=(16, 16, 240, 240), strides=(921600, 57600, 240, 1)
input 2: dtype=int64, shape=8, strides=c
input 3: dtype=int64, shape=8, strides=c
input 4: dtype=int64, shape=8, strides=c
input 5: dtype=int64, shape=8, strides=c
output 0: dtype=float32, shape=(16, 16, 244, 244), strides=(952576, 59536, 244, 1)
0.4% 93.8% 0.664s 1.38e-02s 48 300 112.5 0.2 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
input 0: dtype=float32, shape=(4, 16, 60, 60), strides=(57600, 3600, 60, 1)
input 1: dtype=float32, shape=(256, 16, 60, 60), strides=(57600, 3600, 60, 1)
output 0: dtype=float32, shape=(4, 256, 1, 1), strides=(256, 1, 0, 0)
0.4% 94.2% 0.551s 1.15e-02s 48 44 427.1 0.8 GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous
input 0: dtype=float32, shape=(16, 3, 240, 240), strides=(172800, 57600, 240, 1)
input 1: dtype=float32, shape=(1, 3, 9, 9), strides=(0, 81, 9, 1)
output 0: dtype=float32, shape=(16, 1, 248, 248), strides=(61504, 0, 248, 1)
0.4% 94.5% 0.549s 1.14e-02s 48 61 427.1 0.8 GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous
input 0: dtype=float32, shape=(16, 3, 240, 240), strides=(172800, 57600, 240, 1)
input 1: dtype=float32, shape=(1, 3, 9, 9), strides=(0, 81, 9, 1)
output 0: dtype=float32, shape=(16, 1, 248, 248), strides=(61504, 0, 248, 1)
0.3% 94.9% 0.492s 1.03e-02s 48 326 GpuContiguous(GpuDimShuffle{1,0,2,3}.0)
input 0: dtype=float32, shape=(512, 16, 68, 68), strides=(4624, 2367488, 68, 1)
output 0: dtype=float32, shape=(512, 16, 68, 68), strides=(73984, 4624, 68, 1)
0.3% 95.2% 0.464s 9.67e-03s 48 224 14400.0 30.3 GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous
input 0: dtype=float32, shape=(16, 512, 60, 60), strides=(1843200, 3600, 60, 1)
input 1: dtype=float32, shape=(256, 512, 1, 1), strides=(512, 1, 0, 0)
output 0: dtype=float32, shape=(16, 256, 60, 60), strides=(921600, 3600, 60, 1)
0.3% 95.5% 0.425s 8.86e-03s 48 311 14400.0 33.1 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
input 0: dtype=float32, shape=(16, 256, 60, 60), strides=(921600, 3600, 60, 1)
input 1: dtype=float32, shape=(512, 256, 1, 1), strides=(256, 1, 0, 0)
output 0: dtype=float32, shape=(16, 512, 60, 60), strides=(1843200, 3600, 60, 1)
0.3% 95.7% 0.396s 8.26e-03s 48 232 GpuContiguous(GpuSubtensor{::, ::, ::int64, ::int64}.0)
input 0: dtype=float32, shape=(512, 16, 60, 60), strides=(3600, 1843200, 60, 1)
output 0: dtype=float32, shape=(512, 16, 60, 60), strides=(57600, 3600, 60, 1)
0.3% 96.0% 0.388s 8.07e-03s 48 122 2812.5 7.1 GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous
input 0: dtype=float32, shape=(16, 16, 120, 120), strides=(230400, 14400, 120, 1)
input 1: dtype=float32, shape=(16, 16, 5, 5), strides=(400, 25, 5, 1)
output 0: dtype=float32, shape=(16, 16, 124, 124), strides=(246016, 15376, 124, 1)
0.3% 96.3% 0.385s 8.02e-03s 48 208 GpuReshape{4}(GpuSubtensor{::, ::, int64:int64:, int64:int64:
input 0: dtype=float32, shape=(16, 512, 60, 60), strides=(2367488, 4624, 68, 1)
input 1: dtype=int64, shape=(4,), strides=c
output 0: dtype=float32, shape=(8192, 1, 60, 60), strides=(3600, 0, 60, 1)
0.2% 96.5% 0.367s 7.64e-03s 48 366 GpuContiguous(GpuDimShuffle{1,0,2,3}.0)
input 0: dtype=float32, shape=(16, 16, 244, 244), strides=(59536, 952576, 244, 1)
output 0: dtype=float32, shape=(16, 16, 244, 244), strides=(952576, 59536, 244, 1)
0.2% 96.8% 0.365s 7.61e-03s 48 353 2812.5 7.5 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
input 0: dtype=float32, shape=(16, 16, 124, 124), strides=(246016, 15376, 124, 1)
input 1: dtype=float32, shape=(16, 16, 5, 5), strides=(400, 25, 5, 1)
output 0: dtype=float32, shape=(16, 16, 120, 120), strides=(230400, 14400, 120, 1)
... (remaining 349 Apply instances account for 3.24%(4.79s) of the runtime)

Memory Profile
(Sparse variables are ignored)

(For values in brackets, it's for linker = c|py

Max if no gc (allow_gc=False): 1619415KB (1619415KB)
Max if linker=cvm(default): 765866KB (765866KB)
Memory saved if views are used: 2970455KB (2970455KB)
Memory saved if inplace ops are used: 649542KB (649542KB)
Memory saved if gc is enabled: 853549KB (853549KB)

<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>

 151519232B  [(16, 512, 68, 68)] c GpuAlloc{memset_0=True}(CudaNdarrayConstant{[[[[ 0.]]]]}, Shape_i{0}.0, Shape_i{1}.0, Shape_i{2}.0, Shape_i{3}.0)
 151519232B  [(512, 16, 68, 68)] v GpuContiguous(GpuDimShuffle{1,0,2,3}.0)
 151519232B  [(16, 512, 68, 68)] i GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:}(GpuAlloc{memset_0=True}.0, GpuReshape{4}.0, Constant{4}, Constant{-4}, Constant{4}, Constant{-4})
 151519232B  [(512, 16, 68, 68)] v GpuDimShuffle{1,0,2,3}(GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:}.0)
 151519232B  [(16, 512, 68, 68)] v GpuContiguous(GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:}.0)
 151519232B  [(16, 512, 68, 68)] c GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
 117964800B  [(16, 512, 60, 60)] v GpuContiguous(GpuElemwise{maximum,no_inplace}.0)
 117964800B  [(512, 16, 60, 60)] v GpuContiguous(GpuSubtensor{::, ::, ::int64, ::int64}.0)
 117964800B  [(16, 512, 60, 60)] c GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
 117964800B  [(16, 512, 60, 60)] c GpuElemwise{add,no_inplace}(GpuReshape{4}.0, GpuDimShuffle{x,0,x,x}.0)
 117964800B  [(16, 512, 60, 60)] v GpuReshape{4}(GpuDownsampleFactorMaxGrad{(1, 1),True}.0, MakeVector.0)
 117964800B  [(16, 512, 60, 60)] v GpuReshape{4}(GpuDownsampleFactorMax{(1, 1),True}.0, Join.0)
 117964800B  [(16, 512, 60, 60)] v GpuSubtensor{::, ::, int64:int64:, int64:int64:}(GpuCorrMM{full, (1, 1), pad=0}.0, Constant{4}, Constant{-4}, Constant{4}, Constant{-4})
 117964800B  [(16, 512, 60, 60)] i GpuElemwise{Composite{[Cast{float32}(EQ(i0, i1))]}}[(0, 0)](GpuElemwise{maximum,no_inplace}.0, GpuElemwise{add,no_inplace}.0)
 117964800B  [(8192, 1, 60, 60)] c GpuDownsampleFactorMaxGrad{(1, 1),True}(GpuReshape{4}.0, GpuDownsampleFactorMax{(1, 1),True}.0, GpuReshape{4}.0)
 117964800B  [(16, 512, 60, 60)] i GpuElemwise{Mul}[(0, 0)](GpuElemwise{Composite{[Cast{float32}(EQ(i0, i1))]}}[(0, 0)].0, GpuCorrMM{valid, (1, 1), pad=0}.0)
 117964800B  [(16, 512, 60, 60)] c GpuElemwise{maximum,no_inplace}(CudaNdarrayConstant{[[[[  9.99999997e-07]]]]}, GpuElemwise{add,no_inplace}.0)
 117964800B  [(8192, 1, 60, 60)] v GpuReshape{4}(GpuSubtensor{::, ::, int64:int64:, int64:int64:}.0, Join.0)
 117964800B  [(512, 16, 60, 60)] v GpuDimShuffle{1,0,2,3}(GpuElemwise{maximum,no_inplace}.0)
 117964800B  [(8192, 1, 60, 60)] v GpuReshape{4}(GpuElemwise{Mul}[(0, 0)].0, MakeVector.0)

... (remaining 349 Apply account for 2804535640B/5365158232B ((52.27%)) of the Apply with dense outputs sizes)

<created/inplace/view> is taken from the Op's declaration.
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.


Reply to this email directly or view it on GitHub
#2002 (comment).

@nouiz
Member
nouiz commented Aug 6, 2014

Part of the difference is that we do separate grad for the weights and the
inputs, but caffe share some work while they compute blurry at the same
time.

We need to implement a new op GPU corrgrad, that will compute both grad at
the same time. Can you start it? I'll do the optimization needed.

Keep me updated on what you do and when as I'll also work on it tonight,
but I don't know if that is enough and if we can prevent duly opiate work
that would be great.
Le 6 août 2014 16:12, "Frédéric Bastien" frederic.bastien@gmail.com a
écrit :

if the forward is valid, in the grad, you will have a valid and a full
convolution.
If the forward is full, you will also have a valid and a full in the grad.

Why do you use the full mode in the forward? That is very strange.
Normally, what I saw is that people use the valid mode in the forward. Are
you sure your torch implementation also use the full mode in the forward?

I need to leave. I'll see if I can work on that tonight. It the speed
difference is real I see two way go identify it:

  1. Run your code with cuda-memcheck. If we have a problem like a call to
    gemm with too big number and it cause bad memory read, we will see the
    error. That would be the easy case.

I looked at caffe code and it work differently in the backward case then
what we did. Check the method Backward_gpu in the src/caffe/layers/
conv_layers.cu

Here is the profile of convnet-benchmark you can compare it to what is on
the web site, but we see that the grad, the way we implement it in
GpuCorrMM, it is slower:

CONFIG: input = 3 x 128 x 128 * ker = 3 x 96 x 11 x 11 ( bs = 128 , stride
= 1 )
gemm theano.sandbox.cuda.blas.GpuConvMM fprof: 1246.53857689 GFLOP/s ( tm
= 0.0996497273445 )
gemm theano.sandbox.cuda.blas.GpuConvMM bprop weights: 0.0 GFLOP/s ( tm =
0.355364978313 )
gemm theano.sandbox.cuda.blas.GpuConvMM bprop inputs: 0.0 GFLOP/s ( tm =
1.75031024218 )
gemm theano.sandbox.cuda.blas.GpuConvMM bprop both: 0.0 GFLOP/s ( tm =
2.09745848179 )

CONFIG: input = 64 x 64 x 64 * ker = 64 x 128 x 9 x 9 ( bs = 128 , stride
= 1 )
gemm theano.sandbox.cuda.blas.GpuConvMM fprof: 1814.27996141 GFLOP/s ( tm
= 0.293620705605 )
gemm theano.sandbox.cuda.blas.GpuConvMM bprop weights: 0.0 GFLOP/s ( tm =
1.16873198748 )
gemm theano.sandbox.cuda.blas.GpuConvMM bprop inputs: 0.0 GFLOP/s ( tm =
0.506856739521 )
gemm theano.sandbox.cuda.blas.GpuConvMM bprop both: 0.0 GFLOP/s ( tm =
1.77047175169 )

CONFIG: input = 128 x 32 x 32 * ker = 128 x 128 x 9 x 9 ( bs = 128 ,
stride = 1 )
gemm theano.sandbox.cuda.blas.GpuConvMM fprof: 1163.07152933 GFLOP/s ( tm
= 0.168252289295 )
gemm theano.sandbox.cuda.blas.GpuConvMM bprop weights: 0.0 GFLOP/s ( tm =
0.437334001064 )
gemm theano.sandbox.cuda.blas.GpuConvMM bprop inputs: 0.0 GFLOP/s ( tm =
0.230055272579 )
gemm theano.sandbox.cuda.blas.GpuConvMM bprop both: 0.0 GFLOP/s ( tm =
0.706533968449 )

CONFIG: input = 128 x 16 x 16 * ker = 128 x 128 x 7 x 7 ( bs = 128 ,
stride = 1 )
gemm theano.sandbox.cuda.blas.GpuConvMM fprof: 362.523363232 GFLOP/s ( tm
= 0.0566917657852 )
gemm theano.sandbox.cuda.blas.GpuConvMM bprop weights: 0.0 GFLOP/s ( tm =
0.0665702223778 )
gemm theano.sandbox.cuda.blas.GpuConvMM bprop inputs: 0.0 GFLOP/s ( tm =
0.102097034454 )
gemm theano.sandbox.cuda.blas.GpuConvMM bprop both: 0.0 GFLOP/s ( tm =
0.168358981609 )

On Wed, Aug 6, 2014 at 12:00 PM, Arjun Jain notifications@github.com
wrote:

Hi Fred, thanks a ton for your reply. Please find bellow the complete log
after also using profile_memory=True.

What I dont understand is why is the 'valid' getting called at all? I
only call conv2d with 'full'. Perhaps this happens during the back prop,
but I am really not sure why. Any help would be greatly appreciated. Thanks
a lot!!!

Using gpu device 3: GeForce GTX TITAN Black
/home/ajain/Theano/theano/gof/vm.py:716: UserWarning: CVM does not support memory profile, using Stack VM.
'CVM does not support memory profile, using Stack VM.')

Function profiling

Message: /home/ajain/Projects/deep_nets/python/lib/machine.py:239
Time in 48 calls to Function.call: 1.493318e+02s
Time in Function.fn.call: 1.493235e+02s (99.994%)
Time in thunks: 1.478504e+02s (99.008%)
Total compile time: 5.620848e+00s
Number of Apply nodes: 369
Theano Optimizer time: 5.321193e+00s
Theano validate time: 3.487010e-01s
Theano Linker time (includes C, CUDA code generation/compiling): 2.601519e-01s

Class

<% time> <sum %> <#call> <#apply>
94.4% 94.4% 139.572s 1.53e-01s C 912 19 theano.sandbox.cuda.blas.GpuCorrMM
2.1% 96.5% 3.036s 1.58e-02s C 192 4 theano.sandbox.cuda.basic_ops.GpuIncSubtensor
1.3% 97.8% 1.980s 1.11e-03s C 1776 37 theano.sandbox.cuda.basic_ops.GpuContiguous
0.6% 98.4% 0.910s 4.31e-04s C 2112 44 theano.sandbox.cuda.basic_ops.GpuElemwise
0.5% 98.9% 0.666s 5.34e-04s Py 1248 26 theano.sandbox.cuda.basic_ops.GpuReshape
0.3% 99.1% 0.382s 1.99e-03s C 192 4 theano.sandbox.cuda.basic_ops.GpuFromHost
0.2% 99.3% 0.296s 3.08e-03s Py 96 2 theano.tensor.extra_ops.RepeatOp
0.2% 99.5% 0.296s 1.03e-03s C 288 6 theano.sandbox.cuda.blas.GpuDownsampleFactorMaxGrad
0.2% 99.7% 0.252s 8.75e-04s C 288 6 theano.sandbox.cuda.blas.GpuDownsampleFactorMax
0.1% 99.8% 0.178s 1.24e-03s C 144 3 theano.sandbox.cuda.basic_ops.HostFromGpu
0.1% 99.9% 0.121s 3.16e-04s C 384 8 theano.sandbox.cuda.basic_ops.GpuCAReduce
0.1% 100.0% 0.096s 4.01e-04s C 240 5 theano.sandbox.cuda.basic_ops.GpuAlloc
0.0% 100.0% 0.015s 5.75e-06s C 2688 56 theano.compile.ops.Shape_i
0.0% 100.0% 0.013s 7.38e-06s C 1776 37 theano.sandbox.cuda.basic_ops.GpuSubtensor
0.0% 100.0% 0.011s 7.58e-06s C 1440 30 theano.sandbox.cuda.basic_ops.GpuDimShuffle
0.0% 100.0% 0.007s 5.66e-06s C 1152 24 theano.tensor.elemwise.Elemwise
0.0% 100.0% 0.006s 1.06e-05s C 576 12 theano.tensor.basic.Join
0.0% 100.0% 0.005s 6.05e-06s C 864 18 theano.tensor.subtensor.Subtensor
0.0% 100.0% 0.004s 5.89e-06s C 672 14 theano.tensor.opt.MakeVector
0.0% 100.0% 0.002s 5.11e-06s C 336 7 theano.tensor.elemwise.Prod
... (remaining 2 Classes account for 0.00%(0.00s) of the runtime)

Ops

<% time> <sum %> <#call> <#apply>
91.5% 91.5% 135.223s 2.56e-01s C 528 11 GpuCorrMM{valid, (1, 1), pad=0}
2.9% 94.4% 4.348s 1.13e-02s C 384 8 GpuCorrMM{full, (1, 1), pad=0}
2.1% 96.5% 3.036s 1.58e-02s C 192 4 GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:}
1.3% 97.8% 1.980s 1.11e-03s C 1776 37 GpuContiguous
0.5% 98.2% 0.666s 5.34e-04s Py 1248 26 GpuReshape{4}
0.3% 98.5% 0.382s 1.99e-03s C 192 4 GpuFromHost
0.2% 98.7% 0.296s 3.08e-03s Py 96 2 RepeatOp
0.2% 98.9% 0.247s 1.03e-03s C 240 5 GpuElemwise{add,no_inplace}
0.2% 99.0% 0.237s 1.23e-03s C 192 4 GpuDownsampleFactorMaxGrad{(1, 1),True}
0.1% 99.2% 0.210s 8.77e-04s C 240 5 GpuElemwise{maximum,no_inplace}
0.1% 99.3% 0.206s 1.07e-03s C 192 4 GpuDownsampleFactorMax{(1, 1),True}
0.1% 99.4% 0.178s 1.24e-03s C 144 3 HostFromGpu
0.1% 99.5% 0.164s 6.82e-04s C 240 5 GpuElemwise{Composite{[Cast{float32}(EQ(i0, i1))]}}[(0, 0)]
0.1% 99.6% 0.157s 6.56e-04s C 240 5 GpuElemwise{Mul}[(0, 0)]
0.1% 99.7% 0.100s 3.47e-04s C 288 6 GpuCAReduce{add}{1,0,1,1}
0.1% 99.8% 0.087s 4.51e-04s C 192 4 GpuAlloc{memset_0=True}
0.0% 99.8% 0.059s 6.14e-04s C 96 2 GpuDownsampleFactorMaxGrad{(2, 2),True}
0.0% 99.8% 0.046s 4.80e-04s C 96 2 GpuDownsampleFactorMax{(2, 2),True}
0.0% 99.9% 0.023s 4.84e-04s C 48 1 GpuElemwise{sqr,no_inplace}
0.0% 99.9% 0.022s 2.32e-04s C 96 2 GpuElemwise{TrueDiv}[(0, 0)]
... (remaining 37 Ops account for 0.12%(0.18s) of the runtime)

Apply

<% time> <sum %> <#call> <Gflops/s>
34.9% 34.9% 51.585s 1.07e+00s 48 313 14400.0 0.3 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
input 0: dtype=float32, shape=(256, 16, 60, 60), strides=(57600, 3600, 60, 1)
input 1: dtype=float32, shape=(512, 16, 60, 60), strides=(57600, 3600, 60, 1)
output 0: dtype=float32, shape=(256, 512, 1, 1), strides=(512, 1, 0, 0)
30.4% 65.3% 44.946s 9.36e-01s 48 327 72900.0 1.6 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
input 0: dtype=float32, shape=(512, 16, 68, 68), strides=(73984, 4624, 68, 1)
input 1: dtype=float32, shape=(16, 16, 60, 60), strides=(57600, 3600, 60, 1)
output 0: dtype=float32, shape=(512, 16, 9, 9), strides=(1296, 81, 9, 1)
15.1% 80.4% 22.270s 4.64e-01s 48 367 2109.4 0.1 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
input 0: dtype=float32, shape=(16, 16, 244, 244), strides=(952576, 59536, 244, 1)
input 1: dtype=float32, shape=(3, 16, 240, 240), strides=(921600, 57600, 240, 1)
output 0: dtype=float32, shape=(16, 3, 5, 5), strides=(75, 25, 5, 1)
5.4% 85.8% 8.016s 1.67e-01s 48 325 72900.0 8.9 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
input 0: dtype=float32, shape=(16, 512, 68, 68), strides=(2367488, 4624, 68, 1)
input 1: dtype=float32, shape=(16, 512, 9, 9), strides=(41472, 81, 9, 1)
output 0: dtype=float32, shape=(16, 16, 60, 60), strides=(57600, 3600, 60, 1)
3.6% 89.4% 5.313s 1.11e-01s 48 355 2812.5 0.5 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
input 0: dtype=float32, shape=(16, 16, 124, 124), strides=(246016, 15376, 124, 1)
input 1: dtype=float32, shape=(16, 16, 120, 120), strides=(230400, 14400, 120, 1)
output 0: dtype=float32, shape=(16, 16, 5, 5), strides=(400, 25, 5, 1)
1.2% 90.6% 1.770s 3.69e-02s 48 322 GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:
input 0: dtype=float32, shape=(16, 512, 68, 68), strides=(2367488, 4624, 68, 1)
input 1: dtype=float32, shape=(16, 512, 60, 60), strides=(1843200, 3600, 60, 1)
input 2: dtype=int64, shape=8, strides=c
input 3: dtype=int64, shape=8, strides=c
input 4: dtype=int64, shape=8, strides=c
input 5: dtype=int64, shape=8, strides=c
output 0: dtype=float32, shape=(16, 512, 68, 68), strides=(2367488, 4624, 68, 1)
1.2% 91.7% 1.732s 3.61e-02s 48 190 72900.0 41.1 GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous
input 0: dtype=float32, shape=(16, 16, 60, 60), strides=(57600, 3600, 60, 1)
input 1: dtype=float32, shape=(512, 16, 9, 9), strides=(1296, 81, 9, 1)
output 0: dtype=float32, shape=(16, 512, 68, 68), strides=(2367488, 4624, 68, 1)
1.0% 92.7% 1.416s 2.95e-02s 48 341 703.1 0.5 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
input 0: dtype=float32, shape=(16, 16, 64, 64), strides=(65536, 4096, 64, 1)
input 1: dtype=float32, shape=(16, 16, 60, 60), strides=(57600, 3600, 60, 1)
output 0: dtype=float32, shape=(16, 16, 5, 5), strides=(400, 25, 5, 1)
0.7% 93.3% 0.969s 2.02e-02s 48 364 GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:
input 0: dtype=float32, shape=(16, 16, 244, 244), strides=(952576, 59536, 244, 1)
input 1: dtype=float32, shape=(16, 16, 240, 240), strides=(921600, 57600, 240, 1)
input 2: dtype=int64, shape=8, strides=c
input 3: dtype=int64, shape=8, strides=c
input 4: dtype=int64, shape=8, strides=c
input 5: dtype=int64, shape=8, strides=c
output 0: dtype=float32, shape=(16, 16, 244, 244), strides=(952576, 59536, 244, 1)
0.4% 93.8% 0.664s 1.38e-02s 48 300 112.5 0.2 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
input 0: dtype=float32, shape=(4, 16, 60, 60), strides=(57600, 3600, 60, 1)
input 1: dtype=float32, shape=(256, 16, 60, 60), strides=(57600, 3600, 60, 1)
output 0: dtype=float32, shape=(4, 256, 1, 1), strides=(256, 1, 0, 0)
0.4% 94.2% 0.551s 1.15e-02s 48 44 427.1 0.8 GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous
input 0: dtype=float32, shape=(16, 3, 240, 240), strides=(172800, 57600, 240, 1)
input 1: dtype=float32, shape=(1, 3, 9, 9), strides=(0, 81, 9, 1)
output 0: dtype=float32, shape=(16, 1, 248, 248), strides=(61504, 0, 248, 1)
0.4% 94.5% 0.549s 1.14e-02s 48 61 427.1 0.8 GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous
input 0: dtype=float32, shape=(16, 3, 240, 240), strides=(172800, 57600, 240, 1)
input 1: dtype=float32, shape=(1, 3, 9, 9), strides=(0, 81, 9, 1)
output 0: dtype=float32, shape=(16, 1, 248, 248), strides=(61504, 0, 248, 1)
0.3% 94.9% 0.492s 1.03e-02s 48 326 GpuContiguous(GpuDimShuffle{1,0,2,3}.0)
input 0: dtype=float32, shape=(512, 16, 68, 68), strides=(4624, 2367488, 68, 1)
output 0: dtype=float32, shape=(512, 16, 68, 68), strides=(73984, 4624, 68, 1)
0.3% 95.2% 0.464s 9.67e-03s 48 224 14400.0 30.3 GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous
input 0: dtype=float32, shape=(16, 512, 60, 60), strides=(1843200, 3600, 60, 1)
input 1: dtype=float32, shape=(256, 512, 1, 1), strides=(512, 1, 0, 0)
output 0: dtype=float32, shape=(16, 256, 60, 60), strides=(921600, 3600, 60, 1)
0.3% 95.5% 0.425s 8.86e-03s 48 311 14400.0 33.1 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
input 0: dtype=float32, shape=(16, 256, 60, 60), strides=(921600, 3600, 60, 1)
input 1: dtype=float32, shape=(512, 256, 1, 1), strides=(256, 1, 0, 0)
output 0: dtype=float32, shape=(16, 512, 60, 60), strides=(1843200, 3600, 60, 1)
0.3% 95.7% 0.396s 8.26e-03s 48 232 GpuContiguous(GpuSubtensor{::, ::, ::int64, ::int64}.0)
input 0: dtype=float32, shape=(512, 16, 60, 60), strides=(3600, 1843200, 60, 1)
output 0: dtype=float32, shape=(512, 16, 60, 60), strides=(57600, 3600, 60, 1)
0.3% 96.0% 0.388s 8.07e-03s 48 122 2812.5 7.1 GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous
input 0: dtype=float32, shape=(16, 16, 120, 120), strides=(230400, 14400, 120, 1)
input 1: dtype=float32, shape=(16, 16, 5, 5), strides=(400, 25, 5, 1)
output 0: dtype=float32, shape=(16, 16, 124, 124), strides=(246016, 15376, 124, 1)
0.3% 96.3% 0.385s 8.02e-03s 48 208 GpuReshape{4}(GpuSubtensor{::, ::, int64:int64:, int64:int64:
input 0: dtype=float32, shape=(16, 512, 60, 60), strides=(2367488, 4624, 68, 1)
input 1: dtype=int64, shape=(4,), strides=c
output 0: dtype=float32, shape=(8192, 1, 60, 60), strides=(3600, 0, 60, 1)
0.2% 96.5% 0.367s 7.64e-03s 48 366 GpuContiguous(GpuDimShuffle{1,0,2,3}.0)
input 0: dtype=float32, shape=(16, 16, 244, 244), strides=(59536, 952576, 244, 1)
output 0: dtype=float32, shape=(16, 16, 244, 244), strides=(952576, 59536, 244, 1)
0.2% 96.8% 0.365s 7.61e-03s 48 353 2812.5 7.5 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
input 0: dtype=float32, shape=(16, 16, 124, 124), strides=(246016, 15376, 124, 1)
input 1: dtype=float32, shape=(16, 16, 5, 5), strides=(400, 25, 5, 1)
output 0: dtype=float32, shape=(16, 16, 120, 120), strides=(230400, 14400, 120, 1)
... (remaining 349 Apply instances account for 3.24%(4.79s) of the runtime)

Memory Profile
(Sparse variables are ignored)

(For values in brackets, it's for linker = c|py

Max if no gc (allow_gc=False): 1619415KB (1619415KB)
Max if linker=cvm(default): 765866KB (765866KB)
Memory saved if views are used: 2970455KB (2970455KB)
Memory saved if inplace ops are used: 649542KB (649542KB)
Memory saved if gc is enabled: 853549KB (853549KB)

<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>

 151519232B  [(16, 512, 68, 68)] c GpuAlloc{memset_0=True}(CudaNdarrayConstant{[[[[ 0.]]]]}, Shape_i{0}.0, Shape_i{1}.0, Shape_i{2}.0, Shape_i{3}.0)
 151519232B  [(512, 16, 68, 68)] v GpuContiguous(GpuDimShuffle{1,0,2,3}.0)
 151519232B  [(16, 512, 68, 68)] i GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:}(GpuAlloc{memset_0=True}.0, GpuReshape{4}.0, Constant{4}, Constant{-4}, Constant{4}, Constant{-4})
 151519232B  [(512, 16, 68, 68)] v GpuDimShuffle{1,0,2,3}(GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:}.0)
 151519232B  [(16, 512, 68, 68)] v GpuContiguous(GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:}.0)
 151519232B  [(16, 512, 68, 68)] c GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
 117964800B  [(16, 512, 60, 60)] v GpuContiguous(GpuElemwise{maximum,no_inplace}.0)
 117964800B  [(512, 16, 60, 60)] v GpuContiguous(GpuSubtensor{::, ::, ::int64, ::int64}.0)
 117964800B  [(16, 512, 60, 60)] c GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
 117964800B  [(16, 512, 60, 60)] c GpuElemwise{add,no_inplace}(GpuReshape{4}.0, GpuDimShuffle{x,0,x,x}.0)
 117964800B  [(16, 512, 60, 60)] v GpuReshape{4}(GpuDownsampleFactorMaxGrad{(1, 1),True}.0, MakeVector.0)
 117964800B  [(16, 512, 60, 60)] v GpuReshape{4}(GpuDownsampleFactorMax{(1, 1),True}.0, Join.0)
 117964800B  [(16, 512, 60, 60)] v GpuSubtensor{::, ::, int64:int64:, int64:int64:}(GpuCorrMM{full, (1, 1), pad=0}.0, Constant{4}, Constant{-4}, Constant{4}, Constant{-4})
 117964800B  [(16, 512, 60, 60)] i GpuElemwise{Composite{[Cast{float32}(EQ(i0, i1))]}}[(0, 0)](GpuElemwise{maximum,no_inplace}.0, GpuElemwise{add,no_inplace}.0)
 117964800B  [(8192, 1, 60, 60)] c GpuDownsampleFactorMaxGrad{(1, 1),True}(GpuReshape{4}.0, GpuDownsampleFactorMax{(1, 1),True}.0, GpuReshape{4}.0)
 117964800B  [(16, 512, 60, 60)] i GpuElemwise{Mul}[(0, 0)](GpuElemwise{Composite{[Cast{float32}(EQ(i0, i1))]}}[(0, 0)].0, GpuCorrMM{valid, (1, 1), pad=0}.0)
 117964800B  [(16, 512, 60, 60)] c GpuElemwise{maximum,no_inplace}(CudaNdarrayConstant{[[[[  9.99999997e-07]]]]}, GpuElemwise{add,no_inplace}.0)
 117964800B  [(8192, 1, 60, 60)] v GpuReshape{4}(GpuSubtensor{::, ::, int64:int64:, int64:int64:}.0, Join.0)
 117964800B  [(512, 16, 60, 60)] v GpuDimShuffle{1,0,2,3}(GpuElemwise{maximum,no_inplace}.0)
 117964800B  [(8192, 1, 60, 60)] v GpuReshape{4}(GpuElemwise{Mul}[(0, 0)].0, MakeVector.0)

... (remaining 349 Apply account for 2804535640B/5365158232B ((52.27%)) of the Apply with dense outputs sizes)

<created/inplace/view> is taken from the Op's declaration.
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.

Function profiling

Message: /home/ajain/Projects/deep_nets/python/lib/machine.py:243
Time in 0 calls to Function.call: 0.000000e+00s
Total compile time: 3.708201e+00s
Number of Apply nodes: 0
Theano Optimizer time: 3.540352e+00s
Theano validate time: 1.026495e-01s
Theano Linker time (includes C, CUDA code generation/compiling): 1.489460e-01s

Function profiling

Message: Sum of all printed profiles at exit excluding Scan op profile.
Time in 48 calls to Function.call: 1.493318e+02s
Time in Function.fn.call: 1.493235e+02s (99.994%)
Time in thunks: 1.478504e+02s (99.008%)
Total compile time: 9.329049e+00s
Number of Apply nodes: 369
Theano Optimizer time: 8.861545e+00s
Theano validate time: 4.513505e-01s
Theano Linker time (includes C, CUDA code generation/compiling): 4.090979e-01s

Class

<% time> <sum %> <#call> <#apply>
94.4% 94.4% 139.572s 1.53e-01s C 912 19 theano.sandbox.cuda.blas.GpuCorrMM
2.1% 96.5% 3.036s 1.58e-02s C 192 4 theano.sandbox.cuda.basic_ops.GpuIncSubtensor
1.3% 97.8% 1.980s 1.11e-03s C 1776 37 theano.sandbox.cuda.basic_ops.GpuContiguous
0.6% 98.4% 0.910s 4.31e-04s C 2112 44 theano.sandbox.cuda.basic_ops.GpuElemwise
0.5% 98.9% 0.666s 5.34e-04s Py 1248 26 theano.sandbox.cuda.basic_ops.GpuReshape
0.3% 99.1% 0.382s 1.99e-03s C 192 4 theano.sandbox.cuda.basic_ops.GpuFromHost
0.2% 99.3% 0.296s 3.08e-03s Py 96 2 theano.tensor.extra_ops.RepeatOp
0.2% 99.5% 0.296s 1.03e-03s C 288 6 theano.sandbox.cuda.blas.GpuDownsampleFactorMaxGrad
0.2% 99.7% 0.252s 8.75e-04s C 288 6 theano.sandbox.cuda.blas.GpuDownsampleFactorMax
0.1% 99.8% 0.178s 1.24e-03s C 144 3 theano.sandbox.cuda.basic_ops.HostFromGpu
0.1% 99.9% 0.121s 3.16e-04s C 384 8 theano.sandbox.cuda.basic_ops.GpuCAReduce
0.1% 100.0% 0.096s 4.01e-04s C 240 5 theano.sandbox.cuda.basic_ops.GpuAlloc
0.0% 100.0% 0.015s 5.75e-06s C 2688 56 theano.compile.ops.Shape_i
0.0% 100.0% 0.013s 7.38e-06s C 1776 37 theano.sandbox.cuda.basic_ops.GpuSubtensor
0.0% 100.0% 0.011s 7.58e-06s C 1440 30 theano.sandbox.cuda.basic_ops.GpuDimShuffle
0.0% 100.0% 0.007s 5.66e-06s C 1152 24 theano.tensor.elemwise.Elemwise
0.0% 100.0% 0.006s 1.06e-05s C 576 12 theano.tensor.basic.Join
0.0% 100.0% 0.005s 6.05e-06s C 864 18 theano.tensor.subtensor.Subtensor
0.0% 100.0% 0.004s 5.89e-06s C 672 14 theano.tensor.opt.MakeVector
0.0% 100.0% 0.002s 5.11e-06s C 336 7 theano.tensor.elemwise.Prod
... (remaining 2 Classes account for 0.00%(0.00s) of the runtime)

Ops

<% time> <sum %> <#call> <#apply>
91.5% 91.5% 135.223s 2.56e-01s C 528 11 GpuCorrMM{valid, (1, 1), pad=0}
2.9% 94.4% 4.348s 1.13e-02s C 384 8 GpuCorrMM{full, (1, 1), pad=0}
2.1% 96.5% 3.036s 1.58e-02s C 192 4 GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:}
1.3% 97.8% 1.980s 1.11e-03s C 1776 37 GpuContiguous
0.5% 98.2% 0.666s 5.34e-04s Py 1248 26 GpuReshape{4}
0.3% 98.5% 0.382s 1.99e-03s C 192 4 GpuFromHost
0.2% 98.7% 0.296s 3.08e-03s Py 96 2 RepeatOp
0.2% 98.9% 0.247s 1.03e-03s C 240 5 GpuElemwise{add,no_inplace}
0.2% 99.0% 0.237s 1.23e-03s C 192 4 GpuDownsampleFactorMaxGrad{(1, 1),True}
0.1% 99.2% 0.210s 8.77e-04s C 240 5 GpuElemwise{maximum,no_inplace}
0.1% 99.3% 0.206s 1.07e-03s C 192 4 GpuDownsampleFactorMax{(1, 1),True}
0.1% 99.4% 0.178s 1.24e-03s C 144 3 HostFromGpu
0.1% 99.5% 0.164s 6.82e-04s C 240 5 GpuElemwise{Composite{[Cast{float32}(EQ(i0, i1))]}}[(0, 0)]
0.1% 99.6% 0.157s 6.56e-04s C 240 5 GpuElemwise{Mul}[(0, 0)]
0.1% 99.7% 0.100s 3.47e-04s C 288 6 GpuCAReduce{add}{1,0,1,1}
0.1% 99.8% 0.087s 4.51e-04s C 192 4 GpuAlloc{memset_0=True}
0.0% 99.8% 0.059s 6.14e-04s C 96 2 GpuDownsampleFactorMaxGrad{(2, 2),True}
0.0% 99.8% 0.046s 4.80e-04s C 96 2 GpuDownsampleFactorMax{(2, 2),True}
0.0% 99.9% 0.023s 4.84e-04s C 48 1 GpuElemwise{sqr,no_inplace}
0.0% 99.9% 0.022s 2.32e-04s C 96 2 GpuElemwise{TrueDiv}[(0, 0)]
... (remaining 37 Ops account for 0.12%(0.18s) of the runtime)

Apply

<% time> <sum %> <#call> <Gflops/s>
34.9% 34.9% 51.585s 1.07e+00s 48 313 14400.0 0.3 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
input 0: dtype=float32, shape=(256, 16, 60, 60), strides=(57600, 3600, 60, 1)
input 1: dtype=float32, shape=(512, 16, 60, 60), strides=(57600, 3600, 60, 1)
output 0: dtype=float32, shape=(256, 512, 1, 1), strides=(512, 1, 0, 0)
30.4% 65.3% 44.946s 9.36e-01s 48 327 72900.0 1.6 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
input 0: dtype=float32, shape=(512, 16, 68, 68), strides=(73984, 4624, 68, 1)
input 1: dtype=float32, shape=(16, 16, 60, 60), strides=(57600, 3600, 60, 1)
output 0: dtype=float32, shape=(512, 16, 9, 9), strides=(1296, 81, 9, 1)
15.1% 80.4% 22.270s 4.64e-01s 48 367 2109.4 0.1 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
input 0: dtype=float32, shape=(16, 16, 244, 244), strides=(952576, 59536, 244, 1)
input 1: dtype=float32, shape=(3, 16, 240, 240), strides=(921600, 57600, 240, 1)
output 0: dtype=float32, shape=(16, 3, 5, 5), strides=(75, 25, 5, 1)
5.4% 85.8% 8.016s 1.67e-01s 48 325 72900.0 8.9 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
input 0: dtype=float32, shape=(16, 512, 68, 68), strides=(2367488, 4624, 68, 1)
input 1: dtype=float32, shape=(16, 512, 9, 9), strides=(41472, 81, 9, 1)
output 0: dtype=float32, shape=(16, 16, 60, 60), strides=(57600, 3600, 60, 1)
3.6% 89.4% 5.313s 1.11e-01s 48 355 2812.5 0.5 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
input 0: dtype=float32, shape=(16, 16, 124, 124), strides=(246016, 15376, 124, 1)
input 1: dtype=float32, shape=(16, 16, 120, 120), strides=(230400, 14400, 120, 1)
output 0: dtype=float32, shape=(16, 16, 5, 5), strides=(400, 25, 5, 1)
1.2% 90.6% 1.770s 3.69e-02s 48 322 GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:
input 0: dtype=float32, shape=(16, 512, 68, 68), strides=(2367488, 4624, 68, 1)
input 1: dtype=float32, shape=(16, 512, 60, 60), strides=(1843200, 3600, 60, 1)
input 2: dtype=int64, shape=8, strides=c
input 3: dtype=int64, shape=8, strides=c
input 4: dtype=int64, shape=8, strides=c
input 5: dtype=int64, shape=8, strides=c
output 0: dtype=float32, shape=(16, 512, 68, 68), strides=(2367488, 4624, 68, 1)
1.2% 91.7% 1.732s 3.61e-02s 48 190 72900.0 41.1 GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous
input 0: dtype=float32, shape=(16, 16, 60, 60), strides=(57600, 3600, 60, 1)
input 1: dtype=float32, shape=(512, 16, 9, 9), strides=(1296, 81, 9, 1)
output 0: dtype=float32, shape=(16, 512, 68, 68), strides=(2367488, 4624, 68, 1)
1.0% 92.7% 1.416s 2.95e-02s 48 341 703.1 0.5 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
input 0: dtype=float32, shape=(16, 16, 64, 64), strides=(65536, 4096, 64, 1)
input 1: dtype=float32, shape=(16, 16, 60, 60), strides=(57600, 3600, 60, 1)
output 0: dtype=float32, shape=(16, 16, 5, 5), strides=(400, 25, 5, 1)
0.7% 93.3% 0.969s 2.02e-02s 48 364 GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:
input 0: dtype=float32, shape=(16, 16, 244, 244), strides=(952576, 59536, 244, 1)
input 1: dtype=float32, shape=(16, 16, 240, 240), strides=(921600, 57600, 240, 1)
input 2: dtype=int64, shape=8, strides=c
input 3: dtype=int64, shape=8, strides=c
input 4: dtype=int64, shape=8, strides=c
input 5: dtype=int64, shape=8, strides=c
output 0: dtype=float32, shape=(16, 16, 244, 244), strides=(952576, 59536, 244, 1)
0.4% 93.8% 0.664s 1.38e-02s 48 300 112.5 0.2 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
input 0: dtype=float32, shape=(4, 16, 60, 60), strides=(57600, 3600, 60, 1)
input 1: dtype=float32, shape=(256, 16, 60, 60), strides=(57600, 3600, 60, 1)
output 0: dtype=float32, shape=(4, 256, 1, 1), strides=(256, 1, 0, 0)
0.4% 94.2% 0.551s 1.15e-02s 48 44 427.1 0.8 GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous
input 0: dtype=float32, shape=(16, 3, 240, 240), strides=(172800, 57600, 240, 1)
input 1: dtype=float32, shape=(1, 3, 9, 9), strides=(0, 81, 9, 1)
output 0: dtype=float32, shape=(16, 1, 248, 248), strides=(61504, 0, 248, 1)
0.4% 94.5% 0.549s 1.14e-02s 48 61 427.1 0.8 GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous
input 0: dtype=float32, shape=(16, 3, 240, 240), strides=(172800, 57600, 240, 1)
input 1: dtype=float32, shape=(1, 3, 9, 9), strides=(0, 81, 9, 1)
output 0: dtype=float32, shape=(16, 1, 248, 248), strides=(61504, 0, 248, 1)
0.3% 94.9% 0.492s 1.03e-02s 48 326 GpuContiguous(GpuDimShuffle{1,0,2,3}.0)
input 0: dtype=float32, shape=(512, 16, 68, 68), strides=(4624, 2367488, 68, 1)
output 0: dtype=float32, shape=(512, 16, 68, 68), strides=(73984, 4624, 68, 1)
0.3% 95.2% 0.464s 9.67e-03s 48 224 14400.0 30.3 GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous
input 0: dtype=float32, shape=(16, 512, 60, 60), strides=(1843200, 3600, 60, 1)
input 1: dtype=float32, shape=(256, 512, 1, 1), strides=(512, 1, 0, 0)
output 0: dtype=float32, shape=(16, 256, 60, 60), strides=(921600, 3600, 60, 1)
0.3% 95.5% 0.425s 8.86e-03s 48 311 14400.0 33.1 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
input 0: dtype=float32, shape=(16, 256, 60, 60), strides=(921600, 3600, 60, 1)
input 1: dtype=float32, shape=(512, 256, 1, 1), strides=(256, 1, 0, 0)
output 0: dtype=float32, shape=(16, 512, 60, 60), strides=(1843200, 3600, 60, 1)
0.3% 95.7% 0.396s 8.26e-03s 48 232 GpuContiguous(GpuSubtensor{::, ::, ::int64, ::int64}.0)
input 0: dtype=float32, shape=(512, 16, 60, 60), strides=(3600, 1843200, 60, 1)
output 0: dtype=float32, shape=(512, 16, 60, 60), strides=(57600, 3600, 60, 1)
0.3% 96.0% 0.388s 8.07e-03s 48 122 2812.5 7.1 GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous
input 0: dtype=float32, shape=(16, 16, 120, 120), strides=(230400, 14400, 120, 1)
input 1: dtype=float32, shape=(16, 16, 5, 5), strides=(400, 25, 5, 1)
output 0: dtype=float32, shape=(16, 16, 124, 124), strides=(246016, 15376, 124, 1)
0.3% 96.3% 0.385s 8.02e-03s 48 208 GpuReshape{4}(GpuSubtensor{::, ::, int64:int64:, int64:int64:
input 0: dtype=float32, shape=(16, 512, 60, 60), strides=(2367488, 4624, 68, 1)
input 1: dtype=int64, shape=(4,), strides=c
output 0: dtype=float32, shape=(8192, 1, 60, 60), strides=(3600, 0, 60, 1)
0.2% 96.5% 0.367s 7.64e-03s 48 366 GpuContiguous(GpuDimShuffle{1,0,2,3}.0)
input 0: dtype=float32, shape=(16, 16, 244, 244), strides=(59536, 952576, 244, 1)
output 0: dtype=float32, shape=(16, 16, 244, 244), strides=(952576, 59536, 244, 1)
0.2% 96.8% 0.365s 7.61e-03s 48 353 2812.5 7.5 GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguou
input 0: dtype=float32, shape=(16, 16, 124, 124), strides=(246016, 15376, 124, 1)
input 1: dtype=float32, shape=(16, 16, 5, 5), strides=(400, 25, 5, 1)
output 0: dtype=float32, shape=(16, 16, 120, 120), strides=(230400, 14400, 120, 1)
... (remaining 349 Apply instances account for 3.24%(4.79s) of the runtime)

Memory Profile
(Sparse variables are ignored)

(For values in brackets, it's for linker = c|py

Max if no gc (allow_gc=False): 1619415KB (1619415KB)
Max if linker=cvm(default): 765866KB (765866KB)
Memory saved if views are used: 2970455KB (2970455KB)
Memory saved if inplace ops are used: 649542KB (649542KB)
Memory saved if gc is enabled: 853549KB (853549KB)

<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>

 151519232B  [(16, 512, 68, 68)] c GpuAlloc{memset_0=True}(CudaNdarrayConstant{[[[[ 0.]]]]}, Shape_i{0}.0, Shape_i{1}.0, Shape_i{2}.0, Shape_i{3}.0)
 151519232B  [(512, 16, 68, 68)] v GpuContiguous(GpuDimShuffle{1,0,2,3}.0)
 151519232B  [(16, 512, 68, 68)] i GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:}(GpuAlloc{memset_0=True}.0, GpuReshape{4}.0, Constant{4}, Constant{-4}, Constant{4}, Constant{-4})
 151519232B  [(512, 16, 68, 68)] v GpuDimShuffle{1,0,2,3}(GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:}.0)
 151519232B  [(16, 512, 68, 68)] v GpuContiguous(GpuIncSubtensor{InplaceInc;::, ::, int64:int64:, int64:int64:}.0)
 151519232B  [(16, 512, 68, 68)] c GpuCorrMM{full, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
 117964800B  [(16, 512, 60, 60)] v GpuContiguous(GpuElemwise{maximum,no_inplace}.0)
 117964800B  [(512, 16, 60, 60)] v GpuContiguous(GpuSubtensor{::, ::, ::int64, ::int64}.0)
 117964800B  [(16, 512, 60, 60)] c GpuCorrMM{valid, (1, 1), pad=0}(GpuContiguous.0, GpuContiguous.0)
 117964800B  [(16, 512, 60, 60)] c GpuElemwise{add,no_inplace}(GpuReshape{4}.0, GpuDimShuffle{x,0,x,x}.0)
 117964800B  [(16, 512, 60, 60)] v GpuReshape{4}(GpuDownsampleFactorMaxGrad{(1, 1),True}.0, MakeVector.0)
 117964800B  [(16, 512, 60, 60)] v GpuReshape{4}(GpuDownsampleFactorMax{(1, 1),True}.0, Join.0)
 117964800B  [(16, 512, 60, 60)] v GpuSubtensor{::, ::, int64:int64:, int64:int64:}(GpuCorrMM{full, (1, 1), pad=0}.0, Constant{4}, Constant{-4}, Constant{4}, Constant{-4})
 117964800B  [(16, 512, 60, 60)] i GpuElemwise{Composite{[Cast{float32}(EQ(i0, i1))]}}[(0, 0)](GpuElemwise{maximum,no_inplace}.0, GpuElemwise{add,no_inplace}.0)
 117964800B  [(8192, 1, 60, 60)] c GpuDownsampleFactorMaxGrad{(1, 1),True}(GpuReshape{4}.0, GpuDownsampleFactorMax{(1, 1),True}.0, GpuReshape{4}.0)
 117964800B  [(16, 512, 60, 60)] i GpuElemwise{Mul}[(0, 0)](GpuElemwise{Composite{[Cast{float32}(EQ(i0, i1))]}}[(0, 0)].0, GpuCorrMM{valid, (1, 1), pad=0}.0)
 117964800B  [(16, 512, 60, 60)] c GpuElemwise{maximum,no_inplace}(CudaNdarrayConstant{[[[[  9.99999997e-07]]]]}, GpuElemwise{add,no_inplace}.0)
 117964800B  [(8192, 1, 60, 60)] v GpuReshape{4}(GpuSubtensor{::, ::, int64:int64:, int64:int64:}.0, Join.0)
 117964800B  [(512, 16, 60, 60)] v GpuDimShuffle{1,0,2,3}(GpuElemwise{maximum,no_inplace}.0)
 117964800B  [(8192, 1, 60, 60)] v GpuReshape{4}(GpuElemwise{Mul}[(0, 0)].0, MakeVector.0)

... (remaining 349 Apply account for 2804535640B/5365158232B ((52.27%)) of the Apply with dense outputs sizes)

<created/inplace/view> is taken from the Op's declaration.
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.


Reply to this email directly or view it on GitHub
#2002 (comment).

@stencilman
Contributor

Hi Fred,

I use full because I want the same shape of output as input. I can change to using valid in fprop, but does that change anything?

Yes, I checked the caffe code and how they calculate gradients. I can add this to GpuCorrMM. How can I add a grad function that can call cuda function to get the gradient value?

I tried using cuda-memcheck, but it always seems to get stuck. I will run cuda-memcheck on a minimal program instead and let you know.

Yes, I do think the back prop is as slow as I report.

Would be amazing if you have some time to look at it later. Thank you! I am happy to do anything to make it fast, I really need this.

Thanks a lot,
Warm regards,
Arjun

@nouiz
Member
nouiz commented Aug 7, 2014

I'm looking into it, finaly I don't think we need to create a new op that
will do both grad at the same time. The fact that they are together is only
that don't have have a granularity of operation as small as Theano.

The problem is related to the fact that we use the same code for the
convolution in full mode, but they use another version in that case. I
started to work on it.

Fred

On Wed, Aug 6, 2014 at 6:24 PM, Arjun Jain notifications@github.com wrote:

Hi Fred,

I use full because I want the same shape of output as input. I can change
to using valid in fprop, but does that change anything?

Yes, I checked the caffe code and how they calculate gradients. I can add
this to GpuCorrMM. How can I add a grad function that can call cuda
function to get the gradient value?

I tried using cuda-memcheck, but it always seems to get stuck. I will run
cuda-memcheck on a minimal program instead and let you know.

Yes, I do think the back prop is as slow as I report.

Would be amazing if you have some time to look at it later. Thank you! I
am happy to do anything to make it fast, I really need this.

Thanks a lot,
Warm regards,
Arjun


Reply to this email directly or view it on GitHub
#2002 (comment).

@stencilman
Contributor

It is great to know we might not need to do a new op!

Thanks for working on it, let me know if I can help in any way, I would be more than happy to! I would help us a lot here to have fast convolution in Theano. :-)

@nouiz
Member
nouiz commented Aug 7, 2014

I'll go sleep. I have something that run, but don't return the right answer:

https://github.com/nouiz/Theano/tree/conv_gemm

If you can review it and try to fix it, it would be great.

On Wed, Aug 6, 2014 at 8:22 PM, Arjun Jain notifications@github.com wrote:

It is great to know we might not need to do a new op!

Thanks for working on it, let me know if I can help in any way, I would be
more than happy to! I would help us a lot here to have fast convolution in
Theano. :-)


Reply to this email directly or view it on GitHub
#2002 (comment).

@stencilman
Contributor

Great, thanks! I will have a look and let you know if I can find anything.

@stencilman
Contributor

I am a bit confused, why do you want to call col2im in the 'full' mode? Why do you want to do the 'valid' and 'full' mode differently in the the cuda code?

@stencilman
Contributor

I added support to non-square kernels here: stencilman/Theano-1@85b8a90

It passed all tests. Is that what you were trying to do?

@nouiz
Member
nouiz commented Aug 7, 2014

I'm tring do to as in this fct:

cunn_SpatialConvolutionMM_updateGradInput

In particular this part:

https://github.com/torch/cunn/blob/master/SpatialConvolutionMM.cu#L303

On Wed, Aug 6, 2014 at 11:17 PM, Arjun Jain notifications@github.com
wrote:

I added support to non-square kernels here: stencilman@85b8a90
stencilman/Theano-1@85b8a90

It passed all tests. Is that what you were trying to do?


Reply to this email directly or view it on GitHub
#2002 (comment).

@stencilman
Contributor

Ok. updateGradInput calculates the gradients. I am still a bit confused as to why? It is because you want to get it to work and then you want to change it to a different function?

All tests for full mode will fail right if you calculate the gradient instead of doing the corr?

Please do explain.

And yes, the non-square stuff works perfectly, I will create a PR.

@stencilman
Contributor

I created a PR(#2023) for code which does non-square kernels.

About your branch and the changes you made last night, I am unsure why you want to do what cunn_SpatialConvolutionMM_updateGradInput does. Can you please explain?

@stencilman
Contributor

Also, why do you think it is the 'full' corr that makes it slow? And how is the algorithm different for us in the full mode than them?

what i find is that the forward is super fast, even if it is full. Somehow in the backward it is very slow. It will really help me a lot if you can fix this or tell me how to make it faster for back prop. Thanks a lot.

@nouiz
Member
nouiz commented Aug 8, 2014

I started to write that reply yesterday. But forgot to hit reply. Here it
is:

On Thu, Aug 7, 2014 at 10:25 AM, Arjun Jain notifications@github.com
wrote:

Also, why do you think it is the 'full' corr that makes it slow? And how
is the algorithm different for us in the full mode than them?

I ran the convnet-benchmark before my PR and saw this result (tm is the
time)

CONFIG: input = 3 x 128 x 128 * ker = 3 x 96 x 11 x 11 ( bs = 128 , stride
= 1 )
theano.tensor.nnet.conv.conv2d fprof: 295.136034231 GFLOP/s ( tm =
0.420881271362 )
theano.tensor.nnet.conv.conv2d bprop weights: 0.0 GFLOP/s ( tm =
0.672349274158 )
theano.tensor.nnet.conv.conv2d bprop inputs: 0.0 GFLOP/s ( tm = 51.
4428064823 )
theano.tensor.nnet.conv.conv2d bprop both: 0.0 GFLOP/s ( tm = 53.2438390255
)
(experimental) theano.sandbox.cuda.fftconv.conv2d_fft fprof: 605.508708927
GFLOP/s ( tm = 0.20514523983 )
(experimental) theano.sandbox.cuda.fftconv.conv2d_fft bprop weights: 0.0
GFLOP/s ( tm = 0.206289708614 )
(experimental) theano.sandbox.cuda.fftconv.conv2d_fft bprop inputs: 0.0
GFLOP/s ( tm = 1.18310427666 )
(experimental) theano.sandbox.cuda.fftconv.conv2d_fft bprop both: 0.0
GFLOP/s ( tm = 1.39372771978 )
gemm theano.sandbox.cuda.blas.GpuCorrMM fprof: 1243.54620238 GFLOP/s ( tm =
0.0998895168304 )
gemm theano.sandbox.cuda.blas.GpuCorrMM bprop weights: 0.0 GFLOP/s ( tm =
0.346038997173 )
gemm theano.sandbox.cuda.blas.GpuCorrMM bprop inputs: 0.0 GFLOP/s ( tm =
1.75310575962 )
gemm theano.sandbox.cuda.blas.GpuCorrMM bprop both: 0.0 GFLOP/s ( tm =
2.26795125008 )

Here we see that it take 1753ms (second to last line) for the bprop vs th
inputs. This is the full mode. On the web site of this benchmark (that was
done the the same gpu titan black) the timming for torch7 was 91ms.

with the code in the master of theano, we don't call col2im. There
implementation of the full convolution use it instead of im2col. My guess
is that this is the cause.

what i find is that the forward is super fast, even if it is full. Somehow
in the backward it is very slow. It will really help me a lot if you can
fix this or tell me how to make it faster for back prop. Thanks a lot.

I won't be available tonight and I won't be available for a week after
Friday afternoon. My best guess for now is the full mode. But you are right
that in your profile, it was the valid mode that caused problem.

Can you check to make sure your torch implemenation also use the full mode
in the forward? Also, I'm pretty surprised that with that setting, you
would get the profile you shown me.

@stencilman
Contributor

I am sorry @nouiz , I still dont understand how the full is any different from valid. For me, full is just valid with some padding, and that is how we implement it.

col2im in my opinion has nothing to do with the full or valid modes, it is only useful for calculating the gradient, please correct me if I am wrong.

The convnet-benchmark results as @nouiz report are clearly weird, no? It takes 99msec fprop vs 1753ms bprop? Why?

I will try to get to the bottom of the slow speed.

From tomorrow morning, I will also not be available until next Friday as I am going to vancouver for a conference tomorrow morning. I will see what I can do today.

@nouiz
Member
nouiz commented Aug 8, 2014

On Fri, Aug 8, 2014 at 9:35 AM, Arjun Jain notifications@github.com wrote:

I am sorry @nouiz https://github.com/nouiz , I still dont understand
how the full is any different from valid. For me, full is just valid with
some padding, and that is how we implement it.

col2im in my opinion has nothing to do with the full or valid modes, it is
only useful for calculating the gradient, please correct me if I am wrong.

When the fprop is valid, the bprop wrt the inputs will be a convolution in
full mode.

You are right that the full mode is valid with padding. When we have 2
similar algo (conv with and without padding), it happend sometimes that the
faster implementation isn't the same. It is the case here. The fact that we
know the padding is done with 0, we can make an implementation don't don't
do the computation again the 0s. My guess is that the implementation
im2col+gemm will do the multiplication with 0s, but the implementation
col2im+gemm don't. I didn't look enough to be sure of that. But as caffe
and torch7 do that, I suppose they have a good reason to don't use the same
implementation in that case.

The convnet-benchmark results as @nouiz https://github.com/nouiz report

are clearly weird, no? It takes 99msec fprop vs 1753ms bprop? Why?

I guess it is the reason I wrote above.

I will try to get to the bottom of the slow speed.

From tomorrow morning, I will also not be available until next Friday as I
am going to vancouver for a conference tomorrow morning. I will see what I
can do today.


Reply to this email directly or view it on GitHub
#2002 (comment).

@stencilman
Contributor

Thanks a lot for your reply @nouiz . Hmm, I dont think so. I dont think it is the zero padding that is making it slow. Also, I think the col2im and im2col do not deal with padding differently.

IMHO, the different in speed between this and torch will not be a orders of magnitude because of the zero padding. It is something else. @nouiz I will be very grateful if you could tell me what Theano does in case of bprop? What size convolutions? Maybe we try to manually fprop the same size convolution and see how much time it takes? What do you think?

@stencilman
Contributor

Btw @nouiz we seem to have a misunderstand of what caffe/torch does. Correct me if I am wrong: you seem to think they use col2im + gemm for full mode. I dont think so. They dont even have a full mode. They use col2im only when they are doing their bprop. I dont understand how Theano does the bprop.

@nouiz
Member
nouiz commented Aug 8, 2014

the bprop is also a convolution. So when you tell that they use col2im+gemm
during there bprop, it mean that they use that implementation for the
convolution in the bprop.

The grad graph of conv2d is defined in that method:

https://github.com/Theano/Theano/blob/master/theano/tensor/nnet/conv.py#L775

The grad again the filter is also a convolution, check here:

https://github.com/Theano/Theano/blob/master/theano/tensor/nnet/conv.py#L898

You see there the this grad is always a valid convolution. Here you see
that in the full mode we swap the input/filter of that convolution compared
to the valid mode:

https://github.com/Theano/Theano/blob/master/theano/tensor/nnet/conv.py#L848

The grad again the image is also a convolution check here:

https://github.com/Theano/Theano/blob/master/theano/tensor/nnet/conv.py#L943

Here you see that in that case the, the mode of that convolution is full is
the original conv is valid. Otherwise it is a valid convolution.

https://github.com/Theano/Theano/blob/master/theano/tensor/nnet/conv.py#L916

So in the case the fprop is valid, the bprop again the inputs is a full
convolution. In that case (bprop again the inputs) caffe and torch7 use
col2im+gemm code.

On Fri, Aug 8, 2014 at 10:29 AM, Arjun Jain notifications@github.com
wrote:

Btw @nouiz https://github.com/nouiz we seem to have a misunderstand of
what caffe/torch does. Correct me if I am wrong: you seem to think they use
col2im + gemm for full mode. I dont think so. They dont even have a full
mode. They use col2im only when they are doing their bprop. I dont
understand how Theano does the bprop.


Reply to this email directly or view it on GitHub
#2002 (comment).

@stencilman
Contributor

Thanks a ton for that comment @nouiz ! That clarifies things a lot!

So the grads are conv ops too. Is it possible that the "unroll" while doing the bprop conv doesnt have the optimum shapes and thus it is not fast?

I am still not sure if we need col2im or what the gemm+col2im does. i have a suspicion the problem has something to do with the sizes..

@nouiz
Member
nouiz commented Aug 8, 2014

I haven't been able to work on it today. I won't work on it next week. So
if one of you can continue, it would be great.

Do you know if the torch code in full mode update in place the weight? If
so, that could explain that. We don't do that now. We can do it, both it is
better to do it later. To make it work not in place, we probably need toink
init to zero the allocated memory.

Also, it is possible that we need to allocate bigger memory with that are
used with zero padding.
Le 8 août 2014 10:59, "Frédéric Bastien" frederic.bastien@gmail.com a
écrit :

the bprop is also a convolution. So when you tell that they use
col2im+gemm during there bprop, it mean that they use that implementation
for the convolution in the bprop.

The grad graph of conv2d is defined in that method:

https://github.com/Theano/Theano/blob/master/theano/tensor/nnet/conv.py#L775

The grad again the filter is also a convolution, check here:

https://github.com/Theano/Theano/blob/master/theano/tensor/nnet/conv.py#L898

You see there the this grad is always a valid convolution. Here you see
that in the full mode we swap the input/filter of that convolution compared
to the valid mode:

https://github.com/Theano/Theano/blob/master/theano/tensor/nnet/conv.py#L848

The grad again the image is also a convolution check here:

https://github.com/Theano/Theano/blob/master/theano/tensor/nnet/conv.py#L943

Here you see that in that case the, the mode of that convolution is full
is the original conv is valid. Otherwise it is a valid convolution.

https://github.com/Theano/Theano/blob/master/theano/tensor/nnet/conv.py#L916

So in the case the fprop is valid, the bprop again the inputs is a full
convolution. In that case (bprop again the inputs) caffe and torch7 use
col2im+gemm code.

On Fri, Aug 8, 2014 at 10:29 AM, Arjun Jain notifications@github.com
wrote:

Btw @nouiz https://github.com/nouiz we seem to have a misunderstand of
what caffe/torch does. Correct me if I am wrong: you seem to think they use
col2im + gemm for full mode. I dont think so. They dont even have a full
mode. They use col2im only when they are doing their bprop. I dont
understand how Theano does the bprop.


Reply to this email directly or view it on GitHub
#2002 (comment).

@stencilman
Contributor

I will try to continue to look into this.

maybe this info helps you in figuring the soln: they(torch) create a 'module' and a 'criterion', and when you do module:forward(), it goes forwards and when you do module:backwards() it goes backward(recursively).
(module: https://github.com/torch/torch7-distro/blob/master/extra/nn/Module.lua#L28, criterion: https://github.com/torch/torch7-distro/blob/master/extra/nn/Criterion.lua)

function Module:backward(input, gradOutput, scale)
   scale = scale or 1
   self:updateGradInput(input, gradOutput)
   self:accGradParameters(input, gradOutput, scale)
   return self.gradInput
end

And you:

-- create closure to evaluate f(X) and df/dX
local feval = function(x)
     local target = <your data>

      -- reset gradients
      gradParameters:zero()  -- dE_dw

      -- evaluate function for complete mini batch
      local output = model:forward(batchGPU.data)

      local err = criterion:forward(output, target)
      ave_err = ave_err + err

      -- estimate df/dW
      local df_do = criterion:backward(output, target)
      model:backward(batchGPU.data, df_do)
      -- return f and df/dX
      return err, gradParameters
end

-- optimize on current mini-batch using the above closure
optimMethod(feval, parameters, conf.optimState)

where optimMethod is (https://github.com/torch/optim/blob/master/sgd.lua).

@nouiz
Member
nouiz commented Aug 9, 2014

I don't have the time to look, but where is the "parameter" used in
cunn_SpatialConvolutionMM_accGradParameters?
In particular:

gradWeight
gradBias
finput
fgradInput

This would help solve some of the questions I had when writting the Theano
version. Also, we will need to update the license to also include torch7
license/copy right I think. We can do this after we fix the speed issue in
this PR, juste before merging.

On Fri, Aug 8, 2014 at 7:15 PM, Arjun Jain notifications@github.com wrote:

I will try to continue to look into this.

maybe this info helps you in figuring the soln: they(torch) create a
'module' and a 'criterion', and when you do module:forward(), it goes
forwards and when you do module:backwards() it goes backward(recursively).
(module:
https://github.com/torch/torch7-distro/blob/master/extra/nn/Module.lua#L28,
criterion:
https://github.com/torch/torch7-distro/blob/master/extra/nn/Criterion.lua)

function Module:backward(input, gradOutput, scale)
scale = scale or 1
self:updateGradInput(input, gradOutput)
self:accGradParameters(input, gradOutput, scale)
return self.gradInput
end

And you:

-- create closure to evaluate f(X) and df/dX
local feval = function(x)
local target =

  -- reset gradients
  gradParameters:zero()  -- dE_dw

  -- evaluate function for complete mini batch
  local output = model:forward(batchGPU.data)

  local err = criterion:forward(output, target)
  ave_err = ave_err + err

  -- estimate df/dW
  local df_do = criterion:backward(output, target)
  model:backward(batchGPU.data, df_do)
  -- return f and df/dX
  return err, gradParameters

end

-- optimize on current mini-batch using the above closure
optimMethod(feval, parameters, conf.optimState)

where optimMethod is (https://github.com/torch/optim/blob/master/sgd.lua).


Reply to this email directly or view it on GitHub
#2002 (comment).

@stencilman
Contributor

gradWeight: https://github.com/torch/cunn/blob/master/SpatialConvolutionMM.cu#L407
gradBias: https://github.com/torch/cunn/blob/master/SpatialConvolutionMM.cu#L426

finput and fgardInput are used in updateGradInput (which transfers the gradients for the chain rule):
fgradInput or gradInput_n: https://github.com/torch/cunn/blob/master/SpatialConvolutionMM.cu#L306

@f0k f0k referenced this pull request Jul 1, 2015
Closed

[WIP] CpuCorrMM closes #3026 #3089

10 of 11 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment