Skip to content
This repository has been archived by the owner on Jan 3, 2023. It is now read-only.

Create custom layer #100

Closed
Sowasa opened this issue Sep 22, 2015 · 11 comments
Closed

Create custom layer #100

Sowasa opened this issue Sep 22, 2015 · 11 comments

Comments

@Sowasa
Copy link

Sowasa commented Sep 22, 2015

How do I start to implement a new layer?
I want to implement a few more exotic pooling functions, so basically I want to change the function that takes the max / average and maybe even the gradient for that layer.
It would be great to get a step by step recommendation on how to do such things.

@scott-gray
Copy link
Collaborator

As far as the basic pooling algorithm works, I'd have a look at the CPU backend numpy code. My kernels are implemented in the same way (though I have a much faster kernel now for bprop avg pooling that will be released soon).

The GPU kernels are currently written in assembly. The primary reason for this is that I wanted to leverage the fp16x2 atomic adds that maxwell has but are inexplicably unexposed in the cuda or ptx APIs.

But if you're only concerned about fp32 than there's no reason they can't be written in cuda c. Because our framework sits on top of pycuda it's rather trivial to integrate some custom cuda code. Take a look at the bottom of float_ew for some examples of this. Most of what what float_ew does is dynamically building cuda-c and compiling (and caching) it on the fly.

I'm curious, what pooling operation are you interested in? Maybe I can give you more explicit guidance if I knew more of what you were trying to do.

@Sowasa
Copy link
Author

Sowasa commented Sep 22, 2015

Particularly I am interested in these two:
http://www.matthewzeiler.com/pubs/arxiv2012/arxiv2012.pdf
http://www.matthewzeiler.com/pubs/iclr2013/iclr2013.pdf
And in parametric ReLu's, but that should be doable.

For now I want to start with the first, which is differentiable pooling. My plan is to first implement it without learnable parameters, so I will set them by hand in the layer definition. Then I can already do some checks and implement the gradients only if everything looks good and some toy experiments succeed.

So basically what I need first is a pooling layer that convolves with a Gaussian kernel and has a parameter to set an aspect (for now only variance) of the weight filler. The layer has to be fairly standard, just like any other pooling layer with one (or later more) additional parameter. However, the Gaussian pooling with different variances makes it necessary to convolve with each pooling kernel and I assume for now the tricky part is how to do this efficiently.

Can you give me some guidance on how to achieve this within your framework without slowing everything down too much? A few steps would already be highly appreciated.

@scott-gray
Copy link
Collaborator

Well I could outline the best approach to writing those kernels, I just need to better understand how the computation works. For example, do you truncate the Gaussian tails of the kernel? Or does the filter size just match the image size (and then maybe move the mean around as the sliding operation)?

The stochastic pooling seems pretty doable. Should be pretty similar to regular max pooling.

@Sowasa
Copy link
Author

Sowasa commented Sep 23, 2015

Basically the kernel works like any other pooling kernel, just that here a Gaussian can move around in it. It is of an a priori defined size, maybe a bit bigger than in usual max-pooling (e.g. 11x11) to not truncate the tails too early for a reasonable averaging window in the extreme case of it just doing average pooling (this happens if mean is in center and variance is very big). The kernel is centered to every other pixel in the image depending on the stride and then convolved with each location. So I'd say it is again quite similar to max-pooling in practice for the forward pass, just that one needs a convolution every time, which might make it more costly. The experiments I want to make first is to try if one needs to define more than one pooling kernel per feature map. So the design should allow let's say 2 or 3 pooling kernels per feature map in parallel, first experiments will show if this is necessary.

@scott-gray
Copy link
Collaborator

You say it can be similar to an average. Does this mean you also divide by the pooling window size? Or is each kernel position just the summed pointwise multiplication of the underlying image pixels?

If you have multiple pooling kernels, how does that effect the output. Does that expand the feature map dimension or do you sum each kernel into a single output?

From what you describe I think I can make this only slightly more expensive than average pooling. And with my new code this is quite fast (bprop is no longer bottle-necked on atomic adds in the overlapping filter case).

@Sowasa
Copy link
Author

Sowasa commented Sep 23, 2015

In the Gaussian average case, this is taken care of by the fact that a Gaussian lowpass is normalized such that it always sums to one afaik, regardless of its size. Which means it becomes more flat if it is bigger. So it is just a convolution with the kernel and there is no normalization necessary afterwards if the kernel was scaled properly.

@Sowasa
Copy link
Author

Sowasa commented Sep 23, 2015

The output of the different kernels should be kept separate for now, hence expand the feature maps, as some experiments will be needed to see what works best.

@scott-gray
Copy link
Collaborator

Ok, so I wrote the kernels tonight. You can find them here: https://gist.github.com/scott-gray/5a3cd70465dcd2fe1df1

Not sure if you wanted padding or not so I implemented fprop both ways. The way bprop works it needs padding logic so there's just one version. Both of these kernels should run at the same speed as the new avg pooling kernels.

I'll write the pycuda wrappers and the numpy to test them tomorrow night or over the weekend. I'll add some more comments to explain how they work too.

I need to ponder some more the best way to implement multiple kernels. I guess the question depends on how many you want to have.

@Sowasa
Copy link
Author

Sowasa commented Sep 25, 2015

This is amazing. I will wait for your tests and then start playing around.

The question on how many kernels per feature map are needed is hard to answer, as this is not clear to me from the paper, so best would be flexibility in that regard to leave room for some experiments.

@scott-gray
Copy link
Collaborator

Ok, I just updated the gist with a basic cpu/gpu implementation and a single test. You should be able to take things from there. I opted not to implement padding as in this case it makes the derivative really complicated when using the efficient method used here to compute bprop. The other way you can compute bprop is pretty much a copy of the forward code and then a bunch of atomic adds to the output. For large filers this way can be really slow. But padding should be straight forward with that method.

You'll also note I added normalization to the Gaussian. I had left that out previously. This is actually what makes bprop so tricky when padding is used (each filter position can potentially be normalized differently).

As far as multiple filters, you have two choices. You can keep the code as is and have multiple parallel layers. In fprop you would output to a slice of the full output and in bprop you would accumulate the deltas with the beta param. Take a look at the Inception class in the layer_gpu.py code for an example of this.

Or you could pack an additional filter dimension into one of the block indexes and use integer division to extract it. With careful consideration of block index ordering you could maximize cache usage so you can get more filters with little extra cost. Bprop would use atomic adds on the output.

If you want to make the var and mean params learnable, then I'd store them in device memory instead of passing them as params. If you have multiple filters you could have an array of such params.

This implementation is pretty tantalizingly close to how convolution works. The main difference is that you don't reduce the input feature maps. If you're going to be using big filters I suppose modifying the fprop and bprop conv kernels might be a really efficient way to implement this.

@Sowasa
Copy link
Author

Sowasa commented Oct 5, 2015

Thanks a lot, this is all very helpful!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants