Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What's the most effective and elegant way to set lr_mult and decay_mult for each trainable layer? #669

Closed
kli-casia opened this issue Apr 26, 2016 · 6 comments

Comments

@kli-casia
Copy link
Contributor

kli-casia commented Apr 26, 2016

There are a lot of useful CNN models defined in Caffe’s prototxt files. When one want to define the same model using Lasagne, one must consider the lr_mult and decay_mult hyper-parameters in Caffe model.

For example, in Caffe, a Convolutional layer is defined as follows.

layer {
  name: "conv1"
  type: "Convolution"
  bottom: "data"
  top: "conv1"
  param {
    lr_mult: 10
    decay_mult: 1
  }
  param {
    lr_mult: 20
    decay_mult: 0
  }
  convolution_param {
    num_output: 96
    kernel_size: 11
    stride: 4
    weight_filler {
      type: "gaussian"
      std: 0.01
    }
    bias_filler {
      type: "constant"
      value: 0
    }
  }
}

If the learning rate is 0.1, and weight decay is 0.0005.

The first

param {
    lr_mult: 1
    decay_mult: 1
  }

is for W, which mean that the learning rate of W is 10 * 0.1=1, the weight decay of W is 1* 0.0005=0.0005.

The following

param {
    lr_mult: 20
    decay_mult: 0
  }

is the same as above for the bias b.

So, my question is how to effective define the model (maybe the loss) in Lasagne?

@kli-casia kli-casia changed the title What the most effective and elegant way to set lr_mult and decay_mult for each trainable layer? What's the most effective and elegant way to set lr_mult and decay_mult for each trainable layer? Apr 26, 2016
@kli-casia
Copy link
Contributor Author

kli-casia commented Apr 26, 2016

I noticed that there is a similar question on lasagne-users
https://groups.google.com/forum/#!msg/lasagne-users/2z-6RrgiHkE/lHghzLDgCgAJ

One answer is

params = lasagne.layers.get_all_params(l_out)
grads = theano.grad(loss, params)
for idx, param in enumerate(params):
    grad_scale = ... # obtain multiplier for that parameter in some way
    if grad_scale != 1:
        grads[idx] *= grad_scale
updates = lasagne.updates.nesterov_momentum(grads, params, ...)

You can use whichever way you like for the "obtain multiplier" step -- maintain a dictionary of param -> multiplier, or set param.tag.grad_scale to some value when you build the model (every Theano expression has a tag attribute that can be used freely, e.g., layer.W.tag.grad_scale=.5).

I don’t know how to set tags when build the model. Can someone give me an examples? Thanks.

@f0k
Copy link
Member

f0k commented Apr 26, 2016

I noticed that there is a similar question on lasagne-users

Which is not by coincidence, since support questions should be posted on the mailing list, not on the issue tracker. The issue tracker is to be reserved for bug reports and feature discussions (i.e., things that require a change in Lasagne's codebase).

I don’t know how to set tags when build the model. Can someone give me an examples?

layer = InputLayer((None, 10))
layer = DenseLayer(layer, 100)
layer.W.tag.grad_scale = 10

Then the loop would become:

params = lasagne.layers.get_all_params(l_out)
grads = theano.grad(loss, params)
for idx, param in enumerate(params):
    grad_scale = getattr(param.tag, 'grad_scale', 1)
    if grad_scale != 1:
        grads[idx] *= grad_scale
updates = lasagne.updates.nesterov_momentum(grads, params, ...)

@f0k f0k closed this as completed Apr 26, 2016
@kli-casia
Copy link
Contributor Author

kli-casia commented Apr 26, 2016

Thank you very much, f0k, You always help me in a timely manner.
I will post support questions on the maling list from now on

@jiqiujia
Copy link

In this way, we can only set the lr_mult. How about decay_mult?

@f0k
Copy link
Member

f0k commented Feb 22, 2017

How about decay_mult?

For different L2 decay per layer, you can use regularize_layer_params_weighted. The first argument is a dictionary mapping layers to regularization strength (see the example at the top of the linked page). The result is a loss you would add to your loss function.

@jiqiujia
Copy link

@f0k Oh, yes! That's it. Thank you very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants