Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regularization #285

Merged
merged 1 commit into from
Jun 9, 2015
Merged

Regularization #285

merged 1 commit into from
Jun 9, 2015

Conversation

ebenolson
Copy link
Member

No description provided.

@ebenolson ebenolson mentioned this pull request Jun 3, 2015
@f0k
Copy link
Member

f0k commented Jun 3, 2015

That's a good start! However, I think get_loss() is too unspecific of a name, and I don't really like the interface. It doesn't allow you to regularize all weights of a network at once, for example, and supporting three different types for the first argument makes the code quite convoluted.
What do you think of my proposal from September 2014? It features separate functions for separate cases, which turns them into one-liners and also makes explicit which case you're using. I like penalize instead of regularize, though.

Apart from weight and output penalization, is there anything else we should put into this module? We had that discussion before -- there are regularization methods in different lasagne submodules, but maybe we can collect helper functions for them here in a central place. For example, it would be possible to do a function like apply_dropout() that takes a network and inserts dropout layers in front of every layer, or every dense layer.

@ebenolson
Copy link
Member Author

Personally I prefer something verbose like regularize(layers.get_all_layers(output_layer)) because I got burned a while ago by assuming regularization.l2(layer) would only penalize that layer's weights. But as long as the names are explicit one liners are probably good.

How about breaking it down something like this:

cost_layer(layer, penalty)
cost_layers(layers, penalty)
cost_all_layers(layer, penalty)
weighted_cost_layers(layer_dict, penalty)
cost_layer_output(layer, penalty, inputs)

not sure if the last is needed, seems like penalty(layer.get_output(inputs)) is pretty clear.

@ebenolson
Copy link
Member Author

For example, it would be possible to do a function like apply_dropout() that takes a network and inserts dropout layers in front of every layer, or every dense layer.

I did something along these lines trying to implement batchwise dropout.. unfortunately I had to get pretty invasive, and it ended up not being that much faster anyway. Probably should look into it again at some point though, in principle it should speed things up quite a bit.

@f0k
Copy link
Member

f0k commented Jun 3, 2015

Personally I prefer something verbose like regularize(layers.get_all_layers(output_layer))

Doesn't sound too bad, but that might burn somebody else assuming that regularize(layer, l2) would regularize the full network similar to how get_output(layer) propagates through the network. That's why I'd prefer separate functions, which should make people look twice just to understand the difference.

How about breaking it down something like this:

Looks good, except that cost_layer reads a bit weird. What about penalize_layer, or penalize_weights? The latter makes clear what exactly we're penalizing, but it doesn't allow the distinction of penalize_layer and penalize_layers. We could merge them into one function, though -- I'd just like to keep up the distinction between "penalize only the layers I'm giving you" and "penalize the network I'm giving you" so people don't confuse them.

not sure if the last is needed, seems like penalty(layer.get_output(inputs)) is pretty clear.

Thinking about it, that proposal predated #182. It would actually be a bad idea to provide penalize_output, because that would trick people into calling get_output() multiple times.

unfortunately I had to get pretty invasive, and it ended up not being that much faster anyway.

Batchwise dropout could just become a different layer, couldn't it? This doesn't need a helper function of the kind I suggested. (And the helper function shouldn't make any difference in performance, just provide a different way to obtain a particular graph.) By the way, back in 2012, batchwise dropout wasn't giving me as good results on MNIST as itemwise dropout (i.e., batchwise dropout was not on par with the 2012 dropout paper).

@ebenolson
Copy link
Member Author

I'm not sure about using weights in the name, isn't the point of the regularizable tag that there could in theory be regularizable parameters that aren't weights?

I also somewhat prefer names like cost, loss or penalty to penalize or regularize, because the function is just giving us an expression, not actually performing an action on the network.

Merging penalize_layer and penalize_layers seems good to me, although I don't like really like having an argument named layers that can be a single layer.

@ebenolson
Copy link
Member Author

Batchwise dropout could just become a different layer, couldn't it? This doesn't need a helper function of the kind I suggested. (And the helper function shouldn't make any difference in performance, just provide a different way to obtain a particular graph.) By the way, back in 2012, batchwise dropout wasn't giving me as good results on MNIST as itemwise dropout (i.e., batchwise dropout was not on par with the 2012 dropout paper).

I think maybe we're talking about different things - I was trying to implement this paper from feb 2015. It use submatrices to avoid calculating dropped-out activations, and at least the way I was doing it required modifying the W matrices of layers following the BatchwiseDropoutLayer. Since that can't be done until the full network is constructed, I was using helper functions.

@f0k
Copy link
Member

f0k commented Jun 3, 2015

isn't the point of the regularizable tag that there could in theory be regularizable parameters that aren't weights?

That's true. I mostly stuck to the term because it's called "weight regularization". So params would be more in line with what we're calling it in lasagne. penalize_params and penalize_all_params?

the function is just giving us an expression, not actually performing an action on the network

That's true... but params_penalty and all_params_penalty doesn't work that well. And I think

hidden, out = get_output([l_hidden, l_out])
loss = something(out, target) + alpha * penalize_all_params(outlayer, l2) + beta * l1(hidden)

still reads clearly. But I see your point!

although I don't like really like having an argument named layers that can be a single layer.

In #182, I came up with layer_or_layers for that. It's a bit verbose, but it's very clear. We don't have that everywhere, though -- get_output(layer_or_layers), but get_all_layers(layer), although both accept single layers and lists of layers.

I think maybe we're talking about different things - I was trying to implement this paper from feb 2015.

Ahaa, sorry, I had heard about batchwise dropout, but had never read the paper. Now everything is clear :)

@ebenolson
Copy link
Member Author

penalize_params and penalize_all_params?

I know there are similar things elsewhere, but to me penalize_all_params sounds more like it would include b than penalize all layers.

How do you feel about penalize_layer and penalize_network or penalize_all_layers?

@f0k
Copy link
Member

f0k commented Jun 3, 2015

How do you feel about penalize_layer and penalize_network or penalize_all_layers?

It's okay, it just doesn't tell what exactly we're penalizing. I like network instead of all_layers! And I see your point about all_params... even penalize_params doesn't sound as if it would only include params tagged as regularizable, because there is no connection to the tag name. Ah, those pesky name discussions.

Maybe somebody else has a good idea... we have l1 and l2, both of which take a tensor expression and return a scalar expression, and we want to provide an easy way to apply that to a network output, the regularizable parameters of one or more layers, or the regularizable parameters of a full network.

What about:

def apply_penalty(things, penalty, **kwargs):
    return sum(penalty(thing, **kwargs) for thing in things)

The use cases would be:

loss = apply_penalty(get_output(layer), l1)  # or simply l1(get_output(network))
loss = apply_penalty(layer.get_params(regularizable=True), l2)
loss = apply_penalty((layer.get_params(regularizable=True) for layer in layers), l2)
loss = apply_penalty(get_all_params(layer, regularizable=True), l2)

Would that be too verbose? It's very clear what each line does, and we don't need to discuss any names.

@benanne benanne added this to the First release milestone Jun 3, 2015
@benanne
Copy link
Member

benanne commented Jun 3, 2015

I don't have much to add to this discussion at the moment... you guys seem to have all the bases covered :)

I guess it would be nice to have a shortcut for this one:

loss = apply_penalty(get_all_params(layer, regularizable=True), l2)

This is by far going to be the most common use case, i.e. "add regularization terms for all regularizable parameters in a network". If we can avoid 'exposing' the tagging mechanism here, that would be nice. Something like:

loss = penalize_net(layer, l2)

but maybe with a different name, I don't know :) Just something that encapsulates the regularizable=True and the get_all_params call.

@f0k
Copy link
Member

f0k commented Jun 4, 2015

I guess it would be nice to have a shortcut for this one:
[...]
but maybe with a different name

Yes, that's exactly the trouble. Without a shortcut, it's completely clear what it does. And if we provide a shortcut, we need a name that a) clearly says whether it's going to affect a single layer or also the layers feeding into it and b) what exactly it's applying the penalty function to (namely, the regularizable parameters).
penalize_network and penalize_layer seems too unspecific to me, it suggests penalizing the mere existence of the network or layer. penalize_network_weights and penalize_layer_weights seems slightly wrong to Eben, because it could be applying the penalty to regularizable parameters that are not "weights". On the contrary, penalize_network_params and penalize_layer_params doesn't say that it's only affecting regularizable parameters (and we can't make **tags part of the signature, because we already need to support **kwargs passed on to the penalty function). apply_weight_decay sounds the most natural, but it doesn't say whether it's going to affect a single layer or a full network, and it's a misnomer once you use a different penalty function than L1 or L2 (e.g., some sparsity target, however useful that would be).

If you come up with a good name, I'm all up for a shortcut! I think names are pretty important. Bad names make any code a lot harder to work with (see iter_train, iter_valid and iter_test in the old MNIST example), and we don't want the framework to be hard to work with.

@ebenolson
Copy link
Member Author

penalize_network

couple other quibbles with this: we never use network elsewhere in the code, and if you have a complicated structure with multiple output branches, this won't actually penalize the entire network.

Are there any plans for Lasagne to have a network container class like nolearn's NeuralNet?

@ebenolson
Copy link
Member Author

we could have a shortcut l2_weight_decay(layer) that does apply_penalty(get_all_params(layer, regularizable=True), l2) - that's probably the main use case.

@f0k
Copy link
Member

f0k commented Jun 4, 2015

and if you have a complicated structure with multiple output branches, this won't actually penalize the entire network.

Oh yes, just like get_all_layers or get_output, this should accept either a single output layer or a list of all your output layers (and the shortcut would transparently support this if it uses get_all_params under the hood).

we could have a shortcut l2_weight_decay(layer) that does apply_penalty(get_all_params(layer, regularizable=True), l2)

But that doesn't make clear it's going to regularize the full network. People might inadvertently use it as 0.005 * l2_weight_decay(l_out) + 0.001 * l2_weight_decay(l_hid). (I guess that's about what happened to you?)

@ebenolson
Copy link
Member Author

you're right, that's no good.

ok, I think the best so far is penalize_network_params and penalize_layer_params.

They don't specify regularizable=True, but that will be documented. To make it more obvious, we could have a default argument tags={'regularizable':True} they can override it needed.

And even if the user doesn't realize it, is there a common case where that would cause a bad outcome? Presumably regularizable=False is being set for a good reason.

@f0k
Copy link
Member

f0k commented Jun 4, 2015

Are there any plans for Lasagne to have a network container class like nolearn's NeuralNet?

Not really, see #2 for reasons. For all purposes, a network in Lasagne can be represented as a list (not a set, it must be ordered) of its output layers. This gives a fixed order of input layers and a fixed order of parameters.

ok, I think the best so far is penalize_network_params and penalize_layer_params.
They don't specify regularizable=True, but that will be documented.

My issue is not so much about whether the functionality is clear from the documentation, but whether it's clear from just reading the code. Compare:

loss = apply_penalty(get_all_params(layer, regularizable=True), l2)  # undoubtedly clear, albeit verbose
loss = penalize_net(layer, l2)  # assume it might do weight decay of the network because of 'l2' and 'net'
loss = penalize_network_params(layer, l2)  # assume it excludes biases because of 'l2', but not sure
loss = penalize_network_weights(layer, l2)  # assume it excludes biases because of 'l2' and 'weights'

If we go for ..._params, having tags= as an argument would probably be a good thing so the shortcut lives up to its more general name, but that doesn't help with the doubts expressed above.

I think we had that before, but what about:

loss = regularize_network_params(layer, l2)  # hints both at 'params' and 'regularizable'

"regularize" is a bit generic, but on the other hand, "regularizable" is a bit generic as well and "penalty" could be any kind of regularization, not just weight decay.

My favourites are penalize_network_weights and regularize_network_params.

In any case, I'd like this to be expressed in terms of apply_penalty, this seems a useful function to have (although it doesn't do much, it allows for very readable code if the weight decay shortcut is not what you need).

@benanne
Copy link
Member

benanne commented Jun 4, 2015

But that doesn't make clear it's going to regularize the full network. People might inadvertently use it as 0.005 * l2_weight_decay(l_out) + 0.001 * l2_weight_decay(l_hid). (I guess that's about what happened to you?)

I think this argument is being given too much weight. get_output() also does not explicitly indicate that it gets the output of the network, not the individual layer (or layers) that you pass to it. It just does that by convention. By this line of reasoning we would have to rename this function as well.

It's nice to have clear and descriptive names, but it's a trade-off and there is a point where it just becomes too cumbersome. I think l2_weight_decay would be fine, but I'd rather have a shortcut where i can just specify l2 or l1 or whatever else we come up with as a parameter. As our design goals state:

Pragmatism: making common use cases easy is more important than supporting every possible use case out of the box.

I think pragmatism trumps clarity here. We need this shortcut, just because we can't come up with a descriptive name for it is not a good enough reason not to have it.

@f0k
Copy link
Member

f0k commented Jun 4, 2015

I think this argument is being given too much weight.

I was referring to #285 (comment): "I got burned a while ago by assuming regularization.l2(layer) would only penalize that layer's weights". Even a knowledgeable user can fall for this, so we shouldn't invite the devil.

We need this shortcut, just because we can't come up with a descriptive name for it is not a good enough reason not to have it.

I'm not opposed to the shortcut, but I think names matter a lot to the quality of a framework, and we should choose them wisely. We've covered the pros and cons quite well now, we just need to decide.

I'd rather have a shortcut where i can just specify l2 or l1 or whatever else we come up with as a parameter.

What do you think of our suggestions then? All of them allow this.

@benanne
Copy link
Member

benanne commented Jun 4, 2015

Even a knowledgeable user can fall for this, so we shouldn't invite the devil.

True, but we can't guard ourselves against every possible assumption that any user might make. We shouldn't take this too far. They should also read the docs ;)

What do you think of our suggestions then? All of them allow this.

regularize_network_params is the best option, I think. penalize is a word people might be less familiar with, and we should avoid coming up with 'better' names for things if there's some other term 99% of our userbase is more familiar with. Even though I guess it's technically true that you can do other things than regularize with this function, the main use case will be regularization. It also neatly hints at the fact that it applies only to parameters with regularizable=True (at least by default).

@f0k
Copy link
Member

f0k commented Jun 4, 2015

True, but we can't guard ourselves against every possible assumption that any user might make. We shouldn't take this too far.

Sure, I didn't want to take it further, layer vs. network nicely solves this ambiguity. I just wanted to avoid going back to something like l2_weight_decay!

regularize_network_params is the best option, I think.

Great. Let's go for this (plus regularize_layer_params for a single layer), if Eben is fine with it, too.

@ebenolson
Copy link
Member Author

regularization.regularize_* seems a bit redundant to me, but I'm ok with it. I'll try to rework this later today.

@ebenolson
Copy link
Member Author

reworked, will do docs later.

if len(params) == 0:
return 0
else:
return sum(penalty(p, **kwargs) for p in params)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove the len check? sum returns zero for an empty sequence anyway, and using len will require params to have a length -- it's more flexible if we allow generators as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed this

@f0k
Copy link
Member

f0k commented Jun 5, 2015

Thanks, looks quite good! What about changing param and params to tensor and tensors in l1, l2 and apply_penalty? Those can just as well be used to regularize the code of an autoencoder, for example. Only the regularize_*functions are specific to layer params.

try:
return sum(penalty(x, **kwargs) for x in tensor_or_tensors)
except (TypeError, ValueError):
return penalty(tensor_or_tensors, **kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What case is the ValueError for? Isn't the TypeError enough to distinguish between iterable and non-iterable tensor_or_tensors?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theano raises ValueError when attemping iteration of a vector.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! 👍

@ebenolson
Copy link
Member Author

I think all the comments have been addressed. If there aren't any more, I'll squash this to a single commit.

@benanne
Copy link
Member

benanne commented Jun 8, 2015

looks good to me, go for it :)

The only thing worth mentioning is the use of isinstance to distinguish a layer instance from a list of layer instances, I think @f0k's way of using try .. except for this is better because it's more in line with the duck typing approach.

We have a bunch of isinstance in the library elsewhere though, so there's no point in worrying about this specific one right now, I think. We can fix this globally when / if it becomes a problem.

@benanne
Copy link
Member

benanne commented Jun 8, 2015

If you're done can you remove the [WIP] tag? bit uncomfortable merging a PR like that :)

@ebenolson ebenolson changed the title [WIP] Regularization Regularization Jun 8, 2015
@ebenolson
Copy link
Member Author

Done. Not sure why coveralls is still showing yellow, it seems like the check passed

@benanne
Copy link
Member

benanne commented Jun 8, 2015

Yeah, I've seen it on other PRs as well, I think there's something wrong with it today.

@benanne
Copy link
Member

benanne commented Jun 8, 2015

Hmm, I just noticed one thing that's missing: the rst file to make the documentation show up. It's a bit unfortunate because I was hoping to refer to the docs in the API change announcement on the mailing list. I don't know how many people are using the old regularization module though, not too many I imagine. So it probably isn't too big of an issue.

I'll merge this when I have some time to write the announcement as well.

@benanne
Copy link
Member

benanne commented Jun 9, 2015

Merging.

benanne added a commit that referenced this pull request Jun 9, 2015
@benanne benanne merged commit 3518f27 into Lasagne:master Jun 9, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants