Add recurrent layers #294

Merged
merged 116 commits into from Aug 6, 2015

Conversation

Projects
None yet
10 participants
@craffel
Member

craffel commented Jun 11, 2015

Addresses #17. Includes layers for LSTM, GRU, "vanilla" (dense) RNN, and an RNN with arbitrary connection patterns. Thanks to @skaae for the tests and the GRU layer (+more), and @JackKelly for some comment fixes and additional functionality. Some misc comments that may warrant addressing -

  • We added examples to the examples directory. Do you want these? The usage of the recurrent layers is different enough that an example is probably warranted, but there is also an example in the top docstring of the recurrent submodule, which covers the absolute basics in terms of how to reshape things to make it work. Any examples should probably be supplemented with more fleshed-out stuff, like @skaae's penn treebank example and my speech rec example.
  • We didn't resolve these two issues: craffel#30 craffel#15 which essentially are the same thing - i.e., the ability to use the RNNs for sequence prediction. It's probably a feature we want to have, as it's a very hot topic at the moment.
  • The LSTM layer (and the GRU layer, to some extent) have a LOT of __init__ arguments because there are so many things to initialize. Most of the time people initialize all weight matrices in the same way, although I'm sure that's not always true. It's actually quite common to initialize the forget gate bias vector differently than the rest.
  • The peepholes kwarg of the LSTM layer decides whether to include connections from the cell to the gates; sometimes people use these, sometimes they don't, so it's important that we include this functionality. However, people also sometimes don't include other connections, and we don't have functionality for this. What I'd really like is something where if one of the arguments is passed as None, then the connection is not used, but this would make the code pretty ugly - we'd essentially have to define a new dot method which does nothing when one of the arguments is None or something. So, in short, I'd rather not address this now.
  • The grad_clipping argument in all layers clips the gradients, which is very common for RNNs. I think, however, this is pretty common for other network architectures too, so what I'd really like to see is it be moved out to a global function in updates, or something. But, this isn't something I'm familiar with enough to know how feasible that is, @skaae would be better to comment on that.
  • hid_init (and cell_init) are allowed to be TensorVariables, see our discussion here. craffel#27 (comment) To summarize that, I'd rather have them be able to be variable-size shared variables which are added with add_param, but we can't because of the way we allow for variable batch size. Any ideas on that discussion are appreciated.

Ok, that's about it, let me know what you think! Would be great to get this merged before the first release.

@benanne

This comment has been minimized.

Show comment
Hide comment
@benanne

benanne Jun 11, 2015

Member

Ooooh it's happening :D I'm excited! I will try to look at this in detail and reply to each of your points tomorrow.

Member

benanne commented Jun 11, 2015

Ooooh it's happening :D I'm excited! I will try to look at this in detail and reply to each of your points tomorrow.

@skaae

This comment has been minimized.

Show comment
Hide comment
@skaae

skaae Jun 11, 2015

Member

The grad clipping grad_clipping is meant to be implemented as in http://arxiv.org/pdf/1308.0850.pdf. It is mentioned on page 6.

From page 6:

...all the experiments in this paper clipped the derivative of the loss with respect to the 
network inputs to the LSTM layers (before the sigmoid and tanh functions are applied) 
to lie within a predefined range

I think that is what the grad clips does now, but i would like to know i you think it is correct.

We can implement decoders using GRU units with the current structure of lasagne but we cannot do it for the LSTM layers because they output both a hidden state and a cell value.

Nice work Colin :)

Member

skaae commented Jun 11, 2015

The grad clipping grad_clipping is meant to be implemented as in http://arxiv.org/pdf/1308.0850.pdf. It is mentioned on page 6.

From page 6:

...all the experiments in this paper clipped the derivative of the loss with respect to the 
network inputs to the LSTM layers (before the sigmoid and tanh functions are applied) 
to lie within a predefined range

I think that is what the grad clips does now, but i would like to know i you think it is correct.

We can implement decoders using GRU units with the current structure of lasagne but we cannot do it for the LSTM layers because they output both a hidden state and a cell value.

Nice work Colin :)

@benanne

This comment has been minimized.

Show comment
Hide comment
@benanne

benanne Jun 11, 2015

Member

Travis isn't too happy about this so far, I can spot two problems (but there may be more):

  • This PR makes use of Theano's test value functionality and for some reason this makes a fairly random subset of tests error out. I don't really see the pattern yet, but I'm sure it's not too hard to figure out
  • The added examples do not follow the format of the ones we already have, and our example testing code is trying to run them as tests and failing as a result
Member

benanne commented Jun 11, 2015

Travis isn't too happy about this so far, I can spot two problems (but there may be more):

  • This PR makes use of Theano's test value functionality and for some reason this makes a fairly random subset of tests error out. I don't really see the pattern yet, but I'm sure it's not too hard to figure out
  • The added examples do not follow the format of the ones we already have, and our example testing code is trying to run them as tests and failing as a result
@craffel

This comment has been minimized.

Show comment
Hide comment
@craffel

craffel Jun 11, 2015

Member

This PR makes use of Theano's test value functionality and for some reason this makes a fairly random subset of tests error out. I don't really see the pattern yet, but I'm sure it's not too hard to figure out

Hmm, any intuition as to why these tests would pass on my system but not on Travis? Looking at the error, why would Theano need a default value at all?

The added examples do not follow the format of the ones we already have, and our example testing code is trying to run them as tests and failing as a result

I can try to look at the tests for the examples and make our examples match, or I can try to change the tests for the examples so that they cover our examples, or I can nuke the recurrent examples. Any thoughts?

Member

craffel commented Jun 11, 2015

This PR makes use of Theano's test value functionality and for some reason this makes a fairly random subset of tests error out. I don't really see the pattern yet, but I'm sure it's not too hard to figure out

Hmm, any intuition as to why these tests would pass on my system but not on Travis? Looking at the error, why would Theano need a default value at all?

The added examples do not follow the format of the ones we already have, and our example testing code is trying to run them as tests and failing as a result

I can try to look at the tests for the examples and make our examples match, or I can try to change the tests for the examples so that they cover our examples, or I can nuke the recurrent examples. Any thoughts?

@skaae

This comment has been minimized.

Show comment
Hide comment
@skaae

skaae Jun 11, 2015

Member

I think i added the test_values to the example lstm.py example. You can remove those and the theano.config.compute_test_value = 'raise' (line 5). They are only there to illustrate how to debug.

Member

skaae commented Jun 11, 2015

I think i added the test_values to the example lstm.py example. You can remove those and the theano.config.compute_test_value = 'raise' (line 5). They are only there to illustrate how to debug.

@benanne

This comment has been minimized.

Show comment
Hide comment
@benanne

benanne Jun 11, 2015

Member

I can try to look at the tests for the examples and make our examples match, or I can try to change the tests for the examples so that they cover our examples, or I can nuke the recurrent examples. Any thoughts?

The examples will surely be useful to have, so I don't think we should nuke them.

There's also #215, which we should take care of soon. This will change the format of the existing examples somewhat (but they will still have a main() method to keep them testable). So changing the recurrent examples to be more like the other ones right now might be a wasted effort as well.

Member

benanne commented Jun 11, 2015

I can try to look at the tests for the examples and make our examples match, or I can try to change the tests for the examples so that they cover our examples, or I can nuke the recurrent examples. Any thoughts?

The examples will surely be useful to have, so I don't think we should nuke them.

There's also #215, which we should take care of soon. This will change the format of the existing examples somewhat (but they will still have a main() method to keep them testable). So changing the recurrent examples to be more like the other ones right now might be a wasted effort as well.

lasagne/layers/recurrent.py
+
+
+class GRULayer(Layer):
+ r"""Gated Recurrent Unit (GRU) Layer

This comment has been minimized.

@skaae

skaae Jun 12, 2015

Member

What is this 'r'?

@skaae

skaae Jun 12, 2015

Member

What is this 'r'?

This comment has been minimized.

@craffel

craffel Jun 12, 2015

Member

It tells Python to ignore escape codes, e.g. the \s in \sigma. http://stackoverflow.com/a/4780104

@craffel

craffel Jun 12, 2015

Member

It tells Python to ignore escape codes, e.g. the \s in \sigma. http://stackoverflow.com/a/4780104

lasagne/layers/recurrent.py
+ def get_output_shape_for(self, input_shape):
+ return input_shape[0], input_shape[1], self.num_units
+
+ def get_output_for(self, input, mask=None, *args, **kwargs):

This comment has been minimized.

@skaae

skaae Jun 12, 2015

Member

Remove *args

@skaae

skaae Jun 12, 2015

Member

Remove *args

This comment has been minimized.

@craffel

craffel Jun 12, 2015

Member

Fixed in b96c57b

lasagne/layers/recurrent.py
+ gradient_steps=-1,
+ grad_clipping=False,
+ unroll_scan=False,
+ precompute_input=True):

This comment has been minimized.

@skaae

skaae Jun 12, 2015

Member

All recurrent layers seem to be missing **kwargs in init

@skaae

skaae Jun 12, 2015

Member

All recurrent layers seem to be missing **kwargs in init

This comment has been minimized.

@craffel

craffel Jun 12, 2015

Member

You're right, will add.

@craffel

craffel Jun 12, 2015

Member

You're right, will add.

This comment has been minimized.

@craffel

craffel Jun 12, 2015

Member

Fixed in 694094f

@craffel

This comment has been minimized.

Show comment
Hide comment
@craffel

craffel Jun 12, 2015

Member

I'm really confused by the Travis errors of the kind

ValueError: Cannot compute test value: input 0 (<TensorType(float64, 4D)>) of Op Shape(<TensorType(float64, 4D)>) missing default value

They don't seem to be isolated to the new recurrent tests. Anyone know what's up?

Member

craffel commented Jun 12, 2015

I'm really confused by the Travis errors of the kind

ValueError: Cannot compute test value: input 0 (<TensorType(float64, 4D)>) of Op Shape(<TensorType(float64, 4D)>) missing default value

They don't seem to be isolated to the new recurrent tests. Anyone know what's up?

@benanne

This comment has been minimized.

Show comment
Hide comment
@benanne

benanne Jun 12, 2015

Member

As I said it's something to do with Theano's test_value machinery, which I am unfamiliar with so I can't pinpoint it exactly. It does look like it's "leaking" out to other tests that are unrelated to the recurrent stuff.

It seems like the lstm.py example uses test values and changes Theano settings related to test values. Since the current example testing code tries to load up all examples, this code is run and the settings are changed from then on.

Member

benanne commented Jun 12, 2015

As I said it's something to do with Theano's test_value machinery, which I am unfamiliar with so I can't pinpoint it exactly. It does look like it's "leaking" out to other tests that are unrelated to the recurrent stuff.

It seems like the lstm.py example uses test values and changes Theano settings related to test values. Since the current example testing code tries to load up all examples, this code is run and the settings are changed from then on.

@craffel

This comment has been minimized.

Show comment
Hide comment
@craffel

craffel Jun 12, 2015

Member

Ah I see what you're getting at. I will remove the test value stuff from the lstm.py example... although I am currently thinking the recurrent examples should just be moved to a recipe, because of the way the current examples are used for testing and and in the midst of a revamp.

Member

craffel commented Jun 12, 2015

Ah I see what you're getting at. I will remove the test value stuff from the lstm.py example... although I am currently thinking the recurrent examples should just be moved to a recipe, because of the way the current examples are used for testing and and in the midst of a revamp.

@craffel

This comment has been minimized.

Show comment
Hide comment
@craffel

craffel Jun 12, 2015

Member

Looks like all tests are passing now, except the example tests, because the recurrent examples have no method main. I propose we either

  • Move the examples out into a separate recipe (I like this idea more, because I think they could be fleshed out into something more interesting)
  • Add a main method to all the examples

Any preference?

Member

craffel commented Jun 12, 2015

Looks like all tests are passing now, except the example tests, because the recurrent examples have no method main. I propose we either

  • Move the examples out into a separate recipe (I like this idea more, because I think they could be fleshed out into something more interesting)
  • Add a main method to all the examples

Any preference?

@skaae

This comment has been minimized.

Show comment
Hide comment
@skaae

skaae Jun 12, 2015

Member

I think we should have at least a simple recurrent example. We can run on your delay data?
Let me know if i should create modify craffel#41 to be a recipe.

Member

skaae commented Jun 12, 2015

I think we should have at least a simple recurrent example. We can run on your delay data?
Let me know if i should create modify craffel#41 to be a recipe.

@craffel

This comment has been minimized.

Show comment
Hide comment
@craffel

craffel Jun 12, 2015

Member

I think we should have at least a simple recurrent example.

I think it depends on what we want out of the examples - if they're just intended as a simple use-case to get people started with the library, without going into much depth, then I don't think there needs to be one and I think putting some nice recurrent recipes in is fine (including your penn treebank example, my noisy speech rec benchmark example, and the delay task/other simple tasks example). But, the usage of the recurrent layers is a bit different than the dense layers because of the requisite reshaping, so I'd be happy to include a simple example too. Anyone else have input?

Let me know if i should create modify craffel#41 to be a recipe.

I think you should either way, it makes sense more there to me.

Member

craffel commented Jun 12, 2015

I think we should have at least a simple recurrent example.

I think it depends on what we want out of the examples - if they're just intended as a simple use-case to get people started with the library, without going into much depth, then I don't think there needs to be one and I think putting some nice recurrent recipes in is fine (including your penn treebank example, my noisy speech rec benchmark example, and the delay task/other simple tasks example). But, the usage of the recurrent layers is a bit different than the dense layers because of the requisite reshaping, so I'd be happy to include a simple example too. Anyone else have input?

Let me know if i should create modify craffel#41 to be a recipe.

I think you should either way, it makes sense more there to me.

@benanne

This comment has been minimized.

Show comment
Hide comment
@benanne

benanne Jun 12, 2015

Member

I reckon it would be good to have a simple recurrent example bundled with the library too - one that's tested just like the existing examples, so it can double as an integration test :)

Member

benanne commented Jun 12, 2015

I reckon it would be good to have a simple recurrent example bundled with the library too - one that's tested just like the existing examples, so it can double as an integration test :)

@craffel

This comment has been minimized.

Show comment
Hide comment
@craffel

craffel Jun 12, 2015

Member

Ok, I'll boil them all down to a single example with a main function then!

Member

craffel commented Jun 12, 2015

Ok, I'll boil them all down to a single example with a main function then!

@dnouri

This comment has been minimized.

Show comment
Hide comment
@dnouri

dnouri Jun 12, 2015

Contributor

I reckon it would be good to have a simple recurrent example bundled
with the library too - one that's tested just like the existing
examples, so it can double as an integration test :)

+1

Contributor

dnouri commented Jun 12, 2015

I reckon it would be good to have a simple recurrent example bundled
with the library too - one that's tested just like the existing
examples, so it can double as an integration test :)

+1

lasagne/layers/recurrent.py
+ nonlinearity_updategate=nonlinearities.sigmoid,
+ W_in_to_hidden_update=init.Normal(0.1),
+ W_hid_to_hidden_update=init.Normal(0.1),
+ b_hidden_update=init.Constant(0.),

This comment has been minimized.

@skaae

skaae Jun 13, 2015

Member

This should probably be init.Normal(0.1)?

I think we should change the init of all biases to zero?

@skaae

skaae Jun 13, 2015

Member

This should probably be init.Normal(0.1)?

I think we should change the init of all biases to zero?

This comment has been minimized.

@craffel

craffel Jun 15, 2015

Member

You're right, I didn't notice you had used Normals here, I'll change them to Constants. Issue here: #53

@craffel

craffel Jun 15, 2015

Member

You're right, I didn't notice you had used Normals here, I'll change them to Constants. Issue here: #53

@craffel craffel referenced this pull request in craffel/nntools Jun 13, 2015

Closed

Revamp examples #52

lasagne/utils.py
+ out_ = fn(*step_input)
+ if not isinstance(out_, (list, tuple)):
+ out_ = [out_]
+ output.append(out_)

This comment has been minimized.

@skaae

skaae Jun 14, 2015

Member

We need an additional isinstance check. The returned value from step can be either TensorVariable, list or tuple.
The function crashes for tuples because it'll try to add prev_vals (a tuple) to a list.

Replacing line 348-49 with

            if isinstance(out_, T.TensorVariable):
                out_ = [out_]
            if isinstance(out_, tuple):
                out_ = list(out_)

Solves the problem. This is not a problem for the current code because all step functions return TensorVariables or lists.

@skaae

skaae Jun 14, 2015

Member

We need an additional isinstance check. The returned value from step can be either TensorVariable, list or tuple.
The function crashes for tuples because it'll try to add prev_vals (a tuple) to a list.

Replacing line 348-49 with

            if isinstance(out_, T.TensorVariable):
                out_ = [out_]
            if isinstance(out_, tuple):
                out_ = list(out_)

Solves the problem. This is not a problem for the current code because all step functions return TensorVariables or lists.

This comment has been minimized.

@craffel

craffel Jun 15, 2015

Member

Do you want to create a PR for this?

@craffel

craffel Jun 15, 2015

Member

Do you want to create a PR for this?

This comment has been minimized.

@skaae

skaae Jun 15, 2015

Member

I can create a PR. If its easier that you cp the the lines that is fine aswell.

@skaae

skaae Jun 15, 2015

Member

I can create a PR. If its easier that you cp the the lines that is fine aswell.

This comment has been minimized.

@f0k

f0k Jul 21, 2015

Member

Is this solved? I'm wondering because it doesn't say I'm looking at an outdated diff, but maybe the comment was not close enough to the line that needed changing?

@f0k

f0k Jul 21, 2015

Member

Is this solved? I'm wondering because it doesn't say I'm looking at an outdated diff, but maybe the comment was not close enough to the line that needed changing?

This comment has been minimized.

@craffel

craffel Jul 21, 2015

Member

No, see craffel#60 . I was waiting for @skaae to do it, but I can do it if it's really as simple as copy-pasting.

@craffel

craffel Jul 21, 2015

Member

No, see craffel#60 . I was waiting for @skaae to do it, but I can do it if it's really as simple as copy-pasting.

@superbock

This comment has been minimized.

Show comment
Hide comment
@superbock

superbock Jun 15, 2015

Hi,

first of all I'd like to say a big thanks for the functionality (and Lasagne in general of course!). If it would have existed a year ago it would have saved me a lot of time.

I'd like to add a few comments/suggestions about this PR from a RNN/LSTM user perspective. But before I do so, I'd like to add that I haven't used Lasagne nor this RNN implementation, so some comments may be totally inappropriate or stupid...

  • I think that the need to reshape the output of a RecurrentLayer before being able to connect it to another layer is a bit of counter intuitiv. It would be nice if the Layer would handle this itself.
  • Although I see the point to have a CustomRecurrentLayer I don't like it too much. How about using just the RecurrentLayer and adding options to chose the layer class to be used for input_to_hidden and hidden_to_hidden and use appropriate defaults (e.g. DenseLayer to have a fully connected recurrent layer which is the most common probably).
  • Instead of having the backwards flag in all layers, I'd add a BidirectionalLayer which handles the flipping/reversing of the input and output of the backward layer and also the stacking/summing/pooling of the outputs. Although I used bidirectional networks only with RNNs, this functionality is not limited to RNNs.
  • A possible solution for the numerous __init__ arguments of the LSTMLayer can be that you handle the gates and the cell as separate classes (or instances thereof). This would remove a lot of duplicated code. It may also make the whole thing more readable.
  • I'd put the unroll_scan function in the very same file, since it is related pretty closely (solely?) to recurrent networks.

Hi,

first of all I'd like to say a big thanks for the functionality (and Lasagne in general of course!). If it would have existed a year ago it would have saved me a lot of time.

I'd like to add a few comments/suggestions about this PR from a RNN/LSTM user perspective. But before I do so, I'd like to add that I haven't used Lasagne nor this RNN implementation, so some comments may be totally inappropriate or stupid...

  • I think that the need to reshape the output of a RecurrentLayer before being able to connect it to another layer is a bit of counter intuitiv. It would be nice if the Layer would handle this itself.
  • Although I see the point to have a CustomRecurrentLayer I don't like it too much. How about using just the RecurrentLayer and adding options to chose the layer class to be used for input_to_hidden and hidden_to_hidden and use appropriate defaults (e.g. DenseLayer to have a fully connected recurrent layer which is the most common probably).
  • Instead of having the backwards flag in all layers, I'd add a BidirectionalLayer which handles the flipping/reversing of the input and output of the backward layer and also the stacking/summing/pooling of the outputs. Although I used bidirectional networks only with RNNs, this functionality is not limited to RNNs.
  • A possible solution for the numerous __init__ arguments of the LSTMLayer can be that you handle the gates and the cell as separate classes (or instances thereof). This would remove a lot of duplicated code. It may also make the whole thing more readable.
  • I'd put the unroll_scan function in the very same file, since it is related pretty closely (solely?) to recurrent networks.
@JackKelly

This comment has been minimized.

Show comment
Hide comment
@JackKelly

JackKelly Jun 15, 2015

Contributor

I think that the need to reshape the output of a RecurrentLayer before being able to connect it to another layer is a bit of counter intuitiv. It would be nice if the Layer would handle this itself.

When I first started using @craffel's fork, I thought the same as you (i.e. that the layer should do all the necessary reshaping). But when I started playing with things like doing 1D convolutions over the time axis, I started to like the fact that you have to explicitly reshape because it makes it easier (at least it my head) to track what's going on in the network.

I'd add a BidirectionalLayer

I agree. It's very easy to add a BidirectionalLayer layer. I've been using this in my own experiments:

def BidirectionalLayer(layer_class, *args, **kwargs):
    kwargs.pop('backwards', None)
    l_fwd = layer_class(*args, backwards=False, **kwargs)
    l_bck = layer_class(*args, backwards=True, **kwargs)
    return ElemwiseSumLayer([l_fwd, l_bck])

Then, you can easily define bidirectional RNN and bidirectional LSTM layers:

def BLSTMLayer(*args, **kwargs):
    """Configures forward and backwards LSTM layers to create a
    bidirectional LSTM.
    """
    return BidirectionalLayer(LSTMLayer, *args, **kwargs)


def BidirectionalRecurrentLayer(*args, **kwargs):
    """Configures forward and backwards RecurrentLayers to create a
    bidirectional recurrent layer."""
    return BidirectionalLayer(RecurrentLayer, *args, **kwargs)
Contributor

JackKelly commented Jun 15, 2015

I think that the need to reshape the output of a RecurrentLayer before being able to connect it to another layer is a bit of counter intuitiv. It would be nice if the Layer would handle this itself.

When I first started using @craffel's fork, I thought the same as you (i.e. that the layer should do all the necessary reshaping). But when I started playing with things like doing 1D convolutions over the time axis, I started to like the fact that you have to explicitly reshape because it makes it easier (at least it my head) to track what's going on in the network.

I'd add a BidirectionalLayer

I agree. It's very easy to add a BidirectionalLayer layer. I've been using this in my own experiments:

def BidirectionalLayer(layer_class, *args, **kwargs):
    kwargs.pop('backwards', None)
    l_fwd = layer_class(*args, backwards=False, **kwargs)
    l_bck = layer_class(*args, backwards=True, **kwargs)
    return ElemwiseSumLayer([l_fwd, l_bck])

Then, you can easily define bidirectional RNN and bidirectional LSTM layers:

def BLSTMLayer(*args, **kwargs):
    """Configures forward and backwards LSTM layers to create a
    bidirectional LSTM.
    """
    return BidirectionalLayer(LSTMLayer, *args, **kwargs)


def BidirectionalRecurrentLayer(*args, **kwargs):
    """Configures forward and backwards RecurrentLayers to create a
    bidirectional recurrent layer."""
    return BidirectionalLayer(RecurrentLayer, *args, **kwargs)
@skaae

This comment has been minimized.

Show comment
Hide comment
@skaae

skaae Jun 15, 2015

Member

Heres a few comments:

Reshape: I think the general idea with lasagne is that the settings are not hidden from the user. Secondly if we where to implement such feature i guess a layer would need to know its parent which would complicate things? But maybe the feature would fit into lasagne + nolearn?

CustomRecurrentLayer: We discussed this at length in one of the issues :) Probably somewhere in #17 or in Colins repository. To me the only downside with CustomRecurrentLayer is that it does not do calculations with stacked matrices.

Backwards argument: You could write a small function that sets up a bidirectional layer? We did have a bidirectional layer in an earlier version of the code, but it was removed becuse we thought the current setup is simpler. And it only

Unroll scan: I think that is a good idea.

Member

skaae commented Jun 15, 2015

Heres a few comments:

Reshape: I think the general idea with lasagne is that the settings are not hidden from the user. Secondly if we where to implement such feature i guess a layer would need to know its parent which would complicate things? But maybe the feature would fit into lasagne + nolearn?

CustomRecurrentLayer: We discussed this at length in one of the issues :) Probably somewhere in #17 or in Colins repository. To me the only downside with CustomRecurrentLayer is that it does not do calculations with stacked matrices.

Backwards argument: You could write a small function that sets up a bidirectional layer? We did have a bidirectional layer in an earlier version of the code, but it was removed becuse we thought the current setup is simpler. And it only

Unroll scan: I think that is a good idea.

@benanne

This comment has been minimized.

Show comment
Hide comment
@benanne

benanne Jun 15, 2015

Member

regarding unroll_scan, I don't think moving this into the recurrence submodule makes sense because it's very likely that it has other use cases that do not involve recurrence. scan is often used to implement RNNs but it is much more general than that. If anything I would put it under theano_extensions, especially because it mimics the scan interface.

Member

benanne commented Jun 15, 2015

regarding unroll_scan, I don't think moving this into the recurrence submodule makes sense because it's very likely that it has other use cases that do not involve recurrence. scan is often used to implement RNNs but it is much more general than that. If anything I would put it under theano_extensions, especially because it mimics the scan interface.

@craffel

This comment has been minimized.

Show comment
Hide comment
@craffel

craffel Jun 15, 2015

Member

I think that the need to reshape the output of a RecurrentLayer before being able to connect it to another layer is a bit of counter intuitiv. It would be nice if the Layer would handle this itself.

There are a few reasons for this; apart from @JackKelly's reason above, it also avoids some redundant reshapes if you are stacking recurrent layers, and it also makes the whole system more flexible with respect to how many "sample" dimensions you have vs. "feature" dimensions. Of course, it's also very simple to write a function which does this automatically for you in your own code, if you prefer.

Although I see the point to have a CustomRecurrentLayer I don't like it too much. How about using just the RecurrentLayer and adding options to chose the layer class to be used for input_to_hidden and hidden_to_hidden and use appropriate defaults (e.g. DenseLayer to have a fully connected recurrent layer which is the most common probably).

I'm not really seeing how this is substantially different from what we have, aside from the fact that there are currently two separate classes. Using the CustomRecurrentLayer you can use arbitrary layer structures for the input-to-hidden and hidden-to-hidden connections (which is more flexible than an option "to chose the layer class to be used for input_to_hidden and hidden_to_hidden") and using the RecurrentLayer you have "appropriate defaults" for the connections; that's exactly the point behind the way the code is now.

Instead of having the backwards flag in all layers, I'd add a BidirectionalLayer which handles the flipping/reversing of the input and output of the backward layer and also the stacking/summing/pooling of the outputs. Although I used bidirectional networks only with RNNs, this functionality is not limited to RNNs.

As noted above, we did initially do it this way, but there was a lot of discussion to come to the format it's in now. First, note that when backwards=True, the recurrent layers themselves "handles the flipping/reversing of the input and output". Second, one nice thing about not using a BidirectionalLayer is that it allows the user to describe how they want the layer outputs to be combined - concatenation? Element-wise sum? Processed separately? Etc. As I recall, there was also a strange Theano optimization bug which made things substantially slower with the BidirecitonalLayer I wrote, although this may have been an artifact of #104 (it was many many months ago now...) If, for your own code, it would be convenient to have the same bidirectional structure in many places, it's straightforward to hand-roll your own as @JackKelly pointed out.

A possible solution for the numerous init arguments of the LSTMLayer can be that you handle the gates and the cell as separate classes (or instances thereof). This would remove a lot of duplicated code. It may also make the whole thing more readable.

I'm having trouble picturing what you mean by this, can you be more specific?

I'd put the unroll_scan function in the very same file, since it is related pretty closely (solely?) to recurrent networks.

I don't care much either way, I think it's up to @benanne and @f0k.

Member

craffel commented Jun 15, 2015

I think that the need to reshape the output of a RecurrentLayer before being able to connect it to another layer is a bit of counter intuitiv. It would be nice if the Layer would handle this itself.

There are a few reasons for this; apart from @JackKelly's reason above, it also avoids some redundant reshapes if you are stacking recurrent layers, and it also makes the whole system more flexible with respect to how many "sample" dimensions you have vs. "feature" dimensions. Of course, it's also very simple to write a function which does this automatically for you in your own code, if you prefer.

Although I see the point to have a CustomRecurrentLayer I don't like it too much. How about using just the RecurrentLayer and adding options to chose the layer class to be used for input_to_hidden and hidden_to_hidden and use appropriate defaults (e.g. DenseLayer to have a fully connected recurrent layer which is the most common probably).

I'm not really seeing how this is substantially different from what we have, aside from the fact that there are currently two separate classes. Using the CustomRecurrentLayer you can use arbitrary layer structures for the input-to-hidden and hidden-to-hidden connections (which is more flexible than an option "to chose the layer class to be used for input_to_hidden and hidden_to_hidden") and using the RecurrentLayer you have "appropriate defaults" for the connections; that's exactly the point behind the way the code is now.

Instead of having the backwards flag in all layers, I'd add a BidirectionalLayer which handles the flipping/reversing of the input and output of the backward layer and also the stacking/summing/pooling of the outputs. Although I used bidirectional networks only with RNNs, this functionality is not limited to RNNs.

As noted above, we did initially do it this way, but there was a lot of discussion to come to the format it's in now. First, note that when backwards=True, the recurrent layers themselves "handles the flipping/reversing of the input and output". Second, one nice thing about not using a BidirectionalLayer is that it allows the user to describe how they want the layer outputs to be combined - concatenation? Element-wise sum? Processed separately? Etc. As I recall, there was also a strange Theano optimization bug which made things substantially slower with the BidirecitonalLayer I wrote, although this may have been an artifact of #104 (it was many many months ago now...) If, for your own code, it would be convenient to have the same bidirectional structure in many places, it's straightforward to hand-roll your own as @JackKelly pointed out.

A possible solution for the numerous init arguments of the LSTMLayer can be that you handle the gates and the cell as separate classes (or instances thereof). This would remove a lot of duplicated code. It may also make the whole thing more readable.

I'm having trouble picturing what you mean by this, can you be more specific?

I'd put the unroll_scan function in the very same file, since it is related pretty closely (solely?) to recurrent networks.

I don't care much either way, I think it's up to @benanne and @f0k.

@lim0606

This comment has been minimized.

Show comment
Hide comment
@lim0606

lim0606 Jun 15, 2015

Hi, everyone

First of all, thank you so much for this cool functionality.

I have a question about using arbitrary connections in this recurrent functionality.

I’d like to use this recurrent functionality to implement DRAW (http://arxiv.org/abs/1502.04623)
. I know there is a awesome DRAW example for lasagne written by @skaae (https://github.com/skaae/lasagne-draw), but I want to implement DRAW with combinations of Lasagne feedforward layers and recurrent ones by explicitly representing connectivities; thus, someone can easily make variants of it.

As far as I understand from the original issue related to lasagne recurrence (#17) and here, the current PR seems to support arbitrary connections. however, I’m not sure the arbitrary connections in this PR include the case where the connection happens from a hidden at t-1 to a different hidden at t (if my wording is bad, please see following figures)

this is possible

    y (t-1)    →     y (t)     → ...
      ↑                ↑
hidden_n (t-1) → hidden_n (t)  → ...
     …                ...
hidden_2 (t-1) → hidden_3 (t)  → ...
      ↑                ↑ 
hidden_1 (t-1) → hidden_2 (t)  → ...
      ↑                ↑
hidden_1 (t-1) → hidden_1 (t)  → ...
      ↑                ↑
input(t-1)     →    input (t)  → ...

is this possible?

    y (t-1)    →     y (t)     → ...
      ↑                ↑
hidden_n (t-1) → hidden_n (t)  → ...
     …                ...
hidden_2 (t-1) → hidden_3 (t)  → ...
      ↑        ↘      ↑ 
hidden_1 (t-1) → hidden_2 (t)  → ...
      ↑        ↘      ↑
hidden_1 (t-1) → hidden_1 (t)  → ...
      ↑               ↑
input(t-1)     →    input (t)  → ...

Can anyone help me to represent the arbitrary connection from t-1 to t with this recurrent functionality?

Best regards,

Jaehyun

lim0606 commented Jun 15, 2015

Hi, everyone

First of all, thank you so much for this cool functionality.

I have a question about using arbitrary connections in this recurrent functionality.

I’d like to use this recurrent functionality to implement DRAW (http://arxiv.org/abs/1502.04623)
. I know there is a awesome DRAW example for lasagne written by @skaae (https://github.com/skaae/lasagne-draw), but I want to implement DRAW with combinations of Lasagne feedforward layers and recurrent ones by explicitly representing connectivities; thus, someone can easily make variants of it.

As far as I understand from the original issue related to lasagne recurrence (#17) and here, the current PR seems to support arbitrary connections. however, I’m not sure the arbitrary connections in this PR include the case where the connection happens from a hidden at t-1 to a different hidden at t (if my wording is bad, please see following figures)

this is possible

    y (t-1)    →     y (t)     → ...
      ↑                ↑
hidden_n (t-1) → hidden_n (t)  → ...
     …                ...
hidden_2 (t-1) → hidden_3 (t)  → ...
      ↑                ↑ 
hidden_1 (t-1) → hidden_2 (t)  → ...
      ↑                ↑
hidden_1 (t-1) → hidden_1 (t)  → ...
      ↑                ↑
input(t-1)     →    input (t)  → ...

is this possible?

    y (t-1)    →     y (t)     → ...
      ↑                ↑
hidden_n (t-1) → hidden_n (t)  → ...
     …                ...
hidden_2 (t-1) → hidden_3 (t)  → ...
      ↑        ↘      ↑ 
hidden_1 (t-1) → hidden_2 (t)  → ...
      ↑        ↘      ↑
hidden_1 (t-1) → hidden_1 (t)  → ...
      ↑               ↑
input(t-1)     →    input (t)  → ...

Can anyone help me to represent the arbitrary connection from t-1 to t with this recurrent functionality?

Best regards,

Jaehyun

@superbock

This comment has been minimized.

Show comment
Hide comment
@superbock

superbock Jun 15, 2015

I was not fully aware of the full "history", but after reading issue #17 things are much clearer now. Sorry for the noise.

As for the CustomRecurrentLayer thing, I don't have a problem with the way it is. It's just I didn't like it too much the very first time, exactly like @benanne said in #17 (comment) 😏. Maybe renaming the CustomRecurrentLayer to RecurrentLayer and RecurrentLayer to FullyConnectedRecurrentLayer would be more expressive though?

A possible solution for the numerous init arguments of the LSTMLayer can be that you handle the gates and the cell as separate classes (or instances thereof). This would remove a lot of duplicated code. It may also make the whole thing more readable.

I'm having trouble picturing what you mean by this, can you be more specific?

you could e.g. do something like this:

class Cell(object):
    def __init__(self, W_in=init.Normal(0.1),
                 W_hid=init.Normal(0.1),
                 b=init.Normal(0.1),
                 nonlinearity=nonlinearities.sigmoid):

class Gate(object):
    def __init__(self, W_in=init.Normal(0.1),
                 W_hid=init.Normal(0.1),
                 W_cell=init.Normal(0.1),
                 b=init.Normal(0.1),
                 nonlinearity=nonlinearities.tanh):

class LSTMLayer(Layer):
    def __init__(self, incoming, num_units,
                 ingate=Gate(),
                 forgetgate=Gate(),
                 cell=Cell(),
                 outgate=Gate(), ...):

Then there would be way fewer arguments in the __init__ method of the LSTMLayer. Cell could be of course just a Gate with a different nonlinearity and W_cell set to None (or Gate extending the Cell class). No peepholes could be realised easily by setting W_cell to None.

I was not fully aware of the full "history", but after reading issue #17 things are much clearer now. Sorry for the noise.

As for the CustomRecurrentLayer thing, I don't have a problem with the way it is. It's just I didn't like it too much the very first time, exactly like @benanne said in #17 (comment) 😏. Maybe renaming the CustomRecurrentLayer to RecurrentLayer and RecurrentLayer to FullyConnectedRecurrentLayer would be more expressive though?

A possible solution for the numerous init arguments of the LSTMLayer can be that you handle the gates and the cell as separate classes (or instances thereof). This would remove a lot of duplicated code. It may also make the whole thing more readable.

I'm having trouble picturing what you mean by this, can you be more specific?

you could e.g. do something like this:

class Cell(object):
    def __init__(self, W_in=init.Normal(0.1),
                 W_hid=init.Normal(0.1),
                 b=init.Normal(0.1),
                 nonlinearity=nonlinearities.sigmoid):

class Gate(object):
    def __init__(self, W_in=init.Normal(0.1),
                 W_hid=init.Normal(0.1),
                 W_cell=init.Normal(0.1),
                 b=init.Normal(0.1),
                 nonlinearity=nonlinearities.tanh):

class LSTMLayer(Layer):
    def __init__(self, incoming, num_units,
                 ingate=Gate(),
                 forgetgate=Gate(),
                 cell=Cell(),
                 outgate=Gate(), ...):

Then there would be way fewer arguments in the __init__ method of the LSTMLayer. Cell could be of course just a Gate with a different nonlinearity and W_cell set to None (or Gate extending the Cell class). No peepholes could be realised easily by setting W_cell to None.

@craffel

This comment has been minimized.

Show comment
Hide comment
@craffel

craffel Jun 15, 2015

Member

@lim0606 wrote

is this possible?

    y (t-1)    →     y (t)     → ...
      ↑                ↑
hidden_n (t-1) → hidden_n (t)  → ...
     …                ...
hidden_3 (t-1) → hidden_3 (t)  → ...
      ↑        ↘      ↑ 
hidden_2 (t-1) → hidden_2 (t)  → ...
      ↑        ↘      ↑
hidden_1 (t-1) → hidden_1 (t)  → ...
      ↑               ↑
input(t-1)     →    input (t)  → ...

Hm, not that I can think of off the top of my head; @skaae do you have any ideas here?

@superbock wrote

As for the CustomRecurrentLayer thing, I don't have a problem with the way it is. It's just I didn't like it too much the very first time, exactly like @benanne said in #17 (comment) 😏.

Right, note that it was something I initially suggested; please read above and below that to get the full context as to why it ended up the way it did.

Maybe renaming the CustomRecurrentLayer to RecurrentLayer and RecurrentLayer to FullyConnectedRecurrentLayer would be more expressive though?

The naming convention is based on @f0k's suggestion, I think #17 (comment)

you could e.g. do something like this:

That could work, although in order to stack the matrices for dot product efficiency as we do the Cell/Gate classes would simply just hold on to the parameters; they couldn't have functions to actually compute anything. The main disadvantage here is that the convention across Lasagne is that the layers themselves take parameter initializers as keyword args, with the obvious exception of CustomRecurrentLayer.

Cell could be of course just a Gate with a different nonlinearity and W_cell set to None (or Gate extending the Cell class). No peepholes could be realised easily by setting W_cell to None.

Well, that would rely on some logic that doesn't exist yet where if a parameter is None, no dot product is computed; I proposed this here: craffel#32 (comment) and at the top of this PR because I agree that it would make arbitrary connections pretty slick, but it would make the code pretty messy in places. For example, it would make stacking the W_in_* matrices and using slice much more messy, at least at first glance.

Member

craffel commented Jun 15, 2015

@lim0606 wrote

is this possible?

    y (t-1)    →     y (t)     → ...
      ↑                ↑
hidden_n (t-1) → hidden_n (t)  → ...
     …                ...
hidden_3 (t-1) → hidden_3 (t)  → ...
      ↑        ↘      ↑ 
hidden_2 (t-1) → hidden_2 (t)  → ...
      ↑        ↘      ↑
hidden_1 (t-1) → hidden_1 (t)  → ...
      ↑               ↑
input(t-1)     →    input (t)  → ...

Hm, not that I can think of off the top of my head; @skaae do you have any ideas here?

@superbock wrote

As for the CustomRecurrentLayer thing, I don't have a problem with the way it is. It's just I didn't like it too much the very first time, exactly like @benanne said in #17 (comment) 😏.

Right, note that it was something I initially suggested; please read above and below that to get the full context as to why it ended up the way it did.

Maybe renaming the CustomRecurrentLayer to RecurrentLayer and RecurrentLayer to FullyConnectedRecurrentLayer would be more expressive though?

The naming convention is based on @f0k's suggestion, I think #17 (comment)

you could e.g. do something like this:

That could work, although in order to stack the matrices for dot product efficiency as we do the Cell/Gate classes would simply just hold on to the parameters; they couldn't have functions to actually compute anything. The main disadvantage here is that the convention across Lasagne is that the layers themselves take parameter initializers as keyword args, with the obvious exception of CustomRecurrentLayer.

Cell could be of course just a Gate with a different nonlinearity and W_cell set to None (or Gate extending the Cell class). No peepholes could be realised easily by setting W_cell to None.

Well, that would rely on some logic that doesn't exist yet where if a parameter is None, no dot product is computed; I proposed this here: craffel#32 (comment) and at the top of this PR because I agree that it would make arbitrary connections pretty slick, but it would make the code pretty messy in places. For example, it would make stacking the W_in_* matrices and using slice much more messy, at least at first glance.

@craffel

This comment has been minimized.

Show comment
Hide comment
@craffel

craffel Jun 15, 2015

Member

Ok, I merged all of the examples into a single one:
https://github.com/craffel/nntools/blob/recurrent/examples/recurrent.py
It now uses a more standard task (the "addition task" from the original LSTM paper). It tries to use most of the recurrent-specific stuff, which should be helpful as an actual example. The hyperparameters could probably be tuned some, if anyone wants to do that at some point. I don't see a real reason to have a separate LSTM and GRU example, although a CustomRecurrentLayer example could be useful... Now let's see if Travis likes it. Edit - damn it, I forgot to give main an argument specifying the number of iterations. I'll fix that but I guess it'll be a few hours until that commit gets tested. I guess now would be a good time to ask if anyone knows why py.test skips all the example tests on my system:

lasagne/tests/test_examples.py::test_example[mnist_conv_cc] SKIPPED
lasagne/tests/test_examples.py::test_example[mnist] SKIPPED
lasagne/tests/test_examples.py::test_example[recurrent] SKIPPED
lasagne/tests/test_examples.py::test_example[mnist_conv] SKIPPED
lasagne/tests/test_examples.py::test_example[mnist_conv_dnn] SKIPPED
Member

craffel commented Jun 15, 2015

Ok, I merged all of the examples into a single one:
https://github.com/craffel/nntools/blob/recurrent/examples/recurrent.py
It now uses a more standard task (the "addition task" from the original LSTM paper). It tries to use most of the recurrent-specific stuff, which should be helpful as an actual example. The hyperparameters could probably be tuned some, if anyone wants to do that at some point. I don't see a real reason to have a separate LSTM and GRU example, although a CustomRecurrentLayer example could be useful... Now let's see if Travis likes it. Edit - damn it, I forgot to give main an argument specifying the number of iterations. I'll fix that but I guess it'll be a few hours until that commit gets tested. I guess now would be a good time to ask if anyone knows why py.test skips all the example tests on my system:

lasagne/tests/test_examples.py::test_example[mnist_conv_cc] SKIPPED
lasagne/tests/test_examples.py::test_example[mnist] SKIPPED
lasagne/tests/test_examples.py::test_example[recurrent] SKIPPED
lasagne/tests/test_examples.py::test_example[mnist_conv] SKIPPED
lasagne/tests/test_examples.py::test_example[mnist_conv_dnn] SKIPPED
@skaae

This comment has been minimized.

Show comment
Hide comment
@skaae

skaae Jun 15, 2015

Member

@lim0606: I dont think you can implement that without writing your own scan. Maybe if you create a LSTM layer that does only do a single step. A feed forward LSTM :) Then you could just stack and wire those inside your scan?

Member

skaae commented Jun 15, 2015

@lim0606: I dont think you can implement that without writing your own scan. Maybe if you create a LSTM layer that does only do a single step. A feed forward LSTM :) Then you could just stack and wire those inside your scan?

@benanne

This comment has been minimized.

Show comment
Hide comment
@benanne

benanne Jun 15, 2015

Member

I guess now would be a good time to ask if anyone knows why py.test skips all the example tests on my system:

Make sure to run py.test with the --runslow option, else they are skipped. This bit of info should be added to the development docs.

Member

benanne commented Jun 15, 2015

I guess now would be a good time to ask if anyone knows why py.test skips all the example tests on my system:

Make sure to run py.test with the --runslow option, else they are skipped. This bit of info should be added to the development docs.

@craffel

This comment has been minimized.

Show comment
Hide comment
@craffel

craffel Jun 15, 2015

Member

Make sure to run py.test with the --runslow option, else they are skipped. This bit of info should be added to the development docs.

Got it. I haven't used py.test much, so I was assuming that was some command-line option I didn't know about. Looks like the test passed.

Member

craffel commented Jun 15, 2015

Make sure to run py.test with the --runslow option, else they are skipped. This bit of info should be added to the development docs.

Got it. I haven't used py.test much, so I was assuming that was some command-line option I didn't know about. Looks like the test passed.

@lim0606

This comment has been minimized.

Show comment
Hide comment
@lim0606

lim0606 Jun 16, 2015

@craffel @skaae Thank you for comments!

Jaehyun

lim0606 commented Jun 16, 2015

@craffel @skaae Thank you for comments!

Jaehyun

@f0k

This comment has been minimized.

Show comment
Hide comment
@f0k

f0k Jun 16, 2015

Member

The peepholes kwarg of the LSTM layer decides whether to include connections from the cell to the gates; [...] What I'd really like is something where if one of the arguments is passed as None, then the connection is not used, but this would make the code pretty ugly - we'd essentially have to define a new dot method which does nothing when one of the arguments is None or something.

I haven't taken a look at the code yet, but do you say allowing to set one of the constructor arguments to None would make the peepholes argument obsolete, plus make the class a lot more flexible? If so, I'd be in favor of not introducing the peepholes argument in the first place (we cannot easily remove it after the release). Setting a parameter to None is in line with how bias-less layers are implemented, and I think your solution of a maybe_dot helper method would be elegant enough.

For example, it would make stacking the W_in_* matrices and using slice much more messy, at least at first glance.

Ah, I see... so it's more involved than adding a maybe_dot function.

I was not fully aware of the full "history", but after reading issue #17 things are much clearer now. Sorry for the noise.

Wow, you're one of the few chosen people that know the full history of #17 then :) No need to be sorry, I think that's the longest thread in all of Lasagne and I wouldn't have expected anybody to read it. We might want to document the design choices you wondered about in our FAQ (#298).

Maybe renaming the CustomRecurrentLayer to RecurrentLayer and RecurrentLayer to FullyConnectedRecurrentLayer would be more expressive though?

That might be a bit long, and FullyConnectedRecurrentLayer only tells part of the story -- note that CustomRecurrentLayer also allows to have in->hid or hid->hid be a stack of two or more fully connected layers. What about RecurrentLayer and VanillaRecurrentLayer? The latter would be the one that has single dense layers for in->hid and hid->hid and uses the stacked weight matrices trick to be faster. In any case, I'd use RecurrentLayer for the one that we expect to be used the most. I don't know which one that would be, I leave that up for the RNN experts here to decide.

The main disadvantage here is that the convention across Lasagne is that the layers themselves take parameter initializers as keyword args

I wouldn't mind much about breaking with that convention and group parameters if that makes things easier. I didn't look at the code yet, though.

/edit: I'll have a closer look tomorrow or the day after. Thanks for the joint effort! Merging this before the first release is not really necessary because it doesn't introduce any backward-incompatible changes, but I agree that it would be nice to have it in! We just need to be sure we have a good interface, because it's more difficult to change once it's released.

Member

f0k commented Jun 16, 2015

The peepholes kwarg of the LSTM layer decides whether to include connections from the cell to the gates; [...] What I'd really like is something where if one of the arguments is passed as None, then the connection is not used, but this would make the code pretty ugly - we'd essentially have to define a new dot method which does nothing when one of the arguments is None or something.

I haven't taken a look at the code yet, but do you say allowing to set one of the constructor arguments to None would make the peepholes argument obsolete, plus make the class a lot more flexible? If so, I'd be in favor of not introducing the peepholes argument in the first place (we cannot easily remove it after the release). Setting a parameter to None is in line with how bias-less layers are implemented, and I think your solution of a maybe_dot helper method would be elegant enough.

For example, it would make stacking the W_in_* matrices and using slice much more messy, at least at first glance.

Ah, I see... so it's more involved than adding a maybe_dot function.

I was not fully aware of the full "history", but after reading issue #17 things are much clearer now. Sorry for the noise.

Wow, you're one of the few chosen people that know the full history of #17 then :) No need to be sorry, I think that's the longest thread in all of Lasagne and I wouldn't have expected anybody to read it. We might want to document the design choices you wondered about in our FAQ (#298).

Maybe renaming the CustomRecurrentLayer to RecurrentLayer and RecurrentLayer to FullyConnectedRecurrentLayer would be more expressive though?

That might be a bit long, and FullyConnectedRecurrentLayer only tells part of the story -- note that CustomRecurrentLayer also allows to have in->hid or hid->hid be a stack of two or more fully connected layers. What about RecurrentLayer and VanillaRecurrentLayer? The latter would be the one that has single dense layers for in->hid and hid->hid and uses the stacked weight matrices trick to be faster. In any case, I'd use RecurrentLayer for the one that we expect to be used the most. I don't know which one that would be, I leave that up for the RNN experts here to decide.

The main disadvantage here is that the convention across Lasagne is that the layers themselves take parameter initializers as keyword args

I wouldn't mind much about breaking with that convention and group parameters if that makes things easier. I didn't look at the code yet, though.

/edit: I'll have a closer look tomorrow or the day after. Thanks for the joint effort! Merging this before the first release is not really necessary because it doesn't introduce any backward-incompatible changes, but I agree that it would be nice to have it in! We just need to be sure we have a good interface, because it's more difficult to change once it's released.

@craffel

This comment has been minimized.

Show comment
Hide comment
@craffel

craffel Jun 16, 2015

Member

Setting a parameter to None is in line with how bias-less layers are implemented, and I think your solution of a maybe_dot helper method would be elegant enough.
...
Ah, I see... so it's more involved than adding a maybe_dot function.

Yes, that's what I was initially thinking, and that's why I like it, but yeah, I can't think of a simple way OTOH to make it elegant given the matrix stacking we do for efficiency. Doesn't mean it couldn't be done.

We might want to document the design choices you wondered about in our FAQ (#298).

Not a bad idea, although condensing down all the information in #17, https://github.com/craffel/nntools 's issues and PRs, and even this PR sounds like a massive (but probably worthwhile) undertaking.

What about RecurrentLayer and VanillaRecurrentLayer?

This was the plan at some point, I think, but

In any case, I'd use RecurrentLayer for the one that we expect to be used the most.

I would expect the "vanilla", non-customizable recurrent layer to be used more. It's still the case that there are relatively few papers which have hidden-to-hidden/input-to-hidden connections which are not just dense (excepting LSTM, GRU, etc, of course), even though it's a thing people have tried.

Member

craffel commented Jun 16, 2015

Setting a parameter to None is in line with how bias-less layers are implemented, and I think your solution of a maybe_dot helper method would be elegant enough.
...
Ah, I see... so it's more involved than adding a maybe_dot function.

Yes, that's what I was initially thinking, and that's why I like it, but yeah, I can't think of a simple way OTOH to make it elegant given the matrix stacking we do for efficiency. Doesn't mean it couldn't be done.

We might want to document the design choices you wondered about in our FAQ (#298).

Not a bad idea, although condensing down all the information in #17, https://github.com/craffel/nntools 's issues and PRs, and even this PR sounds like a massive (but probably worthwhile) undertaking.

What about RecurrentLayer and VanillaRecurrentLayer?

This was the plan at some point, I think, but

In any case, I'd use RecurrentLayer for the one that we expect to be used the most.

I would expect the "vanilla", non-customizable recurrent layer to be used more. It's still the case that there are relatively few papers which have hidden-to-hidden/input-to-hidden connections which are not just dense (excepting LSTM, GRU, etc, of course), even though it's a thing people have tried.

@benanne

This comment has been minimized.

Show comment
Hide comment
@benanne

benanne Jun 17, 2015

Member

I would expect the "vanilla", non-customizable recurrent layer to be used more. It's still the case that there are relatively few papers which have hidden-to-hidden/input-to-hidden connections which are not just dense (excepting LSTM, GRU, etc, of course), even though it's a thing people have tried.

I agree, my gut feeling is that most people will just want the vanilla one, so I'm in favour of keeping the current naming scheme.

Member

benanne commented Jun 17, 2015

I would expect the "vanilla", non-customizable recurrent layer to be used more. It's still the case that there are relatively few papers which have hidden-to-hidden/input-to-hidden connections which are not just dense (excepting LSTM, GRU, etc, of course), even though it's a thing people have tried.

I agree, my gut feeling is that most people will just want the vanilla one, so I'm in favour of keeping the current naming scheme.

@craffel

This comment has been minimized.

Show comment
Hide comment
@craffel

craffel Jun 17, 2015

Member

Ok, so in that case it sounds like the pending this for this PR are

Anything else you want to see before merging?

Member

craffel commented Jun 17, 2015

Ok, so in that case it sounds like the pending this for this PR are

Anything else you want to see before merging?

@benanne

This comment has been minimized.

Show comment
Hide comment
@benanne

benanne Jun 17, 2015

Member

I will try to do an in-depth review of the code ASAP to ensure that everything is in line with Lasagne's overall design philosophy, and also w.r.t. coding style. In terms of features/functionality I don't think I have anything to add.

Member

benanne commented Jun 17, 2015

I will try to do an in-depth review of the code ASAP to ensure that everything is in line with Lasagne's overall design philosophy, and also w.r.t. coding style. In terms of features/functionality I don't think I have anything to add.

@f0k

This comment has been minimized.

Show comment
Hide comment
@f0k

f0k Jun 17, 2015

Member

Should we condense the LSTM/GRU parameters? I think the class-style that @superbock is a good solution, although it's considered bad practice to have classes with no methods

Thinking again, how is that supposed to work? (Please actually read as: "How is that supposed to work", not as "I don't see how this can work" :) ) If Cell gets the parameter specifications, who's calling add_param() and how? Should Cell and Gate be derived from Layer so they have their own add_param() and get_params() method?

To maintain potential backwards compatibility, I could change the code so that Nones were allowed for the peephole connections only, and remove the peepholes kwarg.

Could you just post a short snippet of creating a layer without peephole connections a) with the peepholes kwarg and b) with Nones? I'd like to see which one is cleaner.

The third point is easy to solve, isn't it? About the other issues, I agree we can defer them. The latter seems to be really interesting, though.

Member

f0k commented Jun 17, 2015

Should we condense the LSTM/GRU parameters? I think the class-style that @superbock is a good solution, although it's considered bad practice to have classes with no methods

Thinking again, how is that supposed to work? (Please actually read as: "How is that supposed to work", not as "I don't see how this can work" :) ) If Cell gets the parameter specifications, who's calling add_param() and how? Should Cell and Gate be derived from Layer so they have their own add_param() and get_params() method?

To maintain potential backwards compatibility, I could change the code so that Nones were allowed for the peephole connections only, and remove the peepholes kwarg.

Could you just post a short snippet of creating a layer without peephole connections a) with the peepholes kwarg and b) with Nones? I'd like to see which one is cleaner.

The third point is easy to solve, isn't it? About the other issues, I agree we can defer them. The latter seems to be really interesting, though.

@craffel

This comment has been minimized.

Show comment
Hide comment
@craffel

craffel Aug 4, 2015

Member

I think the pattern we should encourage is lasagne.random.get_rng().seed(SEED) wherever possible.

This seems better. Can I use this @f0k?

Member

craffel commented Aug 4, 2015

I think the pattern we should encourage is lasagne.random.get_rng().seed(SEED) wherever possible.

This seems better. Can I use this @f0k?

@f0k

This comment has been minimized.

Show comment
Hide comment
@f0k

f0k Aug 4, 2015

Member

This seems better.

That's fine as well. We should just ensure tests don't change the RNG without setting it back afterwards.

Member

f0k commented Aug 4, 2015

This seems better.

That's fine as well. We should just ensure tests don't change the RNG without setting it back afterwards.

@f0k f0k referenced this pull request Aug 4, 2015

Merged

Split layer docs #335

lasagne/layers/recurrent.py
+ b=lasagne.init.Constant(0.), nonlinearity=lasagne.nonlinearities.sigmoid)
+
+ Simple class to hold the parameters for a gate connection. We define
+ a gate loosely as something which compute the linear mix of two inputs,

This comment has been minimized.

@f0k

f0k Aug 4, 2015

Member

*which computes

@f0k

f0k Aug 4, 2015

Member

*which computes

This comment has been minimized.

@craffel

craffel Aug 4, 2015

Member

Fixed.

@craffel

craffel Aug 4, 2015

Member

Fixed.

lasagne/layers/recurrent.py
+ cell=lasagne.layers.Gate(
+ W_cell=None, nonlinearity=lasagne.nonlinearities.tanh),
+ outgate=lasagne.layers.Gate(),
+ nonlinearity_out=lasagne.nonlinearities.tanh,

This comment has been minimized.

@f0k

f0k Aug 4, 2015

Member

Why not simply nonlinearity? That would make things easier if we ever decide to do #334.

@f0k

f0k Aug 4, 2015

Member

Why not simply nonlinearity? That would make things easier if we ever decide to do #334.

This comment has been minimized.

@craffel

craffel Aug 4, 2015

Member

I'm not sure what you mean, what is nonlinearity?

Edit - oh, you mean instead of calling nonlinearity_out, call it nonlinearity? Well, it's because there are a lot of nonlinearities in the LSTM layer, this differentiates it from the other ones (which come from the gates). It used to be that there were lots of nonlinearities in the LSTM layer, e.g. nonlinearity_ingate, etc. Now that there's only one which appears in the initializer, I suppose we could change its name.

@craffel

craffel Aug 4, 2015

Member

I'm not sure what you mean, what is nonlinearity?

Edit - oh, you mean instead of calling nonlinearity_out, call it nonlinearity? Well, it's because there are a lot of nonlinearities in the LSTM layer, this differentiates it from the other ones (which come from the gates). It used to be that there were lots of nonlinearities in the LSTM layer, e.g. nonlinearity_ingate, etc. Now that there's only one which appears in the initializer, I suppose we could change its name.

This comment has been minimized.

@f0k

f0k Aug 5, 2015

Member

Yes, that's what I meant, and that's also what I assumed it originated from :) +1 for renaming to be more compatible to other layers, this could come helpful some time (and it's more difficult to rename it after it's out).
/edit: Including the attribute (self.nonlinearity instead of self.nonlinearity_out).

@f0k

f0k Aug 5, 2015

Member

Yes, that's what I meant, and that's also what I assumed it originated from :) +1 for renaming to be more compatible to other layers, this could come helpful some time (and it's more difficult to rename it after it's out).
/edit: Including the attribute (self.nonlinearity instead of self.nonlinearity_out).

This comment has been minimized.

@craffel

craffel Aug 5, 2015

Member

Ok, changed.

@craffel

craffel Aug 5, 2015

Member

Ok, changed.

lasagne/layers/recurrent.py
+ outgate=lasagne.layers.Gate(),
+ nonlinearity_out=lasagne.nonlinearities.tanh,
+ cell_init=b=lasagne.init.Constant(0.),
+ hid_init=b=lasagne.init.Constant(0.), backwards=False, learn_init=False,

This comment has been minimized.

@f0k

f0k Aug 4, 2015

Member

The two b= seem to be misplaced.

@f0k

f0k Aug 4, 2015

Member

The two b= seem to be misplaced.

This comment has been minimized.

@craffel

craffel Aug 4, 2015

Member

Looks like those showed up in craffel@ca8288a, fixed.

@craffel

craffel Aug 4, 2015

Member

Looks like those showed up in craffel@ca8288a, fixed.

lasagne/layers/recurrent.py
+ peepholes : bool
+ If True, the LSTM uses peephole connections.
+ When False, `W_cell_to_ingate`, `W_cell_to_forgetgate` and
+ `W_cell_to_outgate` are ignored.

This comment has been minimized.

@f0k

f0k Aug 4, 2015

Member

Those arguments don't exist since introducing the Gate class. You mean ingate.W_cell etc., right?

@f0k

f0k Aug 4, 2015

Member

Those arguments don't exist since introducing the Gate class. You mean ingate.W_cell etc., right?

This comment has been minimized.

@craffel

craffel Aug 4, 2015

Member

Indeed, good catch, fixed.

@craffel

craffel Aug 4, 2015

Member

Indeed, good catch, fixed.

lasagne/layers/recurrent.py
+ # Stack biases into a (4*num_units) vector
+ self.b_stacked = T.concatenate(
+ [self.b_ingate, self.b_forgetgate,
+ self.b_cell, self.b_outgate], axis=0)

This comment has been minimized.

@f0k

f0k Aug 4, 2015

Member

I think I'd move the stacking to get_output_for(), the stacked matrices aren't needed anywhere else and they don't make too much sense as attributes (or is there a use case for them?). They could be mistaken for "real" parameters (i.e., shared variables) when inspecting a LSTMLayer instance interactively, while they're just an implementation detail of the forward pass.

@f0k

f0k Aug 4, 2015

Member

I think I'd move the stacking to get_output_for(), the stacked matrices aren't needed anywhere else and they don't make too much sense as attributes (or is there a use case for them?). They could be mistaken for "real" parameters (i.e., shared variables) when inspecting a LSTMLayer instance interactively, while they're just an implementation detail of the forward pass.

This comment has been minimized.

@craffel

craffel Aug 4, 2015

Member

Hmm... I wouldn't be so sure there isn't a use-case for them as attributes (cc @skaae), although I can't think of any OTOH. Moving them into get_output_for should be straightforward.

@craffel

craffel Aug 4, 2015

Member

Hmm... I wouldn't be so sure there isn't a use-case for them as attributes (cc @skaae), although I can't think of any OTOH. Moving them into get_output_for should be straightforward.

This comment has been minimized.

@f0k

f0k Aug 5, 2015

Member

I wouldn't be so sure there isn't a use-case for them as attributes

Maybe. But I assume they're just here because one of the earlier implementations actually stacked the numpy arrays before turning them into a shared variable, and this had to happen in the constructor. Let's see if @skaae has any objections against moving that code.

@f0k

f0k Aug 5, 2015

Member

I wouldn't be so sure there isn't a use-case for them as attributes

Maybe. But I assume they're just here because one of the earlier implementations actually stacked the numpy arrays before turning them into a shared variable, and this had to happen in the constructor. Let's see if @skaae has any objections against moving that code.

This comment has been minimized.

@skaae

skaae Aug 5, 2015

Member

I guess the stacking is in the constructor because the original code just intialized the stacked matrices. I think its better to move it.

@skaae

skaae Aug 5, 2015

Member

I guess the stacking is in the constructor because the original code just intialized the stacked matrices. I think its better to move it.

This comment has been minimized.

@craffel

craffel Aug 5, 2015

Member

Ok, moved it.

@craffel

craffel Aug 5, 2015

Member

Ok, moved it.

lasagne/layers/recurrent.py
+ # Treat all dimensions after the second as flattened feature dimensions
+ if input.ndim > 3:
+ input = input.reshape((input.shape[0], input.shape[1],
+ T.prod(input.shape[2:])))

This comment has been minimized.

@f0k

f0k Aug 4, 2015

Member

Easier: input = T.flatten(input, 3). Could also be (unmeasurably) faster.

@f0k

f0k Aug 4, 2015

Member

Easier: input = T.flatten(input, 3). Could also be (unmeasurably) faster.

This comment has been minimized.

@craffel

craffel Aug 4, 2015

Member

Nice! Changed.

@craffel

craffel Aug 4, 2015

Member

Nice! Changed.

lasagne/layers/recurrent.py
+
+ Gated Recurrent Unit (GRU) Layer
+
+ Implements the updates proposed in [1]_, which computes the output by

This comment has been minimized.

@f0k

f0k Aug 5, 2015

Member

"Implements the updates" -- what kind of updates? Updates of what? This doesn't have anything to do with how parameters are updated during training, right? (I'm sure that it doesn't, but that was the first I thought of when reading "updates" in conjunction with Lasagne.)

@f0k

f0k Aug 5, 2015

Member

"Implements the updates" -- what kind of updates? Updates of what? This doesn't have anything to do with how parameters are updated during training, right? (I'm sure that it doesn't, but that was the first I thought of when reading "updates" in conjunction with Lasagne.)

This comment has been minimized.

@craffel

craffel Aug 5, 2015

Member

Ah, yes, that's a little ambiguous. "Updates" in the context of an RNN unit means "how do you set the internal states given new input"; this is also the terminology used by scan. I will clarify.

@craffel

craffel Aug 5, 2015

Member

Ah, yes, that's a little ambiguous. "Updates" in the context of an RNN unit means "how do you set the internal states given new input"; this is also the terminology used by scan. I will clarify.

+class GRULayer(MergeLayer):
+ r"""
+ lasagne.layers.recurrent.GRULayer(incoming, num_units,
+ resetgate=lasagne.layers.Gate(W_cell=None),

This comment has been minimized.

@f0k

f0k Aug 5, 2015

Member

What happens if W_cell is not None? Will it be used or ignored? What happens if W_cell is None for a LSTMLayer? Will it break?
--> Should Gate be split into two classes, one with W_cell and one without, or just kept as it is?

@f0k

f0k Aug 5, 2015

Member

What happens if W_cell is not None? Will it be used or ignored? What happens if W_cell is None for a LSTMLayer? Will it break?
--> Should Gate be split into two classes, one with W_cell and one without, or just kept as it is?

This comment has been minimized.

@craffel

craffel Aug 5, 2015

Member

What happens if W_cell is not None? Will it be used or ignored?

It will be ignored.

What happens if W_cell is None for a LSTMLayer? Will it break?

If peepholes=True, yes; the attribute W_cell will attempt to be accessed which won't exist for a Gate initialized with W_cell=None. We could probably throw a more helpful error there.

Should Gate be split into two classes, one with W_cell and one without, or just kept as it is?

I don't think so, they would be completely identical save about 2 lines. We discussed this somewhere...

@craffel

craffel Aug 5, 2015

Member

What happens if W_cell is not None? Will it be used or ignored?

It will be ignored.

What happens if W_cell is None for a LSTMLayer? Will it break?

If peepholes=True, yes; the attribute W_cell will attempt to be accessed which won't exist for a Gate initialized with W_cell=None. We could probably throw a more helpful error there.

Should Gate be split into two classes, one with W_cell and one without, or just kept as it is?

I don't think so, they would be completely identical save about 2 lines. We discussed this somewhere...

lasagne/layers/recurrent.py
+ updategate=lasagne.layers.Gate(W_cell=None),
+ hidden_update=lasagne.layers.Gate(
+ W_cell=None, lasagne.nonlinearities.tanh),
+ hid_init=lasagne.init.Constant(0.), learn_init=True, backwards=False,

This comment has been minimized.

@f0k

f0k Aug 5, 2015

Member

The order should be hid_init, backwards, learn_init (also below in the actual constructor definition), to match the docstring and to match LSTMLayer.

@f0k

f0k Aug 5, 2015

Member

The order should be hid_init, backwards, learn_init (also below in the actual constructor definition), to match the docstring and to match LSTMLayer.

This comment has been minimized.

@craffel

craffel Aug 5, 2015

Member

Good catch, thanks, fixed.

@craffel

craffel Aug 5, 2015

Member

Good catch, thanks, fixed.

lasagne/layers/recurrent.py
+ # Stack gate biases into a (3*num_units) vector
+ self.b_stacked = T.concatenate(
+ [self.b_resetgate, self.b_updategate,
+ self.b_hidden_update], axis=0)

This comment has been minimized.

@f0k

f0k Aug 5, 2015

Member

Again, stacking the weights and biases should be moved into get_output_for().

@f0k

f0k Aug 5, 2015

Member

Again, stacking the weights and biases should be moved into get_output_for().

lasagne/layers/recurrent.py
+ hid_out = hid_out[:, ::-1, :]
+
+ self.hid_out = hid_out
+ return hid_out

This comment has been minimized.

@f0k

f0k Aug 5, 2015

Member

Why this? Setting a class attribute in get_output_for() seems brittle. People should better be forced to obtain the output in the regular way.

@f0k

f0k Aug 5, 2015

Member

Why this? Setting a class attribute in get_output_for() seems brittle. People should better be forced to obtain the output in the regular way.

This comment has been minimized.

@skaae

skaae Aug 5, 2015

Member

I think that should be removed. I guess i added it because I had not figured out that Lasagne supported a regular way to get that output.
There is a similar problem in line 992 with the LSTM cell. I don't think you can refer to the cell in other ways than with the through the class attribute ?

@skaae

skaae Aug 5, 2015

Member

I think that should be removed. I guess i added it because I had not figured out that Lasagne supported a regular way to get that output.
There is a similar problem in line 992 with the LSTM cell. I don't think you can refer to the cell in other ways than with the through the class attribute ?

This comment has been minimized.

@craffel

craffel Aug 5, 2015

Member

I don't think you can refer to the cell in other ways than with the through the class attribute ?

Yes, that's addressed in #337, right? I removed both for now. Edit: Also had to remove a few tests. If this is actually not OK, please speak up!

@craffel

craffel Aug 5, 2015

Member

I don't think you can refer to the cell in other ways than with the through the class attribute ?

Yes, that's addressed in #337, right? I removed both for now. Edit: Also had to remove a few tests. If this is actually not OK, please speak up!

lasagne/utils.py
+ Parameters
+ ----------
+
+ fun : function

This comment has been minimized.

@f0k

f0k Aug 5, 2015

Member

fn, not fun

@f0k

f0k Aug 5, 2015

Member

fn, not fun

This comment has been minimized.

@craffel

craffel Aug 5, 2015

Member

Fixed.

@craffel

craffel Aug 5, 2015

Member

Fixed.

@f0k

This comment has been minimized.

Show comment
Hide comment
@f0k

f0k Aug 5, 2015

Member

Okay, I've read through everything (finally!) and gave some more comment.

Two last things that bug me are:

  1. There's a lot of duplicated code between the three classes (CustomRecurrentLayer, LSTMLayer, GRULayer) in the constructors, in get_output_shape_for() and get_output_for(). It would be nice to factor that out into a common base class, but that can wait.
  2. This PR has 110 commits, some of which only change a single word, or fix things in an example that was subsequently removed. In other PRs we usually take care to squash certain commits to not fill up Lasagne's history with things not worth preserving (such as spelling or pep8 mistakes being introduced and removed), and maybe also to not skew the commit-count-based contributor statistics. In any case, this is a long-running PR, it definitely has some history worth preserving, and cleaning up the history by cherry-picking and squashing commits would be very cumbersome (compared to our usual PRs of 3 or 4 commits by a single author). So I'd be fine with making an exception.

PS: Looking forward to playing around with this! :)

Member

f0k commented Aug 5, 2015

Okay, I've read through everything (finally!) and gave some more comment.

Two last things that bug me are:

  1. There's a lot of duplicated code between the three classes (CustomRecurrentLayer, LSTMLayer, GRULayer) in the constructors, in get_output_shape_for() and get_output_for(). It would be nice to factor that out into a common base class, but that can wait.
  2. This PR has 110 commits, some of which only change a single word, or fix things in an example that was subsequently removed. In other PRs we usually take care to squash certain commits to not fill up Lasagne's history with things not worth preserving (such as spelling or pep8 mistakes being introduced and removed), and maybe also to not skew the commit-count-based contributor statistics. In any case, this is a long-running PR, it definitely has some history worth preserving, and cleaning up the history by cherry-picking and squashing commits would be very cumbersome (compared to our usual PRs of 3 or 4 commits by a single author). So I'd be fine with making an exception.

PS: Looking forward to playing around with this! :)

@benanne

This comment has been minimized.

Show comment
Hide comment
@benanne

benanne Aug 5, 2015

Member

Yeah, let's not waste any time cleaning this up. It's gone on long enough :)

Member

benanne commented Aug 5, 2015

Yeah, let's not waste any time cleaning this up. It's gone on long enough :)

@craffel

This comment has been minimized.

Show comment
Hide comment
@craffel

craffel Aug 5, 2015

Member

Okay, I've read through everything (finally!) and gave some more comment.

I've addressed everything now, I think.

There's a lot of duplicated code between the three classes (CustomRecurrentLayer, LSTMLayer, GRULayer) in the constructors, in get_output_shape_for() and get_output_for(). It would be nice to factor that out into a common base class, but that can wait.

Agreed, it bothers me too. The main argument against it (which was raised by @skaae elsewhere) is that it will be hard to make it equally readable/modifyable after refactoring for code sharing. Either way, this will not be something which will change the interface or behavior, so I am very supportive of not trying to do it now.

This PR has 110 commits, some of which only change a single word, or fix things in an example that was subsequently removed. In other PRs we usually take care to squash certain commits to not fill up Lasagne's history with things not worth preserving (such as spelling or pep8 mistakes being introduced and removed), and maybe also to not skew the commit-count-based contributor statistics. In any case, this is a long-running PR, it definitely has some history worth preserving, and cleaning up the history by cherry-picking and squashing commits would be very cumbersome (compared to our usual PRs of 3 or 4 commits by a single author). So I'd be fine with making an exception.

I can do some squashing of trivial commits, but I'd also like to preserve the commit history as much as possible - a major difference here is that I and many others have been using this codebase for many months, and it (like Lasagne) has changed a good deal in that time, so it's very nice for me to be able to retrieve the codebase at a certain point in time for backwards/parallel-compatibility. Squashing functional-change commits would make this impossible. But, you're agreeing with me here, so let's leave it at that :)

Member

craffel commented Aug 5, 2015

Okay, I've read through everything (finally!) and gave some more comment.

I've addressed everything now, I think.

There's a lot of duplicated code between the three classes (CustomRecurrentLayer, LSTMLayer, GRULayer) in the constructors, in get_output_shape_for() and get_output_for(). It would be nice to factor that out into a common base class, but that can wait.

Agreed, it bothers me too. The main argument against it (which was raised by @skaae elsewhere) is that it will be hard to make it equally readable/modifyable after refactoring for code sharing. Either way, this will not be something which will change the interface or behavior, so I am very supportive of not trying to do it now.

This PR has 110 commits, some of which only change a single word, or fix things in an example that was subsequently removed. In other PRs we usually take care to squash certain commits to not fill up Lasagne's history with things not worth preserving (such as spelling or pep8 mistakes being introduced and removed), and maybe also to not skew the commit-count-based contributor statistics. In any case, this is a long-running PR, it definitely has some history worth preserving, and cleaning up the history by cherry-picking and squashing commits would be very cumbersome (compared to our usual PRs of 3 or 4 commits by a single author). So I'd be fine with making an exception.

I can do some squashing of trivial commits, but I'd also like to preserve the commit history as much as possible - a major difference here is that I and many others have been using this codebase for many months, and it (like Lasagne) has changed a good deal in that time, so it's very nice for me to be able to retrieve the codebase at a certain point in time for backwards/parallel-compatibility. Squashing functional-change commits would make this impossible. But, you're agreeing with me here, so let's leave it at that :)

@f0k

This comment has been minimized.

Show comment
Hide comment
@f0k

f0k Aug 6, 2015

Ah, I see, this was Søren's trick for providing multiple outputs. @skaae, I hope you're fine with merging the PR without this hack and introducing a proper solution for multiple-output layers afterwards (Lasagne#337)!

Ah, I see, this was Søren's trick for providing multiple outputs. @skaae, I hope you're fine with merging the PR without this hack and introducing a proper solution for multiple-output layers afterwards (Lasagne#337)!

This comment has been minimized.

Show comment
Hide comment
@skaae

skaae Aug 6, 2015

yes. One very easy solution is to use a GRU instead.

skaae replied Aug 6, 2015

yes. One very easy solution is to use a GRU instead.

@f0k

This comment has been minimized.

Show comment
Hide comment
@f0k

f0k Aug 6, 2015

Member

I've addressed everything now, I think.

Yes, looks good! Thanks!

I can do some squashing of trivial commits, but I'd also like to preserve the commit history as much as possible - a major difference here is that I and many others have been using this codebase for many months

That's a good point. You'll probably want to keep that branch of yours alive for a while.

It's gone on long enough :)

All right, let's merge it then! I'll wait until the afternoon, just to give people a last chance to chime in.

Member

f0k commented Aug 6, 2015

I've addressed everything now, I think.

Yes, looks good! Thanks!

I can do some squashing of trivial commits, but I'd also like to preserve the commit history as much as possible - a major difference here is that I and many others have been using this codebase for many months

That's a good point. You'll probably want to keep that branch of yours alive for a while.

It's gone on long enough :)

All right, let's merge it then! I'll wait until the afternoon, just to give people a last chance to chime in.

@f0k

This comment has been minimized.

Show comment
Hide comment
@f0k

f0k Aug 6, 2015

Member

Will merge the beast at 15:00 CEST. Get your champagne out.

Member

f0k commented Aug 6, 2015

Will merge the beast at 15:00 CEST. Get your champagne out.

f0k added a commit that referenced this pull request Aug 6, 2015

@f0k f0k merged commit 4c882e6 into Lasagne:master Aug 6, 2015

2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
coverage/coveralls Coverage remained the same at 100.0%
Details
@JeffreyDF

This comment has been minimized.

Show comment
Hide comment
@JeffreyDF

JeffreyDF Aug 6, 2015

Contributor

Yay!

Contributor

JeffreyDF commented Aug 6, 2015

Yay!

@ebenolson

This comment has been minimized.

Show comment
Hide comment
@ebenolson

ebenolson Aug 6, 2015

Member

I think the merge closed some unrelated issues..

Member

ebenolson commented Aug 6, 2015

I think the merge closed some unrelated issues..

@f0k

This comment has been minimized.

Show comment
Hide comment
@f0k

f0k Aug 6, 2015

Member

I think the merge closed some unrelated issues..

I was wondering about that. Github always linked the issues in the commit messages to the ones in @craffel's repository, so I hoped it wouldn't close our issues now. Let's reopen them one by one then...

Member

f0k commented Aug 6, 2015

I think the merge closed some unrelated issues..

I was wondering about that. Github always linked the issues in the commit messages to the ones in @craffel's repository, so I hoped it wouldn't close our issues now. Let's reopen them one by one then...

@benanne

This comment has been minimized.

Show comment
Hide comment
@benanne

benanne Aug 6, 2015

Member

Sweet! Nice work everyone :)

Member

benanne commented Aug 6, 2015

Sweet! Nice work everyone :)

@craffel

This comment has been minimized.

Show comment
Hide comment
@craffel

craffel Aug 6, 2015

Member

Woo!

I was wondering about that. Github always linked the issues in the commit messages to the ones in @craffel's repository, so I hoped it wouldn't close our issues now. Let's reopen them one by one then...

Funny, I've never seen that before. Sorry about that, I guess I will stop doing that.

Member

craffel commented Aug 6, 2015

Woo!

I was wondering about that. Github always linked the issues in the commit messages to the ones in @craffel's repository, so I hoped it wouldn't close our issues now. Let's reopen them one by one then...

Funny, I've never seen that before. Sorry about that, I guess I will stop doing that.

@JackKelly

This comment has been minimized.

Show comment
Hide comment
@JackKelly

JackKelly Aug 7, 2015

Contributor

Awesome work, everyone! Thank you!

Contributor

JackKelly commented Aug 7, 2015

Awesome work, everyone! Thank you!

craffel added a commit to craffel/Lasagne-tutorial that referenced this pull request Nov 9, 2015

Change recurrent layers to accept mask input layer
As discussed here:
Lasagne/Lasagne#294 (comment)
when mask was supplied as a kwarg to get_output, it's not possible
to pass different masks to different layers in a larger network.  A
good example would be when there are two network branches which are
receiving different input.  We decided here:
Lasagne/Lasagne#337
that the best approach (for now) is to make the recurrent layers
subclass MergeLayer, and have two inputs: One for the actual layer
input, and another for the mask.  This commit implements that change
with a new keyword arg to the constructor of all of the recurrent
layer classes, `mask_input`.  It also updates all of the tests and
the example.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment