# Lasagne/Lasagne

Merged
merged 116 commits into from Aug 6, 2015
+2,396 −0

## Conversation

Projects
None yet
10 participants
Member

### craffel commented Jun 11, 2015

 Addresses #17. Includes layers for LSTM, GRU, "vanilla" (dense) RNN, and an RNN with arbitrary connection patterns. Thanks to @skaae for the tests and the GRU layer (+more), and @JackKelly for some comment fixes and additional functionality. Some misc comments that may warrant addressing - We added examples to the examples directory. Do you want these? The usage of the recurrent layers is different enough that an example is probably warranted, but there is also an example in the top docstring of the recurrent submodule, which covers the absolute basics in terms of how to reshape things to make it work. Any examples should probably be supplemented with more fleshed-out stuff, like @skaae's penn treebank example and my speech rec example. We didn't resolve these two issues: craffel#30 craffel#15 which essentially are the same thing - i.e., the ability to use the RNNs for sequence prediction. It's probably a feature we want to have, as it's a very hot topic at the moment. The LSTM layer (and the GRU layer, to some extent) have a LOT of __init__ arguments because there are so many things to initialize. Most of the time people initialize all weight matrices in the same way, although I'm sure that's not always true. It's actually quite common to initialize the forget gate bias vector differently than the rest. The peepholes kwarg of the LSTM layer decides whether to include connections from the cell to the gates; sometimes people use these, sometimes they don't, so it's important that we include this functionality. However, people also sometimes don't include other connections, and we don't have functionality for this. What I'd really like is something where if one of the arguments is passed as None, then the connection is not used, but this would make the code pretty ugly - we'd essentially have to define a new dot method which does nothing when one of the arguments is None or something. So, in short, I'd rather not address this now. The grad_clipping argument in all layers clips the gradients, which is very common for RNNs. I think, however, this is pretty common for other network architectures too, so what I'd really like to see is it be moved out to a global function in updates, or something. But, this isn't something I'm familiar with enough to know how feasible that is, @skaae would be better to comment on that. hid_init (and cell_init) are allowed to be TensorVariables, see our discussion here. craffel#27 (comment) To summarize that, I'd rather have them be able to be variable-size shared variables which are added with add_param, but we can't because of the way we allow for variable batch size. Any ideas on that discussion are appreciated. Ok, that's about it, let me know what you think! Would be great to get this merged before the first release.
Member

### benanne commented Jun 11, 2015

 Ooooh it's happening :D I'm excited! I will try to look at this in detail and reply to each of your points tomorrow.
Member

### skaae commented Jun 11, 2015

 The grad clipping grad_clipping is meant to be implemented as in http://arxiv.org/pdf/1308.0850.pdf. It is mentioned on page 6. From page 6: ...all the experiments in this paper clipped the derivative of the loss with respect to the network inputs to the LSTM layers (before the sigmoid and tanh functions are applied) to lie within a predefined range  I think that is what the grad clips does now, but i would like to know i you think it is correct. We can implement decoders using GRU units with the current structure of lasagne but we cannot do it for the LSTM layers because they output both a hidden state and a cell value. Nice work Colin :)
Member

### benanne commented Jun 11, 2015

 Travis isn't too happy about this so far, I can spot two problems (but there may be more): This PR makes use of Theano's test value functionality and for some reason this makes a fairly random subset of tests error out. I don't really see the pattern yet, but I'm sure it's not too hard to figure out The added examples do not follow the format of the ones we already have, and our example testing code is trying to run them as tests and failing as a result
Member

### craffel commented Jun 11, 2015

 This PR makes use of Theano's test value functionality and for some reason this makes a fairly random subset of tests error out. I don't really see the pattern yet, but I'm sure it's not too hard to figure out Hmm, any intuition as to why these tests would pass on my system but not on Travis? Looking at the error, why would Theano need a default value at all? The added examples do not follow the format of the ones we already have, and our example testing code is trying to run them as tests and failing as a result I can try to look at the tests for the examples and make our examples match, or I can try to change the tests for the examples so that they cover our examples, or I can nuke the recurrent examples. Any thoughts?
Member

### skaae commented Jun 11, 2015

 I think i added the test_values to the example lstm.py example. You can remove those and the theano.config.compute_test_value = 'raise' (line 5). They are only there to illustrate how to debug.
Member

### benanne commented Jun 11, 2015

 I can try to look at the tests for the examples and make our examples match, or I can try to change the tests for the examples so that they cover our examples, or I can nuke the recurrent examples. Any thoughts? The examples will surely be useful to have, so I don't think we should nuke them. There's also #215, which we should take care of soon. This will change the format of the existing examples somewhat (but they will still have a main() method to keep them testable). So changing the recurrent examples to be more like the other ones right now might be a wasted effort as well.

### skaae reviewed Jun 12, 2015

 + + +class GRULayer(Layer): + r"""Gated Recurrent Unit (GRU) Layer

#### skaae Jun 12, 2015

Member

What is this 'r'?

#### craffel Jun 12, 2015

Member

It tells Python to ignore escape codes, e.g. the \s in \sigma. http://stackoverflow.com/a/4780104

### skaae reviewed Jun 12, 2015

 + def get_output_shape_for(self, input_shape): + return input_shape[0], input_shape[1], self.num_units + + def get_output_for(self, input, mask=None, *args, **kwargs):

#### skaae Jun 12, 2015

Member

Remove *args

Member

Fixed in b96c57b

### skaae reviewed Jun 12, 2015

#### skaae Jun 12, 2015

Member

All recurrent layers seem to be missing **kwargs in init

Member

Member

Fixed in 694094f

Member

### craffel commented Jun 12, 2015

 I'm really confused by the Travis errors of the kind ValueError: Cannot compute test value: input 0 () of Op Shape() missing default value  They don't seem to be isolated to the new recurrent tests. Anyone know what's up?
Member

### benanne commented Jun 12, 2015

 As I said it's something to do with Theano's test_value machinery, which I am unfamiliar with so I can't pinpoint it exactly. It does look like it's "leaking" out to other tests that are unrelated to the recurrent stuff. It seems like the lstm.py example uses test values and changes Theano settings related to test values. Since the current example testing code tries to load up all examples, this code is run and the settings are changed from then on.
Member

### craffel commented Jun 12, 2015

 Ah I see what you're getting at. I will remove the test value stuff from the lstm.py example... although I am currently thinking the recurrent examples should just be moved to a recipe, because of the way the current examples are used for testing and and in the midst of a revamp.
Member

### craffel commented Jun 12, 2015

 Looks like all tests are passing now, except the example tests, because the recurrent examples have no method main. I propose we either Move the examples out into a separate recipe (I like this idea more, because I think they could be fleshed out into something more interesting) Add a main method to all the examples Any preference?
Member

### skaae commented Jun 12, 2015

 I think we should have at least a simple recurrent example. We can run on your delay data? Let me know if i should create modify craffel#41 to be a recipe.
Member

### craffel commented Jun 12, 2015

 I think we should have at least a simple recurrent example. I think it depends on what we want out of the examples - if they're just intended as a simple use-case to get people started with the library, without going into much depth, then I don't think there needs to be one and I think putting some nice recurrent recipes in is fine (including your penn treebank example, my noisy speech rec benchmark example, and the delay task/other simple tasks example). But, the usage of the recurrent layers is a bit different than the dense layers because of the requisite reshaping, so I'd be happy to include a simple example too. Anyone else have input? Let me know if i should create modify craffel#41 to be a recipe. I think you should either way, it makes sense more there to me.
Member

### benanne commented Jun 12, 2015

 I reckon it would be good to have a simple recurrent example bundled with the library too - one that's tested just like the existing examples, so it can double as an integration test :)
Member

### craffel commented Jun 12, 2015

 Ok, I'll boil them all down to a single example with a main function then!
Contributor

### dnouri commented Jun 12, 2015

 I reckon it would be good to have a simple recurrent example bundled with the library too - one that's tested just like the existing examples, so it can double as an integration test :) +1

### skaae reviewed Jun 13, 2015

 + nonlinearity_updategate=nonlinearities.sigmoid, + W_in_to_hidden_update=init.Normal(0.1), + W_hid_to_hidden_update=init.Normal(0.1), + b_hidden_update=init.Constant(0.),

#### skaae Jun 13, 2015

Member

This should probably be init.Normal(0.1)?

I think we should change the init of all biases to zero?

#### craffel Jun 15, 2015

Member

You're right, I didn't notice you had used Normals here, I'll change them to Constants. Issue here: #53

Closed

### skaae reviewed Jun 14, 2015

 + out_ = fn(*step_input) + if not isinstance(out_, (list, tuple)): + out_ = [out_] + output.append(out_)

#### skaae Jun 14, 2015

Member

We need an additional isinstance check. The returned value from step can be either TensorVariable, list or tuple.
The function crashes for tuples because it'll try to add prev_vals (a tuple) to a list.

Replacing line 348-49 with

            if isinstance(out_, T.TensorVariable):
out_ = [out_]
if isinstance(out_, tuple):
out_ = list(out_)

Solves the problem. This is not a problem for the current code because all step functions return TensorVariables or lists.

#### craffel Jun 15, 2015

Member

Do you want to create a PR for this?

#### skaae Jun 15, 2015

Member

I can create a PR. If its easier that you cp the the lines that is fine aswell.

#### f0k Jul 21, 2015

Member

Is this solved? I'm wondering because it doesn't say I'm looking at an outdated diff, but maybe the comment was not close enough to the line that needed changing?

#### craffel Jul 21, 2015

Member

No, see craffel#60 . I was waiting for @skaae to do it, but I can do it if it's really as simple as copy-pasting.

### superbock commented Jun 15, 2015

 Hi, first of all I'd like to say a big thanks for the functionality (and Lasagne in general of course!). If it would have existed a year ago it would have saved me a lot of time. I'd like to add a few comments/suggestions about this PR from a RNN/LSTM user perspective. But before I do so, I'd like to add that I haven't used Lasagne nor this RNN implementation, so some comments may be totally inappropriate or stupid... I think that the need to reshape the output of a RecurrentLayer before being able to connect it to another layer is a bit of counter intuitiv. It would be nice if the Layer would handle this itself. Although I see the point to have a CustomRecurrentLayer I don't like it too much. How about using just the RecurrentLayer and adding options to chose the layer class to be used for input_to_hidden and hidden_to_hidden and use appropriate defaults (e.g. DenseLayer to have a fully connected recurrent layer which is the most common probably). Instead of having the backwards flag in all layers, I'd add a BidirectionalLayer which handles the flipping/reversing of the input and output of the backward layer and also the stacking/summing/pooling of the outputs. Although I used bidirectional networks only with RNNs, this functionality is not limited to RNNs. A possible solution for the numerous __init__ arguments of the LSTMLayer can be that you handle the gates and the cell as separate classes (or instances thereof). This would remove a lot of duplicated code. It may also make the whole thing more readable. I'd put the unroll_scan function in the very same file, since it is related pretty closely (solely?) to recurrent networks.
Contributor

### JackKelly commented Jun 15, 2015

 I think that the need to reshape the output of a RecurrentLayer before being able to connect it to another layer is a bit of counter intuitiv. It would be nice if the Layer would handle this itself. When I first started using @craffel's fork, I thought the same as you (i.e. that the layer should do all the necessary reshaping). But when I started playing with things like doing 1D convolutions over the time axis, I started to like the fact that you have to explicitly reshape because it makes it easier (at least it my head) to track what's going on in the network. I'd add a BidirectionalLayer I agree. It's very easy to add a BidirectionalLayer layer. I've been using this in my own experiments: def BidirectionalLayer(layer_class, *args, **kwargs): kwargs.pop('backwards', None) l_fwd = layer_class(*args, backwards=False, **kwargs) l_bck = layer_class(*args, backwards=True, **kwargs) return ElemwiseSumLayer([l_fwd, l_bck]) Then, you can easily define bidirectional RNN and bidirectional LSTM layers: def BLSTMLayer(*args, **kwargs): """Configures forward and backwards LSTM layers to create a bidirectional LSTM. """ return BidirectionalLayer(LSTMLayer, *args, **kwargs) def BidirectionalRecurrentLayer(*args, **kwargs): """Configures forward and backwards RecurrentLayers to create a bidirectional recurrent layer.""" return BidirectionalLayer(RecurrentLayer, *args, **kwargs)
Member

### skaae commented Jun 15, 2015

 Heres a few comments: Reshape: I think the general idea with lasagne is that the settings are not hidden from the user. Secondly if we where to implement such feature i guess a layer would need to know its parent which would complicate things? But maybe the feature would fit into lasagne + nolearn? CustomRecurrentLayer: We discussed this at length in one of the issues :) Probably somewhere in #17 or in Colins repository. To me the only downside with CustomRecurrentLayer is that it does not do calculations with stacked matrices. Backwards argument: You could write a small function that sets up a bidirectional layer? We did have a bidirectional layer in an earlier version of the code, but it was removed becuse we thought the current setup is simpler. And it only Unroll scan: I think that is a good idea.
Member

### benanne commented Jun 15, 2015

 regarding unroll_scan, I don't think moving this into the recurrence submodule makes sense because it's very likely that it has other use cases that do not involve recurrence. scan is often used to implement RNNs but it is much more general than that. If anything I would put it under theano_extensions, especially because it mimics the scan interface.
Member

### craffel commented Jun 15, 2015

 I think that the need to reshape the output of a RecurrentLayer before being able to connect it to another layer is a bit of counter intuitiv. It would be nice if the Layer would handle this itself. There are a few reasons for this; apart from @JackKelly's reason above, it also avoids some redundant reshapes if you are stacking recurrent layers, and it also makes the whole system more flexible with respect to how many "sample" dimensions you have vs. "feature" dimensions. Of course, it's also very simple to write a function which does this automatically for you in your own code, if you prefer. Although I see the point to have a CustomRecurrentLayer I don't like it too much. How about using just the RecurrentLayer and adding options to chose the layer class to be used for input_to_hidden and hidden_to_hidden and use appropriate defaults (e.g. DenseLayer to have a fully connected recurrent layer which is the most common probably). I'm not really seeing how this is substantially different from what we have, aside from the fact that there are currently two separate classes. Using the CustomRecurrentLayer you can use arbitrary layer structures for the input-to-hidden and hidden-to-hidden connections (which is more flexible than an option "to chose the layer class to be used for input_to_hidden and hidden_to_hidden") and using the RecurrentLayer you have "appropriate defaults" for the connections; that's exactly the point behind the way the code is now. Instead of having the backwards flag in all layers, I'd add a BidirectionalLayer which handles the flipping/reversing of the input and output of the backward layer and also the stacking/summing/pooling of the outputs. Although I used bidirectional networks only with RNNs, this functionality is not limited to RNNs. As noted above, we did initially do it this way, but there was a lot of discussion to come to the format it's in now. First, note that when backwards=True, the recurrent layers themselves "handles the flipping/reversing of the input and output". Second, one nice thing about not using a BidirectionalLayer is that it allows the user to describe how they want the layer outputs to be combined - concatenation? Element-wise sum? Processed separately? Etc. As I recall, there was also a strange Theano optimization bug which made things substantially slower with the BidirecitonalLayer I wrote, although this may have been an artifact of #104 (it was many many months ago now...) If, for your own code, it would be convenient to have the same bidirectional structure in many places, it's straightforward to hand-roll your own as @JackKelly pointed out. A possible solution for the numerous init arguments of the LSTMLayer can be that you handle the gates and the cell as separate classes (or instances thereof). This would remove a lot of duplicated code. It may also make the whole thing more readable. I'm having trouble picturing what you mean by this, can you be more specific? I'd put the unroll_scan function in the very same file, since it is related pretty closely (solely?) to recurrent networks. I don't care much either way, I think it's up to @benanne and @f0k.

### lim0606 commented Jun 15, 2015

 Hi, everyone First of all, thank you so much for this cool functionality. I have a question about using arbitrary connections in this recurrent functionality. I’d like to use this recurrent functionality to implement DRAW (http://arxiv.org/abs/1502.04623) . I know there is a awesome DRAW example for lasagne written by @skaae (https://github.com/skaae/lasagne-draw), but I want to implement DRAW with combinations of Lasagne feedforward layers and recurrent ones by explicitly representing connectivities; thus, someone can easily make variants of it. As far as I understand from the original issue related to lasagne recurrence (#17) and here, the current PR seems to support arbitrary connections. however, I’m not sure the arbitrary connections in this PR include the case where the connection happens from a hidden at t-1 to a different hidden at t (if my wording is bad, please see following figures) this is possible  y (t-1) → y (t) → ... ↑ ↑ hidden_n (t-1) → hidden_n (t) → ... … ... hidden_2 (t-1) → hidden_3 (t) → ... ↑ ↑ hidden_1 (t-1) → hidden_2 (t) → ... ↑ ↑ hidden_1 (t-1) → hidden_1 (t) → ... ↑ ↑ input(t-1) → input (t) → ...  is this possible?  y (t-1) → y (t) → ... ↑ ↑ hidden_n (t-1) → hidden_n (t) → ... … ... hidden_2 (t-1) → hidden_3 (t) → ... ↑ ↘ ↑ hidden_1 (t-1) → hidden_2 (t) → ... ↑ ↘ ↑ hidden_1 (t-1) → hidden_1 (t) → ... ↑ ↑ input(t-1) → input (t) → ...  Can anyone help me to represent the arbitrary connection from t-1 to t with this recurrent functionality? Best regards, Jaehyun

### superbock commented Jun 15, 2015

 I was not fully aware of the full "history", but after reading issue #17 things are much clearer now. Sorry for the noise. As for the CustomRecurrentLayer thing, I don't have a problem with the way it is. It's just I didn't like it too much the very first time, exactly like @benanne said in #17 (comment) 😏. Maybe renaming the CustomRecurrentLayer to RecurrentLayer and RecurrentLayer to FullyConnectedRecurrentLayer would be more expressive though? A possible solution for the numerous init arguments of the LSTMLayer can be that you handle the gates and the cell as separate classes (or instances thereof). This would remove a lot of duplicated code. It may also make the whole thing more readable. I'm having trouble picturing what you mean by this, can you be more specific? you could e.g. do something like this: class Cell(object): def __init__(self, W_in=init.Normal(0.1), W_hid=init.Normal(0.1), b=init.Normal(0.1), nonlinearity=nonlinearities.sigmoid): class Gate(object): def __init__(self, W_in=init.Normal(0.1), W_hid=init.Normal(0.1), W_cell=init.Normal(0.1), b=init.Normal(0.1), nonlinearity=nonlinearities.tanh): class LSTMLayer(Layer): def __init__(self, incoming, num_units, ingate=Gate(), forgetgate=Gate(), cell=Cell(), outgate=Gate(), ...): Then there would be way fewer arguments in the __init__ method of the LSTMLayer. Cell could be of course just a Gate with a different nonlinearity and W_cell set to None (or Gate extending the Cell class). No peepholes could be realised easily by setting W_cell to None.
Member

### craffel commented Jun 15, 2015

 @lim0606 wrote is this possible?  y (t-1) → y (t) → ... ↑ ↑ hidden_n (t-1) → hidden_n (t) → ... … ... hidden_3 (t-1) → hidden_3 (t) → ... ↑ ↘ ↑ hidden_2 (t-1) → hidden_2 (t) → ... ↑ ↘ ↑ hidden_1 (t-1) → hidden_1 (t) → ... ↑ ↑ input(t-1) → input (t) → ...  Hm, not that I can think of off the top of my head; @skaae do you have any ideas here? @superbock wrote As for the CustomRecurrentLayer thing, I don't have a problem with the way it is. It's just I didn't like it too much the very first time, exactly like @benanne said in #17 (comment) 😏. Right, note that it was something I initially suggested; please read above and below that to get the full context as to why it ended up the way it did. Maybe renaming the CustomRecurrentLayer to RecurrentLayer and RecurrentLayer to FullyConnectedRecurrentLayer would be more expressive though? The naming convention is based on @f0k's suggestion, I think #17 (comment) you could e.g. do something like this: That could work, although in order to stack the matrices for dot product efficiency as we do the Cell/Gate classes would simply just hold on to the parameters; they couldn't have functions to actually compute anything. The main disadvantage here is that the convention across Lasagne is that the layers themselves take parameter initializers as keyword args, with the obvious exception of CustomRecurrentLayer. Cell could be of course just a Gate with a different nonlinearity and W_cell set to None (or Gate extending the Cell class). No peepholes could be realised easily by setting W_cell to None. Well, that would rely on some logic that doesn't exist yet where if a parameter is None, no dot product is computed; I proposed this here: craffel#32 (comment) and at the top of this PR because I agree that it would make arbitrary connections pretty slick, but it would make the code pretty messy in places. For example, it would make stacking the W_in_* matrices and using slice much more messy, at least at first glance.
Member

### craffel commented Jun 15, 2015

 Ok, I merged all of the examples into a single one: https://github.com/craffel/nntools/blob/recurrent/examples/recurrent.py It now uses a more standard task (the "addition task" from the original LSTM paper). It tries to use most of the recurrent-specific stuff, which should be helpful as an actual example. The hyperparameters could probably be tuned some, if anyone wants to do that at some point. I don't see a real reason to have a separate LSTM and GRU example, although a CustomRecurrentLayer example could be useful... Now let's see if Travis likes it. Edit - damn it, I forgot to give main an argument specifying the number of iterations. I'll fix that but I guess it'll be a few hours until that commit gets tested. I guess now would be a good time to ask if anyone knows why py.test skips all the example tests on my system: lasagne/tests/test_examples.py::test_example[mnist_conv_cc] SKIPPED lasagne/tests/test_examples.py::test_example[mnist] SKIPPED lasagne/tests/test_examples.py::test_example[recurrent] SKIPPED lasagne/tests/test_examples.py::test_example[mnist_conv] SKIPPED lasagne/tests/test_examples.py::test_example[mnist_conv_dnn] SKIPPED 
Member

### skaae commented Jun 15, 2015

 @lim0606: I dont think you can implement that without writing your own scan. Maybe if you create a LSTM layer that does only do a single step. A feed forward LSTM :) Then you could just stack and wire those inside your scan?
Member

### benanne commented Jun 15, 2015

 I guess now would be a good time to ask if anyone knows why py.test skips all the example tests on my system: Make sure to run py.test with the --runslow option, else they are skipped. This bit of info should be added to the development docs.
Member

### craffel commented Jun 15, 2015

 Make sure to run py.test with the --runslow option, else they are skipped. This bit of info should be added to the development docs. Got it. I haven't used py.test much, so I was assuming that was some command-line option I didn't know about. Looks like the test passed.

### lim0606 commented Jun 16, 2015

 @craffel @skaae Thank you for comments! Jaehyun
Member

### f0k commented Jun 16, 2015

 The peepholes kwarg of the LSTM layer decides whether to include connections from the cell to the gates; [...] What I'd really like is something where if one of the arguments is passed as None, then the connection is not used, but this would make the code pretty ugly - we'd essentially have to define a new dot method which does nothing when one of the arguments is None or something. I haven't taken a look at the code yet, but do you say allowing to set one of the constructor arguments to None would make the peepholes argument obsolete, plus make the class a lot more flexible? If so, I'd be in favor of not introducing the peepholes argument in the first place (we cannot easily remove it after the release). Setting a parameter to None is in line with how bias-less layers are implemented, and I think your solution of a maybe_dot helper method would be elegant enough. For example, it would make stacking the W_in_* matrices and using slice much more messy, at least at first glance. Ah, I see... so it's more involved than adding a maybe_dot function. I was not fully aware of the full "history", but after reading issue #17 things are much clearer now. Sorry for the noise. Wow, you're one of the few chosen people that know the full history of #17 then :) No need to be sorry, I think that's the longest thread in all of Lasagne and I wouldn't have expected anybody to read it. We might want to document the design choices you wondered about in our FAQ (#298). Maybe renaming the CustomRecurrentLayer to RecurrentLayer and RecurrentLayer to FullyConnectedRecurrentLayer would be more expressive though? That might be a bit long, and FullyConnectedRecurrentLayer only tells part of the story -- note that CustomRecurrentLayer also allows to have in->hid or hid->hid be a stack of two or more fully connected layers. What about RecurrentLayer and VanillaRecurrentLayer? The latter would be the one that has single dense layers for in->hid and hid->hid and uses the stacked weight matrices trick to be faster. In any case, I'd use RecurrentLayer for the one that we expect to be used the most. I don't know which one that would be, I leave that up for the RNN experts here to decide. The main disadvantage here is that the convention across Lasagne is that the layers themselves take parameter initializers as keyword args I wouldn't mind much about breaking with that convention and group parameters if that makes things easier. I didn't look at the code yet, though. /edit: I'll have a closer look tomorrow or the day after. Thanks for the joint effort! Merging this before the first release is not really necessary because it doesn't introduce any backward-incompatible changes, but I agree that it would be nice to have it in! We just need to be sure we have a good interface, because it's more difficult to change once it's released.
Member

### craffel commented Jun 16, 2015

 Setting a parameter to None is in line with how bias-less layers are implemented, and I think your solution of a maybe_dot helper method would be elegant enough. ... Ah, I see... so it's more involved than adding a maybe_dot function. Yes, that's what I was initially thinking, and that's why I like it, but yeah, I can't think of a simple way OTOH to make it elegant given the matrix stacking we do for efficiency. Doesn't mean it couldn't be done. We might want to document the design choices you wondered about in our FAQ (#298). Not a bad idea, although condensing down all the information in #17, https://github.com/craffel/nntools 's issues and PRs, and even this PR sounds like a massive (but probably worthwhile) undertaking. What about RecurrentLayer and VanillaRecurrentLayer? This was the plan at some point, I think, but In any case, I'd use RecurrentLayer for the one that we expect to be used the most. I would expect the "vanilla", non-customizable recurrent layer to be used more. It's still the case that there are relatively few papers which have hidden-to-hidden/input-to-hidden connections which are not just dense (excepting LSTM, GRU, etc, of course), even though it's a thing people have tried.
Member

### benanne commented Jun 17, 2015

 I would expect the "vanilla", non-customizable recurrent layer to be used more. It's still the case that there are relatively few papers which have hidden-to-hidden/input-to-hidden connections which are not just dense (excepting LSTM, GRU, etc, of course), even though it's a thing people have tried. I agree, my gut feeling is that most people will just want the vanilla one, so I'm in favour of keeping the current naming scheme.
Member

### craffel commented Jun 17, 2015

 Ok, so in that case it sounds like the pending this for this PR are Should we condense the LSTM/GRU parameters? I think the class-style that @superbock is a good solution, although it's considered bad practice to have classes with no methods (see e.g. http://pylint-messages.wikidot.com/messages:r0903 )... We could use a NamedTuple https://docs.python.org/2/library/collections.html#namedtuple-factory-function-for-tuples-with-named-fields instead. Not sure how the docstring works for that though, I think that my be why I didn't use them in the past (e.g. https://github.com/craffel/pretty-midi/blob/master/pretty_midi/pretty_midi.py#L1160 ) Should we allow weight vectors to be None (esp. for the LSTMLayer)? I think this would be nice, but I can't think of a clean solution off the top of my head. I'd rather leave this for future work if it's OK with you all. To maintain potential backwards compatibility, I could change the code so that Nones were allowed for the peephole connections only, and remove the peepholes kwarg. #294 (diff) craffel#30 and craffel#15 which I think we can hold off on now Anything else you want to see before merging?
Member

### benanne commented Jun 17, 2015

 I will try to do an in-depth review of the code ASAP to ensure that everything is in line with Lasagne's overall design philosophy, and also w.r.t. coding style. In terms of features/functionality I don't think I have anything to add.
Member

### f0k commented Jun 17, 2015

 Should we condense the LSTM/GRU parameters? I think the class-style that @superbock is a good solution, although it's considered bad practice to have classes with no methods Thinking again, how is that supposed to work? (Please actually read as: "How is that supposed to work", not as "I don't see how this can work" :) ) If Cell gets the parameter specifications, who's calling add_param() and how? Should Cell and Gate be derived from Layer so they have their own add_param() and get_params() method? To maintain potential backwards compatibility, I could change the code so that Nones were allowed for the peephole connections only, and remove the peepholes kwarg. Could you just post a short snippet of creating a layer without peephole connections a) with the peepholes kwarg and b) with Nones? I'd like to see which one is cleaner. The third point is easy to solve, isn't it? About the other issues, I agree we can defer them. The latter seems to be really interesting, though.
Member

### craffel commented Aug 4, 2015

 I think the pattern we should encourage is lasagne.random.get_rng().seed(SEED) wherever possible. This seems better. Can I use this @f0k?
Member

### f0k commented Aug 4, 2015

 This seems better. That's fine as well. We should just ensure tests don't change the RNG without setting it back afterwards.

Merged

### f0k reviewed Aug 4, 2015

 + b=lasagne.init.Constant(0.), nonlinearity=lasagne.nonlinearities.sigmoid) + + Simple class to hold the parameters for a gate connection. We define + a gate loosely as something which compute the linear mix of two inputs,

Member

*which computes

#### craffel Aug 4, 2015

Member

Fixed.

 Change np.random.seed to get_rng in test_recurrent 
 a3194e0 

### f0k reviewed Aug 4, 2015

 + cell=lasagne.layers.Gate( + W_cell=None, nonlinearity=lasagne.nonlinearities.tanh), + outgate=lasagne.layers.Gate(), + nonlinearity_out=lasagne.nonlinearities.tanh,

#### f0k Aug 4, 2015

Member

Why not simply nonlinearity? That would make things easier if we ever decide to do #334.

#### craffel Aug 4, 2015

Member

I'm not sure what you mean, what is nonlinearity?

Edit - oh, you mean instead of calling nonlinearity_out, call it nonlinearity? Well, it's because there are a lot of nonlinearities in the LSTM layer, this differentiates it from the other ones (which come from the gates). It used to be that there were lots of nonlinearities in the LSTM layer, e.g. nonlinearity_ingate, etc. Now that there's only one which appears in the initializer, I suppose we could change its name.

#### f0k Aug 5, 2015

Member

Yes, that's what I meant, and that's also what I assumed it originated from :) +1 for renaming to be more compatible to other layers, this could come helpful some time (and it's more difficult to rename it after it's out).
/edit: Including the attribute (self.nonlinearity instead of self.nonlinearity_out).

Member

Ok, changed.

### f0k reviewed Aug 4, 2015

 + outgate=lasagne.layers.Gate(), + nonlinearity_out=lasagne.nonlinearities.tanh, + cell_init=b=lasagne.init.Constant(0.), + hid_init=b=lasagne.init.Constant(0.), backwards=False, learn_init=False,

#### f0k Aug 4, 2015

Member

The two b= seem to be misplaced.

#### craffel Aug 4, 2015

Member

Looks like those showed up in craffel@ca8288a, fixed.

### f0k reviewed Aug 4, 2015

 + peepholes : bool + If True, the LSTM uses peephole connections. + When False, W_cell_to_ingate, W_cell_to_forgetgate and + W_cell_to_outgate are ignored.

#### f0k Aug 4, 2015

Member

Those arguments don't exist since introducing the Gate class. You mean ingate.W_cell etc., right?

#### craffel Aug 4, 2015

Member

Indeed, good catch, fixed.

### f0k reviewed Aug 4, 2015

 + # Stack biases into a (4*num_units) vector + self.b_stacked = T.concatenate( + [self.b_ingate, self.b_forgetgate, + self.b_cell, self.b_outgate], axis=0)

#### f0k Aug 4, 2015

Member

I think I'd move the stacking to get_output_for(), the stacked matrices aren't needed anywhere else and they don't make too much sense as attributes (or is there a use case for them?). They could be mistaken for "real" parameters (i.e., shared variables) when inspecting a LSTMLayer instance interactively, while they're just an implementation detail of the forward pass.

#### craffel Aug 4, 2015

Member

Hmm... I wouldn't be so sure there isn't a use-case for them as attributes (cc @skaae), although I can't think of any OTOH. Moving them into get_output_for should be straightforward.

#### f0k Aug 5, 2015

Member

I wouldn't be so sure there isn't a use-case for them as attributes

Maybe. But I assume they're just here because one of the earlier implementations actually stacked the numpy arrays before turning them into a shared variable, and this had to happen in the constructor. Let's see if @skaae has any objections against moving that code.

#### skaae Aug 5, 2015

Member

I guess the stacking is in the constructor because the original code just intialized the stacked matrices. I think its better to move it.

Member

Ok, moved it.

### f0k reviewed Aug 4, 2015

 + # Treat all dimensions after the second as flattened feature dimensions + if input.ndim > 3: + input = input.reshape((input.shape[0], input.shape[1], + T.prod(input.shape[2:])))

#### f0k Aug 4, 2015

Member

Easier: input = T.flatten(input, 3). Could also be (unmeasurably) faster.

Member

Nice! Changed.

### craffel added some commits Aug 4, 2015

 Fix misc docstring issues in recurrent 
 69f492e 
 Change flattening from reshape to flatten 
 03a52c0 

### f0k reviewed Aug 5, 2015

 + + Gated Recurrent Unit (GRU) Layer + + Implements the updates proposed in [1]_, which computes the output by

#### f0k Aug 5, 2015

Member

"Implements the updates" -- what kind of updates? Updates of what? This doesn't have anything to do with how parameters are updated during training, right? (I'm sure that it doesn't, but that was the first I thought of when reading "updates" in conjunction with Lasagne.)

#### craffel Aug 5, 2015

Member

Ah, yes, that's a little ambiguous. "Updates" in the context of an RNN unit means "how do you set the internal states given new input"; this is also the terminology used by scan. I will clarify.

### f0k reviewed Aug 5, 2015

 +class GRULayer(MergeLayer): + r""" + lasagne.layers.recurrent.GRULayer(incoming, num_units, + resetgate=lasagne.layers.Gate(W_cell=None),

#### f0k Aug 5, 2015

Member

What happens if W_cell is not None? Will it be used or ignored? What happens if W_cell is None for a LSTMLayer? Will it break?
--> Should Gate be split into two classes, one with W_cell and one without, or just kept as it is?

#### craffel Aug 5, 2015

Member

What happens if W_cell is not None? Will it be used or ignored?

It will be ignored.

What happens if W_cell is None for a LSTMLayer? Will it break?

If peepholes=True, yes; the attribute W_cell will attempt to be accessed which won't exist for a Gate initialized with W_cell=None. We could probably throw a more helpful error there.

Should Gate be split into two classes, one with W_cell and one without, or just kept as it is?

I don't think so, they would be completely identical save about 2 lines. We discussed this somewhere...

### f0k reviewed Aug 5, 2015

 + updategate=lasagne.layers.Gate(W_cell=None), + hidden_update=lasagne.layers.Gate( + W_cell=None, lasagne.nonlinearities.tanh), + hid_init=lasagne.init.Constant(0.), learn_init=True, backwards=False,

#### f0k Aug 5, 2015

Member

The order should be hid_init, backwards, learn_init (also below in the actual constructor definition), to match the docstring and to match LSTMLayer.

#### craffel Aug 5, 2015

Member

Good catch, thanks, fixed.

### f0k reviewed Aug 5, 2015

 + # Stack gate biases into a (3*num_units) vector + self.b_stacked = T.concatenate( + [self.b_resetgate, self.b_updategate, + self.b_hidden_update], axis=0)

#### f0k Aug 5, 2015

Member

Again, stacking the weights and biases should be moved into get_output_for().

### f0k reviewed Aug 5, 2015

 + hid_out = hid_out[:, ::-1, :] + + self.hid_out = hid_out + return hid_out

#### f0k Aug 5, 2015

Member

Why this? Setting a class attribute in get_output_for() seems brittle. People should better be forced to obtain the output in the regular way.

#### skaae Aug 5, 2015

Member

I think that should be removed. I guess i added it because I had not figured out that Lasagne supported a regular way to get that output.
There is a similar problem in line 992 with the LSTM cell. I don't think you can refer to the cell in other ways than with the through the class attribute ?

#### craffel Aug 5, 2015

Member

I don't think you can refer to the cell in other ways than with the through the class attribute ?

Yes, that's addressed in #337, right? I removed both for now. Edit: Also had to remove a few tests. If this is actually not OK, please speak up!

### f0k reviewed Aug 5, 2015

 + Parameters + ---------- + + fun : function

#### f0k Aug 5, 2015

Member

fn, not fun

Member

Fixed.

Member

### f0k commented Aug 5, 2015

 Okay, I've read through everything (finally!) and gave some more comment. Two last things that bug me are: There's a lot of duplicated code between the three classes (CustomRecurrentLayer, LSTMLayer, GRULayer) in the constructors, in get_output_shape_for() and get_output_for(). It would be nice to factor that out into a common base class, but that can wait. This PR has 110 commits, some of which only change a single word, or fix things in an example that was subsequently removed. In other PRs we usually take care to squash certain commits to not fill up Lasagne's history with things not worth preserving (such as spelling or pep8 mistakes being introduced and removed), and maybe also to not skew the commit-count-based contributor statistics. In any case, this is a long-running PR, it definitely has some history worth preserving, and cleaning up the history by cherry-picking and squashing commits would be very cumbersome (compared to our usual PRs of 3 or 4 commits by a single author). So I'd be fine with making an exception. PS: Looking forward to playing around with this! :)
Member

### benanne commented Aug 5, 2015

 Yeah, let's not waste any time cleaning this up. It's gone on long enough :)

### craffel added some commits Aug 5, 2015

 Change nonlinearity_out to nonlinearity in LSTMLayer 
 5106c1c 
 Move parameter stacking to get_output_for 
 9040ffc 
 Fix kwarg order in GRULayer 
 16403a5 
 Change wording; updates to recurrent step 
 5f3033e 
 Remove self storage of hid_ and cell_out 
 f4d15ba 
 Fix fun -> fn in unroll_scan docstring 
 b7bb816 
Member

### craffel commented Aug 5, 2015

 Okay, I've read through everything (finally!) and gave some more comment. I've addressed everything now, I think. There's a lot of duplicated code between the three classes (CustomRecurrentLayer, LSTMLayer, GRULayer) in the constructors, in get_output_shape_for() and get_output_for(). It would be nice to factor that out into a common base class, but that can wait. Agreed, it bothers me too. The main argument against it (which was raised by @skaae elsewhere) is that it will be hard to make it equally readable/modifyable after refactoring for code sharing. Either way, this will not be something which will change the interface or behavior, so I am very supportive of not trying to do it now. This PR has 110 commits, some of which only change a single word, or fix things in an example that was subsequently removed. In other PRs we usually take care to squash certain commits to not fill up Lasagne's history with things not worth preserving (such as spelling or pep8 mistakes being introduced and removed), and maybe also to not skew the commit-count-based contributor statistics. In any case, this is a long-running PR, it definitely has some history worth preserving, and cleaning up the history by cherry-picking and squashing commits would be very cumbersome (compared to our usual PRs of 3 or 4 commits by a single author). So I'd be fine with making an exception. I can do some squashing of trivial commits, but I'd also like to preserve the commit history as much as possible - a major difference here is that I and many others have been using this codebase for many months, and it (like Lasagne) has changed a good deal in that time, so it's very nice for me to be able to retrieve the codebase at a certain point in time for backwards/parallel-compatibility. Squashing functional-change commits would make this impossible. But, you're agreeing with me here, so let's leave it at that :)

### f0k commented on lasagne/layers/recurrent.py in f4d15baAug 6, 2015

 Ah, I see, this was Søren's trick for providing multiple outputs. @skaae, I hope you're fine with merging the PR without this hack and introducing a proper solution for multiple-output layers afterwards (Lasagne#337)!

### skaae replied Aug 6, 2015

 yes. One very easy solution is to use a GRU instead.
Member

### f0k commented Aug 6, 2015

 I've addressed everything now, I think. Yes, looks good! Thanks! I can do some squashing of trivial commits, but I'd also like to preserve the commit history as much as possible - a major difference here is that I and many others have been using this codebase for many months That's a good point. You'll probably want to keep that branch of yours alive for a while. It's gone on long enough :) All right, let's merge it then! I'll wait until the afternoon, just to give people a last chance to chime in.
Member

### f0k commented Aug 6, 2015

 Will merge the beast at 15:00 CEST. Get your champagne out.

### f0k added a commit that referenced this pull request Aug 6, 2015

 Merge pull request #294 from craffel/recurrent 
Add recurrent layers
 4c882e6 

### f0k merged commit 4c882e6 into Lasagne:master Aug 6, 2015 2 checks passed

#### 2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
coverage/coveralls Coverage remained the same at 100.0%
Details
Contributor

 Yay!
Member

### ebenolson commented Aug 6, 2015

 I think the merge closed some unrelated issues..
Member

### f0k commented Aug 6, 2015

 I think the merge closed some unrelated issues.. I was wondering about that. Github always linked the issues in the commit messages to the ones in @craffel's repository, so I hoped it wouldn't close our issues now. Let's reopen them one by one then...
Member

### benanne commented Aug 6, 2015

 Sweet! Nice work everyone :)
Member

### craffel commented Aug 6, 2015

 Woo! I was wondering about that. Github always linked the issues in the commit messages to the ones in @craffel's repository, so I hoped it wouldn't close our issues now. Let's reopen them one by one then... Funny, I've never seen that before. Sorry about that, I guess I will stop doing that.

Open

Contributor

### JackKelly commented Aug 7, 2015

 Awesome work, everyone! Thank you!

### craffel added a commit to craffel/Lasagne-tutorial that referenced this pull request Nov 9, 2015

 Change recurrent layers to accept mask input layer 
As discussed here:
Lasagne/Lasagne#294 (comment)
when mask was supplied as a kwarg to get_output, it's not possible
to pass different masks to different layers in a larger network.  A
good example would be when there are two network branches which are
receiving different input.  We decided here:
Lasagne/Lasagne#337
that the best approach (for now) is to make the recurrent layers
subclass MergeLayer, and have two inputs: One for the actual layer
input, and another for the mask.  This commit implements that change
with a new keyword arg to the constructor of all of the recurrent
layer classes, mask_input.  It also updates all of the tests and
the example.
 093bc87