Arbitrary expressions as Layer parameters #11

Closed
benanne opened this Issue Sep 11, 2014 · 18 comments

Comments

Projects
None yet
4 participants
@benanne
Member

benanne commented Sep 11, 2014

Currently, Layer parameters are assumed to be Theano shared variables. That way we can call get_value() and set_value() to them, and pass them to theano.function in updates.

However, it would be cool if Layer parameters could be arbitrary Theano expressions. There are a few use cases for this:

  • autoencoders with tied weights. If you have two layers l1 and l2, you might want to do something like l2 = nntools.layers.DenseLayer(l1, W=l1.W.T). This is currently not possible because l1.W.T is not a shared variable, but rather an expression.
  • sometimes the domain of parameters is restricted, and you might want to reparameterize things. For example, you might want all values to be positive. In that case you could reparameterize as follows: l2 = nntools.layers.DenseLayer(l1, W=T.exp(V)), where V is a shared variable. This ensures that all values in W are positive.

However, this requires some modifications to nntools.layers.get_all_params(), because we need a way to get the actual shared variables containing parameter values, not just the Theano expressions built on top of them. Otherwise there's no way to update them.

Given an arbitary Theano expression, it is fairly easy to get a list of all shared variables that occur in it, by traversing the Theano graph. This isn't very 'clean' I suppose, but definitely possible.

However, that just gives us a list of all shared variables, and we have no way of knowing if all of those contain learnable parameters. Perhaps some of them should not be touched by the learning.

We could assume that all shared variables represent learnable parameters by default. This would usually be the case. But then we need to provide a way for the user to specify that a given variable is not to be touched (for example, this could be a variable that contains a binary mask that restricts some of the parameters to be zero). Perhaps an extra attribute on the Layer instances that lists all "non-trainable" shared variables.

Should we support arbitrary expressions as Layer parameters? If so, there will be some added complexity, but it might be worth it. If we do not support this, a new Layer subclass has to be implemented for every new parameterization (so to support autoencoders with tied weights, we would need to implement a TransposedDenseLayer, for example).

What do you guys think?

@f0k

This comment has been minimized.

Show comment
Hide comment
@f0k

f0k Sep 11, 2014

Member

As we discussed via email before, my only concern is that allowing arbitrary expressions will make it more difficult to store/load models in HDF5 format (but still possible if I don't insist on restoring the original configuration of variables, just an equivalent network with respect to the forward pass). That's a minor disadvantage compared to the flexibility we would gain in defining and training models without writing any additional classes.

Regarding the added complexity, get_all_params() and its siblings would be easy to extend to find the shared variables in all parameter expressions. Looking at the example you put up and nntools.updates, providing a way for a user to exclude certain variables from being updated just boils down to excluding some of the variables in computing the updates.

Another problem is regularization. Currently nntools.regularization takes a Layer and then returns regularization expressions for all parameters found in the graph. I see two problems: a) It's not easily possible to obtain expressions for a subset of the parameters and b) if get_all_params() returns the shared variables for the parameter expressions rather than the parameter expressions, then we might not regularize the correct things -- e.g., when parameterizing some weight matrix as T.exp(W), an L1 regularizer should probably still affect the weight matrix and not its logarithm. One way to solve this would be to have nntools.regularization functions take Theano expressions rather than a Layer instance, and to have get_all_params() return the parameter expressions with a second helper function get_shared_variables(expressions, exclude=None) scraping the shared variables from the former, optionally excluding particular variables (this could be implicitly called in each of the nntools.updates functions just to be sure and to make it easier to use).

Member

f0k commented Sep 11, 2014

As we discussed via email before, my only concern is that allowing arbitrary expressions will make it more difficult to store/load models in HDF5 format (but still possible if I don't insist on restoring the original configuration of variables, just an equivalent network with respect to the forward pass). That's a minor disadvantage compared to the flexibility we would gain in defining and training models without writing any additional classes.

Regarding the added complexity, get_all_params() and its siblings would be easy to extend to find the shared variables in all parameter expressions. Looking at the example you put up and nntools.updates, providing a way for a user to exclude certain variables from being updated just boils down to excluding some of the variables in computing the updates.

Another problem is regularization. Currently nntools.regularization takes a Layer and then returns regularization expressions for all parameters found in the graph. I see two problems: a) It's not easily possible to obtain expressions for a subset of the parameters and b) if get_all_params() returns the shared variables for the parameter expressions rather than the parameter expressions, then we might not regularize the correct things -- e.g., when parameterizing some weight matrix as T.exp(W), an L1 regularizer should probably still affect the weight matrix and not its logarithm. One way to solve this would be to have nntools.regularization functions take Theano expressions rather than a Layer instance, and to have get_all_params() return the parameter expressions with a second helper function get_shared_variables(expressions, exclude=None) scraping the shared variables from the former, optionally excluding particular variables (this could be implicitly called in each of the nntools.updates functions just to be sure and to make it easier to use).

@benanne

This comment has been minimized.

Show comment
Hide comment
@benanne

benanne Sep 11, 2014

Member

The regularization thing was an afterthought :) So that could probably be revamped completely. I agree that if possible, the stuff in nntools.regularization should take Theano expressions as input, not layers, so that it could be used in isolation. But that should probably be discussed in a separate issue (I have a list with another 15 or so issues to write up, I'll continue tomorrow).

Member

benanne commented Sep 11, 2014

The regularization thing was an afterthought :) So that could probably be revamped completely. I agree that if possible, the stuff in nntools.regularization should take Theano expressions as input, not layers, so that it could be used in isolation. But that should probably be discussed in a separate issue (I have a list with another 15 or so issues to write up, I'll continue tomorrow).

@benanne

This comment has been minimized.

Show comment
Hide comment
@benanne

benanne Mar 2, 2015

Member

I guess the current idea is that we are not supporting this - in other words, it is safe to assume that layer parameters are always shared variables in the code (and indeed, this assumption is already made in many parts of the code anyway). If users want to reparameterize a layer, the preferred method is to subclass it.

Member

benanne commented Mar 2, 2015

I guess the current idea is that we are not supporting this - in other words, it is safe to assume that layer parameters are always shared variables in the code (and indeed, this assumption is already made in many parts of the code anyway). If users want to reparameterize a layer, the preferred method is to subclass it.

@benanne benanne closed this Mar 2, 2015

@f0k f0k referenced this issue in craffel/nntools Jun 2, 2015

Merged

refactor recurrent, update examples, add tests #27

@f0k f0k reopened this Aug 25, 2015

@f0k

This comment has been minimized.

Show comment
Hide comment
@f0k

f0k Aug 25, 2015

Member

Reopening as this issue came up again on the mailing list: https://groups.google.com/forum/#!topic/lasagne-users/ABiUmAIT-ho

To paraphrase, there are two assumptions in our code and/or user code, one of which we would have to break to support this. The assumptions are: a) Every self.b and self.W is a shared variable, and b) self.params contains self.b and self.W.

I'd suggest to break the first one, such that when a user supplies a custom expression for a constructor parameter W, then that layer's self.W will be set to that expression, and that expression will also be used as the key in self.params.

Furthermore, both our code and user code relies on get_all_params() returning a duplicate-free set of shared variables. I'd suggest to change it to extract all shared variables from the parameter expressions by default. This means when somebody sets W=T.exp(some_matrix), then some_matrix will be treated as the actual parameter for most purposes -- i.e., training, regularization, counting parameters, storing and restoring values, and so on. It will also practically inherit all tags the layer gave to W.

Sander had the objection that if an expression depends on multiple shared variables, it's unclear whether all of them are meant to be trained, and said:

maybe we should settle on a convention where people set a certain attribute on a shared variable that should not be trained. Kind of like how we have parameters with 'trainable' tags right now. But that starts to get pretty complex.

My suggestion would be to not bother with this, and just have all shared variables involved in an expression be collected by get_all_params(). Keeping some underlying parameters fixed can easily be solved by making them constants instead of shared variables. The only use case that wouldn't be covered would be setting W to some expression that involves two shared variables, both of which should be trained, but only one of which should be regularized. I think this is exotic enough to not bother with it for now, following our design goal Pragmatism. People can always subclass a layer to accomplish more complex parameterization.

Then I think we have to make the conceptual difference between the 'virtual' parameter variables (which might be expressions, e.g. self.W and self.b) and the 'real' underlying parameter variables more explicit. Else this could lead to a lot of confusion in the long run. We might also want to split get_all_params() into two separate functions, one that gets the virtual parameters (which may include expressions), and one that gets the real parameters (which are guaranteed to be shared variables).

I'd strongly vote for having get_all_params() to return the 'real' parameter variables by default, because that way most existing code would continue to work without changes. This could be realized via a utils.extract_shared_vars() helper function called from within get_all_params() if extract_shared=True is given, so people could use get_all_params(..., extract_shared=False) to obtain the 'virtual' parameter variables.

And as mentioned before, we should have some mechanism to mark variables as non-trainable in that case.

I disagree. Layers give tags to 'virtual' parameters, and those should just apply to all 'real' variables contained within. Everything else is unnecessarily complex for now. (And as I said, there's always T.constant if it's just about trainability.)

/edit: There also was the suggestion of only allowing parameter expressions that depend on exactly one shared variable, but I'm not actually sure how that would help.

/edit edit: If we allow arbitrary expressions, it's also possible that a 'virtual' parameter doesn't have any 'real' parameter at all -- e.g., because it is a constant, or random.
It's also possible that a 'virtual' parameter is computed by a network and suddenly corresponds to a whole lot of 'real' parameters that would all inherit whatever tags the 'virtual' parameter had. There's no way around that except for changing the tagging system to tag shared variables directly, instead of using a params dictionary.

Member

f0k commented Aug 25, 2015

Reopening as this issue came up again on the mailing list: https://groups.google.com/forum/#!topic/lasagne-users/ABiUmAIT-ho

To paraphrase, there are two assumptions in our code and/or user code, one of which we would have to break to support this. The assumptions are: a) Every self.b and self.W is a shared variable, and b) self.params contains self.b and self.W.

I'd suggest to break the first one, such that when a user supplies a custom expression for a constructor parameter W, then that layer's self.W will be set to that expression, and that expression will also be used as the key in self.params.

Furthermore, both our code and user code relies on get_all_params() returning a duplicate-free set of shared variables. I'd suggest to change it to extract all shared variables from the parameter expressions by default. This means when somebody sets W=T.exp(some_matrix), then some_matrix will be treated as the actual parameter for most purposes -- i.e., training, regularization, counting parameters, storing and restoring values, and so on. It will also practically inherit all tags the layer gave to W.

Sander had the objection that if an expression depends on multiple shared variables, it's unclear whether all of them are meant to be trained, and said:

maybe we should settle on a convention where people set a certain attribute on a shared variable that should not be trained. Kind of like how we have parameters with 'trainable' tags right now. But that starts to get pretty complex.

My suggestion would be to not bother with this, and just have all shared variables involved in an expression be collected by get_all_params(). Keeping some underlying parameters fixed can easily be solved by making them constants instead of shared variables. The only use case that wouldn't be covered would be setting W to some expression that involves two shared variables, both of which should be trained, but only one of which should be regularized. I think this is exotic enough to not bother with it for now, following our design goal Pragmatism. People can always subclass a layer to accomplish more complex parameterization.

Then I think we have to make the conceptual difference between the 'virtual' parameter variables (which might be expressions, e.g. self.W and self.b) and the 'real' underlying parameter variables more explicit. Else this could lead to a lot of confusion in the long run. We might also want to split get_all_params() into two separate functions, one that gets the virtual parameters (which may include expressions), and one that gets the real parameters (which are guaranteed to be shared variables).

I'd strongly vote for having get_all_params() to return the 'real' parameter variables by default, because that way most existing code would continue to work without changes. This could be realized via a utils.extract_shared_vars() helper function called from within get_all_params() if extract_shared=True is given, so people could use get_all_params(..., extract_shared=False) to obtain the 'virtual' parameter variables.

And as mentioned before, we should have some mechanism to mark variables as non-trainable in that case.

I disagree. Layers give tags to 'virtual' parameters, and those should just apply to all 'real' variables contained within. Everything else is unnecessarily complex for now. (And as I said, there's always T.constant if it's just about trainability.)

/edit: There also was the suggestion of only allowing parameter expressions that depend on exactly one shared variable, but I'm not actually sure how that would help.

/edit edit: If we allow arbitrary expressions, it's also possible that a 'virtual' parameter doesn't have any 'real' parameter at all -- e.g., because it is a constant, or random.
It's also possible that a 'virtual' parameter is computed by a network and suddenly corresponds to a whole lot of 'real' parameters that would all inherit whatever tags the 'virtual' parameter had. There's no way around that except for changing the tagging system to tag shared variables directly, instead of using a params dictionary.

@skaae

This comment has been minimized.

Show comment
Hide comment
@skaae

skaae Aug 25, 2015

Member

@f0k: I agree that it would be nice to support autoencoders. Can you elaborate on your last edit? Would it be a problem if a variable was a constant?
It would be pretty cool to support parameters being generated from networks :)

Member

skaae commented Aug 25, 2015

@f0k: I agree that it would be nice to support autoencoders. Can you elaborate on your last edit? Would it be a problem if a variable was a constant?
It would be pretty cool to support parameters being generated from networks :)

@benanne

This comment has been minimized.

Show comment
Hide comment
@benanne

benanne Aug 25, 2015

Member

Keeping some underlying parameters fixed can easily be solved by making them constants instead of shared variables.

But then how would you change their values during training? I might not want to perform gradient updates on a given variable, but that doesn't mean I don't want to change it at all :)

The only use case that wouldn't be covered would be setting W to some expression that involves two shared variables, both of which should be trained, but only one of which should be regularized. I think this is exotic enough to not bother with it for now, following our design goal Pragmatism. People can always subclass a layer to accomplish more complex parameterization.

Agreed, I could live with that.

I'd strongly vote for having get_all_params() to return the 'real' parameter variables by default, because that way most existing code would continue to work without changes.

I think we need to reflect on nomenclature here. If we are going to refer to both real and virtual parameters as 'params', we are going to hopelessly confuse people. I agree that letting get_all_params() return the variables is the easiest and most straightforward thing to do, but I can't get behind a situation where Layer.params potentially contains expressions, and get_all_params() only returns shared variables. Referring to both of these as 'params' is a bad idea. Maybe we should call the latter 'variables' or something? That's not really satisfactory either I guess. 'real params' and 'virtual params' is too cumbersome, but this terminology will do nicely for this discussion at least.

Also let's keep in mind that we want to shield any users who do not wish to use this feature (i.e. the majority of them, most likely) from ever having to deal with any of this. From what's discussed so far I think this would be the case, but let's just keep it in mind nevertheless.

I disagree. Layers give tags to 'virtual' parameters, and those should just apply to all 'real' variables contained within. Everything else is unnecessarily complex for now. (And as I said, there's always T.constant if it's just about trainability.)

Fair enough, except for the T.constant thing :p

Member

benanne commented Aug 25, 2015

Keeping some underlying parameters fixed can easily be solved by making them constants instead of shared variables.

But then how would you change their values during training? I might not want to perform gradient updates on a given variable, but that doesn't mean I don't want to change it at all :)

The only use case that wouldn't be covered would be setting W to some expression that involves two shared variables, both of which should be trained, but only one of which should be regularized. I think this is exotic enough to not bother with it for now, following our design goal Pragmatism. People can always subclass a layer to accomplish more complex parameterization.

Agreed, I could live with that.

I'd strongly vote for having get_all_params() to return the 'real' parameter variables by default, because that way most existing code would continue to work without changes.

I think we need to reflect on nomenclature here. If we are going to refer to both real and virtual parameters as 'params', we are going to hopelessly confuse people. I agree that letting get_all_params() return the variables is the easiest and most straightforward thing to do, but I can't get behind a situation where Layer.params potentially contains expressions, and get_all_params() only returns shared variables. Referring to both of these as 'params' is a bad idea. Maybe we should call the latter 'variables' or something? That's not really satisfactory either I guess. 'real params' and 'virtual params' is too cumbersome, but this terminology will do nicely for this discussion at least.

Also let's keep in mind that we want to shield any users who do not wish to use this feature (i.e. the majority of them, most likely) from ever having to deal with any of this. From what's discussed so far I think this would be the case, but let's just keep it in mind nevertheless.

I disagree. Layers give tags to 'virtual' parameters, and those should just apply to all 'real' variables contained within. Everything else is unnecessarily complex for now. (And as I said, there's always T.constant if it's just about trainability.)

Fair enough, except for the T.constant thing :p

@f0k

This comment has been minimized.

Show comment
Hide comment
@f0k

f0k Aug 26, 2015

Member

But then how would you change their values during training? I might not want to perform gradient updates on a given variable, but that doesn't mean I don't want to change it at all :)

Oh, right.

People can always subclass a layer to accomplish more complex parameterization.

Agreed, I could live with that.

If T.constant is not an option and subclassing is too much effort, people can also easily exclude a variable from training by removing it from the params list before passing it to lasagne.updates.something. That's not too difficult either, actually. They could even use the SharedVariable.tag property for that, without us having to dirty our hands ;)

I think we need to reflect on nomenclature here. If we are going to refer to both real and virtual parameters as 'params', we are going to hopelessly confuse people.

Yes... my idea was that only very few people would need to deal with that. Most people won't use the feature, and the ones that do DenseLayer(..., W=otherlayer.W.T) probably wouldn't think about what get_all_params() does if it just continues to work.

Referring to both of these as 'params' is a bad idea. Maybe we should call the latter 'variables' or something? That's not really satisfactory either I guess. 'real params' and 'virtual params' is too cumbersome, but this terminology will do nicely for this discussion at least.

Referring to the latter as variables would mean we'd need to change a lot of documentation and function names (get_all_variables(), set_all_variable_values()) and be explicit about the distinction everywhere. That'd be very much against our fifth design goal. I'd try to avoid distinguishing the two as much as possible, to not confuse users who don't use the feature at all. That is, I'd continue to refer to everything as "parameters", and only where needed, add a note about what happens in the special case of a parameter being a "parameter expression".

I agree that letting get_all_params() return the variables is the easiest and most straightforward thing to do, but I can't get behind a situation where Layer.params potentially contains expressions, and get_all_params() only returns shared variables.

I think that's totally fine if get_all_params() is documented to do that in the special case of Layer.params containing parameter expressions. All uses of get_all_params() in the library expect it to return shared variables. And again, we could give it a boolean flag that tells whether to unwrap parameter expressions or not, in case any user actually wants to obtain the parameter expressions for some reason (but what would be a use case? note that we couldn't even guarantee that it's free of duplicates in this case).

Member

f0k commented Aug 26, 2015

But then how would you change their values during training? I might not want to perform gradient updates on a given variable, but that doesn't mean I don't want to change it at all :)

Oh, right.

People can always subclass a layer to accomplish more complex parameterization.

Agreed, I could live with that.

If T.constant is not an option and subclassing is too much effort, people can also easily exclude a variable from training by removing it from the params list before passing it to lasagne.updates.something. That's not too difficult either, actually. They could even use the SharedVariable.tag property for that, without us having to dirty our hands ;)

I think we need to reflect on nomenclature here. If we are going to refer to both real and virtual parameters as 'params', we are going to hopelessly confuse people.

Yes... my idea was that only very few people would need to deal with that. Most people won't use the feature, and the ones that do DenseLayer(..., W=otherlayer.W.T) probably wouldn't think about what get_all_params() does if it just continues to work.

Referring to both of these as 'params' is a bad idea. Maybe we should call the latter 'variables' or something? That's not really satisfactory either I guess. 'real params' and 'virtual params' is too cumbersome, but this terminology will do nicely for this discussion at least.

Referring to the latter as variables would mean we'd need to change a lot of documentation and function names (get_all_variables(), set_all_variable_values()) and be explicit about the distinction everywhere. That'd be very much against our fifth design goal. I'd try to avoid distinguishing the two as much as possible, to not confuse users who don't use the feature at all. That is, I'd continue to refer to everything as "parameters", and only where needed, add a note about what happens in the special case of a parameter being a "parameter expression".

I agree that letting get_all_params() return the variables is the easiest and most straightforward thing to do, but I can't get behind a situation where Layer.params potentially contains expressions, and get_all_params() only returns shared variables.

I think that's totally fine if get_all_params() is documented to do that in the special case of Layer.params containing parameter expressions. All uses of get_all_params() in the library expect it to return shared variables. And again, we could give it a boolean flag that tells whether to unwrap parameter expressions or not, in case any user actually wants to obtain the parameter expressions for some reason (but what would be a use case? note that we couldn't even guarantee that it's free of duplicates in this case).

@benanne

This comment has been minimized.

Show comment
Hide comment
@benanne

benanne Aug 26, 2015

Member

You are probably right. This is a less common use case and keeping everything as params is the best way to allow people to completely ignore this if they don't need it. Renaming essential functions like get_all_params() at this point would be a terrible idea anyway.

The one situation I'm still concerned about is when people don't use get_all_params() to acquire the list of parameters (well, shared variables). I could imagine people accidentally passing parameter expressions to the lasagne.updates functions for example. Maybe we should take precautions to ensure that they get a meaningful error message in that case.

Maybe we should start drafting a PR for this, to see if any other hurdles come up that we've missed.

Member

benanne commented Aug 26, 2015

You are probably right. This is a less common use case and keeping everything as params is the best way to allow people to completely ignore this if they don't need it. Renaming essential functions like get_all_params() at this point would be a terrible idea anyway.

The one situation I'm still concerned about is when people don't use get_all_params() to acquire the list of parameters (well, shared variables). I could imagine people accidentally passing parameter expressions to the lasagne.updates functions for example. Maybe we should take precautions to ensure that they get a meaningful error message in that case.

Maybe we should start drafting a PR for this, to see if any other hurdles come up that we've missed.

@f0k

This comment has been minimized.

Show comment
Hide comment
@f0k

f0k Aug 26, 2015

Member

The one situation I'm still concerned about is when people don't juse get_all_params() to acquire the list of parameters (well, shared variables). I could imagine people accidentally passing parameter expressions to the lasagne.updates functions for example. Maybe we should take precautions to ensure that they get a meaningful error message in that case.

Good point!

Looking into the code, I think we should probably even change Layer.get_params() to unwrap parameter expressions, and not get_all_params(). That would only leave users at danger who directly access Layer.params.keys(). And I think I wouldn't provide a way to obtain the parameter expressions, not until we find a use case for that. For visualization purposes and the like, one can still just access Layer.W and Layer.b to obtain the expressions.

Maybe we should start drafting a PR for this, to see if any other hurdles come up that we've missed.

One thing came to my mind: the recurrent layers already support taking a Theano expression for the initial hidden state, but not in a way that they want to include all shared variables in that expression in training. The recurrent layers do have a flag telling whether to learn the initial hidden state, though, so at least for training their behaviour wouldn't change if we officially allowed Theano expressions for parameters. We'd need to check what it means for get_all_param_values(), though.

@skaae, @craffel: What kind of expression do you usually pass for hid_init in a recurrent layer (if you pass an expression instead of a callable, that is)?

Member

f0k commented Aug 26, 2015

The one situation I'm still concerned about is when people don't juse get_all_params() to acquire the list of parameters (well, shared variables). I could imagine people accidentally passing parameter expressions to the lasagne.updates functions for example. Maybe we should take precautions to ensure that they get a meaningful error message in that case.

Good point!

Looking into the code, I think we should probably even change Layer.get_params() to unwrap parameter expressions, and not get_all_params(). That would only leave users at danger who directly access Layer.params.keys(). And I think I wouldn't provide a way to obtain the parameter expressions, not until we find a use case for that. For visualization purposes and the like, one can still just access Layer.W and Layer.b to obtain the expressions.

Maybe we should start drafting a PR for this, to see if any other hurdles come up that we've missed.

One thing came to my mind: the recurrent layers already support taking a Theano expression for the initial hidden state, but not in a way that they want to include all shared variables in that expression in training. The recurrent layers do have a flag telling whether to learn the initial hidden state, though, so at least for training their behaviour wouldn't change if we officially allowed Theano expressions for parameters. We'd need to check what it means for get_all_param_values(), though.

@skaae, @craffel: What kind of expression do you usually pass for hid_init in a recurrent layer (if you pass an expression instead of a callable, that is)?

@craffel

This comment has been minimized.

Show comment
Hide comment
@craffel

craffel Aug 26, 2015

Member

@skaae, @craffel: What kind of expression do you usually pass for hid_init in a recurrent layer (if you pass an expression instead of a callable, that is)?

This is @skaae's change, but I've never seen it not just be a TensorVariable.

Member

craffel commented Aug 26, 2015

@skaae, @craffel: What kind of expression do you usually pass for hid_init in a recurrent layer (if you pass an expression instead of a callable, that is)?

This is @skaae's change, but I've never seen it not just be a TensorVariable.

@f0k

This comment has been minimized.

Show comment
Hide comment
@f0k

f0k Aug 26, 2015

Member

But any expression involving tensors is a TensorVariable:

In [1]: import theano
Using gpu device 0: GeForce GT 640 (CNMeM is disabled)
In [2]: x = theano.tensor.matrix()

In [3]: type(x)
Out[3]: theano.tensor.var.TensorVariable

In [4]: type(x**2)
Out[4]: theano.tensor.var.TensorVariable

The question is whether the expression could be the output of a neural network, for example (i.e., whether it could depend on some shared variables). If it's always just a symbolic input variable (i.e., its owner attribute is None), we're fine!

Member

f0k commented Aug 26, 2015

But any expression involving tensors is a TensorVariable:

In [1]: import theano
Using gpu device 0: GeForce GT 640 (CNMeM is disabled)
In [2]: x = theano.tensor.matrix()

In [3]: type(x)
Out[3]: theano.tensor.var.TensorVariable

In [4]: type(x**2)
Out[4]: theano.tensor.var.TensorVariable

The question is whether the expression could be the output of a neural network, for example (i.e., whether it could depend on some shared variables). If it's always just a symbolic input variable (i.e., its owner attribute is None), we're fine!

@craffel

This comment has been minimized.

Show comment
Hide comment
@craffel

craffel Aug 26, 2015

Member

Sorry, to be more specific, I think the use-case it was intended for was TensorVariable expressions which do not involve any shared variables (although we don't do anything to enforce this).

Member

craffel commented Aug 26, 2015

Sorry, to be more specific, I think the use-case it was intended for was TensorVariable expressions which do not involve any shared variables (although we don't do anything to enforce this).

@f0k

This comment has been minimized.

Show comment
Hide comment
@f0k

f0k Aug 26, 2015

Member

although we don't do anything to enforce this

That's okay though, I'd just be worried if we'd knowingly break existing use cases.

I think I could do a first PR sometime next week, to further check and discuss implications.

Member

f0k commented Aug 26, 2015

although we don't do anything to enforce this

That's okay though, I'd just be worried if we'd knowingly break existing use cases.

I think I could do a first PR sometime next week, to further check and discuss implications.

@skaae

This comment has been minimized.

Show comment
Hide comment
@skaae

skaae Sep 9, 2015

Member

Sorry for the slow answer.

The only use-case i know for hid_init being a TensorVariable is for language models.
For recurrent networks i think i more flexible solutions might be to allow hid_init to be a layer?

Member

skaae commented Sep 9, 2015

Sorry for the slow answer.

The only use-case i know for hid_init being a TensorVariable is for language models.
For recurrent networks i think i more flexible solutions might be to allow hid_init to be a layer?

@f0k

This comment has been minimized.

Show comment
Hide comment
@f0k

f0k Sep 10, 2015

Member

The only use-case i know for hid_init being a TensorVariable is for language models.

But in that use case, what kind of TensorVariable would that be? Just a T.matrix(), or some complex expression involving shared variables?

For recurrent networks i think i more flexible solutions might be to allow hid_init to be a layer?

Would be doable, given that the recurrent layers are MergeLayers already.

Member

f0k commented Sep 10, 2015

The only use-case i know for hid_init being a TensorVariable is for language models.

But in that use case, what kind of TensorVariable would that be? Just a T.matrix(), or some complex expression involving shared variables?

For recurrent networks i think i more flexible solutions might be to allow hid_init to be a layer?

Would be doable, given that the recurrent layers are MergeLayers already.

@skaae

This comment has been minimized.

Show comment
Hide comment
@skaae

skaae Sep 10, 2015

Member

But in that use case, what kind of TensorVariable would that be? Just a T.matrix(), or some complex expression involving shared variables?

A T.matrix.

Would be doable, given that the recurrent layers are MergeLayers already.

Yes. But couldn't we get remove the support for hid_init being a tensor variable and instead allow hid_init to be a layer?
Unless i'm missing something the layer case supports the language model usecase.

Member

skaae commented Sep 10, 2015

But in that use case, what kind of TensorVariable would that be? Just a T.matrix(), or some complex expression involving shared variables?

A T.matrix.

Would be doable, given that the recurrent layers are MergeLayers already.

Yes. But couldn't we get remove the support for hid_init being a tensor variable and instead allow hid_init to be a layer?
Unless i'm missing something the layer case supports the language model usecase.

@f0k

This comment has been minimized.

Show comment
Hide comment
@f0k

f0k Sep 10, 2015

Member

Unless i'm missing something the layer case supports the language model usecase.

Sure, if you want it to be a basic tensor variable, you'd need to set it to an InputLayer instance then, which has that variable (or expression) as its input_var.

But couldn't we get remove the support for hid_init being a tensor variable

The point of this Issue (#11) is to allow all network parameters to be theano expressions, so that would naturally include the case of it being a T.matrix(). No wait, actually, it would just allow it to be a T.vector(), since it's also a vector when it's a shared variable, and the recurrent layer code shouldn't have to distinguish these cases. So we'll break backwards compatibility (but probably for very very few users). Unless we change the recurrent layer code to do the T.dot(ones, ...) expansion depending on hid_init.ndim instead of isinstance(hid_init, ...) and allow add_param() to skip the dimensionality check, or accept multiple allowed shapes. Okay, thinking aloud here :)

and instead allow hid_init to be a layer?

That would be a good idea then, as it allows hid_init to be a matrix (with a separate initial state for each example in a mini-batch) instead of a vector.

Member

f0k commented Sep 10, 2015

Unless i'm missing something the layer case supports the language model usecase.

Sure, if you want it to be a basic tensor variable, you'd need to set it to an InputLayer instance then, which has that variable (or expression) as its input_var.

But couldn't we get remove the support for hid_init being a tensor variable

The point of this Issue (#11) is to allow all network parameters to be theano expressions, so that would naturally include the case of it being a T.matrix(). No wait, actually, it would just allow it to be a T.vector(), since it's also a vector when it's a shared variable, and the recurrent layer code shouldn't have to distinguish these cases. So we'll break backwards compatibility (but probably for very very few users). Unless we change the recurrent layer code to do the T.dot(ones, ...) expansion depending on hid_init.ndim instead of isinstance(hid_init, ...) and allow add_param() to skip the dimensionality check, or accept multiple allowed shapes. Okay, thinking aloud here :)

and instead allow hid_init to be a layer?

That would be a good idea then, as it allows hid_init to be a matrix (with a separate initial state for each example in a mini-batch) instead of a vector.

@f0k

This comment has been minimized.

Show comment
Hide comment
@f0k

f0k Oct 15, 2015

Member

Closing. The main idea has been implemented in #404, and what was discussed in the end for recurrent layers has been moved to a new issue #462.

Member

f0k commented Oct 15, 2015

Closing. The main idea has been implemented in #404, and what was discussed in the end for recurrent layers has been moved to a new issue #462.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment