Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple "reapply" functionality #720

Open
justheuristic opened this issue Jul 7, 2016 · 12 comments
Open

Simple "reapply" functionality #720

justheuristic opened this issue Jul 7, 2016 · 12 comments
Milestone

Comments

@justheuristic
Copy link

justheuristic commented Jul 7, 2016

Greetings!

It's probably a small issue in most image-CNN-related cases, but when dealing with text, multi-input NNs, reinforcement learning or long-term memory networks some layers should be applied at multiple spots with same weights.

Some usage cases:

  • Text processing: Apply the same embedding and text convolution / recurrent processing to advertisement title and description
  • Reinforcement learning: critic subnetwork in deterministic policy graident algorithm should evaluate both "optimal" and actual(i.e. with exploration) actor output, hence should be applied to both of them.

As far as i understood the docs, one natural way to do so in Lasagne is by passing other layer's params on creation, e.g.

l_d0 = DenseLayer(bottom_layer,10)
l_d0_again = DenseLayer(other_bottom_layer, 10,  W = l_d0.W, b = l_d0.b)

or alternatively, create several networks and use .get_output

lasagne.layers.get_output(my_nn, {my_nn_input:input1}
lasagne.layers.get_output(my_nn, {my_nn_input:input2}

The problem with the first approach is that it takes immensely many lines of code to do so with larger networks, especially with e.g. LSTM with a lot more parameters. Not to mention it's easy to make a mistake that way.

The problem with the second approach is that it forces you to break outside lasagne network and makes routine tasks e.g. getting all params, regularizing, applying flags more complicated and verbose.

It may be a good idea to introduce some simpler method to do that.

For example, in blocks layers can be .apply --ed to as many spots as necessary with all params shared.

>>> from blocks.bricks import Linear
>>> from blocks.initialization import IsotropicGaussian, Constant
>>> linear = Linear(input_dim=10, output_dim=5,
...                 weights_init=IsotropicGaussian(),
...                 biases_init=Constant(0.01))
>>> y = linear.apply(x)

This complicates the code a lot and i personally don't think this is the best solution since most of layers are only applied once.

Another approach is to make some .reapply() method for layers with params that would work like a cloning constructor that reuses the original layer's weights, but that too makes cloning large nets complicated.

l_d0 = DenseLayer(bottom_layer)
l_d0_again = l_d0.reapply(other_bottom_layer)

In one of our libraries for reinforcement learning we use a generic function to clone a lasagne network with or without keeping the params.
The source can be found here - https://github.com/yandexdataschool/AgentNet/blob/master/agentnet/utils/clone.py
Some usage examples - https://github.com/yandexdataschool/AgentNet/blob/master/tests/test_clone_and_targets.py#L45

The question is: is there any way to implement functionality of applying network parts multiple times in Lasagne?

p.s. if our clone_network can fit Lasagne spirit, we'd be glad to contribute the code (or you can just grab it at your will, so far as i understand the license - the code has almost no dependencies but for theano and lasagne)

@f0k
Copy link
Member

f0k commented Jul 8, 2016

In one of our libraries for reinforcement learning we use a generic function to clone a lasagne network with or without keeping the params.

Interesting to see that this works -- so basically, copy.deepcopy is able to recreate a network?

is there any way to implement functionality of applying network parts multiple times in Lasagne?

There's one solution that you haven't mentioned: Construct your network such that it processes both parts at once, and split them up afterwards. I.e., instead of:

out1 = lasagne.layers.get_output(my_nn, {my_nn_input:input1}
out2 = lasagne.layers.get_output(my_nn, {my_nn_input:input2}

do:

input = T.concatenate((input1, input2), axis=0)
...
l_out1 = lasagne.layers.SliceLayer(my_nn, slice(None, len(input1)), axis=0)
l_out2 = lasagne.layers.SliceLayer(my_nn, slice(len(input1), None), axis=0)

Now you can continue constructing your network, and don't have to worry about collecting parameters from the different fragments. This will also result in a more efficient solution, since Theano can process both input parts in a single minibatch. However, it won't work well for recurrent networks if the two input parts have large different sequence lengths (the shorter one would have to be padded).

if our clone_network can fit Lasagne spirit

It would be nice to have such functionality, but we'll have to investigate whether copy.deepcopy really does what we need or whether we would have to extend the Layer interface in some way to allow cloning in all cases. But in general, yes, this would be a nice addition if it's easy to do.


Thinking a little more, another solution would be a special layer class that embeds the application of a network to its input(s):

class MacroLayer(MergeLayer):
    def __init__(self, incomings, network, input_layers=None, **kwargs):
        super(MergeLayer, self).__init__(incomings, **kwargs)
        self.network = network
        all_layers = lasagne.layers.get_all_layers(network)
        if input_layers is None:
            input_layers = [layer for layer in all_layers if getattr(layer, 'input_layer', None) is None]
        self.input_layers = input_layers
        self.params = dict(itertools.chain.from_iterable(layer.params.items() for layer in all_layers))
    def get_output_shape_for(self, input_shapes):
        return lasagne.layers.get_output_shape(self.network, dict(zip(self.input_layers, input_shapes)))
    def get_output_for(self, inputs, **kwargs):
        return lasagne.layers.get_output(self.network, dict(zip(self.input_layers, input_shapes)), **kwargs)

Each branch of your network would become a MacroLayer then, each linked to its own InputLayer instance, but both containing the same network for their computation steps. lasagne.layers.get_all_layers() wouldn't expand the MacroLayer, but you could still retrieve all its parameters. Ideally, this layer would be derived from the new layer base class (#678), not MergeLayer or Layer, so it can be used both with single or multiple input layers and also produce multiple outputs.

@justheuristic
Copy link
Author

justheuristic commented Jul 9, 2016

@f0k

Interesting to see that this works -- so basically, copy.deepcopy is able to recreate a network?

I myself was surprised, but the cloned networks seem to work perfectly right - i even lost a bet (for a sandwich) when i promised to find a way to break the code :)

Construct your network such that it processes both parts at once, and split them up afterwards. I.e., instead of:

That's awesome! You have probably just saved one of my friends 10h of compilation time in his High Energy Physics research :)

MacroLayer

On one hand, that's much more code if you just want to copy a single layer.
On the other - this one seems to be much better if network has to be reapplied a lot of times and probably has some common ground with the "recurrent container layer" idea from pull requests.
I believe that it all boils down to looking into typical use cases and seeing which option results in more intuitive code.

if it's easy to do

That piece of code has no dependencies apart from "check_list" thingy that converts stuff(tuple, set) to a list, which must have some analogy in the lasagne utils.
Again, if you think it's useful, i'll prepare a pull request - may i?

@f0k
Copy link
Member

f0k commented Jul 11, 2016

probably has some common ground with the "recurrent container layer" idea from pull requests

Definitely, the recurrent container does that as well and in both cases we need to figure out what's the best API for linking the embedded network to the outer inputs.

Again, if you think it's useful, i'll prepare a pull request - may i?

How can we know that copy.deepcopy is really unbreakable? It's nice that you had a bet on it, but maybe a sandwich just wasn't enough of an incentive :)

Looking at your implementation, I'm worried about the following:

  • if share_params is False, are you sure the deepcopied shared variables are independent from the originals? E.g., theano.clone creates new instances of shared variables that have the same underlying storage.
  • if share_params is True, you only share the SharedVariable instances, but recreate parameter expressions.
  • it's quite a lot of code, can't this be simplified?
  • I'm not sure about the bottom_layers / share_params / share_inputs API. In particular, it seems you allow bottom_layers to include parameters as well, and it's a bit inelegant to allow bottom_layers to be either a list or a dictionary -- this should probably be split up, with different names (e.g., a shared_layers list and a replacements dictionary).

If we can make the implementation and API a little nicer and we are sure it cannot break, I'd be fine with including it. @benanne, what's your opinion?

@f0k f0k added this to the v0.3 milestone Jul 11, 2016
@benanne
Copy link
Member

benanne commented Jul 13, 2016

Construct your network such that it processes both parts at once, and split them up afterwards. I.e., instead of:

That's awesome! You have probably just saved one of my friends 10h of compilation time in his High Energy Physics research :)

We should probably document this trick somewhere, it's super useful and it's something that keeps coming up on the mailing list as well.

If we can make the implementation and API a little nicer and we are sure it cannot break, I'd be fine with including it. @benanne, what's your opinion?

Sure, sounds good!

@justheuristic
Copy link
Author

Sorry for disappearing

if share_params is False, are you sure the deepcopied shared variables are independent from the originals? E.g., theano.clone creates new instances of shared variables that have the same underlying storage.

  • Yes, in the sense that they can have different values

if share_params is True, you only share the SharedVariable instances, but recreate parameter expressions.

  • Theano expressions are copied unless explicitly asked not to, but the parameters they have are shared.

it's quite a lot of code, can't this be simplified?

  • Certainly, by removing some of the "if list, do that, elif dict, to that, else do that". The question is, which of them do you want removed

I'm not sure about the bottom_layers / share_params / share_inputs API. In particular, it seems you allow bottom_layers to include parameters as well, and it's a bit inelegant to allow bottom_layers to be either a list or a dictionary -- this should probably be split up, with different names (e.g., a shared_layers list and a replacements dictionary).

  • Agreed.

Will now try to assemble the thing

@redst4r
Copy link

redst4r commented Oct 13, 2016

Hi,

I have a similar problem, i.e. I want to apply the same set of convolutions (shared weights) to lots of different 'color' channels. As suggested by @f0k, I slice the channels, create conv-layers that all sharethe same W/b variables and merge the output later.

This works quite fine in theory, however, when i try to compile the architecture into a theano-function, this compilation takes alot of time. Looking at the theano-profiler, it seems that most of the compilation time is consumed by the optimizer, in particular a MergeOptimizer.

I tried to construct a minimal example:

from lasagne.layers import get_output, InputLayer, DenseLayer, SliceLayer, ConcatLayer, Conv2DLayer, ReshapeLayer
from lasagne.nonlinearities import softmax
import theano
import theano.tensor as T
import time

theano.config.profile = True
theano.config.profile_optimizer = True

n_channel = 34  # <- replicates the conv-layer 34 times, processing the channels indep
n_classes = 3

# ----------------------------------------------------------------------------------------
input_var = T.tensor4('inputs')
in_layer = InputLayer((None, n_channel, 100, 100), input_var=input_var)

# ------- construct a prototyp conv-layer, replicate a couple of times----------
proto_params = {'name': 'conv1', 'num_filters': 512, 'filter_size': 3, 'pad': 'valid'}
dummy_in = InputLayer((None,1,100,100))
prototype = Conv2DLayer(dummy_in, **proto_params)

conv1_layers = []
for i in range(n_channel):
    theslice = SliceLayer(in_layer, indices=slice(i, i + 1), axis=1, name='slice_' + str(i))
    dup = Conv2DLayer(theslice, W=prototype.W, b=prototype.b, **proto_params)   # duplicate with shared weights
    conv1_layers.append(dup)

# ---------merge their outputs ---------------------
merger = ConcatLayer(conv1_layers, axis=1, name='concat')
reshaper = ReshapeLayer(merger, shape=([0], -1))
dense_last = DenseLayer(reshaper, n_classes, name='fcEND', nonlinearity=softmax)

pred = get_output(dense_last)

# ---------compiling ---------------------
t = time.time()
theano.function([input_var], [pred])
time.time() - t

Heres the relevant part of the theano-profiler:

Function profiling
==================
  Message: <ipython-input-11-61897b361e17>:1
  Time in 0 calls to Function.__call__: 0.000000e+00s
  Total compile time: 1.362505e+02s
    Number of Apply nodes: 345
    Theano Optimizer time: 1.259392e+02s
       Theano validate time: 4.307070e-01s
    Theano Linker time (includes C, CUDA code generation/compiling): 1.023595e+01s
       Import time 2.670388e-01s

Time in all call to theano.grad() 0.000000e+00s
Time since theano import 260.675s
Optimizer Profile
-----------------
 SeqOptimizer  OPT_FAST_RUN  time 125.939s for 342/345 nodes before/after optimization
   117.453s for callback
       0.431s for fgraph.validate()
   callbacks_time
        <theano.gof.opt.MergeFeature object at 0x7f5f4311ed68> , 115.58974885940552
        <theano.gof.destroyhandler.DestroyHandler object at 0x7f5f420c07b8> , 0.7037270069122314
        <theano.tensor.opt.ShapeFeature object at 0x7f5f431737b8> , 0.6588032245635986
        <theano.compile.function_module.Supervisor object at 0x7f5f4311ec88> , 0.17537736892700195
        <theano.gof.toolbox.ReplaceValidate object at 0x7f5f43160e80> , 0.06815862655639648
        <theano.gof.toolbox.PreserveVariableAttributes object at 0x7f5f4311ec50> , 0.03383636474609375
        Updater{gpu_local_optimizations} , 0.009509563446044922
        Updater{gpu_local_optimizations} , 0.00867009162902832
        Updater{gpu_local_optimizations} , 0.007185935974121094
        Updater{gpu_local_optimizations} , 0.005219221115112305
        <theano.gof.opt.ChangeTracker object at 0x7f5f42fe4160> , 0.004741668701171875
        Updater{canonicalize} , 0.004576206207275391
        Updater{gpu_local_optimizations} , 0.0034682750701904297
        Updater{canonicalize} , 0.002482175827026367
        <theano.gof.opt.ChangeTracker object at 0x7f5f43061630> , 0.0003104209899902344
        Updater{gpu_cut_transfers} , 0.00021028518676757812
        Updater{local_dnn_conv_inplace} , 0.00014972686767578125
        Updater{topo_constant_folding} , 8.988380432128906e-05
        Updater{specialize} , 6.508827209472656e-05
        <theano.gof.opt.ChangeTracker object at 0x7f5f4205c278> , 3.0040740966796875e-05
        Updater{gpu_cut_transfers} , 2.5033950805664062e-05
        Updater{local_dot_to_dot22} , 4.5299530029296875e-06
        <theano.gof.opt.ChangeTracker object at 0x7f5f4257fbe0> , 2.384185791015625e-06
        <theano.gof.opt.ChangeTracker object at 0x7f5f42fe0748> , 2.384185791015625e-06
   time      - (name, class, index, nodes before, nodes after) - validate time
   121.205341s - ('merge3', 'MergeOptimizer', 50, 345, 345) - 0.000s
     MergeOptimizer
       nb fail= 6512 merged=    0 constant=    0
       time replace=121.21 validate=0.00 callback=116.15
       callbacks_time
            <theano.gof.toolbox.PreserveVariableAttributes object at 0x7f5f4311ec50> , 0.02354121208190918
            <theano.gof.toolbox.ReplaceValidate object at 0x7f5f43160e80> , 0.05088996887207031
            <theano.tensor.opt.ShapeFeature object at 0x7f5f431737b8> , 0.1858060359954834
            <theano.gof.destroyhandler.DestroyHandler object at 0x7f5f420c07b8> , 0.4132673740386963
            <theano.gof.opt.MergeFeature object at 0x7f5f4311ed68> , 115.3201892375946
   1.616800s - ('gpu_opt', 'SeqOptimizer', 14, 209, 847) - 0.022s
     SeqOptimizer      gpu_opt  time 1.617s for 209/847 nodes before/after optimization
       0.709s for callback
           0.022s for fgraph.validate()

On a sidenote, if i do the same thing, but untying the weights (i.e. they are not shared across channels), compile/optimize time is down to 5sec

Any idea what's going on? This is a small example (150 sec compilation time is ok), but if I scale this up (not a single convolution but a couple of them, i.e. conv-conv-maxpool-conv-conv-maxpool ...) this quickly goes beyond hours, which is not acceptable for me.

I'm not an expert in theano and I'm not sure how to exactly read the profiler output, so help is greatly appreciated :)

@f0k
Copy link
Member

f0k commented Mar 5, 2017

I have a similar problem, i.e. I want to apply the same set of convolutions (shared weights) to lots of different 'color' channels. As suggested by @f0k, I slice the channels, create conv-layers that all sharethe same W/b variables and merge the output later.

Sorry for the delay, it's probably not relevant any more now, but for that use case I wouldn't suggest slicing and doing multiple convolutions, but reshaping the tensor from (batchsize, channels, rows, cols) to (batchsize x channels, 1, rows, cols), then do a normal 2d convolution. This will use the same set of weights for all channels, since they're interpreted as different examples in a batch now. In the end, reshape back to (batchsize, ...) to restore the original grouping (note that if you want to do multiple such convolutions and only pool in between, you can keep the (batchsize x channels, ...) layout until the very end).

@guoxuesong
Copy link

guoxuesong commented Mar 6, 2017

@f0k
I don't think

input = T.concatenate((input1, input2), axis=0)
...
l_out1 = lasagne.layers.SliceLayer(my_nn, slice(None, len(input1)), axis=0)
l_out2 = lasagne.layers.SliceLayer(my_nn, slice(len(input1), None), axis=0)

equal to

out1 = lasagne.layers.get_output(my_nn, {my_nn_input:input1}
out2 = lasagne.layers.get_output(my_nn, {my_nn_input:input2}

@justheuristic 's my_nn can has meanings that we can use it to think in concept. In your case my_nn is just details of implement. When working on a really complicated network, thinking in concept is important.

In my example:

I have a deep conv network which map a 3x64x64 image to a 64 feature point 64x1x1, and a transform network which map a 64x1x1 point to another 64x1x1 point, I want to join them together.

I would like to keep a transform_network instance which means 'transform the 64x1x1 input to 64x1x1 output in a special way we trained', I cannot accept keeping a instance of transform_network with input_layer replaced to conv_network, because that means 'transform the 64x1x1 point that is the conv result of 3x64x64 input to 64x1x1 output in a special way we trained', this meaning is useless for me.

Generally, I would like to keep some lasagne networks which have clear meanings and later using them as templates to compose some high level concepts, and maybe use the high level ones to compose some more complicated ones, and so on.

@f0k
Copy link
Member

f0k commented Mar 6, 2017

I don't think [...] equal to [...]

If you don't use batch normalization, results will be the same. The computation of the former will use more memory and be faster. However, it only works if the two input tensors are the same shape (except for the first dimension). So this is a useful recipe for Siamese networks, but not for all possible use cases.

I cannot accept keeping a instance of transform_network with input_layer replaced to conv_network

Okay... so you want to keep an instance of transform_network, another of conv_network, and a third one that is a copy of the transform_network on top of the conv_network. For that purpose, reviving the clone functionality discussed in this thread or the MacroLayer (#720 (comment)) would be a possible solution. What was holding us back from merging the cloning code as it was were the API, the amount of code, and some doubts about how to handle parameter expressions. I'm still open to a clean lasagne.layers.clone_network function! Regarding the API, instead of full-blown implementations, I'd like to look at docstrings of possible clone_network() functions to converge on something.

@guoxuesong
Copy link

guoxuesong commented Mar 7, 2017

I agree clone_network is perfect for the case not share params. let's focus on sharing params.

A example: there is a layer1, a layer_contains_layer1, another_layer_contains_layer1, yet_another_layer_contains_layer1, and final_layer is a layer contains layer_contains_layer1 and another_layer_contains_layer1. like this:

final_layer = lasagne.layers.ElemwiseMergeLayer(
    (layer_contains_layer1,another_layer_contains_layer1),T.add)

now let's do this:

final_layer2 = lasagne.layers.clone_network(final_layer, 
    bottom_layers={another_layer_contains_layer1:yet_another_layer_contains_layer1},
    share_params=True)

The another_layer_contains_layer1 in final_layer will be replaced by yet_another_layer_contains_layer1 which contains a real layer1, other hand, layer_contains_layer1 in final_layer will be cloned, that means the place of layer1 in this part of final_layer2 is a new layer that share params with layer1. final_layer2 contains layer1 and a clone of layer1 at same time, this looks not good.

The result I prefer is, final_layer2 should keep as most layer same to final_layer as posible, because final_layer is "some concept of layer1", if we replace a part of final_layer with "another concept of layer1" the result final_layer2 should still be "some concept of layer1". "some concept of layer1" means we should be able do something like lasagne.layers.join_layer(final_layer2,{layer1:layer2}), that is by replace layer1 in final_layer2 with layer2 we can get a "some concept of layer2".

Currently, lasagne.layers.clone_network not support this, lasagne.layers.clone_network(final_layer2,bottom_layers={layer1:layer2},share_params=True) would only replace the real layer1, the cloned layer1 will not be replaced with layer2.

@justheuristic
Copy link
Author

justheuristic commented Aug 2, 2017

Okay, i'm currently working on a much simpler way to do this.

The simple version that simply calls get_output:

class reapply(MergeLayer):
    def __init__(self,layer_or_layers,replacements):
        self.layer_or_layers = layer_or_layers
        self.keys,values = zip(*replacements.items())
        MergeLayer.__init__(self,values)
    def get_output_for(self,inputs,**kwargs):
        return get_output(self.layer_or_layers, dict(zip(self.keys,inputs)),**kwargs)
    def get_output_shape_for(self,input_shapes,**kwargs):
        if isinstance(self.layer_or_layers,Layer):
            return self.layer_or_layers.output_shape
        else:
            return [l.output_shape for l in self.layer_or_layers]
    def get_params(self,**kwargs):
        return get_all_params(self.layer_or_layers,**kwargs)

The only problem is, it won't clone several layers at once.

@astooke
Copy link
Contributor

astooke commented Mar 16, 2018

Not exactly on the topic of the thread, but perhaps here is a way around it. I came across this thread when working in reinforcement learning. It was annoying having multiple copies of the same layer, one for "stepping" while interacting with the environment, the other for training on a minibatch. One solution is to specify different behaviors in get_output_for using a kwarg step_or_train. Just feed the kwarg L.get_output() when getting outputs for building theano functions. It assumes data is organized as concatenated trajectory (segments) of the same length.

Seems to be working although I haven't done a lot of learning with it yet....looks right tho?

Another benefit is that the input data formats (and ndims) stay the same whether there is recurrence or not in the network, because the reshaping to (n_batch, n_step, *dims) happens only where it is needed, inside the layer.

class RecurrentLayer(L.MergeLayer):    
    def step(self, x, hprev):
        h = self.nonlinearity(x.dot(self.W_xh) + hprev.dot(self.W_hh) + self.b)
        return h

    def get_output_shape_for(self, input_shapes):
        n_batch_x_step = input_shapes[0][0]
        assert input_shapes[0][0] == input_shapes[1][0]
        return n_batch_x_step, self.num_units

    def get_output_for(self, inputs, step_or_train="train", **kwargs):
        xs, hprevs = inputs  # (hprevs is all initial states, one timestep)
        n_batch = hprevs.shape[0]
        if step_or_train == "train":
            # Assume data is concatenated trajectories of same length
            n_step = xs.shape[0] // n_batch  # Inferred number of time steps
            xs = T.reshape(xs, (n_batch, n_step, -1))  # (flatten the rest)
            xs = xs.dimshuffle(1, 0, 2)  # put time dimension first
            hs, _ = theano.scan(fn=self.step, sequences=[xs], outputs_info=hprevs)
            hs = hs.dimshuffle(1, 0, 2)  # put batch dimension first
            hs = hs.reshape((n_batch * n_step, -1))  # get rid of time dimension
        elif step_or_train == "step":
            n_batch = hprevs.shape[0]
            xs = xs.reshape((n_batch, -1))  # (flatten the rest)
            hs = self.step(xs, hprevs)
        else:
            raise ValueError("Unrecognized step_or_train: {}".format(step_or_train))
        return hs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants