Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training loops #12

Open
benanne opened this issue Sep 12, 2014 · 31 comments
Open

Training loops #12

benanne opened this issue Sep 12, 2014 · 31 comments

Comments

@benanne
Copy link
Member

benanne commented Sep 12, 2014

So far the only thing that's been implemented is a bunch of tools to generate Theano expressions for neural nets. There is no actual training code in the library yet.

We should provide some premade 'training loops' for common use cases, that take care of things like compiling the necessary Theano functions and updating the parameters given a dataset.

It would be great if we could rely on Python generators for this - although at this point I'm not sure if they offer enough flexibility. But if they do, it would be great to be able to avoid adding another class / abstraction.

We could provide a few different types of training loops, for different dataset sizes and approaches. For example, some datasets fit into GPU memory, so we should provide a loop that loads up the data into a shared variable and then iterates over that in batches. But a lot of datasets actually don't, so then we'd have to load a new 'chunk' of data into the shared variable at regular intervals.

Until now I've always reimplemented this type of thing specifically for the problems I was working on (layers.py only provided tools to generate Theano expressions, nothing else). But I've definitely felt like I was reinventing the wheel a couple of times :)

I don't have a concrete idea yet of how we should implement this, input is very welcome.

@dnouri
Copy link
Member

dnouri commented Sep 12, 2014

I think the right term for those chunks of data that you want to load into memory one after the other is 'batch' (versus minibatch).

There's actually two cases hidden here I think. One is where your GPU doesn't have enough RAM, the other is where the CPU also doesn't have enough RAM. But maybe it's possible to generalize the two.

I think what Krizhevsky did for the second case is that he loads batches in an asynchronous way where the CPU is doing the I/O of loading a new batch into CPU memory while the GPU is busy working with the current.

Thinking a little further, I believe that it should be possible to write one training loop that fits both use cases. Because the case of having one batch could be regarded as a special case of having multiple batches where n=1. Doesn't sound like you wanted to do two entirely different implementations for those. But probably I'm just stating the obvious here.

@benanne
Copy link
Member Author

benanne commented Sep 12, 2014

I did something similar for the galaxy challenge. I think the loading of batches into CPU memory is something that should be offloaded to the 'dataset generator' that is provided by the user though.

Maybe we can provide a function to do that kind of thing as well, but I think data loading is separate from the training loop code. What I meant with 'training loop' is the code that consumes a dataset generator (or just a static dataset variable) and proceeds to train a neural net with it :)

EDIT: also, I prefer to refer to this unit of data as a 'chunk' because 'batch' and 'minibatch' gets confusing. Some people refer to 'minibatches' as 'batches' (I believe I did the same in nntools.layers.InputLayer). We should probably form a consensus on what terminology to use for these things.

@f0k
Copy link
Member

f0k commented Oct 13, 2014

I think what Krizhevsky did for the second case is that he loads batches in an asynchronous way where the CPU is doing the I/O of loading a new batch into CPU memory while the GPU is busy working with the current.

This is easy to do since Theano releases the GIL around blocking GPU operations, so you just need to wrap your dataset generator in a Python thread. From my code:

def feed_minibatches(self, batchsize, epochs=-1):
    import Queue
    queue = Queue.Queue(maxsize=self._num_cached)
    end_marker = object()
    # define producer
    def producer():
        for batch in self._reader.feed_minibatches(batchsize, epochs):
            queue.put(np.array(batch))  # create a copy because some data readers reuse memory
        queue.put(end_marker)
    # start producer
    import threading
    thread = threading.Thread(target=producer)
    thread.daemon = True
    thread.start()
    # run as consumer
    item = queue.get()
    while item is not end_marker:
        yield item
        queue.task_done()
        item = queue.get()

It's currently not possible with Theano to also load it into GPU memory asynchronously, because Theano does everything on the NULL stream, which is implicitly synchronized wrt. all other streams.

EDIT: also, I prefer to refer to this unit of data as a 'chunk' because 'batch' and 'minibatch' gets confusing. Some people refer to 'minibatches' as 'batches' (I believe I did the same in nntools.layers.InputLayer ). We should probably form a consensus on what terminology to use for these things.

+1 for chunks, I also used that to name a part of a dataset that fits into memory at once. However, I noticed that transferring chunks to GPU memory and iterating over them wasn't faster (or even a little slower) than just transferring mini-batches (in which case you don't need a shared variable, just hand your numpy mini-batch directly as an input to your compiled Theano function). So we're only left with two cases:
a) ask the user to provide a data matrix that is then copied to a shared variable in GPU memory and iterated over, optionally shuffling on the way
b) ask the user to provide a mini-batch generator that is then iterated over (plus the size of an epoch in batches so we know when an epoch has ended), optionally in a separate thread
I'm not sure how much time a) actually saves over b) in case both are possible. Is it worth supporting a)?

It would be great if we could rely on Python generators for this - although at this point I'm not sure if they offer enough flexibility. But if they do, it would be great to be able to avoid adding another class / abstraction.

I think a Python generator + epoch size is enough for the interface. We'd need to figure out if and how to support datasets and models that have multiple inputs / input layers and/or multiple targets. Have the generator return a tuple of numpy arrays? Or a dictionary? How can the user define which array to pass to which input layer, or bind to which variable? Possibly by handing a list of Theano variables to the training loop call? Something like:

input_layer = nntools.layers.InputLayer(shape=(100,))
hidden_layer = nntools.layers.DenseLayer(nntools.layers.dropout(input_layer, 0.2), num_units=1337, nonlinearity=nntools.nonlinearities.tanh)
output_layer = nntools.layers.DenseLayer(num_units=10, nonlinearity=nntools.nonlinearities.sigm)

cost = nntools.objectives.Objective(output_layer, nntools.objectives.crossentropy)
updates = nntools.updates.momentum(cost.get_loss(), nntools.layers.get_all_params(output_layer), 0.1)

inputreader = MyFancyDatasetReader(...)
targetreader = MyFancyDatasetReader(...)
readboth = itertools.izip(inputreader.feed_minibatches(128), targetreader.feed_minibatches(128))
epochsize = inputreader.num_datapoints // 128
iter_train = nntools.trainloops.minibatchwise(
    generator=readboth, epochsize=epochsize,
    assignments=[input_layer.input_var, cost.target_var],
    updates=updates,
    monitors=[cost.get_loss()],
    threaded=True,
    num_epochs=20)
for epoch, monitors in enumerate(iter_train):
    print "Epoch %d: crossent loss %.3g" % (epoch, monitors[0])

Here the generator returns a tuple of two numpy arrays at each call, one being a mini-batch of inputs, the other being a mini-batch of corresponding targets. The assignments list tells the training loop which Theano variables to pass which array to. That's one way it could work.

For validation, the training loop generator would need to accept another generator/epochsize pair, or the user could just generate a second nntools.trainloops.minibatchwise instance with a different generator/epochsize pair and updates=None and call it on each iteration of the training loop to obtain the validation error for the current epoch.
(An early stopping training loop generator would definitely need two generator/epochsize pairs, though.)
For k-fold cross-validation, things become more involved. The training loop generator would either need the full dataset as numpy arrays (which should always be an option), or an iterable of generator/epochsize pairs. The latter would hardly be easier for the end user than just doing the cross-validation herself or himself, so maybe we shouldn't care about this at all.

@dnouri
Copy link
Member

dnouri commented Oct 13, 2014

Interesting about the release of the GIL and the example code provided.
I was looking to put a thread around my batch generator, but will still
have to figure out a nice enough interface for that. (I'll have to
re-read what you wrote about it too.)

So I was working on this sklearn interface class for nntools and here's
the BatchIterator, which I basically stole from syhw's dnn.py:

https://github.com/dnouri/nolearn/blob/master/nolearn/nntools.py#L32

I'm also asking the batch to iterate over Xb, yb, and not using a shared
variable for the "chunk". I like the flexibility; it allows me to do
data augmentation while I generate batches, FlipBatchGenerator,
CropBatchGenerator etc. (I think I've seen others do this kind of a
thing in the net's layers?)

By the way here's an example of how you'd instantiate a NeuralNet.

    nn = NeuralNet(
        layers=[
            ('input', layers.InputLayer),
            ('hidden1', layers.DenseLayer),
            ('hidden2', layers.DenseLayer),
            ('output', layers.DenseLayer),
            ],
        update_learning_rate=0.01,
        update_momentum=0.9,

        input_shape=(100, 3072),
        hidden1_num_units=500,
        hidden2_num_units=500,
        output_num_units=10,
        output_nonlinearity=softmax,

        max_epochs=100,
        )

And then you'd call nn.fit(X, y) and eventually nn.predict(X), just like
with sklearn. nn.fit() will do the training loop. You can pass a
callback function which gets called with every epoch (and you could do
early stopping there, or draw the loss curve). Another one when
training is finished.

Would be happy to hear people's opinions on this.

@f0k
Copy link
Member

f0k commented Oct 14, 2014

I like the flexibility; it allows me to do data augmentation while I generate batches, FlipBatchGenerator, CropBatchGenerator etc.

Yep, you can easily pipeline your generators for this. When you extend your dataset class to also provide random access rather than just iteration, you can also cut out spectrogram excerpts and shuffle the data on the fly. That's outside the responsibility of nntools, though, it should just accept a generator. (Maybe I'll publish my data reader classes in a separate repository after some refactoring.)

By the way here's an example of how you'd instantiate a NeuralNet. [...] Would be happy to hear people's opinions on this.

I think separating out the layer parameters into their own keyword arguments is a bit awkward, also it may lead to name clashes (what if you call one of the layers 'update' or 'max'?). I'd include the layer constructor arguments in the layer definition list:

layers=[
        ('input', layers.InputLayer, {'shape': (100, 3072)}),
        ('hidden1', layers.DenseLayer, {'num_units': 500}),
        ('hidden2', layers.DenseLayer, {'num_units': 500}),
        ('output', layers.DenseLayer, {'num_units': 10, 'nonlinearity': softmax}),
        ],

If you want to allow other architectures than a stack of layers, you may want to support something like {'num_units': 500, 'input_layer': 'hidden1'}, where you specify the input layer for a layer via its name (or via their names for a MultipleInputsLayer). That's how Krizhevsky does it in cuda-convnet. If you only want to support stacks of layers, then your layers don't need names at all.

@dnouri
Copy link
Member

dnouri commented Oct 14, 2014

On 10/14/2014 03:51 PM, Jan Schlüter wrote:

I like the flexibility; it allows me to do data augmentation while I
generate batches, FlipBatchGenerator, CropBatchGenerator etc.

Yep, you can easily pipeline your generators for this. When you extend
your dataset class to also provide random access rather than just
iteration, you can also cut out spectrogram excerpts and shuffle the
data on the fly. That's outside the responsibility of nntools, though,
it should just accept a generator. (Maybe I'll publish my data reader
classes in a separate repository after some refactoring.)

Yes, I used the same idea in cuda-convnet, where Alex's example code
does cropping inside this dataset provider. (In fact I used it to crop
spectrograms as well.) I'm still wondering how it would compare to do
this on-the-fly agumentation inside a dedicated layer. Maybe it can be
considerably faster?

I agree it's outside the scope of nntools to include a lot of fancy
iterator implementations. But if it nntools grows this train loop code,
then it'd probably make sense to also have this concept of a batch iterator.

By the way here's an example of how you'd instantiate a NeuralNet.
[...] Would be happy to hear people's opinions on this.

I think separating out the layer parameters into their own keyword
arguments is a bit awkward, also it may lead to name clashes (what if
you call one of the layers |'update'|?). I'd include the layer
constructor arguments in the layer definition list:

It's awkward but there's a reason it's like that. Namely scikit-learn's
grid search. I can now do

GirdSearchCV(..., param_grid = {"input1_num_units": [500, 1000]}, ...)

So for this the list of parameters has to be flat.

(I should maybe document at least the motivation for stuff like this
before asking around for opinions. Sorry for that.)

I like your idea of using the names to support multi-input layers.

@f0k
Copy link
Member

f0k commented Oct 14, 2014

I'm still wondering how it would compare to do this on-the-fly agumentation inside a dedicated layer. Maybe it can be considerably faster?

Not in general. For one, some operations cannot be done by the network in mini-batches (e.g., my pipelines usually generate mini-batches by shuffling an epoch worth of spectrogram excerpts formed on-demand, so forming the spectrogram excerpts cannot be done by the network). Secondly, all these on-the-fly augmentation operations can be done on the CPU while the GPU is busy, in which case they come basically for free. So even embarassingly parallel operations such as input normalization can be faster when done by the mini-batch generator.

But if it nntools grows this train loop code, then it'd probably make sense to also have this concept of a batch iterator.

Why yes, it should have this concept in that it accepts a batch iterator as an input for the training loop generator. It should also use a batch iterator in one of the examples. And for convenience, it might provide a wrapper that places a given batch iterator into a separate thread (using my feed_minibatches() code excerpt with minimal changes). The library should not provide any implementation of a batch iterator, though, this would open a can of worms.

It's awkward but there's a reason it's like that. Namely scikit-learn's grid search.

Oh, I see. I guess your approach is fine then, but I'm glad I don't have to use it ;)

@dnouri
Copy link
Member

dnouri commented Oct 15, 2014

On 10/14/2014 11:36 PM, Jan Schlüter wrote:

I'm still wondering how it would compare to do this on-the-fly
agumentation inside a dedicated layer. Maybe it can be considerably
faster?

Not in general. For one, some operations cannot be done by the network
in mini-batches (e.g., my pipelines usually generate mini-batches by
shuffling an epoch worth of spectrogram excerpts formed on-demand, so
forming the spectrogram excerpts cannot be done by the network).
Secondly, all these on-the-fly augmentation operations can be done on
the CPU while the GPU is busy, in which case they come basically for
free. So even embarassingly parallel operations such as input
normalization can be faster when done by the mini-batch generator.

Very good points!

But if it nntools grows this train loop code, then it'd probably
make sense to also have this concept of a batch iterator.

Why yes, it should have this concept in that it accepts a batch iterator
as an input for the training loop generator. It should also use a batch
iterator in one of the examples. And for convenience, it might provide a
wrapper that places a given batch iterator into a separate thread (using
my |feed_minibatches()| code excerpt with minimal changes). The library
should not provide any implementation of a batch iterator, though, this
would open a can of worms.

I'll try and stare at feed_minibatches() a little bit more, because I
definitely like the idea of making it easy for users to have their batch
generators do their work on the side, in a separate thread.

It's awkward but there's a reason it's like that. Namely
scikit-learn's grid search.

Oh, I see. I guess your approach is fine then, but I'm glad I don't have
to use it ;)

Yeah sure you're not forced to use it. ;-)

I think that parameter trick is okay-awkward. But I take it maybe
you're not a big fan of scikit-learn in general. In which case I'd be
curious to hear why.

@f0k
Copy link
Member

f0k commented Oct 17, 2014

I'll try and stare at feed_minibatches() a little bit more, because I
definitely like the idea of making it easy for users to have their batch
generators do their work on the side, in a separate thread.

OK, to make it clear, throwing out what was specific to my implementation it looks like this:

def threaded_generator(generator, num_cached=50):
    import Queue
    queue = Queue.Queue(maxsize=num_cached)
    sentinel = object()  # guaranteed unique reference

    # define producer (putting items into queue)
    def producer():
        for item in generator:
            queue.put(item)
        queue.put(sentinel)

    # start producer (in a background thread)
    import threading
    thread = threading.Thread(target=producer)
    thread.daemon = True
    thread.start()

    # run as consumer (read items from queue, in current thread)
    item = queue.get()
    while item is not sentinel:
        yield item
        queue.task_done()
        item = queue.get()

Feel free to credit me if you commit it somewhere :)

I think that parameter trick is okay-awkward. But I take it maybe
you're not a big fan of scikit-learn in general. In which case I'd be
curious to hear why.

No, that's not the problem, I've even contributed something a few years ago. It's just that this put-everything-into-the-constructor-as-kwargs approach is not suited well to something of as modular an architecture as neural networks (I've seen another sklearn-like neural network class before, it had the same touch of awkwardness to it). It works quite fine for shallow things like GMMs, though.

@dnouri
Copy link
Member

dnouri commented Oct 17, 2014

On 10/17/2014 12:21 PM, Jan Schlüter wrote:

I'll try and stare at feed_minibatches() a little bit more, because I
definitely like the idea of making it easy for users to have their batch
generators do their work on the side, in a separate thread.

OK, to make it clear, throwing out what was specific to my
implementation it looks like this:

...

Feel free to credit me if you commit it somewhere :)

Sweet, will do.

I think that parameter trick is okay-awkward. But I take it maybe
you're not a big fan of scikit-learn in general. In which case I'd be
curious to hear why.

No, that's not the problem, I've even contributed something a few years
ago. It's just that this put-everything-into-the-constructor-as-kwargs
approach is not suited well to something of as modular an architecture
as neural networks (I've seen another sklearn-like neural network class
before, it had the same touch of awkwardness to it). It works quite fine
for shallow things like GMMs, though.

Ah, the kwargs is what bothers you. I can try to make it work both
ways, i.e. also with passing dicts in the layers list as you suggested
earlier. Maybe that'll help control your distaste. Anyway the
flat-kwargs is what's necessary to be able to use some of the tools
inside sklearn conveniently, e.g. pipelines, grid search. I think that
for smaller nets, grid search is a pretty cool feature that then comes
for free here.

@benanne
Copy link
Member Author

benanne commented Oct 18, 2014

This is easy to do since Theano releases the GIL around blocking GPU operations, so you just need to wrap your dataset generator in a Python thread.

I have a general purpose Python generator function that takes another generator function as input and runs this in a separate process, to achieve the same thing.

The reason I went for processes instead of threads is because some libraries, such as h5py, do not play nice with the GIL at all. h5py just claims the GIL for as long as it takes to read in a chunk of data, and nothing else happens during that time. Very frustrating.

So my solution was to use processes instead of threads, but this has plenty of disadvantages too: higher memory usage and slowdowns because the multiprocessing module pickles objects for transfer between different processes, which adds unnecessary overhead. It also introduces limitations on the sizes of the objects transferred (between 2 and 4GB max, I think).

At any rate, it might be useful to have both a 'thread' and a 'process' version of this.

However, I noticed that transferring chunks to GPU memory and iterating over them wasn't faster (or even a little slower) than just transferring mini-batches (in which case you don't need a shared variable, just hand your numpy mini-batch directly as an input to your compiled Theano function).

Are there any numbers on this? I have a feeling the size of the model could affect this (i.e. how much time is spent in computation vs. data transfer), as well as how recent the hardware is (i.e. CPU-to-GPU transfer bandwidth).

If I recall correctly, I started using 'chunks' precisely because transferring individual batches incurred so much overhead, but that was years ago. Newer hardware and bigger models may have made this approach obsolete. But if that is the reason, maybe we should still consider supporting the chunk-based approach for users who are working with smaller models / stuck with older hardware.

@f0k
Copy link
Member

f0k commented Oct 18, 2014

I have a general purpose Python generator function that takes another generator function as input and runs this in a separate process, to achieve the same thing. [...] At any rate, it might be useful to have both a 'thread' and a 'process' version of this.

Yep, I also have both versions. Good to know about the GIL in h5py. I was able to use the threaded version for now, which was quite a bit faster.

Are there any numbers on this? I have a feeling the size of the model could affect this [...]

Hmm, no, we will need to benchmark this. My experience is also from several years ago, I was still using cudamat, and I think I just compared the execution times of an empty loop over the training data (i.e., a loop copying each mini-batch to the GPU, but not doing anything with it). Separately copying mini-batches was faster than copying chunks of a gigabyte or so. But it's well possible that things have changed, so maybe we should support both.

@benanne benanne modified the milestone: First release May 6, 2015
@f0k f0k closed this as completed in 592e61f Aug 6, 2015
@f0k f0k reopened this Aug 6, 2015
@f0k
Copy link
Member

f0k commented Aug 6, 2015

Was inadvertently closed on merging in recurrent layers, due to a commit message that actually referenced an issue in @craffel's fork.

@benanne
Copy link
Member Author

benanne commented Aug 30, 2015

I'd like to revive this discussion again. I think it would be nice if the next release of Lasagne came with some training loop code that covers the most common use cases.

I know there is still some discussion about whether we need this in Lasagne at all, but I think we do. It would be nice to provide a more complete package. Of course the focus remains firmly on neural networks, but training is a pretty essential aspect of NNs so I think this still fits within our philosophy of "do one thing and do it well". And as with everything in Lasagne it would of course be completely optional to use.

I haven't thought too much about the API yet. The only thing I'm certain of is that we should rely on Python generators where we can for maximal flexibility. I think it would be nice to have something that lets you say "okay, here is my network architecture definition, now please train it on this dataset and use sensible defaults for all the things I didn't specify."

Preferably it should be possible to do this in 2-3 lines of code. Currently a lot of code uses the same boilerplate to construct shared variables for the dataset, iterate through it in batches, evaluate periodically, ... I'm pretty sure we could provide something that would be sufficient for 90% of use cases. This would save people a lot of thinking and typing (or copy-pasting).

It would allow our users to focus on designing neural networks rather than designing training code, without the need for external dependencies. Of course I know there is nolearn, but one thing I dislike about it is how it requires you to specify your network architecture in a non-standard way, for the sake of more complete scikit-learn compatibility. It feels a bit like using a different library altogether and imposes certain limits on what you can do that don't exist with vanilla Lasagne.

If I recall correctly, the main motivation was to allow scikit-learn's cross-validation / parameter sweeping infrastructure to be used. Are there any other reasons for this? I'd like to have something that avoids this in the main library, maybe at the cost of not supporting all the fancy scikit-learn stuff out of the box :)

The interface wouldn't even have to match scikit-learn -- although 'fit' and 'predict' are pretty common now across many libraries. We could take a 'scikit-learn light' approach as the authors of Keras did: the API is familiar to scikit-learn users, but things aren't necessarily fully compatible with that library. I think that's a nice compromise because neural networks don't fit into the scikit-learn paradigm all too well, in my opinion.

I'll see if I can come up with some API ideas, but I think it would be nice to brainstorm a little first, so we can align our expectations.

@dnouri
Copy link
Member

dnouri commented Aug 30, 2015

Of course I know there is nolearn, but one thing I dislike about it is how it requires you to specify your network architecture in a non-standard way, for the sake of more complete scikit-learn compatibility. It feels a bit like using a different library altogether and imposes certain limits on what you can do that don't exist with vanilla Lasagne.

This is a very good point. I'll add the possibility of passing a layer object (or a list of layers?) as the layer parameter. This way you'll be able to build your model as with vanilla Lasagne and pass it in when you instantiate nolearn.lasagne.NeuralNet.

We could take a 'scikit-learn light' approach as the authors of Keras did: the API is familiar to scikit-learn users, but things aren't necessarily fully compatible with that library.

I think Keras 'breaks' the sklearn interface for no good reason really. Instead of passing a list of layers at instantiation, you have to call model.add for each layer. That doesn't make a lot of sense to me. Keras also has a special compile method that accepts loss and optimizer. Why not pass that in the __init__?

Is there anything else I've overlooked that would make the sklearn API an awkward choice?

Lastly, in terms of lines of code, consider how you can use nolearn.lasagne today:

net = NeuralNet(... all settings in one place ...)
net.fit(X, y)
ypred = net.predict(X)

@benanne
Copy link
Member Author

benanne commented Aug 30, 2015

Yep, that is pretty sweet :) If I could just pass in a layer representing my network I would already be pretty happy with that!

With regards to Keras's compile method, I suppose one reason to split this would be to make this step a bit more explicit, because it can take a long time. But I also think it's probably a bit redundant, basically the object initialization is split into two separate calls.

I was also thinking about supporting more complicated use cases, i.e. what if there are multiple input layers, or what if there is no label. Maybe we could transparently support those, while assuming the default case of inputs + labels. Perhaps users could specify some kind of mapping between the layers (or Theano variables) and the various bits of data that are passed in.

I also like net.evaluate() in Keras, which is an addition to the API that they made if I'm not mistaken. In many cases you don't really care about the predictions themselves, just about accuracy (or some other metric).

@ebenolson
Copy link
Member

I think it would be good to have a separation between model container / training loop and dataset container / batch generator.

The model would have train and predict (+evaluate?) methods, which can be passed either simple arrays / shared variables, or something that produces minibatches (a Dataset object). We could provide some that cover simple cases like a directory of images, but ideally the user would define a subclass to handle their own data loading (along with things like preprocessing or realtime augmentation).

I was also thinking about supporting more complicated use cases, i.e. what if there are multiple input layers, or what if there is no label. Maybe we could transparently support those, while assuming the default case of inputs + labels. Perhaps users could specify some kind of mapping between the layers (or Theano variables) and the various bits of data that are passed in.

I think this would work pretty well just by having the Model constructor accept a list of output layers, and an optional list of input layers/variables.

I do think this would be a key addition to the library, but it could also become quite a mess trying to support every possible thing people may want (logging, web monitoring, callbacks for early stopping, checkpointing, etc.). I think it would be best to really focus on the most common cases, while making the code as clean and obvious as possible so that people can adapt it to their own needs.

@benanne
Copy link
Member Author

benanne commented Aug 31, 2015

I think it would be good to have a separation between model container / training loop and dataset container / batch generator.

agreed.

The model would have train and predict (+evaluate?) methods, which can be passed either simple arrays / shared variables, or something that produces minibatches (a Dataset object). We could provide some that cover simple cases like a directory of images, but ideally the user would define a subclass to handle their own data loading (along with things like preprocessing or realtime augmentation).

I don't think we need a class for this, if we make it possible to pass in either arrays or generators, I think we can cover all the bases. An additional Dataset abstraction is probably unnecessary.

I do think this would be a key addition to the library, but it could also become quite a mess trying to support every possible thing people may want (logging, web monitoring, callbacks for early stopping, checkpointing, etc.). I think it would be best to really focus on the most common cases, while making the code as clean and obvious as possible so that people can adapt it to their own needs.

I think we can easily sort this out by enabling people to make the training loop explicit as a for loop. This can be done either by providing a generator-based interface for training as well, or just by having an incremental version of the fit method that people could call inside a for loop themselves. Then they can write whatever logging / monitoring / checkpointing code they want around that. (incidentally I don't think Lasagne needs to provide any of those things, but I'm guessing we're agreed on that.)

@dnouri
Copy link
Member

dnouri commented Aug 31, 2015

With regards to Keras's compile method, I suppose one reason to split this would be to make this step a bit more explicit, because it can take a long time.

nolearn.lasagne has net.initialize() that does the same thing as compile but it's implicitly called during fit() if it hasn't been called by the user before.

I was also thinking about supporting more complicated use cases, i.e. what if there are multiple input layers, or what if there is no label.

nolearn.lasagne (and I think Keras too) allow you to pass a mapping of {layer_name: input_value} as X. Multiple outputs aren't supported yet, but it's something on my list. The use case for no labels is unsupervised training? When I train auto-encoders, I will usually just pass the same thing (maybe transposed) as X and y.

I also like net.evaluate() in Keras

nolearn.lasagne has net.score(X, y), which is the same thing but the name's from scikit-learn Again Keras uses a different method name for no reason apparently.

I do think this would be a key addition to the library, but it could also become quite a mess trying to support every possible thing people may want (logging, web monitoring, callbacks for early stopping, checkpointing, etc.).

Callbacks (an idea which I believe Keras took from nolearn.lasagne) help a lot there. There's a couple of well-defined hooks and people can add their own behaviour easily. You can easily cover today all of the use cases that @ebenolson mentions with the available hooks: logging, web monitoring, callbacks for early stopping, checkpointing. As an example, check out EarlyStopping in the kfkd tutorial, or consider how nolearn.lasagne.NeuralNet does its own logging using a callback.

I think we can easily sort this out by enabling people to make the training loop explicit as a for loop.

I know I know I can't stop, but nolearn.lasagne allows you to do this:

net = NeuralNet(
    # ...
    max_epochs=1,  # a little hacky but you'll survive
    )

for X, y in mygenerator:
    net.fit(X, y)
    # fun stuff here

I still prefer using the hooks though because it's plug-and-play. It's more declarative, and encourages reuse between code. Here's an example of a NeuralNet that saves checkpoints, but only if the validation error has improved:

net = NeuralNet(
    # bunch of stuff...
    on_epoch_finished=[
        SaveWeights(
            'models/model-{epoch}-{timestamp}.pkl',
            only_best=True,
            ),
        ],
    )

@benanne
Copy link
Member Author

benanne commented Aug 31, 2015

nolearn.lasagne has net.initialize() that does the same thing as compile but it's implicitly called during fit() if it hasn't been called by the user before.

But a difference I guess is that you specify certain parameters when calling compile(), which nolearn expects to receive in __init__(). It probably makes more sense to group everything together, although there is something to be said for splitting up long argument lists into multiple steps, it might make things easier to remember because of the 'hierarchical' structure. But it does seem a bit arbitrary.

nolearn.lasagne (and I think Keras too) allow you to pass a mapping of {layer_name: input_value} as X. Multiple outputs aren't supported yet, but it's something on my list.

Awesome!

The use case for no labels is unsupervised training? When I train auto-encoders, I will usually just pass the same thing (maybe transposed) as X and y.

In some cases it might be limiting to treat labels as a special type of input. Most tasks fit the (X, y)-paradigm (i.e. data + labels), but there are many that don't. I'd be okay with having features that are specific to this paradigm, because it is so common. But there are are many multiple input / multiple output situations as well, e.g. training siamese networks with pairwise / triplet / quadruplet losses, models with example weights, various forms of masking, ... so I think we should have something generic that can handle all of it.

nolearn.lasagne has net.score(X, y), which is the same thing but the name's from scikit-learn Again Keras uses a different method name for no reason apparently.

It's beginning to look like I need to take a more thorough look at the nolearn docs :)

I still prefer using the hooks though because it's plug-and-play. It's more declarative, and encourages reuse between code.

I'm not a huge fan of callbacks myself, I feel that iterators / generators are a much more natural way of passing control back and forth. But perhaps we can accommodate both styles.

@ebenolson
Copy link
Member

net = NeuralNet(
# bunch of stuff...
on_epoch_finished=[
    SaveWeights(
        'models/model-{epoch}-{timestamp}.pkl',
        only_best=True,
        ),
    ],
)

This is sorta what I think should be avoided. Now there needs to be a callback API, and some standard for accessing history. I think it's a lot less intuitive than a simple for loop, where you can see exactly what's happening without digging into library code.

net = NeuralNet(
    # ...
    max_epochs=1,  # a little hacky but you'll survive
    )

Maybe a partial_fit method would be useful, and in keeping with sklearn? Although I agree passing max_epochs=1 isn't the end of the world.

@dnouri
Copy link
Member

dnouri commented Aug 31, 2015

In some cases it might be limiting to treat labels as a special type of input. Most tasks fit the (X, y)-paradigm (i.e. data + labels), but there are many that don't. [...] But there are are many multiple input / multiple output situations as well, e.g. training siamese networks with pairwise / triplet / quadruplet losses,

I've trained a siamese net with nolearn, it's working pretty well. I've even been using your trick of having an input that's double the size as the labels, and it was straight-forward.

models with example weights, various forms of masking, ... so I think we should have something generic that can handle all of it.

I don't know about these use cases; you'd have to enlighten me.

This is sorta what I think should be avoided. Now there needs to be a callback API, and some standard for accessing history.

The standard is the callback gets passed the NeuralNet instance, so I'd argue it's fairly straight forward. But I can see your point.

Maybe a partial_fit method would be useful, and in keeping with sklearn?

That's a nice idea.

@dnouri
Copy link
Member

dnouri commented Aug 31, 2015

I've trained a siamese net with nolearn, it's working pretty well. I've even been using your trick of having an input that's double the size as the labels, and it was straight-forward.

I should say that I did have to write a custom batch iterator for that, so maybe not super straight forward. The nice thing is it seems flexible enough all around to allow me to do things like this.

It's beginning to look like I need to take a more thorough look at the nolearn docs :)

The nolearn.lasagne docs are a big void right now. :-( There's only a couple of tutorials so far. Need to fix...

@benanne
Copy link
Member Author

benanne commented Aug 31, 2015

This is sorta what I think should be avoided. Now there needs to be a callback API, and some standard for accessing history. I think it's a lot less intuitive than a simple for loop, where you can see exactly what's happening without digging into library code.

Agreed.

Maybe a partial_fit method would be useful, and in keeping with sklearn? Although I agree passing max_epochs=1 isn't the end of the world.

'end of the world' is a bit strong indeed, but it is a bit of a wart imo. We should avoid forcing users to use things in unintuitive ways, which I think this is an instance of.

I think partial_fit a great idea, a lot better than overloading fit to do both and needing a hack to make it work.

I've trained a siamese net with nolearn, it's working pretty well. I've even been using your trick of having an input that's double the size as the labels, and it was straight-forward.

Fair enough, I just want to make sure we don't end up forcing our users to shoehorn their use case into it -- it should be natural to extend whatever we provide to support these more exotic configurations (within reason of course, we probably can't support / think of everything).

I think the BatchIterator stuff might already go a bit too far. What benefits does this abstraction offer over just working with Python generators? Every new class or concept adds cognitive overhead, so for anything that goes into the main library I'd like to keep this to an absolute minimum.

I don't know about these use cases; you'd have to enlighten me.

I was just trying to think of use cases where you have additional data streams that are more label-like, or more input-like, or something in between, to demonstrate that this dichotomy between input and labels isn't always as clear-cut as it is for many regression/classification problems. Just to show that hardcoding this dichotomy everywhere probably isn't the right way to go. Because it is so common, we can of course provide tools to make this case easier, but it should also be easy to go beyond this.

The nolearn.lasagne docs are a big void right now. :-( There's only a couple of tutorials so far. Need to fix...

Maybe we should strive to merge it (or some derivative of it) into Lasagne proper, and fix the docs in the process? It sounds like there's already a ton of useful stuff in there that we could use (more than I previously thought) and it seems silly to just reimplement it.

@dnouri
Copy link
Member

dnouri commented Aug 31, 2015

'end of the world' is a bit strong indeed, but it is a bit of a wart imo. We should avoid forcing users to use things in unintuitive ways, which I think this is an instance of.

+1

I think the BatchIterator stuff might already go a bit too far. What benefits does this abstraction offer over just working with Python generators?

So if you want to support a fit() method that takes the entire dataset (such as it's often done in sklearn), you'll need to split it into batches, and that's what BatchIterator is for. The good news is a lot of times you don't need to worry about it, it'll do the simple thing of generating batches of given (default) size in the background.

If what you want is test-time augmentation, which I often use, you subclass. Here's an example:

class MyBatchIterator(BatchIterator):
    def transform(self, X, y):
        return X + 1, y  # silly example always adds 1 to X

I think this is an example of an okay pattern that you can easily remember. For some users it's arguably easier 'filling in the blanks' as opposed to taking control over train loop, batch generation etc. For the advanced user, it might seem restrictive. It guess it's basically about the difference of framework versus library.

But certainly, if you're striving for one way to do this, and you want to have control over the train loop yourself, then I guess you'd probably get rid of fit() altogether and only provide the equivalent of partial_fit()?

Maybe we should strive to merge it (or some derivative of it) into Lasagne proper, and fix the docs in the process?

Would be happy to try that. It would be a good opportunity to clean things up, too. That said, I'm not sure it'll be easy to get a consensus on many things. You're making a good point about not forcing users to shoehorn their use cases; it's hard anticipating all the things people might want to do.

@benanne
Copy link
Member Author

benanne commented Aug 31, 2015

So if you want to support a fit() method that takes the entire dataset (such as it's often done in sklearn), you'll need to split it into batches, and that's what BatchIterator is for.

If I'm not mistaken you don't specifically need BatchIterator, anything which implements the Python iterator interface should do, right? In that case, that's more or less what I had in mind. I'd prefer to use a generator for something like this, but it sounds like that's already possible.

But certainly, if you're striving for one way to do this, and you want to have control over the train loop yourself, then I guess you'd probably get rid of fit() altogether and only provide the equivalent of partial_fit()?

I think we should provide both, so that there is an easier solution for the 80% of use cases where such fine-grained control is not needed (i.e. fit()). Also I think it would be counterintuitive for many people if we only provide partial_fit() and not fit().

That said, I'm not sure it'll be easy to get a consensus on many things. You're making a good point about not forcing users to shoehorn their use cases; it's hard anticipating all the things people might want to do.

Luckily we have a few people among our contributors who are very good at this :)

@dnouri
Copy link
Member

dnouri commented Aug 31, 2015

If I'm not mistaken you don't specifically need BatchIterator, anything which implements the Python iterator interface should do, right?

Yes, you're right.

I think we should provide both, so that there is an easier solution for the 80% of use cases where such fine-grained control is not needed (i.e. fit()).

Sounds good to me.

@dnouri
Copy link
Member

dnouri commented Sep 7, 2015

FWIW, nolearn.lasagne.NeuralNet now has a partial_fit method as discussed here, and it'll now also accept an output layer as its first argument (in addition of the funny list of tuples). Usage example:

layer = InputLayer(shape=(None, 1, 28, 28))
layer = Conv2DLayer(layer, filter_size=5, num_filters=8)
layer = MaxPool2DLayer(layer, pool_size=2)
layer = Conv2DLayer(layer, filter_size=5, num_filters=8)
layer = MaxPool2DLayer(layer, pool_size=2)
layer = DenseLayer(layer, num_units=128)
layer = DenseLayer(layer, nonlinearity=softmax, num_units=10)

net = NeuralNet(
    layer,
    update=nesterov_momentum,
    update_learning_rate=0.01,
    update_momentum=0.9,
    )

for X_epoch, y_epoch in generate_data():
    net.partial_fit(X_epoch, y_epoch)

Docs is still an outstanding issue (or the opposite of that).

@jonhoo
Copy link
Contributor

jonhoo commented Nov 27, 2015

I don't know if it factors into this discussion at all, but it would be useful if the generator also efficiently supported datasets that do not fit into memory. The most straightforward way to do this is probably to allow passing in numpy.memmap'd ndarrays (and possibly to use borrow=True).

@benanne
Copy link
Member Author

benanne commented Nov 27, 2015

As mentioned in the discussion before, I think data loading should be offloaded to the user entirely, as supporting various data formats directly in the library would make things a lot more complex for us. I think it's outside of the scope of the library.

Luckily Python's concept of generators provides the perfect interface for this. We should just make it so you can pass in any generator you like. Then it becomes easy to support things like data loading from disk, asynchronous data loading, on the fly data augmentation, ... without having to worry about it inside the library.

Of course, we should also support just passing in (a set of) numpy arrays, since that is such a common use case.

I've been working on a draft for this for the past week or so, hopefully I'll be able to submit a PR sometime soon.

@alexander-rakhlin
Copy link

alexander-rakhlin commented Apr 9, 2017

@f0k thank you for the threaded_generator code, it significantly speed up my processing.
Just of curiosity: you are using queue.task_done() and not using queue.join() Is task_done needed then?

  1. It seems without queue.join() task_done makes no difference
  2. Consuming a queue ends up in the main thread, so there no background threads to wait for, therefore join isn't required

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants