-
Notifications
You must be signed in to change notification settings - Fork 947
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training loops #12
Comments
I think the right term for those chunks of data that you want to load into memory one after the other is 'batch' (versus minibatch). There's actually two cases hidden here I think. One is where your GPU doesn't have enough RAM, the other is where the CPU also doesn't have enough RAM. But maybe it's possible to generalize the two. I think what Krizhevsky did for the second case is that he loads batches in an asynchronous way where the CPU is doing the I/O of loading a new batch into CPU memory while the GPU is busy working with the current. Thinking a little further, I believe that it should be possible to write one training loop that fits both use cases. Because the case of having one batch could be regarded as a special case of having multiple batches where n=1. Doesn't sound like you wanted to do two entirely different implementations for those. But probably I'm just stating the obvious here. |
I did something similar for the galaxy challenge. I think the loading of batches into CPU memory is something that should be offloaded to the 'dataset generator' that is provided by the user though. Maybe we can provide a function to do that kind of thing as well, but I think data loading is separate from the training loop code. What I meant with 'training loop' is the code that consumes a dataset generator (or just a static dataset variable) and proceeds to train a neural net with it :) EDIT: also, I prefer to refer to this unit of data as a 'chunk' because 'batch' and 'minibatch' gets confusing. Some people refer to 'minibatches' as 'batches' (I believe I did the same in |
This is easy to do since Theano releases the GIL around blocking GPU operations, so you just need to wrap your dataset generator in a Python thread. From my code: def feed_minibatches(self, batchsize, epochs=-1):
import Queue
queue = Queue.Queue(maxsize=self._num_cached)
end_marker = object()
# define producer
def producer():
for batch in self._reader.feed_minibatches(batchsize, epochs):
queue.put(np.array(batch)) # create a copy because some data readers reuse memory
queue.put(end_marker)
# start producer
import threading
thread = threading.Thread(target=producer)
thread.daemon = True
thread.start()
# run as consumer
item = queue.get()
while item is not end_marker:
yield item
queue.task_done()
item = queue.get() It's currently not possible with Theano to also load it into GPU memory asynchronously, because Theano does everything on the NULL stream, which is implicitly synchronized wrt. all other streams.
+1 for chunks, I also used that to name a part of a dataset that fits into memory at once. However, I noticed that transferring chunks to GPU memory and iterating over them wasn't faster (or even a little slower) than just transferring mini-batches (in which case you don't need a shared variable, just hand your numpy mini-batch directly as an input to your compiled Theano function). So we're only left with two cases:
I think a Python generator + epoch size is enough for the interface. We'd need to figure out if and how to support datasets and models that have multiple inputs / input layers and/or multiple targets. Have the generator return a tuple of numpy arrays? Or a dictionary? How can the user define which array to pass to which input layer, or bind to which variable? Possibly by handing a list of Theano variables to the training loop call? Something like: input_layer = nntools.layers.InputLayer(shape=(100,))
hidden_layer = nntools.layers.DenseLayer(nntools.layers.dropout(input_layer, 0.2), num_units=1337, nonlinearity=nntools.nonlinearities.tanh)
output_layer = nntools.layers.DenseLayer(num_units=10, nonlinearity=nntools.nonlinearities.sigm)
cost = nntools.objectives.Objective(output_layer, nntools.objectives.crossentropy)
updates = nntools.updates.momentum(cost.get_loss(), nntools.layers.get_all_params(output_layer), 0.1)
inputreader = MyFancyDatasetReader(...)
targetreader = MyFancyDatasetReader(...)
readboth = itertools.izip(inputreader.feed_minibatches(128), targetreader.feed_minibatches(128))
epochsize = inputreader.num_datapoints // 128
iter_train = nntools.trainloops.minibatchwise(
generator=readboth, epochsize=epochsize,
assignments=[input_layer.input_var, cost.target_var],
updates=updates,
monitors=[cost.get_loss()],
threaded=True,
num_epochs=20)
for epoch, monitors in enumerate(iter_train):
print "Epoch %d: crossent loss %.3g" % (epoch, monitors[0]) Here the generator returns a tuple of two numpy arrays at each call, one being a mini-batch of inputs, the other being a mini-batch of corresponding targets. The For validation, the training loop generator would need to accept another generator/epochsize pair, or the user could just generate a second |
Interesting about the release of the GIL and the example code provided. So I was working on this sklearn interface class for nntools and here's https://github.com/dnouri/nolearn/blob/master/nolearn/nntools.py#L32 I'm also asking the batch to iterate over Xb, yb, and not using a shared By the way here's an example of how you'd instantiate a NeuralNet.
And then you'd call nn.fit(X, y) and eventually nn.predict(X), just like Would be happy to hear people's opinions on this. |
Yep, you can easily pipeline your generators for this. When you extend your dataset class to also provide random access rather than just iteration, you can also cut out spectrogram excerpts and shuffle the data on the fly. That's outside the responsibility of nntools, though, it should just accept a generator. (Maybe I'll publish my data reader classes in a separate repository after some refactoring.)
I think separating out the layer parameters into their own keyword arguments is a bit awkward, also it may lead to name clashes (what if you call one of the layers layers=[
('input', layers.InputLayer, {'shape': (100, 3072)}),
('hidden1', layers.DenseLayer, {'num_units': 500}),
('hidden2', layers.DenseLayer, {'num_units': 500}),
('output', layers.DenseLayer, {'num_units': 10, 'nonlinearity': softmax}),
], If you want to allow other architectures than a stack of layers, you may want to support something like |
On 10/14/2014 03:51 PM, Jan Schlüter wrote:
Yes, I used the same idea in cuda-convnet, where Alex's example code I agree it's outside the scope of nntools to include a lot of fancy
It's awkward but there's a reason it's like that. Namely scikit-learn's GirdSearchCV(..., param_grid = {"input1_num_units": [500, 1000]}, ...) So for this the list of parameters has to be flat. (I should maybe document at least the motivation for stuff like this I like your idea of using the names to support multi-input layers. |
Not in general. For one, some operations cannot be done by the network in mini-batches (e.g., my pipelines usually generate mini-batches by shuffling an epoch worth of spectrogram excerpts formed on-demand, so forming the spectrogram excerpts cannot be done by the network). Secondly, all these on-the-fly augmentation operations can be done on the CPU while the GPU is busy, in which case they come basically for free. So even embarassingly parallel operations such as input normalization can be faster when done by the mini-batch generator.
Why yes, it should have this concept in that it accepts a batch iterator as an input for the training loop generator. It should also use a batch iterator in one of the examples. And for convenience, it might provide a wrapper that places a given batch iterator into a separate thread (using my
Oh, I see. I guess your approach is fine then, but I'm glad I don't have to use it ;) |
On 10/14/2014 11:36 PM, Jan Schlüter wrote:
Very good points!
I'll try and stare at feed_minibatches() a little bit more, because I
Yeah sure you're not forced to use it. ;-) I think that parameter trick is okay-awkward. But I take it maybe |
OK, to make it clear, throwing out what was specific to my implementation it looks like this: def threaded_generator(generator, num_cached=50):
import Queue
queue = Queue.Queue(maxsize=num_cached)
sentinel = object() # guaranteed unique reference
# define producer (putting items into queue)
def producer():
for item in generator:
queue.put(item)
queue.put(sentinel)
# start producer (in a background thread)
import threading
thread = threading.Thread(target=producer)
thread.daemon = True
thread.start()
# run as consumer (read items from queue, in current thread)
item = queue.get()
while item is not sentinel:
yield item
queue.task_done()
item = queue.get() Feel free to credit me if you commit it somewhere :)
No, that's not the problem, I've even contributed something a few years ago. It's just that this put-everything-into-the-constructor-as-kwargs approach is not suited well to something of as modular an architecture as neural networks (I've seen another sklearn-like neural network class before, it had the same touch of awkwardness to it). It works quite fine for shallow things like GMMs, though. |
On 10/17/2014 12:21 PM, Jan Schlüter wrote:
Sweet, will do.
Ah, the kwargs is what bothers you. I can try to make it work both |
I have a general purpose Python generator function that takes another generator function as input and runs this in a separate process, to achieve the same thing. The reason I went for processes instead of threads is because some libraries, such as h5py, do not play nice with the GIL at all. h5py just claims the GIL for as long as it takes to read in a chunk of data, and nothing else happens during that time. Very frustrating. So my solution was to use processes instead of threads, but this has plenty of disadvantages too: higher memory usage and slowdowns because the multiprocessing module pickles objects for transfer between different processes, which adds unnecessary overhead. It also introduces limitations on the sizes of the objects transferred (between 2 and 4GB max, I think). At any rate, it might be useful to have both a 'thread' and a 'process' version of this.
Are there any numbers on this? I have a feeling the size of the model could affect this (i.e. how much time is spent in computation vs. data transfer), as well as how recent the hardware is (i.e. CPU-to-GPU transfer bandwidth). If I recall correctly, I started using 'chunks' precisely because transferring individual batches incurred so much overhead, but that was years ago. Newer hardware and bigger models may have made this approach obsolete. But if that is the reason, maybe we should still consider supporting the chunk-based approach for users who are working with smaller models / stuck with older hardware. |
Yep, I also have both versions. Good to know about the GIL in h5py. I was able to use the threaded version for now, which was quite a bit faster.
Hmm, no, we will need to benchmark this. My experience is also from several years ago, I was still using cudamat, and I think I just compared the execution times of an empty loop over the training data (i.e., a loop copying each mini-batch to the GPU, but not doing anything with it). Separately copying mini-batches was faster than copying chunks of a gigabyte or so. But it's well possible that things have changed, so maybe we should support both. |
Was inadvertently closed on merging in recurrent layers, due to a commit message that actually referenced an issue in @craffel's fork. |
I'd like to revive this discussion again. I think it would be nice if the next release of Lasagne came with some training loop code that covers the most common use cases. I know there is still some discussion about whether we need this in Lasagne at all, but I think we do. It would be nice to provide a more complete package. Of course the focus remains firmly on neural networks, but training is a pretty essential aspect of NNs so I think this still fits within our philosophy of "do one thing and do it well". And as with everything in Lasagne it would of course be completely optional to use. I haven't thought too much about the API yet. The only thing I'm certain of is that we should rely on Python generators where we can for maximal flexibility. I think it would be nice to have something that lets you say "okay, here is my network architecture definition, now please train it on this dataset and use sensible defaults for all the things I didn't specify." Preferably it should be possible to do this in 2-3 lines of code. Currently a lot of code uses the same boilerplate to construct shared variables for the dataset, iterate through it in batches, evaluate periodically, ... I'm pretty sure we could provide something that would be sufficient for 90% of use cases. This would save people a lot of thinking and typing (or copy-pasting). It would allow our users to focus on designing neural networks rather than designing training code, without the need for external dependencies. Of course I know there is nolearn, but one thing I dislike about it is how it requires you to specify your network architecture in a non-standard way, for the sake of more complete scikit-learn compatibility. It feels a bit like using a different library altogether and imposes certain limits on what you can do that don't exist with vanilla Lasagne. If I recall correctly, the main motivation was to allow scikit-learn's cross-validation / parameter sweeping infrastructure to be used. Are there any other reasons for this? I'd like to have something that avoids this in the main library, maybe at the cost of not supporting all the fancy scikit-learn stuff out of the box :) The interface wouldn't even have to match scikit-learn -- although 'fit' and 'predict' are pretty common now across many libraries. We could take a 'scikit-learn light' approach as the authors of Keras did: the API is familiar to scikit-learn users, but things aren't necessarily fully compatible with that library. I think that's a nice compromise because neural networks don't fit into the scikit-learn paradigm all too well, in my opinion. I'll see if I can come up with some API ideas, but I think it would be nice to brainstorm a little first, so we can align our expectations. |
This is a very good point. I'll add the possibility of passing a layer object (or a list of layers?) as the
I think Keras 'breaks' the sklearn interface for no good reason really. Instead of passing a list of layers at instantiation, you have to call Is there anything else I've overlooked that would make the sklearn API an awkward choice? Lastly, in terms of lines of code, consider how you can use nolearn.lasagne today: net = NeuralNet(... all settings in one place ...)
net.fit(X, y)
ypred = net.predict(X) |
Yep, that is pretty sweet :) If I could just pass in a layer representing my network I would already be pretty happy with that! With regards to Keras's I was also thinking about supporting more complicated use cases, i.e. what if there are multiple input layers, or what if there is no label. Maybe we could transparently support those, while assuming the default case of inputs + labels. Perhaps users could specify some kind of mapping between the layers (or Theano variables) and the various bits of data that are passed in. I also like |
I think it would be good to have a separation between model container / training loop and dataset container / batch generator. The model would have
I think this would work pretty well just by having the I do think this would be a key addition to the library, but it could also become quite a mess trying to support every possible thing people may want (logging, web monitoring, callbacks for early stopping, checkpointing, etc.). I think it would be best to really focus on the most common cases, while making the code as clean and obvious as possible so that people can adapt it to their own needs. |
agreed.
I don't think we need a class for this, if we make it possible to pass in either arrays or generators, I think we can cover all the bases. An additional
I think we can easily sort this out by enabling people to make the training loop explicit as a |
nolearn.lasagne has
nolearn.lasagne (and I think Keras too) allow you to pass a mapping of
nolearn.lasagne has
Callbacks (an idea which I believe Keras took from nolearn.lasagne) help a lot there. There's a couple of well-defined hooks and people can add their own behaviour easily. You can easily cover today all of the use cases that @ebenolson mentions with the available hooks: logging, web monitoring, callbacks for early stopping, checkpointing. As an example, check out
I know I know I can't stop, but nolearn.lasagne allows you to do this: net = NeuralNet(
# ...
max_epochs=1, # a little hacky but you'll survive
)
for X, y in mygenerator:
net.fit(X, y)
# fun stuff here I still prefer using the hooks though because it's plug-and-play. It's more declarative, and encourages reuse between code. Here's an example of a net = NeuralNet(
# bunch of stuff...
on_epoch_finished=[
SaveWeights(
'models/model-{epoch}-{timestamp}.pkl',
only_best=True,
),
],
) |
But a difference I guess is that you specify certain parameters when calling
Awesome!
In some cases it might be limiting to treat labels as a special type of input. Most tasks fit the (X, y)-paradigm (i.e. data + labels), but there are many that don't. I'd be okay with having features that are specific to this paradigm, because it is so common. But there are are many multiple input / multiple output situations as well, e.g. training siamese networks with pairwise / triplet / quadruplet losses, models with example weights, various forms of masking, ... so I think we should have something generic that can handle all of it.
It's beginning to look like I need to take a more thorough look at the nolearn docs :)
I'm not a huge fan of callbacks myself, I feel that iterators / generators are a much more natural way of passing control back and forth. But perhaps we can accommodate both styles. |
This is sorta what I think should be avoided. Now there needs to be a callback API, and some standard for accessing history. I think it's a lot less intuitive than a simple for loop, where you can see exactly what's happening without digging into library code.
Maybe a |
I've trained a siamese net with nolearn, it's working pretty well. I've even been using your trick of having an input that's double the size as the labels, and it was straight-forward.
I don't know about these use cases; you'd have to enlighten me.
The standard is the callback gets passed the
That's a nice idea. |
I should say that I did have to write a custom batch iterator for that, so maybe not super straight forward. The nice thing is it seems flexible enough all around to allow me to do things like this.
The nolearn.lasagne docs are a big void right now. :-( There's only a couple of tutorials so far. Need to fix... |
Agreed.
'end of the world' is a bit strong indeed, but it is a bit of a wart imo. We should avoid forcing users to use things in unintuitive ways, which I think this is an instance of. I think
Fair enough, I just want to make sure we don't end up forcing our users to shoehorn their use case into it -- it should be natural to extend whatever we provide to support these more exotic configurations (within reason of course, we probably can't support / think of everything). I think the
I was just trying to think of use cases where you have additional data streams that are more label-like, or more input-like, or something in between, to demonstrate that this dichotomy between input and labels isn't always as clear-cut as it is for many regression/classification problems. Just to show that hardcoding this dichotomy everywhere probably isn't the right way to go. Because it is so common, we can of course provide tools to make this case easier, but it should also be easy to go beyond this.
Maybe we should strive to merge it (or some derivative of it) into Lasagne proper, and fix the docs in the process? It sounds like there's already a ton of useful stuff in there that we could use (more than I previously thought) and it seems silly to just reimplement it. |
+1
So if you want to support a If what you want is test-time augmentation, which I often use, you subclass. Here's an example: class MyBatchIterator(BatchIterator):
def transform(self, X, y):
return X + 1, y # silly example always adds 1 to X I think this is an example of an okay pattern that you can easily remember. For some users it's arguably easier 'filling in the blanks' as opposed to taking control over train loop, batch generation etc. For the advanced user, it might seem restrictive. It guess it's basically about the difference of framework versus library. But certainly, if you're striving for one way to do this, and you want to have control over the train loop yourself, then I guess you'd probably get rid of
Would be happy to try that. It would be a good opportunity to clean things up, too. That said, I'm not sure it'll be easy to get a consensus on many things. You're making a good point about not forcing users to shoehorn their use cases; it's hard anticipating all the things people might want to do. |
If I'm not mistaken you don't specifically need
I think we should provide both, so that there is an easier solution for the 80% of use cases where such fine-grained control is not needed (i.e.
Luckily we have a few people among our contributors who are very good at this :) |
Yes, you're right.
Sounds good to me. |
FWIW, layer = InputLayer(shape=(None, 1, 28, 28))
layer = Conv2DLayer(layer, filter_size=5, num_filters=8)
layer = MaxPool2DLayer(layer, pool_size=2)
layer = Conv2DLayer(layer, filter_size=5, num_filters=8)
layer = MaxPool2DLayer(layer, pool_size=2)
layer = DenseLayer(layer, num_units=128)
layer = DenseLayer(layer, nonlinearity=softmax, num_units=10)
net = NeuralNet(
layer,
update=nesterov_momentum,
update_learning_rate=0.01,
update_momentum=0.9,
)
for X_epoch, y_epoch in generate_data():
net.partial_fit(X_epoch, y_epoch) Docs is still an outstanding issue (or the opposite of that). |
I don't know if it factors into this discussion at all, but it would be useful if the generator also efficiently supported datasets that do not fit into memory. The most straightforward way to do this is probably to allow passing in |
As mentioned in the discussion before, I think data loading should be offloaded to the user entirely, as supporting various data formats directly in the library would make things a lot more complex for us. I think it's outside of the scope of the library. Luckily Python's concept of generators provides the perfect interface for this. We should just make it so you can pass in any generator you like. Then it becomes easy to support things like data loading from disk, asynchronous data loading, on the fly data augmentation, ... without having to worry about it inside the library. Of course, we should also support just passing in (a set of) numpy arrays, since that is such a common use case. I've been working on a draft for this for the past week or so, hopefully I'll be able to submit a PR sometime soon. |
@f0k thank you for the threaded_generator code, it significantly speed up my processing.
|
So far the only thing that's been implemented is a bunch of tools to generate Theano expressions for neural nets. There is no actual training code in the library yet.
We should provide some premade 'training loops' for common use cases, that take care of things like compiling the necessary Theano functions and updating the parameters given a dataset.
It would be great if we could rely on Python generators for this - although at this point I'm not sure if they offer enough flexibility. But if they do, it would be great to be able to avoid adding another class / abstraction.
We could provide a few different types of training loops, for different dataset sizes and approaches. For example, some datasets fit into GPU memory, so we should provide a loop that loads up the data into a shared variable and then iterates over that in batches. But a lot of datasets actually don't, so then we'd have to load a new 'chunk' of data into the shared variable at regular intervals.
Until now I've always reimplemented this type of thing specifically for the problems I was working on (layers.py only provided tools to generate Theano expressions, nothing else). But I've definitely felt like I was reinventing the wheel a couple of times :)
I don't have a concrete idea yet of how we should implement this, input is very welcome.
The text was updated successfully, but these errors were encountered: