Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I recently saw PR #1239, so I figured I'd mention this one thing I've been working on to figure out whether or not it's useful for other people. I feel this feature is related to #1239 since one of the core aims is to use encoded images and preprocess them while training. It differs in that it also supports multi-task datasets and it also reads images over HTTP rather than via a database or the local filesystem directly.
Here's my use case:
I do my training on EC2 and when I start up a new instance, I need to fetch my training data from S3. Sometimes I just want to train for one epoch or less to help set learning rates and other parameters. It's wasteful to download the full dataset to each machine if I'm not going to use all of it. The disks on g2.2xlarge instances are only 60 GB, so I often can't fit my entire dataset on each anyway.
The second part of my use case is that I often vary the method by which I preprocess my training examples during development. Therefore, the input here is not a set of pre-cropped and resized images, but instead original images along with cropping and resizing parameters.
Here's the feature:
It's a new data layer that takes as input a protobuf that enumerates training examples each with:
(1) A URL to a jpeg image.
(2) A crop region.
(3) A set of training labels.
Each minibatch of images is fetched together with libcurl-multi. That way, if all of your images are stored in the same place, such as S3, then the DNS resolution will be reused between the requests. The images are decoded with libjpeg-turbo which offers better performance than what's used by OpenCV (although you lose support for anything other than jpeg). cv::resize is used, although it's pretty slow unless you use OpenCV 3 since 3 is built with IPP support.
Now, an alternative is to configure a set of EBS volumes with the training data and run NFS or GlusterFS to serve the data to the training machines. I'm not saying one method is better than the other. S3 is certainly easier to configure and in terms of price per storage per month, it's cheaper, but a private network file system might offer better performance. This feature is meant to act as something to try and measure. And it might be the case for online training that consuming resources from an HTTP server is easier.
Here's what the input protobuf message definitions look like for some FooTask. This example dataset has four sets of training labels for multitask learning.
And here's what an input protobuf might look like (encoded with text_format for visualization).
Now say you're not interested in doing multi-task learning, but just want to do classification on the
classification_label
field. In the network's protobuf definition, you can specify the following to select only theclassification_label
as output.Or if you want both
classification_label
andregression_target
for training, then you can specify both. The size of top will always be N + 1 where N is the number of selected label outputs.