Proto data layer #1266

kmatzen · 2014-10-12T16:42:23Z

I recently saw PR #1239, so I figured I'd mention this one thing I've been working on to figure out whether or not it's useful for other people. I feel this feature is related to #1239 since one of the core aims is to use encoded images and preprocess them while training. It differs in that it also supports multi-task datasets and it also reads images over HTTP rather than via a database or the local filesystem directly.

Here's my use case:
I do my training on EC2 and when I start up a new instance, I need to fetch my training data from S3. Sometimes I just want to train for one epoch or less to help set learning rates and other parameters. It's wasteful to download the full dataset to each machine if I'm not going to use all of it. The disks on g2.2xlarge instances are only 60 GB, so I often can't fit my entire dataset on each anyway.

The second part of my use case is that I often vary the method by which I preprocess my training examples during development. Therefore, the input here is not a set of pre-cropped and resized images, but instead original images along with cropping and resizing parameters.

Here's the feature:
It's a new data layer that takes as input a protobuf that enumerates training examples each with:
(1) A URL to a jpeg image.
(2) A crop region.
(3) A set of training labels.

Each minibatch of images is fetched together with libcurl-multi. That way, if all of your images are stored in the same place, such as S3, then the DNS resolution will be reused between the requests. The images are decoded with libjpeg-turbo which offers better performance than what's used by OpenCV (although you lose support for anything other than jpeg). cv::resize is used, although it's pretty slow unless you use OpenCV 3 since 3 is built with IPP support.

Now, an alternative is to configure a set of EBS volumes with the training data and run NFS or GlusterFS to serve the data to the training machines. I'm not saying one method is better than the other. S3 is certainly easier to configure and in terms of price per storage per month, it's cheaper, but a private network file system might offer better performance. This feature is meant to act as something to try and measure. And it might be the case for online training that consuming resources from an HTTP server is easier.

Here's what the input protobuf message definitions look like for some FooTask. This example dataset has four sets of training labels for multitask learning.

message ProtobufManifest {
  repeated ProtobufRecord records = 1;
  repeated string labels = 2;
  repeated uint32 label_dims = 3;
}

message ProtobufRecord {
  optional string image_url = 1;
  optional int32 crop_x1 = 2;
  optional int32 crop_x2 = 3;
  optional int32 crop_y1 = 4;
  optional int32 crop_y2 = 5;

  extensions 100 to max;
}

message FooTaskRecord {
  extend ProtobufRecord {
    optional ClassificationRecord parent = 103;
  }

  optional uint32 classification_label = 1;
  optional float regression_target = 2;
  repeated uint32 classification_multilabel = 3;
  repeated float regression_multitarget = 4;
}

And here's what an input protobuf might look like (encoded with text_format for visualization).

records {
  image_url: "http://s3.amazonaws.com/footask/123.jpg"
  crop_x1: 10
  crop_x2: 20
  crop_y1: 15
  crop_y2: 25
  [caffe.FooTaskRecord.parent] {
    classification_label: 1
    regression_target: 1.5
    classification_multilabel: 1
    classification_multilabel: 2
    classification_multilabel: 3
    classification_multilabel: 4
    classification_multilabel: 5
    regression_multitarget: 1.2
    regression_multitarget: 3.4
    regression_multitarget: 5.6
  }
}
records {
  image_url: "http://s3.amazonaws.com/footask/456.jpg"
  crop_x1: 10
  crop_x2: 20
  crop_y1: 15
  crop_y2: 25
  [caffe.FooTaskRecord.parent] {
    classification_label: 1
    regression_target: 1.5
    classification_multilabel: 1
    classification_multilabel: 2
    classification_multilabel: 3
    classification_multilabel: 4
    classification_multilabel: 5
    regression_multitarget: 1.2
    regression_multitarget: 3.4
    regression_multitarget: 5.6
  }
}
labels: "classification_label"
labels: "regression_target"
labels: "classification_multilabel"
labels: "regression_multitarget"
label_dims: 1
label_dims: 1
label_dims: 5
label_dims: 3

Now say you're not interested in doing multi-task learning, but just want to do classification on the classification_label field. In the network's protobuf definition, you can specify the following to select only the classification_label as output.

layers {
  top: "data"
  top: "classification_label"
  name: "train"
  type: PROTOBUF_DATA
  include {
    phase: TRAIN
  }
  protobuf_data_param {
    source: "my_manifest.protobuf"
    batch_size: 32
    shuffle: true
    crop_height: 224
    crop_width: 224
    labels: "classification_label"
  }
}

Or if you want both classification_label and regression_target for training, then you can specify both. The size of top will always be N + 1 where N is the number of selected label outputs.

layers {
  top: "data"
  top: "classification_label"
  top: "regression_target"
  name: "train"
  type: PROTOBUF_DATA
  include {
    phase: TRAIN
  }
  protobuf_data_param {
    source: "my_manifest.protobuf"
    batch_size: 32
    shuffle: true
    crop_height: 224
    crop_width: 224
    labels: "classification_label"
    labels: "regression_target"
  }
}

sguada · 2014-10-12T22:07:09Z

@kmatzen I think creating another data layer doesn't make much sense. I think the idea behind of #1238 and #1239 is to simplify things and have common interfaces. What I would do is to create another Database that is backed by web stored images.

Regarding adding more dependencies, they would need to be optional, since not everybody may want to use them. The transformations, like cropping should be handle by transformation_param.

Regarding multi-task and other ways to have multi-label, we are planning to separate labels from data, so one can have different labels and/or task for the same data (see plan in #523).

sguada · 2014-11-22T18:35:04Z

@kmatzen would you mind doing a different PR to add the option of using libjpeg and libjpeg-turbo to Caffe?

shelhamer · 2015-03-10T00:46:56Z

@kmatzen rather than special casing this input format I would like to advocate for

data layers in Python for their generality Reform the boost::python wrapper, including layers implemented in Python #1703
a socket data layer that consumes blobs and is decoupled from the producer Socket Data Layer #238

but thank you for pull requesting your own solution. Note that a solution to #1896 will open up decentralized layers too. Closing in favor of (1) and (2).

kloudkl mentioned this pull request Oct 13, 2014

leveldb/lmdb refactoring #1238

Merged

sergeyk force-pushed the dev branch from 2fb4c97 to 1718903 Compare October 17, 2014 18:44

Address bug where jitter is enabled for the validation set.

073ba50

Kevin James Matzen added 5 commits November 25, 2014 18:32

proto

588775d

scale disabled

a66a4e1

Added option to disable cropping.

ae7368a

fixed test

6bc892b

added jitter

6603c05

shelhamer closed this Mar 10, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proto data layer #1266

Proto data layer #1266

kmatzen commented Oct 12, 2014

sguada commented Oct 12, 2014

sguada commented Nov 22, 2014

shelhamer commented Mar 10, 2015

Proto data layer #1266

Proto data layer #1266

Conversation

kmatzen commented Oct 12, 2014

sguada commented Oct 12, 2014

sguada commented Nov 22, 2014

shelhamer commented Mar 10, 2015