Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proto data layer #1266

Closed
wants to merge 6 commits into from
Closed

Proto data layer #1266

wants to merge 6 commits into from

Conversation

kmatzen
Copy link
Contributor

@kmatzen kmatzen commented Oct 12, 2014

I recently saw PR #1239, so I figured I'd mention this one thing I've been working on to figure out whether or not it's useful for other people. I feel this feature is related to #1239 since one of the core aims is to use encoded images and preprocess them while training. It differs in that it also supports multi-task datasets and it also reads images over HTTP rather than via a database or the local filesystem directly.

Here's my use case:
I do my training on EC2 and when I start up a new instance, I need to fetch my training data from S3. Sometimes I just want to train for one epoch or less to help set learning rates and other parameters. It's wasteful to download the full dataset to each machine if I'm not going to use all of it. The disks on g2.2xlarge instances are only 60 GB, so I often can't fit my entire dataset on each anyway.

The second part of my use case is that I often vary the method by which I preprocess my training examples during development. Therefore, the input here is not a set of pre-cropped and resized images, but instead original images along with cropping and resizing parameters.

Here's the feature:
It's a new data layer that takes as input a protobuf that enumerates training examples each with:
(1) A URL to a jpeg image.
(2) A crop region.
(3) A set of training labels.

Each minibatch of images is fetched together with libcurl-multi. That way, if all of your images are stored in the same place, such as S3, then the DNS resolution will be reused between the requests. The images are decoded with libjpeg-turbo which offers better performance than what's used by OpenCV (although you lose support for anything other than jpeg). cv::resize is used, although it's pretty slow unless you use OpenCV 3 since 3 is built with IPP support.

Now, an alternative is to configure a set of EBS volumes with the training data and run NFS or GlusterFS to serve the data to the training machines. I'm not saying one method is better than the other. S3 is certainly easier to configure and in terms of price per storage per month, it's cheaper, but a private network file system might offer better performance. This feature is meant to act as something to try and measure. And it might be the case for online training that consuming resources from an HTTP server is easier.

Here's what the input protobuf message definitions look like for some FooTask. This example dataset has four sets of training labels for multitask learning.

message ProtobufManifest {
  repeated ProtobufRecord records = 1;
  repeated string labels = 2;
  repeated uint32 label_dims = 3;
}

message ProtobufRecord {
  optional string image_url = 1;
  optional int32 crop_x1 = 2;
  optional int32 crop_x2 = 3;
  optional int32 crop_y1 = 4;
  optional int32 crop_y2 = 5;

  extensions 100 to max;
}

message FooTaskRecord {
  extend ProtobufRecord {
    optional ClassificationRecord parent = 103;
  }

  optional uint32 classification_label = 1;
  optional float regression_target = 2;
  repeated uint32 classification_multilabel = 3;
  repeated float regression_multitarget = 4;
}

And here's what an input protobuf might look like (encoded with text_format for visualization).

records {
  image_url: "http://s3.amazonaws.com/footask/123.jpg"
  crop_x1: 10
  crop_x2: 20
  crop_y1: 15
  crop_y2: 25
  [caffe.FooTaskRecord.parent] {
    classification_label: 1
    regression_target: 1.5
    classification_multilabel: 1
    classification_multilabel: 2
    classification_multilabel: 3
    classification_multilabel: 4
    classification_multilabel: 5
    regression_multitarget: 1.2
    regression_multitarget: 3.4
    regression_multitarget: 5.6
  }
}
records {
  image_url: "http://s3.amazonaws.com/footask/456.jpg"
  crop_x1: 10
  crop_x2: 20
  crop_y1: 15
  crop_y2: 25
  [caffe.FooTaskRecord.parent] {
    classification_label: 1
    regression_target: 1.5
    classification_multilabel: 1
    classification_multilabel: 2
    classification_multilabel: 3
    classification_multilabel: 4
    classification_multilabel: 5
    regression_multitarget: 1.2
    regression_multitarget: 3.4
    regression_multitarget: 5.6
  }
}
labels: "classification_label"
labels: "regression_target"
labels: "classification_multilabel"
labels: "regression_multitarget"
label_dims: 1
label_dims: 1
label_dims: 5
label_dims: 3

Now say you're not interested in doing multi-task learning, but just want to do classification on the classification_label field. In the network's protobuf definition, you can specify the following to select only the classification_label as output.

layers {
  top: "data"
  top: "classification_label"
  name: "train"
  type: PROTOBUF_DATA
  include {
    phase: TRAIN
  }
  protobuf_data_param {
    source: "my_manifest.protobuf"
    batch_size: 32
    shuffle: true
    crop_height: 224
    crop_width: 224
    labels: "classification_label"
  }
}

Or if you want both classification_label and regression_target for training, then you can specify both. The size of top will always be N + 1 where N is the number of selected label outputs.

layers {
  top: "data"
  top: "classification_label"
  top: "regression_target"
  name: "train"
  type: PROTOBUF_DATA
  include {
    phase: TRAIN
  }
  protobuf_data_param {
    source: "my_manifest.protobuf"
    batch_size: 32
    shuffle: true
    crop_height: 224
    crop_width: 224
    labels: "classification_label"
    labels: "regression_target"
  }
}

@sguada
Copy link
Contributor

sguada commented Oct 12, 2014

@kmatzen I think creating another data layer doesn't make much sense. I think the idea behind of #1238 and #1239 is to simplify things and have common interfaces. What I would do is to create another Database that is backed by web stored images.

Regarding adding more dependencies, they would need to be optional, since not everybody may want to use them. The transformations, like cropping should be handle by transformation_param.

Regarding multi-task and other ways to have multi-label, we are planning to separate labels from data, so one can have different labels and/or task for the same data (see plan in #523).

@sguada
Copy link
Contributor

sguada commented Nov 22, 2014

@kmatzen would you mind doing a different PR to add the option of using libjpeg and libjpeg-turbo to Caffe?

@shelhamer
Copy link
Member

@kmatzen rather than special casing this input format I would like to advocate for

  1. data layers in Python for their generality Reform the boost::python wrapper, including layers implemented in Python #1703
  2. a socket data layer that consumes blobs and is decoupled from the producer Socket Data Layer #238

but thank you for pull requesting your own solution. Note that a solution to #1896 will open up decentralized layers too. Closing in favor of (1) and (2).

@shelhamer shelhamer closed this Mar 10, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants