RNN + LSTM Layers #3948

Merged
merged 3 commits into from Jun 2, 2016

Conversation

Projects
None yet
Contributor

jeffdonahue commented Apr 5, 2016

This PR includes the core functionality (with minor changes) of #2033 -- the RNNLayer and LSTMLayer implementations (as well as the parent RecurrentLayer class) -- without the COCO data downloading/processing tools or the LRCN example.

Breaking off this chunk for merge should make users who are already using these layer types on their own happy, without adding a large review/maintenance burden for the examples (which have already broken multiple times due to changes in the COCO data distribution format...). On the other hand, without any example on how to format the input data for these layers, it will be fairly difficult to get started, so I'd still like to follow up with at least a simple sequence example for official inclusion in Caffe (maybe memorizing a random integer sequence -- I think I have some code for that somewhere) soon after the core functionality is merged.

There's still at least one documentation TODO: I added expose_hidden to allow direct access (via bottoms/tops) to the recurrent model's 0th timestep and Tth timestep hidden states, but didn't add anything to the list of bottoms/tops -- still need to do that. Otherwise, this should be ready for review.

@weiliu89 weiliu89 added a commit to weiliu89/caffe that referenced this pull request Apr 7, 2016

@weiliu89 weiliu89 fix conflict of merging #3948 b1678f3

@longjon longjon commented on an outdated diff Apr 8, 2016

include/caffe/layers/recurrent_layer.hpp
+#include <utility>
+#include <vector>
+
+#include "caffe/blob.hpp"
+#include "caffe/common.hpp"
+#include "caffe/layer.hpp"
+#include "caffe/net.hpp"
+#include "caffe/proto/caffe.pb.h"
+
+namespace caffe {
+
+template <typename Dtype> class RecurrentLayer;
+
+/**
+ * @brief An abstract class for implementing recurrent behavior inside of an
+ * unrolled network. This Layer type cannot be instantiated -- instaed,
@longjon

longjon Apr 8, 2016

Contributor

typo: "instaed"

shelhamer added the focus label Apr 8, 2016

@weiliu89 weiliu89 added a commit to weiliu89/caffe that referenced this pull request Apr 9, 2016

@weiliu89 weiliu89 Merge pull request #3948 from jeffdonahue/recurrent-layer
RNN + LSTM Layers
8afb9c5

It doesn't work with current net_spec.py. In specific, 1) it will fail when using L.LSTM() or L.RNN() since only RecurrentParameter is defined in the caffe.proto. 2) it will fail when using L.Recurrent() since RecurrentLayer is not registered (an abstract class).

I did a simple hack by adding the following in the param_name_dict() function in net_spec.py

param_names += ['recurrent', 'recurrent']
param_type_names += ['LSTM', 'RNN']
Owner

shelhamer commented Apr 11, 2016

@weiliu89 the recurrent parameter for these layers, like the convolution parameter for DeconvolutionLayer, is defined in net spec by naming it directly:

n = caffe.NetSpec()
...
n.lstm = L.LSTM(n.data, recurrent_param=dict(num_output=10))
...

Whether or not to map these shared parameter types as you suggest here or as suggested or DeconvolutionLayer in #3954 could be handled by a separate PR since recurrent layers are not the only instance of this.

@shelhamer shelhamer commented on an outdated diff Apr 16, 2016

include/caffe/layers/recurrent_layer.hpp
+ * @param top output Blob vector (length 1)
+ * -# @f$ (T \times N \times D) @f$
+ * the time-varying output @f$ y @f$, where @f$ D @f$ is
+ * <code>recurrent_param.num_output()</code>.
+ * Refer to documentation for particular RecurrentLayer implementations
+ * (such as RNNLayer and LSTMLayer) for the definition of @f$ y @f$.
+ */
+ virtual void Forward_cpu(const vector<Blob<Dtype>*>& bottom,
+ const vector<Blob<Dtype>*>& top);
+ virtual void Forward_gpu(const vector<Blob<Dtype>*>& bottom,
+ const vector<Blob<Dtype>*>& top);
+ virtual void Backward_cpu(const vector<Blob<Dtype>*>& top,
+ const vector<bool>& propagate_down, const vector<Blob<Dtype>*>& bottom);
+
+ /// @brief A helper function, useful for stringifying timestep indices.
+ virtual string int_to_str(const int t) const;
@shelhamer

shelhamer Apr 16, 2016

Owner

It's a little surprising to see a helper like this show up in the recurrent layer, but if there weren't any use for it elsewhere then I suppose it could live here. That said, there is already format.hpp and its format_int() function that was added for cross-platform compatibility in b72b031. How about making use of that instead?

@shelhamer shelhamer commented on an outdated diff Apr 16, 2016

src/caffe/layers/recurrent_layer.cpp
+ }
+}
+
+template <typename Dtype>
+void RecurrentLayer<Dtype>::Reset() {
+ // "Reset" the hidden state of the net by zeroing out all recurrent outputs.
+ for (int i = 0; i < recur_output_blobs_.size(); ++i) {
+ caffe_set(recur_output_blobs_[i]->count(), Dtype(0),
+ recur_output_blobs_[i]->mutable_cpu_data());
+ }
+}
+
+template <typename Dtype>
+void RecurrentLayer<Dtype>::Forward_cpu(const vector<Blob<Dtype>*>& bottom,
+ const vector<Blob<Dtype>*>& top) {
+ // Hacky fix for test time... reshare all the shared blobs.
@shelhamer

shelhamer Apr 16, 2016

Owner

It might be worth elaborating on the reason for this since the cause from the test net sharing with the train net during solving isn't immediately clear. One could be confused that there is some issue for test in general with sharing the weights, but just instantiating a net in TEST phase is actually fine.

@shelhamer shelhamer commented on an outdated diff Apr 16, 2016

include/caffe/layers/lstm_layer.hpp
+
+#include "caffe/blob.hpp"
+#include "caffe/common.hpp"
+#include "caffe/layer.hpp"
+#include "caffe/layers/recurrent_layer.hpp"
+#include "caffe/net.hpp"
+#include "caffe/proto/caffe.pb.h"
+
+namespace caffe {
+
+template <typename Dtype> class RecurrentLayer;
+
+/**
+ * @brief Processes sequential inputs using a "Long Short-Term Memory" (LSTM)
+ * [1] style recurrent neural network (RNN). Implemented as a network
+ * unrolled the LSTM computation in time.
@shelhamer

shelhamer Apr 16, 2016

Owner

Grammar. Implemented by unrolling the LSTM computation through time in the network?

@shelhamer shelhamer commented on an outdated diff Apr 16, 2016

src/caffe/test/test_lstm_layer.cpp
+#include "caffe/filler.hpp"
+#include "caffe/layers/lstm_layer.hpp"
+
+#include "caffe/test/test_caffe_main.hpp"
+#include "caffe/test/test_gradient_check_util.hpp"
+
+namespace caffe {
+
+template <typename TypeParam>
+class LSTMLayerTest : public MultiDeviceTest<TypeParam> {
+ typedef typename TypeParam::Dtype Dtype;
+
+ protected:
+ LSTMLayerTest() : num_output_(7) {
+ blob_bottom_vec_.push_back(&blob_bottom_);
+ blob_bottom_vec_.push_back(&blob_bottom_flush_);
@shelhamer

shelhamer Apr 16, 2016

Owner

flush might be better as continuation since 1 == continue and that is the language elsewhere in the implementation. Up to you.

Owner

shelhamer commented Apr 16, 2016

LGTM overall—my only comments were about comments and naming (and that one int -> string function). @longjon are you done with your review?

Contributor

ajtulloch commented Apr 17, 2016

Looks great. Thanks for this @jeffdonahue. We've been using a variant of this for a while and it has performed great.

One thing we can additionally PR/gist (if it's useful) is a wrapper around the LSTM layer that allows for arbitrary length (batched) forward propagation - which came in handy when doing inference on arbitrary length sequences (relaxing the constraint around T_ while preserving memory efficiency for the forward pass by reusing activations across timesteps).

Contributor

jeffdonahue commented May 3, 2016

@shelhamer @longjon thanks for the review! Fixed as suggested.

@ajtulloch glad to hear it's been working for you guys, thanks for looking it over! I'm not sure I understand the idea of the wrapper though. I think this implementation should be able to do what you're saying -- memory efficient forward propagation over arbitrarily long sequences -- by feeding in T_=1 (1xNx...) data to the RecurrentLayer and setting cont=0 at the first timestep of the sequence, then cont=1 through the end (then starting over with cont=0 at the start of the next sequence). This should reuse the activation memory as you mentioned (using just O(N) memory rather than O(NT)). (In fact, this capability is the point of having the cont input in the first place.) Maybe your wrapper is a friendly interface that handles all the bookkeeping for this? In that case it definitely sounds like it would be helpful. Or maybe I'm totally misunderstanding?

Contributor

ajtulloch commented May 3, 2016 edited

@jeffdonahue yeah, the only contribution was around allowing variable-T_ inputs but still batching the i2h transform - this was substantially faster than the approach you describe (T_ = 1 and looping which I initially did), IIRC ~3x for some of our models. It costs a bit more memory (NxT_xD of the i2h, but only NxD for the h/c states for arbitrary T_), but saves NxT_xD for the h/c states. https://gist.github.com/ajtulloch/2b7a98de642df934456001de238ed5c7 is the CPU impl - it's a bit niche so I wouldn't advocate pulling it at all, but might be handy for someone who hits this issue in the future.

Contributor

jeffdonahue commented May 3, 2016

Ah -- batching the input transformation regardless of sequence length indeed makes sense. Thanks in advance for posting the code!

@niketanpansare niketanpansare added a commit to niketanpansare/systemml that referenced this pull request May 10, 2016

@niketanpansare niketanpansare [SYSTEMML-540] Added LSTM and RNN compatibility to Caffe.java
Used the proto file from BVLC/caffe#3948
a416baf

MinaRe commented May 13, 2016

Dear all

I have very big matrix(rows are ID and columns are label ) and I was wondering to know How can i do the training on caffe with just fully connected layers?

Thanks a lot.

@niketanpansare niketanpansare added a commit to niketanpansare/systemml that referenced this pull request May 13, 2016

@niketanpansare niketanpansare [SYSTEMML-540] Added LSTM and RNN compatibility to Caffe.java
Used the proto file from BVLC/caffe#3948
9579358

When will this issue be merged ?

yshean commented May 21, 2016

Anyone successfully merged @jeffdonahue caffe:recurrent-layer and BVLC's caffe:master? Why does the assertion of CHECK_EQ(2 + num_recur_blobs + static_input_, unrolled_net_->num_inputs()); fail during make runtest?

[----------] 9 tests from LSTMLayerTest/0, where TypeParam = caffe::CPUDevice<float>
[ RUN      ] LSTMLayerTest/0.TestForward
F0521 03:29:55.683001  5650 recurrent_layer.cpp:142] Check failed: 2 + num_recur_blobs + static_input_ == unrolled_net_->num_inputs() (4 vs. 0) 

@myfavouritekk myfavouritekk added a commit to myfavouritekk/caffe that referenced this pull request May 24, 2016

@myfavouritekk myfavouritekk Merge pull request #3948 from jeffdonahue/recurrent-layer
RNN + LSTM Layers

* jeffdonahue/recurrent-layer:
  Add LSTMLayer and LSTMUnitLayer, with tests
  Add RNNLayer, with tests
  Add RecurrentLayer: an abstract superclass for other recurrent layer types
3dcc5f8

@aralph aralph added a commit to aralph/caffe that referenced this pull request Jun 1, 2016

@aralph aralph add LSTM, RNN and Recurrent layers. Additions according to PR 'RNN + …
…LSTM Layers #3948' by jeffdonahue for BVLC/caffe master.
d6031ee
Contributor

jeffdonahue commented Jun 2, 2016

Thanks again for the reviews everyone. Sorry for the delays -- wanted to do some additional testing, but I'm now comfortable enough with this to merge.

@jeffdonahue jeffdonahue merged commit 58b10b4 into BVLC:master Jun 2, 2016

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
Contributor

ajtulloch commented Jun 2, 2016

Very nice work @jeffdonahue.

Member

naibaf7 commented Jun 2, 2016

@jeffdonahue
Now also available on the OpenCL branch.

jeffdonahue deleted the jeffdonahue:recurrent-layer branch Jun 2, 2016

Any plans for a release?

Contributor

antran89 commented Jun 7, 2016

Can you have a link of a working tutorial/example on using these layers? It would be easier for new learners. I know you have it somewhere.

@yjxiong yjxiong pushed a commit to yjxiong/caffe that referenced this pull request Jun 15, 2016

@jeffdonahue jeffdonahue + yjxiong Merge pull request #3948 from jeffdonahue/recurrent-layer
RNN + LSTM Layers
702db71

Great work!!! @jeffdonahue I used https://github.com/jeffdonahue/caffe/tree/recurrent-rebase-cleanup/ as the example to do ./examples/coco_caption/train_language_model.sh. The code I used is BVLC master. It converges well at the beginning but diverges after Iteration 2399 as following:

I0630 15:15:16.417166 23801 solver.cpp:228] Iteration 2397, loss = 61.5563
I0630 15:15:16.417196 23801 solver.cpp:244] Train net output #0: cross_entropy_loss = 3.13294 (* 20 = 62.6589 loss)
I0630 15:15:16.417207 23801 sgd_solver.cpp:106] Iteration 2397, lr = 0.1
I0630 15:15:16.533344 23801 solver.cpp:228] Iteration 2398, loss = 61.561
I0630 15:15:16.533375 23801 solver.cpp:244] Train net output #0: cross_entropy_loss = 3.13485 (* 20 = 62.6971 loss)
I0630 15:15:16.533386 23801 sgd_solver.cpp:106] Iteration 2398, lr = 0.1
I0630 15:15:16.655758 23801 solver.cpp:228] Iteration 2399, loss = 61.5369
I0630 15:15:16.655824 23801 solver.cpp:244] Train net output #0: cross_entropy_loss = 2.98118 (* 20 = 59.6236 loss)
I0630 15:15:16.655838 23801 sgd_solver.cpp:106] Iteration 2399, lr = 0.1
I0630 15:15:16.776641 23801 solver.cpp:228] Iteration 2400, loss = 78.3731
I0630 15:15:16.776676 23801 solver.cpp:244] Train net output #0: cross_entropy_loss = 87.3366 (* 20 = 1746.73 loss)
I0630 15:15:16.776690 23801 sgd_solver.cpp:106] Iteration 2400, lr = 0.1
I0630 15:15:16.892026 23801 solver.cpp:228] Iteration 2401, loss = 95.2123
I0630 15:15:16.892060 23801 solver.cpp:244] Train net output #0: cross_entropy_loss = 87.3365 (* 20 = 1746.73 loss)
I0630 15:15:16.892071 23801 sgd_solver.cpp:106] Iteration 2401, lr = 0.1
I0630 15:15:17.007628 23801 solver.cpp:228] Iteration 2402, loss = 112.041
I0630 15:15:17.007663 23801 solver.cpp:244] Train net output #0: cross_entropy_loss = 87.3365 (* 20 = 1746.73 loss)
I0630 15:15:17.007675 23801 sgd_solver.cpp:106] Iteration 2402, lr = 0.1
I0630 15:15:17.123337 23801 solver.cpp:228] Iteration 2403, loss = 128.873
I0630 15:15:17.123373 23801 solver.cpp:244] Train net output #0: cross_entropy_loss = 87.3365 (* 20 = 1746.73 loss)
I0630 15:15:17.123384 23801 sgd_solver.cpp:106] Iteration 2403, lr = 0.1
I0630 15:15:17.239030 23801 solver.cpp:228] Iteration 2404, loss = 145.734
I0630 15:15:17.239061 23801 solver.cpp:244] Train net output #0: cross_entropy_loss = 87.3365 (* 20 = 1746.73 loss)
I0630 15:15:17.239074 23801 sgd_solver.cpp:106] Iteration 2404, lr = 0.1

Any suggestion?

@jeffdonahue I am new to caffe. Do you have any example about RNN. How to use RNN layer.
Any help will be appreciated.

agethen commented Jul 26, 2016

@jeffdonahue May I ask your help for a clarification?
Consider we have an Encoder-Decoder structure with two RNN/LSTM layers. Say we let the Encoder read features X, and the Decoder output its state H, and say that the state of the encoder is copied to the decoder via setting expose_hidden: true and connecting the blobs.

I can see in RecurrentLayer::Reshape that recur_input_blobs share their data with bottom blobs -- but they do not share their diff (unlike as for the top blobs)! Can the hidden state/cell state gradient then still travel backwards from decoder to encoder? Is this a misunderstanding on my side?
Thank you very much!

Hello, what makes it necessary to switch the dimension order of bottom blob from N * T * ... to T * N * ...? In this way, in the batch_size in the prototxt actually is the unrolled step, right?

@fxbit fxbit added a commit to Yodigram/caffe that referenced this pull request Sep 1, 2016

@jeffdonahue @fxbit jeffdonahue + fxbit Merge pull request #3948 from jeffdonahue/recurrent-layer
RNN + LSTM Layers
fd13748

irfan798 referenced this pull request in Evolving-AI-Lab/ppgn Mar 11, 2017

Closed

Can't backpropagate to EmbedLayer input #8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment