LSTM version #3

markovyao · 2017-01-25T11:03:57Z

It is a great work. Is there any plan to develop a LSTM version?

ifrosio · 2017-01-27T19:35:01Z

Not immediately, but it shouldn't be hard to implement it in TF.
If you have any version with LSTM, please let us know.

ieow · 2017-02-03T08:42:14Z

Is it possible to implement lstm in this ga3c architecture?
RNN (lstm) required serialize input, but based on this ga3c architecture which push exp to queue from multiple agent would not make the 'exp' serial input. Thus, batch input for training thread would be mixed and cannot be used as RNN training input.
Correct me if I am wrong.
Thanks

adi-sharma · 2017-02-05T22:33:49Z

Should be straight forward, as the state for Atari games is already defined as 4 frames together (See section 4.1 of the original DQN paper - https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf) and that is what GA3C uses. If you supply those frames serially, the LSTM version of GA3C will work.

mbz · 2017-02-06T16:58:12Z

Implementing the LSTM version without lots of code change depends on how long the sequences of training data should be. If the sequences are as long as TMAX frames (which I think is the case) then the current architecture works since Trainers receive sequences of TMAX frames. But if the training data should be any longer (i.e. multiple TMAX frames merged together) it becomes a little bit more complicated.

etienne87 · 2017-02-06T18:09:43Z

in case of LSTM, shouldn't the batch be organized in (N, T, C, H, W) format?

mbz · 2017-02-06T19:04:44Z

@etienne87 you are correct. But please look at here. What Trainer receives is in (N, T, C, H, W) format but it merges the T dimension to have data in (N, C, H, W) format. In a recurrent model these concatenations are unnecessary.

etienne87 · 2017-02-07T16:22:24Z

@mbz, thanks for pointing to this. Now i am super confused with this part of the code! can you take a look at #6 ? I don't see how these concatenations are working at all!

I would suggest to modify ThreadTrainer.py to :

           if self.server.model.rnn:
                print('todo')
            else:
                while batch_size <= Config.TRAINING_MIN_BATCH_SIZE:
                    x_, r_, a_ = self.server.training_q.get()
                    if batch_size == 0:
                        x__ = x_; r__ = r_; a__ = a_
                    else:
                        x__ = np.concatenate((x__, x_))
                        r__ = np.concatenate((r__, r_))
                        a__ = np.concatenate((a__, a_))     
                    batch_size += x_.shape[0]

etienne87 · 2017-02-09T12:11:04Z

LSTM would require reset_state func to address a specific row from the batch right?

class NetworkVP():
    [...]
    def reset_state(self, idx):
        #todo...
        self.lstm_state_c[idx,...] = 0
        self.lstm_state_h[idx,...] = 0

sorry for pseudocode, not expert with TF.

etienne87 · 2017-02-10T14:25:20Z

Another confusion I have about this, (because little experience with TF). It seems we need 2 graphs : one for prediction (taking a dynamic_rnn), and one (maybe taking a static tf.rnn?) for backprop (if feeding (N, T, C, H, W) , or is there a way to use a gradient applier like in myosuda ?
Sorry if this is not really the good place to ask.

mbz · 2017-02-11T17:54:06Z

@etienne87 I'm not sure if I understand your first question about reset_state correctly. Can you please provide more details?

About having separate graphs, there are different ways of implementing the same logic in TF. We are not using separate graphs simply because it's not necessary. Can you please leverage why you think having two graphs is necessary?

etienne87 · 2017-02-11T18:00:05Z

@mbz ok! What I mean : In classic A3C, it seems we can just backprop at the end of an episode (T_MAX), by just re-using the already computed predictions. On the other hand, here, it seems we need to recompute the predictions with the samples and actions. In short : X should be (N, H, W, C) in predictor thread, (N, T, H, W, C) in the train function? Maybe I misunderstood something about TF internal mechanism?

Also, the thing about reset : at beginning of each episode you probably want to reset to zero the c & h of your lstm. So as @ppwwyyxx is suggesting, lstm state should be saved inside each ProcessAgent ?

ppwwyyxx · 2017-02-11T18:03:12Z

@mbz I have implemented A3C-LSTM with long sequence length. You don't have to send the whole sequence into the graph. What I did is to maintain the current LSTM hidden state for every game simulator in Python side, and every time feed the new frame together with the hidden state of each simulator to the graph.
This way the sequence length could be as long as one episode.

markovyao · 2017-02-20T14:48:38Z

@ppwwyyxx I have built an LSTM and stored the hidden states.
However, I got two questions. 1. where to reset or initial the stored states before each episode? 2. how to deal with the class Experience in LSTM training?

ppwwyyxx · 2017-02-20T15:52:13Z

@markovyao
The states were maintained in python, inside each agent (simulator), so you can easily set them when needed (e.g. right after the agent reaches the end of episode).
Since each agent maintains its own hidden states, it can do the following by its own:

keep the hidden states in its own experience history buffer and give it to the network for training
send the hidden states to predictor to get the next action
request the predictor to send back the next hidden state and keep it

ricky1203 · 2017-02-21T05:17:18Z

@markovyao
alternative implement to @ppwwyyxx solution:

create matrix vars to store LSTM hidden states
every agent assign an unique agent_index
use tf.gather to select the hidden states from matrix to pass to the tf.nn.dynamic_rnn
init_state = tf.gather(_init_state, self._input_agent_indexs)
use tf.scatter_update to reset/update LSTM hidden states accord the agent_index/last_is_over, for example:

        if tc.is_training:
            need_reset_states = tf.reshape(tf.ones_like(self._input_is_over) - self._input_is_over, (-1, 1))
            op_updates = [tf.scatter_update(initial_rnn_states[idx], self._input_agent_indexs, rnn_output_states_array[idx] * tf.cast(need_reset_states, rnn_output_states_array[idx].dtype)) \
                          for idx in range(len(rnn_output_states_array))]
        else:
            # in predict mode, the is_over is for last state
            batch_size = tf.shape(self._input_agent_indexs)[0]
            op_updates = []
            for idx in range(len(initial_rnn_states)):
                shape_states = tf.shape(initial_rnn_states[idx])
                op = tf.scatter_update(initial_rnn_states[idx], self._input_agent_indexs, tf.zeros((batch_size,shape_states[1]), dtype=initial_rnn_states[idx].dtype))
                op_resets.append(op)
                op = tf.scatter_update(initial_rnn_states[idx], self._input_agent_indexs, rnn_output_states_array[idx])
                op_updates.append(op)

in predict/train, call the op update/reset when needed

this implement is useful if you have many LSTM network or in frequent modification development, for it only export update/reset ops outside model

etienne87 · 2017-02-21T15:02:43Z

@ricky1203 : could you perhaps provide an example/ link in context?

ricky1203 · 2017-02-22T04:14:33Z

@etienne87
check the def _create_rnn_from_cell() in model.py

note: for hidden states stored in model, agent should predict/train in one model(GPU device) during one episode

Golly · 2017-03-03T06:25:57Z

@etienne87
Do you have success with developing LSTM pls?

etienne87 · 2017-03-03T12:08:29Z

@Golly not so much to be honest. Also I think I first need to test idea referred in #16; Otherwise LSTM version will need re-computation of TMAX steps before each backward & update.

etienne87 · 2017-03-28T17:00:32Z

Coming back to this problem with a slightly more understanding on with variable length rnn :
I think the easiest way to code the LSTM version is to keep track of c, h states in Experiences Queues.

In ThreadTrainer::run :

 while not self.exit_flag:
            batch_size = 0
            ids = []
            lengths = []
            while batch_size <= Config.TRAINING_MIN_BATCH_SIZE:
                idx, x_, r_, a_, c_, h_ = self.server.training_q.get()
                lengths.append(x_.shape[0])
                if batch_size == 0:
                    x__ = x_; r__ = r_; a__ = a_; c__ = c_; h__ = h_;
                else:
                    x__ = np.concatenate((x__, x_))
                    r__ = np.concatenate((r__, r_))
                    a__ = np.concatenate((a__, a_))
                    c__ = np.concatenate((c__,c_))
                    h__ = np.concatenate((h__,h_))
                
                ids.append(idx)
                batch_size += x_.shape[0]
            
            if Config.TRAIN_MODELS:
                self.server.train_model(x__, r__, a__,c__,h__, lengths)

In NetworkVP::_create_graph

self.d1 = ... #result of feedforward encoder
self.lstm = rnn.BasicLSTMCell(256, state_is_tuple=True)
self.step_sizes = tf.placeholder(tf.int32, [None], name='stepsize') #given by ThreadTrainer, otherwise assume np.ones((batch_predict_size))
batch_size = tf.shape(self.step_sizes)[0]    
d1 = tf.reshape(self.d1, [batch_size,-1,256]) #this will not work without a special function
self.c0 = tf.placeholder(tf.float32, [None, 256])
self.h0 = tf.placeholder(tf.float32, [None, 256])
self.initial_lstm_state = rnn.LSTMStateTuple(self.c0,self.h0)  
lstm_outputs, self.lstm_state = tf.nn.dynamic_rnn(self.lstm,d1,
                                                        initial_state = self.initial_lstm_state,
                                                        sequence_length = self.step_sizes,
                                                        time_major = False))
self._state = tf.reshape(lstm_outputs, [-1,256])  #pass this vector to pi, v

In NetworkVP::predict_p_and_v:

step_sizes = np.ones((c.shape[0],),dtype=np.int32)
feed_dict = self.__get_base_feed_dict()
feed_dict.update({self.x: x, self.step_sizes:step_sizes, self.c0:c, self.h0:h})
p, v, rnn_state = self.sess.run([self.softmax_p, self.logits_v, self.lstm_state], feed_dict=feed_dict)
return p, v, rnn_state.c, rnn_state.h

In NetworkVP::train:

step_sizes = np.array(lengths)
feed_dict = self.__get_base_feed_dict()
feed_dict.update({self.x: x,  self.y_r: r, self.action_index: a, self.step_sizes:step_sizes, self.c0:c, self.h0:h})
r = np.reshape(y_r,(y_r.shape[0],))
self.sess.run(self.train_op, feed_dict=feed_dict)

I think the only thing i am missing is how to sort of "unpack" sequence of encoded states in _create_graph method :

d1 = tf.reshape(self.d1, [batch_size,-1,256]) will not work when sequence lengths are variable, does anybody know TF enough to tell me how to use `step_sizes' in order to create a list of (nstep, 256) tensors?

etienne87 · 2017-03-30T15:52:43Z

Anyway, there is a first implementation that works fine if you don't have too much underachieved experiences (of length < Config.TIME_MAX) here

I "solved" the issue by padding sequences in ThreadTrainer.py.

In order to be optimal, we would need to dynamically batch the data after the feedforward encoder (before the LSTM), in order to feed a (N, TIME_MAX, 256) Tensor to tf.dynamic_rnn; However I am not convinced this really slows down the process as most of experience batches should be full (sequence length is TIME_MAX).

I will now test on Pong, fuse with GAE branch. If someone wants to help me understand how to improve this you are welcome! :-)

etienne87 · 2017-04-12T08:56:33Z

Hum, Actually there was still an error in my code, I forgot to mask the loss for padding inputs!

I propose a first fix here

Apparently this now works better (at least for CartPole-v0)

In Config.py :

    TIME_MAX = 5
    STACKED_FRAMES = 4
    IMAGE_WIDTH = 1
    IMAGE_HEIGHT = 4
    EPISODES = 4000
    ANNEALING_EPISODE_COUNT = 4000
    BETA_START = 0.01
    BETA_END = 0.01
    LEARNING_RATE_START = 0.0003
    LEARNING_RATE_END = 0.0003
    RMSPROP_DECAY = 0.99
    RMSPROP_MOMENTUM = 0.0
    RMSPROP_EPSILON = 0.1
    DUAL_RMSPROP = False
    USE_GRAD_CLIP = False
    GRAD_CLIP_NORM = 40.0 
    LOG_EPSILON = 1e-6
    TRAINING_MIN_BATCH_SIZE = 16
    USE_RNN = True
    NCELLS = 256
    MIN_POLICY = 0.0
    USE_LOG_SOFTMAX = True

wgeul · 2020-02-25T22:58:08Z

TIME_MAX

Out of interest, can I ask why you've removed this page? What were your findings wrt performance of the addition of LSTM?

Edit: Found your model here: https://github.com/etienne87/GA3C , thanks!

ifrosio closed this as completed Jan 27, 2017

ifrosio reopened this Jan 27, 2017

ppwwyyxx mentioned this issue Feb 23, 2017

Adding LSTM cells in the A3C example tensorpack/tensorpack#159

Closed

ppwwyyxx mentioned this issue Mar 16, 2017

Registering and Passing Variables Through A3C Graph tensorpack/tensorpack#192

Closed

nczempin mentioned this issue Mar 24, 2017

Trying to compare this to universe-starter-agent (A3C) #22

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LSTM version #3

LSTM version #3

markovyao commented Jan 25, 2017

ifrosio commented Jan 27, 2017

ieow commented Feb 3, 2017

adi-sharma commented Feb 5, 2017 •

edited

Loading

mbz commented Feb 6, 2017 •

edited

Loading

etienne87 commented Feb 6, 2017

mbz commented Feb 6, 2017

etienne87 commented Feb 7, 2017 •

edited

Loading

etienne87 commented Feb 9, 2017 •

edited

Loading

etienne87 commented Feb 10, 2017 •

edited

Loading

mbz commented Feb 11, 2017

etienne87 commented Feb 11, 2017 •

edited

Loading

ppwwyyxx commented Feb 11, 2017

markovyao commented Feb 20, 2017

ppwwyyxx commented Feb 20, 2017

ricky1203 commented Feb 21, 2017 •

edited

Loading

etienne87 commented Feb 21, 2017

ricky1203 commented Feb 22, 2017 •

edited

Loading

Golly commented Mar 3, 2017

etienne87 commented Mar 3, 2017

etienne87 commented Mar 28, 2017

etienne87 commented Mar 30, 2017

etienne87 commented Apr 12, 2017

wgeul commented Feb 25, 2020 •

edited

Loading

LSTM version #3

LSTM version #3

Comments

markovyao commented Jan 25, 2017

ifrosio commented Jan 27, 2017

ieow commented Feb 3, 2017

adi-sharma commented Feb 5, 2017 • edited Loading

mbz commented Feb 6, 2017 • edited Loading

etienne87 commented Feb 6, 2017

mbz commented Feb 6, 2017

etienne87 commented Feb 7, 2017 • edited Loading

etienne87 commented Feb 9, 2017 • edited Loading

etienne87 commented Feb 10, 2017 • edited Loading

mbz commented Feb 11, 2017

etienne87 commented Feb 11, 2017 • edited Loading

ppwwyyxx commented Feb 11, 2017

markovyao commented Feb 20, 2017

ppwwyyxx commented Feb 20, 2017

ricky1203 commented Feb 21, 2017 • edited Loading

etienne87 commented Feb 21, 2017

ricky1203 commented Feb 22, 2017 • edited Loading

Golly commented Mar 3, 2017

etienne87 commented Mar 3, 2017

etienne87 commented Mar 28, 2017

etienne87 commented Mar 30, 2017

etienne87 commented Apr 12, 2017

wgeul commented Feb 25, 2020 • edited Loading

adi-sharma commented Feb 5, 2017 •

edited

Loading

mbz commented Feb 6, 2017 •

edited

Loading

etienne87 commented Feb 7, 2017 •

edited

Loading

etienne87 commented Feb 9, 2017 •

edited

Loading

etienne87 commented Feb 10, 2017 •

edited

Loading

etienne87 commented Feb 11, 2017 •

edited

Loading

ricky1203 commented Feb 21, 2017 •

edited

Loading

ricky1203 commented Feb 22, 2017 •

edited

Loading

wgeul commented Feb 25, 2020 •

edited

Loading