Question Regarding Sequence Length #17

Davide236 · 2024-05-24T09:49:56Z

Firstly, I wanted to thank you for the great project, it helped me understand Recurrent PPO better.

I mostly have one main question, regarding how the training is done, especially regarding the Actor and Critic hidden states.

In my situation, to better understand the workflow of the project, I use only 1 worker.

If I understand correctly, this worker will collect data (observations, actions, hidden states etc.) from the environment until termination. In the example of CartPole it would be something like this:
Step 1:
Observations: [-0.0058, -0.0000, -0.0079, 0.0000]
Hidden States (Actor): (tensor([[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]]), tensor([[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]]))

Then for Step 2 we get new Observations and Hidden States and so forth.

My question Is regarding the training phase and the actual sequence length. If in my batch I have 4 episodes (padded to have the same length of the longest episode), then (sequence_length = longest_episode_length) right? In case of course I want to be my sequence length to be the length of my episode.

Moreover, in the training, for each episode are we using only the first Hidden_State? If this is the case, then for each episode that is fed to the Networks, then the Hidden States are always going to be the initial states, like:
Hidden States (Actor): (tensor([[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]]), tensor([[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]]))

Hence if this is the case, what is the point of saving the hidden states during the episodes?

I might also have misunderstood this part, so I apologize in advance if this is a nonsense question :)

MarcoMeter · 2024-05-24T10:15:14Z

Hi @Davide236

[...] what is the point of saving the hidden states during the episodes?

In PPO we sample a fixed number of steps regardless of any episode terminations. As a consequence, an episode might be still ongoing once the worker collected its steps. So during the next PPO update cycle, the worker must continue from the still ongoing episode. Because of this, we need to save the previous hidden state of that truncated episode for the next optimization. The initial hidden state of an episode is always 0. If only complete episodes are sampled, then there would be no need to store hidden states. When stepping through this code, make sure to not just examine the very first PPO update ;) Then you are very likely to notice the truncated episodes and that the considered hidden states are not 0.

We may also choose a fixed sequence length that is shorter than the maximum episode length. So when we split an episode into sequences of fixed length, we need to retrieve the initial hidden states for each sequence. To ease implementation, we just store all hidden states during sampling and then just pick the ones we need during optimization. This is also known as truncated back-propagation through time.

Davide236 · 2024-05-24T11:07:29Z

@MarcoMeter, thank you so much for your answer!

So if I understand correctly, if we sample only full episodes and have fixed sequence length which is the length of the episode then all hidden states to be used during training will be 0s.

In this case, theoretically, do the benefit of Recurrent PPO still apply or does this become more akin to a Standard PPO implementation?

Furthermore, would you generally recommend to use fixed sequence length shorter than the actual episode or not?

MarcoMeter · 2024-05-24T11:51:47Z

Hidden states are still produced by iteratively feeding the recurrent cell and during optimization the cell is unrolled based on feeding the entire sequence.

The sequence length is just another hyperparameter to tweak and depends on the task.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question Regarding Sequence Length #17

Question Regarding Sequence Length #17

Davide236 commented May 24, 2024

MarcoMeter commented May 24, 2024

Davide236 commented May 24, 2024

MarcoMeter commented May 24, 2024

Question Regarding Sequence Length #17

Question Regarding Sequence Length #17

Comments

Davide236 commented May 24, 2024

MarcoMeter commented May 24, 2024

Davide236 commented May 24, 2024

MarcoMeter commented May 24, 2024