Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question Regarding Sequence Length #17

Open
Davide236 opened this issue May 24, 2024 · 3 comments
Open

Question Regarding Sequence Length #17

Davide236 opened this issue May 24, 2024 · 3 comments

Comments

@Davide236
Copy link

Firstly, I wanted to thank you for the great project, it helped me understand Recurrent PPO better.

I mostly have one main question, regarding how the training is done, especially regarding the Actor and Critic hidden states.

In my situation, to better understand the workflow of the project, I use only 1 worker.

If I understand correctly, this worker will collect data (observations, actions, hidden states etc.) from the environment until termination. In the example of CartPole it would be something like this:
Step 1:
Observations: [-0.0058, -0.0000, -0.0079, 0.0000]
Hidden States (Actor): (tensor([[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]]), tensor([[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]]))

Then for Step 2 we get new Observations and Hidden States and so forth.

My question Is regarding the training phase and the actual sequence length. If in my batch I have 4 episodes (padded to have the same length of the longest episode), then (sequence_length = longest_episode_length) right? In case of course I want to be my sequence length to be the length of my episode.

Moreover, in the training, for each episode are we using only the first Hidden_State? If this is the case, then for each episode that is fed to the Networks, then the Hidden States are always going to be the initial states, like:
Hidden States (Actor): (tensor([[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]]), tensor([[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]]))

Hence if this is the case, what is the point of saving the hidden states during the episodes?

I might also have misunderstood this part, so I apologize in advance if this is a nonsense question :)

@MarcoMeter
Copy link
Owner

Hi @Davide236

[...] what is the point of saving the hidden states during the episodes?

In PPO we sample a fixed number of steps regardless of any episode terminations. As a consequence, an episode might be still ongoing once the worker collected its steps. So during the next PPO update cycle, the worker must continue from the still ongoing episode. Because of this, we need to save the previous hidden state of that truncated episode for the next optimization. The initial hidden state of an episode is always 0. If only complete episodes are sampled, then there would be no need to store hidden states. When stepping through this code, make sure to not just examine the very first PPO update ;) Then you are very likely to notice the truncated episodes and that the considered hidden states are not 0.

We may also choose a fixed sequence length that is shorter than the maximum episode length. So when we split an episode into sequences of fixed length, we need to retrieve the initial hidden states for each sequence. To ease implementation, we just store all hidden states during sampling and then just pick the ones we need during optimization. This is also known as truncated back-propagation through time.

@Davide236
Copy link
Author

@MarcoMeter, thank you so much for your answer!

So if I understand correctly, if we sample only full episodes and have fixed sequence length which is the length of the episode then all hidden states to be used during training will be 0s.

In this case, theoretically, do the benefit of Recurrent PPO still apply or does this become more akin to a Standard PPO implementation?

Furthermore, would you generally recommend to use fixed sequence length shorter than the actual episode or not?

@MarcoMeter
Copy link
Owner

Hidden states are still produced by iteratively feeding the recurrent cell and during optimization the cell is unrolled based on feeding the entire sequence.

The sequence length is just another hyperparameter to tweak and depends on the task.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants