Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training details about MineAgent #9

Open
mansicer opened this issue Apr 2, 2023 · 9 comments
Open

Training details about MineAgent #9

mansicer opened this issue Apr 2, 2023 · 9 comments

Comments

@mansicer
Copy link

mansicer commented Apr 2, 2023

Hi. Thank you for releasing the precious benchmark! I'm working on implementing the PPO agent you reported in the paper. However, I found some misalignments between the code and your paper.

Trimmed action space

As mentioned by #4, the code below does not correspond to the 89 action dims in Appendix G.2.

action_dim=[3, 3, 4, 25, 25, 8],

About the compass observation

In the paper I see that the compass has a shape of (2,). However, I see an input of (4,) shape in your code.

"compass": torch.rand((B, 4), device=device),

Training on MultiDiscrete action space

Is the 89-dimension action space in the paper a MultiDiscrete action space like the original MineDojo action space, or you simply treat it as a Discrete action space?

In addition, can you release the training code on three task groups in the paper (or share this code via my GitHub email)? It will be beneficial for baseline comparisons!

@iSach
Copy link

iSach commented Jun 12, 2023

Hello,

Did you manage to reimplement the training code for the agents with PPO?

I'm getting some issues with the nested dicts despite using the multi-input policy.

@mansicer
Copy link
Author

@iSach Hi. I tried to reimplement PPO from CleanRL code. I use Gym's AsyncVectorEnv for sampling and manually preprocess the batched Dict obs space in some ways. Feel free if you wanna elaborate relevant issues.

@iSach
Copy link

iSach commented Jun 13, 2023

@iSach Hi. I tried to reimplement PPO from CleanRL code. I use Gym's AsyncVectorEnv for sampling and manually preprocess the batched Dict obs space in some ways. Feel free if you wanna elaborate relevant issues.

I'm not extremely familiar with running more complex environments like these (have only run very basic envs in gym's tutorials). Do you have a repo or a gist to look at?

My main issue is dealing with the nested dict's in the env's observation space. I tried to implement a custom features extractor based on the "SimpleFeatureFusion", but can't manage to get something running at all.

@mansicer
Copy link
Author

Do you have a repo or a gist to look at?

Unfortunately, currently not. I don't think my previous code is bug-free or worth referencing. However, I do suggest that you can start from their provided code like run_env_in_loop.py and try feeding the environment input to the network first. I do start from their example code.

@iSach
Copy link

iSach commented Jun 17, 2023

Do you have a repo or a gist to look at?

Unfortunately, currently not. I don't think my previous code is bug-free or worth referencing. However, I do suggest that you can start from their provided code like run_env_in_loop.py and try feeding the environment input to the network first. I do start from their example code.

I tried, but I'm getting so many problems with PPO because of the weird environment. I don't understand how to get a clean training code. I don't understand why they would release everything except the code for reproducing results. Especially considering the few tasks demonstrated in the code.

@elcajas
Copy link

elcajas commented Jun 19, 2023

About the policy algorithm training:

  • Do you start PPO update when PPO buffer is full or after a certain number of env steps?
  • Do you use a data loader in PPO update? What is the batch size?
  • How many PPO update iterations do you apply?
  • What is SI buffer max capacity?
  • Does the value function head updates also the backbone model parameters?
  • Since a trimmed version of the action space is used, does the agent still use the MulticategoricalActor?
  • Using MineCLIP reward, how to store states with corresponding rewards? How to calculate the rewards of the first 15 steps of the episode?
  • When adding successful trajectories to SI buffer, when do you update the mean and std of the reward?

I would appreciate it if you can clarify the points above. It would be helpful if you release the policy training code in the future.

@mansicer
Copy link
Author

Hi @elcajas,

Since the authors do not reply this issue, I do not continue reimplementing PPO in MineDojo. For things I can share, I implement PPO based on the CleanRL version and adopt a vector env to speed up. The network backbone is like the FeatureFusion from this repo.

Do you start PPO update when PPO buffer is full or after a certain number of env steps?

After a fixed number of env steps.

Do you use a data loader in PPO update? What is the batch size? Other hyper-parameters...

I refer to the CleanRL code and Table A.3 from the MineDojo paper.

Does the value function head updates also the backbone model parameters?

Yes.

Since a trimmed version of the action space is used, does the agent still use the MulticategoricalActor?

No. Use the default discrete version of PPO is okay.

Using MineCLIP reward, how to store states with corresponding rewards? How to calculate the rewards of the first 15 steps of the episode?

Unfortunately I haven't tried that.

When adding successful trajectories to SI buffer, when do you update the mean and std of the reward?

I'm not clear about this question. Can you provide some details?

Generally, that's just some of my experiences although I do not work on it recently. I sincerely hope the authors and our community can open-source some RL approaches to this benchmark.

@mansicer
Copy link
Author

Also found a bug in the example code. See #11.

@AsWali
Copy link

AsWali commented Feb 26, 2024

@elcajas I am having the same questions as you. Did you get any further ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants