-
Notifications
You must be signed in to change notification settings - Fork 131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is save checkpoint not yet supported for ppo ray trainer? #256
Comments
Yes, we haven't fully developed and tested this feature yet. Welcome contribution |
i'm happy to look into it, but how have you guys been saving models? |
Hi @mickel-liu, have you figured this out? I have no choice but to use |
Hi, I did look into the code and found out the saving checkpoints feature is not yet implemented. But actually saving checkpoints wasn't what I was looking for, I want the actual model checkpoints, not the intermediate states as being referred in this repo. So I ended up changing the code on my fork and now it saves model checkpoints after a pre-set amount of iterations. Here's the code in my fork: https://github.com/mickelliu/OpenRLHF/blob/a7f21aa26ac027fcf30ca1c588e01cf07c67cb6f/openrlhf/trainer/ppo_trainer.py#L428-L442 Regardless of ckpt feature is being officially implemented, |
Thanks for the quick reply and for sharing your code! I'm glad to know that saving the trained model would be that simple. Although the checkpointing feature would be a great add, this fix seems to solve my issue. |
When I set
save_step
other than -1, the program outputs an exceptionOpenRLHF/openrlhf/trainer/ppo_trainer.py
Lines 378 to 385 in 3c91875
These three args are indeed not included in
train_ppo_ray.py
and I don't seearg.save_path
being used.I did see this issue was mentioned in #133, wondering if there's any update.
The text was updated successfully, but these errors were encountered: