You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
why the return of function "compute_adv_surrogate" in IPO is" (adv_r - penalty * adv_c) / (1 + penalty)" instead of "adv_r"? And when there are more than one constraint, how can I modify this algorithm? Thanks a lot!!
The text was updated successfully, but these errors were encountered:
That is a pretty good question. This is because when the penalty value is too large, the update direction will be biased. Our approach can make the training oscillate between two extremes: when penalty=0, the update is equivalent to classical reinforcement learning algorithms such as Policy Gradient or PPO, and when penalty= $+\infty$ , the update will simply minimize the cost.
We will provide the performance curves of IPO on multiple environments as soon as possible to validate our ideas with experimental results. We will also consider your suggestions and conduct experiments with the settings you provided. Thank you again for your feedback, and a Pull Request to implement your ideas is also welcomed.
Required prerequisites
Questions
why the return of function "compute_adv_surrogate" in IPO is" (adv_r - penalty * adv_c) / (1 + penalty)" instead of "adv_r"? And when there are more than one constraint, how can I modify this algorithm? Thanks a lot!!
The text was updated successfully, but these errors were encountered: