-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: change the file layout of omnisafe #35
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I have reviewed this pr in details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
self.lagrangian_multiplier = 2.0 | ||
|
||
def compute_loss_pi(self, data: dict): | ||
# Policy loss |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Policy loss | |
"""compute loss for policy""" |
""" | ||
Update actor, critic, running statistics | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
""" | |
Update actor, critic, running statistics | |
""" | |
"""Update actor, critic, running statistics""" |
""" | ||
Pre-process data, e.g. standardize observations, rescale rewards if | ||
enabled by arguments. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
""" | |
Pre-process data, e.g. standardize observations, rescale rewards if | |
enabled by arguments. | |
""" | |
"""Pre-process data, e.g. standardize observations, rescale rewards if enabled by arguments.""" |
docs/source/BaseRL/TRPO.rst
Outdated
@@ -313,7 +313,7 @@ TRPO describes an approximate policy iteration scheme based on the policy improv | |||
Note that for now, we assume exact evaluation of the advantage values :math:`A^R_{\pi}`. | |||
|
|||
It follows from Equation :ref:`(11) <trpo-eq-11>` that TRPO is guaranteed to generate a monotonically improving sequence of policies :math:`J\left(\pi_0\right) \leq J\left(\pi_1\right) \leq J\left(\pi_2\right) \leq \cdots`. | |||
To see this, let :math:`M_i(\pi)=L_{\pi_i}(\pi)-C D_{\mathrm{KL}}^{\max }\left(\pi_i, \pi\right)`. | |||
To see this, let :math:`M_i(\pi)=L_{\pi_i}(\pi)-C D_{\mathrm{}}^{\max }\left(\pi_i, \pi\right)`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing KL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approve.
Description
Motivation and Context
Why is this change required? What problem does it solve?
If it fixes an open issue, please link to the issue here.
You can use the syntax
close #15213
if this solves the issue #15213Types of changes
What types of changes does your code introduce? Put an
x
in all the boxes that apply:Checklist
Go over all the following points, and put an
x
in all the boxes that apply.If you are unsure about any of these, don't hesitate to ask. We are here to help!
make format
. (required)make lint
. (required)make test
pass. (required)