Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update 02.PPO.ipynb #22

Merged
merged 1 commit into from Apr 13, 2020
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
2 changes: 1 addition & 1 deletion 02.PPO.ipynb
Expand Up @@ -40,7 +40,7 @@
"\n",
"There are two kinds of algorithms of PPO: PPO-Penalty and PPO-Clip. Here, we'll implement PPO-clip version.\n",
"\n",
"TRPO computes the gradients with a complex second-order method. On the other hand, PPO tries to solve the problem with a first-order methods that keep new polices close to old. To simplify the surrogate objective, let $r(\\theta)$ denote the probability ratio\n",
"TRPO computes the gradients with a complex second-order method. On the other hand, PPO tries to solve the problem with a first-order methods that keep new policies close to old. To simplify the surrogate objective, let $r(\\theta)$ denote the probability ratio\n",
"\n",
"$$ L^{CPI}(\\theta) = \\hat {\\mathbb{E}}_t \\left [ {\\pi_\\theta(a_t|s_t) \\over \\pi_{\\theta_{old}}(a_t|s_t)} \\hat A_t\\right] = \\hat {\\mathbb{E}}_t \\left [ r_t(\\theta) \\hat A_t \\right ].$$\n",
"\n",
Expand Down