New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exercise 3.23 #8
Comments
Thanks for your passion. I will review all of your issue after my work today. If you can provide a similar review for later chapters that will be a great help! |
Thanks for the whole advice. I will fix that! I kind of understood what I was thinking at that time. I was thinking that, because q* is the q_pi for optimal policy, that in my definition, is a greedy one. That is, if under high we always better search instead of waiting, then in any high state, we should always choose search for the same reason. That is, under high, suppose our optimal policy chooses wait, we should always wait. That justify my answer.However, If we are under high but choose search, it indicates we should choose search again if we are under high. It takes me a while, (about few chapters later), to realize the action values here are not working like that. They are some kind of action preferences but picked up using greedy and many q* are possible for the same optimal policy. |
Hello there,
Thanks! I've just started going through the book right now and your
solutions are a great ressource, at the very least to give a starting
point for thought or to compare approaches. Right now, I have a bit of time
set aside to study this but not sure how much momentum I'll manage to
maintain over the next couple of months.
Do you prefer me opening issues or directly doing pull requests?
Thanks,
Jean
…On Wed, 22 Jan 2020 at 10:41, YIFAN WANG ***@***.***> wrote:
Thanks for the whole advice. I will fix that!
I kind of understood what I was thinking at that time.
I was thinking that, because q* is the q_pi for optimal policy, that in my
definition, is a greedy one. That is, if under high we always better search
instead of waiting, then in any high state, we should always choose search
for the same reason. That is, under high, suppose our optimal policy
chooses wait, we should always wait. That justify my answer.However, If we
are under high but choose search, it indicates we should choose search
again if we are under high.
It takes me a while, (about few chapters later), to realize the action
values here are not working like that. They are some kind of action
preferences but picked up using greedy and many q* are possible for the
same optimal policy.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#8?email_source=notifications&email_token=AA5M7F3X42MVSTQJV4GCJHLQ7AIFNA5CNFSM4KJWXYWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJS4RJA#issuecomment-577095844>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA5M7F5PSLGD4I3QRFU4IU3Q7AIFNANCNFSM4KJWXYWA>
.
|
I am happy to hear that! You can read a little bit about solution of Chapter 12 which will be the longest ever possible answers you have to consider in this book. Nevertheless, the book is quite easy to read, I personally read one chapter per day, perhaps a little bit longer about Chapter 7 and 12, which bring great changes to the flow. I prefer openning issues here so sometimes we can debate and I can mange my source code well. PS: If it is OK for you, would you like to tell me your full name (maybe Jean is enough?) so I can include you in the contibuter list in readme file, with specific problems you corrected on if you wish. |
Hello,
I'm not sure about the solution here.
Firstly, it seems the Bellman optimality equation for q_{*}(s,a) is wrong. Looks like there's a typo where the sum$r + \gamma*max_{a'}q_{*}(s',a')$ has been replaced into the product r_{\gamma}*max_{a'}q_{*}(s',a').
This error has then been propagated to the two solutions given as examples.
If we ignore this typo, I agree with the solution for q_{*}(high, search).
However, I think the answer for q_{*}(high, wait) is wrong.
If the action 'wait' is taken, then the next state s' must be 'high' for which the next actions a' = {'high', 'wait'} are possible. I think the answer should be
q_{*}{high, wait) = [r_{wait} + \gamma * max{q_{*}(high, wait), q_{*}(high, search)}]
What do you think?
The text was updated successfully, but these errors were encountered: