Exercise 3.23 #8

wissam124 · 2020-01-21T16:55:04Z

Hello,

I'm not sure about the solution here.

Firstly, it seems the Bellman optimality equation for q_{*}(s,a) is wrong. Looks like there's a typo where the sum $r + \gamma*max_{a'}q_{*}(s',a')$ has been replaced into the product r_{\gamma}*max_{a'}q_{*}(s',a').

This error has then been propagated to the two solutions given as examples.

If we ignore this typo, I agree with the solution for q_{*}(high, search).

However, I think the answer for q_{*}(high, wait) is wrong.
If the action 'wait' is taken, then the next state s' must be 'high' for which the next actions a' = {'high', 'wait'} are possible. I think the answer should be

q_{*}{high, wait) = [r_{wait} + \gamma * max{q_{*}(high, wait), q_{*}(high, search)}]

What do you think?

LyWangPX · 2020-01-22T07:28:56Z

Thanks for your passion. I will review all of your issue after my work today. If you can provide a similar review for later chapters that will be a great help!

LyWangPX · 2020-01-22T09:41:42Z

Thanks for the whole advice. I will fix that!

I kind of understood what I was thinking at that time.

I was thinking that, because q* is the q_pi for optimal policy, that in my definition, is a greedy one. That is, if under high we always better search instead of waiting, then in any high state, we should always choose search for the same reason. That is, under high, suppose our optimal policy chooses wait, we should always wait. That justify my answer.However, If we are under high but choose search, it indicates we should choose search again if we are under high.

It takes me a while, (about few chapters later), to realize the action values here are not working like that. They are some kind of action preferences but picked up using greedy and many q* are possible for the same optimal policy.

wissam124 · 2020-01-22T13:14:29Z

Hello there, Thanks! I've just started going through the book right now and your solutions are a great ressource, at the very least to give a starting point for thought or to compare approaches. Right now, I have a bit of time set aside to study this but not sure how much momentum I'll manage to maintain over the next couple of months. Do you prefer me opening issues or directly doing pull requests? Thanks, Jean

…

On Wed, 22 Jan 2020 at 10:41, YIFAN WANG ***@***.***> wrote: Thanks for the whole advice. I will fix that! I kind of understood what I was thinking at that time. I was thinking that, because q* is the q_pi for optimal policy, that in my definition, is a greedy one. That is, if under high we always better search instead of waiting, then in any high state, we should always choose search for the same reason. That is, under high, suppose our optimal policy chooses wait, we should always wait. That justify my answer.However, If we are under high but choose search, it indicates we should choose search again if we are under high. It takes me a while, (about few chapters later), to realize the action values here are not working like that. They are some kind of action preferences but picked up using greedy and many q* are possible for the same optimal policy. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#8?email_source=notifications&email_token=AA5M7F3X42MVSTQJV4GCJHLQ7AIFNA5CNFSM4KJWXYWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJS4RJA#issuecomment-577095844>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA5M7F5PSLGD4I3QRFU4IU3Q7AIFNANCNFSM4KJWXYWA> .

LyWangPX · 2020-01-22T14:10:53Z

I am happy to hear that! You can read a little bit about solution of Chapter 12 which will be the longest ever possible answers you have to consider in this book. Nevertheless, the book is quite easy to read, I personally read one chapter per day, perhaps a little bit longer about Chapter 7 and 12, which bring great changes to the flow.

I prefer openning issues here so sometimes we can debate and I can mange my source code well.

PS: If it is OK for you, would you like to tell me your full name (maybe Jean is enough?) so I can include you in the contibuter list in readme file, with specific problems you corrected on if you wish.

LyWangPX closed this as completed Jan 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exercise 3.23 #8

Exercise 3.23 #8

wissam124 commented Jan 21, 2020 •

edited

LyWangPX commented Jan 22, 2020

LyWangPX commented Jan 22, 2020

wissam124 commented Jan 22, 2020 via email

LyWangPX commented Jan 22, 2020

Exercise 3.23 #8

Exercise 3.23 #8

Comments

wissam124 commented Jan 21, 2020 • edited

LyWangPX commented Jan 22, 2020

LyWangPX commented Jan 22, 2020

wissam124 commented Jan 22, 2020 via email

LyWangPX commented Jan 22, 2020

wissam124 commented Jan 21, 2020 •

edited