Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exercise 3.23 #8

Closed
wissam124 opened this issue Jan 21, 2020 · 4 comments
Closed

Exercise 3.23 #8

wissam124 opened this issue Jan 21, 2020 · 4 comments

Comments

@wissam124
Copy link

wissam124 commented Jan 21, 2020

Hello,

I'm not sure about the solution here.

Firstly, it seems the Bellman optimality equation for q_{*}(s,a) is wrong. Looks like there's a typo where the sum $r + \gamma*max_{a'}q_{*}(s',a')$ has been replaced into the product r_{\gamma}*max_{a'}q_{*}(s',a').

This error has then been propagated to the two solutions given as examples.

If we ignore this typo, I agree with the solution for q_{*}(high, search).

However, I think the answer for q_{*}(high, wait) is wrong.
If the action 'wait' is taken, then the next state s' must be 'high' for which the next actions a' = {'high', 'wait'} are possible. I think the answer should be

q_{*}{high, wait) = [r_{wait} + \gamma * max{q_{*}(high, wait), q_{*}(high, search)}]

What do you think?

@LyWangPX
Copy link
Owner

Thanks for your passion. I will review all of your issue after my work today. If you can provide a similar review for later chapters that will be a great help!

@LyWangPX
Copy link
Owner

Thanks for the whole advice. I will fix that!

I kind of understood what I was thinking at that time.

I was thinking that, because q* is the q_pi for optimal policy, that in my definition, is a greedy one. That is, if under high we always better search instead of waiting, then in any high state, we should always choose search for the same reason. That is, under high, suppose our optimal policy chooses wait, we should always wait. That justify my answer.However, If we are under high but choose search, it indicates we should choose search again if we are under high.

It takes me a while, (about few chapters later), to realize the action values here are not working like that. They are some kind of action preferences but picked up using greedy and many q* are possible for the same optimal policy.

@wissam124
Copy link
Author

wissam124 commented Jan 22, 2020 via email

@LyWangPX
Copy link
Owner

I am happy to hear that! You can read a little bit about solution of Chapter 12 which will be the longest ever possible answers you have to consider in this book. Nevertheless, the book is quite easy to read, I personally read one chapter per day, perhaps a little bit longer about Chapter 7 and 12, which bring great changes to the flow.

I prefer openning issues here so sometimes we can debate and I can mange my source code well.

PS: If it is OK for you, would you like to tell me your full name (maybe Jean is enough?) so I can include you in the contibuter list in readme file, with specific problems you corrected on if you wish.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants