Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tweaked UCB calculation for uniform exploratin of actions in vanilla MCTS #99

Merged
merged 6 commits into from
Sep 25, 2022

Conversation

BoZenKhaa
Copy link
Contributor

@BoZenKhaa BoZenKhaa commented Sep 20, 2022

With the current UCB implementation in vanilla MCTS, the first two iterations explore the same action twice:
old[from the example notebook with n_iter=2]

This is because the current implementation will set UCB value of an action to the default q value when sn == 1 && n(sanode) == 0 (parent node visited once and action not yet used). This means that if the action used in a state during the first iteration returned a positive reward, it would be picked again in the second iteration since it has a higher value than the unexplored actions.

I think the usual UCB implementation first explores all available actions in some order before focusing on a specific action.

With this patch, this seems to be happening:
new [same as previous, but with the patch applied]

@codecov
Copy link

codecov bot commented Sep 20, 2022

Codecov Report

Base: 86.12% // Head: 86.62% // Increases project coverage by +0.50% 🎉

Coverage data is based on head (95ef9e1) compared to base (5984055).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##           master      #99      +/-   ##
==========================================
+ Coverage   86.12%   86.62%   +0.50%     
==========================================
  Files          10       10              
  Lines         490      486       -4     
==========================================
- Hits          422      421       -1     
+ Misses         68       65       -3     
Impacted Files Coverage Δ
src/vanilla.jl 90.27% <100.00%> (+1.76%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@BoZenKhaa
Copy link
Contributor Author

I think the conditions could be further simplified such as

        if  n(sanode) == 0
            return sanode # if unexplored action is found, return it
        else
            UCB = q(sanode) + c*sqrt(log(sn)/n(sanode))
        end	

and moving the c==0 outside of the UCB...

So all together:

function best_sanode_UCB(snode::StateNode, c::Float64)
    if c==0
        return argmax(q, children(snode))
    end

    best_UCB = -Inf
    best=first(children(snode))
    sn = total_n(snode)
    for sanode in children(snode)
        if n(sanode) == 0
            return sanode # if unexplored action is found, return it
        else
            UCB = q(sanode) + c*sqrt(log(sn)/n(sanode))
        end
	
        if UCB > best_UCB
            best_UCB = UCB
            best = sanode
        end
    end
    return best
end

Copy link
Member

@zsunberg zsunberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. Thanks!

@zsunberg zsunberg merged commit dd3531b into JuliaPOMDP:master Sep 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants