-
Notifications
You must be signed in to change notification settings - Fork 529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make policy used only for initial guidance #743
Comments
I like this idea (read: I'm experimenting with it as well), it seems so counter-intuitive that policy is still a direct factor in the allocation of search effort even if we've searched thousands of nodes below a tree and have win-rates (which are better guides) for the subtrees. |
Yeah also, this detail of policy effecting U was invented by Deepmind with no justification. |
I've never liked heavy weighting of policy in search even when millions of payouts have been done. I like to think of it more as a tool to guide very early search or know which unknown children to expand first. Eager to hear about the results. |
The original paper about P-UCT had the prior as an additive, not a multiplicative factor. But I'm sure DM was well aware of that :-/ |
Maybe this is a case of them trying an experiment that led to a strength gain and so they kept it without a theoretical basis to do so. |
This is from discord and didn't want to lose it:
|
I agree about not maximizing |
I think the clamping isn't necessary. Search should be able to handle 0 policy without breaking. |
I tried various experiments with pow(policy, softmax + increasing_term) to flatten out the policy as the number of playouts increases, but nothing that actually improves strenght so far. |
@gcp your problem might be that the change you described brings all policies to 0 instead of 1. |
I was tuning the PUCT to compensate at the same time. But maybe it's a point that the policy distribution has to be renormalized. |
@kmcrage I tried your formula, but I used x + y * sqrt(N) as the exponent in the pow. This seems to be worth about 40 Elo, but note that I am testing this without the DM formula that was increasing PUCT. In any case this confirms the idea is worth exploring further. |
Definitely interesting discussion, and still relevant. As gcp pointed out, the P-UCT formula deepmind gives as their source has the additive term |
OK, I'll take a look and close the obsolete ones. |
I filed #1279 if someone can help formalize the math there to make policy dynamic based on search finding a discrepancy between between the P predicting low V of a child and realizing it was wrong on the first visit to that child. I suppose notably different from @Ttl's approach in leela-zero/leela-zero#2337 is that the recalculation would happen "immediately" on first visit when the child V conflicts with its P instead of "eventually" after some number of visits. |
Looks like Stoofvlees blundered here and lc0 happened to find
Specifically, g4's policy is 0.77% while value is 0.0639 better than the parent node's 0.0315.
Although there is some potential that MovesLeft head could help in this situation as well because at a glance, the position after the winning move does seem favorable and closer to mate than the highest policy move. |
Currently policy has a large effect on search even when there are millions of nodes. This causes weird behaviour and can lead to significant oversights. To fix this, I'm proposing that rather than maximizing
Q + c_puct*policy*sqrt(N / n_i)
, we instead maximizeQ + f(policy, N) + c_puct*sqrt(N/n_i)
. I don't have a great idea of whatf
should be, but a reasonable first aproximation ise^-(aN+policy)
, see the link for the graph. https://www.desmos.com/calculator/g3kmkjp5zsThe text was updated successfully, but these errors were encountered: