Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make policy used only for initial guidance #743

Open
oscardssmith opened this issue Feb 18, 2019 · 16 comments
Open

Make policy used only for initial guidance #743

oscardssmith opened this issue Feb 18, 2019 · 16 comments

Comments

@oscardssmith
Copy link
Contributor

Currently policy has a large effect on search even when there are millions of nodes. This causes weird behaviour and can lead to significant oversights. To fix this, I'm proposing that rather than maximizing Q + c_puct*policy*sqrt(N / n_i), we instead maximize Q + f(policy, N) + c_puct*sqrt(N/n_i). I don't have a great idea of what f should be, but a reasonable first aproximation is e^-(aN+policy), see the link for the graph. https://www.desmos.com/calculator/g3kmkjp5zs

@gcp
Copy link
Contributor

gcp commented Feb 19, 2019

I like this idea (read: I'm experimenting with it as well), it seems so counter-intuitive that policy is still a direct factor in the allocation of search effort even if we've searched thousands of nodes below a tree and have win-rates (which are better guides) for the subtrees.

@oscardssmith
Copy link
Contributor Author

Yeah also, this detail of policy effecting U was invented by Deepmind with no justification.

@roy7
Copy link

roy7 commented Feb 19, 2019

I've never liked heavy weighting of policy in search even when millions of payouts have been done. I like to think of it more as a tool to guide very early search or know which unknown children to expand first.

Eager to hear about the results.

@gcp
Copy link
Contributor

gcp commented Feb 19, 2019

Yeah also, this detail of policy effecting U was invented by Deepmind with no justification.

The original paper about P-UCT had the prior as an additive, not a multiplicative factor. But I'm sure DM was well aware of that :-/

@roy7
Copy link

roy7 commented Feb 19, 2019

Maybe this is a case of them trying an experiment that led to a strength gain and so they kept it without a theoretical basis to do so.

@RedDenver
Copy link

This is from discord and didn't want to lose it:

the U computation comes from several places, here's how cpuct is computed:

inline float ComputeCpuct(const SearchParams& params, uint32_t N) {
  const float init = params.GetCpuct();
  const float k = params.GetCpuctFactor();
  const float base = params.GetCpuctBase();
  return init + (k ? k * FastLog((N + base) / base) : 0.0f);
}

then

float puct_mult =
        cpuct * std::sqrt(std::max(node->GetChildrenVisits(), 1u));

then

float GetU(float numerator) const {
    return numerator * GetP() / (1 + GetNStarted());
  }

where numerator=puct_mult

@kmcrage
Copy link

kmcrage commented Mar 9, 2019

I agree about not maximizing
Q + c_puct * policy * sqrt(N / n_i)
but instead of adding an extra term, i though we could maximize
Q + c_puct * (1. - pow(1. - policy, 1. + c_pol * N)) * sqrt(N / n_i)
where c_pol is a small constant. Then as N increases the effect of policy dies out.
I'd also look at clamping policy at something slightly more than zero.

@oscardssmith
Copy link
Contributor Author

I think the clamping isn't necessary. Search should be able to handle 0 policy without breaking.

@gcp
Copy link
Contributor

gcp commented Mar 13, 2019

I tried various experiments with pow(policy, softmax + increasing_term) to flatten out the policy as the number of playouts increases, but nothing that actually improves strenght so far.

@oscardssmith
Copy link
Contributor Author

@gcp your problem might be that the change you described brings all policies to 0 instead of 1.

@gcp
Copy link
Contributor

gcp commented Mar 13, 2019

I was tuning the PUCT to compensate at the same time. But maybe it's a point that the policy distribution has to be renormalized.

@gcp
Copy link
Contributor

gcp commented Mar 15, 2019

@kmcrage I tried your formula, but I used x + y * sqrt(N) as the exponent in the pow. This seems to be worth about 40 Elo, but note that I am testing this without the DM formula that was increasing PUCT.

In any case this confirms the idea is worth exploring further.

@Naphthalin
Copy link
Contributor

Definitely interesting discussion, and still relevant. As gcp pointed out, the P-UCT formula deepmind gives as their source has the additive term -2/p(s,a) * sqrt( log(N) / N ) but was intentionally not used by deepmind. Possible attempts are #1150 #1173 #1235 #1251 #1252 but nothing found to be superior yet. General discussion should probably happen in #1231; @oscardssmith it's your decision which of the issues you would like to keep open :)

@oscardssmith
Copy link
Contributor Author

OK, I'll take a look and close the obsolete ones.

@Mardak
Copy link
Contributor

Mardak commented May 8, 2020

I filed #1279 if someone can help formalize the math there to make policy dynamic based on search finding a discrepancy between between the P predicting low V of a child and realizing it was wrong on the first visit to that child.

I suppose notably different from @Ttl's approach in leela-zero/leela-zero#2337 is that the recalculation would happen "immediately" on first visit when the child V conflicts with its P instead of "eventually" after some number of visits.

@Mardak
Copy link
Contributor

Mardak commented May 27, 2020

Looks like Stoofvlees blundered here and lc0 happened to find 71. g4 even though the network 63600 Policy head was very unfavorable especially with policy-softmax-temp=1. This case happens to be another where #1279 dynamic policy based on the Value head of the subsequent position could help:

setoption name FpuStrategyAtRoot value absolute
position fen 8/8/5p2/p3k1pp/7P/1PP1KpP1/8/8 w - - 0 71
go nodes 9

info e3d2 N:       1 (P:  0.44%) (M: 52.1) (Q: -0.97788) (U: 0.01332) (V: -0.9779) 
info b3b4 N:       1 (P:  0.37%) (M: 61.9) (Q: -0.97444) (U: 0.01130) (V: -0.9744) 
info e3d3 N:       1 (P:  0.50%) (M: 55.0) (Q: -0.97294) (U: 0.01524) (V: -0.9729) 
info e3f2 N:       1 (P:  0.34%) (M: 54.1) (Q: -0.95270) (U: 0.01027) (V: -0.9527) 
info c3c4 N:       1 (P:  0.48%) (M: 63.3) (Q: -0.92873) (U: 0.01467) (V: -0.9287) 
info h4g5 N:       1 (P:  2.54%) (M: 81.3) (Q: -0.22665) (U: 0.07714) (V: -0.2267) 
info g3g4 N:       1 (P:  0.77%) (M: 58.1) (Q:  0.06395) (U: 0.02335) (V:  0.0639) 
info e3f3 N:       1 (P: 94.56%) (M: 65.7) (Q:  0.14880) (U: 2.87272) (V:  0.1488) 
info node N:       9 (P: 100.0%) (M: 65.1) (Q: -0.53213) (V:  0.0315) 

Specifically, g4's policy is 0.77% while value is 0.0639 better than the parent node's 0.0315.

stoofvlees dynamic policy
https://www.chess.com/computer-chess-championship#event=preccc14-gpu-powers-ii&game=35

Although there is some potential that MovesLeft head could help in this situation as well because at a glance, the position after the winning move does seem favorable and closer to mate than the highest policy move.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants