Make policy used only for initial guidance #743

oscardssmith · 2019-02-18T22:02:19Z

Currently policy has a large effect on search even when there are millions of nodes. This causes weird behaviour and can lead to significant oversights. To fix this, I'm proposing that rather than maximizing Q + c_puct*policy*sqrt(N / n_i), we instead maximize Q + f(policy, N) + c_puct*sqrt(N/n_i). I don't have a great idea of what f should be, but a reasonable first aproximation is e^-(aN+policy), see the link for the graph. https://www.desmos.com/calculator/g3kmkjp5zs

The text was updated successfully, but these errors were encountered:

gcp · 2019-02-19T16:13:15Z

I like this idea (read: I'm experimenting with it as well), it seems so counter-intuitive that policy is still a direct factor in the allocation of search effort even if we've searched thousands of nodes below a tree and have win-rates (which are better guides) for the subtrees.

oscardssmith · 2019-02-19T16:42:55Z

Yeah also, this detail of policy effecting U was invented by Deepmind with no justification.

roy7 · 2019-02-19T16:47:09Z

I've never liked heavy weighting of policy in search even when millions of payouts have been done. I like to think of it more as a tool to guide very early search or know which unknown children to expand first.

Eager to hear about the results.

gcp · 2019-02-19T17:17:27Z

Yeah also, this detail of policy effecting U was invented by Deepmind with no justification.

The original paper about P-UCT had the prior as an additive, not a multiplicative factor. But I'm sure DM was well aware of that :-/

roy7 · 2019-02-19T17:20:31Z

Maybe this is a case of them trying an experiment that led to a strength gain and so they kept it without a theoretical basis to do so.

RedDenver · 2019-02-19T17:54:50Z

This is from discord and didn't want to lose it:

the U computation comes from several places, here's how cpuct is computed:

inline float ComputeCpuct(const SearchParams& params, uint32_t N) {
  const float init = params.GetCpuct();
  const float k = params.GetCpuctFactor();
  const float base = params.GetCpuctBase();
  return init + (k ? k * FastLog((N + base) / base) : 0.0f);
}

then

float puct_mult =
        cpuct * std::sqrt(std::max(node->GetChildrenVisits(), 1u));

then

float GetU(float numerator) const {
    return numerator * GetP() / (1 + GetNStarted());
  }

where numerator=puct_mult

kmcrage · 2019-03-09T22:59:21Z

I agree about not maximizing
Q + c_puct * policy * sqrt(N / n_i)
but instead of adding an extra term, i though we could maximize
Q + c_puct * (1. - pow(1. - policy, 1. + c_pol * N)) * sqrt(N / n_i)
where c_pol is a small constant. Then as N increases the effect of policy dies out.
I'd also look at clamping policy at something slightly more than zero.

oscardssmith · 2019-03-09T23:21:36Z

I think the clamping isn't necessary. Search should be able to handle 0 policy without breaking.

gcp · 2019-03-13T09:09:47Z

I tried various experiments with pow(policy, softmax + increasing_term) to flatten out the policy as the number of playouts increases, but nothing that actually improves strenght so far.

oscardssmith · 2019-03-13T13:17:52Z

@gcp your problem might be that the change you described brings all policies to 0 instead of 1.

gcp · 2019-03-13T14:05:07Z

I was tuning the PUCT to compensate at the same time. But maybe it's a point that the policy distribution has to be renormalized.

gcp · 2019-03-15T16:39:39Z

@kmcrage I tried your formula, but I used x + y * sqrt(N) as the exponent in the pow. This seems to be worth about 40 Elo, but note that I am testing this without the DM formula that was increasing PUCT.

In any case this confirms the idea is worth exploring further.

Naphthalin · 2020-04-30T14:58:50Z

Definitely interesting discussion, and still relevant. As gcp pointed out, the P-UCT formula deepmind gives as their source has the additive term -2/p(s,a) * sqrt( log(N) / N ) but was intentionally not used by deepmind. Possible attempts are #1150 #1173 #1235 #1251 #1252 but nothing found to be superior yet. General discussion should probably happen in #1231; @oscardssmith it's your decision which of the issues you would like to keep open :)

oscardssmith · 2020-04-30T15:04:18Z

OK, I'll take a look and close the obsolete ones.

Mardak · 2020-05-08T17:43:55Z

I filed #1279 if someone can help formalize the math there to make policy dynamic based on search finding a discrepancy between between the P predicting low V of a child and realizing it was wrong on the first visit to that child.

I suppose notably different from @Ttl's approach in leela-zero/leela-zero#2337 is that the recalculation would happen "immediately" on first visit when the child V conflicts with its P instead of "eventually" after some number of visits.

Mardak · 2020-05-27T17:35:53Z

Looks like Stoofvlees blundered here and lc0 happened to find 71. g4 even though the network 63600 Policy head was very unfavorable especially with policy-softmax-temp=1. This case happens to be another where #1279 dynamic policy based on the Value head of the subsequent position could help:

setoption name FpuStrategyAtRoot value absolute
position fen 8/8/5p2/p3k1pp/7P/1PP1KpP1/8/8 w - - 0 71
go nodes 9

info e3d2 N:       1 (P:  0.44%) (M: 52.1) (Q: -0.97788) (U: 0.01332) (V: -0.9779) 
info b3b4 N:       1 (P:  0.37%) (M: 61.9) (Q: -0.97444) (U: 0.01130) (V: -0.9744) 
info e3d3 N:       1 (P:  0.50%) (M: 55.0) (Q: -0.97294) (U: 0.01524) (V: -0.9729) 
info e3f2 N:       1 (P:  0.34%) (M: 54.1) (Q: -0.95270) (U: 0.01027) (V: -0.9527) 
info c3c4 N:       1 (P:  0.48%) (M: 63.3) (Q: -0.92873) (U: 0.01467) (V: -0.9287) 
info h4g5 N:       1 (P:  2.54%) (M: 81.3) (Q: -0.22665) (U: 0.07714) (V: -0.2267) 
info g3g4 N:       1 (P:  0.77%) (M: 58.1) (Q:  0.06395) (U: 0.02335) (V:  0.0639) 
info e3f3 N:       1 (P: 94.56%) (M: 65.7) (Q:  0.14880) (U: 2.87272) (V:  0.1488) 
info node N:       9 (P: 100.0%) (M: 65.1) (Q: -0.53213) (V:  0.0315)

Specifically, g4's policy is 0.77% while value is 0.0639 better than the parent node's 0.0315.

https://www.chess.com/computer-chess-championship#event=preccc14-gpu-powers-ii&game=35

Although there is some potential that MovesLeft head could help in this situation as well because at a glance, the position after the winning move does seem favorable and closer to mate than the highest policy move.

gcp mentioned this issue Feb 19, 2019

About MCTS/RL dynamic with current UCB formula leela-zero/leela-zero#2230

Open

oscardssmith mentioned this issue Feb 19, 2019

Make policy go away from search faster #746

Closed

alreadydone mentioned this issue Feb 27, 2019

tensor-accum-0.17/dev+/uniform questions/discussions alreadydone/lz#87

Open

gcp mentioned this issue Apr 23, 2019

Increase policy for moves with good evaluation leela-zero/leela-zero#2337

Closed

Naphthalin mentioned this issue Apr 28, 2020

Detailing requirements for new and improved MCTS #1231

Closed

This was referenced May 3, 2020

Flatten policy at specified ratio every 2^13 = ~10k visits. #1252

Closed

Dynamically update policy when its prediction is wrong #1279

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make policy used only for initial guidance #743

Make policy used only for initial guidance #743

oscardssmith commented Feb 18, 2019

gcp commented Feb 19, 2019

oscardssmith commented Feb 19, 2019

roy7 commented Feb 19, 2019

gcp commented Feb 19, 2019

roy7 commented Feb 19, 2019

RedDenver commented Feb 19, 2019

kmcrage commented Mar 9, 2019 •

edited

Loading

oscardssmith commented Mar 9, 2019

gcp commented Mar 13, 2019

oscardssmith commented Mar 13, 2019

gcp commented Mar 13, 2019

gcp commented Mar 15, 2019

Naphthalin commented Apr 30, 2020

oscardssmith commented Apr 30, 2020

Mardak commented May 8, 2020

Mardak commented May 27, 2020

Make policy used only for initial guidance #743

Make policy used only for initial guidance #743

Comments

oscardssmith commented Feb 18, 2019

gcp commented Feb 19, 2019

oscardssmith commented Feb 19, 2019

roy7 commented Feb 19, 2019

gcp commented Feb 19, 2019

roy7 commented Feb 19, 2019

RedDenver commented Feb 19, 2019

kmcrage commented Mar 9, 2019 • edited Loading

oscardssmith commented Mar 9, 2019

gcp commented Mar 13, 2019

oscardssmith commented Mar 13, 2019

gcp commented Mar 13, 2019

gcp commented Mar 15, 2019

Naphthalin commented Apr 30, 2020

oscardssmith commented Apr 30, 2020

Mardak commented May 8, 2020

Mardak commented May 27, 2020

kmcrage commented Mar 9, 2019 •

edited

Loading