Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pipeline description: self-play temperatures #8

Open
Vandertic opened this issue Nov 2, 2018 · 0 comments
Open

Pipeline description: self-play temperatures #8

Vandertic opened this issue Nov 2, 2018 · 0 comments

Comments

@Vandertic
Copy link
Member

This post is about how on SAI we use two standard LZ tuning parameters that are generally left to their default values in the LZ learning pipeline: --randomtemp and --softmax_temp.

These are temperatures in the sense of probability distributions, or more precisely as the term is used in Gibbs measures. Basically, a probability distribution over a finite set may be perturbed with a temperature parameter, in the sense that a value of 1 leaves the distribution as it is, values larger than 1 flatten the distribution towards uniform probability for all points, and values lesser than 1 sharpen the distribution to the limit (when temperature is 0) where the point with the highest original probability gets probability 1 and all other points probability 0.

  • randomtemp (default 1) is a perturbation of the probability for choosing one move among the visited ones. By default, if the option -m x is used, the first m moves are chosen with temperature 1 (probability proportional to visits number), and the following moves are chosen with temperature 0 (the move with the most visits is chosen with probability 1).
  • softmax_temp (default 1) is a perturbation of the policy probability, applied just after the computation of the policy itself, so before Dirichlet noise.

What we observed in our 7x7 experiments, both with LZ and SAI is that there is a peculiar problem when these parameters are used at their default values. The policy concentrates too much and too fast, as generations of nets go on, to the point where, in the opening of the game, only one move is really considered.
This in principle is the good expected behaviour, since we would like the play to converge to the perfect game, where the policy knows exactly the best move in every situation.
Nevertheless, we observed that the policy would typically concentrate too much before the value estimation would get very good. I this way, the policy will converge to a suboptimal move, which then is really hard to get corrected.

Value and policy are the two estimators that evolve between generations of nets. Both should converge to a theoretical limit which is the perfect game. They may converge at very different speeds, and in particular it appears that the policy is much faster.
The policy is trained on UCT visits of root node's children, which depend on the policy of the previous generation at the same node and on the value of the previous generation at subsequent nodes.
We have a mathematical argument (not fully rigorous) that shows that if the average value of children does not change between generations, the move with the highest average value will have all the policy concentrated on it very quickly. This is true even if the best move is just barely better than the second one. Convergence is exponential.
We suppose that when the value is not fixed but changes slowly between generations this behaviour can be still present.
Notice moreover that if the policy converges to one move, then other subtrees are not explored much and hence the value head cannot be trained much on them, hence it becomes difficult for the learning pipeline to correct this situation.

The proposed solution is to use a higher temperature for softmax_temp. It must be remarked that the mathematical argument above says that the limit policy distribution is concentrated on a single move if softmax_temp is lesser than or equal to 1. Values above 1 give a limit policy that have nonzero probability for the second-best move, so it makes sense to experiment with this parameter to correct the unwanted behaviour. We have experimented with 1.5 and 1.25 and the latter seems to be able to correct the excess of convergence while still letting the policy be trained well.

Of course one unwanted consequence of this choice is that not only policy gets flatter, but also visits to root node's children get flatter. This may be a problem when we are choosing the move to play with randomness. To this end, when setting --softmax_temp 1.25 we also add --randomtemp 0.8, so that the best move is chosen more often. It is also recommended that --blunder_thr 0.25 is used together, in particular with high values of -m.

Finally, let us remark that our experiments on 7x7 with official komi 9.5 showed that in the first part of the training, the nets learn perfect play for white, hence winning with higher and higher probability (fair komi is 9 on 7x7 go with are scoring). After that the way black plays oscillates a lot, while nets learn that this and then that strategies don't work.
When the temperatures are left to 1, this ruins also estimates and playing style of white which periodically unlearns how to win. One can observe that the strength of new nets oscillates widely over a period of tens of nets.

When the temperatures are corrected, this does not happen. The perfect play by white is very stable over hundreds of generations, while the strength of black oscillates somewhat (consequence of playing with komi 9.5). Moreover, the final policy, while still allowing the search to find the better move, is not completely concentrated, and in particular if there are two different moves in the perfect games tree, both have a reasonable fraction of the probability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant