Skip to content
Vandertic edited this page Dec 14, 2019 · 8 revisions

SAI learning pipeline is different from Leela Zero.

Leela Zero is based on AlphaGo Zero paper, while SAI follows a modification of what is described in AlphaZero paper.

The main difference is that in AlphaGo Zero paper and in Leela Zero project, there is gating, meaning that a newly trained network is promoted to best network and hence to do self-play games, only if there is some statistical evidence that it is an improvement over the previous one. To be more precise, a 400-games match is performed, and if the wins are at least 55% (a 2σ deviation from fair coin tossing), then the net is immediately promoted.

In AlphaZero paper there is no gating: training is continuous, on a moving buffer of 1 million games, and every 1000 training steps a new network is produced, immediately promoted, and put to play 200k games.

In our experimental runs with 7x7 and 9x9 SAI, we saw that the second approach is indeed quite robust, even if without gating some oscillations of strength can be observed. In the end we settled with something in between: we follow AlphaZero approach, but at each generation we train a small number of networks and promote the one that performs best against the previously playing network, even if the wins are less than 50%. These promotion matches consist of a small number of games, much less than 400; in fact the aim is not to choose the best candidate, but just to avoid the very bad ones, to ensure robustness.

Some quantitative analysis of the pipeline hyperparameters can be found here.

SAI pipeline

SAI pipeline cycle is faster than AlphaZero, because we don't have huge resources, and we want to be as efficient as possible, by doing smaller intermediate steps, while checking progress.

The cycle is as follows.

  1. gen=0, current_net=random, n=1;
  2. current_net plays 2560 whole self-play games, with variable komi, distributed according to current_net evaluation;
  3. current_net starts playing branches of self-play games, from random positions of previous games;
  4. when the game count reaches 2560 self-play games, training starts, based on the self-plays games of the last n generations;
  5. during training, a variable number of candidate networks are generated (currently, 10 networks at 2000 training steps one from the other);
  6. as soon as candidates are available, promotion matches are added between the new candidate networks and current_net. These matches can be identified because they are 50 games long;
  7. when promotion matches end, the best candidate network is identified; denote it by chosen_net;
  8. current_net finishes playing branches of self-play games until count reaches 5120;
  9. reference matches are added between several recent networks (the ones promoted at generations gen-k, with k in {1, 2, 5, 8, 11}) and chosen_net, to get a more precise evaluation of chosen_net Elo. These matches can be identified because they are 40 games long;
  10. if gen is a multiple of 4, panel matches are added between the 16 networks in the panel and chosen_net, again to get an even more precise evaluation of chosen_net Elo. These matches can be identified because they are 30 games long;
  11. gen++, current_net=chosen_net, if reasonable then n++;
  12. go to step 2;