New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test policy head 8x8x73 rather than plain vector #637

Open
mooskagh opened this Issue Dec 31, 2018 · 7 comments

Comments

Projects
None yet
5 participants
@mooskagh
Copy link
Member

mooskagh commented Dec 31, 2018

(Initial discussion about that was at glinscott/leela-chess#47)

Unlike our one-demensional vector 1858 representation of a policy head (and fully connected layer before it), AlphaZero has convolution layers in policy head with 72 8x8 planes as output.

With 1..1858 plain vector, neural network has to learn every move independently. If it learnt when to play e2f4, is still doesn't know anything about e3f5. With convlayers, it will automatically learn shifts of the patterns.
(The counterargument is that decisions are already done in the main residual tower, and policy head only encodes that, but that doesn't sound very convincing to me without tests.)

DeepMind also tried to use flat policy head like we do, and they wrote:
We also tried using a flat distribution over moves for chess and shogi; the final result was almost identical although training was slightly slower.

<pure speculation>
It feels to me, that that discrepancy between us and AlphaZero may be the reason why Lc0 struggles to reach AlphaZero level. Our policy head does miss tactics sometimes, and also we had to introduce policy softmax to fight that (although AlphaZero does not have that).
</pure speculation>

(however so far noone agreed with me that that changing policy head may help, people say that our policy head is good enough, according to training graphs).

It's not clear for me what "almost identical" means in their paper. It can very well be that Lc0 level is already "almost identical" to A0, and proper policy head is what's missing to make the last step.

@Tilps

This comment has been minimized.

Copy link
Contributor

Tilps commented Jan 3, 2019

I'm coming round to needing to experiment here, but not because of the plain vector vs not.
The choice between 32 filters and 73 was chosen back when our network size was small (5 block). A number of residual blocks that isn't even really enough to cross the board and back before we hit the policy head. So such a network couldn't even be reasonably expected to correctly reason about moves across the board. In such a scenario the direct 8x8x73 output is potentially weaker, since it can't deal with the long range moves, where the fully connected layer allows for a 'this piece looks good to move' 'this place looks good to move to' pairing to generate the policy over the distance (which is probably not optimal in general) - but can probably be easily done with 32 filters, so 73 is unlikely to be a clear win until you have more layers, and more than 64 filters... Also 73 makes the fully connected weights list much much larger - which would have been significant in terms of network size back then.
Now we're dealing with 20b 256 filters, our convolution stack is much larger - having a larger fully connected is not unreasonable for our normal use case.
Given that our backends already support policy conv block output that isn't 32 - testing more filters (but keeping the flat array) makes sense to me. In fact I'm willing to do a quick 1 LR 10b reinforcement run to see if shows a significantly different training curve compared to T35.
I'm thinking 80 filters (64+16 - to be a rounder binary number than 73) is the candidate.

While this might be a bit slower to start than the 8x8x73 output - we can test it without any significant changes (just 3 lines in the training repository).

@Ttl

This comment has been minimized.

Copy link
Member

Ttl commented Jan 3, 2019

One relatively easy way to test this would be to find a mapping from 8x8x73 convolution policy to the current 1858 moves flat policy. Then it would be as simple as adding that mapping to the end of the model and it wouldn't require modifying the training data or policy in lc0. The mapping should be a matrix of size 1858 x 4672 with 1 and 0 entries.

I'm not familiar enough with the policy encoding to trust that I can do it bug free, but I can train the network if someone finds a way to generate the mapping.

@Tilps

This comment has been minimized.

Copy link
Contributor

Tilps commented Jan 3, 2019

I'm training a CCRL supervised 256x20 SE with 80 filters for the policy convolution. (Should be done in about 36 hours from now.) Will be interesting to see if its policy head fully connected turns out to be a mapping of that kind, or if it decides to do something much more complicated.

@Tilps

This comment has been minimized.

Copy link
Contributor

Tilps commented Jan 6, 2019

Training is complete. Elo testing under time control vs old network was positive but clearly within error bars after 200 games.
Given the pretty clear loss advantages on the training graph, I plan on moving ahead with use in training - probably only as stop gap though until the full A0 policy head architecture option gets a thorough vetting and quality implementation.

@Ttl

This comment has been minimized.

Copy link
Member

Ttl commented Jan 7, 2019

I tested the AZ-style convolutional policy head with additional learnable fully connected layer after it to map to lc0 policy. Match result on GTX 1080 Ti, TC: 10+0.5s:

Score of lc0_az_pol2_200k vs lc0_se_sig_200k: 123 - 92 - 185  [0.539] 400
Elo difference: 26.98 +/- 24.98, LOS: 98.28 %, DrawRatio: 46.3 %

It's clearly better in tensorboard too with policy loss being smaller at all steps. Policy loss was 1.467 at the end vs. 1.477 with the current policy head.

@ASilver

This comment has been minimized.

Copy link

ASilver commented Jan 8, 2019

Worth noting that in their paper, AlphaZero was shown to have significant variance midway in its training, though it converged on a similar strength at the end.

@ozabluda

This comment has been minimized.

Copy link

ozabluda commented Feb 13, 2019

PR #712 is merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment