Join GitHub today
GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.Sign up
Method: The second convolution and the following batch norm in residual blocks have twice as many channels. Then, the first half of these channels is sigmoided and multiplied with the second half.
This results in policy and value losses of a 128x10 net to be very close to a 160x10 net while being slightly faster (4800 nps vs 4400). The total benefit over a regular 128x10 is 0.02 better policy and 0.005 better value loss, but slower evaluation (1000 nps difference). Because convolutions scale quadratically with number of filters and multiplication is only linear, this costs less time relatively with a bigger 256x20 network, especially when the self-scaling adds a fixed number of filters, say 64.
- not putting a sigmoid on one of them goes horribly wrong. Same for (1 + a) * b
- trying out only using a 1x1 conv for the sigmoid branch, it's a bit faster and slightly worse results. Probably not worth saving this bit of compute.
- sigmoid is better than tanh
- After the first conv-bn is better than after the second conv-bn in the residual blocks by another 0.01, bringing the total improvement to 0.03.
- The network benefits even when this is applied after the SE (roughly the same as before the SE, but small additional benefit with doubled SE input size thus also slightly slower). This is a sign that SE and self-gating are orthogonal improvements.
- Applying it on both convs is looking good.
- Applying two gated convs and SEs is worse than the baseline. SE doesn't like to be twice it seems, unlike self-scaling.
- Applying it before the BN, immediately after the conv, is better than post BN. It's about a 0.005 improvement for the version with one self-scale. Pre-BN can no longer have the BN be folded into the conv for inference. This means that pre-BN is faster at training (fewer channels to normalise over) but slower at testing (BN weights are folded into conv, so zero cost).
- Even pre-BN, a * b and (1 + a) * (1 + b) is worse.
- Including the bias in the conv again for pre-bn doesn't seem to benefit much.
- Swish activation gains very little (0.002 maybe) over ReLU. It's important to not apply Swish in the skip path, only within the residual blocks, to get any decent results.
- Some sort of hybrid between Swish and self-scaling doesn't work well. To be precise, (a + b).sigmoid() * b and a.sigmoid * (a + b).
- ELU replacing ReLU brings a 0.01 improvement. ReLU can't be fused anyway, so this isn't a big speed loss.
- Post-BN Self-scale with ELU is just as good as pre-BN, except it's worse on the first LR. Post-BN is nicer because BN weights can be completely fused and doesn't require change to weight format. The worse results for first LR might just be because of lower effective LR for pre-BN, because its gradient norm is lower than post-BN (0.93 vs 0.95).