Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extraction of internal features from the tower representation #34

Closed
kblomdahl opened this issue Oct 3, 2018 · 5 comments
Closed

Extraction of internal features from the tower representation #34

kblomdahl opened this issue Oct 3, 2018 · 5 comments

Comments

@kblomdahl
Copy link
Owner

We should investigate the internal features of the tower representation. Doing this should provide several benefits:

  • It should help us identify good features that we can compute directly, and feed into the neural network.
  • It should help us improve the neural network architecture.
@kblomdahl
Copy link
Owner Author

kblomdahl commented Oct 3, 2018

The default network architecture (ResNet) seems to have an odd bias towards a strong activation close to the edge of the board. As can be observed in the diagram below:

This behavior makes sense, consider a set of weights W that are normally distributed, and a positive input x:

y = W [ x₁ x₂ x₃ x₄ x₅ x₆ x₇ x₈ x₉ ]

If W ∈ N(μ, σ) and μ < 0 then along the edge x₇, x₈, and x₉ are zero:

W [ x₁ x₂ x₃ x₄ x₅ x₆ x₇ x₈ x₉ ] ≤ W [ x₁ x₂ x₃ x₄ x₅ x₆ 0 0 0 ]

This hypothesis can be confirmed by investigating the final weights, and checking the mean of the convolution layers:

{
    "01_upsample/conv_1": {
        "mean": -0.0003626140533015132,
        "std": 0.00519569544121623
    },
    "02_residual/conv_1": {
        "mean": -0.00011233328405069187,
        "std": 0.002601742511615157
    },
    "02_residual/conv_2": {
        "mean": -0.00011419910879340023,
        "std": 0.0026016614865511656
    },
    "03_residual/conv_1": {
        "mean": -0.00010304508032277226,
        "std": 0.0026021271478384733
    },
    "03_residual/conv_2": {
        "mean": -6.648160342592746e-05,
        "std": 0.002603317843750119
    },
    "04_residual/conv_1": {
        "mean": -0.0003849344211630523,
        "std": 0.0025755600072443485
    },
    "04_residual/conv_2": {
        "mean": 0.00016120942018460482,
        "std": 0.002599172294139862
    },
    "05_residual/conv_1": {
        "mean": 0.00021969537192489952,
        "std": 0.002594882855191827
    },
    "05_residual/conv_2": {
        "mean": 0.00045546123874373734,
        "std": 0.002564028138294816
    },
    "06_residual/conv_1": {
        "mean": -0.00022857918520458043,
        "std": 0.0025941154453903437
    },
    "06_residual/conv_2": {
        "mean": 0.00020613202650565654,
        "std": 0.0025959957856684923
    },
    "07_residual/conv_1": {
        "mean": -0.0020295707508921623,
        "std": 0.001631724531762302
    },
    "07_residual/conv_2": {
        "mean": -2.099659468512982e-05,
        "std": 0.002604082226753235
    },
    "08_residual/conv_1": {
        "mean": -0.0021902318112552166,
        "std": 0.001408747280947864
    },
    "08_residual/conv_2": {
        "mean": 7.066841476444097e-07,
        "std": 0.002604166977107525
    },
    "09_residual/conv_1": {
        "mean": -0.0021506529301404953,
        "std": 0.0014684603083878756
    },
    "09_residual/conv_2": {
        "mean": -2.3711704670859035e-06,
        "std": 0.002604165580123663
    },
    "10_residual/conv_1": {
        "mean": -0.0015733279287815094,
        "std": 0.002075168304145336
    },
    "10_residual/conv_2": {
        "mean": -0.00012044789764331654,
        "std": 0.0026013797614723444
    },
    "11p_policy/conv_1": {
        "mean": -0.007533475290983915,
        "std": 0.06204431504011154
    },
    "11p_policy/linear_1": {
        "mean": -0.01642797514796257,
        "std": 0.038641564548015594
    },
    "11v_value/conv_1": {
        "mean": -0.0014275580178946257,
        "std": 0.08837682008743286
    },
    "11v_value/linear_1": {
        "mean": 0.00015922913735266775,
        "std": 0.0526527501642704
    },
    "11v_value/linear_2": {
        "mean": -0.010312804952263832,
        "std": 0.06787863373756409
    },
}

As can be seen most of the convolution layers has a negative mean.

edge_tower_representation.zip

Conclusion

The current neural network architecture has a systematic bias towards the edge of the board. This problem is presumably exaggerated by the fact that we clip activations to the range [0, 6].

@kblomdahl
Copy link
Owner Author

kblomdahl commented Oct 3, 2018

Some approaches to solving the problem above immediately springs to mind:

  • Padding with a different constant instead of zero (we could pad with the median of the input distribution, a truncated normal distribution).
  • Use a per-activation, instead of per-channel, bias after each convolution layer.

Neither of these approaches are supported by cuDNN when running with layout NCHWVECTC and INT8x4.


Neural style transfer research has encountered similar artifacts, with a similar cause, but no conclusion beyond using super-resolution.

@kblomdahl
Copy link
Owner Author

Even if it might turn out to be tricky to implement said algorithms in cuDNN, we can still train the architectures in Tensorflow and verify whether they result in a lower loss:

  • Green is per-activation offset
  • Gray is per-channel offset

As one can observe they are effectively equivalent, with the difference being well within the margin for error.

@kblomdahl
Copy link
Owner Author

kblomdahl commented Oct 6, 2018

An alternative attack vector is to try and prevent the non-zero mean for the weights, at which point the zero-padding does not matter anymore.

The most likely culprit for the negative weights mean is the residual connection, which in the AlphaZero architecture looks like this, where Rₖ is the output of a residual block, and Lₖ₋₁ is the output of the previous residual connector:

Lₖ = Lₖ₋₁ + Rₖ

Rₖ ∈ N(0, 1)

Note that if Lₖ₋₁ has a non-zero variance then the variance of Lₖ will increase with k. This would normally not be a huge problem, however since we clip our activations to six the optimizer must continuously decrease the mean of Lₖ to maintain expressiveness of the network.

A solution to this is to instead interpolate between Lₖ₋₁ and Rₖ when computing the residual connector, i.e. Lₖ = α Lₖ₋₁ + (1 - α) Rₖ. A few interpretations for different value of α:

  • α = 0.5 - This is the current architecture without the exploding activations.
  • α ≠ 0.5 - This is what is called a highway network, a common value is α = 0.9.

We could also make α a trainable parameter.

Results

This change seems to have a positive effect on the final loss of the training:

Network Value Loss Policy Loss
Baseline 0.8162 1.936
α = 0.5 0.7942 1.887

It also seems to improve the actual playing strength of the engine:

dg-v060.per-channel-offset v dg-v060.highway (38/500 games)
unknown results: 9 23.68%
board size: 19   komi: 7.5
                             wins              black         white       avg cpu
dg-v060.per-channel-offset     10 26.32%       4  21.05%     6  31.58%    173.60
dg-v060.highway                19 50.00%       9  47.37%     10 52.63%    139.36
                                               13 34.21%     16 42.11%

@kblomdahl
Copy link
Owner Author

When looking at the activations after introducing the α interpolation, the border problem seems to have been largely resolved. But an un-intended side-effect of making sure that Lₖ ∈ N(0, 1), is that the quantization resolution became worse, which has negative effects on the playing strength.

If we want to preserve 99.9% of all values, assuming the values are a normal distribution with a mean of 0 and a variance of 1, then we should clip the activations at 3.09023. This is a far cry from 6.0, and shows that we are effectively just wasting half of the quantized range.

This in-efficient use of the quantized range presumably also had negative effects before introducing α, since the initial parts of the architecture were not using the entire range, resulting in a low resolution during the initial parts of the inference.

kblomdahl added a commit that referenced this issue Oct 8, 2018
Tweak the neural network architecture based on discoveries made in #34:

- Change the residual blocks to a highway architecture to avoid
  the magnitude of the activations exploding.
- Clip relu activations at 3.09023 instead of 6.0
- Reduce batch size from 1024 to 512 (improve performance, and does not
  seem to have any impact on the final accuracy).
- Disable adversarial training by default.

Also make some quality of life changes to the script:

- Read _big SGF files_ instead of TFRecord's. This allows us to simplify
  and streamline the training pipeline.
- Add a '--tower' option, that allows us to inspect the internal
  features of the network.
- Add a measurement of the orthogonality of the weights to '--print'.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant