Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Residual block fusion winograd optimization #1428

Merged
merged 9 commits into from Sep 25, 2020

Conversation

ankan-ban
Copy link
Member

@ankan-ban ankan-ban commented Sep 23, 2020

A relatively small/easy optimization to speed the custom winograd convolution path further.

In the original implementation of custom winograd path, we had three passes for each convolution:

  1. input transform.
  2. GEMM
  3. output transform (fused with RELU, bias addition, and optionally SE operation)

Because the residual tower is made up of many such convolutions one after other, we can fuse the output transform of the first convolution with the input transform of the next convolution - doing them in the same kernel. So, after this optimization we have each convolution do:

  1. GEMM
  2. Output Transform, followed by SE/relu/bias, followed by Input transform (for next convolution)
    For the second convolution in each block, we also need to store untransformed output (to use it as skip connection)

This optimization results in 5-15% speedup with 384x30 networks:

lc0 benchmark (nps)

GPU           network         baseline  optimized   perf gain
-------------------------------------------------- ------------
Titan RTX   32390   (256x20)  57757     61721        6.9 %
Titan RTX   sv-3010 (384x30)  17443     20084       15.1 %
A100        32390   (256x20)  107536    120719      12.3 %
A100        sv-3010 (384x30)  41785     48815       16.8 %

perform winograd output transform of the first convolution and the input transform of the second convolution in the same kernel.
 ~7% speed improvement for 30x384 network on Titan RTX.
 - 3% or so extra speedup, making total speedup close to 10%.
- also add support for networks without SE
meson.build Outdated Show resolved Hide resolved
meson.build Outdated Show resolved Hide resolved
@cn4750
Copy link
Contributor

cn4750 commented Sep 23, 2020

EDIT: Ignore the below testing. See my updated testing in the next message.
Testing on my RTX 2070 Super seems to show a very tiny regression when using this patch on J92-190 (384x30):

lc0 benchmark

GPU    network           baseline  optimized  perf gain
--------------------------------------------------------
2070S  J92-190 (384x30)  11845     11723      -1.0 %

Also a regression on backendbench:

lc0 backendbench (to 80 mbs)

GPU    network           baseline  optimized  perf gain
--------------------------------------------------------
2070S  J92-190 (384x30)  6176      5476       -11.3 %

@cn4750
Copy link
Contributor

cn4750 commented Sep 23, 2020

Edit: Ignore this bench too since Visual Studio bugged out and never actually switched branches for me and was using master. 😢
Additional more thorough testing reveals no significant change.

lc0 backendbench (to 80 mbs)

GPU    network           baseline  optimized  perf gain
--------------------------------------------------------
2070S  J92-190 (384x30)  6118      6117       0.0 %

Normal benchmark seems too noisy to measure accurately and is about even under more tests.
I guess I'm bandwidth bound?

@ankan-ban
Copy link
Member Author

ankan-ban commented Sep 23, 2020

@cn4750, which version of cuda are you using? The optimisation is applicable only for the custom winograd path which gets enabled by default only with cuda11 or later.

@cn4750
Copy link
Contributor

cn4750 commented Sep 23, 2020

@cn4750, which version of cuda are you using? The optimisation is applicable only for the custom winograd path which gets enabled by default only worry cuda11 or later.

       _
|   _ | |
|_ |_ |_| v0.27.0-dev+git.dirty built Sep 23 2020
Loading weights file from: C:\nets\J92-190
Creating backend [cudnn-auto]...
Switching to [cudnn-fp16]...
CUDA Runtime version: 11.0.0
Cudnn version: 7.6.5
Latest version of CUDA supported by the driver: 11.0.0
GPU: GeForce RTX 2070 SUPER
GPU memory: 8 Gb
GPU clock frequency: 1815 MHz
GPU compute capability: 7.5

@cn4750
Copy link
Contributor

cn4750 commented Sep 23, 2020

Actually using properly built executables this time and I see proper gains for large nets:

lc0 benchmark (80 mbs)

GPU    network           baseline  optimized  perf gain
--------------------------------------------------------
2070S  J92-190 (384x30)  10325     11089       7.4 %
lc0 backendbench (80 mbs)

GPU    network           baseline  optimized  perf gain
--------------------------------------------------------
2070S  703810 (128x10)   68790     68909       0.2 %
2070S  J92-190 (384x30)   6176      6855      11.0 %

src/neural/cuda/layers.cc Outdated Show resolved Hide resolved
c_input_ is set to C in the constructor, so this is a NOP change but makes things consistent with rest of hte code.
Copy link
Member

@borg323 borg323 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I commented on some minor stuff that I have seen in other kernels as well.

src/neural/cuda/layers.cc Outdated Show resolved Hide resolved
src/neural/cuda/layers.cc Outdated Show resolved Hide resolved
- add launch bound to the transform kernel to make sure it supports at least 512 threads per block (for running networks with 512 filters).
 - disable the optimization for now when filter count is more than 512.
 - adding launch bound for 1024 threads makes the kernel very very slow (too much register spills).
 - TODO: optimize this kernel more (find a way to reduce register pressure, or running multiple CTA for same 'C' dimension).
- blas_size -> bias_size
- no need to add bias_size to scratch offset
@Tilps
Copy link
Contributor

Tilps commented Sep 24, 2020

Does anyone have any non-regression confirmation results for this? A few thousand games without crash and equal elo at fixed node count (for example)?

- it's slower on V100 - maybe because of register spilling.
@borg323 borg323 merged commit 2b7d7f5 into LeelaChessZero:master Sep 25, 2020
@lp200
Copy link

lp200 commented Sep 25, 2020

__launch_bounds__ (384) seems to be quite slow in some GPU.

borg323 pushed a commit that referenced this pull request Sep 28, 2020
* residual block fusion optimization

perform winograd output transform of the first convolution and the input transform of the second convolution in the same kernel.
 ~7% speed improvement for 30x384 network on Titan RTX.

* keep transformed tensor across residual blocks

 - 3% or so extra speedup, making total speedup close to 10%.

* add backend-opt (default true)

- also add support for networks without SE

* fix non-se path

* fix meson.build to work with old compiler versions

* address review comment

c_input_ is set to C in the constructor, so this is a NOP change but makes things consistent with rest of hte code.

* fix res_block_fusing path for bigger filter counts

- add launch bound to the transform kernel to make sure it supports at least 512 threads per block (for running networks with 512 filters).
 - disable the optimization for now when filter count is more than 512.
 - adding launch bound for 1024 threads makes the kernel very very slow (too much register spills).
 - TODO: optimize this kernel more (find a way to reduce register pressure, or running multiple CTA for same 'C' dimension).

* address review comments

- blas_size -> bias_size
- no need to add bias_size to scratch offset

* dsiable res block fusing for more than 384 ffilters

- it's slower on V100 - maybe because of register spilling.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants