New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Residual block fusion winograd optimization #1428
Conversation
perform winograd output transform of the first convolution and the input transform of the second convolution in the same kernel. ~7% speed improvement for 30x384 network on Titan RTX.
- 3% or so extra speedup, making total speedup close to 10%.
- also add support for networks without SE
|
EDIT: Ignore the below testing. See my updated testing in the next message.
|
|
Edit: Ignore this bench too since Visual Studio bugged out and never actually switched branches for me and was using master. 😢
|
|
@cn4750, which version of cuda are you using? The optimisation is applicable only for the custom winograd path which gets enabled by default only with cuda11 or later. |
|
|
Actually using properly built executables this time and I see proper gains for large nets: |
c_input_ is set to C in the constructor, so this is a NOP change but makes things consistent with rest of hte code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I commented on some minor stuff that I have seen in other kernels as well.
- add launch bound to the transform kernel to make sure it supports at least 512 threads per block (for running networks with 512 filters). - disable the optimization for now when filter count is more than 512. - adding launch bound for 1024 threads makes the kernel very very slow (too much register spills). - TODO: optimize this kernel more (find a way to reduce register pressure, or running multiple CTA for same 'C' dimension).
- blas_size -> bias_size - no need to add bias_size to scratch offset
|
Does anyone have any non-regression confirmation results for this? A few thousand games without crash and equal elo at fixed node count (for example)? |
- it's slower on V100 - maybe because of register spilling.
|
__launch_bounds__ (384) seems to be quite slow in some GPU. |
* residual block fusion optimization perform winograd output transform of the first convolution and the input transform of the second convolution in the same kernel. ~7% speed improvement for 30x384 network on Titan RTX. * keep transformed tensor across residual blocks - 3% or so extra speedup, making total speedup close to 10%. * add backend-opt (default true) - also add support for networks without SE * fix non-se path * fix meson.build to work with old compiler versions * address review comment c_input_ is set to C in the constructor, so this is a NOP change but makes things consistent with rest of hte code. * fix res_block_fusing path for bigger filter counts - add launch bound to the transform kernel to make sure it supports at least 512 threads per block (for running networks with 512 filters). - disable the optimization for now when filter count is more than 512. - adding launch bound for 1024 threads makes the kernel very very slow (too much register spills). - TODO: optimize this kernel more (find a way to reduce register pressure, or running multiple CTA for same 'C' dimension). * address review comments - blas_size -> bias_size - no need to add bias_size to scratch offset * dsiable res block fusing for more than 384 ffilters - it's slower on V100 - maybe because of register spilling.
A relatively small/easy optimization to speed the custom winograd convolution path further.
In the original implementation of custom winograd path, we had three passes for each convolution:
Because the residual tower is made up of many such convolutions one after other, we can fuse the output transform of the first convolution with the input transform of the next convolution - doing them in the same kernel. So, after this optimization we have each convolution do:
For the second convolution in each block, we also need to store untransformed output (to use it as skip connection)
This optimization results in 5-15% speedup with 384x30 networks: