Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed-up OpenCL backend with Batch computation #115

Closed
frpays opened this issue Jun 25, 2018 · 16 comments
Closed

Speed-up OpenCL backend with Batch computation #115

frpays opened this issue Jun 25, 2018 · 16 comments
Labels
enhancement New feature or request

Comments

@frpays
Copy link
Contributor

frpays commented Jun 25, 2018

Use BLAS Batch computation as a model.

@frpays frpays added the enhancement New feature or request label Jun 25, 2018
@frpays frpays assigned frpays and unassigned frpays Jun 25, 2018
@frpays frpays added the lc0 label Jun 25, 2018
@Ttl
Copy link
Member

Ttl commented Jun 26, 2018

I have added OpenCL batching support for LZGo in here: https://github.com/Ttl/leela-zero/tree/batching_test

It's missing batching of innerproduct that is not used in LZGo, but otherwise it's basically complete implementation. It shouldn't be too hard to port this for lc0.

@frpays
Copy link
Contributor Author

frpays commented Jun 27, 2018

I have added OpenCL batching support for LZGo in here: https://github.com/Ttl/leela-zero/tree/batching_test

It's missing batching of innerproduct that is not used in LZGo, but otherwise it's basically complete implementation. It shouldn't be too hard to port this for lc0.

Thank you for point out this to me.

Where you able to join the inner sgemm operations along the batch dimension in the conv3x3 winograd algorithm? This is what yields a 2x performance. See "First step batched gemm" in https://ai.intel.com/winograd-2/ .

Batching the BLAS layer, was basically reordering the M/V temporary matrices dimensions so that all the inner sgemm on a same tile but different batch_size are contiguous in memory . Then you can join all the sgemm operations of a tile into one.

  • Did you do that ? I am unsure reading the code.

  • What speed-up do you achieve?

  • What do you refer as "Proper batching support" ?

//TODO: Proper batching support
  • Why a some small batch size ? BLAS default batch size is 256. Cudnn is 512.
#define MAX_BATCH 8

@Ttl
Copy link
Member

Ttl commented Jun 27, 2018

I don't think the memory layout is like that. I wasn't aware of that.

I don't have a computer with GPU to test right now. On my laptop gain is zero because even single batch can fully utilize the Intel integrated GPU (It only gets 20 nps). Smaller 64x5 network had 30 % performance increase. I would expect that computer with proper GPU would have larger performance gain.

Proper batching support meaning launching only one kernel instead of doing loop. It's not that critical since convolve1 is very fast compared convolve3.

Is that large batch size stronger on equal time compared against smaller batch size? MCTS is most efficient with batch size of one and everything above it forces choosing potentially worse moves to the batch. You want to use the smallest batch size that can fully utilize GPU for the strongest play. It's also good to keep in mind that Go board has about 5.6 times more squares than chess board so every convolution is much bigger requiring smaller batch size.

@Ttl
Copy link
Member

Ttl commented Jun 28, 2018

I updated the code to increase sgemm dimensions instead of adding batches of smaller multiplications. Also added a proper convolve1 batching implementation.

No performance difference to previous code on this laptop. I should be able to test on better GPU this weekend. I'll update the results here.

You did some weird column major reordering during scatter. I just increased the tiles dimension to tiles * batch_size and kept the old order since OpenCL sgemm doesn't support anything else. Is that memory layout faster on CPU? The old order had benefit that one of the matrices was transposed so sgemm could read both matrices in row order. Without transpose one of the matrix needs to be read in column order which isn't that good for memory locality.

@frpays
Copy link
Contributor Author

frpays commented Jun 28, 2018

Profiling shows that the only operation that matters is the convolution3x3. This operation takes more than 95% of all the computation time.

It's not clear to me, if you joined the inner gemm or not. If you joined the inner gemm, you are definitely entiltled to a 2x speed-up. I will have a look at your code tomorrow or this week-end.

If you increases the tiles dimensions, this means you did join the batch_size gemm in one sgeem. Actually, it's very simple: how many gemm do you have in the conv3x3. If you still have batch_size * tiles gemm, you can do better.

Here is what I did in details:

The initial non-batched algorithm looks like that:

foreach batch in batches 
   foreach tile in tiles
       V := scatter (input, batch)
       M[tile] := W[batch][time] x V[tile]
       output := gather(M, batch)

You can make M and V batch_size times larger and batch scatterand gather.

foreach tile in tiles
   V := scatter (input)
   foreach tile in tiles 
       M[batch][tile] := W[tile] x V[batch][tile]
   output := gather(M)

With the proper reordering of M and V, the inner multiplication can be contiguous along the columns of M and V. Note that W is fixed and does not change along the batch. So the tiles inner matrix multiplications are joined in one multiplication. The input/output columns are batch_size times larger. We reuse W[batch] for all the tiles, this is what gives us the speedup.

foreach tile in tiles
   V := scatter (input)
    M[batch]:= W[tile] x V[batch]
   output := gather(M)

I left also here a comment about the method here: #87

@Ttl
Copy link
Member

Ttl commented Jun 29, 2018

Profiling shows that the only operation that matters is the convolution3x3. This operation takes more than 95% of all the computation time.

I'm talking about the SGEMM performance. You changed CblasRowMajor, CblasTrans, CblasNoTrans to CblasColMajor, CblasNoTrans, CblasNoTrans. Is that faster on CPU? On GPU it would be slower due to worse memory access pattern. I kept the original memory layout.

I first had 16 * batch_size batches of outputs x tiles x channels SGEMMs, which I changed now to 16 SGEMMs of outputs x (tiles * batch_size) x channels as you did. The total work is theoretically the same but larger SGEMM dimensions should be more efficient in practice. I'm not sure where you get 2x performance increase from. The approach with more batches doesn't need larger W, its offset is taken mod 16 to reuse it.

Larger SGEMM should be especially faster for chess with 8x8 board that only has 16 tiles and tuning the SGEMM kernel for larger dimensions should result in additional performance gain. Go has 100 tiles, but for practical reasons on GPU it's necessary to zero pad matrix to either 104 or 128 depending on GPU. With larger batch size the amount of padding is smaller which should result in additional performance gain.

@frpays
Copy link
Contributor Author

frpays commented Jun 29, 2018

I changed both CblasRowMajor to CblasColMajor, and CblasTrans to CblasNoTrans on the W matrix. Since theses flags have the opposite effect, the memory access layout is the same with respect to W. I did not test to transpose W. I don't know whether or not that would be faster or slower on cpu.

The initial dimensions or M and V are [batches][wtiles][channels][tiles]. I inverted the dimensions 3 and 4, that causes their transpositions in the multiplication. Then I inverted dimensions 1 and 2, with a resulting layout [wtiles][batches][[tiles][channels], that enables the merging of the inner sgemm ([wtiles][batches x tiles][channels])

I'm talking about the SGEMM performance.

You mentioned "proper convolve1 batching implementation". Whilst intellectually satisfying, I pointed out that this has no bearing on the overall resulting performances.

I first had 16 * batch_size batches of outputs x tiles x channels SGEMMs, which I changed now to 16 SGEMMs of outputs x (tiles * batch_size) x channels as you did. The total work is theoretically the same but larger SGEMM dimensions should be more efficient in practice. I'm not sure where you get 2x performance increase from.

The total work is the same, but the gemm implementation can possibly reuse the W matrix on a whole batch, hence a possible speed-up. This said, it's perfectly possible that the GPU gets no speed-up. It really depends on the gemm implementation and what the limiting factor is, memory bandwidth or FPU throughput. Even with BLAS, all the vendors will not benefit equally from larger operations.

@Ttl
Copy link
Member

Ttl commented Jun 30, 2018

Some performance tests on GTX 1080 Ti using 19x19 board. Gains should be larger than this on lc0 with smaller 8x8 board.

192x15 network:

Batch size n/s
1 923
4 1060
8 1128

64x5 network:

Batch size n/s
1 3883
4 9992
8 10160

20x128 network for 9x9 board. These results should be closer to chess:

Batch size n/s
1 1403
4 4168
8 4688
16 4992
128 4992

I also tried tuning the SGEMM for larger batch size. It didn't change the results for 192x15 network, but with SGEMM tuned for larger batch 64x5 got 11920 n/s at batch size of 8 and 20x128 network for 9x9 board got 5840 n/s at batch size of 16 and 6520 n/s at batch size of 128.

I would expect that this would be about 4 times faster on lc0.

@frpays
Copy link
Contributor Author

frpays commented Jun 30, 2018

Very nice! I cannot wait to see that ported to lc0.

I just worrying about the tuning: does it changes something ? Shouldn't you check the maximum batch_size against the device capabilities ?

I am not sure the Tuning code is all that robust, as there are so devices that are not supported by the OpenCL backend due to, possibly, large local workgroup size. I am working on this, but it's not clear what to do about this.

@Ttl
Copy link
Member

Ttl commented Jun 30, 2018

batch_size only affects global size which is not limited by the device. Local size is constant. The maximum batch size is only limited by the device memory. Since we don't have to do backpropagation there is no need to store gradients and memory use is very small. Large batch sizes can be used even with devices with small amount of memory.

Tuning for larger batch sizes mainly allows using a larger tile size in SGEMM. There isn't really no point in using 32 wide tile when calculating 16 wide SGEMM since half of the computation would be wasted. With larger batch size it's possible to use a larger tile which can give some performance gains.

I'm not aware of any issues with insufficient local size. The tuner checks that the kernel can be compiled and that it gives the correct output, so it should find the best parameters even if device has some limitations with the local size.

@frpays
Copy link
Contributor Author

frpays commented Jul 1, 2018

That's very valuable information, thank you.
It looks like am destined to port this to lc0, except if someone else volunteers. But it will take me a while, that's for sure.
I will keep you informed and probably get back to you if I am stuck.

@frpays
Copy link
Contributor Author

frpays commented Jul 18, 2018

#191

@Ttl
Copy link
Member

Ttl commented Jul 23, 2018

I implemented F(4x4, 3x3) Winograd transformations for LZGo (leela-zero/leela-zero#1643). You might want to check them out since they seem to give a very nice performance increase.

@frpays
Copy link
Contributor Author

frpays commented Aug 11, 2018

Sorry for the delay and many thanks for the hints!
Be sure that your outstanding work on LZGo is followed closely but the lc0 community!
I will definitely look into your development and probably try to port it to our chess implementation of Leela. Also @borg323 was also very interested to examine your patch closer.

@oscardssmith
Copy link
Contributor

@frpays should this be closed?

@frpays
Copy link
Contributor Author

frpays commented Dec 27, 2018

Implemented by #191

@frpays frpays closed this as completed Dec 27, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants