Batch computation for BLAS. #87

frpays · 2018-06-15T21:58:44Z

x2 overall speedup.

I was able to re-order the dimension of V/M matrices, so that I could join the inner sgemm of the Winograd convolve3 into on single sgemm for every tile. Since the weights-tile matrix does not change, this leads to about x2 speed-up.
I also made various optimisations into the scatter/gather methods of the Winograd convolve3. I also reduced significantly the allocations and the pressures on new/delete. Now the vector allocations are once for all the computation. I also removed an unnecessary copy that dealt with the skip-connection.
Finally, I reduced the memory footprint of the computation, removing the input buffer and relying on 3 rotating buffers.

Benchmarks:

https://docs.google.com/spreadsheets/d/1eU2gb67V69_7yXFzIaF2WI4djF1pXGAih-rwhrstmg0

(less complete but formerly https://docs.google.com/spreadsheets/d/1vjPisPj2mbjc2Ic481E57h9O8o7ClyqYVXgU4HWoADQ/)

References:

https://arxiv.org/pdf/1509.09308.pdf
https://ai.intel.com/winograd/
https://ai.intel.com/winograd-2/

Details

The initial non-batched algorithm looks like that:

foreach batch in batches 
   foreach tile in tiles
       V := scatter (input, batch)
       M[tile] := W[batch][time] x V[tile]
       output := gather(M, batch)

You can make M and V batch_size times larger and batch scatterand gather.

foreach tile in tiles
   V := scatter (input)
   foreach tile in tiles 
       M[batch][tile] := W[tile] x V[batch][tile]
   output := gather(M)

With the proper reordering of M and V, the inner multiplication can be contiguous along the columns of M and V. Note that W is fixed and does not change along the batch. So the tiles inner matrix multiplications are joined in one multiplication. The input/output columns are batch_size times larger.

foreach tile in tiles
   V := scatter (input)
    M[batch]:= W[tile] x V[batch
   output := gather(M)

We reuse W[batch] for all the tiles, this is what gives us the speedup.

Known problems

This change is known to be affected by #98, but is not responsible for.

mooskagh · 2018-06-29T20:09:01Z

src/neural/blas/winograd_convolution3.h

+  static constexpr auto kWinogradAlpha = 4;
+  static constexpr auto kWinogradTile = kWinogradAlpha * kWinogradAlpha;
+
+  std::vector<float> V_;


Why uppercase V_ and M_?

Matrices often get a single char uppercase name in code dealing with algebra. The original code used uppercase, and it was helping me to read the code.
As a maintainer, I would vote to keep the uppercase.

mooskagh · 2018-06-29T20:10:58Z

src/neural/blas/winograd_convolution3.h

+class WinogradConvolution3 {
+ public:
+  // Create the zero-padded U matrix
+  static std::vector<float> ZeropadU(const std::vector<float>& U,


Why uppercase U?

Same answer.

It's very common, through algebra software.
Altenatively we can use matrixU_, but U_ is fine.

mooskagh · 2018-06-29T20:15:13Z

src/neural/blas/winograd_convolution3.h

+                                     const size_t outputs_pad,
+                                     const size_t channels_pad);
+
+  // Transform the weights


Sorry for nitpicking but as there is another round of changes anyway..
Comments are sentences and should end with a period. (also in fully_connected_layer.h)
Also, in general, function comments should be in 3rd person (what function does), e.g. "Transforms the weights.".

Also from the comment it's not clear how exactly it transforms it.

Sorry I cannot find anything better than "[building the] filter Transform" in the literature.
https://ai.intel.com/winograd-2/

mooskagh · 2018-06-29T20:16:12Z

src/neural/blas/winograd_convolution3.h

+                                       const size_t outputs,
+                                       const size_t channels);
+
+  // Allocate for the largest batch size


Allocate what? And why is function called "WinograteConvolution" and not "Allocate"?

It's a class that does the forward inference of a convolution 3x3 layer.
The matrices M and U are memory resources that are allocated once for all for the duration of the computation. The batch size of the computation may be larger than the actual largest internal batch. M and U are allocated for the largest batch size.

So the class name is WinogradConvolution3. But the comment can be enhanced.

I settled for

// The instance will allocate memory resources for the // largest batch size, and the largest input and output // layers.

mooskagh · 2018-06-29T20:17:46Z

src/neural/blas/winograd_convolution3.cc

+                                                  const size_t channels,
+                                                  const size_t outputs_pad,
+                                                  const size_t channels_pad) {
+  // Fill with zeroes


Period in the end of the sentence.

mooskagh · 2018-06-29T20:18:22Z

src/neural/blas/fully_connected_layer.h

+
+namespace lczero {
+
+// Fully connected layer


No need for the comment if it's just the class name.

mooskagh · 2018-06-29T20:35:02Z

src/neural/blas/batchnorm.cc

+  }
+}
+
+void Batchnorm::OffsetMeans(std::vector<float>& bn_means,


@dubslow
Many style guides (including google styleguide that we happen to use) disallow mutable references because otherwise it can lead to surprises at calling point (that's a habit from C language, where there's no references). E.g. (extreme example):

int x = 32; ComputeSomethingXTimes(x); // Unexpected that x may change. DoSomethingElseXTimes(x); // x is 0 now. void ComputeSomethingXTimes(int& x) { while (x--) ComputeSomething(); }

mooskagh · 2018-06-29T21:36:26Z

src/neural/network_blas.cc

-  void forward(std::vector<float>& input, std::vector<float>& output_pol,
-               std::vector<float>& output_val) {
+ private:
+  // A cap on the max batch size since it's consume a lot of memory


"it consumes", period at the end.

mooskagh · 2018-06-29T21:58:01Z

src/neural/blas/convolution1.h

+ public:
+  Convolution1() = delete;
+
+  // Batched forward inference


mooskagh · 2018-06-29T22:01:21Z

src/neural/blas/winograd_convolution3.h

+  // The instance will allocate memory resources for the
+  // largest batch size, and the largest input and output
+  // layers.
+  WinogradConvolution3(const size_t max_batch_size,


I've just realized it's a constructor!
Could you move that declaration to be the first in the class?

mooskagh · 2018-06-29T22:03:32Z

src/neural/blas/batchnorm.cc

+  }
+}
+
+void Batchnorm::OffsetMeans(std::vector<float>& bn_means,


Still not fixed.

frpays · 2018-06-30T08:07:43Z

@mooskagh I did not see you comment about Weights. So I moved the methods there.
It really helps, the network_blas and network_opencl really benefit from that.
It's now so clear and simple:


    conv1.OffsetMeans();
    conv2.OffsetMeans();

    conv1.InvertStddev();
    conv2.InvertStddev();

Please review.

frpays · 2018-06-30T08:26:59Z

I am anticipating...

If you want me to move the Batch Normalization methods out of ConvBlock (where they really fit well in my humble opinion), please advise where and how to avoid passing by non-const reference.
I would really like to keep the clarity and the simplicity of the network_opencl and network_blas
Please consider the newly gained clarity that I would like to keep.

Code after moving the methods.

network_blas:

    conv1.OffsetMeans();
    conv2.OffsetMeans();

    conv1.InvertStddev();
    conv2.InvertStddev();
  }

  weights_.policy.OffsetMeans();
  weights_.policy.InvertStddev();

  weights_.value.OffsetMeans();
  weights_.value.InvertStddev();

network_opencl:

      std::vector<float> batchnorm_means_1 = conv1.OffsetMeans();
      std::vector<float> batchnorm_means_2 = conv2.OffsetMeans();

      std::vector<float> batchnorm_stddivs_1 = conv1.InvertStddev();
      std::vector<float> batchnorm_stddivs_2 = conv2.InvertStddev();

Previously the code was: (but with passing non-const reference).

      Transforms::OffsetBatchNormMeans(conv1.bn_means, conv1.biases);
      Transforms::OffsetBatchNormMeans(conv2.bn_means, conv2.biases);

      Transforms::InvertBatchNormStddev(conv1.bn_stddivs);
      Transforms::InvertBatchNormStddev(conv2.bn_stddivs);
    }

    Transforms::OffsetBatchNormMeans(weights_.policy.bn_means,
                                     weights_.policy.biases);
    Transforms::InvertBatchNormStddev(weights_.policy.bn_stddivs);

    Transforms::OffsetBatchNormMeans(weights_.value.bn_means,
                                     weights_.value.biases);
    Transforms::InvertBatchNormStddev(weights_.value.bn_stddivs);

or

      std::vector<float> batchnorm_means_1 = conv1.bn_means;  // copy ctor
      Transforms::OffsetBatchNormMeans(batchnorm_means_1, conv1.biases);

      std::vector<float> batchnorm_means_2 = conv2.bn_means;  // copy ctor
      Transforms::OffsetBatchNormMeans(batchnorm_means_2, conv2.biases);

      std::vector<float> batchnorm_stddivs_1 = conv1.bn_stddivs;  // copy ctor
      Transforms::InvertBatchNormStddev(batchnorm_stddivs_1);

      std::vector<float> batchnorm_stddivs_2 = conv2.bn_stddivs;  // copy ctor
      Transforms::InvertBatchNormStddev(batchnorm_stddivs_2);

frpays · 2018-06-30T18:16:57Z

@mooskagh The PR is ok to merge in this state.

But a question. I have seen this in loader.cc.

void PopulateLastIntoVector(FloatVectors* vecs, Weights::Vec* out) {
  *out = std::move(vecs->back());
  vecs->pop_back();
}

void PopulateConvBlockWeights(FloatVectors* vecs, Weights::ConvBlock* block) {
  PopulateLastIntoVector(vecs, &block->bn_stddivs);
  PopulateLastIntoVector(vecs, &block->bn_means);
  PopulateLastIntoVector(vecs, &block->biases);
  PopulateLastIntoVector(vecs, &block->weights);
}
}  // namespace

Weights LoadWeightsFromFile(const std::string& filename) {
  FloatVectors vecs = LoadFloatsFromFile(filename);

Does it mean I can do the same, use non-const ptr * ?

mooskagh · 2018-06-30T20:55:15Z

Yes, non-const ptr is what should be used instead of non-const reference!
That way at the calling place it's clear that it may be modified.

mooskagh · 2018-06-30T21:00:56Z

As for methods in Network class, that's intended as const-only pure data class (eventually to be replaced with protobufs most likely).
As for where to put them.. src/neural/blas/utils.{h,cc}? E.g.

void OffsetMeans(Network::ConvBlock* conv_block);

frpays · 2018-06-30T21:07:08Z

I really dislike util referencing neural.
I'd rather put them back in Batchnorm.
Would you agree?

frpays · 2018-06-30T21:58:59Z

@mooskagh Put it back in Backnorm with correct signature.

…me time (separate PR).

frpays · 2018-06-30T22:12:29Z

I was tempted to fix the InvertStddev formula. I think it's not completely correct.
But it has been copied in tf and cudnn and I think if the formula has to be fixed, then it has to be fixed on the 4 backends. So I rolled back.

I have nothing to add to this PR for the moment. Please review.

Ttl · 2018-07-01T03:00:30Z

InvertStddev formula is correct. That's how it's defined in the batch normalization paper (https://arxiv.org/abs/1502.03167) and calculated in tensorflow during the training.

frpays · 2018-07-01T06:43:46Z

InvertStddev formula is correct. That's how it's defined in the batch normalization paper (https://arxiv.org/abs/1502.03167) and calculated in tensorflow during the training.

Alright. That looked suspicious, thanks for making it clear.

frpays added 30 commits June 9, 2018 21:54

Network BLAS batch computation

8d54abc

Important optimization in WinogradTransformIn.

8070b08

BLAS batch computation: final cleanup.

b1487b9

Removed 3 compile warnings.

9584a2d

Prepared Winograd transforms for interlacing.

b4d6f83

Variable naming unify between winograd methods

9c413c9

Winograd method loop order commented.

668c9dc

We need M/V wtile x batch x tiles x channel.

652db1f

BLAS Conv. 3x3 batching

5a9465b

Full batching for MKL.

4a996e0

Merge remote-tracking branch 'upstream/master' into dev-blas2

f916cc6

Replacing the copy with a swap for skip layer.

cc2ec5c

BLAS output minor enh.

8438c6d

Optimized and cleaned-up conv3x3 methods.

07e79fb

WinogradTransformOut: removing useless conditionals

50e219e

New default BLAS params: blas_cores=4, batch_size=256.

2c4a478

Minor cosmetic fixes on network_blas

8e5ba9b

clang-format -i -style=google on 3 blas files.

1f11c14

Optimization in WinogradTransformIn (T2 elimitated)

418397d

Minor optimization in Convolve1

d79ff3b

Blas: allocations made once for all batch.

1b94cbe

Extracted WinogradConvolve3 to encapsulate allocation.

9ceec59

Renamed WinogradConvolution3.

e51cece

Dispatched the rest of transforms into relevant (static) classes.

822a635

cpp file added by mistake.

903729e

Updated meson.build.

d784007

Add <cstddef> for size_t and clang format.

376f692

Minor spelling fixes.

43d2a48

BLAS/Mkl compile fix.

171f9d9

Merge branch 'dev-blas2' of https://github.com/frpays/lc0 into dev-blas2

ea9ec05

frpays mentioned this pull request Jun 28, 2018

Speed-up OpenCL backend with Batch computation #115

Closed

mooskagh reviewed Jun 29, 2018

View reviewed changes

frpays added 3 commits June 29, 2018 23:31

Minor fixes in comments

ee75596

Minor fixes in comments

ec06ebe

Missing ctor deletion in static class

9eb1432

mooskagh reviewed Jun 29, 2018

View reviewed changes

frpays added 3 commits June 30, 2018 09:54

Moved OffsetMeans and InvertStddev into Weights::ConvBlock

b83a444

Fixed meson.build and clang-format on network blas/opencl files

8a5d459

Fixed meson.build and clang-format on network blas/opencl files

b489ac1

frpays added 2 commits June 30, 2018 10:12

Moved WinogradConvolution3 ctor at the begining of the class header

2d862f7

comments fixes on network_blas.cc

6d42f66

Missing period.

5dfd66a

frpays added 3 commits June 30, 2018 23:54

OffsetMeans/InvertStddev back in Batchnorm, with compliant signatures

0332639

OffsetMeans/InvertStddev back in Batchnorm, with compliant signatures

bb57e78

clang-format -i -style=google

14b1cec

frpays added 3 commits July 1, 2018 00:05

Fixed InvertStddev formula, this should be more accurate.

7908113

clang-format -i -style=google

5fcebb1

Rolled back formula fix. That has to be done in all backend at the sa…

2352e31

…me time (separate PR).

mooskagh approved these changes Jul 1, 2018

View reviewed changes

frpays merged commit 09c8a8e into LeelaChessZero:master Jul 1, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch computation for BLAS. #87

Batch computation for BLAS. #87

frpays commented Jun 15, 2018 •

edited

Loading

mooskagh Jun 29, 2018

frpays Jun 29, 2018 •

edited

Loading

mooskagh Jun 29, 2018

frpays Jun 29, 2018

frpays Jun 29, 2018

mooskagh Jun 29, 2018

frpays Jun 29, 2018

mooskagh Jun 29, 2018

frpays Jun 29, 2018

frpays Jun 29, 2018

mooskagh Jun 29, 2018

mooskagh Jun 29, 2018

mooskagh Jun 29, 2018

mooskagh Jun 29, 2018

mooskagh Jun 29, 2018

mooskagh Jun 29, 2018

mooskagh Jun 29, 2018

frpays commented Jun 30, 2018

frpays commented Jun 30, 2018 •

edited

Loading

frpays commented Jun 30, 2018

mooskagh commented Jun 30, 2018

mooskagh commented Jun 30, 2018

frpays commented Jun 30, 2018

frpays commented Jun 30, 2018

frpays commented Jun 30, 2018

Ttl commented Jul 1, 2018

frpays commented Jul 1, 2018

Batch computation for BLAS. #87

Batch computation for BLAS. #87

Conversation

frpays commented Jun 15, 2018 • edited Loading

x2 overall speedup.

Details

Known problems

Choose a reason for hiding this comment

frpays Jun 29, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

frpays commented Jun 30, 2018

frpays commented Jun 30, 2018 • edited Loading

frpays commented Jun 30, 2018

mooskagh commented Jun 30, 2018

mooskagh commented Jun 30, 2018

frpays commented Jun 30, 2018

frpays commented Jun 30, 2018

frpays commented Jun 30, 2018

Ttl commented Jul 1, 2018

frpays commented Jul 1, 2018

frpays commented Jun 15, 2018 •

edited

Loading

frpays Jun 29, 2018 •

edited

Loading

frpays commented Jun 30, 2018 •

edited

Loading