Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch computation for BLAS. #87

Merged
merged 71 commits into from
Jul 1, 2018
Merged

Conversation

frpays
Copy link
Contributor

@frpays frpays commented Jun 15, 2018

x2 overall speedup.

I was able to re-order the dimension of V/M matrices, so that I could join the inner sgemm of the Winograd convolve3 into on single sgemm for every tile. Since the weights-tile matrix does not change, this leads to about x2 speed-up.
I also made various optimisations into the scatter/gather methods of the Winograd convolve3. I also reduced significantly the allocations and the pressures on new/delete. Now the vector allocations are once for all the computation. I also removed an unnecessary copy that dealt with the skip-connection.
Finally, I reduced the memory footprint of the computation, removing the input buffer and relying on 3 rotating buffers.

Benchmarks:

https://docs.google.com/spreadsheets/d/1eU2gb67V69_7yXFzIaF2WI4djF1pXGAih-rwhrstmg0

(less complete but formerly https://docs.google.com/spreadsheets/d/1vjPisPj2mbjc2Ic481E57h9O8o7ClyqYVXgU4HWoADQ/)

References:

https://arxiv.org/pdf/1509.09308.pdf
https://ai.intel.com/winograd/
https://ai.intel.com/winograd-2/

Details

The initial non-batched algorithm looks like that:

foreach batch in batches 
   foreach tile in tiles
       V := scatter (input, batch)
       M[tile] := W[batch][time] x V[tile]
       output := gather(M, batch)

You can make M and V batch_size times larger and batch scatterand gather.

foreach tile in tiles
   V := scatter (input)
   foreach tile in tiles 
       M[batch][tile] := W[tile] x V[batch][tile]
   output := gather(M)

With the proper reordering of M and V, the inner multiplication can be contiguous along the columns of M and V. Note that W is fixed and does not change along the batch. So the tiles inner matrix multiplications are joined in one multiplication. The input/output columns are batch_size times larger.

foreach tile in tiles
   V := scatter (input)
    M[batch]:= W[tile] x V[batch
   output := gather(M)

We reuse W[batch] for all the tiles, this is what gives us the speedup.

Known problems

This change is known to be affected by #98, but is not responsible for.

frpays added 30 commits June 9, 2018 21:54
static constexpr auto kWinogradAlpha = 4;
static constexpr auto kWinogradTile = kWinogradAlpha * kWinogradAlpha;

std::vector<float> V_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why uppercase V_ and M_?

Copy link
Contributor Author

@frpays frpays Jun 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Matrices often get a single char uppercase name in code dealing with algebra. The original code used uppercase, and it was helping me to read the code.
As a maintainer, I would vote to keep the uppercase.

class WinogradConvolution3 {
public:
// Create the zero-padded U matrix
static std::vector<float> ZeropadU(const std::vector<float>& U,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why uppercase U?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same answer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's very common, through algebra software.
Altenatively we can use matrixU_, but U_ is fine.

const size_t outputs_pad,
const size_t channels_pad);

// Transform the weights
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for nitpicking but as there is another round of changes anyway..
Comments are sentences and should end with a period. (also in fully_connected_layer.h)
Also, in general, function comments should be in 3rd person (what function does), e.g. "Transforms the weights.".

Also from the comment it's not clear how exactly it transforms it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I cannot find anything better than "[building the] filter Transform" in the literature.
https://ai.intel.com/winograd-2/

const size_t outputs,
const size_t channels);

// Allocate for the largest batch size
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allocate what? And why is function called "WinograteConvolution" and not "Allocate"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a class that does the forward inference of a convolution 3x3 layer.
The matrices M and U are memory resources that are allocated once for all for the duration of the computation. The batch size of the computation may be larger than the actual largest internal batch. M and U are allocated for the largest batch size.

So the class name is WinogradConvolution3. But the comment can be enhanced.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I settled for

  // The instance will allocate memory resources for the
  // largest batch size, and the largest input and output
  // layers.

const size_t channels,
const size_t outputs_pad,
const size_t channels_pad) {
// Fill with zeroes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Period in the end of the sentence.


namespace lczero {

// Fully connected layer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for the comment if it's just the class name.

}
}

void Batchnorm::OffsetMeans(std::vector<float>& bn_means,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dubslow
Many style guides (including google styleguide that we happen to use) disallow mutable references because otherwise it can lead to surprises at calling point (that's a habit from C language, where there's no references). E.g. (extreme example):

int x = 32;
ComputeSomethingXTimes(x);  // Unexpected that x may change.
DoSomethingElseXTimes(x);  // x is 0 now.

void ComputeSomethingXTimes(int& x) {
   while (x--) ComputeSomething();
}

void forward(std::vector<float>& input, std::vector<float>& output_pol,
std::vector<float>& output_val) {
private:
// A cap on the max batch size since it's consume a lot of memory
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"it consumes", period at the end.

public:
Convolution1() = delete;

// Batched forward inference
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Period.

// The instance will allocate memory resources for the
// largest batch size, and the largest input and output
// layers.
WinogradConvolution3(const size_t max_batch_size,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just realized it's a constructor!
Could you move that declaration to be the first in the class?

}
}

void Batchnorm::OffsetMeans(std::vector<float>& bn_means,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still not fixed.

@frpays
Copy link
Contributor Author

frpays commented Jun 30, 2018

@mooskagh I did not see you comment about Weights. So I moved the methods there.
It really helps, the network_blas and network_opencl really benefit from that.
It's now so clear and simple:


    conv1.OffsetMeans();
    conv2.OffsetMeans();

    conv1.InvertStddev();
    conv2.InvertStddev();

Please review.

@frpays
Copy link
Contributor Author

frpays commented Jun 30, 2018

I am anticipating...

If you want me to move the Batch Normalization methods out of ConvBlock (where they really fit well in my humble opinion), please advise where and how to avoid passing by non-const reference.
I would really like to keep the clarity and the simplicity of the network_opencl and network_blas
Please consider the newly gained clarity that I would like to keep.

Code after moving the methods.

network_blas:

    conv1.OffsetMeans();
    conv2.OffsetMeans();

    conv1.InvertStddev();
    conv2.InvertStddev();
  }

  weights_.policy.OffsetMeans();
  weights_.policy.InvertStddev();

  weights_.value.OffsetMeans();
  weights_.value.InvertStddev();

network_opencl:

      std::vector<float> batchnorm_means_1 = conv1.OffsetMeans();
      std::vector<float> batchnorm_means_2 = conv2.OffsetMeans();

      std::vector<float> batchnorm_stddivs_1 = conv1.InvertStddev();
      std::vector<float> batchnorm_stddivs_2 = conv2.InvertStddev();

Previously the code was: (but with passing non-const reference).

      Transforms::OffsetBatchNormMeans(conv1.bn_means, conv1.biases);
      Transforms::OffsetBatchNormMeans(conv2.bn_means, conv2.biases);

      Transforms::InvertBatchNormStddev(conv1.bn_stddivs);
      Transforms::InvertBatchNormStddev(conv2.bn_stddivs);
    }

    Transforms::OffsetBatchNormMeans(weights_.policy.bn_means,
                                     weights_.policy.biases);
    Transforms::InvertBatchNormStddev(weights_.policy.bn_stddivs);

    Transforms::OffsetBatchNormMeans(weights_.value.bn_means,
                                     weights_.value.biases);
    Transforms::InvertBatchNormStddev(weights_.value.bn_stddivs);

or

      std::vector<float> batchnorm_means_1 = conv1.bn_means;  // copy ctor
      Transforms::OffsetBatchNormMeans(batchnorm_means_1, conv1.biases);

      std::vector<float> batchnorm_means_2 = conv2.bn_means;  // copy ctor
      Transforms::OffsetBatchNormMeans(batchnorm_means_2, conv2.biases);

      std::vector<float> batchnorm_stddivs_1 = conv1.bn_stddivs;  // copy ctor
      Transforms::InvertBatchNormStddev(batchnorm_stddivs_1);

      std::vector<float> batchnorm_stddivs_2 = conv2.bn_stddivs;  // copy ctor
      Transforms::InvertBatchNormStddev(batchnorm_stddivs_2);

@frpays
Copy link
Contributor Author

frpays commented Jun 30, 2018

@mooskagh The PR is ok to merge in this state.

But a question. I have seen this in loader.cc.

void PopulateLastIntoVector(FloatVectors* vecs, Weights::Vec* out) {
  *out = std::move(vecs->back());
  vecs->pop_back();
}

void PopulateConvBlockWeights(FloatVectors* vecs, Weights::ConvBlock* block) {
  PopulateLastIntoVector(vecs, &block->bn_stddivs);
  PopulateLastIntoVector(vecs, &block->bn_means);
  PopulateLastIntoVector(vecs, &block->biases);
  PopulateLastIntoVector(vecs, &block->weights);
}
}  // namespace

Weights LoadWeightsFromFile(const std::string& filename) {
  FloatVectors vecs = LoadFloatsFromFile(filename);

Does it mean I can do the same, use non-const ptr * ?

@mooskagh
Copy link
Member

Yes, non-const ptr is what should be used instead of non-const reference!
That way at the calling place it's clear that it may be modified.

@mooskagh
Copy link
Member

As for methods in Network class, that's intended as const-only pure data class (eventually to be replaced with protobufs most likely).
As for where to put them.. src/neural/blas/utils.{h,cc}? E.g.

void OffsetMeans(Network::ConvBlock* conv_block);

@frpays
Copy link
Contributor Author

frpays commented Jun 30, 2018

I really dislike util referencing neural.
I'd rather put them back in Batchnorm.
Would you agree?

@frpays
Copy link
Contributor Author

frpays commented Jun 30, 2018

@mooskagh Put it back in Backnorm with correct signature.

@frpays
Copy link
Contributor Author

frpays commented Jun 30, 2018

I was tempted to fix the InvertStddev formula. I think it's not completely correct.
But it has been copied in tf and cudnn and I think if the formula has to be fixed, then it has to be fixed on the 4 backends. So I rolled back.

I have nothing to add to this PR for the moment. Please review.

@Ttl
Copy link
Member

Ttl commented Jul 1, 2018

InvertStddev formula is correct. That's how it's defined in the batch normalization paper (https://arxiv.org/abs/1502.03167) and calculated in tensorflow during the training.

@frpays
Copy link
Contributor Author

frpays commented Jul 1, 2018

InvertStddev formula is correct. That's how it's defined in the batch normalization paper (https://arxiv.org/abs/1502.03167) and calculated in tensorflow during the training.

Alright. That looked suspicious, thanks for making it clear.

@frpays frpays merged commit 09c8a8e into LeelaChessZero:master Jul 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants