-
Notifications
You must be signed in to change notification settings - Fork 527
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batch computation for BLAS. #87
Conversation
static constexpr auto kWinogradAlpha = 4; | ||
static constexpr auto kWinogradTile = kWinogradAlpha * kWinogradAlpha; | ||
|
||
std::vector<float> V_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why uppercase V_ and M_?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Matrices often get a single char uppercase name in code dealing with algebra. The original code used uppercase, and it was helping me to read the code.
As a maintainer, I would vote to keep the uppercase.
class WinogradConvolution3 { | ||
public: | ||
// Create the zero-padded U matrix | ||
static std::vector<float> ZeropadU(const std::vector<float>& U, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why uppercase U?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same answer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's very common, through algebra software.
Altenatively we can use matrixU_, but U_ is fine.
const size_t outputs_pad, | ||
const size_t channels_pad); | ||
|
||
// Transform the weights |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for nitpicking but as there is another round of changes anyway..
Comments are sentences and should end with a period. (also in fully_connected_layer.h)
Also, in general, function comments should be in 3rd person (what function does), e.g. "Transforms the weights.".
Also from the comment it's not clear how exactly it transforms it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I cannot find anything better than "[building the] filter Transform" in the literature.
https://ai.intel.com/winograd-2/
const size_t outputs, | ||
const size_t channels); | ||
|
||
// Allocate for the largest batch size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Allocate what? And why is function called "WinograteConvolution" and not "Allocate"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a class that does the forward inference of a convolution 3x3 layer.
The matrices M and U are memory resources that are allocated once for all for the duration of the computation. The batch size of the computation may be larger than the actual largest internal batch. M and U are allocated for the largest batch size.
So the class name is WinogradConvolution3. But the comment can be enhanced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I settled for
// The instance will allocate memory resources for the
// largest batch size, and the largest input and output
// layers.
const size_t channels, | ||
const size_t outputs_pad, | ||
const size_t channels_pad) { | ||
// Fill with zeroes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Period in the end of the sentence.
|
||
namespace lczero { | ||
|
||
// Fully connected layer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for the comment if it's just the class name.
src/neural/blas/batchnorm.cc
Outdated
} | ||
} | ||
|
||
void Batchnorm::OffsetMeans(std::vector<float>& bn_means, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dubslow
Many style guides (including google styleguide that we happen to use) disallow mutable references because otherwise it can lead to surprises at calling point (that's a habit from C language, where there's no references). E.g. (extreme example):
int x = 32;
ComputeSomethingXTimes(x); // Unexpected that x may change.
DoSomethingElseXTimes(x); // x is 0 now.
void ComputeSomethingXTimes(int& x) {
while (x--) ComputeSomething();
}
src/neural/network_blas.cc
Outdated
void forward(std::vector<float>& input, std::vector<float>& output_pol, | ||
std::vector<float>& output_val) { | ||
private: | ||
// A cap on the max batch size since it's consume a lot of memory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"it consumes", period at the end.
src/neural/blas/convolution1.h
Outdated
public: | ||
Convolution1() = delete; | ||
|
||
// Batched forward inference |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Period.
// The instance will allocate memory resources for the | ||
// largest batch size, and the largest input and output | ||
// layers. | ||
WinogradConvolution3(const size_t max_batch_size, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just realized it's a constructor!
Could you move that declaration to be the first in the class?
src/neural/blas/batchnorm.cc
Outdated
} | ||
} | ||
|
||
void Batchnorm::OffsetMeans(std::vector<float>& bn_means, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still not fixed.
@mooskagh I did not see you comment about Weights. So I moved the methods there.
Please review. |
I am anticipating... If you want me to move the Batch Normalization methods out of Code after moving the methods. network_blas:
network_opencl:
Previously the code was: (but with passing non-const reference).
or
|
@mooskagh The PR is ok to merge in this state. But a question. I have seen this in loader.cc.
Does it mean I can do the same, use non-const ptr * ? |
Yes, non-const ptr is what should be used instead of non-const reference! |
As for methods in void OffsetMeans(Network::ConvBlock* conv_block); |
I really dislike |
@mooskagh Put it back in Backnorm with correct signature. |
I was tempted to fix the InvertStddev formula. I think it's not completely correct. I have nothing to add to this PR for the moment. Please review. |
InvertStddev formula is correct. That's how it's defined in the batch normalization paper (https://arxiv.org/abs/1502.03167) and calculated in tensorflow during the training. |
Alright. That looked suspicious, thanks for making it clear. |
x2 overall speedup.
I was able to re-order the dimension of V/M matrices, so that I could join the inner sgemm of the Winograd convolve3 into on single sgemm for every tile. Since the weights-tile matrix does not change, this leads to about x2 speed-up.
I also made various optimisations into the scatter/gather methods of the Winograd convolve3. I also reduced significantly the allocations and the pressures on new/delete. Now the vector allocations are once for all the computation. I also removed an unnecessary copy that dealt with the skip-connection.
Finally, I reduced the memory footprint of the computation, removing the input buffer and relying on 3 rotating buffers.
Benchmarks:
https://docs.google.com/spreadsheets/d/1eU2gb67V69_7yXFzIaF2WI4djF1pXGAih-rwhrstmg0
(less complete but formerly https://docs.google.com/spreadsheets/d/1vjPisPj2mbjc2Ic481E57h9O8o7ClyqYVXgU4HWoADQ/)
References:
https://arxiv.org/pdf/1509.09308.pdf
https://ai.intel.com/winograd/
https://ai.intel.com/winograd-2/
Details
The initial non-batched algorithm looks like that:
You can make M and V
batch_size times
larger and batchscatter
andgather
.With the proper reordering of M and V, the inner multiplication can be contiguous along the columns of M and V. Note that W is fixed and does not change along the batch. So the tiles inner matrix multiplications are joined in one multiplication. The input/output columns are batch_size times larger.
We reuse W[batch] for all the tiles, this is what gives us the speedup.
Known problems
This change is known to be affected by #98, but is not responsible for.