New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow-ish run time on MSVC #9
Comments
Hi Patrik, very interesting! I would not expect such performance issues with a model architecture like the one you describe. Disabling OK, but back to the relevant topics. :) // gemm_test.cpp
#include <array>
#include <chrono>
#include <iostream>
#include <eigen3/Eigen/Dense>
using RowMajorMatrixXf = Eigen::Matrix<float, Eigen::Dynamic, Eigen::Dynamic, Eigen::RowMajor>;
using ColMajorMatrixXf = Eigen::Matrix<float, Eigen::Dynamic, Eigen::Dynamic, Eigen::ColMajor>;
template <typename Mat>
void run_test(const std::string& name)
{
using namespace std::chrono;
for (int i = 0; i < 10; ++i)
{
float checksum = 0.0f; // to prevent compiler from optimizing everything away
Mat a_rm(2048, 2048);
Mat b_rm(2048, 2048);
const auto start_time_ns = high_resolution_clock::now().time_since_epoch().count();
const auto c_rm = a_rm * b_rm;
checksum += c_rm(123, 234);
const auto end_time_ns = high_resolution_clock::now().time_since_epoch().count();
const auto elapsed_ms = (end_time_ns - start_time_ns) / 1000000;
if (i > 1) // skip stats of first iteration (CPU cache stuff etc.)
{
std::cout << name << " (checksum: " << checksum << ") elapsed_ms: " << elapsed_ms << std::endl;
}
}
}
int main()
{
run_test<ColMajorMatrixXf>("col major");
run_test<RowMajorMatrixXf>("row major");
}
All the This all does not yet explain the difference in performance you have between Keras and fdeep.
You can achieve a yes to questions 1 and 2 on Linux like that
However I'm not sure how to do it on Windows though. Perhaps you have to use Manual device placement for this. But I would be happy to receive your model ( |
Hey, thank you for the quick reply! :-) I still wouldn't rule out Row vs ColMajor as a potential issue, even with your benchmark. It might be quite dependent on the matrix sizes. But agree with you, it could likely be something else. I'm going to check out Very Sleepy. But the VS profiler can display Inclusive/Exclusive too. I think it's just a matter of my inexperience of making sense of this not-so-trivial profiler result.
It seems like at least part of the answer is using multiple cores. All of these operations should quite easily be parallelisable, as most of the operations are independent. Do you think that would be not too hard to do with fdeep? |
I experimented with enabling OpenMP so Eigen can parallelise its stuff, but the success was very limited with the models I tested. The CPU usage increased a lot but the run-time did not go down a lot. If you send me the model I will measure it here. Otherwise you can try to disable all but one CPU core, perhaps simply in the windows task manager for this one process to see the "clean" keras run-time. |
Great idea. I've pinned the processes to one CPU core. I'm getting now:
It seems like matconvnet/Matlab is bloody good :-) Hmm! |
I don't know matconvnet, but the numbers look suspicious to me. Perhaps it fans out to multiple cores not using threads but multiprocessing, and thus is not as easy to "catch" on one CPU. Disabling all but one CPU system wide could help to make sure matconvnet can not sneak away from the restriction. ;) |
In an attempt to get the question about the performance of row-major vs. column-major out of the way, I adjusted the gemm benchmark code to use randomized matrix sizes. It still look as though there is no significant difference: // gemm_test.cpp
#include <array>
#include <chrono>
#include <iostream>
#include <eigen3/Eigen/Dense>
using RowMajorMatrixXf = Eigen::Matrix<float, Eigen::Dynamic, Eigen::Dynamic, Eigen::RowMajor>;
using ColMajorMatrixXf = Eigen::Matrix<float, Eigen::Dynamic, Eigen::Dynamic, Eigen::ColMajor>;
template <typename Mat>
void run_test(const std::string& name, int s1, int s2, int s3)
{
using namespace std::chrono;
float checksum = 0.0f; // to prevent compiler from optimizing everything away
const auto start_time_ns = high_resolution_clock::now().time_since_epoch().count();
for (size_t i = 0; i < 10; ++i)
{
Mat a_rm(s1, s2);
Mat b_rm(s2, s3);
const auto c_rm = a_rm * b_rm;
checksum += c_rm(0, 0);
}
const auto end_time_ns = high_resolution_clock::now().time_since_epoch().count();
const auto elapsed_ms = (end_time_ns - start_time_ns) / 1000000;
std::cout << name << " (checksum: " << checksum << ") elapsed_ms: " << elapsed_ms << std::endl;
}
int main()
{
std::random_device rd;
std::mt19937 gen(rd());
std::uniform_int_distribution<> dis(1, 2048);
for (std::size_t i = 0; i < 100; ++i)
{
int s1 = dis(gen);
int s2 = dis(gen);
int s3 = dis(gen);
std::cout << s1 << " " << s2 << " " << s3 << std::endl;
run_test<ColMajorMatrixXf>("col major", s1, s2, s3);
run_test<RowMajorMatrixXf>("row major", s1, s2, s3);
std::cout << "--------" << std::endl;
}
}
|
Your Eigen benchmark is interesting. I am actually quite surprised that we don't see a difference. I would really expect to see one, since in one case, it should read a cache line from memory and then can use the whole line, while in the other case, it should have to fetch much more often. But that's about as far as my knowledge goes so it's quite possible I'm missing something.
I also checked the task manager while it was running, and it was strictly capped at 12.5% (which is 1/8th, exactly one of the "eight" cores (4+4 from HT)). But great idea with the msconfig option. I disabled all cores except for the first and measured again. For Visual Studio, I ran the benchmark and tested /O2, /O3, /Oi, /Os, /Ot, /GL, /arch:AVX2, /fp:precise, /fp:fast and various combinations thereof. The best result was with AVX2 and O2 (or O3), and the rest on default. So enabling "more aggressive" stuff like /fp:fast, /Ot etc., either made no difference or made it even 5-10ms slower. I also ran my benchmark on WSL with gcc and clang. The results are quite interesting.
The benchmarks were made with Eigen 3.3.4 in all cases. I think we can conclude some things from the benchmarks so far. First I would say: Matlab/matconvnet is bloody good, also on a single-core. Perhaps not too surprising, linear algebra is their main business. But it is a bit surprising that Keras/Tensorflow (or Eigen, for that matter) can't get close to it. Second, it seems like the MSVC optimiser really sucks in this scenario. 48ms was the best I could achieve, while gcc and clang easily achieve <=20ms. Even without -march=native, gcc & clang are faster than MSVC with AVX2. Third, with regards to fdeep, we can probably conclude that it does a very good job, and there might not be much that we can do? Or what do you think? What do you think about (optional) parallelising in fdeep on a per-filter basis? Probably there's a |
Thank you for your measurements, I think they confirm my interpretation of my VS profiler results. I tried with I think this is probably because Eigen parallelises the GEMM, which is what makes sense to do for Eigen. But the convolutions probably consist of many small GEMM operations, so it doesn't make sense to parallelise on a GEMM level. I also don't think that's what Keras/TensorFlow do. I think they parallelise on a filter level, or something along these lines. Which is what makes the most sense for CNN predictions. I might play around with
I think the execution graph might be serial, but it should still be parallelisable well on a filter-level, don't you think? I would guess this should be the optimal granularity for parallelisation.
I suppose there couldn't anything be optimised there? Like with using |
Currently one (im2col) convolution only consists of one single large matrix multiplication. I don't know what TensorFlow does. And when I tried to, I found it quite hard to read from their source code how it is working.
Probably, yes. At least without im2col I can well imagine how to parallelize the outer loop. But to become faster as im2col one would need a lot of filters and CPU cores, since it is intrinsically a lot slower than im2col.
Perhaps the filling of the matrices for the GEMM can be parallelized, but I'm not sure how. I struggled a lot to get it working at all in the first place and I already can not remember what it is doing actually. I'm glad I have the automatic tests making sure there are no big mistakes. ;-) However I guess we can not avoid copying the values completely, because the matrix used in the GEMM to kind of emulate usual convolution looks a lot different from the actual tensors passed between the layers. Sorry I currently can not help more with this problem, but I'm kind of stumped. Perhaps you can make more sense of this all? :-) Here is a 7-minute high-level overview about how im2col works. |
Oh yes! You are of course right, this processed all filters at the same time. Wow. I really gotta watch that im2col video now, thank you :-) You have some very good insight and raise good points. Thank you for the explanations. I think we can probably close this then, I guess the conclusion is there isn't any issue (apart from the MSVC optimiser), and there are parallelisation solutions but none of them would be trivial. I will have a good think about it and we can always re-open or add something if either of us or anyone else has any additional knowledge or ideas. Thank you very much for all the help! |
As one more or less desparate attempt to speed up im2col convolution for vgg network, I made some of the parameters known at compile time for the special case. For testing I hacked it into the code in this ugly way: template <std::size_t strides_y, std::size_t strides_x, std::size_t offset_y, std::size_t offset_x, std::size_t fy, std::size_t fx>
inline tensor3 convolve_im2col_opt(
std::size_t out_height,
std::size_t out_width,
const std::vector<filter>& filters,
const tensor3& in_padded)
{
//std::cout << "convolve_im2col_opt\n";
const std::size_t fz = filters.front().shape().depth_;
RowMajorMatrixXf a(fz * fy * fx + 1, out_height * out_width);
Eigen::Index a_y = 0;
for (std::size_t zf = 0; zf < fz; ++zf)
{
for (std::size_t yf = 0; yf < fy; ++yf)
{
for (std::size_t xf = 0; xf < fx; ++xf)
{
Eigen::Index a_x = 0;
for (std::size_t y = 0; y < out_height; ++y)
{
for (std::size_t x = 0; x < out_width; ++x)
{
a(a_y, a_x++) = in_padded.get(zf,
offset_y + strides_y * y + yf,
offset_x + strides_x * x + xf);
}
}
++a_y;
}
}
}
Eigen::Index a_x = 0;
for (std::size_t y = 0; y < out_height; ++y)
{
for (std::size_t x = 0; x < out_width; ++x)
{
a(a_y, a_x++) = static_cast<float_type>(1);
}
}
++a_y;
RowMajorMatrixXf b(filters.size(), fz * fy * fx + 1);
Eigen::Index b_y = 0;
Eigen::Index b_x = 0;
for (std::size_t f = 0; f < filters.size(); ++f)
{
b_x = 0;
const filter& filter = filters[f];
for (std::size_t zf = 0; zf < fz; ++zf)
{
for (std::size_t yf = 0; yf < fy; ++yf)
{
for (std::size_t xf = 0; xf < fx; ++xf)
{
b(b_y, b_x++) = filter.get(zf, yf, xf);
}
}
}
b(b_y, b_x++) = filter.get_bias();
++b_y;
}
const auto result = b * a;
return tensor3(shape3(filters.size(), out_height, out_width),
eigen_mat_to_values(result));
} And added the following to the beginning of if (strides_y == 1 && strides_x == 1 && offset_y == 0 && offset_x == 0 && fy == 3 && fx == 3)
{
return convolve_im2col_opt<1,1,0,0,3,3>(out_height, out_width, filters, in_padded);
} However this does not speed up the thing at all. I would have expected at least a tiny speed-up. |
Hey, thanks for this! I've also tried some more things, let me go step by step:
I did all these measurements with I wonder whether it would help making everything fixed size, including the One thing that would actually be good to measure is how much time is spent in each (conv) layer and measuring that over multiple runs. Do you have any better idea how to do this than for example writing to a separate csv file in each layer and then sum up and print the times in Python? In the link about im2col that you posted, he afterwards introduces a faster convolution (mainly for 3x3 kernels) from this paper https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/Lavin_Fast_Algorithms_for_CVPR_2016_paper.pdf, based on Winograd. They get like a 2x speed-up on VGG for a batch size of 1, on the GPU. It's overall quite cool, and I've seen this used in some deep learning repositories across GitHub. I think this could speed-up things on the CPU as well. But I think to implement this, and to do it efficiently, is quite beyond my skill level (or would take an insane amount of time). This is a good resource as well (Maratyszcza/NNPACK#30) and maybe this (https://eigen.tuxfamily.org/dox-devel/unsupported/TensorConvolution_8h_source.html) would be a good first try as well, it at least seems to be a multi-core implementation. (TensorFlow seems to use Eigen Convolution actually, while Caffe2 and others seem to use NNPACK tensorflow/tensorflow#9319). Some other resources, but probably not as interesting as NNPACK: http://www.tvmlang.org/2018/01/16/opt-mali-gpu.html, tiny-dnn/tiny-dnn#109 (comment). Okay, this ended up being quite a long post... :-) |
Awesome analysis and research! You method of measuring is fine I guess. A change that only decreases the runtime by 1 ms (out of 45) is probably quite boring anyway. ;)
I suggest we first prove that knowing more hyperparameters at compile time would really help, since this seems like a huge change.
Perhaps we could split the models into small models with one layer each for easier measurement. But in the end it probably won't make much of a difference. I also heard about Winograd convolutions, but did not follow this path, because it seemed very long to me. O:) Using Also there are two smaller things I would like to try out. Probably the performance win would be not that big, but perhaps something worthwhile anyway:
|
With this commit the filter matrices are now pregenerated when loading a model. This saves some time during prediction, even if it is not very much (
|
OK, here is the next commit. |
I was hoping perhaps the reason the last commit (avoid gemm result copy) did not bring much in terms of performance, is that Eigen does not like to write its matrix multiplication results into unaligned memory ( To test this hypothesis I benchmarked Eigen for both cases, i.e. multiplying two matrices into unaligned external memory and multiplying two matrices into 16-bit aligned external memory. However the results are sobering: #include <array>
#include <chrono>
#include <iostream>
#include <eigen3/Eigen/Dense>
using Mat = Eigen::Matrix<float, Eigen::Dynamic, Eigen::Dynamic, Eigen::RowMajor>;
void run_test_unaligned(int s1, int s2, int s3)
{
using namespace std::chrono;
float checksum = 0.0f; // to prevent compiler from optimizing everything away
const auto start_time_ns = high_resolution_clock::now().time_since_epoch().count();
for (size_t i = 0; i < 10; ++i)
{
Mat a(s1, s2);
Mat b(s2, s3);
const std::size_t num_bytes = a.rows() * b.cols() * sizeof(float);
float* ptr = (float*)std::malloc(num_bytes);
Eigen::Map<Mat, Eigen::Unaligned> c(
ptr,
static_cast<Eigen::Index>(a.rows()),
static_cast<Eigen::Index>(b.cols()));
c.noalias() = a * b;
checksum += ptr[0];
std::free(ptr);
}
const auto end_time_ns = high_resolution_clock::now().time_since_epoch().count();
const auto elapsed_ms = (end_time_ns - start_time_ns) / 1000000;
std::cout << "unaligned (checksum: " << checksum << ") elapsed_ms: " << elapsed_ms << std::endl;
}
void run_test_aligned(int s1, int s2, int s3)
{
using namespace std::chrono;
float checksum = 0.0f; // to prevent compiler from optimizing everything away
const auto start_time_ns = high_resolution_clock::now().time_since_epoch().count();
for (size_t i = 0; i < 10; ++i)
{
Mat a(s1, s2);
Mat b(s2, s3);
const std::size_t num_bytes = a.rows() * b.cols() * sizeof(float);
float* ptr = (float*)aligned_alloc(16, num_bytes);
Eigen::Map<Mat> c(
ptr,
static_cast<Eigen::Index>(a.rows()),
static_cast<Eigen::Index>(b.cols()));
c.noalias() = a * b;
checksum += ptr[0];
std::free(ptr);
}
const auto end_time_ns = high_resolution_clock::now().time_since_epoch().count();
const auto elapsed_ms = (end_time_ns - start_time_ns) / 1000000;
std::cout << "aligned (checksum: " << checksum << ") elapsed_ms: " << elapsed_ms << std::endl;
}
int main()
{
for (std::size_t i = 0; i < 100; ++i)
{
run_test_unaligned(1024, 2048, 768);
run_test_aligned(1024, 2048, 768);
std::cout << "--------" << std::endl;
}
}
So there probably is nothing to gain here. |
Oh, but good news for you (hopefully I didn't make a mistake while measuring). The time for a forward pass with your model went down from 17 ms to 10 ms on my machine ( Edit: If I take a look at your model architecture it however seems plausible, since you are using quite small data tensors (height and width) but a huge number of filters, especially in the last layers. So your forward pass probably likes the im2col filter matrices being precalculated when the model is loaded. Edit2: Just double-checked. The performance gain for your model is real on my system. When I check out this version 100 forward passes take 1715 ms. With that version it is 988 ms. |
Wow, this is awesome! Timings for my model with the very latest fdeep commit:
That is really an amazing improvement. It's obviously still not as fast as I'd like (29ms means "barely real-time" (30fps), but this is very hopeful and I think I can work with that for now. :-) Also because the gcc timings are quite amazing:
I played around with the MSVC flags again, but none of them make it any faster (and /openmp makes it slower). And just for reference, it's the same speed with AVX (1) AVX2, so the speed-gain comes from SSE->AVX, and there is none from AVX->AVX2. I also quickly tested cd90b82 (which doesn't have the tensor-no-copy improvement yet) and I think it's like 1ms or so slower, so it seems indeed like a small improvement. I think this is too small to be measured correctly by my crude benchmark but let's say it is indeed 1ms, then it's actually still a 3% improvement, which is I'd say significant. :-) Btw, I the other speed gains in your tables are also really good. You say "saves some time during prediction, even if it is not very much" but it's actually around 10-20%! And regarding this: "interestingly the implementation have some wiggle room on the details of a convolution": I was actually quite surprised (and very happy!) that the prediction values for a test example were pretty much exactly the same for Matconvnet, Keras and fdeep. I think I checked up to 4 or 6 floating point digits. Even though Matlab probably uses quite different padding, convolutions, matrix multiply, etc... :-) I think this is definitely awesome, thank you very much for these improvements! 🎉 🎉 🎉 |
Could you try to find out if MSVC is so slow with the matrix multiplication in Eigen of with the actual fdeep code? Perhaps you can run the GEMM benchmark from above on MSVC and GCC and compare the results. |
Note: I had to disable the aligned test on MSVC, as it doesn't support C11 VS2017 15.6.0 Preview 3:
clang-6 is around 730ms, so between the two. So interestingly, there is "only" a 20-25% difference. Quite a bit though, but probably not out of the ordinary, as clang-6 is also 20% slower than gcc. Now more interestingly, I took the first 4 random matrix sizes from your random benchmark and run these again on both compilers:
Now MSVC is suddenly significantly slower. So it seems to depend on the matrix size...? I'm a bit at a loss. Another, but perhaps unrelated question, is why the run time from your gcc-compiled executable is half the one of mine. But your CPU has a 3.9 GHz turbo, mine has a 3.5 GHz turbo, maybe it's that. Or because I'm running it in WSL. I just double-checked, I'm definitely using Eigen-3.3.4. And I see you used gcc with c++11 while I used 14/17 but I checked and it doesn't make any difference. |
I just tried something else just to make really sure it's not some weird MSVC flag combination or some flag I'm missing. I created a "plain" project with the VS "New Project" wizard, pasted the gemm test code, and ran it in x64 release mode. And it was even 2x slower than MSVC on the command line 🤣 What gcc version did you compile on? I just tried it in a Linux VM instead of WSL and got the same results.
It's a cross-platform project so there isn't one specific system or machine, but msvc is one of the important platforms. But yea, it might be worth testing it on another Windows machine. |
Interesting results for Whole Program "Optimization". ;) I'm using g++ 5.4.0 (the default version on Ubuntu 16.04.6, I guess). Bye the way, you can use gcc (i.e. MinGW) to produce native windows executables in many cases. Perhaps this helps with run time. |
Hmm this is interesting. I tested on an i5-3550, which only has AVX, and no AVX2. (both our main CPUs have AVX2). VS2017 is more or less equally fast than on my i7. But g++ (5.4.0) with It kind of looks to me as if this has to do with AVX and AVX2. MSVC cannot really optimise the code better when AVX2 is available, it's just as slow as with AVX. g++ and clang on the other hand seem sort-of equally fast than MSVC when only AVX is available on a CPU. But if AVX2 is available, then g++ and clang can really make use of it and blow MSVC out of the water. In any case I think fdeep's performance is really good now, the numbers on g++ with AVX2 are extremely impressive. The rest seems up to the compiler optimisers and/or Eigen. |
Good explanation. It coud be true, since at least it makes sense. And out of the 71% of So now the most can be gained by making MSVC utilize AVX2 in Eigen as good as g++ does. Perhaps this is a job for the devs of Eigen. If we are lucky they only need to adjust one of their many |
I've posted to the Eigen list. |
The people on the Eigen mailing list were extremely helpful, and found out that it is because on MSVC it doesn't emit FMA instructions. Compiling the benchmark with Btw one person on the Eigen mailing list also found out that And the link to the gcc bug report is here for reference. |
Awesome work! 👍 In the end I think we made good progress performance-wise here, not only for your model but for fdeep overall. :) And on your way you also helped the Eigen people, the gcc devs and the users of each. 🥇 |
Yep, definitely, a pretty good outcome, and thank you very much for your help and your code changes again too! :-) |
Hi!
First of all thank you for this great library! :-)
I've got a fairly small model (18 layers) for real-time applications, basically mainly consisting of 5 blocks of Conv2D/ReLu/MaxPool2D, and input size
64x64x3
. I'm unfortunately seeing some speed problems with fdeep.A forward pass takes around 11ms in Keras, and it's taking 60ms in fdeep. (I've measured by calling
predict
100x in a for-loop and then averaging - a bit crude but should do the trick for this purpose). I've compiled with the latest VS2017 15.5.5, Release mode, and default compiler flags (/O2). If I enable AVX2 and instrinsics, it goes down to 50ms, but still way too slow. (I've tried withoutim2col
but it's even slower, around >10x).I've run the VS profiler, but I'm not 100% sure I'm interpreting the results correctly. I think around 30%+5% of the total time is spent in Eigen's
gebp
andgemm
functions, where we probably can't do much. Except maybe: I think I've seen you're usingRowMajor
storage for the Eigen matrices. Eigen is supposedly more optimised for its default,ColMajor
storage. Would it be hard to change that in fdeep?Another 30% seems to be spent in
convolve_im2col
. But I'm not 100% sure where. I first thought it was thememcpy
ineigen_mat_to_values
buteigen_mat_to_values
itself contains very few profiler samples only.There's also a lot of
internal::transform
andstd::transform
showing up in the profiler as well (internal::transform<ContainerOut>(reuse_t{}, f, std::forward<ContainerIn>(xs));
) but I couldn't really figure out what the actual code is that this executes.I also saw that I think you pre-instantiate some convolution functions for common kernels. Most of my convolution kernels are 3x3, and it looks like you only instantiate
n x m
kernels forn
andm
equals 1 and 2. Could it help adding 3x3 there?So yea I'm really not sure about all of it. If indeed the majority of time is spent in Eigen's functions, then the
RowMajor
thing could indeed be a major problem.I'm happy to send you the model and an example input via email if you wanted to have a look.
Here's some screenshots of the profiler:
Thank you very much!
The text was updated successfully, but these errors were encountered: